ERR@HRI 2024 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Interactions Micol Spitale∗ Maria Teresa Parreira Maia Stiber ms2871@cam.ac.uk Cornell University Johns Hopkins University University of Cambridge Ithaca, NY, USA Baltimore, MD, USA Cambridge, UK Minja Axelsson Neval Kara† Garima Kankariya‡ University of Cambridge Cankaya University Indian Institute of Technology Cambridge, UK Ankara, Turkey Delhi, India Chien-Ming Huang Malte Jung Wendy Ju Johns Hopkins University Cornell University Cornell Tech Baltimore, MD, USA Ithaca, NY, USA New York, NY, USA Hatice Gunes University of Cambridge Cambridge, UK Abstract Despite the recent advancements in robotics and machine learning (ML), the deployment of autonomous robots in our everyday lives is still an open challenge. This is due to multiple reasons among which are their frequent mistakes, such as interrupting people or having delayed responses, as well as their limited ability to under- stand human speech, i.e., failure in tasks like transcribing speech to text. These mistakes may disrupt interactions and negatively in- fluence human perception of these robots. To address this problem, robots need to have the ability to detect human-robot interaction (HRI) failures. The ERR@HRI 2024 challenge tackles this by of- fering a benchmark multimodal dataset of robot failures during human-robot interactions, encouraging researchers to develop and benchmark multimodal machine learning models to detect these failures. We created a dataset featuring multimodal non-verbal interaction data, including facial, speech, and pose features from video clips of interactions with a robotic coach, annotated with labels indicating the presence or absence of robot mistakes, user awkwardness, and interaction ruptures, allowing for the training and evaluation of predictive models. Challenge participants have been invited to submit their multimodal ML models for detection ∗The author is also affiliated with Politecnico di Milano, Milan, Italy †Contributed to this work while undertaking a remote visiting studentship at Depart- ment of Computer Science and Technology, University of Cambridge. ‡Contributed to this work while undertaking a remote visiting studentship at Depart- ment of Computer Science and Technology, University of Cambridge. This work is licensed under a Creative Commons Attribution International 4.0 License. ICMI ’24, November 04–08, 2024, San Jose, Costa Rica © 2024 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-0462-8/24/11 https://doi.org/10.1145/3678957.3689030 of robot errors, to be evaluated against various performance met- rics such as accuracy, precision, recall, F1 score, with and without a margin of error reflecting the time-sensitivity of these metrics. The results of this challenge will help the research field in better understanding the robot failures in human-robot interactions and designing autonomous robots that can mitigate their own errors after successfully detecting them. CCS Concepts • Human-centered computing → Empirical studies in HCI ; • Computing methodologies → Machine learning algorithms; Keywords Robot Failure, Error Detection, Human-Robot Interaction, Multi- modal Interaction, Benchmarking. ACM Reference Format: Micol Spitale, Maria Teresa Parreira, Maia Stiber, Minja Axelsson, Neval Kara, Garima Kankariya, Chien-Ming Huang, Malte Jung, Wendy Ju, and Hat- ice Gunes. 2024. ERR@HRI 2024 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Interactions. In INTERNATIONAL CONFER- ENCE ON MULTIMODAL INTERACTION (ICMI ’24), November 04–08, 2024, San Jose, Costa Rica. ACM, New York, NY, USA, 5 pages. https://doi.org/10. 1145/3678957.3689030 1 Introduction Human-Robot Interaction (HRI) research is currently placing a greater emphasis on the development of autonomous robots that can be deployed in real-world scenarios to understand the implica- tions of integrating such robots in our lives. However, past works [8, 12, 13] have shown that such autonomous robots are often characterised by making mistakes, for example when the robot in- terrupts people or when the robot takes a very long time to respond. These robot failures may disrupt the interaction and negatively im- pact the perception of people towards the robot [11]. To overcome this problem, robots should be able to detect HRI failures. 652 https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1145/3678957.3689030 https://doi.org/10.1145/3678957.3689030 https://doi.org/10.1145/3678957.3689030 https://ms2871@cam.ac.uk http://crossmark.crossref.org/dialog/?doi=10.1145%2F3678957.3689030&domain=pdf&date_stamp=2024-11-04 ICMI ’24, November 04–08, 2024, San Jose, Costa Rica Spitale et al. The ERR@HRI 2024 challenge aims at addressing this issue by providing the community with a benchmark multimodal dataset of robot failures during human-robot interaction. The challenge encourages researchers to benchmark and develop multimodal ma- chine learning-based models designed to identify when failures occur during HRI. We recruited challenge participants through email advertise- ments (e.g., ICMI announcements, robotics-worldwide) that in- cluded a link to our website1 where they could fill out the reg- istration form. An EULA agreement, approved by both the DPO and the Departmental Ethics Committee of the University of Cam- bridge, was shared with the teams who signed up. The signed EULA was then sent to the research office of the University of Cambridge for a final review and approval. We provided participants with a dataset that includes 1) multi- modal non-verbal features (i.e., facial, speech, and pose features) of interaction clips where individuals interact with a robotic coach delivering positive psychology exercises, and 2) binary labels in the form of ‘interaction rupture present’ (1) or ‘interaction rup- ture absent’ (0). These features and labels were to be used to train the predictive models. The dataset was annotated as a time-series with the following labels: robot mistake (e.g., interruption or non- responding, (0) absent, (1) present), user awkwardness (e.g., when the participant feels uncomfortable interacting with the robot with- out any robot mistakes, (0) absent, (1) present), and interaction ruptures (i.e., either when the user displays some cues of awk- wardness towards the robot and/or when the robot makes some mistakes; (0) absent, (1) present). We invited the teams to submit their multimodal ML models for error detection to be evaluated and benchmarked against the pre-determined performance metrics, including accuracy, precision, recall, F1 score, with and without an error margin [4, 8]. 1.1 Relevance to Multimodal Interaction This challenge aims at addressing the problem of detecting robot failures in human-robot interaction, and as such it is extremely rel- evant to the multimodal interaction community. HRI is multimodal by nature because interactions often involve multiple types of social signaling, such as facial expressions, speech and body language of both humans and robots that, if better understood, can be used as cues to facilitate more natural interactions. ERR@HRI provides a novel multimodal dataset that can be used by participants to de- velop multimodal machine learning failure detection models. By highlighting the use of multimodal datasets and ML models for de- tecting failures, the ERR@HRI challenge contributes to advancing the understanding and enhancement of interactions between hu- mans and autonomous robots in real-world settings. The increased interest of the ICMI community in HRI is also evident by the recent contributions published in ICMI proceedings that include 5 papers at ICMI’23 (e.g., [5]) and 2 at ICMI’22 (e.g., [14]) on HRI, and as well as a keynote talk by Prof Maja Mataric at ICMI’23 entitled “A Robot Just for You: Multimodal Personalized Human-Robot Interaction and the Future of Work and Care”. The talk focused on multimodal aspects of HRI in healthcare, demonstrating the ICMI community’s increasing attention to this field. 1https://sites.google.com/cam.ac.uk/err-hri/home 2 Related Work Past works have shown that robot failures are known to commonly occur during human-robot interactions, and they can negatively impact the user’s trust towards the robot. For example, Spitale et al. [10] demonstrated that participants experienced frustration when the robot interrupted them e.g., by erroneously detecting the end of the user’s speech while they were still talking, or when the robot exhibited prolonged response times due to internet connec- tivity issues. Analogously, Kontogiorgos et al. examined human non-verbal behaviour reactions to conversational failures during a cooking instruction class delivered by a Furhat robot [6, 7]. They found that severe errors may decrease users’ trust in the robot [7]. However, very few works attempted to address this problem by automatically detecting such failures. Spitale et al. [11] introduced a new multimodal LLM-based system that allows robotic coaches to autonomously adapt to individual’s multimodal behaviours (facial valence and speech duration) and detect ruptures while delivering well-being coaching. Bremers et al. [2] used the bystander reaction dataset as input to a deep-learning model, BADNet, to predict fail- ure occurrence without levering multimodal information. These studies represent the first stepping stone toward identifying robot failures during HRI, but they neither focused on benchmarking nor organising a challenge event to enable such comparisons under pre-defined settings and metrics. The ERR@HRI initiative will provide a unique opportunity for benchmarking not only HRI data but also multimodal machine learning models to detect interaction ruptures, which is fundamen- tal for the success of human-robot interactions. In this first edition, the challenge will focus on using a multimodal dataset collected in a real-world setting where a robotic coach delivered well-being coaching practices to each participant over four weeks. For future editions of the challenge, we plan to focus on additional datasets, such as REACT and Response to Errors in HRI [13], which have already been collected by the co-organizers of this challenge. This will enable a sustained engagement of the research community over the next couple of years and push the state of the art in multimodal robot failure analysis, detection and understanding. 3 The ERR@HRI 2024 Challenge This section describes the dataset provided, including feature ex- traction, tasks, and evaluation process. 3.1 Materials A challenge website was set up2 with a commitment to be main- tained at least for the next 3 years. A GitHub repository3 has been established along with the official website to guide and support the challenge participants. 3.2 Feature Extraction We used a dataset collected in a previous study [1, 12], in which we deployed a robotic positive psychology coach at a workplace over four weeks. We involved a total of 43 employees out which 23 gave consent to share their data in processed and aggregated form. The robotic coach conducted four positive psychology exercises over 2https://sites.google.com/cam.ac.uk/err-hri/home 3https://github.com/ERR-HRI-Challenge/baseline2024 653 https://sites.google.com/cam.ac.uk/err-hri/home https://sites.google.com/cam.ac.uk/err-hri/home https://github.com/ERR-HRI-Challenge/baseline2024 ERR@HRI ICMI ’24, November 04–08, 2024, San Jose, Costa Rica four weeks. Please check the paper [10] for more detail on the study. During the interaction, we collected video recordings (coachee’s face and a side view of the interaction) and audio recordings (both the coachee’s and robot’s speech) using two cameras (a frontal video camera and a lateral GoPro) and a Jabra microphone. We used off-the-shelf state-of-the-art methods to extract mul- timodal behavioural features from the audio-visual data collected from the side-view camera as follows: (1) Facial Features: We used the OpenFace 2.2.0 toolkit to extract the presence and intensity of 17 facial action units (AUs), in a total of 35 facial features per frame, at a rate of 30 fps. (2) Audio Features: We used the openSMILE toolbox and ex- tracted a total of 25 features, corresponding to the low level descriptors of feature set eGeMAPSv02, using a time window of 0.02 s and at a rate of 100 data points per second. (3) Pose Features: We used the OpenPose toolbox [3] and ex- tracted the 25-2D body key points per frame to estimate the movement of the torso, hands, arms, and head. The features provided (at 30 fps) do not correspond directly to the features extracted from Openpose, but rather the relational distance and velocity for pairs of spatial body points, in a total of 44 features, corresponding to relational features of 25 body points. 3.3 Labels The video clips were labelled by 2 annotators using the ELAN video annotation tool. We marked instances of user awkwardness and robot mistakes with binary labels (i.e., 1: present, or 0: absent), marking the time when the displays of user awkwardness or robot mistakes start and end. These labels have been defined in [12] as follows: • User Awkwardness (UA): The coachee displays behaviours that signal the interaction is awkward — they may look confused, uncertain, distressed or uncomfortable. • Robot Mistake (RM): The robot makes a mistake such as interrupting or not responding to the coachee, or respond- ing with an utterance that is not appropriate for what the coachee has just said. • Interaction Rupture (IR): We define an interaction rupture as either the presence of user awkwardness, a robot mistake, or both. 3.4 Sub-challenges Accordingly, the ERR@HRI 2024 Challenge consists of the following three sub-challenges: (1) Detection of robot mistakes (e.g., interrupting or not respond- ing to the coachee) (2) Detection of user awkwardness (e.g., when the coachee feels uncomfortable interacting with the robot without any robot mistakes) (3) Detection of interaction ruptures (i.e. when the robot makes mistakes as described in 1) or when user displays awkward- ness towards the robot described in 2)) 3.5 Dataset The dataset contains data from 23 users, in a total of 89 sessions and 700 minutes of interaction. ERR@HRI 2024 participants are provided with 4 suggested dataset splits (i.e., subject-independent folds), with no overlapping partici- pant data. Details of the data are provided in Table 1. 3.6 Metrics This challenge contemplates three binary classification tasks. The metrics used to evaluate model performance are accuracy, precision, recall, f1-score, as well as metrics with a margin of error [4, 8] – for a sample margin of size 𝑘 , and for a sample 𝑖 , the model prediction is considered right if 𝑦𝑖 𝑝𝑟𝑒𝑑 ∈ [𝑦𝑖 −𝑘 𝑝𝑟 𝑒𝑑 , 𝑦 𝑖+𝑘 𝑝𝑟 𝑒𝑑 ] The motivation for considering metrics with a margin of error is due to considerations of real-life settings where effectiveness may still be achieved even if the model is slightly early or delayed in its error detection. Other options for real-use systems could be to use the median or mode of predictions within an interval, among others. Metrics with a margin of error, in this challenge, include accuracy, precision, recall and f1-score. 3.7 Evaluation Challenge participants were given access to the training and val- idation sets to develop their ML models. Then, they were asked to submit the developed models and weights, and the organisers have evaluated the submitted models on the test set (the test set was released to the challenge participants without labels one week prior to the submission deadline). Each participating group was al- lowed to submit their models and results for the test set up to three times. The submitted models and predictions were automatically evaluated and ranked using various performance metrics, under two categories: overall performance and marginal performance. For both tracks, models are ranked based on the combined rankings of accuracy and F1-score (for the marginal track, we use the accuracy and F1-score considering an error margin of 1 sample). Metrics were calculated using the same script provided to participants in the study repository. Challenge participants were also asked to sub- mit a paper describing their model via the EasyChair system, and their works were reviewed by the Technical Program Committee members of the challenge. 4 Challenge Baseline We have provided a deep-learning multimodal baseline for each of the three tasks, as in [2] and [11] (where we reported results for interaction rupture prediction). 4.1 Training For baseline models, and following previous work on a similar dataset, we decided to use Recurrent Neural Network models, which can conserve some feature history and are common approaches for time-series classification problems in HRI. Namely, we made use of Long Short-Term Memory networks (LSTMs), Bidirectional-LSTMs (BiLSTMs), Gated Recurrent Unit networks (GRUs), which tend to overfit less than LSTMs in smaller datasets. We used single-layer models, with dropout and a fully-connected layer. We wanted to 654 ICMI ’24, November 04–08, 2024, San Jose, Costa Rica Spitale et al. Table 1: Dataset and ground truth characteristics. Time per label includes the total amount of time within the dataset labeled as that type of interaction failure. Percentage refers to time per label over total time – which provides a sense of dataset label balance. Subset Subjects Sessions Total time (s) Time RM (s) % RM Time UA (s) % UA Time IR (s) % IR Train + Val 18 71 33308 5320 0.16 5182 0.16 8679 0.26 Test 5 18 8048 1399 0.17 1875 0.23 2738 0.34 Table 2: Hyperparameters of best performing models. SL: sequence length. LR: learning rate. Task Model Hyperparameters RM GRU SL=5, Units=128, Dropout=0.2, LR: 0.0001 Activation: softmax, Optimizer: Adam Loss: Categorical Cross-Entropy, Batch size = 2048, Epochs = 500 UA BiLSTM SL=5, Units=256, Dropout=0.2, LR: 0.0001 Activation: sigmoid, Optimizer: Adam Loss: Categorical Cross-Entropy, Batch size = 512, Epochs = 200 IR BiLSTM SL=5, Units=256, Dropout=0.2, LR: 0.0001 Activation: softmax, Optimizer: Adam Loss: Categorical Cross-Entropy, Batch size = 4096, Epochs = 500 Table 3: Baseline (macro) performances. Margin of error metrics are noted with and 𝑒 and represent a 1-sample tolerance. Task Accuracy Precision Recall F1-Score Accuracy𝑒 Precision𝑒 Recall𝑒 F1-Score𝑒 RM 0.71349 0.55593 0.54089 0.54184 0.71417 0.55756 0.54219 0.54335 UA 0.73074 0.56358 0.57356 0.56698 0.73207 0.56617 0.57676 0.56978 IR 0.68460 0.55541 0.50268 0.41964 0.68592 0.58794 0.50478 0.42395 provide a standard approach to model development, leaving room for participants to innovate their approaches for detection and clas- sification. For training, we did hyperparameter tuning using test accuracy as the metric to pick the top performing hyperparameters. We used a 3-1 train-validation fold split, with the suggested folds provided in the study repository. For each task and each model ar- chitecture, we picked the top 3 performing model hyperparameters– a total of 9 models per task. These models were then trained using cross-validation on the 4 folds and the final model was picked based on the average metrics across all folds. In the end, each of these models was trained on the 4 folds and predictions on the test set were reported to all participants. 4.2 Results & Discussion The hyperparameters and performance for each model, for each task, are described in Tables 2 and 3. The best performing models have short sequence lengths (5 samples) and the BiLSTM model performed best across two of the subchallenges. The obtained per- fomances on the test set are slightly above chance level. While this baseline did not intend to be an exhaustive exploration of model architectures and training methods to generate the best possible performance, it is nonetheless notable that the performance results are not higher. This illustrates previously reported [9] challenges in obtaining generalizable models, due to the high range and diversity of human reactions to failure. Interestingly, the task of detecting user awkwardness (UA) demon- strates the best overall performance with the highest scores in ac- curacy, precision, recall, and F1-score among the three tasks. This suggests that models are effective in detecting such expressions for predicting UA. This aligns with our previous analysis[12], which showed that expressions of user awkwardness are characterized by laughter, often marked by the intense activation of cheek raiser (AU6) and lip corner puller (AU12) action units, which correspond to the facial features that were fed into the model. The task of de- tecting robot mistakes (RM) shows moderate performance, with lower scores than UA but higher than IR, especially in accuracy and F1-score; while the task of interaction ruptures (IR) performs the worst across all metrics, with significantly lower Recall and F1-score compared to UA and RM. The task of detecting robot mistakes may be more difficult because the robot’s mistakes caused coachees to limit their self-disclosure and, in turn, express less via their facial or auditory cues, as highlighted in [12]. Overall, these findings suggest varying levels of complexity in detecting user awkwardness and robot mistakes, as evidenced by the models performing the worst at detecting interaction ruptures (IR), which combines elements of both RM and UA. This highlights the importance of tailored approaches to improve model performance for each specific task. 5 Participation and Conclusion This paper introduced the ERR@HRI 2024 Challenge organised in conjunction with the ACM International Conference on Multimodal Interaction 2024 (ACM ICMI’24), which focuses on detecting robot failures in human-robot interactions. A total of 10 teams from 5 countries registered for this challenge, and 3 teams from 3 European countries submitted their results for benchmarking and evaluation. The submitted models will be ranked under identical conditions using the specified evaluation protocol and metrics. We aim for the challenge data, code, systems, and results from competing teams 655 ERR@HRI ICMI ’24, November 04–08, 2024, San Jose, Costa Rica to be valuable resources for researchers and practitioners focus- ing on detecting failures in human-robot interactions. Our future efforts will be directed at continuing to organize ERR@HRI chal- lenge events in conjunction with well-known conferences while introducing new datasets and modalities. Acknowledgments Funding: This challenge is possible due to the EPSRC/UKRI grant EP/R030782/1 (ARoEQ) and EP/R511675/1 that supported the HRI studies, and the work of M. Spitale and H. Gunes, that generated the data used in this challenge. M. Spitale’s current work involving the organisation of this challenge and the writing of this paper is supported by PNRR-PE-AI FAIR project funded by the NextGenera- tion EU program. Open Access: For open access purposes, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. Data access: Raw data related to this publication cannot be openly released due to anonymity and privacy issues. However, challenge participants who signed the EULA agreement have been granted access to processed data in the form of aggregated feature statistics and models. References [1] Minja Axelsson, Micol Spitale, and Hatice Gunes. 2024. " Oh, Sorry, I Think I Interrupted You": Designing Repair Strategies for Robotic Longitudinal Well- being Coaching. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction. 13–22. [2] Alexandra Bremers, Maria Teresa Parreira, Xuanyu Fang, Natalie Friedman, Adolfo Ramirez-Aristizabal, Alexandria Pabst, Mirjana Spasojevic, Michael Kuni- avsky, and Wendy Ju. 2023. The Bystander Affect Detection (BAD) Dataset for Failure Detection in HRI. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 11443–11450. [3] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019). [4] IA de Kok and Dirk KJ Heylen. 2012. A survey on evaluation metrics for backchan- nel prediction models. In Interdisciplinary Workshop on Feedback Behaviors in Dialog, Stevenson, Washington, USA: Proceedings of the Interdisciplinary Workshop on Feedback Behaviors in Dialog. University of Texas, 15–18. [5] Apostolos Kalatzis, Saidur Rahman, Vishnunarayan Girishan Prabhu, Laura Stan- ley, and Mike Wittie. 2023. A Multimodal Approach to Investigate the Role of Cognitive Workload and User Interfaces in Human-robot Collaboration. In Proceedings of the 25th International Conference on Multimodal Interaction. 5–14. [6] Dimosthenis Kontogiorgos, Andre Pereira, Boran Sahindal, Sanne van Waveren, and Joakim Gustafson. 2020. Behavioural responses to robot conversational failures. In Proceedings of the 2020 ACM/IEEE International Conference on Human- Robot Interaction. 53–62. [7] Dimosthenis Kontogiorgos, Minh Tran, Joakim Gustafson, and Mohammad So- leymani. 2021. A systematic cross-corpus analysis of human reactions to robot conversational failures. In Proceedings of the 2021 International Conference on Multimodal Interaction. 112–120. [8] Maria Teresa Parreira, Sarah Gillet, and Iolanda Leite. 2023. Robot Duck Debug- ging: Can Attentive Listening Improve Problem Solving?. In Proceedings of the 25th International Conference on Multimodal Interaction. 527–536. [9] Maria Teresa Parreira, Sukruth Gowdru Lingaraju, Adolfo Ramirez-Aristizabal, Manaswi Saha, Michael Kuniavsky, and Wendy Ju. 2024. A Study on Do- main Generalization for Failure Detection through Human Reactions in HRI. arXiv:2403.06315 [cs.RO] https://arxiv.org/abs/2403.06315 [10] Micol Spitale, Minja Axelsson, and Hatice Gunes. 2023. Robotic mental well-being coaches for the workplace: An in-the-wild study on form. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. 301–310. [11] Micol Spitale, Minja Axelsson, and Hatice Gunes. 2023. VITA: A Multi-modal LLM-based System for Longitudinal, Autonomous, and Adaptive Robotic Mental Well-being Coaching. arXiv preprint arXiv:2312.09740 (2023). [12] Micol Spitale, Minja Axelsson, Neval Kara, and Hatice Gunes. 2023. Longitudinal evolution of coachees’ behavioural responses to interaction ruptures in robotic positive psychology coaching. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 315–322. [13] Maia Stiber, Russell H Taylor, and Chien-Ming Huang. 2023. On using social signals to enable flexible error-aware hri. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction. 222–230. [14] Xiang Zhi Tan, Elizabeth Jeanne Carter, Prithu Pareek, and Aaron Steinfeld. 2022. Group formation in multi-robot human interaction during service scenarios. In Proceedings of the 2022 International Conference on Multimodal Interaction. 159–169. 656 https://arxiv.org/abs/2403.06315 https://arxiv.org/abs/2403.06315 Abstract 1 Introduction 1.1 Relevance to Multimodal Interaction 2 Related Work 3 The ERR@HRI 2024 Challenge 3.1 Materials 3.2 Feature Extraction 3.3 Labels 3.4 Sub-challenges 3.5 Dataset 3.6 Metrics 3.7 Evaluation 4 Challenge Baseline 4.1 Training 4.2 Results & Discussion 5 Participation and Conclusion Acknowledgments References