Transparent Analysis of Multi-Modal Embeddings Anita Lilla Vero˝ King’s College This thesis is submitted for the degree of Doctor of Philosophy November, 2021 Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. I further state that no substantial part of my thesis has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. It does not exceed the prescribed word limit for the relevant Degree Committee. Anita Lilla Vero˝ November, 2021 Transparent Analysis of Multi-Modal Embeddings Anita Lilla Vero˝ Abstract Vector Space Models of Distributional Semantics – or Embeddings – serve as use- ful statistical models of word meanings, which can be applied as proxies to learn about human concepts. One of their main benefits is that not only textual, but a wide range of data types can be mapped to a space, where they are comparable or can be fused together. Multi-modal semantics aims to enhance Embeddings with perceptual input, based on the assumption that the representation of meaning in humans is grounded in sensory experience. Most multi-modal research focuses on downstream tasks, involving direct visual input, such as Visual Question Answering. Fewer papers have exploited visual information for meaning representations when the evalua- tion tasks involve no direct visual input, such as semantic similarity. When such research has been undertaken, the results on the impact of visual information have been often inconsistent, due to the lack of comparison and the ambiguity of intrinsic evaluation. Does visual data bolster performance on non-visual tasks? If it does, is this only because we add more data or does it convey complementary quality in- formation compared to a higher quantity of text? Can we achieve comparable performance using small-data if it comes from the right data distribution? Is the modality, the size or the distributional properties of the data that matters? Evaluating on downstream or similarity-type tasks is a good start to compare models and data sources. However, if we want to resolve the ambiguity of in- trinsic evaluations and the spurious correlations of downstream results, creating more transparent and human interpretable models is necessary. This thesis proposes diverse studies to scrutinize the inner “cognitive models” of Embeddings, trained on various data sources and modalities. Our contribu- tion is threefold. Firstly, we present comprehensive analyses of how various visual and linguistic models behave in semantic similarity and brain imaging evaluation tasks. We analyse the e↵ect of various image sources on the performance of se- mantic models, as well as the impact of the quantity of images in visual and multi-modal models. Secondly, we introduce a new type of modality: a visually structured, text based semantic representation, lying in-between visual and lin- guistic modalities. We show that this type of embedding can serve as an ecient modality when combined with low resource text data. Thirdly, we propose and present proof-of-concept studies of a transparent, interpretable semantic space analysis framework. Acknowledgements I am especially thankful to my supervisors, Stephen Clark and Ann Copestake, who guided me on my path to the PhD at di↵erent stages and in di↵erent ways. I am immensely grateful to Steve for the opportunity of starting a PhD at Cam- bridge. I learned a lot from our discussions and enjoyed his openness to any out-of-the-box ideas. Ann helped me greatly with organising my work after a break I had to take in the middle of the programme. She helped me clarifying my thoughts with her insightful questions and motivated me to start planning and writing down ideas early. I feel, I greatly benefited from their very di↵erent but equally supportive mentoring styles. I owe special thanks to my collaborators Douwe Kiela, Luana Bulat, Ekaterina Shutova and Christopher Davis, whose intellect and creativity I was lucky to experience first hand. I feel lucky to have a very supportive family, which helped me through dicult times during the course of this programme. My dad has always showed a great interest in whatever I was doing and often had insightful comments and questions about it too. My mom is always there for me when I have diculties, which means the world. My dear friend, Krisztia´n Gergely, provided invaluable support during the past years for which I will always be grateful for. I would like to thank my good friend, Jonathan Kanen, for his friendship and occasional English corrections. In the last few years I was lucky enough to enjoy the immeasurable support of Jo´zsef Konczer, who not only helped me with finding strength but has always been ready to discuss details of my work as well. The past years would have been much less bearable without the deep conver- sations with my dear old friends Kla´ra Be´ke´s and Fruzsina Balogh, and my close friends from Cambridge, Akemi Herraez Vossbrink, Paula Fayos Pe´rez, Eugenia Biral and Kaho Sato. Finally, I would like to thank all my colleagues in the NLIP group and the visiting guests I had a chance to meet, with whom we had many enlightening and fun conversations in and outside the oce. Contents 1 Introduction 15 1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Background and Motivation for Interpretable Multi-Modal Word Embedding Analysis 23 2.1 What does Word Meaning Mean, and Why should We Care? . . . 23 2.1.1 Philosophical Accounts . . . . . . . . . . . . . . . . . . . . 23 2.1.2 (Cognitive) Linguistics and Neuroimaging . . . . . . . . . 24 2.2 Linguistic Embeddings: From Text to Meaning . . . . . . . . . . . 27 2.2.1 Distributional Semantics . . . . . . . . . . . . . . . . . . . 27 2.2.2 Shallow Networks . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 Visual Embeddings: From Images to Meaning . . . . . . . . . . . 31 2.3.1 CNN Models . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Multi-modal Semantics . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 Symbol Grounding . . . . . . . . . . . . . . . . . . . . . . 35 2.4.2 Early-, Late- and Mid-fusion . . . . . . . . . . . . . . . . . 36 2.4.3 Multi-modal RNNs and Transformers . . . . . . . . . . . . 37 2.5 Structured Embeddings: Motivation for a New Modality . . . . . 38 2.6 Generalisation of Embeddings: Proposed Framework and Formalism 39 2.6.1 Embedding Modalities . . . . . . . . . . . . . . . . . . . . 41 2.7 Modalities as Partial Observers of Meaning . . . . . . . . . . . . . 42 2.7.1 Background and Motivation for Model Transparency . . . 44 2.7.2 Transparency Testing and Ecient Multi-Modal Fusion . . 47 2.7.3 “Cognitive Model” of Embeddings: How do Models Con- ceptualise? . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.7.4 Information Theory Background . . . . . . . . . . . . . . . 50 2.7.5 Proposal for Measuring Independence of Embeddings . . . 52 2.7.6 A Utility Based Model of Embedding Independence . . . . 53 2.8 Summary: Comprehensive and Interpretable Word Semantic Anal- ysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3 Methodology of Data Selection and Proposal for Interpretable Evaluation 59 3.1 Training Data Matters . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1.1 Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.1.2 Text Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 From Intrinsic Evaluation to Interpretable Model Anatomy . . . . 68 3.2.1 Behavioural Tasks . . . . . . . . . . . . . . . . . . . . . . 68 3.2.2 Brain Imaging as Embedding Analysis . . . . . . . . . . . 71 3.2.3 How do Models Conceptualise? – Cluster Analysis . . . . . 74 3.2.3.1 Clustering Methods and Metrics . . . . . . . . . 75 3.2.4 Information Gain from Modalities . . . . . . . . . . . . . . 77 3.2.4.1 Empirical Mutual Information Estimation . . . . 77 3.3 Analysis Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4 Impact of Visual Information in Semantics 81 4.1 Comparing Visual Models and Data Sources for Semantics . . . . 82 4.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Visual Context in the Linguistic Domain . . . . . . . . . . . . . . 85 4.2.1 Scene Graph Context . . . . . . . . . . . . . . . . . . . . . 86 4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 Modalities, Sources and Models: a Thorough Analysis . . . . . . . 90 4.3.1 Studied Embeddings . . . . . . . . . . . . . . . . . . . . . 91 4.3.1.1 Linguistic Embeddings . . . . . . . . . . . . . . . 91 4.3.1.2 Visual Embeddings . . . . . . . . . . . . . . . . . 91 4.3.1.3 Structured Embeddings . . . . . . . . . . . . . . 92 4.3.2 Mid-fusion methods . . . . . . . . . . . . . . . . . . . . . . 92 4.3.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . 93 4.3.3.1 Concreteness . . . . . . . . . . . . . . . . . . . . 93 4.3.3.2 Qualitative Analysis on Nouns of the Brain Datasets 94 4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.3.4.1 Correlations on the Behavioural Tasks . . . . . . 95 4.3.4.2 Results on Brain Data . . . . . . . . . . . . . . . 101 4.3.4.3 Concreteness . . . . . . . . . . . . . . . . . . . . 102 4.3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . 105 4.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4 Model Initialization on a Textual Entailment Task . . . . . . . . . 107 4.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 E↵ects of Data Size and Distribution 119 5.1 Counting in the “E↵ort” . . . . . . . . . . . . . . . . . . . . . . . 120 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2.1 Control for Data Quantity . . . . . . . . . . . . . . . . . . 121 5.2.2 Control for Frequency Ranges . . . . . . . . . . . . . . . . 121 5.2.3 Expected Results . . . . . . . . . . . . . . . . . . . . . . . 122 5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6 Informativeness of Semantic Spaces 127 6.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2 Qualitative Analysis of Semantic Spaces . . . . . . . . . . . . . . 129 6.2.1 Cluster Structure Results . . . . . . . . . . . . . . . . . . 129 6.2.2 Inspecting the Clusters . . . . . . . . . . . . . . . . . . . . 131 6.2.2.1 Size Distribution and Visualisation . . . . . . . . 131 6.2.2.2 Cluster Similarities . . . . . . . . . . . . . . . . . 133 6.2.2.3 Gamified Data Collection . . . . . . . . . . . . . 155 6.2.3 Supervised Visualisation . . . . . . . . . . . . . . . . . . . 157 6.2.3.1 Automatic Class Label Annotation . . . . . . . . 157 6.2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . 157 6.3 Information Gain from Multi-modal Data . . . . . . . . . . . . . . 158 6.3.1 Hyper Parameters and Dimensionality Reduction . . . . . 160 6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.4 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7 Summary and Conclusions 177 7.1 Main Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.2 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . 178 Bibliography 181 A Cross-validated Semantic Relatedness and Similarity 199 B WordNet Concreteness 207 C EmbEval Toolkit 215 D Cluster Structure 217 E Mutual Information of Semantic Spaces 249 F Centroid Contexts 253 Chapter 1 Introduction The anatomy of human language has long intrigued researchers. In the late twentieth century, Information Technology introduced new, ever improving com- putational tools which opened a wide range of opportunities to perform empirical investigations on the written and spoken (recorded) realisations of language. This technology gave birth to new fields such as Computational Linguistics and Natural Language Processing (NLP). Data driven analysis of language provided another boost to NLP after the deep learning revolution (or renaissance) in the first half of the 2010s. The motivations for creating computational models for language are, however, very much varied across communities. Probably, the most dominant branch of research is driven by more – what we may call – engineering incentives, and stands by the mission of creating human level language understanding and generating systems. This area has become even more prominent since Machine Learning (ML) – and NLP in particular – has weaved itself into a rapidly developing commercial market. ML and NLP have become ubiquitous in our everyday lives in domains ranging from criminal justice and public policy to healthcare and education [Kaur et al., 2020]. The other – less prominent – direction concerns itself with employing tech- nological tools in order to empirically test research hypotheses about language and cognition or social phenomena. Here, computational models are rather the means than an end, which can generate more knowledge using large scale statisti- cal analysis. This area involves sub-fields which can be labelled as Computational Linguistics or Computational Sociology. 15 The two approaches can di↵er on the level of applied models as well, which are partially derived from the purpose of investigation. Applied NLP involves more end-to-end models trained for tasks which are close to end-user applications, such as Question Answering, or dialogue systems. More theoretic work often focus on models which are more interpretable and evaluations which are more intrinsic, such as semantic similarity or predicting concept representations in the brain. Machine Learning practitioners cannot debug their models if they do not understand their behaviour [Kaur et al., 2020]. Thus, this type of analytic research can also serve as an important component of a checks as balances system of commercial NLP. The topic of this thesis is related to the aims of the latter area. We concentrate on word semantic models. Even though words primarily acquire their meaning within context and use, thinking in concepts and categories is a basic human strategy by which to operate [Bowker and Star, 2000]. Semantic models of words – and vector space models in particular – provide a compelling instrument for statistical analysis of concepts, realised in language. Therefore, investigations on lexical semantics can be useful for other interdisciplinary research, such as Computational Sociology. Here, we are concerned with analysing the behaviour as well as the internal “cognitive model” of semantic representations with a focus on multi-modal input. Symbol grounding [Harnad, 1990] or the hypothesis that human semantic repre- sentation depends on sensori-motor experience, has been given much attention in the past decades. Dual coding theory [Bucci, 1985], the idea in cognitive science that meaning might be represented in the human brain in multiple modalities has inspired much research in NLP and Computational Linguistics. Most multi-modal research focus on engineering type of evaluation tasks (and therefore models which perform well on them) which involve direct visual input, such as Visual Question Answering (VQA) [Antol et al., 2015, Srivastava and Salakhutdinov, 2012, Kiros et al., 2014, Socher et al., 2014, Tsai et al., 2019, Lu et al., 2019, Su et al., 2019, Majumdar et al., 2020]. They are usually referential type tasks, in which case the usefulness of visual input is not surprising. Moreover, evaluating solely on downstream tasks is prone to exhibit spurious correlations. Unlike most studies, this work investigates visual information’s contribution to semantic meaning representations when the evaluation tasks involve no direct visual input. Instead of evaluating on referential type tasks like VQA, we are 16 interested in the impact of visual information in higher level word and concept representations. A minority of papers have exploited visual information for mean- ing representations when the evaluation tasks involve no direct visual input, such as semantic similarity [Bruni et al., 2014, Kiela and Bottou, 2014, Kiela et al., 2016, Lazaridou et al., 2015, Davis et al., 2019, Lin and Parikh, 2015, Vendrov et al., 2015]. There are three main issues in the literature, which we are addressing in this thesis. Problems of Intrinsic Analyses As a start, we focus on two types of intrinsic evaluation: human judgement based semantic tasks and brain activity prediction. The type of evaluation the community uses has an e↵ect on the model selection process, hence the questions we ask will influence the future direction of model development as well. Working on intrinsic evaluations, such as semantic similarity can positively contribute to both basic research questions about linguistic phe- nomena as well as developing higher quality end-user applications, by recognising potential pitfalls. However, due to the ambiguous notion of similarity and the low inter-annotator agreement, it is dicult to draw robust conclusions on the di↵erences between models based on solely this type of evaluation [Batchkarov et al., 2016]. To overcome this problem our first key contribution is a compre- hensive analysis of multi-modal models. We perform large scale evaluations on di↵erent data sources, model architectures and modalities. Eciency of Models and Data Most multi-modal models require huge image and text training datasets. Our second key contribution is the proposal and analysis of a new type of hybrid modality based on small, structured data, lying in-between visual and linguistic modalities. Lack of Model Transparency A further crucial issue with embeddings (and recent ML models in general) is that the learnt representations are not inter- pretable for humans. Thus, we are prone to overlook spurious correlations, or data and model biases [Kaur et al., 2020, Hooker, 2021, Bender et al., 2021]. To mitigate this problem, the third main proposal of this work is a framework of transparent and interpretable analyses of semantic space representations. In- terpretability has gained traction in AI in the past few years not just for down- 17 stream performance but also for AI Safety and Fairness reasons [Barocas et al., 2019, Bender et al., 2021, Kaur et al., 2020]. We introduce various quantitative and qualitative analyses to understand how our models conceptualise the “world”, which depends on model architecture, data source and modality. To address the above problems, we propose, and present proof-of-concept studies of a three-pillar analysis framework of multi-modal embeddings: 1. Black-Box Performance testing – How representations of di↵erent modal- ities perform on intrinsic evaluation tasks? We extended previous work with the following: (a) Comprehensive analysis of models across data sources, machine learn- ing models and modalities, (b) New modality based on small data, lying in-between low level visual information and high level linguistic / symbolic data, and (c) Eciency analyses, controlling for data size, data distribution and model size. 2. Transparency testing – Qualitative / Quantitative structural anal- ysis: How representations of di↵erent modalities di↵er? An analysis of concept structures captured by modalities. 3. Transparency testing – Independence analysis: An information-theory based analysis to measure how much representations di↵er? This thesis was inspired by a series of previous work. They are detailed in Chapter 2 where we introduce the background. To highlight a few influential related work: Kiela et al. in [Kiela et al., 2014] introduced enlightening anal- yses of multi-modal embeddings. They showcased how image dispersion a↵ects multi-modal embedding performance, and how word concreteness is a relevant factor. Our methodology of structural embedding analysis was partially inspired by [Minnema and Herbelot, 2019] who used various metrics to measure the simi- larity between a linguistic embedding space and a brain image embeddings space. Our theoretical semantic embedding framework generalises Katrin Erk’s defini- tion of distributional models [Erk, 2016]. Our information-theoretical framework and experiments were supported by the work of Zolta´n Szabo´ [Szabo´, 2014], who kindly o↵ered consulting on the theoretical background. 18 Understanding how machine learning models “understand” concepts is a cru- cial step towards managing model and data bias, which impacts billions of users on a daily basis who interact with AI models on social media platforms, jurisdiction or health care practices. We hope that our methodology for analysing model con- ceptualisation will inspire other researchers to release more interpretable model analyses, therefore contributing to safer and fairer AI system development. 1.1 Key Contributions The contributions of this thesis can be summarised in three key points: I. A comprehensive analysis of multi-modal models – involving visual and linguistic data – across data sources, model architectures and modali- ties. II. Introduction and analysis of a new type of modality: a visually struc- tured, text based semantic representation, lying in-between visual and lin- guistic modalities. III. Proposing and presenting proof-of-concept studies of a transparent, inter- pretable semantic space analysis framework. The course of this research and the design of the experiments were led by the pursuit for answering the following questions: 1. How does the source of images a↵ect the performance of multi-modal se- mantic representations? 2. Does the number of images have an impact on performance? 3. Do previous findings on complementary visual information scale to di↵erent types and sizes of linguistic corpora? 4. Does visual data bolster performance only because we add more data or does it convey complementary quality information compared to a higher quantity of text? (a) Can we achieve comparable performance using small-data if it comes from the right data distribution? 19 5. Can we move beyond performance evaluation? Are there any emergent con- cepts in embeddings? Can we quantify the di↵erence between the concept structures of semantic spaces? 6. Can we quantify the di↵erence between semantic spaces, based on the useful information they contribute to the meaning representation? 1.2 Thesis Outline Chapter 2 gives an overview of the background and literature in Distributional Semantics, Computer Vision and multi-modal semantics, and also introduces our framework of transparency analysis. Details and discussion of the data sources and evaluation methodology are presented in Chapter 3. Chapters 4, 5 and 6 involve implementation details and results of experiments, designed to answer the research questions from Section 1.1. Chapters 4 and 5 implement our first and second key contributions I. comprehensive analysis of multi-modal models and II. introduction and analysis of a new type of modality. The experiments focus on Questions 1, 2 and 3. Section 4.1 addresses Questions 1 and 2, evaluating di↵erent visual data sources for semantics, in terms of the impact of image quantity and quality. Section 4.2 introduces a novel structured embedding as a new modality. In Section 4.3 a broader study is presented which, tacking Question 3, aims to perform a wide range of evaluations across several di↵erent visual, linguistic and multi-modal models. As an outlook over the application of word embedding initialisations we investigate a textual entailment task in Section 4.4. Chapter 5 provides a more in-depth investigation of the e↵ects of data size and frequency distributions in linguistic and multi-modal embeddings (Questions 4 and 4a). Finally, in Chapter 6 we implement the third key contribution of this thesis: III. a transparent, interpretable semantic space analysis. We address Ques- tion 5, where we employ qualitative structural analysis of semantic spaces, and Question 6 by presenting a method for estimating the information di↵erent modal- ities add to the linguistic representations. A summary, conclusions and ideas for future directions based on this research are discussed in Chapter 7. Appendices A, B, C, D, E and F contain extra results, which were omitted from the main text for space and readability considerations. 20 1.3 Publications Content involving thesis material: • Anita L. Vero˝ and Ann Copestake. Ecient Multi-Modal Embeddings from Structured Data. arXiv preprint arXiv:2110.02577 , 2021. • Douwe Kiela, Anita L. Vero˝, and Stephen Clark. Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-16), 2016. Thesis-related content: • Christopher Davis, Luana Bulat, Anita L. Vero˝, and Ekaterina Shutova. Deconstructing multimodality: visual properties and visual context in hu- man semantic processing. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (* SEM 2019), pages 118–124, 2019. • Christopher Davis, Luana Bulat, Anita L. Vero˝ and Ekaterina Shutova. Modelling Visual Properties and Visual Context in Multimodal Semantics. In Workshop on Visually Grounded Interaction and Language, NIPS, Mon- treal, Canada, 2018. Not directly thesis-related content: • Douwe Kiela, Luana Bulat, Anita L. Vero˝ and Stephen Clark. Virtual Embodiment: A Scalable Long-Term Strategy for Artificial Intelligence Re- search. In NIPS Workshop on Machine Intelligence (MAIN), Barcelona, Spain, 2016. Software • EmbEval: The implementation of transparent evaluation methodology and the majority of experiments are available as an open source software1. This code was used in Chapters 4, 5 and 6. Details on its usage can be found in the documentation2. 1https://github.com/anitavero/embeval 2https://anitavero.github.io/embeval/ 21 • MMFeat - Flickr API: I implemented a Flickr API and some experiment and demo code into the MMFeat software3, which is used in Chapter 4.4 • Concept Game: A two player, collaborative gamified data collection app5 (See Section 6.2.2.3.) This code is also publicly available on Github6. 3https://github.com/douwekiela/mmfeat 4https://github.com/anitavero/mmfeat/commits?author=anitavero 5http://concept-guessing-game.com/ 6https://github.com/anitavero/concept_game 22 Chapter 2 Background and Motivation for Interpretable Multi-Modal Word Embedding Analysis In this chapter we place the thesis into the context of previous work. We explain the motivation for our intrinsic and information-theory based analyses. Further- more, we introduce the framework and notation used throughout the thesis. 2.1 What does Word Meaning Mean, and Why should We Care? 2.1.1 Philosophical Accounts Traditionally, word semantics has been discussed in the framework of lexical com- petence. According to the externalist view, words have an objective meaning known by a “perfect competent speaker”, however, people are imperfect speakers, hence the di↵erence between our levels of understandings [Kripke, 1972, Putnam, 1970]. This has been criticised by many including Chomsky in 2000 [Chomsky et al., 2000]. The most notable criticism came from the contextualist and praga- matic point of view. Similarly to Wittgenstein [Wittgenstein, 1953, p. 20], it identifies meaning with use, and highlights the contextual nature of word mean- ings [Grice, 1975, Searle, 1985]. 23 To demonstrate the two opposing positions, take the following example sen- tence: “There is milk in the fridge”. According to the contextualists: in the context of morning breakfast it will be considered true if there is a carton of milk in the fridge and false if there is a patch of milk on a tray in the fridge, whereas in the context of cleaning up the kitchen truth conditions are reversed [Gasparri and Marconi, 2021]. The externalist could object by challenging the contextual- ist’s intuitions about truth conditions. “There is milk in the fridge”, she could argue, is true if and only if there is a certain amount (a few molecules will do1). The contextualist’s reply is that, in fact, neither the speaker nor the interpreter is aware of such alleged literal content if there is even such a thing. A cognitive approach characterizes Marconi’s [Marconi, 1997] account of lex- ical semantic competence. In his view, lexical competence has two aspects: an inferential aspect, underlying performances such as semantically based inference and the command of synonymy, hyponymy and other semantic relations; and a referential aspect, which is in charge of performances such as naming (e.g., call- ing a horse “horse”) and application (e.g., answering the question “Are there any spoons in the drawer?”). According to his theory of individual competence, communication depends both on the uniformity of cognitive interactions with the external world and on communal norms concerning the use of language, together with speakers’ deferential attitude toward semantic authorities. Recanati [Recanati, 2004] has extended the contextualised view with including the history of a word’s meaning. He says a word has a “semantic potential” defined as the collection of past uses of a word between source situations (i.e., the circumstances in which a speaker has used a word) and target situations (i.e., candidate occasions of application of the word). 2.1.2 (Cognitive) Linguistics and Neuroimaging At the beginning of the 1970s a new cognitive theory of the mental representa- tion of categories surfaced [Mervis and Rosch, 1981]. It put forward the notion on prototypes which revolutionized the existing approaches to category concepts and was a leading force behind the birth of cognitive linguistics. Later a whole 1This example was given in [Gasparri and Marconi, 2021], however, we would point out that there is no such thing as “milk molecules” [Lucey et al., 2017], which supports scepticism towards an extreme externalist approach. 24 paradigm, called Simulationism emerged with a series of evidence between men- tal realisation of concepts and sensory-motor activation. For example listening to sentences that describe actions performed with the mouth, hand, or leg ac- tivates the visuomotor circuits [Tettamanti et al., 2005]; or odor-related words (“jasmine”, “garlic”, “cinnamon”) di↵erentially activates the primary olfactory cortex [Gonza´lez et al., 2006]. This all lead to theories such as the dual coding hypothesis, which is in relation to the philosophical problem of symbol grounding, discussed in detail in Section 2.4. Distributional Hypothesis According to the summary of [Lenci, 2008], al- though the linguistic context appears as one of the ingredients of human concep- tualization, the emphasis of cognitive semantics is on an intrinsically embodied conceptual representation of aspects of the world, grounded in action and per- ception systems. On the other hand, the Contextual Hypothesis in psychology arguing for a “usage-based” characterization of semantic representations incited linguistics towards statistical corpus analysis. According to Lenci, this view is related to Wittgenstein’s claim, i.e. that “the meaning of a word is its use in the language”. This led to the Distributional Hypothesis (DH) according to which at least certain aspects of the meaning of lexical expressions depend on the dis- tributional properties of semantic similarity between two such expressions. Or as Firth [Firth, 1957] put it, “Words that occur in similar contexts tend to have similar meanings” [Turney, 2010]. There is an increasing evidence towards the “strong” version of DH which does not only assumes correlation between semantic content and linguistic distri- butions. This version is a cognitive hypothesis stating that repeated encounters with words in di↵erent linguistic contexts eventually lead to the formation of a contextual representation. That is an abstract characterization of the most sig- nificant contexts with which the word is used [Lenci, 2008]. Baroni and Lenci found important similarities between distributional models and human-generated properties but also striking di↵erences [Baroni and Lenci, 2008]. Statistical rep- resentations of word meaning has since become a prevalent approach forming the basis of computational linguistics. [Boleda, 2020] summarised the reasons behind this in three factors. First, distributional representations are learnt from natural language data, scaling up to very large vocabularies, thus providing a coherent system where systematic explorations are possible. Second, recent models involve 25 high dimensional representations. Third, they use continuous values and simi- larity metrics. Both of the latter allow for rich and nuanced information to be encoded and analysed. Concepts, words and senses In philosophy, historically there has been many di↵erent definitions of the term concept [Margolis and Laurence, 2021]. We use an empiricist, embodied definition which treat concepts as internal human cog- nitive knowledge representation, which probably involves multi-modal sensory based representation, as mentioned earlier. Words are elements of a language with meaning. However, human language is ambiguous, so many words can be interpreted in multiple ways depending on the context in which they occur. For instance, consider the following sentences (from [Navigli, 2009]): (a) I calculated the interest rate. (b) They have an interest in music. The occurrences of the word interest in the two sentences clearly denote di↵erent meanings: financial earnings and passion, respectively. These di↵erent meanings of a word are called word senses, which are abstractions over word meanings [Lenci, 2008]. Neuroimaging The development of neuroimaging techniques such as PET, fMRI and ERP has provided further means to adjudicate hypotheses about lexi- cal semantic processes in the brain, which has been studied in relation to statis- tical semantic models, e.g. [Mitchell et al., 2008, Pereira et al., 2018, Handjaras et al., 2016]. Mitchell et al. found correlation between distributional models of word meanings and brain imaging representations in human participants [Mitchell et al., 2008]. Handjaras et al. found that conceptual knowledge in the human brain relies on a distributed, modality-independent cortical representation that integrates the partial category and modality specific information retained at a regional level [Handjaras et al., 2016]. This thesis also complements standard semantic evaluations with tests on neuroimaging datasets, introduced in Sec- tion 3.2.2. Introducing Model-Concepts In this thesis – similarly to Lenci and Boleda – we treat distributional semantic models of word meaning as a proxy to em- 26 pirically investigate “aggregated meanings”, which is not the semantic model of any particular individual (and most likely not even a particular society’s). Since human concept representations seem at least partially perceptual, we focus on multi-modal distributional models involving visual perceptual data. We start from statistical models of word meaning, but we proceed towards more in-depth model interpretation analysis. We investigate whether there are structures in our learnt representations which represent some kind of conceptualisation of the machine. We call these model-concepts. Model-concepts are di↵erent from human cognition. They are also not directly word meaning representations as we are looking for further emerging structures / clusters. Since we are studying the fusion of linguistic and perceptual data, model-concepts are assumed to be closer to human concepts than purely text based ones. Throughout the thesis we will use “concept” and “model-concept” interchangeably, as our investigation only involves model-concepts, not human conceptual representations. We introduce the history of Distributional Semantic models in more detail in Section 2.2, visual models from Computer Vision in Section 2.3 and multi-modal literature in Section 2.4. 2.2 Linguistic Embeddings: From Text to Meaning This section reviews the history of statistical models of word semantics based on text corpora. 2.2.1 Distributional Semantics In Natural Language Processing, word meaning representation models have been primarily inspired by Firth’s distributional hypothesis [Firth, 1957], saying “Words that occur in similar contexts tend to have similar meanings” [Turney, 2010]. Con- temporary corpus-based approaches implement this idea by using vector repre- sentations of words also known as distributional semantic models or embeddings. The representation vector of each word can be computed from the co-occurrence frequencies with other terms in the same context. Here, we give a short overview of the development of distributional semantic models; for a detailed survey, see 27 Clark’s book chapter in The Handbook of Contemporary Semantic Theory [Clark, 2015] or a more recent overview of Distributional Models of Word Meaning by Lenci [Lenci, 2018]. The history of word representations by vectors goes back to Karen Spa¨rck Jones’ 1967 work in Computational Linguistics who first used a principled tech- nique for comparing contexts [Spa¨rck Jones, 1967]. Vector representation was widely popularised for the document retrieval problem in Information Retrieval [Schu¨tze et al., 2008]. At the beginning, both the query and the documents were represented with a “bag of words”, i.e., a vector of word frequencies. This was a successful model despite the fact that it does not account for word order. To circumvent bias towards frequent words, weighted versions have been introduced, such as the term frequency-inverse document frequency (tf-idf) based on the fre- quency of terms in a document, and the inverse of the number of documents in which a term occurs. One useful way to think about document vectors is in terms of term-document matrix. This way, rows can correspond to document vectors, whereas columns are word representations. A popular method was to apply a dimensionality reduction technique on such matrices, such as singular value decomposition (SVD). The application of SVD to the term-document ma- trix was introduced by Deerwester et al. [Deerwester et al., 1990], who called the method Latent Semantic Analysis (LSA). The name comes from the intuition that LSA teases out a latent meaning from the co-occurrence data, by clustering words along a small number — typically a few hundred — of semantic, or topical, dimensions [Turney, 2010]. From the term-document matrix we can easily arrive to the concept of term- term matrix. Instead of treating the document as the context similar words co-occur in, we can narrow it down to a smaller window around a word. This way the elements of a matrix are the frequency of two words occurring in the same context window. To normalise raw frequencies using Positive Pointwise Mutual Information (PPMI) of two words (w1, w2) is a popular method: PPMI(w1, w2) = max(log2 P (w1, w2) P (w1)P (w2) , 0). (2.1) Applying SVD can also be useful on these type of matrices. Representing the meaning of multiple-word phrases or sentences, still proves to be a challenging problem. Many researchers have studied compositional semantics 28 using vector operations on word vectors [Mitchell and Lapata, 2010] or tensor based representations [Clark, 2015]. 2.2.2 Shallow Networks Recent research has presented several neural network-based approaches to learn word vector representations. Such distributed representations have become known as embeddings. The most well known and widely used models were introduced by Mikolov et al. [Mikolov et al., 2013a, Mikolov et al., 2013b] and have become popular as part of the word2vec toolkit. They introduced two models, both con- sisting of a shallow, two-layer neural network which learns an approximation of co-occurrence statistics [Levy and Goldberg, 2014b]. They train a neural net- work to predict neighbouring words, in doing so learning dense embeddings for the words. It is much faster than SVD and easy to train. The skip-gram (SG) model [Mikolov et al., 2013b] learns to predict the words that can occur in the context of a target word. Its objective function is as follows: 1 T TX t=1 X cjc,c 6=0 log p(wt+j|wt) (2.2) where T is the size of the corpus, c is the context window size, wi is a word, (1 <= i <= T ). Let d be the embedding dimension, V the vocabulary. The model learns two embeddings, or lookup matrices: 1) an input embedding W 2 Rd⇥|V |, where column i gives the embedding vi of size 1⇥ d for word wi in the vocabulary 2) an output embedding W 0 2 R|V |⇥d, where row i is a d⇥ 1 embedding v0i for word wi in V . v0O and vI are the “input” and “output” vector representations of w. The probability of a word occurring in a context is given by the softmax function: p(wO|wI) = exp(v 0 O · vI)P|V | j=1 exp(v 0 j · vI) (2.3) This architecture is illustrated in Figure 2.1. Because of the denominator term, training this model directly would be com- putationally infeasible. For this reason Mikolov et al. introduced the trick of hierarchical softmax and skip-gram with negative sampling (SGNS). 29 Figure 2.1: Skip-gram and CBOW architectures.2 Since we have two embeddings vj and v0j for each word wj we can either just use, vj, sum or concatenate them. If we multiply WW 0T , we get a matrix M , each entry mij corresponding to some association between input word i and output word j. Levy and Goldberg [Levy and Goldberg, 2014b] show that skip-gram reaches its optimum just when this matrix is a shifted version of the PMI matrix: WW 0T =MPMI log k (2.4) Thus, skip-gram is implicitly factoring a shifted version of the PMI matrix, into the two embedding matrices. In the other model of Mikolov et al., called Continuous Bag of Words (CBOW) [Mikolov et al., 2013a], a similar training happens, except instead of predicting the context around a word in a window, the objective is to predict the middle word in the context window. The two model architectures are illustrated in Figure 2.1. Global Vectors model (GloVe) [Pennington et al., 2014] aims to learn a version of the PMI matrix which is weighted toward more frequent word context pairs. They theorise that the fact that their model can be optimised directly as opposed to the on-line training of SGNS, it introduces more global frequency information. However, Levy and Goldberg showed, that after tuning hyperparameters, it does not produce any performance gain [Levy et al., 2015]. Other versions of skip-gram have been proposed such as a dependency-based 2https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf 30 word embedding [Levy and Goldberg, 2014a], where instead of using a simple sliding window as the context, a window goes through the dependency graph of each word as the context. Deep Recurrent Neural Networks [Bengio et al., 2003, Bahdanau et al., 2015, Cho et al., 2014, Kiros et al., 2015, Wang and Jiang, 2015, Rockta¨schel et al., 2016] and Transformers with self-attention [Peters et al., 2018, Radford et al., 2018, Devlin et al., 2019, Yang et al., 2019] have appeared in the forefront of NLP research in the past few years. They achieve state-of-the-art performance on various sentence level tasks, included in the GLUE multi-task benchmark for Natural Language Understanding [Wang et al., 2018a]. The tasks involve textual entailment, sentiment analysis, paraphrasing and question answering. Since the main objectives of this thesis were creating and testing a framework for com- prehensive, transparent and interpretable semantic analysis, we use the smallest possible models which allow us to incorporate visual embeddings, thus studying multi-modality. Therefore, in this work we apply shallow network type models, as visual embeddings fit into them more easily then into count based models, while being the simplest neural models. Due to the few parameters of these mod- els, they are also much easier to train than bigger neural models, allowing us to run comprehensive studies across several datasets and model types. Throughout this work we use SGNS and FastText, which uses the CBOW model, with ver- sions extended with subword information [Mikolov et al., 2018]. Furthermore, we use di↵erent versions of PMI in Section 6.4 for analysing our training cor- pora. Applying our framework for the latest transformer type models would be a straightforward application of this thesis. Although running broad-scale analysis is much more challenging using these large models, it would be interesting to see how attentions a↵ects multi-modal fusion. 2.3 Visual Embeddings: From Images to Meaning Our research focuses on the most ecient fusion of vision and language for mean- ing representations. Thus we revise the basics of Computer Vision approaches for encoding images as well as state-of-the-art models in Section 2.3.1, which we rely on. 31 Similar to language embeddings, representing the content of an image or a video also involves producing a vector representation. This is expected to capture a compressed representation of interesting features over the high dimensional, raw pixel input that corresponds to human semantic constructs. This can include low level features such as edges and corners, or higher level ones such as objects of an image or temporal patterns on a video. The selection of these features, however, is not a trivial task. Traditional Computer Vision methods applied hand-crafted features similar to the above mentioned edge and corner detectors from which they could build a Bag-of-words type model [Sivic and Zisserman, 2003]. Neural Networks revolutionized this area as well with the introduction of Convolutional Neural Networks (CNNs). These are biologically inspired net- works motivated by the visual cortex [Lecun et al., 1998]. They are capable of learning high level features gradually by exploiting a deep structure where every layer learns a higher abstraction based on the lower ones. Such networks can be trained for many di↵erent tasks such as object classification [Simonyan and Zisserman, 2014, Krizhevsky et al., 2012, Szegedy et al., 2015, He et al., 2016], image segmentation [Kendall et al., 2017] or action recognition [Sharma et al., 2015]. The learned vectors proved to be a good basis for learning high performing image embeddings [Kiela and Bottou, 2014]. The core building block of such networks is the convolutional layer. This refers to the mathematical convolution of a filter function across the pixels of an image. In traditional Computer Vision this filter function (or kernel) was crafted manually, whereas in a CNN it is learned from data. Down-sampling and learning compressed local (globally invariant) features is done by the pooling layers. CNNs usually involve fully connected layers on the top and activation functions similar to other neural networks. They are usually trained with an objective for a supervised task, such as object classification. Figure 2.2 illustrates the architecture of LeNet [LeCun et al., 1989], the first CNN successfully trained by back-propagation to classify hand-written digits. It performed better than manual coecient design, and was suited to a broader range of image recognition problems. Thus, it became the foundation of modern Computer Vision. 32 Figure 2.2: Architecture of the LeNet-5 for digit recognition. Each plane is a feature map i.e. a set of units whose weights are constrained to be identical. 2.3.1 CNN Models In our study, CNN models serve the role of encoding images into visual word semantic vectors. We used four architectures which di↵er in size and structure. See Table 2.1 for an overview. AlexNet The network by Krizhevsky [Krizhevsky et al., 2012] introduces the following network architecture: first, there are five convolutional layers, followed by two fully-connected layers, where the final layer is fed into a softmax which produces a distribution over the class labels. All layers apply rectified linear units (ReLUs) [Nair and Hinton, 2010] and use dropout for regularization [Hinton et al., 2012]. This network won the ILSVRC 2012 ImageNet classification challenge. GoogLeNet The ILSVRC 2014 challenge winning GoogLeNet [Szegedy et al., 2015] uses “inception modules” as a network-in-network method [Lin et al., 2013] for enhancing model discriminability for local patches within the receptive field. It uses much smaller receptive fields and explicitly focuses on eciency: while it is much deeper than AlexNet, it has fewer parameters. Its architecture consists of two convolutional layers, followed by inception layers that culminate into an average pooling layer that feeds into the softmax decision. That is, it has no fully connected layers. Dropout is only applied on the final layer. All connections use rectified units. 33 AlexNet GoogLeNet VGGNet ResNet ILSVRC winner 2012 2014 2015 2015 #Layers 7 22 19 152 #Parameters (million) ⇠60 ⇠6.7 ⇠144 ⇠6.8 Receptive field size 11⇥ 11 1⇥ 1, 3⇥ 3, 5⇥ 5 3⇥ 3 3⇥ 3 Fully connected layers Yes No Yes Yes Table 2.1: Network architectures. Layer counts only include layers with parame- ters. VGGNet The ILSVRC 2015 ImageNet classification challenge was won by VG- GNet [Simonyan and Zisserman, 2014]. Like GoogLeNet, it is much deeper than AlexNet and uses smaller receptive fields. It has many more parameters than the other networks. It consists of a series of convolutional layers followed by the fully connected ones. All layers are rectified and dropout is applied to the first two fully connected layers. ResNet ResNet [He et al., 2016] revolutionized the CNN architectural race by introducing the concept of residual learning in CNN and devised an ecient methodology for training of deep nets. He et al. proposed a 152-layers deep CNN, which won the ILSVRC 2015 competition. ResNet, which was 20 and 8 times deeper than AlexNet and VGG respectively, showed less computational complexity than previously proposed nets. They empirically showed that ResNet with 50/101/152 layers has less error on image classification task than 34 layers plain net. These networks were selected because they are very well-known in the Com- puter Vision community. They exhibit interesting qualitative di↵erences in terms of their depth (i.e., the number of layers), the number of parameters, regulariza- tion methods and the use of fully connected layers. They have all been winning network architectures in the ILSVRC ImageNet classification challenges3. 3https://image-net.org/challenges/LSVRC/ 34 2.4 Multi-modal Semantics 2.4.1 Symbol Grounding Despite their undeniable success, textual embeddings have their own limitations regarding the grounding of meaning to the outside world, often referred to as Harnard’s symbol grounding problem [Harnad, 1990]. Similarly, Computer Vision research has reached a point where leveraging non-visual common sense knowledge is necessary for further improvement even on purely vision based applications. It is motivated by an insight from cognitive science (Section 2.1.2): the human semantic representation of symbols (e.g., words or objects) is based on multi- modal sensory inputs perceived on a lifelong basis [Roy, 2005]. When it comes to applications and models the question arises: What do we mean by grounding in practice? In what way can multi-modal data contribute to meaning representations? We can distinguish between two main approaches for grounding: Referential grounding refers to the task of determining the referent that a word denotes in the context of the other modality (e.g., a specific object in an image). The core issue here is finding a mapping between the two spaces [Lazaridou et al., 2016]. In contrast, representational grounding addresses the problem of multi-modal semantics: Representing the grounded meaning of a word in the sense of fusing di↵erent modalities into one, richer semantic representation [Bruni et al., 2014]. While all these results are promising some fundamental questions are still unexplored. Non-Visual Tasks Most work focuses on evaluation tasks (and therefore on models which perform well on them) which involve direct visual input. These are usually referential type tasks such as Visual Question Answering (VQA) [Srivas- tava and Salakhutdinov, 2012, Kiros et al., 2014, Socher et al., 2014, Tsai et al., 2019, Lu et al., 2019, Su et al., 2019, Majumdar et al., 2020]. In these cases the usefulness of visual input is not surprising. Fewer papers have exploited visual information for representational grounding, when the evaluation tasks involve no direct visual input, such as semantic similarity [Bruni et al., 2014, Kiela and Bottou, 2014, Kiela et al., 2016, Lazaridou et al., 2015, Davis et al., 2019, Lin 35 and Parikh, 2015, Vendrov et al., 2015]. Lin [Lin and Parikh, 2015] introduced a fill-in-the-blank task, which has been done, however, using abstract images. A further interesting proposal relates to the so-called order-embeddings, a general hierarchical framework for hypernymy, textual entailment, and image captioning [Vendrov et al., 2015]. However, it still does not involve a thorough investigation of multi-modal fusion possibilities. Some papers including [Kiela and Bottou, 2014, Kiela et al., 2016, Lazaridou et al., 2015, Davis et al., 2019] perform in- trinsic analysis of multi-modal embeddings. However, the reasons for the impact of visual information are not well understood, for we see only correlations on intrinsic evaluation tasks. This work investigates visual information’s contribution to meaning represen- tations on evaluation tasks involving no direct visual input. We aim to showcase a proof-of-concept framework for deeper analysis of unsupervised multi-modal representations. We study the concepts which emerge in grounded meaning rep- resentations. Cost of Data All the mentioned tasks require huge image datasets with ex- pensive human annotation. In the case of multi-modal tasks these annotations are even more dicult to acquire, since annotating combinations of texts and images/videos can be even more complicated and time consuming than in the uni-modal cases. We try to circumvent the problem of the costs by studying model and data size eciency (introduced in Section 2.7.2) as well as alternatives for new modalities based on small data (Section 2.5). 2.4.2 Early-, Late- and Mid-fusion In the literature, we can find three ways for performing the fusion of textual and perceptual information: • In early fusion, one learns a joint representation from the two spaces, then computes a function for the specific task (e.g., cosine distance for measuring semantic relatedness) [Lazaridou et al., 2015, Kottur et al., 2015]. • Mid-fusion techniques learn separate representations for each modalities, then combine them into a multi-modal representation, finally they compute 36 the function for the task [Kiela et al., 2014]. • Late fusion methods also learn uni-modal representations separately, then compute a function for each modality individually, and combine function outputs at the end [Silberer and Lapata, 2014]. Figure 2.3 illustrates the three types of fusion techniques. In this work we focus on mid-fusion based models since it allows us to study the information preserved in the individual modalities. Figure 2.3: Fusion methods for combining textual and perceptual information. V andW are representations learnt from either Text or Images. f is a function that fuses two representations in Early and Middle fusion. In Late fusion f combines the outputs of functions g which embed uni-modal data. (Figure is borrowed from the “Multimodal Learning and Reasoning” ACL 2016 tutorial4.) 2.4.3 Multi-modal RNNs and Transformers Neural networks and recurrent networks have been used on multi-modal input since they got popular, even going back to Boltzmann machines [Srivastava and Salakhutdinov, 2012, Kiros et al., 2014]. They were mainly tested on image retrieval and caption generation tasks. Architectures, such as Tree RNNs have also been applied to cross-modal tasks [Socher et al., 2014]. The latest NLP models have also inspired the creation of new multi-modal representations. Tsai et al. [Tsai et al., 2019] developed a multi-modal Trans- former model using cross-modal attention and tested it on sentiment analysis tasks in videos. Lu et al. [Lu et al., 2019] created ViLBERT, a multi-modal model based on BERT. They pre-trained it on Conceptual Captions dataset and 4http://multimodalnlp.github.io/mlr_tutorial.pdf 37 then transferred it to multiple vision-and-language tasks — visual question an- swering, visual common-sense reasoning, referring expressions, and caption-based image retrieval. 2.5 Structured Embeddings: Motivation for a New Modality The multi-modal framework we introduce in this thesis can be used to any modal- ities (such as text, image, video, audio). In the experimental part of this work we focus on fusing linguistic and visual information. As we saw in the previous section, ample research exploited large visual datasets and CNN models with in- creasingly large number of parameters. This is a fairly expensive way of injecting visual information into meaning representations. The second key contribution of this thesis is thoroughly exploring a structured visual dataset, called Visual Genome [Krishna et al., 2016], and the way it can enrich meaning representations. Visual Genome contains images with bounding box annotations as well as text annotation in a graph structure (it is detailed, among all the other datasets we use, in Section 3.1). This would be beneficial for two reasons. First, structured data can serve as a bridge over the semantic gap between low level image data and high level symbolic information in text. Secondly, it can provide a small data alternative to big data driven models, which could become the basis of essential tools in situations where a huge amount of text is not available, but where more structured data could be easier to collect. By exploiting this textual dataset based on a visual structure, this work in- troduces a new type of embedding, which we consider as a new, hybrid modality. In the next section we introduce our general framework of modalities. The new embedding modality called Structured Embeddings will be introduced in Sec- tion 2.6.1. The details of its creation is explained in Section 4.2. 38 2.6 Generalisation of Embeddings: Proposed Framework and Formalism In this work we use a general notion of Embedding, which refers to a vector space representation of word meanings. The weights of each vector, however, can be set by any machine learning algorithm, trained on any data type, such as text, images, sound, structured datasets etc. The only criterion for calling a vector space a word embedding space is that we find an interpretation of the dataset where it represents words. We formally define Semantic Embedding models as tuples of their relevant parameters. We generalise Katrin Erk’s definition of distributional models [Erk, 2016] to include word representations based on other modalities as well. We denote modality by m 2 {L, V, S}, which can take the value of linguistic L, visual V or structural S. The parameters of a semantic embedding model of modality m are the following: A set of T target words that receive vector representations, a set Om of observable context items in a dataset Dm, an extraction function Xm which chooses relevant contexts in which to look for context items, and a mapping function Am, which maps from target and context items to a dm dimensional space Rdm . The mapping for all target elements is represented by an Embedding matrix Em 2 R|T |⇥dm . T is an arbitrary set of words, Dm is a set of data items. Dm includes target representations r 2 Dm with a relation to t 2 T target elements r ⇠ t. Om is all the potential target contexts in the dataset: Om : T ! P(Dm), Om(t) = {U ⇢ Dm | 9 r 2 U, r ⇠ t}, where P is the power set. The extraction function Xm returns “relevant” context items from Om to each target element from T – that is it returns a mapping from target/context item pairs to numbers in N, representing a relevance score of context pairs: Xm : T ! (Om(T ) ! N). We use “relevance” here in a fairly general sense: it can for example be co-occurrence counts within a text window, image search engine result relevance, or scores based on other prior assumptions about relevancy in the the dataset, such as graph neighbourhood, which we will exploit for structured data. The mapping function Am is a combination of a (usually machine learning) algorithm and any further pre- and post-processing method which together takes the output of Xm and turns it into a mapping from targets to real values, Am : (T,Dm, Xm) ! (T ! 39 Rdm). The output mapping is represented by a matrix, called an Embedding Em 2 R|T |⇥dm , which is a vector space consisting of vector representations for each target word in T . In summary, we define Semantic Embedding models for a modalitym as tuples comprising the sets of target elements, observable context items, the dataset, the extraction function, the mapping function, and the embedding dimensionality: Sm = hT,Om, Dm, Xm, Am, dmi (2.5) The output of the model is the learnt embedding Em. For example a Google Image based Semantic Embedding model would have the following parameters: SG = hT,OG, DG, XG, AG, dGi (2.6) where T is our target vocabulary and DG is the dataset consisting of words from T and Google Image Search results for each t 2 T . OG are all the potential subsets of image results for a given word t in Google Image Search. For example we can use any number of images from the search results. The extraction function XG selects which contexts we chose, e.g., it selects the first 10 image results in Google Search Engine’s relevance order. AG will include a CNN network which maps each image to a vector representation, plus an aggregation function which creates one image vector representation for each word t. In this case, dG will be the dimensionality of the last layer of the CNN network which we use as image representation. Thus, it will be the dimensionality of our learnt Google Image Embedding EG. Note that in general, Am is a very broad notation. It can involve any learning algorithm. If our training data is text for example, it can involve any tradi- tional count based methods, shallow or deep neural networks or any other type of method which maps targets from a dataset with an extraction function to choose relevant contexts, to a vector representation. In the next section we will introduce three types of Semantic Embedding models which we study in this thesis. 40 2.6.1 Embedding Modalities In this work we are going to distinguish between three di↵erent types of embed- ding for each modalities m 2 {L, V, S}, which are produced by three class of semantic embedding models varying in all parameters but T : Linguistic Embeddings EL 2 R|T |⇥dL are vector spaces which are learnt by an algorithm AL trained on large text data DL. The learning algorithm can be any of the standard shallow neural models, which approximate co-occurrence statistics of words, such as SGNS, CBOW or FastText. XL corresponds to co- occurrence counts for target/context word pairs within a context window around target words. Visual Embeddings EV 2 R|T |⇥dV consist of vectors which have been trained on images DV , which are associated to words by XV (e.g. images labelled with words). In this case the learning algorithm is typically a CNN network (see Section 2.3) which has a specified architecture for learning abstract patterns from image data. However, after mapping images to a vector space, we need a method which associates one vector to a word. In our case we usually have multiple image results for a word, hence this method has to be a vector aggregation, such as element-wise maximum, mean or median (discussed in Sections 4.1 and 4.3). The learning algorithm and the aggregation method together constitutes AV . Structured Embeddings ES 2 R|T |⇥dS are the result of an XS which extracts relevant contexts from data DS which has a more developed structure than raw text or images on the internet. These datasets usually involve some manual design and labour for the collection, therefore they are much smaller in terms of the used computer memory in bytes. One example is Visual Genome Scene Graph annotations (introduced in Section 3.1.2), which we study in detail in Chapters 4, 5 and 6. AS is a similar algorithm to AL, trained on the extracted pairs, with co-occurrence statistics. dL, dS depend on the output size of the shallow network model in use, usually equals to 300. dV is the size of the last layer of a CNN network. The combination of the above embedding types can happen using one of the three fusion techniques (Section 2.4.2). Throughout this thesis we will use 41 mid-fusion as it allows us to examine the information coming from each embed- dings more easily. We denote multi-modal embeddings by Em1 + Em2 ,m1,m2 2 {L, V, S},m1 6= m2. 2.7 Modalities as Partial Observers of Meaning The ancient Indian parable called Blind men and an elephant tells a story of a group of blind men who have never come across an elephant before and who learn and conceptualise what the elephant is like by touching it. Their observations go as follows in James Baldwin’s English version5: ...The first one happened to put his hand on the elephant’s side. “Well, well!” he said, “now I know all about this beast. He is exactly like a wall.” The second felt only of the elephant’s tusk. “My brother,” he said, “you are mistaken. He is not at all like a wall. He is round and smooth and sharp. He is more like a spear than anything else.” The third happened to take hold of the elephant’s trunk. “Both of you are wrong,” he said. “Anybody who knows anything can see that this elephant is like a snake.”... As for another person, whose hand was upon its leg, said, the elephant is a pillar like a tree. For the fifth whose hand reached its ear, it seemed like a kind of fan. The last one who felt its tail, described it as a rope. Will they be able to combine their observations into one description more accurate than any of their individual ones? Or will they just disagree and become more confused than they had been? If the blind men were touching di↵erent objects, or were in completely di↵erent universes, they would probably struggle to reach an agreement. Since, however, they are feeling the same animal, they do have a common ground, which is at first hidden from them, but which they have a chance to comprehend better together through collaboration. It only makes sense to collaborate if none of them is already an elephant expert, or talking about a completely irrelevant or 5https://americanliterature.com/author/james-baldwin/short-story/ the-blind-men-and-the-elephant 42 Figure 2.4: Modalities and the elephant. Illustration of the Semantic Embedding models for di↵erent modalities, which include di↵erent perspectives. Data D in- cludes the target concept T of the elephant plus the observable contexts Om1 , Om2 , which are the trunk and a tusk. Each of the two Semantic Embedding models Sm1 ,Sm2 receives the data from their di↵erent perspectives: Dm1 = (T,Om1) and Dm2 = (T,Om2) respectively. 43 random subject. Similarly, our Semantic Embedding models have a chance to combine their knowledge if done properly. Figure 2.4 presents an illustration of our multi-modal framework, with one target concept of the elephant and two Semantic Embedding models with di↵erent perspectives.6 Analogously to the imperfect lexical competence framework, mentioned in Section 2.1.1, we treat modalities as partial observers of meaning. Like the men above, we assume that they have di↵erent perspectives on the same object. This object in our case is word meaning, or rather an aggregated statistical represen- tation of words at a specific point in time (described in Section 2.1.2). Using the notation before, let’s say we have [Sm1 , . . . ,SmM ] Semantic Embed- ding models of M di↵erent modalities. We assume: 1. Common ground: Each of them captures some aspect of word meanings. That is, we assume that the vector weights of none of the learnt embeddings [Em1 , . . . , EmM ] are random. 2. Perspectives: They do not share the same knowledge, they represent dif- ferent perspectives. 3. Imperfect knowledge: None of them has perfect knowledge: none of the Semantic Embedding models is an oracle which represents the ground truth. In some versions of the parable the men get into a disagreement (or a fight of various degree of violence depending on the version), in others they learn that they were all partially correct and partially wrong. In the following, we will search for the best way to ensure our models of di↵erent modalities can collaborate in the most e↵ective way. 2.7.1 Background and Motivation for Model Transparency From the existing multi-modal literature we know that combining textual and vi- sual modalities can collaborate and improve performance in various cases. Most 6Icons made by Good Ware (https://www.flaticon.com/authors/good-ware) from www.flaticon.com. Photo of an Indian elephant is from Wikipedia (http://web.archive.org/web/20210907113830/https://de.wikipedia.org/wiki/ Datei:Elephas_maximus_%28Bandipur%29.jpg), elephant drawing is from http: //web.archive.org/web/20210907105456/https://www.drawingtutorials101.com/ how-to-draw-an-indian-elephant. 44 work, however, evaluates solely on tasks, such as semantic similarity or down- stream tasks, such as Visual Question Answering (VQA). It has been shown by many researchers that this traditional way of evaluating models in Machine Learning is prone to various flaws, which can be fatally misleading for the field. Kuhnle in [Kuhnle, 2020, Chapter 2] gave a comprehensive discussion of these problems. Built on this we summarise the issues in the following categories: Black-Box Model Performance Since the recent deep learning revolution, ML evaluation appeared to be solely concerned with beating benchmarks on downstream-tasks, while the models are often treated as black-boxes. This often lead to models which learn “weird behaviour”. For example vision models may rely on the image background to recognise an object [Ponce et al., 2006], blind spots of deep CNNs [Zhang et al., 2018], or neural models mistranslate low- frequency words into context-fitting but content-changing alternatives [Arthur et al., 2016]. Good evaluation performance on one task often does not transfer to downstream tasks either. In Section 4.4 we also present our own finding that a deep LSTM with randomly initialised input word vectors performs on par with an input of pretrained word embeddings on a Textual Entailment task (SNLI). Zhang and Bowman found the related phenomenon of high performing random initialized LSTM models [Zhang and Bowman, 2018]. This is in line with current findings considering the recent transformer type models which are shown to be far from solving general tasks (e.g., document question answering). Rather, these models are overfitting to the quirks of particular datasets [Yogatama et al., 2019]. This all leads us to conclude that looking at only performance improvements between models are mostly meaningless without further analysis. Dataset Bias Data in the context of ML is supposed to convey patterns which are characteristic for a certain task. Kuhnle defines dataset bias as coinciden- tal systematic artefacts in the data which are not characteristic of the task in question. Because of this incidentality, using such datasets as training data can result in unintentional behaviour. For instance Wang et al. [Wang et al., 2018b] found that image captioning models for MS-COCO [Lin et al., 2014] can learn to produce reasonable captions merely by knowing about the objects in an image while ignoring, for instance, their location and relation. On VQA tasksmodality bias has been shown, which refers to the systematic tendency that one modality 45 suces to infer the correct output with high confidence. Multiple examples were reported, such as a language-only model which completely ignores the image but can answer almost half of the questions correctly [Zhang et al., 2016]. Agrawal et al. [Agrawal et al., 2016] observed how seemingly well-performing models jump to conclusions after only the first few question words, thus concluding that they fail at complete question and image understanding. Although, Kuhnle does not include ethical bias in his definition, we think it could fit into it, by including ethical goals into our task definition. The field of AI fairness is shifting towards concentrating on harms rather than bias in the political sense [Barocas et al., 2019, p. 136-143], however, after including mitigating harm in our task objective we can use Kuhnle’s data bias definition. There is a line of research on cultural stereotypes reflected in word embeddings [Barocas et al., 2019, p. 141]. Even though word embeddings per se do not correspond to any linguistic or decision- making task, analysing them before incorporating them into applications is a crucial step from an ethical point of view as well. Model Bias Hooker in [Hooker, 2021] argued that bias materialises not only in data but in the algorithms as well. She argues that the key reason why model design choices amplify algorithmic bias is because notions of fairness often co- incide with how underrepresented protected features are treated by the model. Most real-world data naturally have a skewed distribution with a small number of well-represented features and a “long-tail” of features that are relatively un- derrepresented. The skew in feature frequency leads to disparate error rates on the underrepresented attribute. Problems of Metrics Lastly, evaluating meaning representations is inherently limited by the methods and possibilities of human annotation collection. On top of this, as mentioned in [Kuhnle, 2020, p. 23-24] evaluations are often prone to statistical flaws of interpreting performance scores, such as missing baseline scores, reported confidence intervals with no reference or explanation, and lacking formal comparison/hypothesis testing [Faruqui et al., 2016]. Solutions A range of papers have been published recently which attempt to fix some of the identified evaluation issues. Several attempts have been made to fix- ing data, however Torralba and Efros [Torralba and Efros, 2011] argued that such 46 a process is likely doomed to result in a “vicious cycle” of ad hoc improvements, unless one reconsiders the underlying mechanisms which cause undesired dataset bias. Artificial data and unit testing [Fouhey and Zitnick, 2014, Johnson et al., 2017, Kuhnle and Copestake, 2017] is a promising paradigm to amend ML eval- uations. Probing is a recently increasingly popular approach to “stress-testing” involving testing the model on solving an auxiliary predictive task and testing the sensitivity of the model output to modifications of the input [Conneau et al., 2018, Voita and Titov, 2020]. Approaches for interpretable models and post-hoc model explanation techniques are also growing areas [Ghorbani et al., 2019, Kaur et al., 2020]. In the next section we propose transparency analysis as an extension of the above proposed solutions aiming to prevent “vicious cycles” by promoting a more informed model development process. 2.7.2 Transparency Testing and Ecient Multi-Modal Fusion A key objective of this thesis is to propose and demonstrate a framework for overcoming the inconsistency of multi-modal results. Our approach is somewhat related to the probing paradigm and partially inspired by interpretability research and cognitive science. Beyond “stress-tests” for our models we propose to extend standard evaluation techniques with an in-depth model and data analysis. We propose both going wider towards a more comprehensive model comparison across modalities and data sources, as well as deeper into studying the “cognition” of our models. We choose to analyse our datasets and models in a transparent way, which could serve as a preprocessing step before performing data or model debiasing. We propose performing and automating such data and model analyses, in order to prevent “vicious cycles” of ad hoc improvements, mentioned in the previous section. We postulate that amending performance evaluation with more in-depth trans- parency testing of semantic models are a useful way of developing more ecient and also safer models. Getting to know our models inner “cognitive models” can be a way towards AI methods, which are capable of communicating their reason- ing and also potential biases towards humans. This would make them easier to debug and maintain safely in the future. 47 We propose an embedding analysis leaning on three pillars. We postulate that they together form a comprehensive, interpretable semantic analysis but none of them are sucient on their own. The three types of analysis are categorised in black/transparency testing and are aiming to answer the following questions: 1. Performance testing : Black-Box testing – How representations of di↵erent modalities perform on evaluation tasks trained on di↵erent datasets? 2. Qualitative / Quantitative structural analysis : Transparency testing – How representations of di↵erent modalities di↵er? 3. Independence analysis : Transparency testing: How much representa- tions di↵er? By learning about how and how much our di↵erent embeddings EL, EV , ES di↵er while looking at the performance scores, we can reach a conclusion on: What is the most ecient way of combining our di↵erent resources? Eciency What do we mean by eciency? Performance testing is only one way to account for eciency. When we hold a machine learning model to be e- cient depends on our costs and resources. Data is often a limited resource, so in most cases it makes sense to take data size into account. Required computational resources, running times and electricity costs are also important factors to con- sider. Eciency in the context of economic footprint was famously thematised by Bender et al. [Bender et al., 2021]. In this work we account for performance, data size and distribution as well as model size, as these are metrics we could easily control for. Including hardware, electricity costs and running time could be a relevant extension of our studies. None of the three types of analysis on their own is sucient to answer the above question, but together they have a potential for providing meaningful in- sight in the anatomy of multi-modal semantic models. In Chapter 3 we will discuss the details of our approach to all three types of analysis. In the following sections we introduce our framework for transparency analysis of multi-modal models. 48 2.7.3 “Cognitive Model” of Embeddings: How do Models Conceptualise? As the second pillar, or the first transparency analysis, we ask the question whether each of these vector spaces represent meaningful concepts as clusters, and how these concept structures relate to each other? Comparing semantic spaces is central in Lexical Semantic Change (LSC). Dubossarsky et al. introduced Temporal Referencing7 for robust modelling of LSC on diachronic corpora [Dubossarsky et al., 2019]. They treat all time-specific corpora Ca, Cb, . . . , Cn as one corpus C and learn word representations on the full corpus. However, they first replace each target word w 2 Ct with a time- specific token wt. This way, they learn one single space that contains a vector for each target-time pair wt, which may be compared directly without the need for mapping di↵erent spaces to each other. In Statistical Machine Translation the comparison of semantic spaces has oc- curred in order to perform unsupervised learning of bilingual lexicons. Artetxe et al. [Artetxe et al., 2018] developed a cross-lingual word embedding mapping in order to align two languages without the need of parallel corpora. They propose a self-learning method based on the observation that, given the similarity matrix of all words in the vocabulary, each word has a di↵erent distribution of similar- ity values. Their assumption is that two equivalent words in di↵erent languages should have similar distributions. Minnema and Herbelot [Minnema and Herbelot, 2019] used various metrics to measure the similarity between a linguistic embedding space and a brain im- age embeddings space. Besides testing pairwise and rank correlation between vectors for the same word from the two spaces, their metrics included Nearest Neighbour structure of the two spaces and Representational Similarity Analysis (Pearson correlation between their respective similarity matrices). The latter is somewhat related to the method of Artetxe et al. [Artetxe et al., 2018], as they also initialise with correlation matrices of the two vector spaces – which, in their case, correspond to linguistic spaces of two di↵erent languages. Dubossarsky et al. [Dubossarsky et al., 2019] also performed nearest neighbour analysis in the Lexical Semantic Change context. As regards measurements, such as nearest neighbour, in high dimensional 7https://github.com/Garrafao/TemporalReferencing 49 vector spaces, one has to take the threat of the curse of dimensionality into account. Dinu et al. [Dinu et al., 2015] showed that nearest neighbour su↵ers from the hubness problem. This phenomenon is known to occur as an e↵ect of the curse of dimensionality, and causes a few points (known as hubs) to be nearest neighbours of many other points [Radovanovic´ et al., 2010]. This is a problem because these hub vectors tend to be near a high proportion of items, pushing their correct labels (e.g., words which are semantically similar) down the neighbour list. Concept based interpretability analysis using clustering is a new area in ML, which is related to our approach in spirit. Ghorbani et al. [Ghorbani et al., 2019] introduced post-training analysis of computer vision models using clustering of image segments. Clustering and visualisations have been previously used for multi-modal embedding analysis in [Gupta et al., 2019]. As a qualitative / quantitative structural analysis we will employ standard clusterization metrics, which is most related to [Minnema and Herbelot, 2019] and cluster visualisations somewhat similar to [Gupta et al., 2019]. Unlike previ- ous work, we will zoom even further into our embeddings and perform a thorough qualitative cluster analysis along with visualisations to discover model-concepts (introduced in Section 2.1.2), and analyse a new structured embedding type (Sec- tion 2.5). This will be complemented with an information-theoretical analysis framework, which we introduce in the following sections. 2.7.4 Information Theory Background The third pillar of our semantic analysis seeks the answer to the question: How much semantic embeddings Em of di↵erent modalities di↵er? We reformulate this questions as follows: How much extra information we gain if we combine two modalities? We could also phrase it this way: How much less confused a model Sm1 gets after combining it with another Sm2? We reach out for the help of information-theory to formalise our question. We start with a review of the basics then formulate our approach. The standard unit of information in computer science is the bit. The most widespread way of measuring information is the Shannon entropy [MacKay, 2003], introduced by Claude Shannon in 1948 [Shannon, 2001]. In information theory, the entropy of a random variable is the average level of “information”, “sur- 50 prise”, or “uncertainty” inherent in the variable’s possible outcomes. Shannon was searching for an information measure with the following conditions: Let p be a probability of an event, then 1. H(p) is monotonically decreasing in p. 2. H(p) 0: information is a non-negative quantity. 3. H(1) = 0: events that always occur do not communicate information. 4. H(p1, p2) = H(p1) + H(p2): the information learned from independent events is the sum of the information learned from each event. Shannon discovered that the only suitable choice of H, where X = x1, . . . xn is a random variable and P (X) is a probability mass function, is: H(X) = nX i=1 P (xi) logb P (xi) (2.7) where b is the base of the logarithm used (b = 2 measures information in bit). We rely on the concept of Mutual Information, which is intimately linked to entropy. It is also known as Information Gain and measures the information that two random variables, X and Y share: It measures how much knowing one of these variables reduces uncertainty about the other. Using the entropy it is defined by: I(X, Y ) = H(X)H(X|Y ) (2.8) where H(X|Y ) is the conditional entropy [MacKay, 2003]. Let (X, Y ) be a pair of continuous random variables with values over the space X ⇥ Y . If their joint distribution is PX,Y and the marginal distributions are PX and PY , the mutual information is defined as I(X, Y ) = Z X Z Y PX,Y (x, y) log ✓ PX,Y (x, y) PX(x)PY (y) ◆ (2.9) It follows that I(X, Y ) = DKL(PX,Y ||PX ⌦ PY ) (2.10) 51 where DKL is the Kullback–Leibler divergence: DKL(P ||Q) = Z Rd dP log dP dQ (2.11) If p(x) and q(x) are densities then DKL(p||q) = Z Rd p(x) log p(x) q(x) dx. (2.12) 2.7.5 Proposal for Measuring Independence of Embeddings The phenomena that human multi-modal sensory information fusion happens in a statistically optimal fashion has been studied in Cognitive Psychology [Ernst and Banks, 2002]. Ernst et al. found that humans combine visual and haptic information in proportion to their uni-modal variance. Interestingly, not directly analogous, but somewhat related is the finding of Kiela et al. [Kiela et al., 2014] for multi-modal (visuolinguistic) word embeddings. They filtered visual input for words based on the corresponding images’ dispersion, which measures the average pairwise distances of image vectors for a word. They found that filtering out “noisy” images improved on the multi-modal representation. This does not necessarily mean that one should ignore all new conflicting information, but highlights that it is possible to add more data to the system and having worse performance. In this thesis, we are pursuing a deeper understanding of the exact circumstances under which visual information enhances meaning representations and when it does not, by learning more about the relationship between semantic spaces of di↵erent modalities. The informativeness of new data has been studied in learning pure linguistic embeddings as well. Kabbach et al. [Kabbach et al., 2019] developed a method to train word embeddings on a smaller corpus with maximal information gain, after pretraining them on a large corpus. Their model is designed to simulate new word acquisition by an adult speaker who already masters a substantial vocabulary. Their system uses a pretrained CBOW as this “background knowledge” which they then use to train an SGNS on a much smaller data in a way that the context is maximally informative (has minimal entropy) given the previous knowledge. To our knowledge we are first to propose measuring the independence of dif- 52 ferent modalities by estimating the Mutual Information between their embedding spaces. In order to do so, we treat each embedding space Emi as a vector space, representing samples from a multivariate random distribution. By estimating the mutual information we can compare which embedding pairs di↵er more from each other. We would like to know, whether the perspective of ES or EV is “far- ther” from EL; which one is “more independent”? Let us reformulate our three assumptions on partial observers from Section 2.7 in the information-theoretical framework. Let mi,mj,mk be modalities, where i, j, k are distinct, then: 1. Common ground: Neither two embeddings Emi , Emj are completely in- dependent, as they have all learnt some pattern related to the same hidden concepts in a language: I(Emi , Emj) 6= 0. 2. Perspectives: They are not completely correlated: I(Emi , Emj) is not maximal. 3. Imperfect knowledge: None of them is an oracle, they do not predict the evaluation data perfectly P (D|Smi) 6= 1. Thus if the eciency (Section 2.7.2) of Emj and Emk are similar, and I(Emi , Emj) > I(Emi , Emk) (2.13) then we hypothesise that there is a combination method with which, combin- ing Emi with Emk is more ecient than using Emi + Emj , as they convey more complementary information which can be combined. The question of how this combination is realised depends on all the parameters in Smi and Smk and the combination method itself. In this work we explore mid-fusion combination as it allows us to study the information from di↵erent modalities separately as well as combined, and it makes it straightforward to compare individual embeddings. 2.7.6 A Utility Based Model of Embedding Independence In this section we introduce a toy model based on probabilistic games, which serves as a theoretical backing for Mutual Information minimisation. As it is just a toy model, it is not a fundamental part of the framework of this thesis. 53 However, it provides an interesting perspective on learning multi-modal semantic representations based on information-theory, which could be generalised in the future. Before we create our own model of multi-modal fusion, we introduce Kelly’s framework of betting in a game through a noisy binary channel [Kelly jr, 1956], [Cover and Thomas, 2012, p. 162]. Rate of Growth Let us consider a repeatable game, where in each round a gambler can bet some amount of their wealth (including the whole) on either of two outcomes. After each round the gambler wins the double of their bet if they guessed right, and loses it otherwise. If p is the probability of error and q is the probability of a right guess, how much would they bet? Let V0 be the starting capital, VN is the capital after N bets. If they bet their entire capital each time, this in fact, would maximise the expected value of their capital hVNi, which in this case would be given by hVNi = (2q)NV0 (2.14) This would be little comfort, however, since if they continued indefinitely (N ! 1), they would be broke with probability one. Let us, instead, assume that the gambler bets a fraction l of their capital each time. Then VN = (1 + l) W (1 l)LV0 (2.15) where W and L are the number of wins and losses in the N bets. Then the doubling factor or rate of growth of the gambler’s capital G is8 G = lim N!1  W N log(1 + l) + L N log(1 l) = q log(1 + l) + p log(1 l) with probability one (2.16) We want to maximise this gain. Since it is logarithmic, we can take its deriva- tive at the point of zero, and we get Gmax = 1 + p log p+ q log q = 1H(X) (2.17) 8Here, log denotes log2. 54 which is 1 minus the Shannon entropy, where X is a random variable which can take the value of p or q. The model has been generalised by Kelly for more than two outcomes in [Kelly jr, 1956]. Gain of Multi-Modal Fusion Now, let us imagine learning concepts in a language from data as such a game. Figure 2.4 illustrates the model the follow- ing way. Winning corresponds to learning a semantic model of target concepts T which highly correlates with human semantic judgement. The noisy channel corresponds to the dataset D via which our models can learn embedding rep- resentations Em1 , Em2 . In this game we are interested in maximising our gain, by combining two modalities the most ecient way. Let X denote a perfect “ground-truth” semantic representation, which maximally correlates with human judgement on our task. For the sake of readability let Y := Em1 , Z := Em2 . Then the maximal rate of growths for each model and for the ground-truth are: GY = 1H(X|Y ) GZ = 1H(X|Z) G0 = 1H(X) (2.18) The rate of growth or gain with the combination of Y and Z is GY Z = 1H(X|Y, Z) (2.19) We are interested in maximising the rate of growth after we combine the information from both modalities. Let us maximise the following di↵erence: GY Z = GY Z G0 (2.20) Thus, the following theorem holds: Theorem 1. GY Z = GY +GZ I(X, Y, Z). Proof. From Equations 2.18, 2.19 and 2.20: GY = H(X)H(X|Y ) = I(X, Y ) GZ = H(X)H(X|Z) = I(X,Z) GY Z = H(X)H(X|Y, Z) (2.21) 55 (a) Low inter-modality dependence, inde- pendently from X: I(Y, Z|X). (b) High inter-modality dependence, inde- pendently from X: I(Y, Z|X). Figure 2.5: Three Random Variables X, Y and Z. Here X represents a “ground- truth” variable, a perfect semantic representation. Y and Z are two random variables, corresponding to embeddings of two modalities Em1 , Em2 . (This is also Kelly’s result for the general case, with more than two outcomes to bet on, with independent transmitted symbols with fair odds. Fair odds means that the odds paid on the occurrence of the s ’th transmitted symbol is propor- tional to the probability that the transmitted symbol is the s ’th one [Kelly jr, 1956].) Furthermore, we apply the I-Diagram in Figure 2.5, a geometrical representa- tion of the relationship among the information measures. It is analogous to the Venn Diagram in set theory, which makes several information-theoretical proofs easier [Yeung, 1991]. Therefore, GY Z = GY +GZ I(X, Y, Z) (see Figure 2.5a) (2.22) Furthermore, the following inequality holds: Theorem 2. I(X, Y, Z)  I(Y, Z). Mutual Information is an upper bound to minimise, in order to maximise the rate of growth after multi-modal fusion. Proof. GY and GZ are given because the individual embeddings have already 56 been trained. Therefore, from Theorem 1 it follows that we need to minimise I(X, Y, Z) in order to maximise GY Z . Furthermore, using the I-Diagram in Figure 2.5a: I(X, Y, Z)  I(Y, Z) (2.23) Let us notice that if I(Y, Z) is high, the reason might be independent from X. Therefore, I(X, Y, Z) can be small while I(Y, Z|X) is high, as it is illustrated in Figure 2.5b. In practice, however, this would mean that two embeddings Em1 , Em2 are correlated in some way which is irrelevant to learning semantic representations. For example two corpora may have similar number of documents, or written in the same verse etc. If this spurious correlation is too high, minimising I(Y, Z) may not be a good approximation. Our investigation of the datasets we use did not reveal such spurious correlations. Therefore, we treat I(X, Y, Z) being very close to I(Y, Z). Maximising the gain from multi-modal embedding combination serves as a framework for analysing ecient multi-modal fusion. An exciting future extension of this model would be to generalise it further for odds which are not fair, based on [Kelly jr, 1956]. In Section 3.2.4 we will introduce empirical MI estimation methods, which we will apply in experiments presented in Chapter 6. 2.8 Summary: Comprehensive and Interpretable Word Semantic Analysis In this chapter we reviewed the philosophical and theoretical background of word semantics and motivated researching distributional word semantic models as a proxy for statistical analysis of concepts. After reviewing the literature on tex- tual distributional semantics, visual embeddings and multi-modal approaches, we proposed a new type of embedding in between linguistic and visual modalities, based on small data. Furthermore, we introduced a general framework and for- malism for investigating multi-modal semantic embedding models. Lastly, we presented a framework for treating modalities as partial observers of meaning based on information-theory. 57 To tackle inconsistencies and the lack of systematic comparisons in multi- modal literature, we proposed extending the analyses of previous work with an interpretable analysis framework of three pillars: 1. Performance testing : Black-Box testing – How representations of di↵erent modalities perform on evaluation tasks? We extended previous work with: (a) Comprehensive analysis of models across data sources, machine learn- ing models and modalities. (b) New Modality based on small data and in between low level visual information and high level linguistic, symbolic data. (c) Eciency analysis controlling for data size, data distribution and model size. 2. Qualitative / Quantitative structural analysis : Transparency testing – How representations of di↵erent modalities di↵er? An analysis of model- concept structures captured by modalities. 3. Independence analysis : Transparency testing: How much representa- tions di↵er? We postulated that none of these pillars are alone sucient for an inter- pretable semantic embedding analysis, however, when combined, they can o↵er a fuller picture on what and how our models capture. We need a (1.) compre- hensive performance testing combined with eciency metrics as a goal. Within this context we can make transparency analysis involving (2.) zooming into the structural properties of embeddings and (3.) quantifying the optimal information gain from multi-modal fusion. Within this proof-of-concept framework we showcase that structured small data can be an ecient alternative to expensive big data and models, when the resources are scarce. 58 Chapter 3 Methodology of Data Selection and Proposal for Interpretable Evaluation In this chapter we introduce the training and evaluation datasets which form the basis of this study. Understanding how each training data and evaluation sets have been created is crucial for interpreting the results. Using the notation from Section 2.6, Section 3.1 describes image, text and structured corpora DV , DL, DS used as training data. Section 3.2 gives an overview of the evaluation data and methodology. Finally, we summarise the roadmap of the scheme of our three pillar analysis in Section 3.3. 3.1 Training Data Matters One of the main objectives of this thesis is to analyse the data sources that are being used during model training. Recalling our notation of semantic embedding models of modality m (with output embedding Em): Sm = hT,Om, Dm, Xm, Am, dmi (3.1) The dataset Dm comprising observable items and target elements is an essential parameter. Analysing them, therefore, is the basis for all three contributions. In our I. comprehensive analysis we aim to overcome the often inconsistent or 59 hard to compare results in previous work. Introducing a new mapping XS from a structured data source as well as analysing the properties of the data is in the centre of our study of a II. new type of semantic embedding model SS. Lastly, getting more familiar with the training data is imperative if we want to create III. transparent and interpretable semantic models. Section 3.1.1 gives a summary of the properties of image datasets DV which are used throughout the thesis for visual models SV . Section 3.1.2 introduces text corpora DL for linguistic semantic embedding models SL. Let us highlight that Visual Genome is included in both categories, since it is used both as an image dataset DV as well as a structured text corpus DS of SS after extracting annotation from its structured annotations. 3.1.1 Image Data This section introduces the details of processing image data and image datasets which deliver observable context OV in visual semantic embedding models SV . Processing Image Data We used MMFeat toolkit1 (based on Ca↵e2) to ob- tain image representations for three di↵erent convolutional network architectures: AlexNet [Krizhevsky et al., 2012], GoogLeNet [Szegedy et al., 2015] and VGGNet [Simonyan and Zisserman, 2014], and our own toolkit, EmbEval3 for ResNet [He et al., 2016] and AlexNet based on Pytorch-torchvision4. Image representations are turned into an overall word-level visual representation by taking the mean of the relevant image representations. All four networks are trained to maximize the multinomial logistic regression objective using mini-batch gradient descent with momentum: DX i=1 KX k=1 1{y(i) = k} log exp(✓ (k)>x(i))PK j=1 exp(✓ (j)>x(i)) (3.2) where 1{·} is the indicator function, x(i) and y(i) are the input and output, re- spectively. D is the number of training examples and K is the number of classes. 1https://github.com/douwekiela/mmfeat 2https://caffe.berkeleyvision.org/ 3https://github.com/anitavero/embeval 4https://pytorch.org/docs/stable/torchvision/index.html 60 Google Bing Flickr ImageNet Visual Genome Type Search engine Search engine Photo sharing Image database Image database Annotation Automatic Automatic Human Human Human Coverage Unlimited Unlimited Unlimited Limited Limited Sorted Yes Yes Yes No No Tag specificity Unknown Unknown Loose Specific Dense Table 3.1: Sources of image data. The networks are trained on the ImageNet classification task and we transfer layers from the pre-trained network. As we use CNN models pre-trained on ImageNet the other datasets do not serve as CNN training data. However, all CNN networks work as a mapping from our OV images to a vector space. The vector representations are obtained by running a feed-forward step in the network and extracting the last layer as the representation of the image. We use the last fully connected layer from AlexNet and VGGNet (both 4096 dimensional vectors), and the last pooling layer from GoogLeNet (1024 dimensions) and ResNet (512 dimension). We have multiple image results for a word, hence this method has to be a vector aggregation, such as element-wise maximum, mean or median (studied in Section 4.1). The learning algorithm and the aggregation method together constitutes the mapping function AV in SV . Image Datasets Previous systematic studies of parameters for text-based dis- tributional methods have found that the source corpus has a large impact on representational quality [Sahlgren and Lenci, 2016, Kiela and Clark, 2014]. The same is likely to hold in the case of visual representations. Various sources of image data have been used in multi-modal semantics, but there have not been many comparisons: [Bergsma and Goebel, 2011] compare Google and Flickr, and [Kiela and Bottou, 2014] compare ImageNet [Deng et al., 2009] and the ESP Game dataset [von Ahn and Dabbish, 2004], but most works use a single data source. In this work, one of our objectives is to asses the quality of various sources of image data DV . We selected the presented datasets because they are all standard in Computer 61 Vision or NLP while they all di↵er in at least one of the following properties: • Type: search engines; photo sharing social networks or hand crafted image datasets. • Annotation: Automatic by an algorithm or annotated by humans. • Coverage: Unlimited – crowd sourced on the internet or a prepared dataset of limited size. • Sorted : Whether there is a relevance score assigned to each image that indicates how descriptive it is of a word (e.g., search engine order). • Tag specificity : Whether the annotation of images are: specific of objects / scenes in the image; loose – related to the image on a higher semantic level or from a personal annotator’s angle; dense – detailed labels of objects and relationships within an image. Table 3.1 provides an overview of the data sources. Descriptions of each dataset follow: Google Images Google’s image search5 results have been found to be compa- rable to hand-crafted image datasets [Fergus et al., 2005]. Bing Images An alternative image search engine is Bing Images6. It uses di↵er- ent underlying technology from Google Images, but o↵ers the same functionality as an image search engine. Flickr Although [Bergsma and Goebel, 2011] have found that Google Images works better in one experiment, the photo sharing service Flickr7 is an interesting data source because its images are tagged by human annotators. ImageNet ImageNet [Deng et al., 2009] is a large ontology of images devel- oped for a variety of Computer Vision applications. It serves as a benchmarking standard for various image processing and Computer Vision tasks. ImageNet is 5https://images.google.com/ 6https://www.bing.com/images 7https://www.flickr.com 62 constructed along the same hierarchical structure as WordNet [Miller, 1995], by attaching images to the corresponding synset (synonym set). Visual Genome Visual Genome [Krishna et al., 2016] is a human annotated dataset which contains images with bounding box annotations around objects and relations among many other types of information, such as scene and region descriptions, object attributes, semantic relationships between image regions and objects, and Visual Question Answering (VQA) pairs. The objects, attributes, relationships, and noun phrases in region descriptions, and VQA pairs are also canonicalised to WordNet [Miller, 1995] synsets. All of the dataset properties can be relevant, however, it is not immediately obvious whether any of the above sources are superior over the other. While search engines provide full data coverage for virtually any vocabularies of various languages, they fall behind in tag specificity, as the search word is in an associative relationship with the images, not a hand-crafted label. Search engines and Flickr all come with a relevance order, which can be useful for image based meaning representations. However, in case of search engines we rely too much on black- box algorithms and automatic annotation. Hand-crafted datasets, while certainly fall behind in size and thus coverage, contain more carefully collected human annotation, which are usually more specific and detailed. In both ImageNet and VisualGenome, annotations are aligned with WordNet, which is a standard knowledge base. Figure 3.1 contains image samples from all datasets which serve as observable contexts OV , that are mapped to vectors by a feed-forward step in a CNN. All networks are pre-trained on ImageNet, thus our models do not di↵er in this regard. While there is less di↵erence for the more specific concept of elephant, results for animal are more diverse across sources. Visual Genome (Figure 3.1a) includes several bounding boxes with dense annotations, whereas the others are ordered by relevance. Flickr tends to include more personal photos, such as pets in Figure 3.1d. Google and Bing have more versatile results (Figure 3.1b, 3.1c). In order to see clearer how each properties a↵ect model performance, we propose measuring the e↵ect of image source choice and discuss its e↵ectiveness regarding the costs of dataset creation. 63 (a) Visual Genome (b) Google (c) Bing (d) Flickr Figure 3.1: Example images for animal and elephant from the various data sources used as observable contexts OV . While there is less di↵erence for the more specific concept of elephant, results for animal are more diverse across sources. Visual Genome includes several bounding boxes with dense annotations, whereas the others are ordered by relevance. 3.1.2 Text Corpora Linguistic modes SL are naturally trained on text corpora DL. Structured embed- dings SS are also trained on text, however, the main di↵erence from traditional 64 text corpora is that these are ordered in a specific structure instead of free text, e.g., a graph of expressions, hence the distinct notation DS. We used di↵er- ent versions of Wikipedia and Common Crawl datasets as DL training data. DS consists of Visual Genome Scene Graphs. All these are described in the following. Wikipedia Wikipedia8 is a widely used corpus in NLP applications. It is a crowd-sourced encyclopaedia, which covers various common sense and scientific concepts. Its topic structure has been directly exploited in Explicit Semantic Analysis [Gabrilovich et al., 2007]. It has been used as a general training corpus for its wide topic coverage, and long history of crowd-sourced quality control. In this work we use versions, trained on 2013 and 2020 Wikipedia dumps, as baseline models. FastText In Section 4.3 we use more recent pretrained word embeddings from the FastText framework9. These models use the traditional CBOW model, with versions extended with subword information [Mikolov et al., 2018]. The following training datasets were used: 1. wiki-news-300d-1M : 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). 2. wiki-news-300d-1M-subword : 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). 3. crawl-300d-2M : 2 million word vectors trained on Common Crawl (600B tokens). 4. crawl-300d-2M-subword : 2 million word vectors trained with subword in- formation on Common Crawl (600B tokens). Visual Genome Scene Graph In this work we do not only use Visual Genome [Krishna et al., 2016] as an image dataset, but we exploit its dense and structured human annotation as well, as a text corpus. The Visual Genome dataset contains 8https://www.wikipedia.org/ 9https://fasttext.cc/docs/en/english-vectors.html 65 complete set of descriptions and QAs for each image based on multiple image re- gions, and a formalized representation of the components of an image. It consists of seven main components: region descriptions, objects, attributes, relationships, region graphs, scene graphs, and question answer pairs. Figure 3.2 shows ex- amples of each component for one image. Although, it falls behind the above mentioned text corpora in terms of size, its highly structured nature can convey semantic information in itself. This dataset is special in terms of its “modality”. It includes dense textual annotation of image objects and scenes which people normally do not write about. Therefore, even though the annotation consists of character series, it conveys some high level visual common-sense knowledge. Besides its relevance for research, this type of annotation collection methodology could benefit data acquisition of low-resource languages, where there is no abun- dance (or there is an absence) of corpora. Applying tools where speakers can point out visually grounded meaning of their language could be a highly ecient way of documenting and analysing these languages. Moreover, automatic Scene Graph Generation algorithms [Xu et al., 2020] can further boost the eciency of such methods. Some statistics10 on the size of di↵erent annotation types are summarized in Table 3.2. Preliminary studies on a new embedding type based on this dataset is discussed in Section 4.2. The model is thoroughly studied in Section 4.3 and in Chapters 5 and 6. We decided to use Wikipedia and corpora from the FastText system, because they are all standard in the literature and are also easily and openly available. While we used pretrained models in our studies, we also trained our own SGNS model on various subsets of Wikipedia for quantity and distribution control exper- iments. Experimenting with even bigger datasets would be a potential improve- ment. However, given our resources and the number of experiments planned, this was a sensible data size limit. Visual Genome is a unique data source for its structured annotations. We chose it to investigate the potentials of such a dataset for multi-modal semantics. 10https://visualgenome.org/data_analysis/statistics 66 6 Ranjay Krishna et al. Fig. 4: A representation of the Visual Genome dataset. Each image contains region descriptions that describe a localized portion of the image. We collect two types of question answer pairs (QAs): freeform QAs and region-based QAs. Each region is converted to a region graph representation of objects, attributes, and pairwise relationships. Finally, each of these region graphs are combined to form a scene graph with all the objects grounded to the image. Best viewed in color Figure 3.2: A r presentation of th Visual Genome dataset. Each image contains region descriptions that describe a localized portion of the image. There are two types of question answer pairs (QAs): free form QAs and region-based QAs. Each region is converted to a region graph representation of objects, attributes, and pairwise relationships. Finally, each of these region graphs are combined to form a scene graph with all the objects grounded to the image. [Krishna et al., 2016] 67 Total region descriptions 4,297,502 Total image object instances 1,366,673 Unique image objects 75,729 Total object-object relationship instances 1,531,448 Unique relationships 40,480 Total attribute-object instances 1,670,182 Unique attributes 40,513 Total Scene Graphs 108,249 Total Region Graphs 3,788,715 Total Question Answers 1,773,258 Table 3.2: Visual Genome annotation statistics. 3.2 From Intrinsic Evaluation to Interpretable Model Anatomy In this section we discuss the used evaluation datasets, metrics and analysis methodology, which we applied to implement our three-pillar transparent testing of multi-modal embeddings, laid out in Section 2.7.2. Section 3.2.1 describes the tools for 1 Performance testing. Section 3.2.2 describes analysis on brain data as embedding analysis. Section 3.2.3 introduces cluster analysis as 2 Qualitative / Quantitative structural analysis. Finally Section 3.2.4. introduces empirical Mutual Information estimation methods for 3 Independence analysis. 3.2.1 Behavioural Tasks Most multi-modal word embedding work evaluate on semantic similarity and relatedness tasks in the hope of gathering information about the intrinsic be- haviour of abstract semantic representations. However, the ambiguous notion of similarity and the low inter-annotator agreement make it dicult to draw robust conclusions on the di↵erences between models [Batchkarov et al., 2016]. As a first black-box step, we will also evaluate on these standard datasets. Unlike previous work, however, we first aim to create an extensive study of comparing several se- 68 mantic models Sm with varying parameters of T,Om, Dm, Xm, Am, dm, Em. Then we gradually move towards more in-depth transparency analysis. We briefly describe the standard evaluation datasets and metrics we use in our experiments: MEN The MEN data set [Bruni et al., 2014] consists of 3,000 word pairs, randomly selected from words that occur at least 700 times in the freely available ukWaC and Wackypedia corpora combined and at least 50 times (as tags) in the opensourced subset of the ESP game dataset.11 Pairs were sampled so that they represent a balanced range of relatedness levels according to a text-based semantic score. Each pair was randomly matched with a comparison pair and rated in this setting (as either more or less related than the comparison point) by an annotator on Amazon Mechanical Turk. This binary comparison task is both more natural for an individual annotator, and also permits seamless integration of the supervision from many annotators. The downside is that this way, there is no well-defined inter-subject agreement. In total, each pair was rated against 50 comparison pairs, thus obtaining a final score on a 50-point scale, although the Turkers’ choices were binary. SimLex-999 SimLex-999 [Hill et al., 2015] is a dataset structurally similar to MEN, including 999 word pairs for intrinsic semantic evaluation. Its objective is, however, to measure how well models capture similarity, rather than relatedness or association. The scores in SimLex-999 therefore di↵er from other well-known evaluation datasets such as MEN. For example, “coast” and “shore” would have high score in both MEN and SimLex. On the other hand, “cloth” and “closet” would have low score in SimLex but high score in MEN, since they have di↵erent materials, function etc., even though they are very much related. This task is chal- lenging for computational models to replicate because, in order to perform well, they must learn to capture similarity independently of relatedness/association. These two relationships between words show up in di↵erent contextual features. Similarity is inferred from similar co-occurrences with other words. Similarity or relatedness is then captured by the type of co-occurrence / window size [Kilgarri↵ and Yallop, 2000]. In addition SimLex includes concreteness Part-Of-Speech and 11https://staff.fnwi.uva.nl/e.bruni/MEN 69 association scores from the University of South Florida (USF) Free Association Norms [Nelson et al., 2004]. SimVerb-3500 SimVerb-3500 [Gerz et al., 2016] is an evaluation resource that provides human ratings for the similarity of 3,500 verb pairs. It covers all normed verb types from the USF Free Association database, providing at least three examples for every VerbNet [Schuler, 2005] class. Verb pairs are rated on a scale 0-10, for example: “to reply” / “to respond” - 9.79; “to participate” / “to join” - 5.64; “to stay” / “to leave” - 0.17. We included this dataset in Section 4.2, where predicate - object relationships are in focus, to test how it a↵ects verb representations in particular. Evaluation metric Model performance is assessed through the Spearman ⇢s rank correlation between the embedding similarity scores for a given pair of words, together with human judgements in each evaluation datasets. Pearson correla- tion has also been considered, however, humans find it much harder to attach a numerical score to a pairwise comparison like “cat”–“dog”, rather than hav- ing to judge whether that comparison is more similar than “cat”–“television”. Furthermore, Pearson correlation coecient should also be avoided because even if humans give numerical scores as similarity ratings, these are unlikely to be normally distributed. Embedding similarity scores are computed using the cosine distance of the two word vectors, ~w1, ~w2 of a word pair, w1, w2. Cosine( ~w1, ~w2) = ~w1 · ~w2 k ~w1kk ~w2k (3.3) = ~w1 · ~w2pP iw1 2 i pP iw2 2 i (3.4) The dot product in the numerator is calculating numerical overlap between the word vectors, and dividing by the respective lengths provides a length normal- isation which leads to the cosine of the angle between the vectors. Normalisation is important because we would not want two word vectors to score highly for similarity simply because those words were frequent in the corpus. The cosine measure is commonly used in studies of distributional semantics, however, we 70 could use any other vector space metric [Clark, 2015]. It is dicult to reach a conclusion from the literature regarding which similarity measure is best; we use cosine distance here because it has become standard in NLP. Future work could involve revisiting these standard metrics because they may behave di↵erently depending on the task and the source/modality of training data. 3.2.2 Brain Imaging as Embedding Analysis Evaluating on brain imaging data has been introduced as NLP evaluation tasks on various occasions [Mitchell et al., 2008, Anderson et al., 2016] (Section 2.1.2). In some cases visually grounded models have been included in the evaluation [Davis et al., 2019, Anderson et al., 2017, Bulat et al., 2017]. The measured impact of multi-modal information, however, varies across studies, thus in this work we included a broader analysis on these tasks as well. We aim to use correlation studies with brain data as a type of black-box analysis, which is substantially di↵erent from behavioural tasks and as such can shed new light on di↵erences between our Semantic Embedding models of di↵erent modalities. The findings in cognitive neuro-science (Section 2.1.2) on multi-modal human brain activities while performing semantic tasks, further motivates us to include brain data in our studies. We evaluate on two brain image datasets which were collected while partici- pants viewed 60 concrete nouns with line drawings [Mitchell et al., 2008, Sudre et al., 2012]. One dataset was collected using fMRI (Functional Magnetic Res- onance Imaging) and one with MEG (Magnetoencephalography). Each dataset has 9 participants, but the participant sets are disjoint, thus there are 18 unique participants in total. Though the stimuli is shared across the two experiments, MEG and fMRI are very di↵erent recording modalities and thus the data are not redundant [Xu et al., 2016]. fMRI dataset fMRI measures the change in blood oxygen levels in the brain, which varies according to the amount of work being done by a particular brain area. In this fMRI dataset collected by Mitchell et al. [Mitchell et al., 2008] participants were presented with line drawings and noun labels of 60 concrete nouns from 12 semantic categories: animals, body parts, buildings, building parts, clothing, furniture, insects, kitchen items, tools, vegetables, vehicles and man- 71 made objects. The experimental task was to think about the properties of the noun concept they were shown - the set of 60 concepts was presented in a random order six times to each participant. Each concept was presented for 3 seconds, with seven second gaps between presentations. MEG dataset This experiment involved the same task as the previous one but using MEG machine, a large helmet with 306 sensors that measure aspects of the magnetic fields at di↵erent locations in the brain. A MEG brain image is the time signals recorded from each of these sensors. Each of the words was presented 20 times (in random order) for a total of 1200 brain images. Both brain image data have been preprocessed by the BrainBench Test Suit [Xu et al., 2016]. They used “partialling out” process in order to remove low level activity attributable to visual properties from the brain images. They used the methodology from Mitchell et al. to select the most stable brain image features for each of the 18 participants. The stability metric assigns a high score to features that show strong self-correlation over presentations of the same word. Two vs. two test To evaluate on brain data we need to compare representa- tion similarities from brain imaging vectors and meaning representation vectors. This type of evaluation if fundamentally di↵erent from the behavioural tasks, as we do not have human similarity score labels for word pairs. We use leave-two-out cross validation, the testing methodology from Mitchell et al. which has become standard for brain imaging evaluation of semantic embeddings. Our implemen- tation is based on BrainBench with modifications so we can perform analysis on individual participants. The evaluation starts from two similarity matrices, a neural and a brain similarity matrix. Columns of this matrices are called simi- larity codes. Similarity codes (~si, ~sj) and brain activity similarity codes (~ai, ~aj) are selected for two nouns. Elements i. and j. from each of the similarity codes are removed, as these entries correspond to the nouns being tested. Figure 3.3 visualises an example of the decoding procedure. Decoding is successful if the sum of Pearson correlations for the correct pairings is greater than the sum of Pearson correlations for the incorrect pairings, resulting in decoding accuracy of 1 for this pair and 0 otherwise. Thus, the expected chance-level decoding accuracy is 50%. 72 Figure 3.3: Visualisation of leave-two-out cross validation from [Anderson et al., 2016]. 73 3.2.3 How do Models Conceptualise? – Cluster Analysis As introduced in Section 2.7.3 the second pillar (2) of our analysis is a transpar- ent investigation of the concepts our embedding spaces EL, EV , ES capture. We are interested in how much these model-concepts di↵er from each other to un- derstand under what circumstances each modalities can complement each other. As mentioned before this qualitative / quantitative structural analysis is meant to be used in the context of previous performance analyses and the third pillar of 3 independence analysis, we will detail in Section 3.2.4. By model-concept, here, we mean some similarity metric based clusters in the embedding spaces, which do not necessarily correspond to the meaning of one word, but rather some higher level or di↵erent structure. As a straightforward implementation, we chose to use standard clustering algorithms and metrics, to compare our di↵erent embeddings. In order to grasp how the concept structure of our embedding spaces di↵er from each other we first searched for ways to quantify their cluster structure. We do not know the ground truth labels of our clusters or even the number of clusters each embedding spaces should be broken into. Therefore, we experiment with three standard clusterization metrics which are designed for the case when a ground truth labelling is not available. Furthermore, we report results for a range of number of clusters. In Chapter 6 we present the design, implementation and result of our trans- parency studies. Section 6.2 includes qualitative and quantitative cluster analysis. In Section 6.2.2 we compare our embeddings’ cluster structures and visualise the learnt clusterings. In Section 6.2.3 we present supervised visualisations of the embedding spaces alongside an automatic label generation method and compare the results against the clusterization metric scores. As an e↵ective visualisation we use the T-SNE algorithm [Maaten and Hinton, 2008, Wattenberg et al., 2016]. Clustering and T-SNE have been previously used for multi-modal embedding analysis e.g., [Gupta et al., 2019]. In Section 6.2 we report qualitative analyses by investigating the elements of the clusters, as well as reporting further quantitative cluster structure comparison analyses. One of our clustering analyses is based on the pre-defined cluster labels of [Gupta et al., 2019]. They also use Visual Genome, otherwise, their work is fundamentally di↵erent from ours as they use di↵erent models, they do not exploit the Visual Genome graph structure and 74 evaluate on downstream tasks. In the following we present all the standard algorithms and metrics used for the clustering studies. 3.2.3.1 Clustering Methods and Metrics We ran the K-means [MacQueen et al., 1967] clusterization algorithm on all three embeddings to see if it can reveal more about the underlying structure of the spaces. We used the k-means++ initialization scheme [Arthur and Vassilvitskii, 2006], which has been implemented in the Scikit-learn package12. This initializes the centroids to be (generally) distant from each other, leading to probably bet- ter results than random initialization. As a control for consistency of clustering we also present results using Agglomerative Clustering13. To measure the rate of clusterization, when the labels are not known, we used three standard met- rics implemented in the Scikit-learn package14. One drawback of these metrics is that they are generally higher for convex clusters than other concepts of clus- ters. However, convexity is not always given. They respond poorly to elongated clusters, or manifolds with irregular shapes. 1. Davies–Bouldin Index can be calculated by the following formula: DB = 1 K KX i=1 max j 6=i ✓ i + j d(ci, cj) ◆ (3.5) where x is the average distance of all elements from the cluster cen- troid in cluster Cx. d(ci, cj) is the distance between centroids ci, cj. Since clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the smaller this number is the better the clusteriza- tion is considered to be. The computation of Davies-Bouldin is simpler than that of Silhouette scores. The index is solely based on quantities and features inherent to the dataset as its computation only uses point-wise distances. 12https://scikit-learn.org/stable/modules/clustering.html#k-means 13https://scikit-learn.org/stable/modules/clustering.html# hierarchical-clustering 14https://scikit-learn.org/stable/modules/clustering.html# clustering-performance-evaluation 75 2. Calinski-Harabasz Index – also known as the Variance Ratio Criterion – can be used to evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters. The index is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared): CH = tr(BK) tr(WK) ⇥ N K K 1 (3.6) where tr(BK) is the trace of the between group dispersion matrix and tr(WK) is the trace of the within-cluster dispersion matrix defined by: WK = X k X e2Ck (e ck)(e ck)T (3.7) BK = X k (ck cE)(ck cE)T (3.8) with cE being the centroid of E. The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster. 3. Silhouette Coecient value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). For each data point ei we define: a(ei) = 1 |Ci| 1 X j2C,i 6=j d(ei, ej) (3.9) b(ei) = min k 6=i 1 |Ck| X j2Ck d(ei, ej) (3.10) We now define a silhouette (value) of one data point ei: S(ei) = b(ei) a(ei) max{a(ei), b(ei)} , if |Ci| > 1 (3.11) Silhouette Coecient is also higher when clusters are dense and well separated. 76 3.2.4 Information Gain from Modalities The third pillar of our analysis is the second transparency study, which aims to uncover how much representations di↵er? We formulated it as an independence analysis (Pillar 3) of our embeddings EL, EV , ES as multivariate random variables in Section 2.7.5. Applying equation 2.13 to the three modalities (including the same three assumptions), we aim to measure whether I(EL, EV ) > I(EL, ES) (3.12) in which case we hypothesise that there is a combination method with which, combining EL with ES is more ecient than using EL+EV , as they convey more complementary information which can be combined. The experiment design and the results are reported in Section 6.3. We need to estimate the empirical Mutual Information of our vector spaces from data, which is a hard problem. In the following we introduce standard methods and tools we used for this purpose. 3.2.4.1 Empirical Mutual Information Estimation Since Mutual Information is a special case of divergence (such as DKL in Equa- tion 2.10), divergence estimators can be employed to estimate it. To recall the definition of DKL (Equation 2.12): if p(x) and q(x) are densities then DKL(p||q) = Z Rd p(x) log p(x) q(x) dx. (3.13) The estimators then approximate Equation 2.10: I(X, Y ) = DKL(PX,Y ||PX ⌦ PY ) (3.14) In our application, PX,Y is a sample from a multi-modal embedding created by mid-fusion, whereas the marginals are the uni-modal embeddings. To estimate the densities p(x) and q(x), the traditional approach is to use histograms with equally sized bins [Wang et al., 2005]. However, the computational complexity of such methods is exponential in d and the estimation accuracy deteriorates quickly as the dimension increases. Hence, a more robust way of estimating mul- tidimensional Mutual Information is using k-Nearest Neighbor distances (IKNN) 77 which bypasses the diculties associated with partitioning in a high-dimensional space [Wang et al., 2009]. This method estimates a density by computing the average frequency of each point’s KNNs in the Euclidean ball centred around the point. This provides a consistent estimate of DKL(p||q). In practice these methods become unreliable in a high-dimensional space due to the sparsity of the data objects. To overcome this, another approach is to introduce non-linearity using a ker- nel, when calculating the distances. In this work we use a kernel method called the Hilbert-Schmidt Independence Criterion (HSIC) algorithm [Gretton et al., 2005], because it has been shown to work in practical applications [Jitkrittum et al., 2017]. Consider a reproducing kernel Hilbert space F of functions from X to R. To each point X 2 X , there corresponds an element (X) 2 F such that h(X),(X 0)iF = k(X,X 0), where k : X ⇥ X ! R is a unique positive defi- nite kernel. Then the HSIC estimate is given by the following: IHSIC(X, Y ) = kCX,Y kHS, (3.15) where k.kHS is the Hilbert-Schmidt Norm. CX,Y is a cross-covariance operator between X and Y : CX,Y = EX,Y ([k1(·, X) µX ]⌦ [k2(·, Y ) µY ]) (3.16) where µX = EX [k1(·, X)] and µY = EY [k2(·, Y )] are the mean embeddings of X and Y respectively to a Reproducing Kernel Hilbert Space. k1 and k2 are kernels on X and Y respectively. For more details on the theoretical background see [Gretton et al., 2005]. We apply an open source Python implementation of the above algorithms from the Information Theoretical Estimators Toolbox15 [Szabo´, 2014].16 78 Figure 3.4: Roadmap of analyses. On the top: Pillar 1 Performance testing : broad comparison across data sources, ML models and modalities. Based on this SL,SV ,SS are narrowed down to a particular combination of model and data source. Following, in the middle: Pillar 2 of structural cluster analysis to discover embedding concepts. At the bottom: Pillar 3: Independence analysis of embeddings. 79 3.3 Analysis Scheme Figure 3.4 represents a roadmap of our three pillar analysis.17 On the top: Pillar 1 Performance testing : broad comparison across data sources, ML models and modalities, which will be presented in Chapter 4. Based on this SL,SV ,SS are narrowed down to a particular combination of model and data source. In Chap- ter 5 we change our focus on more in-depth analysis of fewer models based on the findings in the previous blanket studies. Here, we restrict ourselves to behavioural tests, but we inspect our models in a more fine grained fashion, regarding size and distribution ranges. Following, in the middle: Pillar 2 of structural cluster analysis to discover embedding concepts. At the bottom: Pillar 3: Independence analysis of embeddings. Chapter 6 includes the two parts of our transparency analysis. Here, we will focus on the structure of each embedding types EL, ES and EV . Lastly, we measure the information gain ES and EV entail when combined with EL. Narrowing the umbrella studies down to a few model, data and modality com- binations is another layer on top of the three pillar analysis framework. However, this layer is not necessary for our proposed evaluation methodology. Performing costly large scale studies with numerous current models would become shortly obsolete. Our aim is rather to provide a general framework with proof-of-concept studies, which can be applied to various models in the future. 15https://bitbucket.org/szzoli/ite 16We would also like to thank Zolta´n Szabo´ for his counsel on the theoretical background. 17Icons made by Freepik, Smashicons, Good Ware, Eucalyp and Becris from https: //flaticon.com/authors/. Voronoi diagrams were generated using http: //alexbeutel.com/webgl/voronoi.html. 80 Chapter 4 Impact of Visual Information in Semantics This chapter covers experiments which form an implementation of pillar 1 Perfor- mance testing (Figure 3.4 on top). We cover experiments towards a comprehensive analysis of models across data sources, machine learning models and modalities. We introduce the implementation of a new structured hybrid modality based on small data and in between low level visual information and high level linguistic, symbolic data. We use evaluations which we refer to as black-box testing, for looking at only performance numbers. However, by performing a broad study we aim to o↵er a more comprehensive analysis of multi-modal studies than in previous work. The experiments are designed to addresses our research Questions 1, 2 and 3, laid out in Chapter 1.1. To recap and frame them in our Semantic Embedding model framework (Section 2.6): 1. How does the source of images DV a↵ect the performance of multi-modal semantic representations? 2. Does the number of images have an impact on performance? – Variability of the visual extraction function XV . 3. Do previous findings on complementary visual information scale to di↵erent types and sizes of linguistic corpora? – Variability of observable context data OL, OV , OS and introducing a new extraction function for structured 81 data XS. In Section 4.1 we present a systematic study of the performance of state-of- the-art image data sources and CNN architectures, and measure the impact of image quantity (Questions 1 and 2). In Section 4.2 we introduce a new embedding type based on a visually structured, textual data source, the Visual Genome Scene Graphs [Krishna et al., 2016], and show preliminary studies on its performance for “sanity-check”. In Section 4.3 we present a broader analysis involving the models from the previous sections, extended with new ones. We tackle Question 3 by comparing several data sources of di↵erent sizes and modalities. Section 4.4 involves a study on how pretrained word embedding initialisation a↵ects sequence model performance on textual entailment. 4.1 Comparing Visual Models and Data Sources for Semantics This section focuses on the analysis of EL + EV type multi-modal word embed- dings with mid-fusion and various Convolutional Neural Network based EV visual representations. The study explores the following questions regarding semantic similarity and relatedness tasks: 1. How important is the source of images DV ? Is there a di↵erence between search engines and manually annotated data sources? 2. How should we aggregate the image representations for a search key into one visual representation? – Post-processing part of the visual mapping function AV . 3. Does the number of images obtained for each search key matter? – Vari- ability of the visual extraction function XV . 4. Does the choice of the CNN architecture have an impact on the performance of visual and multi-modal models? – ML algorithm part of the visual map- ping function AV . To address the first question, we decided to use di↵erent search engines and other existing image datasets. For that purpose, we extended Douwe Kiela’s 82 MMFeat toolkit1 with an API for the Flickr search engine. Later on we contin- ued working on a joint project addressing the above questions in multi-modal distributional word semantics. The results have been published in an EMNLP long paper [Kiela et al., 2016].2 In this project, we systematically compared deep visual representation learning techniques, experimenting with three well-known network architectures, AlexNet, GoogLeNet and VGGNet (see Section 2.3.1). In addition, we explored the various data sources (described in Section 3.1.1) that can be used for retrieving relevant images, showing that images from search en- gines perform as well as, or better than, those from manually crafted resources such as ImageNet. Furthermore, we explored the optimal number of images and the multi-lingual applicability of multi-modal semantics. 4.1.1 Evaluation We employ behavioural evaluation tasks described in detail in Section 3.2.1. In summary, model performance is assessed through the Spearman ⇢s rank corre- lation between the system’s similarity scores for a given pair of words, together with human judgements. We evaluate on two well-known similarity and related- ness judgement datasets: MEN [Bruni et al., 2014] and SimLex-999 [Hill et al., 2015]. In each experiment, we examine performance of the visual representations compared to text-based representations, as well as performance of the multi- modal representation that fuses the two. In this case, we apply mid-level fusion – a popular technique in multi-modal semantics (described earlier) – concate- nating the L2-normalized representations. Linguistic representations are 300- dimensional and are obtained by applying skip-gram with negative sampling to a 2013 dump of Wikipedia. Visual vectors based on AlexNet and VGGNet are both 4096-dimensional, GoogLeNet vectors are of 1024 dimensions. The normal- ization step that is performed before applying fusion ensures that both modalities contribute equally to the overall multi-modal representation. We evaluated the di↵erent architectures and data sources using either the mean or elementwise maximum method for aggregating image representations 1https://github.com/douwekiela/mmfeat 2I implemented the Flickr API and all the data collection, experiments and evaluations presented in this thesis. 83 into visual ones (AV post-processing). However, we found no significant di↵erence between these two methods. 4.1.2 Results Figure 4.1: The e↵ect of the number of images on representation quality. We found that multi-modal representation learning yields better performance across the board: for di↵erent network architectures, di↵erent data sources and di↵erent aggregation methods (Figure 4.1). We examined AlexNet, GoogLeNet and VGGNet, all three winners of the ILSVRC ImageNet classification challenge, and found that they perform very similarly. If eciency or memory are issues, AlexNet or GoogLeNet are the most suitable architectures. For overall best performance, AlexNet and VGGNet are the best choices. The choice of data sources has a bigger impact: Google, Bing, Flickr and ImageNet were much better than the ESP Game dataset. Google, Flickr and Bing have the advantage that they have potentially unlimited coverage. Google 84 and Bing are particularly suited to full-coverage experiments, even when these include abstract words [Kiela et al., 2016]. Another question is the number of images we want to use: does performance increase with more images? There is an obvious trade-o↵ here, since downloading and processing images takes time (and may incur financial costs). This experi- ment only applies to relevance-sorted image search data sources. We found that the number of images has an impact on performance, but that it stabilizes at around 10-20 images, indicating that it is usually not necessary to obtain more than 10 images per word. For Flickr, obtaining more images is detrimental to performance. The e↵ect of the number of images on the performance is shown in Figure 4.1. 4.1.3 Conclusion This work explores some important factors for choosing visual models and data sources for multi-modal semantics. It is important to note that the multi-modal results only apply to the mid-level fusion method of concatenating normalized vectors: although these findings are indicative of performance for other fusion methods, di↵erent architectures or data sources may be more suitable for di↵erent fusion methods. Understanding what it is that makes these representations perform so well is another important question. Is it more data or the multi-modal nature of the data which is increasing performance? Building on these preliminary findings, in Section 4.3 we explore a broader range of factors which may shed more light to visual models’ behaviour in multi-modal semantics. 4.2 Visual Context in the Linguistic Domain Despite the indisputable success of data driven methods in NLP, humans’ ability to generalise after having been exposed to only a small amount of data provides motivation to further explore alternative machine learning methods. An appeal- ing option is to exploit structured prior information combined with multi-modal input. There is a need for more work on applying and automatically acquiring structured prior information that can help us to take a step towards human level and interpretable language generation and understanding. 85 The second key contribution (II.) of this thesis is the introduction and analysis of a new modality (Section 2.5). The study, presented here, aims is to explore the possibilities for learning semantic word representations based on structured and visually grounded prior information. This way we further explore the types of text corpora we use, expanding on Question 3. We use the Visual Genome (VG) dataset’s scene graphs and bounding boxes as structured training data (introduced in Section 3.1.2). Visual Genome images are annotated with region graph representation of objects, attributes, and pairwise relationships. Each of these region graphs are combined to form a scene graph with all the objects grounded to the image (see Figure 3.2). The main questions this work aims to examine are the following: What is the information coming from (structured) image data? Is it the high level information of visual scene structure which enhances linguistic information or low level visual features matter as well? 4.2.1 Scene Graph Context We introduce a new Semantic Embedding model SS. There could be many ways to incorporate structured, visually grounded prior information from VG, such as using graph neural networks [Scarselli et al., 2008] as part of the mapping function AS. In this work, we implemented a much simpler method in order to see if a small, fast to train model performs well. Instead of developing a new mapping function, we introduce a new extraction function XS, which extracts the relevant context information from the scene graphs then feeds it into a simple shallow-network as AS. Using the scene graph annotations as a corpus, XS takes as input the whole scene graph dataset DS and returns “relevant” context items from OS to each target element from T – that is it returns a mapping from target/context item pairs to numbers in N, representing a relevance score of context pairs: XS : T ! (OS(T )! N). In this case this score is a binary number representing whether a context node o 2 OS is in the graph neighbourhood of the surface representations of t 2 T . The relevant context corresponds to a radius in this graph around an object or predicate node. The radius is the number of steps we take starting from a node in a breadth first search manner. The context words are all the node labels within this sub-graph. Algorithm 1 presents the pseudo code for the Scene 86 Graph Context Generation Algorithm. G denotes the scene graph, rad is the radius. It returns a word, context pair list [< t1, o1 >, ..., < tn, on >]. Each node in G has more word labels or “names” (e.g. elephant and animal can be names of the same object node). We take all the combinations of the given node names of two nodes, which are in each others context. This operation is denoted by the direct product of the two name lists, ⇥. E.g., if node {elephant, animal} is in the neighbourhood of node with label {sleep}, then we generate context pairs of: [helephant, sleepi, hanimal, sleepi]. In this case the mapping function AS, is a Skip-gram algorithm [Mikolov et al., 2013b], which maps from context items to a word embedding space ES 2 R|T |⇥dS , dS = 300. Figure 4.2 shows an example for creating contexts for embed- dings from Visual Genome Scene Graphs. The context words (orange) used are up to three links from a target node (black). Algorithm 1: Scene Graph Context Generation Algorithm Input: G, rad Result: contexts = [< t1, o1 >, ..., < tn, on >] for node 2 G do context nodes = breadth first traverse(node, rad); for cnode 2 context nodes do contexts += [node.names⇥ cnode.names] end end Visual Genome scene graphs have been used for word meaning representa- tions [Kuzmenko and Herbelot, 2019, Herbelot, 2020]. They build a truth the- oretic model including predicate / entity pairs before feeding it to a skip-gram model. Our method is more relaxed since we directly process the Scene Graphs into contexts of a given size (radius), without any further restriction based on grammatical information. The results are compared in Section 5.2.4. This model is linguistic in a sense that it only uses text context in the graph neighbourhood, without grounding it to visual features. However, it still uses visual information implicitly, since the graph represents relationships in visual scenes. Di↵erent versions of the above model are compared to the following baselines: 1. w2v-wikipedia: A traditional skip-gram trained on a 2013 dump of Wikipedia. 87 Figure 4.2: Generating contexts for embeddings from Visual Genome Scene Graphs. The context words (orange) used are up to three links from a target node (black). The pairs are then fed to a Skip-gram algorithm. Photos are from https://visualgenome.org/ 2. w2v-descriptions : A skip-gram model trained on the Visual Genome image descriptions. For evaluation we perform the following intrinsic and extrinsic tests: • Semantic relatedness/similarity on the MEN [Bruni et al., 2014] , SimLex [Hill et al., 2015] and SimVerb [Gerz et al., 2016] datasets. • Brain data: Predicting patterns of brain activity associated with the mean- ing of nouns, making use of two datasets: fMRI (Functional Magnetic Res- onance Imaging) [Mitchell et al., 2008] and one with MEG (Magnetoen- cephalography) [Sudre et al., 2012]. (See in Section 4.3.4) 4.2.2 Results Table 4.1 shows some preliminary results using Scene Graph context, that is based on the proximity of words in the Visual Genome Scene Graph. N in “radN ” indicates the number of steps we take around a node in a breadth first search manner. The context words are all the node labels within this radius. Results 88 Lemmatised Method MEN SimLex SimVerb No VG rad3 0.433 0.274 0.008 w2v-wikipedia 0.680 0.238 0.149 Yes VG rad3 0.433 0.274 0.132 w2v-wikipedia 0.673 0.257 0.134 No VG rad1 0.211 0.16 -0.031 w2v-wikipedia 0.680 0.238 0.238 Yes VG rad1 0.206 0.154 0.040 w2v-wikipedia 0.673 0.257 0.134 Yes w2v-description 0.427 0.289 0.127 Table 4.1: Pearson correlations of the di↵erent versions of the model and the Skip-gram baseline on the MEN, SimLex and SimVerb datasets. N in “radN ” indicates the number of steps we take around a node in a breadth first search manner. The context words are all the node labels within this radius. Results are shown for both lemmatised and non lemmatised versions of the scene graph corpus. are shown for both lemmatised and non lemmatised versions of the scene graph corpus. There is no substantial di↵erence after using this preprocessing step (non lemmatised versions even perform slightly better on MEN and SimLex), therefore we do not lemmatise in the following experiments. Using a radius of three, our model outperforms the baseline w2v-wikipedia and w2v-description baselines on SimLex, but it performs worse on the other datasets. Further results on behavioural tasks and brain imaging datasets are discussed in Section 4.3. 4.2.3 Conclusion Based on these preliminary results, using structured small-data is a promising area to explore. Despite its size, structured training data can achieve comparable results to our big corpus based baseline. Collecting such data by manual labour is expensive, but it is probably worthwhile to explore crowd-sourced, gamified or even (semi–)automatic techniques [Xu et al., 2020] for collecting structured training data. We report on a broader scale analysis of various models including the ones we introduced in this section and in Section 4.1. 89 4.3 Modalities, Sources and Models: a Thorough Analysis In the previous sections we investigated the impact of visual models and data sources for non-visual evaluations. We compared di↵erent convolutional networks for visual embeddings and di↵erent image sources. We also experimented with a “small-data” based embedding, using structured information somewhere between the visual and the linguistic domains. There are two main problems, however, which the multi-modal literature (in- cluding the above studies) su↵er from: 1. Too small and probably not well formed evaluation datasets [Faruqui et al., 2016]. 2. Lack of standardized comparative studies involving many di↵erent models. The first problem is a challenging one due to the cost of data collection. Traditional semantic similarity and relatedness tasks can provide a good starting point to evaluate word semantics, but we certainly need a more thorough analysis if we really want to compare semantic embedding spaces. Recently, the NLP community started evaluating on Brain imaging data as well (see Section 3.2.2), in the hope of learning about the relationship between word embeddings and brain activation of people while thinking of corresponding concepts. These datasets are relatively expensive to create, hence they are not very large. While evaluating on them can provide with interesting insights, we should be cautious when drawing conclusions from these results. In the following study we use both semantic similarity / relatedness and brain datasets as evaluation. Unlike previous work, however, we try and make a further step towards a more in depth analysis of the results to filter out the potential noise we face in these experiments, coming from di↵erent models and small evaluation sets. As for the second problem, multi-modal models are usually compared to only one linguistic baseline and maybe except for our study in Section 4.1, only one visual source / model combination. Here, we present a broader study involving several di↵erent visual and linguistic embeddings in order to get a better picture of the variance we have in performance, tackling our Question 3. 90 All the experiments have been implemented as part of the EmbEval toolkit (see Section 1.3), including the creation of uni-modal embeddings as well as new mid-fusion techniques (described in Section 4.3.2). 4.3.1 Studied Embeddings In the following we summarise the parameters of the studied Semantic Embedding models, which were described in detail in Chapters 2 and 3. 4.3.1.1 Linguistic Embeddings To train SL models we use pretrained embeddings from the FastText System [Mikolov et al., 2018]. Each model has been trained on di↵erent sources DL: 1. wiki-news-300d-1M : 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). 2. wiki-news-300d-1M-subword : 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). 3. crawl-300d-2M : 2 million word vectors trained on Common Crawl (600B tokens). Furthermore, for comparison with earlier works we also use the same Skip- Gram model, trained on a Wikipedia dump from 2013. 4.3.1.2 Visual Embeddings Based on the findings in Section 4.1 we test the following datasets and models for SV : Image Source DV We use Google Images as a source, as it had a stable per- formance across models, and is widely used. We compare this big data source to visual representations trained on Visual Genome Images. This way we compare a big data source to a smaller, but systematically annotated dataset. 91 ML part of AV For CNN models we use the best and fastest AlexNet model, based on previous findings. Since publishing the results in Section 4.1 a new CNN architecture, called Deep Residual Network (ResNet) [He et al., 2016] appeared, which is the current state-of-the-art in object recognition on images both in terms of classification accuracy and speed. Therefore, in this broader study we included this model as well. We also compare two AlexNet models trained on Visual Genome images internal object bounding box images or on the whole images, similarly to [Davis et al., 2019].3 Post-processing part of AV Since our findings in [Kiela et al., 2016] suggest no obvious di↵erence between the two methods, here we only use the mean of image embedding vectors (as opposed to taking the maximum) to create one visual representation for a word. Extraction function XV Furthermore, since after 10-20 images the perfor- mance plateaus across the board, in this study we always use 10 images for each word representation. 4.3.1.3 Structured Embeddings We analyse the XS version from Section 4.2, when we take three steps around a node in a breadth first search manner. 4.3.2 Mid-fusion methods To create multi-modal embeddings using mid-fusion we applied two methods: 1. Intersection: Similarly to previous work a multi-modal embedding is the concatenation of visual and linguistic vectors. Therefore, we only have representations for the intersection of their vocabularies. This is mainly relevant in the case of Visual Genome, where we might not have full coverage (as opposed to Google). 3The training of the models has been done by Christopher Davis. In this paper, I provided supervision with the experiments, help with using MMFeat and helper code for processing Visual Genome. 92 2. Padding : In order to have full coverage in every case, in this method if one modality does not cover a word in the vocabulary we just pad the multi- modal vector with as many zeros as the dimensionality of the modality space with the missing vector. This way we have multi-modal embeddings for all the words in the intersection of their vocabularies, and uni-modal vectors, where one of the modalities failed to cover the word. 4.3.3 Evaluation Methods Evaluation of word embeddings on similarity tasks has been shown to be prob- lematic due to 1) the lack of train/development/test splits, 2) the absence of statistical significance, 3) low correlation with downstream performance, 4) the hubness problem and 5) their inability to account for polysemy [Faruqui et al., 2016]. To tackle the first problem we performed three-way cross-validation on MEN and SimLex, leaving out one third of the word pairs randomly. Based on the results – reported in Appendix A – we present correlation figures up to two decimal points. As for the second issue we present a series of detailed evalu- ation methods in the next chapters, which aim to unearth the reasons behind the behaviour of our models beyond correlation. For correlation scores we report p-values for every correlation score. 4) and 5) are addressed in Chapters 5 and 6. As we discussed in Section 2.1.2, in this work, we view semantic space analysis as a statistical tool for dataset analysis which provides value on its own without downstream applications, therefore 3) is beyond the scope of this thesis. We cannot directly compare models trained on di↵erent data sources, be- cause they have di↵erent coverage, but we can look at absolute performance and compare network architectures and modalities. We also present results on the common subset of the evaluation datasets, where all word pairs have images in each of the data sources. Results on the Brain datasets are analysed averaged over participants for em- bedding comparison. We present further analyses, where results are averaged over modalities, therefore we can focus more on the variability between participants. 4.3.3.1 Concreteness Concreteness of words has been studied before in the context of multi-modal se- mantics and for Brain imaging evaluation. Kiela et al. [Kiela et al., 2014] applied 93 a dispersion metric on the visual domain to filter out words with image results which are noisier than a threshold, based on their metric. They hypothesised that abstract words have higher, whereas concrete words have lower image dispersion. Anderson et al. [Anderson et al., 2017] systematically selected word categories for their Italian dataset based on concreteness. In this work we developed an automatic concreteness score based on WordNet. The concreteness score of a word is its distance (one minus similarity) from its root hypernyms in the Synset graph. Since in WordNet we have multiple synsets for one surface form we compare two di↵erent techniques to aggregate each sysnets’ distances from the root: 1. Taking the median of all sysnet’s distances for a word. 2. Selecting the synset with the maximum distance from the root, so we have the most concrete sense of the word. Hence, the formula for our WordNet concreteness score is: WNConc(w) = Aggw[d(si, ri) | i 2 {1, . . . , Nw}], (4.1) where Aggw(.) is the synset aggregation method, d(., .) is the WordNet distance, w is a word. si are the synsets for w and ri are the roots of each synset in the WordNet hypernym hierarchy. Nw is the number of synsets for word w. Another question is, how we should combine the concreteness scores for word pairs in the behavioural tasks? We present two methods to do this: 1. Taking the sum of the two words’ concreteness scores. 2. The absolute di↵erence of the two words’ concreteness scores. 4.3.3.2 Qualitative Analysis on Nouns of the Brain Datasets Lastly, we performed qualitative analysis regarding the 60 nouns in the Brain evaluation datasets. Looking at the word concreteness scores did not show any pattern, but this is unsurprising, since this dataset already consists of mainly concrete nouns. Instead, in this work we included an analysis of the relationship between all studied models in terms of their performance for individual words, averaged over 94 participants (Figures 4.5 and 4.6). Even though this evaluation set is small in terms of vocabulary size, it still can be useful for looking into the nuances we may find regarding individual concepts. 4.3.4 Results The tables in this section show evaluation scores for each task using di↵erent versions of evaluation methods. The notation for all tables is the following: Each line corresponds to an embedding. Separator lines divide embeddings by modali- ties: Linguistic EL, Visual EV , Structured ES and Multi-modal models EL+EV and EL+ES. wikinews, wikinews sub and crawl signify FastText vectors trained on the corresponding corpora. w2v13 is a Skip-Gram model trained on a 2013 Wikipedia dump. Visual Embeddings’ names that are trained on Google are in the format of . VG-internal|external denotes training on Visual Genome images, either on the internal object images or on the whole images, as it is done in [Davis et al., 2019]. Finally, VG SceneGraph stands for the Visual Genome Scene Graph Embeddings from Section 4.2. Multi-modal embeddings have a “+” in their names which separates the two embedding names they are built on. Red colour indicates best performance, blue means that the multi-modal em- bedding outperformed the corresponding uni-modal ones. In case of aggregated results for each modality, best performance is signified by bold font. 4.3.4.1 Correlations on the Behavioural Tasks Tables 4.2-4.7 present the standard Spearman’s correlation scores of di↵erent embeddings on the Semantic Similarity and Relatedness tasks. Tables 4.2, 4.3 and 4.4 present results on the full datasets, Tables 4.6, 4.7 include results on the embeddings’ common coverage subsets. Results using padding mid-fusion method are shown in Tables 4.2 and 4.3, results applying the intersection method are presented in Tables 4.4, 4.5, 4.6, 4.7. Except for the relatively small common subset on SimLex (Table 4.7), crawl linguistic embedding outperforms all the other. Multi-modal models outperform uni-modal ones mainly on SimLex, but only in the case of the 2013 Skip-Gram model, which is in line with previous results in Section 4.2. The only multi-modal model outperforming uni-modal ones on MEN is the combination of w2v13 and VG Scene Graph. 95 Modality Embedding Spearman P-value Coverage EL wikinews 0.79 0 3000 wikinews sub 0.80 0 3000 crawl 0.85 0 3000 w2v13 0.68 0 3000 EV Google AlexNet 0.50 0 3000 Google VGG 0.51 0 3000 VG-internal 0.37 0 2784 VG-whole 0.41 0 2784 Google ResNet-152 0.47 0 3000 ES VG SceneGraph 0.42 0 2574 EL + EV wikinews+Google AlexNet 0.50 0 3000 wikinews+Google VGG 0.51 0 3000 wikinews+VG-internal 0.36 0 3000 wikinews+VG-whole 0.39 0 3000 wikinews+Google ResNet-152 0.48 0 3000 wikinews sub+Google AlexNet 0.50 0 3000 wikinews sub+Google VGG 0.51 0 3000 wikinews sub+VG-internal 0.36 0 3000 wikinews sub+VG-whole 0.39 0 3000 wikinews sub+Google ResNet-152 0.47 0 3000 crawl+Google AlexNet 0.51 0 3000 crawl+Google VGG 0.52 0 3000 crawl+VG-internal 0.37 0 3000 crawl+VG-whole 0.40 0 3000 crawl+Google ResNet-152 0.51 0 3000 w2v13+Google AlexNet 0.50 0 3000 w2v13+Google VGG 0.51 0 3000 w2v13+VG-internal 0.36 0 3000 w2v13+VG-whole 0.40 0 3000 w2v13+Google ResNet-152 0.48 0 3000 EL + ES w2v13+VG SceneGraph 0.64 0 3000 crawl+VG SceneGraph 0.78 0 3000 wikinews sub+VG SceneGraph 0.37 0 3000 wikinews+VG SceneGraph 0.57 0 3000 Table 4.2: Spearman correlation on the MEN dataset. Multi-modal embeddings are created using the Padding technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. Red colour signifies the best performance. Blue would mean that the multi-modal embedding outperformed the corresponding uni-modal ones, which here did not happen. 96 Modality Embedding Spearman P-value Coverage EL wikinews 0.45 0 999 wikinews sub 0.44 0 999 crawl 0.50 0 999 w2v13 0.31 0 999 EV Google AlexNet 0.34 0 999 Google VGG 0.34 0 999 VG-internal 0.31 0 103 VG-whole 0.19 0.06 103 Google ResNet-152 0.35 0 999 ES VG SceneGraph 0.26 0 593 EL + EV wikinews+Google AlexNet 0.34 0 999 wikinews+Google VGG 0.34 0 999 wikinews+VG-internal 0.31 0 999 wikinews+VG-whole 0.31 0 999 wikinews+Google ResNet-152 0.35 0 999 wikinews sub+Google AlexNet 0.34 0 999 wikinews sub+Google VGG 0.34 0 999 wikinews sub+VG-internal 0.30 0 999 wikinews sub+VG-whole 0.30 0 999 wikinews sub+Google ResNet-152 0.35 0 999 crawl+Google AlexNet 0.34 0 999 crawl+Google VGG 0.34 0 999 crawl+VG-internal 0.32 0 999 crawl+VG-whole 0.32 0 999 crawl+Google ResNet-152 0.37 0 999 w2v13+Google AlexNet 0.34 0 999 w2v13+Google VGG 0.34 0 999 w2v13+VG-internal 0.23 0 999 w2v13+VG-whole 0.23 0 999 w2v13+Google ResNet-152 0.35 0 999 EL + ES w2v13+VG SceneGraph 0.29 0 999 crawl+VG SceneGraph 0.45 0 999 wikinews sub+VG SceneGraph 0.20 0 999 wikinews+VG SceneGraph 0.35 0 999 [h!] Table 4.3: Spearman correlation on the SimLex dataset. Multi-modal embeddings are created using the Padding technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. Red colour signifies the best performance. Blue would mean that the multi-modal embedding outperformed the corresponding uni-modal ones, which here did not happen. 97 Modality Embedding Spearman P-value Coverage EL wikinews 0.79 0 3000 wikinews sub 0.80 0 3000 crawl 0.85 0 3000 w2v13 0.68 0 3000 EV Google AlexNet 0.50 0 3000 Google VGG 0.51 0 3000 VG-internal 0.37 0 2784 VG-whole 0.41 0 2784 Google ResNet-152 0.47 0 3000 ES VG SceneGraph 0.42 0 2574 EL + EV wikinews+Google AlexNet 0.50 0 3000 wikinews+Google VGG 0.51 0 3000 wikinews+VG-internal 0.38 0 2784 wikinews+VG-whole 0.41 0 2784 wikinews+Google ResNet-152 0.48 0 3000 wikinews sub+Google AlexNet 0.50 0 3000 wikinews sub+Google VGG 0.51 0 3000 wikinews sub+VG-internal 0.37 0 2784 wikinews sub+VG-whole 0.41 0 2784 wikinews sub+Google ResNet-152 0.47 0 3000 crawl+Google AlexNet 0.51 0 3000 crawl+Google VGG 0.52 0 3000 crawl+VG-internal 0.38 0 2784 crawl+VG-whole 0.42 0 2784 crawl+Google ResNet-152 0.51 0 3000 w2v13+Google AlexNet 0.50 0 3000 w2v13+Google VGG 0.51 0 3000 w2v13+VG-internal 0.38 0 2784 w2v13+VG-whole 0.41 0 2784 w2v13+Google ResNet-152 0.48 0 3000 EL + ES w2v13+VG SceneGraph 0.70 0 2574 crawl+VG SceneGraph 0.81 0 2574 wikinews sub+VG SceneGraph 0.45 0 2574 wikinews+VG SceneGraph 0.65 0 2574 Table 4.4: Spearman correlation on the MEN dataset. Multi-modal embeddings are created using the Intersection technique. The table sections contain linguis- tic, visual and multi-modal embeddings in this order. Red colour signifies the best performance, blue means that the multi-modal embedding outperformed the corresponding uni-modal ones. 98 Modality Embedding Spearman P-value Coverage EL wikinews 0.45 0 999 wikinews sub 0.44 0 999 crawl 0.50 0 999 w2v13 0.31 0 999 EV Google AlexNet 0.34 0 999 Google VGG 0.34 0 999 VG-internal 0.31 0 103 VG-whole 0.19 0.06 103 Google ResNet-152 0.35 0 999 ES VG SceneGraph 0.26 0 593 EL + EV wikinews+Google AlexNet 0.34 0 999 wikinews+Google VGG 0.34 0 999 wikinews+VG-internal 0.31 0 103 wikinews+VG-whole 0.18 0.06 103 wikinews+Google ResNet-152 0.35 0 999 wikinews sub+Google AlexNet 0.34 0 999 wikinews sub+Google VGG 0.34 0 999 wikinews sub+VG-internal 0.31 0 103 wikinews sub+VG-whole 0.18 0.06 103 wikinews sub+Google ResNet-152 0.35 0 999 crawl+Google AlexNet 0.34 0 999 crawl+Google VGG 0.34 0 999 crawl+VG-internal 0.31 0 103 crawl+VG-whole 0.19 0.06 103 crawl+Google ResNet-152 0.37 0 999 w2v13+Google AlexNet 0.34 0 999 w2v13+Google VGG 0.34 0 999 w2v13+VG-internal 0.31 0 103 w2v13+VG-whole 0.18 0.06 103 w2v13+Google ResNet-152 0.35 0 999 EL + ES w2v13+VG SceneGraph 0.29 0 593 crawl+VG SceneGraph 0.44 0 593 wikinews sub+VG SceneGraph 0.30 0 593 wikinews+VG SceneGraph 0.35 0 593 Table 4.5: Spearman correlation on the SimLex dataset. Multi-modal embeddings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. Red colour signifies the best performance. Blue would mean that the multi-modal embedding outperformed the corresponding uni-modal ones, which here did not happen. 99 Modality Embedding Spearman P-value Coverage EL wikinews 0.80 0 2481 wikinews sub 0.80 0 2481 crawl 0.84 0 2481 w2v13 0.67 0 2481 EV Google AlexNet 0.52 0 2481 Google VGG 0.51 0 2481 VG-internal 0.38 0 2481 VG-whole 0.41 0 2481 Google ResNet-152 0.47 0 2481 ES VG SceneGraph 0.44 0 2481 EL + EV wikinews+Google AlexNet 0.52 0 2481 wikinews+Google VGG 0.52 0 2481 wikinews+VG-internal 0.38 0 2481 wikinews+VG-whole 0.41 0 2481 wikinews+Google ResNet-152 0.48 0 2481 wikinews sub+Google AlexNet 0.52 0 2481 wikinews sub+Google VGG 0.51 0 2481 wikinews sub+VG-internal 0.38 0 2481 wikinews sub+VG-whole 0.41 0 2481 wikinews sub+Google ResNet-152 0.47 0 2481 crawl+Google AlexNet 0.52 0 2481 crawl+Google VGG 0.52 0 2481 crawl+VG-internal 0.38 0 2481 crawl+VG-whole 0.42 0 2481 crawl+Google ResNet-152 0.51 0 2481 w2v13+Google AlexNet 0.52 0 2481 w2v13+Google VGG 0.52 0 2481 w2v13+VG-internal 0.38 0 2481 w2v13+VG-whole 0.41 0 2481 w2v13+Google ResNet-152 0.49 0 2481 EL + ES w2v13+VG SceneGraph 0.70 0 2481 crawl+VG SceneGraph 0.81 0 2481 wikinews sub+VG SceneGraph 0.46 0 2481 wikinews+VG SceneGraph 0.66 0 2481 Table 4.6: Spearman correlation on the common subset of the MEN dataset. Multi-modal embeddings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. Red colour signifies the best performance, blue means that the multi-modal embedding outperformed the corresponding uni-modal ones. 100 Interestingly, using ResNet does not provide any performance gain overAlexNet, similarly to the other more complicated models in Section 4.1. Both models are fast to run, and AlexNet sometimes performs even better, so there is no good reason to use ResNet in this task. Padding multi-modal vectors for bigger coverage does not help, in the case of w2v13+VG SceneGraph it even hurts performance. However, this may be due to including more, and perhaps “harder” concept-pairs in the test set than in the smaller intersection set. The success of combining w2v13 and VG Scene Graph over other visual vectors is interesting. While image embeddings did not help on MEN, this embedding in-between visual and linguistic conveys some complementary information to this linguistic baseline. Note that in this study we used our EmbEval toolkit for creating multi-modal embeddings, with two di↵erent types of mid-fusion methods. In Section 4.1 we used MMFeat, which includes slightly di↵erent mid-fusion techniques, therefore, the results are not directly comparable. The main point of this comprehensive study was to reveal patterns across several di↵erent sources, architectures and modalities. In the eciency studies in Chapter 5 and for the transparency analysis in Chapter 6, we also used the EmbEval toolkit. 4.3.4.2 Results on Brain Data Results on the Brain datasets include scores from the 2 vs. 2 test, described in Section 3.2.2. These experiments have all been run using the Intersection mid- fusion technique. This is because padding did not make much of a di↵erence in performance, but it requires much more memory. In addition to the previous visual models, here, we use the best performing models from [Davis et al., 2019], namely the internal bounding box images, the whole images and the combined image representations of Visual Genome. In some cases we created a nested multi-modal model where we combined their initial multi-modal models (denoted by MM ) with all our linguistic models. Tables 4.8, 4.9, 4.10, 4.11 show the scores of each embedding for every partic- ipant, and their averages over participants. Multi-modal models on average are clearly bigger winners of this task than of the previous one. In all settings a multi- modal model achieved the highest performance. In all but one case (MEG scores 101 on the common subset vocabularies in Table 4.11) about half of multi-modal models outperformed their corresponding uni-modal ones. On the full datasets (Tables 4.8 and 4.9) VG SceneGraph and AlexNet im- proved the most on the fMRI and MEG datasets respectively. On the common subset evaluations ResNet won the first medal. Interestingly, in all cases the com- bination with the older w2v13 linguistic model outperformed the combinations with FastText embeddings. When it comes to individual participants we see a substantial variance. Tables 4.12, 4.13, 4.14, 4.15 average performances over modalities for each of them. In all settings except for the common subset of fMRI dataset (Table 4.14), multi- modal models achieve a higher average performance than the uni-modal ones in more than 50% of the cases. On the common subset of fMRI dataset, for all participants, visual, structured and multi-modal averages are higher than the linguistic ones. In a recent paper [Pereira et al., 2018] also report high variance between subjects in the way their di↵erent systems (linguistic, visual, etc.) encode con- ceptual information: some are more visually oriented, others more linguistic, etc. Although, their dataset includes abstract concepts as well, which may explain the lesser involvement of visual information. One important observation is that the standard deviations on the maximally covered fMRI data are moving around 0.1, whereas on MEG data they are around 0.06. On the common subsets the numbers are around 0.08 and 0.09 respectively. In many cases the di↵erence between models’ average performances fall within these error margins. However, in most cases the improvements over uni-modal models go beyond this error. 4.3.4.3 Concreteness Figure 4.3 and 4.4 show Spearman’s correlation scores of each of our embeddings on splits of the MEN and SimLex datasets, of size 100. On the x axis we see the index of word pairs in our respective evaluation sets, ordered by WordNet Concreteness, where concreteness for a word is computed using Equation 4.1. Dark blue line indicates the WordNet Concreteness score for each word pair, therefore, it is on di↵erent axes than all the other lines which represent correlation scores. Figure 4.3 depicts the case when the concreteness of a word pair is the 102 sum of the concretenesses of the individual words. In Figure 4.4 this is computed using their absolute di↵erence. The two versions of synset aggregation for a word are both presented: median distance and maximum distance (most concrete) selection. Figure 4.3: Spearman’s correlation on the full Semantic Similarity dataset splits, ordered by the sum of WordNet concreteness scores of the two words in every word pair. Mid-fusion method: Padding. Axis x shows the index of word pairs, ordered by WordNet Concreteness. There are two plots on top of each other for displaying the trend. Left y axis is the scale for WordNet concreteness score (blue). Right y axis is the scale for Spearman’s correlations for all the embeddings Ehmodalityi. Perhaps because of the size of the datasets, we can see a tendency in the scores on MEN but way less on SimLex. When word pairs are ordered by the sum con- creteness, we see a slightly upward trend as the concreteness score increases, especially in the median synset aggregation case. In the absolute di↵erence con- creteness ordering there is a steep growth for the first 5-10 splits, then the increase plummets. Since we have a lot of embeddings, we use colour codes to separate embed- dings by modality. Furthermore, we distinguish Visual Genome images EV G from 103 Figure 4.4: Spearman’s correlation on the full Semantic Similarity dataset splits, ordered by the di↵erence of WordNet concreteness scores of the two words in every word pair. Mid-fusion method: Padding. Axis x shows the index of word pairs, ordered by WordNet Concreteness. There are two plots on top of each other for displaying the trend. Left y axis is the scale for WordNet concreteness score (blue). Right y axis is the scale for Spearman’s correlations for all the embeddings Ehmodalityi. Google, denoted by EGoogle. An interesting observation is that ES behaves more like a visual embedding in this experiment. A potential hypothesis is that for such abstract semantic tasks, (as opposed to traditional multi-modal tasks, such as VQA) we may not need low level visual features. Instead, it is rather the co-occurrence statistics, learned on this visually ordered graph structure, which can convey complementary informa- tion to a linguistic semantic embedding, trained on a “natural” text corpus. One potential way to test this hypothesis could be to gradually reduce the resolution of images we use for the visual embeddings and see how the performance changes, in what rate it starts to decline in particular. We would expect it to plateau or only decline slowly until a point when the objects are not distinguishable any more. This way we would see how much visual detail we can omit and keep the 104 same gain for these conceptually abstract tasks. Further results for evaluation on common subsets and Intersection type mid- fusion method can be found in Appendix B. They are consistent with the results presented here. 4.3.4.4 Qualitative Analysis Our automatic WordNet concreteness score is not a distinguishing metric for the 60 nouns in the Brain datasets, nevertheless, there can be some pattern when we look at the results for individual words. Figure 4.5: Scores on the full the Brain datasets words, ordered by the EV G score. The scores are the number of hits per word, averaged over all participants. Mid-fusion method: Intersection. Figure 4.5 and Figure 4.6 show the number of hits for individual words, aver- aged over participants. A word gets a hit whenever it was in a word pair with a positive 2 vs. 2 test score. Here we order the plot by a Visual Genome based embedding on combined image segments. Some words, e.g., barn, airplane and spoon got very high rank in both the fMRI and the MEG dataset. Note that the participant sets of this two 105 Figure 4.6: Scores on the embeddings’ common subset of the Brain datasets words, ordered by the EV G score. The scores are the number of hits per word, averaged over all participants. Mid-fusion method: Intersection. datasets are disjoint. It is harder to see such similarity for words with lower hit numbers. Embeddings, trained on di↵erent datasets and of di↵erent modalities follow a similar trend. In order to get a better understanding of the type of words which behave di↵erently in these brain imaging experiments we would need more evaluation data. 4.3.5 Conclusion In this study we took a step towards a more detailed analysis on the impact of visual information on high level semantic tasks, with no direct visual input. Furthermore, we investigated two brain imaging evaluation sets, involving two di↵erent imaging methods: fMRI (Functional Magnetic Resonance Imaging) and one with MEG (Magnetoencephalography). The results show that indeed, comparing several di↵erent visual and linguistic sources and models on various di↵erent evaluation tasks is necessary in order 106 to avoid fooling ourselves with overfitting certain types of evaluation sets. In several occasion, previous literature showed performance gain using multi-modal embeddings of linguistic and visual input. This is indeed the case on certain tasks and using certain embeddings, but not in every case. In this work we aimed to shed light on the various factors that might play a role. Models behave di↵erently on MEN and SimLex, and the performance gain of multi-modal models, when using linguistic vectors trained on huge textual sources is not well supported on these tasks. Visual information is complementary when our linguistic model has been trained on a smaller corpus, but this e↵ect does not necessarily scale with corpus size. Multi-modal models achieved a more convincing improvement on the brain imaging data, however these datasets are fairly small, so we would refrain from drawing far-reaching conclusions. An interesting outcome of this study is that the model trained on the visually structured scene graph of Visual Genome achieved a surprising success across the board, despite its small size compared to all the other datasets. This is an interesting model, since it is linguistic in a sense that it is trained on text, but the word contexts are organised in a visually motivated structure. This suggests that images may indeed convey complementary statistical information about the co-occurrence of objects in visual scenes. It is even possible that this information is more important for abstract semantic tasks than lower level visual properties of words. This would be intuitive, since unlike multi-modal tasks with direct visual input, such as Visual Question Answering, in our case we are aiming for abstract meaning representations of concepts. It would make sense if detailed visual information about what a table looks like mattered less when we talk about table as an abstract concept. 4.4 Model Initialization on a Textual Entailment Task This section is a brief digression to studying the application possibilities of word embeddings as initialisations on a sentence level task: textual entailment. We evaluate on the Stanford Natural Language Inference (SNLI) corpus [Bowman et al., 2015]. 107 We compared five di↵erent neural network models for encoding sentences and four di↵erent word embeddings to initialize these models, following the baselines of [Bowman et al., 2015]. The task is a three-way classification, where the input is a sentence pair and the classification labels are entailment, contradiction and neutral. We included words in the multi-modal representation only if they have a visual representation. On the top of each model there is a three-fold classifier on the concatenated sentence embeddings for the premise and the hypothesis sentences. The five sentence encoding models are the following4: 1. Addition: Vector addition of word embedding vectors in the sentence. 2. Addition + translation layer: The previous model extended with an additional layer that learns another sentence embedding above the fixed word embedding based sentence representation. 3. Addition + translation layer + full size image embeddings: The model above, but instead of using dimensionality reduced visual vectors, we use the original image embeddings and smaller (100 dimensional) lin- guistic vectors. In the previous models we used PCA to keep the first 300 components out of 4094. 4. GRU: Gated Recurrent Unit based recurrent sentence encoding model. 5. LSTM: Recurrent sentence encoding model with Long short-term memory units. All of them were initialized with four di↵erent word embeddings: 1. Linguistic only: Skip-gram embedding trained on a 2013Wikipedia dump. 2. Visual only: Image embeddings, extracting CNN representations of Google images for the individual words in the sentence. 3. Multi-modal: The concatenation of linguistic and visual vectors for each word. 4. Random: The initial vector weights are sampled from a normal distribu- tion. 4Some of the base code was written in collaboration with Amandla Mabona. 108 4.4.1 Results The results are shown in Table 4.16. The experiments indicate two phenomena: 1. The translation layer plays an important role in models 1-3. In these cases the simplest model without the translation layer (model 1.) the linguistic initialisation performs the best. After adding the translation layer, how- ever, multi-modal embeddings outperform all the other ones, in case of full size image vectors (model 3.) with a substantial margin in classification accuracy. 2. In case of the more sophisticated recurrent models (4-5.), however, we found that the performance di↵erence across di↵erent initialisations van- ishes. Even the random initial embeddings do not achieve significantly lower classification accuracy then the other methods. 4.4.2 Conclusion The second finding may suggest that we could create more time ecient models, since we do not necessarily need to spend time on pre-training word embeddings. It also alerts us, however, to the danger of overfitting. Note that we ignored multi- modal representations for words where visual information is missing, which may hurt performance. Although the high performance of random initialisations are more telling. Our findings are in line with Zhang and Bowman’s, who found the related phenomenon of high performing random initialized LSTM models [Zhang and Bowman, 2018]. [Yogatama et al., 2019] recently found that transformer type models are overfitting to the quirks of particular datasets. Possible future work could be to gradually increase model complexity as well as performing more ablation studies, in order to better understand the models’ capacity. 4.5 Conclusion In this Chapter we demonstrated the e↵ectiveness of image search engines in multi-modal mid-fusion embeddings. We found that around the first 10 image results are sucient, beyond that the performance plateaus. 109 We introduced a new visually structured textual embedding based on Visual Genome and showed that it enriches linguistic models trained on smaller corpora, therefore they can be useful for low resource languages. We found that pretrained word embeddings do not necessarily help sequence model training. However, they can be valuable on their own for discovering concept structures in a data source. Based on these findings we move on to an in-depth study of our embeddings of di↵erent modalities and their combinations. The following chapters showcase the second and third pillars of our methodology, which involve transparency analysis (see Section 3.3). We narrow our focus to a few models, as such analyses would be fairly time consuming for all the above combinations of sources, modalities and models. Furthermore, such studies on numerous current models would become shortly obsolete. Our aim is rather to provide a general framework with proof- of-concept studies, which can be applied to various models in the future. 110 Modality Embedding Spearman P-value Coverage EL wikinews 0.28 0 103 wikinews sub 0.25 0.01 103 crawl 0.37 0 103 w2v13 0.11 0.25 103 EV Google AlexNet 0.55 0 103 Google VGG 0.53 0 103 VG-internal 0.31 0 103 VG-whole 0.19 0.06 103 Google ResNet-152 0.50 0 103 ES VG SceneGraph 0.30 0 103 EL + EV wikinews+Google AlexNet 0.55 0 103 wikinews+Google VGG 0.53 0 103 wikinews+VG-internal 0.31 0 103 wikinews+VG-whole 0.18 0.06 103 wikinews+Google ResNet-152 0.50 0 103 wikinews sub+Google AlexNet 0.55 0 103 wikinews sub+Google VGG 0.53 0 103 wikinews sub+VG-internal 0.31 0 103 wikinews sub+VG-whole 0.18 0.06 103 wikinews sub+Google ResNet-152 0.50 0 103 crawl+Google AlexNet 0.55 0 103 crawl+Google VGG 0.52 0 103 crawl+VG-internal 0.31 0 103 crawl+VG-whole 0.19 0.06 103 crawl+Google ResNet-152 0.49 0 103 w2v13+Google AlexNet 0.55 0 103 w2v13+Google VGG 0.53 0 103 w2v13+VG-internal 0.31 0 103 w2v13+VG-whole 0.18 0.06 103 w2v13+Google ResNet-152 0.49 0 103 EL + ES w2v13+VG SceneGraph 0.25 0.01 103 crawl+VG SceneGraph 0.34 0 103 wikinews sub+VG SceneGraph 0.30 0 103 wikinews+VG SceneGraph 0.29 0 103 Table 4.7: Spearman correlation on the common subset of the SimLex dataset. Multi-modal embeddings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. Red colour signifies the best performance. Blue would mean that the multi-modal embedding outperformed the corresponding uni-modal ones, which here did not happen. 111 Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Covr. EL w2v13 0.79 0.54 0.66 0.76 0.58 0.65 0.47 0.60 0.67 0.64 0.1 45 wikinews sub 0.83 0.66 0.68 0.83 0.61 0.54 0.59 0.56 0.70 0.67 0.1 60 wikinews 0.83 0.68 0.63 0.81 0.64 0.54 0.56 0.48 0.65 0.65 0.11 60 crawl 0.86 0.68 0.61 0.88 0.65 0.58 0.58 0.55 0.60 0.67 0.12 60 EV Google-VIS whole 0.89 0.65 0.64 0.75 0.51 0.61 0.64 0.55 0.60 0.65 0.11 52 Google ResNet-152 0.88 0.63 0.64 0.73 0.46 0.56 0.64 0.50 0.56 0.62 0.12 52 VG-VIS internal 0.85 0.70 0.63 0.72 0.52 0.55 0.57 0.47 0.57 0.62 0.11 57 Google AlexNet 0.89 0.61 0.66 0.72 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52 VG-VIS combined 0.85 0.71 0.65 0.76 0.55 0.57 0.60 0.44 0.66 0.64 0.11 57 ES VG SceneGraph 0.83 0.68 0.57 0.77 0.59 0.63 0.58 0.59 0.64 0.65 0.09 58 EL + EV VG-MM internal 0.88 0.66 0.66 0.78 0.59 0.64 0.67 0.47 0.65 0.67 0.11 57 VG-MM combined 0.88 0.64 0.67 0.79 0.61 0.65 0.67 0.48 0.68 0.67 0.11 57 Google-MM whole 0.89 0.67 0.67 0.80 0.61 0.60 0.65 0.52 0.64 0.67 0.1 52 wikinews+Google ResNet-152 0.88 0.63 0.64 0.75 0.47 0.56 0.63 0.49 0.55 0.62 0.12 52 wikinews+Google AlexNet 0.89 0.61 0.66 0.73 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52 wikinews+VG-VIS internal 0.84 0.69 0.66 0.80 0.65 0.55 0.62 0.47 0.66 0.66 0.11 57 wikinews+VG-MM internal 0.83 0.67 0.66 0.80 0.65 0.56 0.62 0.47 0.67 0.66 0.1 57 wikinews+VG-VIS combined 0.84 0.68 0.66 0.80 0.65 0.56 0.63 0.47 0.67 0.66 0.11 57 wikinews+VG-MM combined 0.83 0.67 0.66 0.80 0.65 0.56 0.63 0.48 0.67 0.66 0.1 57 wikinews+Google-VIS whole 0.85 0.72 0.70 0.79 0.68 0.50 0.60 0.55 0.65 0.67 0.11 52 wikinews+Google-MM whole 0.83 0.72 0.68 0.80 0.69 0.49 0.58 0.55 0.65 0.67 0.11 52 wikinews sub+Google ResNet-152 0.88 0.63 0.63 0.74 0.46 0.56 0.63 0.49 0.55 0.62 0.12 52 wikinews sub+Google AlexNet 0.89 0.61 0.66 0.73 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52 wikinews sub+VG-VIS internal 0.87 0.68 0.67 0.78 0.59 0.57 0.57 0.50 0.61 0.65 0.11 57 wikinews sub+VG-MM internal 0.88 0.64 0.69 0.80 0.63 0.63 0.66 0.51 0.66 0.68 0.1 57 wikinews sub+VG-VIS combined 0.87 0.70 0.67 0.81 0.60 0.58 0.62 0.48 0.67 0.67 0.11 57 wikinews sub+VG-MM combined 0.87 0.64 0.69 0.81 0.63 0.63 0.67 0.52 0.67 0.68 0.1 57 wikinews sub+Google-VIS whole 0.89 0.67 0.66 0.77 0.52 0.58 0.64 0.55 0.62 0.66 0.11 52 wikinews sub+Google-MM whole 0.88 0.69 0.70 0.81 0.64 0.57 0.63 0.55 0.65 0.68 0.1 52 crawl+Google ResNet-152 0.88 0.64 0.64 0.75 0.47 0.55 0.62 0.50 0.55 0.62 0.12 52 crawl+Google AlexNet 0.89 0.61 0.66 0.73 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52 crawl+VG-VIS internal 0.87 0.68 0.62 0.86 0.67 0.60 0.62 0.53 0.61 0.67 0.11 57 crawl+VG-MM internal 0.87 0.67 0.62 0.86 0.67 0.60 0.62 0.53 0.61 0.67 0.11 57 crawl+VG-VIS combined 0.87 0.68 0.62 0.86 0.67 0.60 0.63 0.53 0.62 0.67 0.11 57 crawl+VG-MM combined 0.87 0.67 0.62 0.86 0.67 0.60 0.62 0.53 0.61 0.67 0.11 57 crawl+Google-VIS whole 0.87 0.72 0.69 0.87 0.72 0.51 0.60 0.60 0.57 0.69 0.12 52 crawl+Google-MM whole 0.86 0.72 0.69 0.87 0.73 0.51 0.60 0.61 0.57 0.68 0.12 52 w2v13+Google ResNet-152 0.89 0.65 0.68 0.75 0.50 0.56 0.56 0.54 0.67 0.64 0.11 40 w2v13+Google AlexNet 0.90 0.66 0.71 0.74 0.53 0.64 0.58 0.57 0.77 0.68 0.11 40 w2v13+VG-VIS internal 0.81 0.55 0.66 0.76 0.60 0.68 0.49 0.59 0.67 0.65 0.09 44 w2v13+VG-MM internal 0.80 0.55 0.65 0.76 0.60 0.67 0.50 0.59 0.68 0.65 0.09 44 w2v13+VG-VIS combined 0.81 0.54 0.66 0.76 0.60 0.68 0.50 0.59 0.68 0.65 0.09 44 w2v13+VG-MM combined 0.80 0.55 0.65 0.76 0.60 0.67 0.50 0.59 0.68 0.64 0.09 44 w2v13+Google-VIS whole 0.84 0.61 0.68 0.75 0.62 0.59 0.44 0.64 0.66 0.65 0.1 40 w2v13+Google-MM whole 0.82 0.59 0.67 0.74 0.62 0.59 0.45 0.64 0.65 0.64 0.1 40 EL + ES wikinews+VG SceneGraph 0.84 0.71 0.59 0.80 0.63 0.60 0.58 0.58 0.65 0.66 0.09 58 wikinews sub+VG SceneGraph 0.84 0.68 0.57 0.78 0.60 0.63 0.59 0.60 0.64 0.66 0.09 58 crawl+VG SceneGraph 0.87 0.71 0.59 0.86 0.66 0.60 0.60 0.60 0.64 0.68 0.1 58 w2v13+VG SceneGraph 0.87 0.65 0.66 0.81 0.66 0.69 0.52 0.61 0.76 0.69 0.1 45 Table 4.8: fMRI scores for each participant and embedding. Multi-modal em- beddings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. Red colour signifies the best performance, blue means that the multi-modal embedding outperformed the corresponding uni-modal ones. 112 Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Coverage EL w2v13 0.65 0.64 0.58 0.64 0.74 0.65 0.75 0.56 0.69 0.66 0.06 45 wikinews sub 0.63 0.59 0.48 0.70 0.72 0.65 0.71 0.66 0.73 0.65 0.07 60 wikinews 0.63 0.61 0.50 0.71 0.71 0.64 0.72 0.63 0.76 0.66 0.07 60 crawl 0.65 0.58 0.57 0.69 0.67 0.63 0.73 0.65 0.71 0.65 0.05 60 EV Google-VIS whole 0.70 0.51 0.56 0.71 0.69 0.73 0.70 0.62 0.69 0.66 0.07 52 Google ResNet-152 0.65 0.55 0.52 0.69 0.70 0.68 0.63 0.61 0.66 0.63 0.06 52 VG-VIS internal 0.62 0.55 0.54 0.66 0.62 0.69 0.64 0.49 0.59 0.60 0.06 57 Google AlexNet 0.66 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.63 0.06 52 VG-VIS combined 0.69 0.60 0.55 0.68 0.70 0.76 0.68 0.56 0.69 0.66 0.07 57 ES VG SceneGraph 0.63 0.60 0.55 0.65 0.70 0.62 0.67 0.50 0.73 0.63 0.07 58 EL + EV VG-MM internal 0.66 0.65 0.56 0.73 0.68 0.64 0.70 0.60 0.69 0.65 0.05 57 VG-MM combined 0.68 0.67 0.56 0.74 0.71 0.69 0.72 0.62 0.72 0.68 0.05 57 Google-MM whole 0.72 0.59 0.53 0.72 0.71 0.67 0.72 0.65 0.72 0.67 0.06 52 wikinews+Google ResNet-152 0.65 0.55 0.51 0.68 0.70 0.69 0.63 0.61 0.66 0.63 0.06 52 wikinews+Google AlexNet 0.66 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.64 0.06 52 wikinews+VG-VIS internal 0.63 0.67 0.53 0.71 0.74 0.69 0.72 0.60 0.76 0.67 0.07 57 wikinews+VG-MM internal 0.62 0.67 0.54 0.70 0.75 0.68 0.73 0.62 0.76 0.67 0.07 57 wikinews+VG-VIS combined 0.64 0.66 0.53 0.71 0.74 0.71 0.73 0.62 0.77 0.68 0.07 57 wikinews+VG-MM combined 0.62 0.67 0.53 0.70 0.75 0.69 0.73 0.62 0.76 0.67 0.07 57 wikinews+Google-VIS whole 0.66 0.61 0.53 0.70 0.76 0.70 0.72 0.61 0.75 0.67 0.07 52 wikinews+Google-MM whole 0.66 0.63 0.53 0.69 0.76 0.66 0.71 0.61 0.76 0.67 0.07 52 wikinews sub+Google ResNet-152 0.65 0.55 0.51 0.68 0.70 0.68 0.62 0.61 0.66 0.63 0.06 52 wikinews sub+Google AlexNet 0.66 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.64 0.06 52 wikinews sub+VG-VIS internal 0.64 0.61 0.56 0.72 0.68 0.70 0.70 0.54 0.67 0.65 0.06 57 wikinews sub+VG-MM internal 0.66 0.66 0.56 0.75 0.72 0.67 0.72 0.63 0.73 0.68 0.06 57 wikinews sub+VG-VIS combined 0.70 0.64 0.57 0.73 0.73 0.75 0.72 0.59 0.71 0.68 0.06 57 wikinews sub+VG-MM combined 0.68 0.67 0.55 0.76 0.74 0.71 0.73 0.63 0.75 0.69 0.06 57 wikinews sub+Google-VIS whole 0.70 0.53 0.55 0.70 0.73 0.73 0.71 0.63 0.70 0.66 0.07 52 wikinews sub+Google-MM whole 0.71 0.62 0.53 0.71 0.76 0.67 0.72 0.65 0.74 0.68 0.07 52 crawl+Google ResNet-152 0.65 0.56 0.51 0.68 0.71 0.68 0.63 0.62 0.66 0.63 0.06 52 crawl+Google AlexNet 0.67 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.64 0.06 52 crawl+VG-VIS internal 0.65 0.63 0.60 0.69 0.69 0.68 0.73 0.65 0.73 0.67 0.04 57 crawl+VG-MM internal 0.65 0.64 0.60 0.69 0.69 0.67 0.73 0.65 0.73 0.67 0.04 57 crawl+VG-VIS combined 0.65 0.63 0.60 0.69 0.69 0.68 0.73 0.65 0.73 0.67 0.04 57 crawl+VG-MM combined 0.65 0.64 0.61 0.69 0.69 0.67 0.73 0.65 0.73 0.67 0.04 57 crawl+Google-VIS whole 0.67 0.62 0.56 0.68 0.77 0.67 0.73 0.62 0.75 0.67 0.06 52 crawl+Google-MM whole 0.67 0.63 0.57 0.68 0.77 0.66 0.73 0.62 0.75 0.67 0.06 52 w2v13+Google ResNet-152 0.68 0.66 0.60 0.72 0.74 0.69 0.73 0.62 0.66 0.68 0.05 40 w2v13+Google AlexNet 0.69 0.59 0.67 0.75 0.77 0.74 0.73 0.62 0.69 0.69 0.06 40 w2v13+VG-VIS internal 0.65 0.65 0.61 0.62 0.75 0.65 0.75 0.53 0.71 0.66 0.07 44 w2v13+VG-MM internal 0.64 0.64 0.60 0.62 0.75 0.65 0.74 0.54 0.70 0.66 0.06 44 w2v13+VG-VIS combined 0.66 0.64 0.61 0.62 0.76 0.67 0.74 0.54 0.71 0.66 0.06 44 w2v13+VG-MM combined 0.64 0.64 0.60 0.63 0.75 0.65 0.74 0.54 0.70 0.66 0.06 44 w2v13+Google-VIS whole 0.69 0.59 0.56 0.64 0.73 0.65 0.69 0.56 0.71 0.65 0.06 40 w2v13+Google-MM whole 0.68 0.59 0.54 0.63 0.71 0.63 0.70 0.56 0.71 0.64 0.06 40 EL + ES wikinews+VG SceneGraph 0.62 0.61 0.55 0.69 0.73 0.62 0.71 0.55 0.75 0.65 0.07 58 wikinews sub+VG SceneGraph 0.62 0.60 0.55 0.67 0.70 0.62 0.68 0.51 0.73 0.63 0.07 58 crawl+VG SceneGraph 0.66 0.63 0.55 0.69 0.72 0.62 0.74 0.62 0.76 0.67 0.06 58 w2v13+VG SceneGraph 0.69 0.63 0.56 0.68 0.81 0.68 0.78 0.59 0.72 0.68 0.08 45 Table 4.9: MEG scores for each participant and embedding. Multi-modal em- beddings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. Red colour signifies the best performance, blue means that the multi-modal embedding outperformed the corresponding uni-modal ones. 113 Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Coverage EL w2v13 0.41 0.41 0.33 0.46 0.44 0.49 0.43 0.51 0.55 0.45 0.06 39 wikinews sub 0.47 0.47 0.52 0.49 0.48 0.45 0.50 0.52 0.52 0.49 0.03 39 wikinews 0.54 0.44 0.50 0.49 0.51 0.52 0.53 0.46 0.54 0.50 0.03 39 crawl 0.41 0.49 0.44 0.54 0.56 0.48 0.47 0.39 0.56 0.48 0.06 39 EV Google-VIS whole 0.47 0.33 0.54 0.46 0.44 0.41 0.24 0.49 0.45 0.43 0.08 39 Google ResNet-152 0.42 0.39 0.58 0.54 0.53 0.53 0.57 0.62 0.39 0.51 0.08 39 VG-VIS internal 0.66 0.37 0.60 0.65 0.58 0.57 0.52 0.56 0.56 0.56 0.08 39 Google AlexNet 0.59 0.58 0.54 0.58 0.53 0.61 0.53 0.59 0.56 0.57 0.03 39 VG-VIS combined 0.57 0.34 0.57 0.58 0.53 0.52 0.49 0.53 0.46 0.51 0.07 39 ES VG SceneGraph 0.51 0.45 0.49 0.55 0.63 0.38 0.55 0.48 0.36 0.49 0.08 39 EL + EV VG-MM internal 0.70 0.61 0.46 0.45 0.55 0.48 0.57 0.60 0.51 0.55 0.08 39 VG-MM combined 0.48 0.43 0.45 0.51 0.47 0.66 0.44 0.58 0.56 0.51 0.07 39 Google-MM whole 0.45 0.32 0.48 0.44 0.46 0.49 0.27 0.49 0.40 0.43 0.07 39 wikinews+Google ResNet-152 0.33 0.38 0.59 0.42 0.43 0.39 0.69 0.50 0.64 0.49 0.12 39 wikinews+Google AlexNet 0.42 0.41 0.47 0.49 0.49 0.41 0.65 0.48 0.55 0.48 0.07 39 wikinews+VG-VIS internal 0.64 0.69 0.43 0.66 0.64 0.50 0.53 0.62 0.55 0.58 0.08 39 wikinews+VG-MM internal 0.66 0.68 0.40 0.67 0.61 0.50 0.55 0.64 0.52 0.58 0.09 39 wikinews+VG-VIS combined 0.65 0.67 0.42 0.63 0.63 0.51 0.54 0.63 0.55 0.58 0.08 39 wikinews+VG-MM combined 0.69 0.70 0.42 0.65 0.60 0.49 0.61 0.66 0.48 0.59 0.1 39 wikinews+Google-VIS whole 0.43 0.46 0.49 0.40 0.51 0.41 0.63 0.40 0.50 0.47 0.07 39 wikinews+Google-MM whole 0.42 0.48 0.47 0.41 0.53 0.41 0.63 0.41 0.49 0.47 0.07 39 wikinews sub+Google ResNet-152 0.33 0.39 0.59 0.42 0.43 0.39 0.69 0.50 0.65 0.49 0.12 39 wikinews sub+Google AlexNet 0.42 0.41 0.47 0.49 0.48 0.41 0.65 0.48 0.55 0.48 0.07 39 wikinews sub+VG-VIS internal 0.60 0.57 0.52 0.68 0.54 0.57 0.43 0.58 0.52 0.55 0.06 39 wikinews sub+VG-MM internal 0.68 0.57 0.40 0.75 0.57 0.51 0.43 0.64 0.46 0.56 0.11 39 wikinews sub+VG-VIS combined 0.57 0.53 0.50 0.65 0.53 0.55 0.46 0.65 0.49 0.55 0.06 39 wikinews sub+VG-MM combined 0.69 0.59 0.40 0.70 0.59 0.50 0.50 0.65 0.42 0.56 0.1 39 wikinews sub+Google-VIS whole 0.43 0.42 0.50 0.38 0.45 0.38 0.58 0.43 0.59 0.46 0.08 39 wikinews sub+Google-MM whole 0.38 0.47 0.46 0.38 0.47 0.42 0.66 0.44 0.57 0.47 0.09 39 crawl+Google ResNet-152 0.33 0.38 0.59 0.42 0.44 0.39 0.69 0.50 0.63 0.48 0.12 39 crawl+Google AlexNet 0.42 0.41 0.47 0.49 0.49 0.40 0.65 0.48 0.55 0.48 0.07 39 crawl+VG-VIS internal 0.63 0.62 0.41 0.67 0.59 0.56 0.50 0.59 0.60 0.57 0.07 39 crawl+VG-MM internal 0.60 0.60 0.39 0.68 0.57 0.55 0.51 0.61 0.59 0.57 0.08 39 crawl+VG-VIS combined 0.64 0.62 0.41 0.67 0.58 0.55 0.50 0.59 0.60 0.57 0.07 39 crawl+VG-MM combined 0.60 0.62 0.43 0.65 0.58 0.59 0.55 0.62 0.54 0.58 0.06 39 crawl+Google-VIS whole 0.48 0.49 0.52 0.46 0.55 0.36 0.65 0.39 0.52 0.49 0.08 39 crawl+Google-MM whole 0.48 0.49 0.50 0.46 0.55 0.35 0.65 0.38 0.51 0.49 0.08 39 w2v13+Google ResNet-152 0.87 0.67 0.69 0.73 0.60 0.61 0.56 0.58 0.69 0.67 0.09 39 w2v13+Google AlexNet 0.74 0.64 0.60 0.67 0.58 0.54 0.62 0.61 0.71 0.63 0.06 39 w2v13+VG-VIS internal 0.35 0.47 0.28 0.38 0.55 0.51 0.54 0.48 0.40 0.44 0.09 39 w2v13+VG-MM internal 0.37 0.50 0.35 0.48 0.39 0.47 0.59 0.45 0.47 0.45 0.07 39 w2v13+VG-VIS combined 0.36 0.47 0.29 0.37 0.55 0.50 0.54 0.47 0.41 0.44 0.08 39 w2v13+VG-MM combined 0.35 0.51 0.39 0.54 0.42 0.39 0.57 0.45 0.43 0.45 0.07 39 w2v13+Google-VIS whole 0.76 0.57 0.61 0.66 0.61 0.53 0.49 0.60 0.69 0.61 0.08 39 w2v13+Google-MM whole 0.75 0.56 0.60 0.67 0.61 0.53 0.48 0.61 0.68 0.61 0.08 39 EL + ES wikinews+VG SceneGraph 0.48 0.51 0.47 0.49 0.50 0.50 0.59 0.32 0.48 0.48 0.07 39 wikinews sub+VG SceneGraph 0.54 0.55 0.49 0.50 0.40 0.51 0.58 0.45 0.59 0.51 0.06 39 crawl+VG SceneGraph 0.56 0.59 0.43 0.47 0.43 0.47 0.59 0.38 0.49 0.49 0.07 39 w2v13+VG SceneGraph 0.59 0.49 0.63 0.48 0.45 0.60 0.42 0.53 0.67 0.54 0.08 39 Table 4.10: fMRI scores for each participant and embedding on the common sub- set of vocabularies. Multi-modal embeddings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embed- dings in this order. Red colour signifies the best performance, blue means that the multi-modal embedding outperformed the corresponding uni-modal ones. 114 Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Coverage EL w2v13 0.56 0.52 0.36 0.46 0.34 0.55 0.50 0.44 0.51 0.47 0.07 39 wikinews sub 0.55 0.64 0.40 0.52 0.59 0.42 0.62 0.46 0.53 0.53 0.08 39 wikinews 0.59 0.62 0.38 0.39 0.64 0.42 0.62 0.37 0.65 0.52 0.12 39 crawl 0.50 0.45 0.66 0.41 0.60 0.40 0.55 0.54 0.45 0.51 0.08 39 EV Google-VIS whole 0.49 0.52 0.56 0.35 0.44 0.64 0.45 0.65 0.52 0.51 0.09 39 Google ResNet-152 0.56 0.55 0.65 0.24 0.38 0.60 0.50 0.60 0.45 0.50 0.12 39 VG-VIS internal 0.46 0.54 0.51 0.53 0.66 0.46 0.49 0.52 0.65 0.54 0.07 39 Google AlexNet 0.35 0.52 0.54 0.45 0.52 0.52 0.53 0.51 0.50 0.49 0.05 39 VG-VIS combined 0.33 0.44 0.49 0.62 0.68 0.46 0.49 0.47 0.54 0.50 0.1 39 ES VG SceneGraph 0.48 0.60 0.49 0.49 0.53 0.54 0.59 0.39 0.49 0.51 0.06 39 EL + EV VG-MM internal 0.50 0.22 0.50 0.55 0.39 0.54 0.50 0.54 0.52 0.47 0.1 39 VG-MM combined 0.29 0.36 0.54 0.45 0.40 0.38 0.35 0.39 0.50 0.41 0.07 39 Google-MM whole 0.39 0.51 0.48 0.29 0.40 0.54 0.46 0.65 0.57 0.48 0.1 39 wikinews+Google ResNet-152 0.46 0.44 0.57 0.43 0.64 0.53 0.35 0.61 0.45 0.50 0.09 39 wikinews+Google AlexNet 0.52 0.36 0.42 0.46 0.48 0.57 0.31 0.63 0.58 0.48 0.1 39 wikinews+VG-VIS internal 0.39 0.34 0.46 0.58 0.49 0.50 0.52 0.57 0.67 0.50 0.09 39 wikinews+VG-MM internal 0.41 0.33 0.49 0.59 0.47 0.51 0.41 0.57 0.64 0.49 0.09 39 wikinews+VG-VIS combined 0.39 0.34 0.46 0.58 0.48 0.49 0.51 0.57 0.66 0.50 0.09 39 wikinews+VG-MM combined 0.42 0.38 0.51 0.59 0.49 0.53 0.42 0.59 0.65 0.51 0.09 39 wikinews+Google-VIS whole 0.42 0.38 0.48 0.31 0.50 0.66 0.55 0.59 0.45 0.48 0.1 39 wikinews+Google-MM whole 0.41 0.38 0.47 0.31 0.48 0.66 0.56 0.59 0.42 0.48 0.1 39 wikinews sub+Google ResNet-152 0.46 0.44 0.57 0.43 0.64 0.53 0.35 0.61 0.45 0.50 0.09 39 wikinews sub+Google AlexNet 0.52 0.35 0.42 0.46 0.48 0.57 0.31 0.63 0.58 0.48 0.1 39 wikinews sub+VG-VIS internal 0.54 0.40 0.40 0.47 0.52 0.43 0.59 0.54 0.56 0.49 0.07 39 wikinews sub+VG-MM internal 0.51 0.43 0.44 0.51 0.48 0.55 0.53 0.55 0.66 0.52 0.06 39 wikinews sub+VG-VIS combined 0.52 0.40 0.40 0.50 0.49 0.45 0.58 0.53 0.57 0.49 0.06 39 wikinews sub+VG-MM combined 0.50 0.48 0.50 0.56 0.48 0.56 0.52 0.57 0.65 0.53 0.05 39 wikinews sub+Google-VIS whole 0.48 0.39 0.46 0.40 0.58 0.60 0.42 0.63 0.48 0.49 0.08 39 wikinews sub+Google-MM whole 0.44 0.37 0.45 0.39 0.53 0.61 0.46 0.57 0.44 0.47 0.08 39 crawl+Google ResNet-152 0.47 0.44 0.57 0.43 0.63 0.53 0.36 0.61 0.45 0.50 0.09 39 crawl+Google AlexNet 0.52 0.35 0.42 0.46 0.48 0.57 0.31 0.63 0.58 0.48 0.1 39 crawl+VG-VIS internal 0.43 0.35 0.46 0.50 0.50 0.45 0.54 0.50 0.60 0.48 0.07 39 crawl+VG-MM internal 0.45 0.34 0.47 0.49 0.51 0.40 0.48 0.49 0.61 0.47 0.07 39 crawl+VG-VIS combined 0.42 0.35 0.46 0.50 0.49 0.45 0.54 0.50 0.60 0.48 0.07 39 crawl+VG-MM combined 0.49 0.44 0.50 0.50 0.54 0.42 0.47 0.50 0.65 0.50 0.06 39 crawl+Google-VIS whole 0.57 0.42 0.43 0.32 0.48 0.67 0.60 0.60 0.55 0.52 0.11 39 crawl+Google-MM whole 0.57 0.42 0.43 0.31 0.47 0.67 0.61 0.60 0.55 0.52 0.11 39 w2v13+Google ResNet-152 0.67 0.61 0.62 0.65 0.73 0.73 0.67 0.58 0.70 0.66 0.05 39 w2v13+Google AlexNet 0.53 0.48 0.57 0.65 0.61 0.68 0.59 0.50 0.57 0.58 0.06 39 w2v13+VG-VIS internal 0.52 0.46 0.51 0.56 0.45 0.54 0.48 0.68 0.53 0.53 0.07 39 w2v13+VG-MM internal 0.55 0.42 0.53 0.56 0.43 0.49 0.47 0.67 0.42 0.50 0.08 39 w2v13+VG-VIS combined 0.52 0.46 0.51 0.56 0.44 0.54 0.48 0.68 0.52 0.52 0.07 39 w2v13+VG-MM combined 0.55 0.44 0.49 0.55 0.54 0.56 0.56 0.68 0.47 0.54 0.06 39 w2v13+Google-VIS whole 0.66 0.59 0.59 0.55 0.68 0.74 0.66 0.47 0.72 0.63 0.08 39 w2v13+Google-MM whole 0.66 0.59 0.59 0.52 0.66 0.72 0.66 0.45 0.72 0.62 0.08 39 EL + ES wikinews+VG SceneGraph 0.56 0.65 0.53 0.56 0.45 0.33 0.64 0.52 0.41 0.52 0.1 39 wikinews sub+VG SceneGraph 0.62 0.67 0.62 0.56 0.49 0.35 0.60 0.44 0.46 0.54 0.1 39 crawl+VG SceneGraph 0.69 0.52 0.52 0.48 0.42 0.55 0.67 0.66 0.45 0.55 0.09 39 w2v13+VG SceneGraph 0.52 0.49 0.54 0.54 0.40 0.55 0.52 0.51 0.50 0.51 0.04 39 Table 4.11: MEG scores for each participant and embedding on the common sub- set of vocabularies. Multi-modal embeddings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embed- dings in this order. Red colour signifies the best performance, blue means that the multi-modal embedding outperformed the corresponding uni-modal ones. 115 Modality P1 P2 P3 P4 P5 P6 P7 P8 P9 EL 0.83 0.64 0.65 0.82 0.62 0.58 0.55 0.55 0.65 EV 0.87 0.66 0.65 0.74 0.51 0.58 0.61 0.50 0.61 ES 0.83 0.68 0.57 0.77 0.59 0.63 0.58 0.59 0.64 EL + EV 0.86 0.65 0.66 0.79 0.60 0.59 0.60 0.54 0.64 EL + ES 0.86 0.69 0.60 0.81 0.64 0.63 0.57 0.60 0.67 Table 4.12: fMRI scores averaged over each modality. Bold signifies the highest average performance for each participant. Modality P1 P2 P3 P4 P5 P6 P7 P8 P9 EL 0.64 0.60 0.53 0.69 0.71 0.64 0.73 0.63 0.72 EV 0.66 0.54 0.55 0.68 0.68 0.72 0.67 0.57 0.66 ES 0.63 0.60 0.55 0.65 0.70 0.62 0.67 0.50 0.73 EL + EV 0.66 0.62 0.56 0.69 0.73 0.68 0.71 0.60 0.71 EL + ES 0.65 0.62 0.55 0.68 0.74 0.64 0.73 0.57 0.74 Table 4.13: MEG scores averaged over each modality. Bold signifies the highest average performance for each participant. Modality P1 P2 P3 P4 P5 P6 P7 P8 P9 EL 0.46 0.45 0.45 0.50 0.50 0.48 0.48 0.47 0.54 EV 0.54 0.40 0.57 0.56 0.52 0.53 0.47 0.56 0.48 ES 0.51 0.45 0.49 0.55 0.63 0.38 0.55 0.48 0.36 EL + EV 0.53 0.53 0.47 0.55 0.53 0.48 0.56 0.54 0.54 EL + ES 0.54 0.53 0.50 0.48 0.45 0.52 0.54 0.42 0.56 Table 4.14: fMRI scores averaged over each modality on the common subset of vocabularies. Bold signifies the highest average performance for each participant. Modality P1 P2 P3 P4 P5 P6 P7 P8 P9 EL 0.55 0.56 0.45 0.44 0.55 0.45 0.57 0.45 0.53 EV 0.44 0.51 0.55 0.44 0.54 0.54 0.49 0.55 0.53 ES 0.48 0.60 0.49 0.49 0.53 0.54 0.59 0.39 0.49 EL + EV 0.49 0.41 0.49 0.48 0.52 0.55 0.49 0.57 0.56 EL + ES 0.60 0.58 0.55 0.53 0.44 0.45 0.61 0.53 0.45 Table 4.15: MEG scores averaged over each modality on the common subset of vocabularies. Bold signifies the highest average performance for each participant. 116 Architecture Embedding Accuracy (%) Add linguistic only 77.54 visual only 72.70 multi-modal 76.56 random 69.87 Add+Translation linguistic only 81.21 visual only 79.75 multi-modal 81.81 random 78.33 Add+Translation+FullVis linguistic only 79.85 visual only 79.11 multi-modal 81.29 random 78.79 GRU linguistic only 79.77 visual only 77.34 multi-modal 79.48 random 79.25 LSTM linguistic only 79.80 visual only 78.22 multi-modal 79.61 random 76.16 Table 4.16: Classification accuracy of the di↵erent architectures and embedding initialisations. 117 118 Chapter 5 E↵ects of Data Size and Distribution This chapter shifts the focus towards a more in-depth analysis of some selected model, data source and modality combination based on the results of the previous chapter. Our main metric is still performance accuracy, thus this analysis forms the last part of pillar 1. We aim our attention at studying model eciency regarding size and perfor- mance. In this study we dig deeper into the e↵ect of the training data size and distribution. The presented experiments address the following questions: • Does visual data bolster performance only because we add more data or does it convey complementary quality information compared to a higher quantity of text? (Question 4) • Can we achieve comparable performance using small-data if it comes from the right data distribution? (Question 4a) We perform di↵erent experiments in order to test the e↵ect of data size and data distribution on semantic similarity and relatedness tasks. We will compare linguistic, visual and structured embeddings, based on various criteria. 119 5.1 Counting in the “E↵ort” The work presented here is related to a recently published information theoretical probing framework based on minimal description length (MDL) [Voita and Titov, 2020] i.e. the minimum number of bits needed to transmit the labels knowing the representations. Our idea is to count in the “e↵ort” of data collection and quantity into the performance of our multi-modal word meaning representations. Unlike Voita et al., instead of testing on supervised tasks, we focus on unsuper- vised evaluation. We do not train a multi-layered perception for probing. This is relevant because this way we avoid distorting our results by a network functioning as supervised fine tuning. In Section 4.4 we found that a shallow neural network and a deep LSTM, both with randomly initialised input word vectors, perform on par with an input of pretrained word embeddings on a Textual Entailment task (SNLI). Zhang and Bowman found the related phenomenon of high performing random initialized LSTM models [Zhang and Bowman, 2018]. This is in line with current findings considering the recent transformer type models which are shown to be far from solving general tasks (e.g., document question answering). Rather, these models are overfitting to the quirks of particular datasets [Yogatama et al., 2019]. Motivated by these results, in this work we decided to focus on diving into unsupervised representation learning. In unsupervised representation learning we are learning P (x) instead of P (y|x), where x is the input data, y is the corresponding label determined by the super- vised evaluation task. Hence, our approach is more related to Voita et al.’s MDL framework with “online” code where the code length is simply calculated by the entropy of the training data. We pursue measuring how hard it is to achieve a high performing representa- tion with small data. In the previous chapter we controlled for image quantity for DV (Section 4.1) and the context size (radius) of DS (Section 4.2). In this chapter we focus on controlling for text data size and distribution DL. Our ques- tion is: What is the corpus size where visual information is helpful? We count in the “e↵ort” by discussing performance in the context of data and model size. In the following, we describe our implementation of controlling for data quantity and word frequency distribution. 120 5.2 Experiments Here, we summarise the notation and specify the models used in the following experiments, based on our previous findings in Chapter 4. EL 2 R|T |⇥dL : Linguistic Embedding. Here, we present results using Skip- Gram with Negative Sampling (SGNS) [Mikolov et al., 2013a, Mikolov et al., 2013b] trained on a 2020 English Wikipedia dump. Due to its simplicity, it is suitable for running a wide range of experiments. EV 2 R|T |⇥dV : Visual Embedding. We ran a feedforward step of ResNet-152 [He et al., 2016] on Google Images. We apply mean aggregation on the first 10 image results which has been found on of the best performing in Section 4.1. ES 2 R|T |⇥dS : Structured Embedding. We use our in-between visual and lin- guistic embedding, trained on the visually structured text of Visual Genome Scene Graphs (Section 4.2). In the following we show results according to e1, . . . , el samples from the lin- guistic training corpus DL. T = |V \ Vtask| ⇡ |Vtask|, Vtask ⇢ V , where V is the vocabulary of the text corpus and Vtask is the vocabulary of the evaluation tasks. 5.2.1 Control for Data Quantity We perform experiments where we restrict the training data size of EL. Similarly to Sahlgren et al [Sahlgren and Lenci, 2016], we sample the corpora randomly to subsets with increasing number of tokens: e1, . . . , eN . 5.2.2 Control for Frequency Ranges In the second phase we can test how models, trained on di↵erent word frequency ranges, interact with the other types of embeddings. Similarly to [Sahlgren and Lenci, 2016] we split the vocabulary into three equally large parts; HIGH, MEDIUM and LOW range. This way we generate samples for EL, EV and ES for the di↵erent frequency ranges in the text corpus. 121 5.2.3 Expected Results These experiments will potentially shed light to patterns across modalities and sources. One interesting result will be to see whether EV and ES embeddings contribute more if there is smaller amount of text data for EL. If this is the case, the experiments where we control for word frequencies can reveal whether EV and ES contribute di↵erently for words with di↵erent data distributions, or whether the e↵ect is more due to data quantity. Similar questions can be answered in the reverse direction when we perform experiments where we control for image data size and distributional properties, such as image resolution or dispersion of image sets. 5.2.4 Results Figure 5.1 shows the e↵ect of EL corpus size on the performance of uni-modal EL and the combined EL + ES and EL + EV on the embeddings’ common coverage subsets of MEN (Figure 5.1a) and SimLex (Figure 5.1b). The common coverage is 73% on MEN and 56% on SimLex. ES and EV are constant since only EL’s training data is varied. Results on the full datasets are presented in Figure 5.2. Axis x represents the size of the training corpus (in the number of tokens). Error bars indicate variance after three runs of random down-sampling of the data. Table 5.1 gives an account of the amount of training data each model requires. The last line shows the size after compression by Lempel-Ziv coding (LZ77). Since ImageNet images are already in jpg format, LZ77 was not able to achieve any further compression. The first striking result is that ES alone, with ⇠9M tokens, outperforms EL, with ⇠1G tokens, on both evaluation tasks. Secondly, when combined with linguistic data, ES greatly outperforms EV on MEN and underperforms it on SimLex, however, their di↵erence becomes marginal as text data increases. Im- portantly, ES achieves this result with orders of magnitude less data than required by EV (Table 5.1). Moreover, ResNet-152 with ⇠6.8G parameters outputs a 1.7 times bigger model (4.8MB) than SGNS, used for EL and ES (2.8MB), consisting of 151,200 parameters. A summary of model sizes is included in Table 5.2 for the common subset of their vocabularies of 1203 words. Figure 5.2c and 5.2d report the e↵ect of word frequency on performance on 122 (a) MEN (b) SimLex Figure 5.1: E↵ect of EL training corpus (token) quantity on performance on the common coverage subsets of evaluation pairs (73% on MEN, 56% on SimLex). ES and EV are constant since only EL’s training data is varied. the same tasks. Similarly to [Sahlgren and Lenci, 2016] we split the vocabulary into three equally large parts; HIGH, MEDIUM and LOW range. On MEN we see a slight performance gain of the baseline EL model on medium range frequency words, whereas on SimLex, low frequency words dominate the performance within the whole data (MIXED). On SimLex visual information helps more with HIGH frequency words. This could be due to narrowing down the meaning of ambiguous words. Checking this hypothesis would be an interesting future analysis. ES performs similarly to the FastText VG description model of [Herbelot, 2020] on SimLex. The increase of EL performance is in line with [Sahlgren and Lenci, 2016] until 2G tokens (they stopped at 1G), after which it plateaus. The best Spearman correlation of [Kuzmenko and Herbelot, 2019] using relations on MEN is 0.5499, with almost third the coverage (847) of ours on the common subset: ES achieves 0.44 with a coverage of 2481. Their word2vec model is consistent with results reported by [Sahlgren and Lenci, 2016] and our word2vec based EL model with similar amount of data. 5.3 Conclusion Overall, we conclude that our structured visuo-linguistic embedding contributes to a linguistic model in a much more economic way than the image based ones. We saw that when the linguistic sources are limited, visual or structured infor- 123 (a) MEN, quantity (b) SimLex, quantity (c) MEN, frequency (d) SimLex, frequency Figure 5.2: E↵ect of EL training corpus quantity and word frequency on perfor- mance. Numbers on top of the bars and on the lines indicate the coverage of evaluation dataset pairs (where both words are in the embedding vocabulary) in percentages. ES and EV are constant since only EL’s training data is varied. 124 EL ES EV Model SGNS SGNS ResNet-152 Training data Wikipedia 2020 Visual Genome annotations ImageNet + Google Images Size in units 13G tokens ⇠9M tokens ⇠1.28M + 15,770 images (jpg) Storage size 14GB ⇠1.8GB ⇠140GB Compressed size ⇠5GB ⇠0.2GB ⇠140GB Table 5.1: Training data sizes. EL ES EV Model SGNS SGNS ResNet-152 Number of model parameters 151,200 151,200 6.8G Embedding size 2.8MB 2.8MB 4.8MB Table 5.2: Model sizes on the common subset of vocabularies (|Vcommon| = 1203). mation can greatly improve on semantic similarity and relatedness predictions. As the volume of our text corpus increases, both its usefulness plateaus as well as the performance gain using other modalities shrinks, however, in most cases some improvement remains. These findings suggest that in certain cases one can save valuable training time and storage space by balancing the trade-o↵ between training on di↵erent modalities or acquiring more text data. Our structured embedding trained on Visual Genome Scene Graph requires orders of magnitude less data than either of the other two modalities, still con- tributing substantially to the meaning representation. This may be due to the amount of human e↵ort had been made while creating the dataset. Applying automatically generated scenes graphs [Xu et al., 2020] would mitigate this prob- lem. This would serve as a highly e↵ective tool with important applications for low resource languages. Our findings support the intuition of “no free lunch” when it comes to e↵ort, but depending on the tasks in hand and the available resources it can be crucial to optimise the types of resources we use. Here we only focused on data and model size. Including processing time and costs would 125 be an important future extension of eciency analysis. Exactly how ES contributes to the linguistic EL representation cannot be interpreted based solely on performance metrics. Therefore, we investigate the interpretation of our representations and the type of information they convey in the next Chapter. 126 Chapter 6 Informativeness of Semantic Spaces In this chapter, we introduce the third key contribution of this thesis (Chap- ter 1.1), presenting proof-of-concept studies of interpretable Transparency anal- ysis. We present experiments demonstrating pillars 2 Qualitative / Quantitative structural analysis and 3 Independence analysis. We aim to take the systematic studies in Chapter 4 and 5 a step further, and perform quantitative and qualitative comparison of embedding space struc- tures. We showcase an implementation in the framework of modalities as partial observers of meaning, introduced in Section 2.7. Section 6.1 introduces our two hypotheses. In Section 6.2 we tackle Question 5: Can we move beyond performance evaluation? Are there any emergent concepts in embeddings? Can we quantify the di↵erence between the concept structures of semantic spaces? We hypothesise that each embedding space represents clusters of word representations which can be interpreted as each embeddings’ own “idea” of concepts in the world. They can “disagree” depending on the data distributions of the specific modality and data source they were trained on. By zooming into our embeddings’ structure we aim to find out how much their models of concepts di↵er from each other if they di↵er at all. We are looking for quantitative ways of measuring the di↵erence between embedding spaces to complement the qualitative analysis. Section 6.3 addresses Question 6: Can we quantify the di↵erence between se- mantic spaces, based on the useful information they contribute to the meaning 127 representation? We apply an information-theoretical framework laid out in Sec- tion 2.7.5 to estimate Mutual Information of two semantic spaces using methods described in Section 3.2.4. Finally, Section 6.4 investigates the results in the context of distributional properties of the linguistic and structured data sources, DL and DS. Our main contribution is a proof-of-concept framework for quantifying the information di↵erent data sources, models and modalities bring into multi-modal word representations. It can easily be applied to various more data, model or modality types beyond the ones showcased in this study. These set of methods can help us looking under the hood of accuracy numbers on evaluation tasks and understanding better how these di↵erent concept models interact with each other when they are combined in multi-modal models of word meaning. 6.1 Hypotheses Within our generalised embedding framework (Section 2.6) we use the same mod- els as in Section 5.2. We propose investigating the structure of the learnt embed- ding spaces EL, EV , ES. This aspires to qualitatively compare embedding spaces according to various metrics. These metrics aim to capture the distributional properties of vector spaces. Furthermore, we put the results in the context of analysing the training data distributions. Based on our previous findings we form the following hypotheses: I. EV can be complementary to EL when the training corpus size is small. It is not clear whether in this case EV comes from a di↵erent and comple- mentary distribution or the performance gain is only relative to the size of the additional data. In this case, we would achieve the same result with training on the same amount of additional text. II. Due to the manufactured way of collecting data for ES, it is possible that this dataset comes from a substantially di↵erent distribution than our lin- guistic data. Therefore, it can provide useful information and can facilitate learning from small data. 128 6.2 Qualitative Analysis of Semantic Spaces As described in Section 3.2.3.1, in order to grasp how the concept structure of our embedding spaces di↵er from each other we first searched for ways to quantify their cluster structure. We do not know the ground truth labels of our clusters or even the number of clusters each embedding spaces should be broken into. Therefore, in Section 6.2.1 we present the results of experiments with three clusterization metrics which are designed for the case when a ground truth labelling is not available. Furthermore, we report results for a range of number of clusters. Following the desire of interpreting how our di↵erent models conceptualise, in Sections 6.2.2 and 6.2.3 we zoom into our embedding spaces even further. In Section 6.2.2 we compare our embeddings’ cluster structures and visualise the learnt clusterings. In Section 6.2.3 we present supervised visualisations of the embedding spaces alongside an automatic label generation method and compare the results against the clusterization metric scores. 6.2.1 Cluster Structure Results Clustering metrics results are presented for increasing numbers of clusters, using K-means clustering in Figure 6.1 (See the definition of metrics in Section 3.2.3.1). We compare the common subset of our embedding vocabularies, resulting in 1204 words. Calinski-Harabasz Index and Davies-Bouldin Index score results (Figure 6.1c and 6.1b) are fairly consistent with each other, while we see a di↵erent pattern on Silhouette Coecient in Figure 6.1a. This is unsurprising since the first two are based on node and centroid distances, whereas the latter calculates distances solely between nodes in the space. In Davies-Bouldin Index (Figure 6.1c) all models significantly outperform the baseline Random embedding ER 2 R|Vcommon|⇥300. All models achieve similar scores with the visual, the structured and linguistic-visual multi-modal models performing the best. This index represents the ratio between intra-cluster dis- tances from the centroids and inter-cluster distances of centroids. Calinski-Harabasz Index scores (Figure 6.1b) show a similar tendency among the models, having EV and EL + EV as best performing across the number of clusters, while all models overcome the Random baseline. As the number of 129 (a) Silhouette Coecient. Higher is better. (b) Calinski-Harabasz Index. Higher is better. (c) Davies-Bouldin Index. Lower is better. Figure 6.1: Clustering metrics for increasing number of K-means clusters. clusters grow the results converge to a lower (worse) score. This score can be interpreted as a measurement of how well defined the clusters are in terms of the ratio between inter- and intra-cluster dispersions, therefore a higher score means better defined clusters. Silhouette Coecient measures pairwise distances of data points within their own clusters and between each point’s distance to data points in other clusters. It gives a ratio of cluster cohesion and separation. In 6.1a we see a similar tendency across models (having EV as the best) as before with the exception of the struc- tured model ES. It outperforms all models up to ⇠20 clusters then drops below the Random baseline by 40. Furthermore, all the other models do not converge as in the previous two cases. This suggests that ES has much more cohesive structure of ⇠20 clusters, but becomes in-cohesive if we try and break it into 130 more clusters. This phenomenon might be related to the statistical properties of the Visual Genome dataset ES is trained on. In the original paper [Krishna et al., 2016] the authors report results on clustering region descriptions. They found that on average, each image contains descriptions from 17 di↵erent clus- ters, the image with the most diverse descriptions contains descriptions from 26 clusters. Unlike our model, they clustered averaged pertained word representa- tions of region descriptions, therefore, their results are not directly comparable to ours. Nevertheless, we think this can indicate why this dramatic drop occurs at around 20 clusters in our experiments. 6.2.2 Inspecting the Clusters In the following we inspect the individual clusters in all three embeddings after clustering them for 20 clusters. We also look at ES after clustering it for 40 clusters, where the drop in Silhouette Coecient happens. 6.2.2.1 Size Distribution and Visualisation In Figure 6.2 we present the distribution of cluster sizes (number of cluster mem- bers) for each cases. Firstly, we observe that EL and EV cluster sizes move between 10 and ⇠100, whereas in both cases ES cluster size distribution ranges between 1 and ⇠400. In the ES 20 clusters case (Figure 6.2a) most clusters range between 10 and 117, there are two one-element clusters and one with size 444. Clustering it to 40 clusters (Figure 6.2b) we get three one-element clusters and two salient clusters of sizes 148 and 310. To check the consistency of clustering, in Figure 6.3 we present similar his- tograms after clustering the embeddings using Agglomerative Clustering. We see a very similar pattern in cluster size distribution as with K-means in all three embeddings. ES has a saliently big cluster of 351 elements. The red line shows the average frequencies of words (AF) in each cluster in the corresponding textual dataset (Visual Genome Scene Graphs for ES and Wikipedia2020 for EL.) In the visual case the notion of word frequency is not applicable. We were mainly interested in whether the saliently big clusters in ES are due to an artefact of word frequencies. Whereas in the case of 20 K-means clusters we only see a slight drop of AF, in the 40 cluster case the two biggest clusters have relatively low numbers, although there are other low AF clusters 131 (a) ES , 20 clusters. (b) ES , 40 clusters. (c) EL, 20 clusters. (d) EV , 20 clusters. Figure 6.2: K-means Cluster size distributions. Y axis shows the number of cluster member in log scale. Red line shows the average frequencies of words in each cluster in the corresponding textual dataset. among the smaller ones as well (Figure 6.2). After Agglomerative Clustering (Figure 6.3) we observe a more substantial drop in AF for the two biggest clusters. In EL we see no such patterns, but the cluster sizes are less varied there. As an e↵ective visualisation we use the T-SNE algorithm [Maaten and Hin- ton, 2008, Wattenberg et al., 2016] to zoom further into the structure of our embedding spaces. We applied Tensorboard1 for the projections as well as their implementation of T-SNE. Following the guidelines in [Wattenberg et al., 2016] we tried di↵erent perplexity settings (running it multiple times). In most cases we did not find too much di↵erence between the results on our data, but fol- lowing the suggested range of 5 – 50, we present results for perplexity = 30 or indicate otherwise. Figures 6.5-6.8 and D.10 contain T-SNE visualisations of the clusterings. The salience of the biggest ES K-means clusters is visible in all cases (Figure 6.5, 6.8, D.10). Based on the average frequency results, we think, that the reason for this huge separable cluster is at least partially that it includes more low frequency words. The breakdown of cluster cohesion is visible in the 40 cluster cases. In general, the clusters are fairly separated in all projections. 1https://www.tensorflow.org/tensorboard 132 (a) ES , 20 clusters. (b) EL, 20 clusters. (c) EV , 20 clusters. Figure 6.3: Agglomerative Cluster size distributions. Y axis shows the number of cluster member in log scale. Red line shows the average frequencies of words in each cluster in the corresponding textual dataset. 6.2.2.2 Cluster Similarities Next, we looked into the individual clusters in each embeddings. Each row in Tables 6.2-6.4 contains the members of example clusters for the corresponding embedding. (See tables including all clusters in Appendix D.) Rows are ordered by the number of cluster members in increasing order. Words in column “Members” are ordered by their distance from the cluster centroid in increasing order. (In Tables of ES clusters in Figures D.2 and D.4 we shortened the biggest cluster, indicated by three dots, for better readability.) We labelled each clusters post-factum in two ways: 1. WordNet label was generated by querying the synset closure up to a depth of 3 in the hypernym hierarchy for each words in the cluster. Then we took each synset name in the closure lists and created a set from each of them (by removing duplicates). Next, we concatenated all the sets (corresponding to one word) into one list. The generated cluster label is the first three most common lemmas in this list. An example is shown on Figure 6.4. This can be considered as a form of “crowd-sourced” annotation, as it relies on a 133 1. Cluster = [’apple’, ’pizza’] 2. closures(’apple’) = [ Synset(’edible fruit.n.01’), Synset(’pome.n.01’), Synset(’fruit.n.01’), Synset(’produce.n.01’), Synset(’reproductive structure.n.01’), Synset(’food.n.02’), Synset(’apple tree.n.01’), Synset(’fruit tree.n.01’), Synset(’angiospermous tree.n.01’), Synset(’apple.n.01’), Synset(’apple.n.02’)] 3. closures(’pizza’) = [ Synset(’dish.n.02’), Synset(’nutriment.n.01’), Synset(’food.n.01’), Synset(’pizza.n.01’)] 4. list of synset names in decreasing frequency order = [ ’food’, ’nutriment’, ’pizza’, ’dish’, ’apple’, ’pome’, ’fruit’, ’apple tree’, ’edible fruit’, ’fruit tree’, ’produce’, ’angiospermous tree’, ’reproductive structure’] 5. labels = [’food’, ’nutriment’, ’pizza’] Figure 6.4: WordNet label generation example. dataset created by human linguistic experts. 2. Own label is our annotation (without looking at the WordNet labels). “Misc” stands for Miscellaneous, where we could not find an appropriate concept to describe the cluster. Our own annotations and the WordNet labels are fairly consistent with each other, often use the same words or synonyms e.g., “drink”-“beverage”. One interesting exception is the fifth row in Table 6.4 of the image based clusters which we interpreted as female visual stereotypes, whereas the WordNet label is: “person, organism, casual agent”. We find our interpretation supported by previous work on the bias of Google Images [Kay et al., 2015], however, with the disclaimer of coherence being “in the eye of the beholder” [Bender et al., 2021]. WordNet labels can be sometimes more generic than our annotation. This may be because we exploit WordNet which was created by multiple experts as opposed to our own annotations. In general, the Wikipedia based EL has more clusters with abstract topics, such as verbs, activities and communication. ES has more concrete clusters e.g., train, vehicles, building structures, containers or furnishing. Whereas the image based EV includes more clusters related to the outdoors, such as “travel”, “trans- 134 portation”, “landscape” and “vacation”, and on appearance, such as “colours & materials”. These di↵erences may not be surprising regarding each data source, but we would highlight the fact that these statistics are on the exact same vo- cabulary. Therefore, the di↵erence between these data sources is not simply that they include di↵erent vocabularies, but that they “understand” the same words di↵erently. This is the type of information we think is important to be conscious about when building on any data source or modality. There are also some concepts that all three embeddings capture consistently, such as “food”, “colours”, “plants”, “animals” and “body parts”. Di↵erent em- beddings di↵er, however, in the number of clusters they have related to similar concepts and of course their exact content di↵ers to various extents. In order to capture how similar the clusters are across the di↵erent embed- dings, we measured the pairwise Jaccard similarity coecient between each two embeddings. The Jaccard similarity coecient between two clusters A, B is de- fined as J(A,B) = |A \B| |A [B| . (6.1) Note that, 0  J(A,B)  1. We calculated Jaccard similarity scores between each pair of clusters which represent concepts. Cluster maps of similarities are presented in Figures 6.9, 6.10 and 6.11. These are heat maps of Jaccard similarities, where the rows and columns of the matrix have been clustered for better visibility. Each row and column is labelled with their respective WordNet cluster label. We observe that “food”, “plants”, “animals”, “body parts” and “travel / vehi- cle” related clusters are distinctly more similar between each pair of embeddings than the other clusters. Beyond this, ES and EL have similar cluster related to “visual property”, “clothing”, “structures / buildings” and a “food” related ES cluster is close to a “container” cluster in EL. ES and EV contain more similar travel related clusters: “travel, change, object” – “physical entity, body of water, thing” and a pair of containers / instruments: “artifact, whole, instrumentality” – “instrumentality, container, substance”. EL and EV have similar clusters on “structure / area” and an EL “artifact, whole, instrumentality” cluster is close to “food, beverage, produce” in EV . Similar cluster maps are presented for Agglomerative Clustering in Appendix D, Figures D.7–D.9. Figures D.1–D.6 include heat maps, where clusters are ordered 135 by size. We did not find any pattern in similarities based on size. We also compared K-means and Agglomerative clusters of the same modalities in Figures 6.12–6.15. We found the cluster structures fairly similar, the most similar clusters are food, body parts, animals, plants, vehicles and visual property related. In order to quantify how similar each pair of cluster structures are, in Ta- ble 6.1 we summarise the number of cluster pairs with Jaccard similarities above thresholds of [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]. In case of K-means, even though EL and EV have 9 cluster pairs with 0.3 < J(., .) < 0.479, ES has 12 clusters with EL and 8 with EV above a similarity of 0.2. With Agglomerative Clustering this relative closeness of ES and EL disappears, while the other two pairs show similar patterns to K-means. K-means and Agglomerative clusterings are fairly similar, with EV sharing the most similar cluster structure. Figure 6.14 includes a heat map of K-means vs. Agglomerative ES clusters ordered by size. Here, we can see that the two saliently biggest clusters are rela- tively similar, reaching 0.65 Jaccard similarity. Their labels also share the words “person”and “change”, which indicates that there is more meaningful coherence in those sizeable clusters than merely including low frequency words. Note that this coherence is hard to see with the naked eye because of the number of words to review. 136 K-means >0.2 >0.3 >0.4 Max ES-EL 12 1 0 0.358 ES-EV 8 2 0 0.363 EL-EV 9 5 4 0.479 Agglomerative >0.2 >0.3 >0.4 Max ES-EL 6 1 0 0.347 ES-EV 9 4 2 0.5 EL-EV 9 5 2 0.467 K-means – Agglomerative >0.2 >0.3 >0.4 >0.5 >0.6 >0.7 Max EL-EL 18 14 9 4 2 1 0.79 ES-ES 16 15 13 9 5 1 0.729 EV -EV 23 16 13 8 4 0 0.644 Table 6.1: Number of cluster pairs out of 202 with Jaccard similarities above thresholds of [0.2, ..., 0.7]. Last column shows the maximum similarity. WordNet label Own label Members food nutriment foodstu↵ food butter, cheese, bread, chicken, soup, sauce, dessert, beef, salad, meat, cake, steak, tomato, potato, pizza, flour, milk, meal, vinegar, bacon, pie, cooking, sushi, sandwich, breakfast, burger, menu vascular plant plant organ plant part plants flower, flowers, tree, blossom, dandelion, foliage, fruit, weed, cactus, lily, bloom, shade, leaf, grass, sunflower, poppy, vine, plant, garden, iris, grow, daisy, oak, bulb, rust, herb, moss, tulip, palm, maple, root, tall, bush, seed, family atmospheric phenomenon physical phenomenon change weather rain, snow, fog, weather, mist, drizzle, frost, dew, cold, wet, wind, smoke, sunlight, misty, sunrise, winter, storm, sunset, haze, sunshine, fire, spring, dusk, autumn, heavy, atmosphere, cloud, sunny, burn, flood, desert, sun, hot, ice, tropical 137 artifact covering clothing clothing / fashion wig, clothes, dress, shoes, jacket, sweater, skirt, sunglasses, leather, hair, costume, shirt, hair- cut, cloth, socks, waist, mannequin, collar, jew- elry, tattoo, lingerie, beard, blonde, mask, fabric, uniform, necklace, linen, outfit, glove, hat, fash- ion, blanket, bikini, knitting, swimsuit, crochet, badge, coat, carpet, bracelet, arms, makeup artifact structure whole classical architecture tower, building, marble, staircase, fountain, door- way, roof, chapel, steeple, porch, ceiling, mu- ral, glass, wall, brick, statue, stone, arch, monu- ment, dome, window, gravestone, sculpture, aisle, tiles, gate, interior, painted, decoration, concrete, church, graveyard, cathedral, curtain, painting, palace, clock, grave, portrait, choir, architecture, pyramid, memorial, square, castle, skyscraper, museum, cemetery, temple, organ change color visual property colour / decor blue, bright, green, pink, black, yellow, dark, white, purple, red, brown, violet, rainbow, colour, orange, sky, rusty, silhouette, grey, diamond, red- head, light, flame, peacock, mirror, color, tiny, shadow, stripes, dull, rose, neon, colorful, crys- tal, bell, moon, horizon, arrow, silver, ivy, gold, swan, dragon, lantern, star, pearl, horn, ray, fox, globe, planet, bold, belt body part part artifact body parts skin, spine, neck, bone, chest, throat, shoul- der, wrist, stomach, ear, jaw, cheek, lips, nose, eyes, eye, limb, toe, belly, skull, abdomen, finger, teeth, elbow, cord, whiskers, knee, thumb, tooth, muscle, ankle, tail, paws, lip, brain, flesh, leg, body, calf, heart, blood, tongue, brow, pain, tear, blade, mouth, liver, gut, arm, marrow, curled, ca- nine, feathers, foot, vein, hip, cancer 138 change act be verbs bring, get, come, want, go, keep, take, know, find, say, give, make, understand, put, listen, en- joy, feel, leave, think, learn, imagine, gather, be- lieve, fail, arrange, add, lose, create, way, hear, send, meet, collect, carry, avoid, buy, remain, al- low, appear, might, enter, arrive, seem, entertain, break, steal, receive, stop, stand, build, locked, compare, retain, sell, handle, danger, eat, wan- der, face, unhappy, protect, please, pray, become, walk, expand, travel, plenty, greet, inspect, com- fort, huge, possess, dominate, attach, roam, par- ticipate, speak, step, drawn, construct, replace, divide, great, living Table 6.2: Examples of the 20 clusters in EL. Clusters are ordered by size. See all clusters in Appendix Table D.1 WordNet label Own label Members artifact line whole train railway, railroad, subway, curve, tunnel, run, shelter, train, station, tram, highway, track, rail, way, engine, stop, gate, bridge, smoke structure area room room classroom, hallway, hall, closet, bedroom, room, bath- room, garage, oce, cafe, museum, doorway, kitchen, shop, restaurant, store, mannequin, stadium, market, ceiling, corner bird vertebrate person animals hummingbird, gull, peacock, hawk, pelican, crow, par- rot, seagull, wing, swan, pigeon, owl, goose, flamingo, nest, eagle, tail, bird, silhouette, duck, chest, body, ledge, gira↵e, zebra travel wheeled vehicle self-propelled vehicle vehicles cab, car, taxi, police, vehicle, automobile, drive, rac- ing, scooter, bike, van, street, road, motorcycle, truck, speak, wagon, bus, parade, drawn, asphalt, cop, park- ing, bicycle, sidewalk, trac, driver, carriage, meter plant organ plant vascular plant plants bloom, foliage, grave, dead, vine, blossom, ivy, pod, cactus, tree, moss, root, leave, limb, forest, bush, plant, lily, branch, weed, leaf, vein, sunshine, log, fence, flower, sunlight, wood, palm, bench, sun 139 structure artifact whole building parts chapel, cottage, steeple, castle, dome, story, cathe- dral, build, skyscraper, arch, lighthouse, apartment, hut, angel, shed, hotel, monument, window, staircase, home, cabin, house, roof, porch, tower, sculpture, pa- tio, bell, deck, brick, church, cross, clock, step, statue instrumentality container substance vessel champagne, tea, beverage, alcohol, honey, milk, pen- cil, tulip, juice, oil, bakery, ceramic, container, co↵ee, tin, cup, beer, sunflower, daisy, wine, rose, marble, bowl, sweet, maker, jar, vessel, mug, money, bottle, pumpkin, straw, glass, basket, box, pot, bucket, bunch body part artifact part pets & body parts jaw, throat, pupil, cheek, canine, belly, brow, mouth, stomach, tongue, eye, nose, poodle, ear, hamster, lip, fur, tooth, teeth, pet, leg, wool, head, feline, toe, panda, smile, neck, face, beard, puppy, collar, horn, skin, cat, kitty, calf, nail, dog, tag, mother physical entity body of water thing water rapid, village, coast, bay, mist, horizon, canal, skyline, valley, sea, cli↵, fog, town, waterfall, stream, water, sunset, pier, harbor, boardwalk, break, ocean, lake, fountain, shore, island, river, wave, splash, city, rock, ship, building, sand, hill, crane, mountain, beach, pond, surf, boat, pool location artifact region farm animal dandelion, boundary, grass, wild, deer, stork, field, mud, farm, windmill, garden, landscape, desert, cat- tle, dirt, area, barn, yard, zoo, ox, path, footprint, garbage, puddle, lawn, cow, sheep, concrete, snow, eat, lamb, goat, stone, cone, trail, rain, day, park, animal, cage, horse, bull, elephant change color visual property colors bright, beautiful, big, dirty, small, colorful, grey, long, purple, dark, round, men, tiny, pink, eyes, painted, brown, gold, medium, white, hang, iron, silver, old, black, left, tall, red, safety, large, metal, blue, steel, yellow, leather, hanging, make, walk, green, right, color, bath, pair, washing, sitting, carry 140 food produce solid food drizzle, nuts, herb, beef, flour, season, cereal, cherry, breakfast, sugar, steak, bacon, burger, butter, rice, meat, meal, sauce, dinner, pie, raspberry, lunch, sushi, bean, mustard, pepper, seed, salt, soup, cheese, tomato, hot, berry, potato, dessert, strawberry, salad, cardboard, food, bone, lemon, burn, frost, chocolate, bread, turkey, sandwich, spoon, pizza, chicken, shell, candy, peel, cooking, bubble, knife, fruit, fish, donut, cake, apple, ice, banana, orange Table 6.3: Examples of the 20 clusters in ES. Clusters are ordered by size. See all clusters in Appendix Table D.2 WordNet label Own label Members bird aquatic bird seabird birds seagull, gull, goose, duck, pelican, swan, mallard, stork, eagle, flamingo furnishing furniture instrumentality furnishing furniture, stand, booth, desk, modern, display, bed, chair, container, door, appliance, drawer, sofa, cur- tain, couch, bench, crib, frame, box, table, tv, window, computer, cradle, television, mac instrumentality self-propelled vehicle wheeled vehicle car related accident, cord, vehicle, auto, automobile, skate, pho- tography, truck, race, arrive, ford, chopper, cab, rally, seat, industrial, smart, mechanic, racing, car, demo- lition, triumph, construction, motorcycle, machine, taxi, engine, driver, crane, carriage, van, bus, cannon, motor, tank, hockey, wagon, camera vascular plant plant grow plants weed, bunch, maple, cancer, iris, poppy, dandelion, leave, flower, rose, foliage, grow, plant, cactus, spring, tulip, ivy, palm, lily, leaf, daisy, tree, root, wheat, wool, raspberry, tobacco, flowers, blossom, butterfly, sunflower, cotton, herb, violet, oak, moss, strawberry, nest, dew, berry, rice, branch, coal person organism causal agent “female topics” woman, model, brandy, pink, actress, lady, girl, young, wife, tiny, haircut, blonde, women, girls, hot, mother, hair, portrait, body, makeup, cheek, wig, neck, muscle, chest, lingerie, waist, redhead, child, face, bride, belly, bikini, kid, swimsuit, baby, brow, skirt, dress, short 141 food nutriment substance food sushi, meal, sandwich, pie, breakfast, lunch, food, supper, flour, cereal, sweet, dessert, dinner, subway, diet, cake, date, steak, sauce, bread, copper, nuts, ba- con, cooking, beef, meat, bakery, knitting, eat, potato, salad, donut, pizza, burger, co↵ee, soup, bean, cheese, vitamin, fruit, pumpkin, rock, marrow, market, tim- ber artifact change cover colours & materials texture, fabric, cloth, metal, rain, concrete, paper, suds, rough, words, stone, wall, square, dense, leather, quote, wood, frost, mud, noise, text, purple, carpet, blue, tiles, dirt, droplets, red, sand, fog, formula, mist, pattern, handwriting, green, straw, linen, as- phalt, stripes, crowd, marble, yellow, black, brown, grey, grass, white body part artifact part body parts gut, throat, wrist, burn, ear, thumb, elbow, listen, shoulder, liver, pain, knee, arms, hand, toe, finger, give, tongue, limb, abdomen, jaw, receive, nail, arm, feet, hear, skin, washing, head, ankle, hip, teeth, tear, stomach, brain, foot, lip, mouth, leg, flesh, mask, eyes, nose, skull, eye, socks, lips structure artifact area room museum, garage, hall, classroom, kitchen, cellar, inte- rior, oce, diner, decoration, exhibition, hotel, ceiling, restaurant, store, bathroom, trial, pub, class, closet, cafe, room, porch, stairs, deck, hospital, living, cor- ridor, aisle, bar, staircase, doorway, hallway, chapel, floor, lab, station, bedroom, gate, elevator, theatre, escalator, tunnel, organ, alley, library, jail, tram travel change object vacation island, view, reflection, harbor, nice, side, sea, sum- mer, tropical, pollution, port, aircraft, pier, travel, surfers, journey, sunny, coast, flying, morning, ocean, seashore, horizon, mare, holiday, lake, surf, shore, va- cation, bay, airport, cli↵, sunlight, air, river, storm, ship, fishing, beach, desert, harbour, puddle, flight, sailing, evening, sunrise, skyline, vessel, lighthouse, dawn, sunset, rocket, mountain, whale, underwater, boat, swimming, swim, plane, dusk, jet, cloud, sky, airplane, ski 142 change abstraction state festival theme, wisdom, soul, image, possess, large, confi- dence, happiness, beautiful, joy, love, ceremony, festi- val, movement, abundance, dead, depth, celebration, lover, run, demon, blurred, pray, happy, remain, wet, dance, navy, family, carnival, angel, sculpture, ray, dragon, drive, atmosphere, night, shadow, band, god, believe, party, dark, hanging, abstract, show, christ- mas, monster, devil, jump, lighting, sunshine, war- rior, painting, water, aquarium, zombie, concert, haze, crystal, statue, explosion, jazz, jellyfish, wave, bright, rainbow, ice, light, smoke, club, neon, colorful, hole, protest, autumn, rust, reef, flame, fire person organism causal agent animals animals, animal, picture, painted, zoo, turkey, curled, goat, companion, pets, canine, pet, prey, relaxed, horse, spirit, tail, dog, chipmunk, squirrel, pigeon, fox, cute, please, sheep, owl, birds, military, gira↵e, lion, lamb, bee, insect, hamster, hawk, licking, bird, cat, puppy, feline, terrier, deer, calf, rat, chicken, camel, dragonfly, whiskers, poodle, cow, hound, cattle, lizard, fish, bunny, crow, wolf, tiger, parrot, zebra, cheetah, fur, panda, bull, wasp, ox, hen, frog, crab, snake, boxer, hummingbird, rabbit, elephant, pupil, husky, peacock, spider, pug, ant Table 6.4: Examples of the 20 clusters in EV . Clusters are ordered by size. See all clusters in Appendix Table D.3 143 Figure 6.5: T-SNE plot of ES with 20 cluster labels obtained by K-means clus- tering. 144 Figure 6.6: T-SNE plot of EL with 20 cluster labels obtained by K-means clus- tering. 145 Figure 6.7: T-SNE plot of EV with 20 cluster labels obtained by K-means clus- tering. 146 Figure 6.8: T-SNE plot of ES with 40 cluster labels obtained by K-means clus- tering. TSNE perplexity = 10. 147 Figure 6.9: Cluster map of Jaccard coecients between K-means clusters of ES (axis y) and EL (axis x). 148 Figure 6.10: Cluster map of Jaccard coecients between K-means clusters of ES (axis y) and EV (axis x). 149 Figure 6.11: Cluster map of Jaccard coecients between K-means clusters of EL (axis y) and EV (axis x). 150 Figure 6.12: Cluster map of Jaccard coecients between K-means (axis y) and Agglomerative (axis x) clusters of EL. 151 Figure 6.13: Cluster map of Jaccard coecients between K-means (axis y) and Agglomerative (axis x) clusters of ES. 152 Figure 6.14: Heatmap of Jaccard coecients between K-means (axis y) and Ag- glomerative (axis x) clusters of ES. Clusters are ordered by size. 153 Figure 6.15: Cluster map of Jaccard coecients between K-means (axis y) and Agglomerative (axis x) clusters of EV . 154 6.2.2.3 Gamified Data Collection Figure 6.16: Screen-shot of Concept Game, a two player, collaborative gamified data collection app, for acquiring cluster label annotations. We developed a two player, collaborative gamified data collection app, called Concept Game2, similar to ESP Game [Von Ahn and Dabbish, 2004], but with word lists (clusters) instead of images (Figure 6.16). The pair of players have to guess the concept for a list of words, which are the elements of all the clusters from this section. They get a score if their guesses have one word/expression in common. This way we aim to collect more human cluster label annotation for di↵erent modalities in the future. The back-end involves a Sqlite Database on an AWS server3, where we collect data. The dataset includes two tables: • Game: It stores each game rounds, which is each time the users see a new word list they are guessing a concept for. We log the following attributes: – game id = TextField() 2http://concept-guessing-game.com/ 3https://aws.amazon.com/ 155 – start time = DateTimeField() – cluster id = TextField() – user1 = TextField(): firsts user’s id – user2 = TextField(): second user’s id – guess = TextField(): the guessed word – NONE if they ran out of time • Answer: This table stores the log for each word the users typed in with time stamps. This way, later, the time needed for agreeing on a cluster label can be used to infer the diculty / ambiguity of a cluster word list. It logs the following attributes: – game = ForeignKeyField(Game, backref=’answers’): reference to a game id in Game. – cluster id = TextField() – user = TextField(): id of the user who typed in a word as an answer – word = TextField() – e time = TimeField(): elapsed time since the beginning of the game The project is still under development in order to make it more accessible. Currently, people can only play if there are enough players active on the platform. So far only test data has been collected. In the future an auto replay functionality would greatly improve the usability of the game. The code is publicly available on Github4. The web technology development was helped by Krisztia´n Gergely5. 4https://github.com/anitavero/concept_game 5http://krisoft.hu/ 156 6.2.3 Supervised Visualisation In this Section we use the same T-SNE algorithm as in Section 6.2.2. However, for the labelled projections we apply a WordNet based automatic labelling tech- nique on the words beforehand. This is fundamentally di↵erent from the previous Section, where the labelling came from the clustering method in an unsupervised fashion. In that case, WordNet was used only for analysing the cluster outputs, whereas here we label the data first. This way we can inspect our embedding spaces based on pre-defined concepts. The previous method is more generic, this approach contributes to the interpretation of embeddings. 6.2.3.1 Automatic Class Label Annotation Figures 6.20 – 6.24 show coloured plots where the colours correspond to 13 class labels. We used the same coarse categories as in [Gupta et al., 2019]. They labelled their data manually, which we were not able to do due to the size of our data. Therefore, we developed a technique to automatically label our words using the WordNet hierarchy. Let C be the set of class labels, C = {transport, food, building, animal, appliance, action, clothes, utensil, body, colour, electron- ics, number, human}. All words in the embeddings’ common subset vocabulary Vcommon were labelled with a class in the following way: First, we queried the synset list S(c) for each class c 2 C. Then we obtained the synset closure of each word w up to the third level in the hypernym hierarchy: Scl3 (w). The class with the maximum number of synset overlap with each word synset closure is assigned as the word’s class label: class(w) = maxc2C [S(c)\Scl3 (w)]. We only show words where this maximum exists. 6.2.3.2 Results Figure 6.17 depicts a 2D projection of a 3D T-SNE plot of a 100 000 sample from the SGNS Wikipedia 2020 model. After looking at the word labels, clear clusters became apparent, such as words in di↵erent languages, topics (e.g., math, mental health, numbers). The thin curves usually contain numbers with the same number of digits and in order. Figures 6.18 and 6.19 show two examples for the clusters. Figure 6.20 shows a 2D T-SNE plot of our Wikipedia 2020 model trained 157 on the whole corpus. Despite the simple heuristic we used to generate class labels, clearly separable clusters emerged for many of them. We can see colours indicated by orange, numbers by blue, clothes by red, food related words by light green, buildings by brown, animals by purple etc. Some of the confused labels visibly come from the failure of our labelling technique, but looking at it, many mislabelled words cluster around other words in the same topic / category. In Figure 6.21 – 6.24 we show similar projections for EL, EV , ES and a random embedding ER, where we restricted the vocabulary to the intersection of the three modalities, then kept the ones with an existing WordNet label, resulting 252 words. All EL, EV , ES clearly show much more distinct clusters with much better defined class labels than the random embedding. This may seem obvious, however, it is worth noting, since in very high dimensions even random vector spaces can show some structure. In our projection in Figure 6.24, both data points as well as labels are uniformly distributed. Looking at the projections in Figure 6.21 – 6.23 the three modalities have di↵erent cluster shapes: EV having the most and ES having the least coherent and separable clusters. This is consistent with the results on clusterization met- rics in Figure 6.1. In general, classes transport, food, building, animal, clothes, colour, number, action look to be better captured by this labelling and projec- tion technique than appliance, utensil, body, electronics, human. This is probably due to the coarse labelling method, and could be alleviated by collecting human annotation. [Gupta et al., 2019] reported that their visual-context model showed more distinct clusters than their linguistic one using GloVe. In our T-SNE pro- jections we did not find such patterns, although our method is fundamentally di↵erent from theirs, as they use early-fusion, GloVe, they do not exploit the Visual Genome graph structure, and they apply manual labelling. Overall, it is remarkable how much structure can already be revealed without the need for acquiring additional human e↵ort. 6.3 Information Gain from Multi-modal Data So far we compared our embedding spaces based on their cluster structure. In this section we move on to pillar 3 in our analysis. This second type of trans- parency analysis involved experiments for measuring similarity between distribu- 158 Figure 6.17: T-SNE plot of a trained SGNS model on a 2020 dump of Wikipedia. tions, based on an information-theoretical approach introduced in Section 2.7.5. We aim to measure the information gain ES and EV each contribute when com- bined with EL. By treating the embedding spaces as samples from multivariate distributions we formulate the question in the following way: Are two semantic spaces from di↵erent modalities independent from each other? We employ empirical Mutual Information prediction methods, described in Section 3.2.4. Section 6.3.1 describes details of the analysis, results are presented in Section 6.3.2.6 159 Figure 6.18: Cluster, containing the word “pancakes” on the T-SNE plot of a trained SGNS model on a 2020 dump of Wikipedia. 6.3.1 Hyper Parameters and Dimensionality Reduction Since IKNN is not robust in very high dimensions we explore the hyper parameters of IHSIC . We used the Gaussian Radial Basis Function (RBF) Kernel [Vert et al., 2004] with parameter settings = 1 and using median heuristic [Garreau et al., 2017]. Furthermore, in order to test the robustness of the results we ran the method after projecting our spaces onto lower dimensional spaces using Principal Com- ponent Analysis (PCA) [Wold et al., 1987]. We tested the embeddings with dimensions d = {10, 100,max}, where max is the full dimension of each space. For further robustness, we ran the IHSIC algorithm for d = {3, 11, 12, 13, 50} (Appendix E). 6We would like to thank Zolta´n Szabo´ for his counsel on the theoretical background for these studies. 160 Figure 6.19: Cluster, containing the number “1505” on the T-SNE plot of a trained SGNS model on a 2020 dump of Wikipedia. 6.3.2 Results The main benefit of this experiment is that we may be able to understand how data of di↵erent modalities contribute to the performance of multi-modal embed- dings if they contribute at all. In case they do, is it just an artefact of introducing more data or is it due to meaningful information which changes the structure of the vector space in a useful way? In Figure 6.25 and 6.26 axis y shows I(EL, EV ) (red) and I(EL, ES) (blue), where I is the estimated Shannon mutual information using either a k-Nearest Neighbor based, linear algorithm (IKNN) or the HSIC kernel method (IHSIC). In Figure 6.25 axis x represents the size of the training corpus e1, . . . , eN (in terms of the number of tokens) for EL. Apart from IHSIC with = 1 the models agree on I(EL, EV ) being greater than I(EL, ES), which suggests that the Visual Genome Scene Graph based structured embedding ES is “more independent” from the linguistic model EL, than the image based EV . This is surprising after observing the two models behaving similarly in Chapter 4. Moreover, the results are interesting, since, while the creation of this type of training data was highly 161 Figure 6.20: T-SNE plot of a trained SGNS model on a 2020 dump of Wikipedia. The colours correspond to 13 classes automatically generated using the WordNet hierarchy: transport, food, building, animal, appliance, action, clothes, utensil, body, colour, electronics, number, human visually directed, yet it is a text based model. Nevertheless, it is “farther” from the linguistic model in distribution than the visual one. I(EL, EV ) appears to be lower for lower volumes of text data. This may be because with more data they contain more related information. Although, in the case of IHSIC with maximal dimensions, using the median heuristic for this pattern cannot be seen. In I(EL, ES) no such tendency can be observed. Figure 6.26 reports the e↵ect of word frequency (in the EL training corpus) on the estimated I. Similarly to [Sahlgren and Lenci, 2016] we split the vocabulary into three equally large parts; HIGH, MEDIUM and LOW range. This way we generate samples for EL, EV and ES for the di↵erent frequency ranges in the text corpus. Again, higher mutual information between the linguistic and the visual embeddings can be observed. The negative IKNN in Figure 6.26a is due to the oscillating nature of the approximation, and shows that the k-Nearest Neighbor 162 Figure 6.21: T-SNE plot of EL with its vocabulary restricted to the common subset of EL, EV , ES and the ones with an existing automatic WordNet class label, resulting 252 words. The colours correspond to 13 classes automatically generated using the WordNet hierarchy: transport, food, building, animal, appliance, action, clothes, utensil, body, colour, electronics, number, human method is not robust enough in this high dimension. In terms of the e↵ect of word frequency, the only pattern that emerges is the relative low mutual information between EL and EV on low frequency words. However, this may be an artefact of sparse data, since the coverage drops dra- matically with filtering pairs which fall in the same frequency category (see in Figure 5.2). In order to further test the robustness of the results we ran the IHSIC al- gorithm for further dimensions in the very low range and one medium size: 163 Figure 6.22: T-SNE plot of EV with its vocabulary restricted to the common subset of EL, EV , ES and the ones with an existing automatic WordNet class label, resulting 252 words. The colours correspond to 13 classes automatically generated using the WordNet hierarchy: transport, food, building, animal, appliance, action, clothes, utensil, body, colour, electronics, number, human d = {3, 11, 12, 13, 50}. The results are shown in Appendix E. They support the the overall pattern in the above figures, adding that the results lose their robustness for d = 3. 6.4 Dataset Distribution Finally, we analyse the text based data source distributions DL and DS directly to get another perspective on the type of information they convey. We present 164 Figure 6.23: T-SNE plot of ES with its vocabulary restricted to the common subset of EL, EV , ES and the ones with an existing automatic WordNet class label, resulting 252 words. The colours correspond to 13 classes automatically generated using the WordNet hierarchy: transport, food, building, animal, appliance, action, clothes, utensil, body, colour, electronics, number, human words in the respective datasets with the 10 highest probability of co-occurrence with each centroid word from Section 6.2.27. To estimate this probability we calculated Pointwise Mutual Information (PMI), Positive PMI (PPMI) (Equa- tion 2.1), a modified PMI (PMI3), 2 [Manning and Schutze, 1999, Section 5.3.3.] and Fisher’s exact test [Pedersen, 1996]. PMI3 has an exponent of 3 for the nu- 7Duplicated words for appearing as left and right context as well are removed. Therefore the number of words are  10. 165 Figure 6.24: T-SNE plot of a random embedding ER 2 R252x300. The colours correspond to 13 classes automatically generated using the WordNet hierarchy: transport, food, building, animal, appliance, action, clothes, utensil, body, colour, electronics, number, human. The colour labels are evenly distributed on the projection. merator and no logarithm. We used the NLTK package implementations of all the above metrics8. Since PMI, PPMI and Fisher’s test su↵ered from over-representing low fre- quency bigrams, we only present results for 2 and PMI3, which outputted fairly similar results. Table 6.5 presents examples for words closest to cluster centroids 8https://www.nltk.org/api/nltk.html#module-nltk.collocations 166 (a) IKNN (b) IHSIC , = 1, d = max (c) IHSIC , : median, d = max (d) IHSIC , : median, d = 100 (e) IHSIC , : median, d = 10 Figure 6.25: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES) (blue) for di↵erent corpus sizes. with the 10 highest 2 score. Results for the full set of centroid words using 2 and PMI3 can be found in Appendix F. Centroid Wikipedia Visual Genome plate tectonics, nazca, restrictor, farallon, subducts, license, cribriform, tec- tonic, subducting, eurasian plate, lying on top of, on, has, on top of, in 167 rust epique, cronartium, oleum, cohle, obritzberg, blister, belt, puccinia, windexed, colored rust, stains down, around side of, rusted onto, on fire, with a lot hummingbird amazilia, selasphorus, mellisuga, ca- lypte, cynanthus, berylline, scin- tillant, orthorhyncus, eupherusa, chinned hummingbird, eat nectar from, in flight below, flapping its, flap- ping, windspan fun poked, poking, pokes, poke, loving, lot, lovin, yidishn, fun, wea¨sell are having, are having great, fun, facing away, planning, having hand right, sleight, grenades, left, hand, cranked, grenade, claps, gloved, up- per hand, holding, held in, on, in mans, man bird passerine, migratory, caged, sanc- tuary, watchers, watching, topley, species, prey, furnariidae bird, perched on, flying in, fly- ing over, beak, flying ahead of Table 6.5: Example for context words of cluster centroids with the 10 highest 2 score. See all cluster centroids in Appendix F. The samples reveal that while Wikipedia includes more encyclopaedic syn- onyms as most likely bigrams, Visual Genome conveys more functional, specific type of contexts including more actions and attributes. For example “tecton- ics” in Wikipedia vs. “lying on top of” in Visual Genome as the most likely co-occurrence for “plate”. Our observations are in line with the word distributions in VG published in [Krishna et al., 2016]. The most common concepts (Figure 6.27), objects (Figure 6.28), attributes (Figure 6.29) and relationships (Figure 6.30) all paint a picture of how visually oriented VG annotations are. The published statistics also support our observation that VG mostly includes specific descriptions of smaller scenes. These support our previous findings that Visual Genome can contribute with complementary information to a text based meaning representation by having denser annotations of visual scenes. 168 (a) IKNN (b) IHSIC , = 1, d = max (c) IHSIC , : median, d = max (d) IHSIC , : median, d = 100 (e) IHSIC , : median, d = 10 Figure 6.26: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES) (blue) for di↵erent word frequency ranges. 6.5 Conclusion In this chapter we presented proof-of-concept studies of interpretable Trans- parency analysis, forming the second and third pillars of our analysis (Section 3.3). 169 Qualitative / Quantitative Structural Analysis Firstly, our aim was to interpret our models by zooming into the distributional properties of linguistic, visual, structured and multi-modal embeddings. We ran K-means and Agglom- erative clusterings on each embedding and used standard clustering metrics for evaluation when class labels are not given. The results indicate that while the image based model may have better defined clusters, the Visual Genome Scene Graph structured model can outperform the other ones in terms of consistency when the number of clusters are chosen well. We visualised the clustered embed- dings and inspected the individual clusters from the the best K-means clustering. We introduced a WordNet based cluster label annotation technique. Furthermore, we compared the clustering to Agglomerative Clustering results. The supervised T-SNE visualisations provide further insight into the structure of our semantic spaces, which are in line with the above findings. We introduced a simple method to automatically annotate our data with topic labels saving huge amount of human e↵ort. Remarkably, the results already give further insight into our data, despite the simple heuristic of label generation. We believe the method could be easily improved to gain better coverage on the vocabulary and higher accuracy of labels. Independence Analysis Secondly, we created an implementation of our in- formation theory based framework to measure the information gain visual and structured embeddings may provide by combining them with text based linguistic models. We found that the Visual Genome SceneGraph based structured model is more independent from the Wikipedia based SGNS model than the visual embed- dings, trained on images. This may reveal something about why this structural data on its own, as well as combined with linguistic information, can achieve such high accuracies, despite having orders of magnitude less training data than either of the other modalities (as we saw in Chapter 5). Analysing the e↵ect of VG and image data size on this metric would be an important future direction, as we saw that the mutual information of image and text based embeddings increase with corpus size. However, in the context of the structured model’s comparable per- formance, we think that the estimated mutual information is a promising metric for deciding over the usefulness of a new data source. 170 Summary of Transparency Analysis Let us examine the two hypotheses we made in Section 6.1. All three embedding types show di↵erent cluster struc- tures, however, the image based embedding is closer to the linguistic one than our visually structured, textual embedding: both in terms of cluster structure as well as being more mutually dependent. Considering this result in relation to the performance numbers in the previous chapters, we conclude that the image based embedding requires orders of magnitude more data and training time, while not necessarily providing additional useful information to a text based representation in the context of word semantic similarity. Therefore, we weakly reject Hypoth- esis I. On the other hand, based on the three pillars of our analyses: 1. reaching comparable performance despite being based on a small model trained on small data, 2. the quantitative and qualitative analysis of its cluster structure and 3. independence analysis, we conclude that our structured embedding provides complementary information to our linguistic representation while being highly ecient. Hence, we accept Hypothesis II. Investigating transformers, Bayesian MI estimators and other evaluations could be potential extensions of these studies. Applying automatically gener- ated scenes graphs [Xu et al., 2020] would mitigate the main limitation of this approach, which is the manual labour required for creating VG. This would serve as a highly e↵ective tool with important applications for low resource languages. 171 20 Ranjay Krishna et al. (a) (b) Fig. 18: (a) A plot of the most common visual concepts or phrases that occur in region descriptions. The most common phrases refer to universal visual concepts like “blue sky,” “green grass,” etc. (b) A plot of the most frequently used words in region descriptions. Colors occur the most frequently, followed by common objects like “man” and “dog” and universal visual concepts like “sky.” Figure 6.27: (a) A plot of the most common visual concep s or phrases that occur in region descriptions. The most common phrases refer to universal visual concepts like “blue sky,” “green grass,” etc. (b) A plot of the most frequently used words in region descriptions. Colours occur the most frequently, followed by common objects like “man” and “dog” and universal visual concepts like “sky.” [Krishna et al., 2016] 172 Visual Genome 23 Visual Genome ILSVRC Det. (Russakovsky et al., 2015) MS- COCO (Lin et al., 2014) Caltech101 (Fei-Fei et al., 2007) Caltech256 (Grin et al., 2007) PASCAL Det. (Everingham et al., 2010) Abstract Scenes (Zitnick and Parikh, 2013) Images 108,249 476,688 328,000 9,144 30,608 11,530 10,020 Total Objects 255,718 534,309 2,500,000 9,144 30,608 27,450 58 Total Categories 18,136 200 91 102 257 20 11 Objects per Category 14.10 2671.50 27472.50 90 119 1372.50 5.27 Table 3: Comparison of Visual Genome objects and categories to related datasets. Street LightGlass Bench Pizza Stop Light Bird Building Bear Plane Truck (a) (b) Fig. 22: (a) Examples of objects in Visual Genome. Each object is localized in its image with a tightly drawn bounding box. (b) Plot of the most frequently occurring objects in images. People are the most frequently occurring objects in our dataset, followed by common objects and visual elements like building, shirt, and sky. Figure 6.28: (a) Examples of objects in VG. Each object is localized in its image with a tightly dra n ounding ox. (b) Plot of the most frequently occurring objects in images. People are the most frequently occurring objects in the dataset, followed by common objects and visual elements like “building”, “shirt”, and “sky”. [Krishna et al., 2016] 173 Visual Genome 25 (a) (b) Fig. 24: (a) Distribution showing the most common attributes in the dataset. Colors (white, red) and materials (wooden, metal) are the most common. (b) Distribution showing the number of attributes describing people. State-of-motion verbs (standing, walking) are the most common, while certain sports (skiing, surfing) are also highly represented due to an image source bias in our image set. Figure 6.29: (a) Distribution showing the most common attributes in VG. Col urs (“white”, “red”) and materials (“wooden”, “ etal”) are the most common. (b) Distribution showing the number of attributes describing people. State-of-motion verbs (“standing”, “walking”) are the most common, while certain sports (“ski- ing”, “surfing”) are also highly represented due to an image source bias in the image set. [Krishna et al., 2016] 174 28 Ranjay Krishna et al. (a) (b) Fig. 27: (a) A sample of the most frequent relationships in our dataset. In general, the most common relationships are spatial (on top of, on side of, etc.). (b) A sample of the most frequent relationships involving humans in our dataset. The relationships involving people tend to be more action oriented (walk, speak, run, etc.). Objects Attributes Relationships Region Graph 0.43 0.41 0.45 Scene Graph 21.26 16.21 18.67 Table 4: The average number of objects, attributes, and relationships per region graph and per scene graph. 5.6 Region and Scene Graph Statistics We introduce in this paper the largest dataset of scene graphs to date. We use these graph representations of images as a deeper understanding of the visual world. In this section, we analyze the properties of these represen- tations, both at the region level through region graphs and at the image level through scene graphs. We also Figure 6.30: (a) A sample of the most frequent relationships in VG. In gener l, the most common relationships are spatial (“on top of”, “on side of”, etc.). (b) A sample of the most frequent relationships involving humans in the dataset. The relationships involving people tend to be more action oriented (“walk”, “speak”, “run”, tc.). [Krishna et al., 2016] 175 176 Chapter 7 Summary and Conclusions This thesis has been pursuing a better understanding of the impact of visual information on semantic models in non-visual tasks. Since the literature is nar- rower and more inconclusive on these tasks, here we aimed for constructing a broader evaluation and analysis. We introduced a general embedding formalism and a three pillar framework for transparent analysis of multi-modal semantic embedding models. We proposed and implemented a new type of embedding in between linguistic and visual modalities, based on small data. We analysed its contribution to linguistic representations within our analytical framework. Fur- thermore, we presented and showcased a framework for treating modalities as partial observers of meaning based on information-theory. 7.1 Main Findings The main findings are the following: • The source of images a↵ect the performance of multi-modal mid-fused se- mantic representations. • The number of images in ordered sources has an impact on performance, but it stabilizes at around 10-20 images. • Visual information can be complementary for smaller linguistic corpora, but this e↵ect does not necessarily scale with corpus size. 177 • Images convey complementary statistical information about the co-occurrence of objects in visual scenes, but there is no direct indication of how low level visual features contribute. • Cluster analysis can provide a useful framework for analysing emergent concept structures. Combined with independence analysis they can serve as a useful framework for transparent embedding analysis. • VG Scene Graph based, visually structured, textual models achieve com- parable or better performance in an economic way, by using orders of mag- nitude less resources than visual models. When combined, it enriches our linguistic model with more divergent information than the image based one. Its clusters represent more concrete concepts, in-between visual and linguistic domains. 7.2 Conclusion and Future Work Instead of comparing all the latest models at the time, we developed a general analysis framework and presented proof-of-concept studies, which can be applied to various models in the future. To present our methodology, we employed the smallest possible models which allow us to incorporate visual embeddings, thus studying multi-modality. Therefore, in this work we applied the shallow skip- gram network, as visual embeddings fit into them more easily then into count based models, while being the simplest neural models. Furthermore, we used mid- fusion technique, which made it straightforward to study individual modalities. Incorporating this methodology to the evaluation of various recent models would be the next step. In parallel, the analysis methodology can also be further developed. One direction is to test the level of visual information that impacts abstract semantic representations. One potential test is to gradually reduce the resolution of images we use for visual embeddings and see how the performance changes, in what rate it starts to decline in particular. This way we would see how much visual detail can be omitted while keeping the same gain for conceptually abstract tasks. Another exciting direction would be to extend the notion of modality and compare semantic representations trained across di↵erent data sources in general, 178 such as corpora of di↵erent authors, from di↵erent times or di↵erent styles and social circles. Further extension of the notion of semantic representation could be measuring semantic change in time, such as the polarisation of political discourse. This has the potential to have positive social impact if we are capable of detecting the time and “place” of the source of miscommunication. Applying automatically generated scenes graphs would mitigate the main lim- itation of the presented Visual Genome based approach, which is the manual labour required for creating it. This would serve as a highly e↵ective tool with important applications for low resource languages. For measuring information gain experimenting with Bayesian Mutual Infor- mation estimation methods and other evaluation and training datasets would also be a viable future route. Understanding the information our various data sources convey and the biases our di↵erent models have on them is an essential work in Artificial Intelligence. Data driven AI applications surround us, thus we believe there is a surging need for such meta analyses in order to advance this technology in a more conscious way. 179 180 Bibliography [Agrawal et al., 2016] Agrawal, A., Batra, D., and Parikh, D. (2016). Analyzing the behavior of visual question answering models. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1955– 1960. [Anderson et al., 2017] Anderson, A. J., Kiela, D., Clark, S., and Poesio, M. (2017). Visually grounded and textual semantic models di↵erentially decode brain activity associated with concrete and abstract nouns. Transactions of the Association for Computational Linguistics, 5:17–30. [Anderson et al., 2016] Anderson, A. J., Zinszer, B. D., and Raizada, R. D. (2016). Representational similarity encoding for fmri: Pattern-based synthe- sis to predict brain activity using stimulus-model-similarities. NeuroImage, 128:44–53. [Antol et al., 2015] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit- nick, C. L., and Parikh, D. (2015). Vqa: Visual question answering. Proceedings of the IEEE international conference on computer vision, pages 2425–2433. [Artetxe et al., 2018] Artetxe, M., Labaka, G., and Agirre, E. (2018). A robust self-learning method for fully unsupervised cross-lingual mappings of word em- beddings. ACL. [Arthur and Vassilvitskii, 2006] Arthur, D. and Vassilvitskii, S. (2006). k- means++: The advantages of careful seeding. Stanford. [Arthur et al., 2016] Arthur, P., Neubig, G., and Nakamura, S. (2016). Incorpo- rating discrete translation lexicons into neural machine translation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1557–1567. 181 [Bahdanau et al., 2015] Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural Machine Translation By Jointly Learning To Align and Translate. Iclr 2015, 26(1):1–15. [Barocas et al., 2019] Barocas, S., Hardt, M., and Narayanan, A. (2019). Fair- ness and Machine Learning. http://www.fairmlbook.org. [Baroni and Lenci, 2008] Baroni, M. and Lenci, A. (2008). Concepts and prop- erties in word spaces. Italian Journal of Linguistics, 20(1):55–88. [Batchkarov et al., 2016] Batchkarov, M., Kober, T., Ren, J., Weeds, J., and Weir, D. (2016). A critique of word similarity as a method for evaluating distributional semantic models. Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 7–12. [Bender et al., 2021] Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? . Proceedings of FAccT 2021. [Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3:1137–1155. [Bergsma and Goebel, 2011] Bergsma, S. and Goebel, R. (2011). Using visual information to predict lexical preference. Proceedings of RANLP, pages 399– 405. [Boleda, 2020] Boleda, G. (2020). Distributional semantics and linguistic theory. Annual Review of Linguistics, 6:213–234. [Bowker and Star, 2000] Bowker, G. C. and Star, S. L. (2000). Sorting things out: Classification and its consequences. [Bowman et al., 2015] Bowman, S., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for learning natural language inference. Pro- ceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. 182 [Bruni et al., 2014] Bruni, E., Tran, N.-K., and Baroni, M. (2014). Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49(2014):1–47. [Bucci, 1985] Bucci, W. (1985). Dual coding: A cognitive model for psychoana- lytic research. Journal of the American Psychoanalytic Association, 33(3):571– 607. [Bulat et al., 2017] Bulat, L., Clark, S., and Shutova, E. (2017). Speaking, seeing, understanding: Correlating semantic models with conceptual representation in the brain. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1081–1091. [Cho et al., 2014] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Rep- resentations using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. [Chomsky et al., 2000] Chomsky, N. et al. (2000). New horizons in the study of language and mind. [Clark, 2015] Clark, S. (2015). Vector space models of lexical meaning. Handbook of Contemporary Semantics, 10:9781118882139. [Conneau et al., 2018] Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. (2018). What you can cram into a single &!#* vector: Probing sentence embeddings for linguistic properties. ACL 2018-56th Annual Meeting of the Association for Computational Linguistics, 1:2126–2136. [Cover and Thomas, 2012] Cover, T. and Thomas, J. (2012). Elements of Infor- mation Theory. [Davis et al., 2019] Davis, C., Bulat, L., Vero˝, A. L., and Shutova, E. (2019). Deconstructing multimodality: visual properties and visual context in human semantic processing. Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (* SEM 2019), pages 118–124. [Deerwester et al., 1990] Deerwester, S., Dumais, S. T., Furnas, G. W., Lan- dauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407. 183 [Deng et al., 2009] Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Li, F. (2009). Imagenet: A large-scale hierarchical image database. Proceedings of CVPR, pages 248–255. [Devlin et al., 2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language un- derstanding. NAACL-HLT (1). [Dinu et al., 2015] Dinu, G., Lazaridou, A., and Baroni, M. (2015). Improving zero-shot learning by mitigating the hubness problem. International Conference on Learning Representations, Workshop Track. [Dubossarsky et al., 2019] Dubossarsky, H., Hengchen, S., Tahmasebi, N., and Schlechtweg, D. (2019). Time-out: Temporal referencing for robust modeling of lexical semantic change. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 457–470. [Erk, 2016] Erk, K. (2016). What do you know about an alligator when you know the company it keeps? Semantics and Pragmatics, 9:17–1. [Ernst and Banks, 2002] Ernst, M. O. and Banks, M. S. (2002). Humans inte- grate visual and haptic information in a statistically optimal fashion. Nature, 415(6870):429. [Faruqui et al., 2016] Faruqui, M., Tsvetkov, Y., Rastogi, P., and Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. Proceedings of the 1st Workshop on Evaluating Vector-Space Represen- tations for NLP, pages 30–35. [Fergus et al., 2005] Fergus, R., Li, F., Perona, P., and Zisserman, A. (2005). Learning object categories from Google’s image search. Proceedings of ICCV, pages 1816–1823. [Firth, 1957] Firth, J. R. (1957). A synopsis of linguistic theory. Studies in Lin- guistic Analysis, Oxford: Philological Society, (1–32), reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952-1959, London: Longman (1968). [Fouhey and Zitnick, 2014] Fouhey, D. F. and Zitnick, C. L. (2014). Predicting object dynamics in scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2019–2026. 184 [Gabrilovich et al., 2007] Gabrilovich, E., Markovitch, S., et al. (2007). Com- puting semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI, 7:1606–1611. [Garreau et al., 2017] Garreau, D., Jitkrittum, W., and Kanagawa, M. (2017). Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269. [Gasparri and Marconi, 2021] Gasparri, L. and Marconi, D. (2021). Word Mean- ing. The Stanford Encyclopedia of Philosophy. [Gerz et al., 2016] Gerz, D., Vulic´, I., Hill, F., Reichart, R., and Korhonen, A. (2016). SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity. EMNLP. [Ghorbani et al., 2019] Ghorbani, A., Wexler, J., Zou, J. Y., and Kim, B. (2019). Towards automatic concept-based explanations. Advances in Neural Informa- tion Processing Systems, 32:9277–9286. [Gonza´lez et al., 2006] Gonza´lez, J., Barros-Loscertales, A., Pulvermu¨ller, F., Meseguer, V., Sanjua´n, A., Belloch, V., and A´vila, C. (2006). Reading cin- namon activates olfactory brain regions. Neuroimage, 32(2):906–912. [Gretton et al., 2005] Gretton, A., Bousquet, O., Smola, A., and Scho¨lkopf, B. (2005). Measuring statistical dependence with hilbert-schmidt norms. Inter- national conference on algorithmic learning theory, pages 63–77. [Grice, 1975] Grice, H. P. (1975). Logic and conversation. pages 41–58. [Gupta et al., 2019] Gupta, T., Schwing, A., and Hoiem, D. (2019). Vico: Word embeddings from visual co-occurrences. Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 7425–7434. [Handjaras et al., 2016] Handjaras, G., Ricciardi, E., Leo, A., Lenci, A., Cec- chetti, L., Cosottini, M., Marotta, G., and Pietrini, P. (2016). How concepts are encoded in the human brain: a modality independent, category-based cor- tical organization of semantic knowledge. Neuroimage, 135:232–242. [Harnad, 1990] Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346. 185 [He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. [Herbelot, 2020] Herbelot, A. (2020). Re-solve it: simulating the acquisition of core semantic competences from small data. Proceedings of the 24th Conference on Computational Natural Language Learning, pages 344–354. [Hill et al., 2015] Hill, F., Reichart, R., and Korhonen, A. (2015). SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Associa- tion for Computational Linguistics. [Hinton et al., 2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. [Hooker, 2021] Hooker, S. (2021). Moving beyond “algorithmic bias is a data problem”. Patterns, 2(4):100241. [Jitkrittum et al., 2017] Jitkrittum, W., Szabo´, Z., and Gretton, A. (2017). An adaptive test of independence with analytic kernel embeddings. International Conference on Machine Learning, pages 1742–1751. [Johnson et al., 2017] Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910. [Kabbach et al., 2019] Kabbach, A., Gulordava, K., and Herbelot, A. (2019). To- wards incremental learning of word embeddings using context informativeness. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 162–168. [Kaur et al., 2020] Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H., and Wortman Vaughan, J. (2020). Interpreting interpretability: Understanding data scientists’ use of interpretability tools for machine learning. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–14. 186 [Kay et al., 2015] Kay, M., Matuszek, C., and Munson, S. A. (2015). Unequal representation and gender stereotypes in image search results for occupations. Proceedings of the 33rd Annual ACM Conference on Human Factors in Com- puting Systems, pages 3819–3828. [Kelly jr, 1956] Kelly jr, J. (1956). A new interpretation of information rate. the bell system technical journal. [Kendall et al., 2017] Kendall, A., Badrinarayanan, V., and Cipolla, R. (2017). Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder ar- chitectures for scene understanding. British Machine Vision Conference 2017, BMVC 2017. [Kiela and Bottou, 2014] Kiela, D. and Bottou, L. (2014). Learning image em- beddings using convolutional neural networks for improved multi-modal se- mantics. Proceedings of EMNLP, pages 36–45. [Kiela and Clark, 2014] Kiela, D. and Clark, S. (2014). A Systematic Study of Semantic Vector Space Model Parameters. Proceedings of EACL 2014, Work- shop on Continuous Vector Space Models and their Compositionality (CVSC). [Kiela et al., 2014] Kiela, D., Hill, F., Korhonen, A., and Clark, S. (2014). Im- proving multi-modal representations using image dispersion: Why less is some- times more. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2:835–841. [Kiela et al., 2016] Kiela, D., Vero˝, A. L., and Clark, S. (2016). Comparing Data Sources and Architectures for Deep Visual Representation Learning in Seman- tics. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-16). [Kilgarri↵ and Yallop, 2000] Kilgarri↵, A. and Yallop, C. (2000). What’s in a thesaurus? LREC, pages 1371–1379. [Kiros et al., 2014] Kiros, R., Salakhutdinov, R., and Zemel, R. (2014). Multi- modal neural language models. International Conference on Machine Learning, pages 595–603. 187 [Kiros et al., 2015] Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Tor- ralba, A., Urtasun, R., and Fidler, S. (2015). Skip-Thought Vectors. ArxiV, 58(786):1–11. [Kottur et al., 2015] Kottur, S., Vedantam, R., Moura, J. M. F., and Parikh, D. (2015). Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes. arXiv preprint. [Kripke, 1972] Kripke, S. A. (1972). Naming and necessity. pages 253–355. [Krishna et al., 2016] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M., and Fei-Fei, L. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations. [Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Proceedings of NIPS, pages 1106–1114. [Kuhnle, 2020] Kuhnle, A. (2020). Evaluating visually grounded language capa- bilities using microworlds. Technical report, University of Cambridge, Com- puter Laboratory. [Kuhnle and Copestake, 2017] Kuhnle, A. and Copestake, A. (2017). Shapeworld-a new test methodology for multimodal language understanding. arXiv preprint arXiv:1704.04517. [Kuzmenko and Herbelot, 2019] Kuzmenko, E. and Herbelot, A. (2019). Distri- butional semantics in the real world: building word vector representations from a truth-theoretic model. Proceedings of the 13th International Conference on Computational Semantics-Short Papers, pages 16–23. [Lazaridou et al., 2015] Lazaridou, A., Baroni, M., et al. (2015). Combining lan- guage and vision with a multimodal skip-gram model. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 153–163. [Lazaridou et al., 2016] Lazaridou, A., Pham, N. T., and Baroni, M. (2016). To- wards Multi-Agent Communication-Based Language Learning. 188 [LeCun et al., 1989] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551. [Lecun et al., 1998] Lecun, Y., Bottou, L., Bengio, Y., and Ha↵ner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. [Lenci, 2008] Lenci, A. (2008). Distributional semantics in linguistic and cogni- tive research. Italian journal of linguistics, 20(1):1–31. [Lenci, 2018] Lenci, A. (2018). Distributional models of word meaning. Annual review of Linguistics, 4:151–171. [Levy and Goldberg, 2014a] Levy, O. and Goldberg, Y. (2014a). Dependency- based word embeddings. ACL (2), pages 302–308. [Levy and Goldberg, 2014b] Levy, O. and Goldberg, Y. (2014b). Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, pages 2177–2185. [Levy et al., 2015] Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving dis- tributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225. [Lin et al., 2013] Lin, M., Chen, Q., and Yan, S. (2013). Network in network. CoRR, abs/1312.4400. [Lin et al., 2014] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- manan, D., Dolla´r, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. European conference on computer vision, pages 740–755. [Lin and Parikh, 2015] Lin, X. and Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07-12-June:2984–2993. [Lu et al., 2019] Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). Vilbert: Pre- training task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32. 189 [Lucey et al., 2017] Lucey, J. A., Otter, D., and Horne, D. S. (2017). A 100-year review: Progress on the chemistry of milk and its components. Journal of Dairy Science, 100(12):9916–9932. [Maaten and Hinton, 2008] Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605. [MacKay, 2003] MacKay, D. J. (2003). Information theory, inference and learning algorithms. [MacQueen et al., 1967] MacQueen, J. et al. (1967). Some methods for classifica- tion and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14):281–297. [Majumdar et al., 2020] Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., and Batra, D. (2020). Improving vision-and-language navigation with image-text pairs from the web. European Conference on Computer Vision, pages 259–274. [Manning and Schutze, 1999] Manning, C. and Schutze, H. (1999). Foundations of statistical natural language processing. [Marconi, 1997] Marconi, D. (1997). Lexical competence. [Margolis and Laurence, 2021] Margolis, E. and Laurence, S. (2021). Concepts. The Stanford Encyclopedia of Philosophy. [Mervis and Rosch, 1981] Mervis, C. B. and Rosch, E. (1981). Categorization of natural objects. Annual review of psychology, 32(1):89–115. [Mikolov et al., 2013a] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Ecient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. [Mikolov et al., 2018] Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2018). Advances in pre-training distributed word representations. Proceedings of the International Conference on Language Resources and Eval- uation (LREC 2018). 190 [Mikolov et al., 2013b] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, pages 3111–3119. [Miller, 1995] Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. [Minnema and Herbelot, 2019] Minnema, G. and Herbelot, A. (2019). From brain space to distributional space: the perilous journeys of fmri decoding. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 155–161. [Mitchell and Lapata, 2010] Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive science, 34(8):1388–429. [Mitchell et al., 2008] Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.- M., Malave, V. L., Mason, R. A., and Just, M. A. (2008). Predicting human brain activity associated with the meanings of nouns. science, 320(5880):1191– 1195. [Nair and Hinton, 2010] Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. Proceedings of ICML, pages 807–814. [Navigli, 2009] Navigli, R. (2009). Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1–69. [Nelson et al., 2004] Nelson, D. L., McEvoy, C. L., and Schreiber, T. A. (2004). The university of south florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers, 36(3):402– 407. [Pedersen, 1996] Pedersen, T. (1996). Fishing for exactness. arXiv preprint cmp- lg/9608010. [Pennington et al., 2014] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 1532–1543. 191 [Pereira et al., 2018] Pereira, F., Lou, B., Pritchett, B., Ritter, S., Gershman, S. J., Kanwisher, N., Botvinick, M., and Fedorenko, E. (2018). Toward a universal decoder of linguistic meaning from brain activation. Nature commu- nications, 9(1):1–13. [Peters et al., 2018] Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representa- tions. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237. [Ponce et al., 2006] Ponce, J., Berg, T. L., Everingham, M., Forsyth, D. A., Hebert, M., Lazebnik, S., Marszalek, M., Schmid, C., Russell, B. C., Torralba, A., et al. (2006). Dataset issues in object recognition. pages 29–48. [Putnam, 1970] Putnam, H. (1970). Is semantics possible? Metaphilosophy, 1(3):187–201. [Radford et al., 2018] Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving language understanding by genera- tive pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/researchcovers/languageunsupervised/language understanding paper.pdf. [Radovanovic´ et al., 2010] Radovanovic´, M., Nanopoulos, A., and Ivanovic´, M. (2010). On the existence of obstinate results in vector space models. Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 186–193. [Recanati, 2004] Recanati, F. (2004). Literal meaning. [Rockta¨schel et al., 2016] Rockta¨schel, T., Grefenstette, E., Hermann, K. M., Kocˇisky´, T., and Blunsom, P. (2016). Reasoning about Entailment with Neural Attention. ICLR. [Roy, 2005] Roy, D. (2005). Grounding words in perception and action: Compu- tational insights. [Sahlgren and Lenci, 2016] Sahlgren, M. and Lenci, A. (2016). The e↵ects of data size and frequency range on distributional semantic models. EMNLP 2016. 192 [Scarselli et al., 2008] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2008). The graph neural network model. IEEE transactions on neural networks, 20(1):61–80. [Schuler, 2005] Schuler, K. K. (2005). Verbnet: A broad-coverage, comprehensive verb lexicon. [Schu¨tze et al., 2008] Schu¨tze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information retrieval. Proceedings of the international commu- nication of association for computing machinery conference, 4. [Searle, 1985] Searle, J. R. (1985). Expression and meaning: Studies in the theory of speech acts. [Shannon, 2001] Shannon, C. E. (2001). A mathematical theory of communica- tion. ACM SIGMOBILE mobile computing and communications review, 5(1):3– 55. [Sharma et al., 2015] Sharma, S., Kiros, R., and Salakhutdinov, R. (2015). Ac- tion Recognition using Visual Attention. arXiv preprint, pages 1–11. [Silberer and Lapata, 2014] Silberer, C. and Lapata, M. (2014). Learning Grounded Meaning Representations with Autoencoders. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, June 23-25:721–732. [Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR), 2015. [Sivic and Zisserman, 2003] Sivic, J. and Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. IEEE International Conference on Computer Vision, (Iccv):1470–1477. [Socher et al., 2014] Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng, A. Y. (2014). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2:207–218. 193 [Spa¨rck Jones, 1967] Spa¨rck Jones, K. (1967). A small semantic classification experiment using cooccurrence data. Report ML, 196. [Srivastava and Salakhutdinov, 2012] Srivastava, N. and Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. Advances in neural information processing systems, pages 2222–2230. [Su et al., 2019] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J. (2019). Vl-bert: Pre-training of generic visual-linguistic representations. International Conference on Learning Representations. [Sudre et al., 2012] Sudre, G., Pomerleau, D., Palatucci, M., Wehbe, L., Fyshe, A., Salmelin, R., and Mitchell, T. (2012). Tracking neural coding of perceptual and semantic features of concrete nouns. NeuroImage, 62(1):451–463. [Szabo´, 2014] Szabo´, Z. (2014). Information theoretical estimators toolbox. The Journal of Machine Learning Research, 15(1):283–287. [Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9. [Tettamanti et al., 2005] Tettamanti, M., Buccino, G., Saccuman, M. C., Gallese, V., Danna, M., Scifo, P., Fazio, F., Rizzolatti, G., Cappa, S. F., and Perani, D. (2005). Listening to action-related sentences activates fronto-parietal motor circuits. Journal of cognitive neuroscience, 17(2):273–281. [Torralba and Efros, 2011] Torralba, A. and Efros, A. A. (2011). Unbiased look at dataset bias. CVPR 2011, pages 1521–1528. [Tsai et al., 2019] Tsai, Y.-H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L.-P., and Salakhutdinov, R. (2019). Multimodal transformer for unaligned multimodal language sequences. Proceedings of the Annual Meeting of the Association for Computational Linguistics. [Turney, 2010] Turney, P. D. (2010). From Frequency to Meaning : Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37:141–188. 194 [Vendrov et al., 2015] Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R. (2015). Order-Embeddings of Images and Language. arXiv preprint, (2005):1–13. [Vert et al., 2004] Vert, J.-P., Tsuda, K., and Scho¨lkopf, B. (2004). A primer on kernel methods. Kernel methods in computational biology, 47:35–70. [Voita and Titov, 2020] Voita, E. and Titov, I. (2020). Information-theoretic probing with minimum description length. Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196. [von Ahn and Dabbish, 2004] von Ahn, L. and Dabbish, L. (2004). Labeling images with a computer game. CHI, pages 319–326. [Von Ahn and Dabbish, 2004] Von Ahn, L. and Dabbish, L. (2004). Labeling im- ages with a computer game. Proceedings of the SIGCHI conference on Human factors in computing systems, pages 319–326. [Wang et al., 2018a] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2018a). Glue: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355. [Wang et al., 2018b] Wang, J., Madhyastha, P. S., and Specia, L. (2018b). Object counts! bringing explicit detections back into image captioning. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2180–2193. [Wang et al., 2005] Wang, Q., Kulkarni, S. R., and Verdu´, S. (2005). Diver- gence estimation of continuous distributions based on data-dependent parti- tions. IEEE Transactions on Information Theory, 51(9):3064–3074. [Wang et al., 2009] Wang, Q., Kulkarni, S. R., and Verdu´, S. (2009). Diver- gence estimation for multidimensional densities via k-nearest-neighbor dis- tances. IEEE Transactions on Information Theory, 55(5):2392–2405. 195 [Wang and Jiang, 2015] Wang, S. and Jiang, J. (2015). Learning Natural Lan- guage Inference with LSTM. Naacl. [Wattenberg et al., 2016] Wattenberg, M., Vie´gas, F., and Johnson, I. (2016). How to use t-sne e↵ectively. Distill. [Wittgenstein, 1953] Wittgenstein, L. (1953). Philosophical investigations. [Wold et al., 1987] Wold, S., Esbensen, K., and Geladi, P. (1987). Principal com- ponent analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37– 52. [Xu et al., 2016] Xu, H., Murphy, B., and Fyshe, A. (2016). Brainbench: A brain-image test suite for distributional semantic models. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2017–2021. [Xu et al., 2020] Xu, P., Chang, X., Guo, L., Huang, P.-Y., Chen, X., and Haupt- mann, A. G. (2020). A survey of scene graph: Generation and application. IEEE Trans. Neural Netw. Learn. Syst. 2020. [Yang et al., 2019] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32. [Yeung, 1991] Yeung, R. W. (1991). A new outlook on shannon’s information measures. IEEE transactions on information theory, 37(3):466–474. [Yogatama et al., 2019] Yogatama, D., d’Autume, C. d. M., Connor, J., Kocisky, T., Chrzanowski, M., Kong, L., Lazaridou, A., Ling, W., Yu, L., Dyer, C., et al. (2019). Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373. [Zhang and Bowman, 2018] Zhang, K. and Bowman, S. (2018). Language model- ing teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. Proceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 359–361. 196 [Zhang et al., 2016] Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., and Parikh, D. (2016). Yin and yang: Balancing and answering binary visual ques- tions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5014–5022. [Zhang et al., 2018] Zhang, Q., Wang, W., and Zhu, S.-C. (2018). Examining cnn representations with respect to dataset bias. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). 197 198 Appendix A Cross-validated Semantic Relatedness and Similarity 199 Embedding Spearman P-value Coverage wikinews 0.797 (0.004) <1‰(<1‰) 2000 wikinews sub 0.805 (0.005) <1‰(<1‰) 2000 crawl 0.843 (0.001) <1‰(<1‰) 2000 w2v13 0.684 (0.007) <1‰(<1‰) 2000 Google AlexNet 0.506 (0.009) <1‰(<1‰) 2000 VG SceneGraph 0.427 (0.006) <1‰(<1‰) 1716 Google VGG 0.516 (0.005) <1‰(<1‰) 2000 VG-internal 0.377 (0.008) <1‰(<1‰) 1856 VG-whole 0.415 (0.006) <1‰(<1‰) 1856 Google ResNet-152 0.469 (0.003) <1‰(<1‰) 2000 wikinews+Google AlexNet 0.499 (0.003) <1‰(<1‰) 2000 wikinews+VG SceneGraph 0.568 (0.013) <1‰(<1‰) 2000 wikinews+Google VGG 0.512 (0.005) <1‰(<1‰) 2000 wikinews+VG-internal 0.367 (0.008) <1‰(<1‰) 2000 wikinews+VG-whole 0.402 (0.007) <1‰(<1‰) 2000 wikinews+Google ResNet-152 0.479 (0.008) <1‰(<1‰) 2000 wikinews sub+Google AlexNet 0.506 (0.011) <1‰(<1‰) 2000 wikinews sub+VG SceneGraph 0.380 (0.010) <1‰(<1‰) 2000 wikinews sub+Google VGG 0.514 (0.012) <1‰(<1‰) 2000 wikinews sub+VG-internal 0.364 (0.009) <1‰(<1‰) 2000 wikinews sub+VG-whole 0.387 (0.013) <1‰(<1‰) 2000 wikinews sub+Google ResNet-152 0.463 (0.004) <1‰(<1‰) 2000 crawl+Google AlexNet 0.501 (0.004) <1‰(<1‰) 2000 crawl+VG SceneGraph 0.778 (0.006) <1‰(<1‰) 2000 crawl+Google VGG 0.516 (0.006) <1‰(<1‰) 2000 crawl+VG-internal 0.357 (0.012) <1‰(<1‰) 2000 crawl+VG-whole 0.398 (0.008) <1‰(<1‰) 2000 crawl+Google ResNet-152 0.514 (0.007) <1‰(<1‰) 2000 w2v13+Google AlexNet 0.501 (0.012) <1‰(<1‰) 2000 w2v13+VG SceneGraph 0.645 (0.008) <1‰(<1‰) 2000 w2v13+Google VGG 0.518 (0.010) <1‰(<1‰) 2000 w2v13+VG-internal 0.372 (0.005) <1‰(<1‰) 2000 w2v13+VG-whole 0.403 (0.004) <1‰(<1‰) 2000 w2v13+Google ResNet-152 0.486 (0.002) <1‰(<1‰) 2000 Table A.1: Cross-validated Spearman correlations on the MEN dataset. Spear- man and P-value columns report of three samples after leaving out the third of the evaluation pairs. Multi-modal embeddings are created using the Padding technique. The table sections contain linguistic, visual and multi- modal embeddings in this order. 200 Embedding Spearman P-value Coverage wikinews 0.463 (0.009) <1‰(<1‰) 666 wikinews sub 0.412 (0.025) <1‰(<1‰) 666 crawl 0.506 (0.019) <1‰(<1‰) 666 w2v13 0.316 (0.020) <1‰(<1‰) 666 Google AlexNet 0.348 (0.025) <1‰(<1‰) 666 VG SceneGraph 0.274 (0.019) <1‰(<1‰) 395 Google VGG 0.363 (0.017) <1‰(<1‰) 666 VG-internal 0.311 (0.059) 0.023 (0.027) 68 VG-whole 0.169 (0.024) 0.178 (0.068) 68 Google ResNet-152 0.354 (0.007) <1‰(<1‰) 666 wikinews+Google AlexNet 0.332 (0.032) <1‰(<1‰) 666 wikinews+VG SceneGraph 0.348 (0.018) <1‰(<1‰) 666 wikinews+Google VGG 0.332 (0.014) <1‰(<1‰) 666 wikinews+VG-internal 0.300 (0.002) <1‰(<1‰) 666 wikinews+VG-whole 0.326 (0.017) <1‰(<1‰) 666 wikinews+Google ResNet-152 0.350 (0.028) <1‰(<1‰) 666 wikinews sub+Google AlexNet 0.329 (0.022) <1‰(<1‰) 666 wikinews sub+VG SceneGraph 0.187 (0.027) <1‰(<1‰) 666 wikinews sub+Google VGG 0.353 (0.011) <1‰(<1‰) 666 wikinews sub+VG-internal 0.299 (0.013) <1‰(<1‰) 666 wikinews sub+VG-whole 0.304 (0.015) <1‰(<1‰) 666 wikinews sub+Google ResNet-152 0.348 (0.011) <1‰(<1‰) 666 crawl+Google AlexNet 0.349 (0.025) <1‰(<1‰) 666 crawl+VG SceneGraph 0.434 (0.017) <1‰(<1‰) 666 crawl+Google VGG 0.346 (0.017) <1‰(<1‰) 666 crawl+VG-internal 0.310 (0.038) <1‰(<1‰) 666 crawl+VG-whole 0.321 (0.007) <1‰(<1‰) 666 crawl+Google ResNet-152 0.364 (0.009) <1‰(<1‰) 666 w2v13+Google AlexNet 0.345 (0.024) <1‰(<1‰) 666 w2v13+VG SceneGraph 0.312 (0.007) <1‰(<1‰) 666 w2v13+Google VGG 0.362 (0.017) <1‰(<1‰) 666 w2v13+VG-internal 0.209 (0.017) <1‰(<1‰) 666 w2v13+VG-whole 0.225 (0.007) <1‰(<1‰) 666 w2v13+Google ResNet-152 0.352 (0.020) <1‰(<1‰) 666 Table A.2: Cross-validated Spearman correlations on the SimLex dataset. Spear- man and P-value columns report of three samples after leaving out the third of the evaluation pairs. Multi-modal embeddings are created using the Padding technique. The table sections contain linguistic, visual and multi- modal embeddings in this order. 201 Embedding Spearman P-value Coverage wikinews 0.792 (0.002) <1‰(<1‰) 2000 wikinews sub 0.804 (0.001) <1‰(<1‰) 2000 crawl 0.845 (0.001) <1‰(<1‰) 2000 w2v13 0.684 (0.003) <1‰(<1‰) 2000 Google AlexNet 0.509 (0.005) <1‰(<1‰) 2000 VG SceneGraph 0.413 (0.004) <1‰(<1‰) 1716 Google VGG 0.508 (0.008) <1‰(<1‰) 2000 VG-internal 0.374 (0.015) <1‰(<1‰) 1856 VG-whole 0.412 (0.002) <1‰(<1‰) 1856 Google ResNet-152 0.464 (0.007) <1‰(<1‰) 2000 wikinews+Google AlexNet 0.497 (0.004) <1‰(<1‰) 2000 wikinews+VG SceneGraph 0.654 (0.006) <1‰(<1‰) 1716 wikinews+Google VGG 0.504 (0.011) <1‰(<1‰) 2000 wikinews+VG-internal 0.374 (0.003) <1‰(<1‰) 1856 wikinews+VG-whole 0.415 (0.006) <1‰(<1‰) 1856 wikinews+Google ResNet-152 0.476 (0.004) <1‰(<1‰) 2000 wikinews sub+Google AlexNet 0.501 (0.008) <1‰(<1‰) 2000 wikinews sub+VG SceneGraph 0.452 (0.021) <1‰(<1‰) 1716 wikinews sub+Google VGG 0.503 (0.002) <1‰(<1‰) 2000 wikinews sub+VG-internal 0.370 (0.005) <1‰(<1‰) 1856 wikinews sub+VG-whole 0.415 (0.005) <1‰(<1‰) 1856 wikinews sub+Google ResNet-152 0.475 (0.005) <1‰(<1‰) 2000 crawl+Google AlexNet 0.502 (0.009) <1‰(<1‰) 2000 crawl+VG SceneGraph 0.813 (0.001) <1‰(<1‰) 1716 crawl+Google VGG 0.512 (0.008) <1‰(<1‰) 2000 crawl+VG-internal 0.392 (0.005) <1‰(<1‰) 1856 crawl+VG-whole 0.427 (0.006) <1‰(<1‰) 1856 crawl+Google ResNet-152 0.514 (0.003) <1‰(<1‰) 2000 w2v13+Google AlexNet 0.502 (0.004) <1‰(<1‰) 2000 w2v13+VG SceneGraph 0.696 (0.003) <1‰(<1‰) 1716 w2v13+Google VGG 0.528 (0.005) <1‰(<1‰) 2000 w2v13+VG-internal 0.369 (0.011) <1‰(<1‰) 1856 w2v13+VG-whole 0.423 (0.010) <1‰(<1‰) 1856 w2v13+Google ResNet-152 0.484 (0.010) <1‰(<1‰) 2000 Table A.3: Cross-validated Spearman correlations on the MEN dataset. Spear- man and P-value columns report of three samples after leaving out the third of the evaluation pairs. Multi-modal embeddings are created us- ing the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. 202 Embedding Spearman P-value Coverage wikinews 0.457 (0.006) <1‰(<1‰) 666 wikinews sub 0.443 (0.015) <1‰(<1‰) 666 crawl 0.493 (0.013) <1‰(<1‰) 666 w2v13 0.300 (0.010) <1‰(<1‰) 666 Google AlexNet 0.348 (0.004) <1‰(<1‰) 666 VG SceneGraph 0.249 (0.023) <1‰(<1‰) 395 Google VGG 0.344 (0.008) <1‰(<1‰) 666 VG-internal 0.289 (0.034) 0.022 (0.015) 68 VG-whole 0.118 (0.032) 0.354 (0.135) 68 Google ResNet-152 0.351 (0.022) <1‰(<1‰) 666 wikinews+Google AlexNet 0.331 (0.021) <1‰(<1‰) 666 wikinews+VG SceneGraph 0.362 (0.017) <1‰(<1‰) 395 wikinews+Google VGG 0.318 (0.019) <1‰(<1‰) 666 wikinews+VG-internal 0.289 (0.043) 0.024 (0.021) 68 wikinews+VG-whole 0.269 (0.017) 0.028 (0.009) 68 wikinews+Google ResNet-152 0.370 (0.017) <1‰(<1‰) 666 wikinews sub+Google AlexNet 0.356 (0.015) <1‰(<1‰) 666 wikinews sub+VG SceneGraph 0.304 (0.022) <1‰(<1‰) 395 wikinews sub+Google VGG 0.336 (0.021) <1‰(<1‰) 666 wikinews sub+VG-internal 0.270 (0.058) 0.046 (0.048) 68 wikinews sub+VG-whole 0.090 (0.119) 0.528 (0.350) 68 wikinews sub+Google ResNet-152 0.348 (0.005) <1‰(<1‰) 666 crawl+Google AlexNet 0.358 (0.014) <1‰(<1‰) 666 crawl+VG SceneGraph 0.428 (0.027) <1‰(<1‰) 395 crawl+Google VGG 0.332 (0.008) <1‰(<1‰) 666 crawl+VG-internal 0.305 (0.024) 0.013 (0.006) 68 crawl+VG-whole 0.160 (0.074) 0.271 (0.247) 68 crawl+Google ResNet-152 0.370 (0.026) <1‰(<1‰) 666 w2v13+Google AlexNet 0.338 (0.002) <1‰(<1‰) 666 w2v13+VG SceneGraph 0.278 (0.008) <1‰(<1‰) 395 w2v13+Google VGG 0.337 (0.019) <1‰(<1‰) 666 w2v13+VG-internal 0.306 (0.049) 0.017 (0.011) 68 w2v13+VG-whole 0.233 (0.058) 0.086 (0.080) 68 w2v13+Google ResNet-152 0.367 (0.004) <1‰(<1‰) 666 Table A.4: Cross-validated Spearman correlations on the SimLex dataset. Spear- man and P-value columns report of three samples after leaving out the third of the evaluation pairs. Multi-modal embeddings are created us- ing the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. 203 Embedding Spearman P-value Coverage wikinews 0.798 (0.005) <1‰(<1‰) 1654 wikinews sub 0.806 (0.004) <1‰(<1‰) 1654 crawl 0.844 (0.003) <1‰(<1‰) 1654 w2v13 0.667 (0.003) <1‰(<1‰) 1654 Google AlexNet 0.511 (0.006) <1‰(<1‰) 1654 VG SceneGraph 0.431 (0.015) <1‰(<1‰) 1654 Google VGG 0.524 (0.007) <1‰(<1‰) 1654 VG-internal 0.381 (0.008) <1‰(<1‰) 1654 VG-whole 0.405 (0.009) <1‰(<1‰) 1654 Google ResNet-152 0.472 (0.014) <1‰(<1‰) 1654 wikinews+Google AlexNet 0.518 (0.004) <1‰(<1‰) 1654 wikinews+VG SceneGraph 0.654 (0.006) <1‰(<1‰) 1654 wikinews+Google VGG 0.516 (0.003) <1‰(<1‰) 1654 wikinews+VG-internal 0.376 (0.002) <1‰(<1‰) 1654 wikinews+VG-whole 0.412 (0.008) <1‰(<1‰) 1654 wikinews+Google ResNet-152 0.476 (0.014) <1‰(<1‰) 1654 wikinews sub+Google AlexNet 0.516 (0.007) <1‰(<1‰) 1654 wikinews sub+VG SceneGraph 0.452 (0.008) <1‰(<1‰) 1654 wikinews sub+Google VGG 0.515 (0.004) <1‰(<1‰) 1654 wikinews sub+VG-internal 0.364 (0.002) <1‰(<1‰) 1654 wikinews sub+VG-whole 0.406 (0.017) <1‰(<1‰) 1654 wikinews sub+Google ResNet-152 0.483 (0.012) <1‰(<1‰) 1654 crawl+Google AlexNet 0.514 (0.015) <1‰(<1‰) 1654 crawl+VG SceneGraph 0.813 (0.001) <1‰(<1‰) 1654 crawl+Google VGG 0.524 (0.008) <1‰(<1‰) 1654 crawl+VG-internal 0.393 (0.007) <1‰(<1‰) 1654 crawl+VG-whole 0.423 (0.013) <1‰(<1‰) 1654 crawl+Google ResNet-152 0.512 (0.005) <1‰(<1‰) 1654 w2v13+Google AlexNet 0.507 (0.007) <1‰(<1‰) 1654 w2v13+VG SceneGraph 0.695 (0.004) <1‰(<1‰) 1654 w2v13+Google VGG 0.521 (0.008) <1‰(<1‰) 1654 w2v13+VG-internal 0.378 (0.005) <1‰(<1‰) 1654 w2v13+VG-whole 0.405 (0.002) <1‰(<1‰) 1654 w2v13+Google ResNet-152 0.487 (0.006) <1‰(<1‰) 1654 Table A.5: Cross-validated Spearman correlations on the common subset of the MEN dataset. Spearman and P-value columns report of three samples after leaving out the third of the evaluation pairs. Multi-modal embed- dings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. 204 Embedding Spearman P-value Coverage wikinews 0.299 (0.064) 0.029 (0.030) 68 wikinews sub 0.233 (0.074) 0.095 (0.064) 68 crawl 0.361 (0.055) 0.005 (0.003) 68 w2v13 0.101 (0.033) 0.428 (0.145) 68 Google AlexNet 0.536 (0.042) <1‰(<1‰) 68 VG SceneGraph 0.257 (0.038) 0.044 (0.032) 68 Google VGG 0.464 (0.031) <1‰(<1‰) 68 VG-internal 0.295 (0.030) 0.018 (0.014) 68 VG-whole 0.213 (0.049) 0.108 (0.087) 68 Google ResNet-152 0.527 (0.034) <1‰(<1‰) 68 wikinews+Google AlexNet 0.584 (0.025) <1‰(<1‰) 68 wikinews+VG SceneGraph 0.353 (0.070) 0.008 (0.006) 68 wikinews+Google VGG 0.547 (0.024) <1‰(<1‰) 68 wikinews+VG-internal 0.326 (0.022) 0.008 (0.003) 68 wikinews+VG-whole 0.128 (0.074) 0.377 (0.305) 68 wikinews+Google ResNet-152 0.456 (0.023) <1‰(<1‰) 68 wikinews sub+Google AlexNet 0.605 (0.027) <1‰(<1‰) 68 wikinews sub+VG SceneGraph 0.317 (0.059) 0.020 (0.024) 68 wikinews sub+Google VGG 0.538 (0.054) <1‰(<1‰) 68 wikinews sub+VG-internal 0.319 (0.062) 0.019 (0.022) 68 wikinews sub+VG-whole 0.165 (0.106) 0.313 (0.220) 68 wikinews sub+Google ResNet-152 0.540 (0.023) <1‰(<1‰) 68 crawl+Google AlexNet 0.564 (0.027) <1‰(<1‰) 68 crawl+VG SceneGraph 0.339 (0.072) 0.014 (0.016) 68 crawl+Google VGG 0.602 (0.023) <1‰(<1‰) 68 crawl+VG-internal 0.335 (0.053) 0.011 (0.012) 68 crawl+VG-whole 0.178 (0.055) 0.189 (0.158) 68 crawl+Google ResNet-152 0.501 (0.018) <1‰(<1‰) 68 w2v13+Google AlexNet 0.495 (0.020) <1‰(<1‰) 68 w2v13+VG SceneGraph 0.227 (0.084) 0.136 (0.164) 68 w2v13+Google VGG 0.485 (0.044) <1‰(<1‰) 68 w2v13+VG-internal 0.333 (0.059) 0.014 (0.018) 68 w2v13+VG-whole 0.251 (0.049) 0.055 (0.043) 68 w2v13+Google ResNet-152 0.498 (0.028) <1‰(<1‰) 68 Table A.6: Cross-validated Spearman correlations on the common subset of the SimLex dataset. Spearman and P-value columns report of three samples after leaving out the third of the evaluation pairs. Multi-modal embed- dings are created using the Intersection technique. The table sections contain linguistic, visual and multi-modal embeddings in this order. 205 206 Appendix B WordNet Concreteness Further WordNet concreteness analysis (Section 4.3.4.3) on the common subset of the datasets for the behavioural tasks, and for Intersection type mid-fusion method. 207 Figure B.1: Scores on the embeddings’ common subset of Semantic Similarity dataset splits, ordered by the sum of WordNet concreteness scores of the two words in every word pair. Mid-fusion method: Padding. 208 Figure B.2: Scores on the full Semantic Similarity dataset splits, ordered by the sum of WordNet concreteness scores of the two words in every word pair. Mid- fusion method: Intersection. 209 Figure B.3: Scores on the embeddings’ common subset of Semantic Similarity dataset splits, ordered by the sum of WordNet concreteness scores of the two words in every word pair. Mid-fusion method: Intersection. 210 Figure B.4: Scores on the embeddings’ common subset of Semantic Similarity dataset splits, ordered by the di↵erence of WordNet concreteness scores of the two words in every word pair. Mid-fusion method: Padding. 211 Figure B.5: Scores on the full Semantic Similarity dataset splits, ordered by the di↵erence of WordNet concreteness scores of the two words in every word pair. Mid-fusion method: Intersection. 212 Figure B.6: Scores on the embeddings’ common subset of Semantic Similarity dataset splits, ordered by the di↵erence of WordNet concreteness scores of the two words in every word pair. Mid-fusion method: Intersection. 213 214 Appendix C EmbEval Toolkit The code we used to generate the results in this work is openly available1. It performs a general evaluation of word embeddings (which we used in Chapters 4, 5 and 6. The code base loads several embedding models, generates multi-modal em- beddings and runs all the evaluations on the semantic similarity and relatedness datasets well as the brain datasets. The software can also be used to generate the various visualisations and tables of results as well as visualisations of embedding spaces. Details on its usage can be found in the documentation2. 1https://github.com/anitavero/embeval 2https://anitavero.github.io/embeval/ 215 216 Appendix D Cluster Structure WordNet label Own label Members food nutriment foodstu↵ food butter, cheese, bread, chicken, soup, sauce, dessert, beef, salad, meat, cake, steak, tomato, potato, pizza, flour, milk, meal, vine- gar, bacon, pie, cooking, sushi, sandwich, breakfast, burger, menu vascular plant plant organ plant part plants flower, flowers, tree, blossom, dandelion, fo- liage, fruit, weed, cactus, lily, bloom, shade, leaf, grass, sunflower, poppy, vine, plant, gar- den, iris, grow, daisy, oak, bulb, rust, herb, moss, tulip, palm, maple, root, tall, bush, seed, family atmospheric phenomenon physical phenomenon change weather rain, snow, fog, weather, mist, drizzle, frost, dew, cold, wet, wind, smoke, sunlight, misty, sunrise, winter, storm, sunset, haze, sun- shine, fire, spring, dusk, autumn, heavy, at- mosphere, cloud, sunny, burn, flood, desert, sun, hot, ice, tropical food beverage produce sweets alcohol tobacco “legal drugs” co↵ee, lemon, candy, juice, chocolate, sugar, strawberry, honey, tea, beer, bottle, bean, banana, cocktail, whiskey, pumpkin, bev- erage, pepper, cereal, brandy, sweet, wine, tobacco, mug, cherry, donut, nuts, liquor, berry, rice, mustard, cigar, cigarette, alcohol, raspberry, champagne, pot, apple, peel 217 substance material artifact material – farm animals cow, wool, charcoal, sheep, cattle, food, ani- mal, wood, goat, wheat, sand, animals, salt, water, timber, fish, mud, straw, cotton, cop- per, washing, oil, ox, iron, lamb, fresh, abun- dance, fur, coal, fishing, exotic, dye, ceramic, camel, pollution, tin, licking, smoking, diet, vitamin artifact covering clothing clothing / fashion wig, clothes, dress, shoes, jacket, sweater, skirt, sunglasses, leather, hair, costume, shirt, haircut, cloth, socks, waist, man- nequin, collar, jewelry, tattoo, lingerie, beard, blonde, mask, fabric, uniform, neck- lace, linen, outfit, glove, hat, fashion, blan- ket, bikini, knitting, swimsuit, crochet, badge, coat, carpet, bracelet, arms, makeup artifact structure whole classical architecture tower, building, marble, staircase, fountain, doorway, roof, chapel, steeple, porch, ceiling, mural, glass, wall, brick, statue, stone, arch, monument, dome, window, gravestone, sculp- ture, aisle, tiles, gate, interior, painted, dec- oration, concrete, church, graveyard, cathe- dral, curtain, painting, palace, clock, grave, portrait, choir, architecture, pyramid, memo- rial, square, castle, skyscraper, museum, cemetery, temple, organ change color visual property colour / decor blue, bright, green, pink, black, yellow, dark, white, purple, red, brown, violet, rainbow, colour, orange, sky, rusty, silhouette, grey, di- amond, redhead, light, flame, peacock, mir- ror, color, tiny, shadow, stripes, dull, rose, neon, colorful, crystal, bell, moon, horizon, arrow, silver, ivy, gold, swan, dragon, lantern, star, pearl, horn, ray, fox, globe, planet, bold, belt 218 body part part artifact body parts skin, spine, neck, bone, chest, throat, shoul- der, wrist, stomach, ear, jaw, cheek, lips, nose, eyes, eye, limb, toe, belly, skull, ab- domen, finger, teeth, elbow, cord, whiskers, knee, thumb, tooth, muscle, ankle, tail, paws, lip, brain, flesh, leg, body, calf, heart, blood, tongue, brow, pain, tear, blade, mouth, liver, gut, arm, marrow, curled, canine, feathers, foot, vein, hip, cancer attribute whole artifact measures & Misc flexible, reflection, pattern, sharp, ripples, large, elastic, normal, angle, object, spi- ral, fragile, dense, di↵erent, relaxed, frame, strong, fast, target, small, bottom, wave, long, rough, illusion, cone, narrow, texture, pair, noise, curve, bubble, depth, droplets, display, footprint, condition, wide, sphere, re- duce, hole, blurred, lamp, short, shell, rapid, medium, plate, size, lens, instrument, feet, helium, chain, meter, inch, cell, adult, for- mula, males artifact instrumentality move objects bag, cardboard, bucket, wire, hand, nail, pen- cil, hanging, rope, skateboard, knife, garbage, splash, button, scratch, pipe, ink, dripping, dirty, boot, spoon, drawer, hard, dirt, cage, suds, miniature, box, puddle, grati, hang, drum, jar, swing, metal, collage, pin, pil- low, tough, rock, surf, cradle, vintage, sten- cil, origami, keyboard, disc, rod, big, rattle, racket, ipod, vinyl, lego, surfers, odd, basket, tag, van, mac person organism bird animals bird, cat, squirrel, owl, rabbit, dog, birds, parrot, zebra, gira↵e, stork, duck, goose, pel- ican, deer, elephant, rat, snake, eagle, pi- geon, hamster, wolf, cheetah, hawk, mal- lard, crab, poodle, chipmunk, frog, flamingo, mouse, tiger, pets, crow, whale, gull, wild, in- sect, feline, prey, hummingbird, hound, pug, lion, panda, pet, lizard, bee, ant, dragonfly, nest, zoo, jellyfish, hen, seagull, spider, wasp, terrier, aquarium, butterfly 219 structure artifact area room kitchen, room, bedroom, bathroom, garage, shop, cafe, motel, cellar, diner, closet, hall- way, cottage, hotel, sidewalk, restaurant, barn, house, apartment, door, pub, alley, stairs, sofa, patio, bed, floor, couch, cabin, bakery, store, booth, crib, dinner, desk, fur- niture, hut, parking, fence, inn, pool, corner, shelter, hall, farm, lawn, street, shed, bar, mill, lab, windmill, sitting, oce, hospital, log, classroom, shopping, supper, bath, jail, lunch, theatre, yard person organism causal agent social roles: family members & professions father, friend, mother, lover, uncle, wife, daughter, lawyer, woman, brother, teacher, son, child, nurse, nephew, banker, sol- dier, couple, maid, gentleman, husband, au- thor, bride, doctor, priest, wedding, part- ner, photographer, worker, actor, lady, cap- tain, employee, sailor, groom, appointment, leader, student, king, secretary, scientist, singer, queen, guardian, professor, president, princess, actress, justice, children, instruc- tor, monk, prince, birthday, maker, sheri↵, bishop, manager, mayor, companion, chair, minister, politician, boxer, age, pupil, saint, jean, rabbi object artifact physical entity places shore, corridor, trail, bridge, road, harbour, river, tunnel, area, park, beach, pond, val- ley, lake, hill, ledge, city, railroad, island, highway, harbor, rail, downtown, seashore, canyon, west, canal, border, coast, north, town, mountain, pier, path, trac, bay, ocean, cli↵, forest, swamp, port, abandoned, skyline, stream, line, south, boundary, water- fall, station, loop, sea, railway, construction, boardwalk, scenery, reef, branch, lighthouse, demolition, landscape, underground, airport, zone, urban, metro, region, capital, gauge, village, population 220 instrumentality travel vehicle transportation vehicle, airplane, truck, car, elevator, auto- mobile, aircraft, cab, carriage, bike, jet, chop- per, scooter, balloon, bicycle, pilot, deck, train, wagon, gasoline, motorcycle, plane, craft, machine, engine, boat, taxi, cannon, crane, tank, escalator, mechanic, ship, hose, driver, steel, rocket, container, gun, safety, auto, motor, explosion, flying, factory, air, flight, camera, appliance, accident, drive, aluminum, telephone, bus, underwater, light- ing, vessel, aerial, phone, emergency, ford, exit, subway, company, police, pod, tram, in- dustrial, asphalt, wing change act be verbs bring, get, come, want, go, keep, take, know, find, say, give, make, understand, put, lis- ten, enjoy, feel, leave, think, learn, imag- ine, gather, believe, fail, arrange, add, lose, create, way, hear, send, meet, collect, carry, avoid, buy, remain, allow, appear, might, en- ter, arrive, seem, entertain, break, steal, re- ceive, stop, stand, build, locked, compare, re- tain, sell, handle, danger, eat, wander, face, unhappy, protect, please, pray, become, walk, expand, travel, plenty, greet, inspect, com- fort, huge, possess, dominate, attach, roam, participate, speak, step, drawn, construct, re- place, divide, great, living 221 person organism causal agent art / entertainment smile, fun, happy, love, girl, kid, kids, boy, baby, dad, mom, kiss, dude, friends, funny, man, joy, angel, beautiful, christmas, cute, movie, night, spirit, beast, bunny, mad, sing, puppy, monster, soul, zombie, song, devil, dance, kitty, guy, bunch, happiness, snow- man, show, holiday, buddy, music, rest- less, theme, sketch, nice, boys, dead, clown, young, quest, girls, vacation, celebration, emotion, carnival, dreary, dawn, bad, cop, sleep, journey, concert, pride, hero, evening, story, demon, sad, morning, warrior, jazz, band, guest, film, god, piano, punk, doodle, guitar, tv, television, husky, violin, festival, female travel act group sport time, day, year, second, course, run, win, game, home, sports, ball, trip, season, week, country, match, track, dropped, club, pa- rade, trick, world, crowd, august, month, horse, winner, swimming, field, football, left, men, triumph, women, gymnastics, basket- ball, bench, table, racing, round, jump, outdoor, cup, top, swim, race, side, base- ball, sailing, opponent, champion, goal, held, school, trial, played, camp, cross, flag, bowl, summer, rally, squad, head, old, ceremony, military, hockey, exhibition, skating, state, bull, college, purse, army, pole, stadium, ski, chess, navy, minute, class, posted, skate, an- chor, colt, seat, stud, turkey, santa, mare 222 abstraction communication act writing / Misc fact, discussion, work, idea, read, sense, quote, manner, words, conversation, infor- mation, book, picture, value, image, reader, view, person, advertisement, paper, vision, impression, communication, nature, phrase, page, paragraph, proof, article, interest, job, definition, money, abstract, poster, formal, wisdom, reading, skill, choice, attention, lit- erature, letter, handwriting, art, business, smart, awareness, confidence, word, key, design, new, essential, model, date, com- puter, action, collection, payment, note, law, graphic, figure, bible, library, protest, task, news, violent, chapter, umbrella, movement, dollar, magazine, symbol, photography, mod- ern, newspaper, web, activity, circle, number, people, peace, market, map, self, card, code, psychology, text, right, parent, dictionary, or- der, party, language, journal, written, tax, style, era, calendar, cent, ad, ancient Table D.1: Members of the 20 clusters in EL. Clusters are ordered by size. WordNet label Own label Members base layer flatware plate plate lick cream beating licking licking communication promotion message ad ad, advertisement change passage tube pipe rust, pipe, hose, tank, grati, chain artifact line whole train railway, railroad, subway, curve, tunnel, run, shelter, train, station, tram, highway, track, rail, way, engine, stop, gate, bridge, smoke 223 structure area room room classroom, hallway, hall, closet, bedroom, room, bath- room, garage, oce, cafe, museum, doorway, kitchen, shop, restaurant, store, mannequin, stadium, market, ceiling, corner bird vertebrate person animals hummingbird, gull, peacock, hawk, pelican, crow, par- rot, seagull, wing, swan, pigeon, owl, goose, flamingo, nest, eagle, tail, bird, silhouette, duck, chest, body, ledge, gira↵e, zebra travel wheeled vehicle self-propelled vehicle vehicles cab, car, taxi, police, vehicle, automobile, drive, rac- ing, scooter, bike, van, street, road, motorcycle, truck, speak, wagon, bus, parade, drawn, asphalt, cop, park- ing, bicycle, sidewalk, trac, driver, carriage, meter plant organ plant vascular plant plants bloom, foliage, grave, dead, vine, blossom, ivy, pod, cactus, tree, moss, root, leave, limb, forest, bush, plant, lily, branch, weed, leaf, vein, sunshine, log, fence, flower, sunlight, wood, palm, bench, sun structure artifact whole building parts chapel, cottage, steeple, castle, dome, story, cathe- dral, build, skyscraper, arch, lighthouse, apartment, hut, angel, shed, hotel, monument, window, staircase, home, cabin, house, roof, porch, tower, sculpture, pa- tio, bell, deck, brick, church, cross, clock, step, statue instrumentality container substance vessel champagne, tea, beverage, alcohol, honey, milk, pen- cil, tulip, juice, oil, bakery, ceramic, container, co↵ee, tin, cup, beer, sunflower, daisy, wine, rose, marble, bowl, sweet, maker, jar, vessel, mug, money, bottle, pumpkin, straw, glass, basket, box, pot, bucket, bunch body part artifact part pets & body parts jaw, throat, pupil, cheek, canine, belly, brow, mouth, stomach, tongue, eye, nose, poodle, ear, hamster, lip, fur, tooth, teeth, pet, leg, wool, head, feline, toe, panda, smile, neck, face, beard, puppy, collar, horn, skin, cat, kitty, calf, nail, dog, tag, mother physical entity body of water thing water rapid, village, coast, bay, mist, horizon, canal, skyline, valley, sea, cli↵, fog, town, waterfall, stream, water, sunset, pier, harbor, boardwalk, break, ocean, lake, fountain, shore, island, river, wave, splash, city, rock, ship, building, sand, hill, crane, mountain, beach, pond, surf, boat, pool 224 location artifact region farm animal dandelion, boundary, grass, wild, deer, stork, field, mud, farm, windmill, garden, landscape, desert, cat- tle, dirt, area, barn, yard, zoo, ox, path, footprint, garbage, puddle, lawn, cow, sheep, concrete, snow, eat, lamb, goat, stone, cone, trail, rain, day, park, animal, cage, horse, bull, elephant change color visual property colors bright, beautiful, big, dirty, small, colorful, grey, long, purple, dark, round, men, tiny, pink, eyes, painted, brown, gold, medium, white, hang, iron, silver, old, black, left, tall, red, safety, large, metal, blue, steel, yellow, leather, hanging, make, walk, green, right, color, bath, pair, washing, sitting, carry food produce solid food drizzle, nuts, herb, beef, flour, season, cereal, cherry, breakfast, sugar, steak, bacon, burger, butter, rice, meat, meal, sauce, dinner, pie, raspberry, lunch, sushi, bean, mustard, pepper, seed, salt, soup, cheese, tomato, hot, berry, potato, dessert, strawberry, salad, cardboard, food, bone, lemon, burn, frost, chocolate, bread, turkey, sandwich, spoon, pizza, chicken, shell, candy, peel, cooking, bubble, knife, fruit, fish, donut, cake, apple, ice, banana, orange artifact whole instrumentality furnishing crochet, calendar, linen, map, painting, work, frog, skull, note, code, stud, lantern, art, telephone, scratch, furniture, information, collection, menu, ipod, page, table, mural, piano, spring, movie, magazine, poster, cell, spine, portrait, appliance, desk, paper, graphic, frame, bed, date, crib, pattern, text, picture, card, globe, butterfly, wall, pillow, fabric, cord, sofa, carpet, guitar, square, cloth, image, tv, book, heart, lamp, star, television, blanket, couch, newspaper, night, dec- oration, mirror, time, computer, design, keyboard, word, mouse, border, drawer, floor, button, chair, key, display, curtain, reading 225 person artifact covering people fun, nurse, lingerie, violin, jewelry, makeup, haircut, cigar, wig, monk, instructor, santa, pug, brother, doctor, dad, terrier, huge, parent, scientist, gentle- man, bikini, pearl, badge, bracelet, shirt, swimsuit, sweater, jean, costume, hip, jacket, sleep, daughter, mom, short, skirt, snowman, hat, man, muscle, instru- ment, necklace, young, basketball, wrist, hair, smok- ing, glove, outfit, music, coat, rabbit, pets, woman, band, football, father, dude, boot, hand, elbow, tat- too, arm, ankle, soldier, lab, waist, clown, dress, belt, racket, blonde, bunny, uniform, loop, lens, friend, cigarette, held, finger, girl, photographer, purse, per- son, knee, pin, boy, female, trick, thumb, guy, mask, foot, son, swing, clothes, lady, bride, skate, squirrel, bag, phone, disc, ski, tiger, child, groom, adult, shoul- der, student, kid, camera, skateboard, baseball, ball, baby change act artifact Misc pain, downtown, capital, condition, theatre, motel, cemetery, elevator, journey, class, zone, captain, coal, military, navy, school, craft, gauge, texture, exit, storm, language, moon, company, create, club, an- chor, country, construction, meet, rainbow, weather, port, alley, hospital, party, take, flight, pilot, dragon, booth, interior, business, race, sky, library, drum, sunny, door, motor, employee, light, model, hen, bulb, goal, gun, wind, cloud, diner, pole, aircraft, course, fox, rod, skating, letter, jump, show, written, flame, symbol, reflection, plane, shadow, object, diamond, airport, ray, circle, line, airplane, swimming, bottom, arrow, flag, crowd, balloon, top, number, aquarium, fire, flying, seat, side, stand, figure, air, handle, game, winter, view, match, blade, bar, machine, family, wire, lion, hole, people, shade, worker, jet, rope, umbrella, couple 226 person change organism Misc ant, news, jellyfish, protest, add, imagine, inn, journal, liver, essential, marrow, rattle, arrange, wasp, para- graph, brandy, fact, aerial, devil, unhappy, emotion, chipmunk, god, oak, explosion, prey, proof, vision, ac- tivity, chess, movement, danger, gasoline, secretary, jazz, song, send, mayor, tobacco, soul, urban, violent, quote, demon, replace, fragile, manner, misty, receive, ancient, flowers, skill, reef, ripples, rally, living, diet, sketch, awareness, illusion, pollution, abstract, value, wisdom, squad, remain, arrive, saint, trial, impres- sion, avoid, vinyl, minister, maid, concert, believe, jail, learn, please, politician, great, guardian, population, holiday, cancer, psychology, become, college, demoli- tion, payment, brain, army, rabbi, lawyer, literature, prince, task, tropical, bring, lover, bold, inch, interest, companion, exhibition, leader, noise, actor, underwa- ter, supper, communication, helium, sense, happiness, win, sad, gymnastics, entertain, champion, banker, odd, conversation, planet, dawn, dense, camp, law, locked, pray, lose, plenty, abundance, fail, mallard, vacation, chapter, dreary, warrior, origami, might, joy, timber, choice, underground, depth, stencil, for- mula, friends, allow, retain, participate, understand, paws, mad, pride, stairs, wander, comfort, theme, give, nephew, reduce, funny, bad, idea, droplets, age, ..., surfers Table D.2: Members of the 20 clusters in ES. Clusters are ordered by size. WordNet label Own label Members bird aquatic bird seabird birds seagull, gull, goose, duck, pelican, swan, mallard, stork, eagle, flamingo furnishing furniture instrumentality furnishing furniture, stand, booth, desk, modern, display, bed, chair, container, door, appliance, drawer, sofa, cur- tain, couch, bench, crib, frame, box, table, tv, win- dow, computer, cradle, television, mac 227 instrumentality artifact device objects inspect, protect, collar, find, skateboard, gasoline, heavy, key, belt, steal, instrument, hang, justice, glove, handle, knife, scooter, horn, shoes, pipe, bone, telephone, mouse, bag, hat, spoon, guitar, gun, colt, purse, drum, iron, boot, violin, spine, umbrella, sunglasses instrumentality self-propelled vehicle wheeled vehicle car related accident, cord, vehicle, auto, automobile, skate, photography, truck, race, arrive, ford, chopper, cab, rally, seat, industrial, smart, mechanic, racing, car, demolition, triumph, construction, motorcycle, ma- chine, taxi, engine, driver, crane, carriage, van, bus, cannon, motor, tank, hockey, wagon, camera person organism causal agent “female topics” woman, model, brandy, pink, actress, lady, girl, young, wife, tiny, haircut, blonde, women, girls, hot, mother, hair, portrait, body, makeup, cheek, wig, neck, muscle, chest, lingerie, waist, redhead, child, face, bride, belly, bikini, kid, swimsuit, baby, brow, skirt, dress, short instrumentality artifact device metals & writing object, aluminum, journal, author, capital, lawyer, step, cardboard, law, silver, elastic, bible, writ- ten, book, tin, literature, chocolate, wire, money, cigarette, stud, steel, payment, glass, charcoal, blanket, gold, newspaper, page, cigar, appoint- ment, brick, butter, pencil, mirror, log, phone, ipod, match, pillow, rod, piano, keyboard vascular plant plant grow plants weed, bunch, maple, cancer, iris, poppy, dande- lion, leave, flower, rose, foliage, grow, plant, cactus, spring, tulip, ivy, palm, lily, leaf, daisy, tree, root, wheat, wool, raspberry, tobacco, flowers, blossom, butterfly, sunflower, cotton, herb, violet, oak, moss, strawberry, nest, dew, berry, rice, branch, coal food nutriment substance food sushi, meal, sandwich, pie, breakfast, lunch, food, supper, flour, cereal, sweet, dessert, dinner, subway, diet, cake, date, steak, sauce, bread, copper, nuts, bacon, cooking, beef, meat, bakery, knitting, eat, potato, salad, donut, pizza, burger, co↵ee, soup, bean, cheese, vitamin, fruit, pumpkin, rock, mar- row, market, timber 228 artifact change cover colours & materials texture, fabric, cloth, metal, rain, concrete, pa- per, suds, rough, words, stone, wall, square, dense, leather, quote, wood, frost, mud, noise, text, pur- ple, carpet, blue, tiles, dirt, droplets, red, sand, fog, formula, mist, pattern, handwriting, green, straw, linen, asphalt, stripes, crowd, marble, yellow, black, brown, grey, grass, white body part artifact part body parts gut, throat, wrist, burn, ear, thumb, elbow, lis- ten, shoulder, liver, pain, knee, arms, hand, toe, finger, give, tongue, limb, abdomen, jaw, receive, nail, arm, feet, hear, skin, washing, head, ankle, hip, teeth, tear, stomach, brain, foot, lip, mouth, leg, flesh, mask, eyes, nose, skull, eye, socks, lips structure artifact area room museum, garage, hall, classroom, kitchen, cellar, interior, oce, diner, decoration, exhibition, ho- tel, ceiling, restaurant, store, bathroom, trial, pub, class, closet, cafe, room, porch, stairs, deck, hospi- tal, living, corridor, aisle, bar, staircase, doorway, hallway, chapel, floor, lab, station, bedroom, gate, elevator, theatre, escalator, tunnel, organ, alley, li- brary, jail, tram artifact whole instrumentality fruit, drinks & sport compare, sad, ceramic, tea, rattle, honey, mus- tard, weather, champagne, pearl, button, wine, sugar, peel, pepper, jewelry, milk, orange, balloon, bulb, lemon, beer, cocktail, salt, beverage, sphere, juice, sports, planet, sun, whiskey, lantern, world, cup, football, pin, diamond, banana, basket, cherry, cent, basketball, globe, ripples, vinegar, pot, bottle, jar, tomato, baseball, plate, bucket, bowl, bubble, mug, ball, moon travel change object vacation island, view, reflection, harbor, nice, side, sea, summer, tropical, pollution, port, aircraft, pier, travel, surfers, journey, sunny, coast, flying, morn- ing, ocean, seashore, horizon, mare, holiday, lake, surf, shore, vacation, bay, airport, cli↵, sunlight, air, river, storm, ship, fishing, beach, desert, har- bour, puddle, flight, sailing, evening, sunrise, sky- line, vessel, lighthouse, dawn, sunset, rocket, moun- tain, whale, underwater, boat, swimming, swim, plane, dusk, jet, cloud, sky, airplane, ski 229 change abstraction state festival theme, wisdom, soul, image, possess, large, con- fidence, happiness, beautiful, joy, love, ceremony, festival, movement, abundance, dead, depth, cele- bration, lover, run, demon, blurred, pray, happy, remain, wet, dance, navy, family, carnival, angel, sculpture, ray, dragon, drive, atmosphere, night, shadow, band, god, believe, party, dark, hanging, abstract, show, christmas, monster, devil, jump, lighting, sunshine, warrior, painting, water, aquar- ium, zombie, concert, haze, crystal, statue, explo- sion, jazz, jellyfish, wave, bright, rainbow, ice, light, smoke, club, neon, colorful, hole, protest, autumn, rust, reef, flame, fire person organism causal agent animals animals, animal, picture, painted, zoo, turkey, curled, goat, companion, pets, canine, pet, prey, relaxed, horse, spirit, tail, dog, chipmunk, squirrel, pigeon, fox, cute, please, sheep, owl, birds, military, gira↵e, lion, lamb, bee, insect, hamster, hawk, lick- ing, bird, cat, puppy, feline, terrier, deer, calf, rat, chicken, camel, dragonfly, whiskers, poodle, cow, hound, cattle, lizard, fish, bunny, crow, wolf, tiger, parrot, zebra, cheetah, fur, panda, bull, wasp, ox, hen, frog, crab, snake, boxer, hummingbird, rabbit, elephant, pupil, husky, peacock, spider, pug, ant change abstraction travel Misc think, condition, understand, know, meet, sing, symbol, bring, speak, awareness, say, strong, sense, music, song, come, stencil, badge, loop, avoid, long, tag, idea, feel, bell, helium, guest, held, heart, proof, film, tall, information, oil, meter, an- chor, female, drawn, flexible, smile, peace, break, note, paragraph, figure, attach, gauge, apple, wan- der, kitty, paws, silhouette, footprint, hose, locked, vinyl, corner, round, divide, curve, cross, target, wing, lens, necklace, tooth, border, rope, lamp, bracelet, minute, north, time, illusion, cone, swing, racket, angle, circle, chain, clock, bike, bicycle, pole, spiral 230 person organism causal agent people monk, manager, student, males, banker, instruc- tor, parent, politician, minister, worker, adult, pro- fessor, played, employee, pilot, bottom, husband, style, uncle, business, men, boys, son, captain, dude, teacher, man, mayor, top, beard, dad, boy, retain, cop, fail, uniform, outfit, company, priest, nurse, daughter, maid, opponent, father, scientist, police, children, sailor, friends, beast, restless, sit- ting, kids, old, bishop, prince, punk, costume, peo- ple, tattoo, groom, president, couple, blade, secre- tary, saint, sheri↵, singer, mad, walk, pod, doctor, photographer, guy, skating, person, formal, bush, actor, gentleman, rabbi, queen, sleep, funny, sol- dier, jacket, sweater, coat, shirt, jean structure artifact whole landmark village, mill, cemetery, country, graveyard, board- walk, bath, memorial, outdoor, wide, ancient, tem- ple, inn, path, town, abandoned, windmill, land- scape, canal, downtown, trip, cottage, scenery, ar- chitecture, farm, patio, roam, palace, camp, drizzle, factory, monument, road, apartment, street, shel- ter, nature, tower, grave, wind, fountain, season, way, flood, castle, barn, exotic, city, cabin, shade, school, aerial, arch, ledge, garbage, motel, railroad, railway, hill, house, bridge, highway, dreary, gar- den, train, dome, trail, day, church, winter, urban, parade, home, waterfall, dull, canyon, trac, cathe- dral, building, yard, skyscraper, steeple, pool, rail, wild, stadium, forest, mural, pyramid, track, park, field, hut, pond, roof, shed, fence, sidewalk, stream, valley, snow, swamp, lawn 231 change act artifact Misc learn, seem, course, dropped, reading, gather, cre- ate, reader, impression, might, champion, partner, advertisement, friend, hard, dye, comfort, trick, vi- sion, construct, craft, small, goal, violent, poster, movie, conversation, participate, communication, read, population, huge, smoking, discussion, under- ground, tough, become, build, carry, leader, col- lege, pair, tax, fashion, fast, graphic, misty, minia- ture, odd, big, imagine, cold, collage, shopping, shop, grati, magazine, color, dirty, choir, ink, unhappy, di↵erent, vintage, wedding, king, seed, arrange, psychology, kiss, birthday, cell, plenty, bloom, princess, boundary, lego, snowman, crochet, sketch, gymnastics, emotion, santa, art, origami, clown, narrow, mannequin, army, chess, rusty, blood, collection, dripping, cage, colour, clothes, al- cohol, liquor, candy, flag, age, metro, dollar, grave- stone, feathers, map act change abstraction Misc activity, great, put, replace, lose, want, order, buy, allow, august, reduce, south, essential, keep, posted, bold, pride, fun, west, game, job, action, safety, buddy, story, entertain, get, week, maker, collect, skill, language, fact, normal, interest, hero, value, work, bad, self, attention, brother, greet, chapter, danger, appear, nephew, ad, size, medium, year, dominate, enjoy, era, task, mom, emergency, sell, news, go, zone, guardian, send, take, left, sec- ond, choice, word, card, web, quest, add, make, phrase, dictionary, sharp, winner, line, scratch, ar- row, vein, number, shell, splash, parking, enter, rapid, disc, new, right, win, stop, manner, fresh, calendar, squad, month, vine, exit, fragile, region, article, expand, menu, design, area, state, inch, def- inition, doodle, code, letter, star Table D.3: Members of the 20 clusters in EV . Clusters are ordered by size. WordNet label Own label Members 232 baby organism work baby baby device weapon hurt knife knife area communication mark footprint footprint atmosphere condition obscure sky cloud, sky line brandish gesticulate ocean wave, ocean artifact animal tissue implementation teeth tooth, teeth way road artifact road road, street, highway organism animal bad person animal fox, hen, game substance food grass food cereal, soup, oil nonvascular organism moss bryophyte alpine plant moss, ivy, cli↵ aircraft craft airplane airplane aircraft, airplane, jet, plane instrumentality device artifact computer keyboard, mouse, computer, key food beverage substance drink beverage, wine, beer, juice 233 body part process part body parts ear, head, eye, horn, tail instrumentality artifact substance pottery ceramic, tin, pencil, marble, hot bird vertebrate artifact flying animal parrot, limb, hummingbird, hawk, owl, dragon, squirrel, branch, butterfly bird aquatic bird seabird bird gull, seagull, pelican, swan, peacock, crow, pi- geon, goose, flamingo, wing, bird, duck, eagle thing body of water physical entity water bay, canal, harbor, water, lake, sea, pier, river, ship, pond, shore, boat, splash, pool change move visual property body, color left, long, small, big, muscle, purple, pink, right, washing, green, pair, color, sitting, palm food fruit change desserts nuts, sugar, cherry, frost, chocolate, raspberry, flour, dessert, butter, pie, strawberry, candy, lemon, ice, donut, cake group event act event party, parade, crowd, booth, race, cafe, stadium, show, family, restaurant, match, people, market, stand, park, airport, student, couple change color visual property visual property bright, grey, dark, round, painted, white, gold, silver, black, red, old, brown, blue, tall, yellow, metal, large, hanging object structure artifact landscape horizon, skyline, fog, valley, sunset, town, skyscraper, waterfall, moon, lighthouse, stream, city, building, castle, island, fountain, mountain, crane, hill container instrumentality measure drink, vessel tea, champagne, alcohol, honey, milk, co↵ee, cup, container, salt, bowl, mug, maker, spoon, jar, bottle, money, vessel, straw, diner, glass, bucket, basket, pot, bubble artifact whole furnishing furnishing, pet linen, sleep, furniture, blanket, bed, spring, crib, pillow, carpet, couch, pattern, sofa, feline, fab- ric, bunny, cloth, piano, floor, chair, square, cat, leather, chest, patio, kitty, button 234 clothing covering consumer goods clothing wig, instructor, jacket, bikini, costume, badge, sweater, shirt, swimsuit, outfit, gentleman, skirt, short, jean, boot, hat, coat, dude, dress, glove, uniform, clothes, soldier, belt, mask, cop, pin, ski reproductive structure plant organ vascular plant plants pod, bloom, tulip, daisy, cactus, sunflower, berry, blossom, sweet, rose, lily, vine, tiny, root, vein, pumpkin, garden, flower, plant, leave, leaf, peel, fruit, bunch, desert, banana, orange, apple artifact part body part body parts house animals jaw, throat, canine, belly, pupil, cheek, stomach, hamster, tongue, poodle, mouth, nose, fur, pet, lip, leg, wool, panda, toe, neck, collar, puppy, skin, licking, body, calf, dog, tag, lamb food nutriment meat food beef, herb, season, steak, meat, breakfast, ba- con, burger, rice, meal, sauce, lunch, mustard, cheese, pepper, dinner, bean, sushi, tomato, seed, potato, salad, food, bone, sandwich, turkey, bread, chicken, pizza, cooking, plate, fish artifact instrumentality substance oce crochet, calendar, collection, telephone, menu, note, movie, ipod, appliance, magazine, table, frog, cardboard, date, desk, paper, hospital, skull, card, library, box, shell, book, cord, pic- ture, television, steel, tv, drawer, object, newspa- per, garbage, night, top, ledge, machine, corner, display, fire abstraction communication change communication language, code, information, text, ad, company, graphic, painting, map, written, exit, mural, let- ter, word, work, art, scratch, poster, symbol, heart, advertisement, star, grati, image, page, spine, border, time, arrow, frame, diamond, say, portrait, number, birthday, design, circle, deco- ration, reading structure artifact area building elevator, chapel, hallway, apartment, closet, garage, hall, window, classroom, bedroom, door- way, cathedral, door, bathroom, story, interior, build, museum, cabin, room, arch, mannequin, shop, oce, club, staircase, store, hotel, reflec- tion, kitchen, tunnel, mirror, pilot, house, ceiling, aquarium, view, curtain, shade, church 235 artifact travel whole transportation zone, railway, construction, curve, create, taxi, run, subway, car, cab, drive, automobile, rail- road, business, parking, alley, shelter, tram, ve- hicle, stop, asphalt, course, way, light, train, po- lice, station, bus, rail, gate, van, sidewalk, home, line, truck, track, concrete, trac, bridge, cross, meter, brick artifact structure whole farm & wild animals deer, dandelion, wild, grass, farm, foliage, wind- mill, field, mud, bush, forest, weed, landscape, shed, barn, zoo, hut, tree, cattle, area, dirt, fence, rock, log, goal, ox, yard, cow, sheep, goat, lawn, eat, animal, gira↵e, stone, cage, wood, zebra, mother, horse, lion, bull, elephant, hole artifact body part instrumentality body accessories cigar, haircut, makeup, brow, pug, hip, bracelet, wrist, pearl, tattoo, elbow, stud, smile, ankle, hand, necklace, finger, arm, band, smoking, hair, snowman, beard, waist, thumb, lens, cigarette, loop, woman, burn, cell, knee, purse, racket, face, nail, foot, shoulder, bride, phone, bag, camera, lady, groom, skateboard change travel object travel rapid, village, journey, seashore, swamp, the- atre, mist, storm, scientist, stork, boundary, sunny, coast, country, boardwalk, sunshine, wet, weather, break, rainbow, dirty, aisle, flight, rain, meet, ray, sand, day, puddle, escalator, lab, trail, beach, path, surf, silhouette, nest, walk, snow, wind, shadow, sunlight, flying, cone, sun, bal- loon, umbrella artifact instrumentality device building vehicle pain, capital, minute, gauge, coal, cottage, rust, lantern, anchor, angel, speak, steeple, motor, dome, port, iron, pole, globe, rod, pipe, bulb, engine, hose, bell, model, seat, roof, porch, sculp- ture, monument, flame, handle, tank, lamp, gun, flag, bar, chain, wall, deck, bike, side, bottom, figure, wagon, rope, tower, wire, clock, scooter, step, blade, motorcycle, bench, bicycle, smoke, statue, carriage 236 person organism causal agent people activities fun, violin, nurse, brother, lingerie, monk, par- ent, dad, jewelry, huge, played, santa, doctor, basketball, terrier, instrument, music, captain, take, football, man, father, young, daughter, drum, mom, trick, son, jump, held, pets, men, blonde, friend, employee, colorful, skating, per- son, guy, boy, swing, girl, safety, photographer, racing, swimming, female, clown, disc, skate, adult, kid, winter, guitar, child, baseball, driver, ball, air, carry, worker person change organism Misc rattle, news, song, ant, imagine, send, emotion, arrange, living, jazz, ripples, inn, god, learn, please, violent, fragile, marrow, aerial, misty, inch, unhappy, devil, essential, avoid, squad, to- bacco, prey, flowers, banker, urban, protest, re- place, saint, psychology, demon, movement, hol- iday, rabbi, pollution, mayor, illusion, dense, entertain, wisdom, underwater, manner, aware- ness, politician, pray, give, lawyer, become, par- ticipate, supper, trial, vinyl, law, gymnastics, droplets, odd, believe, dawn, brain, secretary, brandy, retain, fail, communication, wasp, in- terest, gasoline, plenty, concert, helium, noise, locked, demolition, activity, payment, lose, great, literature, allow, bring, nephew, abstract, soul, paws, guardian, win, funny, might, expand, dreary, lover, tax, friends, skill, jail, put, un- cle, ancient, joy, tough, tropical, happiness, boys, population, underground, understand, wander, stairs, abundance, value, idea, exhibition, can- cer, choice, males, professor, reduce, mad, depth, hockey, discussion, flexible, compare, collect, ap- pointment, exotic, think, seem, confidence, bad, steal, get, birds, dull, ceremony, abandoned, re- laxed, sailing, industrial, lips, sunglasses, normal, surfers 237 change person causal agent Misc jellyfish, add, fact, journal, proof, paragraph, oak, liver, impression, danger, chipmunk, explo- sion, vision, chess, quote, rally, diet, prince, re- main, receive, minister, sketch, sad, arrive, reef, task, college, leader, origami, stencil, planet, maid, champion, bold, chapter, army, actor, mal- lard, camp, sense, companion, formula, timber, conversation, warrior, pride, dew, theme, queen, vacation, comfort, age, self, mare, morning, redhead, mill, cold, celebration, reader, flood, phrase, era, cent, evening, zombie, partner, con- struct, know, violet, cellar, gut, august, manager, winner, copper, hard, autumn, mechanic, singer, month, tiles, bishop, poppy, miniature, festival, justice, attention, spider, blurred, children, lis- ten, colour, animals, women, carnival, hound, girls, definition, triumph, hero, kids, peace, vita- min, week, dusk, dragonfly, job, web, wolf, sun- rise, go, smart, author, president, quest, auto, graveyard, heavy, fashion, article, atmosphere, summer, flesh, restless, gather, emergency, can- non, suds, north, sell, vinegar, cute, world, pyra- mid, ford, handwriting, formal, wife, architec- ture, ..., wedding Table D.4: Members of the 40 clusters in ES. Clusters are ordered by size. 238 Figure D.1: Heatmap of Jaccard coecients between K-means clusters of ES and EL (y and x axes respectively). 239 Figure D.2: Heatmap of Jaccard coecients between K-means clusters of ES and EV (y and x axes respectively). 240 Figure D.3: Heatmap of Jaccard coecients between K-means clusters of EL and EV (y and x axes respectively). 241 Figure D.4: Heatmap of Jaccard coecients between Agglomerative clusters of ES and EL (y and x axes respectively). 242 Figure D.5: Heatmap of Jaccard coecients between Agglomerative clusters of ES and EV (y and x axes respectively). 243 Figure D.6: Heatmap of Jaccard coecients between Agglomerative clusters of EL and EV (y and x axes respectively). 244 Figure D.7: Cluster map of Jaccard coecients between Agglomerative clusters of ES and EL (y and x axes respectively). 245 Figure D.8: Cluster map of Jaccard coecients between Agglomerative clusters of ES and EV (y and x axes respectively). 246 Figure D.9: Cluster map of Jaccard coecients between Agglomerative clusters of EL and EV (y and x axes respectively). 247 Figure D.10: T-SNE plot of ES with 40 cluster labels obtained by K-means clustering. TSNE perplexity = 52. 248 Appendix E Mutual Information of Semantic Spaces 249 (a) IHSIC , : median, d = 3 (b) IHSIC , : median, d = 11 (c) IHSIC , : median, d = 12 (d) IHSIC , : median, d = 13 (e) IHSIC , : median, d = 50 Figure E.1: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES) (blue) for di↵erent corpus sizes. 250 (a) IHSIC , : median, d = 3 (b) IHSIC , : median, d = 11 (c) IHSIC , : median, d = 12 (d) IHSIC , : median, d = 13 (e) IHSIC , : median, d = 50 Figure E.2: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES) (blue) for di↵erent word frequency ranges. 251 252 Appendix F Centroid Contexts 253 *2Mi`QB/ qBFBT2/B o: THi2 i2+iQMB+b- Mx+- `2bi`B+iQ`- 7`HHQM- bm#/m+ib- HB+2Mb2- +`B#`B7Q`K- i2+@ iQMB+- bm#/m+iBM;- 2m`bBM THi2- HvBM;nQMniQTnQ7- QM- ?b- QMniQTnQ7- BM HB+FBM; /BTKBt- mTbmKB/- dz2;2`- Kmr2- +Ȫ- ;Hm+Q`iB+QB/- MK2`m- b+?H2+F2M- m/?mKH- #Q`K2ix HB+FBM;- +2HBM;- iQm;m2- iQM;m2- ;B`@ `72- HBQM2bb / ?Q+- /BM- HB##2/- pHQ`2K- ?QKBM2K- HB#b- H+Q`+ƦM- HBi2K- BM}MBimK- HB#@ #BM; /- Q7n/Bz2`2Mi- `2nMn/- 72@ im`2bnpB2r- i2KTH2iQM- HBMBM; `mbi 2TB[m2- +`QM`iBmK- QH2mK- +Q?H2- Q#`Bix#2`;- #HBbi2`- #2Hi- Tm++BMB- rBM/2t2/- +QHQ`2/ `mbi- biBMbn/QrM- `QmM/nbB/2nQ7- `mbi2/nQMiQ- QMn}`2- rBi?nnHQi `BHrv biiBQM- HBM2- KB/HM/- #Mb7- M2`2bi- ;m;2- DmM+iBQM- r2bi2`M- `?2iBM- biiBQMb `BHrv- /2i+?- 2H2@ pi2/nQMnTHi7Q`KnQp2`- Tbb2bnQp2`n- bTHB+2/ni?`Qm;?- i`/BiBQMH +Hbb`QQK i2+?2`b- BMbi`m+iBQM- +QHH#Q`Bx2- 2K#B;;2MBM;- ? 쓶 훊- bT+2- k8djkdkdjek3- #Q``Qr`2/- +Hbb@ #MF- TT`Q+?ěb?2 +Hbb`QQK- /Bb+mbbBM;nBM- biM/@ BM;nBMbB/2- bBiiBM;nBMbB/2- bim/2Mi- ii2M/BM; ?mKKBM;#B`/ KxBHB- b2HbT?Q`mb- K2HHBbm;- +@ HvTi2- +vMMi?mb- #2`vHHBM2- b+BM@ iBHHMi- Q`i?Q`?vM+mb- 2mT?2`mb- +?BMM2/ ?mKKBM;#B`/- 2inM2+i`n7`QK- BMn~B;?in#2HQr- ~TTBM;nBib- ~TTBM;- rBM/bTM +# +miB2- +HHQrv- ?MbQK- Q#`/QB`Q- /`Bp2`- itB- Kmx2M- bB;MHHBM;- #@ bi`+ib- bmb#/2 +#- QMn?QQ/nQ7- TBMiBM- `/@ BM;- #+FnrBM/QrnQ7nitB- /`Bp@ BM;nbB/2nQ7 #HQQK /QQ`v`/- H;H- bHB2p2- DxKBM2- ?`QH/- pHbiMBF- ~Qr2`b- bKmi?- irBHH2`#m/b- Q`HM/Q #HQQK- +?2``vn#HQbbQKni`22nBM- BMn7mHHnbmKK2`- `Qb2n?bn7mHHv- #mii2`+mT- in +?T2H bBbiBM2- ?BHH- 2b2- +`QHBM- K2i?Q/Bbi- +Hp`v- +?Mi`v- mM+- KQ`im`v- #`M+++B +?T2H- +?m`+?- QmibB/2- ?QK2- i`BK- iQ +?KT;M2 `/2MM2- Bb?B?BF- #QiiH2- biF2b- HMbQM- +?HQMb- MQďHH- #2m;`M/- +`v2mb2- RRdkĜRkRN +?KT;M2- BBM- BMnnrQKMb- +`72- +HBM;niQ- r`TT2/n`QmM/ Dr KQQb2- /`QTTBM;Hv- HQr2`- T?Qbbv- Qbi2QM2+`QbBb- mTT2`- /`QTTBM;- É2?mǶT- THBMH- rFKQr Dr- bi`QM;- b2`+?BM;- 7+@ BM;nQTTQbBi2nQ7- Q7- r2`BM;nMQ k89 `TB/ i`MbBi- BMi2`#Q`Qm;?- ;`Qri?- rB2M- T`QiQivTBM;- BMi2MbB}+iBQM- #m+m`2șiB- 2tTMbBQM- #mb- BM/mbi`BHBxiBQM `TB/- TQr2`BM;ni?`Qm;?- KM2m@ p2`ni?`Qm;?- `B/2bnQ7- +`b?BM;nQp2`- ;Q2bnQp2` /M/2HBQM i`t+mK- #m`/Q+F- +9?9MkQb- +B@ +ǁ`B- BMbm#`B+Q- pHB2;2xrK- xM;mM2- 7`22Hv- Q/mpM+?BF- T`/2#HQ2K /M/2HBQM- BMn2KTivnbTQinQ7- ;BMbinbQK2- `2n/Bbi`B#mi2/nBM- KQM;- ;`QrBM;nBM #`B;?i v2HHQr- +QHQ`b- `2/- Q`M;2- HB;?ib- +QHQm`b- F2HHB2- bTQib- bmMb?BM2- ;`22M #`B;?i- HB;?in;`22Mni2M- HBin`2/n``QrnTQBMiBM;- bFvn+HQm/vn#mi- bi`22inHB;?i- ipnb+`22M- v2HHQrnTBMi2/nrQQ/ /`BxxH2 嬱嬱곝- /`xxH2- +?BbT2`- ##m+?- Qm`/2Hi- - KB;;2H2M- /KTv- 7x/x /`BxxH2- n/QMmi- /2+Q@ `i2/nbK2nb- /Q`MBM;- QMin?2- iQTTBM; +`Q+?2i v`MiBM2`b- bT2`Ʀ- FQHQb2- 7`227Q`@ KiBQMb- M2irQ`FběbB;MH2/- FMBiiBM;- }H2i- iiiBM;- K`M/iB +`Q+?2i- +HBi- M22/H2TQBMi- `2n7Q`- bBi@ iBM;nQniQTnQ7- v`M 7mM TQF2/- TQFBM;- TQF2b- TQF2- HQpBM;- HQi- HQpBM- vB/Bb?M- 7mM- r2 b2HH `2n?pBM;- `2n?pBM;n;`2i- 7mM- 7+BM;nrv- THMMBM;- ?pBM; TBM M2m`QTi?B+- #/QKBMH- +?`QMB+- 2t@ +`m+BiBM;- +?2bi- p2Hvi- Q`Q7+BH- KvQ7b+BH- br2HHBM;- F?Bv#M +?QTTvnM2`- rpBM;n7`QK- bBi@ iBM;n#+F- biB+FBM;nmTnQminQ7- i`p2HnQM Mi /2+- 7Q`KB+B/- MQTHQH2TBb- `#Q`2H- #Hm2#H+F- 2iF2MK2M- bQH2MQTbBb- H27@ +mii2`- ;2Mmb Mi- b?QrBM;ni?`Qm;?- `2~2+iBM;nQz- `2n#2?BM/- pb2- `2nQM #mb `Qmi2b- b2`pB+2- i2`KBMH- b2`pB+2b- BM@ i2`+Biv- biQTb- biQT- HBM2b- b?miiH2- `TB/ #mb- QMn7`QMinQ7- QMnbB/2nQ7- BM- r2`BM;- HBinmTnQM ;B`z2 ;B`z- `2iB+mHi2/- KbB- ;Q;QHB+F- 2`/K MM+?2M- 7Qm`Ĝ?Q`M2/- ;2`M2Qmb- ;B`z2- KimM/m- [ȕ`B#ȕ ;B`z2- ?b- Q7- QM- r2`BM;- KM ;Hbb biBM2/- rBM/Qrb- K;MB7vBM;- #Q`QbBH@ B+i2- H2/2/- BQMQK2`- rBM/Qr- K2M;2`B2- TM2b- #2/b ;Hbb- rBM2- r2`BM;- QM- HB[mB/- ?H7n7mHHnQ7 ?M/ `B;?i- bH2B;?i- ;`2M/2b- H27i- ?M/- +`MF2/- ;`2M/2- +HTb- ;HQp2/- mTT2` ?M/- ?QH/BM;- ?2H/nBM- QM- BMnKMb- KM rBM/Qr i`Mb72`- QT2MBM;b- i`MbQK- ;Hbb- TH@ H/BM- Q`B2H- HM+2i- bBHHb- bb?- TM2b rBM/Qr- #mBHinBMiQ- r2`BM;- KM- QMnbB/2nQ7- QM THM2 +`b?- T`QD2+iBp2- 7Q+H- 2m+HB/2M- +`b?2/- ?vT2`#QHB+- BM+HBM2/- bi`H- i`MbTvHQ`B+- +`b?2b THM2- ~vBM;nBM- r2`BM;- KM- QMnbB/2nQ7- Q7 r?Bi2 bQt- #H+F- bmT`2K+Bbi- ?Qmb2- y- +`2Kv- +QHH`- bmT`2K+Bbib- iBH2/- bi`BT2b r?Bi2- +QHQ`2/- ;`QmT2/- `mbiB+- iBH2nQMn- #HQbb2K k88 ;`bb K``K- `QQib- bTH2M/Qm`- ;ɃMi2`- KQHBMB- imbbQ+F- `QbKH2M- v2HHQr2v2/- +Q;QM- imbbQ+Fb ;`bb- 2iBM;- ;`xBM;nQM- biM/@ BM;nBM- ?b- ;`xBM;nBM i`22 i`mMFb- #MvM- MQBbBHv- #Q/?B- +?`Bbi@ Kb- #22`#Q?K- };- TQ`+mTBM2- 7`Q;- HBM2/ i`22- ;`QrBM;nQM- #2?BM/- H2p2- QM- KM `QQK /BMBM;- HQ+F2`- /`2bbBM;- rBiBM;- #BH@ HB`/- i2KT2`im`2- b+?QQH?Qmb2- #QBH2`- ?Qi2H- `QKT2` `QQK- BM- BMnQi?2`- QM- BMn+Q`M2`nQ7- bKBHBM;nBM ri2` TQHQ- /`BMFBM;- bmTTHv- TQi#H2- 7`2b?- bMBiiBQM- #`+FBb?- pTQ`- b?HHQr- bQHm#H2 ri2`- BM- brBKKBM;nBM- QM- ~Qi@ BM;nBM- r/BM;ni?`Qm;? rHH bi`22i- ?/`BM- +m`iBM- /Q//Ĝ7`MF- MiQMBM2- `2iBMBM;- ?M;BM;b- #2`HBM- [B#H- +2HH rHH- ?M;BM;nQM- ;BMbi- ?b- ?mM;nQM- KM /Q; #QMxQ- Mm;?iv- ?QmM/- bH2/- K/- r?2HFb- DmMFv`/- bi`v- b?;;v /Q;- QM- KM- BM- +?bBM;- ?b bFv T2`72+ip- K/`2M- #B;- ;Q+?2QF- bTQ`ib- i;k9- #Hm2- +Qbiěb2- M2rb BM- bFv- ?M;BM;nBM- ~vBM;nBM- +HQm/- QM i`BM FKT?- Tbb2M;2`- r;QM- ?Hib- /2@ `BH2/- biiBQM- 7`2B;?i- 2tT`2bb- bQmi?@ #QmM/- b2`pB+2b i`BM- QMn7`QMinQ7- BM- r2`BM;- KM- rBiBM;n7Q` i#H2 i2MMBb- HBbib- T2`BQ/B+- b?Qrb- 7QHHQr@ BM;- bmKK`Bb2b- bQ`i#H2- bmKK`Bx2b- HQQFmT- ?b? i#H2- bBiiBM;ni- QMniQTnQ7- ?b- BM- i KM bTB/2`- BbH2- vQmM;- K2;- vǔb?ɟ- #22@ MB2- B`QM- ii- QH/- T+ r2`BM;- QM- KM- BM- r2`b- ?QH/BM; HBpBM; +QmTH2b- bQK2QM2- HQM2- iQ;2i?2`- R3- 7KBHB2b- TQp2`iv- /vHB;?ib- [m`i2`b- T2QTH2 HBpBM;- BbHM/- /Q+F2/nM2`- THMi- ?Q`BxQM- iQT bv ;QQ/#v2- M22/H2bb- Mvi?BM;- v2b- ;QQ/@ #v2b- /`M/2bi- ;Q2b- bQm`+2b- r2Mi- ?2HHQ bv- iM- ?BHiQM- /QTi- HH2`iQM- 2b@ /2M- KQ/ b2;mHH eyyjj- +?2F?Qp- ~mQtviBM2- - `+/@ BM- i`2TH2p- 2MRey- ?KiK- F``Fm- K2`BFDFb b2;mHH- ~QiBM;nrBi?- Qp2`nM/nBM- #2bB/2nQ7n- ?bnr2##2/- #QminiQn/Bp2nBM 7m`MBim`2 mT?QHbi2`2/- MiB[m2- }iiBM;b- biQ`2- ?QK2biQ`2b- /2bB;M2`- KF2`- rB//B@ +QK#- /2TQbBiQ`v- ?``Q/b 7m`MBim`2- bMB{M;nmM/2`- Q++mTvBM;- ?bnn`2~2+iBQMnBM- ?bnb?/Qrb- HB/n?B//2Mn#v- Ki+?2/nrBi? BMbT2+i +2`MvkyRk- 2Mb2M/2- 2pbBQMě``Bp2b- KMBFiQHH- b?mii2`ěi?2- Q7biBM- bQFQiő- ƓbmT2`pBb2b- ;QQ/bĘ- bF;2``F BMbT2+i- x2#`- ?QQ7- #`M+?- KM2- i`BM k8e ++B/2Mi +`- miQKQ#BH2- 7iH- KQiQ`+v+H2- Q+@ +m``2/- #QiBM;- 7`2F- i`{+- i`;B+- BMp2biB;iBQM ++B/2Mi- b+m``vBM;n`QmM/- `@ `Bp2/ninM- rbnBMnM- r?Bi- r2`BM;nQM2 rQKM rQM/2`- vQmM;- bmz`;2- }`bi- T`2;@ MMi- #BQMB+- MK2/- 2H/2`Hv- KM- #2miB7mH rQKM- r2`BM;- QM- ?QH/BM;- KM- BM Q#D2+i Q`B2Mi2/- M2TimMBM- BMMBKi2- HBK2`@ 2Mi- i?Q`M2ĜʊviFQr- bm#bi2HH`- tKH@ ?iiT`2[m2bi- T2`KM2M+2- /ibQm`+2- `2KQi#H2 Q#D2+i- QMn#QiiQKnQ7nM- QMn#QiiQKnQ7n- bmTTQ`iBM;n- BMn7`QMinQ7nM- BMniQ r22/ MQtBQmb- 7Q`2biBM- DBKbQM- i?m`HQr- r?+F2`- BMpbBp2- +#QK#- bQi- i2v+F- 7Q``2biBM r22/- `2n;`QrBM;nBM- Q#b+m`2b- ;`QrBM;ni?`Qm;?- ;`QrBM;nBM- ;`QrnQminQ7 bmb?B bb?BKB- MB;B`B- Mm/QFB- i2KTm`- #2MFv- +`2KB2`2- /Bb?2běHBbi- 7mM@ iQbi- Bi+?Q bmb?B- `QHHb- p+/Q- }b?2b- `2nT`iBHHv- `2n#2HQr i2tim`2 H2i?2`v- p2Hp2iv- +`mM+?v- TQ`T?v`BiB+- KTTBM;- +`mK#Hv- b?/2`b- +?2rv- T@ T2`v- rtv i2tim`2- M;H2n+`2i2b- ?n/Bz2`2Mi- n+H2`- i2HHb- +bibnQM ;mi KB+`Q#BQi- r`2M+?BM;- KB+`Q#BQK2- pQHFb2B;2M2b- `B2+?bi- biD2TFQ- B/2`@ #B+?H- HFHv- r`2M+?BM;Hv- #v2QHbBM ;mi- bi`BM;v- TmKTFBM- r2`BM;- HH- i?Bb Kmb2mK `i- bi2/2HBDF- b?KQH2M- FmMbi?Bb@ iQ`Bb+?2b- ?B`b??Q`M- K2i`QTQHBiM- r?BiM2v- ;m;;2M?2BK- #QBDKMb T`F2/nQmibB/2nQ7- Kmb2mK- KQmMi2/niQnbB/2nQ7- Q7nB`THM2- T`F2/nQmibB/2- T`F2/nBMn7`QMinQ7 +QKT`2 +Km`2F- /ǶM2HHQ- BiBKi`2b- FQHbFvb- [mQi2#Qt2b- v/?BF2Mm- /BbTQbBiBQMě iQ- 7pQ`#Hv- #m#2MbTBixH2- ?2iB`Bbi`B +QKT`2- FB/- T?QM2 BbHM/ `?Q/2- bii2M- +QM2v- HQM;- #{M- r?B/#2v- +Mp2v- K+FBM+- `BF2`b- pM+Qmp2` BbHM/- HBpBM;- `BKK2/nBM- QMn7`QMinT`inQ7- QT2M2/nM2tiniQ- #mBH/BM;nBMiQ i?2K2 bQM;- 2M/BM;- QT2MBM;- imM2- MiQHB+- `2+m``BM;- `K2MB+- T`F- i?`+2bBM- T`Fb i?2K2- ;Q`BHH- QT2MbnQM- /2TB+iBM;- biM/bni- bTimH MBKHb +`m2Hiv- THMib- 7m``v- rBH/- /QK2biB@ +i2/- MQM?mKM- BKTHBM;- ?mKMb- bimz2/ MBKHb- bimz2/- +m#B+H2- BMb/B2- bBi- #`QrM i?BMF iMF- /QMǶi- iMFb- dzB- b?m//2`- BiǶb- bB/- bvBM;- /M+2- #QxQb i?BMF- ;`2v- i`mMF k8d KQMF i?2HQMBQmb- #m//?Bbi- #2M2/B+iBM2- iQM@ bm`2/- #`2iiQM- 7`vbiQM- tmMxM;- ?mB7M- 匃娇- +Mi2HH KQMF- ;`QrbnbT`b2HvnM2`- ?2HT@ BM;n- MpB;iBM;- ?p2nQM- KQMbi2`v pBHH;2 /KBMBbi`iBp2- KmMB+BTHBiv- /KBMBb@ i`i2/- ?QrK2?- TQTmHiBQM- ;`22MrB+?- HQ+HBiv- KF2mT- TQK2`MB- FB2H+2 pBHH;2- in7QQinQ7n- Qp2`HQQFb- i`p@ 2HBM;ni?`Qm;?- i`p2HBM;- ?`#Qm` H2`M b?Q+F2/- b+BFBi- bm`T`Bb2/- `2/v- T2`7v+i- ?Q``B}2/- bim/2Mib- #bB+b- QTTQ`imMBiv- /BbKv2/ H2`M- bFB- +?BH/- 7Q`- #QQF- QM +iBpBiv pQH+MB+- i?mM/2`biQ`K- 2+QMQKB+- 2M@ xvKiB+- b2tmH- b2BbKB+- T`Q;2biQ;2MB+- 2ti`p2?B+mH`- 7mK`QHB+- 2bi`Q;2MB+ +iBpBiv- ii2M/BM;- j- ri+?BM;- `QmM/- bT2+iiQ` T?QM2 KQ#BH2- +HH- +HHb- +2HH- ?+FBM;- rBM@ /Qrb- THb- +2HHmH`- T?`2Fb- #QQi? T?QM2- iHFBM;nQM- ?QH/BM;- mbBM;- HQQFBM;ni- iHFBM;nQMn /`r2` r#bi`imT- /2bF- /QQ`#M/b- +`v/- +B;`22i- ;` zb- ?Nyy- D?M;22`- TM@ T?Q#B+- i`BM+?Mi2 /`r2`- 7+2nBM- #mBHinmM/2`M2i?- /`2bb2`- ?M/H2- FMQ# TQTTv QTBmK- TTp2`- D?F`- 2b+?b+?QHxB- /2H2pBM;M2- #`2/b22/- bQpB- TBTQTTQ- b22/b- T`B+FHv?2/ TQTTv- +Hmbi2`nQ7nTBMF- biK2M- B`Bb- HT2H- ;2MiH2KM rBM2 bT`FHBM;- ;`T2- ibiBM;- +2HH`- +2HH`b- KmHH2/- ;`T2b- DM+Bb- ibiBM;b- #`2/ rBM2- TQm`BM;- ;Hbb- /`BMFBM;- ibiBM;- ?H7n}HH2/ brM r?QQT2`- MmiQ`- b2`BM/- bBHp2`iQM2b- H2/- Q/BH2- imQM2H- +Qb+Q`Q#- HF2- TmF­ brM- brBKKBM;n#Qp2n- brBK@ KBM;nQM- brBKKBM;nmTQM- brBK@ KBM;nBM- r/BM;nQM #HMF2i MQMT`iBbM- 2D2+i- #Q;- bii2@ K2Mibi2BM- mb#- #BM;Q- T`BK`v- #H+F7`B`;i2- kyyĜeyyKK- b`QHB; #HMF2i- b2rMnBMiQ- /Q`M@ BM;- +QHQ`nBM- /`T2/nQp2`- r`TT2/nmTnBM #BF2 KQmMiBM- `+Fb- HM2b- /Q+FH2bb- i`BHb- Ti?b- Ti?- +BiB- Q`B2Mi22`- `B/2 #BF2- `B/BM;- HQ+F2/niQ- BM- `B/BM;n- +?BM2/niQ #2+? THK- Kv`iH2- /2H`v- p2`Q- /viQM- TQKTMQ- `2/QM/Q- T2##H2- pQHH2v#HH- HQM; #2+?- i- rHFBM;nQM- #`2FBM;nQM- THvBM;nQM- biM/BM;nQM +H7 `QTBM;- 7ii2/- +Qr- ;QH/2M- biQpH- BM@ Dm`v- bpib- 2/v- pBiHQ- KBHF+Qr +H7- Mm`bBM;n7`QK- Mm`bBM;nBM- Mm`b@ BM;- ?bn?2/nQM- M/nbi`22inrBi? v2HHQr D+F2ib- 72p2`- TH2- ;`22MBb?- #`B;?i- T2`+?- #2HHB2/- ~Qr2`b- Q`M;2- T2`BH v2HHQr- ;`22Mn7+2nQ7- +QH@ Q`2/nbm+2nQM- i`+Fn?b- v2H@ HQrn;`QmM/- +QHQ`2/n/Qm;?Mmi +QQFBM; mi2MbBHb- TQib- +H2MBM;- pBMvH- TQi- ǵFQ@ F2M- QBH- /Qi+?- #FBM;- b2rBM; +QQFBM;- ;`BHH- TM- ?Qi/Q;- QBH- bi2F k83 ?Q`M ?`/`i- +T2- `BKK2/- FBKH2v- FMB2?iBBQ- #HBMF2v- 7`B+- iBBQ- ~m;2H- i`2pQ` ?Q`M- +m`pBM;nrvn7`QK- ?Q`MbnQMn;B`z2- ?b- ?bn+m`p2/- ;`Qrn7`QK MBH iQQi?- #Bi2`- #BiBM;- ÏFø`?M- KxB@ iQpB+?- vFmTQp- +Q{M- TQHBb?2b- #Bi@ BM;Hv- bHQMb MBH- mb2/nBM- BKTH2/nQM- ?QH/@ BM;niQ;2i?2`- ?mM;nmT- ?bni?mK# #m//v ?QHHv- 2#b2M- /2bvHp- /27`M+Q- HxB2`- HM/2H- pHbi`Q- biQQ/BQb- +BM+B- `Q2K2` #m//v- #2bi- MBKHbnBM- i?`22- biM/bn#v- biM/ ?x2 /Bx22- 2pQi- i`Mb#QmM/`v- Tm`@ TH2- /BM;H2#2``v- #;BH;mH- /27`BM;2- ?K2b?mKb?- `vF22- mMB/2MiB}#H2Ĝ mMB7Q`K ?x2- Qp2`niQT- `2n?B;?2`ni?2M- `2np2`vn7`n7`QK- #HQ#- BMn/BbiM+2 T`Qi2bi T2+27mH- `2bB;M2/- K`+?2b- HBmHBimM- K2+?i- T`Qi2biB`K- `Qb2Mbi`bb2- bi`BF2- MQMpBQH2Mi- BxBF T`Qi2bi- 7Q`- bB;M- `Q/- ?QH/BM;- QM bH22T TM2- `2K- /2T`BpiBQM- M`2K- Q#bi`m+@ iBp2- /`2KH2bb- TQHvT?bB+- TMQ2- /Bb@ Q`/2`b- #`mtBbK bH22T- rBi?nn#Hm2- rBi?n;`22Mnb?B`i- +Qp2`2/nrBi?n- ?bn?Bbn?2/- b?Q2bn`2nQM +HQi?2b br//HBM;- Qtt7Q`/- rb?BM;- +BpBHBM- /`v2`b- THBM- #Q`HQ- /`v2`- K`BHHv- M;+?mKT +HQi?2b- QMn+HQb- biQ`2b- ?p2nTQ+F2ibnBM- `2nTBH2/nQMniQTnQ7- bi`QrMnQM #mii2` T2Mmi- K`;`BM2- +Q+Q- #`2/- +H`B@ }2/- +?22b2- K2Hi2/- D2HHv- mMbHi2/- Km@ `mKm`m #mii2`- H`;2nM2`- T`iHvnQmibB/2- bmi22BM;nBM- K`;`BM2nQM- BMnnbKHH ~Qr2` #m/b- ?2/b- MiHBF2- i`p2HHBM- bTBF2b- Tb[m2- HQimb- Mi?Qb- #2/b- biHFb ~Qr2`- pb2- #HQQKBM;nBM- #HQQK@ BM;nBMbB/2- BM- BMnKB/nQ7 `BM iQ``2MiBH- bBM;BM- ?2pv- TQm`BM;- bQF2/- bMQr7HH- bMQr- ?i7mH- b?BM2- 7`22xBM; `BM- 7HHBM;nQM- ;2iiBM;nr2in#v- iQrMb- rHFBM;nQM- iF@ BM;nb?2Hi2`n7`QK +Qz22 b?QT- `Qbi2`b- #2Mb- b?QTb- THMi@ iBQMb- `#B+- /2+z2BMi2/- bi`#m+Fb- `Qbi2`- i2 +Qz22- +mT- Km;- TQm`2/n7`QK- Q7nbi2KBM;- #HQrBM;n+`Qbb +Qr +H`#2HH2- /mM;- KBHF- +H7- ?v/`Q@ /KHBb- T`bMBT- ?Q+F2/- `2BM2/- K/- FQr2K2`F +Qr- QM- biM/BM;nBM- HvBM;nBM- KBHF@ BM;- #2BM;nb?QrMni rB; rK- D;#;b- rKMB- #HQM/2- r;- b?2BM?`/i- #HQM/- TQHQ;Bb2ěBi- ?B`@ TB2+2ě- H+27`QMi rB;- BMn+HQrM- r2`b- ;mB/BM;- MQinr2`BM;n- TmHH2/n#v k8N iQr2` +QMMBM;- 2Bz2H- #2HH- +HQ+F- ?KH2ib- K`i2HHQ- ##2H- bTbbFv- /`m;- bTB`2 iQr2`- ?QmbBM;n- HQM;n7`QMinQ7- iniQTnQ7- +HQ+F- +QMiBMBM;n #Hm2 Dvb- `B##QM- ƺvbi2`- D+F2ib- ?22H2`b- +QHH`- /2pBHb- `B#M/- #QK#2`b- `B/;2 #Hm2- +H2`- M/nr?Bi2nQ+2M- +HQm/H2bb- B`nQM- #`M/nMK2n`mbivnQMnBib bFBM B``BiiBQM- `b?2b- ;`7ib- H2bBQMb- TB;@ K2MiiBQM- B``BiiBQMb- ;QH/#2i2`- Km@ +Qmb- MQMK2HMQK- iMM2/ bFBM- ?M;BM;n?- H/vnHB;?i- TT2`@ BM;nQM- KiBM;nrBi?- QMn+i ~2tB#H2 +QKT+iBM;- bB;KQB/Qb+QTv- }#2`QTiB+- #2M/bQK2- +QM7Q`KiBQMHHv- /Ti@ #H2- THMFQbi2M`2+?MmM;- HB[mB/iB;?i- HBMF2`b- ?2HB+ ~2tB#H2- ;`22Mn`BK- iQn+i+?n- ;`2M- BMn/Q;b- 7`Bb#22 #; /mz2H- THbiB+- /m|2- #B/Bi- pBM#Q- TmM+?BM;- bH22TBM;- ;`#- +QHQbiQKv- B`bB+FM2bb #;- +``vBM;- +``vBM;n- +``B2b- #2v@ r22M- TH+2/nBMbB/2 #B`/ Tbb2`BM2- KB;`iQ`v- +;2/- bM+im`v- ri+?2`b- ri+?BM;- iQTH2v- bT2+B2b- T`2v- 7m`M`BB/2 #B`/- T2`+?2/nQM- ~vBM;nBM- ~v@ BM;nQp2`- #2F- ~vBM;n?2/nQ7 FBi+?2M bBMF- bQmT- ?2HH- mi2MbBHb- /BMBM;- ?2HHǶb- #i?`QQK- TMi`v- b+mHH2`v- HmM/`v FBi+?2M- BM- 7+2nBM- T`2T`2/nBM- rQ`FBM;nBM- rBi?nBMi2`BQ` 7i?2` bm++22/2/- /B2/- 7QQibi2Tb- /QTiBp2- #B@ QHQ;B+H- /2i?- Hr- BM?2`Bi2/- bQM 7i?2`- H2MBM;nQp2`niQniQm+?- iF@ BM;nnTB+im`2nBM- iFBM;nnb2H7nBM- M/nbQMnH2`MBM;niQnB+2- +H2MbnbQMb b?Q`2 /BM?- HF2- MQ`i?- 2bi2`M- #ii2`B2b- ;2Q`/B2- #QK#`/K2Mi- TmHv- D2`b2v b?Q`2- #`2FBM;nQM- rb?BM;nQM- +QK@ BM;niQ- +QKBM;nBMniQ- +`b?BM;nQM p2?B+H2 KQiQ`- HmM+?- `2;Bbi`iBQM- 2H2+i`B+- Kp- r?22H2/- `22Mi`v- `Qp- miBHBiv- mp p2?B+H2- T`F2/nHQM;bB/2nQ7- T`F2/nHQM;bB/2- T`F2/nQM- T`FBM;nQMnbB/2nQ7- `2nT`F2/nHQM;bB/2nQ7 #`BM; #+F- ?2HT2/- iQ;2i?2`- rQmH/- ?Q`BxQM- 7Q`i?- +QmH/- ii2MiBQM- #H2- ii2KTi #`BM;- bvb- rHH- QMn- ?bn- ?b bKBH2 bKBH2v- +`Mi- pQD/MQp- تُٜؕ- Ŀŀ- bQHKB- iQQi?v- /F- bQM`ő2- 7+2 bKBH2- 2tTQbBM;- `2p2HBM;- 2tTQb2b- BMnbKBHBM;- 2tT`2bbBM; iBK2 }`bi- 7mHH- HQM;- 2ti`- bT2M/- bT2Mi- `2H- +QMbmKBM;- bHQi- `QmM/ iBK2- ?pBM;n;`2i- +QmMib- b+2M2n/m`BM;n/v- iQni2HH- i2HHb 7+i /2bTBi2- /m2- +?2+FBM;- bTBi2- }M/@ BM;- +QKTHB+i2/- Kii2`- +?2+F2`- +QK@ TQmM/2/- 2pB/2M+2/ 7+i- HBbi2/nQM- rBi?nbQK2- BM;`2/B2Mi- Dm;- #QiiH2 7QQi#HH H2;m2- i2K- +Hm#- +QHH2;2- THv2`- pB+@ iQ`BM- MiBQMH- K2`B+M- T`Q72bbBQMH- +Q+? 7QQi#HH- +?b2b- iQn?BM/2`- i`v@ BM;niQnbp2- THv2`nBM- 2tT2`B2M+@ BM;n key }b? }MM2/- +vT`BMB/- rBH/HB72- ?i+?2`v- #QMv- K`/v- 7`2b?ri2`- M/`QKQmb- /2K2`bH- b?2HH}b? }b?- KQH2nQMn- r`Bi?2bnBMbB/2- ?bnn;`v- +m;?inrBi?- #QminiQn#2n72/niQ }HK 72biBpH- /B`2+i2/- +MM2b- /`K- 72@ im`2- +QK2/v- bmM/M+2- /Q+mK2Mi`v- ?Q``Q`- i?`BHH2` }HK- iF2MnrBi?- `2n#2BM;- iT2/niQ- #2BM;- THvBM;nQMn `Kb +Qi- +Qib- 2K#`;Q- KmMB+BTHBivǶb- KKmMBiBQM- H2;b- Tm`bmBpMi- +MiBM;- bKHH- ;mH2b `Kb- Qmibi`2i+?2/- #`2- 7QH/- bFi2`- bFi2#Q`/2` THMi ~Qr2`BM;- TQr2`- Ti?Q;2M- /2bHBM@ iBQM- };rQ`i- Q`MK2MiH- ?Qbi- ?`/B@ M2bb- ?2`#+2Qmb- #QiiHBM; THMi- ;`QrBM;nBM- ;`QrBM;nQM- ;`Qr@ BM;nmTn- TQi- H27 7QQ/ 7bi- #2p2`;2- /`m;- /`BMF- b?Q`i;2b- bmTTHB2b- BMb2+m`Biv- M2Bp2i?MK- ;`B- biTH2 7QQ/- //2/nQM- bmi2BM;nBM- r`TT2/nBMbB/2- TH+2/- +minBMniQ KF2 bm`2- K2M/b- rv- 7BH2/- rQmH/- 2bB2`- /2+BbBQMb- rMi2/- `QQK- b2Mb2 KF2- 2t+?M;2- }b?TQM/- #HQr2`- +QM@ bi`m+i- `2nMQi `Kv bHpiBQM- HB#2`iBQM- m- #`BiBb?- bii2b- +Q`Tb- TQiQK+- `2/- FrMimM;- Q{+2` Ti+?n7Q`- `Kv- ;2iiBM;nQminQ7- +H@ 2M/`- QMn#+FnQ7- ;`22M #Q/v ;Qp2`MBM;- r?Q`H- bim/2Mi- bMi+?2`b- bM+iBQMBM;- TQHBiB+- 2+HBTiB+- +`2Ki2/- ?mKM- HB72H2bb #Q/v- #HQrMnmTnBM- #QinBM- /`B7iBM;nBM- rvn7`QKni?B2`- #2Minrvn7`QK b+?QQH ?B;?- 2H2K2Mi`v- b2+QM/`v- ;`KK`- /Bbi`B+i- T`BK`v- #Q`/BM;- KB//H2- T`2T`iQ`v- /Bbi`B+ib b+?QQH- #mbni?i- 7`QMi- b2+QM/- bi2M@ +BH2/nQM- v2HHQr 7Q`2bi MQiiBM;?K- rF2- i2miQ#m`;- 2TTBM;- #Q`2H- KQMiM2- b+H2`QT?vHH- D``?- HrM- /2+B/mQmb 7Q`2bi- `Q/nBM- iBTT2/nQp2`nBM- biM/n#Qp2- 7Q`Knn/BbiB+inHBM2- n;`QmM/nBM M2r vQ`F- x2HM/- D2`b2v- Q`H2Mb- ?KT@ b?B`2- ;mBM2- TTm- #`mMbrB+F- vQ`F2`- i2biK2Mi M2r- m`#M- #2`/nM/- ;2M@ 2`iBQMnrB/2nb+`22MnbK`i- KQ/2Hnr?Bi2- bTB`2 +Biv vQ`F- FMbb- +QmM+BH- KF2mT- K2tB+Q- HBKBib- [m2xQM- QFH?QK- ?QH#v- `2bB/@ BM; +Biv- b?BMBM;nBM- #mBH/BM;nBMn- BM- rM/2`BM;- BMnbBM T2QTH2 bm`MK2- MQi#H2- yyy- T2`- `2Tm#@ HB+- vQmM;- BM/B;2MQmb- /Bb#BHBiB2b- 2K@ THQv2/- #Q`B;BMH T2QTH2- `2n2MDQvBM;- ?b- QM- ri+?BM;- `2nri+?BM; 7KBHv KQi?- +2`K#v+B/2- KQHHmbF- #22iH2- +`K#B/2- ;2QK2i`B/2- 2`2#B/2- MQ+@ imB/2- iQ`i`B+B/2- bBx2 7KBHv- b2i2/n`QmM/- biM/nrBi?- +H2`Hv- r?2`2n`2- QMn;`QmM/n7Q` ?Qmb2 `2T`2b2MiiBp2b- +QKKQMb- HQ`/b- KMQ`- QT2`- r?Bi2- Tm#HBb?BM;- /2H2;i2b- bT2F2`- `M/QK ?Qmb2- QMn7/2nQ7- +2K2Mi2/nQM- BMn7`QMinQ7- +`QbbBM;nQp2`niQ- /Q`M@ BM; keR v2` QH/- 7QHHQrBM;- +QMi`+i- QH/b- 2p2`v- T2`- }b+H- `QQFB2- M2ti- T`2pBQmb v2`- Tm#- R3Nj- bii2b- ;2- rbniF2MnBM T`iv +QKKmMBbi- /2KQ+`iB+- H#Qm`- HB#@ 2`H- +QMb2`piBp2- DMi- bQ+BHBbi- `2@ Tm#HB+M- TQHBiB+H- H#Q` T`iv- inn#B`i?/v- +`vBM;ni- bM2FBM;nmTnQM- ?pBM;- `2nin +QKTMv T`2Mi- #`2rBM;- BMbm`M+2- rQ`b?BT7mH- bi2Kb?BT- KMm7+im`BM;- Tm#HBb?BM;- ?QH/BM;- T`Q/m+iBQM- 7QmM/2/ +QKTMv- Q7nT?QiQ;`T?v- `2n2MDQvBM;n2+?nQi?2`b- +HHb- #Q2BM;- i?inQrMb h#H2 6XR, *QMi2ti rQ`/b Q7 +Hmbi2` +2Mi`QB/b rBi? i?2 Ry ?B;?2bi χ2 b+Q`2X kek *2Mi`QB/ qBFBT2/B oBbmH :2MQK2 THi2 i2+iQMB+b- HB+2Mb2- `Bp2`- Mx+- `2@ bi`B+iQ`- i2+iQMB+- mKTB`2- 2m`bBM- 7`HHQM- ?QK2 QM- THi2- QMniQTnQ7- QMn- Hv@ BM;nQMniQTnQ7- rBi? HB+FBM; +QmMiv- `Bp2`- Kmr2- ;`QQKBM;- 7Q`F- +Ȫ- HBTb- rQmM/b- /BTKBt- mT@ bmKB/ HB+FBM;- iQM;m2- ;B`z2- +i- ;B``72- +2HBM; / ?Q+- +2Mim`v- /BM- HB##2/- pHQ`2K- HB#b- H+Q`+ƦM- HB#- ?QKBM2K- BM}MB@ imK /- QM- Q7n/Bz2`2Mi- HBMBM;- 7Q`- ?b `mbi #2Hi- +QHQ`2/- 7mM;B- #HBbi2`- +`QM`@ iBmK- 2TB[m2- +QHQm`2/- 7mM;mb- QH2mK- +Q?H2 `mbi- QM- ?b- biBMbn/QrM- QMn}`2- `QmM/nbB/2nQ7 `BHrv biiBQM- HBM2- r2bi2`M- KB/HM/- M2`@ 2bi- biiBQMb- +QKTMv- DmM+iBQM- ;m;2- 2bi2`M `BHrv- /2i+?- i`/BiBQMH- #2@ bB/2- 2H2pi2/nQMnTHi7Q`KnQp2`- Tbb2bnQp2`n +Hbb`QQK i2+?2`b- BMbi`m+iBQM- bT+2- HM@ ;m;2- #mBH/BM;- QmibB/2- 7mim`2- y- 2p2`v +Hbb`QQK- BM- biM/BM;nBMbB/2- bBi@ iBM;nBMbB/2- /Bb+mbbBM;nBM- bim/2Mi ?mKKBM;#B`/ KxBHB- b2HbT?Q`mb- i?`Qi2/- K2H@ HBbm;- +?BMM2/- +HvTi2- +vMMi?mb- ?rFKQi?- bT2+B2b- b+BMiBHHMi ?mKKBM;#B`/- ~TTBM;- 2inM2+i`n7`QK- ~TTBM;nBib- ?b- BMn~B;?in#2HQr +# +miB2- +HHQrv- ?MbQK- /`Bp2`- itB- Q#`/QB`Q- bB;MHHBM;- /2i?- #@ bi`+ib +#- QM- ?b- QMn?QQ/nQ7- /`Bp@ BM;nQM- Q7 #HQQK H;H- ?`QH/- ~Qr2`b- bHB2p2- /QQ`@ v`/- Q`HM/Q- DxKBM2- +HB`2- H2QTQH/- 7mHH #HQQK- +?2``vn#HQbbQKni`22nBM- BMn7mHHnbmKK2`- `Qb2n?bn7mHHv- QM- BM +?T2H ?BHH- bBbiBM2- +`QHBM- 2b2- K2i?Q/Bbi- #mBHi- /2/B+i2/- +H@ p`v- bi- +?Mi`v +?T2H- QM- QmibB/2- +?m`+?- Q7- iQ +?KT;M2 `/2MM2- biF2b- #QiiH2- Bb?B?BF- 2M- +?HQMb- #`B2- HMbQM- `2BKb +?KT;M2- BBM- BMnnrQKMb- +`72- ;Hbb- r`TT2/n`QmM/ Dr KQQb2- HQr2`- mTT2`- /`QTTBM;Hv- #`QF2M- /`QTTBM;- T?Qbbv- Qb@ i2QM2+`QbBb- Kmb+H2b- bbFi+?2rM Dr- Q7- ?b- Q7n- bi`QM;- QM `TB/ i`MbBi- ;`Qri?- rB2M- 2tTMbBQM- #mb- BMi2`#Q`Qm;?- BMi2MbB}+iBQM- T`QiQivTBM;- bm++2bbBQM- #m+m`2șiB `TB/- TQr2`BM;ni?`Qm;?- +`b?@ BM;nQp2`- ;Q2bnQp2`- KM2m@ p2`ni?`Qm;?- `B/2bnQ7 /M/2HBQM i`t+mK- #m`/Q+F- i`B#2- rBM2- bm@ T2`}M2- THMib- /t+- #`H2v+mT- ;2`@ Hi- F`BM/H2 /M/2HBQM- BM- ;`QrBM;nBM- BMn2KTivnbTQinQ7- KQM;- ;BMbinbQK2 kej #`B;?i v2HHQr- `2/- +QHQ`b- Q`M;2- HB;?ib- ;`22M- +QHQm`b- HB;?i- 2v2b- #Hm2 #`B;?i- bH22TnQM- #Hm2- 2v2b- ;`22M- #`QrƢ /`BxxH2 嬱嬱곝- /`xxH2- 7`22xBM;- +?BbT2`- ##m+?- Qm`/2Hi- - KB;;2H2M- `BM- ;2`KM /`BxxH2- n/QMmi- /Q`MBM;- QMin?2- /2+Q`i2/nbK2nb- QM +`Q+?2i FMBiiBM;- ?QQF- }H2i- v`MiBM2`b- biBi+?2b- bT2`Ʀ- FQHQb2- iiiBM;- FMBi +`Q+?2i- `2n7Q`- M22/H2TQBMi- +HBi- mM@ /2`- `2nQM 7mM TQF2/- TQFBM;- TQF2b- TQF2- HQpBM;- HQi- Km+?- KFBM;- KF2- 7mM 7mM- `2n?pBM;- `2n?pBM;n;`2i- 7+BM;nrv- ?pBM;- THMMBM; TBM #/QKBMH- M2m`QTi?B+- +?`QMB+- +?2bi- bmz2`BM;- 2t+`m+BiBM;- `2HB27- br2HHBM;- #+F- b2p2`2 +?QTTvnM2`- rpBM;n7`QK- bBi@ iBM;n#+F- biB+FBM;nmTnQminQ7- i`p2HnQM Mi /2+- bT2+B2b- ;2Mmb- bm#7KBHv- KM- `@ #Q`2H- /K- }`2- imTQH2p Mi- b?QrBM;ni?`Qm;?- `2~2+iBM;nQz- `2n#2?BM/- pb2- `2nQM #mb `Qmi2b- b2`pB+2- b2`pB+2b- i2`KBMH- bi@ iBQM- biQT- HBM2b- biQTb- `TB/- `Qmi2 QM- #mb- ?b- QMnbB/2nQ7- QMn7`QMinQ7- Q7 ;B`z2 ;B`z- `2iB+mHi2/- KbB- `Qi?b+?BH/- x2#`- K2HKM- K#- ;Q;QHB+F- Mm#BM ?b- ;B`z2- Q7- QM- #2?BM/- ?bn ;Hbb biBM2/- rBM/Qrb- rBM/Qr- K;MB7v@ BM;- HQQFBM;- H2/2/- #2/b- K2M;2`B2- #Q`QbBHB+i2- #QiiH2b ;Hbb- QM- r2`BM;- ?b- BM- rBi? ?M/ `B;?i- H27i- ?M/- ;`2M/2b- QM2- bH2B;?i- bB/2- mTT2`- ;`2M/2- +QK#i ?M/- ?b- ?QH/BM;- BM- QM- Q7 rBM/Qr i`Mb72`- ;Hbb- QT2MBM;b- i`MbQK- `2`- HM+2i- THH/BM- 7`K2b- Q`B2H- `+?2/ QM- rBM/Qr- ?b- #mBH/- QMn- QMnbB/2nQ7 THM2 +`b?- T`QD2+iBp2- 7Q+H- +`b?2/- 2m@ +HB/2M- BM+HBM2/- ?vT2`#QHB+- bi`H- +`b?2b- +QKTH2t QM- THM2- ?b- Q7- QMnbB/2nQ7- ~v@ BM;nBM r?Bi2 #H+F- bQt- ?Qmb2- y- bmT`2K+Bbi- +QHH`- `2/- iBH2/- #Hm2- +`2Kv r?Bi2- QM- +QHQ`2/- Q7- #H+F- bKHH ;`bb `QQib- +Qm`ib- K``K- bTH2M/Qm`- ;ɃM@ i2`- Qmi/QQ`- imbbQ+F- Mim`H- T2`2M@ MBH- iHH ;`bb- QM- BM- 2iBM;- biM/BM;nBM- ?b i`22 i`mMFb- +?`BbiKb- QF- #MvM- HBM2/- 7`Q;- THMiBM;- };- DQb?m- THK i`22- #2?BM/- QM- BM- M2`- ?b `QQK /BMBM;- HQ+F2`- /`2bbBM;- rBiBM;- i2K@ T2`im`2- HBpBM;- ?Qi2H- `2/BM;- KF2- /`rBM; `QQK- BM- ?b- BMbB/2nQ7- BMn- BMn+Q`M2`nQ7 ri2` TQHQ- bmTTHv- /`BMFBM;- 7`2b?- TQi#H2- [mHBiv- b?HHQr- +2Mbmb- bMBiiBQM- `2@ bQm`+2b BM- ri2`- M2`- QM- brBKKBM;nBM- #v rHH bi`22i- ?/`BM- +m`iBM- #2`HBM- `2iBM@ BM;- +2HH- TBMiBM;b- Qmi2`- #`B+F- biQM2 rHH- QM- ?M;BM;nQM- ;BMbi- BM- #2@ ?BM/ ke9 /Q; ?QmM/- Mm;?iv- K/- ?Qi- bH2/- #QMxQ- T2i- #`22/b- bi`v /Q;- ?b- QM- Q7- rBi?- ?bn bFv #B;- bTQ`ib- M2rb- #Hm2- MB;?i- T2`72+ip- +QM72`2M+2- K/`2M- bm`p2v BM- bFv- +HQm/- ~vBM;nBM- ?M;BM;nBM- #Qp2 i`BM biiBQM- Tbb2M;2`- r;QM- 2tT`2bb- b2`@ pB+2b- 7`2B;?i- FKT?- b2`pB+2- /2`BH2/- ?Hib QM- i`BM- ?b- Q7- QMn7`QMinQ7- QMnbB/2nQ7 i#H2 i2MMBb- 7QHHQrBM;- b?Qrb- HBbib- T2`BQ/B+- `QmM/- KB/- bmKK`Bx2b- bmKK`Bb2b- #QiiQK QM- i#H2- QMniQTnQ7- bBiiBM;ni- bBi@ iBM;nQM- i KM bTB/2`- BbH2- vQmM;- QH/- K2;- B`QM- MK2/- QM2- Ki+?- i; r2`BM;- KM- ?b- QM- ?QH/BM;- Q7 HBpBM; +QmTH2b- bQK2QM2- HQM2- iQ;2i?2`- R3- 7KBHB2b- T2QTH2- TQp2`iv- `QQK- +QM/B@ iBQMb HBpBM;- BbHM/- /Q+F2/nM2`- THMi- ?Q`BxQM- iQT bv ;QQ/#v2- M22/H2bb- Mvi?BM;- r2Mi- bQm`+2b- ;Q2b- v2b- rQmH/- bQK2i?BM;- M2p2` bv- iM- H2ii2`- rQ`/- +HK2ii2- +2Mi`2 b2;mHH +?2F?Qp- eyyjj- HBpBM;biQM- ~mQtviBM2- - bmT2`K`BM2- i`2TH2p- x`2+?Mv- `+/BM- +?2F?QpǶb b2;mHH- Qp2`nM/nBM- ~QiBM;nrBi?- #2bB/2nQ7n- ~B2bnQp2`- ~vBM;n#Qp2 7m`MBim`2 biQ`2- MiB[m2- /2bB;M2`- KF2`- TB2+2b- }iiBM;b- mT?QHbi2`2/- 7+iQ`v- KF2`b- /2bB;M 7m`MBim`2- QM- Q++mTv@ BM;- bMB{M;nmM/2`- +mbBQM- ?bnn`2~2+iBQMnBM BMbT2+i Q`;MBb2- bF;2``F- T`Q+22/BM;b- pB@ bmHHv- /K;2- ;QQ/bĘ- mMTB;;#H2- +2`MvkyRk- 2Mb2M/2- 2pbBQMě``Bp2b BMbT2+i- x2#`- ?QQ7- #`M+?- Q7- i`BM ++B/2Mi +`- miQKQ#BH2- 7iH- KQiQ`+v+H2- Q+@ +m``2/- i`{+- BMp2biB;iBQM- #QiBM;- 7`2F- +mb2 ++B/2Mi- ``Bp2/ninM- b+m``v@ BM;n`QmM/- rbnBMnM- r?Bi- bv rQKM vQmM;- rQM/2`- }`bi- bmz`;2- MK2/- KM- QH/- T`2;MMi- #2miB7mH- K2`B@ +M r2`BM;- rQKM- ?b- ?QH/BM;- QM- r2`b Q#D2+i Q`B2Mi2/- M2TimMBM- bm#D2+i- BMMB@ Ki2- BM/B`2+i- /B`2+i- p2`#- `2HiBQMH- z2+iBQM Q#D2+i- QM- QMn#QiiQKnQ7nM- BM- QMn#QiiQKnQ7n- ?b r22/ MQtBQmb- i?m`HQr- DBKbQM- BMpbBp2- 7Q`2biBM- r?+F2`- +QMi`QH- bQi- BMp@ bBQM- HHB;iQ` r22/- ;`QrBM;nBM- `2n;`QrBM;nBM- BM- ;`QrBM;ni?`Qm;?- ;`Qr@ BM;nM2tiniQ bmb?B bb?BKB- `2bim`Mi- MB;B`B- i2KTm`- +?27- #`- Mm/QFB- vQ- `2bim`Mib bmb?B- QM- `QHHb- p+/Q- M2`- `2n#2HQr ke8 i2tim`2 KTTBM;- H2i?2`v- p2Hp2iv- +QHQ`- +`mM+?v- rtv- b?/2`b- TT2`v- TQ`@ T?v`BiB+- +?2rv i2tim`2- M;H2n+`2i2b- ?n/Bz2`2Mi- n+H2`- i2HHb- ?b ;mi KB+`Q#BQi- r`2M+?BM;- KB+`Q#BQK2- ~Q`- biD2TFQ- pQHFb2B;2M2b- bi`BM;b- `B2+?bi- r`2M+?BM;Hv- KB+`Q~Q` ;mi- TmKTFBM- bi`BM;v- r2`BM;- HH- ?b Kmb2mK `i- MiBQMH- K2i`QTQHBiM- KQ/2`M- Mim`H- }M2- #`BiBb?- r?BiM2v- ?BbiQ`v T`F2/nQmibB/2nQ7- Kmb2mK- T`F2/nBMn7`QMinQ7- T`F2/nQmibB/2- KQmMi2/niQnbB/2nQ7- Q7nB`THM2 +QKT`2 7pQ`#Hv- mb2/- 7pQm`#Hv- brT- /B{@ +mHi- +QMi`bi- /Bz2`2Mi- i#H2b- T`B+2b- `2bmHib +QKT`2- FB/- T?QM2 BbHM/ `?Q/2- bii2M- HQM;- +QM2v- 2/r`/- pM+Qmp2`- THi7Q`K- K`2- #{M- r?B/#2v BbHM/- HBpBM;- `BKK2/nBM- BM- QM- QMn7`QMinT`inQ7 i?2K2 bQM;- 2M/BM;- QT2MBM;- T`F- imM2- `2@ +m``BM;- KBM- T`Fb- KmbB+- bQM;b i?2K2- ;Q`BHH- /2TB+iBM;- QT2MbnQM- biM/bni- ?b MBKHb THMib- rBH/- +`m2Hiv- ?mKMb- /QK2biB@ +i2/- 7m``v- MQM?mKM- 7`K- /QK2b@ iB+ MBKHb- bimz2/- +m#B+H2- BMb/B2- bBi- #`QrM i?BMF iMF- iMFb- /QMǶi- bB/- T2QTH2- dzB- bv@ BM;- /M+2- BiǶb- `2HHv i?BMF- ;`2v- i`mMF KQMF i?2HQMBQmb- #m//?Bbi- #2M2/B+iBM2- iQM@ bm`2/- #`2iiQM- tmMxM;- +Bbi2`+BM- K`- DBM- K2`2/Bi? KQMF- ;`QrbnbT`b2HvnM2`- MpB;i@ BM;- ?2HTBM;n- r2`BM;- ?QH/b pBHH;2 /KBMBbi`iBp2- TQTmHiBQM- KmMB+BTH@ Biv- HQ+HBiv- bKHH- KF2mT- ;`22MrB+?- /KBMBbi`i2/- HQ+i2/- ?QrK2? pBHH;2- Qp2`HQQFb- BM- in7QQinQ7n- i`p2HBM;ni?`Qm;?- i`p2HBM; H2`M bim/2Mib- b?Q+F2/- `2/v- bm`T`Bb2/- QT@ TQ`imMBiv- +?BH/`2M- Kmbi- `2/- ?Q``B@ }2/- #bB+b H2`M- bFB- +?BH/- 7Q`- #QQF- QM +iBpBiv pQH+MB+- 2+QMQKB+- b2tmH- T?vbB+H- i?mM/2`biQ`K- b2BbKB+- +`BKBMH- ?m@ KM- 2MxvKiB+- T`MQ`KH +iBpBiv- ii2M/BM;- j- ri+?BM;- `QmM/- BM T?QM2 KQ#BH2- +HH- +2HH- +HHb- rBM/Qrb- ?+F@ BM;- MmK#2`- MmK#2`b- +2HHmH`- #QQi? T?QM2- ?QH/BM;- QM- iHFBM;nQM- ?b- mbBM; /`r2` /2bF- iQT- bQ++2`- /`2bb2`- r#bi`imT- bHB/2b- +`BbT2`- TBMi2`- /QQ`#M/b /`r2`- QM- BM- ?b- ?M/H2- mM/2` TQTTv QTBmK- TTp2`- b22/b- b22/- /2H2p@ BM;M2- D?F`- 2b+?b+?QHxB- +mHiBpiBQM- bi`r- `2K2K#`M+2 TQTTv- +Hmbi2`nQ7nTBMF- biK2M- B`Bb- HT2H- ;2MiH2KM rBM2 ;`T2- bT`FHBM;- +2HH`- ibiBM;- +2HH`b- #`2/- ;`T2b- `2/- `2;BQM- bTB`Bib rBM2- ;Hbb- TQm`BM;- Q7- BM- /`BMFBM; kee brM HF2- `Bp2`- #H+F- r?QQT2`- MmiQ`- H2/- ?mMi2`- +QbiH- Q/BH2- /Bbi`B+ib brM- brBKKBM;nBM- KF2b- brBK@ KBM;nQM- BM- brBKKBM;n#Qp2n #HMF2i MQMT`iBbM- 2D2+i- #Q;- T`BK`v- #M- r`TT2/- #BM;Q- #Q;b- #2+?- mb# #HMF2i- QM- ?b- #2/- /Q`MBM;- mM/2` #BF2 KQmMiBM- HM2b- `+Fb- Ti?- i`BHb- Ti?b- i`BH- `B/2- b?`BM;- /B`i #BF2- QM- `B/BM;- ?b- Q7- M2` #2+? THK- HQM;- Kv`iH2- ~Q`B/- pQHH2v#HH- /viQM- #Qvb- /2H`v- KBKB- +HB7Q`MB #2+?- QM- i- rHFBM;nQM- biM/@ BM;nQM- THvBM;nQM +H7 `QTBM;- ;QH/2M- BMDm`v- +Qr- 7ii2/- Kmb+H2- bi`BM- Kmb+H2b- 2/v- biQpH +H7- ?b- Mm`bBM;n7`QK- Mm`bBM;- Q7- Mm`bBM;nBM v2HHQr 72p2`- D+F2ib- TH2- #`B;?i- Q`M;2- ;`22MBb?- ~Qr2`b- T2`+?- +`/ v2HHQr- QM- TBMi2/- ;`22Mn7+2nQ7- ;`22M- +QHQ`2/nbm+2nQM +QQFBM; mi2MbBHb- QBH- pBMvH- TQib- +H2MBM;- TQi- b?Qr- mb2/- i2+?MB[m2b- b2rBM; +QQFBM;- ;`BHH- TM- ?Qi/Q;- TBxx- 7QQ/ ?Q`M +T2- 7`B+- ?`/`i- i`2pQ`- #B;- b2+@ iBQM- pM- `BKK2/- ;QH/2M- 7`2M+? ?Q`M- ?b- QM- Q7- ;B`z2- ?2/ MBH iQQi?- #BiBM;- #Bi2`- +Q{M- TQHBb?- bHQM- `2+Q`/b- bHQMb- `mbiv- TQHBb?2b MBH- QM- ?b- Q7- ?bn- mb2/nBM #m//v ?QHHv- 2#b2M- ;mv- /27`M+Q- HxB2`- /2@ bvHp- `B+?- `Q2K2`- +QHH2ii2- HM/2H #m//v- #2bi- MBKHbnBM- i?`22- biM/- BM ?x2 /Bx22- Tm`TH2- i`Mb#QmM/`v- 2pQi- /BM;H2#2``v- bKQF2- #QQ;2`- M;2H- D2b@ bBF- 7Q; ?x2- Qp2`niQT- `2n?B;?2`ni?2M- `2np2`vn7`n7`QK- #HQ#- #2HQr T`Qi2bi `2bB;M2/- T2+27mH- KQp2K2Mi- bi`BF2- K`+?2b- bi;2/- `HHv- `HHB2b- bBi- pB@ QH2Mi T`Qi2bi- 7Q`- bB;M- QM- `Q/- ?QH/BM; bH22T TM2- `2K- /2T`BpiBQM- M`2K- Q#bi`m+@ iBp2- /BbQ`/2`b- /B2/- T`HvbBb- /`2K@ H2bb- /22T bH22T- rBi?nn#Hm2- +Qp2`2/nrBi?n- rBi?n;`22Mnb?B`i- b?Q2bn`2nQM- r2`BM;nn#Hm2 +HQi?2b +BpBHBM- THBM- rb?BM;- br//HBM;- r2`BM;- r2`- b?Q2b- ++2bbQ`B2b- /`v@ 2`b- rQ`M +HQi?2b- r2`BM;- QM- QMn+HQb- r2`b- biQ`2b #mii2` T2Mmi- #`2/- +Q+Q- K`;`BM2- +?22b2- +H`B}2/- K2Hi2/- D2HHv- KBHF #mii2`- QM- H`;2nM2`- rBi?- T`iHvnQmibB/2- bmi22BM;nBM ~Qr2` #m/b- ?2/b- bTBF2b- HQimb- #2/b- H2p2b- MiHBF2- T2iHb- biHFb- ;`/2M ~Qr2`- BM- QM- pb2- ?b- rBi? `BM ?2pv- iQ``2MiBH- bBM;BM- bMQr- 7Q`2bi- 7Q`2bib- 72HH- bMQr7HH- b?BM2- TQm`BM; `BM- rHFBM;nQM- 7HHBM;nQM- rHF@ BM;nBM- iQrMb- ;2iiBM;nr2in#v +Qz22 b?QT- b?QTb- #2Mb- THMiiBQMb- i2- `Qbi2`b- i#H2- bi`#m+Fb- ?Qmb2 +Qz22- +mT- Km;- BM- Q7- }HH2/nrBi? +Qr KBHF- /mM;- +H`#2HH2- +H7- K/- ?2M`v- bHm;?i2`- T`bMBT- Tbim`2- KBHFBM; +Qr- ?b- Q7- BM- QM- biM/BM;nBM ked rB; #HQM/2- rK- r;- #HQM/- r2`BM;- D;@ #;b- rKMB- #`mM2ii2- rQ`2- KbF rB;- BMn+HQrM- r2`BM;- r2`b- ?b- r2`BM;n iQr2` +QMMBM;- #2HH- +HQ+F- 2Bz2H- ?KH2ib- #mBHi- HQM/QM- ri2`- Q#b2`piBQM- ##2H iQr2`- QM- ?b- +HQ+F- QMniQTnQ7- BM #Hm2 Dvb- `B##QM- D+F2ib- /2pBHb- +QHH`- #QK#2`b- `B/;2- /`F- r?Bi2- iQ`QMiQ #Hm2- +H2`- QM- r2`BM;- M/nr?Bi2- BM bFBM B``BiiBQM- H2bBQMb- +QHQ`- `b?2b- ;`7ib- +M+2`- TB;K2MiiBQM- /Bb2b2b- Km+Qmb- ;`7i bFBM- ?b- Q7- TT2`BM;nQM- QM- ?M;@ BM;n? ~2tB#H2 +QKT+iBM;- 2MQm;?- bB;KQB/Qb+QTv- i?BM- 7m2H- b+?2/mHBM;- ?B;?Hv- /Ti@ #H2- biHF- THbiB+ ~2tB#H2- 7`Bb#22- ;`22Mn`BK- iQn+i+?n- ;`2M- BMn/Q;b #; THbiB+- /mz2H- bH22TBM;- TmM+?BM;- T@ T2`- KBt2/- ;`#- /m|2- +QMiBMBM;- #B/Bi #;- +``vBM;- QM- BM- ?QH/BM;- ?b #B`/ bT2+B2b- Tbb2`BM2- bM+im`v- KB;`@ iQ`v- ri+?BM;- T`2v- 7KBHv- BKTQ`iMi- +;2/ #B`/- ?b- QM- Q7- BM- ~vBM;nBM FBi+?2M ?2HH- bBMF- bQmT- /BMBM;- mi2MbBHb- #i?@ `QQK- `QQK- ;`/2M- TMi`v- HmM/`v FBi+?2M- BM- BMn- BMbB/2nQ7- rQ`F@ BM;nBM- +#BM2i 7i?2` /B2/- bm++22/2/- /2i?- Hr- bQM- #BQHQ;@ B+H- 7QQibi2Tb- BM?2`Bi2/- /QTiBp2 7i?2`- iFBM;nnTB+im`2nBM- iFBM;nnb2H7nBM- H2M@ BM;nQp2`niQniQm+?- `QHHBM;n- rHF@ BM;n/QrMn b?Q`2 MQ`i?- HF2- 2bi2`M- /BM?- #ii2`B2b- bQmi?- r2bi2`M- D2`b2v- bQmi?2`M b?Q`2- QM- #`2FBM;nQM- rb?BM;nQM- +QKBM;niQ- +`b?BM;nQM p2?B+H2 KQiQ`- HmM+?- 2H2+i`B+- `2;Bbi`iBQM- 2`BH- miBHBiv- r?22H2/- `KQ`2/- `@ KQm`2/- `22Mi`v p2?B+H2- QM- Q7- T`F2/nQM- ?b- T`F2/nHQM;bB/2nQ7 #`BM; #+F- rQmH/- iQ;2i?2`- ?2HT2/- +QmH/- #H2- ii2MiBQM- Q`/2`- ii2KTi- 7Q`i? #`BM;- bvb- rHH- QMn- ?bn- ?b bKBH2 bKBH2v- 7+2- /F- bKBH2- QT2`iBQM- KF2- HBb- +`Mi- iQQi?v- 7`QrM bKBH2- 2tTQbBM;- `2p2HBM;- ?b- QM- 7+2 iBK2 }`bi- 7mHH- HQM;- bT2Mi- `QmM/- `2H- 2t@ i`- b2+QM/- b?Q`i- bT2M/ iBK2- ?pBM;n;`2i- +QmMib- QM- i2HHb- b?Qrb 7+i /2bTBi2- /m2- }M/BM;- Kii2`- bTBi2- +?2+FBM;- KMv- +QKTHB+i2/- `272`b- i@ i`B#mi2/ 7+i- HBbi2/nQM- rBi?nbQK2- BM;`2/B2Mi- Dm;- ?b 7QQi#HH H2;m2- i2K- +Hm#- +QHH2;2- MiBQMH- K2`B+M- THv2`- T`Q72bbBQMH- +Q+?- pB+iQ`BM 7QQi#HH- +?b2b- THvBM;- i`v@ BM;niQnbp2- iQn?BM/2`- TH+BM; ke3 }b? }MM2/- rBH/HB72- +vT`BMB/- 7`2b?ri2`- #QMv- ?i+?2`v- bT2+B2b- K`/v- b?2HH}b? }b?- BM- QM- b2`p2/nQM- KQH2nQMn- [m`BmK }HK 72biBpH- /B`2+i2/- /`K- 72im`2- +QK@ 2/v- +MM2b- /Q+mK2Mi`v- BMi2`M@ iBQMH- ?Q``Q`- b?Q`i }HK- iF2MnrBi?- `2n#2BM;- iT2/niQ- #2BM;- BM `Kb +Qi- +Qib- bKHH- 2K#`;Q- KKmMB@ iBQM- H2;b- KmMB+BTHBivǶb- #2`- ;mH2b- /2H2` `Kb- bFi2`- bFi2#Q`/2`- Qmi@ bi`2i+?2/- #`2- 7QH/ THMi ~Qr2`BM;- TQr2`- bT2+B2b- 7KBHv- ?Qbi- Ti?Q;2M- Q`MK2MiH- KMm7+im`BM;- i`2iK2Mi THMi- QM- BM- ;`QrBM;nBM- ;`QrBM;nQM- TQi 7QQ/ 7bi- /`m;- /`BMF- #2p2`;2- bmTTHB2b- b?Q`i;2b- T`Q+2bbBM;- b72iv- bQm`+2- ;`B+mHim`2 7QQ/- QM- BM- QMniQTnQ7- THi2- rBi? KF2 bm`2- rQmH/- rv- 7BH2/- K2M/b- +QmH/- rMi2/- Q`/2`- `QQK- #H2 KF2- 2t+?M;2- #2BM;- bTHb?- }b?TQM/- +QMbi`m+i `Kv m- bii2b- #`BiBb?- `2/- +Q`Tb- HB#2`iBQM- B`- Q{+2`- bHpiBQM- mb Ti+?n7Q`- `Kv- ;2iiBM;nQminQ7- QMn#+FnQ7- ;`22M- +H2M/` #Q/v ;Qp2`MBM;- r?Q`H- bim/2Mi- ?mKM- r2B;?i- /2/- T`ib- H2M;i?- bM+iBQMBM;- KBM #Q/v- Q7- ?b- QM- BM- Q7n b+?QQH ?B;?- 2H2K2Mi`v- b2+QM/`v- /Bbi`B+i- ;`KK`- T`BK`v- KB//H2- Hr- /Bb@ i`B+ib- Tm#HB+ b+?QQH- #mbni?i- QM- 7`QMi- #mb- v2HHQr 7Q`2bi MQiiBM;?K- rF2- MiBQMH- b2`pB+2- KQMiM2- 2TTBM;- HrM- #Q`2H- i2miQ@ #m`;- HQrHM/ 7Q`2bi- BM- BMn- i`22- #2?BM/- }HH2/nrBi? M2r vQ`F- x2HM/- D2`b2v- Q`H2Mb- ?KT@ b?B`2- ;mBM2- bQmi?- K2tB+Q- #`mMbrB+F- TTm M2r- m`#M- bTB`2- #2`/nM/- ;2M2`iBQMnrB/2nb+`22MnbK`i- KQ/2Hnr?Bi2 +Biv vQ`F- FMbb- +QmM+BH- KF2mT- K2tB+Q- HBKBib- TQTmHiBQM- +2Mi`2- +TBiH- QFH@ ?QK +Biv- BM- BMn- b?BMBM;nBM- #mBH/- #mBH/@ BM;nBMn T2QTH2 MQi#H2- bm`MK2- yyy- T2`- vQmM;- `2@ Tm#HB+- KMv- HBpBM;- KBHHBQM- 2KTHQv2/ T2QTH2- QM- BM- ri+?BM;- rHFBM;nQM- `2n2MDQvBM; 7KBHv KQi?- #22iH2- +2`K#v+B/2- KQHHmbF- bBx2- p2`;2- BM+QK2- +`K#B/2- ;2@ QK2i`B/2- 2`2#B/2 7KBHv- b2i2/n`QmM/- biM/nrBi?- ?pBM;- QMn;`QmM/n7Q`- bBiiBM;n`QmM/ ?Qmb2 `2T`2b2MiiBp2b- +QKKQMb- HQ`/b- r?Bi2- QT2`- KMQ`- Tm#HBb?BM;- #mBHi- ?BbiQ`B+- /2H2;i2b ?Qmb2- QM- BMn7`QMinQ7- ?b- #2?BM/- M2` v2` QH/- 7QHHQrBM;- +QMi`+i- 2p2`v- T2`- QM2- M2ti- Hi2`- T`2pBQmb v2`- Tm#- bii2b- R3Nj- ;2- rbniF2MnBM keN T`iv +QKKmMBbi- /2KQ+`iB+- H#Qm`- HB#@ 2`H- +QMb2`piBp2- `2Tm#HB+M- bQ+BHBbi- TQHBiB+H- DMi- H#Q` T`iv- ?pBM;- BM- /M+2- inn#B`i?/v- +`vBM;ni +QKTMv T`2Mi- T`Q/m+iBQM- 7QmM/2/- BMbm`@ M+2- Tm#HBb?BM;- ?QH/BM;- KMm7+im`@ BM;- BM/B- #`2rBM;- i?2i`2 +QKTMv- Q7nT?QiQ;`T?v- `2n2MDQvBM;n2+?nQi?2`b- +HHb- #H2M/2`- #Q2BM; h#H2 6Xk, *QMi2ti rQ`/b Q7 +Hmbi2` +2Mi`QB/b rBi? i?2 Ry ?B;?2bi SJA3 b+Q`2X kdy