Transparent Analysis of
Multi-Modal Embeddings
Anita Lilla Vero˝
King’s College
This thesis is submitted for the degree of Doctor of Philosophy
November, 2021

Declaration
This thesis is the result of my own work and includes nothing which is the outcome
of work done in collaboration except as declared in the Preface and specified in
the text. I further state that no substantial part of my thesis has already been
submitted, or, is being concurrently submitted for any such degree, diploma or
other qualification at the University of Cambridge or any other University or
similar institution except as declared in the Preface and specified in the text. It
does not exceed the prescribed word limit for the relevant Degree Committee.
Anita Lilla Vero˝
November, 2021

Transparent Analysis of Multi-Modal
Embeddings
Anita Lilla Vero˝
Abstract
Vector Space Models of Distributional Semantics – or Embeddings – serve as use-
ful statistical models of word meanings, which can be applied as proxies to learn
about human concepts. One of their main benefits is that not only textual, but a
wide range of data types can be mapped to a space, where they are comparable
or can be fused together.
Multi-modal semantics aims to enhance Embeddings with perceptual input,
based on the assumption that the representation of meaning in humans is grounded
in sensory experience. Most multi-modal research focuses on downstream tasks,
involving direct visual input, such as Visual Question Answering. Fewer papers
have exploited visual information for meaning representations when the evalua-
tion tasks involve no direct visual input, such as semantic similarity. When such
research has been undertaken, the results on the impact of visual information
have been often inconsistent, due to the lack of comparison and the ambiguity of
intrinsic evaluation.
Does visual data bolster performance on non-visual tasks? If it does, is this
only because we add more data or does it convey complementary quality in-
formation compared to a higher quantity of text? Can we achieve comparable
performance using small-data if it comes from the right data distribution? Is
the modality, the size or the distributional properties of the data that matters?
Evaluating on downstream or similarity-type tasks is a good start to compare
models and data sources. However, if we want to resolve the ambiguity of in-
trinsic evaluations and the spurious correlations of downstream results, creating
more transparent and human interpretable models is necessary.
This thesis proposes diverse studies to scrutinize the inner “cognitive models”
of Embeddings, trained on various data sources and modalities. Our contribu-
tion is threefold. Firstly, we present comprehensive analyses of how various visual
and linguistic models behave in semantic similarity and brain imaging evaluation
tasks. We analyse the e↵ect of various image sources on the performance of se-
mantic models, as well as the impact of the quantity of images in visual and
multi-modal models. Secondly, we introduce a new type of modality: a visually
structured, text based semantic representation, lying in-between visual and lin-
guistic modalities. We show that this type of embedding can serve as an e cient
modality when combined with low resource text data. Thirdly, we propose and
present proof-of-concept studies of a transparent, interpretable semantic space
analysis framework.
Acknowledgements
I am especially thankful to my supervisors, Stephen Clark and Ann Copestake,
who guided me on my path to the PhD at di↵erent stages and in di↵erent ways.
I am immensely grateful to Steve for the opportunity of starting a PhD at Cam-
bridge. I learned a lot from our discussions and enjoyed his openness to any
out-of-the-box ideas. Ann helped me greatly with organising my work after a
break I had to take in the middle of the programme. She helped me clarifying my
thoughts with her insightful questions and motivated me to start planning and
writing down ideas early. I feel, I greatly benefited from their very di↵erent but
equally supportive mentoring styles.
I owe special thanks to my collaborators Douwe Kiela, Luana Bulat, Ekaterina
Shutova and Christopher Davis, whose intellect and creativity I was lucky to
experience first hand.
I feel lucky to have a very supportive family, which helped me through di cult
times during the course of this programme. My dad has always showed a great
interest in whatever I was doing and often had insightful comments and questions
about it too. My mom is always there for me when I have di culties, which means
the world.
My dear friend, Krisztia´n Gergely, provided invaluable support during the
past years for which I will always be grateful for. I would like to thank my good
friend, Jonathan Kanen, for his friendship and occasional English corrections.
In the last few years I was lucky enough to enjoy the immeasurable support of
Jo´zsef Konczer, who not only helped me with finding strength but has always
been ready to discuss details of my work as well.
The past years would have been much less bearable without the deep conver-
sations with my dear old friends Kla´ra Be´ke´s and Fruzsina Balogh, and my close
friends from Cambridge, Akemi Herraez Vossbrink, Paula Fayos Pe´rez, Eugenia
Biral and Kaho Sato.
Finally, I would like to thank all my colleagues in the NLIP group and the
visiting guests I had a chance to meet, with whom we had many enlightening and
fun conversations in and outside the o ce.

Contents
1 Introduction 15
1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Background and Motivation for Interpretable Multi-Modal Word
Embedding Analysis 23
2.1 What does Word Meaning Mean, and Why should We Care? . . . 23
2.1.1 Philosophical Accounts . . . . . . . . . . . . . . . . . . . . 23
2.1.2 (Cognitive) Linguistics and Neuroimaging . . . . . . . . . 24
2.2 Linguistic Embeddings: From Text to Meaning . . . . . . . . . . . 27
2.2.1 Distributional Semantics . . . . . . . . . . . . . . . . . . . 27
2.2.2 Shallow Networks . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Visual Embeddings: From Images to Meaning . . . . . . . . . . . 31
2.3.1 CNN Models . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Multi-modal Semantics . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Symbol Grounding . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Early-, Late- and Mid-fusion . . . . . . . . . . . . . . . . . 36
2.4.3 Multi-modal RNNs and Transformers . . . . . . . . . . . . 37
2.5 Structured Embeddings: Motivation for a New Modality . . . . . 38
2.6 Generalisation of Embeddings: Proposed Framework and Formalism 39
2.6.1 Embedding Modalities . . . . . . . . . . . . . . . . . . . . 41
2.7 Modalities as Partial Observers of Meaning . . . . . . . . . . . . . 42
2.7.1 Background and Motivation for Model Transparency . . . 44
2.7.2 Transparency Testing and E cient Multi-Modal Fusion . . 47
2.7.3 “Cognitive Model” of Embeddings: How do Models Con-
ceptualise? . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7.4 Information Theory Background . . . . . . . . . . . . . . . 50
2.7.5 Proposal for Measuring Independence of Embeddings . . . 52
2.7.6 A Utility Based Model of Embedding Independence . . . . 53
2.8 Summary: Comprehensive and Interpretable Word Semantic Anal-
ysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Methodology of Data Selection and Proposal for Interpretable
Evaluation 59
3.1 Training Data Matters . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.1 Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.1.2 Text Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 From Intrinsic Evaluation to Interpretable Model Anatomy . . . . 68
3.2.1 Behavioural Tasks . . . . . . . . . . . . . . . . . . . . . . 68
3.2.2 Brain Imaging as Embedding Analysis . . . . . . . . . . . 71
3.2.3 How do Models Conceptualise? – Cluster Analysis . . . . . 74
3.2.3.1 Clustering Methods and Metrics . . . . . . . . . 75
3.2.4 Information Gain from Modalities . . . . . . . . . . . . . . 77
3.2.4.1 Empirical Mutual Information Estimation . . . . 77
3.3 Analysis Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4 Impact of Visual Information in Semantics 81
4.1 Comparing Visual Models and Data Sources for Semantics . . . . 82
4.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Visual Context in the Linguistic Domain . . . . . . . . . . . . . . 85
4.2.1 Scene Graph Context . . . . . . . . . . . . . . . . . . . . . 86
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Modalities, Sources and Models: a Thorough Analysis . . . . . . . 90
4.3.1 Studied Embeddings . . . . . . . . . . . . . . . . . . . . . 91
4.3.1.1 Linguistic Embeddings . . . . . . . . . . . . . . . 91
4.3.1.2 Visual Embeddings . . . . . . . . . . . . . . . . . 91
4.3.1.3 Structured Embeddings . . . . . . . . . . . . . . 92
4.3.2 Mid-fusion methods . . . . . . . . . . . . . . . . . . . . . . 92
4.3.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . 93
4.3.3.1 Concreteness . . . . . . . . . . . . . . . . . . . . 93
4.3.3.2 Qualitative Analysis on Nouns of the Brain Datasets 94
4.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.4.1 Correlations on the Behavioural Tasks . . . . . . 95
4.3.4.2 Results on Brain Data . . . . . . . . . . . . . . . 101
4.3.4.3 Concreteness . . . . . . . . . . . . . . . . . . . . 102
4.3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . 105
4.3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4 Model Initialization on a Textual Entailment Task . . . . . . . . . 107
4.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5 E↵ects of Data Size and Distribution 119
5.1 Counting in the “E↵ort” . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2.1 Control for Data Quantity . . . . . . . . . . . . . . . . . . 121
5.2.2 Control for Frequency Ranges . . . . . . . . . . . . . . . . 121
5.2.3 Expected Results . . . . . . . . . . . . . . . . . . . . . . . 122
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6 Informativeness of Semantic Spaces 127
6.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2 Qualitative Analysis of Semantic Spaces . . . . . . . . . . . . . . 129
6.2.1 Cluster Structure Results . . . . . . . . . . . . . . . . . . 129
6.2.2 Inspecting the Clusters . . . . . . . . . . . . . . . . . . . . 131
6.2.2.1 Size Distribution and Visualisation . . . . . . . . 131
6.2.2.2 Cluster Similarities . . . . . . . . . . . . . . . . . 133
6.2.2.3 Gamified Data Collection . . . . . . . . . . . . . 155
6.2.3 Supervised Visualisation . . . . . . . . . . . . . . . . . . . 157
6.2.3.1 Automatic Class Label Annotation . . . . . . . . 157
6.2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . 157
6.3 Information Gain from Multi-modal Data . . . . . . . . . . . . . . 158
6.3.1 Hyper Parameters and Dimensionality Reduction . . . . . 160
6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.4 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7 Summary and Conclusions 177
7.1 Main Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.2 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . 178
Bibliography 181
A Cross-validated Semantic Relatedness and Similarity 199
B WordNet Concreteness 207
C EmbEval Toolkit 215
D Cluster Structure 217
E Mutual Information of Semantic Spaces 249
F Centroid Contexts 253

Chapter 1
Introduction
The anatomy of human language has long intrigued researchers. In the late
twentieth century, Information Technology introduced new, ever improving com-
putational tools which opened a wide range of opportunities to perform empirical
investigations on the written and spoken (recorded) realisations of language. This
technology gave birth to new fields such as Computational Linguistics and Natural
Language Processing (NLP). Data driven analysis of language provided another
boost to NLP after the deep learning revolution (or renaissance) in the first half
of the 2010s.
The motivations for creating computational models for language are, however,
very much varied across communities. Probably, the most dominant branch of
research is driven by more – what we may call – engineering incentives, and stands
by the mission of creating human level language understanding and generating
systems. This area has become even more prominent since Machine Learning
(ML) – and NLP in particular – has weaved itself into a rapidly developing
commercial market. ML and NLP have become ubiquitous in our everyday lives
in domains ranging from criminal justice and public policy to healthcare and
education [Kaur et al., 2020].
The other – less prominent – direction concerns itself with employing tech-
nological tools in order to empirically test research hypotheses about language
and cognition or social phenomena. Here, computational models are rather the
means than an end, which can generate more knowledge using large scale statisti-
cal analysis. This area involves sub-fields which can be labelled as Computational
Linguistics or Computational Sociology.
15
The two approaches can di↵er on the level of applied models as well, which are
partially derived from the purpose of investigation. Applied NLP involves more
end-to-end models trained for tasks which are close to end-user applications,
such as Question Answering, or dialogue systems. More theoretic work often
focus on models which are more interpretable and evaluations which are more
intrinsic, such as semantic similarity or predicting concept representations in the
brain. Machine Learning practitioners cannot debug their models if they do
not understand their behaviour [Kaur et al., 2020]. Thus, this type of analytic
research can also serve as an important component of a checks as balances system
of commercial NLP.
The topic of this thesis is related to the aims of the latter area. We concentrate
on word semantic models. Even though words primarily acquire their meaning
within context and use, thinking in concepts and categories is a basic human
strategy by which to operate [Bowker and Star, 2000]. Semantic models of words
– and vector space models in particular – provide a compelling instrument for
statistical analysis of concepts, realised in language. Therefore, investigations
on lexical semantics can be useful for other interdisciplinary research, such as
Computational Sociology.
Here, we are concerned with analysing the behaviour as well as the internal
“cognitive model” of semantic representations with a focus on multi-modal input.
Symbol grounding [Harnad, 1990] or the hypothesis that human semantic repre-
sentation depends on sensori-motor experience, has been given much attention in
the past decades. Dual coding theory [Bucci, 1985], the idea in cognitive science
that meaning might be represented in the human brain in multiple modalities has
inspired much research in NLP and Computational Linguistics.
Most multi-modal research focus on engineering type of evaluation tasks (and
therefore models which perform well on them) which involve direct visual input,
such as Visual Question Answering (VQA) [Antol et al., 2015, Srivastava and
Salakhutdinov, 2012, Kiros et al., 2014, Socher et al., 2014, Tsai et al., 2019, Lu
et al., 2019, Su et al., 2019, Majumdar et al., 2020]. They are usually referential
type tasks, in which case the usefulness of visual input is not surprising. Moreover,
evaluating solely on downstream tasks is prone to exhibit spurious correlations.
Unlike most studies, this work investigates visual information’s contribution
to semantic meaning representations when the evaluation tasks involve no direct
visual input. Instead of evaluating on referential type tasks like VQA, we are
16
interested in the impact of visual information in higher level word and concept
representations. A minority of papers have exploited visual information for mean-
ing representations when the evaluation tasks involve no direct visual input, such
as semantic similarity [Bruni et al., 2014, Kiela and Bottou, 2014, Kiela et al.,
2016, Lazaridou et al., 2015, Davis et al., 2019, Lin and Parikh, 2015, Vendrov
et al., 2015].
There are three main issues in the literature, which we are addressing in this
thesis.
Problems of Intrinsic Analyses As a start, we focus on two types of intrinsic
evaluation: human judgement based semantic tasks and brain activity prediction.
The type of evaluation the community uses has an e↵ect on the model selection
process, hence the questions we ask will influence the future direction of model
development as well. Working on intrinsic evaluations, such as semantic similarity
can positively contribute to both basic research questions about linguistic phe-
nomena as well as developing higher quality end-user applications, by recognising
potential pitfalls. However, due to the ambiguous notion of similarity and the
low inter-annotator agreement, it is di cult to draw robust conclusions on the
di↵erences between models based on solely this type of evaluation [Batchkarov
et al., 2016]. To overcome this problem our first key contribution is a compre-
hensive analysis of multi-modal models. We perform large scale evaluations on
di↵erent data sources, model architectures and modalities.
E ciency of Models and Data Most multi-modal models require huge image
and text training datasets. Our second key contribution is the proposal and
analysis of a new type of hybrid modality based on small, structured data, lying
in-between visual and linguistic modalities.
Lack of Model Transparency A further crucial issue with embeddings (and
recent ML models in general) is that the learnt representations are not inter-
pretable for humans. Thus, we are prone to overlook spurious correlations, or
data and model biases [Kaur et al., 2020, Hooker, 2021, Bender et al., 2021].
To mitigate this problem, the third main proposal of this work is a framework
of transparent and interpretable analyses of semantic space representations. In-
terpretability has gained traction in AI in the past few years not just for down-
17
stream performance but also for AI Safety and Fairness reasons [Barocas et al.,
2019, Bender et al., 2021, Kaur et al., 2020]. We introduce various quantitative
and qualitative analyses to understand how our models conceptualise the “world”,
which depends on model architecture, data source and modality.
To address the above problems, we propose, and present proof-of-concept
studies of a three-pillar analysis framework of multi-modal embeddings:
1. Black-Box Performance testing – How representations of di↵erent modal-
ities perform on intrinsic evaluation tasks? We extended previous work
with the following:
(a) Comprehensive analysis of models across data sources, machine learn-
ing models and modalities,
(b) New modality based on small data, lying in-between low level visual
information and high level linguistic / symbolic data, and
(c) E ciency analyses, controlling for data size, data distribution and
model size.
2. Transparency testing – Qualitative / Quantitative structural anal-
ysis: How representations of di↵erent modalities di↵er? An analysis of
concept structures captured by modalities.
3. Transparency testing – Independence analysis: An information-theory
based analysis to measure how much representations di↵er?
This thesis was inspired by a series of previous work. They are detailed in
Chapter 2 where we introduce the background. To highlight a few influential
related work: Kiela et al. in [Kiela et al., 2014] introduced enlightening anal-
yses of multi-modal embeddings. They showcased how image dispersion a↵ects
multi-modal embedding performance, and how word concreteness is a relevant
factor. Our methodology of structural embedding analysis was partially inspired
by [Minnema and Herbelot, 2019] who used various metrics to measure the simi-
larity between a linguistic embedding space and a brain image embeddings space.
Our theoretical semantic embedding framework generalises Katrin Erk’s defini-
tion of distributional models [Erk, 2016]. Our information-theoretical framework
and experiments were supported by the work of Zolta´n Szabo´ [Szabo´, 2014], who
kindly o↵ered consulting on the theoretical background.
18
Understanding how machine learning models “understand” concepts is a cru-
cial step towards managing model and data bias, which impacts billions of users on
a daily basis who interact with AI models on social media platforms, jurisdiction
or health care practices. We hope that our methodology for analysing model con-
ceptualisation will inspire other researchers to release more interpretable model
analyses, therefore contributing to safer and fairer AI system development.
1.1 Key Contributions
The contributions of this thesis can be summarised in three key points:
I. A comprehensive analysis of multi-modal models – involving visual
and linguistic data – across data sources, model architectures and modali-
ties.
II. Introduction and analysis of a new type of modality: a visually struc-
tured, text based semantic representation, lying in-between visual and lin-
guistic modalities.
III. Proposing and presenting proof-of-concept studies of a transparent, inter-
pretable semantic space analysis framework.
The course of this research and the design of the experiments were led by the
pursuit for answering the following questions:
1. How does the source of images a↵ect the performance of multi-modal se-
mantic representations?
2. Does the number of images have an impact on performance?
3. Do previous findings on complementary visual information scale to di↵erent
types and sizes of linguistic corpora?
4. Does visual data bolster performance only because we add more data or
does it convey complementary quality information compared to a higher
quantity of text?
(a) Can we achieve comparable performance using small-data if it comes
from the right data distribution?
19
5. Can we move beyond performance evaluation? Are there any emergent con-
cepts in embeddings? Can we quantify the di↵erence between the concept
structures of semantic spaces?
6. Can we quantify the di↵erence between semantic spaces, based on the useful
information they contribute to the meaning representation?
1.2 Thesis Outline
Chapter 2 gives an overview of the background and literature in Distributional
Semantics, Computer Vision and multi-modal semantics, and also introduces our
framework of transparency analysis. Details and discussion of the data sources
and evaluation methodology are presented in Chapter 3.
Chapters 4, 5 and 6 involve implementation details and results of experiments,
designed to answer the research questions from Section 1.1. Chapters 4 and 5
implement our first and second key contributions I. comprehensive analysis
of multi-modal models and II. introduction and analysis of a new type of
modality. The experiments focus on Questions 1, 2 and 3. Section 4.1 addresses
Questions 1 and 2, evaluating di↵erent visual data sources for semantics, in
terms of the impact of image quantity and quality. Section 4.2 introduces a
novel structured embedding as a new modality. In Section 4.3 a broader study is
presented which, tacking Question 3, aims to perform a wide range of evaluations
across several di↵erent visual, linguistic and multi-modal models. As an outlook
over the application of word embedding initialisations we investigate a textual
entailment task in Section 4.4. Chapter 5 provides a more in-depth investigation
of the e↵ects of data size and frequency distributions in linguistic and multi-modal
embeddings (Questions 4 and 4a).
Finally, in Chapter 6 we implement the third key contribution of this thesis:
III. a transparent, interpretable semantic space analysis. We address Ques-
tion 5, where we employ qualitative structural analysis of semantic spaces, and
Question 6 by presenting a method for estimating the information di↵erent modal-
ities add to the linguistic representations.
A summary, conclusions and ideas for future directions based on this research
are discussed in Chapter 7. Appendices A, B, C, D, E and F contain extra results,
which were omitted from the main text for space and readability considerations.
20
1.3 Publications
Content involving thesis material:
• Anita L. Vero˝ and Ann Copestake. E cient Multi-Modal Embeddings from
Structured Data. arXiv preprint arXiv:2110.02577 , 2021.
• Douwe Kiela, Anita L. Vero˝, and Stephen Clark. Comparing Data Sources
and Architectures for Deep Visual Representation Learning in Semantics.
In Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP-16), 2016.
Thesis-related content:
• Christopher Davis, Luana Bulat, Anita L. Vero˝, and Ekaterina Shutova.
Deconstructing multimodality: visual properties and visual context in hu-
man semantic processing. In Proceedings of the Eighth Joint Conference on
Lexical and Computational Semantics (* SEM 2019), pages 118–124, 2019.
• Christopher Davis, Luana Bulat, Anita L. Vero˝ and Ekaterina Shutova.
Modelling Visual Properties and Visual Context in Multimodal Semantics.
In Workshop on Visually Grounded Interaction and Language, NIPS, Mon-
treal, Canada, 2018.
Not directly thesis-related content:
• Douwe Kiela, Luana Bulat, Anita L. Vero˝ and Stephen Clark. Virtual
Embodiment: A Scalable Long-Term Strategy for Artificial Intelligence Re-
search. In NIPS Workshop on Machine Intelligence (MAIN), Barcelona,
Spain, 2016.
Software
• EmbEval: The implementation of transparent evaluation methodology and
the majority of experiments are available as an open source software1. This
code was used in Chapters 4, 5 and 6. Details on its usage can be found in
the documentation2.
1https://github.com/anitavero/embeval
2https://anitavero.github.io/embeval/
21
• MMFeat - Flickr API: I implemented a Flickr API and some experiment
and demo code into the MMFeat software3, which is used in Chapter 4.4
• Concept Game: A two player, collaborative gamified data collection app5
(See Section 6.2.2.3.) This code is also publicly available on Github6.
3https://github.com/douwekiela/mmfeat
4https://github.com/anitavero/mmfeat/commits?author=anitavero
5http://concept-guessing-game.com/
6https://github.com/anitavero/concept_game
22
Chapter 2
Background and Motivation for
Interpretable Multi-Modal Word
Embedding Analysis
In this chapter we place the thesis into the context of previous work. We explain
the motivation for our intrinsic and information-theory based analyses. Further-
more, we introduce the framework and notation used throughout the thesis.
2.1 What does Word Meaning Mean, and Why
should We Care?
2.1.1 Philosophical Accounts
Traditionally, word semantics has been discussed in the framework of lexical com-
petence. According to the externalist view, words have an objective meaning
known by a “perfect competent speaker”, however, people are imperfect speakers,
hence the di↵erence between our levels of understandings [Kripke, 1972, Putnam,
1970]. This has been criticised by many including Chomsky in 2000 [Chomsky
et al., 2000]. The most notable criticism came from the contextualist and praga-
matic point of view. Similarly to Wittgenstein [Wittgenstein, 1953, p. 20], it
identifies meaning with use, and highlights the contextual nature of word mean-
ings [Grice, 1975, Searle, 1985].
23
To demonstrate the two opposing positions, take the following example sen-
tence: “There is milk in the fridge”. According to the contextualists: in the
context of morning breakfast it will be considered true if there is a carton of milk
in the fridge and false if there is a patch of milk on a tray in the fridge, whereas
in the context of cleaning up the kitchen truth conditions are reversed [Gasparri
and Marconi, 2021]. The externalist could object by challenging the contextual-
ist’s intuitions about truth conditions. “There is milk in the fridge”, she could
argue, is true if and only if there is a certain amount (a few molecules will do1).
The contextualist’s reply is that, in fact, neither the speaker nor the interpreter
is aware of such alleged literal content if there is even such a thing.
A cognitive approach characterizes Marconi’s [Marconi, 1997] account of lex-
ical semantic competence. In his view, lexical competence has two aspects: an
inferential aspect, underlying performances such as semantically based inference
and the command of synonymy, hyponymy and other semantic relations; and a
referential aspect, which is in charge of performances such as naming (e.g., call-
ing a horse “horse”) and application (e.g., answering the question “Are there
any spoons in the drawer?”). According to his theory of individual competence,
communication depends both on the uniformity of cognitive interactions with the
external world and on communal norms concerning the use of language, together
with speakers’ deferential attitude toward semantic authorities.
Recanati [Recanati, 2004] has extended the contextualised view with including
the history of a word’s meaning. He says a word has a “semantic potential”
defined as the collection of past uses of a word between source situations (i.e.,
the circumstances in which a speaker has used a word) and target situations (i.e.,
candidate occasions of application of the word).
2.1.2 (Cognitive) Linguistics and Neuroimaging
At the beginning of the 1970s a new cognitive theory of the mental representa-
tion of categories surfaced [Mervis and Rosch, 1981]. It put forward the notion
on prototypes which revolutionized the existing approaches to category concepts
and was a leading force behind the birth of cognitive linguistics. Later a whole
1This example was given in [Gasparri and Marconi, 2021], however, we would point out
that there is no such thing as “milk molecules” [Lucey et al., 2017], which supports scepticism
towards an extreme externalist approach.
24
paradigm, called Simulationism emerged with a series of evidence between men-
tal realisation of concepts and sensory-motor activation. For example listening
to sentences that describe actions performed with the mouth, hand, or leg ac-
tivates the visuomotor circuits [Tettamanti et al., 2005]; or odor-related words
(“jasmine”, “garlic”, “cinnamon”) di↵erentially activates the primary olfactory
cortex [Gonza´lez et al., 2006]. This all lead to theories such as the dual coding
hypothesis, which is in relation to the philosophical problem of symbol grounding,
discussed in detail in Section 2.4.
Distributional Hypothesis According to the summary of [Lenci, 2008], al-
though the linguistic context appears as one of the ingredients of human concep-
tualization, the emphasis of cognitive semantics is on an intrinsically embodied
conceptual representation of aspects of the world, grounded in action and per-
ception systems. On the other hand, the Contextual Hypothesis in psychology
arguing for a “usage-based” characterization of semantic representations incited
linguistics towards statistical corpus analysis. According to Lenci, this view is
related to Wittgenstein’s claim, i.e. that “the meaning of a word is its use in the
language”. This led to the Distributional Hypothesis (DH) according to which
at least certain aspects of the meaning of lexical expressions depend on the dis-
tributional properties of semantic similarity between two such expressions. Or
as Firth [Firth, 1957] put it, “Words that occur in similar contexts tend to have
similar meanings” [Turney, 2010].
There is an increasing evidence towards the “strong” version of DH which
does not only assumes correlation between semantic content and linguistic distri-
butions. This version is a cognitive hypothesis stating that repeated encounters
with words in di↵erent linguistic contexts eventually lead to the formation of a
contextual representation. That is an abstract characterization of the most sig-
nificant contexts with which the word is used [Lenci, 2008]. Baroni and Lenci
found important similarities between distributional models and human-generated
properties but also striking di↵erences [Baroni and Lenci, 2008]. Statistical rep-
resentations of word meaning has since become a prevalent approach forming the
basis of computational linguistics. [Boleda, 2020] summarised the reasons behind
this in three factors. First, distributional representations are learnt from natural
language data, scaling up to very large vocabularies, thus providing a coherent
system where systematic explorations are possible. Second, recent models involve
25
high dimensional representations. Third, they use continuous values and simi-
larity metrics. Both of the latter allow for rich and nuanced information to be
encoded and analysed.
Concepts, words and senses In philosophy, historically there has been many
di↵erent definitions of the term concept [Margolis and Laurence, 2021]. We use
an empiricist, embodied definition which treat concepts as internal human cog-
nitive knowledge representation, which probably involves multi-modal sensory
based representation, as mentioned earlier. Words are elements of a language
with meaning. However, human language is ambiguous, so many words can be
interpreted in multiple ways depending on the context in which they occur. For
instance, consider the following sentences (from [Navigli, 2009]):
(a) I calculated the interest rate.
(b) They have an interest in music.
The occurrences of the word interest in the two sentences clearly denote di↵erent
meanings: financial earnings and passion, respectively. These di↵erent meanings
of a word are called word senses, which are abstractions over word meanings
[Lenci, 2008].
Neuroimaging The development of neuroimaging techniques such as PET,
fMRI and ERP has provided further means to adjudicate hypotheses about lexi-
cal semantic processes in the brain, which has been studied in relation to statis-
tical semantic models, e.g. [Mitchell et al., 2008, Pereira et al., 2018, Handjaras
et al., 2016]. Mitchell et al. found correlation between distributional models of
word meanings and brain imaging representations in human participants [Mitchell
et al., 2008]. Handjaras et al. found that conceptual knowledge in the human
brain relies on a distributed, modality-independent cortical representation that
integrates the partial category and modality specific information retained at a
regional level [Handjaras et al., 2016]. This thesis also complements standard
semantic evaluations with tests on neuroimaging datasets, introduced in Sec-
tion 3.2.2.
Introducing Model-Concepts In this thesis – similarly to Lenci and Boleda
– we treat distributional semantic models of word meaning as a proxy to em-
26
pirically investigate “aggregated meanings”, which is not the semantic model of
any particular individual (and most likely not even a particular society’s). Since
human concept representations seem at least partially perceptual, we focus on
multi-modal distributional models involving visual perceptual data. We start
from statistical models of word meaning, but we proceed towards more in-depth
model interpretation analysis. We investigate whether there are structures in
our learnt representations which represent some kind of conceptualisation of the
machine. We call these model-concepts. Model-concepts are di↵erent from
human cognition. They are also not directly word meaning representations as
we are looking for further emerging structures / clusters. Since we are studying
the fusion of linguistic and perceptual data, model-concepts are assumed to be
closer to human concepts than purely text based ones. Throughout the thesis
we will use “concept” and “model-concept” interchangeably, as our investigation
only involves model-concepts, not human conceptual representations.
We introduce the history of Distributional Semantic models in more detail in
Section 2.2, visual models from Computer Vision in Section 2.3 and multi-modal
literature in Section 2.4.
2.2 Linguistic Embeddings: From Text to
Meaning
This section reviews the history of statistical models of word semantics based on
text corpora.
2.2.1 Distributional Semantics
In Natural Language Processing, word meaning representation models have been
primarily inspired by Firth’s distributional hypothesis [Firth, 1957], saying “Words
that occur in similar contexts tend to have similar meanings” [Turney, 2010]. Con-
temporary corpus-based approaches implement this idea by using vector repre-
sentations of words also known as distributional semantic models or embeddings.
The representation vector of each word can be computed from the co-occurrence
frequencies with other terms in the same context. Here, we give a short overview
of the development of distributional semantic models; for a detailed survey, see
27
Clark’s book chapter in The Handbook of Contemporary Semantic Theory [Clark,
2015] or a more recent overview of Distributional Models of Word Meaning by
Lenci [Lenci, 2018].
The history of word representations by vectors goes back to Karen Spa¨rck
Jones’ 1967 work in Computational Linguistics who first used a principled tech-
nique for comparing contexts [Spa¨rck Jones, 1967]. Vector representation was
widely popularised for the document retrieval problem in Information Retrieval
[Schu¨tze et al., 2008]. At the beginning, both the query and the documents were
represented with a “bag of words”, i.e., a vector of word frequencies. This was
a successful model despite the fact that it does not account for word order. To
circumvent bias towards frequent words, weighted versions have been introduced,
such as the term frequency-inverse document frequency (tf-idf) based on the fre-
quency of terms in a document, and the inverse of the number of documents
in which a term occurs. One useful way to think about document vectors is in
terms of term-document matrix. This way, rows can correspond to document
vectors, whereas columns are word representations. A popular method was to
apply a dimensionality reduction technique on such matrices, such as singular
value decomposition (SVD). The application of SVD to the term-document ma-
trix was introduced by Deerwester et al. [Deerwester et al., 1990], who called
the method Latent Semantic Analysis (LSA). The name comes from the intuition
that LSA teases out a latent meaning from the co-occurrence data, by clustering
words along a small number — typically a few hundred — of semantic, or topical,
dimensions [Turney, 2010].
From the term-document matrix we can easily arrive to the concept of term-
term matrix. Instead of treating the document as the context similar words
co-occur in, we can narrow it down to a smaller window around a word. This
way the elements of a matrix are the frequency of two words occurring in the same
context window. To normalise raw frequencies using Positive Pointwise Mutual
Information (PPMI) of two words (w1, w2) is a popular method:
PPMI(w1, w2) = max(log2
P (w1, w2)
P (w1)P (w2)
, 0). (2.1)
Applying SVD can also be useful on these type of matrices.
Representing the meaning of multiple-word phrases or sentences, still proves to
be a challenging problem. Many researchers have studied compositional semantics
28
using vector operations on word vectors [Mitchell and Lapata, 2010] or tensor
based representations [Clark, 2015].
2.2.2 Shallow Networks
Recent research has presented several neural network-based approaches to learn
word vector representations. Such distributed representations have become known
as embeddings. The most well known and widely used models were introduced
by Mikolov et al. [Mikolov et al., 2013a, Mikolov et al., 2013b] and have become
popular as part of the word2vec toolkit. They introduced two models, both con-
sisting of a shallow, two-layer neural network which learns an approximation of
co-occurrence statistics [Levy and Goldberg, 2014b]. They train a neural net-
work to predict neighbouring words, in doing so learning dense embeddings for
the words. It is much faster than SVD and easy to train.
The skip-gram (SG) model [Mikolov et al., 2013b] learns to predict the words
that can occur in the context of a target word. Its objective function is as follows:
1
T
TX
t=1
X
 cjc,c 6=0
log p(wt+j|wt) (2.2)
where T is the size of the corpus, c is the context window size, wi is a word,
(1 <= i <= T ).
Let d be the embedding dimension, V the vocabulary. The model learns two
embeddings, or lookup matrices: 1) an input embedding W 2 Rd⇥|V |, where
column i gives the embedding vi of size 1⇥ d for word wi in the vocabulary 2) an
output embedding W 0 2 R|V |⇥d, where row i is a d⇥ 1 embedding v0i for word wi
in V . v0O and vI are the “input” and “output” vector representations of w. The
probability of a word occurring in a context is given by the softmax function:
p(wO|wI) = exp(v
0
O · vI)P|V |
j=1 exp(v
0
j · vI)
(2.3)
This architecture is illustrated in Figure 2.1.
Because of the denominator term, training this model directly would be com-
putationally infeasible. For this reason Mikolov et al. introduced the trick of
hierarchical softmax and skip-gram with negative sampling (SGNS).
29
Figure 2.1: Skip-gram and CBOW architectures.2
Since we have two embeddings vj and v0j for each word wj we can either just
use, vj, sum or concatenate them.
If we multiply WW 0T , we get a matrix M , each entry mij corresponding to
some association between input word i and output word j. Levy and Goldberg
[Levy and Goldberg, 2014b] show that skip-gram reaches its optimum just when
this matrix is a shifted version of the PMI matrix:
WW 0T =MPMI   log k (2.4)
Thus, skip-gram is implicitly factoring a shifted version of the PMI matrix, into
the two embedding matrices.
In the other model of Mikolov et al., called Continuous Bag of Words (CBOW)
[Mikolov et al., 2013a], a similar training happens, except instead of predicting the
context around a word in a window, the objective is to predict the middle word
in the context window. The two model architectures are illustrated in Figure 2.1.
Global Vectors model (GloVe) [Pennington et al., 2014] aims to learn a version
of the PMI matrix which is weighted toward more frequent word context pairs.
They theorise that the fact that their model can be optimised directly as opposed
to the on-line training of SGNS, it introduces more global frequency information.
However, Levy and Goldberg showed, that after tuning hyperparameters, it does
not produce any performance gain [Levy et al., 2015].
Other versions of skip-gram have been proposed such as a dependency-based
2https://web.stanford.edu/~jurafsky/li15/lec3.vector.pdf
30
word embedding [Levy and Goldberg, 2014a], where instead of using a simple
sliding window as the context, a window goes through the dependency graph of
each word as the context.
Deep Recurrent Neural Networks [Bengio et al., 2003, Bahdanau et al., 2015,
Cho et al., 2014, Kiros et al., 2015, Wang and Jiang, 2015, Rockta¨schel et al.,
2016] and Transformers with self-attention [Peters et al., 2018, Radford et al.,
2018, Devlin et al., 2019, Yang et al., 2019] have appeared in the forefront of
NLP research in the past few years. They achieve state-of-the-art performance
on various sentence level tasks, included in the GLUE multi-task benchmark for
Natural Language Understanding [Wang et al., 2018a]. The tasks involve textual
entailment, sentiment analysis, paraphrasing and question answering. Since the
main objectives of this thesis were creating and testing a framework for com-
prehensive, transparent and interpretable semantic analysis, we use the smallest
possible models which allow us to incorporate visual embeddings, thus studying
multi-modality. Therefore, in this work we apply shallow network type models,
as visual embeddings fit into them more easily then into count based models,
while being the simplest neural models. Due to the few parameters of these mod-
els, they are also much easier to train than bigger neural models, allowing us to
run comprehensive studies across several datasets and model types. Throughout
this work we use SGNS and FastText, which uses the CBOW model, with ver-
sions extended with subword information [Mikolov et al., 2018]. Furthermore,
we use di↵erent versions of PMI in Section 6.4 for analysing our training cor-
pora. Applying our framework for the latest transformer type models would be a
straightforward application of this thesis. Although running broad-scale analysis
is much more challenging using these large models, it would be interesting to see
how attentions a↵ects multi-modal fusion.
2.3 Visual Embeddings: From Images to
Meaning
Our research focuses on the most e cient fusion of vision and language for mean-
ing representations. Thus we revise the basics of Computer Vision approaches
for encoding images as well as state-of-the-art models in Section 2.3.1, which we
rely on.
31
Similar to language embeddings, representing the content of an image or a
video also involves producing a vector representation. This is expected to capture
a compressed representation of interesting features over the high dimensional, raw
pixel input that corresponds to human semantic constructs. This can include low
level features such as edges and corners, or higher level ones such as objects of an
image or temporal patterns on a video. The selection of these features, however,
is not a trivial task. Traditional Computer Vision methods applied hand-crafted
features similar to the above mentioned edge and corner detectors from which
they could build a Bag-of-words type model [Sivic and Zisserman, 2003].
Neural Networks revolutionized this area as well with the introduction of
Convolutional Neural Networks (CNNs). These are biologically inspired net-
works motivated by the visual cortex [Lecun et al., 1998]. They are capable of
learning high level features gradually by exploiting a deep structure where every
layer learns a higher abstraction based on the lower ones. Such networks can
be trained for many di↵erent tasks such as object classification [Simonyan and
Zisserman, 2014, Krizhevsky et al., 2012, Szegedy et al., 2015, He et al., 2016],
image segmentation [Kendall et al., 2017] or action recognition [Sharma et al.,
2015]. The learned vectors proved to be a good basis for learning high performing
image embeddings [Kiela and Bottou, 2014].
The core building block of such networks is the convolutional layer. This
refers to the mathematical convolution of a filter function across the pixels of
an image. In traditional Computer Vision this filter function (or kernel) was
crafted manually, whereas in a CNN it is learned from data. Down-sampling
and learning compressed local (globally invariant) features is done by the pooling
layers. CNNs usually involve fully connected layers on the top and activation
functions similar to other neural networks. They are usually trained with an
objective for a supervised task, such as object classification.
Figure 2.2 illustrates the architecture of LeNet [LeCun et al., 1989], the first
CNN successfully trained by back-propagation to classify hand-written digits. It
performed better than manual coe cient design, and was suited to a broader
range of image recognition problems. Thus, it became the foundation of modern
Computer Vision.
32
Figure 2.2: Architecture of the LeNet-5 for digit recognition. Each plane is a
feature map i.e. a set of units whose weights are constrained to be identical.
2.3.1 CNN Models
In our study, CNN models serve the role of encoding images into visual word
semantic vectors. We used four architectures which di↵er in size and structure.
See Table 2.1 for an overview.
AlexNet The network by Krizhevsky [Krizhevsky et al., 2012] introduces the
following network architecture: first, there are five convolutional layers, followed
by two fully-connected layers, where the final layer is fed into a softmax which
produces a distribution over the class labels. All layers apply rectified linear units
(ReLUs) [Nair and Hinton, 2010] and use dropout for regularization [Hinton et al.,
2012]. This network won the ILSVRC 2012 ImageNet classification challenge.
GoogLeNet The ILSVRC 2014 challenge winning GoogLeNet [Szegedy et al.,
2015] uses “inception modules” as a network-in-network method [Lin et al., 2013]
for enhancing model discriminability for local patches within the receptive field.
It uses much smaller receptive fields and explicitly focuses on e ciency: while it
is much deeper than AlexNet, it has fewer parameters. Its architecture consists
of two convolutional layers, followed by inception layers that culminate into an
average pooling layer that feeds into the softmax decision. That is, it has no fully
connected layers. Dropout is only applied on the final layer. All connections use
rectified units.
33
AlexNet GoogLeNet VGGNet ResNet
ILSVRC winner 2012 2014 2015 2015
#Layers 7 22 19 152
#Parameters (million) ⇠60 ⇠6.7 ⇠144 ⇠6.8
Receptive field size 11⇥ 11 1⇥ 1, 3⇥ 3,
5⇥ 5
3⇥ 3 3⇥ 3
Fully connected layers Yes No Yes Yes
Table 2.1: Network architectures. Layer counts only include layers with parame-
ters.
VGGNet The ILSVRC 2015 ImageNet classification challenge was won by VG-
GNet [Simonyan and Zisserman, 2014]. Like GoogLeNet, it is much deeper than
AlexNet and uses smaller receptive fields. It has many more parameters than the
other networks. It consists of a series of convolutional layers followed by the fully
connected ones. All layers are rectified and dropout is applied to the first two
fully connected layers.
ResNet ResNet [He et al., 2016] revolutionized the CNN architectural race
by introducing the concept of residual learning in CNN and devised an e cient
methodology for training of deep nets. He et al. proposed a 152-layers deep
CNN, which won the ILSVRC 2015 competition. ResNet, which was 20 and 8
times deeper than AlexNet and VGG respectively, showed less computational
complexity than previously proposed nets. They empirically showed that ResNet
with 50/101/152 layers has less error on image classification task than 34 layers
plain net.
These networks were selected because they are very well-known in the Com-
puter Vision community. They exhibit interesting qualitative di↵erences in terms
of their depth (i.e., the number of layers), the number of parameters, regulariza-
tion methods and the use of fully connected layers. They have all been winning
network architectures in the ILSVRC ImageNet classification challenges3.
3https://image-net.org/challenges/LSVRC/
34
2.4 Multi-modal Semantics
2.4.1 Symbol Grounding
Despite their undeniable success, textual embeddings have their own limitations
regarding the grounding of meaning to the outside world, often referred to as
Harnard’s symbol grounding problem [Harnad, 1990]. Similarly, Computer Vision
research has reached a point where leveraging non-visual common sense knowledge
is necessary for further improvement even on purely vision based applications. It
is motivated by an insight from cognitive science (Section 2.1.2): the human
semantic representation of symbols (e.g., words or objects) is based on multi-
modal sensory inputs perceived on a lifelong basis [Roy, 2005].
When it comes to applications and models the question arises: What do we
mean by grounding in practice? In what way can multi-modal data contribute to
meaning representations? We can distinguish between two main approaches for
grounding:
Referential grounding refers to the task of determining the referent that a word
denotes in the context of the other modality (e.g., a specific object in an image).
The core issue here is finding a mapping between the two spaces [Lazaridou et al.,
2016].
In contrast, representational grounding addresses the problem of multi-modal
semantics: Representing the grounded meaning of a word in the sense of fusing
di↵erent modalities into one, richer semantic representation [Bruni et al., 2014].
While all these results are promising some fundamental questions are still
unexplored.
Non-Visual Tasks Most work focuses on evaluation tasks (and therefore on
models which perform well on them) which involve direct visual input. These are
usually referential type tasks such as Visual Question Answering (VQA) [Srivas-
tava and Salakhutdinov, 2012, Kiros et al., 2014, Socher et al., 2014, Tsai et al.,
2019, Lu et al., 2019, Su et al., 2019, Majumdar et al., 2020]. In these cases the
usefulness of visual input is not surprising. Fewer papers have exploited visual
information for representational grounding, when the evaluation tasks involve no
direct visual input, such as semantic similarity [Bruni et al., 2014, Kiela and
Bottou, 2014, Kiela et al., 2016, Lazaridou et al., 2015, Davis et al., 2019, Lin
35
and Parikh, 2015, Vendrov et al., 2015]. Lin [Lin and Parikh, 2015] introduced
a fill-in-the-blank task, which has been done, however, using abstract images. A
further interesting proposal relates to the so-called order-embeddings, a general
hierarchical framework for hypernymy, textual entailment, and image captioning
[Vendrov et al., 2015]. However, it still does not involve a thorough investigation
of multi-modal fusion possibilities. Some papers including [Kiela and Bottou,
2014, Kiela et al., 2016, Lazaridou et al., 2015, Davis et al., 2019] perform in-
trinsic analysis of multi-modal embeddings. However, the reasons for the impact
of visual information are not well understood, for we see only correlations on
intrinsic evaluation tasks.
This work investigates visual information’s contribution to meaning represen-
tations on evaluation tasks involving no direct visual input. We aim to showcase
a proof-of-concept framework for deeper analysis of unsupervised multi-modal
representations. We study the concepts which emerge in grounded meaning rep-
resentations.
Cost of Data All the mentioned tasks require huge image datasets with ex-
pensive human annotation. In the case of multi-modal tasks these annotations
are even more di cult to acquire, since annotating combinations of texts and
images/videos can be even more complicated and time consuming than in the
uni-modal cases.
We try to circumvent the problem of the costs by studying model and data size
e ciency (introduced in Section 2.7.2) as well as alternatives for new modalities
based on small data (Section 2.5).
2.4.2 Early-, Late- and Mid-fusion
In the literature, we can find three ways for performing the fusion of textual and
perceptual information:
• In early fusion, one learns a joint representation from the two spaces, then
computes a function for the specific task (e.g., cosine distance for measuring
semantic relatedness) [Lazaridou et al., 2015, Kottur et al., 2015].
• Mid-fusion techniques learn separate representations for each modalities,
then combine them into a multi-modal representation, finally they compute
36
the function for the task [Kiela et al., 2014].
• Late fusion methods also learn uni-modal representations separately, then
compute a function for each modality individually, and combine function
outputs at the end [Silberer and Lapata, 2014].
Figure 2.3 illustrates the three types of fusion techniques. In this work we focus
on mid-fusion based models since it allows us to study the information preserved
in the individual modalities.
Figure 2.3: Fusion methods for combining textual and perceptual information. V
andW are representations learnt from either Text or Images. f is a function that
fuses two representations in Early and Middle fusion. In Late fusion f combines
the outputs of functions g which embed uni-modal data. (Figure is borrowed
from the “Multimodal Learning and Reasoning” ACL 2016 tutorial4.)
2.4.3 Multi-modal RNNs and Transformers
Neural networks and recurrent networks have been used on multi-modal input
since they got popular, even going back to Boltzmann machines [Srivastava and
Salakhutdinov, 2012, Kiros et al., 2014]. They were mainly tested on image
retrieval and caption generation tasks. Architectures, such as Tree RNNs have
also been applied to cross-modal tasks [Socher et al., 2014].
The latest NLP models have also inspired the creation of new multi-modal
representations. Tsai et al. [Tsai et al., 2019] developed a multi-modal Trans-
former model using cross-modal attention and tested it on sentiment analysis
tasks in videos. Lu et al. [Lu et al., 2019] created ViLBERT, a multi-modal
model based on BERT. They pre-trained it on Conceptual Captions dataset and
4http://multimodalnlp.github.io/mlr_tutorial.pdf
37
then transferred it to multiple vision-and-language tasks — visual question an-
swering, visual common-sense reasoning, referring expressions, and caption-based
image retrieval.
2.5 Structured Embeddings: Motivation for a
New Modality
The multi-modal framework we introduce in this thesis can be used to any modal-
ities (such as text, image, video, audio). In the experimental part of this work
we focus on fusing linguistic and visual information. As we saw in the previous
section, ample research exploited large visual datasets and CNN models with in-
creasingly large number of parameters. This is a fairly expensive way of injecting
visual information into meaning representations.
The second key contribution of this thesis is thoroughly exploring a structured
visual dataset, called Visual Genome [Krishna et al., 2016], and the way it can
enrich meaning representations. Visual Genome contains images with bounding
box annotations as well as text annotation in a graph structure (it is detailed,
among all the other datasets we use, in Section 3.1). This would be beneficial
for two reasons. First, structured data can serve as a bridge over the semantic
gap between low level image data and high level symbolic information in text.
Secondly, it can provide a small data alternative to big data driven models, which
could become the basis of essential tools in situations where a huge amount of
text is not available, but where more structured data could be easier to collect.
By exploiting this textual dataset based on a visual structure, this work in-
troduces a new type of embedding, which we consider as a new, hybrid modality.
In the next section we introduce our general framework of modalities. The new
embedding modality called Structured Embeddings will be introduced in Sec-
tion 2.6.1. The details of its creation is explained in Section 4.2.
38
2.6 Generalisation of Embeddings: Proposed
Framework and Formalism
In this work we use a general notion of Embedding, which refers to a vector space
representation of word meanings. The weights of each vector, however, can be
set by any machine learning algorithm, trained on any data type, such as text,
images, sound, structured datasets etc. The only criterion for calling a vector
space a word embedding space is that we find an interpretation of the dataset
where it represents words.
We formally define Semantic Embedding models as tuples of their relevant
parameters. We generalise Katrin Erk’s definition of distributional models [Erk,
2016] to include word representations based on other modalities as well. We
denote modality by m 2 {L, V, S}, which can take the value of linguistic L, visual
V or structural S. The parameters of a semantic embedding model of modality
m are the following: A set of T target words that receive vector representations,
a set Om of observable context items in a dataset Dm, an extraction function Xm
which chooses relevant contexts in which to look for context items, and a mapping
function Am, which maps from target and context items to a dm dimensional space
Rdm . The mapping for all target elements is represented by an Embedding matrix
Em 2 R|T |⇥dm .
T is an arbitrary set of words, Dm is a set of data items. Dm includes target
representations r 2 Dm with a relation to t 2 T target elements r ⇠ t. Om
is all the potential target contexts in the dataset: Om : T ! P(Dm), Om(t) =
{U ⇢ Dm | 9 r 2 U, r ⇠ t}, where P is the power set. The extraction function
Xm returns “relevant” context items from Om to each target element from T –
that is it returns a mapping from target/context item pairs to numbers in N,
representing a relevance score of context pairs: Xm : T ! (Om(T ) ! N). We
use “relevance” here in a fairly general sense: it can for example be co-occurrence
counts within a text window, image search engine result relevance, or scores based
on other prior assumptions about relevancy in the the dataset, such as graph
neighbourhood, which we will exploit for structured data. The mapping function
Am is a combination of a (usually machine learning) algorithm and any further
pre- and post-processing method which together takes the output of Xm and
turns it into a mapping from targets to real values, Am : (T,Dm, Xm) ! (T !
39
Rdm). The output mapping is represented by a matrix, called an Embedding
Em 2 R|T |⇥dm , which is a vector space consisting of vector representations for
each target word in T .
In summary, we define Semantic Embedding models for a modalitym as tuples
comprising the sets of target elements, observable context items, the dataset, the
extraction function, the mapping function, and the embedding dimensionality:
Sm = hT,Om, Dm, Xm, Am, dmi (2.5)
The output of the model is the learnt embedding Em.
For example a Google Image based Semantic Embedding model would have
the following parameters:
SG = hT,OG, DG, XG, AG, dGi (2.6)
where T is our target vocabulary and DG is the dataset consisting of words from
T and Google Image Search results for each t 2 T . OG are all the potential
subsets of image results for a given word t in Google Image Search. For example
we can use any number of images from the search results. The extraction function
XG selects which contexts we chose, e.g., it selects the first 10 image results in
Google Search Engine’s relevance order. AG will include a CNN network which
maps each image to a vector representation, plus an aggregation function which
creates one image vector representation for each word t. In this case, dG will be
the dimensionality of the last layer of the CNN network which we use as image
representation. Thus, it will be the dimensionality of our learnt Google Image
Embedding EG.
Note that in general, Am is a very broad notation. It can involve any learning
algorithm. If our training data is text for example, it can involve any tradi-
tional count based methods, shallow or deep neural networks or any other type of
method which maps targets from a dataset with an extraction function to choose
relevant contexts, to a vector representation.
In the next section we will introduce three types of Semantic Embedding
models which we study in this thesis.
40
2.6.1 Embedding Modalities
In this work we are going to distinguish between three di↵erent types of embed-
ding for each modalities m 2 {L, V, S}, which are produced by three class of
semantic embedding models varying in all parameters but T :
Linguistic Embeddings EL 2 R|T |⇥dL are vector spaces which are learnt by
an algorithm AL trained on large text data DL. The learning algorithm can
be any of the standard shallow neural models, which approximate co-occurrence
statistics of words, such as SGNS, CBOW or FastText. XL corresponds to co-
occurrence counts for target/context word pairs within a context window around
target words.
Visual Embeddings EV 2 R|T |⇥dV consist of vectors which have been trained
on images DV , which are associated to words by XV (e.g. images labelled with
words). In this case the learning algorithm is typically a CNN network (see
Section 2.3) which has a specified architecture for learning abstract patterns from
image data. However, after mapping images to a vector space, we need a method
which associates one vector to a word. In our case we usually have multiple
image results for a word, hence this method has to be a vector aggregation, such
as element-wise maximum, mean or median (discussed in Sections 4.1 and 4.3).
The learning algorithm and the aggregation method together constitutes AV .
Structured Embeddings ES 2 R|T |⇥dS are the result of an XS which extracts
relevant contexts from data DS which has a more developed structure than raw
text or images on the internet. These datasets usually involve some manual
design and labour for the collection, therefore they are much smaller in terms of
the used computer memory in bytes. One example is Visual Genome Scene Graph
annotations (introduced in Section 3.1.2), which we study in detail in Chapters 4,
5 and 6. AS is a similar algorithm to AL, trained on the extracted pairs, with
co-occurrence statistics.
dL, dS depend on the output size of the shallow network model in use, usually
equals to 300. dV is the size of the last layer of a CNN network.
The combination of the above embedding types can happen using one of
the three fusion techniques (Section 2.4.2). Throughout this thesis we will use
41
mid-fusion as it allows us to examine the information coming from each embed-
dings more easily. We denote multi-modal embeddings by Em1 + Em2 ,m1,m2 2
{L, V, S},m1 6= m2.
2.7 Modalities as Partial Observers of Meaning
The ancient Indian parable called Blind men and an elephant tells a story of a
group of blind men who have never come across an elephant before and who learn
and conceptualise what the elephant is like by touching it. Their observations go
as follows in James Baldwin’s English version5:
...The first one happened to put his hand on the elephant’s side. “Well,
well!” he said, “now I know all about this beast. He is exactly like a
wall.”
The second felt only of the elephant’s tusk. “My brother,” he said,
“you are mistaken. He is not at all like a wall. He is round and
smooth and sharp. He is more like a spear than anything else.”
The third happened to take hold of the elephant’s trunk. “Both of you
are wrong,” he said. “Anybody who knows anything can see that this
elephant is like a snake.”...
As for another person, whose hand was upon its leg, said, the elephant is a
pillar like a tree. For the fifth whose hand reached its ear, it seemed like a kind
of fan. The last one who felt its tail, described it as a rope.
Will they be able to combine their observations into one description more
accurate than any of their individual ones? Or will they just disagree and become
more confused than they had been?
If the blind men were touching di↵erent objects, or were in completely di↵erent
universes, they would probably struggle to reach an agreement. Since, however,
they are feeling the same animal, they do have a common ground, which is at
first hidden from them, but which they have a chance to comprehend better
together through collaboration. It only makes sense to collaborate if none of
them is already an elephant expert, or talking about a completely irrelevant or
5https://americanliterature.com/author/james-baldwin/short-story/
the-blind-men-and-the-elephant
42
Figure 2.4: Modalities and the elephant. Illustration of the Semantic Embedding
models for di↵erent modalities, which include di↵erent perspectives. Data D in-
cludes the target concept T of the elephant plus the observable contexts Om1 , Om2 ,
which are the trunk and a tusk. Each of the two Semantic Embedding models
Sm1 ,Sm2 receives the data from their di↵erent perspectives: Dm1 = (T,Om1) and
Dm2 = (T,Om2) respectively.
43
random subject. Similarly, our Semantic Embedding models have a chance to
combine their knowledge if done properly. Figure 2.4 presents an illustration of
our multi-modal framework, with one target concept of the elephant and two
Semantic Embedding models with di↵erent perspectives.6
Analogously to the imperfect lexical competence framework, mentioned in
Section 2.1.1, we treat modalities as partial observers of meaning. Like the men
above, we assume that they have di↵erent perspectives on the same object. This
object in our case is word meaning, or rather an aggregated statistical represen-
tation of words at a specific point in time (described in Section 2.1.2).
Using the notation before, let’s say we have [Sm1 , . . . ,SmM ] Semantic Embed-
ding models of M di↵erent modalities. We assume:
1. Common ground: Each of them captures some aspect of word meanings.
That is, we assume that the vector weights of none of the learnt embeddings
[Em1 , . . . , EmM ] are random.
2. Perspectives: They do not share the same knowledge, they represent dif-
ferent perspectives.
3. Imperfect knowledge: None of them has perfect knowledge: none of the
Semantic Embedding models is an oracle which represents the ground truth.
In some versions of the parable the men get into a disagreement (or a fight
of various degree of violence depending on the version), in others they learn that
they were all partially correct and partially wrong. In the following, we will search
for the best way to ensure our models of di↵erent modalities can collaborate in
the most e↵ective way.
2.7.1 Background and Motivation for Model
Transparency
From the existing multi-modal literature we know that combining textual and vi-
sual modalities can collaborate and improve performance in various cases. Most
6Icons made by Good Ware (https://www.flaticon.com/authors/good-ware)
from www.flaticon.com. Photo of an Indian elephant is from Wikipedia
(http://web.archive.org/web/20210907113830/https://de.wikipedia.org/wiki/
Datei:Elephas_maximus_%28Bandipur%29.jpg), elephant drawing is from http:
//web.archive.org/web/20210907105456/https://www.drawingtutorials101.com/
how-to-draw-an-indian-elephant.
44
work, however, evaluates solely on tasks, such as semantic similarity or down-
stream tasks, such as Visual Question Answering (VQA). It has been shown
by many researchers that this traditional way of evaluating models in Machine
Learning is prone to various flaws, which can be fatally misleading for the field.
Kuhnle in [Kuhnle, 2020, Chapter 2] gave a comprehensive discussion of these
problems. Built on this we summarise the issues in the following categories:
Black-Box Model Performance Since the recent deep learning revolution,
ML evaluation appeared to be solely concerned with beating benchmarks on
downstream-tasks, while the models are often treated as black-boxes. This often
lead to models which learn “weird behaviour”. For example vision models may
rely on the image background to recognise an object [Ponce et al., 2006], blind
spots of deep CNNs [Zhang et al., 2018], or neural models mistranslate low-
frequency words into context-fitting but content-changing alternatives [Arthur
et al., 2016]. Good evaluation performance on one task often does not transfer
to downstream tasks either. In Section 4.4 we also present our own finding that
a deep LSTM with randomly initialised input word vectors performs on par with
an input of pretrained word embeddings on a Textual Entailment task (SNLI).
Zhang and Bowman found the related phenomenon of high performing random
initialized LSTM models [Zhang and Bowman, 2018]. This is in line with current
findings considering the recent transformer type models which are shown to be
far from solving general tasks (e.g., document question answering). Rather, these
models are overfitting to the quirks of particular datasets [Yogatama et al., 2019].
This all leads us to conclude that looking at only performance improvements
between models are mostly meaningless without further analysis.
Dataset Bias Data in the context of ML is supposed to convey patterns which
are characteristic for a certain task. Kuhnle defines dataset bias as coinciden-
tal systematic artefacts in the data which are not characteristic of the task in
question. Because of this incidentality, using such datasets as training data can
result in unintentional behaviour. For instance Wang et al. [Wang et al., 2018b]
found that image captioning models for MS-COCO [Lin et al., 2014] can learn to
produce reasonable captions merely by knowing about the objects in an image
while ignoring, for instance, their location and relation. On VQA tasksmodality
bias has been shown, which refers to the systematic tendency that one modality
45
su ces to infer the correct output with high confidence. Multiple examples were
reported, such as a language-only model which completely ignores the image but
can answer almost half of the questions correctly [Zhang et al., 2016]. Agrawal et
al. [Agrawal et al., 2016] observed how seemingly well-performing models jump
to conclusions after only the first few question words, thus concluding that they
fail at complete question and image understanding. Although, Kuhnle does not
include ethical bias in his definition, we think it could fit into it, by including
ethical goals into our task definition. The field of AI fairness is shifting towards
concentrating on harms rather than bias in the political sense [Barocas et al.,
2019, p. 136-143], however, after including mitigating harm in our task objective
we can use Kuhnle’s data bias definition. There is a line of research on cultural
stereotypes reflected in word embeddings [Barocas et al., 2019, p. 141]. Even
though word embeddings per se do not correspond to any linguistic or decision-
making task, analysing them before incorporating them into applications is a
crucial step from an ethical point of view as well.
Model Bias Hooker in [Hooker, 2021] argued that bias materialises not only
in data but in the algorithms as well. She argues that the key reason why model
design choices amplify algorithmic bias is because notions of fairness often co-
incide with how underrepresented protected features are treated by the model.
Most real-world data naturally have a skewed distribution with a small number
of well-represented features and a “long-tail” of features that are relatively un-
derrepresented. The skew in feature frequency leads to disparate error rates on
the underrepresented attribute.
Problems of Metrics Lastly, evaluating meaning representations is inherently
limited by the methods and possibilities of human annotation collection. On top
of this, as mentioned in [Kuhnle, 2020, p. 23-24] evaluations are often prone
to statistical flaws of interpreting performance scores, such as missing baseline
scores, reported confidence intervals with no reference or explanation, and lacking
formal comparison/hypothesis testing [Faruqui et al., 2016].
Solutions A range of papers have been published recently which attempt to fix
some of the identified evaluation issues. Several attempts have been made to fix-
ing data, however Torralba and Efros [Torralba and Efros, 2011] argued that such
46
a process is likely doomed to result in a “vicious cycle” of ad hoc improvements,
unless one reconsiders the underlying mechanisms which cause undesired dataset
bias. Artificial data and unit testing [Fouhey and Zitnick, 2014, Johnson et al.,
2017, Kuhnle and Copestake, 2017] is a promising paradigm to amend ML eval-
uations. Probing is a recently increasingly popular approach to “stress-testing”
involving testing the model on solving an auxiliary predictive task and testing
the sensitivity of the model output to modifications of the input [Conneau et al.,
2018, Voita and Titov, 2020]. Approaches for interpretable models and post-hoc
model explanation techniques are also growing areas [Ghorbani et al., 2019, Kaur
et al., 2020].
In the next section we propose transparency analysis as an extension of the
above proposed solutions aiming to prevent “vicious cycles” by promoting a more
informed model development process.
2.7.2 Transparency Testing and E cient Multi-Modal
Fusion
A key objective of this thesis is to propose and demonstrate a framework for
overcoming the inconsistency of multi-modal results. Our approach is somewhat
related to the probing paradigm and partially inspired by interpretability research
and cognitive science. Beyond “stress-tests” for our models we propose to extend
standard evaluation techniques with an in-depth model and data analysis. We
propose both going wider towards a more comprehensive model comparison across
modalities and data sources, as well as deeper into studying the “cognition” of
our models. We choose to analyse our datasets and models in a transparent
way, which could serve as a preprocessing step before performing data or model
debiasing. We propose performing and automating such data and model analyses,
in order to prevent “vicious cycles” of ad hoc improvements, mentioned in the
previous section.
We postulate that amending performance evaluation with more in-depth trans-
parency testing of semantic models are a useful way of developing more e cient
and also safer models. Getting to know our models inner “cognitive models” can
be a way towards AI methods, which are capable of communicating their reason-
ing and also potential biases towards humans. This would make them easier to
debug and maintain safely in the future.
47
We propose an embedding analysis leaning on three pillars. We postulate that
they together form a comprehensive, interpretable semantic analysis but none of
them are su cient on their own. The three types of analysis are categorised in
black/transparency testing and are aiming to answer the following questions:
1. Performance testing : Black-Box testing – How representations of di↵erent
modalities perform on evaluation tasks trained on di↵erent datasets?
2. Qualitative / Quantitative structural analysis : Transparency testing –
How representations of di↵erent modalities di↵er?
3. Independence analysis : Transparency testing: How much representa-
tions di↵er?
By learning about how and how much our di↵erent embeddings EL, EV , ES
di↵er while looking at the performance scores, we can reach a conclusion on:
What is the most e cient way of combining our di↵erent resources?
E ciency What do we mean by e ciency? Performance testing is only one
way to account for e ciency. When we hold a machine learning model to be e -
cient depends on our costs and resources. Data is often a limited resource, so in
most cases it makes sense to take data size into account. Required computational
resources, running times and electricity costs are also important factors to con-
sider. E ciency in the context of economic footprint was famously thematised
by Bender et al. [Bender et al., 2021]. In this work we account for performance,
data size and distribution as well as model size, as these are metrics we could
easily control for. Including hardware, electricity costs and running time could
be a relevant extension of our studies.
None of the three types of analysis on their own is su cient to answer the
above question, but together they have a potential for providing meaningful in-
sight in the anatomy of multi-modal semantic models.
In Chapter 3 we will discuss the details of our approach to all three types of
analysis. In the following sections we introduce our framework for transparency
analysis of multi-modal models.
48
2.7.3 “Cognitive Model” of Embeddings: How do
Models Conceptualise?
As the second pillar, or the first transparency analysis, we ask the question
whether each of these vector spaces represent meaningful concepts as clusters,
and how these concept structures relate to each other?
Comparing semantic spaces is central in Lexical Semantic Change (LSC).
Dubossarsky et al. introduced Temporal Referencing7 for robust modelling of
LSC on diachronic corpora [Dubossarsky et al., 2019]. They treat all time-specific
corpora Ca, Cb, . . . , Cn as one corpus C and learn word representations on the
full corpus. However, they first replace each target word w 2 Ct with a time-
specific token wt. This way, they learn one single space that contains a vector for
each target-time pair wt, which may be compared directly without the need for
mapping di↵erent spaces to each other.
In Statistical Machine Translation the comparison of semantic spaces has oc-
curred in order to perform unsupervised learning of bilingual lexicons. Artetxe et
al. [Artetxe et al., 2018] developed a cross-lingual word embedding mapping in
order to align two languages without the need of parallel corpora. They propose
a self-learning method based on the observation that, given the similarity matrix
of all words in the vocabulary, each word has a di↵erent distribution of similar-
ity values. Their assumption is that two equivalent words in di↵erent languages
should have similar distributions.
Minnema and Herbelot [Minnema and Herbelot, 2019] used various metrics
to measure the similarity between a linguistic embedding space and a brain im-
age embeddings space. Besides testing pairwise and rank correlation between
vectors for the same word from the two spaces, their metrics included Nearest
Neighbour structure of the two spaces and Representational Similarity Analysis
(Pearson correlation between their respective similarity matrices). The latter is
somewhat related to the method of Artetxe et al. [Artetxe et al., 2018], as they
also initialise with correlation matrices of the two vector spaces – which, in their
case, correspond to linguistic spaces of two di↵erent languages. Dubossarsky et
al. [Dubossarsky et al., 2019] also performed nearest neighbour analysis in the
Lexical Semantic Change context.
As regards measurements, such as nearest neighbour, in high dimensional
7https://github.com/Garrafao/TemporalReferencing
49
vector spaces, one has to take the threat of the curse of dimensionality into
account. Dinu et al. [Dinu et al., 2015] showed that nearest neighbour su↵ers
from the hubness problem. This phenomenon is known to occur as an e↵ect
of the curse of dimensionality, and causes a few points (known as hubs) to be
nearest neighbours of many other points [Radovanovic´ et al., 2010]. This is a
problem because these hub vectors tend to be near a high proportion of items,
pushing their correct labels (e.g., words which are semantically similar) down the
neighbour list.
Concept based interpretability analysis using clustering is a new area in ML,
which is related to our approach in spirit. Ghorbani et al. [Ghorbani et al., 2019]
introduced post-training analysis of computer vision models using clustering of
image segments. Clustering and visualisations have been previously used for
multi-modal embedding analysis in [Gupta et al., 2019].
As a qualitative / quantitative structural analysis we will employ standard
clusterization metrics, which is most related to [Minnema and Herbelot, 2019]
and cluster visualisations somewhat similar to [Gupta et al., 2019]. Unlike previ-
ous work, we will zoom even further into our embeddings and perform a thorough
qualitative cluster analysis along with visualisations to discover model-concepts
(introduced in Section 2.1.2), and analyse a new structured embedding type (Sec-
tion 2.5). This will be complemented with an information-theoretical analysis
framework, which we introduce in the following sections.
2.7.4 Information Theory Background
The third pillar of our semantic analysis seeks the answer to the question: How
much semantic embeddings Em of di↵erent modalities di↵er? We reformulate
this questions as follows: How much extra information we gain if we combine
two modalities? We could also phrase it this way: How much less confused a
model Sm1 gets after combining it with another Sm2? We reach out for the help
of information-theory to formalise our question. We start with a review of the
basics then formulate our approach.
The standard unit of information in computer science is the bit. The most
widespread way of measuring information is the Shannon entropy [MacKay, 2003],
introduced by Claude Shannon in 1948 [Shannon, 2001]. In information theory,
the entropy of a random variable is the average level of “information”, “sur-
50
prise”, or “uncertainty” inherent in the variable’s possible outcomes. Shannon
was searching for an information measure with the following conditions: Let p be
a probability of an event, then
1. H(p) is monotonically decreasing in p.
2. H(p)   0: information is a non-negative quantity.
3. H(1) = 0: events that always occur do not communicate information.
4. H(p1, p2) = H(p1) + H(p2): the information learned from independent
events is the sum of the information learned from each event.
Shannon discovered that the only suitable choice of H, where X = x1, . . . xn
is a random variable and P (X) is a probability mass function, is:
H(X) =  
nX
i=1
P (xi) logb P (xi) (2.7)
where b is the base of the logarithm used (b = 2 measures information in bit).
We rely on the concept of Mutual Information, which is intimately linked
to entropy. It is also known as Information Gain and measures the information
that two random variables, X and Y share: It measures how much knowing one
of these variables reduces uncertainty about the other. Using the entropy it is
defined by:
I(X, Y ) = H(X) H(X|Y ) (2.8)
where H(X|Y ) is the conditional entropy [MacKay, 2003].
Let (X, Y ) be a pair of continuous random variables with values over the space
X ⇥ Y . If their joint distribution is PX,Y and the marginal distributions are PX
and PY , the mutual information is defined as
I(X, Y ) =
Z
X
Z
Y
PX,Y (x, y) log
✓
PX,Y (x, y)
PX(x)PY (y)
◆
(2.9)
It follows that
I(X, Y ) = DKL(PX,Y ||PX ⌦ PY ) (2.10)
51
where DKL is the Kullback–Leibler divergence:
DKL(P ||Q) =
Z
Rd
dP log
dP
dQ
(2.11)
If p(x) and q(x) are densities then
DKL(p||q) =
Z
Rd
p(x) log
p(x)
q(x)
dx. (2.12)
2.7.5 Proposal for Measuring Independence of
Embeddings
The phenomena that human multi-modal sensory information fusion happens in
a statistically optimal fashion has been studied in Cognitive Psychology [Ernst
and Banks, 2002]. Ernst et al. found that humans combine visual and haptic
information in proportion to their uni-modal variance. Interestingly, not directly
analogous, but somewhat related is the finding of Kiela et al. [Kiela et al.,
2014] for multi-modal (visuolinguistic) word embeddings. They filtered visual
input for words based on the corresponding images’ dispersion, which measures
the average pairwise distances of image vectors for a word. They found that
filtering out “noisy” images improved on the multi-modal representation. This
does not necessarily mean that one should ignore all new conflicting information,
but highlights that it is possible to add more data to the system and having worse
performance. In this thesis, we are pursuing a deeper understanding of the exact
circumstances under which visual information enhances meaning representations
and when it does not, by learning more about the relationship between semantic
spaces of di↵erent modalities.
The informativeness of new data has been studied in learning pure linguistic
embeddings as well. Kabbach et al. [Kabbach et al., 2019] developed a method to
train word embeddings on a smaller corpus with maximal information gain, after
pretraining them on a large corpus. Their model is designed to simulate new word
acquisition by an adult speaker who already masters a substantial vocabulary.
Their system uses a pretrained CBOW as this “background knowledge” which
they then use to train an SGNS on a much smaller data in a way that the context
is maximally informative (has minimal entropy) given the previous knowledge.
To our knowledge we are first to propose measuring the independence of dif-
52
ferent modalities by estimating the Mutual Information between their embedding
spaces. In order to do so, we treat each embedding space Emi as a vector space,
representing samples from a multivariate random distribution. By estimating
the mutual information we can compare which embedding pairs di↵er more from
each other. We would like to know, whether the perspective of ES or EV is “far-
ther” from EL; which one is “more independent”? Let us reformulate our three
assumptions on partial observers from Section 2.7 in the information-theoretical
framework. Let mi,mj,mk be modalities, where i, j, k are distinct, then:
1. Common ground: Neither two embeddings Emi , Emj are completely in-
dependent, as they have all learnt some pattern related to the same hidden
concepts in a language: I(Emi , Emj) 6= 0.
2. Perspectives: They are not completely correlated: I(Emi , Emj) is not
maximal.
3. Imperfect knowledge: None of them is an oracle, they do not predict the
evaluation data perfectly P (D|Smi) 6= 1.
Thus if the e ciency (Section 2.7.2) of Emj and Emk are similar, and
I(Emi , Emj) > I(Emi , Emk) (2.13)
then we hypothesise that there is a combination method with which, combin-
ing Emi with Emk is more e cient than using Emi + Emj , as they convey more
complementary information which can be combined. The question of how this
combination is realised depends on all the parameters in Smi and Smk and the
combination method itself. In this work we explore mid-fusion combination as it
allows us to study the information from di↵erent modalities separately as well as
combined, and it makes it straightforward to compare individual embeddings.
2.7.6 A Utility Based Model of Embedding
Independence
In this section we introduce a toy model based on probabilistic games, which
serves as a theoretical backing for Mutual Information minimisation. As it is
just a toy model, it is not a fundamental part of the framework of this thesis.
53
However, it provides an interesting perspective on learning multi-modal semantic
representations based on information-theory, which could be generalised in the
future.
Before we create our own model of multi-modal fusion, we introduce Kelly’s
framework of betting in a game through a noisy binary channel [Kelly jr, 1956],
[Cover and Thomas, 2012, p. 162].
Rate of Growth Let us consider a repeatable game, where in each round a
gambler can bet some amount of their wealth (including the whole) on either of
two outcomes. After each round the gambler wins the double of their bet if they
guessed right, and loses it otherwise. If p is the probability of error and q is the
probability of a right guess, how much would they bet? Let V0 be the starting
capital, VN is the capital after N bets. If they bet their entire capital each time,
this in fact, would maximise the expected value of their capital hVNi, which in
this case would be given by
hVNi = (2q)NV0 (2.14)
This would be little comfort, however, since if they continued indefinitely
(N ! 1), they would be broke with probability one. Let us, instead, assume
that the gambler bets a fraction l of their capital each time. Then
VN = (1 + l)
W (1  l)LV0 (2.15)
where W and L are the number of wins and losses in the N bets. Then the
doubling factor or rate of growth of the gambler’s capital G is8
G = lim
N!1

W
N
log(1 + l) +
L
N
log(1  l)
 
= q log(1 + l) + p log(1  l) with probability one
(2.16)
We want to maximise this gain. Since it is logarithmic, we can take its deriva-
tive at the point of zero, and we get
Gmax = 1 + p log p+ q log q = 1 H(X) (2.17)
8Here, log denotes log2.
54
which is 1 minus the Shannon entropy, where X is a random variable which can
take the value of p or q. The model has been generalised by Kelly for more than
two outcomes in [Kelly jr, 1956].
Gain of Multi-Modal Fusion Now, let us imagine learning concepts in a
language from data as such a game. Figure 2.4 illustrates the model the follow-
ing way. Winning corresponds to learning a semantic model of target concepts
T which highly correlates with human semantic judgement. The noisy channel
corresponds to the dataset D via which our models can learn embedding rep-
resentations Em1 , Em2 . In this game we are interested in maximising our gain,
by combining two modalities the most e cient way. Let X denote a perfect
“ground-truth” semantic representation, which maximally correlates with human
judgement on our task. For the sake of readability let Y := Em1 , Z := Em2 . Then
the maximal rate of growths for each model and for the ground-truth are:
GY = 1 H(X|Y )
GZ = 1 H(X|Z)
G0 = 1 H(X)
(2.18)
The rate of growth or gain with the combination of Y and Z is
GY Z = 1 H(X|Y, Z) (2.19)
We are interested in maximising the rate of growth after we combine the
information from both modalities. Let us maximise the following di↵erence:
 GY Z = GY Z  G0 (2.20)
Thus, the following theorem holds:
Theorem 1.  GY Z =  GY + GZ   I(X, Y, Z).
Proof. From Equations 2.18, 2.19 and 2.20:
 GY = H(X) H(X|Y ) = I(X, Y )
 GZ = H(X) H(X|Z) = I(X,Z)
 GY Z = H(X) H(X|Y, Z)
(2.21)
55
(a) Low inter-modality dependence, inde-
pendently from X: I(Y, Z|X).
(b) High inter-modality dependence, inde-
pendently from X: I(Y, Z|X).
Figure 2.5: Three Random Variables X, Y and Z. Here X represents a “ground-
truth” variable, a perfect semantic representation. Y and Z are two random
variables, corresponding to embeddings of two modalities Em1 , Em2 .
(This is also Kelly’s result for the general case, with more than two outcomes
to bet on, with independent transmitted symbols with fair odds. Fair odds means
that the odds paid on the occurrence of the s ’th transmitted symbol is propor-
tional to the probability that the transmitted symbol is the s ’th one [Kelly jr,
1956].)
Furthermore, we apply the I-Diagram in Figure 2.5, a geometrical representa-
tion of the relationship among the information measures. It is analogous to the
Venn Diagram in set theory, which makes several information-theoretical proofs
easier [Yeung, 1991].
Therefore,
 GY Z =  GY + GZ   I(X, Y, Z) (see Figure 2.5a) (2.22)
Furthermore, the following inequality holds:
Theorem 2. I(X, Y, Z)  I(Y, Z). Mutual Information is an upper bound to
minimise, in order to maximise the rate of growth after multi-modal fusion.
Proof.  GY and  GZ are given because the individual embeddings have already
56
been trained. Therefore, from Theorem 1 it follows that we need to minimise
I(X, Y, Z) in order to maximise  GY Z .
Furthermore, using the I-Diagram in Figure 2.5a:
I(X, Y, Z)  I(Y, Z) (2.23)
Let us notice that if I(Y, Z) is high, the reason might be independent from
X. Therefore, I(X, Y, Z) can be small while I(Y, Z|X) is high, as it is illustrated
in Figure 2.5b. In practice, however, this would mean that two embeddings
Em1 , Em2 are correlated in some way which is irrelevant to learning semantic
representations. For example two corpora may have similar number of documents,
or written in the same verse etc. If this spurious correlation is too high, minimising
I(Y, Z) may not be a good approximation. Our investigation of the datasets we
use did not reveal such spurious correlations. Therefore, we treat I(X, Y, Z) being
very close to I(Y, Z).
Maximising the gain from multi-modal embedding combination serves as a
framework for analysing e cient multi-modal fusion. An exciting future extension
of this model would be to generalise it further for odds which are not fair, based
on [Kelly jr, 1956]. In Section 3.2.4 we will introduce empirical MI estimation
methods, which we will apply in experiments presented in Chapter 6.
2.8 Summary: Comprehensive and
Interpretable Word Semantic Analysis
In this chapter we reviewed the philosophical and theoretical background of word
semantics and motivated researching distributional word semantic models as a
proxy for statistical analysis of concepts. After reviewing the literature on tex-
tual distributional semantics, visual embeddings and multi-modal approaches, we
proposed a new type of embedding in between linguistic and visual modalities,
based on small data. Furthermore, we introduced a general framework and for-
malism for investigating multi-modal semantic embedding models. Lastly, we
presented a framework for treating modalities as partial observers of meaning
based on information-theory.
57
To tackle inconsistencies and the lack of systematic comparisons in multi-
modal literature, we proposed extending the analyses of previous work with an
interpretable analysis framework of three pillars:
1. Performance testing : Black-Box testing – How representations of di↵erent
modalities perform on evaluation tasks? We extended previous work with:
(a) Comprehensive analysis of models across data sources, machine learn-
ing models and modalities.
(b) New Modality based on small data and in between low level visual
information and high level linguistic, symbolic data.
(c) E ciency analysis controlling for data size, data distribution and
model size.
2. Qualitative / Quantitative structural analysis : Transparency testing –
How representations of di↵erent modalities di↵er? An analysis of model-
concept structures captured by modalities.
3. Independence analysis : Transparency testing: How much representa-
tions di↵er?
We postulated that none of these pillars are alone su cient for an inter-
pretable semantic embedding analysis, however, when combined, they can o↵er
a fuller picture on what and how our models capture. We need a (1.) compre-
hensive performance testing combined with e ciency metrics as a goal. Within
this context we can make transparency analysis involving (2.) zooming into the
structural properties of embeddings and (3.) quantifying the optimal information
gain from multi-modal fusion.
Within this proof-of-concept framework we showcase that structured small
data can be an e cient alternative to expensive big data and models, when the
resources are scarce.
58
Chapter 3
Methodology of Data Selection
and Proposal for Interpretable
Evaluation
In this chapter we introduce the training and evaluation datasets which form the
basis of this study. Understanding how each training data and evaluation sets
have been created is crucial for interpreting the results. Using the notation from
Section 2.6, Section 3.1 describes image, text and structured corpora DV , DL,
DS used as training data. Section 3.2 gives an overview of the evaluation data
and methodology. Finally, we summarise the roadmap of the scheme of our three
pillar analysis in Section 3.3.
3.1 Training Data Matters
One of the main objectives of this thesis is to analyse the data sources that are
being used during model training. Recalling our notation of semantic embedding
models of modality m (with output embedding Em):
Sm = hT,Om, Dm, Xm, Am, dmi (3.1)
The dataset Dm comprising observable items and target elements is an essential
parameter. Analysing them, therefore, is the basis for all three contributions. In
our I. comprehensive analysis we aim to overcome the often inconsistent or
59
hard to compare results in previous work. Introducing a new mapping XS from
a structured data source as well as analysing the properties of the data is in the
centre of our study of a II. new type of semantic embedding model SS.
Lastly, getting more familiar with the training data is imperative if we want to
create III. transparent and interpretable semantic models.
Section 3.1.1 gives a summary of the properties of image datasets DV which
are used throughout the thesis for visual models SV . Section 3.1.2 introduces
text corpora DL for linguistic semantic embedding models SL. Let us highlight
that Visual Genome is included in both categories, since it is used both as an
image dataset DV as well as a structured text corpus DS of SS after extracting
annotation from its structured annotations.
3.1.1 Image Data
This section introduces the details of processing image data and image datasets
which deliver observable context OV in visual semantic embedding models SV .
Processing Image Data We used MMFeat toolkit1 (based on Ca↵e2) to ob-
tain image representations for three di↵erent convolutional network architectures:
AlexNet [Krizhevsky et al., 2012], GoogLeNet [Szegedy et al., 2015] and VGGNet
[Simonyan and Zisserman, 2014], and our own toolkit, EmbEval3 for ResNet [He
et al., 2016] and AlexNet based on Pytorch-torchvision4. Image representations
are turned into an overall word-level visual representation by taking the mean of
the relevant image representations. All four networks are trained to maximize the
multinomial logistic regression objective using mini-batch gradient descent with
momentum:
 
DX
i=1
KX
k=1
1{y(i) = k} log exp(✓
(k)>x(i))PK
j=1 exp(✓
(j)>x(i))
(3.2)
where 1{·} is the indicator function, x(i) and y(i) are the input and output, re-
spectively. D is the number of training examples and K is the number of classes.
1https://github.com/douwekiela/mmfeat
2https://caffe.berkeleyvision.org/
3https://github.com/anitavero/embeval
4https://pytorch.org/docs/stable/torchvision/index.html
60
Google Bing Flickr ImageNet Visual Genome
Type
Search
engine
Search
engine
Photo
sharing
Image
database
Image
database
Annotation Automatic Automatic Human Human Human
Coverage Unlimited Unlimited Unlimited Limited Limited
Sorted Yes Yes Yes No No
Tag specificity Unknown Unknown Loose Specific Dense
Table 3.1: Sources of image data.
The networks are trained on the ImageNet classification task and we transfer
layers from the pre-trained network.
As we use CNN models pre-trained on ImageNet the other datasets do not
serve as CNN training data. However, all CNN networks work as a mapping
from our OV images to a vector space. The vector representations are obtained
by running a feed-forward step in the network and extracting the last layer as the
representation of the image. We use the last fully connected layer from AlexNet
and VGGNet (both 4096 dimensional vectors), and the last pooling layer from
GoogLeNet (1024 dimensions) and ResNet (512 dimension). We have multiple
image results for a word, hence this method has to be a vector aggregation, such
as element-wise maximum, mean or median (studied in Section 4.1). The learning
algorithm and the aggregation method together constitutes the mapping function
AV in SV .
Image Datasets Previous systematic studies of parameters for text-based dis-
tributional methods have found that the source corpus has a large impact on
representational quality [Sahlgren and Lenci, 2016, Kiela and Clark, 2014]. The
same is likely to hold in the case of visual representations. Various sources of
image data have been used in multi-modal semantics, but there have not been
many comparisons: [Bergsma and Goebel, 2011] compare Google and Flickr, and
[Kiela and Bottou, 2014] compare ImageNet [Deng et al., 2009] and the ESP
Game dataset [von Ahn and Dabbish, 2004], but most works use a single data
source. In this work, one of our objectives is to asses the quality of various sources
of image data DV .
We selected the presented datasets because they are all standard in Computer
61
Vision or NLP while they all di↵er in at least one of the following properties:
• Type: search engines; photo sharing social networks or hand crafted image
datasets.
• Annotation: Automatic by an algorithm or annotated by humans.
• Coverage: Unlimited – crowd sourced on the internet or a prepared dataset
of limited size.
• Sorted : Whether there is a relevance score assigned to each image that
indicates how descriptive it is of a word (e.g., search engine order).
• Tag specificity : Whether the annotation of images are: specific of objects /
scenes in the image; loose – related to the image on a higher semantic level
or from a personal annotator’s angle; dense – detailed labels of objects and
relationships within an image.
Table 3.1 provides an overview of the data sources. Descriptions of each dataset
follow:
Google Images Google’s image search5 results have been found to be compa-
rable to hand-crafted image datasets [Fergus et al., 2005].
Bing Images An alternative image search engine is Bing Images6. It uses di↵er-
ent underlying technology from Google Images, but o↵ers the same functionality
as an image search engine.
Flickr Although [Bergsma and Goebel, 2011] have found that Google Images
works better in one experiment, the photo sharing service Flickr7 is an interesting
data source because its images are tagged by human annotators.
ImageNet ImageNet [Deng et al., 2009] is a large ontology of images devel-
oped for a variety of Computer Vision applications. It serves as a benchmarking
standard for various image processing and Computer Vision tasks. ImageNet is
5https://images.google.com/
6https://www.bing.com/images
7https://www.flickr.com
62
constructed along the same hierarchical structure as WordNet [Miller, 1995], by
attaching images to the corresponding synset (synonym set).
Visual Genome Visual Genome [Krishna et al., 2016] is a human annotated
dataset which contains images with bounding box annotations around objects
and relations among many other types of information, such as scene and region
descriptions, object attributes, semantic relationships between image regions and
objects, and Visual Question Answering (VQA) pairs. The objects, attributes,
relationships, and noun phrases in region descriptions, and VQA pairs are also
canonicalised to WordNet [Miller, 1995] synsets.
All of the dataset properties can be relevant, however, it is not immediately
obvious whether any of the above sources are superior over the other. While
search engines provide full data coverage for virtually any vocabularies of various
languages, they fall behind in tag specificity, as the search word is in an associative
relationship with the images, not a hand-crafted label. Search engines and Flickr
all come with a relevance order, which can be useful for image based meaning
representations. However, in case of search engines we rely too much on black-
box algorithms and automatic annotation. Hand-crafted datasets, while certainly
fall behind in size and thus coverage, contain more carefully collected human
annotation, which are usually more specific and detailed. In both ImageNet
and VisualGenome, annotations are aligned with WordNet, which is a standard
knowledge base.
Figure 3.1 contains image samples from all datasets which serve as observable
contexts OV , that are mapped to vectors by a feed-forward step in a CNN. All
networks are pre-trained on ImageNet, thus our models do not di↵er in this
regard. While there is less di↵erence for the more specific concept of elephant,
results for animal are more diverse across sources. Visual Genome (Figure 3.1a)
includes several bounding boxes with dense annotations, whereas the others are
ordered by relevance. Flickr tends to include more personal photos, such as pets
in Figure 3.1d. Google and Bing have more versatile results (Figure 3.1b, 3.1c).
In order to see clearer how each properties a↵ect model performance, we propose
measuring the e↵ect of image source choice and discuss its e↵ectiveness regarding
the costs of dataset creation.
63
(a) Visual Genome
(b) Google
(c) Bing
(d) Flickr
Figure 3.1: Example images for animal and elephant from the various data
sources used as observable contexts OV . While there is less di↵erence for the more
specific concept of elephant, results for animal are more diverse across sources.
Visual Genome includes several bounding boxes with dense annotations, whereas
the others are ordered by relevance.
3.1.2 Text Corpora
Linguistic modes SL are naturally trained on text corpora DL. Structured embed-
dings SS are also trained on text, however, the main di↵erence from traditional
64
text corpora is that these are ordered in a specific structure instead of free text,
e.g., a graph of expressions, hence the distinct notation DS. We used di↵er-
ent versions of Wikipedia and Common Crawl datasets as DL training data. DS
consists of Visual Genome Scene Graphs. All these are described in the following.
Wikipedia Wikipedia8 is a widely used corpus in NLP applications. It is a
crowd-sourced encyclopaedia, which covers various common sense and scientific
concepts. Its topic structure has been directly exploited in Explicit Semantic
Analysis [Gabrilovich et al., 2007]. It has been used as a general training corpus
for its wide topic coverage, and long history of crowd-sourced quality control. In
this work we use versions, trained on 2013 and 2020 Wikipedia dumps, as baseline
models.
FastText In Section 4.3 we use more recent pretrained word embeddings from
the FastText framework9. These models use the traditional CBOW model, with
versions extended with subword information [Mikolov et al., 2018]. The following
training datasets were used:
1. wiki-news-300d-1M : 1 million word vectors trained on Wikipedia 2017,
UMBC webbase corpus and statmt.org news dataset (16B tokens).
2. wiki-news-300d-1M-subword : 1 million word vectors trained with subword
infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news
dataset (16B tokens).
3. crawl-300d-2M : 2 million word vectors trained on Common Crawl (600B
tokens).
4. crawl-300d-2M-subword : 2 million word vectors trained with subword in-
formation on Common Crawl (600B tokens).
Visual Genome Scene Graph In this work we do not only use Visual Genome
[Krishna et al., 2016] as an image dataset, but we exploit its dense and structured
human annotation as well, as a text corpus. The Visual Genome dataset contains
8https://www.wikipedia.org/
9https://fasttext.cc/docs/en/english-vectors.html
65
complete set of descriptions and QAs for each image based on multiple image re-
gions, and a formalized representation of the components of an image. It consists
of seven main components: region descriptions, objects, attributes, relationships,
region graphs, scene graphs, and question answer pairs. Figure 3.2 shows ex-
amples of each component for one image. Although, it falls behind the above
mentioned text corpora in terms of size, its highly structured nature can convey
semantic information in itself. This dataset is special in terms of its “modality”.
It includes dense textual annotation of image objects and scenes which people
normally do not write about. Therefore, even though the annotation consists
of character series, it conveys some high level visual common-sense knowledge.
Besides its relevance for research, this type of annotation collection methodology
could benefit data acquisition of low-resource languages, where there is no abun-
dance (or there is an absence) of corpora. Applying tools where speakers can
point out visually grounded meaning of their language could be a highly e cient
way of documenting and analysing these languages. Moreover, automatic Scene
Graph Generation algorithms [Xu et al., 2020] can further boost the e ciency
of such methods. Some statistics10 on the size of di↵erent annotation types are
summarized in Table 3.2. Preliminary studies on a new embedding type based
on this dataset is discussed in Section 4.2. The model is thoroughly studied in
Section 4.3 and in Chapters 5 and 6.
We decided to use Wikipedia and corpora from the FastText system, because
they are all standard in the literature and are also easily and openly available.
While we used pretrained models in our studies, we also trained our own SGNS
model on various subsets of Wikipedia for quantity and distribution control exper-
iments. Experimenting with even bigger datasets would be a potential improve-
ment. However, given our resources and the number of experiments planned,
this was a sensible data size limit. Visual Genome is a unique data source for
its structured annotations. We chose it to investigate the potentials of such a
dataset for multi-modal semantics.
10https://visualgenome.org/data_analysis/statistics
66
6 Ranjay Krishna et al.
Fig. 4: A representation of the Visual Genome dataset. Each image contains region descriptions that describe a
localized portion of the image. We collect two types of question answer pairs (QAs): freeform QAs and region-based
QAs. Each region is converted to a region graph representation of objects, attributes, and pairwise relationships.
Finally, each of these region graphs are combined to form a scene graph with all the objects grounded to the image.
Best viewed in color
Figure 3.2: A r presentation of th Visual Genome dataset. Each image contains
region descriptions that describe a localized portion of the image. There are two
types of question answer pairs (QAs): free form QAs and region-based QAs. Each
region is converted to a region graph representation of objects, attributes, and
pairwise relationships. Finally, each of these region graphs are combined to form
a scene graph with all the objects grounded to the image. [Krishna et al., 2016]
67
Total region descriptions 4,297,502
Total image object instances 1,366,673
Unique image objects 75,729
Total object-object relationship instances 1,531,448
Unique relationships 40,480
Total attribute-object instances 1,670,182
Unique attributes 40,513
Total Scene Graphs 108,249
Total Region Graphs 3,788,715
Total Question Answers 1,773,258
Table 3.2: Visual Genome annotation statistics.
3.2 From Intrinsic Evaluation to Interpretable
Model Anatomy
In this section we discuss the used evaluation datasets, metrics and analysis
methodology, which we applied to implement our three-pillar transparent testing
of multi-modal embeddings, laid out in Section 2.7.2. Section 3.2.1 describes the
tools for 1 Performance testing. Section 3.2.2 describes analysis on brain data
as embedding analysis. Section 3.2.3 introduces cluster analysis as 2 Qualitative
/ Quantitative structural analysis. Finally Section 3.2.4. introduces empirical
Mutual Information estimation methods for 3 Independence analysis.
3.2.1 Behavioural Tasks
Most multi-modal word embedding work evaluate on semantic similarity and
relatedness tasks in the hope of gathering information about the intrinsic be-
haviour of abstract semantic representations. However, the ambiguous notion of
similarity and the low inter-annotator agreement make it di cult to draw robust
conclusions on the di↵erences between models [Batchkarov et al., 2016]. As a first
black-box step, we will also evaluate on these standard datasets. Unlike previous
work, however, we first aim to create an extensive study of comparing several se-
68
mantic models Sm with varying parameters of T,Om, Dm, Xm, Am, dm, Em. Then
we gradually move towards more in-depth transparency analysis.
We briefly describe the standard evaluation datasets and metrics we use in
our experiments:
MEN The MEN data set [Bruni et al., 2014] consists of 3,000 word pairs,
randomly selected from words that occur at least 700 times in the freely available
ukWaC and Wackypedia corpora combined and at least 50 times (as tags) in
the opensourced subset of the ESP game dataset.11 Pairs were sampled so that
they represent a balanced range of relatedness levels according to a text-based
semantic score. Each pair was randomly matched with a comparison pair and
rated in this setting (as either more or less related than the comparison point) by
an annotator on Amazon Mechanical Turk. This binary comparison task is both
more natural for an individual annotator, and also permits seamless integration
of the supervision from many annotators. The downside is that this way, there is
no well-defined inter-subject agreement. In total, each pair was rated against 50
comparison pairs, thus obtaining a final score on a 50-point scale, although the
Turkers’ choices were binary.
SimLex-999 SimLex-999 [Hill et al., 2015] is a dataset structurally similar to
MEN, including 999 word pairs for intrinsic semantic evaluation. Its objective is,
however, to measure how well models capture similarity, rather than relatedness
or association. The scores in SimLex-999 therefore di↵er from other well-known
evaluation datasets such as MEN. For example, “coast” and “shore” would have
high score in both MEN and SimLex. On the other hand, “cloth” and “closet”
would have low score in SimLex but high score in MEN, since they have di↵erent
materials, function etc., even though they are very much related. This task is chal-
lenging for computational models to replicate because, in order to perform well,
they must learn to capture similarity independently of relatedness/association.
These two relationships between words show up in di↵erent contextual features.
Similarity is inferred from similar co-occurrences with other words. Similarity or
relatedness is then captured by the type of co-occurrence / window size [Kilgarri↵
and Yallop, 2000]. In addition SimLex includes concreteness Part-Of-Speech and
11https://staff.fnwi.uva.nl/e.bruni/MEN
69
association scores from the University of South Florida (USF) Free Association
Norms [Nelson et al., 2004].
SimVerb-3500 SimVerb-3500 [Gerz et al., 2016] is an evaluation resource that
provides human ratings for the similarity of 3,500 verb pairs. It covers all normed
verb types from the USF Free Association database, providing at least three
examples for every VerbNet [Schuler, 2005] class. Verb pairs are rated on a scale
0-10, for example: “to reply” / “to respond” - 9.79; “to participate” / “to join”
- 5.64; “to stay” / “to leave” - 0.17. We included this dataset in Section 4.2,
where predicate - object relationships are in focus, to test how it a↵ects verb
representations in particular.
Evaluation metric Model performance is assessed through the Spearman ⇢s
rank correlation between the embedding similarity scores for a given pair of words,
together with human judgements in each evaluation datasets. Pearson correla-
tion has also been considered, however, humans find it much harder to attach
a numerical score to a pairwise comparison like “cat”–“dog”, rather than hav-
ing to judge whether that comparison is more similar than “cat”–“television”.
Furthermore, Pearson correlation coe cient should also be avoided because even
if humans give numerical scores as similarity ratings, these are unlikely to be
normally distributed.
Embedding similarity scores are computed using the cosine distance of the
two word vectors, ~w1, ~w2 of a word pair, w1, w2.
Cosine( ~w1, ~w2) =
~w1 · ~w2
k ~w1kk ~w2k (3.3)
=
~w1 · ~w2pP
iw1
2
i
pP
iw2
2
i
(3.4)
The dot product in the numerator is calculating numerical overlap between
the word vectors, and dividing by the respective lengths provides a length normal-
isation which leads to the cosine of the angle between the vectors. Normalisation
is important because we would not want two word vectors to score highly for
similarity simply because those words were frequent in the corpus. The cosine
measure is commonly used in studies of distributional semantics, however, we
70
could use any other vector space metric [Clark, 2015]. It is di cult to reach a
conclusion from the literature regarding which similarity measure is best; we use
cosine distance here because it has become standard in NLP. Future work could
involve revisiting these standard metrics because they may behave di↵erently
depending on the task and the source/modality of training data.
3.2.2 Brain Imaging as Embedding Analysis
Evaluating on brain imaging data has been introduced as NLP evaluation tasks on
various occasions [Mitchell et al., 2008, Anderson et al., 2016] (Section 2.1.2). In
some cases visually grounded models have been included in the evaluation [Davis
et al., 2019, Anderson et al., 2017, Bulat et al., 2017]. The measured impact
of multi-modal information, however, varies across studies, thus in this work we
included a broader analysis on these tasks as well. We aim to use correlation
studies with brain data as a type of black-box analysis, which is substantially
di↵erent from behavioural tasks and as such can shed new light on di↵erences
between our Semantic Embedding models of di↵erent modalities. The findings
in cognitive neuro-science (Section 2.1.2) on multi-modal human brain activities
while performing semantic tasks, further motivates us to include brain data in
our studies.
We evaluate on two brain image datasets which were collected while partici-
pants viewed 60 concrete nouns with line drawings [Mitchell et al., 2008, Sudre
et al., 2012]. One dataset was collected using fMRI (Functional Magnetic Res-
onance Imaging) and one with MEG (Magnetoencephalography). Each dataset
has 9 participants, but the participant sets are disjoint, thus there are 18 unique
participants in total. Though the stimuli is shared across the two experiments,
MEG and fMRI are very di↵erent recording modalities and thus the data are not
redundant [Xu et al., 2016].
fMRI dataset fMRI measures the change in blood oxygen levels in the brain,
which varies according to the amount of work being done by a particular brain
area. In this fMRI dataset collected by Mitchell et al. [Mitchell et al., 2008]
participants were presented with line drawings and noun labels of 60 concrete
nouns from 12 semantic categories: animals, body parts, buildings, building parts,
clothing, furniture, insects, kitchen items, tools, vegetables, vehicles and man-
71
made objects. The experimental task was to think about the properties of the
noun concept they were shown - the set of 60 concepts was presented in a random
order six times to each participant. Each concept was presented for 3 seconds,
with seven second gaps between presentations.
MEG dataset This experiment involved the same task as the previous one but
using MEG machine, a large helmet with 306 sensors that measure aspects of the
magnetic fields at di↵erent locations in the brain. A MEG brain image is the time
signals recorded from each of these sensors. Each of the words was presented 20
times (in random order) for a total of 1200 brain images.
Both brain image data have been preprocessed by the BrainBench Test Suit
[Xu et al., 2016]. They used “partialling out” process in order to remove low level
activity attributable to visual properties from the brain images. They used the
methodology from Mitchell et al. to select the most stable brain image features for
each of the 18 participants. The stability metric assigns a high score to features
that show strong self-correlation over presentations of the same word.
Two vs. two test To evaluate on brain data we need to compare representa-
tion similarities from brain imaging vectors and meaning representation vectors.
This type of evaluation if fundamentally di↵erent from the behavioural tasks, as
we do not have human similarity score labels for word pairs. We use leave-two-out
cross validation, the testing methodology from Mitchell et al. which has become
standard for brain imaging evaluation of semantic embeddings. Our implemen-
tation is based on BrainBench with modifications so we can perform analysis on
individual participants. The evaluation starts from two similarity matrices, a
neural and a brain similarity matrix. Columns of this matrices are called simi-
larity codes. Similarity codes (~si, ~sj) and brain activity similarity codes (~ai, ~aj)
are selected for two nouns. Elements i. and j. from each of the similarity codes
are removed, as these entries correspond to the nouns being tested. Figure 3.3
visualises an example of the decoding procedure. Decoding is successful if the
sum of Pearson correlations for the correct pairings is greater than the sum of
Pearson correlations for the incorrect pairings, resulting in decoding accuracy of 1
for this pair and 0 otherwise. Thus, the expected chance-level decoding accuracy
is 50%.
72
Figure 3.3: Visualisation of leave-two-out cross validation from [Anderson et al.,
2016].
73
3.2.3 How do Models Conceptualise? – Cluster Analysis
As introduced in Section 2.7.3 the second pillar (2) of our analysis is a transpar-
ent investigation of the concepts our embedding spaces EL, EV , ES capture. We
are interested in how much these model-concepts di↵er from each other to un-
derstand under what circumstances each modalities can complement each other.
As mentioned before this qualitative / quantitative structural analysis is meant
to be used in the context of previous performance analyses and the third pillar
of 3 independence analysis, we will detail in Section 3.2.4.
By model-concept, here, we mean some similarity metric based clusters in the
embedding spaces, which do not necessarily correspond to the meaning of one
word, but rather some higher level or di↵erent structure. As a straightforward
implementation, we chose to use standard clustering algorithms and metrics, to
compare our di↵erent embeddings.
In order to grasp how the concept structure of our embedding spaces di↵er
from each other we first searched for ways to quantify their cluster structure.
We do not know the ground truth labels of our clusters or even the number of
clusters each embedding spaces should be broken into. Therefore, we experiment
with three standard clusterization metrics which are designed for the case when
a ground truth labelling is not available. Furthermore, we report results for a
range of number of clusters.
In Chapter 6 we present the design, implementation and result of our trans-
parency studies. Section 6.2 includes qualitative and quantitative cluster analysis.
In Section 6.2.2 we compare our embeddings’ cluster structures and visualise the
learnt clusterings. In Section 6.2.3 we present supervised visualisations of the
embedding spaces alongside an automatic label generation method and compare
the results against the clusterization metric scores. As an e↵ective visualisation
we use the T-SNE algorithm [Maaten and Hinton, 2008, Wattenberg et al., 2016].
Clustering and T-SNE have been previously used for multi-modal embedding
analysis e.g., [Gupta et al., 2019]. In Section 6.2 we report qualitative analyses by
investigating the elements of the clusters, as well as reporting further quantitative
cluster structure comparison analyses. One of our clustering analyses is based
on the pre-defined cluster labels of [Gupta et al., 2019]. They also use Visual
Genome, otherwise, their work is fundamentally di↵erent from ours as they use
di↵erent models, they do not exploit the Visual Genome graph structure and
74
evaluate on downstream tasks.
In the following we present all the standard algorithms and metrics used for
the clustering studies.
3.2.3.1 Clustering Methods and Metrics
We ran the K-means [MacQueen et al., 1967] clusterization algorithm on all three
embeddings to see if it can reveal more about the underlying structure of the
spaces. We used the k-means++ initialization scheme [Arthur and Vassilvitskii,
2006], which has been implemented in the Scikit-learn package12. This initializes
the centroids to be (generally) distant from each other, leading to probably bet-
ter results than random initialization. As a control for consistency of clustering
we also present results using Agglomerative Clustering13. To measure the rate
of clusterization, when the labels are not known, we used three standard met-
rics implemented in the Scikit-learn package14. One drawback of these metrics
is that they are generally higher for convex clusters than other concepts of clus-
ters. However, convexity is not always given. They respond poorly to elongated
clusters, or manifolds with irregular shapes.
1. Davies–Bouldin Index can be calculated by the following formula:
DB =
1
K
KX
i=1
max
j 6=i
✓
 i +  j
d(ci, cj)
◆
(3.5)
where  x is the average distance of all elements from the cluster cen-
troid in cluster Cx. d(ci, cj) is the distance between centroids ci, cj. Since
clusters with low intra-cluster distances (high intra-cluster similarity) and
high inter-cluster distances (low inter-cluster similarity) will have a low
Davies–Bouldin index, the smaller this number is the better the clusteriza-
tion is considered to be.
The computation of Davies-Bouldin is simpler than that of Silhouette scores.
The index is solely based on quantities and features inherent to the dataset
as its computation only uses point-wise distances.
12https://scikit-learn.org/stable/modules/clustering.html#k-means
13https://scikit-learn.org/stable/modules/clustering.html#
hierarchical-clustering
14https://scikit-learn.org/stable/modules/clustering.html#
clustering-performance-evaluation
75
2. Calinski-Harabasz Index – also known as the Variance Ratio Criterion –
can be used to evaluate the model, where a higher Calinski-Harabasz score
relates to a model with better defined clusters. The index is the ratio of
the sum of between-clusters dispersion and of inter-cluster dispersion for all
clusters (where dispersion is defined as the sum of distances squared):
CH =
tr(BK)
tr(WK)
⇥ N  K
K   1 (3.6)
where tr(BK) is the trace of the between group dispersion matrix and
tr(WK) is the trace of the within-cluster dispersion matrix defined by:
WK =
X
k
X
e2Ck
(e  ck)(e  ck)T (3.7)
BK =
X
k
(ck   cE)(ck   cE)T (3.8)
with cE being the centroid of E.
The score is higher when clusters are dense and well separated, which relates
to a standard concept of a cluster.
3. Silhouette Coe cient value is a measure of how similar an object is to
its own cluster (cohesion) compared to other clusters (separation).
For each data point ei we define:
a(ei) =
1
|Ci|  1
X
j2C,i 6=j
d(ei, ej) (3.9)
b(ei) = min
k 6=i
1
|Ck|
X
j2Ck
d(ei, ej) (3.10)
We now define a silhouette (value) of one data point ei:
S(ei) =
b(ei)  a(ei)
max{a(ei), b(ei)} , if |Ci| > 1 (3.11)
Silhouette Coe cient is also higher when clusters are dense and well separated.
76
3.2.4 Information Gain from Modalities
The third pillar of our analysis is the second transparency study, which aims to
uncover how much representations di↵er? We formulated it as an independence
analysis (Pillar 3) of our embeddings EL, EV , ES as multivariate random variables
in Section 2.7.5. Applying equation 2.13 to the three modalities (including the
same three assumptions), we aim to measure whether
I(EL, EV ) > I(EL, ES) (3.12)
in which case we hypothesise that there is a combination method with which,
combining EL with ES is more e cient than using EL+EV , as they convey more
complementary information which can be combined. The experiment design and
the results are reported in Section 6.3. We need to estimate the empirical Mutual
Information of our vector spaces from data, which is a hard problem. In the
following we introduce standard methods and tools we used for this purpose.
3.2.4.1 Empirical Mutual Information Estimation
Since Mutual Information is a special case of divergence (such as DKL in Equa-
tion 2.10), divergence estimators can be employed to estimate it. To recall the
definition of DKL (Equation 2.12): if p(x) and q(x) are densities then
DKL(p||q) =
Z
Rd
p(x) log
p(x)
q(x)
dx. (3.13)
The estimators then approximate Equation 2.10:
I(X, Y ) = DKL(PX,Y ||PX ⌦ PY ) (3.14)
In our application, PX,Y is a sample from a multi-modal embedding created
by mid-fusion, whereas the marginals are the uni-modal embeddings. To estimate
the densities p(x) and q(x), the traditional approach is to use histograms with
equally sized bins [Wang et al., 2005]. However, the computational complexity
of such methods is exponential in d and the estimation accuracy deteriorates
quickly as the dimension increases. Hence, a more robust way of estimating mul-
tidimensional Mutual Information is using k-Nearest Neighbor distances (IKNN)
77
which bypasses the di culties associated with partitioning in a high-dimensional
space [Wang et al., 2009]. This method estimates a density by computing the
average frequency of each point’s KNNs in the Euclidean ball centred around
the point. This provides a consistent estimate of DKL(p||q). In practice these
methods become unreliable in a high-dimensional space due to the sparsity of the
data objects.
To overcome this, another approach is to introduce non-linearity using a ker-
nel, when calculating the distances. In this work we use a kernel method called
the Hilbert-Schmidt Independence Criterion (HSIC) algorithm [Gretton et al.,
2005], because it has been shown to work in practical applications [Jitkrittum
et al., 2017].
Consider a reproducing kernel Hilbert space F of functions from X to R.
To each point X 2 X , there corresponds an element  (X) 2 F such that
h (X), (X 0)iF = k(X,X 0), where k : X ⇥ X ! R is a unique positive defi-
nite kernel.
Then the HSIC estimate is given by the following:
IHSIC(X, Y ) = kCX,Y kHS, (3.15)
where k.kHS is the Hilbert-Schmidt Norm. CX,Y is a cross-covariance operator
between X and Y :
CX,Y = EX,Y ([k1(·, X)  µX ]⌦ [k2(·, Y )  µY ]) (3.16)
where µX = EX [k1(·, X)] and µY = EY [k2(·, Y )] are the mean embeddings of X
and Y respectively to a Reproducing Kernel Hilbert Space. k1 and k2 are kernels
on X and Y respectively. For more details on the theoretical background see
[Gretton et al., 2005].
We apply an open source Python implementation of the above algorithms
from the Information Theoretical Estimators Toolbox15 [Szabo´, 2014].16
78
Figure 3.4: Roadmap of analyses. On the top: Pillar 1 Performance testing : broad
comparison across data sources, ML models and modalities. Based on this SL,SV ,SS
are narrowed down to a particular combination of model and data source. Following,
in the middle: Pillar 2 of structural cluster analysis to discover embedding concepts.
At the bottom: Pillar 3: Independence analysis of embeddings.
79
3.3 Analysis Scheme
Figure 3.4 represents a roadmap of our three pillar analysis.17 On the top: Pillar
1 Performance testing : broad comparison across data sources, ML models and
modalities, which will be presented in Chapter 4. Based on this SL,SV ,SS are
narrowed down to a particular combination of model and data source. In Chap-
ter 5 we change our focus on more in-depth analysis of fewer models based on the
findings in the previous blanket studies. Here, we restrict ourselves to behavioural
tests, but we inspect our models in a more fine grained fashion, regarding size
and distribution ranges. Following, in the middle: Pillar 2 of structural cluster
analysis to discover embedding concepts. At the bottom: Pillar 3: Independence
analysis of embeddings. Chapter 6 includes the two parts of our transparency
analysis. Here, we will focus on the structure of each embedding types EL, ES and
EV . Lastly, we measure the information gain ES and EV entail when combined
with EL.
Narrowing the umbrella studies down to a few model, data and modality com-
binations is another layer on top of the three pillar analysis framework. However,
this layer is not necessary for our proposed evaluation methodology. Performing
costly large scale studies with numerous current models would become shortly
obsolete. Our aim is rather to provide a general framework with proof-of-concept
studies, which can be applied to various models in the future.
15https://bitbucket.org/szzoli/ite
16We would also like to thank Zolta´n Szabo´ for his counsel on the theoretical background.
17Icons made by Freepik, Smashicons, Good Ware, Eucalyp and Becris from https:
//flaticon.com/authors/<author name>. Voronoi diagrams were generated using http:
//alexbeutel.com/webgl/voronoi.html.
80
Chapter 4
Impact of Visual Information in
Semantics
This chapter covers experiments which form an implementation of pillar 1 Perfor-
mance testing (Figure 3.4 on top). We cover experiments towards a comprehensive
analysis of models across data sources, machine learning models and modalities.
We introduce the implementation of a new structured hybrid modality based on
small data and in between low level visual information and high level linguistic,
symbolic data. We use evaluations which we refer to as black-box testing, for
looking at only performance numbers. However, by performing a broad study
we aim to o↵er a more comprehensive analysis of multi-modal studies than in
previous work.
The experiments are designed to addresses our research Questions 1, 2 and 3,
laid out in Chapter 1.1. To recap and frame them in our Semantic Embedding
model framework (Section 2.6):
1. How does the source of images DV a↵ect the performance of multi-modal
semantic representations?
2. Does the number of images have an impact on performance? – Variability
of the visual extraction function XV .
3. Do previous findings on complementary visual information scale to di↵erent
types and sizes of linguistic corpora? – Variability of observable context
data OL, OV , OS and introducing a new extraction function for structured
81
data XS.
In Section 4.1 we present a systematic study of the performance of state-of-
the-art image data sources and CNN architectures, and measure the impact of
image quantity (Questions 1 and 2). In Section 4.2 we introduce a new embedding
type based on a visually structured, textual data source, the Visual Genome Scene
Graphs [Krishna et al., 2016], and show preliminary studies on its performance
for “sanity-check”. In Section 4.3 we present a broader analysis involving the
models from the previous sections, extended with new ones. We tackle Question 3
by comparing several data sources of di↵erent sizes and modalities. Section 4.4
involves a study on how pretrained word embedding initialisation a↵ects sequence
model performance on textual entailment.
4.1 Comparing Visual Models and Data
Sources for Semantics
This section focuses on the analysis of EL + EV type multi-modal word embed-
dings with mid-fusion and various Convolutional Neural Network based EV visual
representations. The study explores the following questions regarding semantic
similarity and relatedness tasks:
1. How important is the source of images DV ? Is there a di↵erence between
search engines and manually annotated data sources?
2. How should we aggregate the image representations for a search key into
one visual representation? – Post-processing part of the visual mapping
function AV .
3. Does the number of images obtained for each search key matter? – Vari-
ability of the visual extraction function XV .
4. Does the choice of the CNN architecture have an impact on the performance
of visual and multi-modal models? – ML algorithm part of the visual map-
ping function AV .
To address the first question, we decided to use di↵erent search engines and
other existing image datasets. For that purpose, we extended Douwe Kiela’s
82
MMFeat toolkit1 with an API for the Flickr search engine. Later on we contin-
ued working on a joint project addressing the above questions in multi-modal
distributional word semantics. The results have been published in an EMNLP
long paper [Kiela et al., 2016].2 In this project, we systematically compared deep
visual representation learning techniques, experimenting with three well-known
network architectures, AlexNet, GoogLeNet and VGGNet (see Section 2.3.1). In
addition, we explored the various data sources (described in Section 3.1.1) that
can be used for retrieving relevant images, showing that images from search en-
gines perform as well as, or better than, those from manually crafted resources
such as ImageNet. Furthermore, we explored the optimal number of images and
the multi-lingual applicability of multi-modal semantics.
4.1.1 Evaluation
We employ behavioural evaluation tasks described in detail in Section 3.2.1. In
summary, model performance is assessed through the Spearman ⇢s rank corre-
lation between the system’s similarity scores for a given pair of words, together
with human judgements. We evaluate on two well-known similarity and related-
ness judgement datasets: MEN [Bruni et al., 2014] and SimLex-999 [Hill et al.,
2015].
In each experiment, we examine performance of the visual representations
compared to text-based representations, as well as performance of the multi-
modal representation that fuses the two. In this case, we apply mid-level fusion
– a popular technique in multi-modal semantics (described earlier) – concate-
nating the L2-normalized representations. Linguistic representations are 300-
dimensional and are obtained by applying skip-gram with negative sampling to
a 2013 dump of Wikipedia. Visual vectors based on AlexNet and VGGNet are
both 4096-dimensional, GoogLeNet vectors are of 1024 dimensions. The normal-
ization step that is performed before applying fusion ensures that both modalities
contribute equally to the overall multi-modal representation.
We evaluated the di↵erent architectures and data sources using either the
mean or elementwise maximum method for aggregating image representations
1https://github.com/douwekiela/mmfeat
2I implemented the Flickr API and all the data collection, experiments and evaluations
presented in this thesis.
83
into visual ones (AV post-processing). However, we found no significant di↵erence
between these two methods.
4.1.2 Results
Figure 4.1: The e↵ect of the number of images on representation quality.
We found that multi-modal representation learning yields better performance
across the board: for di↵erent network architectures, di↵erent data sources and
di↵erent aggregation methods (Figure 4.1).
We examined AlexNet, GoogLeNet and VGGNet, all three winners of the
ILSVRC ImageNet classification challenge, and found that they perform very
similarly. If e ciency or memory are issues, AlexNet or GoogLeNet are the most
suitable architectures. For overall best performance, AlexNet and VGGNet are
the best choices.
The choice of data sources has a bigger impact: Google, Bing, Flickr and
ImageNet were much better than the ESP Game dataset. Google, Flickr and
Bing have the advantage that they have potentially unlimited coverage. Google
84
and Bing are particularly suited to full-coverage experiments, even when these
include abstract words [Kiela et al., 2016].
Another question is the number of images we want to use: does performance
increase with more images? There is an obvious trade-o↵ here, since downloading
and processing images takes time (and may incur financial costs). This experi-
ment only applies to relevance-sorted image search data sources. We found that
the number of images has an impact on performance, but that it stabilizes at
around 10-20 images, indicating that it is usually not necessary to obtain more
than 10 images per word. For Flickr, obtaining more images is detrimental to
performance. The e↵ect of the number of images on the performance is shown in
Figure 4.1.
4.1.3 Conclusion
This work explores some important factors for choosing visual models and data
sources for multi-modal semantics. It is important to note that the multi-modal
results only apply to the mid-level fusion method of concatenating normalized
vectors: although these findings are indicative of performance for other fusion
methods, di↵erent architectures or data sources may be more suitable for di↵erent
fusion methods.
Understanding what it is that makes these representations perform so well is
another important question. Is it more data or the multi-modal nature of the
data which is increasing performance? Building on these preliminary findings, in
Section 4.3 we explore a broader range of factors which may shed more light to
visual models’ behaviour in multi-modal semantics.
4.2 Visual Context in the Linguistic Domain
Despite the indisputable success of data driven methods in NLP, humans’ ability
to generalise after having been exposed to only a small amount of data provides
motivation to further explore alternative machine learning methods. An appeal-
ing option is to exploit structured prior information combined with multi-modal
input. There is a need for more work on applying and automatically acquiring
structured prior information that can help us to take a step towards human level
and interpretable language generation and understanding.
85
The second key contribution (II.) of this thesis is the introduction and analysis
of a new modality (Section 2.5). The study, presented here, aims is to explore
the possibilities for learning semantic word representations based on structured
and visually grounded prior information. This way we further explore the types
of text corpora we use, expanding on Question 3.
We use the Visual Genome (VG) dataset’s scene graphs and bounding boxes as
structured training data (introduced in Section 3.1.2). Visual Genome images are
annotated with region graph representation of objects, attributes, and pairwise
relationships. Each of these region graphs are combined to form a scene graph
with all the objects grounded to the image (see Figure 3.2).
The main questions this work aims to examine are the following: What is the
information coming from (structured) image data? Is it the high level information
of visual scene structure which enhances linguistic information or low level visual
features matter as well?
4.2.1 Scene Graph Context
We introduce a new Semantic Embedding model SS. There could be many ways
to incorporate structured, visually grounded prior information from VG, such
as using graph neural networks [Scarselli et al., 2008] as part of the mapping
function AS. In this work, we implemented a much simpler method in order to
see if a small, fast to train model performs well. Instead of developing a new
mapping function, we introduce a new extraction function XS, which extracts
the relevant context information from the scene graphs then feeds it into a simple
shallow-network as AS.
Using the scene graph annotations as a corpus, XS takes as input the whole
scene graph dataset DS and returns “relevant” context items from OS to each
target element from T – that is it returns a mapping from target/context item
pairs to numbers in N, representing a relevance score of context pairs: XS : T !
(OS(T )! N). In this case this score is a binary number representing whether a
context node o 2 OS is in the graph neighbourhood of the surface representations
of t 2 T . The relevant context corresponds to a radius in this graph around an
object or predicate node. The radius is the number of steps we take starting
from a node in a breadth first search manner. The context words are all the node
labels within this sub-graph. Algorithm 1 presents the pseudo code for the Scene
86
Graph Context Generation Algorithm. G denotes the scene graph, rad is the
radius. It returns a word, context pair list [< t1, o1 >, ..., < tn, on >]. Each node
in G has more word labels or “names” (e.g. elephant and animal can be names
of the same object node). We take all the combinations of the given node names
of two nodes, which are in each others context. This operation is denoted by the
direct product of the two name lists, ⇥. E.g., if node {elephant, animal} is in
the neighbourhood of node with label {sleep}, then we generate context pairs of:
[helephant, sleepi, hanimal, sleepi].
In this case the mapping function AS, is a Skip-gram algorithm [Mikolov
et al., 2013b], which maps from context items to a word embedding space ES 2
R|T |⇥dS , dS = 300. Figure 4.2 shows an example for creating contexts for embed-
dings from Visual Genome Scene Graphs. The context words (orange) used are
up to three links from a target node (black).
Algorithm 1: Scene Graph Context Generation Algorithm
Input: G, rad
Result: contexts = [< t1, o1 >, ..., < tn, on >]
for node 2 G do
context nodes = breadth first traverse(node, rad);
for cnode 2 context nodes do
contexts += [node.names⇥ cnode.names]
end
end
Visual Genome scene graphs have been used for word meaning representa-
tions [Kuzmenko and Herbelot, 2019, Herbelot, 2020]. They build a truth the-
oretic model including predicate / entity pairs before feeding it to a skip-gram
model. Our method is more relaxed since we directly process the Scene Graphs
into contexts of a given size (radius), without any further restriction based on
grammatical information. The results are compared in Section 5.2.4.
This model is linguistic in a sense that it only uses text context in the graph
neighbourhood, without grounding it to visual features. However, it still uses
visual information implicitly, since the graph represents relationships in visual
scenes.
Di↵erent versions of the above model are compared to the following baselines:
1. w2v-wikipedia: A traditional skip-gram trained on a 2013 dump of Wikipedia.
87
Figure 4.2: Generating contexts for embeddings from Visual Genome Scene
Graphs. The context words (orange) used are up to three links from a target
node (black). The <target, context word> pairs are then fed to a Skip-gram
algorithm. Photos are from https://visualgenome.org/
2. w2v-descriptions : A skip-gram model trained on the Visual Genome image
descriptions.
For evaluation we perform the following intrinsic and extrinsic tests:
• Semantic relatedness/similarity on the MEN [Bruni et al., 2014] , SimLex
[Hill et al., 2015] and SimVerb [Gerz et al., 2016] datasets.
• Brain data: Predicting patterns of brain activity associated with the mean-
ing of nouns, making use of two datasets: fMRI (Functional Magnetic Res-
onance Imaging) [Mitchell et al., 2008] and one with MEG (Magnetoen-
cephalography) [Sudre et al., 2012]. (See in Section 4.3.4)
4.2.2 Results
Table 4.1 shows some preliminary results using Scene Graph context, that is based
on the proximity of words in the Visual Genome Scene Graph. N in “radN ”
indicates the number of steps we take around a node in a breadth first search
manner. The context words are all the node labels within this radius. Results
88
Lemmatised Method MEN SimLex SimVerb
No
VG rad3 0.433 0.274 0.008
w2v-wikipedia 0.680 0.238 0.149
Yes
VG rad3 0.433 0.274 0.132
w2v-wikipedia 0.673 0.257 0.134
No
VG rad1 0.211 0.16 -0.031
w2v-wikipedia 0.680 0.238 0.238
Yes
VG rad1 0.206 0.154 0.040
w2v-wikipedia 0.673 0.257 0.134
Yes w2v-description 0.427 0.289 0.127
Table 4.1: Pearson correlations of the di↵erent versions of the model and the
Skip-gram baseline on the MEN, SimLex and SimVerb datasets. N in “radN ”
indicates the number of steps we take around a node in a breadth first search
manner. The context words are all the node labels within this radius. Results
are shown for both lemmatised and non lemmatised versions of the scene graph
corpus.
are shown for both lemmatised and non lemmatised versions of the scene graph
corpus. There is no substantial di↵erence after using this preprocessing step (non
lemmatised versions even perform slightly better on MEN and SimLex), therefore
we do not lemmatise in the following experiments. Using a radius of three, our
model outperforms the baseline w2v-wikipedia and w2v-description baselines on
SimLex, but it performs worse on the other datasets.
Further results on behavioural tasks and brain imaging datasets are discussed
in Section 4.3.
4.2.3 Conclusion
Based on these preliminary results, using structured small-data is a promising
area to explore. Despite its size, structured training data can achieve comparable
results to our big corpus based baseline. Collecting such data by manual labour
is expensive, but it is probably worthwhile to explore crowd-sourced, gamified
or even (semi–)automatic techniques [Xu et al., 2020] for collecting structured
training data. We report on a broader scale analysis of various models including
the ones we introduced in this section and in Section 4.1.
89
4.3 Modalities, Sources and Models: a
Thorough Analysis
In the previous sections we investigated the impact of visual models and data
sources for non-visual evaluations. We compared di↵erent convolutional networks
for visual embeddings and di↵erent image sources. We also experimented with a
“small-data” based embedding, using structured information somewhere between
the visual and the linguistic domains.
There are two main problems, however, which the multi-modal literature (in-
cluding the above studies) su↵er from:
1. Too small and probably not well formed evaluation datasets [Faruqui et al.,
2016].
2. Lack of standardized comparative studies involving many di↵erent models.
The first problem is a challenging one due to the cost of data collection.
Traditional semantic similarity and relatedness tasks can provide a good starting
point to evaluate word semantics, but we certainly need a more thorough analysis
if we really want to compare semantic embedding spaces. Recently, the NLP
community started evaluating on Brain imaging data as well (see Section 3.2.2),
in the hope of learning about the relationship between word embeddings and brain
activation of people while thinking of corresponding concepts. These datasets are
relatively expensive to create, hence they are not very large. While evaluating on
them can provide with interesting insights, we should be cautious when drawing
conclusions from these results.
In the following study we use both semantic similarity / relatedness and brain
datasets as evaluation. Unlike previous work, however, we try and make a further
step towards a more in depth analysis of the results to filter out the potential noise
we face in these experiments, coming from di↵erent models and small evaluation
sets.
As for the second problem, multi-modal models are usually compared to only
one linguistic baseline and maybe except for our study in Section 4.1, only one
visual source / model combination. Here, we present a broader study involving
several di↵erent visual and linguistic embeddings in order to get a better picture
of the variance we have in performance, tackling our Question 3.
90
All the experiments have been implemented as part of the EmbEval toolkit
(see Section 1.3), including the creation of uni-modal embeddings as well as new
mid-fusion techniques (described in Section 4.3.2).
4.3.1 Studied Embeddings
In the following we summarise the parameters of the studied Semantic Embedding
models, which were described in detail in Chapters 2 and 3.
4.3.1.1 Linguistic Embeddings
To train SL models we use pretrained embeddings from the FastText System
[Mikolov et al., 2018]. Each model has been trained on di↵erent sources DL:
1. wiki-news-300d-1M : 1 million word vectors trained on Wikipedia 2017,
UMBC webbase corpus and statmt.org news dataset (16B tokens).
2. wiki-news-300d-1M-subword : 1 million word vectors trained with subword
infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news
dataset (16B tokens).
3. crawl-300d-2M : 2 million word vectors trained on Common Crawl (600B
tokens).
Furthermore, for comparison with earlier works we also use the same Skip-
Gram model, trained on a Wikipedia dump from 2013.
4.3.1.2 Visual Embeddings
Based on the findings in Section 4.1 we test the following datasets and models
for SV :
Image Source DV We use Google Images as a source, as it had a stable per-
formance across models, and is widely used. We compare this big data source to
visual representations trained on Visual Genome Images. This way we compare
a big data source to a smaller, but systematically annotated dataset.
91
ML part of AV For CNN models we use the best and fastest AlexNet model,
based on previous findings. Since publishing the results in Section 4.1 a new CNN
architecture, called Deep Residual Network (ResNet) [He et al., 2016] appeared,
which is the current state-of-the-art in object recognition on images both in terms
of classification accuracy and speed. Therefore, in this broader study we included
this model as well. We also compare two AlexNet models trained on Visual
Genome images internal object bounding box images or on the whole images,
similarly to [Davis et al., 2019].3
Post-processing part of AV Since our findings in [Kiela et al., 2016] suggest
no obvious di↵erence between the two methods, here we only use the mean of
image embedding vectors (as opposed to taking the maximum) to create one
visual representation for a word.
Extraction function XV Furthermore, since after 10-20 images the perfor-
mance plateaus across the board, in this study we always use 10 images for each
word representation.
4.3.1.3 Structured Embeddings
We analyse the XS version from Section 4.2, when we take three steps around a
node in a breadth first search manner.
4.3.2 Mid-fusion methods
To create multi-modal embeddings using mid-fusion we applied two methods:
1. Intersection: Similarly to previous work a multi-modal embedding is the
concatenation of visual and linguistic vectors. Therefore, we only have
representations for the intersection of their vocabularies. This is mainly
relevant in the case of Visual Genome, where we might not have full coverage
(as opposed to Google).
3The training of the models has been done by Christopher Davis. In this paper, I provided
supervision with the experiments, help with using MMFeat and helper code for processing
Visual Genome.
92
2. Padding : In order to have full coverage in every case, in this method if one
modality does not cover a word in the vocabulary we just pad the multi-
modal vector with as many zeros as the dimensionality of the modality
space with the missing vector. This way we have multi-modal embeddings
for all the words in the intersection of their vocabularies, and uni-modal
vectors, where one of the modalities failed to cover the word.
4.3.3 Evaluation Methods
Evaluation of word embeddings on similarity tasks has been shown to be prob-
lematic due to 1) the lack of train/development/test splits, 2) the absence of
statistical significance, 3) low correlation with downstream performance, 4) the
hubness problem and 5) their inability to account for polysemy [Faruqui et al.,
2016]. To tackle the first problem we performed three-way cross-validation on
MEN and SimLex, leaving out one third of the word pairs randomly. Based on
the results – reported in Appendix A – we present correlation figures up to two
decimal points. As for the second issue we present a series of detailed evalu-
ation methods in the next chapters, which aim to unearth the reasons behind
the behaviour of our models beyond correlation. For correlation scores we report
p-values for every correlation score. 4) and 5) are addressed in Chapters 5 and 6.
As we discussed in Section 2.1.2, in this work, we view semantic space analysis
as a statistical tool for dataset analysis which provides value on its own without
downstream applications, therefore 3) is beyond the scope of this thesis.
We cannot directly compare models trained on di↵erent data sources, be-
cause they have di↵erent coverage, but we can look at absolute performance and
compare network architectures and modalities. We also present results on the
common subset of the evaluation datasets, where all word pairs have images in
each of the data sources.
Results on the Brain datasets are analysed averaged over participants for em-
bedding comparison. We present further analyses, where results are averaged over
modalities, therefore we can focus more on the variability between participants.
4.3.3.1 Concreteness
Concreteness of words has been studied before in the context of multi-modal se-
mantics and for Brain imaging evaluation. Kiela et al. [Kiela et al., 2014] applied
93
a dispersion metric on the visual domain to filter out words with image results
which are noisier than a threshold, based on their metric. They hypothesised that
abstract words have higher, whereas concrete words have lower image dispersion.
Anderson et al. [Anderson et al., 2017] systematically selected word categories
for their Italian dataset based on concreteness.
In this work we developed an automatic concreteness score based on WordNet.
The concreteness score of a word is its distance (one minus similarity) from its
root hypernyms in the Synset graph.
Since in WordNet we have multiple synsets for one surface form we compare
two di↵erent techniques to aggregate each sysnets’ distances from the root:
1. Taking the median of all sysnet’s distances for a word.
2. Selecting the synset with the maximum distance from the root, so we have
the most concrete sense of the word.
Hence, the formula for our WordNet concreteness score is:
WNConc(w) = Aggw[d(si, ri) | i 2 {1, . . . , Nw}], (4.1)
where Aggw(.) is the synset aggregation method, d(., .) is the WordNet distance,
w is a word. si are the synsets for w and ri are the roots of each synset in the
WordNet hypernym hierarchy. Nw is the number of synsets for word w.
Another question is, how we should combine the concreteness scores for word
pairs in the behavioural tasks? We present two methods to do this:
1. Taking the sum of the two words’ concreteness scores.
2. The absolute di↵erence of the two words’ concreteness scores.
4.3.3.2 Qualitative Analysis on Nouns of the Brain Datasets
Lastly, we performed qualitative analysis regarding the 60 nouns in the Brain
evaluation datasets. Looking at the word concreteness scores did not show any
pattern, but this is unsurprising, since this dataset already consists of mainly
concrete nouns.
Instead, in this work we included an analysis of the relationship between all
studied models in terms of their performance for individual words, averaged over
94
participants (Figures 4.5 and 4.6). Even though this evaluation set is small in
terms of vocabulary size, it still can be useful for looking into the nuances we
may find regarding individual concepts.
4.3.4 Results
The tables in this section show evaluation scores for each task using di↵erent
versions of evaluation methods. The notation for all tables is the following: Each
line corresponds to an embedding. Separator lines divide embeddings by modali-
ties: Linguistic EL, Visual EV , Structured ES and Multi-modal models EL+EV
and EL+ES. wikinews, wikinews sub and crawl signify FastText vectors trained
on the corresponding corpora. w2v13 is a Skip-Gram model trained on a 2013
Wikipedia dump. Visual Embeddings’ names that are trained on Google are in
the format of <image source> <CNN model>. VG-internal|external denotes
training on Visual Genome images, either on the internal object images or on the
whole images, as it is done in [Davis et al., 2019]. Finally, VG SceneGraph stands
for the Visual Genome Scene Graph Embeddings from Section 4.2. Multi-modal
embeddings have a “+” in their names which separates the two embedding names
they are built on.
Red colour indicates best performance, blue means that the multi-modal em-
bedding outperformed the corresponding uni-modal ones. In case of aggregated
results for each modality, best performance is signified by bold font.
4.3.4.1 Correlations on the Behavioural Tasks
Tables 4.2-4.7 present the standard Spearman’s correlation scores of di↵erent
embeddings on the Semantic Similarity and Relatedness tasks. Tables 4.2, 4.3
and 4.4 present results on the full datasets, Tables 4.6, 4.7 include results on the
embeddings’ common coverage subsets. Results using padding mid-fusion method
are shown in Tables 4.2 and 4.3, results applying the intersection method are
presented in Tables 4.4, 4.5, 4.6, 4.7. Except for the relatively small common
subset on SimLex (Table 4.7), crawl linguistic embedding outperforms all the
other. Multi-modal models outperform uni-modal ones mainly on SimLex, but
only in the case of the 2013 Skip-Gram model, which is in line with previous
results in Section 4.2. The only multi-modal model outperforming uni-modal
ones on MEN is the combination of w2v13 and VG Scene Graph.
95
Modality Embedding Spearman P-value Coverage
EL
wikinews 0.79 0 3000
wikinews sub 0.80 0 3000
crawl 0.85 0 3000
w2v13 0.68 0 3000
EV
Google AlexNet 0.50 0 3000
Google VGG 0.51 0 3000
VG-internal 0.37 0 2784
VG-whole 0.41 0 2784
Google ResNet-152 0.47 0 3000
ES VG SceneGraph 0.42 0 2574
EL + EV
wikinews+Google AlexNet 0.50 0 3000
wikinews+Google VGG 0.51 0 3000
wikinews+VG-internal 0.36 0 3000
wikinews+VG-whole 0.39 0 3000
wikinews+Google ResNet-152 0.48 0 3000
wikinews sub+Google AlexNet 0.50 0 3000
wikinews sub+Google VGG 0.51 0 3000
wikinews sub+VG-internal 0.36 0 3000
wikinews sub+VG-whole 0.39 0 3000
wikinews sub+Google ResNet-152 0.47 0 3000
crawl+Google AlexNet 0.51 0 3000
crawl+Google VGG 0.52 0 3000
crawl+VG-internal 0.37 0 3000
crawl+VG-whole 0.40 0 3000
crawl+Google ResNet-152 0.51 0 3000
w2v13+Google AlexNet 0.50 0 3000
w2v13+Google VGG 0.51 0 3000
w2v13+VG-internal 0.36 0 3000
w2v13+VG-whole 0.40 0 3000
w2v13+Google ResNet-152 0.48 0 3000
EL + ES
w2v13+VG SceneGraph 0.64 0 3000
crawl+VG SceneGraph 0.78 0 3000
wikinews sub+VG SceneGraph 0.37 0 3000
wikinews+VG SceneGraph 0.57 0 3000
Table 4.2: Spearman correlation on the MEN dataset. Multi-modal embeddings
are created using the Padding technique. The table sections contain linguistic,
visual and multi-modal embeddings in this order. Red colour signifies the best
performance. Blue would mean that the multi-modal embedding outperformed
the corresponding uni-modal ones, which here did not happen.
96
Modality Embedding Spearman P-value Coverage
EL
wikinews 0.45 0 999
wikinews sub 0.44 0 999
crawl 0.50 0 999
w2v13 0.31 0 999
EV
Google AlexNet 0.34 0 999
Google VGG 0.34 0 999
VG-internal 0.31 0 103
VG-whole 0.19 0.06 103
Google ResNet-152 0.35 0 999
ES VG SceneGraph 0.26 0 593
EL + EV
wikinews+Google AlexNet 0.34 0 999
wikinews+Google VGG 0.34 0 999
wikinews+VG-internal 0.31 0 999
wikinews+VG-whole 0.31 0 999
wikinews+Google ResNet-152 0.35 0 999
wikinews sub+Google AlexNet 0.34 0 999
wikinews sub+Google VGG 0.34 0 999
wikinews sub+VG-internal 0.30 0 999
wikinews sub+VG-whole 0.30 0 999
wikinews sub+Google ResNet-152 0.35 0 999
crawl+Google AlexNet 0.34 0 999
crawl+Google VGG 0.34 0 999
crawl+VG-internal 0.32 0 999
crawl+VG-whole 0.32 0 999
crawl+Google ResNet-152 0.37 0 999
w2v13+Google AlexNet 0.34 0 999
w2v13+Google VGG 0.34 0 999
w2v13+VG-internal 0.23 0 999
w2v13+VG-whole 0.23 0 999
w2v13+Google ResNet-152 0.35 0 999
EL + ES
w2v13+VG SceneGraph 0.29 0 999
crawl+VG SceneGraph 0.45 0 999
wikinews sub+VG SceneGraph 0.20 0 999
wikinews+VG SceneGraph 0.35 0 999
[h!]
Table 4.3: Spearman correlation on the SimLex dataset. Multi-modal embeddings
are created using the Padding technique. The table sections contain linguistic,
visual and multi-modal embeddings in this order. Red colour signifies the best
performance. Blue would mean that the multi-modal embedding outperformed
the corresponding uni-modal ones, which here did not happen.
97
Modality Embedding Spearman P-value Coverage
EL
wikinews 0.79 0 3000
wikinews sub 0.80 0 3000
crawl 0.85 0 3000
w2v13 0.68 0 3000
EV
Google AlexNet 0.50 0 3000
Google VGG 0.51 0 3000
VG-internal 0.37 0 2784
VG-whole 0.41 0 2784
Google ResNet-152 0.47 0 3000
ES VG SceneGraph 0.42 0 2574
EL + EV
wikinews+Google AlexNet 0.50 0 3000
wikinews+Google VGG 0.51 0 3000
wikinews+VG-internal 0.38 0 2784
wikinews+VG-whole 0.41 0 2784
wikinews+Google ResNet-152 0.48 0 3000
wikinews sub+Google AlexNet 0.50 0 3000
wikinews sub+Google VGG 0.51 0 3000
wikinews sub+VG-internal 0.37 0 2784
wikinews sub+VG-whole 0.41 0 2784
wikinews sub+Google ResNet-152 0.47 0 3000
crawl+Google AlexNet 0.51 0 3000
crawl+Google VGG 0.52 0 3000
crawl+VG-internal 0.38 0 2784
crawl+VG-whole 0.42 0 2784
crawl+Google ResNet-152 0.51 0 3000
w2v13+Google AlexNet 0.50 0 3000
w2v13+Google VGG 0.51 0 3000
w2v13+VG-internal 0.38 0 2784
w2v13+VG-whole 0.41 0 2784
w2v13+Google ResNet-152 0.48 0 3000
EL + ES
w2v13+VG SceneGraph 0.70 0 2574
crawl+VG SceneGraph 0.81 0 2574
wikinews sub+VG SceneGraph 0.45 0 2574
wikinews+VG SceneGraph 0.65 0 2574
Table 4.4: Spearman correlation on the MEN dataset. Multi-modal embeddings
are created using the Intersection technique. The table sections contain linguis-
tic, visual and multi-modal embeddings in this order. Red colour signifies the
best performance, blue means that the multi-modal embedding outperformed the
corresponding uni-modal ones.
98
Modality Embedding Spearman P-value Coverage
EL
wikinews 0.45 0 999
wikinews sub 0.44 0 999
crawl 0.50 0 999
w2v13 0.31 0 999
EV
Google AlexNet 0.34 0 999
Google VGG 0.34 0 999
VG-internal 0.31 0 103
VG-whole 0.19 0.06 103
Google ResNet-152 0.35 0 999
ES VG SceneGraph 0.26 0 593
EL + EV
wikinews+Google AlexNet 0.34 0 999
wikinews+Google VGG 0.34 0 999
wikinews+VG-internal 0.31 0 103
wikinews+VG-whole 0.18 0.06 103
wikinews+Google ResNet-152 0.35 0 999
wikinews sub+Google AlexNet 0.34 0 999
wikinews sub+Google VGG 0.34 0 999
wikinews sub+VG-internal 0.31 0 103
wikinews sub+VG-whole 0.18 0.06 103
wikinews sub+Google ResNet-152 0.35 0 999
crawl+Google AlexNet 0.34 0 999
crawl+Google VGG 0.34 0 999
crawl+VG-internal 0.31 0 103
crawl+VG-whole 0.19 0.06 103
crawl+Google ResNet-152 0.37 0 999
w2v13+Google AlexNet 0.34 0 999
w2v13+Google VGG 0.34 0 999
w2v13+VG-internal 0.31 0 103
w2v13+VG-whole 0.18 0.06 103
w2v13+Google ResNet-152 0.35 0 999
EL + ES
w2v13+VG SceneGraph 0.29 0 593
crawl+VG SceneGraph 0.44 0 593
wikinews sub+VG SceneGraph 0.30 0 593
wikinews+VG SceneGraph 0.35 0 593
Table 4.5: Spearman correlation on the SimLex dataset. Multi-modal embeddings
are created using the Intersection technique. The table sections contain linguistic,
visual and multi-modal embeddings in this order. Red colour signifies the best
performance. Blue would mean that the multi-modal embedding outperformed
the corresponding uni-modal ones, which here did not happen.
99
Modality Embedding Spearman P-value Coverage
EL
wikinews 0.80 0 2481
wikinews sub 0.80 0 2481
crawl 0.84 0 2481
w2v13 0.67 0 2481
EV
Google AlexNet 0.52 0 2481
Google VGG 0.51 0 2481
VG-internal 0.38 0 2481
VG-whole 0.41 0 2481
Google ResNet-152 0.47 0 2481
ES VG SceneGraph 0.44 0 2481
EL + EV
wikinews+Google AlexNet 0.52 0 2481
wikinews+Google VGG 0.52 0 2481
wikinews+VG-internal 0.38 0 2481
wikinews+VG-whole 0.41 0 2481
wikinews+Google ResNet-152 0.48 0 2481
wikinews sub+Google AlexNet 0.52 0 2481
wikinews sub+Google VGG 0.51 0 2481
wikinews sub+VG-internal 0.38 0 2481
wikinews sub+VG-whole 0.41 0 2481
wikinews sub+Google ResNet-152 0.47 0 2481
crawl+Google AlexNet 0.52 0 2481
crawl+Google VGG 0.52 0 2481
crawl+VG-internal 0.38 0 2481
crawl+VG-whole 0.42 0 2481
crawl+Google ResNet-152 0.51 0 2481
w2v13+Google AlexNet 0.52 0 2481
w2v13+Google VGG 0.52 0 2481
w2v13+VG-internal 0.38 0 2481
w2v13+VG-whole 0.41 0 2481
w2v13+Google ResNet-152 0.49 0 2481
EL + ES
w2v13+VG SceneGraph 0.70 0 2481
crawl+VG SceneGraph 0.81 0 2481
wikinews sub+VG SceneGraph 0.46 0 2481
wikinews+VG SceneGraph 0.66 0 2481
Table 4.6: Spearman correlation on the common subset of the MEN dataset.
Multi-modal embeddings are created using the Intersection technique. The table
sections contain linguistic, visual and multi-modal embeddings in this order. Red
colour signifies the best performance, blue means that the multi-modal embedding
outperformed the corresponding uni-modal ones.
100
Interestingly, using ResNet does not provide any performance gain overAlexNet,
similarly to the other more complicated models in Section 4.1. Both models are
fast to run, and AlexNet sometimes performs even better, so there is no good
reason to use ResNet in this task.
Padding multi-modal vectors for bigger coverage does not help, in the case of
w2v13+VG SceneGraph it even hurts performance. However, this may be due to
including more, and perhaps “harder” concept-pairs in the test set than in the
smaller intersection set.
The success of combining w2v13 and VG Scene Graph over other visual vectors
is interesting. While image embeddings did not help on MEN, this embedding
in-between visual and linguistic conveys some complementary information to this
linguistic baseline.
Note that in this study we used our EmbEval toolkit for creating multi-modal
embeddings, with two di↵erent types of mid-fusion methods. In Section 4.1 we
used MMFeat, which includes slightly di↵erent mid-fusion techniques, therefore,
the results are not directly comparable. The main point of this comprehensive
study was to reveal patterns across several di↵erent sources, architectures and
modalities. In the e ciency studies in Chapter 5 and for the transparency analysis
in Chapter 6, we also used the EmbEval toolkit.
4.3.4.2 Results on Brain Data
Results on the Brain datasets include scores from the 2 vs. 2 test, described in
Section 3.2.2. These experiments have all been run using the Intersection mid-
fusion technique. This is because padding did not make much of a di↵erence in
performance, but it requires much more memory.
In addition to the previous visual models, here, we use the best performing
models from [Davis et al., 2019], namely the internal bounding box images, the
whole images and the combined image representations of Visual Genome. In
some cases we created a nested multi-modal model where we combined their
initial multi-modal models (denoted by MM ) with all our linguistic models.
Tables 4.8, 4.9, 4.10, 4.11 show the scores of each embedding for every partic-
ipant, and their averages over participants. Multi-modal models on average are
clearly bigger winners of this task than of the previous one. In all settings a multi-
modal model achieved the highest performance. In all but one case (MEG scores
101
on the common subset vocabularies in Table 4.11) about half of multi-modal
models outperformed their corresponding uni-modal ones.
On the full datasets (Tables 4.8 and 4.9) VG SceneGraph and AlexNet im-
proved the most on the fMRI and MEG datasets respectively. On the common
subset evaluations ResNet won the first medal. Interestingly, in all cases the com-
bination with the older w2v13 linguistic model outperformed the combinations
with FastText embeddings.
When it comes to individual participants we see a substantial variance. Tables
4.12, 4.13, 4.14, 4.15 average performances over modalities for each of them. In
all settings except for the common subset of fMRI dataset (Table 4.14), multi-
modal models achieve a higher average performance than the uni-modal ones in
more than 50% of the cases. On the common subset of fMRI dataset, for all
participants, visual, structured and multi-modal averages are higher than the
linguistic ones.
In a recent paper [Pereira et al., 2018] also report high variance between
subjects in the way their di↵erent systems (linguistic, visual, etc.) encode con-
ceptual information: some are more visually oriented, others more linguistic, etc.
Although, their dataset includes abstract concepts as well, which may explain the
lesser involvement of visual information.
One important observation is that the standard deviations on the maximally
covered fMRI data are moving around 0.1, whereas on MEG data they are around
0.06. On the common subsets the numbers are around 0.08 and 0.09 respectively.
In many cases the di↵erence between models’ average performances fall within
these error margins. However, in most cases the improvements over uni-modal
models go beyond this error.
4.3.4.3 Concreteness
Figure 4.3 and 4.4 show Spearman’s correlation scores of each of our embeddings
on splits of the MEN and SimLex datasets, of size 100. On the x axis we see
the index of word pairs in our respective evaluation sets, ordered by WordNet
Concreteness, where concreteness for a word is computed using Equation 4.1.
Dark blue line indicates the WordNet Concreteness score for each word pair,
therefore, it is on di↵erent axes than all the other lines which represent correlation
scores. Figure 4.3 depicts the case when the concreteness of a word pair is the
102
sum of the concretenesses of the individual words. In Figure 4.4 this is computed
using their absolute di↵erence. The two versions of synset aggregation for a word
are both presented: median distance and maximum distance (most concrete)
selection.
Figure 4.3: Spearman’s correlation on the full Semantic Similarity dataset splits,
ordered by the sum of WordNet concreteness scores of the two words in every
word pair. Mid-fusion method: Padding. Axis x shows the index of word pairs,
ordered by WordNet Concreteness. There are two plots on top of each other
for displaying the trend. Left y axis is the scale for WordNet concreteness score
(blue). Right y axis is the scale for Spearman’s correlations for all the embeddings
Ehmodalityi.
Perhaps because of the size of the datasets, we can see a tendency in the scores
on MEN but way less on SimLex. When word pairs are ordered by the sum con-
creteness, we see a slightly upward trend as the concreteness score increases,
especially in the median synset aggregation case. In the absolute di↵erence con-
creteness ordering there is a steep growth for the first 5-10 splits, then the increase
plummets.
Since we have a lot of embeddings, we use colour codes to separate embed-
dings by modality. Furthermore, we distinguish Visual Genome images EV G from
103
Figure 4.4: Spearman’s correlation on the full Semantic Similarity dataset splits,
ordered by the di↵erence of WordNet concreteness scores of the two words in
every word pair. Mid-fusion method: Padding. Axis x shows the index of word
pairs, ordered by WordNet Concreteness. There are two plots on top of each
other for displaying the trend. Left y axis is the scale for WordNet concreteness
score (blue). Right y axis is the scale for Spearman’s correlations for all the
embeddings Ehmodalityi.
Google, denoted by EGoogle.
An interesting observation is that ES behaves more like a visual embedding in
this experiment. A potential hypothesis is that for such abstract semantic tasks,
(as opposed to traditional multi-modal tasks, such as VQA) we may not need low
level visual features. Instead, it is rather the co-occurrence statistics, learned on
this visually ordered graph structure, which can convey complementary informa-
tion to a linguistic semantic embedding, trained on a “natural” text corpus. One
potential way to test this hypothesis could be to gradually reduce the resolution
of images we use for the visual embeddings and see how the performance changes,
in what rate it starts to decline in particular. We would expect it to plateau or
only decline slowly until a point when the objects are not distinguishable any
more. This way we would see how much visual detail we can omit and keep the
104
same gain for these conceptually abstract tasks.
Further results for evaluation on common subsets and Intersection type mid-
fusion method can be found in Appendix B. They are consistent with the results
presented here.
4.3.4.4 Qualitative Analysis
Our automatic WordNet concreteness score is not a distinguishing metric for the
60 nouns in the Brain datasets, nevertheless, there can be some pattern when we
look at the results for individual words.
Figure 4.5: Scores on the full the Brain datasets words, ordered by the EV G
score. The scores are the number of hits per word, averaged over all participants.
Mid-fusion method: Intersection.
Figure 4.5 and Figure 4.6 show the number of hits for individual words, aver-
aged over participants. A word gets a hit whenever it was in a word pair with a
positive 2 vs. 2 test score.
Here we order the plot by a Visual Genome based embedding on combined
image segments. Some words, e.g., barn, airplane and spoon got very high rank in
both the fMRI and the MEG dataset. Note that the participant sets of this two
105
Figure 4.6: Scores on the embeddings’ common subset of the Brain datasets
words, ordered by the EV G score. The scores are the number of hits per word,
averaged over all participants. Mid-fusion method: Intersection.
datasets are disjoint. It is harder to see such similarity for words with lower hit
numbers. Embeddings, trained on di↵erent datasets and of di↵erent modalities
follow a similar trend.
In order to get a better understanding of the type of words which behave
di↵erently in these brain imaging experiments we would need more evaluation
data.
4.3.5 Conclusion
In this study we took a step towards a more detailed analysis on the impact
of visual information on high level semantic tasks, with no direct visual input.
Furthermore, we investigated two brain imaging evaluation sets, involving two
di↵erent imaging methods: fMRI (Functional Magnetic Resonance Imaging) and
one with MEG (Magnetoencephalography).
The results show that indeed, comparing several di↵erent visual and linguistic
sources and models on various di↵erent evaluation tasks is necessary in order
106
to avoid fooling ourselves with overfitting certain types of evaluation sets. In
several occasion, previous literature showed performance gain using multi-modal
embeddings of linguistic and visual input. This is indeed the case on certain tasks
and using certain embeddings, but not in every case.
In this work we aimed to shed light on the various factors that might play a
role. Models behave di↵erently on MEN and SimLex, and the performance gain of
multi-modal models, when using linguistic vectors trained on huge textual sources
is not well supported on these tasks. Visual information is complementary when
our linguistic model has been trained on a smaller corpus, but this e↵ect does
not necessarily scale with corpus size.
Multi-modal models achieved a more convincing improvement on the brain
imaging data, however these datasets are fairly small, so we would refrain from
drawing far-reaching conclusions.
An interesting outcome of this study is that the model trained on the visually
structured scene graph of Visual Genome achieved a surprising success across
the board, despite its small size compared to all the other datasets. This is an
interesting model, since it is linguistic in a sense that it is trained on text, but
the word contexts are organised in a visually motivated structure. This suggests
that images may indeed convey complementary statistical information about the
co-occurrence of objects in visual scenes. It is even possible that this information
is more important for abstract semantic tasks than lower level visual properties
of words. This would be intuitive, since unlike multi-modal tasks with direct
visual input, such as Visual Question Answering, in our case we are aiming for
abstract meaning representations of concepts. It would make sense if detailed
visual information about what a table looks like mattered less when we talk
about table as an abstract concept.
4.4 Model Initialization on a Textual
Entailment Task
This section is a brief digression to studying the application possibilities of word
embeddings as initialisations on a sentence level task: textual entailment. We
evaluate on the Stanford Natural Language Inference (SNLI) corpus [Bowman
et al., 2015].
107
We compared five di↵erent neural network models for encoding sentences and
four di↵erent word embeddings to initialize these models, following the baselines
of [Bowman et al., 2015]. The task is a three-way classification, where the input
is a sentence pair and the classification labels are entailment, contradiction and
neutral. We included words in the multi-modal representation only if they have
a visual representation.
On the top of each model there is a three-fold classifier on the concatenated
sentence embeddings for the premise and the hypothesis sentences. The five
sentence encoding models are the following4:
1. Addition: Vector addition of word embedding vectors in the sentence.
2. Addition + translation layer: The previous model extended with an
additional layer that learns another sentence embedding above the fixed
word embedding based sentence representation.
3. Addition + translation layer + full size image embeddings: The
model above, but instead of using dimensionality reduced visual vectors,
we use the original image embeddings and smaller (100 dimensional) lin-
guistic vectors. In the previous models we used PCA to keep the first 300
components out of 4094.
4. GRU: Gated Recurrent Unit based recurrent sentence encoding model.
5. LSTM: Recurrent sentence encoding model with Long short-term memory
units.
All of them were initialized with four di↵erent word embeddings:
1. Linguistic only: Skip-gram embedding trained on a 2013Wikipedia dump.
2. Visual only: Image embeddings, extracting CNN representations of Google
images for the individual words in the sentence.
3. Multi-modal: The concatenation of linguistic and visual vectors for each
word.
4. Random: The initial vector weights are sampled from a normal distribu-
tion.
4Some of the base code was written in collaboration with Amandla Mabona.
108
4.4.1 Results
The results are shown in Table 4.16. The experiments indicate two phenomena:
1. The translation layer plays an important role in models 1-3. In these cases
the simplest model without the translation layer (model 1.) the linguistic
initialisation performs the best. After adding the translation layer, how-
ever, multi-modal embeddings outperform all the other ones, in case of full
size image vectors (model 3.) with a substantial margin in classification
accuracy.
2. In case of the more sophisticated recurrent models (4-5.), however, we
found that the performance di↵erence across di↵erent initialisations van-
ishes. Even the random initial embeddings do not achieve significantly
lower classification accuracy then the other methods.
4.4.2 Conclusion
The second finding may suggest that we could create more time e cient models,
since we do not necessarily need to spend time on pre-training word embeddings.
It also alerts us, however, to the danger of overfitting. Note that we ignored multi-
modal representations for words where visual information is missing, which may
hurt performance. Although the high performance of random initialisations are
more telling. Our findings are in line with Zhang and Bowman’s, who found the
related phenomenon of high performing random initialized LSTM models [Zhang
and Bowman, 2018]. [Yogatama et al., 2019] recently found that transformer
type models are overfitting to the quirks of particular datasets. Possible future
work could be to gradually increase model complexity as well as performing more
ablation studies, in order to better understand the models’ capacity.
4.5 Conclusion
In this Chapter we demonstrated the e↵ectiveness of image search engines in
multi-modal mid-fusion embeddings. We found that around the first 10 image
results are su cient, beyond that the performance plateaus.
109
We introduced a new visually structured textual embedding based on Visual
Genome and showed that it enriches linguistic models trained on smaller corpora,
therefore they can be useful for low resource languages.
We found that pretrained word embeddings do not necessarily help sequence
model training. However, they can be valuable on their own for discovering
concept structures in a data source.
Based on these findings we move on to an in-depth study of our embeddings of
di↵erent modalities and their combinations. The following chapters showcase the
second and third pillars of our methodology, which involve transparency analysis
(see Section 3.3). We narrow our focus to a few models, as such analyses would be
fairly time consuming for all the above combinations of sources, modalities and
models. Furthermore, such studies on numerous current models would become
shortly obsolete. Our aim is rather to provide a general framework with proof-
of-concept studies, which can be applied to various models in the future.
110
Modality Embedding Spearman P-value Coverage
EL
wikinews 0.28 0 103
wikinews sub 0.25 0.01 103
crawl 0.37 0 103
w2v13 0.11 0.25 103
EV
Google AlexNet 0.55 0 103
Google VGG 0.53 0 103
VG-internal 0.31 0 103
VG-whole 0.19 0.06 103
Google ResNet-152 0.50 0 103
ES VG SceneGraph 0.30 0 103
EL + EV
wikinews+Google AlexNet 0.55 0 103
wikinews+Google VGG 0.53 0 103
wikinews+VG-internal 0.31 0 103
wikinews+VG-whole 0.18 0.06 103
wikinews+Google ResNet-152 0.50 0 103
wikinews sub+Google AlexNet 0.55 0 103
wikinews sub+Google VGG 0.53 0 103
wikinews sub+VG-internal 0.31 0 103
wikinews sub+VG-whole 0.18 0.06 103
wikinews sub+Google ResNet-152 0.50 0 103
crawl+Google AlexNet 0.55 0 103
crawl+Google VGG 0.52 0 103
crawl+VG-internal 0.31 0 103
crawl+VG-whole 0.19 0.06 103
crawl+Google ResNet-152 0.49 0 103
w2v13+Google AlexNet 0.55 0 103
w2v13+Google VGG 0.53 0 103
w2v13+VG-internal 0.31 0 103
w2v13+VG-whole 0.18 0.06 103
w2v13+Google ResNet-152 0.49 0 103
EL + ES
w2v13+VG SceneGraph 0.25 0.01 103
crawl+VG SceneGraph 0.34 0 103
wikinews sub+VG SceneGraph 0.30 0 103
wikinews+VG SceneGraph 0.29 0 103
Table 4.7: Spearman correlation on the common subset of the SimLex dataset.
Multi-modal embeddings are created using the Intersection technique. The table
sections contain linguistic, visual and multi-modal embeddings in this order. Red
colour signifies the best performance. Blue would mean that the multi-modal
embedding outperformed the corresponding uni-modal ones, which here did not
happen.
111
Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Covr.
EL
w2v13 0.79 0.54 0.66 0.76 0.58 0.65 0.47 0.60 0.67 0.64 0.1 45
wikinews sub 0.83 0.66 0.68 0.83 0.61 0.54 0.59 0.56 0.70 0.67 0.1 60
wikinews 0.83 0.68 0.63 0.81 0.64 0.54 0.56 0.48 0.65 0.65 0.11 60
crawl 0.86 0.68 0.61 0.88 0.65 0.58 0.58 0.55 0.60 0.67 0.12 60
EV
Google-VIS whole 0.89 0.65 0.64 0.75 0.51 0.61 0.64 0.55 0.60 0.65 0.11 52
Google ResNet-152 0.88 0.63 0.64 0.73 0.46 0.56 0.64 0.50 0.56 0.62 0.12 52
VG-VIS internal 0.85 0.70 0.63 0.72 0.52 0.55 0.57 0.47 0.57 0.62 0.11 57
Google AlexNet 0.89 0.61 0.66 0.72 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52
VG-VIS combined 0.85 0.71 0.65 0.76 0.55 0.57 0.60 0.44 0.66 0.64 0.11 57
ES VG SceneGraph 0.83 0.68 0.57 0.77 0.59 0.63 0.58 0.59 0.64 0.65 0.09 58
EL + EV
VG-MM internal 0.88 0.66 0.66 0.78 0.59 0.64 0.67 0.47 0.65 0.67 0.11 57
VG-MM combined 0.88 0.64 0.67 0.79 0.61 0.65 0.67 0.48 0.68 0.67 0.11 57
Google-MM whole 0.89 0.67 0.67 0.80 0.61 0.60 0.65 0.52 0.64 0.67 0.1 52
wikinews+Google ResNet-152 0.88 0.63 0.64 0.75 0.47 0.56 0.63 0.49 0.55 0.62 0.12 52
wikinews+Google AlexNet 0.89 0.61 0.66 0.73 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52
wikinews+VG-VIS internal 0.84 0.69 0.66 0.80 0.65 0.55 0.62 0.47 0.66 0.66 0.11 57
wikinews+VG-MM internal 0.83 0.67 0.66 0.80 0.65 0.56 0.62 0.47 0.67 0.66 0.1 57
wikinews+VG-VIS combined 0.84 0.68 0.66 0.80 0.65 0.56 0.63 0.47 0.67 0.66 0.11 57
wikinews+VG-MM combined 0.83 0.67 0.66 0.80 0.65 0.56 0.63 0.48 0.67 0.66 0.1 57
wikinews+Google-VIS whole 0.85 0.72 0.70 0.79 0.68 0.50 0.60 0.55 0.65 0.67 0.11 52
wikinews+Google-MM whole 0.83 0.72 0.68 0.80 0.69 0.49 0.58 0.55 0.65 0.67 0.11 52
wikinews sub+Google ResNet-152 0.88 0.63 0.63 0.74 0.46 0.56 0.63 0.49 0.55 0.62 0.12 52
wikinews sub+Google AlexNet 0.89 0.61 0.66 0.73 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52
wikinews sub+VG-VIS internal 0.87 0.68 0.67 0.78 0.59 0.57 0.57 0.50 0.61 0.65 0.11 57
wikinews sub+VG-MM internal 0.88 0.64 0.69 0.80 0.63 0.63 0.66 0.51 0.66 0.68 0.1 57
wikinews sub+VG-VIS combined 0.87 0.70 0.67 0.81 0.60 0.58 0.62 0.48 0.67 0.67 0.11 57
wikinews sub+VG-MM combined 0.87 0.64 0.69 0.81 0.63 0.63 0.67 0.52 0.67 0.68 0.1 57
wikinews sub+Google-VIS whole 0.89 0.67 0.66 0.77 0.52 0.58 0.64 0.55 0.62 0.66 0.11 52
wikinews sub+Google-MM whole 0.88 0.69 0.70 0.81 0.64 0.57 0.63 0.55 0.65 0.68 0.1 52
crawl+Google ResNet-152 0.88 0.64 0.64 0.75 0.47 0.55 0.62 0.50 0.55 0.62 0.12 52
crawl+Google AlexNet 0.89 0.61 0.66 0.73 0.48 0.63 0.62 0.54 0.66 0.65 0.11 52
crawl+VG-VIS internal 0.87 0.68 0.62 0.86 0.67 0.60 0.62 0.53 0.61 0.67 0.11 57
crawl+VG-MM internal 0.87 0.67 0.62 0.86 0.67 0.60 0.62 0.53 0.61 0.67 0.11 57
crawl+VG-VIS combined 0.87 0.68 0.62 0.86 0.67 0.60 0.63 0.53 0.62 0.67 0.11 57
crawl+VG-MM combined 0.87 0.67 0.62 0.86 0.67 0.60 0.62 0.53 0.61 0.67 0.11 57
crawl+Google-VIS whole 0.87 0.72 0.69 0.87 0.72 0.51 0.60 0.60 0.57 0.69 0.12 52
crawl+Google-MM whole 0.86 0.72 0.69 0.87 0.73 0.51 0.60 0.61 0.57 0.68 0.12 52
w2v13+Google ResNet-152 0.89 0.65 0.68 0.75 0.50 0.56 0.56 0.54 0.67 0.64 0.11 40
w2v13+Google AlexNet 0.90 0.66 0.71 0.74 0.53 0.64 0.58 0.57 0.77 0.68 0.11 40
w2v13+VG-VIS internal 0.81 0.55 0.66 0.76 0.60 0.68 0.49 0.59 0.67 0.65 0.09 44
w2v13+VG-MM internal 0.80 0.55 0.65 0.76 0.60 0.67 0.50 0.59 0.68 0.65 0.09 44
w2v13+VG-VIS combined 0.81 0.54 0.66 0.76 0.60 0.68 0.50 0.59 0.68 0.65 0.09 44
w2v13+VG-MM combined 0.80 0.55 0.65 0.76 0.60 0.67 0.50 0.59 0.68 0.64 0.09 44
w2v13+Google-VIS whole 0.84 0.61 0.68 0.75 0.62 0.59 0.44 0.64 0.66 0.65 0.1 40
w2v13+Google-MM whole 0.82 0.59 0.67 0.74 0.62 0.59 0.45 0.64 0.65 0.64 0.1 40
EL + ES
wikinews+VG SceneGraph 0.84 0.71 0.59 0.80 0.63 0.60 0.58 0.58 0.65 0.66 0.09 58
wikinews sub+VG SceneGraph 0.84 0.68 0.57 0.78 0.60 0.63 0.59 0.60 0.64 0.66 0.09 58
crawl+VG SceneGraph 0.87 0.71 0.59 0.86 0.66 0.60 0.60 0.60 0.64 0.68 0.1 58
w2v13+VG SceneGraph 0.87 0.65 0.66 0.81 0.66 0.69 0.52 0.61 0.76 0.69 0.1 45
Table 4.8: fMRI scores for each participant and embedding. Multi-modal em-
beddings are created using the Intersection technique. The table sections contain
linguistic, visual and multi-modal embeddings in this order. Red colour signifies
the best performance, blue means that the multi-modal embedding outperformed
the corresponding uni-modal ones.
112
Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Coverage
EL
w2v13 0.65 0.64 0.58 0.64 0.74 0.65 0.75 0.56 0.69 0.66 0.06 45
wikinews sub 0.63 0.59 0.48 0.70 0.72 0.65 0.71 0.66 0.73 0.65 0.07 60
wikinews 0.63 0.61 0.50 0.71 0.71 0.64 0.72 0.63 0.76 0.66 0.07 60
crawl 0.65 0.58 0.57 0.69 0.67 0.63 0.73 0.65 0.71 0.65 0.05 60
EV
Google-VIS whole 0.70 0.51 0.56 0.71 0.69 0.73 0.70 0.62 0.69 0.66 0.07 52
Google ResNet-152 0.65 0.55 0.52 0.69 0.70 0.68 0.63 0.61 0.66 0.63 0.06 52
VG-VIS internal 0.62 0.55 0.54 0.66 0.62 0.69 0.64 0.49 0.59 0.60 0.06 57
Google AlexNet 0.66 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.63 0.06 52
VG-VIS combined 0.69 0.60 0.55 0.68 0.70 0.76 0.68 0.56 0.69 0.66 0.07 57
ES VG SceneGraph 0.63 0.60 0.55 0.65 0.70 0.62 0.67 0.50 0.73 0.63 0.07 58
EL + EV
VG-MM internal 0.66 0.65 0.56 0.73 0.68 0.64 0.70 0.60 0.69 0.65 0.05 57
VG-MM combined 0.68 0.67 0.56 0.74 0.71 0.69 0.72 0.62 0.72 0.68 0.05 57
Google-MM whole 0.72 0.59 0.53 0.72 0.71 0.67 0.72 0.65 0.72 0.67 0.06 52
wikinews+Google ResNet-152 0.65 0.55 0.51 0.68 0.70 0.69 0.63 0.61 0.66 0.63 0.06 52
wikinews+Google AlexNet 0.66 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.64 0.06 52
wikinews+VG-VIS internal 0.63 0.67 0.53 0.71 0.74 0.69 0.72 0.60 0.76 0.67 0.07 57
wikinews+VG-MM internal 0.62 0.67 0.54 0.70 0.75 0.68 0.73 0.62 0.76 0.67 0.07 57
wikinews+VG-VIS combined 0.64 0.66 0.53 0.71 0.74 0.71 0.73 0.62 0.77 0.68 0.07 57
wikinews+VG-MM combined 0.62 0.67 0.53 0.70 0.75 0.69 0.73 0.62 0.76 0.67 0.07 57
wikinews+Google-VIS whole 0.66 0.61 0.53 0.70 0.76 0.70 0.72 0.61 0.75 0.67 0.07 52
wikinews+Google-MM whole 0.66 0.63 0.53 0.69 0.76 0.66 0.71 0.61 0.76 0.67 0.07 52
wikinews sub+Google ResNet-152 0.65 0.55 0.51 0.68 0.70 0.68 0.62 0.61 0.66 0.63 0.06 52
wikinews sub+Google AlexNet 0.66 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.64 0.06 52
wikinews sub+VG-VIS internal 0.64 0.61 0.56 0.72 0.68 0.70 0.70 0.54 0.67 0.65 0.06 57
wikinews sub+VG-MM internal 0.66 0.66 0.56 0.75 0.72 0.67 0.72 0.63 0.73 0.68 0.06 57
wikinews sub+VG-VIS combined 0.70 0.64 0.57 0.73 0.73 0.75 0.72 0.59 0.71 0.68 0.06 57
wikinews sub+VG-MM combined 0.68 0.67 0.55 0.76 0.74 0.71 0.73 0.63 0.75 0.69 0.06 57
wikinews sub+Google-VIS whole 0.70 0.53 0.55 0.70 0.73 0.73 0.71 0.63 0.70 0.66 0.07 52
wikinews sub+Google-MM whole 0.71 0.62 0.53 0.71 0.76 0.67 0.72 0.65 0.74 0.68 0.07 52
crawl+Google ResNet-152 0.65 0.56 0.51 0.68 0.71 0.68 0.63 0.62 0.66 0.63 0.06 52
crawl+Google AlexNet 0.67 0.52 0.57 0.66 0.69 0.71 0.69 0.57 0.65 0.64 0.06 52
crawl+VG-VIS internal 0.65 0.63 0.60 0.69 0.69 0.68 0.73 0.65 0.73 0.67 0.04 57
crawl+VG-MM internal 0.65 0.64 0.60 0.69 0.69 0.67 0.73 0.65 0.73 0.67 0.04 57
crawl+VG-VIS combined 0.65 0.63 0.60 0.69 0.69 0.68 0.73 0.65 0.73 0.67 0.04 57
crawl+VG-MM combined 0.65 0.64 0.61 0.69 0.69 0.67 0.73 0.65 0.73 0.67 0.04 57
crawl+Google-VIS whole 0.67 0.62 0.56 0.68 0.77 0.67 0.73 0.62 0.75 0.67 0.06 52
crawl+Google-MM whole 0.67 0.63 0.57 0.68 0.77 0.66 0.73 0.62 0.75 0.67 0.06 52
w2v13+Google ResNet-152 0.68 0.66 0.60 0.72 0.74 0.69 0.73 0.62 0.66 0.68 0.05 40
w2v13+Google AlexNet 0.69 0.59 0.67 0.75 0.77 0.74 0.73 0.62 0.69 0.69 0.06 40
w2v13+VG-VIS internal 0.65 0.65 0.61 0.62 0.75 0.65 0.75 0.53 0.71 0.66 0.07 44
w2v13+VG-MM internal 0.64 0.64 0.60 0.62 0.75 0.65 0.74 0.54 0.70 0.66 0.06 44
w2v13+VG-VIS combined 0.66 0.64 0.61 0.62 0.76 0.67 0.74 0.54 0.71 0.66 0.06 44
w2v13+VG-MM combined 0.64 0.64 0.60 0.63 0.75 0.65 0.74 0.54 0.70 0.66 0.06 44
w2v13+Google-VIS whole 0.69 0.59 0.56 0.64 0.73 0.65 0.69 0.56 0.71 0.65 0.06 40
w2v13+Google-MM whole 0.68 0.59 0.54 0.63 0.71 0.63 0.70 0.56 0.71 0.64 0.06 40
EL + ES
wikinews+VG SceneGraph 0.62 0.61 0.55 0.69 0.73 0.62 0.71 0.55 0.75 0.65 0.07 58
wikinews sub+VG SceneGraph 0.62 0.60 0.55 0.67 0.70 0.62 0.68 0.51 0.73 0.63 0.07 58
crawl+VG SceneGraph 0.66 0.63 0.55 0.69 0.72 0.62 0.74 0.62 0.76 0.67 0.06 58
w2v13+VG SceneGraph 0.69 0.63 0.56 0.68 0.81 0.68 0.78 0.59 0.72 0.68 0.08 45
Table 4.9: MEG scores for each participant and embedding. Multi-modal em-
beddings are created using the Intersection technique. The table sections contain
linguistic, visual and multi-modal embeddings in this order. Red colour signifies
the best performance, blue means that the multi-modal embedding outperformed
the corresponding uni-modal ones.
113
Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Coverage
EL
w2v13 0.41 0.41 0.33 0.46 0.44 0.49 0.43 0.51 0.55 0.45 0.06 39
wikinews sub 0.47 0.47 0.52 0.49 0.48 0.45 0.50 0.52 0.52 0.49 0.03 39
wikinews 0.54 0.44 0.50 0.49 0.51 0.52 0.53 0.46 0.54 0.50 0.03 39
crawl 0.41 0.49 0.44 0.54 0.56 0.48 0.47 0.39 0.56 0.48 0.06 39
EV
Google-VIS whole 0.47 0.33 0.54 0.46 0.44 0.41 0.24 0.49 0.45 0.43 0.08 39
Google ResNet-152 0.42 0.39 0.58 0.54 0.53 0.53 0.57 0.62 0.39 0.51 0.08 39
VG-VIS internal 0.66 0.37 0.60 0.65 0.58 0.57 0.52 0.56 0.56 0.56 0.08 39
Google AlexNet 0.59 0.58 0.54 0.58 0.53 0.61 0.53 0.59 0.56 0.57 0.03 39
VG-VIS combined 0.57 0.34 0.57 0.58 0.53 0.52 0.49 0.53 0.46 0.51 0.07 39
ES VG SceneGraph 0.51 0.45 0.49 0.55 0.63 0.38 0.55 0.48 0.36 0.49 0.08 39
EL + EV
VG-MM internal 0.70 0.61 0.46 0.45 0.55 0.48 0.57 0.60 0.51 0.55 0.08 39
VG-MM combined 0.48 0.43 0.45 0.51 0.47 0.66 0.44 0.58 0.56 0.51 0.07 39
Google-MM whole 0.45 0.32 0.48 0.44 0.46 0.49 0.27 0.49 0.40 0.43 0.07 39
wikinews+Google ResNet-152 0.33 0.38 0.59 0.42 0.43 0.39 0.69 0.50 0.64 0.49 0.12 39
wikinews+Google AlexNet 0.42 0.41 0.47 0.49 0.49 0.41 0.65 0.48 0.55 0.48 0.07 39
wikinews+VG-VIS internal 0.64 0.69 0.43 0.66 0.64 0.50 0.53 0.62 0.55 0.58 0.08 39
wikinews+VG-MM internal 0.66 0.68 0.40 0.67 0.61 0.50 0.55 0.64 0.52 0.58 0.09 39
wikinews+VG-VIS combined 0.65 0.67 0.42 0.63 0.63 0.51 0.54 0.63 0.55 0.58 0.08 39
wikinews+VG-MM combined 0.69 0.70 0.42 0.65 0.60 0.49 0.61 0.66 0.48 0.59 0.1 39
wikinews+Google-VIS whole 0.43 0.46 0.49 0.40 0.51 0.41 0.63 0.40 0.50 0.47 0.07 39
wikinews+Google-MM whole 0.42 0.48 0.47 0.41 0.53 0.41 0.63 0.41 0.49 0.47 0.07 39
wikinews sub+Google ResNet-152 0.33 0.39 0.59 0.42 0.43 0.39 0.69 0.50 0.65 0.49 0.12 39
wikinews sub+Google AlexNet 0.42 0.41 0.47 0.49 0.48 0.41 0.65 0.48 0.55 0.48 0.07 39
wikinews sub+VG-VIS internal 0.60 0.57 0.52 0.68 0.54 0.57 0.43 0.58 0.52 0.55 0.06 39
wikinews sub+VG-MM internal 0.68 0.57 0.40 0.75 0.57 0.51 0.43 0.64 0.46 0.56 0.11 39
wikinews sub+VG-VIS combined 0.57 0.53 0.50 0.65 0.53 0.55 0.46 0.65 0.49 0.55 0.06 39
wikinews sub+VG-MM combined 0.69 0.59 0.40 0.70 0.59 0.50 0.50 0.65 0.42 0.56 0.1 39
wikinews sub+Google-VIS whole 0.43 0.42 0.50 0.38 0.45 0.38 0.58 0.43 0.59 0.46 0.08 39
wikinews sub+Google-MM whole 0.38 0.47 0.46 0.38 0.47 0.42 0.66 0.44 0.57 0.47 0.09 39
crawl+Google ResNet-152 0.33 0.38 0.59 0.42 0.44 0.39 0.69 0.50 0.63 0.48 0.12 39
crawl+Google AlexNet 0.42 0.41 0.47 0.49 0.49 0.40 0.65 0.48 0.55 0.48 0.07 39
crawl+VG-VIS internal 0.63 0.62 0.41 0.67 0.59 0.56 0.50 0.59 0.60 0.57 0.07 39
crawl+VG-MM internal 0.60 0.60 0.39 0.68 0.57 0.55 0.51 0.61 0.59 0.57 0.08 39
crawl+VG-VIS combined 0.64 0.62 0.41 0.67 0.58 0.55 0.50 0.59 0.60 0.57 0.07 39
crawl+VG-MM combined 0.60 0.62 0.43 0.65 0.58 0.59 0.55 0.62 0.54 0.58 0.06 39
crawl+Google-VIS whole 0.48 0.49 0.52 0.46 0.55 0.36 0.65 0.39 0.52 0.49 0.08 39
crawl+Google-MM whole 0.48 0.49 0.50 0.46 0.55 0.35 0.65 0.38 0.51 0.49 0.08 39
w2v13+Google ResNet-152 0.87 0.67 0.69 0.73 0.60 0.61 0.56 0.58 0.69 0.67 0.09 39
w2v13+Google AlexNet 0.74 0.64 0.60 0.67 0.58 0.54 0.62 0.61 0.71 0.63 0.06 39
w2v13+VG-VIS internal 0.35 0.47 0.28 0.38 0.55 0.51 0.54 0.48 0.40 0.44 0.09 39
w2v13+VG-MM internal 0.37 0.50 0.35 0.48 0.39 0.47 0.59 0.45 0.47 0.45 0.07 39
w2v13+VG-VIS combined 0.36 0.47 0.29 0.37 0.55 0.50 0.54 0.47 0.41 0.44 0.08 39
w2v13+VG-MM combined 0.35 0.51 0.39 0.54 0.42 0.39 0.57 0.45 0.43 0.45 0.07 39
w2v13+Google-VIS whole 0.76 0.57 0.61 0.66 0.61 0.53 0.49 0.60 0.69 0.61 0.08 39
w2v13+Google-MM whole 0.75 0.56 0.60 0.67 0.61 0.53 0.48 0.61 0.68 0.61 0.08 39
EL + ES
wikinews+VG SceneGraph 0.48 0.51 0.47 0.49 0.50 0.50 0.59 0.32 0.48 0.48 0.07 39
wikinews sub+VG SceneGraph 0.54 0.55 0.49 0.50 0.40 0.51 0.58 0.45 0.59 0.51 0.06 39
crawl+VG SceneGraph 0.56 0.59 0.43 0.47 0.43 0.47 0.59 0.38 0.49 0.49 0.07 39
w2v13+VG SceneGraph 0.59 0.49 0.63 0.48 0.45 0.60 0.42 0.53 0.67 0.54 0.08 39
Table 4.10: fMRI scores for each participant and embedding on the common sub-
set of vocabularies. Multi-modal embeddings are created using the Intersection
technique. The table sections contain linguistic, visual and multi-modal embed-
dings in this order. Red colour signifies the best performance, blue means that
the multi-modal embedding outperformed the corresponding uni-modal ones.
114
Modality Embedding P1 P2 P3 P4 P5 P6 P7 P8 P9 Avg STD Coverage
EL
w2v13 0.56 0.52 0.36 0.46 0.34 0.55 0.50 0.44 0.51 0.47 0.07 39
wikinews sub 0.55 0.64 0.40 0.52 0.59 0.42 0.62 0.46 0.53 0.53 0.08 39
wikinews 0.59 0.62 0.38 0.39 0.64 0.42 0.62 0.37 0.65 0.52 0.12 39
crawl 0.50 0.45 0.66 0.41 0.60 0.40 0.55 0.54 0.45 0.51 0.08 39
EV
Google-VIS whole 0.49 0.52 0.56 0.35 0.44 0.64 0.45 0.65 0.52 0.51 0.09 39
Google ResNet-152 0.56 0.55 0.65 0.24 0.38 0.60 0.50 0.60 0.45 0.50 0.12 39
VG-VIS internal 0.46 0.54 0.51 0.53 0.66 0.46 0.49 0.52 0.65 0.54 0.07 39
Google AlexNet 0.35 0.52 0.54 0.45 0.52 0.52 0.53 0.51 0.50 0.49 0.05 39
VG-VIS combined 0.33 0.44 0.49 0.62 0.68 0.46 0.49 0.47 0.54 0.50 0.1 39
ES VG SceneGraph 0.48 0.60 0.49 0.49 0.53 0.54 0.59 0.39 0.49 0.51 0.06 39
EL + EV
VG-MM internal 0.50 0.22 0.50 0.55 0.39 0.54 0.50 0.54 0.52 0.47 0.1 39
VG-MM combined 0.29 0.36 0.54 0.45 0.40 0.38 0.35 0.39 0.50 0.41 0.07 39
Google-MM whole 0.39 0.51 0.48 0.29 0.40 0.54 0.46 0.65 0.57 0.48 0.1 39
wikinews+Google ResNet-152 0.46 0.44 0.57 0.43 0.64 0.53 0.35 0.61 0.45 0.50 0.09 39
wikinews+Google AlexNet 0.52 0.36 0.42 0.46 0.48 0.57 0.31 0.63 0.58 0.48 0.1 39
wikinews+VG-VIS internal 0.39 0.34 0.46 0.58 0.49 0.50 0.52 0.57 0.67 0.50 0.09 39
wikinews+VG-MM internal 0.41 0.33 0.49 0.59 0.47 0.51 0.41 0.57 0.64 0.49 0.09 39
wikinews+VG-VIS combined 0.39 0.34 0.46 0.58 0.48 0.49 0.51 0.57 0.66 0.50 0.09 39
wikinews+VG-MM combined 0.42 0.38 0.51 0.59 0.49 0.53 0.42 0.59 0.65 0.51 0.09 39
wikinews+Google-VIS whole 0.42 0.38 0.48 0.31 0.50 0.66 0.55 0.59 0.45 0.48 0.1 39
wikinews+Google-MM whole 0.41 0.38 0.47 0.31 0.48 0.66 0.56 0.59 0.42 0.48 0.1 39
wikinews sub+Google ResNet-152 0.46 0.44 0.57 0.43 0.64 0.53 0.35 0.61 0.45 0.50 0.09 39
wikinews sub+Google AlexNet 0.52 0.35 0.42 0.46 0.48 0.57 0.31 0.63 0.58 0.48 0.1 39
wikinews sub+VG-VIS internal 0.54 0.40 0.40 0.47 0.52 0.43 0.59 0.54 0.56 0.49 0.07 39
wikinews sub+VG-MM internal 0.51 0.43 0.44 0.51 0.48 0.55 0.53 0.55 0.66 0.52 0.06 39
wikinews sub+VG-VIS combined 0.52 0.40 0.40 0.50 0.49 0.45 0.58 0.53 0.57 0.49 0.06 39
wikinews sub+VG-MM combined 0.50 0.48 0.50 0.56 0.48 0.56 0.52 0.57 0.65 0.53 0.05 39
wikinews sub+Google-VIS whole 0.48 0.39 0.46 0.40 0.58 0.60 0.42 0.63 0.48 0.49 0.08 39
wikinews sub+Google-MM whole 0.44 0.37 0.45 0.39 0.53 0.61 0.46 0.57 0.44 0.47 0.08 39
crawl+Google ResNet-152 0.47 0.44 0.57 0.43 0.63 0.53 0.36 0.61 0.45 0.50 0.09 39
crawl+Google AlexNet 0.52 0.35 0.42 0.46 0.48 0.57 0.31 0.63 0.58 0.48 0.1 39
crawl+VG-VIS internal 0.43 0.35 0.46 0.50 0.50 0.45 0.54 0.50 0.60 0.48 0.07 39
crawl+VG-MM internal 0.45 0.34 0.47 0.49 0.51 0.40 0.48 0.49 0.61 0.47 0.07 39
crawl+VG-VIS combined 0.42 0.35 0.46 0.50 0.49 0.45 0.54 0.50 0.60 0.48 0.07 39
crawl+VG-MM combined 0.49 0.44 0.50 0.50 0.54 0.42 0.47 0.50 0.65 0.50 0.06 39
crawl+Google-VIS whole 0.57 0.42 0.43 0.32 0.48 0.67 0.60 0.60 0.55 0.52 0.11 39
crawl+Google-MM whole 0.57 0.42 0.43 0.31 0.47 0.67 0.61 0.60 0.55 0.52 0.11 39
w2v13+Google ResNet-152 0.67 0.61 0.62 0.65 0.73 0.73 0.67 0.58 0.70 0.66 0.05 39
w2v13+Google AlexNet 0.53 0.48 0.57 0.65 0.61 0.68 0.59 0.50 0.57 0.58 0.06 39
w2v13+VG-VIS internal 0.52 0.46 0.51 0.56 0.45 0.54 0.48 0.68 0.53 0.53 0.07 39
w2v13+VG-MM internal 0.55 0.42 0.53 0.56 0.43 0.49 0.47 0.67 0.42 0.50 0.08 39
w2v13+VG-VIS combined 0.52 0.46 0.51 0.56 0.44 0.54 0.48 0.68 0.52 0.52 0.07 39
w2v13+VG-MM combined 0.55 0.44 0.49 0.55 0.54 0.56 0.56 0.68 0.47 0.54 0.06 39
w2v13+Google-VIS whole 0.66 0.59 0.59 0.55 0.68 0.74 0.66 0.47 0.72 0.63 0.08 39
w2v13+Google-MM whole 0.66 0.59 0.59 0.52 0.66 0.72 0.66 0.45 0.72 0.62 0.08 39
EL + ES
wikinews+VG SceneGraph 0.56 0.65 0.53 0.56 0.45 0.33 0.64 0.52 0.41 0.52 0.1 39
wikinews sub+VG SceneGraph 0.62 0.67 0.62 0.56 0.49 0.35 0.60 0.44 0.46 0.54 0.1 39
crawl+VG SceneGraph 0.69 0.52 0.52 0.48 0.42 0.55 0.67 0.66 0.45 0.55 0.09 39
w2v13+VG SceneGraph 0.52 0.49 0.54 0.54 0.40 0.55 0.52 0.51 0.50 0.51 0.04 39
Table 4.11: MEG scores for each participant and embedding on the common sub-
set of vocabularies. Multi-modal embeddings are created using the Intersection
technique. The table sections contain linguistic, visual and multi-modal embed-
dings in this order. Red colour signifies the best performance, blue means that
the multi-modal embedding outperformed the corresponding uni-modal ones.
115
Modality P1 P2 P3 P4 P5 P6 P7 P8 P9
EL 0.83 0.64 0.65 0.82 0.62 0.58 0.55 0.55 0.65
EV 0.87 0.66 0.65 0.74 0.51 0.58 0.61 0.50 0.61
ES 0.83 0.68 0.57 0.77 0.59 0.63 0.58 0.59 0.64
EL + EV 0.86 0.65 0.66 0.79 0.60 0.59 0.60 0.54 0.64
EL + ES 0.86 0.69 0.60 0.81 0.64 0.63 0.57 0.60 0.67
Table 4.12: fMRI scores averaged over each modality. Bold signifies the highest
average performance for each participant.
Modality P1 P2 P3 P4 P5 P6 P7 P8 P9
EL 0.64 0.60 0.53 0.69 0.71 0.64 0.73 0.63 0.72
EV 0.66 0.54 0.55 0.68 0.68 0.72 0.67 0.57 0.66
ES 0.63 0.60 0.55 0.65 0.70 0.62 0.67 0.50 0.73
EL + EV 0.66 0.62 0.56 0.69 0.73 0.68 0.71 0.60 0.71
EL + ES 0.65 0.62 0.55 0.68 0.74 0.64 0.73 0.57 0.74
Table 4.13: MEG scores averaged over each modality. Bold signifies the highest
average performance for each participant.
Modality P1 P2 P3 P4 P5 P6 P7 P8 P9
EL 0.46 0.45 0.45 0.50 0.50 0.48 0.48 0.47 0.54
EV 0.54 0.40 0.57 0.56 0.52 0.53 0.47 0.56 0.48
ES 0.51 0.45 0.49 0.55 0.63 0.38 0.55 0.48 0.36
EL + EV 0.53 0.53 0.47 0.55 0.53 0.48 0.56 0.54 0.54
EL + ES 0.54 0.53 0.50 0.48 0.45 0.52 0.54 0.42 0.56
Table 4.14: fMRI scores averaged over each modality on the common subset of
vocabularies. Bold signifies the highest average performance for each participant.
Modality P1 P2 P3 P4 P5 P6 P7 P8 P9
EL 0.55 0.56 0.45 0.44 0.55 0.45 0.57 0.45 0.53
EV 0.44 0.51 0.55 0.44 0.54 0.54 0.49 0.55 0.53
ES 0.48 0.60 0.49 0.49 0.53 0.54 0.59 0.39 0.49
EL + EV 0.49 0.41 0.49 0.48 0.52 0.55 0.49 0.57 0.56
EL + ES 0.60 0.58 0.55 0.53 0.44 0.45 0.61 0.53 0.45
Table 4.15: MEG scores averaged over each modality on the common subset of
vocabularies. Bold signifies the highest average performance for each participant.
116
Architecture Embedding Accuracy (%)
Add
linguistic only 77.54
visual only 72.70
multi-modal 76.56
random 69.87
Add+Translation
linguistic only 81.21
visual only 79.75
multi-modal 81.81
random 78.33
Add+Translation+FullVis
linguistic only 79.85
visual only 79.11
multi-modal 81.29
random 78.79
GRU
linguistic only 79.77
visual only 77.34
multi-modal 79.48
random 79.25
LSTM
linguistic only 79.80
visual only 78.22
multi-modal 79.61
random 76.16
Table 4.16: Classification accuracy of the di↵erent architectures and embedding
initialisations.
117
118
Chapter 5
E↵ects of Data Size and
Distribution
This chapter shifts the focus towards a more in-depth analysis of some selected
model, data source and modality combination based on the results of the previous
chapter. Our main metric is still performance accuracy, thus this analysis forms
the last part of pillar 1.
We aim our attention at studying model e ciency regarding size and perfor-
mance. In this study we dig deeper into the e↵ect of the training data size and
distribution. The presented experiments address the following questions:
• Does visual data bolster performance only because we add more data or
does it convey complementary quality information compared to a higher
quantity of text? (Question 4)
• Can we achieve comparable performance using small-data if it comes from
the right data distribution? (Question 4a)
We perform di↵erent experiments in order to test the e↵ect of data size and
data distribution on semantic similarity and relatedness tasks. We will compare
linguistic, visual and structured embeddings, based on various criteria.
119
5.1 Counting in the “E↵ort”
The work presented here is related to a recently published information theoretical
probing framework based on minimal description length (MDL) [Voita and Titov,
2020] i.e. the minimum number of bits needed to transmit the labels knowing
the representations. Our idea is to count in the “e↵ort” of data collection and
quantity into the performance of our multi-modal word meaning representations.
Unlike Voita et al., instead of testing on supervised tasks, we focus on unsuper-
vised evaluation. We do not train a multi-layered perception for probing. This is
relevant because this way we avoid distorting our results by a network functioning
as supervised fine tuning. In Section 4.4 we found that a shallow neural network
and a deep LSTM, both with randomly initialised input word vectors, perform on
par with an input of pretrained word embeddings on a Textual Entailment task
(SNLI). Zhang and Bowman found the related phenomenon of high performing
random initialized LSTM models [Zhang and Bowman, 2018]. This is in line with
current findings considering the recent transformer type models which are shown
to be far from solving general tasks (e.g., document question answering). Rather,
these models are overfitting to the quirks of particular datasets [Yogatama et al.,
2019]. Motivated by these results, in this work we decided to focus on diving into
unsupervised representation learning.
In unsupervised representation learning we are learning P (x) instead of P (y|x),
where x is the input data, y is the corresponding label determined by the super-
vised evaluation task. Hence, our approach is more related to Voita et al.’s MDL
framework with “online” code where the code length is simply calculated by the
entropy of the training data.
We pursue measuring how hard it is to achieve a high performing representa-
tion with small data. In the previous chapter we controlled for image quantity
for DV (Section 4.1) and the context size (radius) of DS (Section 4.2). In this
chapter we focus on controlling for text data size and distribution DL. Our ques-
tion is: What is the corpus size where visual information is helpful? We count
in the “e↵ort” by discussing performance in the context of data and model size.
In the following, we describe our implementation of controlling for data quantity
and word frequency distribution.
120
5.2 Experiments
Here, we summarise the notation and specify the models used in the following
experiments, based on our previous findings in Chapter 4.
EL 2 R|T |⇥dL : Linguistic Embedding. Here, we present results using Skip-
Gram with Negative Sampling (SGNS) [Mikolov et al., 2013a, Mikolov et al.,
2013b] trained on a 2020 English Wikipedia dump. Due to its simplicity, it
is suitable for running a wide range of experiments.
EV 2 R|T |⇥dV : Visual Embedding. We ran a feedforward step of ResNet-152
[He et al., 2016] on Google Images. We apply mean aggregation on the
first 10 image results which has been found on of the best performing in
Section 4.1.
ES 2 R|T |⇥dS : Structured Embedding. We use our in-between visual and lin-
guistic embedding, trained on the visually structured text of Visual Genome
Scene Graphs (Section 4.2).
In the following we show results according to e1, . . . , el samples from the lin-
guistic training corpus DL. T = |V \ Vtask| ⇡ |Vtask|, Vtask ⇢ V , where V is the
vocabulary of the text corpus and Vtask is the vocabulary of the evaluation tasks.
5.2.1 Control for Data Quantity
We perform experiments where we restrict the training data size of EL. Similarly
to Sahlgren et al [Sahlgren and Lenci, 2016], we sample the corpora randomly to
subsets with increasing number of tokens: e1, . . . , eN .
5.2.2 Control for Frequency Ranges
In the second phase we can test how models, trained on di↵erent word frequency
ranges, interact with the other types of embeddings. Similarly to [Sahlgren
and Lenci, 2016] we split the vocabulary into three equally large parts; HIGH,
MEDIUM and LOW range. This way we generate samples for EL, EV and ES
for the di↵erent frequency ranges in the text corpus.
121
5.2.3 Expected Results
These experiments will potentially shed light to patterns across modalities and
sources. One interesting result will be to see whether EV and ES embeddings
contribute more if there is smaller amount of text data for EL. If this is the case,
the experiments where we control for word frequencies can reveal whether EV and
ES contribute di↵erently for words with di↵erent data distributions, or whether
the e↵ect is more due to data quantity. Similar questions can be answered in the
reverse direction when we perform experiments where we control for image data
size and distributional properties, such as image resolution or dispersion of image
sets.
5.2.4 Results
Figure 5.1 shows the e↵ect of EL corpus size on the performance of uni-modal EL
and the combined EL + ES and EL + EV on the embeddings’ common coverage
subsets of MEN (Figure 5.1a) and SimLex (Figure 5.1b). The common coverage
is 73% on MEN and 56% on SimLex. ES and EV are constant since only EL’s
training data is varied. Results on the full datasets are presented in Figure 5.2.
Axis x represents the size of the training corpus (in the number of tokens). Error
bars indicate variance after three runs of random down-sampling of the data.
Table 5.1 gives an account of the amount of training data each model requires.
The last line shows the size after compression by Lempel-Ziv coding (LZ77). Since
ImageNet images are already in jpg format, LZ77 was not able to achieve any
further compression.
The first striking result is that ES alone, with ⇠9M tokens, outperforms
EL, with ⇠1G tokens, on both evaluation tasks. Secondly, when combined with
linguistic data, ES greatly outperforms EV on MEN and underperforms it on
SimLex, however, their di↵erence becomes marginal as text data increases. Im-
portantly, ES achieves this result with orders of magnitude less data than required
by EV (Table 5.1). Moreover, ResNet-152 with ⇠6.8G parameters outputs a 1.7
times bigger model (4.8MB) than SGNS, used for EL and ES (2.8MB), consisting
of 151,200 parameters. A summary of model sizes is included in Table 5.2 for the
common subset of their vocabularies of 1203 words.
Figure 5.2c and 5.2d report the e↵ect of word frequency on performance on
122
(a) MEN (b) SimLex
Figure 5.1: E↵ect of EL training corpus (token) quantity on performance on the
common coverage subsets of evaluation pairs (73% on MEN, 56% on SimLex).
ES and EV are constant since only EL’s training data is varied.
the same tasks. Similarly to [Sahlgren and Lenci, 2016] we split the vocabulary
into three equally large parts; HIGH, MEDIUM and LOW range. On MEN we see
a slight performance gain of the baseline EL model on medium range frequency
words, whereas on SimLex, low frequency words dominate the performance within
the whole data (MIXED). On SimLex visual information helps more with HIGH
frequency words. This could be due to narrowing down the meaning of ambiguous
words. Checking this hypothesis would be an interesting future analysis.
ES performs similarly to the FastText VG description model of [Herbelot,
2020] on SimLex. The increase of EL performance is in line with [Sahlgren and
Lenci, 2016] until 2G tokens (they stopped at 1G), after which it plateaus. The
best Spearman correlation of [Kuzmenko and Herbelot, 2019] using relations on
MEN is 0.5499, with almost third the coverage (847) of ours on the common
subset: ES achieves 0.44 with a coverage of 2481. Their word2vec model is
consistent with results reported by [Sahlgren and Lenci, 2016] and our word2vec
based EL model with similar amount of data.
5.3 Conclusion
Overall, we conclude that our structured visuo-linguistic embedding contributes
to a linguistic model in a much more economic way than the image based ones.
We saw that when the linguistic sources are limited, visual or structured infor-
123
(a) MEN, quantity (b) SimLex, quantity
(c) MEN, frequency (d) SimLex, frequency
Figure 5.2: E↵ect of EL training corpus quantity and word frequency on perfor-
mance. Numbers on top of the bars and on the lines indicate the coverage of
evaluation dataset pairs (where both words are in the embedding vocabulary) in
percentages. ES and EV are constant since only EL’s training data is varied.
124
EL ES EV
Model SGNS SGNS ResNet-152
Training data Wikipedia
2020
Visual
Genome
annotations
ImageNet +
Google Images
Size in units 13G tokens ⇠9M tokens ⇠1.28M + 15,770
images (jpg)
Storage size 14GB ⇠1.8GB ⇠140GB
Compressed size ⇠5GB ⇠0.2GB ⇠140GB
Table 5.1: Training data sizes.
EL ES EV
Model SGNS SGNS ResNet-152
Number of model parameters 151,200 151,200 6.8G
Embedding size 2.8MB 2.8MB 4.8MB
Table 5.2: Model sizes on the common subset of vocabularies (|Vcommon| = 1203).
mation can greatly improve on semantic similarity and relatedness predictions.
As the volume of our text corpus increases, both its usefulness plateaus as well
as the performance gain using other modalities shrinks, however, in most cases
some improvement remains. These findings suggest that in certain cases one can
save valuable training time and storage space by balancing the trade-o↵ between
training on di↵erent modalities or acquiring more text data.
Our structured embedding trained on Visual Genome Scene Graph requires
orders of magnitude less data than either of the other two modalities, still con-
tributing substantially to the meaning representation. This may be due to the
amount of human e↵ort had been made while creating the dataset. Applying
automatically generated scenes graphs [Xu et al., 2020] would mitigate this prob-
lem. This would serve as a highly e↵ective tool with important applications for
low resource languages. Our findings support the intuition of “no free lunch”
when it comes to e↵ort, but depending on the tasks in hand and the available
resources it can be crucial to optimise the types of resources we use. Here we
only focused on data and model size. Including processing time and costs would
125
be an important future extension of e ciency analysis.
Exactly how ES contributes to the linguistic EL representation cannot be
interpreted based solely on performance metrics. Therefore, we investigate the
interpretation of our representations and the type of information they convey in
the next Chapter.
126
Chapter 6
Informativeness of Semantic
Spaces
In this chapter, we introduce the third key contribution of this thesis (Chap-
ter 1.1), presenting proof-of-concept studies of interpretable Transparency anal-
ysis. We present experiments demonstrating pillars 2 Qualitative / Quantitative
structural analysis and 3 Independence analysis.
We aim to take the systematic studies in Chapter 4 and 5 a step further,
and perform quantitative and qualitative comparison of embedding space struc-
tures. We showcase an implementation in the framework of modalities as partial
observers of meaning, introduced in Section 2.7.
Section 6.1 introduces our two hypotheses. In Section 6.2 we tackle Question 5:
Can we move beyond performance evaluation? Are there any emergent concepts
in embeddings? Can we quantify the di↵erence between the concept structures of
semantic spaces? We hypothesise that each embedding space represents clusters
of word representations which can be interpreted as each embeddings’ own “idea”
of concepts in the world. They can “disagree” depending on the data distributions
of the specific modality and data source they were trained on. By zooming
into our embeddings’ structure we aim to find out how much their models of
concepts di↵er from each other if they di↵er at all. We are looking for quantitative
ways of measuring the di↵erence between embedding spaces to complement the
qualitative analysis.
Section 6.3 addresses Question 6: Can we quantify the di↵erence between se-
mantic spaces, based on the useful information they contribute to the meaning
127
representation? We apply an information-theoretical framework laid out in Sec-
tion 2.7.5 to estimate Mutual Information of two semantic spaces using methods
described in Section 3.2.4.
Finally, Section 6.4 investigates the results in the context of distributional
properties of the linguistic and structured data sources, DL and DS.
Our main contribution is a proof-of-concept framework for quantifying the
information di↵erent data sources, models and modalities bring into multi-modal
word representations. It can easily be applied to various more data, model or
modality types beyond the ones showcased in this study. These set of methods
can help us looking under the hood of accuracy numbers on evaluation tasks and
understanding better how these di↵erent concept models interact with each other
when they are combined in multi-modal models of word meaning.
6.1 Hypotheses
Within our generalised embedding framework (Section 2.6) we use the same mod-
els as in Section 5.2. We propose investigating the structure of the learnt embed-
ding spaces EL, EV , ES. This aspires to qualitatively compare embedding spaces
according to various metrics. These metrics aim to capture the distributional
properties of vector spaces. Furthermore, we put the results in the context of
analysing the training data distributions.
Based on our previous findings we form the following hypotheses:
I. EV can be complementary to EL when the training corpus size is small.
It is not clear whether in this case EV comes from a di↵erent and comple-
mentary distribution or the performance gain is only relative to the size of
the additional data. In this case, we would achieve the same result with
training on the same amount of additional text.
II. Due to the manufactured way of collecting data for ES, it is possible that
this dataset comes from a substantially di↵erent distribution than our lin-
guistic data. Therefore, it can provide useful information and can facilitate
learning from small data.
128
6.2 Qualitative Analysis of Semantic Spaces
As described in Section 3.2.3.1, in order to grasp how the concept structure
of our embedding spaces di↵er from each other we first searched for ways to
quantify their cluster structure. We do not know the ground truth labels of
our clusters or even the number of clusters each embedding spaces should be
broken into. Therefore, in Section 6.2.1 we present the results of experiments
with three clusterization metrics which are designed for the case when a ground
truth labelling is not available. Furthermore, we report results for a range of
number of clusters.
Following the desire of interpreting how our di↵erent models conceptualise,
in Sections 6.2.2 and 6.2.3 we zoom into our embedding spaces even further. In
Section 6.2.2 we compare our embeddings’ cluster structures and visualise the
learnt clusterings. In Section 6.2.3 we present supervised visualisations of the
embedding spaces alongside an automatic label generation method and compare
the results against the clusterization metric scores.
6.2.1 Cluster Structure Results
Clustering metrics results are presented for increasing numbers of clusters, using
K-means clustering in Figure 6.1 (See the definition of metrics in Section 3.2.3.1).
We compare the common subset of our embedding vocabularies, resulting in
1204 words. Calinski-Harabasz Index and Davies-Bouldin Index score results
(Figure 6.1c and 6.1b) are fairly consistent with each other, while we see a di↵erent
pattern on Silhouette Coe cient in Figure 6.1a. This is unsurprising since the
first two are based on node and centroid distances, whereas the latter calculates
distances solely between nodes in the space.
In Davies-Bouldin Index (Figure 6.1c) all models significantly outperform the
baseline Random embedding ER 2 R|Vcommon|⇥300. All models achieve similar
scores with the visual, the structured and linguistic-visual multi-modal models
performing the best. This index represents the ratio between intra-cluster dis-
tances from the centroids and inter-cluster distances of centroids.
Calinski-Harabasz Index scores (Figure 6.1b) show a similar tendency among
the models, having EV and EL + EV as best performing across the number of
clusters, while all models overcome the Random baseline. As the number of
129
(a) Silhouette Coe cient.
Higher is better.
(b) Calinski-Harabasz Index.
Higher is better.
(c) Davies-Bouldin Index.
Lower is better.
Figure 6.1: Clustering metrics for increasing number of K-means clusters.
clusters grow the results converge to a lower (worse) score. This score can be
interpreted as a measurement of how well defined the clusters are in terms of the
ratio between inter- and intra-cluster dispersions, therefore a higher score means
better defined clusters.
Silhouette Coe cient measures pairwise distances of data points within their
own clusters and between each point’s distance to data points in other clusters. It
gives a ratio of cluster cohesion and separation. In 6.1a we see a similar tendency
across models (having EV as the best) as before with the exception of the struc-
tured model ES. It outperforms all models up to ⇠20 clusters then drops below
the Random baseline by 40. Furthermore, all the other models do not converge
as in the previous two cases. This suggests that ES has much more cohesive
structure of ⇠20 clusters, but becomes in-cohesive if we try and break it into
130
more clusters. This phenomenon might be related to the statistical properties
of the Visual Genome dataset ES is trained on. In the original paper [Krishna
et al., 2016] the authors report results on clustering region descriptions. They
found that on average, each image contains descriptions from 17 di↵erent clus-
ters, the image with the most diverse descriptions contains descriptions from 26
clusters. Unlike our model, they clustered averaged pertained word representa-
tions of region descriptions, therefore, their results are not directly comparable
to ours. Nevertheless, we think this can indicate why this dramatic drop occurs
at around 20 clusters in our experiments.
6.2.2 Inspecting the Clusters
In the following we inspect the individual clusters in all three embeddings after
clustering them for 20 clusters. We also look at ES after clustering it for 40
clusters, where the drop in Silhouette Coe cient happens.
6.2.2.1 Size Distribution and Visualisation
In Figure 6.2 we present the distribution of cluster sizes (number of cluster mem-
bers) for each cases. Firstly, we observe that EL and EV cluster sizes move
between 10 and ⇠100, whereas in both cases ES cluster size distribution ranges
between 1 and ⇠400. In the ES 20 clusters case (Figure 6.2a) most clusters range
between 10 and 117, there are two one-element clusters and one with size 444.
Clustering it to 40 clusters (Figure 6.2b) we get three one-element clusters and
two salient clusters of sizes 148 and 310.
To check the consistency of clustering, in Figure 6.3 we present similar his-
tograms after clustering the embeddings using Agglomerative Clustering. We see
a very similar pattern in cluster size distribution as with K-means in all three
embeddings. ES has a saliently big cluster of 351 elements.
The red line shows the average frequencies of words (AF) in each cluster
in the corresponding textual dataset (Visual Genome Scene Graphs for ES and
Wikipedia2020 for EL.) In the visual case the notion of word frequency is not
applicable. We were mainly interested in whether the saliently big clusters in ES
are due to an artefact of word frequencies. Whereas in the case of 20 K-means
clusters we only see a slight drop of AF, in the 40 cluster case the two biggest
clusters have relatively low numbers, although there are other low AF clusters
131
(a) ES , 20 clusters. (b) ES , 40 clusters.
(c) EL, 20 clusters. (d) EV , 20 clusters.
Figure 6.2: K-means Cluster size distributions. Y axis shows the number of
cluster member in log scale. Red line shows the average frequencies of words in
each cluster in the corresponding textual dataset.
among the smaller ones as well (Figure 6.2). After Agglomerative Clustering
(Figure 6.3) we observe a more substantial drop in AF for the two biggest clusters.
In EL we see no such patterns, but the cluster sizes are less varied there.
As an e↵ective visualisation we use the T-SNE algorithm [Maaten and Hin-
ton, 2008, Wattenberg et al., 2016] to zoom further into the structure of our
embedding spaces. We applied Tensorboard1 for the projections as well as their
implementation of T-SNE. Following the guidelines in [Wattenberg et al., 2016]
we tried di↵erent perplexity settings (running it multiple times). In most cases
we did not find too much di↵erence between the results on our data, but fol-
lowing the suggested range of 5 – 50, we present results for perplexity = 30 or
indicate otherwise. Figures 6.5-6.8 and D.10 contain T-SNE visualisations of the
clusterings. The salience of the biggest ES K-means clusters is visible in all cases
(Figure 6.5, 6.8, D.10). Based on the average frequency results, we think, that
the reason for this huge separable cluster is at least partially that it includes
more low frequency words. The breakdown of cluster cohesion is visible in the 40
cluster cases. In general, the clusters are fairly separated in all projections.
1https://www.tensorflow.org/tensorboard
132
(a) ES , 20 clusters.
(b) EL, 20 clusters. (c) EV , 20 clusters.
Figure 6.3: Agglomerative Cluster size distributions. Y axis shows the number
of cluster member in log scale. Red line shows the average frequencies of words
in each cluster in the corresponding textual dataset.
6.2.2.2 Cluster Similarities
Next, we looked into the individual clusters in each embeddings. Each row in
Tables 6.2-6.4 contains the members of example clusters for the corresponding
embedding. (See tables including all clusters in Appendix D.) Rows are ordered by
the number of cluster members in increasing order. Words in column “Members”
are ordered by their distance from the cluster centroid in increasing order. (In
Tables of ES clusters in Figures D.2 and D.4 we shortened the biggest cluster,
indicated by three dots, for better readability.)
We labelled each clusters post-factum in two ways:
1. WordNet label was generated by querying the synset closure up to a depth
of 3 in the hypernym hierarchy for each words in the cluster. Then we took
each synset name in the closure lists and created a set from each of them (by
removing duplicates). Next, we concatenated all the sets (corresponding to
one word) into one list. The generated cluster label is the first three most
common lemmas in this list. An example is shown on Figure 6.4. This can
be considered as a form of “crowd-sourced” annotation, as it relies on a
133
1. Cluster = [’apple’, ’pizza’]
2. closures(’apple’) = [
Synset(’edible fruit.n.01’), Synset(’pome.n.01’), 
Synset(’fruit.n.01’), Synset(’produce.n.01’),
Synset(’reproductive structure.n.01’), Synset(’food.n.02’),
Synset(’apple tree.n.01’), Synset(’fruit tree.n.01’),
Synset(’angiospermous tree.n.01’), Synset(’apple.n.01’), 
Synset(’apple.n.02’)]
3. closures(’pizza’) = [
Synset(’dish.n.02’), Synset(’nutriment.n.01’), 
Synset(’food.n.01’), Synset(’pizza.n.01’)]
4. list of synset names in decreasing frequency order = [
’food’, ’nutriment’, ’pizza’, ’dish’, ’apple’, ’pome’, 
’fruit’, ’apple tree’, ’edible fruit’, ’fruit tree’, 
’produce’, ’angiospermous tree’, ’reproductive structure’]
5. labels = [’food’, ’nutriment’, ’pizza’]
Figure 6.4: WordNet label generation example.
dataset created by human linguistic experts.
2. Own label is our annotation (without looking at the WordNet labels).
“Misc” stands for Miscellaneous, where we could not find an appropriate
concept to describe the cluster.
Our own annotations and the WordNet labels are fairly consistent with each
other, often use the same words or synonyms e.g., “drink”-“beverage”. One
interesting exception is the fifth row in Table 6.4 of the image based clusters
which we interpreted as female visual stereotypes, whereas the WordNet label
is: “person, organism, casual agent”. We find our interpretation supported by
previous work on the bias of Google Images [Kay et al., 2015], however, with the
disclaimer of coherence being “in the eye of the beholder” [Bender et al., 2021].
WordNet labels can be sometimes more generic than our annotation. This may
be because we exploit WordNet which was created by multiple experts as opposed
to our own annotations.
In general, the Wikipedia based EL has more clusters with abstract topics,
such as verbs, activities and communication. ES has more concrete clusters e.g.,
train, vehicles, building structures, containers or furnishing. Whereas the image
based EV includes more clusters related to the outdoors, such as “travel”, “trans-
134
portation”, “landscape” and “vacation”, and on appearance, such as “colours &
materials”. These di↵erences may not be surprising regarding each data source,
but we would highlight the fact that these statistics are on the exact same vo-
cabulary. Therefore, the di↵erence between these data sources is not simply that
they include di↵erent vocabularies, but that they “understand” the same words
di↵erently. This is the type of information we think is important to be conscious
about when building on any data source or modality.
There are also some concepts that all three embeddings capture consistently,
such as “food”, “colours”, “plants”, “animals” and “body parts”. Di↵erent em-
beddings di↵er, however, in the number of clusters they have related to similar
concepts and of course their exact content di↵ers to various extents.
In order to capture how similar the clusters are across the di↵erent embed-
dings, we measured the pairwise Jaccard similarity coe cient between each two
embeddings. The Jaccard similarity coe cient between two clusters A, B is de-
fined as
J(A,B) =
|A \B|
|A [B| . (6.1)
Note that, 0  J(A,B)  1.
We calculated Jaccard similarity scores between each pair of clusters which
represent concepts. Cluster maps of similarities are presented in Figures 6.9,
6.10 and 6.11. These are heat maps of Jaccard similarities, where the rows and
columns of the matrix have been clustered for better visibility. Each row and
column is labelled with their respective WordNet cluster label.
We observe that “food”, “plants”, “animals”, “body parts” and “travel / vehi-
cle” related clusters are distinctly more similar between each pair of embeddings
than the other clusters. Beyond this, ES and EL have similar cluster related to
“visual property”, “clothing”, “structures / buildings” and a “food” related ES
cluster is close to a “container” cluster in EL. ES and EV contain more similar
travel related clusters: “travel, change, object” – “physical entity, body of water,
thing” and a pair of containers / instruments: “artifact, whole, instrumentality”
– “instrumentality, container, substance”. EL and EV have similar clusters on
“structure / area” and an EL “artifact, whole, instrumentality” cluster is close
to “food, beverage, produce” in EV .
Similar cluster maps are presented for Agglomerative Clustering in Appendix D,
Figures D.7–D.9. Figures D.1–D.6 include heat maps, where clusters are ordered
135
by size. We did not find any pattern in similarities based on size.
We also compared K-means and Agglomerative clusters of the same modalities
in Figures 6.12–6.15. We found the cluster structures fairly similar, the most
similar clusters are food, body parts, animals, plants, vehicles and visual property
related.
In order to quantify how similar each pair of cluster structures are, in Ta-
ble 6.1 we summarise the number of cluster pairs with Jaccard similarities above
thresholds of [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]. In case of K-means, even though EL
and EV have 9 cluster pairs with 0.3 < J(., .) < 0.479, ES has 12 clusters with
EL and 8 with EV above a similarity of 0.2. With Agglomerative Clustering this
relative closeness of ES and EL disappears, while the other two pairs show similar
patterns to K-means. K-means and Agglomerative clusterings are fairly similar,
with EV sharing the most similar cluster structure.
Figure 6.14 includes a heat map of K-means vs. Agglomerative ES clusters
ordered by size. Here, we can see that the two saliently biggest clusters are rela-
tively similar, reaching 0.65 Jaccard similarity. Their labels also share the words
“person”and “change”, which indicates that there is more meaningful coherence
in those sizeable clusters than merely including low frequency words. Note that
this coherence is hard to see with the naked eye because of the number of words
to review.
136
K-means
>0.2 >0.3 >0.4 Max
ES-EL 12 1 0 0.358
ES-EV 8 2 0 0.363
EL-EV 9 5 4 0.479
Agglomerative
>0.2 >0.3 >0.4 Max
ES-EL 6 1 0 0.347
ES-EV 9 4 2 0.5
EL-EV 9 5 2 0.467
K-means – Agglomerative
>0.2 >0.3 >0.4 >0.5 >0.6 >0.7 Max
EL-EL 18 14 9 4 2 1 0.79
ES-ES 16 15 13 9 5 1 0.729
EV -EV 23 16 13 8 4 0 0.644
Table 6.1: Number of cluster pairs out of 202 with Jaccard similarities above
thresholds of [0.2, ..., 0.7]. Last column shows the maximum similarity.
WordNet label Own label Members
food
nutriment
foodstu↵
food butter, cheese, bread, chicken, soup, sauce,
dessert, beef, salad, meat, cake, steak, tomato,
potato, pizza, flour, milk, meal, vinegar, bacon,
pie, cooking, sushi, sandwich, breakfast, burger,
menu
vascular plant
plant organ
plant part
plants flower, flowers, tree, blossom, dandelion, foliage,
fruit, weed, cactus, lily, bloom, shade, leaf, grass,
sunflower, poppy, vine, plant, garden, iris, grow,
daisy, oak, bulb, rust, herb, moss, tulip, palm,
maple, root, tall, bush, seed, family
atmospheric phenomenon
physical phenomenon
change
weather rain, snow, fog, weather, mist, drizzle, frost, dew,
cold, wet, wind, smoke, sunlight, misty, sunrise,
winter, storm, sunset, haze, sunshine, fire, spring,
dusk, autumn, heavy, atmosphere, cloud, sunny,
burn, flood, desert, sun, hot, ice, tropical
137
artifact
covering
clothing
clothing /
fashion
wig, clothes, dress, shoes, jacket, sweater, skirt,
sunglasses, leather, hair, costume, shirt, hair-
cut, cloth, socks, waist, mannequin, collar, jew-
elry, tattoo, lingerie, beard, blonde, mask, fabric,
uniform, necklace, linen, outfit, glove, hat, fash-
ion, blanket, bikini, knitting, swimsuit, crochet,
badge, coat, carpet, bracelet, arms, makeup
artifact
structure
whole
classical
architecture
tower, building, marble, staircase, fountain, door-
way, roof, chapel, steeple, porch, ceiling, mu-
ral, glass, wall, brick, statue, stone, arch, monu-
ment, dome, window, gravestone, sculpture, aisle,
tiles, gate, interior, painted, decoration, concrete,
church, graveyard, cathedral, curtain, painting,
palace, clock, grave, portrait, choir, architecture,
pyramid, memorial, square, castle, skyscraper,
museum, cemetery, temple, organ
change
color
visual property
colour /
decor
blue, bright, green, pink, black, yellow, dark,
white, purple, red, brown, violet, rainbow, colour,
orange, sky, rusty, silhouette, grey, diamond, red-
head, light, flame, peacock, mirror, color, tiny,
shadow, stripes, dull, rose, neon, colorful, crys-
tal, bell, moon, horizon, arrow, silver, ivy, gold,
swan, dragon, lantern, star, pearl, horn, ray, fox,
globe, planet, bold, belt
body part
part
artifact
body parts skin, spine, neck, bone, chest, throat, shoul-
der, wrist, stomach, ear, jaw, cheek, lips, nose,
eyes, eye, limb, toe, belly, skull, abdomen, finger,
teeth, elbow, cord, whiskers, knee, thumb, tooth,
muscle, ankle, tail, paws, lip, brain, flesh, leg,
body, calf, heart, blood, tongue, brow, pain, tear,
blade, mouth, liver, gut, arm, marrow, curled, ca-
nine, feathers, foot, vein, hip, cancer
138
change
act
be
verbs bring, get, come, want, go, keep, take, know,
find, say, give, make, understand, put, listen, en-
joy, feel, leave, think, learn, imagine, gather, be-
lieve, fail, arrange, add, lose, create, way, hear,
send, meet, collect, carry, avoid, buy, remain, al-
low, appear, might, enter, arrive, seem, entertain,
break, steal, receive, stop, stand, build, locked,
compare, retain, sell, handle, danger, eat, wan-
der, face, unhappy, protect, please, pray, become,
walk, expand, travel, plenty, greet, inspect, com-
fort, huge, possess, dominate, attach, roam, par-
ticipate, speak, step, drawn, construct, replace,
divide, great, living
Table 6.2: Examples of the 20 clusters in EL. Clusters are ordered by size. See
all clusters in Appendix Table D.1
WordNet label Own label Members
artifact
line
whole
train railway, railroad, subway, curve, tunnel, run, shelter,
train, station, tram, highway, track, rail, way, engine,
stop, gate, bridge, smoke
structure
area
room
room classroom, hallway, hall, closet, bedroom, room, bath-
room, garage, o ce, cafe, museum, doorway, kitchen,
shop, restaurant, store, mannequin, stadium, market,
ceiling, corner
bird
vertebrate
person
animals hummingbird, gull, peacock, hawk, pelican, crow, par-
rot, seagull, wing, swan, pigeon, owl, goose, flamingo,
nest, eagle, tail, bird, silhouette, duck, chest, body,
ledge, gira↵e, zebra
travel
wheeled vehicle
self-propelled vehicle
vehicles cab, car, taxi, police, vehicle, automobile, drive, rac-
ing, scooter, bike, van, street, road, motorcycle, truck,
speak, wagon, bus, parade, drawn, asphalt, cop, park-
ing, bicycle, sidewalk, tra c, driver, carriage, meter
plant organ
plant
vascular plant
plants bloom, foliage, grave, dead, vine, blossom, ivy, pod,
cactus, tree, moss, root, leave, limb, forest, bush,
plant, lily, branch, weed, leaf, vein, sunshine, log,
fence, flower, sunlight, wood, palm, bench, sun
139
structure
artifact
whole
building
parts
chapel, cottage, steeple, castle, dome, story, cathe-
dral, build, skyscraper, arch, lighthouse, apartment,
hut, angel, shed, hotel, monument, window, staircase,
home, cabin, house, roof, porch, tower, sculpture, pa-
tio, bell, deck, brick, church, cross, clock, step, statue
instrumentality
container
substance
vessel champagne, tea, beverage, alcohol, honey, milk, pen-
cil, tulip, juice, oil, bakery, ceramic, container, co↵ee,
tin, cup, beer, sunflower, daisy, wine, rose, marble,
bowl, sweet, maker, jar, vessel, mug, money, bottle,
pumpkin, straw, glass, basket, box, pot, bucket, bunch
body part
artifact
part
pets &
body parts
jaw, throat, pupil, cheek, canine, belly, brow, mouth,
stomach, tongue, eye, nose, poodle, ear, hamster, lip,
fur, tooth, teeth, pet, leg, wool, head, feline, toe,
panda, smile, neck, face, beard, puppy, collar, horn,
skin, cat, kitty, calf, nail, dog, tag, mother
physical entity
body of water
thing
water rapid, village, coast, bay, mist, horizon, canal, skyline,
valley, sea, cli↵, fog, town, waterfall, stream, water,
sunset, pier, harbor, boardwalk, break, ocean, lake,
fountain, shore, island, river, wave, splash, city, rock,
ship, building, sand, hill, crane, mountain, beach,
pond, surf, boat, pool
location
artifact
region
farm
animal
dandelion, boundary, grass, wild, deer, stork, field,
mud, farm, windmill, garden, landscape, desert, cat-
tle, dirt, area, barn, yard, zoo, ox, path, footprint,
garbage, puddle, lawn, cow, sheep, concrete, snow,
eat, lamb, goat, stone, cone, trail, rain, day, park,
animal, cage, horse, bull, elephant
change
color
visual property
colors bright, beautiful, big, dirty, small, colorful, grey, long,
purple, dark, round, men, tiny, pink, eyes, painted,
brown, gold, medium, white, hang, iron, silver, old,
black, left, tall, red, safety, large, metal, blue, steel,
yellow, leather, hanging, make, walk, green, right,
color, bath, pair, washing, sitting, carry
140
food
produce
solid
food drizzle, nuts, herb, beef, flour, season, cereal, cherry,
breakfast, sugar, steak, bacon, burger, butter, rice,
meat, meal, sauce, dinner, pie, raspberry, lunch,
sushi, bean, mustard, pepper, seed, salt, soup, cheese,
tomato, hot, berry, potato, dessert, strawberry, salad,
cardboard, food, bone, lemon, burn, frost, chocolate,
bread, turkey, sandwich, spoon, pizza, chicken, shell,
candy, peel, cooking, bubble, knife, fruit, fish, donut,
cake, apple, ice, banana, orange
Table 6.3: Examples of the 20 clusters in ES. Clusters are ordered by size. See
all clusters in Appendix Table D.2
WordNet label Own label Members
bird
aquatic bird
seabird
birds seagull, gull, goose, duck, pelican, swan, mallard,
stork, eagle, flamingo
furnishing
furniture
instrumentality
furnishing furniture, stand, booth, desk, modern, display, bed,
chair, container, door, appliance, drawer, sofa, cur-
tain, couch, bench, crib, frame, box, table, tv, window,
computer, cradle, television, mac
instrumentality
self-propelled vehicle
wheeled vehicle
car
related
accident, cord, vehicle, auto, automobile, skate, pho-
tography, truck, race, arrive, ford, chopper, cab, rally,
seat, industrial, smart, mechanic, racing, car, demo-
lition, triumph, construction, motorcycle, machine,
taxi, engine, driver, crane, carriage, van, bus, cannon,
motor, tank, hockey, wagon, camera
vascular plant
plant
grow
plants weed, bunch, maple, cancer, iris, poppy, dandelion,
leave, flower, rose, foliage, grow, plant, cactus, spring,
tulip, ivy, palm, lily, leaf, daisy, tree, root, wheat,
wool, raspberry, tobacco, flowers, blossom, butterfly,
sunflower, cotton, herb, violet, oak, moss, strawberry,
nest, dew, berry, rice, branch, coal
person
organism
causal agent
“female
topics”
woman, model, brandy, pink, actress, lady, girl, young,
wife, tiny, haircut, blonde, women, girls, hot, mother,
hair, portrait, body, makeup, cheek, wig, neck, muscle,
chest, lingerie, waist, redhead, child, face, bride, belly,
bikini, kid, swimsuit, baby, brow, skirt, dress, short
141
food
nutriment
substance
food sushi, meal, sandwich, pie, breakfast, lunch, food,
supper, flour, cereal, sweet, dessert, dinner, subway,
diet, cake, date, steak, sauce, bread, copper, nuts, ba-
con, cooking, beef, meat, bakery, knitting, eat, potato,
salad, donut, pizza, burger, co↵ee, soup, bean, cheese,
vitamin, fruit, pumpkin, rock, marrow, market, tim-
ber
artifact
change
cover
colours &
materials
texture, fabric, cloth, metal, rain, concrete, paper,
suds, rough, words, stone, wall, square, dense, leather,
quote, wood, frost, mud, noise, text, purple, carpet,
blue, tiles, dirt, droplets, red, sand, fog, formula,
mist, pattern, handwriting, green, straw, linen, as-
phalt, stripes, crowd, marble, yellow, black, brown,
grey, grass, white
body part
artifact
part
body parts gut, throat, wrist, burn, ear, thumb, elbow, listen,
shoulder, liver, pain, knee, arms, hand, toe, finger,
give, tongue, limb, abdomen, jaw, receive, nail, arm,
feet, hear, skin, washing, head, ankle, hip, teeth, tear,
stomach, brain, foot, lip, mouth, leg, flesh, mask, eyes,
nose, skull, eye, socks, lips
structure
artifact
area
room museum, garage, hall, classroom, kitchen, cellar, inte-
rior, o ce, diner, decoration, exhibition, hotel, ceiling,
restaurant, store, bathroom, trial, pub, class, closet,
cafe, room, porch, stairs, deck, hospital, living, cor-
ridor, aisle, bar, staircase, doorway, hallway, chapel,
floor, lab, station, bedroom, gate, elevator, theatre,
escalator, tunnel, organ, alley, library, jail, tram
travel
change
object
vacation island, view, reflection, harbor, nice, side, sea, sum-
mer, tropical, pollution, port, aircraft, pier, travel,
surfers, journey, sunny, coast, flying, morning, ocean,
seashore, horizon, mare, holiday, lake, surf, shore, va-
cation, bay, airport, cli↵, sunlight, air, river, storm,
ship, fishing, beach, desert, harbour, puddle, flight,
sailing, evening, sunrise, skyline, vessel, lighthouse,
dawn, sunset, rocket, mountain, whale, underwater,
boat, swimming, swim, plane, dusk, jet, cloud, sky,
airplane, ski
142
change
abstraction
state
festival theme, wisdom, soul, image, possess, large, confi-
dence, happiness, beautiful, joy, love, ceremony, festi-
val, movement, abundance, dead, depth, celebration,
lover, run, demon, blurred, pray, happy, remain, wet,
dance, navy, family, carnival, angel, sculpture, ray,
dragon, drive, atmosphere, night, shadow, band, god,
believe, party, dark, hanging, abstract, show, christ-
mas, monster, devil, jump, lighting, sunshine, war-
rior, painting, water, aquarium, zombie, concert, haze,
crystal, statue, explosion, jazz, jellyfish, wave, bright,
rainbow, ice, light, smoke, club, neon, colorful, hole,
protest, autumn, rust, reef, flame, fire
person
organism
causal agent
animals animals, animal, picture, painted, zoo, turkey, curled,
goat, companion, pets, canine, pet, prey, relaxed,
horse, spirit, tail, dog, chipmunk, squirrel, pigeon, fox,
cute, please, sheep, owl, birds, military, gira↵e, lion,
lamb, bee, insect, hamster, hawk, licking, bird, cat,
puppy, feline, terrier, deer, calf, rat, chicken, camel,
dragonfly, whiskers, poodle, cow, hound, cattle, lizard,
fish, bunny, crow, wolf, tiger, parrot, zebra, cheetah,
fur, panda, bull, wasp, ox, hen, frog, crab, snake,
boxer, hummingbird, rabbit, elephant, pupil, husky,
peacock, spider, pug, ant
Table 6.4: Examples of the 20 clusters in EV . Clusters are ordered by size. See
all clusters in Appendix Table D.3
143
Figure 6.5: T-SNE plot of ES with 20 cluster labels obtained by K-means clus-
tering.
144
Figure 6.6: T-SNE plot of EL with 20 cluster labels obtained by K-means clus-
tering.
145
Figure 6.7: T-SNE plot of EV with 20 cluster labels obtained by K-means clus-
tering.
146
Figure 6.8: T-SNE plot of ES with 40 cluster labels obtained by K-means clus-
tering. TSNE perplexity = 10.
147
Figure 6.9: Cluster map of Jaccard coe cients between K-means clusters of ES
(axis y) and EL (axis x).
148
Figure 6.10: Cluster map of Jaccard coe cients between K-means clusters of ES
(axis y) and EV (axis x).
149
Figure 6.11: Cluster map of Jaccard coe cients between K-means clusters of EL
(axis y) and EV (axis x).
150
Figure 6.12: Cluster map of Jaccard coe cients between K-means (axis y) and
Agglomerative (axis x) clusters of EL.
151
Figure 6.13: Cluster map of Jaccard coe cients between K-means (axis y) and
Agglomerative (axis x) clusters of ES.
152
Figure 6.14: Heatmap of Jaccard coe cients between K-means (axis y) and Ag-
glomerative (axis x) clusters of ES. Clusters are ordered by size.
153
Figure 6.15: Cluster map of Jaccard coe cients between K-means (axis y) and
Agglomerative (axis x) clusters of EV .
154
6.2.2.3 Gamified Data Collection
Figure 6.16: Screen-shot of Concept Game, a two player, collaborative gamified
data collection app, for acquiring cluster label annotations.
We developed a two player, collaborative gamified data collection app, called
Concept Game2, similar to ESP Game [Von Ahn and Dabbish, 2004], but with
word lists (clusters) instead of images (Figure 6.16). The pair of players have to
guess the concept for a list of words, which are the elements of all the clusters
from this section. They get a score if their guesses have one word/expression in
common. This way we aim to collect more human cluster label annotation for
di↵erent modalities in the future.
The back-end involves a Sqlite Database on an AWS server3, where we collect
data. The dataset includes two tables:
• Game: It stores each game rounds, which is each time the users see a new
word list they are guessing a concept for. We log the following attributes:
– game id = TextField()
2http://concept-guessing-game.com/
3https://aws.amazon.com/
155
– start time = DateTimeField()
– cluster id = TextField()
– user1 = TextField(): firsts user’s id
– user2 = TextField(): second user’s id
– guess = TextField(): the guessed word – NONE if they ran out of
time
• Answer: This table stores the log for each word the users typed in with
time stamps. This way, later, the time needed for agreeing on a cluster
label can be used to infer the di culty / ambiguity of a cluster word list.
It logs the following attributes:
– game = ForeignKeyField(Game, backref=’answers’): reference to a
game id in Game.
– cluster id = TextField()
– user = TextField(): id of the user who typed in a word as an answer
– word = TextField()
– e time = TimeField(): elapsed time since the beginning of the game
The project is still under development in order to make it more accessible.
Currently, people can only play if there are enough players active on the platform.
So far only test data has been collected. In the future an auto replay functionality
would greatly improve the usability of the game.
The code is publicly available on Github4. The web technology development
was helped by Krisztia´n Gergely5.
4https://github.com/anitavero/concept_game
5http://krisoft.hu/
156
6.2.3 Supervised Visualisation
In this Section we use the same T-SNE algorithm as in Section 6.2.2. However,
for the labelled projections we apply a WordNet based automatic labelling tech-
nique on the words beforehand. This is fundamentally di↵erent from the previous
Section, where the labelling came from the clustering method in an unsupervised
fashion. In that case, WordNet was used only for analysing the cluster outputs,
whereas here we label the data first. This way we can inspect our embedding
spaces based on pre-defined concepts. The previous method is more generic, this
approach contributes to the interpretation of embeddings.
6.2.3.1 Automatic Class Label Annotation
Figures 6.20 – 6.24 show coloured plots where the colours correspond to 13 class
labels. We used the same coarse categories as in [Gupta et al., 2019]. They
labelled their data manually, which we were not able to do due to the size of
our data. Therefore, we developed a technique to automatically label our words
using the WordNet hierarchy. Let C be the set of class labels, C = {transport,
food, building, animal, appliance, action, clothes, utensil, body, colour, electron-
ics, number, human}. All words in the embeddings’ common subset vocabulary
Vcommon were labelled with a class in the following way: First, we queried the
synset list S(c) for each class c 2 C. Then we obtained the synset closure of each
word w up to the third level in the hypernym hierarchy: Scl3 (w). The class with
the maximum number of synset overlap with each word synset closure is assigned
as the word’s class label: class(w) = maxc2C [S(c)\Scl3 (w)]. We only show words
where this maximum exists.
6.2.3.2 Results
Figure 6.17 depicts a 2D projection of a 3D T-SNE plot of a 100 000 sample
from the SGNS Wikipedia 2020 model. After looking at the word labels, clear
clusters became apparent, such as words in di↵erent languages, topics (e.g., math,
mental health, numbers). The thin curves usually contain numbers with the same
number of digits and in order. Figures 6.18 and 6.19 show two examples for the
clusters.
Figure 6.20 shows a 2D T-SNE plot of our Wikipedia 2020 model trained
157
on the whole corpus. Despite the simple heuristic we used to generate class
labels, clearly separable clusters emerged for many of them. We can see colours
indicated by orange, numbers by blue, clothes by red, food related words by light
green, buildings by brown, animals by purple etc. Some of the confused labels
visibly come from the failure of our labelling technique, but looking at it, many
mislabelled words cluster around other words in the same topic / category.
In Figure 6.21 – 6.24 we show similar projections for EL, EV , ES and a random
embedding ER, where we restricted the vocabulary to the intersection of the
three modalities, then kept the ones with an existing WordNet label, resulting
252 words. All EL, EV , ES clearly show much more distinct clusters with much
better defined class labels than the random embedding. This may seem obvious,
however, it is worth noting, since in very high dimensions even random vector
spaces can show some structure. In our projection in Figure 6.24, both data
points as well as labels are uniformly distributed.
Looking at the projections in Figure 6.21 – 6.23 the three modalities have
di↵erent cluster shapes: EV having the most and ES having the least coherent
and separable clusters. This is consistent with the results on clusterization met-
rics in Figure 6.1. In general, classes transport, food, building, animal, clothes,
colour, number, action look to be better captured by this labelling and projec-
tion technique than appliance, utensil, body, electronics, human. This is probably
due to the coarse labelling method, and could be alleviated by collecting human
annotation. [Gupta et al., 2019] reported that their visual-context model showed
more distinct clusters than their linguistic one using GloVe. In our T-SNE pro-
jections we did not find such patterns, although our method is fundamentally
di↵erent from theirs, as they use early-fusion, GloVe, they do not exploit the
Visual Genome graph structure, and they apply manual labelling. Overall, it
is remarkable how much structure can already be revealed without the need for
acquiring additional human e↵ort.
6.3 Information Gain from Multi-modal Data
So far we compared our embedding spaces based on their cluster structure. In
this section we move on to pillar 3 in our analysis. This second type of trans-
parency analysis involved experiments for measuring similarity between distribu-
158
Figure 6.17: T-SNE plot of a trained SGNS model on a 2020 dump of Wikipedia.
tions, based on an information-theoretical approach introduced in Section 2.7.5.
We aim to measure the information gain ES and EV each contribute when com-
bined with EL. By treating the embedding spaces as samples from multivariate
distributions we formulate the question in the following way: Are two semantic
spaces from di↵erent modalities independent from each other?
We employ empirical Mutual Information prediction methods, described in
Section 3.2.4. Section 6.3.1 describes details of the analysis, results are presented
in Section 6.3.2.6
159
Figure 6.18: Cluster, containing the word “pancakes” on the T-SNE plot of a
trained SGNS model on a 2020 dump of Wikipedia.
6.3.1 Hyper Parameters and Dimensionality Reduction
Since IKNN is not robust in very high dimensions we explore the hyper parameters
of IHSIC . We used the Gaussian Radial Basis Function (RBF) Kernel [Vert et al.,
2004] with parameter settings   = 1 and using median heuristic [Garreau et al.,
2017].
Furthermore, in order to test the robustness of the results we ran the method
after projecting our spaces onto lower dimensional spaces using Principal Com-
ponent Analysis (PCA) [Wold et al., 1987]. We tested the embeddings with
dimensions d = {10, 100,max}, where max is the full dimension of each space.
For further robustness, we ran the IHSIC algorithm for d = {3, 11, 12, 13, 50}
(Appendix E).
6We would like to thank Zolta´n Szabo´ for his counsel on the theoretical background for these
studies.
160
Figure 6.19: Cluster, containing the number “1505” on the T-SNE plot of a
trained SGNS model on a 2020 dump of Wikipedia.
6.3.2 Results
The main benefit of this experiment is that we may be able to understand how
data of di↵erent modalities contribute to the performance of multi-modal embed-
dings if they contribute at all. In case they do, is it just an artefact of introducing
more data or is it due to meaningful information which changes the structure of
the vector space in a useful way?
In Figure 6.25 and 6.26 axis y shows I(EL, EV ) (red) and I(EL, ES) (blue),
where I is the estimated Shannon mutual information using either a k-Nearest
Neighbor based, linear algorithm (IKNN) or the HSIC kernel method (IHSIC).
In Figure 6.25 axis x represents the size of the training corpus e1, . . . , eN (in
terms of the number of tokens) for EL. Apart from IHSIC with   = 1 the models
agree on I(EL, EV ) being greater than I(EL, ES), which suggests that the Visual
Genome Scene Graph based structured embedding ES is “more independent”
from the linguistic model EL, than the image based EV . This is surprising after
observing the two models behaving similarly in Chapter 4. Moreover, the results
are interesting, since, while the creation of this type of training data was highly
161
Figure 6.20: T-SNE plot of a trained SGNS model on a 2020 dump of Wikipedia.
The colours correspond to 13 classes automatically generated using the WordNet
hierarchy: transport, food, building, animal, appliance, action, clothes, utensil,
body, colour, electronics, number, human
visually directed, yet it is a text based model. Nevertheless, it is “farther” from
the linguistic model in distribution than the visual one. I(EL, EV ) appears to be
lower for lower volumes of text data. This may be because with more data they
contain more related information. Although, in the case of IHSIC with maximal
dimensions, using the median heuristic for   this pattern cannot be seen. In
I(EL, ES) no such tendency can be observed.
Figure 6.26 reports the e↵ect of word frequency (in the EL training corpus) on
the estimated I. Similarly to [Sahlgren and Lenci, 2016] we split the vocabulary
into three equally large parts; HIGH, MEDIUM and LOW range. This way we
generate samples for EL, EV and ES for the di↵erent frequency ranges in the text
corpus. Again, higher mutual information between the linguistic and the visual
embeddings can be observed. The negative IKNN in Figure 6.26a is due to the
oscillating nature of the approximation, and shows that the k-Nearest Neighbor
162
Figure 6.21: T-SNE plot of EL with its vocabulary restricted to the common
subset of EL, EV , ES and the ones with an existing automatic WordNet class label,
resulting 252 words. The colours correspond to 13 classes automatically generated
using the WordNet hierarchy: transport, food, building, animal, appliance, action,
clothes, utensil, body, colour, electronics, number, human
method is not robust enough in this high dimension.
In terms of the e↵ect of word frequency, the only pattern that emerges is the
relative low mutual information between EL and EV on low frequency words.
However, this may be an artefact of sparse data, since the coverage drops dra-
matically with filtering pairs which fall in the same frequency category (see in
Figure 5.2).
In order to further test the robustness of the results we ran the IHSIC al-
gorithm for further dimensions in the very low range and one medium size:
163
Figure 6.22: T-SNE plot of EV with its vocabulary restricted to the common
subset of EL, EV , ES and the ones with an existing automatic WordNet class label,
resulting 252 words. The colours correspond to 13 classes automatically generated
using the WordNet hierarchy: transport, food, building, animal, appliance, action,
clothes, utensil, body, colour, electronics, number, human
d = {3, 11, 12, 13, 50}. The results are shown in Appendix E. They support
the the overall pattern in the above figures, adding that the results lose their
robustness for d = 3.
6.4 Dataset Distribution
Finally, we analyse the text based data source distributions DL and DS directly
to get another perspective on the type of information they convey. We present
164
Figure 6.23: T-SNE plot of ES with its vocabulary restricted to the common
subset of EL, EV , ES and the ones with an existing automatic WordNet class label,
resulting 252 words. The colours correspond to 13 classes automatically generated
using the WordNet hierarchy: transport, food, building, animal, appliance, action,
clothes, utensil, body, colour, electronics, number, human
words in the respective datasets with the 10 highest probability of co-occurrence
with each centroid word from Section 6.2.27. To estimate this probability we
calculated Pointwise Mutual Information (PMI), Positive PMI (PPMI) (Equa-
tion 2.1), a modified PMI (PMI3),  2 [Manning and Schutze, 1999, Section 5.3.3.]
and Fisher’s exact test [Pedersen, 1996]. PMI3 has an exponent of 3 for the nu-
7Duplicated words for appearing as left and right context as well are removed. Therefore
the number of words are  10.
165
Figure 6.24: T-SNE plot of a random embedding ER 2 R252x300. The colours
correspond to 13 classes automatically generated using the WordNet hierarchy:
transport, food, building, animal, appliance, action, clothes, utensil, body, colour,
electronics, number, human. The colour labels are evenly distributed on the
projection.
merator and no logarithm. We used the NLTK package implementations of all
the above metrics8.
Since PMI, PPMI and Fisher’s test su↵ered from over-representing low fre-
quency bigrams, we only present results for  2 and PMI3, which outputted fairly
similar results. Table 6.5 presents examples for words closest to cluster centroids
8https://www.nltk.org/api/nltk.html#module-nltk.collocations
166
(a) IKNN
(b) IHSIC ,   = 1, d = max (c) IHSIC ,  : median, d = max
(d) IHSIC ,  : median, d = 100 (e) IHSIC ,  : median, d = 10
Figure 6.25: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES)
(blue) for di↵erent corpus sizes.
with the 10 highest  2 score. Results for the full set of centroid words using  2
and PMI3 can be found in Appendix F.
Centroid Wikipedia Visual Genome
plate tectonics, nazca, restrictor, farallon,
subducts, license, cribriform, tec-
tonic, subducting, eurasian
plate, lying on top of, on, has,
on top of, in
167
rust epique, cronartium, oleum, cohle,
obritzberg, blister, belt, puccinia,
windexed, colored
rust, stains down, around side of,
rusted onto, on fire, with a lot
hummingbird amazilia, selasphorus, mellisuga, ca-
lypte, cynanthus, berylline, scin-
tillant, orthorhyncus, eupherusa,
chinned
hummingbird, eat nectar from,
in flight below, flapping its, flap-
ping, windspan
fun poked, poking, pokes, poke, loving,
lot, lovin, yidishn, fun, wea¨sell
are having, are having great, fun,
facing away, planning, having
hand right, sleight, grenades, left, hand,
cranked, grenade, claps, gloved, up-
per
hand, holding, held in, on, in mans,
man
bird passerine, migratory, caged, sanc-
tuary, watchers, watching, topley,
species, prey, furnariidae
bird, perched on, flying in, fly-
ing over, beak, flying ahead of
Table 6.5: Example for context words of cluster centroids with the 10 highest  2
score. See all cluster centroids in Appendix F.
The samples reveal that while Wikipedia includes more encyclopaedic syn-
onyms as most likely bigrams, Visual Genome conveys more functional, specific
type of contexts including more actions and attributes. For example “tecton-
ics” in Wikipedia vs. “lying on top of” in Visual Genome as the most likely
co-occurrence for “plate”.
Our observations are in line with the word distributions in VG published
in [Krishna et al., 2016]. The most common concepts (Figure 6.27), objects
(Figure 6.28), attributes (Figure 6.29) and relationships (Figure 6.30) all paint a
picture of how visually oriented VG annotations are. The published statistics also
support our observation that VG mostly includes specific descriptions of smaller
scenes.
These support our previous findings that Visual Genome can contribute with
complementary information to a text based meaning representation by having
denser annotations of visual scenes.
168
(a) IKNN
(b) IHSIC ,   = 1, d = max (c) IHSIC ,  : median, d = max
(d) IHSIC ,  : median, d = 100 (e) IHSIC ,  : median, d = 10
Figure 6.26: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES)
(blue) for di↵erent word frequency ranges.
6.5 Conclusion
In this chapter we presented proof-of-concept studies of interpretable Trans-
parency analysis, forming the second and third pillars of our analysis (Section 3.3).
169
Qualitative / Quantitative Structural Analysis Firstly, our aim was to
interpret our models by zooming into the distributional properties of linguistic,
visual, structured and multi-modal embeddings. We ran K-means and Agglom-
erative clusterings on each embedding and used standard clustering metrics for
evaluation when class labels are not given. The results indicate that while the
image based model may have better defined clusters, the Visual Genome Scene
Graph structured model can outperform the other ones in terms of consistency
when the number of clusters are chosen well. We visualised the clustered embed-
dings and inspected the individual clusters from the the best K-means clustering.
We introduced a WordNet based cluster label annotation technique. Furthermore,
we compared the clustering to Agglomerative Clustering results.
The supervised T-SNE visualisations provide further insight into the structure
of our semantic spaces, which are in line with the above findings. We introduced
a simple method to automatically annotate our data with topic labels saving huge
amount of human e↵ort. Remarkably, the results already give further insight into
our data, despite the simple heuristic of label generation. We believe the method
could be easily improved to gain better coverage on the vocabulary and higher
accuracy of labels.
Independence Analysis Secondly, we created an implementation of our in-
formation theory based framework to measure the information gain visual and
structured embeddings may provide by combining them with text based linguistic
models. We found that the Visual Genome SceneGraph based structured model is
more independent from the Wikipedia based SGNS model than the visual embed-
dings, trained on images. This may reveal something about why this structural
data on its own, as well as combined with linguistic information, can achieve such
high accuracies, despite having orders of magnitude less training data than either
of the other modalities (as we saw in Chapter 5). Analysing the e↵ect of VG and
image data size on this metric would be an important future direction, as we saw
that the mutual information of image and text based embeddings increase with
corpus size. However, in the context of the structured model’s comparable per-
formance, we think that the estimated mutual information is a promising metric
for deciding over the usefulness of a new data source.
170
Summary of Transparency Analysis Let us examine the two hypotheses
we made in Section 6.1. All three embedding types show di↵erent cluster struc-
tures, however, the image based embedding is closer to the linguistic one than
our visually structured, textual embedding: both in terms of cluster structure as
well as being more mutually dependent. Considering this result in relation to the
performance numbers in the previous chapters, we conclude that the image based
embedding requires orders of magnitude more data and training time, while not
necessarily providing additional useful information to a text based representation
in the context of word semantic similarity. Therefore, we weakly reject Hypoth-
esis I. On the other hand, based on the three pillars of our analyses: 1. reaching
comparable performance despite being based on a small model trained on small
data, 2. the quantitative and qualitative analysis of its cluster structure and
3. independence analysis, we conclude that our structured embedding provides
complementary information to our linguistic representation while being highly
e cient. Hence, we accept Hypothesis II.
Investigating transformers, Bayesian MI estimators and other evaluations
could be potential extensions of these studies. Applying automatically gener-
ated scenes graphs [Xu et al., 2020] would mitigate the main limitation of this
approach, which is the manual labour required for creating VG. This would serve
as a highly e↵ective tool with important applications for low resource languages.
171
20 Ranjay Krishna et al.
(a) (b)
Fig. 18: (a) A plot of the most common visual concepts or phrases that occur in region descriptions. The most
common phrases refer to universal visual concepts like “blue sky,” “green grass,” etc. (b) A plot of the most
frequently used words in region descriptions. Colors occur the most frequently, followed by common objects like
“man” and “dog” and universal visual concepts like “sky.”
Figure 6.27: (a) A plot of the most common visual concep s or phrases that
occur in region descriptions. The most common phrases refer to universal visual
concepts like “blue sky,” “green grass,” etc. (b) A plot of the most frequently
used words in region descriptions. Colours occur the most frequently, followed by
common objects like “man” and “dog” and universal visual concepts like “sky.”
[Krishna et al., 2016]
172
Visual Genome 23
Visual
Genome
ILSVRC Det.
(Russakovsky
et al., 2015)
MS-
COCO (Lin
et al., 2014)
Caltech101
(Fei-Fei et al.,
2007)
Caltech256
(Gri n et al.,
2007)
PASCAL Det.
(Everingham
et al., 2010)
Abstract
Scenes
(Zitnick and
Parikh, 2013)
Images 108,249 476,688 328,000 9,144 30,608 11,530 10,020
Total Objects 255,718 534,309 2,500,000 9,144 30,608 27,450 58
Total Categories 18,136 200 91 102 257 20 11
Objects per Category 14.10 2671.50 27472.50 90 119 1372.50 5.27
Table 3: Comparison of Visual Genome objects and categories to related datasets.
Street LightGlass
Bench Pizza
Stop Light Bird
Building Bear
Plane Truck
(a) (b)
Fig. 22: (a) Examples of objects in Visual Genome. Each object is localized in its image with a tightly drawn
bounding box. (b) Plot of the most frequently occurring objects in images. People are the most frequently occurring
objects in our dataset, followed by common objects and visual elements like building, shirt, and sky.
Figure 6.28: (a) Examples of objects in VG. Each object is localized in its image
with a tightly dra n ounding ox. (b) Plot of the most frequently occurring
objects in images. People are the most frequently occurring objects in the dataset,
followed by common objects and visual elements like “building”, “shirt”, and
“sky”. [Krishna et al., 2016]
173
Visual Genome 25
(a) (b)
Fig. 24: (a) Distribution showing the most common attributes in the dataset. Colors (white, red) and materials
(wooden, metal) are the most common. (b) Distribution showing the number of attributes describing people.
State-of-motion verbs (standing, walking) are the most common, while certain sports (skiing, surfing)
are also highly represented due to an image source bias in our image set.
Figure 6.29: (a) Distribution showing the most common attributes in VG. Col urs
(“white”, “red”) and materials (“wooden”, “ etal”) are the most common. (b)
Distribution showing the number of attributes describing people. State-of-motion
verbs (“standing”, “walking”) are the most common, while certain sports (“ski-
ing”, “surfing”) are also highly represented due to an image source bias in the
image set. [Krishna et al., 2016]
174
28 Ranjay Krishna et al.
(a) (b)
Fig. 27: (a) A sample of the most frequent relationships in our dataset. In general, the most common relationships
are spatial (on top of, on side of, etc.). (b) A sample of the most frequent relationships involving humans
in our dataset. The relationships involving people tend to be more action oriented (walk, speak, run, etc.).
Objects Attributes Relationships
Region Graph 0.43 0.41 0.45
Scene Graph 21.26 16.21 18.67
Table 4: The average number of objects, attributes, and
relationships per region graph and per scene graph.
5.6 Region and Scene Graph Statistics
We introduce in this paper the largest dataset of scene
graphs to date. We use these graph representations of
images as a deeper understanding of the visual world. In
this section, we analyze the properties of these represen-
tations, both at the region level through region graphs
and at the image level through scene graphs. We also
Figure 6.30: (a) A sample of the most frequent relationships in VG. In gener l,
the most common relationships are spatial (“on top of”, “on side of”, etc.). (b) A
sample of the most frequent relationships involving humans in the dataset. The
relationships involving people tend to be more action oriented (“walk”, “speak”,
“run”, tc.). [Krishna et al., 2016]
175
176
Chapter 7
Summary and Conclusions
This thesis has been pursuing a better understanding of the impact of visual
information on semantic models in non-visual tasks. Since the literature is nar-
rower and more inconclusive on these tasks, here we aimed for constructing a
broader evaluation and analysis. We introduced a general embedding formalism
and a three pillar framework for transparent analysis of multi-modal semantic
embedding models. We proposed and implemented a new type of embedding in
between linguistic and visual modalities, based on small data. We analysed its
contribution to linguistic representations within our analytical framework. Fur-
thermore, we presented and showcased a framework for treating modalities as
partial observers of meaning based on information-theory.
7.1 Main Findings
The main findings are the following:
• The source of images a↵ect the performance of multi-modal mid-fused se-
mantic representations.
• The number of images in ordered sources has an impact on performance,
but it stabilizes at around 10-20 images.
• Visual information can be complementary for smaller linguistic corpora, but
this e↵ect does not necessarily scale with corpus size.
177
• Images convey complementary statistical information about the co-occurrence
of objects in visual scenes, but there is no direct indication of how low level
visual features contribute.
• Cluster analysis can provide a useful framework for analysing emergent
concept structures. Combined with independence analysis they can serve
as a useful framework for transparent embedding analysis.
• VG Scene Graph based, visually structured, textual models achieve com-
parable or better performance in an economic way, by using orders of mag-
nitude less resources than visual models. When combined, it enriches our
linguistic model with more divergent information than the image based
one. Its clusters represent more concrete concepts, in-between visual and
linguistic domains.
7.2 Conclusion and Future Work
Instead of comparing all the latest models at the time, we developed a general
analysis framework and presented proof-of-concept studies, which can be applied
to various models in the future. To present our methodology, we employed the
smallest possible models which allow us to incorporate visual embeddings, thus
studying multi-modality. Therefore, in this work we applied the shallow skip-
gram network, as visual embeddings fit into them more easily then into count
based models, while being the simplest neural models. Furthermore, we used mid-
fusion technique, which made it straightforward to study individual modalities.
Incorporating this methodology to the evaluation of various recent models would
be the next step.
In parallel, the analysis methodology can also be further developed. One
direction is to test the level of visual information that impacts abstract semantic
representations. One potential test is to gradually reduce the resolution of images
we use for visual embeddings and see how the performance changes, in what rate
it starts to decline in particular. This way we would see how much visual detail
can be omitted while keeping the same gain for conceptually abstract tasks.
Another exciting direction would be to extend the notion of modality and
compare semantic representations trained across di↵erent data sources in general,
178
such as corpora of di↵erent authors, from di↵erent times or di↵erent styles and
social circles. Further extension of the notion of semantic representation could be
measuring semantic change in time, such as the polarisation of political discourse.
This has the potential to have positive social impact if we are capable of detecting
the time and “place” of the source of miscommunication.
Applying automatically generated scenes graphs would mitigate the main lim-
itation of the presented Visual Genome based approach, which is the manual
labour required for creating it. This would serve as a highly e↵ective tool with
important applications for low resource languages.
For measuring information gain experimenting with Bayesian Mutual Infor-
mation estimation methods and other evaluation and training datasets would also
be a viable future route.
Understanding the information our various data sources convey and the biases
our di↵erent models have on them is an essential work in Artificial Intelligence.
Data driven AI applications surround us, thus we believe there is a surging need
for such meta analyses in order to advance this technology in a more conscious
way.
179
180
Bibliography
[Agrawal et al., 2016] Agrawal, A., Batra, D., and Parikh, D. (2016). Analyzing
the behavior of visual question answering models. Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing, pages 1955–
1960.
[Anderson et al., 2017] Anderson, A. J., Kiela, D., Clark, S., and Poesio, M.
(2017). Visually grounded and textual semantic models di↵erentially decode
brain activity associated with concrete and abstract nouns. Transactions of
the Association for Computational Linguistics, 5:17–30.
[Anderson et al., 2016] Anderson, A. J., Zinszer, B. D., and Raizada, R. D.
(2016). Representational similarity encoding for fmri: Pattern-based synthe-
sis to predict brain activity using stimulus-model-similarities. NeuroImage,
128:44–53.
[Antol et al., 2015] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit-
nick, C. L., and Parikh, D. (2015). Vqa: Visual question answering. Proceedings
of the IEEE international conference on computer vision, pages 2425–2433.
[Artetxe et al., 2018] Artetxe, M., Labaka, G., and Agirre, E. (2018). A robust
self-learning method for fully unsupervised cross-lingual mappings of word em-
beddings. ACL.
[Arthur and Vassilvitskii, 2006] Arthur, D. and Vassilvitskii, S. (2006). k-
means++: The advantages of careful seeding. Stanford.
[Arthur et al., 2016] Arthur, P., Neubig, G., and Nakamura, S. (2016). Incorpo-
rating discrete translation lexicons into neural machine translation. Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing,
pages 1557–1567.
181
[Bahdanau et al., 2015] Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural
Machine Translation By Jointly Learning To Align and Translate. Iclr 2015,
26(1):1–15.
[Barocas et al., 2019] Barocas, S., Hardt, M., and Narayanan, A. (2019). Fair-
ness and Machine Learning. http://www.fairmlbook.org.
[Baroni and Lenci, 2008] Baroni, M. and Lenci, A. (2008). Concepts and prop-
erties in word spaces. Italian Journal of Linguistics, 20(1):55–88.
[Batchkarov et al., 2016] Batchkarov, M., Kober, T., Re n, J., Weeds, J., and
Weir, D. (2016). A critique of word similarity as a method for evaluating
distributional semantic models. Proceedings of the 1st Workshop on Evaluating
Vector-Space Representations for NLP, pages 7–12.
[Bender et al., 2021] Bender, E. M., Gebru, T., McMillan-Major, A., and
Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language
models be too big? . Proceedings of FAccT 2021.
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C.
(2003). A Neural Probabilistic Language Model. The Journal of Machine
Learning Research, 3:1137–1155.
[Bergsma and Goebel, 2011] Bergsma, S. and Goebel, R. (2011). Using visual
information to predict lexical preference. Proceedings of RANLP, pages 399–
405.
[Boleda, 2020] Boleda, G. (2020). Distributional semantics and linguistic theory.
Annual Review of Linguistics, 6:213–234.
[Bowker and Star, 2000] Bowker, G. C. and Star, S. L. (2000). Sorting things
out: Classification and its consequences.
[Bowman et al., 2015] Bowman, S., Angeli, G., Potts, C., and Manning, C. D.
(2015). A large annotated corpus for learning natural language inference. Pro-
ceedings of the 2015 Conference on Empirical Methods in Natural Language
Processing, pages 632–642.
182
[Bruni et al., 2014] Bruni, E., Tran, N.-K., and Baroni, M. (2014). Multimodal
distributional semantics. J. Artif. Intell. Res.(JAIR), 49(2014):1–47.
[Bucci, 1985] Bucci, W. (1985). Dual coding: A cognitive model for psychoana-
lytic research. Journal of the American Psychoanalytic Association, 33(3):571–
607.
[Bulat et al., 2017] Bulat, L., Clark, S., and Shutova, E. (2017). Speaking, seeing,
understanding: Correlating semantic models with conceptual representation in
the brain. Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing, pages 1081–1091.
[Cho et al., 2014] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,
Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Rep-
resentations using RNN Encoder-Decoder for Statistical Machine Translation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1724–1734.
[Chomsky et al., 2000] Chomsky, N. et al. (2000). New horizons in the study of
language and mind.
[Clark, 2015] Clark, S. (2015). Vector space models of lexical meaning. Handbook
of Contemporary Semantics, 10:9781118882139.
[Conneau et al., 2018] Conneau, A., Kruszewski, G., Lample, G., Barrault, L.,
and Baroni, M. (2018). What you can cram into a single &!#* vector: Probing
sentence embeddings for linguistic properties. ACL 2018-56th Annual Meeting
of the Association for Computational Linguistics, 1:2126–2136.
[Cover and Thomas, 2012] Cover, T. and Thomas, J. (2012). Elements of Infor-
mation Theory.
[Davis et al., 2019] Davis, C., Bulat, L., Vero˝, A. L., and Shutova, E. (2019).
Deconstructing multimodality: visual properties and visual context in human
semantic processing. Proceedings of the Eighth Joint Conference on Lexical and
Computational Semantics (* SEM 2019), pages 118–124.
[Deerwester et al., 1990] Deerwester, S., Dumais, S. T., Furnas, G. W., Lan-
dauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis.
Journal of the American society for information science, 41(6):391–407.
183
[Deng et al., 2009] Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Li, F.
(2009). Imagenet: A large-scale hierarchical image database. Proceedings of
CVPR, pages 248–255.
[Devlin et al., 2019] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional transformers for language un-
derstanding. NAACL-HLT (1).
[Dinu et al., 2015] Dinu, G., Lazaridou, A., and Baroni, M. (2015). Improving
zero-shot learning by mitigating the hubness problem. International Conference
on Learning Representations, Workshop Track.
[Dubossarsky et al., 2019] Dubossarsky, H., Hengchen, S., Tahmasebi, N., and
Schlechtweg, D. (2019). Time-out: Temporal referencing for robust modeling
of lexical semantic change. Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages 457–470.
[Erk, 2016] Erk, K. (2016). What do you know about an alligator when you know
the company it keeps? Semantics and Pragmatics, 9:17–1.
[Ernst and Banks, 2002] Ernst, M. O. and Banks, M. S. (2002). Humans inte-
grate visual and haptic information in a statistically optimal fashion. Nature,
415(6870):429.
[Faruqui et al., 2016] Faruqui, M., Tsvetkov, Y., Rastogi, P., and Dyer, C.
(2016). Problems with evaluation of word embeddings using word similarity
tasks. Proceedings of the 1st Workshop on Evaluating Vector-Space Represen-
tations for NLP, pages 30–35.
[Fergus et al., 2005] Fergus, R., Li, F., Perona, P., and Zisserman, A. (2005).
Learning object categories from Google’s image search. Proceedings of ICCV,
pages 1816–1823.
[Firth, 1957] Firth, J. R. (1957). A synopsis of linguistic theory. Studies in Lin-
guistic Analysis, Oxford: Philological Society, (1–32), reprinted in F.R. Palmer
(ed.), Selected Papers of J.R. Firth 1952-1959, London: Longman (1968).
[Fouhey and Zitnick, 2014] Fouhey, D. F. and Zitnick, C. L. (2014). Predicting
object dynamics in scenes. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2019–2026.
184
[Gabrilovich et al., 2007] Gabrilovich, E., Markovitch, S., et al. (2007). Com-
puting semantic relatedness using wikipedia-based explicit semantic analysis.
IJcAI, 7:1606–1611.
[Garreau et al., 2017] Garreau, D., Jitkrittum, W., and Kanagawa, M.
(2017). Large sample analysis of the median heuristic. arXiv preprint
arXiv:1707.07269.
[Gasparri and Marconi, 2021] Gasparri, L. and Marconi, D. (2021). Word Mean-
ing. The Stanford Encyclopedia of Philosophy.
[Gerz et al., 2016] Gerz, D., Vulic´, I., Hill, F., Reichart, R., and Korhonen,
A. (2016). SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity.
EMNLP.
[Ghorbani et al., 2019] Ghorbani, A., Wexler, J., Zou, J. Y., and Kim, B. (2019).
Towards automatic concept-based explanations. Advances in Neural Informa-
tion Processing Systems, 32:9277–9286.
[Gonza´lez et al., 2006] Gonza´lez, J., Barros-Loscertales, A., Pulvermu¨ller, F.,
Meseguer, V., Sanjua´n, A., Belloch, V., and A´vila, C. (2006). Reading cin-
namon activates olfactory brain regions. Neuroimage, 32(2):906–912.
[Gretton et al., 2005] Gretton, A., Bousquet, O., Smola, A., and Scho¨lkopf, B.
(2005). Measuring statistical dependence with hilbert-schmidt norms. Inter-
national conference on algorithmic learning theory, pages 63–77.
[Grice, 1975] Grice, H. P. (1975). Logic and conversation. pages 41–58.
[Gupta et al., 2019] Gupta, T., Schwing, A., and Hoiem, D. (2019). Vico: Word
embeddings from visual co-occurrences. Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 7425–7434.
[Handjaras et al., 2016] Handjaras, G., Ricciardi, E., Leo, A., Lenci, A., Cec-
chetti, L., Cosottini, M., Marotta, G., and Pietrini, P. (2016). How concepts
are encoded in the human brain: a modality independent, category-based cor-
tical organization of semantic knowledge. Neuroimage, 135:232–242.
[Harnad, 1990] Harnad, S. (1990). The symbol grounding problem. Physica D:
Nonlinear Phenomena, 42(1-3):335–346.
185
[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual
learning for image recognition. Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778.
[Herbelot, 2020] Herbelot, A. (2020). Re-solve it: simulating the acquisition of
core semantic competences from small data. Proceedings of the 24th Conference
on Computational Natural Language Learning, pages 344–354.
[Hill et al., 2015] Hill, F., Reichart, R., and Korhonen, A. (2015). SimLex-999:
Evaluating Semantic Models With (Genuine) Similarity Estimation. Associa-
tion for Computational Linguistics.
[Hinton et al., 2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. R. (2012). Improving neural networks by preventing
co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
[Hooker, 2021] Hooker, S. (2021). Moving beyond “algorithmic bias is a data
problem”. Patterns, 2(4):100241.
[Jitkrittum et al., 2017] Jitkrittum, W., Szabo´, Z., and Gretton, A. (2017). An
adaptive test of independence with analytic kernel embeddings. International
Conference on Machine Learning, pages 1742–1751.
[Johnson et al., 2017] Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei,
L., Lawrence Zitnick, C., and Girshick, R. (2017). Clevr: A diagnostic dataset
for compositional language and elementary visual reasoning. Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 2901–2910.
[Kabbach et al., 2019] Kabbach, A., Gulordava, K., and Herbelot, A. (2019). To-
wards incremental learning of word embeddings using context informativeness.
Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics: Student Research Workshop, pages 162–168.
[Kaur et al., 2020] Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H., and
Wortman Vaughan, J. (2020). Interpreting interpretability: Understanding
data scientists’ use of interpretability tools for machine learning. Proceedings
of the 2020 CHI Conference on Human Factors in Computing Systems, pages
1–14.
186
[Kay et al., 2015] Kay, M., Matuszek, C., and Munson, S. A. (2015). Unequal
representation and gender stereotypes in image search results for occupations.
Proceedings of the 33rd Annual ACM Conference on Human Factors in Com-
puting Systems, pages 3819–3828.
[Kelly jr, 1956] Kelly jr, J. (1956). A new interpretation of information rate. the
bell system technical journal.
[Kendall et al., 2017] Kendall, A., Badrinarayanan, V., and Cipolla, R. (2017).
Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder ar-
chitectures for scene understanding. British Machine Vision Conference 2017,
BMVC 2017.
[Kiela and Bottou, 2014] Kiela, D. and Bottou, L. (2014). Learning image em-
beddings using convolutional neural networks for improved multi-modal se-
mantics. Proceedings of EMNLP, pages 36–45.
[Kiela and Clark, 2014] Kiela, D. and Clark, S. (2014). A Systematic Study of
Semantic Vector Space Model Parameters. Proceedings of EACL 2014, Work-
shop on Continuous Vector Space Models and their Compositionality (CVSC).
[Kiela et al., 2014] Kiela, D., Hill, F., Korhonen, A., and Clark, S. (2014). Im-
proving multi-modal representations using image dispersion: Why less is some-
times more. Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), 2:835–841.
[Kiela et al., 2016] Kiela, D., Vero˝, A. L., and Clark, S. (2016). Comparing Data
Sources and Architectures for Deep Visual Representation Learning in Seman-
tics. Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP-16).
[Kilgarri↵ and Yallop, 2000] Kilgarri↵, A. and Yallop, C. (2000). What’s in a
thesaurus? LREC, pages 1371–1379.
[Kiros et al., 2014] Kiros, R., Salakhutdinov, R., and Zemel, R. (2014). Multi-
modal neural language models. International Conference on Machine Learning,
pages 595–603.
187
[Kiros et al., 2015] Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Tor-
ralba, A., Urtasun, R., and Fidler, S. (2015). Skip-Thought Vectors. ArxiV,
58(786):1–11.
[Kottur et al., 2015] Kottur, S., Vedantam, R., Moura, J. M. F., and Parikh,
D. (2015). Visual Word2Vec (vis-w2v): Learning Visually Grounded Word
Embeddings Using Abstract Scenes. arXiv preprint.
[Kripke, 1972] Kripke, S. A. (1972). Naming and necessity. pages 253–355.
[Krishna et al., 2016] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,
Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M.,
and Fei-Fei, L. (2016). Visual genome: Connecting language and vision using
crowdsourced dense image annotations.
[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
ImageNet classification with deep convolutional neural networks. Proceedings
of NIPS, pages 1106–1114.
[Kuhnle, 2020] Kuhnle, A. (2020). Evaluating visually grounded language capa-
bilities using microworlds. Technical report, University of Cambridge, Com-
puter Laboratory.
[Kuhnle and Copestake, 2017] Kuhnle, A. and Copestake, A. (2017).
Shapeworld-a new test methodology for multimodal language understanding.
arXiv preprint arXiv:1704.04517.
[Kuzmenko and Herbelot, 2019] Kuzmenko, E. and Herbelot, A. (2019). Distri-
butional semantics in the real world: building word vector representations from
a truth-theoretic model. Proceedings of the 13th International Conference on
Computational Semantics-Short Papers, pages 16–23.
[Lazaridou et al., 2015] Lazaridou, A., Baroni, M., et al. (2015). Combining lan-
guage and vision with a multimodal skip-gram model. Proceedings of the 2015
Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pages 153–163.
[Lazaridou et al., 2016] Lazaridou, A., Pham, N. T., and Baroni, M. (2016). To-
wards Multi-Agent Communication-Based Language Learning.
188
[LeCun et al., 1989] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,
R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to
handwritten zip code recognition. Neural computation, 1(4):541–551.
[Lecun et al., 1998] Lecun, Y., Bottou, L., Bengio, Y., and Ha↵ner, P. (1998).
Gradient-based learning applied to document recognition. Proceedings of the
IEEE, 86(11):2278–2324.
[Lenci, 2008] Lenci, A. (2008). Distributional semantics in linguistic and cogni-
tive research. Italian journal of linguistics, 20(1):1–31.
[Lenci, 2018] Lenci, A. (2018). Distributional models of word meaning. Annual
review of Linguistics, 4:151–171.
[Levy and Goldberg, 2014a] Levy, O. and Goldberg, Y. (2014a). Dependency-
based word embeddings. ACL (2), pages 302–308.
[Levy and Goldberg, 2014b] Levy, O. and Goldberg, Y. (2014b). Neural word
embedding as implicit matrix factorization. Advances in neural information
processing systems, pages 2177–2185.
[Levy et al., 2015] Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving dis-
tributional similarity with lessons learned from word embeddings. Transactions
of the Association for Computational Linguistics, 3:211–225.
[Lin et al., 2013] Lin, M., Chen, Q., and Yan, S. (2013). Network in network.
CoRR, abs/1312.4400.
[Lin et al., 2014] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Dolla´r, P., and Zitnick, C. L. (2014). Microsoft coco: Common
objects in context. European conference on computer vision, pages 740–755.
[Lin and Parikh, 2015] Lin, X. and Parikh, D. (2015). Don’t just listen, use your
imagination: Leveraging visual common sense for non-visual tasks. Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 07-12-June:2984–2993.
[Lu et al., 2019] Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). Vilbert: Pre-
training task-agnostic visiolinguistic representations for vision-and-language
tasks. Advances in Neural Information Processing Systems, 32.
189
[Lucey et al., 2017] Lucey, J. A., Otter, D., and Horne, D. S. (2017). A 100-year
review: Progress on the chemistry of milk and its components. Journal of
Dairy Science, 100(12):9916–9932.
[Maaten and Hinton, 2008] Maaten, L. v. d. and Hinton, G. (2008). Visualizing
data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.
[MacKay, 2003] MacKay, D. J. (2003). Information theory, inference and learning
algorithms.
[MacQueen et al., 1967] MacQueen, J. et al. (1967). Some methods for classifica-
tion and analysis of multivariate observations. Proceedings of the fifth Berkeley
symposium on mathematical statistics and probability, 1(14):281–297.
[Majumdar et al., 2020] Majumdar, A., Shrivastava, A., Lee, S., Anderson, P.,
Parikh, D., and Batra, D. (2020). Improving vision-and-language navigation
with image-text pairs from the web. European Conference on Computer Vision,
pages 259–274.
[Manning and Schutze, 1999] Manning, C. and Schutze, H. (1999). Foundations
of statistical natural language processing.
[Marconi, 1997] Marconi, D. (1997). Lexical competence.
[Margolis and Laurence, 2021] Margolis, E. and Laurence, S. (2021). Concepts.
The Stanford Encyclopedia of Philosophy.
[Mervis and Rosch, 1981] Mervis, C. B. and Rosch, E. (1981). Categorization of
natural objects. Annual review of psychology, 32(1):89–115.
[Mikolov et al., 2013a] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).
E cient estimation of word representations in vector space. 1st International
Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona,
USA, May 2-4, 2013, Workshop Track Proceedings.
[Mikolov et al., 2018] Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and
Joulin, A. (2018). Advances in pre-training distributed word representations.
Proceedings of the International Conference on Language Resources and Eval-
uation (LREC 2018).
190
[Mikolov et al., 2013b] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
Dean, J. (2013b). Distributed representations of words and phrases and their
compositionality. Advances in neural information processing systems, pages
3111–3119.
[Miller, 1995] Miller, G. A. (1995). Wordnet: a lexical database for english.
Communications of the ACM, 38(11):39–41.
[Minnema and Herbelot, 2019] Minnema, G. and Herbelot, A. (2019). From
brain space to distributional space: the perilous journeys of fmri decoding.
Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics: Student Research Workshop, pages 155–161.
[Mitchell and Lapata, 2010] Mitchell, J. and Lapata, M. (2010). Composition in
distributional models of semantics. Cognitive science, 34(8):1388–429.
[Mitchell et al., 2008] Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-
M., Malave, V. L., Mason, R. A., and Just, M. A. (2008). Predicting human
brain activity associated with the meanings of nouns. science, 320(5880):1191–
1195.
[Nair and Hinton, 2010] Nair, V. and Hinton, G. E. (2010). Rectified linear units
improve restricted boltzmann machines. Proceedings of ICML, pages 807–814.
[Navigli, 2009] Navigli, R. (2009). Word sense disambiguation: A survey. ACM
computing surveys (CSUR), 41(2):1–69.
[Nelson et al., 2004] Nelson, D. L., McEvoy, C. L., and Schreiber, T. A. (2004).
The university of south florida free association, rhyme, and word fragment
norms. Behavior Research Methods, Instruments, & Computers, 36(3):402–
407.
[Pedersen, 1996] Pedersen, T. (1996). Fishing for exactness. arXiv preprint cmp-
lg/9608010.
[Pennington et al., 2014] Pennington, J., Socher, R., and Manning, C. (2014).
Glove: Global vectors for word representation. Proceedings of the 2014 con-
ference on empirical methods in natural language processing (EMNLP), pages
1532–1543.
191
[Pereira et al., 2018] Pereira, F., Lou, B., Pritchett, B., Ritter, S., Gershman,
S. J., Kanwisher, N., Botvinick, M., and Fedorenko, E. (2018). Toward a
universal decoder of linguistic meaning from brain activation. Nature commu-
nications, 9(1):1–13.
[Peters et al., 2018] Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark,
C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representa-
tions. Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 2227–2237.
[Ponce et al., 2006] Ponce, J., Berg, T. L., Everingham, M., Forsyth, D. A.,
Hebert, M., Lazebnik, S., Marszalek, M., Schmid, C., Russell, B. C., Torralba,
A., et al. (2006). Dataset issues in object recognition. pages 29–48.
[Putnam, 1970] Putnam, H. (1970). Is semantics possible? Metaphilosophy,
1(3):187–201.
[Radford et al., 2018] Radford, A., Narasimhan, K., Salimans, T., and
Sutskever, I. (2018). Improving language understanding by genera-
tive pre-training. URL https://s3-us-west-2. amazonaws. com/openai-
assets/researchcovers/languageunsupervised/language understanding paper.pdf.
[Radovanovic´ et al., 2010] Radovanovic´, M., Nanopoulos, A., and Ivanovic´, M.
(2010). On the existence of obstinate results in vector space models. Proceedings
of the 33rd international ACM SIGIR conference on Research and development
in information retrieval, pages 186–193.
[Recanati, 2004] Recanati, F. (2004). Literal meaning.
[Rockta¨schel et al., 2016] Rockta¨schel, T., Grefenstette, E., Hermann, K. M.,
Kocˇisky´, T., and Blunsom, P. (2016). Reasoning about Entailment with Neural
Attention. ICLR.
[Roy, 2005] Roy, D. (2005). Grounding words in perception and action: Compu-
tational insights.
[Sahlgren and Lenci, 2016] Sahlgren, M. and Lenci, A. (2016). The e↵ects of data
size and frequency range on distributional semantic models. EMNLP 2016.
192
[Scarselli et al., 2008] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
Monfardini, G. (2008). The graph neural network model. IEEE transactions
on neural networks, 20(1):61–80.
[Schuler, 2005] Schuler, K. K. (2005). Verbnet: A broad-coverage, comprehensive
verb lexicon.
[Schu¨tze et al., 2008] Schu¨tze, H., Manning, C. D., and Raghavan, P. (2008).
Introduction to information retrieval. Proceedings of the international commu-
nication of association for computing machinery conference, 4.
[Searle, 1985] Searle, J. R. (1985). Expression and meaning: Studies in the theory
of speech acts.
[Shannon, 2001] Shannon, C. E. (2001). A mathematical theory of communica-
tion. ACM SIGMOBILE mobile computing and communications review, 5(1):3–
55.
[Sharma et al., 2015] Sharma, S., Kiros, R., and Salakhutdinov, R. (2015). Ac-
tion Recognition using Visual Attention. arXiv preprint, pages 1–11.
[Silberer and Lapata, 2014] Silberer, C. and Lapata, M. (2014). Learning
Grounded Meaning Representations with Autoencoders. Proceedings of the
52nd Annual Meeting of the Association for Computational Linguistics, June
23-25:721–732.
[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014). Very
deep convolutional networks for large-scale image recognition. International
Conference on Learning Representations (ICLR), 2015.
[Sivic and Zisserman, 2003] Sivic, J. and Zisserman, A. (2003). Video Google:
a text retrieval approach to object matching in videos. IEEE International
Conference on Computer Vision, (Iccv):1470–1477.
[Socher et al., 2014] Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and
Ng, A. Y. (2014). Grounded compositional semantics for finding and describing
images with sentences. Transactions of the Association for Computational
Linguistics, 2:207–218.
193
[Spa¨rck Jones, 1967] Spa¨rck Jones, K. (1967). A small semantic classification
experiment using cooccurrence data. Report ML, 196.
[Srivastava and Salakhutdinov, 2012] Srivastava, N. and Salakhutdinov, R. R.
(2012). Multimodal learning with deep boltzmann machines. Advances in
neural information processing systems, pages 2222–2230.
[Su et al., 2019] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai,
J. (2019). Vl-bert: Pre-training of generic visual-linguistic representations.
International Conference on Learning Representations.
[Sudre et al., 2012] Sudre, G., Pomerleau, D., Palatucci, M., Wehbe, L., Fyshe,
A., Salmelin, R., and Mitchell, T. (2012). Tracking neural coding of perceptual
and semantic features of concrete nouns. NeuroImage, 62(1):451–463.
[Szabo´, 2014] Szabo´, Z. (2014). Information theoretical estimators toolbox. The
Journal of Machine Learning Research, 15(1):283–287.
[Szegedy et al., 2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going
deeper with convolutions. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–9.
[Tettamanti et al., 2005] Tettamanti, M., Buccino, G., Saccuman, M. C., Gallese,
V., Danna, M., Scifo, P., Fazio, F., Rizzolatti, G., Cappa, S. F., and Perani,
D. (2005). Listening to action-related sentences activates fronto-parietal motor
circuits. Journal of cognitive neuroscience, 17(2):273–281.
[Torralba and Efros, 2011] Torralba, A. and Efros, A. A. (2011). Unbiased look
at dataset bias. CVPR 2011, pages 1521–1528.
[Tsai et al., 2019] Tsai, Y.-H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency,
L.-P., and Salakhutdinov, R. (2019). Multimodal transformer for unaligned
multimodal language sequences. Proceedings of the Annual Meeting of the
Association for Computational Linguistics.
[Turney, 2010] Turney, P. D. (2010). From Frequency to Meaning : Vector Space
Models of Semantics. Journal of Artificial Intelligence Research, 37:141–188.
194
[Vendrov et al., 2015] Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R. (2015).
Order-Embeddings of Images and Language. arXiv preprint, (2005):1–13.
[Vert et al., 2004] Vert, J.-P., Tsuda, K., and Scho¨lkopf, B. (2004). A primer on
kernel methods. Kernel methods in computational biology, 47:35–70.
[Voita and Titov, 2020] Voita, E. and Titov, I. (2020). Information-theoretic
probing with minimum description length. Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language Processing (EMNLP), pages
183–196.
[von Ahn and Dabbish, 2004] von Ahn, L. and Dabbish, L. (2004). Labeling
images with a computer game. CHI, pages 319–326.
[Von Ahn and Dabbish, 2004] Von Ahn, L. and Dabbish, L. (2004). Labeling im-
ages with a computer game. Proceedings of the SIGCHI conference on Human
factors in computing systems, pages 319–326.
[Wang et al., 2018a] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
Bowman, S. (2018a). Glue: A multi-task benchmark and analysis platform for
natural language understanding. Proceedings of the 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages
353–355.
[Wang et al., 2018b] Wang, J., Madhyastha, P. S., and Specia, L. (2018b). Object
counts! bringing explicit detections back into image captioning. Proceedings
of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers), pages 2180–2193.
[Wang et al., 2005] Wang, Q., Kulkarni, S. R., and Verdu´, S. (2005). Diver-
gence estimation of continuous distributions based on data-dependent parti-
tions. IEEE Transactions on Information Theory, 51(9):3064–3074.
[Wang et al., 2009] Wang, Q., Kulkarni, S. R., and Verdu´, S. (2009). Diver-
gence estimation for multidimensional densities via k-nearest-neighbor dis-
tances. IEEE Transactions on Information Theory, 55(5):2392–2405.
195
[Wang and Jiang, 2015] Wang, S. and Jiang, J. (2015). Learning Natural Lan-
guage Inference with LSTM. Naacl.
[Wattenberg et al., 2016] Wattenberg, M., Vie´gas, F., and Johnson, I. (2016).
How to use t-sne e↵ectively. Distill.
[Wittgenstein, 1953] Wittgenstein, L. (1953). Philosophical investigations.
[Wold et al., 1987] Wold, S., Esbensen, K., and Geladi, P. (1987). Principal com-
ponent analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–
52.
[Xu et al., 2016] Xu, H., Murphy, B., and Fyshe, A. (2016). Brainbench: A
brain-image test suite for distributional semantic models. Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing, pages
2017–2021.
[Xu et al., 2020] Xu, P., Chang, X., Guo, L., Huang, P.-Y., Chen, X., and Haupt-
mann, A. G. (2020). A survey of scene graph: Generation and application.
IEEE Trans. Neural Netw. Learn. Syst. 2020.
[Yang et al., 2019] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,
R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for
language understanding. Advances in neural information processing systems,
32.
[Yeung, 1991] Yeung, R. W. (1991). A new outlook on shannon’s information
measures. IEEE transactions on information theory, 37(3):466–474.
[Yogatama et al., 2019] Yogatama, D., d’Autume, C. d. M., Connor, J., Kocisky,
T., Chrzanowski, M., Kong, L., Lazaridou, A., Ling, W., Yu, L., Dyer, C.,
et al. (2019). Learning and evaluating general linguistic intelligence. arXiv
preprint arXiv:1901.11373.
[Zhang and Bowman, 2018] Zhang, K. and Bowman, S. (2018). Language model-
ing teaches you more than translation does: Lessons learned through auxiliary
syntactic task analysis. Proceedings of the 2018 EMNLP Workshop Black-
boxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 359–361.
196
[Zhang et al., 2016] Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., and
Parikh, D. (2016). Yin and yang: Balancing and answering binary visual ques-
tions. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5014–5022.
[Zhang et al., 2018] Zhang, Q., Wang, W., and Zhu, S.-C. (2018). Examining
cnn representations with respect to dataset bias. Proceedings of the AAAI
Conference on Artificial Intelligence, 32(1).
197
198
Appendix A
Cross-validated Semantic
Relatedness and Similarity
199
Embedding Spearman P-value Coverage
wikinews 0.797 (0.004) <1‰(<1‰) 2000
wikinews sub 0.805 (0.005) <1‰(<1‰) 2000
crawl 0.843 (0.001) <1‰(<1‰) 2000
w2v13 0.684 (0.007) <1‰(<1‰) 2000
Google AlexNet 0.506 (0.009) <1‰(<1‰) 2000
VG SceneGraph 0.427 (0.006) <1‰(<1‰) 1716
Google VGG 0.516 (0.005) <1‰(<1‰) 2000
VG-internal 0.377 (0.008) <1‰(<1‰) 1856
VG-whole 0.415 (0.006) <1‰(<1‰) 1856
Google ResNet-152 0.469 (0.003) <1‰(<1‰) 2000
wikinews+Google AlexNet 0.499 (0.003) <1‰(<1‰) 2000
wikinews+VG SceneGraph 0.568 (0.013) <1‰(<1‰) 2000
wikinews+Google VGG 0.512 (0.005) <1‰(<1‰) 2000
wikinews+VG-internal 0.367 (0.008) <1‰(<1‰) 2000
wikinews+VG-whole 0.402 (0.007) <1‰(<1‰) 2000
wikinews+Google ResNet-152 0.479 (0.008) <1‰(<1‰) 2000
wikinews sub+Google AlexNet 0.506 (0.011) <1‰(<1‰) 2000
wikinews sub+VG SceneGraph 0.380 (0.010) <1‰(<1‰) 2000
wikinews sub+Google VGG 0.514 (0.012) <1‰(<1‰) 2000
wikinews sub+VG-internal 0.364 (0.009) <1‰(<1‰) 2000
wikinews sub+VG-whole 0.387 (0.013) <1‰(<1‰) 2000
wikinews sub+Google ResNet-152 0.463 (0.004) <1‰(<1‰) 2000
crawl+Google AlexNet 0.501 (0.004) <1‰(<1‰) 2000
crawl+VG SceneGraph 0.778 (0.006) <1‰(<1‰) 2000
crawl+Google VGG 0.516 (0.006) <1‰(<1‰) 2000
crawl+VG-internal 0.357 (0.012) <1‰(<1‰) 2000
crawl+VG-whole 0.398 (0.008) <1‰(<1‰) 2000
crawl+Google ResNet-152 0.514 (0.007) <1‰(<1‰) 2000
w2v13+Google AlexNet 0.501 (0.012) <1‰(<1‰) 2000
w2v13+VG SceneGraph 0.645 (0.008) <1‰(<1‰) 2000
w2v13+Google VGG 0.518 (0.010) <1‰(<1‰) 2000
w2v13+VG-internal 0.372 (0.005) <1‰(<1‰) 2000
w2v13+VG-whole 0.403 (0.004) <1‰(<1‰) 2000
w2v13+Google ResNet-152 0.486 (0.002) <1‰(<1‰) 2000
Table A.1: Cross-validated Spearman correlations on the MEN dataset. Spear-
man and P-value columns report <mean (STD)> of three samples after leaving
out the third of the evaluation pairs. Multi-modal embeddings are created using
the Padding technique. The table sections contain linguistic, visual and multi-
modal embeddings in this order.
200
Embedding Spearman P-value Coverage
wikinews 0.463 (0.009) <1‰(<1‰) 666
wikinews sub 0.412 (0.025) <1‰(<1‰) 666
crawl 0.506 (0.019) <1‰(<1‰) 666
w2v13 0.316 (0.020) <1‰(<1‰) 666
Google AlexNet 0.348 (0.025) <1‰(<1‰) 666
VG SceneGraph 0.274 (0.019) <1‰(<1‰) 395
Google VGG 0.363 (0.017) <1‰(<1‰) 666
VG-internal 0.311 (0.059) 0.023 (0.027) 68
VG-whole 0.169 (0.024) 0.178 (0.068) 68
Google ResNet-152 0.354 (0.007) <1‰(<1‰) 666
wikinews+Google AlexNet 0.332 (0.032) <1‰(<1‰) 666
wikinews+VG SceneGraph 0.348 (0.018) <1‰(<1‰) 666
wikinews+Google VGG 0.332 (0.014) <1‰(<1‰) 666
wikinews+VG-internal 0.300 (0.002) <1‰(<1‰) 666
wikinews+VG-whole 0.326 (0.017) <1‰(<1‰) 666
wikinews+Google ResNet-152 0.350 (0.028) <1‰(<1‰) 666
wikinews sub+Google AlexNet 0.329 (0.022) <1‰(<1‰) 666
wikinews sub+VG SceneGraph 0.187 (0.027) <1‰(<1‰) 666
wikinews sub+Google VGG 0.353 (0.011) <1‰(<1‰) 666
wikinews sub+VG-internal 0.299 (0.013) <1‰(<1‰) 666
wikinews sub+VG-whole 0.304 (0.015) <1‰(<1‰) 666
wikinews sub+Google ResNet-152 0.348 (0.011) <1‰(<1‰) 666
crawl+Google AlexNet 0.349 (0.025) <1‰(<1‰) 666
crawl+VG SceneGraph 0.434 (0.017) <1‰(<1‰) 666
crawl+Google VGG 0.346 (0.017) <1‰(<1‰) 666
crawl+VG-internal 0.310 (0.038) <1‰(<1‰) 666
crawl+VG-whole 0.321 (0.007) <1‰(<1‰) 666
crawl+Google ResNet-152 0.364 (0.009) <1‰(<1‰) 666
w2v13+Google AlexNet 0.345 (0.024) <1‰(<1‰) 666
w2v13+VG SceneGraph 0.312 (0.007) <1‰(<1‰) 666
w2v13+Google VGG 0.362 (0.017) <1‰(<1‰) 666
w2v13+VG-internal 0.209 (0.017) <1‰(<1‰) 666
w2v13+VG-whole 0.225 (0.007) <1‰(<1‰) 666
w2v13+Google ResNet-152 0.352 (0.020) <1‰(<1‰) 666
Table A.2: Cross-validated Spearman correlations on the SimLex dataset. Spear-
man and P-value columns report <mean (STD)> of three samples after leaving
out the third of the evaluation pairs. Multi-modal embeddings are created using
the Padding technique. The table sections contain linguistic, visual and multi-
modal embeddings in this order.
201
Embedding Spearman P-value Coverage
wikinews 0.792 (0.002) <1‰(<1‰) 2000
wikinews sub 0.804 (0.001) <1‰(<1‰) 2000
crawl 0.845 (0.001) <1‰(<1‰) 2000
w2v13 0.684 (0.003) <1‰(<1‰) 2000
Google AlexNet 0.509 (0.005) <1‰(<1‰) 2000
VG SceneGraph 0.413 (0.004) <1‰(<1‰) 1716
Google VGG 0.508 (0.008) <1‰(<1‰) 2000
VG-internal 0.374 (0.015) <1‰(<1‰) 1856
VG-whole 0.412 (0.002) <1‰(<1‰) 1856
Google ResNet-152 0.464 (0.007) <1‰(<1‰) 2000
wikinews+Google AlexNet 0.497 (0.004) <1‰(<1‰) 2000
wikinews+VG SceneGraph 0.654 (0.006) <1‰(<1‰) 1716
wikinews+Google VGG 0.504 (0.011) <1‰(<1‰) 2000
wikinews+VG-internal 0.374 (0.003) <1‰(<1‰) 1856
wikinews+VG-whole 0.415 (0.006) <1‰(<1‰) 1856
wikinews+Google ResNet-152 0.476 (0.004) <1‰(<1‰) 2000
wikinews sub+Google AlexNet 0.501 (0.008) <1‰(<1‰) 2000
wikinews sub+VG SceneGraph 0.452 (0.021) <1‰(<1‰) 1716
wikinews sub+Google VGG 0.503 (0.002) <1‰(<1‰) 2000
wikinews sub+VG-internal 0.370 (0.005) <1‰(<1‰) 1856
wikinews sub+VG-whole 0.415 (0.005) <1‰(<1‰) 1856
wikinews sub+Google ResNet-152 0.475 (0.005) <1‰(<1‰) 2000
crawl+Google AlexNet 0.502 (0.009) <1‰(<1‰) 2000
crawl+VG SceneGraph 0.813 (0.001) <1‰(<1‰) 1716
crawl+Google VGG 0.512 (0.008) <1‰(<1‰) 2000
crawl+VG-internal 0.392 (0.005) <1‰(<1‰) 1856
crawl+VG-whole 0.427 (0.006) <1‰(<1‰) 1856
crawl+Google ResNet-152 0.514 (0.003) <1‰(<1‰) 2000
w2v13+Google AlexNet 0.502 (0.004) <1‰(<1‰) 2000
w2v13+VG SceneGraph 0.696 (0.003) <1‰(<1‰) 1716
w2v13+Google VGG 0.528 (0.005) <1‰(<1‰) 2000
w2v13+VG-internal 0.369 (0.011) <1‰(<1‰) 1856
w2v13+VG-whole 0.423 (0.010) <1‰(<1‰) 1856
w2v13+Google ResNet-152 0.484 (0.010) <1‰(<1‰) 2000
Table A.3: Cross-validated Spearman correlations on the MEN dataset. Spear-
man and P-value columns report <mean (STD)> of three samples after leaving
out the third of the evaluation pairs. Multi-modal embeddings are created us-
ing the Intersection technique. The table sections contain linguistic, visual and
multi-modal embeddings in this order.
202
Embedding Spearman P-value Coverage
wikinews 0.457 (0.006) <1‰(<1‰) 666
wikinews sub 0.443 (0.015) <1‰(<1‰) 666
crawl 0.493 (0.013) <1‰(<1‰) 666
w2v13 0.300 (0.010) <1‰(<1‰) 666
Google AlexNet 0.348 (0.004) <1‰(<1‰) 666
VG SceneGraph 0.249 (0.023) <1‰(<1‰) 395
Google VGG 0.344 (0.008) <1‰(<1‰) 666
VG-internal 0.289 (0.034) 0.022 (0.015) 68
VG-whole 0.118 (0.032) 0.354 (0.135) 68
Google ResNet-152 0.351 (0.022) <1‰(<1‰) 666
wikinews+Google AlexNet 0.331 (0.021) <1‰(<1‰) 666
wikinews+VG SceneGraph 0.362 (0.017) <1‰(<1‰) 395
wikinews+Google VGG 0.318 (0.019) <1‰(<1‰) 666
wikinews+VG-internal 0.289 (0.043) 0.024 (0.021) 68
wikinews+VG-whole 0.269 (0.017) 0.028 (0.009) 68
wikinews+Google ResNet-152 0.370 (0.017) <1‰(<1‰) 666
wikinews sub+Google AlexNet 0.356 (0.015) <1‰(<1‰) 666
wikinews sub+VG SceneGraph 0.304 (0.022) <1‰(<1‰) 395
wikinews sub+Google VGG 0.336 (0.021) <1‰(<1‰) 666
wikinews sub+VG-internal 0.270 (0.058) 0.046 (0.048) 68
wikinews sub+VG-whole 0.090 (0.119) 0.528 (0.350) 68
wikinews sub+Google ResNet-152 0.348 (0.005) <1‰(<1‰) 666
crawl+Google AlexNet 0.358 (0.014) <1‰(<1‰) 666
crawl+VG SceneGraph 0.428 (0.027) <1‰(<1‰) 395
crawl+Google VGG 0.332 (0.008) <1‰(<1‰) 666
crawl+VG-internal 0.305 (0.024) 0.013 (0.006) 68
crawl+VG-whole 0.160 (0.074) 0.271 (0.247) 68
crawl+Google ResNet-152 0.370 (0.026) <1‰(<1‰) 666
w2v13+Google AlexNet 0.338 (0.002) <1‰(<1‰) 666
w2v13+VG SceneGraph 0.278 (0.008) <1‰(<1‰) 395
w2v13+Google VGG 0.337 (0.019) <1‰(<1‰) 666
w2v13+VG-internal 0.306 (0.049) 0.017 (0.011) 68
w2v13+VG-whole 0.233 (0.058) 0.086 (0.080) 68
w2v13+Google ResNet-152 0.367 (0.004) <1‰(<1‰) 666
Table A.4: Cross-validated Spearman correlations on the SimLex dataset. Spear-
man and P-value columns report <mean (STD)> of three samples after leaving
out the third of the evaluation pairs. Multi-modal embeddings are created us-
ing the Intersection technique. The table sections contain linguistic, visual and
multi-modal embeddings in this order.
203
Embedding Spearman P-value Coverage
wikinews 0.798 (0.005) <1‰(<1‰) 1654
wikinews sub 0.806 (0.004) <1‰(<1‰) 1654
crawl 0.844 (0.003) <1‰(<1‰) 1654
w2v13 0.667 (0.003) <1‰(<1‰) 1654
Google AlexNet 0.511 (0.006) <1‰(<1‰) 1654
VG SceneGraph 0.431 (0.015) <1‰(<1‰) 1654
Google VGG 0.524 (0.007) <1‰(<1‰) 1654
VG-internal 0.381 (0.008) <1‰(<1‰) 1654
VG-whole 0.405 (0.009) <1‰(<1‰) 1654
Google ResNet-152 0.472 (0.014) <1‰(<1‰) 1654
wikinews+Google AlexNet 0.518 (0.004) <1‰(<1‰) 1654
wikinews+VG SceneGraph 0.654 (0.006) <1‰(<1‰) 1654
wikinews+Google VGG 0.516 (0.003) <1‰(<1‰) 1654
wikinews+VG-internal 0.376 (0.002) <1‰(<1‰) 1654
wikinews+VG-whole 0.412 (0.008) <1‰(<1‰) 1654
wikinews+Google ResNet-152 0.476 (0.014) <1‰(<1‰) 1654
wikinews sub+Google AlexNet 0.516 (0.007) <1‰(<1‰) 1654
wikinews sub+VG SceneGraph 0.452 (0.008) <1‰(<1‰) 1654
wikinews sub+Google VGG 0.515 (0.004) <1‰(<1‰) 1654
wikinews sub+VG-internal 0.364 (0.002) <1‰(<1‰) 1654
wikinews sub+VG-whole 0.406 (0.017) <1‰(<1‰) 1654
wikinews sub+Google ResNet-152 0.483 (0.012) <1‰(<1‰) 1654
crawl+Google AlexNet 0.514 (0.015) <1‰(<1‰) 1654
crawl+VG SceneGraph 0.813 (0.001) <1‰(<1‰) 1654
crawl+Google VGG 0.524 (0.008) <1‰(<1‰) 1654
crawl+VG-internal 0.393 (0.007) <1‰(<1‰) 1654
crawl+VG-whole 0.423 (0.013) <1‰(<1‰) 1654
crawl+Google ResNet-152 0.512 (0.005) <1‰(<1‰) 1654
w2v13+Google AlexNet 0.507 (0.007) <1‰(<1‰) 1654
w2v13+VG SceneGraph 0.695 (0.004) <1‰(<1‰) 1654
w2v13+Google VGG 0.521 (0.008) <1‰(<1‰) 1654
w2v13+VG-internal 0.378 (0.005) <1‰(<1‰) 1654
w2v13+VG-whole 0.405 (0.002) <1‰(<1‰) 1654
w2v13+Google ResNet-152 0.487 (0.006) <1‰(<1‰) 1654
Table A.5: Cross-validated Spearman correlations on the common subset of the
MEN dataset. Spearman and P-value columns report <mean (STD)> of three
samples after leaving out the third of the evaluation pairs. Multi-modal embed-
dings are created using the Intersection technique. The table sections contain
linguistic, visual and multi-modal embeddings in this order.
204
Embedding Spearman P-value Coverage
wikinews 0.299 (0.064) 0.029 (0.030) 68
wikinews sub 0.233 (0.074) 0.095 (0.064) 68
crawl 0.361 (0.055) 0.005 (0.003) 68
w2v13 0.101 (0.033) 0.428 (0.145) 68
Google AlexNet 0.536 (0.042) <1‰(<1‰) 68
VG SceneGraph 0.257 (0.038) 0.044 (0.032) 68
Google VGG 0.464 (0.031) <1‰(<1‰) 68
VG-internal 0.295 (0.030) 0.018 (0.014) 68
VG-whole 0.213 (0.049) 0.108 (0.087) 68
Google ResNet-152 0.527 (0.034) <1‰(<1‰) 68
wikinews+Google AlexNet 0.584 (0.025) <1‰(<1‰) 68
wikinews+VG SceneGraph 0.353 (0.070) 0.008 (0.006) 68
wikinews+Google VGG 0.547 (0.024) <1‰(<1‰) 68
wikinews+VG-internal 0.326 (0.022) 0.008 (0.003) 68
wikinews+VG-whole 0.128 (0.074) 0.377 (0.305) 68
wikinews+Google ResNet-152 0.456 (0.023) <1‰(<1‰) 68
wikinews sub+Google AlexNet 0.605 (0.027) <1‰(<1‰) 68
wikinews sub+VG SceneGraph 0.317 (0.059) 0.020 (0.024) 68
wikinews sub+Google VGG 0.538 (0.054) <1‰(<1‰) 68
wikinews sub+VG-internal 0.319 (0.062) 0.019 (0.022) 68
wikinews sub+VG-whole 0.165 (0.106) 0.313 (0.220) 68
wikinews sub+Google ResNet-152 0.540 (0.023) <1‰(<1‰) 68
crawl+Google AlexNet 0.564 (0.027) <1‰(<1‰) 68
crawl+VG SceneGraph 0.339 (0.072) 0.014 (0.016) 68
crawl+Google VGG 0.602 (0.023) <1‰(<1‰) 68
crawl+VG-internal 0.335 (0.053) 0.011 (0.012) 68
crawl+VG-whole 0.178 (0.055) 0.189 (0.158) 68
crawl+Google ResNet-152 0.501 (0.018) <1‰(<1‰) 68
w2v13+Google AlexNet 0.495 (0.020) <1‰(<1‰) 68
w2v13+VG SceneGraph 0.227 (0.084) 0.136 (0.164) 68
w2v13+Google VGG 0.485 (0.044) <1‰(<1‰) 68
w2v13+VG-internal 0.333 (0.059) 0.014 (0.018) 68
w2v13+VG-whole 0.251 (0.049) 0.055 (0.043) 68
w2v13+Google ResNet-152 0.498 (0.028) <1‰(<1‰) 68
Table A.6: Cross-validated Spearman correlations on the common subset of the
SimLex dataset. Spearman and P-value columns report <mean (STD)> of three
samples after leaving out the third of the evaluation pairs. Multi-modal embed-
dings are created using the Intersection technique. The table sections contain
linguistic, visual and multi-modal embeddings in this order.
205
206
Appendix B
WordNet Concreteness
Further WordNet concreteness analysis (Section 4.3.4.3) on the common subset
of the datasets for the behavioural tasks, and for Intersection type mid-fusion
method.
207
Figure B.1: Scores on the embeddings’ common subset of Semantic Similarity
dataset splits, ordered by the sum of WordNet concreteness scores of the two
words in every word pair. Mid-fusion method: Padding.
208
Figure B.2: Scores on the full Semantic Similarity dataset splits, ordered by the
sum of WordNet concreteness scores of the two words in every word pair. Mid-
fusion method: Intersection.
209
Figure B.3: Scores on the embeddings’ common subset of Semantic Similarity
dataset splits, ordered by the sum of WordNet concreteness scores of the two
words in every word pair. Mid-fusion method: Intersection.
210
Figure B.4: Scores on the embeddings’ common subset of Semantic Similarity
dataset splits, ordered by the di↵erence of WordNet concreteness scores of the
two words in every word pair. Mid-fusion method: Padding.
211
Figure B.5: Scores on the full Semantic Similarity dataset splits, ordered by the
di↵erence of WordNet concreteness scores of the two words in every word pair.
Mid-fusion method: Intersection.
212
Figure B.6: Scores on the embeddings’ common subset of Semantic Similarity
dataset splits, ordered by the di↵erence of WordNet concreteness scores of the
two words in every word pair. Mid-fusion method: Intersection.
213
214
Appendix C
EmbEval Toolkit
The code we used to generate the results in this work is openly available1. It
performs a general evaluation of word embeddings (which we used in Chapters
4, 5 and 6.
The code base loads several embedding models, generates multi-modal em-
beddings and runs all the evaluations on the semantic similarity and relatedness
datasets well as the brain datasets.
The software can also be used to generate the various visualisations and tables
of results as well as visualisations of embedding spaces. Details on its usage can
be found in the documentation2.
1https://github.com/anitavero/embeval
2https://anitavero.github.io/embeval/
215
216
Appendix D
Cluster Structure
WordNet label Own label Members
food
nutriment
foodstu↵
food butter, cheese, bread, chicken, soup, sauce,
dessert, beef, salad, meat, cake, steak,
tomato, potato, pizza, flour, milk, meal, vine-
gar, bacon, pie, cooking, sushi, sandwich,
breakfast, burger, menu
vascular plant
plant organ
plant part
plants flower, flowers, tree, blossom, dandelion, fo-
liage, fruit, weed, cactus, lily, bloom, shade,
leaf, grass, sunflower, poppy, vine, plant, gar-
den, iris, grow, daisy, oak, bulb, rust, herb,
moss, tulip, palm, maple, root, tall, bush,
seed, family
atmospheric phenomenon
physical phenomenon
change
weather rain, snow, fog, weather, mist, drizzle, frost,
dew, cold, wet, wind, smoke, sunlight, misty,
sunrise, winter, storm, sunset, haze, sun-
shine, fire, spring, dusk, autumn, heavy, at-
mosphere, cloud, sunny, burn, flood, desert,
sun, hot, ice, tropical
food
beverage
produce
sweets
alcohol
tobacco
“legal drugs”
co↵ee, lemon, candy, juice, chocolate, sugar,
strawberry, honey, tea, beer, bottle, bean,
banana, cocktail, whiskey, pumpkin, bev-
erage, pepper, cereal, brandy, sweet, wine,
tobacco, mug, cherry, donut, nuts, liquor,
berry, rice, mustard, cigar, cigarette, alcohol,
raspberry, champagne, pot, apple, peel
217
substance
material
artifact
material –
farm
animals
cow, wool, charcoal, sheep, cattle, food, ani-
mal, wood, goat, wheat, sand, animals, salt,
water, timber, fish, mud, straw, cotton, cop-
per, washing, oil, ox, iron, lamb, fresh, abun-
dance, fur, coal, fishing, exotic, dye, ceramic,
camel, pollution, tin, licking, smoking, diet,
vitamin
artifact
covering
clothing
clothing /
fashion
wig, clothes, dress, shoes, jacket, sweater,
skirt, sunglasses, leather, hair, costume,
shirt, haircut, cloth, socks, waist, man-
nequin, collar, jewelry, tattoo, lingerie,
beard, blonde, mask, fabric, uniform, neck-
lace, linen, outfit, glove, hat, fashion, blan-
ket, bikini, knitting, swimsuit, crochet,
badge, coat, carpet, bracelet, arms, makeup
artifact
structure
whole
classical
architecture
tower, building, marble, staircase, fountain,
doorway, roof, chapel, steeple, porch, ceiling,
mural, glass, wall, brick, statue, stone, arch,
monument, dome, window, gravestone, sculp-
ture, aisle, tiles, gate, interior, painted, dec-
oration, concrete, church, graveyard, cathe-
dral, curtain, painting, palace, clock, grave,
portrait, choir, architecture, pyramid, memo-
rial, square, castle, skyscraper, museum,
cemetery, temple, organ
change
color
visual property
colour /
decor
blue, bright, green, pink, black, yellow, dark,
white, purple, red, brown, violet, rainbow,
colour, orange, sky, rusty, silhouette, grey, di-
amond, redhead, light, flame, peacock, mir-
ror, color, tiny, shadow, stripes, dull, rose,
neon, colorful, crystal, bell, moon, horizon,
arrow, silver, ivy, gold, swan, dragon, lantern,
star, pearl, horn, ray, fox, globe, planet, bold,
belt
218
body part
part
artifact
body parts skin, spine, neck, bone, chest, throat, shoul-
der, wrist, stomach, ear, jaw, cheek, lips,
nose, eyes, eye, limb, toe, belly, skull, ab-
domen, finger, teeth, elbow, cord, whiskers,
knee, thumb, tooth, muscle, ankle, tail, paws,
lip, brain, flesh, leg, body, calf, heart, blood,
tongue, brow, pain, tear, blade, mouth, liver,
gut, arm, marrow, curled, canine, feathers,
foot, vein, hip, cancer
attribute
whole
artifact
measures &
Misc
flexible, reflection, pattern, sharp, ripples,
large, elastic, normal, angle, object, spi-
ral, fragile, dense, di↵erent, relaxed, frame,
strong, fast, target, small, bottom, wave,
long, rough, illusion, cone, narrow, texture,
pair, noise, curve, bubble, depth, droplets,
display, footprint, condition, wide, sphere, re-
duce, hole, blurred, lamp, short, shell, rapid,
medium, plate, size, lens, instrument, feet,
helium, chain, meter, inch, cell, adult, for-
mula, males
artifact
instrumentality
move
objects bag, cardboard, bucket, wire, hand, nail, pen-
cil, hanging, rope, skateboard, knife, garbage,
splash, button, scratch, pipe, ink, dripping,
dirty, boot, spoon, drawer, hard, dirt, cage,
suds, miniature, box, puddle, gra ti, hang,
drum, jar, swing, metal, collage, pin, pil-
low, tough, rock, surf, cradle, vintage, sten-
cil, origami, keyboard, disc, rod, big, rattle,
racket, ipod, vinyl, lego, surfers, odd, basket,
tag, van, mac
person
organism
bird
animals bird, cat, squirrel, owl, rabbit, dog, birds,
parrot, zebra, gira↵e, stork, duck, goose, pel-
ican, deer, elephant, rat, snake, eagle, pi-
geon, hamster, wolf, cheetah, hawk, mal-
lard, crab, poodle, chipmunk, frog, flamingo,
mouse, tiger, pets, crow, whale, gull, wild, in-
sect, feline, prey, hummingbird, hound, pug,
lion, panda, pet, lizard, bee, ant, dragonfly,
nest, zoo, jellyfish, hen, seagull, spider, wasp,
terrier, aquarium, butterfly
219
structure
artifact
area
room kitchen, room, bedroom, bathroom, garage,
shop, cafe, motel, cellar, diner, closet, hall-
way, cottage, hotel, sidewalk, restaurant,
barn, house, apartment, door, pub, alley,
stairs, sofa, patio, bed, floor, couch, cabin,
bakery, store, booth, crib, dinner, desk, fur-
niture, hut, parking, fence, inn, pool, corner,
shelter, hall, farm, lawn, street, shed, bar,
mill, lab, windmill, sitting, o ce, hospital,
log, classroom, shopping, supper, bath, jail,
lunch, theatre, yard
person
organism
causal agent
social roles:
family members
& professions
father, friend, mother, lover, uncle, wife,
daughter, lawyer, woman, brother, teacher,
son, child, nurse, nephew, banker, sol-
dier, couple, maid, gentleman, husband, au-
thor, bride, doctor, priest, wedding, part-
ner, photographer, worker, actor, lady, cap-
tain, employee, sailor, groom, appointment,
leader, student, king, secretary, scientist,
singer, queen, guardian, professor, president,
princess, actress, justice, children, instruc-
tor, monk, prince, birthday, maker, sheri↵,
bishop, manager, mayor, companion, chair,
minister, politician, boxer, age, pupil, saint,
jean, rabbi
object
artifact
physical entity
places shore, corridor, trail, bridge, road, harbour,
river, tunnel, area, park, beach, pond, val-
ley, lake, hill, ledge, city, railroad, island,
highway, harbor, rail, downtown, seashore,
canyon, west, canal, border, coast, north,
town, mountain, pier, path, tra c, bay,
ocean, cli↵, forest, swamp, port, abandoned,
skyline, stream, line, south, boundary, water-
fall, station, loop, sea, railway, construction,
boardwalk, scenery, reef, branch, lighthouse,
demolition, landscape, underground, airport,
zone, urban, metro, region, capital, gauge,
village, population
220
instrumentality
travel
vehicle
transportation vehicle, airplane, truck, car, elevator, auto-
mobile, aircraft, cab, carriage, bike, jet, chop-
per, scooter, balloon, bicycle, pilot, deck,
train, wagon, gasoline, motorcycle, plane,
craft, machine, engine, boat, taxi, cannon,
crane, tank, escalator, mechanic, ship, hose,
driver, steel, rocket, container, gun, safety,
auto, motor, explosion, flying, factory, air,
flight, camera, appliance, accident, drive,
aluminum, telephone, bus, underwater, light-
ing, vessel, aerial, phone, emergency, ford,
exit, subway, company, police, pod, tram, in-
dustrial, asphalt, wing
change
act
be
verbs bring, get, come, want, go, keep, take, know,
find, say, give, make, understand, put, lis-
ten, enjoy, feel, leave, think, learn, imag-
ine, gather, believe, fail, arrange, add, lose,
create, way, hear, send, meet, collect, carry,
avoid, buy, remain, allow, appear, might, en-
ter, arrive, seem, entertain, break, steal, re-
ceive, stop, stand, build, locked, compare, re-
tain, sell, handle, danger, eat, wander, face,
unhappy, protect, please, pray, become, walk,
expand, travel, plenty, greet, inspect, com-
fort, huge, possess, dominate, attach, roam,
participate, speak, step, drawn, construct, re-
place, divide, great, living
221
person
organism
causal agent
art /
entertainment
smile, fun, happy, love, girl, kid, kids, boy,
baby, dad, mom, kiss, dude, friends, funny,
man, joy, angel, beautiful, christmas, cute,
movie, night, spirit, beast, bunny, mad, sing,
puppy, monster, soul, zombie, song, devil,
dance, kitty, guy, bunch, happiness, snow-
man, show, holiday, buddy, music, rest-
less, theme, sketch, nice, boys, dead, clown,
young, quest, girls, vacation, celebration,
emotion, carnival, dreary, dawn, bad, cop,
sleep, journey, concert, pride, hero, evening,
story, demon, sad, morning, warrior, jazz,
band, guest, film, god, piano, punk, doodle,
guitar, tv, television, husky, violin, festival,
female
travel
act
group
sport time, day, year, second, course, run, win,
game, home, sports, ball, trip, season, week,
country, match, track, dropped, club, pa-
rade, trick, world, crowd, august, month,
horse, winner, swimming, field, football, left,
men, triumph, women, gymnastics, basket-
ball, bench, table, racing, round, jump,
outdoor, cup, top, swim, race, side, base-
ball, sailing, opponent, champion, goal, held,
school, trial, played, camp, cross, flag, bowl,
summer, rally, squad, head, old, ceremony,
military, hockey, exhibition, skating, state,
bull, college, purse, army, pole, stadium, ski,
chess, navy, minute, class, posted, skate, an-
chor, colt, seat, stud, turkey, santa, mare
222
abstraction
communication
act
writing /
Misc
fact, discussion, work, idea, read, sense,
quote, manner, words, conversation, infor-
mation, book, picture, value, image, reader,
view, person, advertisement, paper, vision,
impression, communication, nature, phrase,
page, paragraph, proof, article, interest, job,
definition, money, abstract, poster, formal,
wisdom, reading, skill, choice, attention, lit-
erature, letter, handwriting, art, business,
smart, awareness, confidence, word, key,
design, new, essential, model, date, com-
puter, action, collection, payment, note, law,
graphic, figure, bible, library, protest, task,
news, violent, chapter, umbrella, movement,
dollar, magazine, symbol, photography, mod-
ern, newspaper, web, activity, circle, number,
people, peace, market, map, self, card, code,
psychology, text, right, parent, dictionary, or-
der, party, language, journal, written, tax,
style, era, calendar, cent, ad, ancient
Table D.1: Members of the 20 clusters in EL. Clusters are ordered by size.
WordNet label Own label Members
base
layer
flatware
plate plate
lick
cream
beating
licking licking
communication
promotion
message
ad ad, advertisement
change
passage
tube
pipe rust, pipe, hose, tank, gra ti, chain
artifact
line
whole
train railway, railroad, subway, curve, tunnel, run, shelter,
train, station, tram, highway, track, rail, way, engine,
stop, gate, bridge, smoke
223
structure
area
room
room classroom, hallway, hall, closet, bedroom, room, bath-
room, garage, o ce, cafe, museum, doorway, kitchen,
shop, restaurant, store, mannequin, stadium, market,
ceiling, corner
bird
vertebrate
person
animals hummingbird, gull, peacock, hawk, pelican, crow, par-
rot, seagull, wing, swan, pigeon, owl, goose, flamingo,
nest, eagle, tail, bird, silhouette, duck, chest, body,
ledge, gira↵e, zebra
travel
wheeled vehicle
self-propelled vehicle
vehicles cab, car, taxi, police, vehicle, automobile, drive, rac-
ing, scooter, bike, van, street, road, motorcycle, truck,
speak, wagon, bus, parade, drawn, asphalt, cop, park-
ing, bicycle, sidewalk, tra c, driver, carriage, meter
plant organ
plant
vascular plant
plants bloom, foliage, grave, dead, vine, blossom, ivy, pod,
cactus, tree, moss, root, leave, limb, forest, bush,
plant, lily, branch, weed, leaf, vein, sunshine, log,
fence, flower, sunlight, wood, palm, bench, sun
structure
artifact
whole
building
parts
chapel, cottage, steeple, castle, dome, story, cathe-
dral, build, skyscraper, arch, lighthouse, apartment,
hut, angel, shed, hotel, monument, window, staircase,
home, cabin, house, roof, porch, tower, sculpture, pa-
tio, bell, deck, brick, church, cross, clock, step, statue
instrumentality
container
substance
vessel champagne, tea, beverage, alcohol, honey, milk, pen-
cil, tulip, juice, oil, bakery, ceramic, container, co↵ee,
tin, cup, beer, sunflower, daisy, wine, rose, marble,
bowl, sweet, maker, jar, vessel, mug, money, bottle,
pumpkin, straw, glass, basket, box, pot, bucket, bunch
body part
artifact
part
pets &
body parts
jaw, throat, pupil, cheek, canine, belly, brow, mouth,
stomach, tongue, eye, nose, poodle, ear, hamster, lip,
fur, tooth, teeth, pet, leg, wool, head, feline, toe,
panda, smile, neck, face, beard, puppy, collar, horn,
skin, cat, kitty, calf, nail, dog, tag, mother
physical entity
body of water
thing
water rapid, village, coast, bay, mist, horizon, canal, skyline,
valley, sea, cli↵, fog, town, waterfall, stream, water,
sunset, pier, harbor, boardwalk, break, ocean, lake,
fountain, shore, island, river, wave, splash, city, rock,
ship, building, sand, hill, crane, mountain, beach,
pond, surf, boat, pool
224
location
artifact
region
farm
animal
dandelion, boundary, grass, wild, deer, stork, field,
mud, farm, windmill, garden, landscape, desert, cat-
tle, dirt, area, barn, yard, zoo, ox, path, footprint,
garbage, puddle, lawn, cow, sheep, concrete, snow,
eat, lamb, goat, stone, cone, trail, rain, day, park,
animal, cage, horse, bull, elephant
change
color
visual property
colors bright, beautiful, big, dirty, small, colorful, grey, long,
purple, dark, round, men, tiny, pink, eyes, painted,
brown, gold, medium, white, hang, iron, silver, old,
black, left, tall, red, safety, large, metal, blue, steel,
yellow, leather, hanging, make, walk, green, right,
color, bath, pair, washing, sitting, carry
food
produce
solid
food drizzle, nuts, herb, beef, flour, season, cereal, cherry,
breakfast, sugar, steak, bacon, burger, butter, rice,
meat, meal, sauce, dinner, pie, raspberry, lunch,
sushi, bean, mustard, pepper, seed, salt, soup, cheese,
tomato, hot, berry, potato, dessert, strawberry, salad,
cardboard, food, bone, lemon, burn, frost, chocolate,
bread, turkey, sandwich, spoon, pizza, chicken, shell,
candy, peel, cooking, bubble, knife, fruit, fish, donut,
cake, apple, ice, banana, orange
artifact
whole
instrumentality
furnishing crochet, calendar, linen, map, painting, work, frog,
skull, note, code, stud, lantern, art, telephone, scratch,
furniture, information, collection, menu, ipod, page,
table, mural, piano, spring, movie, magazine, poster,
cell, spine, portrait, appliance, desk, paper, graphic,
frame, bed, date, crib, pattern, text, picture, card,
globe, butterfly, wall, pillow, fabric, cord, sofa, carpet,
guitar, square, cloth, image, tv, book, heart, lamp,
star, television, blanket, couch, newspaper, night, dec-
oration, mirror, time, computer, design, keyboard,
word, mouse, border, drawer, floor, button, chair, key,
display, curtain, reading
225
person
artifact
covering
people fun, nurse, lingerie, violin, jewelry, makeup, haircut,
cigar, wig, monk, instructor, santa, pug, brother,
doctor, dad, terrier, huge, parent, scientist, gentle-
man, bikini, pearl, badge, bracelet, shirt, swimsuit,
sweater, jean, costume, hip, jacket, sleep, daughter,
mom, short, skirt, snowman, hat, man, muscle, instru-
ment, necklace, young, basketball, wrist, hair, smok-
ing, glove, outfit, music, coat, rabbit, pets, woman,
band, football, father, dude, boot, hand, elbow, tat-
too, arm, ankle, soldier, lab, waist, clown, dress, belt,
racket, blonde, bunny, uniform, loop, lens, friend,
cigarette, held, finger, girl, photographer, purse, per-
son, knee, pin, boy, female, trick, thumb, guy, mask,
foot, son, swing, clothes, lady, bride, skate, squirrel,
bag, phone, disc, ski, tiger, child, groom, adult, shoul-
der, student, kid, camera, skateboard, baseball, ball,
baby
change
act
artifact
Misc pain, downtown, capital, condition, theatre, motel,
cemetery, elevator, journey, class, zone, captain, coal,
military, navy, school, craft, gauge, texture, exit,
storm, language, moon, company, create, club, an-
chor, country, construction, meet, rainbow, weather,
port, alley, hospital, party, take, flight, pilot, dragon,
booth, interior, business, race, sky, library, drum,
sunny, door, motor, employee, light, model, hen, bulb,
goal, gun, wind, cloud, diner, pole, aircraft, course,
fox, rod, skating, letter, jump, show, written, flame,
symbol, reflection, plane, shadow, object, diamond,
airport, ray, circle, line, airplane, swimming, bottom,
arrow, flag, crowd, balloon, top, number, aquarium,
fire, flying, seat, side, stand, figure, air, handle, game,
winter, view, match, blade, bar, machine, family, wire,
lion, hole, people, shade, worker, jet, rope, umbrella,
couple
226
person
change
organism
Misc ant, news, jellyfish, protest, add, imagine, inn, journal,
liver, essential, marrow, rattle, arrange, wasp, para-
graph, brandy, fact, aerial, devil, unhappy, emotion,
chipmunk, god, oak, explosion, prey, proof, vision, ac-
tivity, chess, movement, danger, gasoline, secretary,
jazz, song, send, mayor, tobacco, soul, urban, violent,
quote, demon, replace, fragile, manner, misty, receive,
ancient, flowers, skill, reef, ripples, rally, living, diet,
sketch, awareness, illusion, pollution, abstract, value,
wisdom, squad, remain, arrive, saint, trial, impres-
sion, avoid, vinyl, minister, maid, concert, believe, jail,
learn, please, politician, great, guardian, population,
holiday, cancer, psychology, become, college, demoli-
tion, payment, brain, army, rabbi, lawyer, literature,
prince, task, tropical, bring, lover, bold, inch, interest,
companion, exhibition, leader, noise, actor, underwa-
ter, supper, communication, helium, sense, happiness,
win, sad, gymnastics, entertain, champion, banker,
odd, conversation, planet, dawn, dense, camp, law,
locked, pray, lose, plenty, abundance, fail, mallard,
vacation, chapter, dreary, warrior, origami, might,
joy, timber, choice, underground, depth, stencil, for-
mula, friends, allow, retain, participate, understand,
paws, mad, pride, stairs, wander, comfort, theme,
give, nephew, reduce, funny, bad, idea, droplets, age,
..., surfers
Table D.2: Members of the 20 clusters in ES. Clusters are ordered by size.
WordNet label Own label Members
bird
aquatic bird
seabird
birds seagull, gull, goose, duck, pelican, swan, mallard,
stork, eagle, flamingo
furnishing
furniture
instrumentality
furnishing furniture, stand, booth, desk, modern, display, bed,
chair, container, door, appliance, drawer, sofa, cur-
tain, couch, bench, crib, frame, box, table, tv, win-
dow, computer, cradle, television, mac
227
instrumentality
artifact
device
objects inspect, protect, collar, find, skateboard, gasoline,
heavy, key, belt, steal, instrument, hang, justice,
glove, handle, knife, scooter, horn, shoes, pipe,
bone, telephone, mouse, bag, hat, spoon, guitar,
gun, colt, purse, drum, iron, boot, violin, spine,
umbrella, sunglasses
instrumentality
self-propelled vehicle
wheeled vehicle
car
related
accident, cord, vehicle, auto, automobile, skate,
photography, truck, race, arrive, ford, chopper, cab,
rally, seat, industrial, smart, mechanic, racing, car,
demolition, triumph, construction, motorcycle, ma-
chine, taxi, engine, driver, crane, carriage, van, bus,
cannon, motor, tank, hockey, wagon, camera
person
organism
causal agent
“female
topics”
woman, model, brandy, pink, actress, lady, girl,
young, wife, tiny, haircut, blonde, women, girls,
hot, mother, hair, portrait, body, makeup, cheek,
wig, neck, muscle, chest, lingerie, waist, redhead,
child, face, bride, belly, bikini, kid, swimsuit, baby,
brow, skirt, dress, short
instrumentality
artifact
device
metals &
writing
object, aluminum, journal, author, capital, lawyer,
step, cardboard, law, silver, elastic, bible, writ-
ten, book, tin, literature, chocolate, wire, money,
cigarette, stud, steel, payment, glass, charcoal,
blanket, gold, newspaper, page, cigar, appoint-
ment, brick, butter, pencil, mirror, log, phone,
ipod, match, pillow, rod, piano, keyboard
vascular plant
plant
grow
plants weed, bunch, maple, cancer, iris, poppy, dande-
lion, leave, flower, rose, foliage, grow, plant, cactus,
spring, tulip, ivy, palm, lily, leaf, daisy, tree, root,
wheat, wool, raspberry, tobacco, flowers, blossom,
butterfly, sunflower, cotton, herb, violet, oak, moss,
strawberry, nest, dew, berry, rice, branch, coal
food
nutriment
substance
food sushi, meal, sandwich, pie, breakfast, lunch, food,
supper, flour, cereal, sweet, dessert, dinner, subway,
diet, cake, date, steak, sauce, bread, copper, nuts,
bacon, cooking, beef, meat, bakery, knitting, eat,
potato, salad, donut, pizza, burger, co↵ee, soup,
bean, cheese, vitamin, fruit, pumpkin, rock, mar-
row, market, timber
228
artifact
change
cover
colours &
materials
texture, fabric, cloth, metal, rain, concrete, pa-
per, suds, rough, words, stone, wall, square, dense,
leather, quote, wood, frost, mud, noise, text, pur-
ple, carpet, blue, tiles, dirt, droplets, red, sand, fog,
formula, mist, pattern, handwriting, green, straw,
linen, asphalt, stripes, crowd, marble, yellow, black,
brown, grey, grass, white
body part
artifact
part
body parts gut, throat, wrist, burn, ear, thumb, elbow, lis-
ten, shoulder, liver, pain, knee, arms, hand, toe,
finger, give, tongue, limb, abdomen, jaw, receive,
nail, arm, feet, hear, skin, washing, head, ankle,
hip, teeth, tear, stomach, brain, foot, lip, mouth,
leg, flesh, mask, eyes, nose, skull, eye, socks, lips
structure
artifact
area
room museum, garage, hall, classroom, kitchen, cellar,
interior, o ce, diner, decoration, exhibition, ho-
tel, ceiling, restaurant, store, bathroom, trial, pub,
class, closet, cafe, room, porch, stairs, deck, hospi-
tal, living, corridor, aisle, bar, staircase, doorway,
hallway, chapel, floor, lab, station, bedroom, gate,
elevator, theatre, escalator, tunnel, organ, alley, li-
brary, jail, tram
artifact
whole
instrumentality
fruit, drinks &
sport
compare, sad, ceramic, tea, rattle, honey, mus-
tard, weather, champagne, pearl, button, wine,
sugar, peel, pepper, jewelry, milk, orange, balloon,
bulb, lemon, beer, cocktail, salt, beverage, sphere,
juice, sports, planet, sun, whiskey, lantern, world,
cup, football, pin, diamond, banana, basket, cherry,
cent, basketball, globe, ripples, vinegar, pot, bottle,
jar, tomato, baseball, plate, bucket, bowl, bubble,
mug, ball, moon
travel
change
object
vacation island, view, reflection, harbor, nice, side, sea,
summer, tropical, pollution, port, aircraft, pier,
travel, surfers, journey, sunny, coast, flying, morn-
ing, ocean, seashore, horizon, mare, holiday, lake,
surf, shore, vacation, bay, airport, cli↵, sunlight,
air, river, storm, ship, fishing, beach, desert, har-
bour, puddle, flight, sailing, evening, sunrise, sky-
line, vessel, lighthouse, dawn, sunset, rocket, moun-
tain, whale, underwater, boat, swimming, swim,
plane, dusk, jet, cloud, sky, airplane, ski
229
change
abstraction
state
festival theme, wisdom, soul, image, possess, large, con-
fidence, happiness, beautiful, joy, love, ceremony,
festival, movement, abundance, dead, depth, cele-
bration, lover, run, demon, blurred, pray, happy,
remain, wet, dance, navy, family, carnival, angel,
sculpture, ray, dragon, drive, atmosphere, night,
shadow, band, god, believe, party, dark, hanging,
abstract, show, christmas, monster, devil, jump,
lighting, sunshine, warrior, painting, water, aquar-
ium, zombie, concert, haze, crystal, statue, explo-
sion, jazz, jellyfish, wave, bright, rainbow, ice, light,
smoke, club, neon, colorful, hole, protest, autumn,
rust, reef, flame, fire
person
organism
causal agent
animals animals, animal, picture, painted, zoo, turkey,
curled, goat, companion, pets, canine, pet, prey,
relaxed, horse, spirit, tail, dog, chipmunk, squirrel,
pigeon, fox, cute, please, sheep, owl, birds, military,
gira↵e, lion, lamb, bee, insect, hamster, hawk, lick-
ing, bird, cat, puppy, feline, terrier, deer, calf, rat,
chicken, camel, dragonfly, whiskers, poodle, cow,
hound, cattle, lizard, fish, bunny, crow, wolf, tiger,
parrot, zebra, cheetah, fur, panda, bull, wasp, ox,
hen, frog, crab, snake, boxer, hummingbird, rabbit,
elephant, pupil, husky, peacock, spider, pug, ant
change
abstraction
travel
Misc think, condition, understand, know, meet, sing,
symbol, bring, speak, awareness, say, strong, sense,
music, song, come, stencil, badge, loop, avoid,
long, tag, idea, feel, bell, helium, guest, held,
heart, proof, film, tall, information, oil, meter, an-
chor, female, drawn, flexible, smile, peace, break,
note, paragraph, figure, attach, gauge, apple, wan-
der, kitty, paws, silhouette, footprint, hose, locked,
vinyl, corner, round, divide, curve, cross, target,
wing, lens, necklace, tooth, border, rope, lamp,
bracelet, minute, north, time, illusion, cone, swing,
racket, angle, circle, chain, clock, bike, bicycle,
pole, spiral
230
person
organism
causal agent
people monk, manager, student, males, banker, instruc-
tor, parent, politician, minister, worker, adult, pro-
fessor, played, employee, pilot, bottom, husband,
style, uncle, business, men, boys, son, captain,
dude, teacher, man, mayor, top, beard, dad, boy,
retain, cop, fail, uniform, outfit, company, priest,
nurse, daughter, maid, opponent, father, scientist,
police, children, sailor, friends, beast, restless, sit-
ting, kids, old, bishop, prince, punk, costume, peo-
ple, tattoo, groom, president, couple, blade, secre-
tary, saint, sheri↵, singer, mad, walk, pod, doctor,
photographer, guy, skating, person, formal, bush,
actor, gentleman, rabbi, queen, sleep, funny, sol-
dier, jacket, sweater, coat, shirt, jean
structure
artifact
whole
landmark village, mill, cemetery, country, graveyard, board-
walk, bath, memorial, outdoor, wide, ancient, tem-
ple, inn, path, town, abandoned, windmill, land-
scape, canal, downtown, trip, cottage, scenery, ar-
chitecture, farm, patio, roam, palace, camp, drizzle,
factory, monument, road, apartment, street, shel-
ter, nature, tower, grave, wind, fountain, season,
way, flood, castle, barn, exotic, city, cabin, shade,
school, aerial, arch, ledge, garbage, motel, railroad,
railway, hill, house, bridge, highway, dreary, gar-
den, train, dome, trail, day, church, winter, urban,
parade, home, waterfall, dull, canyon, tra c, cathe-
dral, building, yard, skyscraper, steeple, pool, rail,
wild, stadium, forest, mural, pyramid, track, park,
field, hut, pond, roof, shed, fence, sidewalk, stream,
valley, snow, swamp, lawn
231
change
act
artifact
Misc learn, seem, course, dropped, reading, gather, cre-
ate, reader, impression, might, champion, partner,
advertisement, friend, hard, dye, comfort, trick, vi-
sion, construct, craft, small, goal, violent, poster,
movie, conversation, participate, communication,
read, population, huge, smoking, discussion, under-
ground, tough, become, build, carry, leader, col-
lege, pair, tax, fashion, fast, graphic, misty, minia-
ture, odd, big, imagine, cold, collage, shopping,
shop, gra ti, magazine, color, dirty, choir, ink,
unhappy, di↵erent, vintage, wedding, king, seed,
arrange, psychology, kiss, birthday, cell, plenty,
bloom, princess, boundary, lego, snowman, crochet,
sketch, gymnastics, emotion, santa, art, origami,
clown, narrow, mannequin, army, chess, rusty,
blood, collection, dripping, cage, colour, clothes, al-
cohol, liquor, candy, flag, age, metro, dollar, grave-
stone, feathers, map
act
change
abstraction
Misc activity, great, put, replace, lose, want, order,
buy, allow, august, reduce, south, essential, keep,
posted, bold, pride, fun, west, game, job, action,
safety, buddy, story, entertain, get, week, maker,
collect, skill, language, fact, normal, interest, hero,
value, work, bad, self, attention, brother, greet,
chapter, danger, appear, nephew, ad, size, medium,
year, dominate, enjoy, era, task, mom, emergency,
sell, news, go, zone, guardian, send, take, left, sec-
ond, choice, word, card, web, quest, add, make,
phrase, dictionary, sharp, winner, line, scratch, ar-
row, vein, number, shell, splash, parking, enter,
rapid, disc, new, right, win, stop, manner, fresh,
calendar, squad, month, vine, exit, fragile, region,
article, expand, menu, design, area, state, inch, def-
inition, doodle, code, letter, star
Table D.3: Members of the 20 clusters in EV . Clusters are ordered by size.
WordNet label Own label Members
232
baby
organism
work
baby baby
device
weapon
hurt
knife knife
area
communication
mark
footprint footprint
atmosphere
condition
obscure
sky cloud, sky
line
brandish
gesticulate
ocean wave, ocean
artifact
animal tissue
implementation
teeth tooth, teeth
way
road
artifact
road road, street, highway
organism
animal
bad person
animal fox, hen, game
substance
food
grass
food cereal, soup, oil
nonvascular organism
moss
bryophyte
alpine plant moss, ivy, cli↵
aircraft
craft
airplane
airplane aircraft, airplane, jet, plane
instrumentality
device
artifact
computer keyboard, mouse, computer, key
food
beverage
substance
drink beverage, wine, beer, juice
233
body part
process
part
body parts ear, head, eye, horn, tail
instrumentality
artifact
substance
pottery ceramic, tin, pencil, marble, hot
bird
vertebrate
artifact
flying animal parrot, limb, hummingbird, hawk, owl, dragon,
squirrel, branch, butterfly
bird
aquatic bird
seabird
bird gull, seagull, pelican, swan, peacock, crow, pi-
geon, goose, flamingo, wing, bird, duck, eagle
thing
body of water
physical entity
water bay, canal, harbor, water, lake, sea, pier, river,
ship, pond, shore, boat, splash, pool
change
move
visual property
body, color left, long, small, big, muscle, purple, pink, right,
washing, green, pair, color, sitting, palm
food
fruit
change
desserts nuts, sugar, cherry, frost, chocolate, raspberry,
flour, dessert, butter, pie, strawberry, candy,
lemon, ice, donut, cake
group
event
act
event party, parade, crowd, booth, race, cafe, stadium,
show, family, restaurant, match, people, market,
stand, park, airport, student, couple
change
color
visual property
visual property bright, grey, dark, round, painted, white, gold,
silver, black, red, old, brown, blue, tall, yellow,
metal, large, hanging
object
structure
artifact
landscape horizon, skyline, fog, valley, sunset, town,
skyscraper, waterfall, moon, lighthouse, stream,
city, building, castle, island, fountain, mountain,
crane, hill
container
instrumentality
measure
drink, vessel tea, champagne, alcohol, honey, milk, co↵ee, cup,
container, salt, bowl, mug, maker, spoon, jar,
bottle, money, vessel, straw, diner, glass, bucket,
basket, pot, bubble
artifact
whole
furnishing
furnishing, pet linen, sleep, furniture, blanket, bed, spring, crib,
pillow, carpet, couch, pattern, sofa, feline, fab-
ric, bunny, cloth, piano, floor, chair, square, cat,
leather, chest, patio, kitty, button
234
clothing
covering
consumer goods
clothing wig, instructor, jacket, bikini, costume, badge,
sweater, shirt, swimsuit, outfit, gentleman, skirt,
short, jean, boot, hat, coat, dude, dress, glove,
uniform, clothes, soldier, belt, mask, cop, pin,
ski
reproductive structure
plant organ
vascular plant
plants pod, bloom, tulip, daisy, cactus, sunflower, berry,
blossom, sweet, rose, lily, vine, tiny, root, vein,
pumpkin, garden, flower, plant, leave, leaf, peel,
fruit, bunch, desert, banana, orange, apple
artifact
part
body part
body parts
house animals
jaw, throat, canine, belly, pupil, cheek, stomach,
hamster, tongue, poodle, mouth, nose, fur, pet,
lip, leg, wool, panda, toe, neck, collar, puppy,
skin, licking, body, calf, dog, tag, lamb
food
nutriment
meat
food beef, herb, season, steak, meat, breakfast, ba-
con, burger, rice, meal, sauce, lunch, mustard,
cheese, pepper, dinner, bean, sushi, tomato,
seed, potato, salad, food, bone, sandwich, turkey,
bread, chicken, pizza, cooking, plate, fish
artifact
instrumentality
substance
o ce crochet, calendar, collection, telephone, menu,
note, movie, ipod, appliance, magazine, table,
frog, cardboard, date, desk, paper, hospital,
skull, card, library, box, shell, book, cord, pic-
ture, television, steel, tv, drawer, object, newspa-
per, garbage, night, top, ledge, machine, corner,
display, fire
abstraction
communication
change
communication language, code, information, text, ad, company,
graphic, painting, map, written, exit, mural, let-
ter, word, work, art, scratch, poster, symbol,
heart, advertisement, star, gra ti, image, page,
spine, border, time, arrow, frame, diamond, say,
portrait, number, birthday, design, circle, deco-
ration, reading
structure
artifact
area
building elevator, chapel, hallway, apartment, closet,
garage, hall, window, classroom, bedroom, door-
way, cathedral, door, bathroom, story, interior,
build, museum, cabin, room, arch, mannequin,
shop, o ce, club, staircase, store, hotel, reflec-
tion, kitchen, tunnel, mirror, pilot, house, ceiling,
aquarium, view, curtain, shade, church
235
artifact
travel
whole
transportation zone, railway, construction, curve, create, taxi,
run, subway, car, cab, drive, automobile, rail-
road, business, parking, alley, shelter, tram, ve-
hicle, stop, asphalt, course, way, light, train, po-
lice, station, bus, rail, gate, van, sidewalk, home,
line, truck, track, concrete, tra c, bridge, cross,
meter, brick
artifact
structure
whole
farm &
wild animals
deer, dandelion, wild, grass, farm, foliage, wind-
mill, field, mud, bush, forest, weed, landscape,
shed, barn, zoo, hut, tree, cattle, area, dirt, fence,
rock, log, goal, ox, yard, cow, sheep, goat, lawn,
eat, animal, gira↵e, stone, cage, wood, zebra,
mother, horse, lion, bull, elephant, hole
artifact
body part
instrumentality
body
accessories
cigar, haircut, makeup, brow, pug, hip, bracelet,
wrist, pearl, tattoo, elbow, stud, smile, ankle,
hand, necklace, finger, arm, band, smoking, hair,
snowman, beard, waist, thumb, lens, cigarette,
loop, woman, burn, cell, knee, purse, racket, face,
nail, foot, shoulder, bride, phone, bag, camera,
lady, groom, skateboard
change
travel
object
travel rapid, village, journey, seashore, swamp, the-
atre, mist, storm, scientist, stork, boundary,
sunny, coast, country, boardwalk, sunshine, wet,
weather, break, rainbow, dirty, aisle, flight, rain,
meet, ray, sand, day, puddle, escalator, lab, trail,
beach, path, surf, silhouette, nest, walk, snow,
wind, shadow, sunlight, flying, cone, sun, bal-
loon, umbrella
artifact
instrumentality
device
building
vehicle
pain, capital, minute, gauge, coal, cottage, rust,
lantern, anchor, angel, speak, steeple, motor,
dome, port, iron, pole, globe, rod, pipe, bulb,
engine, hose, bell, model, seat, roof, porch, sculp-
ture, monument, flame, handle, tank, lamp, gun,
flag, bar, chain, wall, deck, bike, side, bottom,
figure, wagon, rope, tower, wire, clock, scooter,
step, blade, motorcycle, bench, bicycle, smoke,
statue, carriage
236
person
organism
causal agent
people
activities
fun, violin, nurse, brother, lingerie, monk, par-
ent, dad, jewelry, huge, played, santa, doctor,
basketball, terrier, instrument, music, captain,
take, football, man, father, young, daughter,
drum, mom, trick, son, jump, held, pets, men,
blonde, friend, employee, colorful, skating, per-
son, guy, boy, swing, girl, safety, photographer,
racing, swimming, female, clown, disc, skate,
adult, kid, winter, guitar, child, baseball, driver,
ball, air, carry, worker
person
change
organism
Misc rattle, news, song, ant, imagine, send, emotion,
arrange, living, jazz, ripples, inn, god, learn,
please, violent, fragile, marrow, aerial, misty,
inch, unhappy, devil, essential, avoid, squad, to-
bacco, prey, flowers, banker, urban, protest, re-
place, saint, psychology, demon, movement, hol-
iday, rabbi, pollution, mayor, illusion, dense,
entertain, wisdom, underwater, manner, aware-
ness, politician, pray, give, lawyer, become, par-
ticipate, supper, trial, vinyl, law, gymnastics,
droplets, odd, believe, dawn, brain, secretary,
brandy, retain, fail, communication, wasp, in-
terest, gasoline, plenty, concert, helium, noise,
locked, demolition, activity, payment, lose, great,
literature, allow, bring, nephew, abstract, soul,
paws, guardian, win, funny, might, expand,
dreary, lover, tax, friends, skill, jail, put, un-
cle, ancient, joy, tough, tropical, happiness, boys,
population, underground, understand, wander,
stairs, abundance, value, idea, exhibition, can-
cer, choice, males, professor, reduce, mad, depth,
hockey, discussion, flexible, compare, collect, ap-
pointment, exotic, think, seem, confidence, bad,
steal, get, birds, dull, ceremony, abandoned, re-
laxed, sailing, industrial, lips, sunglasses, normal,
surfers
237
change
person
causal agent
Misc jellyfish, add, fact, journal, proof, paragraph,
oak, liver, impression, danger, chipmunk, explo-
sion, vision, chess, quote, rally, diet, prince, re-
main, receive, minister, sketch, sad, arrive, reef,
task, college, leader, origami, stencil, planet,
maid, champion, bold, chapter, army, actor, mal-
lard, camp, sense, companion, formula, timber,
conversation, warrior, pride, dew, theme, queen,
vacation, comfort, age, self, mare, morning,
redhead, mill, cold, celebration, reader, flood,
phrase, era, cent, evening, zombie, partner, con-
struct, know, violet, cellar, gut, august, manager,
winner, copper, hard, autumn, mechanic, singer,
month, tiles, bishop, poppy, miniature, festival,
justice, attention, spider, blurred, children, lis-
ten, colour, animals, women, carnival, hound,
girls, definition, triumph, hero, kids, peace, vita-
min, week, dusk, dragonfly, job, web, wolf, sun-
rise, go, smart, author, president, quest, auto,
graveyard, heavy, fashion, article, atmosphere,
summer, flesh, restless, gather, emergency, can-
non, suds, north, sell, vinegar, cute, world, pyra-
mid, ford, handwriting, formal, wife, architec-
ture, ..., wedding
Table D.4: Members of the 40 clusters in ES. Clusters are ordered by size.
238
Figure D.1: Heatmap of Jaccard coe cients between K-means clusters of ES and
EL (y and x axes respectively).
239
Figure D.2: Heatmap of Jaccard coe cients between K-means clusters of ES and
EV (y and x axes respectively).
240
Figure D.3: Heatmap of Jaccard coe cients between K-means clusters of EL and
EV (y and x axes respectively).
241
Figure D.4: Heatmap of Jaccard coe cients between Agglomerative clusters of
ES and EL (y and x axes respectively).
242
Figure D.5: Heatmap of Jaccard coe cients between Agglomerative clusters of
ES and EV (y and x axes respectively).
243
Figure D.6: Heatmap of Jaccard coe cients between Agglomerative clusters of
EL and EV (y and x axes respectively).
244
Figure D.7: Cluster map of Jaccard coe cients between Agglomerative clusters
of ES and EL (y and x axes respectively).
245
Figure D.8: Cluster map of Jaccard coe cients between Agglomerative clusters
of ES and EV (y and x axes respectively).
246
Figure D.9: Cluster map of Jaccard coe cients between Agglomerative clusters
of EL and EV (y and x axes respectively).
247
Figure D.10: T-SNE plot of ES with 40 cluster labels obtained by K-means
clustering. TSNE perplexity = 52.
248
Appendix E
Mutual Information of Semantic
Spaces
249
(a) IHSIC ,  : median, d = 3
(b) IHSIC ,  : median, d = 11 (c) IHSIC ,  : median, d = 12
(d) IHSIC ,  : median, d = 13 (e) IHSIC ,  : median, d = 50
Figure E.1: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES)
(blue) for di↵erent corpus sizes.
250
(a) IHSIC ,  : median, d = 3
(b) IHSIC ,  : median, d = 11 (c) IHSIC ,  : median, d = 12
(d) IHSIC ,  : median, d = 13 (e) IHSIC ,  : median, d = 50
Figure E.2: Estimated Mutual Informations: I(EL, EV ) (red) and I(EL, ES)
(blue) for di↵erent word frequency ranges.
251
252
Appendix F
Centroid Contexts
253
*2Mi`QB/ qBFBT2/B o:
THi2 i2+iQMB+b- Mx+- `2bi`B+iQ`- 7`HHQM-
bm#/m+ib- HB+2Mb2- +`B#`B7Q`K- i2+@
iQMB+- bm#/m+iBM;- 2m`bBM
THi2- HvBM;nQMniQTnQ7- QM- ?b-
QMniQTnQ7- BM
HB+FBM; /BTKBt- mTbmKB/- ǳ2;2`- Kmr2-
+Ȫ- ;Hm+Q`iB+QB/- MK2`m-
b+?H2+F2M- m/?mKH- #Q`K2ix
HB+FBM;- +2HBM;- iQm;m2- iQM;m2- ;B`@
`72- HBQM2bb
/ ?Q+- /BM- HB##2/- pHQ`2K- ?QKBM2K-
HB#b- H+Q`+ƦM- HBi2K- BM}MBimK- HB#@
#BM;
/- Q7n/Bz2`2Mi- `2nMn/- 72@
im`2bnpB2r- i2KTH2iQM- HBMBM;
`mbi 2TB[m2- +`QM`iBmK- QH2mK- +Q?H2-
Q#`Bix#2`;- #HBbi2`- #2Hi- Tm++BMB-
rBM/2t2/- +QHQ`2/
`mbi- biBMbn/QrM- `QmM/nbB/2nQ7-
`mbi2/nQMiQ- QMn}`2- rBi?nnHQi
`BHrv biiBQM- HBM2- KB/HM/- #Mb7- M2`2bi-
;m;2- DmM+iBQM- r2bi2`M- `?2iBM-
biiBQMb
`BHrv- /2i+?- 2H2@
pi2/nQMnTHi7Q`KnQp2`-
Tbb2bnQp2`n- bTHB+2/ni?`Qm;?-
i`/BiBQMH
+Hbb`QQK i2+?2`b- BMbi`m+iBQM- +QHH#Q`Bx2-
2K#B;;2MBM;- ? 쓶 훊- bT+2-
k8djkdkdjek3- #Q``Qr`2/- +Hbb@
#MF- TT`Q+?ěb?2
+Hbb`QQK- /Bb+mbbBM;nBM- biM/@
BM;nBMbB/2- bBiiBM;nBMbB/2- bim/2Mi-
ii2M/BM;
?mKKBM;#B`/ KxBHB- b2HbT?Q`mb- K2HHBbm;- +@
HvTi2- +vMMi?mb- #2`vHHBM2- b+BM@
iBHHMi- Q`i?Q`?vM+mb- 2mT?2`mb-
+?BMM2/
?mKKBM;#B`/- 2inM2+i`n7`QK-
BMn~B;?in#2HQr- ~TTBM;nBib-
~TTBM;- rBM/bTM
+# +miB2- +HHQrv- ?MbQK- Q#`/QB`Q-
/`Bp2`- itB- Kmx2M- bB;MHHBM;- #@
bi`+ib- bmb#/2
+#- QMn?QQ/nQ7- TBMiBM- `/@
BM;- #+FnrBM/QrnQ7nitB- /`Bp@
BM;nbB/2nQ7
#HQQK /QQ`v`/- H;H- bHB2p2- DxKBM2-
?`QH/- pHbiMBF- ~Qr2`b- bKmi?-
irBHH2`#m/b- Q`HM/Q
#HQQK- +?2``vn#HQbbQKni`22nBM-
BMn7mHHnbmKK2`- `Qb2n?bn7mHHv-
#mii2`+mT- in
+?T2H bBbiBM2- ?BHH- 2b2- +`QHBM-
K2i?Q/Bbi- +Hp`v- +?Mi`v- mM+-
KQ`im`v- #`M+++B
+?T2H- +?m`+?- QmibB/2- ?QK2- i`BK-
iQ
+?KT;M2 `/2MM2- Bb?B?BF- #QiiH2- biF2b-
HMbQM- +?HQMb- MQďHH- #2m;`M/-
+`v2mb2- RRdkĜRkRN
+?KT;M2- BBM- BMnnrQKMb-
+`72- +HBM;niQ- r`TT2/n`QmM/
Dr KQQb2- /`QTTBM;Hv- HQr2`- T?Qbbv-
Qbi2QM2+`QbBb- mTT2`- /`QTTBM;-
É2?mǶT- THBMH- rFKQr
Dr- bi`QM;- b2`+?BM;- 7+@
BM;nQTTQbBi2nQ7- Q7- r2`BM;nMQ
k89
`TB/ i`MbBi- BMi2`#Q`Qm;?- ;`Qri?- rB2M-
T`QiQivTBM;- BMi2MbB}+iBQM- #m+m`2șiB-
2tTMbBQM- #mb- BM/mbi`BHBxiBQM
`TB/- TQr2`BM;ni?`Qm;?- KM2m@
p2`ni?`Qm;?- `B/2bnQ7- +`b?BM;nQp2`-
;Q2bnQp2`
/M/2HBQM i`t+mK- #m`/Q+F- +9?9MkQb- +B@
+ǁ`B- BMbm#`B+Q- pHB2;2xrK- xM;mM2-
7`22Hv- Q/mpM+?BF- T`/2#HQ2K
/M/2HBQM- BMn2KTivnbTQinQ7-
;BMbinbQK2- `2n/Bbi`B#mi2/nBM-
KQM;- ;`QrBM;nBM
#`B;?i v2HHQr- +QHQ`b- `2/- Q`M;2- HB;?ib-
+QHQm`b- F2HHB2- bTQib- bmMb?BM2- ;`22M
#`B;?i- HB;?in;`22Mni2M-
HBin`2/n``QrnTQBMiBM;-
bFvn+HQm/vn#mi- bi`22inHB;?i-
ipnb+`22M- v2HHQrnTBMi2/nrQQ/
/`BxxH2 嬱嬱곝- /`xxH2- +?BbT2`- ##m+?-
Qm`/2Hi-    - KB;;2H2M- /KTv- 7x/x
/`BxxH2- n/QMmi- /2+Q@
`i2/nbK2nb- /Q`MBM;- QMin?2-
iQTTBM;
+`Q+?2i v`MiBM2`b- bT2`Ʀ- FQHQb2- 7`227Q`@
KiBQMb- M2irQ`FběbB;MH2/- FMBiiBM;-
}H2i- iiiBM;- K`M/iB
+`Q+?2i- +HBi- M22/H2TQBMi- `2n7Q`- bBi@
iBM;nQniQTnQ7- v`M
7mM TQF2/- TQFBM;- TQF2b- TQF2- HQpBM;- HQi-
HQpBM- vB/Bb?M- 7mM- r2 b2HH
`2n?pBM;- `2n?pBM;n;`2i- 7mM-
7+BM;nrv- THMMBM;- ?pBM;
TBM M2m`QTi?B+- #/QKBMH- +?`QMB+- 2t@
+`m+BiBM;- +?2bi- p2Hvi- Q`Q7+BH-
KvQ7b+BH- br2HHBM;- F?Bv#M
+?QTTvnM2`- rpBM;n7`QK- bBi@
iBM;n#+F- biB+FBM;nmTnQminQ7-
i`p2HnQM
Mi /2+- 7Q`KB+B/- MQTHQH2TBb- `#Q`2H-
#Hm2#H+F- 2iF2MK2M- bQH2MQTbBb- H27@
+mii2`- ;2Mmb
Mi- b?QrBM;ni?`Qm;?- `2~2+iBM;nQz-
`2n#2?BM/- pb2- `2nQM
#mb `Qmi2b- b2`pB+2- i2`KBMH- b2`pB+2b- BM@
i2`+Biv- biQTb- biQT- HBM2b- b?miiH2-
`TB/
#mb- QMn7`QMinQ7- QMnbB/2nQ7- BM-
r2`BM;- HBinmTnQM
;B`z2 ;B`z- `2iB+mHi2/- KbB- ;Q;QHB+F-
2`/K MM+?2M- 7Qm`Ĝ?Q`M2/- ;2`M2Qmb-
;B`z2- KimM/m- [ȕ`B#ȕ
;B`z2- ?b- Q7- QM- r2`BM;- KM
;Hbb biBM2/- rBM/Qrb- K;MB7vBM;- #Q`QbBH@
B+i2- H2/2/- BQMQK2`- rBM/Qr-
K2M;2`B2- TM2b- #2/b
;Hbb- rBM2- r2`BM;- QM- HB[mB/-
?H7n7mHHnQ7
?M/ `B;?i- bH2B;?i- ;`2M/2b- H27i- ?M/-
+`MF2/- ;`2M/2- +HTb- ;HQp2/- mTT2`
?M/- ?QH/BM;- ?2H/nBM- QM- BMnKMb-
KM
rBM/Qr i`Mb72`- QT2MBM;b- i`MbQK- ;Hbb- TH@
H/BM- Q`B2H- HM+2i- bBHHb- bb?- TM2b
rBM/Qr- #mBHinBMiQ- r2`BM;- KM-
QMnbB/2nQ7- QM
THM2 +`b?- T`QD2+iBp2- 7Q+H- 2m+HB/2M-
+`b?2/- ?vT2`#QHB+- BM+HBM2/- bi`H-
i`MbTvHQ`B+- +`b?2b
THM2- ~vBM;nBM- r2`BM;- KM-
QMnbB/2nQ7- Q7
r?Bi2 bQt- #H+F- bmT`2K+Bbi- ?Qmb2- y-
+`2Kv- +QHH`- bmT`2K+Bbib- iBH2/-
bi`BT2b
r?Bi2- +QHQ`2/- ;`QmT2/- `mbiB+-
iBH2nQMn- #HQbb2K
k88
;`bb K``K- `QQib- bTH2M/Qm`- ;ɃMi2`-
KQHBMB- imbbQ+F- `QbKH2M- v2HHQr2v2/-
+Q;QM- imbbQ+Fb
;`bb- 2iBM;- ;`xBM;nQM- biM/@
BM;nBM- ?b- ;`xBM;nBM
i`22 i`mMFb- #MvM- MQBbBHv- #Q/?B- +?`Bbi@
Kb- #22`#Q?K- };- TQ`+mTBM2- 7`Q;-
HBM2/
i`22- ;`QrBM;nQM- #2?BM/- H2p2- QM-
KM
`QQK /BMBM;- HQ+F2`- /`2bbBM;- rBiBM;- #BH@
HB`/- i2KT2`im`2- b+?QQH?Qmb2- #QBH2`-
?Qi2H- `QKT2`
`QQK- BM- BMnQi?2`- QM- BMn+Q`M2`nQ7-
bKBHBM;nBM
ri2` TQHQ- /`BMFBM;- bmTTHv- TQi#H2- 7`2b?-
bMBiiBQM- #`+FBb?- pTQ`- b?HHQr-
bQHm#H2
ri2`- BM- brBKKBM;nBM- QM- ~Qi@
BM;nBM- r/BM;ni?`Qm;?
rHH bi`22i- ?/`BM- +m`iBM- /Q//Ĝ7`MF-
MiQMBM2- `2iBMBM;- ?M;BM;b- #2`HBM-
[B#H- +2HH
rHH- ?M;BM;nQM- ;BMbi- ?b-
?mM;nQM- KM
/Q; #QMxQ- Mm;?iv- ?QmM/- bH2/- K/-
r?2HFb- DmMFv`/- bi`v- b?;;v
/Q;- QM- KM- BM- +?bBM;- ?b
bFv T2`72+ip- K/`2M- #B;- ;Q+?2QF-
bTQ`ib- i;k9- #Hm2- +Qbiěb2- M2rb
BM- bFv- ?M;BM;nBM- ~vBM;nBM- +HQm/-
QM
i`BM FKT?- Tbb2M;2`- r;QM- ?Hib- /2@
`BH2/- biiBQM- 7`2B;?i- 2tT`2bb- bQmi?@
#QmM/- b2`pB+2b
i`BM- QMn7`QMinQ7- BM- r2`BM;- KM-
rBiBM;n7Q`
i#H2 i2MMBb- HBbib- T2`BQ/B+- b?Qrb- 7QHHQr@
BM;- bmKK`Bb2b- bQ`i#H2- bmKK`Bx2b-
HQQFmT- ?b?
i#H2- bBiiBM;ni- QMniQTnQ7- ?b- BM-
i
KM bTB/2`- BbH2- vQmM;- K2;- vǔb?ɟ- #22@
MB2- B`QM- ii- QH/- T+
r2`BM;- QM- KM- BM- r2`b- ?QH/BM;
HBpBM; +QmTH2b- bQK2QM2- HQM2- iQ;2i?2`- R3-
7KBHB2b- TQp2`iv- /vHB;?ib- [m`i2`b-
T2QTH2
HBpBM;- BbHM/- /Q+F2/nM2`- THMi-
?Q`BxQM- iQT
bv ;QQ/#v2- M22/H2bb- Mvi?BM;- v2b- ;QQ/@
#v2b- /`M/2bi- ;Q2b- bQm`+2b- r2Mi-
?2HHQ
bv- iM- ?BHiQM- /QTi- HH2`iQM- 2b@
/2M- KQ/
b2;mHH eyyjj- +?2F?Qp- ~mQtviBM2-      - `+/@
BM- i`2TH2p- 2MRey- ?KiK- F``Fm-
K2`BFDFb
b2;mHH- ~QiBM;nrBi?- Qp2`nM/nBM-
#2bB/2nQ7n- ?bnr2##2/-
#QminiQn/Bp2nBM
7m`MBim`2 mT?QHbi2`2/- MiB[m2- }iiBM;b- biQ`2-
?QK2biQ`2b- /2bB;M2`- KF2`- rB//B@
+QK#- /2TQbBiQ`v- ?``Q/b
7m`MBim`2- bMB{M;nmM/2`- Q++mTvBM;-
?bnn`2~2+iBQMnBM- ?bnb?/Qrb-
HB/n?B//2Mn#v- Ki+?2/nrBi?
BMbT2+i +2`MvkyRk- 2Mb2M/2- 2pbBQMě``Bp2b-
KMBFiQHH- b?mii2`ěi?2- Q7biBM-
bQFQiő- ƓbmT2`pBb2b- ;QQ/bĘ- bF;2``F
BMbT2+i- x2#`- ?QQ7- #`M+?- KM2-
i`BM
k8e
++B/2Mi +`- miQKQ#BH2- 7iH- KQiQ`+v+H2- Q+@
+m``2/- #QiBM;- 7`2F- i`{+- i`;B+-
BMp2biB;iBQM
++B/2Mi- b+m``vBM;n`QmM/- `@
`Bp2/ninM- rbnBMnM- r?Bi-
r2`BM;nQM2
rQKM rQM/2`- vQmM;- bmz`;2- }`bi- T`2;@
MMi- #BQMB+- MK2/- 2H/2`Hv- KM-
#2miB7mH
rQKM- r2`BM;- QM- ?QH/BM;- KM- BM
Q#D2+i Q`B2Mi2/- M2TimMBM- BMMBKi2- HBK2`@
2Mi- i?Q`M2ĜʊviFQr- bm#bi2HH`- tKH@
?iiT`2[m2bi- T2`KM2M+2- /ibQm`+2-
`2KQi#H2
Q#D2+i- QMn#QiiQKnQ7nM-
QMn#QiiQKnQ7n- bmTTQ`iBM;n-
BMn7`QMinQ7nM- BMniQ
r22/ MQtBQmb- 7Q`2biBM- DBKbQM- i?m`HQr-
r?+F2`- BMpbBp2- +#QK#- bQi- i2v+F-
7Q``2biBM
r22/- `2n;`QrBM;nBM- Q#b+m`2b-
;`QrBM;ni?`Qm;?- ;`QrBM;nBM-
;`QrnQminQ7
bmb?B bb?BKB- MB;B`B- Mm/QFB- i2KTm`-
#2MFv- +`2KB2`2- /Bb?2běHBbi- 7mM@
iQbi- Bi+?Q
bmb?B- `QHHb- p+/Q- }b?2b-
`2nT`iBHHv- `2n#2HQr
i2tim`2 H2i?2`v- p2Hp2iv- +`mM+?v- TQ`T?v`BiB+-
KTTBM;- +`mK#Hv- b?/2`b- +?2rv- T@
T2`v- rtv
i2tim`2- M;H2n+`2i2b- ?n/Bz2`2Mi-
n+H2`- i2HHb- +bibnQM
;mi KB+`Q#BQi- r`2M+?BM;- KB+`Q#BQK2-
pQHFb2B;2M2b- `B2+?bi- biD2TFQ- B/2`@
#B+?H- HFHv- r`2M+?BM;Hv- #v2QHbBM
;mi- bi`BM;v- TmKTFBM- r2`BM;- HH- i?Bb
Kmb2mK `i- bi2/2HBDF- b?KQH2M- FmMbi?Bb@
iQ`Bb+?2b- ?B`b??Q`M- K2i`QTQHBiM-
r?BiM2v- ;m;;2M?2BK- #QBDKMb
T`F2/nQmibB/2nQ7- Kmb2mK-
KQmMi2/niQnbB/2nQ7- Q7nB`THM2-
T`F2/nQmibB/2- T`F2/nBMn7`QMinQ7
+QKT`2 +Km`2F- /ǶM2HHQ- BiBKi`2b- FQHbFvb-
[mQi2#Qt2b- v/?BF2Mm- /BbTQbBiBQMě
iQ- 7pQ`#Hv- #m#2MbTBixH2- ?2iB`Bbi`B
+QKT`2- FB/- T?QM2
BbHM/ `?Q/2- bii2M- +QM2v- HQM;- #{M-
r?B/#2v- +Mp2v- K+FBM+- `BF2`b-
pM+Qmp2`
BbHM/- HBpBM;- `BKK2/nBM-
QMn7`QMinT`inQ7- QT2M2/nM2tiniQ-
#mBH/BM;nBMiQ
i?2K2 bQM;- 2M/BM;- QT2MBM;- imM2- MiQHB+-
`2+m``BM;- `K2MB+- T`F- i?`+2bBM-
T`Fb
i?2K2- ;Q`BHH- QT2MbnQM- /2TB+iBM;-
biM/bni- bTimH
MBKHb +`m2Hiv- THMib- 7m``v- rBH/- /QK2biB@
+i2/- MQM?mKM- BKTHBM;- ?mKMb-
bimz2/
MBKHb- bimz2/- +m#B+H2- BMb/B2- bBi-
#`QrM
i?BMF iMF- /QMǶi- iMFb- ǳB- b?m//2`- BiǶb-
bB/- bvBM;- /M+2- #QxQb
i?BMF- ;`2v- i`mMF
k8d
KQMF i?2HQMBQmb- #m//?Bbi- #2M2/B+iBM2- iQM@
bm`2/- #`2iiQM- 7`vbiQM- tmMxM;-
?mB7M- 匃娇- +Mi2HH
KQMF- ;`QrbnbT`b2HvnM2`- ?2HT@
BM;n- MpB;iBM;- ?p2nQM- KQMbi2`v
pBHH;2 /KBMBbi`iBp2- KmMB+BTHBiv- /KBMBb@
i`i2/- ?QrK2?- TQTmHiBQM- ;`22MrB+?-
HQ+HBiv- KF2mT- TQK2`MB- FB2H+2
pBHH;2- in7QQinQ7n- Qp2`HQQFb- i`p@
2HBM;ni?`Qm;?- i`p2HBM;- ?`#Qm`
H2`M b?Q+F2/- b+BFBi- bm`T`Bb2/- `2/v-
T2`7v+i- ?Q``B}2/- bim/2Mib- #bB+b-
QTTQ`imMBiv- /BbKv2/
H2`M- bFB- +?BH/- 7Q`- #QQF- QM
+iBpBiv pQH+MB+- i?mM/2`biQ`K- 2+QMQKB+- 2M@
xvKiB+- b2tmH- b2BbKB+- T`Q;2biQ;2MB+-
2ti`p2?B+mH`- 7mK`QHB+- 2bi`Q;2MB+
+iBpBiv- ii2M/BM;- j- ri+?BM;-
`QmM/- bT2+iiQ`
T?QM2 KQ#BH2- +HH- +HHb- +2HH- ?+FBM;- rBM@
/Qrb- THb- +2HHmH`- T?`2Fb- #QQi?
T?QM2- iHFBM;nQM- ?QH/BM;- mbBM;-
HQQFBM;ni- iHFBM;nQMn
/`r2` r#bi`imT- /2bF- /QQ`#M/b- +`v/-
+B;`22i- ;` zb- ?Nyy- D?M;22`- TM@
T?Q#B+- i`BM+?Mi2
/`r2`- 7+2nBM- #mBHinmM/2`M2i?-
/`2bb2`- ?M/H2- FMQ#
TQTTv QTBmK- TTp2`- D?F`- 2b+?b+?QHxB-
/2H2pBM;M2- #`2/b22/- bQpB- TBTQTTQ-
b22/b- T`B+FHv?2/
TQTTv- +Hmbi2`nQ7nTBMF- biK2M- B`Bb-
HT2H- ;2MiH2KM
rBM2 bT`FHBM;- ;`T2- ibiBM;- +2HH`- +2HH`b-
KmHH2/- ;`T2b- DM+Bb- ibiBM;b- #`2/
rBM2- TQm`BM;- ;Hbb- /`BMFBM;- ibiBM;-
?H7n}HH2/
brM r?QQT2`- MmiQ`- b2`BM/- bBHp2`iQM2b-
H2/- Q/BH2- imQM2H- +Qb+Q`Q#- HF2-
TmF­
brM- brBKKBM;n#Qp2n- brBK@
KBM;nQM- brBKKBM;nmTQM- brBK@
KBM;nBM- r/BM;nQM
#HMF2i MQMT`iBbM- 2D2+i- #Q;- bii2@
K2Mibi2BM- mb#- #BM;Q- T`BK`v-
#H+F7`B`;i2- kyyĜeyyKK- b`QHB;
#HMF2i- b2rMnBMiQ- /Q`M@
BM;- +QHQ`nBM- /`T2/nQp2`-
r`TT2/nmTnBM
#BF2 KQmMiBM- `+Fb- HM2b- /Q+FH2bb- i`BHb-
Ti?b- Ti?- +BiB- Q`B2Mi22`- `B/2
#BF2- `B/BM;- HQ+F2/niQ- BM- `B/BM;n-
+?BM2/niQ
#2+? THK- Kv`iH2- /2H`v- p2`Q- /viQM-
TQKTMQ- `2/QM/Q- T2##H2- pQHH2v#HH-
HQM;
#2+?- i- rHFBM;nQM- #`2FBM;nQM-
THvBM;nQM- biM/BM;nQM
+H7 `QTBM;- 7ii2/- +Qr- ;QH/2M- biQpH- BM@
Dm`v- bpib- 2/v- pBiHQ- KBHF+Qr
+H7- Mm`bBM;n7`QK- Mm`bBM;nBM- Mm`b@
BM;- ?bn?2/nQM- M/nbi`22inrBi?
v2HHQr D+F2ib- 72p2`- TH2- ;`22MBb?- #`B;?i-
T2`+?- #2HHB2/- ~Qr2`b- Q`M;2- T2`BH
v2HHQr- ;`22Mn7+2nQ7- +QH@
Q`2/nbm+2nQM- i`+Fn?b- v2H@
HQrn;`QmM/- +QHQ`2/n/Qm;?Mmi
+QQFBM; mi2MbBHb- TQib- +H2MBM;- pBMvH- TQi- ǵFQ@
F2M- QBH- /Qi+?- #FBM;- b2rBM;
+QQFBM;- ;`BHH- TM- ?Qi/Q;- QBH- bi2F
k83
?Q`M ?`/`i- +T2- `BKK2/- FBKH2v-
FMB2?iBBQ- #HBMF2v- 7`B+- iBBQ- ~m;2H-
i`2pQ`
?Q`M- +m`pBM;nrvn7`QK-
?Q`MbnQMn;B`z2- ?b- ?bn+m`p2/-
;`Qrn7`QK
MBH iQQi?- #Bi2`- #BiBM;- ÏFø`?M- KxB@
iQpB+?- vFmTQp- +Q{M- TQHBb?2b- #Bi@
BM;Hv- bHQMb
MBH- mb2/nBM- BKTH2/nQM- ?QH/@
BM;niQ;2i?2`- ?mM;nmT- ?bni?mK#
#m//v ?QHHv- 2#b2M- /2bvHp- /27`M+Q- HxB2`-
HM/2H- pHbi`Q- biQQ/BQb- +BM+B- `Q2K2`
#m//v- #2bi- MBKHbnBM- i?`22-
biM/bn#v- biM/
?x2 /Bx22- 2pQi- i`Mb#QmM/`v- Tm`@
TH2- /BM;H2#2``v- #;BH;mH- /27`BM;2-
?K2b?mKb?- `vF22- mMB/2MiB}#H2Ĝ
mMB7Q`K
?x2- Qp2`niQT- `2n?B;?2`ni?2M-
`2np2`vn7`n7`QK- #HQ#- BMn/BbiM+2
T`Qi2bi T2+27mH- `2bB;M2/- K`+?2b- HBmHBimM-
K2+?i- T`Qi2biB`K- `Qb2Mbi`bb2-
bi`BF2- MQMpBQH2Mi- BxBF
T`Qi2bi- 7Q`- bB;M- `Q/- ?QH/BM;- QM
bH22T TM2- `2K- /2T`BpiBQM- M`2K- Q#bi`m+@
iBp2- /`2KH2bb- TQHvT?bB+- TMQ2- /Bb@
Q`/2`b- #`mtBbK
bH22T- rBi?nn#Hm2- rBi?n;`22Mnb?B`i-
+Qp2`2/nrBi?n- ?bn?Bbn?2/-
b?Q2bn`2nQM
+HQi?2b br//HBM;- Qtt7Q`/- rb?BM;- +BpBHBM-
/`v2`b- THBM- #Q`HQ- /`v2`- K`BHHv-
M;+?mKT
+HQi?2b- QMn+HQb-
biQ`2b- ?p2nTQ+F2ibnBM-
`2nTBH2/nQMniQTnQ7- bi`QrMnQM
#mii2` T2Mmi- K`;`BM2- +Q+Q- #`2/- +H`B@
}2/- +?22b2- K2Hi2/- D2HHv- mMbHi2/- Km@
`mKm`m
#mii2`- H`;2nM2`- T`iHvnQmibB/2-
bmi22BM;nBM- K`;`BM2nQM-
BMnnbKHH
~Qr2` #m/b- ?2/b- MiHBF2- i`p2HHBM- bTBF2b-
Tb[m2- HQimb- Mi?Qb- #2/b- biHFb
~Qr2`- pb2- #HQQKBM;nBM- #HQQK@
BM;nBMbB/2- BM- BMnKB/nQ7
`BM iQ``2MiBH- bBM;BM- ?2pv- TQm`BM;-
bQF2/- bMQr7HH- bMQr- ?i7mH- b?BM2-
7`22xBM;
`BM- 7HHBM;nQM- ;2iiBM;nr2in#v-
iQrMb- rHFBM;nQM- iF@
BM;nb?2Hi2`n7`QK
+Qz22 b?QT- `Qbi2`b- #2Mb- b?QTb- THMi@
iBQMb- `#B+- /2+z2BMi2/- bi`#m+Fb-
`Qbi2`- i2
+Qz22- +mT- Km;- TQm`2/n7`QK-
Q7nbi2KBM;- #HQrBM;n+`Qbb
+Qr +H`#2HH2- /mM;- KBHF- +H7- ?v/`Q@
/KHBb- T`bMBT- ?Q+F2/- `2BM2/- K/-
FQr2K2`F
+Qr- QM- biM/BM;nBM- HvBM;nBM- KBHF@
BM;- #2BM;nb?QrMni
rB; rK- D;#;b- rKMB- #HQM/2- r;-
b?2BM?`/i- #HQM/- TQHQ;Bb2ěBi- ?B`@
TB2+2ě- H+27`QMi
rB;- BMn+HQrM- r2`b- ;mB/BM;-
MQinr2`BM;n- TmHH2/n#v
k8N
iQr2` +QMMBM;- 2Bz2H- #2HH- +HQ+F- ?KH2ib-
K`i2HHQ- ##2H- bTbbFv- /`m;-
bTB`2
iQr2`- ?QmbBM;n- HQM;n7`QMinQ7-
iniQTnQ7- +HQ+F- +QMiBMBM;n
#Hm2 Dvb- `B##QM- ƺvbi2`- D+F2ib- ?22H2`b-
+QHH`- /2pBHb- `B#M/- #QK#2`b- `B/;2
#Hm2- +H2`- M/nr?Bi2nQ+2M-
+HQm/H2bb- B`nQM-
#`M/nMK2n`mbivnQMnBib
bFBM B``BiiBQM- `b?2b- ;`7ib- H2bBQMb- TB;@
K2MiiBQM- B``BiiBQMb- ;QH/#2i2`- Km@
+Qmb- MQMK2HMQK- iMM2/
bFBM- ?M;BM;n?- H/vnHB;?i- TT2`@
BM;nQM- KiBM;nrBi?- QMn+i
~2tB#H2 +QKT+iBM;- bB;KQB/Qb+QTv- }#2`QTiB+-
#2M/bQK2- +QM7Q`KiBQMHHv- /Ti@
#H2- THMFQbi2M`2+?MmM;- HB[mB/iB;?i-
HBMF2`b- ?2HB+
~2tB#H2- ;`22Mn`BK- iQn+i+?n- ;`2M-
BMn/Q;b- 7`Bb#22
#; /mz2H- THbiB+- /m|2- #B/Bi- pBM#Q-
TmM+?BM;- bH22TBM;- ;`#- +QHQbiQKv-
B`bB+FM2bb
#;- +``vBM;- +``vBM;n- +``B2b- #2v@
r22M- TH+2/nBMbB/2
#B`/ Tbb2`BM2- KB;`iQ`v- +;2/- bM+im`v-
ri+?2`b- ri+?BM;- iQTH2v- bT2+B2b-
T`2v- 7m`M`BB/2
#B`/- T2`+?2/nQM- ~vBM;nBM- ~v@
BM;nQp2`- #2F- ~vBM;n?2/nQ7
FBi+?2M bBMF- bQmT- ?2HH- mi2MbBHb- /BMBM;- ?2HHǶb-
#i?`QQK- TMi`v- b+mHH2`v- HmM/`v
FBi+?2M- BM- 7+2nBM- T`2T`2/nBM-
rQ`FBM;nBM- rBi?nBMi2`BQ`
7i?2` bm++22/2/- /B2/- 7QQibi2Tb- /QTiBp2- #B@
QHQ;B+H- /2i?- Hr- BM?2`Bi2/- bQM
7i?2`- H2MBM;nQp2`niQniQm+?- iF@
BM;nnTB+im`2nBM- iFBM;nnb2H7nBM-
M/nbQMnH2`MBM;niQnB+2-
+H2MbnbQMb
b?Q`2 /BM?- HF2- MQ`i?- 2bi2`M- #ii2`B2b-
;2Q`/B2- #QK#`/K2Mi- TmHv- D2`b2v
b?Q`2- #`2FBM;nQM- rb?BM;nQM- +QK@
BM;niQ- +QKBM;nBMniQ- +`b?BM;nQM
p2?B+H2 KQiQ`- HmM+?- `2;Bbi`iBQM- 2H2+i`B+-
Kp- r?22H2/- `22Mi`v- `Qp- miBHBiv- mp
p2?B+H2- T`F2/nHQM;bB/2nQ7-
T`F2/nHQM;bB/2- T`F2/nQM-
T`FBM;nQMnbB/2nQ7-
`2nT`F2/nHQM;bB/2nQ7
#`BM; #+F- ?2HT2/- iQ;2i?2`- rQmH/- ?Q`BxQM-
7Q`i?- +QmH/- ii2MiBQM- #H2- ii2KTi
#`BM;- bvb- rHH- QMn- ?bn- ?b
bKBH2 bKBH2v- +`Mi- pQD/MQp- تُٜؕ-
Ŀ        ŀ- bQHKB- iQQi?v- /F- bQM`ő2- 7+2
bKBH2- 2tTQbBM;- `2p2HBM;- 2tTQb2b-
BMnbKBHBM;- 2tT`2bbBM;
iBK2 }`bi- 7mHH- HQM;- 2ti`- bT2M/- bT2Mi-
`2H- +QMbmKBM;- bHQi- `QmM/
iBK2- ?pBM;n;`2i- +QmMib-
b+2M2n/m`BM;n/v- iQni2HH- i2HHb
7+i /2bTBi2- /m2- +?2+FBM;- bTBi2- }M/@
BM;- +QKTHB+i2/- Kii2`- +?2+F2`- +QK@
TQmM/2/- 2pB/2M+2/
7+i- HBbi2/nQM- rBi?nbQK2- BM;`2/B2Mi-
Dm;- #QiiH2
7QQi#HH H2;m2- i2K- +Hm#- +QHH2;2- THv2`- pB+@
iQ`BM- MiBQMH- K2`B+M- T`Q72bbBQMH-
+Q+?
7QQi#HH- +?b2b- iQn?BM/2`- i`v@
BM;niQnbp2- THv2`nBM- 2tT2`B2M+@
BM;n
key
}b? }MM2/- +vT`BMB/- rBH/HB72- ?i+?2`v-
#QMv- K`/v- 7`2b?ri2`- M/`QKQmb-
/2K2`bH- b?2HH}b?
}b?- KQH2nQMn- r`Bi?2bnBMbB/2-
?bnn;`v- +m;?inrBi?-
#QminiQn#2n72/niQ
}HK 72biBpH- /B`2+i2/- +MM2b- /`K- 72@
im`2- +QK2/v- bmM/M+2- /Q+mK2Mi`v-
?Q``Q`- i?`BHH2`
}HK- iF2MnrBi?- `2n#2BM;- iT2/niQ-
#2BM;- THvBM;nQMn
`Kb +Qi- +Qib- 2K#`;Q- KmMB+BTHBivǶb-
KKmMBiBQM- H2;b- Tm`bmBpMi- +MiBM;-
bKHH- ;mH2b
`Kb- Qmibi`2i+?2/- #`2- 7QH/- bFi2`-
bFi2#Q`/2`
THMi ~Qr2`BM;- TQr2`- Ti?Q;2M- /2bHBM@
iBQM- };rQ`i- Q`MK2MiH- ?Qbi- ?`/B@
M2bb- ?2`#+2Qmb- #QiiHBM;
THMi- ;`QrBM;nBM- ;`QrBM;nQM- ;`Qr@
BM;nmTn- TQi- H27
7QQ/ 7bi- #2p2`;2- /`m;- /`BMF- b?Q`i;2b-
bmTTHB2b- BMb2+m`Biv- M2Bp2i?MK- ;`B-
biTH2
7QQ/- //2/nQM- bmi2BM;nBM-
r`TT2/nBMbB/2- TH+2/- +minBMniQ
KF2 bm`2- K2M/b- rv- 7BH2/- rQmH/- 2bB2`-
/2+BbBQMb- rMi2/- `QQK- b2Mb2
KF2- 2t+?M;2- }b?TQM/- #HQr2`- +QM@
bi`m+i- `2nMQi
`Kv bHpiBQM- HB#2`iBQM- m- #`BiBb?- bii2b-
+Q`Tb- TQiQK+- `2/- FrMimM;- Q{+2`
Ti+?n7Q`- `Kv- ;2iiBM;nQminQ7- +H@
2M/`- QMn#+FnQ7- ;`22M
#Q/v ;Qp2`MBM;- r?Q`H- bim/2Mi- bMi+?2`b-
bM+iBQMBM;- TQHBiB+- 2+HBTiB+- +`2Ki2/-
?mKM- HB72H2bb
#Q/v- #HQrMnmTnBM- #QinBM-
/`B7iBM;nBM- rvn7`QKni?B2`-
#2Minrvn7`QK
b+?QQH ?B;?- 2H2K2Mi`v- b2+QM/`v- ;`KK`-
/Bbi`B+i- T`BK`v- #Q`/BM;- KB//H2-
T`2T`iQ`v- /Bbi`B+ib
b+?QQH- #mbni?i- 7`QMi- b2+QM/- bi2M@
+BH2/nQM- v2HHQr
7Q`2bi MQiiBM;?K- rF2- i2miQ#m`;- 2TTBM;-
#Q`2H- KQMiM2- b+H2`QT?vHH- D``?-
HrM- /2+B/mQmb
7Q`2bi- `Q/nBM- iBTT2/nQp2`nBM-
biM/n#Qp2- 7Q`Knn/BbiB+inHBM2-
n;`QmM/nBM
M2r vQ`F- x2HM/- D2`b2v- Q`H2Mb- ?KT@
b?B`2- ;mBM2- TTm- #`mMbrB+F- vQ`F2`-
i2biK2Mi
M2r- m`#M- #2`/nM/- ;2M@
2`iBQMnrB/2nb+`22MnbK`i-
KQ/2Hnr?Bi2- bTB`2
+Biv vQ`F- FMbb- +QmM+BH- KF2mT- K2tB+Q-
HBKBib- [m2xQM- QFH?QK- ?QH#v- `2bB/@
BM;
+Biv- b?BMBM;nBM- #mBH/BM;nBMn- BM-
rM/2`BM;- BMnbBM
T2QTH2 bm`MK2- MQi#H2- yyy- T2`- `2Tm#@
HB+- vQmM;- BM/B;2MQmb- /Bb#BHBiB2b- 2K@
THQv2/- #Q`B;BMH
T2QTH2- `2n2MDQvBM;- ?b- QM- ri+?BM;-
`2nri+?BM;
7KBHv KQi?- +2`K#v+B/2- KQHHmbF- #22iH2-
+`K#B/2- ;2QK2i`B/2- 2`2#B/2- MQ+@
imB/2- iQ`i`B+B/2- bBx2
7KBHv- b2i2/n`QmM/- biM/nrBi?-
+H2`Hv- r?2`2n`2- QMn;`QmM/n7Q`
?Qmb2 `2T`2b2MiiBp2b- +QKKQMb- HQ`/b- KMQ`-
QT2`- r?Bi2- Tm#HBb?BM;- /2H2;i2b-
bT2F2`- `M/QK
?Qmb2- QMn7/2nQ7- +2K2Mi2/nQM-
BMn7`QMinQ7- +`QbbBM;nQp2`niQ- /Q`M@
BM;
keR
v2` QH/- 7QHHQrBM;- +QMi`+i- QH/b- 2p2`v-
T2`- }b+H- `QQFB2- M2ti- T`2pBQmb
v2`- Tm#- R3Nj- bii2b- ;2-
rbniF2MnBM
T`iv +QKKmMBbi- /2KQ+`iB+- H#Qm`- HB#@
2`H- +QMb2`piBp2- DMi- bQ+BHBbi- `2@
Tm#HB+M- TQHBiB+H- H#Q`
T`iv- inn#B`i?/v- +`vBM;ni-
bM2FBM;nmTnQM- ?pBM;- `2nin
+QKTMv T`2Mi- #`2rBM;- BMbm`M+2- rQ`b?BT7mH-
bi2Kb?BT- KMm7+im`BM;- Tm#HBb?BM;-
?QH/BM;- T`Q/m+iBQM- 7QmM/2/
+QKTMv- Q7nT?QiQ;`T?v-
`2n2MDQvBM;n2+?nQi?2`b- +HHb-
#Q2BM;- i?inQrMb
h#H2 6XR, *QMi2ti rQ`/b Q7 +Hmbi2` +2Mi`QB/b rBi? i?2 Ry ?B;?2bi χ2 b+Q`2X
kek
*2Mi`QB/ qBFBT2/B oBbmH :2MQK2
THi2 i2+iQMB+b- HB+2Mb2- `Bp2`- Mx+- `2@
bi`B+iQ`- i2+iQMB+- mKTB`2- 2m`bBM-
7`HHQM- ?QK2
QM- THi2- QMniQTnQ7- QMn- Hv@
BM;nQMniQTnQ7- rBi?
HB+FBM; +QmMiv- `Bp2`- Kmr2- ;`QQKBM;-
7Q`F- +Ȫ- HBTb- rQmM/b- /BTKBt- mT@
bmKB/
HB+FBM;- iQM;m2- ;B`z2- +i- ;B``72-
+2HBM;
/ ?Q+- +2Mim`v- /BM- HB##2/- pHQ`2K-
HB#b- H+Q`+ƦM- HB#- ?QKBM2K- BM}MB@
imK
/- QM- Q7n/Bz2`2Mi- HBMBM;- 7Q`- ?b
`mbi #2Hi- +QHQ`2/- 7mM;B- #HBbi2`- +`QM`@
iBmK- 2TB[m2- +QHQm`2/- 7mM;mb-
QH2mK- +Q?H2
`mbi- QM- ?b- biBMbn/QrM- QMn}`2-
`QmM/nbB/2nQ7
`BHrv biiBQM- HBM2- r2bi2`M- KB/HM/- M2`@
2bi- biiBQMb- +QKTMv- DmM+iBQM-
;m;2- 2bi2`M
`BHrv- /2i+?- i`/BiBQMH- #2@
bB/2- 2H2pi2/nQMnTHi7Q`KnQp2`-
Tbb2bnQp2`n
+Hbb`QQK i2+?2`b- BMbi`m+iBQM- bT+2- HM@
;m;2- #mBH/BM;- QmibB/2- 7mim`2- y-
2p2`v
+Hbb`QQK- BM- biM/BM;nBMbB/2- bBi@
iBM;nBMbB/2- /Bb+mbbBM;nBM- bim/2Mi
?mKKBM;#B`/ KxBHB- b2HbT?Q`mb- i?`Qi2/- K2H@
HBbm;- +?BMM2/- +HvTi2- +vMMi?mb-
?rFKQi?- bT2+B2b- b+BMiBHHMi
?mKKBM;#B`/- ~TTBM;-
2inM2+i`n7`QK- ~TTBM;nBib-
?b- BMn~B;?in#2HQr
+# +miB2- +HHQrv- ?MbQK- /`Bp2`- itB-
Q#`/QB`Q- bB;MHHBM;- /2i?- #@
bi`+ib
+#- QM- ?b- QMn?QQ/nQ7- /`Bp@
BM;nQM- Q7
#HQQK H;H- ?`QH/- ~Qr2`b- bHB2p2- /QQ`@
v`/- Q`HM/Q- DxKBM2- +HB`2-
H2QTQH/- 7mHH
#HQQK- +?2``vn#HQbbQKni`22nBM-
BMn7mHHnbmKK2`- `Qb2n?bn7mHHv-
QM- BM
+?T2H ?BHH- bBbiBM2- +`QHBM- 2b2-
K2i?Q/Bbi- #mBHi- /2/B+i2/- +H@
p`v- bi- +?Mi`v
+?T2H- QM- QmibB/2- +?m`+?- Q7- iQ
+?KT;M2 `/2MM2- biF2b- #QiiH2- Bb?B?BF- 2M-
+?HQMb- #`B2- HMbQM- `2BKb
+?KT;M2- BBM- BMnnrQKMb-
+`72- ;Hbb- r`TT2/n`QmM/
Dr KQQb2- HQr2`- mTT2`- /`QTTBM;Hv-
#`QF2M- /`QTTBM;- T?Qbbv- Qb@
i2QM2+`QbBb- Kmb+H2b- bbFi+?2rM
Dr- Q7- ?b- Q7n- bi`QM;- QM
`TB/ i`MbBi- ;`Qri?- rB2M- 2tTMbBQM-
#mb- BMi2`#Q`Qm;?- BMi2MbB}+iBQM-
T`QiQivTBM;- bm++2bbBQM- #m+m`2șiB
`TB/- TQr2`BM;ni?`Qm;?- +`b?@
BM;nQp2`- ;Q2bnQp2`- KM2m@
p2`ni?`Qm;?- `B/2bnQ7
/M/2HBQM i`t+mK- #m`/Q+F- i`B#2- rBM2- bm@
T2`}M2- THMib- /t+- #`H2v+mT- ;2`@
Hi- F`BM/H2
/M/2HBQM- BM- ;`QrBM;nBM-
BMn2KTivnbTQinQ7- KQM;-
;BMbinbQK2
kej
#`B;?i v2HHQr- `2/- +QHQ`b- Q`M;2- HB;?ib- ;`22M-
+QHQm`b- HB;?i- 2v2b- #Hm2
#`B;?i- bH22TnQM- #Hm2- 2v2b- ;`22M-
#`QrƢ
/`BxxH2 嬱嬱곝- /`xxH2- 7`22xBM;- +?BbT2`-
##m+?- Qm`/2Hi-    - KB;;2H2M- `BM-
;2`KM
/`BxxH2- n/QMmi- /Q`MBM;- QMin?2-
/2+Q`i2/nbK2nb- QM
+`Q+?2i FMBiiBM;- ?QQF- }H2i- v`MiBM2`b-
biBi+?2b- bT2`Ʀ- FQHQb2- iiiBM;- FMBi
+`Q+?2i- `2n7Q`- M22/H2TQBMi- +HBi- mM@
/2`- `2nQM
7mM TQF2/- TQFBM;- TQF2b- TQF2- HQpBM;- HQi-
Km+?- KFBM;- KF2- 7mM
7mM- `2n?pBM;- `2n?pBM;n;`2i-
7+BM;nrv- ?pBM;- THMMBM;
TBM #/QKBMH- M2m`QTi?B+- +?`QMB+- +?2bi-
bmz2`BM;- 2t+`m+BiBM;- `2HB27- br2HHBM;-
#+F- b2p2`2
+?QTTvnM2`- rpBM;n7`QK- bBi@
iBM;n#+F- biB+FBM;nmTnQminQ7-
i`p2HnQM
Mi /2+- bT2+B2b- ;2Mmb- bm#7KBHv- KM- `@
#Q`2H- /K- }`2- imTQH2p
Mi- b?QrBM;ni?`Qm;?- `2~2+iBM;nQz-
`2n#2?BM/- pb2- `2nQM
#mb `Qmi2b- b2`pB+2- b2`pB+2b- i2`KBMH- bi@
iBQM- biQT- HBM2b- biQTb- `TB/- `Qmi2
QM- #mb- ?b- QMnbB/2nQ7- QMn7`QMinQ7-
Q7
;B`z2 ;B`z- `2iB+mHi2/- KbB- `Qi?b+?BH/-
x2#`- K2HKM- K#- ;Q;QHB+F- Mm#BM
?b- ;B`z2- Q7- QM- #2?BM/- ?bn
;Hbb biBM2/- rBM/Qrb- rBM/Qr- K;MB7v@
BM;- HQQFBM;- H2/2/- #2/b- K2M;2`B2-
#Q`QbBHB+i2- #QiiH2b
;Hbb- QM- r2`BM;- ?b- BM- rBi?
?M/ `B;?i- H27i- ?M/- ;`2M/2b- QM2- bH2B;?i-
bB/2- mTT2`- ;`2M/2- +QK#i
?M/- ?b- ?QH/BM;- BM- QM- Q7
rBM/Qr i`Mb72`- ;Hbb- QT2MBM;b- i`MbQK- `2`-
HM+2i- THH/BM- 7`K2b- Q`B2H- `+?2/
QM- rBM/Qr- ?b- #mBH/- QMn-
QMnbB/2nQ7
THM2 +`b?- T`QD2+iBp2- 7Q+H- +`b?2/- 2m@
+HB/2M- BM+HBM2/- ?vT2`#QHB+- bi`H-
+`b?2b- +QKTH2t
QM- THM2- ?b- Q7- QMnbB/2nQ7- ~v@
BM;nBM
r?Bi2 #H+F- bQt- ?Qmb2- y- bmT`2K+Bbi- +QHH`-
`2/- iBH2/- #Hm2- +`2Kv
r?Bi2- QM- +QHQ`2/- Q7- #H+F- bKHH
;`bb `QQib- +Qm`ib- K``K- bTH2M/Qm`- ;ɃM@
i2`- Qmi/QQ`- imbbQ+F- Mim`H- T2`2M@
MBH- iHH
;`bb- QM- BM- 2iBM;- biM/BM;nBM- ?b
i`22 i`mMFb- +?`BbiKb- QF- #MvM- HBM2/-
7`Q;- THMiBM;- };- DQb?m- THK
i`22- #2?BM/- QM- BM- M2`- ?b
`QQK /BMBM;- HQ+F2`- /`2bbBM;- rBiBM;- i2K@
T2`im`2- HBpBM;- ?Qi2H- `2/BM;- KF2-
/`rBM;
`QQK- BM- ?b- BMbB/2nQ7- BMn-
BMn+Q`M2`nQ7
ri2` TQHQ- bmTTHv- /`BMFBM;- 7`2b?- TQi#H2-
[mHBiv- b?HHQr- +2Mbmb- bMBiiBQM- `2@
bQm`+2b
BM- ri2`- M2`- QM- brBKKBM;nBM- #v
rHH bi`22i- ?/`BM- +m`iBM- #2`HBM- `2iBM@
BM;- +2HH- TBMiBM;b- Qmi2`- #`B+F- biQM2
rHH- QM- ?M;BM;nQM- ;BMbi- BM- #2@
?BM/
ke9
/Q; ?QmM/- Mm;?iv- K/- ?Qi- bH2/- #QMxQ-
T2i- #`22/b- bi`v
/Q;- ?b- QM- Q7- rBi?- ?bn
bFv #B;- bTQ`ib- M2rb- #Hm2- MB;?i- T2`72+ip-
+QM72`2M+2- K/`2M- bm`p2v
BM- bFv- +HQm/- ~vBM;nBM- ?M;BM;nBM-
#Qp2
i`BM biiBQM- Tbb2M;2`- r;QM- 2tT`2bb- b2`@
pB+2b- 7`2B;?i- FKT?- b2`pB+2- /2`BH2/-
?Hib
QM- i`BM- ?b- Q7- QMn7`QMinQ7-
QMnbB/2nQ7
i#H2 i2MMBb- 7QHHQrBM;- b?Qrb- HBbib- T2`BQ/B+-
`QmM/- KB/- bmKK`Bx2b- bmKK`Bb2b-
#QiiQK
QM- i#H2- QMniQTnQ7- bBiiBM;ni- bBi@
iBM;nQM- i
KM bTB/2`- BbH2- vQmM;- QH/- K2;- B`QM-
MK2/- QM2- Ki+?- i;
r2`BM;- KM- ?b- QM- ?QH/BM;- Q7
HBpBM; +QmTH2b- bQK2QM2- HQM2- iQ;2i?2`- R3-
7KBHB2b- T2QTH2- TQp2`iv- `QQK- +QM/B@
iBQMb
HBpBM;- BbHM/- /Q+F2/nM2`- THMi-
?Q`BxQM- iQT
bv ;QQ/#v2- M22/H2bb- Mvi?BM;- r2Mi-
bQm`+2b- ;Q2b- v2b- rQmH/- bQK2i?BM;-
M2p2`
bv- iM- H2ii2`- rQ`/- +HK2ii2- +2Mi`2
b2;mHH +?2F?Qp- eyyjj- HBpBM;biQM- ~mQtviBM2-
     - bmT2`K`BM2- i`2TH2p- x`2+?Mv-
`+/BM- +?2F?QpǶb
b2;mHH- Qp2`nM/nBM- ~QiBM;nrBi?-
#2bB/2nQ7n- ~B2bnQp2`- ~vBM;n#Qp2
7m`MBim`2 biQ`2- MiB[m2- /2bB;M2`- KF2`- TB2+2b-
}iiBM;b- mT?QHbi2`2/- 7+iQ`v- KF2`b-
/2bB;M
7m`MBim`2- QM- Q++mTv@
BM;- bMB{M;nmM/2`- +mbBQM-
?bnn`2~2+iBQMnBM
BMbT2+i Q`;MBb2- bF;2``F- T`Q+22/BM;b- pB@
bmHHv- /K;2- ;QQ/bĘ- mMTB;;#H2-
+2`MvkyRk- 2Mb2M/2- 2pbBQMě``Bp2b
BMbT2+i- x2#`- ?QQ7- #`M+?- Q7- i`BM
++B/2Mi +`- miQKQ#BH2- 7iH- KQiQ`+v+H2- Q+@
+m``2/- i`{+- BMp2biB;iBQM- #QiBM;-
7`2F- +mb2
++B/2Mi- ``Bp2/ninM- b+m``v@
BM;n`QmM/- rbnBMnM- r?Bi- bv
rQKM vQmM;- rQM/2`- }`bi- bmz`;2- MK2/-
KM- QH/- T`2;MMi- #2miB7mH- K2`B@
+M
r2`BM;- rQKM- ?b- ?QH/BM;- QM-
r2`b
Q#D2+i Q`B2Mi2/- M2TimMBM- bm#D2+i- BMMB@
Ki2- BM/B`2+i- /B`2+i- p2`#- `2HiBQMH-
z2+iBQM
Q#D2+i- QM- QMn#QiiQKnQ7nM- BM-
QMn#QiiQKnQ7n- ?b
r22/ MQtBQmb- i?m`HQr- DBKbQM- BMpbBp2-
7Q`2biBM- r?+F2`- +QMi`QH- bQi- BMp@
bBQM- HHB;iQ`
r22/- ;`QrBM;nBM- `2n;`QrBM;nBM-
BM- ;`QrBM;ni?`Qm;?- ;`Qr@
BM;nM2tiniQ
bmb?B bb?BKB- `2bim`Mi- MB;B`B- i2KTm`-
+?27- #`- Mm/QFB- vQ- `2bim`Mib
bmb?B- QM- `QHHb- p+/Q- M2`-
`2n#2HQr
ke8
i2tim`2 KTTBM;- H2i?2`v- p2Hp2iv- +QHQ`-
+`mM+?v- rtv- b?/2`b- TT2`v- TQ`@
T?v`BiB+- +?2rv
i2tim`2- M;H2n+`2i2b- ?n/Bz2`2Mi-
n+H2`- i2HHb- ?b
;mi KB+`Q#BQi- r`2M+?BM;- KB+`Q#BQK2-
~Q`- biD2TFQ- pQHFb2B;2M2b- bi`BM;b-
`B2+?bi- r`2M+?BM;Hv- KB+`Q~Q`
;mi- TmKTFBM- bi`BM;v- r2`BM;- HH- ?b
Kmb2mK `i- MiBQMH- K2i`QTQHBiM- KQ/2`M-
Mim`H- }M2- #`BiBb?- r?BiM2v- ?BbiQ`v
T`F2/nQmibB/2nQ7- Kmb2mK-
T`F2/nBMn7`QMinQ7- T`F2/nQmibB/2-
KQmMi2/niQnbB/2nQ7- Q7nB`THM2
+QKT`2 7pQ`#Hv- mb2/- 7pQm`#Hv- brT- /B{@
+mHi- +QMi`bi- /Bz2`2Mi- i#H2b- T`B+2b-
`2bmHib
+QKT`2- FB/- T?QM2
BbHM/ `?Q/2- bii2M- HQM;- +QM2v- 2/r`/-
pM+Qmp2`- THi7Q`K- K`2- #{M-
r?B/#2v
BbHM/- HBpBM;- `BKK2/nBM- BM- QM-
QMn7`QMinT`inQ7
i?2K2 bQM;- 2M/BM;- QT2MBM;- T`F- imM2- `2@
+m``BM;- KBM- T`Fb- KmbB+- bQM;b
i?2K2- ;Q`BHH- /2TB+iBM;- QT2MbnQM-
biM/bni- ?b
MBKHb THMib- rBH/- +`m2Hiv- ?mKMb- /QK2biB@
+i2/- 7m``v- MQM?mKM- 7`K- /QK2b@
iB+
MBKHb- bimz2/- +m#B+H2- BMb/B2- bBi-
#`QrM
i?BMF iMF- iMFb- /QMǶi- bB/- T2QTH2- ǳB- bv@
BM;- /M+2- BiǶb- `2HHv
i?BMF- ;`2v- i`mMF
KQMF i?2HQMBQmb- #m//?Bbi- #2M2/B+iBM2- iQM@
bm`2/- #`2iiQM- tmMxM;- +Bbi2`+BM-
K`- DBM- K2`2/Bi?
KQMF- ;`QrbnbT`b2HvnM2`- MpB;i@
BM;- ?2HTBM;n- r2`BM;- ?QH/b
pBHH;2 /KBMBbi`iBp2- TQTmHiBQM- KmMB+BTH@
Biv- HQ+HBiv- bKHH- KF2mT- ;`22MrB+?-
/KBMBbi`i2/- HQ+i2/- ?QrK2?
pBHH;2- Qp2`HQQFb- BM- in7QQinQ7n-
i`p2HBM;ni?`Qm;?- i`p2HBM;
H2`M bim/2Mib- b?Q+F2/- `2/v- bm`T`Bb2/- QT@
TQ`imMBiv- +?BH/`2M- Kmbi- `2/- ?Q``B@
}2/- #bB+b
H2`M- bFB- +?BH/- 7Q`- #QQF- QM
+iBpBiv pQH+MB+- 2+QMQKB+- b2tmH- T?vbB+H-
i?mM/2`biQ`K- b2BbKB+- +`BKBMH- ?m@
KM- 2MxvKiB+- T`MQ`KH
+iBpBiv- ii2M/BM;- j- ri+?BM;-
`QmM/- BM
T?QM2 KQ#BH2- +HH- +2HH- +HHb- rBM/Qrb- ?+F@
BM;- MmK#2`- MmK#2`b- +2HHmH`- #QQi?
T?QM2- ?QH/BM;- QM- iHFBM;nQM- ?b-
mbBM;
/`r2` /2bF- iQT- bQ++2`- /`2bb2`- r#bi`imT-
bHB/2b- +`BbT2`- TBMi2`- /QQ`#M/b
/`r2`- QM- BM- ?b- ?M/H2- mM/2`
TQTTv QTBmK- TTp2`- b22/b- b22/- /2H2p@
BM;M2- D?F`- 2b+?b+?QHxB- +mHiBpiBQM-
bi`r- `2K2K#`M+2
TQTTv- +Hmbi2`nQ7nTBMF- biK2M- B`Bb-
HT2H- ;2MiH2KM
rBM2 ;`T2- bT`FHBM;- +2HH`- ibiBM;- +2HH`b-
#`2/- ;`T2b- `2/- `2;BQM- bTB`Bib
rBM2- ;Hbb- TQm`BM;- Q7- BM- /`BMFBM;
kee
brM HF2- `Bp2`- #H+F- r?QQT2`- MmiQ`-
H2/- ?mMi2`- +QbiH- Q/BH2- /Bbi`B+ib
brM- brBKKBM;nBM- KF2b- brBK@
KBM;nQM- BM- brBKKBM;n#Qp2n
#HMF2i MQMT`iBbM- 2D2+i- #Q;- T`BK`v- #M-
r`TT2/- #BM;Q- #Q;b- #2+?- mb#
#HMF2i- QM- ?b- #2/- /Q`MBM;- mM/2`
#BF2 KQmMiBM- HM2b- `+Fb- Ti?- i`BHb-
Ti?b- i`BH- `B/2- b?`BM;- /B`i
#BF2- QM- `B/BM;- ?b- Q7- M2`
#2+? THK- HQM;- Kv`iH2- ~Q`B/- pQHH2v#HH-
/viQM- #Qvb- /2H`v- KBKB- +HB7Q`MB
#2+?- QM- i- rHFBM;nQM- biM/@
BM;nQM- THvBM;nQM
+H7 `QTBM;- ;QH/2M- BMDm`v- +Qr- 7ii2/-
Kmb+H2- bi`BM- Kmb+H2b- 2/v- biQpH
+H7- ?b- Mm`bBM;n7`QK- Mm`bBM;- Q7-
Mm`bBM;nBM
v2HHQr 72p2`- D+F2ib- TH2- #`B;?i- Q`M;2-
;`22MBb?- ~Qr2`b- T2`+?- +`/
v2HHQr- QM- TBMi2/- ;`22Mn7+2nQ7-
;`22M- +QHQ`2/nbm+2nQM
+QQFBM; mi2MbBHb- QBH- pBMvH- TQib- +H2MBM;- TQi-
b?Qr- mb2/- i2+?MB[m2b- b2rBM;
+QQFBM;- ;`BHH- TM- ?Qi/Q;- TBxx- 7QQ/
?Q`M +T2- 7`B+- ?`/`i- i`2pQ`- #B;- b2+@
iBQM- pM- `BKK2/- ;QH/2M- 7`2M+?
?Q`M- ?b- QM- Q7- ;B`z2- ?2/
MBH iQQi?- #BiBM;- #Bi2`- +Q{M- TQHBb?- bHQM-
`2+Q`/b- bHQMb- `mbiv- TQHBb?2b
MBH- QM- ?b- Q7- ?bn- mb2/nBM
#m//v ?QHHv- 2#b2M- ;mv- /27`M+Q- HxB2`- /2@
bvHp- `B+?- `Q2K2`- +QHH2ii2- HM/2H
#m//v- #2bi- MBKHbnBM- i?`22- biM/-
BM
?x2 /Bx22- Tm`TH2- i`Mb#QmM/`v- 2pQi-
/BM;H2#2``v- bKQF2- #QQ;2`- M;2H- D2b@
bBF- 7Q;
?x2- Qp2`niQT- `2n?B;?2`ni?2M-
`2np2`vn7`n7`QK- #HQ#- #2HQr
T`Qi2bi `2bB;M2/- T2+27mH- KQp2K2Mi- bi`BF2-
K`+?2b- bi;2/- `HHv- `HHB2b- bBi- pB@
QH2Mi
T`Qi2bi- 7Q`- bB;M- QM- `Q/- ?QH/BM;
bH22T TM2- `2K- /2T`BpiBQM- M`2K- Q#bi`m+@
iBp2- /BbQ`/2`b- /B2/- T`HvbBb- /`2K@
H2bb- /22T
bH22T- rBi?nn#Hm2- +Qp2`2/nrBi?n-
rBi?n;`22Mnb?B`i- b?Q2bn`2nQM-
r2`BM;nn#Hm2
+HQi?2b +BpBHBM- THBM- rb?BM;- br//HBM;-
r2`BM;- r2`- b?Q2b- ++2bbQ`B2b- /`v@
2`b- rQ`M
+HQi?2b- r2`BM;- QM- QMn+HQb- r2`b-
biQ`2b
#mii2` T2Mmi- #`2/- +Q+Q- K`;`BM2-
+?22b2- +H`B}2/- K2Hi2/- D2HHv- KBHF
#mii2`- QM- H`;2nM2`- rBi?-
T`iHvnQmibB/2- bmi22BM;nBM
~Qr2` #m/b- ?2/b- bTBF2b- HQimb- #2/b- H2p2b-
MiHBF2- T2iHb- biHFb- ;`/2M
~Qr2`- BM- QM- pb2- ?b- rBi?
`BM ?2pv- iQ``2MiBH- bBM;BM- bMQr- 7Q`2bi-
7Q`2bib- 72HH- bMQr7HH- b?BM2- TQm`BM;
`BM- rHFBM;nQM- 7HHBM;nQM- rHF@
BM;nBM- iQrMb- ;2iiBM;nr2in#v
+Qz22 b?QT- b?QTb- #2Mb- THMiiBQMb- i2-
`Qbi2`b- i#H2- bi`#m+Fb- ?Qmb2
+Qz22- +mT- Km;- BM- Q7- }HH2/nrBi?
+Qr KBHF- /mM;- +H`#2HH2- +H7- K/- ?2M`v-
bHm;?i2`- T`bMBT- Tbim`2- KBHFBM;
+Qr- ?b- Q7- BM- QM- biM/BM;nBM
ked
rB; #HQM/2- rK- r;- #HQM/- r2`BM;- D;@
#;b- rKMB- #`mM2ii2- rQ`2- KbF
rB;- BMn+HQrM- r2`BM;- r2`b- ?b-
r2`BM;n
iQr2` +QMMBM;- #2HH- +HQ+F- 2Bz2H- ?KH2ib-
#mBHi- HQM/QM- ri2`- Q#b2`piBQM- ##2H
iQr2`- QM- ?b- +HQ+F- QMniQTnQ7- BM
#Hm2 Dvb- `B##QM- D+F2ib- /2pBHb- +QHH`-
#QK#2`b- `B/;2- /`F- r?Bi2- iQ`QMiQ
#Hm2- +H2`- QM- r2`BM;- M/nr?Bi2- BM
bFBM B``BiiBQM- H2bBQMb- +QHQ`- `b?2b- ;`7ib-
+M+2`- TB;K2MiiBQM- /Bb2b2b- Km+Qmb-
;`7i
bFBM- ?b- Q7- TT2`BM;nQM- QM- ?M;@
BM;n?
~2tB#H2 +QKT+iBM;- 2MQm;?- bB;KQB/Qb+QTv-
i?BM- 7m2H- b+?2/mHBM;- ?B;?Hv- /Ti@
#H2- biHF- THbiB+
~2tB#H2- 7`Bb#22- ;`22Mn`BK-
iQn+i+?n- ;`2M- BMn/Q;b
#; THbiB+- /mz2H- bH22TBM;- TmM+?BM;- T@
T2`- KBt2/- ;`#- /m|2- +QMiBMBM;-
#B/Bi
#;- +``vBM;- QM- BM- ?QH/BM;- ?b
#B`/ bT2+B2b- Tbb2`BM2- bM+im`v- KB;`@
iQ`v- ri+?BM;- T`2v- 7KBHv- BKTQ`iMi-
+;2/
#B`/- ?b- QM- Q7- BM- ~vBM;nBM
FBi+?2M ?2HH- bBMF- bQmT- /BMBM;- mi2MbBHb- #i?@
`QQK- `QQK- ;`/2M- TMi`v- HmM/`v
FBi+?2M- BM- BMn- BMbB/2nQ7- rQ`F@
BM;nBM- +#BM2i
7i?2` /B2/- bm++22/2/- /2i?- Hr- bQM- #BQHQ;@
B+H- 7QQibi2Tb- BM?2`Bi2/- /QTiBp2
7i?2`- iFBM;nnTB+im`2nBM-
iFBM;nnb2H7nBM- H2M@
BM;nQp2`niQniQm+?- `QHHBM;n- rHF@
BM;n/QrMn
b?Q`2 MQ`i?- HF2- 2bi2`M- /BM?- #ii2`B2b-
bQmi?- r2bi2`M- D2`b2v- bQmi?2`M
b?Q`2- QM- #`2FBM;nQM- rb?BM;nQM-
+QKBM;niQ- +`b?BM;nQM
p2?B+H2 KQiQ`- HmM+?- 2H2+i`B+- `2;Bbi`iBQM-
2`BH- miBHBiv- r?22H2/- `KQ`2/- `@
KQm`2/- `22Mi`v
p2?B+H2- QM- Q7- T`F2/nQM- ?b-
T`F2/nHQM;bB/2nQ7
#`BM; #+F- rQmH/- iQ;2i?2`- ?2HT2/- +QmH/-
#H2- ii2MiBQM- Q`/2`- ii2KTi- 7Q`i?
#`BM;- bvb- rHH- QMn- ?bn- ?b
bKBH2 bKBH2v- 7+2- /F- bKBH2- QT2`iBQM- KF2-
HBb- +`Mi- iQQi?v- 7`QrM
bKBH2- 2tTQbBM;- `2p2HBM;- ?b- QM- 7+2
iBK2 }`bi- 7mHH- HQM;- bT2Mi- `QmM/- `2H- 2t@
i`- b2+QM/- b?Q`i- bT2M/
iBK2- ?pBM;n;`2i- +QmMib- QM- i2HHb-
b?Qrb
7+i /2bTBi2- /m2- }M/BM;- Kii2`- bTBi2-
+?2+FBM;- KMv- +QKTHB+i2/- `272`b- i@
i`B#mi2/
7+i- HBbi2/nQM- rBi?nbQK2- BM;`2/B2Mi-
Dm;- ?b
7QQi#HH H2;m2- i2K- +Hm#- +QHH2;2- MiBQMH-
K2`B+M- THv2`- T`Q72bbBQMH- +Q+?-
pB+iQ`BM
7QQi#HH- +?b2b- THvBM;- i`v@
BM;niQnbp2- iQn?BM/2`- TH+BM;
ke3
}b? }MM2/- rBH/HB72- +vT`BMB/- 7`2b?ri2`-
#QMv- ?i+?2`v- bT2+B2b- K`/v- b?2HH}b?
}b?- BM- QM- b2`p2/nQM- KQH2nQMn-
[m`BmK
}HK 72biBpH- /B`2+i2/- /`K- 72im`2- +QK@
2/v- +MM2b- /Q+mK2Mi`v- BMi2`M@
iBQMH- ?Q``Q`- b?Q`i
}HK- iF2MnrBi?- `2n#2BM;- iT2/niQ-
#2BM;- BM
`Kb +Qi- +Qib- bKHH- 2K#`;Q- KKmMB@
iBQM- H2;b- KmMB+BTHBivǶb- #2`- ;mH2b-
/2H2`
`Kb- bFi2`- bFi2#Q`/2`- Qmi@
bi`2i+?2/- #`2- 7QH/
THMi ~Qr2`BM;- TQr2`- bT2+B2b- 7KBHv- ?Qbi-
Ti?Q;2M- Q`MK2MiH- KMm7+im`BM;-
i`2iK2Mi
THMi- QM- BM- ;`QrBM;nBM- ;`QrBM;nQM-
TQi
7QQ/ 7bi- /`m;- /`BMF- #2p2`;2- bmTTHB2b-
b?Q`i;2b- T`Q+2bbBM;- b72iv- bQm`+2-
;`B+mHim`2
7QQ/- QM- BM- QMniQTnQ7- THi2- rBi?
KF2 bm`2- rQmH/- rv- 7BH2/- K2M/b- +QmH/-
rMi2/- Q`/2`- `QQK- #H2
KF2- 2t+?M;2- #2BM;- bTHb?- }b?TQM/-
+QMbi`m+i
`Kv m- bii2b- #`BiBb?- `2/- +Q`Tb- HB#2`iBQM-
B`- Q{+2`- bHpiBQM- mb
Ti+?n7Q`- `Kv- ;2iiBM;nQminQ7-
QMn#+FnQ7- ;`22M- +H2M/`
#Q/v ;Qp2`MBM;- r?Q`H- bim/2Mi- ?mKM-
r2B;?i- /2/- T`ib- H2M;i?- bM+iBQMBM;-
KBM
#Q/v- Q7- ?b- QM- BM- Q7n
b+?QQH ?B;?- 2H2K2Mi`v- b2+QM/`v- /Bbi`B+i-
;`KK`- T`BK`v- KB//H2- Hr- /Bb@
i`B+ib- Tm#HB+
b+?QQH- #mbni?i- QM- 7`QMi- #mb- v2HHQr
7Q`2bi MQiiBM;?K- rF2- MiBQMH- b2`pB+2-
KQMiM2- 2TTBM;- HrM- #Q`2H- i2miQ@
#m`;- HQrHM/
7Q`2bi- BM- BMn- i`22- #2?BM/-
}HH2/nrBi?
M2r vQ`F- x2HM/- D2`b2v- Q`H2Mb- ?KT@
b?B`2- ;mBM2- bQmi?- K2tB+Q- #`mMbrB+F-
TTm
M2r- m`#M- bTB`2- #2`/nM/-
;2M2`iBQMnrB/2nb+`22MnbK`i-
KQ/2Hnr?Bi2
+Biv vQ`F- FMbb- +QmM+BH- KF2mT- K2tB+Q-
HBKBib- TQTmHiBQM- +2Mi`2- +TBiH- QFH@
?QK
+Biv- BM- BMn- b?BMBM;nBM- #mBH/- #mBH/@
BM;nBMn
T2QTH2 MQi#H2- bm`MK2- yyy- T2`- vQmM;- `2@
Tm#HB+- KMv- HBpBM;- KBHHBQM- 2KTHQv2/
T2QTH2- QM- BM- ri+?BM;- rHFBM;nQM-
`2n2MDQvBM;
7KBHv KQi?- #22iH2- +2`K#v+B/2- KQHHmbF-
bBx2- p2`;2- BM+QK2- +`K#B/2- ;2@
QK2i`B/2- 2`2#B/2
7KBHv- b2i2/n`QmM/- biM/nrBi?-
?pBM;- QMn;`QmM/n7Q`- bBiiBM;n`QmM/
?Qmb2 `2T`2b2MiiBp2b- +QKKQMb- HQ`/b- r?Bi2-
QT2`- KMQ`- Tm#HBb?BM;- #mBHi- ?BbiQ`B+-
/2H2;i2b
?Qmb2- QM- BMn7`QMinQ7- ?b- #2?BM/-
M2`
v2` QH/- 7QHHQrBM;- +QMi`+i- 2p2`v- T2`- QM2-
M2ti- Hi2`- T`2pBQmb
v2`- Tm#- bii2b- R3Nj- ;2-
rbniF2MnBM
keN
T`iv +QKKmMBbi- /2KQ+`iB+- H#Qm`- HB#@
2`H- +QMb2`piBp2- `2Tm#HB+M- bQ+BHBbi-
TQHBiB+H- DMi- H#Q`
T`iv- ?pBM;- BM- /M+2-
inn#B`i?/v- +`vBM;ni
+QKTMv T`2Mi- T`Q/m+iBQM- 7QmM/2/- BMbm`@
M+2- Tm#HBb?BM;- ?QH/BM;- KMm7+im`@
BM;- BM/B- #`2rBM;- i?2i`2
+QKTMv- Q7nT?QiQ;`T?v-
`2n2MDQvBM;n2+?nQi?2`b- +HHb-
#H2M/2`- #Q2BM;
h#H2 6Xk, *QMi2ti rQ`/b Q7 +Hmbi2` +2Mi`QB/b rBi? i?2 Ry ?B;?2bi SJA3 b+Q`2X
kdy