Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4086–4101
November 16 - 20, 2020. c©2020 Association for Computational Linguistics
4086
STANDER: An Expert-Annotated Dataset
for News Stance Detection and Evidence Retrieval
Costanza Conforti1, Jakob Berndt2, Mohammad Taher Pilehvar1,3,
Chryssi Giannitsarou2, Flavio Toxvaerd2, Nigel Collier1
1 Language Technology Lab, University of Cambridge
2 Faculty of Economics, University of Cambridge
3 Tehran Institute for Advanced Studies, Iran
{cc918,jb2088}@cam.ac.uk
Abstract
We present a new challenging news dataset
that targets both stance detection (SD) and fine-
grained evidence retrieval (ER). With its 3,291
expert-annotated articles, the dataset consti-
tutes a high-quality benchmark for future re-
search in SD and multi-task learning. We pro-
vide a detailed description of the corpus collec-
tion methodology and carry out an extensive
analysis on the sources of disagreement be-
tween annotators, observing a correlation be-
tween their disagreement and the diffusion of
uncertainty around a target in the real world.
Our experiments show that the dataset poses a
strong challenge to recent state-of-the-art mod-
els. Notably, our dataset aligns with an exist-
ing Twitter SD dataset: their union thus ad-
dresses a key shortcoming of previous work,
by providing the first dedicated resource to
study multi-genre SD as well as the interplay
of signals from social media and news sources
in rumour verification.
1 Introduction
Starting from early work by Agrawal et al. (2003),
Stance Detection (SD) has gained increasing inter-
est from the research community (Zubiaga et al.,
2018a). Recent work in SD has mostly focused on
modeling user-generated data (Mohammad et al.,
2017; Ku¨c¸u¨k and Can, 2020). However, SD on
complex and articulated texts, such as news arti-
cles, has been considerably less studied, mainly due
to the scarcity of published datasets (Pomerleau
and Rao, 2017; Hanselowski et al., 2019). More-
over, research on user-generated SD and news SD
has proceeded on parallel and independent tracks,
neglecting the deep mutual influence that exists
between social media and news sources (Canter,
2015; Kostkova et al., 2017).
In this paper, we seek to fill this gap, introduc-
ing STANDER (STANce Detection & Evidence Re-
trieval), a new expert-annotated dataset which is
labeled for both news SD and fine-grained ER.
STANDER collects news articles in English from
high-reputation sources which discuss four recent
mergers and acquisitions (M&A) operations be-
tween major healthcare companies in the US (Ta-
ble 1). The term M&A refers to the process
by which the ownership of a company (the tar-
get) is transferred to another company (the buyer).
An M&A process (merger) ranges from informal
talks between the companies to the closing of the
deal; high secrecy is involved and discussions
are usually not publicly disclosed during its early
stages (Bruner and Perella, 2004). Thus, the analy-
sis of the evolution of opinions and concerns about
a potential merger is a process similar to rumor
verification (Zubiaga et al., 2018b).
Notably, the news articles in STANDER discuss
the same targets as in WT–WT (Conforti et al.,
2020), a large Twitter SD dataset: thus, their union
provides aligned signals from both authoritative
(articles) and user-generated (tweets) sources, con-
stituting the first resource of this kind for SD.
In this paper, we make the following contribu-
tions:
(1) We construct STANDER, a large expert-
annotated news dataset 1 labeled for SD and fine-
grained ER. To our knowledge, it is the first news
SD dataset to provide evidence snippets, along with
their exact location in the corresponding article.
(2) We provide detailed statistics of our data, as
well as the first diachronic analysis of the sources
of disagreement among annotators in a SD paper,
shedding light on the potential correlation between
uncertainty in the world and increased ambiguity in
journalistic prose. This suggests that considering
1https://github.com/cambridge-wtwt/
emnlp2020-stander-news
Data is released according to Factiva (https:
//library.princeton.edu/resource/3791)
and the University of Cambridge’s data policy.
4087
Merger Buyer Target Outcome
AET HUM Aetna Humana rejected
ANTM CI Anthem Cigna rejected
CI ESRX Cigna Express Scripts succeeded
CSV AET CVS Aetna succeeded
Table 1: Mergers considered in this work. Note that
two companies appear both as Buyer and as Target.
SD in a controlled domain, such as mergers, could
allow model builders to develop deeper insights
into the factors influencing model performance.
(3) We report results obtained for several state-
of-the-art models on our dataset, and show that
STANDER constitutes a challenging benchmark for
future research in SD, ER and multi-task learning.
(4) We provide a correlation analysis of the articles
from STANDER and the tweets from WT–WT, ob-
serving a moderately strong correlation. While the
interplay between social media and news sources
has been widely studied in other research fields,
such as journalism studies (Johnson et al., 2018;
Orellana-Rodriguez and Keane, 2018), very little
work exists in computer science (Dredze et al.,
2016), and notably, none considering SD.
2 Background
The Task. SD is the task of automatically identi-
fying the opinion expressed in a text with respect
to a target (Mohammad et al., 2017). Note that
SD constitutes a related, but different task than
both sentiment analysis and textual entailment. The
first considers the emotions conveyed in a text (Al-
hothali and Hoey, 2015; Tang et al., 2016), while in
the second, the goal is to predict whether a logical
implication exists between two sentences (Bowman
et al., 2015). Consider the following example:
• Target: Aetna will merge with Humana
• Text: Aetna & Humana CEOs met again to talk
about deal, can’t stand those bla-bla people!!!
The text’s sentiment is negative, as the author is
complaining about the meeting; concerning entail-
ment, it is positive: the target entails the text be-
cause, in order to merge, two companies need to
discuss the deal; finally, its stance is commenting,
as it is just talking about the merger, without ex-
pressing the orientation that it will happen (or not).
SD as a Sub-Task. SD is often integrated into ru-
mor verification (Zubiaga et al., 2018b), as testified
by popular shared tasks (Derczynski et al., 2017;
Gorrell et al., 2018). Starting from Vlachos and
Riedel (2014), SD has been identified as a key step
in fake news detection (Lillie and Middelboe, 2019)
and automated fact-checking (Popat et al., 2017;
Thorne and Vlachos, 2018; Baly et al., 2018): in
this context, textual entailment is sometimes pre-
ferred to SD as the penultimate sub-step before
verification (Thorne et al., 2018).
Twitter SD. Traditionally, research on SD focused
on user-generated data, such as blogs and com-
menting sections on websites (Skeppstedt et al.,
2017; Hercig et al., 2017), apps (Vamvas and Sen-
nrich, 2020), and Facebook posts (Klenner et al.,
2017); above all, mainly due to the handiness of
its API, Twitter was used as a data source (Moham-
mad et al., 2016; Zubiaga et al., 2016; Inkpen et al.,
2017; Aker et al., 2017; Conforti et al., 2020).
News SD. At the time of writing, a very
small number of SD datasets collecting news
have been released, usually building on platforms
originally developed by professional journalists,
like Emergent (Ferreira and Vlachos, 2016) or
Snopes (Hanselowski et al., 2019). Note that in
Twitter SD, the task consists of defining the stance
of a tweet with respect to a short target (usually
a named entity like Hillary Clinton (Inkpen et al.,
2017), or a concept like feminism (Mohammad
et al., 2016)); in news SD, on the contrary, the in-
put article is much longer than a tweet, and the
target is a complete sentence (Hanselowski et al.,
2018a).
Comparison with corpora for News SD. EMER-
GENT (Ferreira and Vlachos, 2016) constitutes the
first released corpus for news SD: it collects 300
targets and 2,595 articles (with an average of 8.6
articles/target), labeled using a 3-class classifica-
tion schema. For the first edition of the Fake News
Challenge (Pomerleau and Rao, 2017), it was en-
riched with randomly generated unrelated samples.
Neither of the two corpora is annotated with evi-
dences.
To our knowledge, the only other news dataset
to be annotated for both SD and ER is that of
Hanselowski et al. (2019), which annotates fact-
checking instances from the debunking website
Snopes2. Our work differs in a number of aspects:
• Statistics. While Snopes is larger in size, it pro-
vides relatively few samples per target (14,296
samples and 6,422 targets, with an average of
2.22 articles/target); STANDER, in contrast, col-
lects 3,291 articles on 4 targets, with an average
2https://www.snopes.com/about-snopes/
4088
of more than 800 articles/target.
• Annotators. Snopes is annotated by crowdsourc-
ing, we employ domain-expert annotators.
• Evidence Annotations. Snopes provides entire
sentences as evidence; importantly, STANDER
is the first to provide the exact start and end
indices of evidence snippets inside the sentences
(Figure 4): this will enable future research on
more fine-grained evidence extraction.
• Multi-Genre. At the time of writing, almost
all released SD corpora collect data from one
genre, with a prevalence of user-generated con-
tent. Snopes constitutes the only exception, as
some of the collected documents (11%) come
from Facebook or Twitter. Note, however, that
they do not provide aligned signals from news
and user-generated sources for all considered
targets, but only for a limited portion of them.
In contrast, our news dataset is the first to com-
pletely align with an existing resource for Twitter
SD, providing a relevant amount of samples from
two genres for all considered targets (Section 6).
This will open a number of interesting research
directions: while adversarial domain adaptation
– using data of the same news genre, but from
another domain – proved to be useful for news
SD (Xu et al., 2019), the impact of considering
data of the same domain but from another genre
has never been studied in SD.
3 Building the Dataset
In this section, we report on data collection and
annotation, and provide a detailed analysis of the
findings from the annotation process.
3.1 Data Retrieval
We consider four recent mergers involving US com-
panies in the healthcare industry (Table 1). To
retrieve news articles related to the mergers, we
used Factiva (Johal, 2009), a database by Dow
Jones which collects more than 32,000 general
and finance-specific sources, including newspapers,
journals and magazines.
For each merger, we searched for the involved
companies and selected articles in English tagged
as Acquisitions/Mergers/Shareholdings. We re-
trieved articles from one month before the first
contact of the firms up to one month after any fi-
nal decision on the merger. Refer to Appendix A
for details on the crawl settings and the crawling
timeline.
Figure 1: Inter-rater agreement (normalized).
3.2 Annotation Guidelines
The annotation process was initiated by a pilot, af-
ter which the annotation guidelines were written in
close collaboration with three domain experts. Ex-
tracts from the annotation guidelines are reported
in Appendix A.
Stance Annotation. Following Pomerleau and Rao
(2017), we consider four stance labels:
1. Support: the article is voicing confidence that
the two companies will merge.
2. Refute: the article is voicing doubts that the two
companies will merge.
3. Comment: the article is talking about the merger,
neither directly supporting, nor refuting it.
4. Unrelated: the article is unrelated to the merger.
Note that the article might be talking about one
or both the considered companies, but without
discussing their merger.
Evidence Annotation. In addition to the stance
label, annotators were asked to select the text snip-
pets or sentences from the article which were de-
terminant for them to classify its stance, which we
refer to as evidence (Thorne and Vlachos, 2018).
3.3 Data Annotation
In line with previous work on news SD (Vlachos
and Riedel, 2014; Ferreira and Vlachos, 2016), in
which data was labeled by professional journalists,
we rely on domain experts for annotation. Specifi-
cally, we provided articles to eight economists3 in
batches and asked them to annotate no more than
100 articles per day4; the annotation process lasted
4 months. Each article was independently labeled
3Six PhD students and two lecturers in Economics (Faculty
of Economics, University of Cambridge)
4Reported annotation speed is ∼55 articles/hour; anno-
tators were asked not to spend more than 2 hrs/day on the
task.
4089
Figure 2: Timestamp of publication of articles whose
stance annotators disagreed on (AET HUM merger).
by 2 to 4 annotators.
To aggregate stance labels, we used majority vot-
ing. For evidence snippets, we merged the provided
snippets to obtain a list of selected evidences; a fur-
ther annotator, who did not take part in the first
phase, manually checked the overlapping snippets.
3.4 Analysis of Annotators’ Disagreement
The most common source of disagreement between
annotators is on support/comment (Figure 1): note
that, sometimes, the given stance depends on sub-
tle nuances in the article’s argumentative struc-
ture and it is therefore somehow subjective; such
samples are difficult to discriminate for ML sys-
tems as well (Riedel et al., 2017). With respect
to datasets with randomly generated unrelated
samples (Pomerleau and Rao, 2017; Hanselowski
et al., 2018b), we report a slightly higher unre-
lated/comment disagreement between annotators,
which reflects the higher complexity of the task in
our setting.
To further understand the sources of disagree-
ment between annotators, we perform a diachronic
analysis of the samples which received different
labels and their time of publication. As shown in
Figure 2, a correlation exists between some rele-
vant events (such as the first joint press release)
and the number of articles published. However, a
higher volume of articles does not always correlate
with higher disagreement rates between annotators:
interestingly, it seems that some events (such as the
merger agreement) spread more uncertainty around
the merger than others (such as the start of the an-
titrust trial). This uncertainty is transmitted to the
press, resulting in journalists writing speculative
articles: such articles seem to be more prone to the
reader’s subjective biases, eventually producing a
higher inter-annotator disagreement.
The interplay of different layers of uncertainty
until the resolution of the event (i.e. confirmation
of merger talks or the complaint before the DOJ)
makes our domain choice particularly insightful for
model builders.
3.5 Quality Assessment
To assess the quality of our dataset, we asked a
domain expert to annotate a random 10% of the
samples, which are used as an upper bound for
evaluation. First, she received targets together with
the gold evidence snippets selected in the first anno-
tation round; in a second phase, the same annotator
received the complete articles and was asked to
re-annotate the samples. In the former setting and
similar to Hanselowski et al. (2019), we wanted to
assess whether the selected evidence snippets alone
are sufficient to provide the correct stance: the Co-
hen’s κ between those labels and the gold is 75.2,
which is substantial (Cohen, 1960) and reflects the
good quality of the extracted snippets. Cohen’s κ
obtained when considering the entire article texts
is 59.5 (moderate).
This drop testifies that: (1) SD on long, unstruc-
tured texts is complex and more prone to subjective
biases than SD on evidence snippets; interestingly,
a similar low inter-annotator agreement (Fleiss’ κ
of 0.55, (Fleiss, 1971)) has been observed also for
the related news articles in the Fake News Chal-
lenge dataset (Hanselowski et al., 2018a), which
does not contain annotation of evidences; unfortu-
nately, Hanselowski et al. (2019) do not report on
the agreement considering the entire sample texts;
(2) therefore, providing evidence annotation is fun-
damental to building a reliable dataset that can be
used to train supervised stance classifiers.
4 Corpus Analysis
4.1 Desiderata and Challenges
Notably, STANDER satisfies all four desired prop-
erties outlined in Mohammad et al. (2017):
1. Topics should be commonly understood by a
wide number of people. We consider some of
the major US healthcare providers, with which
almost everyone has interacted at different lev-
els (insurers, pharmacy chains, ...): thus, not
only finance experts (example (a) in Table 3)
and local sources (b), but also politicians (c),
physicians (d), policymakers (e) and the general
4090
public are interested in their outcome, resulting
in a dataset which collects different registers.
2. The topics convey different opinions, produc-
ing a significant amount of data for all stance
labels. The considered mergers are controver-
sial, because their outcome might change the
US healthcare landscape; moreover, as they hap-
pened during the change from the Obama to the
Trump administration, with the introduction and
partial rollback of Obamacare, there is consider-
able interference with politics (f).
3. The dataset contains indirect references to the
targets, as when the involved companies are
not explicitly mentioned: for example, given
a merger between A and B, if a source states that
A is in talk with C, this implicitly undermines
the likelihood of the A-B merger to happen (g).
4. The dataset contains samples where the target
of opinion is different. This is the case of arti-
cles that discuss about one or both companies,
(a) Aetna shares rose 1.3% premarket after climbing 10%
just before the market closed Thursday following a
Wall Street Journal report that CVS Health is in talks
to buy the insurer [...]
(b) Commercial real estate office experts [...] agree that
the [...] planned acquisition [...] by Connecticut-based
Aetna could have a significant negative impact on Hu-
mana’s major office footprint in [...] Louisville.
(c) Rep. EG and state Senator DH are asking the state
insurance commissioner to receive a guarantee of zero
job reductions within Humana’s state locations if its
proposed merger with Aetna proceeds.
(d) February survey of physicians [...] found that 28% are
so concerned by the potential merger that they would
be likely to retire early [...] said the CMS [Colorado
Medical Society] president.
(e) Justice Department attorneys, arguing before a fed-
eral court judge on Monday, contended that Aetna’s
(AET) planned acquisition of Humana (HUM) violated
antitrust law [...]
(f) The second thing [...] Aetna must persuade Bates of
is that [...] the merger won’t harm individuals who
receive their health coverage through Obamacare.
(g) Meanwhile, UnitedHealth is said to be interested in
scooping up Aetna.
(h) Aetna, with eye on regulators, sells Medicare drug
business to WellCare [...]
(i) Besides the possible Anthem deal, Humana is consid-
ering a sale, possibly to Cigna or Aetna.
(j) Even amid the Anthem talks [...] Cigna continues
to examine a potential purchase of Louisville-based
Humana Inc., people familiar with the matter said.
Figure 3: Example snippets from STANDER.
Stance support
Target AET HUM
Title Aetna to Acquire Humana for $37 Billion
Body Aetna (NYSE: AET) and Humana Inc. (NYSE:
HUM) today announced that they have entered
into a definitive agreement under which Aetna will
acquire all outstanding shares of Humana for a
combination of cash and stock [...]
Stance comment
Target CI ESRX
Title Cigna’s Purchase of Expres Scripts Unlikely to
Affect Workers’ Comp
Body According to Joe Paduda, principal of Health Strat-
egy Associates, these kinds of purchases don’t re-
ally impact worker’s comp stakeholders. “Health
plans and PBMs are merging to better control care
delivery and cost,” [...]
Stance refute
Target CVS AET
Title Health Care up Amid Deal Activity
Body A federal judge voiced concern about the Jus-
tice Department’s decision to allow CVS Health’s
nearly USD 70 billion acquisition of Aetna, and
said he may require CVS to hold Aetna’s assets sep-
arately while he considers the settlement between
the companies and the government [...]
Figure 4: A supporting, a commenting and a refuting
sample from STANDER (evidence snippets underlined).
without taking a stance on their merger (h).
Moreover, as the mergers happened simultaneously,
there is considerable interplay between companies;
a successful classifier thus requires the modeling of
the deep relationship between the target merger and
the article, not just simple keyword matching (i, j).
In addition, the task is challenging as the under-
lying argumentative structure is needed in order
to correctly classify the article. Considering the
support example in Figure 4, both the title and the
body contain the same information. In the com-
ment example, the evidence is in the title, while the
body provides additional information. In the refute
example the evidence is in body while the title does
not contain information regarding the stance.
These characteristics contribute to making
STANDER a challenging benchmark for news SD.
4.2 Corpus Statistics
Dataset Statistics. The final dataset collects 3,291
labeled news articles from heterogeneous news out-
lets (Figure 5): while finance-specific publications
constitute the majority of the most frequent sources,
the corpus also contains many general newspapers
(such as Reuters News or The New York Times)
as well as local journals (such as the Louisville
Business First). News articles present an asym-
4091
avg articles/source 11.2
avg paragraph/body 24.4
avg tokens/title 12.8
avg tokens/paragraph 24.3
avg tokens/body 592.1
avg evidence/article 1.9
avg tokens/evidence 22.7
Table 2: Relevant statistics from
the STANDER corpus.
support refute comment unrelated Total
#samples % #samples % #samples % #samples %
CVS AET 372 46.4 104 12.9 294 36.7 31 3.8 831
CI ESRX 207 59.8 64 18.4 70 20.2 5 1.4 376
ANTM CI 367 31.4 537 46.0 248 21.2 14 1.2 1,199
AET HUM 463 47.3 313 32.0 197 20.1 5 0.5 1,009
Total 1409 1018 809 55
Table 3: Label distribution across different mergers in the STANDER corpus
(refer to Table 1).
metric and hierarchical structure: they are formed
by a concise and short title and a (usually) long
body, which in turn is composed of ordered para-
graphs (Table 2). Note that, while articles might be
very long (Figure 8), evidences are usually located
in the title or in the first few paragraphs (Figure 6).
This is in line with the inverted pyramid (Scanlan,
2000; Po¨ttker, 2003) or summary news lead (Errico
et al., 1997) style – widely adopted in modern jour-
nalistic prose – in which the most relevant informa-
tion is concentrated at the beginning of the article.
Label Distribution. A clear correlation can
be observed between the merger’s outcome
(blocked/succeeded) and the relative proportion
of supporting and refuting samples (Table 3).
Contrary to many popular SD datasets (Der-
czynski et al., 2017; Pomerleau and Rao, 2017;
Hanselowski et al., 2018b), the related labels
present a relatively balanced distribution: this is in
line with property (2) (Section 4.1); however, in
contrast to Mohammad et al. (2017), who employed
query keywords to “force” it, such a balanced dis-
tribution arose naturally from our data.
Figure 5: 15 most frequent news sources in the dataset.
5 Baselines and Discussion
This section provides results for a number of recent
techniques. While more complex models could
possibly achieve better results, our aim was to set
baselines for our dataset with a number of strong
models. Detailed description of the experimental
setting is provided in Appendix B and C for repli-
cation.
5.1 Experiments
Models. We consider two dummy baselines – a
random and a majority vote baseline – and, fol-
lowing Hanselowski et al. (2019), three neural
baselines: BertEmb, an MLP leveraging sentence-
BERT embeddings (Reimers and Gurevych, 2019);
UseEmb, an MLP leveraging Universal Sentence
Encoder’s sentence embeddings (Cer et al., 2018);
and a BiLSTM over Glove embeddings (Penning-
ton et al., 2014). As upper bound, we consider the
performance of a domain expert against the aggre-
gated gold data (see Section 3, Quality Assessment,
for further details).
Experimental Setting. We first test the models’
ability to perform SD given the correct set of sen-
tences which contain an evidence snippet (SD in
isolation). Secondly, we consider both SD and
ER: while the tasks could be approached with a
Figure 6: Distribution of the evidence locations.
4092
Stance Detection: F1 across mergers avg Stance Detection avg Evidence Retrieval
Model CVS AET CI ESRX ANTM CI AET HUM avgP avgR avgF1 avgP@5 avgR@5
3
cl
as
se
s
(o
nl
y
re
la
te
d) Random Base 25.0 24.3 26.0 24.5 25.3 25.3 25.0 15.3 08.2
Majority Base 15.2 15.2 15.2 15.2 10.9 25.0 15.2 58.3 46.1
BiLSTM 44.1 67.2 46.5 60.2 64.0 56.6 52.7 – –
UseEmb 44.0 59.6 55.2 56.4 59.3 55.5 53.3 – –
BertEmb 47.4 55.6 50.1 59.4 56.6 56.6 52.8 – –
BiLSTM (+ER) 46.4 60.8 56.5 55.5 60.9 57.0 54.2 54.6 57.7
UseEmb (+ER) 47.8 54.4 48.3 58.1 57.6 54.8 51.8 56.4 58.5
BertEmb (+ER) 54.2 70.0 52.8 60.3 63.5 59.6 57.3 54.1 53.8
4
cl
as
se
s
(+
un
re
la
te
d) Random Base 17.5 17.4 17.1 16.5 19.6 19.8 17.1 15.1 07.9
Majority Base 12.0 12.0 12.0 12.0 8.6 20.0 12.0 58.0 46.7
BiLSTM 38.8 42.9 42.8 43.8 46.2 43.9 42.1 – –
UseEmb 35.8 33.2 39.7 43.3 44.0 40.6 39.1 – –
BertEmb 42.5 33.2 46.4 43.9 50.5 45.6 43.2 – –
BiLSTM (+ER) 40.2 35.1 41.1 43.8 44.4 42.4 41.0 55.4 57.1
UseEmb (+ER) 31.8 36.1 35.5 43.0 41.6 39.6 36.9 56.9 57.4
BertEmb (+ER) 47.3 53.6 45.3 41.8 51.7 47.8 45.7 54.2 55.0
Upper Bound 72.3 85.2 64.2 75.6 72.9 73.2 71.9 – –
Table 4: Results of baseline experiments on Stance Detection (SD), both in isolation and jointly with Evidence
Retrieval (+ER). We consider both SD on the completed stance tagset (4 classes) and on only related classes (3
classes; note that in this case the sample distribution is balanced). Macro F1 refers to testing on the target merger
while training on the other three. Performances over all operations are averaged weighting by merger’s size.
pipelined strategy (as in Thorne et al. (2018)), we
follow a multi-task training approach, which has
proven to be more effective (Yin and Roth, 2018).
When jointly training, we employ a simple ER strat-
egy, by taking the title and the first 4 paragraphs
from each article as candidates.
We train in a cross-target setting (train on three
mergers, test on the fourth), and consider two train-
ing settings: first, we select only related sam-
ples, which present a balanced distribution (Ta-
ble 3); then, we consider all stances: this is more
difficult because unrelated samples are very in-
frequent, resulting in a skewed distribution as
in RumourEval (Derczynski et al., 2017; Gorrell
et al., 2018). To account for performance fluctu-
ations (Reimers and Gurevych, 2017), we run 5
simulations for each model and take the average
of the results. We leave the identification of the
evidence’s indices in the sentences, as well as the
usage of more sophisticated ER methods, to future
work.
Evaluation. We follow recent work (Thorne et al.,
2018; Hanselowski et al., 2018b, 2019) and con-
sider macro-averaged precision, recall and F1 for
SD, and precision and recall on the 5 selected evi-
dence candidates (P@5 and R@5) for ER.
5.2 Results and Error Analysis
Results of the experiments are reported in Table 4.
As expected, we observe a drop in performance
when considering only related vs all classes. While
all considered models obtain significant gains over
the two dummy baselines, the BertEmb model – as
observed also in Hanselowski et al. (2019) – ob-
tains the best results overall for SD. Note, however,
the wide gap between BertEmb performance and
the upper bound, which confirms the difficulty of
our dataset. Considering ER results, we observe a
smaller gap in performance between models, with
UseEmb obtaining the best results overall.
Interestingly, we observe a gain in stance clas-
sification when BertEmb is jointly trained to per-
form both SD and ER: this seems to indicate that,
by learning to classify whether an input sentence
constitutes an evidence snippet or not, the sys-
tem is indirectly gathering knowledge which is
also useful to solve the SD task. An error anal-
ysis of BertEmb’s predictions shows that most
mis-classifications happen between the comment
the support labels: this is in line with findings
from both previous work (Riedel et al., 2017) and
the analysis of the inter-annotator agreement (Sec-
tion 3). A relatively high number of comment sam-
ples are also mis-classified as refute: note that –
while in news SD corpora refuting samples coming
4093
from popular newspapers can sometimes be eas-
ily spotted by the presence of words such as fake,
hoax, or similar – STANDER contains articles from
high-reputation sources, which usually do not use
sensationalist language.
6 Integrating News and Twitter Signal
As outlined in the Introduction, STANDER con-
tains the same targets as the Twitter SD WT–WT
corpus (Conforti et al., 2020). The union of both
corpora thus provides a great opportunity for study-
ing the interplay between authoritative and user-
generated signals: the first refers to long and ar-
ticulated texts written by professional journalists,
while the second refers to a very abundant but po-
tentially noisy stream of posts, which are published
without any editorial review. While a detailed time
series analysis (Lim and Tucker, 2019) is beyond
the scope of this paper, we provide a first data de-
scription and a correlation analysis, which show
the potential of the obtained aligned corpus and the
challenges it may pose to future research.
6.1 Statistical Analysis.
The relative frequency of samples between mergers
is similar in both the news and the Twitter signals
(Figure 7), with CI ESRX being the less popular
target (refer to Conforti et al. (2020) for a detailed
analysis of the WT–WT corpus). The same holds
true for the relative distribution of related labels,
with refuting samples being more frequent in the
case of blocked mergers.
However, there are a number of differences be-
tween the two signals: notably, the Twitter signal
presents a high number of noisy unrelated sam-
ples, which is not surprising when dealing with
Figure 7: Label distribution across the considered
mergers in the news (left) and Twitter dataset (right).
user-generated data (Zubiaga et al., 2015); we also
observe a higher proportion of commenting sam-
ples, which has often been observed in financial
microbloggings (Zˇnidarsˇicˇ et al., 2018). On the
contrary, the news signal is cleaner, but around one
order of magnitude less abundant (Figure 7). Apart
from this asymmetry in label distribution, a further
asymmetry in length can be observed between the
corpora: tweets tend to be short and compact, while
pieces of news are long and articulated (Figure 8),
thus posing interesting challenges for future work
on multi-genre SD.
6.2 Signal Correlation
A diachronic analysis of the volume of tweets and
articles discussing CVS AET (Figure 9) shows a rel-
atively similar distribution between the two signals,
with some notable differences. While the Twitter
signal presents some constant but minor activity
from the very beginning of the process, the news
signal remains completely silent until the compa-
nies’ views are reported by a major news outlet.
For some of the mergers, we even observe a no-
table spike in the Twitter activity before, but close
to the first merger’s mention in the press (see the
analysis of the ANTM CI merger in Appendix D).
This is in line with studies on the usage of social
media, especially Twitter, as sources of informa-
tion for journalists (Van Leuven and Deprez, 2017;
Rony et al., 2018; Johnson, 2019).
As reported in Table 5, the two signals exert
moderate levels of correlation, which is further in-
creased when only considering related tweets. This
follows from the observation that large spikes in
both the Twitter and the news signal are around
dates of milestones within the merger process
(Bruner and Perella, 2004; Piesse et al., 2013) and
that many of the unrelated tweets occur before the
first news article is published, when no activity is
Figure 8: Asymmetry in length in STANDER (left) and
in the WT–WT Twitter SD corpus (right).
4094
Figure 9: Volume of tweets and news over time for the CVS AET merger
(for further visualizations see Appendix D).
Merger allstances
only
related
obs.
(days)
AET HUM
0.5527
(0.0244)
0.6116
(0.0220)
815
ANTM CI
0.4793
(0.0230)
0.5535
(0.0207)
1,124
CI ESRX
0.4878
(0.0350)
0.5398
(0.0326)
475
CSV AET
0.6260
(0.0235)
0.6470
(0.0225)
671
Table 5: Spearman correlation and
approx. standard errors between the
twitter and the news signals.
registered for the news signal (see Appendix D for
further discussion).
7 Conclusions and Future Work
We presented STANDER, a new expert-annotated
resource for news SD and ER. We provided a de-
tailed description of the annotation process and
corpus statistics, as well as of the findings from the
annotation process. Our experiments with a set of
strong models indicated a consistent (up to 30%)
performance gap between SoA and human upper
bound: this proves that our corpus constitutes a
strong challenge and leaves plenty of room for fu-
ture work on news SD, ER, domain adaptation and
multi-task training.
Moreover, our corpus enables future research in
a number of new areas, including: fine-grained ER
for news SD – where the goal is not only to retrieve
evidence snippets, but also their exact location in
the text – which goes in the direction of improving
interpretability of a model’s predictions; and multi-
genre SD – due to the fact that our corpus aligns
with an existing resource for Twitter SD – which
would open new interesting scenarios in the wider
field of rumour verification.
Acknowledgments
We thank the anonymous reviewers of this paper
for their efforts and for the constructive comments
and suggestions. We gratefully acknowledge fund-
ing from the Keynes Fund, University of Cam-
bridge (grant no. JHOQ). CC is grateful to NERC
DREAM CDT (grant no. 1945246) for partially
funding this work. CG and FT are thankful to the
Cambridge Endowment for Research in Finance
(CERF).
References
Rakesh Agrawal, Sridhar Rajagopalan, Ramakrishnan
Srikant, and Yirong Xu. 2003. Mining newsgroups
using networks arising from social behavior. In Pro-
ceedings of the Twelfth International World Wide
Web Conference, WWW 2003, Budapest, Hungary,
May 20-24, 2003, pages 529–535. ACM.
Ahmet Aker, Arkaitz Zubiaga, Kalina Bontcheva,
Anna Kolliakou, Rob Procter, and Maria Liakata.
2017. Stance classification in out-of-domain ru-
mours: A case study around mental health disorders.
In Social Informatics - 9th International Conference,
SocInfo 2017, Oxford, UK, September 13-15, 2017,
Proceedings, Part II, volume 10540 of Lecture Notes
in Computer Science, pages 53–64. Springer.
Areej Alhothali and Jesse Hoey. 2015. Good news or
bad news: Using affect control theory to analyze
readers’ reaction towards news articles. In Proceed-
ings of the 2015 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
1548–1558, Denver, Colorado. Association for Com-
putational Linguistics.
Ramy Baly, Mitra Mohtarami, James R. Glass, Lluı´s
Ma`rquez, Alessandro Moschitti, and Preslav Nakov.
2018. Integrating stance detection and fact check-
ing in a unified corpus. In Proceedings of the 2018
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT, New Orleans,
Louisiana, USA, June 1-6, 2018, Volume 2 (Short Pa-
pers), pages 21–27. Association for Computational
Linguistics.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large an-
notated corpus for learning natural language infer-
ence. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing,
EMNLP 2015, Lisbon, Portugal, September 17-21,
2015, pages 632–642. The Association for Compu-
tational Linguistics.
4095
Robert F Bruner and Joseph R Perella. 2004. Applied
mergers and acquisitions, volume 173. John Wiley
& Sons.
Lily Canter. 2015. Personalised tweeting: The emerg-
ing practices of journalists on twitter. Digital Jour-
nalism, 3(6):888–907.
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
Nicole Limtiaco, Rhomni St. John, Noah Con-
stant, Mario Guajardo-Cespedes, Steve Yuan, Chris
Tar, Yun-Hsuan Sung, Brian Strope, and Ray
Kurzweil. 2018. Universal sentence encoder. CoRR,
abs/1803.11175.
Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and psychological mea-
surement, 20(1):37–46.
Costanza Conforti, Jakob Berndt, M. Taher Pilehvar,
Chryssi Giannitsarou, Flavio Toxvaerd, and Nigel
Collier. 2020. Will-they-won’t-they: A very large
dataset for stance detection on twitter. In Proceed-
ings of the 2020 Annual Conference of the Associa-
tion for Computational Linguistics. Association for
Computational Linguistics.
Leon Derczynski, Kalina Bontcheva, Maria Liakata,
Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz
Zubiaga. 2017. Semeval-2017 task 8: Rumoureval:
Determining rumour veracity and support for ru-
mours. In Proceedings of the 11th International
Workshop on Semantic Evaluation, SemEval@ACL
2017, Vancouver, Canada, August 3-4, 2017, pages
69–76.
Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev,
Gideon Mann, and Miles Osborne. 2016. How twit-
ter is changing the nature of financial news discov-
ery. In Proceedings of the Second International
Workshop on Data Science for Macro-Modeling,
DSMM@SIGMOD 2016, San Francisco, CA, USA,
June 26 - July 1, 2016, pages 2:1–2:5. ACM.
Marcus Errico, J April, A Asch, L Khalfani, M Smith,
and X Ybarra. 1997. The evolution of the summary
news lead. Media History Monographs, 1(1).
William Ferreira and Andreas Vlachos. 2016. Emer-
gent: a novel data-set for stance classification. In
Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 1163–1168, San Diego, California. Associa-
tion for Computational Linguistics.
Joseph L Fleiss. 1971. Measuring nominal scale agree-
ment among many raters. Psychological bulletin,
76(5):378.
Genevieve Gorrell, Kalina Bontcheva, Leon Derczyn-
ski, Elena Kochkina, Maria Liakata, and Arkaitz
Zubiaga. 2018. Rumoureval 2019: Determining
rumour veracity and support for rumours. CoRR,
abs/1809.06683.
Andreas Hanselowski, Avinesh PVS, Benjamin
Schiller, Felix Caspelherr, Debanjan Chaudhuri,
Christian M. Meyer, and Iryna Gurevych. 2018a. A
retrospective analysis of the fake news challenge
stance-detection task. In Proceedings of the 27th
International Conference on Computational Lin-
guistics, pages 1859–1874, Santa Fe, New Mexico,
USA. Association for Computational Linguistics.
Andreas Hanselowski, Avinesh P. V. S., Benjamin
Schiller, Felix Caspelherr, Debanjan Chaudhuri,
Christian M. Meyer, and Iryna Gurevych. 2018b.
A retrospective analysis of the fake news challenge
stance-detection task. In Proceedings of the 27th In-
ternational Conference on Computational Linguis-
tics, COLING 2018, Santa Fe, New Mexico, USA,
August 20-26, 2018, pages 1859–1874. Association
for Computational Linguistics.
Andreas Hanselowski, Christian Stab, Claudia Schulz,
Zile Li, and Iryna Gurevych. 2019. A richly anno-
tated corpus for different tasks in automated fact-
checking. In Proceedings of the 23rd Confer-
ence on Computational Natural Language Learning
(CoNLL), pages 493–503, Hong Kong, China. Asso-
ciation for Computational Linguistics.
Toma´sˇ Hercig, Peter Krejzl, Barbora Hourova´, Josef
Steinberger, and Ladislav Lenc. 2017. Detecting
stance in czech news commentaries. In Proceed-
ings of the 17th ITAT: Slovenskocesky` NLP work-
shop (SloNLP 2017), volume 1885, pages 176–180.
Diana Inkpen, Xiaodan Zhu, and Parinaz Sobhani.
2017. A dataset for multi-target stance detection.
In Proceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Computational
Linguistics, EACL 2017, Valencia, Spain, April 3-7,
2017, Volume 2: Short Papers, pages 551–557. As-
sociation for Computational Linguistics.
Rajiv Johal. 2009. Factiva: Gateway to business infor-
mation. Journal of Business & Finance Librarian-
ship, 15(1):60–64.
Michiel Johnson. 2019. Sourcing Twitter: a multi-
methodological study on the role of Twitter in eco-
nomic journalism. Ph.D. thesis, University of
Antwerp.
Michiel Johnson, Steve Paulussen, and Peter Van Aelst.
2018. Much ado about nothing? the low importance
of twitter as a sourcing tool for economic journalists.
Digital Journalism, 6(7):869–888.
Manfred Klenner, Don Tuggener, and Simon
Clematide. 2017. Stance detection in facebook
posts of a german right-wing party. In Proceed-
ings of the 2nd Workshop on Linking Models of
Lexical, Sentential and Discourse-level Semantics,
LSDSem@EACL 2017, Valencia, Spain, April 3,
2017, pages 31–40. Association for Computational
Linguistics.
4096
Patty Kostkova, Vino Mano, Heidi J. Larson, and
William S. Schulz. 2017. Who is spreading rumours
about vaccines? influential user impact modelling in
social networks. In Proceedings of the 2017 Interna-
tional Conference on Digital Health, DH ’17, page
48–52, New York, NY, USA. Association for Com-
puting Machinery.
Dilek Ku¨c¸u¨k and Fazli Can. 2020. Stance detection: A
survey. ACM Comput. Surv., 53(1).
Anders Edelbo Lillie and Emil Refsgaard Middelboe.
2019. Fake news detection using stance classifica-
tion: A survey. CoRR, abs/1907.00181.
Sung Hoon Lim and Conrad S. Tucker. 2019. Mining
twitter data for causal links between tweets and real-
world outcomes. Expert Syst. Appl. X, 3:100007.
Edward Loper and Steven Bird. 2002. Nltk: the natural
language toolkit. arXiv preprint cs/0205028.
Wes McKinney. 2010. Data Structures for Statistical
Computing in Python. In Proceedings of the 9th
Python in Science Conference, pages 56 – 61.
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
hani, Xiao-Dan Zhu, and Colin Cherry. 2016. A
dataset for detecting stance in tweets. In Proceed-
ings of the Tenth International Conference on Lan-
guage Resources and Evaluation LREC 2016, Por-
torozˇ, Slovenia, May 23-28, 2016. European Lan-
guage Resources Association (ELRA).
Saif M. Mohammad, Parinaz Sobhani, and Svetlana
Kiritchenko. 2017. Stance and sentiment in tweets.
ACM Trans. Internet Techn., 17(3):26:1–26:23.
Claudia Orellana-Rodriguez and Mark T. Keane. 2018.
Attention to news and its dissemination on twitter: A
survey. Computer Science Review, 29:74 – 94.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay. 2011. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research,
12:2825–2830.
Jeffrey Pennington, Richard Socher, and Christopher D.
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 1532–1543.
Jenifer Piesse, Cheng-Few Lee, Lin Lin, and Hsien-
Chang Kuo. 2013. Merger and acquisition: Defini-
tions, motives, and market responses. Encyclopedia
of Finance, pages 411–420.
Dean Pomerleau and Delip Rao. 2017. Fake news chal-
lenge.
Kashyap Popat, Subhabrata Mukherjee, Jannik
Stro¨tgen, and Gerhard Weikum. 2017. Where the
truth lies: Explaining the credibility of emerging
claims on the web and social media. In Proceedings
of the 26th International Conference on World Wide
Web Companion, Perth, Australia, April 3-7, 2017,
pages 1003–1012.
Horst Po¨ttker. 2003. News and its communicative qual-
ity: The inverted pyramid—when and why did it ap-
pear? Journalism Studies, 4(4):501–511.
Nils Reimers and Iryna Gurevych. 2017. Report-
ing score distributions makes a difference: Perfor-
mance study of lstm-networks for sequence tag-
ging. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing,
EMNLP 2017, Copenhagen, Denmark, September 9-
11, 2017, pages 338–348. Association for Computa-
tional Linguistics.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
bert: Sentence embeddings using siamese bert-
networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing.
Association for Computational Linguistics.
Benjamin Riedel, Isabelle Augenstein, Georgios P Sp-
ithourakis, and Sebastian Riedel. 2017. A sim-
ple but tough-to-beat baseline for the fake news
challenge stance detection task. arXiv preprint
arXiv:1707.03264.
Md Main Uddin Rony, Mohammad Yousuf, and
Naeemul Hassan. 2018. A large-scale study of so-
cial media sources in news articles. arXiv preprint
arXiv:1810.13078.
Christopher Scanlan. 2000. Reporting and writing: Ba-
sics for the 21st century. Oxford University Press.
Maria Skeppstedt, Andreas Kerren, and Manfred Stede.
2017. Automatic detection of stance towards vacci-
nation in online discussion forums. In Proceedings
of the International Workshop on Digital Disease
Detection using Social Media, DDDSM@IJCNLP
2017, Taipei, Taiwan, November 27, 2017, pages 1–
8.
Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu.
2016. Effective lstms for target-dependent senti-
ment classification. In COLING 2016, 26th Inter-
national Conference on Computational Linguistics,
Proceedings of the Conference: Technical Papers,
December 11-16, 2016, Osaka, Japan, pages 3298–
3307. ACL.
James Thorne and Andreas Vlachos. 2018. Automated
fact checking: Task formulations, methods and fu-
ture directions. In Proceedings of the 27th Inter-
national Conference on Computational Linguistics,
COLING 2018, Santa Fe, New Mexico, USA, Au-
gust 20-26, 2018, pages 3346–3359. Association for
Computational Linguistics.
James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
FEVER: a large-scale dataset for fact extraction
and verification. In Proceedings of the 2018
4097
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT 2018, New
Orleans, Louisiana, USA, June 1-6, 2018, Volume
1 (Long Papers), pages 809–819. Association for
Computational Linguistics.
Jannis Vamvas and Rico Sennrich. 2020. X-stance: A
multilingual multi-target dataset for stance detection.
CoRR, abs/2003.08385.
Sarah Van Leuven and Annelore Deprez. 2017. ‘to fol-
low or not to follow?’: How belgian health journal-
ists use twitter to monitor potential sources. Journal
of Applied Journalism & Media Studies, 6(3):545–
566.
Andreas Vlachos and Sebastian Riedel. 2014. Fact
checking: Task definition and dataset construction.
In Proceedings of the Workshop on Language Tech-
nologies and Computational Social Science@ACL
2014, Baltimore, MD, USA, June 26, 2014, pages
18–22. Association for Computational Linguistics.
Brian Xu, Mitra Mohtarami, and James Glass. 2019.
Adversarial domain adaptation for stance detection.
CoRR, abs/1902.02401.
Wenpeng Yin and Dan Roth. 2018. Twowingos: A two-
wing optimization strategy for evidential claim veri-
fication. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4, 2018,
pages 105–114. Association for Computational Lin-
guistics.
Martin Zˇnidarsˇicˇ, Jasmina Smailovic´, Jan Gorsˇe, Miha
Grcˇar, Igor Mozeticˇ, and Senja Pollak. 2018. Trust
and doubt terms in financial tweets and periodic re-
ports. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation
(LREC 2018), Paris, France. European Language Re-
sources Association (ELRA).
Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva,
Maria Liakata, and Rob Procter. 2018a. Detection
and resolution of rumours in social media: A survey.
ACM Comput. Surv., 51(2):32:1–32:36.
Arkaitz Zubiaga, Geraldine Wong Sak Hoi, Maria
Liakata, Rob Procter, and Peter Tolmie. 2015.
Analysing how people orient to and spread rumours
in social media by looking at conversational threads.
CoRR, abs/1511.07487.
Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob
Procter, and Michal Lukasik. 2016. Stance classifi-
cation in rumours as a sequential task exploiting the
tree structure of social media conversations. In COL-
ING 2016, 26th International Conference on Compu-
tational Linguistics, December 11-16, 2016, Osaka,
Japan, pages 2438–2448. ACL.
Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob
Procter, Michal Lukasik, Kalina Bontcheva, Trevor
Cohn, and Isabelle Augenstein. 2018b. Discourse-
aware rumour stance classification in social media
using sequential classifiers. Inf. Process. Manage.,
54(2):273–290.
A Corpus-related Specifications
A.1 Screenshot from Factiva
Below, we report a screenshot from the Factiva
interface (Johal, 2009), while crawling for the
CVS AET merger:
A.2 Crawling Timelines
Table 6 gives an overview of the considered M&A
operations, their respective crawling timelines and
the total number of articles.
Merger Crawl start Crawl end Articles
CVS AET 15/02/2017 17/12/2018 831
CI ESRX 27/05/2017 17/08/2018 376
ANTM CI 01/04/2014 28/04/2017 1,199
AET HUM 01/09/2014 23/01/2017 1,009
Table 6: Crawling specifications.
A.3 Metadata Included in the Corpus
We provide a sample of the data in the Supple-
mentary material. Each sample in the dataset is
associated with the following fields:
• Target merger; one from {CVS AET, CI ESRX,
ANTM CI, AET HUM}.
• Stance of the article with respect to the target
merger; one from {support, refute, comment, un-
related}.
• Title of the article, followed by a ordered list of
the article’s Paragraphs.
4098
• A list of Evidence Snippets, indicating
1) the index of the paragraph in the article where
the evidence is located; and
2) the exact start and end indices of the snippet
in the corresponding paragraph.
A.4 Annotation Guidelines
The following is an extract from the annotation
guidelines sent to the annotators. Each label de-
scription was correlated with a number of exam-
ples, which we don’t report due to space limitation.
You will be sent a number of news articles. The an-
notation process consists of choosing one of 4 pos-
sible labels for each article and marking which part
of the article (e.g., the title or a specific sentence,
phrase, paragraph) led you to your assessment.
The four labels to choose from are Support, Com-
ment, Refute, and Unrelated.
Label: Support This label should be chosen if
the article is supporting the theory that the merger
is happening. That is, after reading the article the
reader feels more confident that the two companies
will merge. Articles that mention the merger as a
fact and then talk about e.g. the implications or
consequences of the merger should not be labelled
as supporting but as commenting.
Label: Refute This label should be chosen if the
article is refuting the theory that the merger is hap-
pening. That is, after reading the article the reader
feels less confident that the two companies will
merge. Articles that are voicing doubts or men-
tion potential roadblocks (such as antitrust issues)
should be labelled refute as well.
Label: Comment This label should be chosen
if the article is commenting on one of the merg-
ers. The article should neither directly state that
the merger is happening, nor refute that it will be
completed successfully. Articles that mention the
merger as a fact and then talk about e.g. the im-
plications or consequences of the merger should
also be labelled as commenting. For articles that
are long, presenting both positive and negative evi-
dence, annotators should weigh the evidence and
conclude whether the article is ‘mostly’ positive or
negative. Only of the assessment of the annotator
is that the evidence is equal should the article be
labelled as commenting.
Label: Unrelated This label should be chosen
if the article is unrelated to the merger in ques-
tion. Since the articles have been collected from a
news aggregation service, some of them may not
in fact be about one of the mergers. This label will
only have few articles and should be the easiest to
identify. Note that an article that is mainly about
a different topic/merger, but talks about the rele-
vant merger in one paragraph or just a sentence,
annotators should choose the label based on this
paragraph or sentence.
B Baselines-Related Specifications
Below, we report on the implementations details for
the baselines presented in Section 5. SD stands for
Stance Detection and ER for Evidence Retrieval.
B.1 Dummy Baselines
Two dummy baselines have been considered as
lower bound.
• Random Baseline.
SD: outputs a random stance;
ER: outputs two random sentences chosen
from the title and all body’s paragraphs.
• Majority Baseline.
SD: always outputs support (the most frequent
label in the corpus);
ER: always outputs the title and the first para-
graph (the most frequent locations of evi-
dences in the dataset, Figure 6).
B.2 Neural Baselines
Three strong neural baselines, which obtained state-
of-the-art results in previous work (Hanselowski
et al., 2019), are considered for future reference.
Inputs. The models receive as input n + 1 se-
quences {t, s1, ..., sn}, where t is the target and
{s1, ..., sn} is the list of n sentences from the ar-
ticles. If training for SD in isolation, such sen-
tences are the gold evidences; if jointly training
for SD+ER), they are the evidence candidates: as
a simple sentence retrieval method, we always re-
trieve the title and the first four paragraphs of the
article, where evidence snippets are most frequently
located in the corpus (Figure 6). For a target merger
between companies A and B (with acronyms a and
b), we employ as target a string containing the text:
"A (a) will merge with B (b)."
Encoders. We employ three neural encoders to ob-
tain a target-aware representation hi of each input
sentence si:
• BiLSTM. We employ 300-dimension word
embeddings to encode each input token. The
4099
embedding matrix is initialized with Glove5
embeddings (Pennington et al., 2014), which
are kept fixed over training to prevent overfit-
ting. We concatenate each input evidence with
the target, and we obtain a hidden represen-
tation for each pair of inputs with a BiLSTM
network with size of 128 hidden units.
• UseEmb. We obtain sentence embeddings
for each input sequence with the Universal
Sentence Encoder (Cer et al., 2018), and we
concatenate each input sentence with the input
target. We use the large model for English6.
We then pass the obtained encoded represen-
tation through a position-specific dense layer
with 128 hidden units.
• BERTEmb. We follow the same principle
as above, but using Sentence-BERT (Reimers
and Gurevych, 2019) to obtain sentence em-
beddings for each input sentence. We use the
bert-base model trained on the SNLI and
MultiNLI datasets7.
Decoders. After encoding, we obtain n representa-
tions {h1, ..., hn}, where hi is the target-aware rep-
resentation of the sentence at position i. Inspired
by Yin and Roth (2018), we obtain a probability
αi ∈ (0, 1) of the sentence si being an evidence as:
αi = sigmoid(v · hi) (1)
where v is a learned parameter vector. To model
the entire set of input sentences as a whole, we
construct their joint representation e as:
e =
n∑
i=1
αi · hi (2)
We then consider two decoders, depending on the
task(s) we are training for (only SD, or SD+ER):
• Only SD. We predict the stance label with a
softmax operation over the stance tagset on e.
• ER and SD. If we jointly perform both ER and
SD with a multi-task training setting, we bi-
narize the probability vector α = [α1, αn] by
rounding at 0.5; we consider all input sentence
si where αi > 0.5 as an evidence snippet.
5We use 300-dimensional word embeddings pretrained on
Wikipedia 2014 + Gigaword 5, https://nlp.stanford.
edu/projects/glove/
6https://tfhub.dev/google/
universal-sentence-encoder-large
7https://github.com/UKPLab/
sentence-transformers
C Experimental Setting Specifications
Data Preprocessing. We perform minimal data
preprocessing. The following refers to the BiL-
STM model: we include all types in the corpus
without selecting any minimal frequency; for tok-
enization, we use NLTK’s word tokenize tok-
enizer (Loper and Bird, 2002)8; we pad/cut input
sentences up to 10 tokens (in the case of the arti-
cle’s title) or 25 tokens (in the case of the article’s
paragraphs).
(Hyper)-Parameters and Runtime Specifica-
tions. Refer to Appendix B for a description of the
considered models’ architectures (completed with
embedding size and number of hidden units per
layer). We train all models with Adagrad setting
the learning rate to 0.02. We train with batches
of 32 samples for a maximum of 70 epochs, using
Early Stopping with a patience of 10. To prevent
overfitting, dropout of 0.2 has been used during
training on all layers of the models.
Note that, given that this is a resource paper,
our goal is to provide a set of robust baselines for
future research. For this reason, we don’t perform
extensive hyper-parameter tuning on the selected
models.
Table 7 reports on the total number of (trainable)
parameters for each considered model.
Model #parameters #trainable
parameters
3 classes
BiLSTM 1,701,832 201,832
UseEmb 657,032 657,032
BertEmb 984,712 984,712
4 classes
BiLSTM 1,701,969 201,969
UseEmb 657,161 657,161
BertEmb 984,841 984,841
Table 7: Number of (trainable) parameters for all con-
sidered models and training settings.
This resulted in the average runtime/step reported
in Table 8 (the average runtime is calculated over
five different runs of the same model, trained on
the ANTM CI, AET HUM and CVS AET mergers).
Training Setting. All models are trained using
cross-validation, testing on one merger and training
on the other three.
8https://www.nltk.org/api/nltk.
tokenize.html
4100
Training Setting
Model 3 classes 4 classes
SD
BiLSTM 33s 11ms 37s 13ms
UseEmb 0s 147µs 1s 513µs
BertEmb 0s 161µs 1s 408µs
SD
+E
R BiLSTM 41s 14ms 37s 13ms
UseEmb 0s 167µs 1s 418µs
BertEmb 0s 163µs 1s 422µs
Table 8: Average runtime/step for each considered
model and training setting.
To account for performance fluctua-
tions (Reimers and Gurevych, 2017), we
run 5 simulations for each model and take the
average of the results, weighting according to the
size of the collected articles for each merger.
Table 9 reports the standard deviation between
different runs of the same model. Interestingly,
UseEmb is the most stable model for SD, while
BertEmb is most stable for ER.
SD ER
Model P R F1 P@5 R@5
3
cl
as
se
s
BiLSTM 3.039 7.909 9.817 – –
UseEmb 1.246 3.681 5.897 – –
BertEmb 1.972 2.967 4.700 – –
BiLSTM 1.353 2.490 5.273 8.362 10.58
UseEmb 1.287 2.806 4.295 8.337 10.26
BertEmb 4.723 6.131 6.770 6.154 11.60
4
cl
as
se
s
BiLSTM 1.214 2.646 1.943 – –
UseEmb 0.102 2.876 3.825 – –
BertEmb 5.016 3.413 4.986 – –
BiLSTM 1.657 2.440 3.148 7.562 9.806
UseEmb 2.304 3.027 4.064 7.457 10.745
BertEmb 4.882 4.637 4.311 4.592 11.50
Table 9: Standard deviation between results obtained
with the considered models over different runs. For
each training setting (3 vs 4 classes) we first report σ
on SD in isolation, then on jointly training SD+ER.
Computing Infrastructure. We run experiments
on an NVIDIA GeForce GTX 1080 GPU.
Evaluation Specifications. For SD, we use the
sklearn’s (Pedregosa et al., 2011) implementation
of macro-averaged precision, recall and F1 score9.
For ER, we use Thorne et al. (2018)’s implementa-
tion of P@5 and R@510, which has also been used
9https://scikit-learn.org/stable/
modules/classes.html#module-sklearn.
metrics
10https://github.com/sheffieldnlp/
fever-scorer/
by Hanselowski et al. (2019).
D Correlation Analysis
D.1 Implementation Details
For the correlation analysis in Section 6, we used
Panda’s implementation of the Spearman correla-
tion11 (Wes McKinney, 2010).
We calculate the standard error as:
σx =
1− r2x√
n− 2 (3)
where rx is the correlation coefficient and n is the
number of observations (i.e. the number of days of
observations collected for each mergers).
D.2 The Case of the Anthem/Cigna Merger
Figure 10 shows the distribution of tweets and arti-
cles over time for ANTM CI. Three distinct phases
can be distinguished in the timeline of the merger.
The first phase goes from the beginning of the
data collection to the first report on the companies’
talks which appeared on a major news outlet. Dur-
ing this phase, we observe minor movements in the
Twitter signal and some sparse news articles. Con-
sidering only related samples, most of the tweets
and articles in this phase disappear. However, at
the end of this phase there are spikes in the Twitter
signal. This suggests that during this period the
ongoing talks between the companies are not pub-
licly known, but at the very end information may
be leaked. The tweet signal spikes during the first
phase on 20.05.2015, around one month before the
first news report.
The second phase begins with the first report by
a major news outlet and lasts until the beginning
of the antitrust process. It is characterised by large
spikes in both the volume of tweets posted and
the number of published articles. The first spike
in the news articles occurs on 15.06.2015, when
the Wall Street Journal – as it happens for most
considered mergers – reports on the ongoing talks
between the two companies. The second spike
occurs on 24.07.2015, when the companies pub-
licly announce the merger with a joint press release.
These two spikes in the news signal are mirrored
in the tweets. After the initial reporting about the
two companies’ intentions, most news articles and
11https://pandas.pydata.org/
pandas-docs/stable/reference/api/pandas.
DataFrame.corr.html
4101
tweets discuss the implications of the merger, dis-
playing a constant but not heavy activity.
The third phase begins with the antitrust pro-
cess and lasts until the end of the merger’s timeline.
Spikes in the volume of tweets and articles can
be observed around specific events, such as when
the official antitrust complaint is presented to the
Department of Justice (DOJ), at the start of the
antitrust trial and around the date of the court de-
cision. During this phase, spikes present a very
similar distribution for both signals.
Figure 10: Evolution of the ANTM CI merger over time. From top to bottom: volume of all posted tweets discussing
this target in the WT–WT corpus (Conforti et al., 2020); volume of tweets annotated as related; volume of published
news articles in STANDER.