Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER), pages 49–57
May 5, 2023 ©2023 Association for Computational Linguistics

"World Knowledge" in Multiple Choice Reading Comprehension

Adian Liusie∗
ALTA Institute, Cambridge University

al826@cam.ac.uk

Vatsal Raina∗

ALTA Institute, Cambridge University
vr311@cam.ac.uk

Mark Gales
ALTA Institute, Cambridge University

mjfg@cam.ac.uk

Abstract

Recently it has been shown that without any ac-
cess to the contextual passage, multiple choice
reading comprehension (MCRC) systems are
able to answer questions significantly better
than random on average. These systems use
their accumulated "world knowledge" to di-
rectly answer questions, rather than using infor-
mation from the passage. This paper examines
the possibility of exploiting this observation as
a tool for test designers to ensure that the form
of "world knowledge" is acceptable for a partic-
ular set of questions. We propose information-
theory based metrics that enable the level of
"world knowledge" exploited by systems to
be assessed. Two metrics are described: the
expected number of options, which measures
whether a passage-free system can identify the
answer to a question using world knowledge;
and the contextual mutual information, which
measures the importance of context for a given
question. We demonstrate that questions with
low expected number of options, and hence
answerable by the shortcut system, are often
similarly answerable by humans without con-
text. This highlights that the general knowledge
‘shortcuts’ could be equally used by exam can-
didates, and that our proposed metrics may be
helpful for future test designers to monitor the
quality of questions.

1 Introduction

Reading comprehension (RC) exams are used ex-
tensively in a wide range of language competency
examinations (Alderson, 2000), and have become
a ubiquitous assessment method to probe how well
candidates can read a passage and understand the
text’s core meaning. A fundamental assumption of
RC exams is that to answer any of the questions
correctly, one has to read the passage, comprehend

∗Equal Contribution

Figure 1: Output probabilities of our model (trained
with contexts omitted) on real RACE++ (Liang et al.,
2019) examples. ‘Effective number of options’ is a
metric that captures the model’s confidence.

its meaning, and identify the relevant information
of a given question. However, recent work has
shown that multiple-choice machine reading com-
prehension (MCMRC) systems without access to
the passage can achieve reasonable performance
(Pang et al., 2022), showing that the models may
be doing less comprehension than anticipated.

In this paper we analyse this phenomena and for
several standard MCMRC datasets (Liang et al.,
2019; Huang et al., 2019; Yu et al., 2020) verify
that passage-free baselines are able to achieve per-
formance significantly better than random. We
show that a subset of questions can be answered
accurately and confidently without access to the
contextual passage, where further analysis shows
this is partly due to the presence of low-quality
distractors, i.e. options that can be eliminated us-
ing only the question. As an example, given the
question “Mina’s sister’s name is:", one can elimi-
nate any options that use a traditionally male name
(see Figure 1). This highlights a potential ‘shortcut’
candidates could use to answer questions while by-
passing the context. Our work raises awareness to
this potential flaw, and proposes a simple solution
to catch questions that can be answered without
comprehension. The proposed metrics might be
a useful tool for future multiple-choice RC test
designers to ensure that all questions truly assess

49


Figure 2: Model architecture.

reading comprehension ability.
Machine reading comprehension (MRC) is a

highly researched area, with state-of-the-art (SoTA)
systems (Zhang et al., 2021; Yamada et al., 2020;
Zaheer et al., 2020; Wang et al., 2022) often ap-
proaching or even exceeding human level per-
formance on public benchmarking leaderboards
(Clark et al., 2018; Lai et al., 2017; Trischler et al.,
2017; Yang et al., 2018). Existing work has anal-
ysed the robustness of MRC systems, where re-
searchers have questioned whether systems fully
leverage context and whether they accomplish the
underlying comprehension task (Sugawara et al.,
2020; Rajpurkar et al., 2016; Kaushik and Lipton,
2018; Jia and Liang, 2017; Si et al., 2019). Most
notably Kaushik and Lipton (2018) show that for
a range of question-answering tasks, passage-only
systems can often achieve remarkable performance,
which has been observed in the MCRC setting
(Pang et al., 2022).

Most existing work has discussed model robust-
ness, demonstrating that for some tasks it is possi-
ble to obtain high average system performance with
no context information. In contrast, this paper fo-
cuses on the attributes of individual questions and
options, identifying questions where "world knowl-
edge" can be leveraged, and the extent to which
this knowledge can be leveraged. This could be a
useful tool to enable test designers to monitor the
questions being proposed, and whether alternative
distractors or questions should be considered.

2 Multiple choice reading comprehension

Multiple-choice reading comprehension is a
popular task where given a context passage C and
question Q, the correct answer must be deduced
from a set of answer options {O}. Current
SoTA MRC systems are dominated by pre-trained
language models (PrLMs) based on the transformer
encoder architecture (Devlin et al., 2019; Liu et al.,
2019; Clark et al., 2020).

Model Architecture Our question-answering
system follows the standard MCMRC architecture
of Figure 2 (Yu et al., 2020; Raina and Gales,
2022). Each option is individually encoded along
with the question and the context into a score, and
a softmax layer converts the 4 options’ scores
into a probability distribution. At inference, the
predicted answer is the option with the greatest
probability.

‘No Context’ Shortcut System A requirement for
good MCRC questions is that information from
both the question and the context passage must be
used to determine the correct answer. To probe
whether this is an issue for MCMRC, we train sys-
tems using ‘context free’ inputs (similar to Pang
et al. (2022)). The standard set-up (Figure 2) is
still followed, however the input is now altered to
exclude the context, as shown in Figure 3.

Figure 3: System inputs for shortcut system.

Effective Number of Options Consider the out-
put probability distribution of the predicted answer,
P(y). One can determine the entropy, H(Y ), which
can be converted into the more interpretable effec-
tive number of options, N (Y ), a value bounded
between 1 and the maximum number of options:

N (Y ) = 2H(Y ), H(Y ) = −
∑

y∈Y

P(y) log2 P(y) (1)

For well designed questions, one would expect
systems with missing information (i.e. the
‘shortcut’ models) to have no information of what
the answer is. This would correspond to a uniform
distribution output (the distribution of maximum
entropy), with an effective number of options equal
to the total number of answer options. However,
if the effective number of options is significantly
lower than the total number of answer options,

50


then this implies that prior information stored
during training can be used to answer the question,
without comprehension.

Mutual Information To probe how much informa-
tion is gained by the context, one can additionally
look at an approximation of mutual information of
the context. This looks at how much the entropy
decreases between the ‘no context’ shortcut system
and the ‘context’ baseline system .

I(Y ;C|Q, {O}) = H(Y |Q, {O})−H(Y |Q, {O}, C) (2)

An alternative approach would be to use random
contexts (Creswell and Shanahan, 2022) however
we consider the stricter ‘no context’ setting.

3 Experiments

Data We consider three popular MCMRC datasets:
RACE++ (Lai et al., 2017), COSMOSQA (Huang
et al., 2019) and ReClor (Yu et al., 2020). RACE++
is a dataset of English comprehension questions
for Chinese high school students, COSMOSQA
is a large scale commonsense-based reading com-
prehension dataset, while ReClor is a challenging
logical reasoning dataset at a graduate student
level. All datasets have 4 options per question, one
of which being the correct answer.

TRN DEV EVL

RACE++ 100,388 5,599 5,642
COSMOS 25,262 2,985 –
ReClor 4,638 500 1000

Table 1: Dataset statistics

Training An ELECTRA-large1 model is fine-tuned
on the training split TRN, hyper-parameters are
tuned on the developement set DEV, and perfor-
mance reported on the test split EVL for RACE++
(DEV splits are used for COSMOS and ReClor due
to unavailability of the EVL splits). All model
parameters (transformer and classifier) are up-
dated during fine-tuning. Additionally, models are
trained and evaluated using the ‘no context’, as
described in Section 2. Final hyperparameters are
given in Appendix B.1. Three seeds are trained,
and ensemble accuracy is used as the default metric
when reporting performance.2

1https://huggingface.co/docs/
transformers/model_doc/electra

2code for experiments available at:
https://github.com/adianliusie/MCRC

3.1 Results
Context-Free Performance We compare the per-
formance of the baseline ‘standard’ system against
the shortcut ‘no context’ systems for the various
MCMRC datasets. Table 2 illustrates that the short-
cut systems achieve high performance across all
MCMRC datasets, all above 50%, significantly
above the expected random performance of 25%.
Further, we find that the shortcut rules can gener-
alise across domains, most notably seen with the
54% performance when training the shortcut sys-
tem on RACE and evaluating on COSMOS. This
highlights that the shortcut performance cannot be
explained purely by dataset bias, but that there is a
skill, unrelated to comprehension, that the systems
are meaningfully leveraging.

Training data RACE++ COSMOS ReClor

– 25.00 25.00 25.00

RACE++ stan. 85.01 70.05 48.60
no con. 57.32 54.04 34.80

COSMOS stan. 66.81 84.49 41.20
no con. 38.73 68.51 27.80

ReClor stan. 52.69 41.68 69.80
no con. 31.27 33.13 51.80

Table 2: Cross-performance of systems on RACE++,
COSMOSQA and ReClor (standard vs no context).

RACE++ Effective Number of Options Figure 4
presents the count and accuracy plots of the effec-
tive number of options (bin width of 0.2) for both
the standard and shortcut systems on RACE++ (see
Appendix for other datasets). The counts plot show
the number of questions within the bin range, while
accuracy refers to the accuracy over all the exam-
ples within the bin. Since the systems are slightly
overconfident3, the systems’ output probabilities
are calibrated using temperature annealing (Guo
et al., 2017) (see Appendix B.3).

The baseline system has high certainty for most
points, which is somewhat expected given the
baseline’s high accuracy. However the shortcut
system, without any contextual information,
has a significant number of examples in the
very low entropy region. This shows that for a
subset of questions, the system can confidently
answer questions correctly without doing any
comprehension at all. In other cases, the shortcut
system can leverage some information from the

3For both models, the mean of the maximum probability
is 5% above the overall accuracy.

51

https://huggingface.co/docs/transformers/model_doc/electra
https://huggingface.co/docs/transformers/model_doc/electra
https://github.com/adianliusie/MCRC


Figure 4: Distribution of effective number of options
and corresponding (binned) accuracy.

question and can reduce the number of effective
options to between 2-3, which implies that
certain poor distractors can be eliminated by
the question alone. We also show that for both
models, there is a clear linear relationship between
uncertainty and accuracy, illustrating that the
context-free system’s use of world knowledge
is sensible and that it leverages meaningful task
information (see Appendix D for low-entropy
examples). This confirms that the systems are well
calibrated and that the effective number of op-
tions is a good measure of actual model uncertainty.

Mutual Information To further look at the
influence of context, the mutual information (MI)
between prediction and context was approximated
for each example using Equation 2. Examples
with a high MI are questions where the model
is certain of the answer with context, but is
uncertain without context - a desired property
for comprehension questions. Figure 5 shows
the counts when all the examples are ordered by
MI (see Equation 2) along with both the baseline
and shortcut system accuracies. We note that the
count distribution has a mix of high and low MI
questions, which shows that the benefit of context
is not a system-wide property but instead varies
over questions. The accuracy of the baseline
system increases considerably when context is
useful, while accuracy falls for the shortcut system.
It is interesting that a small fraction of questions
have negative MI. Though MI should always
be positive, negative values can be observed
since models are only approximations of the true
underlying distributions. The low accuracy of the
shortcut model on negative MI questions occurs
when standard world knowledge is not consistent

Figure 5: Distribution of counts and corresponding ac-
curacy when points are sorted by MI approximation.

with the information in the context.

Human Evaluation of Metrics We perform human
evaluation to judge the practical use of our metrics.
We select 100 questions with lowest and highest
entropy, and three volunteer graduate students in-
dependently answer the questions without access
to the context. We further select 50 questions with
lowest and highest MI, and get our volunteers to
first answer questions without context, then with
context, and calculate the accuracy increase. All
questions are shuffled, and volunteers attempt to
best answer all questions. We find that our met-
rics are very effective in measuring their desired
properties. Without context, humans are often able
to answer the questions that the shortcut systems
answer confidently, with humans achieving an aver-
age accuracy of 92% on the 100 lowest entropy and
32% on the 100 highest entropy examples respec-
tively. Further, for high MI questions humans get
a performance boost of 71% when context is pro-
vided, and only 22% for low mutual information
questions.

low ent. high ent. high MI low MI

human 91.7±1.9 31.7±2.9 ∆69.3±0.9 ∆24.7±5.0

system 99.0±0.0 24.3±6.2 ∆68.0±0.9 ∆3.3±4.7

Table 3: Human and system ‘no context’ accuracy on
lowest and highest entropy questions as well as human
and system change in accuracies on lowest and highest
mutual information questions.

4 Conclusions

For popular MCMRC datasets, systems can achieve
reasonably high performance without performing
any comprehension. Without passage information,
‘shortcut’ systems can confidently determine some

52


correct answer options, eliminate some unlikely
distractors, and use general knowledge to gain in-
formation. Rather than focusing on average system
performance, our work analyses individual ques-
tion’s reliance on world knowledge. We propose
a metric based on the shortcut systems to automat-
ically flag questions that are answerable without
comprehension. We further provide evidence that
the flagged questions are answerable by humans
without any context. Lastly, using an approxima-
tion of the mutual information, we show that the
importance of context varies over the questions in
the dataset, and reason that high MI questions can
be thought of as candidates for high-quality ques-
tions that truly measure comprehension abilities.

5 Limitations

We propose an approach that can automatically flag
questions that can be answered without contextual
information. However, the remaining questions are
not necessarily high-quality questions, since many
other aspects make up question quality. Second,
the experiments are conducted using only the Elec-
tra model, though it is expected similar trends will
be picked up by alternative transformer-based lan-
guage models. Further, exams might be aimed at
a level where a lack of specific knowledge may be
assumed. Our work does not consider variable can-
didate knowledge levels, and our evaluation was
only done by highly educated (we’d like to think)
graduate students. Finally, we acknowledge that
our human evaluation was limited in size and ques-
tions, however it is clearly demonstrated that for
low ‘shortcut entropy’ questions, comprehension
is not necessarily required.

6 Acknowledgements

This research is funded by the EPSRC (The En-
gineering and Physical Sciences Research Coun-
cil) Doctoral Training Partnership (DTP) PhD stu-
dentship and supported by Cambridge Assessment,
University of Cambridge and ALTA.

7 Ethics Statement

There are no serious ethical concerns with this
work. The human volunteers all performed the
human evaluation tasks willingly without any co-
ercion. The human evaluation took 2 hours per
person.

References
J. Charles. Alderson. 2000. Assessing Reading, 1 edi-

tion. Cambridge University Press„ Cambridge :.

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
Christopher D. Manning. 2020. Electra: Pre-training
text encoders as discriminators rather than generators.
In International Conference on Learning Representa-
tions.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question an-
swering? try arc, the ai2 reasoning challenge. ArXiv,
abs/1803.05457.

Antonia Creswell and Murray Shanahan. 2022. Faith-
ful reasoning using large language models. arXiv
preprint arXiv:2208.14271.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-
berger. 2017. On calibration of modern neural net-
works. In International conference on machine learn-
ing, pages 1321–1330. PMLR.

Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
Yejin Choi. 2019. Cosmos qa: Machine reading com-
prehension with contextual commonsense reasoning.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 2391–2401.

Robin Jia and Percy Liang. 2017. Adversarial examples
for evaluating reading comprehension systems. In
EMNLP.

Divyansh Kaushik and Zachary C Lipton. 2018. How
much reading does reading comprehension require? a
critical investigation of popular benchmarks. In Pro-
ceedings of the 2018 Conference on Empirical Meth-
ods in Natural Language Processing, pages 5010–
5015.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
and E. Hovy. 2017. Race: Large-scale reading com-
prehension dataset from examinations. In EMNLP.

Yichan Liang, Jianheng Li, and Jian Yin. 2019. A new
multi-choice reading comprehension dataset for cur-
riculum learning. In Proceedings of The Eleventh
Asian Conference on Machine Learning, volume 101
of Proceedings of Machine Learning Research, pages
742–757, Nagoya, Japan. PMLR.

53

https://openreview.net/forum?id=r1xMH1BtvB
https://openreview.net/forum?id=r1xMH1BtvB
https://doi.org/10.18653/v1/N19-1423
https://doi.org/10.18653/v1/N19-1423
https://doi.org/10.18653/v1/N19-1423
http://proceedings.mlr.press/v101/liang19a.html
http://proceedings.mlr.press/v101/liang19a.html
http://proceedings.mlr.press/v101/liang19a.html


Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach. International conference on machine learn-
ing, abs/1907.11692.

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi,
Nikita Nangia, Jason Phang, Angelica Chen, Vishakh
Padmakumar, Johnny Ma, Jana Thompson, He He,
and Samuel Bowman. 2022. QuALITY: Question
answering with long input texts, yes! In Proceedings
of the 2022 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 5336–5358,
Seattle, United States. Association for Computational
Linguistics.

Vatsal Raina and Mark Gales. 2022. Answer uncertainty
and unanswerability in multiple-choice machine read-
ing comprehension. In Findings of the Association
for Computational Linguistics: ACL 2022, pages
1020–1034, Dublin, Ireland. Association for Compu-
tational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 2383–2392, Austin,
Texas. Association for Computational Linguistics.

Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing
Jiang. 2019. What does bert learn from multiple-
choice reading comprehension datasets? arXiv
preprint arXiv:1910.12391.

Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and
Akiko Aizawa. 2020. Assessing the benchmarking
capacity of machine reading comprehension datasets.
In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 34, pages 8918–8927.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris,
Alessandro Sordoni, Philip Bachman, and Kaheer
Suleman. 2017. Newsqa: A machine comprehension
dataset. In Rep4NLP@ACL.

Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu
Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and Nan
Duan. 2022. Logic-driven context extension and data
augmentation for logical reasoning of text. In Find-
ings of the Association for Computational Linguistics:
ACL 2022, pages 1619–1629.

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki
Takeda, and Yuji Matsumoto. 2020. Luke: Deep con-
textualized entity representations with entity-aware
self-attention. In EMNLP.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
gio, William W. Cohen, Ruslan Salakhutdinov, and
Christopher D. Manning. 2018. Hotpotqa: A dataset
for diverse, explainable multi-hop question answer-
ing. In EMNLP.

Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng.
2020. Reclor: A reading comprehension dataset re-
quiring logical reasoning. In International Confer-
ence on Learning Representations.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontañón, Philip Pham, Anirudh Ravula, Qifan
Wang, Li Yang, and Amr Ahmed. 2020. Big
bird: Transformers for longer sequences. ArXiv,
abs/2007.14062.

Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2021.
Retrospective reader for machine reading compre-
hension. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 35, pages 14506–
14514.

54

https://doi.org/10.18653/v1/2022.naacl-main.391
https://doi.org/10.18653/v1/2022.naacl-main.391
https://doi.org/10.18653/v1/2022.findings-acl.82
https://doi.org/10.18653/v1/2022.findings-acl.82
https://doi.org/10.18653/v1/2022.findings-acl.82
https://doi.org/10.18653/v1/D16-1264
https://doi.org/10.18653/v1/D16-1264
https://openreview.net/forum?id=HJgJtT4tvB
https://openreview.net/forum?id=HJgJtT4tvB


Appendix A Additional Results

Appendix A.1 COSMOSQA

Figure Appendix A.1: Distribution of effective number
of options and binned accuracy for COSMOSQA.

Figure Appendix A.2: Distribution of counts and cor-
responding accuracy when points are sorted by MI ap-
proximation for COSMOSQA.

We repeat the entropy plot (Figure Appendix
A.1) for COSMOSQA and find similar trends to
those seen in RACE++. The shortcut no-context
system has a very flat distribution with a substantial
number of questions answerable without context,
with the effective number of options again having
a clean linear relationship with accuracy. The re-
peated mutual information plot (Figure Appendix
A.2) for COSMOSQA also has the same trend seen
in RACE++, validating that our findings are more
general that just for RACE++.

Appendix A.2 ReClor

ReClor show roughly the same trends, however
the questions of ReClor are much more challeng-
ing than in either RACE++ and COSMOSQA, and
so we notice that the counts distribution is pushed
considerably to the higher entropy side. Addition-
ally, since ReClor is much smaller than RACE++

Figure Appendix A.3: Distribution of effective number
of options and binned accuracy for ReClor.

Figure Appendix A.4: Distribution of counts and cor-
responding accuracy when points are sorted by MI ap-
proximation for ReClor.

and COSMOSQA (see Table 1), the curves are less
smooth and largely suffer from noise.

Appendix A.3 Other Shortcuts
We also consider other shortcut approaches, such as
having context and options (i.e. missing question)
and only options (Figure Appendix A.5). Perfor-
mance of the systems is shown in Table Appendix
A.1.

Figure Appendix A.5: System inputs for alternative
shortcut systems.

55


Training data RACE++ COS. ReClor

– 25.00 25.00 25.00

RACE++

{O} 41.76 21.44 34.00
Q+{O} 57.32 54.04 34.80
{O}+C 68.20 54.61 46.00

Q+{O}+C 85.01 70.05 48.60

COSMOS

{O} 29.95 57.39 25.20
Q+{O} 38.73 68.51 27.80
{O}+C 52.41 78.96 40.40

Q+{O}+C 66.81 84.49 41.20

ReClor

{O}. 26.07 18.29 49.00
Q+{O} 31.27 33.13 51.80
{O}+C 39.83 36.88 68.40

Q+{O}+C 52.69 41.68 69.80

Table Appendix A.1: Cross-performance of systems on
RACE++, COSMOSQA and ReClor using accuracy.

Appendix B Model Information

B.1 Training Details

For all systems, deep ensembles of 3 models are
trained with the large 4 ELECTRA PrLM as a part
of the multiple-choice MRC architecture depicted
in Figure 2. Each model has 340M parameters.
Grid search was performed for hyperparameter
tuning with the initial setting of the hyperparam-
eter values dictated by the baseline systems from
Yu et al. (2020); Raina and Gales (2022). Apart
from the default values used for various hyper-
paramters, the grid search was performed for the
maximum number of epochs ∈ {2, 5, 10}; learning
rate ∈ {2e−7, 2e−6, 2e−5}; batch size ∈ {2, 4}.
For RACE++, training was performed for 2 epochs
at a learning rate of 2e-6 with a batch size of 4
and inputs truncated to 512 tokens. For systems
trained on ReClor the final hyperparameter settings
included training for 10 epochs at a learning rate
of 2e-6 with a batch size of 4 and inputs truncated
to 512 tokens. For COSMOSQA, training was per-
formed for 5 epochs at a learning rate of 2e-6 with
a batch size of 4 and inputs truncated to 512 tokens.
Cross-entropy loss was used at training time with
models built using NVIDIA A100 graphical pro-
cessing units with training time under 3 hours per
model for ReClor, 5 hours for COSMOSQA and 4
hours for RACE++. All hyperparameter tuning was
performed by training on TRN and selecting values
that achieved optimal performance on DEV. For
fairness, the ‘shortcut’ systems (omitting various

4Configuration at: https://huggingface.co/
google/electra-large-discriminator/blob/
main/config.json.

forms of the input) for each dataset were trained
with the same hyperparameter settings as their cor-
responding baseline systems.

B.2 Evaluation Details
For each dataset, the systems are trained on the
training split and hyperparameter tuned on the de-
velopment split. For RACE++, systems are evalu-
ated on the held out test data, but for COSMOSQA
and ReClor, the evaluations are performed on the
development split because their test splits have their
labels hidden.

B.3 Calibration
The trained models were calibrated post-hoc using
single parameter temperature annealing (Guo et al.,
2017). Uncalibrated, model probabilities are deter-
mined by applying the softmax to the output logit
scores si:

P (y = k;θ) ∝ exp(sk) (3)

where k denotes a possible output class for a predic-
tion y. Temperature annealing ‘softens’ the output
probability distribution by dividing all logits by a
single parameter T before the softmax.

PCAL(y = k;θ) ∝ exp(sk/T ) (4)

As the parameter T does not change the relative
rankings of the logits, the model’s prediction will
be unchanged and so temperature scaling does not
affect the model’s accuracy. The parameter T is
chosen such that the accuracy of the system is equal
to the mean of the maximum probability (which
would be expected for a calibrated system).

Appendix C Licenses

This section details the license agreements of the
scientific artifacts used in this work. The dataset
COSMOSQA (Huang et al., 2019) has BSD 3-
Clause License. The datasets RACE++ (Lai et al.,
2017) and ReClor (Yu et al., 2020) are freely avail-
able with the limitation on the latter that it can
only be used for non-commercial research purposes.
Huggingface transformer models are released un-
der: Apache License 2.0. All the scientific aritfacts
are consistent with their intended uses.

56

https://huggingface.co/google/electra-large-discriminator/blob/main/config.json
https://huggingface.co/google/electra-large-discriminator/blob/main/config.json
https://huggingface.co/google/electra-large-discriminator/blob/main/config.json


Appendix D Low Entropy Examples

57