MQAG: Multiple-choice Question Answering and Generation for
Assessing Information Consistency in Summarization

Potsawee Manakul, Adian Liusie, Mark J. F. Gales
ALTA Institute, Department of Engineering, University of Cambridge
pm574@cam.ac.uk, al826@cam.ac.uk, mjfg@eng.cam.ac.uk

Abstract

State-of-the-art summarization systems can
generate highly fluent summaries. These sum-
maries, however, may contain factual inconsis-
tencies and/or information not present in the
source. Hence, an important component of as-
sessing the quality of summaries is to determine
whether there is information consistency be-
tween the source and the summary. Existing ap-
proaches are typically based on lexical match-
ing or representation-based methods. In this
work, we introduce an alternative scheme based
on standard information-theoretic measures in
which the information present in the source and
summary is directly compared. We propose a
Multiple-choice Question Answering and Gen-
eration framework, MQAG, which approxi-
mates the information consistency by comput-
ing the expected statistical distance between
summary and source answer distributions over
automatically generated multiple-choice ques-
tions. This approach exploits multiple-choice
answer probabilities, as predicted answer dis-
tributions can be compared. We conduct exper-
iments on four summary evaluation datasets:
QAG-CNNDM/XSum, XSum-Hallucination,
Podcast Assessment, and SummEval. Experi-
ments show that MQAG, using models trained
on SQuAD or RACE, outperforms existing
evaluation methods on the majority of tasks.1

1 Introduction

The objective of summary evaluation is to quan-
tify the quality of summaries, either on a relative
or an absolute scale. Accurate and reliable auto-
matic summary evaluation systems are useful to
researchers, as they provide an easy and cheap way
to compare new summarization models to exist-
ing ones. Although current summarization sys-
tems have improved dramatically in the last decade,
and are capable of generating highly fluent outputs
(Lewis et al., 2020; Zhang et al., 2020a; Brown

1Code and model weights are available at https://
github.com/potsawee/mqag0.

Source X
...............
...............
...............
...............

Summary Y

............

............

Multiple-Choice
Question Generation

Answering System

Question?
a) option 1
b) option 2
c) option 3
d) option 4

prob. dist.
given X

prob. dist.
given Y

Statistical Distance (e.g. KL-Div)

MQAG score

Figure 1: Multiple-choice Question Answering and Gen-
eration (MQAG) framework. The answers are repre-
sented by probability distributions over choices instead
of text spans in existing question-answering approaches.

et al., 2020), it has been shown that generated sum-
maries are prone to exhibit factual errors or halluci-
nations (Kryscinski et al., 2019; Huang et al., 2021;
Cao et al., 2022; Ji et al., 2022). Thus, information
consistency between the summary and source is an
important assessment criterion.

Existing methods that measure information con-
sistency generally perform lexical matching, either
directly such as ROUGE (Lin, 2004) and BLEU
(Papineni et al., 2002), or indirectly using more
complex representations such as triple matching
(Goodrich et al., 2019). Some recent approaches
adopt question answering (QA) pipelines to detect
factual inconsistencies (Chen et al., 2018; Wang
et al., 2020; Durmus et al., 2020; Deutsch et al.,
2021; Nan et al., 2021). They are based on the
assumption that if the source extracted answer is
consistent with the summary extracted answer then

1

https://github.com/potsawee/mqag0
https://github.com/potsawee/mqag0


the summary and source are consistent. The an-
swers are compared using either lexical matching
(Scialom et al., 2019; Wang et al., 2020; Durmus
et al., 2020; Scialom et al., 2021) or representation-
based matching (Deutsch and Roth, 2022). These
span-based QA approaches may have lexical biases,
and struggle with highly abstractive summaries or
when dealing with multiple answer spans.

In this work, a measure of consistency be-
tween the source and summary is defined from
an information-theoretic perspective. We propose
a Multiple-choice Question Answering and Gener-
ation framework, MQAG, where instead of com-
paring text-based answer spans, multiple-choice
questions are generated and the resulting answer
distributions from the source and summary are com-
pared. The main contributions of this paper are:

• We provide an alternative and novel question
answering-based approach for assessing infor-
mation consistency. Our approach can repre-
sent the answers via probability distributions
instead of lexical or embeddings.

• We show that our approach, MQAG, achieves
state-of-the-art performance on four out of six
summary evaluation tasks.

2 Background and Related Work

Standard summary evaluation metrics such as
ROUGE (Lin, 2004) and METEOR (Banerjee and
Lavie, 2005) are designed to assess summaries
against ground-truth summaries, i.e. reference sum-
maries. However, these metrics have been shown
to have a low correlation with human judgements
(Fabbri et al., 2021). In practice, there is no ground-
truth summary to be used as the reference, and
evaluation methods need to compare the summary
against the source. Therefore, the scope of this
work is assessing the summary against the source.

Although there are several aspects of good sum-
maries, including fluency, coherency, coverage
or consistency, generation systems are becoming
much more capable of generating fluent texts, so
the fluency/coherency aspects are less of a concern
compared to consistency and hallucination prob-
lems (Ji et al., 2023). Thus, this work focuses on
consistency. Because the definition of consistent
information can depend on one’s interpretation, we
follow the definition of ‘faithfulness‘ in Maynez
et al. (2020) such that we determine if the informa-
tion in the summary is consistent with information

in the source, and we do not consider ‘factuality’
where valid external facts are acceptable. Existing
unsupervised evaluation methods are categorized
and explained in the following part.2

Textual overlap scores
n-gram based metrics, including BLEU (Pap-
ineni et al., 2002), ROUGE (Lin, 2004), and ME-
TEOR (Banerjee and Lavie, 2005) measure n-gram
overlap between two texts. Instead of n-grams,
BERTScore (Zhang et al., 2020b) and BLEURT
(Sellam et al., 2020) compare texts in their rep-
resentation space. These metrics measure textual
similarity, so they are not necessarily a good mea-
sure of consistency. We note that the original works
that proposed these metrics compare the summary
against the ground-truth summary, but this work
focuses on the scenario where there is no ground-
truth summary, and these metrics are used as base-
lines to compare the summary against the source.

Knowledge representation
Goodrich et al. (2019) assess factual consistency
by comparing relation triples from the source and
the summary. The relation triples are in the format
of Subject-Relation-Object and can be obtained
using a model-free method such as OpenIE (Etzioni
et al., 2008) or using a trained relation extraction
model. The factual accuracy score based on the
triple matching approach is then defined as,

Score =
|Tx ∩ Ty|

|Ty|

where Tx and Ty are relation triples extracted from
the source and the summary, respectively.

Textual Entailment
Simulated data, such as real or fake summaries
created by pre-defined transformations, have been
used to train classifiers to detect inconsistent sum-
maries (Kryscinski et al., 2020; Bao et al., 2022).
Alternatively, (Maynez et al., 2020) trained a tex-
tual entailment classifier on the Multi-NLI (MNLI)
dataset (Williams et al., 2018). Given a context, the
entailment model is to classify the hypothesis into
one of the three classes (entail/neutral/contradict).
When applied to assess summaries, the context is
the source document and the hypothesis is the sum-
mary. The probability of being the entail class is

2Supervised approaches, with systems trained on human
evaluation annotations, are outside the scope of this work.

2


then used as the consistency score,

Score = P (entail|x, y) (1)

Span-based Question Answering (SpanQAG)
A question-answering approach consists of a
question-generation model and an answering
model. Given automatically generated questions,
the first answer is derived from the source and the
second answer is derived from the evaluated sum-
mary, and then the two answers are compared.

For example, Eyal et al. (2019) proposed a QA-
based method where questions are generated from
the ground-truth summary. QAGS (Wang et al.,
2020) and FEQA (Durmus et al., 2020) generate
questions from the evaluated summary, so these
two methods are designed to measure the amount
of information in the summary that is consistent
with the source. In contrast, SummaQA (Scialom
et al., 2019) generates questions from the source
document, so it assesses the coverage of the sum-
mary. As an extension to the ideas in QAGS/FEQA
and SummQA, QuestEval (Scialom et al., 2021)
generates questions from both the source and the
summary separately to obtain a precision score and
a recall score. QuestEval also assigns a weight-
ing function to take into account the importance of
each query/question.

Nevertheless, existing QA methods are span-
based where the answering system extracts answer
spans before two answer spans are compared. Due
to the nature of span-based answers, answer verifi-
cation (i.e. answer comparison) is typically through
exact matching, token F1, BERTScore, or a learned
metric (Deutsch and Roth, 2022). This answer ver-
ification illustrates a drawback of the existing QA
methods that they have to compare the similarity
between two texts. To avoid span-based answer
verification, we propose an alternative question
answering-based approach where multiple-choice
question generation and answering systems are
used where the answers are now in the form of
probability distributions rather than text spans.

3 Multiple-choice Question Answering
and Generation (MQAG)

3.1 Motivation and Theory
Since current summarization systems generate
highly fluent summaries, this work focuses on as-
sessing whether summaries contain the same infor-
mation as that of the source, or whether it is con-
tradictory. One way to view information would be

to consider the set of questions that are answerable
given a certain passage. If a summary is consistent
with the source, then one would expect the set of
answerable questions by the summary to overlap
with those of the source and yield similar answers.
Though span-based QA approaches are similarly
motivated, existing span-based frameworks use text
similarity measures, either in the form of lexical
or representation space. In contrast, we attempt to
measure information using multiple-choice ques-
tions, which allows for a more abstract understand-
ing of information and enables convenient use of
standard information-theoretic measures.

3.2 MQAG Score
Let x = source, y = summary, q = question, and o =
options associated with the question q. We define
information inconsistency as,

I(x, y) =∫
q,o

D (PA(o|q, x), PA(o|q, y))PG(q,o|y)dodq

≈ 1

N

N∑
i=1

D
(
PA(o

(i)|q(i), x), PA(o
(i)|q(i), y)

)
(2)

where {q(i),o(i)} is sampled from PG(q,o|y), the
question-option generation model, PA(o

(i)|q(i), x)
and PA(o

(i)|q(i), y) are the option distributions
given the source and summary respectively, and
D is a statistical distance such as KL-divergence.
Based on the information inconsistency score in
Equation 2, we define the MQAG score as,3

MQAG-Score(x, y) = 1− I(x, y) (3)

We refer to Equation 3 as the MQAG-Sum score
as the questions are generated from the summary.
Furthermore, it is possible to generate questions,
{q,o} using the source x instead of the summary y,
{q(i),o(i)} is sampled from PG(q,o|x). We will re-
fer to this variant as the MQAG-Src score. MQAG-
Src is expected to measure the amount of source
information present in the summary, i.e. the cover-
age of the summary, while MQAG-Sum is expected
to measure the consistency of the summary with re-
spect to the source. To account for consistency and
coverage, we also consider a simple combination,

MQAG-F1 = 2·MQAG-Sum × MQAG-Src
MQAG-Sum + MQAG-Src

(4)
3If D > 1, for example, when using KL-divergence, the

MQAG score can be negative, but the maximum value is 1.0.

3


3.3 Statistical Distances D
Given two probability distributions over options o
(e.g. one conditioned on source x, and the other
conditioned on summary y), a statistical distance
D measures the distance between the probability
distributions. There are multiple distances, which
can be used, and in this work, we consider some
of the main distances and investigate their proper-
ties as well as their empirical performance in our
MQAG framework as follows,

• KL-Divergence:

DKL =
∑
o∈o

PA(o|q, x) log
(
PA(o|q, x)
PA(o|q, y)

)
• One-Best (i.e. argmax matching):

DOB =

{
0, if ox = oy

1, otherwise

where ox = argmaxo PA(o|q, x) and oy =
argmaxo PA(o|q, y). DOB simply determines
whether the two answers match or not.

• Total Variation:

DTV =
1

2
∥PA(o|q, x)− PA(o|q, y)∥1

• Hellinger:

DHL =
1√
2

∥∥∥√PA(o|q, x)−
√

PA(o|q, y)
∥∥∥
2

KL divergence is unbounded, which means the
value can be exceedingly large. In contrast, one-
best is bounded but discontinuous. Both total vari-
ation and Hellinger distance are bounded and con-
tinuous. We illustrate examples of the properties of
these statistical distances on Bernoulli distributions
in Figure 4 in the appendix.

4 Experimental Setup

4.1 System Development Data
RACE (Lai et al., 2017) is a multiple-choice read-
ing comprehension dataset where each example
consists of context, question, answer, and 3 distrac-
tors (i.e. incorrect options). SQuAD (Rajpurkar
et al., 2016) is a collection of question-answer pairs
derived from Wikipedia articles, and the correct an-
swers can be any sequence of tokens in the given
context. The statistics are provided in Table 1
where abstractiveness is measured by 1.0 minus the
length of the longest sequence that exists in both
the context and the answer per the answer length,
i.e. 1.0− ROUGE-LPrecision(Answer,Context).

Dataset Size
Length

Abstractive
Context Answer

SQuAD 98.2k 317.8 11.0 0.0%
RACE 97.7k 138.3 11.3 39.1%

Table 1: Statistics of datasets for training MQAG sys-
tems. Length = #words. Abstractiveness of 0% indicates
that in SQuAD the answer always exists in the context.

4.2 Evaluation Data
We evaluate the performance by measuring the cor-
relation against human judgements at the summary
level on QAG-(CNNDM (Hermann et al., 2015),
XSum (Narayan et al., 2018)), XSum-Hallucination
and at the system level on Podcast Assessment and
SummEval, and the definitions of summary-level
and system-level correlations are provided in Ap-
pendix C. The statistics are provided in Table 2.

Eval Dataset Size
Length

Source Summary

QAG-CNNDM 235 355.8 54.4
QAG-XSum 239 403.7 19.7

XSum-H 2500 442.1 20.5
Podcast ∗20×179 5950 88.3

SummEval ∗16×100 404.0 63.7

Table 2: Statistics of evaluation datasets. Length is the
number of words calculated using the NLTK tokenizer.
∗#systems×documents.

QAG. Wang et al. (2020) annotated 235 CNNDM
summaries of the system in Gehrmann et al. (2018)
and 239 XSum summaries of fine-tuned BART
(Lewis et al., 2020). The annotation was performed
at the sentence level indicating if hallucination oc-
curs or not. Subsequently, for each summary, the
faithfulness (or consistency) score is then obtained
by averaging all sentence-level human scores.

XSum-Hallucination (XSum-H). Maynez et al.
(2020) annotated 2500 XSum summaries using 3
crowd-sourced workers on two metrics: 1) Faithful-
ness = whether the information is faithful w.r.t. the
source at the token level. The judgements are then
averaged; 2) Factuality = whether the summary
level is factual w.r.t source and external knowledge.

Podcast Assessment. Manakul and Gales (2022)
compiled 3580 podcast summaries of abstraction
and extractive summarization systems from Spotify
Podcast Challenge 2020 (Jones et al., 2021). The

4


human evaluation was performed on a 4-point scale
considering a combination of consistency, cover-
age, and fluency.

SummEval. Fabbri et al. (2021) assessed 1600
CNNDM summaries from 16 different summariza-
tion systems on four aspects, including relevancy,
consistency, coherency, and fluency. In this work,
we use the consistency scores.

4.3 Baselines
All of the considered methods compare the sum-
mary y against the source document x without
the ground-truth summary, and we implement
these methods as described in Section 2 using
code/repository from the relevant previous works.

ROUGE. We use the ROUGE-1 (F1) score in the
rouge-score Python package.

OpenIE-TripleMatch. The relation extraction is
based on an open scheme, and we use the imple-
mentation in FactSumm (Heo, 2021).

BERTScore. We use DeBERTa-base (He et al.,
2021) fine-tuned to MNLI as the backbone.

Entailment model. Following the method in
Maynez et al. (2020), we trained BERT-large (De-
vlin et al., 2019) on MNLI and we use the proba-
bility of the source being entailed by the summary
as the assessment score as shown in Equation 1.

Span-based QAG Baselines. We use three ex-
isting span-based question-answering methods as
our baselines: QAGS proposed by Wang et al.
(2020), FEQA proposed by Durmus et al. (2020),
and QuestEval proposed by Scialom et al. (2021).

4.4 MQAG Implementation
Question Generation (G1, G2)
The multiple-choice question generation is imple-
mented in two stages.4 First model G1 generates
the question q and answer a, then model G2 gener-
ates the distractors o\a given q and a.

PG(q,o|y) = PG2(o\a|q, a, y)PG1(q, a|y) (5)

where o = {a,o\a} denotes all options/choices.
We set the number of options to four. Both G1
and G2 are sequence-to-sequence T5-large models
(Raffel et al., 2020). The question-answer gener-
ation system G1 is fine-tuned to either RACE or

4The motivation is based on our initial experiments that
a single generation system (generating the question and 4
options together) often gave low-quality distractors, and using
two generation systems improved the quality of distractors.

SQuAD, and the distractor generation system G2 is
fine-tuned to RACE.

Question Answering (A)
The answering stage contains one model A, which
is Longformer-large (Beltagy et al., 2020) with a
multiple-choice setup following Yu et al. (2020);
Raina and Gales (2022). The input to the model
is a concatenation of context, question and option.
The answering model A is fine-tuned to RACE.

Answerability of Generated Questions
Because not all generated questions are of high
quality, we consider filtering out low-quality ques-
tions through question-context answerability mea-
sures (Kundu and Ng, 2018; Hu et al., 2019). We
consider a simple answerability measure based on
the entropy of the probability distribution over the
options. We define the effective number of options,

Ny(q,o) = 2H[PA(o|q,y)] (6)

where H(.) is base-2 entropy, so Ny(q,o) ranges
from 1.0 to the number of options, e.g. 4.0. When
q is generated from y but Ny(q,o) is high, this
question q should be deemed unanswerable as it is
not answerable even when using the same context.
As a result, we use Ny(q,o) as an answerability
criterion to reject questions which have Ny(q,o)
higher than a threshold denoted by N τ

y .

5 Experimental Results

5.1 Analysis of the Components in MQAG
In this subsection, we carry out experiments to
find the best configuration of MQAG, including the
analysis of statistical distances, variants of MQAG,
and answerability. We build two MQAG variants:
MQAGSQuAD and MQAGRACE, which differ in the
training data of the question+answer generator G1,
while the distractor generator G2 and answering
system A are both trained on RACE.

Statistical Distances
In Table 3, our results compare statistical distances.
It can be seen that in both configurations, KL-
divergence yields lower correlations than other dis-
tances, and on average total variation slightly out-
performs Hellinger and one-best distances. Hence,
total variation will be used as the main distance.
The next observation is that MQAGSQuAD, despite
generating more extractive questions, achieves
higher correlations than MQAGRACE on most tasks
except on Podcast and SummEval.

5


1.0 1.5 2.0 2.5 3.0 3.5 4.0
Ny threshold

0.02

0.01

0.00

0.01

0.02

0.03

0.04

0.05
QAG-CNNDM

MQAG_SQuAD
MQAG_RACE

1.0 1.5 2.0 2.5 3.0 3.5 4.0
Ny threshold

0.00

0.02

0.04

0.06

0.08

XSum-Hallucination-Faithful
MQAG_SQuAD
MQAG_RACE

1.0 1.5 2.0 2.5 3.0 3.5 4.0
Ny threshold

0.01

0.00

0.01

0.02

0.03

SummEval
MQAG_SQuAD
MQAG_RACE

Figure 2: ∆PCC of MQAG-Sum with total variation (i.e. PCC − PCCNτ
y =4.0) against the answerability threshold

N τ
y on X-axis. MQAG without answerability is equivalent to setting Nτ

y = 4.0, and the results at this operating
point can be seen on the right-most point in each plot. As we reduce the threshold (Nτ

y ↓), more questions are
rejected. The results on QAG-XSum and Podcast are provided in Figure 5 in the appendix.

D
QAG XSum-H Podc SumECNN XSum Faith Fact

MQAG-Sum, G1 = SQuAD
DKL 0.478 0.374 0.177 0.226 0.251 0.936
DOB 0.476 0.354 0.295 0.254 0.677 0.872
DTV 0.508 0.396 0.269 0.267 0.225 0.870
DHL 0.499 0.399 0.266 0.269 0.201 0.870
MQAG-Sum, G1 = RACE
DKL 0.450 0.283 0.135 0.179 0.789 0.954
DOB 0.453 0.225 0.240 0.221 0.839 0.928
DTV 0.462 0.309 0.221 0.244 0.770 0.933
DHL 0.473 0.323 0.215 0.244 0.751 0.927

Table 3: Comparison of Statistical Distances using
MQAG-Sum without answerability.

MQAG-Sum, MQAG-Src, MQAG-F1
Here, we compare three variants of MQAG scores.
Our results in Table 4 show that MQAG-Src, which
assesses how much source information is contained
in the summary by generating questions from the
source, achieves lower PCCs than MQAG-Sum on
all datasets. This finding aligns with our expec-
tation, as the summaries were graded by humans
predominantly on the consistency aspect (which
MQAG-Sum was designed to measure) rather than
the quantity of source information present (which
MQAG-Src measures). When combining MQAG-
Src and MQAG-Sum into MQAG-F1, we only ob-
serve a small gain on two test settings. Therefore,
MQAG-Sum is selected as our main MQAG con-
figuration for the remaining investigations.

Answerability
In Figure 2, the answerability is swept from 4.0
(keeping all questions) to 1.0 (only keeping those
that the answering system A is highly confident).
It can be seen that as we filter out high-entropy

QAG XSum-H Podc SumECNN XSum Faith Fact

G1 = SQuAD, D = Total Variation
Sum 0.508 0.396 0.269 0.267 0.225 0.870
Src 0.272 0.017 0.093 0.037 0.470 0.707
F1 0.490 0.393 0.286 0.261 0.475 0.863
G1 = RACE, D = Total Variation
Sum 0.462 0.309 0.221 0.244 0.770 0.933
Src 0.233 0.143 0.069 0.087 0.144 0.588
F1 0.468 0.301 0.217 0.252 0.731 0.866

Table 4: Comparison of MQAG-Src, MQAG-Sum, and
MQAG-F1 without answerability.

questions, there is an upward trend in performance
across all tasks. In addition, as shown in the figure,
setting N τ

y at 2.0 seems to be a reasonable answer-
ability threshold. At this threshold, N τ

y = 2.0, out
of 50 automatically generated questions, about 36
questions are kept for MQAGSQuAD and about 30
questions are kept for MQAGRACE. The number of
remaining questions is similar across all datasets
as shown in Table 9 in the appendix. Thus, we
set N τ

y = 2.0, and the performance of MQAG
using this answerability criterion is presented and
compared against baseline systems in Table 5.

5.2 Comparison Against Existing Baselines

The baseline and MQAG results are shown in Ta-
ble 5. The observation is that MQAG achieves a
higher correlation than the best SpanQAG on 5 out
of 6 tasks. When compared to all existing baselines,
MQAG achieves state-of-the-art performance on
4 out of 6 tasks. To investigate the impact of the
abstractiveness of summaries on the performance,

6


Method
QAG XSum-H

Podcast SumEvl
CNNDM XSum Faithful Factual

Baselines: Other Approaches
ROUGE-1 0.337 0.012 -0.050 0.008 0.326 0.458
OpenIE-TripleMatching 0.381 0.131 0.019 -0.020 0.706 0.548
BERTScore 0.584 0.008 0.185 0.154 0.718 0.645
Entailment (BERT Model) 0.159 0.169 0.362 0.209 0.228 0.619
Baselines: SpanQAG
QAGS 0.437 0.200 0.101 0.080 0.464 0.812
FEQA 0.322 0.283 0.297 0.171 0.603 0.464
QuestEval 0.250 0.173 0.421 0.197 0.579 0.838
Multiple-choice Question Answering and Generation (MQAG)
MQAGSQuAD 0.519 0.407 0.324 0.292 0.502 0.890
MQAGRACE 0.502 0.313 0.306 0.270 0.855 0.945

Table 5: Pearson Correlation Coefficient (PCC) between the scores of summary evaluation methods and human
judgements. PCCs are computed at the summary level on QAG and XSum-H, and at the system level on Podcast and
SummEval. PCCs on Podcast are computed on 15 abstractive systems. Our best performing MQAG configuration
consists of (i) generation stage G generates questions from summary y (i.e. MQAG-Sum), (ii) statistical distance is
total variation, (iii) the answerability threshold N τ

y is set to 2.0. Underline denotes where MQAG outperforms the
best SpanQAG system, which is 5 out of 6 tasks. When compared to all baselines, MQAG achieves the highest PCC
on 4 out of 6 tasks. The results of all MQAG configurations are provided in Table 10, and Spearman’s correlation
results are provided in Table 11 in the appendix.

we split QAG-XSum and XSum-H datasets5 into
two portions of the same size by abstractiveness
as measured by the longest sequence in the sum-
mary that exists in the source per the summary
length (i.e. ROUGE-L precision of summary y
using source x as the reference). The results in
Table 6 show that although MQAGRACE achieves
lower PCCs than MQAGSQuAD (in Table 5), when
evaluated on the more abstractive split, the per-
formance MQAGRACE is much closer to that of
MQAGSQuAD. In addition, compared to MQAG,
SpanQAG methods show a larger drop in PCCs
in the more abstractive split. This finding further
illustrates the benefits of comparing answer distri-
butions rather than text spans.

6 Ablation Studies

6.1 Number of Questions (N )
We analyse the impact of the number of gener-
ated questions on the performance of MQAG. The
mean and standard deviation are presented in Fig-
ure 3. The results show a smooth increase in corre-
lation, which is as expected because the framework
is based on a Monte-Carlo approximation (in Equa-
tion 2), and a similar finding was also observed in

5XSum summaries are more abstractive than CNNDM
summaries, so using XSum should enable us to investigate the
impact of abstractiveness better than CNNDM.

Method
QAG-XSum XSum-H
Low High Low High

QAGS 0.190 0.184 0.101 0.159
FEQA 0.296 0.163 0.290 0.124
QuestEval 0.215 0.061 0.398 0.326

MQAGSQuAD 0.431 0.328 0.334 0.254
MQAGRACE 0.277 0.295 0.319 0.249

Table 6: Performance as measured by Pearson corre-
lation coefficient on the low abstractiveness and high
abstractiveness of QAG-XSum and XSum-H (Faithful).
The results on the entire datasets are in Table 5.

QAGS (Wang et al., 2020). Figure 3 also shows
that the variance decreases with N , showing the
stability of the approach. Though the performance
curve has not completely plateaued at N=50, since
the computational cost of MQAG scales linearly
with N , 50 questions seem to be a reasonable com-
promise between computational efficiency and per-
formance. An interesting next step would be to
investigate if the same or similar performance can
be achieved with as low N as possible, for exam-
ple, by generating a smaller but more diverse set
of questions and options such as varifocal question
generation where questions are generated based on
different focal points (Ousidhoum et al., 2022).

7


0 10 20 30 40 50
Num. questions)

0.15

0.20

0.25

0.30

0.35

0.40

0.45

QAG-CNNDM (Mean)

0 10 20 30 40 50
Num. questions)

0.03

0.04

0.05

0.06

0.07

QAG-CNNDM (Std)

Figure 3: Mean and standard deviation of Pearson correlation (Y-axis) of MQAGRACE on QAG-CNNDM when the
number of generated questions N is varied from 1 to 50 (X-axis). Standard deviation is obtained via bootstrapping.
The results on other datasets are provided in Figure 6 in the appendix.

6.2 Model Choices

Pre-trained Backbone
We investigate model choices by swapping to less
capable models, e.g. T5-large → T5-base for gen-
eration, and Longformer(4096) → RoBERTa(512)
(Liu et al., 2019) for answering. The results in
Table 8 in the appendix show: (1) For generation
stage, using a smaller model does not result in
lower performance. This could be because T5-base
has higher perplexity, and yields more diverse ques-
tions. (2) In contrast, for answering stage, when us-
ing RoBERTa, with a shorter input length, the per-
formance on SummEval (the input length is mostly
shorter than 512) remains almost the same. How-
ever, as the input length is longer in other datasets,
we observe a drop in PCC when using RoBERTa.

Zero-shot Multiple-choice Question Generation
Given the impressive results of large language
models (LLMs) across natural language generation
tasks, we investigate the performance of LLMs in a
zero-shot fashion instead of using fine-tuned T5 for
multiple-choice question generation. Specifically,
we use OpenAI GPT-3 (Brown et al., 2020)
(text-davinci-003) where we query 50 questions
and 4 options using the following prompt format:

Write 50 diverse multiple-choice
questions with 4 options from the
following context: {context}.

We found that GPT-3 generated 50 questions as
specified in the prompt around 26% of the exam-
ples and the remaining only have 20 questions. The
majority of questions (more than 95%) have 4 op-

tions, while the remaining have 2 options. In Ta-
ble 7, the results show that zero-shot GPT-3 per-
forms worse than our fine-tuned T5 systems in both
multiple-choice question generation tasks. This il-
lustrates that there is some sensitivity due to the
quality of generated questions, and using our fine-
tuned T5 is a better option than zero-shot GPT-3.

Backbone
QAG

CNNDM XSum

T5 (SQuAD) 0.508 0.396
T5 (RACE) 0.462 0.309

GPT-3 0.392 0.130

Table 7: GPT-3 versus fine-tuned T5 using DTV without
answerability for multiple-choice question generation.

7 Conclusion

This work proposes MQAG – a novel scheme
for assessing information consistency between
source and summary based on the distance between
multiple-choice answer distributions instead of text-
based answer spans in existing question-answering
methods. Our experiments demonstrate the poten-
tial of this alternative approach which outperforms
existing techniques on various datasets. The real-
ization of the framework exploits current multiple-
choice question generation and answering systems.
Its performance is expected to increase as back-
bone systems improve, for example, the diversity
of questions generated and the selection of options.
Also, the framework is highly interpretable, allow-
ing more insight into summary assessment.

8


Limitations

Domain. Our approach is designed to assess the
information content, so it may not work well with
other aspects of summary evaluation such as flu-
ency or coherency. Our analysis is based on the
systems trained on RACE, which is collected from
English examinations in China. Hence, the gener-
ated questions and answer distributions could be
biased towards the style of the examinations.

Efficiency. Given the realization of the MQAG
framework where two generators G1 and G2 are
adopted, the MQAG framework can be slow when
using old infrastructure, for example, it takes
around 3 seconds per question on one NVIDIA
P100 GPU. To address this issue, future work could
explore a more efficient realization of MQAG.

Acknowledgments

This work is supported by Cambridge University
Press & Assessment (CUP&A), a department of
The Chancellor, Masters, and Scholars of the Uni-
versity of Cambridge, and the Cambridge Com-
monwealth, European & International Trust. We
would like to thank the anonymous reviewers for
their helpful comments.

References
Satanjeev Banerjee and Alon Lavie. 2005. METEOR:

An automatic metric for MT evaluation with im-
proved correlation with human judgments. In Pro-
ceedings of the ACL Workshop on Intrinsic and Ex-
trinsic Evaluation Measures for Machine Transla-
tion and/or Summarization, pages 65–72, Ann Arbor,
Michigan. Association for Computational Linguis-
tics.

Forrest Bao, Ge Luo, Hebi Li, Minghui Qiu, Yinfei
Yang, Youbiao He, and Cen Chen. 2022. SueNes:
A weakly supervised approach to evaluating single-
document summarization via negative sampling. In
Proceedings of the 2022 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 2450–2458, Seattle, United States. Association
for Computational Linguistics.

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer. arXiv
preprint arXiv:2004.05150.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In Ad-
vances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
Inc.

Meng Cao, Yue Dong, and Jackie Cheung. 2022. Hal-
lucinated but factual! inspecting the factuality of
hallucinations in abstractive summarization. In Pro-
ceedings of the 60th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 3340–3354, Dublin, Ireland. Associa-
tion for Computational Linguistics.

Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2018. A
semantic qa-based approach for text summarization
evaluation. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 32.

Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth.
2021. Towards question-answering as an automatic
metric for evaluating the content quality of a sum-
mary. Transactions of the Association for Computa-
tional Linguistics, 9:774–789.

Daniel Deutsch and Dan Roth. 2022. Benchmarking
answer verification methods for question answering-
based summarization evaluation metrics. In Find-
ings of the Association for Computational Linguis-
tics: ACL 2022, pages 3759–3765, Dublin, Ireland.
Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.

Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
question answering evaluation framework for faith-
fulness assessment in abstractive summarization. In
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 5055–
5070, Online. Association for Computational Lin-
guistics.

Oren Etzioni, Michele Banko, Stephen Soderland, and
Daniel S Weld. 2008. Open information extrac-
tion from the web. Communications of the ACM,
51(12):68–74.

Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.
Question answering as an automatic evaluation met-
ric for news article summarization. In Proceedings
of the 2019 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and

9

https://aclanthology.org/W05-0909
https://aclanthology.org/W05-0909
https://aclanthology.org/W05-0909
https://doi.org/10.18653/v1/2022.naacl-main.175
https://doi.org/10.18653/v1/2022.naacl-main.175
https://doi.org/10.18653/v1/2022.naacl-main.175
https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
https://doi.org/10.18653/v1/2022.acl-long.236
https://doi.org/10.18653/v1/2022.acl-long.236
https://doi.org/10.18653/v1/2022.acl-long.236
https://doi.org/10.1162/tacl_a_00397
https://doi.org/10.1162/tacl_a_00397
https://doi.org/10.1162/tacl_a_00397
https://doi.org/10.18653/v1/2022.findings-acl.296
https://doi.org/10.18653/v1/2022.findings-acl.296
https://doi.org/10.18653/v1/2022.findings-acl.296
https://doi.org/10.18653/v1/N19-1423
https://doi.org/10.18653/v1/N19-1423
https://doi.org/10.18653/v1/N19-1423
https://doi.org/10.18653/v1/2020.acl-main.454
https://doi.org/10.18653/v1/2020.acl-main.454
https://doi.org/10.18653/v1/2020.acl-main.454
https://doi.org/10.18653/v1/N19-1395
https://doi.org/10.18653/v1/N19-1395


Short Papers), pages 3938–3948, Minneapolis, Min-
nesota. Association for Computational Linguistics.

Alexander R. Fabbri, Wojciech Kryściński, Bryan Mc-
Cann, Caiming Xiong, Richard Socher, and Dragomir
Radev. 2021. SummEval: Re-evaluating summariza-
tion evaluation. Transactions of the Association for
Computational Linguistics, 9:391–409.

Sebastian Gehrmann, Yuntian Deng, and Alexander
Rush. 2018. Bottom-up abstractive summarization.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
4098–4109, Brussels, Belgium. Association for Com-
putational Linguistics.

Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad
Saleh. 2019. Assessing the factual accuracy of gener-
ated text. In Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery
& Data Mining, KDD ’19, page 166–175, New York,
NY, USA. Association for Computing Machinery.

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Weizhu Chen. 2021. Deberta: Decoding-enhanced
bert with disentangled attention. In International
Conference on Learning Representations.

Hoon Heo. 2021. Factsumm: Factual consistency scorer
for abstractive summarization. https://github.
com/Huffon/factsumm.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. 2015. Teaching machines to read
and comprehend. In Advances in Neural Information
Processing Systems, volume 28. Curran Associates,
Inc.

Minghao Hu, Furu Wei, Yuxing Peng, Zhen Huang,
Nan Yang, and Dongsheng Li. 2019. Read+ verify:
Machine reading comprehension with unanswerable
questions. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 33, pages 6529–6537.

Yi-Chong Huang, Xiachong Feng, Xiaocheng Feng, and
Bing Qin. 2021. The factual inconsistency problem
in abstractive text summarization: A survey. ArXiv,
abs/2104.14839.

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan
Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. 2023. Survey of halluci-
nation in natural language generation. ACM Comput.
Surv., 55(12).

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea
Madotto, and Pascale Fung. 2022. Survey of hal-
lucination in natural language generation. ArXiv,
abs/2202.03629.

Rosie Jones, Ben Carterette, Ann Clifton, Maria Es-
kevich, Gareth JF Jones, Jussi Karlgren, Aasish
Pappu, Sravana Reddy, and Yongze Yu. 2021. Trec
2020 podcasts track overview. arXiv preprint
arXiv:2103.15953.

Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
Cann, Caiming Xiong, and Richard Socher. 2019.
Neural text summarization: A critical evaluation. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 540–551, Hong
Kong, China. Association for Computational Linguis-
tics.

Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
and Richard Socher. 2020. Evaluating the factual
consistency of abstractive text summarization. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 9332–9346, Online. Association for Computa-
tional Linguistics.

Souvik Kundu and Hwee Tou Ng. 2018. A nil-aware
answer extraction framework for question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
4243–4252, Brussels, Belgium. Association for Com-
putational Linguistics.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
and Eduard Hovy. 2017. RACE: Large-scale ReAd-
ing comprehension dataset from examinations. In
Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pages 785–
794, Copenhagen, Denmark. Association for Compu-
tational Linguistics.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. 2020.
BART: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 7871–7880, Online. Association for Computa-
tional Linguistics.

Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.

Potsawee Manakul and Mark JF Gales. 2022. Pod-
cast Summary Assessment: A resource for evaluat-
ing summary assessment methods. arXiv preprint
arXiv:2208.13265.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Ryan McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1906–1919, On-
line. Association for Computational Linguistics.

10

https://doi.org/10.1162/tacl_a_00373
https://doi.org/10.1162/tacl_a_00373
https://doi.org/10.18653/v1/D18-1443
https://doi.org/10.1145/3292500.3330955
https://doi.org/10.1145/3292500.3330955
https://openreview.net/forum?id=XPZIaotutsD
https://openreview.net/forum?id=XPZIaotutsD
https://github.com/Huffon/factsumm
https://github.com/Huffon/factsumm
https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf
https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf
https://doi.org/10.1145/3571730
https://doi.org/10.1145/3571730
https://doi.org/10.18653/v1/D19-1051
https://doi.org/10.18653/v1/2020.emnlp-main.750
https://doi.org/10.18653/v1/2020.emnlp-main.750
https://doi.org/10.18653/v1/D18-1456
https://doi.org/10.18653/v1/D18-1456
https://doi.org/10.18653/v1/D17-1082
https://doi.org/10.18653/v1/D17-1082
https://doi.org/10.18653/v1/2020.acl-main.703
https://doi.org/10.18653/v1/2020.acl-main.703
https://doi.org/10.18653/v1/2020.acl-main.703
https://aclanthology.org/W04-1013
https://aclanthology.org/W04-1013
https://doi.org/10.18653/v1/2020.acl-main.173
https://doi.org/10.18653/v1/2020.acl-main.173


Feng Nan, Cicero Nogueira dos Santos, Henghui Zhu,
Patrick Ng, Kathleen McKeown, Ramesh Nallapati,
Dejiao Zhang, Zhiguo Wang, Andrew O. Arnold, and
Bing Xiang. 2021. Improving factual consistency
of abstractive summarization via question answering.
In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the
11th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), pages
6881–6894, Online. Association for Computational
Linguistics.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
2018. Don’t give me the details, just the summary!
topic-aware convolutional neural networks for ex-
treme summarization. In Proceedings of the 2018
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1797–1807, Brussels, Bel-
gium. Association for Computational Linguistics.

Nedjma Ousidhoum, Zhangdie Yuan, and Andreas Vla-
chos. 2022. Varifocal question generation for fact-
checking. In Proceedings of the 2022 Conference on
Empirical Methods in Natural Language Processing,
pages 2532–2544, Abu Dhabi, United Arab Emirates.
Association for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. Journal of Machine Learning Research.

Vatsal Raina and Mark Gales. 2022. Answer uncertainty
and unanswerability in multiple-choice machine read-
ing comprehension. In Findings of the Association
for Computational Linguistics: ACL 2022, pages
1020–1034, Dublin, Ireland. Association for Compu-
tational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 2383–2392, Austin,
Texas. Association for Computational Linguistics.

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
and Patrick Gallinari. 2021. QuestEval: Summariza-
tion asks for fact-based evaluation. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 6594–6604, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.

Thomas Scialom, Sylvain Lamprier, Benjamin Pi-
wowarski, and Jacopo Staiano. 2019. Answers unite!

unsupervised metrics for reinforced summarization
models. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
3246–3256, Hong Kong, China. Association for Com-
putational Linguistics.

Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020.
BLEURT: Learning robust metrics for text genera-
tion. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
7881–7892, Online. Association for Computational
Linguistics.

Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
Asking and answering questions to evaluate the fac-
tual consistency of summaries. In Proceedings of the
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 5008–5020, Online. Asso-
ciation for Computational Linguistics.

Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122, New Orleans,
Louisiana. Association for Computational Linguis-
tics.

Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng.
2020. Reclor: A reading comprehension dataset re-
quiring logical reasoning. In International Confer-
ence on Learning Representations.

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
ter Liu. 2020a. Pegasus: Pre-training with extracted
gap-sentences for abstractive summarization. In In-
ternational Conference on Machine Learning, pages
11328–11339. PMLR.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Weinberger, and Yoav Artzi. 2020b. Bertscore: Eval-
uating text generation with BERT. In International
Conference on Learning Representations.

A More Details about Models and Data

Training QG and QA systems
We train the question+answer generation model
(G1) on RACE or SQuAD, and train the distractor
generation model (G2) and the answering model
(A) on RACE. We do early stopping when the per-
formance on the validation set does not improve.
We use batch size 8 for G1 and G2 models (T5) and
2 for A model (Longformer). The learning rate is
set to 1e-6, and we use the Adam optimizer. We
carried out training on one NVIDIA A100-80GB
GPU. Training one generation model (T5-large)
takes around 8 hours, and training the answering

11

https://doi.org/10.18653/v1/2021.acl-long.536
https://doi.org/10.18653/v1/2021.acl-long.536
https://doi.org/10.18653/v1/D18-1206
https://doi.org/10.18653/v1/D18-1206
https://doi.org/10.18653/v1/D18-1206
https://aclanthology.org/2022.emnlp-main.163
https://aclanthology.org/2022.emnlp-main.163
https://doi.org/10.3115/1073083.1073135
https://doi.org/10.3115/1073083.1073135
https://doi.org/10.18653/v1/2022.findings-acl.82
https://doi.org/10.18653/v1/2022.findings-acl.82
https://doi.org/10.18653/v1/2022.findings-acl.82
https://doi.org/10.18653/v1/D16-1264
https://doi.org/10.18653/v1/D16-1264
https://doi.org/10.18653/v1/2021.emnlp-main.529
https://doi.org/10.18653/v1/2021.emnlp-main.529
https://doi.org/10.18653/v1/D19-1320
https://doi.org/10.18653/v1/D19-1320
https://doi.org/10.18653/v1/D19-1320
https://doi.org/10.18653/v1/2020.acl-main.704
https://doi.org/10.18653/v1/2020.acl-main.704
https://doi.org/10.18653/v1/2020.acl-main.450
https://doi.org/10.18653/v1/2020.acl-main.450
https://doi.org/10.18653/v1/N18-1101
https://doi.org/10.18653/v1/N18-1101
https://openreview.net/forum?id=HJgJtT4tvB
https://openreview.net/forum?id=HJgJtT4tvB
https://openreview.net/forum?id=SkeHuCVFDr
https://openreview.net/forum?id=SkeHuCVFDr


model (Longformer-4096) takes up to 2 days. Run-
ning MQAG inference with generation=T5-large
and answering=Longformer-4096 on one NVIDIA
P100 GPU takes around 3 seconds per question.

Licenses
The licenses of the datasets are CC-BY-4.0 for
XSum-Hallucination and Podcast Assessment, and
MIT license for SummEval. For QAG, we were
unable to find its license. The licenses of T5 and
Longformer backbone models are apache-2.0.

Open-Sourcing Trained Models
To allow the trained models in MQAG to be used
for research purposes in other question genera-
tion and answering tasks, we have made them
available online. The links to these models on
HuggingFace can be found on our project page at
https://github.com/potsawee/mqag0.

B Statistical Distances

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Di
st

an
ce

p_1 = 0.00
KL-Div
One-Best
TotalVar
Hellinger

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Di
st

an
ce

p_1 = 0.25
KL-Div
One-Best
TotalVar
Hellinger

0.0 0.2 0.4 0.6 0.8 1.0
p_2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Di
st

an
ce

p_1 = 0.50
KL-Div
One-Best
TotalVar
Hellinger

0.0 0.2 0.4 0.6 0.8 1.0
p_2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Di
st

an
ce

p_1 = 0.75
KL-Div
One-Best
TotalVar
Hellinger

Figure 4: Statistical distances between two Bernoulli
distributions p1 = [p1; 1 − p1] and p2 = [p2; 1 − p2]
at different values of p1. We show 4 plots of different
values of p1 = 0.00, 0.25, 0.50, 0.75, and Y-axis rep-
resents distance D and X-axis represents p2. It can be
seen that KL divergence is unbounded, which means the
value can be exceedingly large. One-best, in contrast,
is bounded between 0.0 and 1.0; however, one-best is
discontinuous. Total variation and Hellinger distance
are continuous and bounded between 0.0 and 1.0.

C Computing Correlation

Following the notation in Deutsch et al. (2021), let
zji and z̄ji be two scores of metrics Z and Z̄ for the
summary output by system i ∈ {1, ..., N} on the

document j ∈ {1, ...,M}. In this work, Z is the
evaluation method, and Z̄ is the human judgement.
The correlations, e.g. Pearson or Spearman’s rank
correlation coefficient, are defined as follows:

• System-level (i.e. Corpus-level)

ρ = Corr

{∑
j z

j
i

M
,

∑
j z̄

j
i

M

}N

i=1


• Summary-level (i.e. Sentence-level)

ρ =
1

M

∑
j

Corr
({

zji , z̄
j
i

}N

i=1

)

D Additional Results

D.1 Ablation: Model Choices
For generation models, we measure cross-entropy
losses on RACE-testset:

• T5-base (223M): G1 = 1.612, G2 = 1.875

• T5-large (738M): G1 = 1.478, G2 = 1.741

where G1 denotes question+answer generation, and
G2 denotes distractor generation. For answering
models, we measure accuracy on RACE-testset:

• Roberta (355M): Accuracy = 84.84

• Longformer (435M): Accuracy = 81.67

Model Pearson Corr.
Generation Answering SumE QAG-X Podc

T5-base RoBERTa 0.949 0.242 0.471
T5-base Longformer 0.949 0.293 0.647
T5-large RoBERTa 0.930 0.211 0.350
T5-large Longformer 0.930 0.229 0.772

Table 8: Ablation on model choices in MQAG using
N=20. SumE = SummEval (Consistency aspect), QAG-
X = QAG-XSum, Podc = Podcast Assessment.

D.2 MQAG Results
Here, we provide results that are complementary
to those presented in the main text. Figure 5 illus-
trates the answerability results on QAG-XSum and
Podcast, and Figure 6 illustrates the impact of N
on the remaining datasets not presented in the main
text. Table 10 shows the results of all MQAG con-
figurations. Table 11 shows the Spearman’s rank
correlation coefficient of the main results.

12

https://github.com/potsawee/mqag0


Method QAG-CNNDM QAG-XSum XSum-H Podcast SummEval

MQAGSQuAD 35.0 37.4 34.0 34.7 37.0
MQAGRACE 30.5 30.0 30.0 30.5 31.1

Table 9: The number remaining questions at N τ
y = 2.0.

MQAG Configuration QAG XSum-H
Podcast SumEvl

G’s Inp. G1-trained Dist. Ans. CNNDM XSum Faithful Factual

Src x SQuAD DKL ✗ 0.219 0.008 0.070 0.027 0.432 0.726
Src x SQuAD DOB ✗ 0.264 0.003 0.165 0.064 0.788 0.703
Src x SQuAD DTV ✗ 0.272 0.017 0.093 0.037 0.470 0.707
Src x SQuAD DHL ✗ 0.266 0.010 0.081 0.032 0.517 0.713
Sum y SQuAD DKL ✗ 0.478 0.374 0.177 0.226 0.251 0.936
Sum y SQuAD DOB ✗ 0.476 0.354 0.295 0.254 0.677 0.872
Sum y SQuAD DTV ✗ 0.508 0.396 0.269 0.267 0.225 0.870
Sum y SQuAD DHL ✗ 0.499 0.399 0.266 0.269 0.201 0.870

F1 SQuAD DKL ✗ 0.508 0.361 0.197 0.213 0.531 0.921
F1 SQuAD DOB ✗ 0.416 0.161 0.296 0.199 0.825 0.869
F1 SQuAD DTV ✗ 0.490 0.393 0.286 0.261 0.475 0.863
F1 SQuAD DHL ✗ 0.481 0.387 0.274 0.255 0.487 0.862

Sum y SQuAD DKL Ny 0.483 0.396 0.229 0.249 0.545 0.943
Sum y SQuAD DOB Ny 0.517 0.385 0.286 0.256 0.711 0.914
Sum y SQuAD DTV Ny 0.519 0.407 0.324 0.292 0.502 0.890
Sum y SQuAD DHL Ny 0.512 0.413 0.323 0.299 0.385 0.889
Src x RACE DKL ✗ 0.143 0.097 0.088 0.054 0.321 0.599
Src x RACE DOB ✗ 0.226 0.091 0.160 0.091 0.534 0.612
Src x RACE DTV ✗ 0.233 0.143 0.069 0.087 0.144 0.588
Src x RACE DHL ✗ 0.221 0.148 0.056 0.083 0.222 0.592
Sum y RACE DKL ✗ 0.450 0.283 0.135 0.179 0.789 0.954
Sum y RACE DOB ✗ 0.453 0.225 0.240 0.221 0.839 0.928
Sum y RACE DTV ✗ 0.462 0.309 0.221 0.244 0.770 0.933
Sum y RACE DHL ✗ 0.473 0.323 0.215 0.244 0.751 0.927

F1 RACE DKL ✗ 0.480 0.266 0.156 0.198 0.830 0.908
F1 RACE DOB ✗ 0.379 0.192 0.268 0.206 0.796 0.815
F1 RACE DTV ✗ 0.468 0.301 0.217 0.252 0.731 0.866
F1 RACE DHL ✗ 0.472 0.317 0.206 0.252 0.693 0.858

Sum y RACE DKL Ny 0.460 0.302 0.208 0.206 0.857 0.961
Sum y RACE DOB Ny 0.466 0.233 0.266 0.226 0.822 0.954
Sum y RACE DTV Ny 0.502 0.313 0.306 0.270 0.855 0.945
Sum y RACE DHL Ny 0.501 0.328 0.305 0.273 0.860 0.936

Table 10: Pearson correlation coefficients of all MQAG configurations. Our MQAG results are based on N=50.
When applying the answerability mechanism, the threshold N τ

y is set to 2.0.

13


1.0 1.5 2.0 2.5 3.0 3.5 4.0
Ny threshold

0.02

0.01

0.00

0.01

0.02

0.03

0.04

0.05

QAG-XSum
MQAG_SQuAD
MQAG_RACE

1.0 1.5 2.0 2.5 3.0 3.5 4.0
Ny threshold

0.2

0.1

0.0

0.1

0.2

0.3

0.4
Podcast

MQAG_SQuAD
MQAG_RACE

Figure 5: ∆PCC of MQAG-Sum with total variation against the answerability threshold N τ
y on the X-axis. This

figure extends Figure 2 in the main text.

0 10 20 30 40 50

0.10

0.15

0.20

0.25

0.30

QAG-XSum (Mean)

0 10 20 30 40 50

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22
XSum-H-Faithful (Mean)

0 10 20 30 40 50

0.3

0.4

0.5

0.6

0.7

Podcast (Mean)

0 10 20 30 40 50
0.70

0.75

0.80

0.85

0.90

SummEval (Mean)

0 10 20 30 40 50
0.025

0.030

0.035

0.040

0.045

0.050

0.055

QAG-XSum (Std)

0 10 20 30 40 50

0.008

0.010

0.012

0.014

0.016

0.018
XSum-H-Faithful (Std)

0 10 20 30 40 50

0.06

0.08

0.10

0.12

0.14

0.16

0.18

Podcast (Std)

0 10 20 30 40 50

0.02

0.04

0.06

0.08

0.10

0.12
SummEval (Std)

Figure 6: Mean (top row) and standard deviation (bottom row) of Pearson correlation (Y-axis) of MQAGRACE when
the number of generated questions N is varied from 1 to 50 (X-axis). This figure extends Figure 3 in the main text.

Method
QAG XSum-H

Podcast SumEvl
CNNDM XSum Faithful Factual

Baselines: Other Approaches
ROUGE-1 0.318 0.053 -0.030 0.001 0.282 0.627
OpenIE-TripleMatching 0.337 0.130 0.019 -0.025 0.700 0.671
BERTScore 0.523 0.018 0.183 0.153 0.686 0.835
Entailment (BERT Model) 0.167 0.190 0.380 0.202 0.207 0.141
Baselines: SpanQAG
QAGS 0.341 0.166 0.085 0.052 0.357 0.421
FEQA 0.275 0.277 0.300 0.155 0.504 0.270
QuestEval 0.181 0.175 0.415 0.176 0.425 0.812
Multiple-choice Question Answering and Generation (MQAG)
MQAGSQuAD 0.470 0.409 0.335 0.284 0.441 0.773
MQAGRACE 0.460 0.308 0.322 0.266 0.779 0.920

Table 11: Spearman’s rank correlation coefficient between the scores of summary evaluation methods and human
judgements. This table is complementary to Table 5 which reports Pearson’s correlation coefficient results.

14


Source: A G4S security van has been robbed outside a branch of royal bank of Scotland in
Glasgow city centre. Police said three armed men took a five-figure sum from the vehicle in the city’s
Sauchiehall street on Monday at about 21:45. A spokesman said no-one had been injured although two
security guards aged 47 and 49 were left badly shaken. The area around the bank, which is near the
Buchanan galleries shopping centre, has been cordoned off by police. Police said the security guards had
been making their delivery when they were approached by the three armed men, who threatened them and
demanded they hand over a box of money. It is understood the cash taken was in the region of £50,000.
Following the robbery, the three men got into a white seat Leon car, which sped off along west Nile street
towards the cowcaddens area. [...]

Summary: Two security guards have been threatened during a robbery at a bank in Edinburgh.

Generated question (using summary): The robbery happened in _ .
Generated options (using summary): (1) Edinburgh (2) a bank (3) a shop (4) a small town.

Prob. over options given Source: 0.077, 0.895, 0.018, 0.010
Prob. over options given Summary: 0.687, 0.295, 0.000, 0.018

Table 12: Example from QAG-XSum (documentID=1). Factual inconsistency in the summary is highlighted in red.

15