LLM Comparative Assessment: Zero-shot NLG Evaluation through
Pairwise Comparisons using Large Language Models

Adian Liusie, Potsawee Manakul, Mark J. F. Gales
ALTA Institute, Department of Engineering, University of Cambridge
al826@cam.ac.uk, pm574@cam.ac.uk, mjfg@eng.cam.ac.uk

Abstract

Current developments in large language models
(LLMs) have enabled impressive zero-shot ca-
pabilities across various natural language tasks.
An interesting application of these systems is
in the automated assessment of natural lan-
guage generation (NLG), a highly challeng-
ing area with great practical benefit. In this
paper, we explore two options for exploiting
the emergent abilities of LLMs for zero-shot
NLG assessment: absolute score prediction,
and comparative assessment which uses rela-
tive comparisons between pairs of candidates.
Though comparative assessment has not been
extensively studied in NLG assessment, we
note that humans often find it more intuitive
to compare two options rather than scoring
each one independently. This work examines
comparative assessment from multiple perspec-
tives: performance compared to absolute grad-
ing; positional biases in the prompt; and effi-
cient ranking in terms of the number of com-
parisons. We illustrate that LLM comparative
assessment is a simple, general and effective
approach for NLG assessment. For moderate-
sized open-source LLMs, such as FlanT5 and
Llama2-chat, comparative assessment is supe-
rior to prompt scoring, and in many cases can
achieve performance competitive with state-of-
the-art methods. Additionally, we demonstrate
that LLMs often exhibit strong positional bi-
ases when making pairwise comparisons, and
we propose debiasing methods that can further
improve performance.

1 Introduction

With the current rapid advances in generative AI,
pre-trained models are increasingly utilized in a
range of NLP tasks, necessitating reliable evalua-
tions of these models. Human evaluation, where an-
notators critically assess the quality of the outputs
of natural language generation (NLG) systems, has
been the gold standard approach (Lita et al., 2005;
Belz and Reiter, 2006; Lai and Tetreault, 2018;

!! !" !# !$ !% !&
9 8 1 5 2 7

!! !" !# !$ !% !&
!! A A B A A

!" B A B A B

!# B B B A B

!$ B A A B A

!% B B B A B

!& B A A B A 

<context>

Summary: <x_5>
Provide a score between 1 
and 10 that measures the 
summaries’ coherence
Answer: 2

Response B

Re
sp

on
se

 A

ranking
[!!, !", !&, !$, !%, !#]

<context>

Summary A: <x_1>
Summary B: <x_5>

Which Summary is more 
coherent, Summary A or 
Summary B?

Answer: Summary A is the 
more coherent summary

LLM Prompt Scoring 

LLM Comparative Assessment 

[!!, !$, !&, !", !#, !%]
ranking

Figure 1: Prompt Scoring v.s. Comparative Assessment.
Comparative Assessment prompts an LLM to compare can-
didates in a pairwise manner, and the comparisons are subse-
quently converted into scores or ranks.

Fabbri et al., 2021). However, human evaluation
has its drawbacks, and is notably labor-intensive,
time-consuming, and costly. As such, automating
the evaluation process and assessing NLG systems
without human intervention is highly desirable.

Though there has been considerable progress in
automatic evaluation methods, many proposed ap-
proaches have certain restrictions that limit their
effectiveness. A large body of existing work use
evaluation methods designed for particular tasks
and attributes (Mehri and Eskenazi, 2020a; Rei
et al., 2020; Manakul et al., 2023b), for example,
measuring the consistency of summaries (Wang
et al., 2020; Manakul et al., 2023a). Though effec-
tive within their domain, these approaches are not
extensible to different NLG aspects and cannot be
used by practitioners wishing to evaluate systems
on inputs or properties that are less common.

The recent development in the emergent abili-
ties of LLMs (Wei et al., 2022) has enabled LLMs
to achieve impressive zero-shot performance for

1


a slew of language tasks. This has led to gen-
eral prompt-based assessment approaches, such as
prompt-scoring where an LLM is probed to score
outputs on a particular aspect (Wang et al., 2023;
Kocmi and Federmann, 2023). These approaches
are often only effective with massive LLMs with
175B+ parameters, which may limit the applica-
bility of the approach, especially when access is
limited to API access.

With the insight that for humans, it is often eas-
ier to select which of two options is better than
it is to score options independently, we question
whether pairwise comparisons may be more effec-
tive at leveraging the impressive emergent ability of
LLMs. In this work, we consider LLM comparative
assessment, where an LLM is prompted to compare
pairs of NLG candidates and predict which one is
better. We demonstrate empirically that compara-
tive assessment performs much better than prompt-
scoring for FlanT5 and Llama style models, and en-
ables moderate-sized open-source LLMs to achieve
near (or above) state-of-the-art performance across
a range of NLG language tasks, for a diverse set
of attributes. Our approach is general and can be
applied to a diverse range of tasks and textual at-
tributes, is simple and requires minimal prompt
engineering. Further, we demonstrate that pairwise
LLM comparisons often exhibit strong positional
biases, where the ordering of candidates impacts
the decisions. We introduce a simple debiasing
method and empirically illustrate that debiasing
can provide further performance improvements, es-
pecially when large biases are present.

Our contributions are 1) We are the first work
that comprehensively analyzes pairwise compara-
tive assessment for NLG evaluation; 2) We demon-
strate that comparative assessment is far more ef-
fective than prompt-scoring for moderately-sized
LLMs, and yields performance that is state-of-the-
art for particular attributes; 3) We demonstrate that
positional bias impacts comparative decisions, and
introduce a method to debias LLMs which leads to
performance boosts, especially when only a subset
of comparisons are considered.

2 Background and Related Work

2.1 Reference-based Evaluation

In NLG evaluation, a standard approach is the
comparison of annotator-provided gold-standard
references with the generated response. Estab-
lished heuristics, such as the N-gram overlap met-

rics ROUGE (Lin, 2004) and METEOR (Baner-
jee and Lavie, 2005), have extensively been ap-
plied for assessing summarization and machine
translation respectively. Recently, the paradigm
has evolved to incorporate embedding-based meth-
ods like BERTScore (Zhang et al., 2019), which
not only compares generated texts with references,
but also factors in semantic considerations beyond
word overlap.

2.2 Tailored NLG Evaluation Approaches

Tailored approaches have been proposed for assess-
ing specific properties of generated texts. For exam-
ple, question-answering systems are used for sum-
mary consistency assessment (Wang et al., 2020;
Scialom et al., 2021) to probe information consis-
tency. For Dialogue quality assessment, the lan-
guage model probability from a DiaoloGPT sys-
tem is used as a proxy for response quality (Mehri
and Eskenazi, 2020b). A survey for NLG evalua-
tion methods was conducted by Celikyilmaz et al.
(2020).

2.3 Zero-shot LLM Evaluation

Given the current capabilities of LLMs such as
ChatGPT and GPT4, the zero-shot ability of these
systems for a wide range of tasks, including NLG
evaluation, has been investigated. Existing works
have looked at using LLM to evaluate open-ended
story generation and adversarial attacks (Chiang
and Lee, 2023) and using ChatGPT to score the
quality of texts along a certain axis (Wang et al.,
2023; Kocmi and Federmann, 2023), demonstrat-
ing that ChatGPT can be used in a zero-shot setting
and achieve reasonable performance.

2.4 LLM Pairwise Comparisons

Pairwise comparative judgement (Thurstone, 1927)
has been a popular approach of assessing candi-
dates for exams, however where typically human
assessors are used. Investigating the ability and
application of pairwise comparisons via LLMs
has been relatively underexplored, with concurrent
work using pairwise rankings for information text
retrieval (Qin et al., 2023) and separately for as-
sessing LLM-based chat assistants on open-ended
questions where outputs are compared to that of a
baseline system (Chiang et al., 2023; Zheng et al.,
2023).

2


3 Comparative Assessment

3.1 Notation

In this work, we investigate using LLM compar-
ative judgements for NLG assessment. Assume
that there is a context d (e.g., a text passage or di-
alogue) and a set of N candidate responses, x1:N .
For a given attribute (e.g., coherence, consistency,
fluency) the N candidates have true underlying
scores, s1:N . As scores often only have relative
meaning, in this work only the ranks of the candi-
dates will be evaluated. The objective is therefore
to accurately predict the true ranks, r1:N , of the
candidate scores. In comparative assessment, one
uses pairwise comparisons to determine which of
the two input responses is better. Let yij ∈ {0, 1}
represent the true outcome of whether xi is higher
ranked than xj , such that yij = 1(si > sj). Here,
an LLM is used to model the probability that re-
sponse i is better than response j, pij ,

pij = P (yij |xi, xj , d) (1)

Which can alternatively be converted into hard de-
cisions, ŷij , by selecting the most likely outcome.

ŷij =

{
1, if pij > 0.5

0, otherwise
(2)

Let C = {ck}k=1...R represent a set of comparisons,
where R is the total number of comparisons, and
each comparison c = (i, j) indicates the indices
of the two considered candidate responses. For
example, the set of all possible comparisons, C =
{(i, j) | i, j ∈ [1...N ], i ̸= j}, could be used, or
alternatively a smaller subset of comparisons.

3.2 Prompt Design

To leverage the emergent ability of LLMs, we use
comparative prompts that probe a model to decide
which of the two candidates is better. Let T be a
prompt template that converts candidate responses
xi and xj as well as context d into an output text,
prompt P = T (xi, xj , d). This work aims to find
a simple, general and robust assessment method,
and as such extensive prompt engineering is not
in the scope of this work (despite possible perfor-
mance gains). We evaluate two simple and suitable
prompts in our initial investigations. Our prompts
for comparative assessment are shown in Figure 2.

Passage:
<context>

Summary A: <Summary 1>
Summary B: <Summary 2>

Which Summary is more consistent relative 
to the passage, Summary A or Summary B?

<context>

Summary A: <Summary 1>
Summary B: <Summary 1>

Which Summary is more consistent, 
Summary A or Summary B?

Prompt 1

Prompt 2

Figure 2: Comparative prompt template 1 and 2. When
assessing different attributes, only the attribute is changed
(e.g., consistent → engaging) and for response assessment,
the word ‘summary’ is replaced with ‘response’.

3.3 Comparative Decisions
A central aspect of LLM comparative assessment is
the methodology of getting comparative decisions.
In this section, we consider two approaches for
leveraging LLMs for comparative assessment; First
for when one has output token-level probabilities
(Prompt-Based Classifier), and second for when
only the output texts are available.

Prompt-Based Classifier: If one has access to the
output probabilities, an efficient method to get prob-
ability estimates of the predictions is to leverage
prompt-based classifiers. Let Pθ(w|x) represent an
LLM’s conditional language model distribution of
the output sequence w given the input text x. For
prompt-based classifiers, the LM probabilities of
specific label words (wk) are used as a proxy for
the class decisions (Liusie et al., 2023). For exam-
ple in summarization assessment, given a prompt
P ending in ‘... which summary is better’, one
can set wi=‘Summary A’ and wj=‘Summary B’ and
define the probability that response i is better than
response j as:

pij =
Pθ(wi|P)

Pθ(wi|P) + Pθ(wj |P)
(3)

Text Generation: Alternately, if only limited API
access is available, one can sample responses from
the conditional LM given the input prompt P ,

w̃(k) ∼ Pθ(w|P) (4)

Let f(w̃) ∈ {0, 1} be a function that maps the text
response to the comparative decision. By generat-
ing K samples from the LLM, one can estimate

3


the comparative probability pij by looking at the
fraction of the samples that selects xi over xj .

pij =
1

K

K∑
k=1

f(w̃(k)) (5)

3.4 Comparisons to Ranks
Although the full set of possible comparisons yields
the most information for the rankings, this requires
R=N(N−1) comparisons, which can be compu-
tationally expensive. For computational efficiency,
we can consider 3 different comparison selection
strategies: random, no-repeat and symmetric. For
random, comparisons are randomly selected from
the set of all possible comparisons. For no-repeat,
if (xi, xj) is selected then (xj , xi) will not be se-
lected. For symmetric, if (xi, xj) is selected, then
(xj , xi) will also be selected.

Given a set of selected comparisons C and
weights of a comparative assessment system θ,
one can generate a predicted rank ordering r̂1:N
of the candidate responses. A simple but effective
approach is to sort the candidates by win-loss ratio,

ŝi =
#wins of xi

#comparisons involving xi
(6)

which can then be ordered to convert the scores
into predicted ranks r̂1:N .

3.5 Debiased Comparative Assessment
Let ỹij represent the outcome of the comparison
when considered in the opposite ordering, such
that ỹij = 1 − ŷji. For a positionally unbiased
comparator, reversing the ordering should have no
impact on the outcome of the comparison

ỹij = ŷij ∀ (i, j) ∈ [1...N ], i ̸= j (7)

Systems may, however, have systematic positional
biases and could for example favor the first posi-
tion over the second position. To quantify the level
of systematic bias, one can determine P (A), the
prior associated with the first position, and P (B)
the prior for the second position. This can be esti-
mated for a given set of comparisons by using the
statistics over all comparisons, and by calculating
the fraction of times that each position is selected.

P (A) =

∑
i,j∈C ŷij

|C|
P (B) =

∑
i,j∈C ỹij

|C|
(8)

When using a symmetric comparative set C, for
an unbiased system, both P (A) and P (B) should

be 0.5 and any large deviation is symptomatic of
positional bias. To address possible positional bias,
one may reweight system probabilities, p̂ij , through

p̂ij =
α · pij

α · pij + (1− pij)
(9)

where α ∈ R+ is a weight that can be set such that
P (A) = P (B) = 0.5. Reweighting in this fashion
is equivalent to,

ŷij =

{
1, if pij > τ

0, otherwise
(10)

where τ ∈ [0, 1] is a decision threshold correspond-
ing to α, set such that P (A) = P (B) = 0.5.

4 Experimental Setup

4.1 Datasets
To investigate the general applicability of compara-
tive assessment, we cover a range of standard NLG
evaluation tasks and datasets as follows:

SummEval (Fabbri et al., 2021) is a summary eval-
uation benchmark of 100 passages, each with 16
machine-generated summaries. Each summary is
evaluated for coherency (COH), consistency (CON),
fluency (FLU), and relevancy (REL).

Podcast (Manakul and Gales, 2022) is for bench-
marking podcast summary assessment methods. It
contains 179 podcasts each with 15 abstractive sum-
maries. Each summary was evaluated for its overall
quality on a 4-point scale.

TopicalChat with the USR annotations (Mehri and
Eskenazi, 2020b) is for benchmarking dialogue
evaluation. It includes 60 dialogue contexts and
six system responses per context. These responses
were assessed on coherency (COH), continuity (CNT),
engagingness (ENG), and naturalness (NAT).

WebNLG (Gardent et al., 2017) is for benchmark-
ing data-to-text evaluation methods. It contains 223
semantic triple groups, each paired with outputs
from 8 triple-to-text generation systems. These
texts were evaluated for fluency (FLU), grammar
(GRA) and semantic equivalence (SEM).

4.2 Base Large Language Models (LLMs)
We investigate two families of open-source
instruction-tuned LLMs. The first system is FlanT5
(Chung et al., 2022), T5 (Raffel et al., 2020) that
have been instruction tuned on a diverse set of 1000
NLP tasks (Wang et al., 2022). The second system

4


is Llama2-chat (Touvron et al., 2023), which is
Llama2 tuned on instruction datasets. We investi-
gate a range of model sizes; 220M, 770M, 3B and
11B for FlanT5, and 3B and 13B for Llama2.

4.3 Baselines

The NLG evaluation methods can be categorized
into reference-based and reference-free. Reference-
based methods compare the output against the refer-
ence such as n-gram metrics (e.g., BLEU (Papineni
et al., 2002) and ROUGE (Lin, 2004)), or embed-
ding based metrics (e.g., BERTScore (Zhang et al.,
2019)). In contrast, reference-free methods com-
pare the generated texts against the original source
(or context for generation) directly.

4.3.1 Bespoke Methods
Bespoke methods require a specific data which
could be supervised labels (e.g., human judgements
for the summaries) or data for model training (e.g.,
question-answering). Although bespoke methods
could work in a similar domain (e.g., developed
for summarization, but applied on dialogue genera-
tion), they are not as general as zero-shot methods.

UniEval (Zhong et al., 2022) convert NLG evalua-
tion into Boolean QA problem. This method uses
pre-defined schemes for selected aspects (e.g., co-
herence) and generates synthetic data to fine-tune
a T5 system for NLG assessment. References are
used for particular aspects (e.g. relevancy), and
schemes/systems are bespoke for a particular at-
tribute (though a sequentially trained system that
scores multiple attributes is also explored).

QuestEval (Scialom et al., 2021) and MQAG
(Manakul et al., 2023a) are QA-based approaches
for assessing consistency in summarization tasks.
QuestEval uses extracted answer spans while
MQAG represents information using multiple-
choice questions. Both methods are reference-free.

Longformer-SFT: For podcast summarization, we
follow Manakul and Gales (2022) in using a Su-
pervised Fine-Tuned longformer (Beltagy et al.,
2020) as a baseline. The input is the document and
the summary, and human judgement is used as the
supervised target label at training, and the perfor-
mance is reported using 5-fold cross-validation.

4.3.2 Zero-shot Methods
Zero-shot methods can be applied generally to any
task without further training or fine-tuning. Com-
parative assessment is a zero-shot method.

GPTScore (Fu et al., 2023) evaluates texts using
conditional language model scores. By condition-
ing the language model on instruction and context,
GPTScore assumes that it will assign a higher prob-
ability to a high-quality generated text.

Prompt Scoring. Another baseline is prompt-
scoring. With this approach, for a particular at-
tribute, the LLMs is asked to assess the response
quality between 1-10. Simple prompts are used
with the general templates shown in Figure 3.
Prompt-scoring is run for all open-source LLMs
considered (FlanT5 and Llama2), and is used as the
main baseline to compare comparative assessment
against. During generation, the maximum gener-
ation length is set to 5 and the temperature is set
to 1.0. Similarly, ChatGPT prompt-scoring has re-
cently been proposed in Wang et al. (2023); Kocmi
and Federmann (2023), which we also include as a
baseline where applicable.

Passage: 
<context>

Summary: <Summary>

Score the response between 1 and 10 based 
on how consistent the summary is

<context>

Summary: <Summary>

Provide a score between 1 and 10 that 
measures the summary’s consistency

Prompt 1

Prompt 2

Figure 3: Scoring template 1 and template 2. Only the at-
tribute is changed (e.g., consistent → engaging) and response
description (‘summary’→ ‘response’) for different tasks.

G-Eval (Liu et al., 2023) As an extension to
prompt-scoring, G-Eval extends standard prompt
scoring by using detailed prompts and then generat-
ing a continuous score by calculating the expected
score over a score range (e.g. 1-5 normalized by
their probabilities). We apply G-Eval to the var-
ious base LLMs and contrast performance to the
other approaches for SummEval, since the prompts
for different attributes have been made publically
available.1

4.4 Methodology
Each LLM is used for both prompt-scoring and
comparative assessment. For the main comparative

1https://github.com/nlpyang/geval

5

https://github.com/nlpyang/geval


assessment results, we consider the full set of pos-
sible comparisons, where all pairs of candidates in
both permutations are compared by the framework.
Comparisons are made using the prompt-based
classifier (as described in §3.3) using the prompt
templates shown in Fig. 2, where the system out-
puts a probability for Response A and Response
B. The winner of the comparison is the response
with the highest probability, where candidates are
then ranked in order of the win-ratio (as described
in §3.4). For Llama2, comparative prompts are ap-
pended with ‘Answer:’ while scoring prompts end
with ‘Score:’. The spearman correlation between
predicted scores and human judgements is used as
the performance metric.

5 Experiments

5.1 NLG Evaluation Results

Summary Assessment: Table 1 analyzes the effec-
tiveness of comparative assessment on SummEval,
where the following observations can be made:

(1) Moderate-sized LLMs are ineffective in the
prompt-scoring set-up, with the best system
(FlanT5-3B) achieving Spearman correlations of
10-20. The performance difference with ChatGPT
prompt-scoring implies that scoring is likely an
emergent ability only effective for larger LLMs.

(2) G-Eval, which uses task specific detailed
prompts and continuous scores, yields significant
improvements over prompt-scoring. Nonetheless,
comparative assessment remains more effective
than G-Eval in the majority of settings.

(3) LLMs are able to achieve considerably higher
correlations in the comparative assessment set-up,
with performance higher for nearly all systems.
Further, comparative assessment leads to more ro-
bust performance, with most 3B+ models achieving
correlations within the range of 30-50.

(4) Comparative assessment enables LLMs of un-
der 1B to perform well, with FlanT5-770M achiev-
ing moderate correlations. However, performance
improves significantly when using 3B+ LLMs, al-
though for SummEval there are diminishing (if any)
performance gains by scaling up.

(5) The best comparative assessment LLM (FlanT5-
3B) is competitive with all other zero-shot methods,
including ChatGPT scoring (an LLM with two or-
ders of magnitude more parameters), and achieves
the best correlation in 3 of the 4 aspects.

Approach COH CON FLU REL

Baselines (§4.3)
BERTScore (w/ Ref) 25.9 19.7 23.7 34.7
QuestEval 18.2 30.6 22.8 26.8
MQAG 17.0 28.8 19.3 16.6
UniEval (single-best) 54.6 47.2 43.3 46.3
UniEval (continual) 57.5 44.6 44.9 42.6
GPTScore FlanT5-3B 47.0 43.6 42.1 34.4
GPTScore FlanT5-11B 45.6 43.8 42.4 34.3
GPTScore GPT3 40.1 47.5 41.0 34.3
ChatGPT scoring† 45.1 43.2 38.0 43.9
Prompt Scoring (§4.3.2)
FlanT5-220M 4.0 -0.2 0.2 2.8
FlanT5-770M -3.6 -1.6 -1.5 -0.0
FlanT5-3B 14.5 19.8 3.9 15.2
FlanT5-11B 0.7 11.2 3.2 5.7
Llama2-chat-7B 8.6 9.0 1.8 7.8
Llama2-chat-13B 9.9 6.9 1.2 9.2
G-Eval (§4.3.2)
FlanT5-220M 3.6 0.6 2.7 8.0
FlanT5-770M 8.5 7.0 15.3 24.1
FlanT5-3B 10.5 29.1 9.8 23.8
FlanT5-11B 19.2 29.3 20.7 35.8
Llama2-chat-7B 28.2 29.4 23.0 27.4
Llama2-chat-13B 53.2 33.7 16.5 38.3
Comparative Assessment (§3)
FlanT5-220M 4.0 -0.2 0.2 2.8
FlanT5-770M 29.8 26.3 20.6 35.1
FlanT5-3B 51.2 47.1 32.5 44.8
FlanT5-11B 44.2 37.2 30.2 43.4
Llama2-chat-7B 27.9 24.6 20.2 35.6
Llama2-chat-13B 40.9 39.9 30.8 45.3

Table 1: Spearman correlation coefficient for SummEval,
averaged over both prompts per system (for prompt-scoring
and comparative). †ChatGPT performance is quoted from
Wang et al. (2023), which use more detailed scoring prompts.

Approach System-lvl Summary-lvl

Baselines (§4.3)
BERTScore (w/ Ref) 73.9 25.1
UniEval (continual) 42.0 22.8
QuestEval 42.5 20.4
MQAG 77.9 12.6
Longformer-SFT 89.6 19.6
Prompt Scoring (§4.3.2)
Llama2-chat-7B 88.5 2.6
Llama2-chat-13B 80.0 25.3
Comparative Assessment (§3)
Llama2-chat-7B 88.2 37.4
Llama2-chat-13B 97.1 45.5

Table 2: Spearman correlation coefficient for Podcast.

(6) Comparative assessment achieves competitive
performance with UniEval. Although UniEval
has better overall performance, UniEval was de-
signed for bespoke tasks and aspects (it is fine-
tuned on synthetic data created for particular at-
tributes) where the results in Tables 2 and 4 show
that UniEval has noticeable degradation in out-of-
domain settings. In contrast, comparative assess-
ment is zero-shot and general.

6


Approach COH CNT ENG NAT

Baselines (§4.3)
UniEval (single-best) 60.7 - 59.6 54.7
UniEval (continual) 61.3 - 60.5 44.4
GPTScore GPT3 56.9 32.9 49.6 52.4
ChatGPT scoring† 54.7 57.7 37.9 58.0
Prompt Scoring (§4.3.2)
FlanT5-220M -2.2 0.2 -8.4 2.1
FlanT5-770M 3.7 3.1 -4.3 3.8
FlanT5-3B 31.9 28.8 17.4 23.7
FlanT5-11B 15.3 8.0 4.3 24.3
Llama2-chat-7B 16.4 17.0 20.6 21.4
Llama2-chat-13B 21.7 19.9 31.4 23.2
Comparative Assessment (§3)
FlanT5-220M -0.3 8.2 -10.5 2.2
FlanT5-770M 38.5 36.3 25.3 35.3
FlanT5-3B 49.4 49.4 37.3 47.4
FlanT5-11B 54.3 42.2 54.7 54.2
Llama2-chat-7B 28.9 33.7 36.1 30.3
Llama2-chat-13B 32.4 43.2 55.5 33.5

Table 3: Spearman correlation coefficient for TopicalChat.
†ChatGPT is prompted using our prompt-scoring prompts.

Podcast Assessment: When considering podcast
summarization with long inputs of over 5k tokens
on average, only Llama2 models (which have a
limit of 4k tokens) were used (as FlanT5 has a
limit of 1k tokens). Table 2 shows that comparative
assessment yields highly impressive performance
for long-spoken summarization, with comparative
assessment out-competing all other baselines. Fur-
ther, although prompt-scoring has good system-
level correlations, the lack of granularity leads to
poor summary-level performance.

Dialogue Assessment: Next, we analyze compar-
ative assessment on TopicalChat, for evaluating
conversational responses. Table 3 shows similar
findings for TopicalChat as to those in SummEval,
where comparative assessment again outperforms
the correlations seen from prompt-scoring.

Data-to-Text Assessment: For data-to-text gen-
eration, the context is highly abstract and is a list
of triples in the form of (object, relation, subject).
This makes assessing the semantics challenging, as
the LLM needs to parse and understand semantic
triples. Table 4 shows that understanding triples is
an emergent ability of LLMs, where for grammar
and fluency the correlations are quite similar be-
tween the 3B and 11B/13B systems, however for
semantic understanding, the 10B+ systems highly
outcompete the 3B+ systems. Note that when eval-
uating UniEval, we used the closest attribute that
they designed for, which was naturalness for both.

Approach FLU GRA SEM

Baselines (§4.3)
BLEU 36.3 34.7 50.3
METEOR 44.3 42.9 62.7
NLI Model∗ - - 63.7
UniEval (continual) 21.7 16.3 -
Prompt Scoring (§4.3.2)
FlanT5-220M 18.5 17.4 8.0
FlanT5-770M 14.5 13.6 17.1
FlanT5-3B 30.8 32.7 38.5
FlanT5-11B -0.7 6.9 20.8
Llama2-chat-7B 3.8 2.4 17.0
Llama2-chat-13B 1.8 0.5 5.6
Comparative Assessment (§3)
FlanT5-220M -13.6 -17.9 0.1
FlanT5-770M 36.2 35.2 11.4
FlanT5-3B 40.6 41.4 12.8
FlanT5-11B 41.4 44.8 52.4
Llama2-chat-7B 22.9 37.8 -5.3
Llama2-chat-13B 44.9 45.1 53.5

Table 4: Spearman correlation coefficient for WebNLG.
∗Quoted from the NLI method with the backoff template in
Dušek and Kasner (2020).

5.2 Positional Bias

We investigate whether the comparative prompts
have any implicit positional bias, and whether sys-
tems prefer the first/second position. Table 5 shows
the fraction of comparisons that selected the candi-
date in the first position for SummEval. Since all
comparisons in both permutations are considered,
this fraction should be 0.50 for an unbiased sys-
tem. However, we observe considerably high bias,
with some set-ups even selecting the first option
80% of the time. Further, we observe that larger
systems appear to be more susceptible to bias than
smaller systems, which may explain the similarity
in performance for the 3B and 11B/13B systems in
the previous main results. Similar results for other
datasets are provided in Appendix A.2

System Prompt COH CON FLU REL

FlanT5 1 0.37 0.46 0.39 0.41
3B 2 0.43 0.47 0.40 0.44

FlanT5 1 0.18 0.20 0.13 0.23
7B 2 0.24 0.24 0.17 0.26

Llama2-chat 1 0.41 0.17 0.26 0.18
7B 2 0.68 0.56 0.48 0.45

Llama2-chat 1 0.31 0.37 0.18 0.32
13B 2 0.29 0.30 0.19 0.26

Table 5: Positional bias P (A) for both prompt templates, for
various systems in the comparative setup on SummEval.

7


System Debias SummEval TopicalChat WebNLG Avg.COH CON FLU REL COH CNT ENG NAT FLU GRA

FlanT5-3B ✗ 51.2 47.1 32.5 44.8 49.4 49.4 37.3 47.4 41.0 41.8 44.2
✓ 51.8 46.9 33.0 45.3 49.6 50.2 38.0 46.3 40.7 42.3 44.4

FlanT5-11B ✗ 44.2 37.2 30.2 43.4 54.3 42.2 54.7 54.2 41.4 44.8 44.7
✓ 45.3 39.7 30.7 44.7 57.2 59.5 59.5 58.8 44.5 44.6 48.5

Llama2-chat-7B ✗ 29.4 24.6 19.7 35.2 28.2 33.1 36.3 28.7 22.9 37.8 29.6
✓ 28.8 24.8 19.7 35.5 29.1 34.5 39.7 28.5 24.3 37.1 30.2

Llama2-chat-13B ✗ 40.9 39.9 30.8 45.3 32.4 43.2 55.5 33.5 44.9 45.1 41.2
✓ 42.8 40.3 31.9 47.1 32.5 44.5 56.9 38.4 45.9 43.7 42.4

Table 6: Spearman correlation coefficient on different aspects of the NLG evaluation tasks, averaged over all prompts considered,
using all pairs and ordering considered (i.e. full matrix comparisons).

5.3 Debiasing

The previous section demonstrates that compara-
tive assessment exhibits positional bias which may
impact system decisions. We therefore investigate
whether debiasing can improve evaluation perfor-
mance. Table 6 shows standard and debiased LLM
comparative assessment performance for the con-
sidered tasks and scores, with WebNLG SEM and
Podcast omitted due to the required emergent abil-
ity and large context length respectively. We ob-
serve that debiasing can lead to performance boosts,
where we note that the prompts which have a high
bias (seen in Table 5 and Table 9 in the appendix)
benefit most from debiasing. In particular, for Topi-
calChat we observe large gains for the FlanT5-11B
system, which enables state-of-the-art performance.
To explain why debiasing can lead to large perfor-
mance boosts, consider a very biased system where
the first response is always selected as better. Al-
though over both permutations the system is un-
biased for any comparison, the bias in the system
will cause the system to assume that all candidates
are of the same quality. By reducing the bias of
each comparison, the system may be able to pick
up subtler quality differences between the samples.

5.4 Comparative Accuracy

One can also measure the accuracy of the compara-
tive system at a comparison level. Table 7 shows
the pairwise comparison accuracy for Summeval,
over all candidate pairs where the true score of
the candidate response varies. We observe accura-
cies between 60-80% across all tasks and observe
that debiasing can substantially increase accuracy.
This highlights that LLMs are able to compare the
quality of responses fairly well, though the moder-
ately sized LLMs may not always select the best
response (with respect to labels).

System Debias COH CON FLU REL

FlanT5-3B ✗ 68.6 82.0 68.2 67.2
✓ 69.8 82.1 68.8 67.8

FlanT5-11B ✗ 61.6 70.3 60.3 63.3
✓ 66.2 76.7 65.9 67.4

Llama2-chat-7B ✗ 59.6 63.8 59.6 61.0
✓ 60.3 65.7 60.4 63.1

Llama2-chat-13B ✗ 62.6 75.4 61.1 65.4
✓ 65.8 76.9 67.2 68.5

Table 7: Accuracy of the comparative systems, at a compari-
son level, for SummEval.

5.5 Self-Consistency

SummEval has 16 summaries per context which
leads to 240 possible comparisons. If one were to
instead randomly sample N outputs and consider
all N ·(N−1) comparisons, how consistent would
the rankings with the subset of systems be with
respect to the final predicted rankings? Table 8
illustrates the self-consistency measured by the ac-
curacy when comparing pairs, and demonstrates
that even when using few outputs, the model is
very consistent to the final rankings that would be
achieved by using many more examples.

2 3 4 6 8 12 16

Final 84.0 88.3 90.7 93.7 95.5 98.0 100
Gold 68.0 69.1 69.7 70.3 70.6 70.8 70.9

Table 8: Accuracy when using fewer systems with respect
to final rankings (using all 16 systems) and the ground truth
labels. Results shown for Summeval COH using FlanT5-xl.

5.6 Subset of Comparisons

Due to O(N2) number of comparisons required for
the full comparison matrix, it might be practical
to only consider a subset of comparisons. Fig-
ure 4 shows the downstream Spearman correlation

8


for SummEval coherency, when averaged over 50
runs, for different comparison selection strategies.
Of the three schemes, we observe that for small
R (i.e. less than half the total number of com-
parisons) selecting comparisons with no repeats
leads to a marginal improvement over random se-
lection. Further, by using the symmetric selection
scheme, despite the number of comparisons being
half that of no-repeat (although each comparison
is done twice, once in each permutation), interest-
ingly there is only a performance difference of 1
in terms of Spearman. Finally, we observe that
debiasing can be very effective in efficient set-ups,
and leads to larger benefits when the number of
comparisons is small. Equivalent plots for other
tasks/scores can be found in Appendix A.1.

50 75 100 125 150 175 200 225
R (number of comparisons)

44

46

48

50

52

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

Figure 4: FlanT5-3B performance for SummEval COH when
a subset of the comparisons are selected by either random,
no-repeat or symmetric (as described in §3.4). For no-repeat,
each pair is compared once, hence has a smaller maximum R.

6 Conclusions

This paper investigates LLM comparative assess-
ment, a simple zero-shot approach to NLG evalu-
ation. We demonstrate that for moderately sized
LLMs, comparative assessment outperforms abso-
lute scoring, and is an effective automatic assess-
ment, achieving near state-of-the-art performance
for a range of NLG evaluation tasks. Furthermore,
we show that LLMs are prone to have positional
bias that could impact their decisions, however, we
introduce a simple debiasing approach that leads to
performance boosts, especially for biased systems.

Limitations

Computational Cost. The comparative assessment
framework with the full set of comparisons uses
N · (N − 1) comparisons, which for large N can

be computationally prohibitive. This paper investi-
gated datasets with at most 16 candidates, and may
not scale when more candidates are required.

Base LLMs. The empirical findings are for LLMs
of up to 13B parameters. By using larger models
(with 100B+ parameters) one may expect further
performance improvements. However, due to API
costs and the O(N2) number of comparisons, re-
sults are limited to open-source LLMs.

Selection of the subset of comparisons. For our
comparison selection scheme, this work only con-
sidered static selection schemes. Future work may
investigate dynamic selection schemes, either by
considering sorting algorithms or ELO competition
schemes, and methods similar to those studied in
information retrieval by Qin et al. (2023).

Ethics Statement

For some tasks/datasets, comparative assessment
could be ineffective and have poor generalisa-
tion over the task. Deploying machine learning
classifiers in real-world classification settings has
many associated risks, and careful analysis should
be made before deploying such systems. Mis-
use/overconfidence in the approach may lead to
mistrust of users towards LLM solutions.

Acknowledgements

This paper reports on research supported by Cam-
bridge University Press & Assessment (CUP&A), a
department of The Chancellor, Masters, and Schol-
ars of the University of Cambridge. This research
is further supported by the Cambridge International
& St John’s College scholarship.

References

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for MT evaluation with im-
proved correlation with human judgments. In Pro-
ceedings of the ACL Workshop on Intrinsic and Ex-
trinsic Evaluation Measures for Machine Transla-
tion and/or Summarization, pages 65–72, Ann Arbor,
Michigan. Association for Computational Linguis-
tics.

Iz Beltagy, Matthew E. Peters, and Arman Cohan.
2020. Longformer: The long-document transformer.
arXiv:2004.05150.

Anja Belz and Ehud Reiter. 2006. Comparing auto-
matic and human evaluation of NLG systems. In

9

https://aclanthology.org/W05-0909
https://aclanthology.org/W05-0909
https://aclanthology.org/W05-0909
https://aclanthology.org/E06-1040
https://aclanthology.org/E06-1040


11th Conference of the European Chapter of the As-
sociation for Computational Linguistics, pages 313–
320, Trento, Italy. Association for Computational
Linguistics.

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao.
2020. Evaluation of text generation: A survey. arXiv
preprint arXiv:2006.14799.

Cheng-Han Chiang and Hung-yi Lee. 2023. Can large
language models be an alternative to human evalua-
tions? In Proceedings of the 61st Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 15607–15631, Toronto,
Canada. Association for Computational Linguistics.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
Stoica, and Eric P. Xing. 2023. Vicuna: An open-
source chatbot impressing gpt-4 with 90%* chatgpt
quality.

Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
2022. Scaling instruction-finetuned language models.
arXiv preprint arXiv:2210.11416.

Ondřej Dušek and Zdeněk Kasner. 2020. Evaluating
semantic accuracy of data-to-text generation with nat-
ural language inference. In Proceedings of the 13th
International Conference on Natural Language Gen-
eration, pages 131–137, Dublin, Ireland. Association
for Computational Linguistics.

Alexander R Fabbri, Wojciech Kryściński, Bryan Mc-
Cann, Caiming Xiong, Richard Socher, and Dragomir
Radev. 2021. Summeval: Re-evaluating summariza-
tion evaluation. Transactions of the Association for
Computational Linguistics, 9:391–409.

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei
Liu. 2023. Gptscore: Evaluate as you desire. arXiv
preprint arXiv:2302.04166.

Claire Gardent, Anastasia Shimorina, Shashi Narayan,
and Laura Perez-Beltrachini. 2017. Creating training
corpora for NLG micro-planners. In Proceedings
of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 179–188, Vancouver, Canada. Association for
Computational Linguistics.

Tom Kocmi and Christian Federmann. 2023. Large
language models are state-of-the-art evaluators of
translation quality. arXiv preprint arXiv:2302.14520.

Alice Lai and Joel Tetreault. 2018. Discourse coherence
in the wild: A dataset, evaluation and methods. In
Proceedings of the 19th Annual SIGdial Meeting on
Discourse and Dialogue, pages 214–223, Melbourne,
Australia. Association for Computational Linguistics.

Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.

Lucian Lita, Monica Rogati, and Alon Lavie. 2005.
BLANC: Learning evaluation metrics for MT. In
Proceedings of Human Language Technology Con-
ference and Conference on Empirical Methods in
Natural Language Processing, pages 740–747, Van-
couver, British Columbia, Canada. Association for
Computational Linguistics.

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,
Ruochen Xu, and Chenguang Zhu. 2023. G-eval:
NLG evaluation using gpt-4 with better human align-
ment. In Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing,
pages 2511–2522, Singapore. Association for Com-
putational Linguistics.

Adian Liusie, Potsawee Manakul, and Mark J. F. Gales.
2023. Mitigating word bias in zero-shot prompt-
based classifiers.

Potsawee Manakul and Mark JF Gales. 2022. Pod-
cast summary assessment: A resource for evaluat-
ing summary assessment methods. arXiv preprint
arXiv:2208.13265.

Potsawee Manakul, Adian Liusie, and Mark JF Gales.
2023a. Mqag: Multiple-choice question answering
and generation for assessing information consistency
in summarization. arXiv preprint arXiv:2301.12307.

Potsawee Manakul, Adian Liusie, and Mark JF Gales.
2023b. Selfcheckgpt: Zero-resource black-box hal-
lucination detection for generative large language
models. arXiv preprint arXiv:2303.08896.

Shikib Mehri and Maxine Eskenazi. 2020a. Unsuper-
vised evaluation of interactive dialog with DialoGPT.
In Proceedings of the 21th Annual Meeting of the
Special Interest Group on Discourse and Dialogue,
pages 225–235, 1st virtual meeting. Association for
Computational Linguistics.

Shikib Mehri and Maxine Eskenazi. 2020b. USR: An
unsupervised and reference free evaluation metric
for dialog generation. In Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 681–707, Online. Association for
Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang,
Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu,
Donald Metzler, Xuanhui Wang, et al. 2023.
Large language models are effective text rankers

10

https://doi.org/10.18653/v1/2023.acl-long.870
https://doi.org/10.18653/v1/2023.acl-long.870
https://doi.org/10.18653/v1/2023.acl-long.870
https://lmsys.org/blog/2023-03-30-vicuna/
https://lmsys.org/blog/2023-03-30-vicuna/
https://lmsys.org/blog/2023-03-30-vicuna/
https://aclanthology.org/2020.inlg-1.19
https://aclanthology.org/2020.inlg-1.19
https://aclanthology.org/2020.inlg-1.19
https://doi.org/10.18653/v1/P17-1017
https://doi.org/10.18653/v1/P17-1017
https://doi.org/10.18653/v1/W18-5023
https://doi.org/10.18653/v1/W18-5023
https://aclanthology.org/W04-1013
https://aclanthology.org/W04-1013
https://aclanthology.org/H05-1093
https://doi.org/10.18653/v1/2023.emnlp-main.153
https://doi.org/10.18653/v1/2023.emnlp-main.153
https://doi.org/10.18653/v1/2023.emnlp-main.153
http://arxiv.org/abs/2309.04992
http://arxiv.org/abs/2309.04992
https://aclanthology.org/2020.sigdial-1.28
https://aclanthology.org/2020.sigdial-1.28
https://doi.org/10.18653/v1/2020.acl-main.64
https://doi.org/10.18653/v1/2020.acl-main.64
https://doi.org/10.18653/v1/2020.acl-main.64
https://doi.org/10.3115/1073083.1073135
https://doi.org/10.3115/1073083.1073135


with pairwise ranking prompting. arXiv preprint
arXiv:2306.17563.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. The Journal of Machine Learning Research,
21(1):5485–5551.

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT
evaluation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 2685–2702, Online. Association
for Computational Linguistics.

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
and Patrick Gallinari. 2021. QuestEval: Summariza-
tion asks for fact-based evaluation. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 6594–6604, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.

Louis L Thurstone. 1927. A law of comparative judg-
ment. Psychological review, 34(4):273.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.

Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
Asking and answering questions to evaluate the fac-
tual consistency of summaries. In Proceedings of the
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 5008–5020, Online. Asso-
ciation for Computational Linguistics.

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang
Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou.
2023. Is chatgpt a good nlg evaluator? a preliminary
study. arXiv preprint arXiv:2303.04048.

Yizhong Wang, Swaroop Mishra, Pegah Alipoor-
molabashi, Yeganeh Kordi, Amirreza Mirzaei,
Anjana Arunkumar, Arjun Ashok, Arut Selvan
Dhanasekaran, Atharva Naik, David Stap, et al. 2022.
Super-naturalinstructions: Generalization via declar-
ative instructions on 1600+ nlp tasks. arXiv preprint
arXiv:2204.07705.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, et al.
2022. Emergent abilities of large language models.
arXiv preprint arXiv:2206.07682.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein-
berger, and Yoav Artzi. 2019. Bertscore: Evaluating
text generation with bert. In International Confer-
ence on Learning Representations.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
Judging llm-as-a-judge with mt-bench and chatbot
arena. arXiv preprint arXiv:2306.05685.

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu
Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and
Jiawei Han. 2022. Towards a unified multi-
dimensional evaluator for text generation. In Pro-
ceedings of the 2022 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2023–
2038, Abu Dhabi, United Arab Emirates. Association
for Computational Linguistics.

11

https://doi.org/10.18653/v1/2020.emnlp-main.213
https://doi.org/10.18653/v1/2020.emnlp-main.213
https://doi.org/10.18653/v1/2021.emnlp-main.529
https://doi.org/10.18653/v1/2021.emnlp-main.529
https://doi.org/10.18653/v1/2020.acl-main.450
https://doi.org/10.18653/v1/2020.acl-main.450
https://aclanthology.org/2022.emnlp-main.131
https://aclanthology.org/2022.emnlp-main.131


A Additional Results

A.1 Partial Comparison Curves

50 75 100 125 150 175 200 225
R (number of comparisons)

41

42

43

44

45

46

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(a) FlanT5-3B, SummEval, CON

50 75 100 125 150 175 200 225
R (number of comparisons)

28

29

30

31

32

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(b) FlanT5-3B, SummEval, FLU

50 75 100 125 150 175 200 225
R (number of comparisons)

38

39

40

41

42

43

44

45

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(c) FlanT5-3B, SummEval, REL

10 15 20 25 30
R (number of comparisons)

25

30

35

40

45

50

55

60

65

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(d) FlanT5-11B, TopicalChat, COH

10 15 20 25 30
R (number of comparisons)

35

40

45

50

55

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(e) FlanT5-11B, TopicalChat, ENG

10 15 20 25 30
R (number of comparisons)

35

40

45

50

55

60

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(f) FlanT5-11B, TopicalChat, NAT

50 75 100 125 150 175 200 225
R (number of comparisons)

34

36

38

40

42

44

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(g) FlanT5-11B, SummEval, REL

50 75 100 125 150 175 200 225
R (number of comparisons)

17

18

19

20

21

22

23

24

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(h) Llama-chat-7B, SummEval, CON

50 75 100 125 150 175 200 225
R (number of comparisons)

15

16

17

18

19

20

Sp
ea

rm
an

no-repeat (debias)
random (debias)
symmetric (debias)
no-repeat
random
symmetric

(i) Llama-chat-13B, SummEval, FLU

Figure 5: Assessment Performance when only a subset of comparisons are considered (extending the results of Figure 4).
Multiple different base LLMs, datasets and scores and displayed.

A.2 Positional Bias

System prompt SummEval TopicalChat WebNLG PodcastCOH CON FLU REL COH CNT ENG NAT FLU GRA SEM

FlanT5 1 0.37 0.46 0.41 0.42 0.47 0.44 0.50 0.49 0.46 0.41 0.89 -
3B 2 0.43 0.47 0.42 0.44 0.46 0.44 0.47 0.47 0.38 0.36 0.85 -

FlanT5 1 0.18 0.25 0.16 0.23 0.25 0.17 0.27 0.26 0.15 0.19 0.56 -
11B 2 0.24 0.29 0.19 0.26 0.27 0.13 0.29 0.31 0.19 0.21 0.42 -

Llama2-chat 1 0.41 0.21 0.28 0.18 0.57 0.26 0.25 0.36 0.36 0.53 0.98 0.33
7B 2 0.68 0.57 0.50 0.45 0.56 0.37 0.22 0.35 0.37 0.48 0.90 0.24

Llama2-chat 1 0.31 0.43 0.20 0.32 0.69 0.73 0.67 0.74 0.23 0.38 0.50 0.22
13B 2 0.29 0.37 0.22 0.26 0.65 0.65 0.62 0.68 0.28 0.40 0.29 0.40

Table 9: Fraction of comparisons where the candidate in the first position was selected by the LLM when using the full
(symmetric) set of comparisons. The bias is presented for both prompts, over all datasets and scores, extending the results in
Table 5.

12


A.3 Accuracy of Pairwise Comparisons

System debias SummEval TopicalChat WebNLG PodcastCOH CON FLU REL COH CNT ENG NAT FLU GRA SEM

FlanT5 ✗ 68.6 82.0 68.2 67.2 75.3 71.0 65.6 70.3 66.2 65.5 51.8 -
3B ✓ 69.8 82.1 68.8 67.8 75.4 72.2 65.6 69.9 66.7 66.6 51.3 -

FlanT5 ✗ 61.6 70.3 60.3 63.3 70.0 60.5 68.0 68.9 60.8 62.7 69.6 -
11B ✓ 66.2 76.7 65.9 67.4 76.6 74.2 74.4 74.7 67.6 67.3 69.9 -

Llama2-chat ✗ 59.6 63.8 59.6 61.0 64.0 62.0 61.0 60.4 56.6 61.1 48.3 63.4
7B ✓ 60.3 65.7 60.4 63.1 64.0 64.3 65.9 61.6 57.1 61.1 50.2 -

Llama2-chat ✗ 62.6 75.4 61.1 65.4 64.5 66.8 72.0 62.3 64.7 67.6 67.3 70.3
13B ✓ 65.8 76.9 67.2 68.5 65.9 69.4 73.8 65.2 66.7 67.4 68.9 -

Table 10: Accuracy of pairwise comparisons of all candidates which differ in true value. Accuracies are shown for all datasets
and scores, extending the results of Table 6.

B Alternate Ranking Strategies

In the main paper, we only consider the win ra-
tio as an approach of converting comparisons to
ranks, due to win-ratio being simple and intuitive.
However alternate ranking strategies are possible;
a well-motivated decoding approach is to select
the ranks with the highest probability given the ob-
served comparisons. By Bayes’ theorem, this is
equivalent to finding the ranks that maximise the
likelihood of the observations.

r̂1:N = argmax
r1:N

P (C|r1:N ) (11)

For a set of ranks r1:N , let zij=1(ri<rj)∈{0, 1},
i.e. whether the ranks imply xi is better than xj .
Given the probability of each comparison, the like-
lihood of the ranks can be defined as

P (C|r1:N ) =
∏

(i,j)∈C

(
p
zij
ij + (1− pij)

1−zij
)
(12)

If only hard decisions are available (i.e. the proba-
bilities are not), then one can instead approximate
the likelihood and find the ranks that maximise the
approximate-likelihood.

P (C|r1:N ) =
∏

(i,j)∈C

P (ŷij |zij) (13)

Since ŷij ∈ {0, 1} and zij ∈ {0, 1}, there are 4
conditional probabilities P (ŷij |zij). Setting one
probability will set the other 3, which can be esti-
mated with the system’s comparative statistics.

B.1 Initial Results
Table 11 presents initial results for FlanT5-3B on
Summeval, comparing the maximum likelihood
ranking to the win ratio approach. The initial find-
ing was that performance was similar between the
two conversion schemes. However, it’s worth not-
ing that minimizing the objective function poses
intractability challenges, necessitating an approx-
imate greedy search. For the sake of simplicity,
our main paper focused on the win-ratio method,
while future research may explore more advanced
conversion strategies.

SummEval
COH CON FLU REL

win-loss 51.4 46.4 31.9 45.0
likelihood 51.7 46.0 31.5 44.7

Table 11: Spearman correlation when the comparisons are
converted using either win-ratio or maximum likelihood, for
FlanT5-3B on SummEval.

13


	Introduction
	Background and Related Work
	Reference-based Evaluation
	Tailored NLG Evaluation Approaches
	Zero-shot LLM Evaluation
	LLM Pairwise Comparisons

	Comparative Assessment
	Notation
	Prompt Design
	Comparative Decisions
	Comparisons to Ranks
	Debiased Comparative Assessment

	Experimental Setup
	Datasets
	Base Large Language Models (LLMs)
	Baselines
	Bespoke Methods
	Zero-shot Methods

	Methodology

	Experiments
	NLG Evaluation Results
	Positional Bias
	Debiasing
	Comparative Accuracy
	Self-Consistency
	Subset of Comparisons

	Conclusions
	Additional Results
	Partial Comparison Curves
	Positional Bias
	Accuracy of Pairwise Comparisons

	Alternate Ranking Strategies
	Initial Results