LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models Adian Liusie, Potsawee Manakul, Mark J. F. Gales ALTA Institute, Department of Engineering, University of Cambridge al826@cam.ac.uk, pm574@cam.ac.uk, mjfg@eng.cam.ac.uk Abstract Current developments in large language models (LLMs) have enabled impressive zero-shot ca- pabilities across various natural language tasks. An interesting application of these systems is in the automated assessment of natural lan- guage generation (NLG), a highly challeng- ing area with great practical benefit. In this paper, we explore two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment: absolute score prediction, and comparative assessment which uses rela- tive comparisons between pairs of candidates. Though comparative assessment has not been extensively studied in NLG assessment, we note that humans often find it more intuitive to compare two options rather than scoring each one independently. This work examines comparative assessment from multiple perspec- tives: performance compared to absolute grad- ing; positional biases in the prompt; and effi- cient ranking in terms of the number of com- parisons. We illustrate that LLM comparative assessment is a simple, general and effective approach for NLG assessment. For moderate- sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is supe- rior to prompt scoring, and in many cases can achieve performance competitive with state-of- the-art methods. Additionally, we demonstrate that LLMs often exhibit strong positional bi- ases when making pairwise comparisons, and we propose debiasing methods that can further improve performance. 1 Introduction With the current rapid advances in generative AI, pre-trained models are increasingly utilized in a range of NLP tasks, necessitating reliable evalua- tions of these models. Human evaluation, where an- notators critically assess the quality of the outputs of natural language generation (NLG) systems, has been the gold standard approach (Lita et al., 2005; Belz and Reiter, 2006; Lai and Tetreault, 2018; !! !" !# !$ !% !& 9 8 1 5 2 7 !! !" !# !$ !% !& !! A A B A A !" B A B A B !# B B B A B !$ B A A B A !% B B B A B !& B A A B A Summary: Provide a score between 1 and 10 that measures the summaries’ coherence Answer: 2 Response B Re sp on se A ranking [!!, !", !&, !$, !%, !#] Summary A: Summary B: Which Summary is more coherent, Summary A or Summary B? Answer: Summary A is the more coherent summary LLM Prompt Scoring LLM Comparative Assessment [!!, !$, !&, !", !#, !%] ranking Figure 1: Prompt Scoring v.s. Comparative Assessment. Comparative Assessment prompts an LLM to compare can- didates in a pairwise manner, and the comparisons are subse- quently converted into scores or ranks. Fabbri et al., 2021). However, human evaluation has its drawbacks, and is notably labor-intensive, time-consuming, and costly. As such, automating the evaluation process and assessing NLG systems without human intervention is highly desirable. Though there has been considerable progress in automatic evaluation methods, many proposed ap- proaches have certain restrictions that limit their effectiveness. A large body of existing work use evaluation methods designed for particular tasks and attributes (Mehri and Eskenazi, 2020a; Rei et al., 2020; Manakul et al., 2023b), for example, measuring the consistency of summaries (Wang et al., 2020; Manakul et al., 2023a). Though effec- tive within their domain, these approaches are not extensible to different NLG aspects and cannot be used by practitioners wishing to evaluate systems on inputs or properties that are less common. The recent development in the emergent abili- ties of LLMs (Wei et al., 2022) has enabled LLMs to achieve impressive zero-shot performance for 1 a slew of language tasks. This has led to gen- eral prompt-based assessment approaches, such as prompt-scoring where an LLM is probed to score outputs on a particular aspect (Wang et al., 2023; Kocmi and Federmann, 2023). These approaches are often only effective with massive LLMs with 175B+ parameters, which may limit the applica- bility of the approach, especially when access is limited to API access. With the insight that for humans, it is often eas- ier to select which of two options is better than it is to score options independently, we question whether pairwise comparisons may be more effec- tive at leveraging the impressive emergent ability of LLMs. In this work, we consider LLM comparative assessment, where an LLM is prompted to compare pairs of NLG candidates and predict which one is better. We demonstrate empirically that compara- tive assessment performs much better than prompt- scoring for FlanT5 and Llama style models, and en- ables moderate-sized open-source LLMs to achieve near (or above) state-of-the-art performance across a range of NLG language tasks, for a diverse set of attributes. Our approach is general and can be applied to a diverse range of tasks and textual at- tributes, is simple and requires minimal prompt engineering. Further, we demonstrate that pairwise LLM comparisons often exhibit strong positional biases, where the ordering of candidates impacts the decisions. We introduce a simple debiasing method and empirically illustrate that debiasing can provide further performance improvements, es- pecially when large biases are present. Our contributions are 1) We are the first work that comprehensively analyzes pairwise compara- tive assessment for NLG evaluation; 2) We demon- strate that comparative assessment is far more ef- fective than prompt-scoring for moderately-sized LLMs, and yields performance that is state-of-the- art for particular attributes; 3) We demonstrate that positional bias impacts comparative decisions, and introduce a method to debias LLMs which leads to performance boosts, especially when only a subset of comparisons are considered. 2 Background and Related Work 2.1 Reference-based Evaluation In NLG evaluation, a standard approach is the comparison of annotator-provided gold-standard references with the generated response. Estab- lished heuristics, such as the N-gram overlap met- rics ROUGE (Lin, 2004) and METEOR (Baner- jee and Lavie, 2005), have extensively been ap- plied for assessing summarization and machine translation respectively. Recently, the paradigm has evolved to incorporate embedding-based meth- ods like BERTScore (Zhang et al., 2019), which not only compares generated texts with references, but also factors in semantic considerations beyond word overlap. 2.2 Tailored NLG Evaluation Approaches Tailored approaches have been proposed for assess- ing specific properties of generated texts. For exam- ple, question-answering systems are used for sum- mary consistency assessment (Wang et al., 2020; Scialom et al., 2021) to probe information consis- tency. For Dialogue quality assessment, the lan- guage model probability from a DiaoloGPT sys- tem is used as a proxy for response quality (Mehri and Eskenazi, 2020b). A survey for NLG evalua- tion methods was conducted by Celikyilmaz et al. (2020). 2.3 Zero-shot LLM Evaluation Given the current capabilities of LLMs such as ChatGPT and GPT4, the zero-shot ability of these systems for a wide range of tasks, including NLG evaluation, has been investigated. Existing works have looked at using LLM to evaluate open-ended story generation and adversarial attacks (Chiang and Lee, 2023) and using ChatGPT to score the quality of texts along a certain axis (Wang et al., 2023; Kocmi and Federmann, 2023), demonstrat- ing that ChatGPT can be used in a zero-shot setting and achieve reasonable performance. 2.4 LLM Pairwise Comparisons Pairwise comparative judgement (Thurstone, 1927) has been a popular approach of assessing candi- dates for exams, however where typically human assessors are used. Investigating the ability and application of pairwise comparisons via LLMs has been relatively underexplored, with concurrent work using pairwise rankings for information text retrieval (Qin et al., 2023) and separately for as- sessing LLM-based chat assistants on open-ended questions where outputs are compared to that of a baseline system (Chiang et al., 2023; Zheng et al., 2023). 2 3 Comparative Assessment 3.1 Notation In this work, we investigate using LLM compar- ative judgements for NLG assessment. Assume that there is a context d (e.g., a text passage or di- alogue) and a set of N candidate responses, x1:N . For a given attribute (e.g., coherence, consistency, fluency) the N candidates have true underlying scores, s1:N . As scores often only have relative meaning, in this work only the ranks of the candi- dates will be evaluated. The objective is therefore to accurately predict the true ranks, r1:N , of the candidate scores. In comparative assessment, one uses pairwise comparisons to determine which of the two input responses is better. Let yij ∈ {0, 1} represent the true outcome of whether xi is higher ranked than xj , such that yij = 1(si > sj). Here, an LLM is used to model the probability that re- sponse i is better than response j, pij , pij = P (yij |xi, xj , d) (1) Which can alternatively be converted into hard de- cisions, ŷij , by selecting the most likely outcome. ŷij = { 1, if pij > 0.5 0, otherwise (2) Let C = {ck}k=1...R represent a set of comparisons, where R is the total number of comparisons, and each comparison c = (i, j) indicates the indices of the two considered candidate responses. For example, the set of all possible comparisons, C = {(i, j) | i, j ∈ [1...N ], i ̸= j}, could be used, or alternatively a smaller subset of comparisons. 3.2 Prompt Design To leverage the emergent ability of LLMs, we use comparative prompts that probe a model to decide which of the two candidates is better. Let T be a prompt template that converts candidate responses xi and xj as well as context d into an output text, prompt P = T (xi, xj , d). This work aims to find a simple, general and robust assessment method, and as such extensive prompt engineering is not in the scope of this work (despite possible perfor- mance gains). We evaluate two simple and suitable prompts in our initial investigations. Our prompts for comparative assessment are shown in Figure 2. Passage: Summary A: Summary B: Which Summary is more consistent relative to the passage, Summary A or Summary B? Summary A: Summary B: Which Summary is more consistent, Summary A or Summary B? Prompt 1 Prompt 2 Figure 2: Comparative prompt template 1 and 2. When assessing different attributes, only the attribute is changed (e.g., consistent → engaging) and for response assessment, the word ‘summary’ is replaced with ‘response’. 3.3 Comparative Decisions A central aspect of LLM comparative assessment is the methodology of getting comparative decisions. In this section, we consider two approaches for leveraging LLMs for comparative assessment; First for when one has output token-level probabilities (Prompt-Based Classifier), and second for when only the output texts are available. Prompt-Based Classifier: If one has access to the output probabilities, an efficient method to get prob- ability estimates of the predictions is to leverage prompt-based classifiers. Let Pθ(w|x) represent an LLM’s conditional language model distribution of the output sequence w given the input text x. For prompt-based classifiers, the LM probabilities of specific label words (wk) are used as a proxy for the class decisions (Liusie et al., 2023). For exam- ple in summarization assessment, given a prompt P ending in ‘... which summary is better’, one can set wi=‘Summary A’ and wj=‘Summary B’ and define the probability that response i is better than response j as: pij = Pθ(wi|P) Pθ(wi|P) + Pθ(wj |P) (3) Text Generation: Alternately, if only limited API access is available, one can sample responses from the conditional LM given the input prompt P , w̃(k) ∼ Pθ(w|P) (4) Let f(w̃) ∈ {0, 1} be a function that maps the text response to the comparative decision. By generat- ing K samples from the LLM, one can estimate 3 the comparative probability pij by looking at the fraction of the samples that selects xi over xj . pij = 1 K K∑ k=1 f(w̃(k)) (5) 3.4 Comparisons to Ranks Although the full set of possible comparisons yields the most information for the rankings, this requires R=N(N−1) comparisons, which can be compu- tationally expensive. For computational efficiency, we can consider 3 different comparison selection strategies: random, no-repeat and symmetric. For random, comparisons are randomly selected from the set of all possible comparisons. For no-repeat, if (xi, xj) is selected then (xj , xi) will not be se- lected. For symmetric, if (xi, xj) is selected, then (xj , xi) will also be selected. Given a set of selected comparisons C and weights of a comparative assessment system θ, one can generate a predicted rank ordering r̂1:N of the candidate responses. A simple but effective approach is to sort the candidates by win-loss ratio, ŝi = #wins of xi #comparisons involving xi (6) which can then be ordered to convert the scores into predicted ranks r̂1:N . 3.5 Debiased Comparative Assessment Let ỹij represent the outcome of the comparison when considered in the opposite ordering, such that ỹij = 1 − ŷji. For a positionally unbiased comparator, reversing the ordering should have no impact on the outcome of the comparison ỹij = ŷij ∀ (i, j) ∈ [1...N ], i ̸= j (7) Systems may, however, have systematic positional biases and could for example favor the first posi- tion over the second position. To quantify the level of systematic bias, one can determine P (A), the prior associated with the first position, and P (B) the prior for the second position. This can be esti- mated for a given set of comparisons by using the statistics over all comparisons, and by calculating the fraction of times that each position is selected. P (A) = ∑ i,j∈C ŷij |C| P (B) = ∑ i,j∈C ỹij |C| (8) When using a symmetric comparative set C, for an unbiased system, both P (A) and P (B) should be 0.5 and any large deviation is symptomatic of positional bias. To address possible positional bias, one may reweight system probabilities, p̂ij , through p̂ij = α · pij α · pij + (1− pij) (9) where α ∈ R+ is a weight that can be set such that P (A) = P (B) = 0.5. Reweighting in this fashion is equivalent to, ŷij = { 1, if pij > τ 0, otherwise (10) where τ ∈ [0, 1] is a decision threshold correspond- ing to α, set such that P (A) = P (B) = 0.5. 4 Experimental Setup 4.1 Datasets To investigate the general applicability of compara- tive assessment, we cover a range of standard NLG evaluation tasks and datasets as follows: SummEval (Fabbri et al., 2021) is a summary eval- uation benchmark of 100 passages, each with 16 machine-generated summaries. Each summary is evaluated for coherency (COH), consistency (CON), fluency (FLU), and relevancy (REL). Podcast (Manakul and Gales, 2022) is for bench- marking podcast summary assessment methods. It contains 179 podcasts each with 15 abstractive sum- maries. Each summary was evaluated for its overall quality on a 4-point scale. TopicalChat with the USR annotations (Mehri and Eskenazi, 2020b) is for benchmarking dialogue evaluation. It includes 60 dialogue contexts and six system responses per context. These responses were assessed on coherency (COH), continuity (CNT), engagingness (ENG), and naturalness (NAT). WebNLG (Gardent et al., 2017) is for benchmark- ing data-to-text evaluation methods. It contains 223 semantic triple groups, each paired with outputs from 8 triple-to-text generation systems. These texts were evaluated for fluency (FLU), grammar (GRA) and semantic equivalence (SEM). 4.2 Base Large Language Models (LLMs) We investigate two families of open-source instruction-tuned LLMs. The first system is FlanT5 (Chung et al., 2022), T5 (Raffel et al., 2020) that have been instruction tuned on a diverse set of 1000 NLP tasks (Wang et al., 2022). The second system 4 is Llama2-chat (Touvron et al., 2023), which is Llama2 tuned on instruction datasets. We investi- gate a range of model sizes; 220M, 770M, 3B and 11B for FlanT5, and 3B and 13B for Llama2. 4.3 Baselines The NLG evaluation methods can be categorized into reference-based and reference-free. Reference- based methods compare the output against the refer- ence such as n-gram metrics (e.g., BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004)), or embed- ding based metrics (e.g., BERTScore (Zhang et al., 2019)). In contrast, reference-free methods com- pare the generated texts against the original source (or context for generation) directly. 4.3.1 Bespoke Methods Bespoke methods require a specific data which could be supervised labels (e.g., human judgements for the summaries) or data for model training (e.g., question-answering). Although bespoke methods could work in a similar domain (e.g., developed for summarization, but applied on dialogue genera- tion), they are not as general as zero-shot methods. UniEval (Zhong et al., 2022) convert NLG evalua- tion into Boolean QA problem. This method uses pre-defined schemes for selected aspects (e.g., co- herence) and generates synthetic data to fine-tune a T5 system for NLG assessment. References are used for particular aspects (e.g. relevancy), and schemes/systems are bespoke for a particular at- tribute (though a sequentially trained system that scores multiple attributes is also explored). QuestEval (Scialom et al., 2021) and MQAG (Manakul et al., 2023a) are QA-based approaches for assessing consistency in summarization tasks. QuestEval uses extracted answer spans while MQAG represents information using multiple- choice questions. Both methods are reference-free. Longformer-SFT: For podcast summarization, we follow Manakul and Gales (2022) in using a Su- pervised Fine-Tuned longformer (Beltagy et al., 2020) as a baseline. The input is the document and the summary, and human judgement is used as the supervised target label at training, and the perfor- mance is reported using 5-fold cross-validation. 4.3.2 Zero-shot Methods Zero-shot methods can be applied generally to any task without further training or fine-tuning. Com- parative assessment is a zero-shot method. GPTScore (Fu et al., 2023) evaluates texts using conditional language model scores. By condition- ing the language model on instruction and context, GPTScore assumes that it will assign a higher prob- ability to a high-quality generated text. Prompt Scoring. Another baseline is prompt- scoring. With this approach, for a particular at- tribute, the LLMs is asked to assess the response quality between 1-10. Simple prompts are used with the general templates shown in Figure 3. Prompt-scoring is run for all open-source LLMs considered (FlanT5 and Llama2), and is used as the main baseline to compare comparative assessment against. During generation, the maximum gener- ation length is set to 5 and the temperature is set to 1.0. Similarly, ChatGPT prompt-scoring has re- cently been proposed in Wang et al. (2023); Kocmi and Federmann (2023), which we also include as a baseline where applicable. Passage: Summary: Score the response between 1 and 10 based on how consistent the summary is Summary: Provide a score between 1 and 10 that measures the summary’s consistency Prompt 1 Prompt 2 Figure 3: Scoring template 1 and template 2. Only the at- tribute is changed (e.g., consistent → engaging) and response description (‘summary’→ ‘response’) for different tasks. G-Eval (Liu et al., 2023) As an extension to prompt-scoring, G-Eval extends standard prompt scoring by using detailed prompts and then generat- ing a continuous score by calculating the expected score over a score range (e.g. 1-5 normalized by their probabilities). We apply G-Eval to the var- ious base LLMs and contrast performance to the other approaches for SummEval, since the prompts for different attributes have been made publically available.1 4.4 Methodology Each LLM is used for both prompt-scoring and comparative assessment. For the main comparative 1https://github.com/nlpyang/geval 5 https://github.com/nlpyang/geval assessment results, we consider the full set of pos- sible comparisons, where all pairs of candidates in both permutations are compared by the framework. Comparisons are made using the prompt-based classifier (as described in §3.3) using the prompt templates shown in Fig. 2, where the system out- puts a probability for Response A and Response B. The winner of the comparison is the response with the highest probability, where candidates are then ranked in order of the win-ratio (as described in §3.4). For Llama2, comparative prompts are ap- pended with ‘Answer:’ while scoring prompts end with ‘Score:’. The spearman correlation between predicted scores and human judgements is used as the performance metric. 5 Experiments 5.1 NLG Evaluation Results Summary Assessment: Table 1 analyzes the effec- tiveness of comparative assessment on SummEval, where the following observations can be made: (1) Moderate-sized LLMs are ineffective in the prompt-scoring set-up, with the best system (FlanT5-3B) achieving Spearman correlations of 10-20. The performance difference with ChatGPT prompt-scoring implies that scoring is likely an emergent ability only effective for larger LLMs. (2) G-Eval, which uses task specific detailed prompts and continuous scores, yields significant improvements over prompt-scoring. Nonetheless, comparative assessment remains more effective than G-Eval in the majority of settings. (3) LLMs are able to achieve considerably higher correlations in the comparative assessment set-up, with performance higher for nearly all systems. Further, comparative assessment leads to more ro- bust performance, with most 3B+ models achieving correlations within the range of 30-50. (4) Comparative assessment enables LLMs of un- der 1B to perform well, with FlanT5-770M achiev- ing moderate correlations. However, performance improves significantly when using 3B+ LLMs, al- though for SummEval there are diminishing (if any) performance gains by scaling up. (5) The best comparative assessment LLM (FlanT5- 3B) is competitive with all other zero-shot methods, including ChatGPT scoring (an LLM with two or- ders of magnitude more parameters), and achieves the best correlation in 3 of the 4 aspects. Approach COH CON FLU REL Baselines (§4.3) BERTScore (w/ Ref) 25.9 19.7 23.7 34.7 QuestEval 18.2 30.6 22.8 26.8 MQAG 17.0 28.8 19.3 16.6 UniEval (single-best) 54.6 47.2 43.3 46.3 UniEval (continual) 57.5 44.6 44.9 42.6 GPTScore FlanT5-3B 47.0 43.6 42.1 34.4 GPTScore FlanT5-11B 45.6 43.8 42.4 34.3 GPTScore GPT3 40.1 47.5 41.0 34.3 ChatGPT scoring† 45.1 43.2 38.0 43.9 Prompt Scoring (§4.3.2) FlanT5-220M 4.0 -0.2 0.2 2.8 FlanT5-770M -3.6 -1.6 -1.5 -0.0 FlanT5-3B 14.5 19.8 3.9 15.2 FlanT5-11B 0.7 11.2 3.2 5.7 Llama2-chat-7B 8.6 9.0 1.8 7.8 Llama2-chat-13B 9.9 6.9 1.2 9.2 G-Eval (§4.3.2) FlanT5-220M 3.6 0.6 2.7 8.0 FlanT5-770M 8.5 7.0 15.3 24.1 FlanT5-3B 10.5 29.1 9.8 23.8 FlanT5-11B 19.2 29.3 20.7 35.8 Llama2-chat-7B 28.2 29.4 23.0 27.4 Llama2-chat-13B 53.2 33.7 16.5 38.3 Comparative Assessment (§3) FlanT5-220M 4.0 -0.2 0.2 2.8 FlanT5-770M 29.8 26.3 20.6 35.1 FlanT5-3B 51.2 47.1 32.5 44.8 FlanT5-11B 44.2 37.2 30.2 43.4 Llama2-chat-7B 27.9 24.6 20.2 35.6 Llama2-chat-13B 40.9 39.9 30.8 45.3 Table 1: Spearman correlation coefficient for SummEval, averaged over both prompts per system (for prompt-scoring and comparative). †ChatGPT performance is quoted from Wang et al. (2023), which use more detailed scoring prompts. Approach System-lvl Summary-lvl Baselines (§4.3) BERTScore (w/ Ref) 73.9 25.1 UniEval (continual) 42.0 22.8 QuestEval 42.5 20.4 MQAG 77.9 12.6 Longformer-SFT 89.6 19.6 Prompt Scoring (§4.3.2) Llama2-chat-7B 88.5 2.6 Llama2-chat-13B 80.0 25.3 Comparative Assessment (§3) Llama2-chat-7B 88.2 37.4 Llama2-chat-13B 97.1 45.5 Table 2: Spearman correlation coefficient for Podcast. (6) Comparative assessment achieves competitive performance with UniEval. Although UniEval has better overall performance, UniEval was de- signed for bespoke tasks and aspects (it is fine- tuned on synthetic data created for particular at- tributes) where the results in Tables 2 and 4 show that UniEval has noticeable degradation in out-of- domain settings. In contrast, comparative assess- ment is zero-shot and general. 6 Approach COH CNT ENG NAT Baselines (§4.3) UniEval (single-best) 60.7 - 59.6 54.7 UniEval (continual) 61.3 - 60.5 44.4 GPTScore GPT3 56.9 32.9 49.6 52.4 ChatGPT scoring† 54.7 57.7 37.9 58.0 Prompt Scoring (§4.3.2) FlanT5-220M -2.2 0.2 -8.4 2.1 FlanT5-770M 3.7 3.1 -4.3 3.8 FlanT5-3B 31.9 28.8 17.4 23.7 FlanT5-11B 15.3 8.0 4.3 24.3 Llama2-chat-7B 16.4 17.0 20.6 21.4 Llama2-chat-13B 21.7 19.9 31.4 23.2 Comparative Assessment (§3) FlanT5-220M -0.3 8.2 -10.5 2.2 FlanT5-770M 38.5 36.3 25.3 35.3 FlanT5-3B 49.4 49.4 37.3 47.4 FlanT5-11B 54.3 42.2 54.7 54.2 Llama2-chat-7B 28.9 33.7 36.1 30.3 Llama2-chat-13B 32.4 43.2 55.5 33.5 Table 3: Spearman correlation coefficient for TopicalChat. †ChatGPT is prompted using our prompt-scoring prompts. Podcast Assessment: When considering podcast summarization with long inputs of over 5k tokens on average, only Llama2 models (which have a limit of 4k tokens) were used (as FlanT5 has a limit of 1k tokens). Table 2 shows that comparative assessment yields highly impressive performance for long-spoken summarization, with comparative assessment out-competing all other baselines. Fur- ther, although prompt-scoring has good system- level correlations, the lack of granularity leads to poor summary-level performance. Dialogue Assessment: Next, we analyze compar- ative assessment on TopicalChat, for evaluating conversational responses. Table 3 shows similar findings for TopicalChat as to those in SummEval, where comparative assessment again outperforms the correlations seen from prompt-scoring. Data-to-Text Assessment: For data-to-text gen- eration, the context is highly abstract and is a list of triples in the form of (object, relation, subject). This makes assessing the semantics challenging, as the LLM needs to parse and understand semantic triples. Table 4 shows that understanding triples is an emergent ability of LLMs, where for grammar and fluency the correlations are quite similar be- tween the 3B and 11B/13B systems, however for semantic understanding, the 10B+ systems highly outcompete the 3B+ systems. Note that when eval- uating UniEval, we used the closest attribute that they designed for, which was naturalness for both. Approach FLU GRA SEM Baselines (§4.3) BLEU 36.3 34.7 50.3 METEOR 44.3 42.9 62.7 NLI Model∗ - - 63.7 UniEval (continual) 21.7 16.3 - Prompt Scoring (§4.3.2) FlanT5-220M 18.5 17.4 8.0 FlanT5-770M 14.5 13.6 17.1 FlanT5-3B 30.8 32.7 38.5 FlanT5-11B -0.7 6.9 20.8 Llama2-chat-7B 3.8 2.4 17.0 Llama2-chat-13B 1.8 0.5 5.6 Comparative Assessment (§3) FlanT5-220M -13.6 -17.9 0.1 FlanT5-770M 36.2 35.2 11.4 FlanT5-3B 40.6 41.4 12.8 FlanT5-11B 41.4 44.8 52.4 Llama2-chat-7B 22.9 37.8 -5.3 Llama2-chat-13B 44.9 45.1 53.5 Table 4: Spearman correlation coefficient for WebNLG. ∗Quoted from the NLI method with the backoff template in Dušek and Kasner (2020). 5.2 Positional Bias We investigate whether the comparative prompts have any implicit positional bias, and whether sys- tems prefer the first/second position. Table 5 shows the fraction of comparisons that selected the candi- date in the first position for SummEval. Since all comparisons in both permutations are considered, this fraction should be 0.50 for an unbiased sys- tem. However, we observe considerably high bias, with some set-ups even selecting the first option 80% of the time. Further, we observe that larger systems appear to be more susceptible to bias than smaller systems, which may explain the similarity in performance for the 3B and 11B/13B systems in the previous main results. Similar results for other datasets are provided in Appendix A.2 System Prompt COH CON FLU REL FlanT5 1 0.37 0.46 0.39 0.41 3B 2 0.43 0.47 0.40 0.44 FlanT5 1 0.18 0.20 0.13 0.23 7B 2 0.24 0.24 0.17 0.26 Llama2-chat 1 0.41 0.17 0.26 0.18 7B 2 0.68 0.56 0.48 0.45 Llama2-chat 1 0.31 0.37 0.18 0.32 13B 2 0.29 0.30 0.19 0.26 Table 5: Positional bias P (A) for both prompt templates, for various systems in the comparative setup on SummEval. 7 System Debias SummEval TopicalChat WebNLG Avg.COH CON FLU REL COH CNT ENG NAT FLU GRA FlanT5-3B ✗ 51.2 47.1 32.5 44.8 49.4 49.4 37.3 47.4 41.0 41.8 44.2 ✓ 51.8 46.9 33.0 45.3 49.6 50.2 38.0 46.3 40.7 42.3 44.4 FlanT5-11B ✗ 44.2 37.2 30.2 43.4 54.3 42.2 54.7 54.2 41.4 44.8 44.7 ✓ 45.3 39.7 30.7 44.7 57.2 59.5 59.5 58.8 44.5 44.6 48.5 Llama2-chat-7B ✗ 29.4 24.6 19.7 35.2 28.2 33.1 36.3 28.7 22.9 37.8 29.6 ✓ 28.8 24.8 19.7 35.5 29.1 34.5 39.7 28.5 24.3 37.1 30.2 Llama2-chat-13B ✗ 40.9 39.9 30.8 45.3 32.4 43.2 55.5 33.5 44.9 45.1 41.2 ✓ 42.8 40.3 31.9 47.1 32.5 44.5 56.9 38.4 45.9 43.7 42.4 Table 6: Spearman correlation coefficient on different aspects of the NLG evaluation tasks, averaged over all prompts considered, using all pairs and ordering considered (i.e. full matrix comparisons). 5.3 Debiasing The previous section demonstrates that compara- tive assessment exhibits positional bias which may impact system decisions. We therefore investigate whether debiasing can improve evaluation perfor- mance. Table 6 shows standard and debiased LLM comparative assessment performance for the con- sidered tasks and scores, with WebNLG SEM and Podcast omitted due to the required emergent abil- ity and large context length respectively. We ob- serve that debiasing can lead to performance boosts, where we note that the prompts which have a high bias (seen in Table 5 and Table 9 in the appendix) benefit most from debiasing. In particular, for Topi- calChat we observe large gains for the FlanT5-11B system, which enables state-of-the-art performance. To explain why debiasing can lead to large perfor- mance boosts, consider a very biased system where the first response is always selected as better. Al- though over both permutations the system is un- biased for any comparison, the bias in the system will cause the system to assume that all candidates are of the same quality. By reducing the bias of each comparison, the system may be able to pick up subtler quality differences between the samples. 5.4 Comparative Accuracy One can also measure the accuracy of the compara- tive system at a comparison level. Table 7 shows the pairwise comparison accuracy for Summeval, over all candidate pairs where the true score of the candidate response varies. We observe accura- cies between 60-80% across all tasks and observe that debiasing can substantially increase accuracy. This highlights that LLMs are able to compare the quality of responses fairly well, though the moder- ately sized LLMs may not always select the best response (with respect to labels). System Debias COH CON FLU REL FlanT5-3B ✗ 68.6 82.0 68.2 67.2 ✓ 69.8 82.1 68.8 67.8 FlanT5-11B ✗ 61.6 70.3 60.3 63.3 ✓ 66.2 76.7 65.9 67.4 Llama2-chat-7B ✗ 59.6 63.8 59.6 61.0 ✓ 60.3 65.7 60.4 63.1 Llama2-chat-13B ✗ 62.6 75.4 61.1 65.4 ✓ 65.8 76.9 67.2 68.5 Table 7: Accuracy of the comparative systems, at a compari- son level, for SummEval. 5.5 Self-Consistency SummEval has 16 summaries per context which leads to 240 possible comparisons. If one were to instead randomly sample N outputs and consider all N ·(N−1) comparisons, how consistent would the rankings with the subset of systems be with respect to the final predicted rankings? Table 8 illustrates the self-consistency measured by the ac- curacy when comparing pairs, and demonstrates that even when using few outputs, the model is very consistent to the final rankings that would be achieved by using many more examples. 2 3 4 6 8 12 16 Final 84.0 88.3 90.7 93.7 95.5 98.0 100 Gold 68.0 69.1 69.7 70.3 70.6 70.8 70.9 Table 8: Accuracy when using fewer systems with respect to final rankings (using all 16 systems) and the ground truth labels. Results shown for Summeval COH using FlanT5-xl. 5.6 Subset of Comparisons Due to O(N2) number of comparisons required for the full comparison matrix, it might be practical to only consider a subset of comparisons. Fig- ure 4 shows the downstream Spearman correlation 8 for SummEval coherency, when averaged over 50 runs, for different comparison selection strategies. Of the three schemes, we observe that for small R (i.e. less than half the total number of com- parisons) selecting comparisons with no repeats leads to a marginal improvement over random se- lection. Further, by using the symmetric selection scheme, despite the number of comparisons being half that of no-repeat (although each comparison is done twice, once in each permutation), interest- ingly there is only a performance difference of 1 in terms of Spearman. Finally, we observe that debiasing can be very effective in efficient set-ups, and leads to larger benefits when the number of comparisons is small. Equivalent plots for other tasks/scores can be found in Appendix A.1. 50 75 100 125 150 175 200 225 R (number of comparisons) 44 46 48 50 52 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric Figure 4: FlanT5-3B performance for SummEval COH when a subset of the comparisons are selected by either random, no-repeat or symmetric (as described in §3.4). For no-repeat, each pair is compared once, hence has a smaller maximum R. 6 Conclusions This paper investigates LLM comparative assess- ment, a simple zero-shot approach to NLG evalu- ation. We demonstrate that for moderately sized LLMs, comparative assessment outperforms abso- lute scoring, and is an effective automatic assess- ment, achieving near state-of-the-art performance for a range of NLG evaluation tasks. Furthermore, we show that LLMs are prone to have positional bias that could impact their decisions, however, we introduce a simple debiasing approach that leads to performance boosts, especially for biased systems. Limitations Computational Cost. The comparative assessment framework with the full set of comparisons uses N · (N − 1) comparisons, which for large N can be computationally prohibitive. This paper investi- gated datasets with at most 16 candidates, and may not scale when more candidates are required. Base LLMs. The empirical findings are for LLMs of up to 13B parameters. By using larger models (with 100B+ parameters) one may expect further performance improvements. However, due to API costs and the O(N2) number of comparisons, re- sults are limited to open-source LLMs. Selection of the subset of comparisons. For our comparison selection scheme, this work only con- sidered static selection schemes. Future work may investigate dynamic selection schemes, either by considering sorting algorithms or ELO competition schemes, and methods similar to those studied in information retrieval by Qin et al. (2023). Ethics Statement For some tasks/datasets, comparative assessment could be ineffective and have poor generalisa- tion over the task. Deploying machine learning classifiers in real-world classification settings has many associated risks, and careful analysis should be made before deploying such systems. Mis- use/overconfidence in the approach may lead to mistrust of users towards LLM solutions. Acknowledgements This paper reports on research supported by Cam- bridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Schol- ars of the University of Cambridge. This research is further supported by the Cambridge International & St John’s College scholarship. References Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with im- proved correlation with human judgments. In Pro- ceedings of the ACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Transla- tion and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguis- tics. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Anja Belz and Ehud Reiter. 2006. Comparing auto- matic and human evaluation of NLG systems. In 9 https://aclanthology.org/W05-0909 https://aclanthology.org/W05-0909 https://aclanthology.org/W05-0909 https://aclanthology.org/E06-1040 https://aclanthology.org/E06-1040 11th Conference of the European Chapter of the As- sociation for Computational Linguistics, pages 313– 320, Trento, Italy. Association for Computational Linguistics. Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. 2020. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evalua- tions? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality. Hyung Won Chung, Le Hou, Shayne Longpre, Bar- ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. Ondřej Dušek and Zdeněk Kasner. 2020. Evaluating semantic accuracy of data-to-text generation with nat- ural language inference. In Proceedings of the 13th International Conference on Natural Language Gen- eration, pages 131–137, Dublin, Ireland. Association for Computational Linguistics. Alexander R Fabbri, Wojciech Kryściński, Bryan Mc- Cann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summariza- tion evaluation. Transactions of the Association for Computational Linguistics, 9:391–409. Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166. Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520. Alice Lai and Joel Tetreault. 2018. Discourse coherence in the wild: A dataset, evaluation and methods. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 214–223, Melbourne, Australia. Association for Computational Linguistics. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Lucian Lita, Monica Rogati, and Alon Lavie. 2005. BLANC: Learning evaluation metrics for MT. In Proceedings of Human Language Technology Con- ference and Conference on Empirical Methods in Natural Language Processing, pages 740–747, Van- couver, British Columbia, Canada. Association for Computational Linguistics. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human align- ment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Com- putational Linguistics. Adian Liusie, Potsawee Manakul, and Mark J. F. Gales. 2023. Mitigating word bias in zero-shot prompt- based classifiers. Potsawee Manakul and Mark JF Gales. 2022. Pod- cast summary assessment: A resource for evaluat- ing summary assessment methods. arXiv preprint arXiv:2208.13265. Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023a. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. arXiv preprint arXiv:2301.12307. Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023b. Selfcheckgpt: Zero-resource black-box hal- lucination detection for generative large language models. arXiv preprint arXiv:2303.08896. Shikib Mehri and Maxine Eskenazi. 2020a. Unsuper- vised evaluation of interactive dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics. Shikib Mehri and Maxine Eskenazi. 2020b. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707, Online. Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Compu- tational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, et al. 2023. Large language models are effective text rankers 10 https://doi.org/10.18653/v1/2023.acl-long.870 https://doi.org/10.18653/v1/2023.acl-long.870 https://doi.org/10.18653/v1/2023.acl-long.870 https://lmsys.org/blog/2023-03-30-vicuna/ https://lmsys.org/blog/2023-03-30-vicuna/ https://lmsys.org/blog/2023-03-30-vicuna/ https://aclanthology.org/2020.inlg-1.19 https://aclanthology.org/2020.inlg-1.19 https://aclanthology.org/2020.inlg-1.19 https://doi.org/10.18653/v1/P17-1017 https://doi.org/10.18653/v1/P17-1017 https://doi.org/10.18653/v1/W18-5023 https://doi.org/10.18653/v1/W18-5023 https://aclanthology.org/W04-1013 https://aclanthology.org/W04-1013 https://aclanthology.org/H05-1093 https://doi.org/10.18653/v1/2023.emnlp-main.153 https://doi.org/10.18653/v1/2023.emnlp-main.153 https://doi.org/10.18653/v1/2023.emnlp-main.153 http://arxiv.org/abs/2309.04992 http://arxiv.org/abs/2309.04992 https://aclanthology.org/2020.sigdial-1.28 https://aclanthology.org/2020.sigdial-1.28 https://doi.org/10.18653/v1/2020.acl-main.64 https://doi.org/10.18653/v1/2020.acl-main.64 https://doi.org/10.18653/v1/2020.acl-main.64 https://doi.org/10.3115/1073083.1073135 https://doi.org/10.3115/1073083.1073135 with pairwise ranking prompting. arXiv preprint arXiv:2306.17563. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text trans- former. The Journal of Machine Learning Research, 21(1):5485–5551. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics. Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. QuestEval: Summariza- tion asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Louis L Thurstone. 1927. A law of comparative judg- ment. Psychological review, 34(4):273. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evaluate the fac- tual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 5008–5020, Online. Asso- ciation for Computational Linguistics. Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048. Yizhong Wang, Swaroop Mishra, Pegah Alipoor- molabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022. Super-naturalinstructions: Generalization via declar- ative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Confer- ence on Learning Representations. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685. Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. Towards a unified multi- dimensional evaluator for text generation. In Pro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 2023– 2038, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. 11 https://doi.org/10.18653/v1/2020.emnlp-main.213 https://doi.org/10.18653/v1/2020.emnlp-main.213 https://doi.org/10.18653/v1/2021.emnlp-main.529 https://doi.org/10.18653/v1/2021.emnlp-main.529 https://doi.org/10.18653/v1/2020.acl-main.450 https://doi.org/10.18653/v1/2020.acl-main.450 https://aclanthology.org/2022.emnlp-main.131 https://aclanthology.org/2022.emnlp-main.131 A Additional Results A.1 Partial Comparison Curves 50 75 100 125 150 175 200 225 R (number of comparisons) 41 42 43 44 45 46 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (a) FlanT5-3B, SummEval, CON 50 75 100 125 150 175 200 225 R (number of comparisons) 28 29 30 31 32 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (b) FlanT5-3B, SummEval, FLU 50 75 100 125 150 175 200 225 R (number of comparisons) 38 39 40 41 42 43 44 45 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (c) FlanT5-3B, SummEval, REL 10 15 20 25 30 R (number of comparisons) 25 30 35 40 45 50 55 60 65 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (d) FlanT5-11B, TopicalChat, COH 10 15 20 25 30 R (number of comparisons) 35 40 45 50 55 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (e) FlanT5-11B, TopicalChat, ENG 10 15 20 25 30 R (number of comparisons) 35 40 45 50 55 60 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (f) FlanT5-11B, TopicalChat, NAT 50 75 100 125 150 175 200 225 R (number of comparisons) 34 36 38 40 42 44 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (g) FlanT5-11B, SummEval, REL 50 75 100 125 150 175 200 225 R (number of comparisons) 17 18 19 20 21 22 23 24 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (h) Llama-chat-7B, SummEval, CON 50 75 100 125 150 175 200 225 R (number of comparisons) 15 16 17 18 19 20 Sp ea rm an no-repeat (debias) random (debias) symmetric (debias) no-repeat random symmetric (i) Llama-chat-13B, SummEval, FLU Figure 5: Assessment Performance when only a subset of comparisons are considered (extending the results of Figure 4). Multiple different base LLMs, datasets and scores and displayed. A.2 Positional Bias System prompt SummEval TopicalChat WebNLG PodcastCOH CON FLU REL COH CNT ENG NAT FLU GRA SEM FlanT5 1 0.37 0.46 0.41 0.42 0.47 0.44 0.50 0.49 0.46 0.41 0.89 - 3B 2 0.43 0.47 0.42 0.44 0.46 0.44 0.47 0.47 0.38 0.36 0.85 - FlanT5 1 0.18 0.25 0.16 0.23 0.25 0.17 0.27 0.26 0.15 0.19 0.56 - 11B 2 0.24 0.29 0.19 0.26 0.27 0.13 0.29 0.31 0.19 0.21 0.42 - Llama2-chat 1 0.41 0.21 0.28 0.18 0.57 0.26 0.25 0.36 0.36 0.53 0.98 0.33 7B 2 0.68 0.57 0.50 0.45 0.56 0.37 0.22 0.35 0.37 0.48 0.90 0.24 Llama2-chat 1 0.31 0.43 0.20 0.32 0.69 0.73 0.67 0.74 0.23 0.38 0.50 0.22 13B 2 0.29 0.37 0.22 0.26 0.65 0.65 0.62 0.68 0.28 0.40 0.29 0.40 Table 9: Fraction of comparisons where the candidate in the first position was selected by the LLM when using the full (symmetric) set of comparisons. The bias is presented for both prompts, over all datasets and scores, extending the results in Table 5. 12 A.3 Accuracy of Pairwise Comparisons System debias SummEval TopicalChat WebNLG PodcastCOH CON FLU REL COH CNT ENG NAT FLU GRA SEM FlanT5 ✗ 68.6 82.0 68.2 67.2 75.3 71.0 65.6 70.3 66.2 65.5 51.8 - 3B ✓ 69.8 82.1 68.8 67.8 75.4 72.2 65.6 69.9 66.7 66.6 51.3 - FlanT5 ✗ 61.6 70.3 60.3 63.3 70.0 60.5 68.0 68.9 60.8 62.7 69.6 - 11B ✓ 66.2 76.7 65.9 67.4 76.6 74.2 74.4 74.7 67.6 67.3 69.9 - Llama2-chat ✗ 59.6 63.8 59.6 61.0 64.0 62.0 61.0 60.4 56.6 61.1 48.3 63.4 7B ✓ 60.3 65.7 60.4 63.1 64.0 64.3 65.9 61.6 57.1 61.1 50.2 - Llama2-chat ✗ 62.6 75.4 61.1 65.4 64.5 66.8 72.0 62.3 64.7 67.6 67.3 70.3 13B ✓ 65.8 76.9 67.2 68.5 65.9 69.4 73.8 65.2 66.7 67.4 68.9 - Table 10: Accuracy of pairwise comparisons of all candidates which differ in true value. Accuracies are shown for all datasets and scores, extending the results of Table 6. B Alternate Ranking Strategies In the main paper, we only consider the win ra- tio as an approach of converting comparisons to ranks, due to win-ratio being simple and intuitive. However alternate ranking strategies are possible; a well-motivated decoding approach is to select the ranks with the highest probability given the ob- served comparisons. By Bayes’ theorem, this is equivalent to finding the ranks that maximise the likelihood of the observations. r̂1:N = argmax r1:N P (C|r1:N ) (11) For a set of ranks r1:N , let zij=1(ri