58  |  Nature  |  Vol 652  |  2 April 2026

Article

General scales unlock AI evaluation with 
explanatory and predictive power

Lexin Zhou1,2,3,4 ✉, Lorenzo Pacchiardi2, Fernando Martínez-Plumed4, Katherine M. Collins5, 
Yael Moros-Daval4, Seraphina Zhang2,6, Qinlin Zhao3, Yitian Huang3, Luning Sun7, 
Jonathan E. Prunty2, Zongqian Li8, Pablo Sánchez-García9, Kexin Jiang-Chen4, 
Pablo A. M. Casares4, Jiyun Zu10, John Burden2, Behzad Mehrbakhsh4, David Stillwell7, 
Manuel Cebrian11, Jindong Wang12, Peter Henderson1, Sherry Tongshuang Wu13, 
Patrick C. Kyllonen10, Lucy Cheke2,6, Xing Xie3 ✉ & José Hernández-Orallo2,4 ✉

Ensuring safe and effective use of artificial intelligence (AI) requires understanding 
and anticipating its performance on new tasks, from advanced scientific challenges  
to transformed workplace activities1–3. So far, benchmarking has guided progress in  
AI but has offered limited explanatory and predictive power for general-purpose  
AI systems4–8, attributed to limited transferability across specific tasks9–11. Here we 
introduce general scales for AI evaluation that elicit demand profiles explaining what 
capabilities common AI benchmarks truly measure, extract ability profiles quantifying 
the general strengths and limits of AI systems and robustly predict AI performance for 
new task instances. Our fully automated methodology builds on 18 rubrics, capturing 
a broad range of cognitive and intellectual demands, which place different task 
instances on the same general scales, illustrated on 15 large language models (LLMs) 
and 63 tasks. Both the demand and the ability profiles on these scales bring new 
insights such as construct validity through benchmark sensitivity and specificity and 
explain conflicting claims about whether AI has reasoning capabilities. Ultimately, 
high predictive power at the instance level becomes possible using the general scales, 
providing superior estimates over strong black-box baseline predictors, especially in 
out-of-distribution settings (new tasks and benchmarks). The scales, rubrics, battery, 
techniques and results presented here constitute a solid foundation for a science of  
AI evaluation, underpinning the reliable deployment of AI in the years ahead.

Present general-purpose AI systems, such as LLMs, are highly unreli-
able and unpredictable6,12. This places a large burden on AI evaluation 
in terms of explanatory and predictive power: we need to understand 
why the AI system is failing and anticipate where it can be applied suc-
cessfully. The traditional performance-oriented evaluation approach 
has shown limited predictive power at the instance level, inside or 
outside the benchmark9,10. If DeepSeek-R1 achieves 79.8% average 
performance13 on a popular mathematical benchmark such as the 
American Invitational Mathematics Examination dataset14, we cannot 
make informed estimates of success on individual items sampled from 
that benchmark. This performance score is even less informative for 
out-of-distribution instances from other mathematical benchmarks, 
let alone benchmarks from other domains. Indeed, aggregate perfor-
mance scores are a function of both the benchmark and the AI system, 
not invariable properties of the system only—its ‘capabilities’—that 
delineate the limits of the system, generalizable across a wide range of  
scenarios.

Instead of aggregating performance, other evaluation paradigms do 
estimate some properties of the subject (the human or the AI system), 
which, jointly with some properties of the item (the specific problem 
instance), can predict performance; we provide a glossary for technical 
terms such as subject, item, ability and contamination in Supplemen-
tary Information Section 1.16. Several techniques from psychometrics 
and other behavioural sciences have been applied to AI evaluation15, 
such as factor analysis16,17 and item response theory (IRT)18. However, 
the extracted factors or parameters are populational: they depend 
heavily on the population of systems and benchmarks used, which 
makes them quickly outdated with the fast pace of AI progress. More 
recently, score prediction metamodels related to uncertainty esti-
mation and calibration methods, known as ‘assessors’19,20, have been 
used to anticipate performance for new tasks at the instance level, 
by means of latent features. Nonetheless, these features are difficult 
to interpret and typically extrapolate poorly out of distribution21,22. 
Alternatively, these features can be engineered by humans through 

https://doi.org/10.1038/s41586-026-10303-2

Received: 29 March 2025

Accepted: 19 February 2026

Published online: 1 April 2026

Open access

 Check for updates

1Princeton University, Princeton, NJ, USA. 2Leverhulme Centre for the Future of Intelligence, University of Cambridge, Cambridge, UK. 3Microsoft Research Asia, Beijing, China. 4Valencian 
Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, València, Spain. 5Department of Engineering, University of Cambridge, Cambridge, UK. 6Department of 
Psychology, University of Cambridge, Cambridge, UK. 7The Psychometrics Centre, University of Cambridge, Cambridge, UK. 8Department of Theoretical and Applied Linguistics, University of 
Cambridge, Cambridge, UK. 9KU Leuven, Leuven, Belgium. 10Educational Testing Service, Princeton, NJ, USA. 11Center for Automation and Robotics (CAR), Spanish National Research Council 
(CSIC-UPM), Madrid, Spain. 12William & Mary, Williamsburg, VA, USA. 13Carnegie Mellon University, Pittsburgh, PA, USA. ✉e-mail: lz5066@princeton.edu; xing.xie@microsoft.com; josephorallo@
gmail.com

https://doi.org/10.1038/s41586-026-10303-2
http://crossmark.crossref.org/dialog/?doi=10.1038/s41586-026-10303-2&domain=pdf
mailto:lz5066@princeton.edu
mailto:xing.xie@microsoft.com
mailto:josephorallo@gmail.com
mailto:josephorallo@gmail.com


Nature  |  Vol 652  |  2 April 2026  |  59

cognitively inspired approaches23, but the scalability of this approach 
is limited by the need for experts who develop the cognitive models 
and annotate the testing items.

These perspectives differ in what is measured and how8, but they 
have all grappled with explanatory depth and predictive power. Also, 
most of these frameworks derive features, parameters or scales that 
are regularly saturated by an extremely volatile space of AI systems 
and benchmarks, soon becoming obsolete24,25. Lack of construct valid-
ity10,26–28 is also an issue in the common benchmarking paradigm8. 
Solving all of these issues is a prerequisite for more robust assessment 
in the real world9,29, such as interactive, subjective and adaptive evalu-
ations30–32. Table 1 summarizes the problems and associated findings 
presented in this paper, the solutions it brings and its numerous new 
applications. Supplementary Information Section 1.1 further details 
related work.

We present a new methodology that can accompany, map and inform 
AI progress, regulation and deployment in the coming decades. This 
is instantiated and demonstrated for LLMs—the most popular form 
of general-purpose AI—but the methodology is extendable to AI sys-
tems with other architectures and affordances. The core element 
is an array of 18 scales in the range (0, ∞) corresponding to general 
capabilities relevant to tasks expressed in natural language—such 
as verbal comprehension and logical reasoning—and broad areas of 
knowledge—such as natural and formal sciences. The precise values 
on these scales (the demand levels) are obtained through 18 carefully 
crafted demand-level-annotation (DeLeAn) rubrics in the range 0 to 
5+, which humans can interpret and apply to any testing instance, but 
ultimately applied by a LLM judge for scalability.

By running the rubrics through a collection of 20 benchmarks, we 
obtain the annotated-demand-levels (ADeLe) battery, whose 18 histo-
grams of demand levels form a demand profile examining the sensitiv-
ity of each benchmark (measuring what they claim to measure) and 
specificity (not measuring other capabilities beyond what they claim 
to measure). For each LLM on which ADeLe is applied, we get 18 char-
acteristic curves, delineating LLM performance as a function of the 
demand levels. Each curve is summarized into an ability estimate that is 
commensurate to each demand scale, hence composing an ability pro-
file of 18 ability levels. Notably, the demand levels for a particular task 
or benchmark and the ability levels for an AI system are independent  
of other benchmarks and systems and any population thereof. Most  
notably, the demand levels can be used to build strong predictive 

models for the success of AI systems on unseen in-distribution and, 
particularly, out-of-distribution instances (new tasks and bench-
marks).

As an example, by annotating several benchmarks that claim to evalu-
ate ‘reasoning’ (Fig. 1) and comparing the annotated demands with the 
measured capabilities for an AI system, we can obtain causal explanation 
and prediction: if an AI system such as DeepSeek-R1-Distilled-Qwen-14B 
has a profile with quantitative reasoning (QLq), logical reasoning  
(QLl) and inductive reasoning (CL) abilities of 4.5, 4.3 and 4.2, respec-
tively, as shown in Fig. 1a, we can anticipate success in a typical instance  
from GSM8K with 2, 1 and 0 demands in these same dimensions (and 
low on the others). We can also predict a less optimistic outcome 
on a typical instance from OlymMATH Hard, with values around 
4 and even 5 for some dimensions (Fig. 1b). We can also perform 
counterfactual analyses, such as arguing that, if the capability of 
DeepSeek-R1-Distilled-Qwen-14B in QLq were reduced to 3, its per-
formance on GSM8K would be marginally affected. However, it would 
be greatly affected if its capability in QLq were reduced to 1.

Thus, with our methodology, we unlock the following possibilities, 
beyond the reach of previous approaches:
1.	 We can carve the space of capabilities into a hierarchical catalogue 

of general scales. The DeLeAn rubrics v1.0 (see Supplementary  
Information Section 2 for the dimensions in Extended Data Table 5) 
are applied systematically to the 16,108 instances of the ADeLe bat-
tery v1.0 (Supplementary Table 28), yielding 289,944 annotations 
across 18 general scales. The clarity of the rubrics is validated by the 
agreement between human and LLM annotations. The existence of 
instances that differ on any pair of capabilities and the moderate 
demand correlations between the 19 dimensions (Extended Data 
Fig. 1) suggest that the set of scales maps potentially distinctive 
capabilities, not dependent on present systems, likely remaining 
informative for future AI systems.

2.	We can explain what common benchmarks truly measure. We dis-
cover the presence of demands in extraneous dimensions such as 
atypicality (from common to unique), volume (from small to large) 
and unguessability (from multiple-choice to open-ended), indi-
cating contamination (overestimation because similar data were 
seen during training33), amalgamation (underestimation because 
examples are made more difficult by agglomerating more things 
to the task34) and funnelling (underestimation or overestimation by 
changing the difficulty of a task by reducing or increasing options or 

Table 1 | Diagnosis of the challenges of present AI evaluation paradigms, associated new findings revealed by the 
methodological solutions contributed in this paper and the potential applications of the new methodology (expanded in 
Methods section ‘Pipeline and guidelines for applications and extensions’)

Challenge Finding Solution Applications

Construct validity: common 
benchmarks do not measure the 
abilities that they claim to measure

Narrow ranges of task demand levels 
on targeted abilities, confounded by 
unwanted demands

Benchmark profiles quantifying 
what abilities the benchmark truly 
measures

Resolution of contradictory claims (for 
example, LLMs can and can’t reason), 
better benchmarks with construct validity 
by design

Commensurability: incomparable 
measurements across benchmarks and 
distributions

Aggregate percentages mostly 
reflect sufficiency of capabilities, not 
capability estimates

Standardized general scales fixing 
demand levels to infer general 
capability profiles of AI

Interoperability of benchmarks, instance 
reuse into new batteries, meaningful 
scaling laws through non-saturated 
general scales

Population independence: frequent 
benchmark saturation and replacement 
dynamics

Instances in saturated benchmarks 
can still be informative, owing to 
uneven demand profiles

Non-populational benchmark 
demand and model ability profiles 
through standardized scales

Measurements robust to changing 
populations (of benchmarks and 
AI systems), capability catalogue 
accommodating AI progress

Explanatory power: benchmarks do 
not explain why LLMs fail on particular 
instances and what they lack

Failures monotonically increase in a 
sigmoidal way as demands increase

Defining general scales and rubrics 
that are interpretable by humans

Capability profiles bringing explanatory 
power, enabling model diagnosis and 
counterfactual explanations of AI failures 
across domains

Predictive power: poor anticipation of 
AI performance for new tasks and new 
domains

Capability-based instance-level 
predictions are highly accurate, 
robust to out-of-distribution cases

Predicting performance with demand 
levels as features, optionally with 
system profiles

Routing instances to the LLM with highest 
predicted probability or rejecting queries, 
monitoring AIs, guiding red teaming


60  |  Nature  |  Vol 652  |  2 April 2026

Article

distractors35), respectively (Fig. 2 shows the levels of these demands 
and Supplementary Table 2 shows how predictive these dimensions 
are). Beyond these effects, many benchmarks lack either sensitivity 
or specificity: they do not contain instances of all demand levels for 
the dimensions their designers claimed to measure or they include 
non-zero demands on other dimensions they should not be meas-
uring (Fig. 2). Identifying what each instance really measures paves 
the way for interoperability of benchmarks and AI evaluation with 
construct validity.

3.	We can explain the general strengths and limits of AI systems through  
commensurate scales. In our experiments with three families of 
LLMs, we find that the ability scores at knowledge dimensions 
are mostly determined by model size, whereas quantitative and 
logical reasoning, learning and abstraction and (perhaps surpris-
ingly) mind modelling and social capabilities are boosted in chain- 
of-thought, inference-heavy models such as OpenAI’s o1 and DeepSeek- 
R1-Distilled (Figs. 3 and 4). Because the dependent variable is not a 
relative percentage on a benchmark but a level on commensurate 
ratio scales that do not saturate, we have been able to clarify conflict-
ing evaluation results (Supplementary Information Section 1.12) and 
demonstrate diminishing returns in scaling laws (Supplementary 
Information Section 1.4).

4.	We can robustly predict AI performance for instances from new tasks 
and benchmarks. High predictive power at the instance level is pos-
sible, superior to black-box assessor baselines based on embeddings 
or fine-tuning, especially in out-of-distribution settings (new tasks and 
benchmarks), supporting both internal and external validity of the 

scales. These are also superior to domain-based36 or learning-levels 
taxonomies37 (Supplementary Information Section 1.9). This opens up 
a range of applications, such as better routing methods to choose what 
model to use38, safety operating areas in which assurance is guaran-
teed7 and anticipatory reject rules when harm or cost is anticipated39,40. 
See Extended Data Tables 2, 3 and 4 and Supplementary Fig. 8.

These processes are fully automated through open-source pipe-
lines that can be easily customized by AI researchers, policymakers 
and regulators by extending the scales to other capabilities, traits or 
propensities (for example, affecting safety or fairness) and to agents 
with affordances (see Extended Data Fig. 5 and full explanation of the 
collaborative platform in Methods section ‘Pipeline and guidelines for 
applications and extensions’). This endeavour is seminal in creating a 
measurement standard for AI, mimicking the measurement efforts 
that have been pivotal in other sciences41–43.

The key element for our overhauling of AI evaluation is the configu-
ration of scales that are understandable, general and well-grounded 
in measurement theory. We work with a catalogue of 18 scales, follow-
ing a hierarchical structure (Supplementary Information Section 2), 
chosen by following a set of criteria fully explained in Methods section 
‘General scales’. We refer to the first 11 as ‘elemental’, capturing general 
capabilities such as verbal expression and metacognition. The second 
group includes five ‘knowledge’ dimensions measuring the expertise 
in different broad areas of science. There are also three ‘extraneous’ 
dimensions (two are proper scales and the third is a control variable 
for funnelling), AT (Atypicality), VO (Volume) and UG (Unguessability), 

AS
CEc

CEe

CL

MCr

MCt

MCu

MS

QLl
QLq

SNs

KNa

KNc

KNf

KNn

KNs

AT

VO

2 

3

4

6

7

8

vs.

1

a b

5

CEe

CL

MCr

MCt

MCu

MS
QLl

AS
CEc

CEe

CL

MCr

MCt

MCu

MS
QLlQLq

SNs

KNa

KNc

KNf

KNn

KNs

AT
VO

AS
CEc

QLq
SNs

KNa

KNc

KNf

KNn

KNs

AT
VOAS

CEc

CEe

CL

MCr

MCt

MCu

MS
QLlQLq

SNs

KNa

KNc

KNf

KNn

KNs

AT
VO

AS
CEc

CEe

CL

MCr

MCt

MCu

MS
QLlQLq

SNs

KNa

KNc

KNf

KNn

KNs

AT
VO

0
1
2
3
4

0
1
2
3
4
5+

0

20

40

60

80

100

Fr
eq

ue
nc

y

5+

OlymMATH Easy

Fr
eq

ue
nc

y

0
250
500
750
1,000
1,250

GSM8K

Fr
eq

ue
nc

y

GPQA

0

100

200

300

400 OlymMATH Hard

DeepSeek-R1-Distilled-Qwen-14B

0
1
2
3
4
5+

0
1
2
3
4
5+

Fig. 1 | Commensurate LLM and benchmark profiles can be compared to 
explain and predict performance. Here we show LLM capability profiles  
(a; DeepSeek-R1-Distilled-Qwen-14B, estimated as discussed in the section 
titled ‘Explanatory power analysis: profiling LLM abilities’) and four different 
‘reasoning’ benchmark demand profiles (b; GSM8K, OlymMATH Easy, GPQA 
and OlymMATH Hard, for which each slice represents the frequency of demand 
levels for each capability, with darker colours representing higher frequency). 
Researchers, developers and users can intuit that performance is expected to 
be high for the benchmark GSM8K but worse for the other three. Moreover, 
these profiles explain apparently contradicting findings, such as the accuracy 
of DeepSeek-R1-Distilled-Qwen-14B on GSM8K, OlymMATH Easy, GPQA and 

OlymMATH Hard being 90.50%, 61.80%, 59.10% and 13.30%, respectively, 
despite all of these benchmarks supposedly testing mathematical reasoning 
according to their creators. Indeed, OlymMATH Easy has lower demands for 
quantitative reasoning (QLq), logical reasoning (QLl) and inductive reasoning (CL) 
than OlymMATH Hard but similar demands for all other dimensions. Instead, 
GPQA yields worse performance than OlymMATH Easy, despite being easier in 
reasoning dimensions, because of its low specificity, with high demands for some 
knowledge dimensions beyond KNf (Formal Sciences), such as KNn (Natural 
Sciences) and KNa (Applied Sciences). Further details of these benchmarks 
associated with ‘reasoning’ benchmarks are discussed in Supplementary 
Information Section 1.12.


Nature  |  Vol 652  |  2 April 2026  |  61

which do not directly capture cognitive demands but, rather, reflect 
those elements making items more difficult in other ways. The full scale 
rubrics can be found in Supplementary Information Section 2. We also 
explore alternative ablations with subsets of the catalogue as well as 
other taxonomies36,37 in Supplementary Information Sections 1.7 and 
1.9, with none of them coming close to what the DeLeAn catalogue 
achieves in predictive or explanatory power.

In Methods sections ‘Ratio scales’ and ‘Dissecting the demand- 
ability space’, we explain how the scales are defined using rubrics that 
serve as measurement instruments for the instance demands and then 
build the methodology around them; this is applicable to whatever 
catalogue we use, be it DeLeAn v1.0, its extension or others. Our main 

goal with these scales is to achieve AI evaluation with both explanatory 
and predictive power. We now demonstrate that this is indeed the case 
with four specific research questions, comparing our approach with 
standard practice or best baselines in AI evaluation.

Annotation scales distinguish levels and dimensions
First we address the following research question: can humans distin-
guish the levels in the rubrics and the dimensions? The scales will only 
serve for explanatory purposes if they can be understood. In Methods 
section ‘LLM annotators and inter-rater analysis’, we describe how a 
group of five humans were selected, how the rubrics were presented and 

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLq
SNs

KNa

KNc

KNf

KNn

KNs
AT

VO

0

400

800

1,200

1,600

ChemLLMBench

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLq
SNs

KNa

KNc

KNf

KNn

KNs
AT

VO

0

100

200

300

Civil Service Examination

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLq
SNs

KNa

KNc

KNf

KNn

KNs
AT

VO

0

10

20

30

Data Analysis

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLq
SNs

KNa

KNc

KNf

KNn

KNs
AT

VO

0
100
200
300
400

Date Arithmetic

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLq
SNs

KNa

KNc

KNf

KNn

KNs
AT

VO

0

50

100

150

200

GRE & GMAT

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLq
SNs

KNa

KNc

KNf

KNn

KNs
AT

VO

0

10

20

Language

0

200

400

600

800

LSAT

0

50

100

150

Math

0

50

100

150

200

MCTACO

0
100
200
300
400
500

MedCalcBench

0

200

400

600

MenatQA

0
1,000
2,000
3,000
4,000
5,000

MMLU-Pro

0

400

800

1,200

1,600

OmniMath

0
10
20
30
40
50

Reasoning

0

100

200

300

SAT

0

100

200

300

SciBench

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO

0

200

400

600

TempReason

0

100

200

300

TimeDial

0

200

400

600

TimeQA

0

250

500

750

1,000

TruthQuest

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO AS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO AS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VOAS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VOAS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VOAS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO AS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO AS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO AS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO

AS
CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VOAS

CEc

CEe

CL

MCr

MCt

MCu
MS

QLlQLqSNs

KNa

KNc

KNf

KNn

KNs

AT
VO

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

0
1
2
3
4
5

Fig. 2 | Distribution of level frequencies for the 18 demands (that is, demand profiles) of the 20 benchmarks in the ADeLe v1.0 battery. Supplementary 
Information Section 1.12 reconciles common myths in LLM ‘reasoning’ and also describes the demand profiles for 20 so-called ‘reasoning’ benchmarks.


62  |  Nature  |  Vol 652  |  2 April 2026

Article

to what sample of data. The inter-rater agreement (rWG index) of these 
five humans for the 18 demands ranges between 0.70 and 0.91 (with an 
average of 0.83). After applying the Delphi method, we have a consensus 
annotation, which we compare against GPT-4o, the LLM annotator, 
resulting in high agreement rates (rWG scores between 0.75 and 0.94, 
averaging 0.86). These agreement rates show common understanding 
between humans and with the automated annotations performed by 

GPT-4o. Another source of necessary support for a rubric would be 
whether it leads to high predictive power, which we will explore in the 
section ‘Predictive power analysis: anticipating performance with asses-
sors’, while still representing the construct in an understandable way.

The dimensions could be understandable by humans but concep-
tually redundant, in the sense that we could not conceive an instance 
for which one dimension level is high and the other is low. If such an 

0

0.2

0.4

0.6

0.8

1.0

S
uc

ce
ss

 p
ro

b
ab

ili
ty

AS CEc CEe

0

0.2

0.4

0.6

0.8

1.0

S
uc

ce
ss

 p
ro

b
ab

ili
ty

CL MCr MCt

0

0.2

0.4

0.6

0.8

1.0

S
uc

ce
ss

 p
ro

b
ab

ili
ty

MCu MS QLl

0

0.2

0.4

0.6

0.8

1.0

S
uc

ce
ss

 p
ro

b
ab

ili
ty

QLq SNs KNa

0

0.2

0.4

0.6

0.8

1.0

S
uc

ce
ss

 p
ro

b
ab

ili
ty

KNc KNf KNn

0 2 4 6 8 10

Demand level

0

0.2

0.4

0.6

0.8

1.0

S
uc

ce
ss

 p
ro

b
ab

ili
ty

KNs

0 2 4 6 8 10

Demand level

AT

0 2 4 6 8 10

Demand level

VO

Babbage-002 Davinci-002 GPT-3.5-Turbo GPT-4o OpenAI o1-mini OpenAI o1

LLaMA-3.2-1B-Instruct LLaMA-3.2-3B-Instruct LLaMA-3.2-11B-Instruct LLaMA-3.1-405B-InstructLLaMA-3.2-90B-Instruct

DeepSeek-R1-Distilled-Qwen-7B DeepSeek-R1-Distilled-Qwen-14B DeepSeek-R1-Distilled-Qwen-32BDeepSeek-R1-Distilled-Qwen-1.5B

n = 16 n = 317

n = 1,322

n = 322
n = 19

n = 84 n = 633
n = 804

n = 689
n = 201

n = 70 n = 19
n = 100

n = 100

n = 1

n = 1 n = 63 n = 609

n = 1,267 n = 129

n = 8
n = 44

n = 634
n = 762

n = 23

n = 83 n = 761
n = 1,311

n = 1,036

n = 111

n = 22 n = 154
n = 1,180

n = 1,450

n = 61

n = 53 n = 138
n = 139

n = 13

n = 80 n = 99
n = 906

n = 1,864

n = 337

n = 84 n = 633
n = 647

n = 546

n = 51

n = 8
n = 59

n = 321 n = 230

n = 4 n = 229 n = 1,260

n = 4,140

n = 1,390

n = 50 n = 1,018
n = 2,115

n = 1,462

n = 198

n = 85 n = 457
n = 1,455

n = 3,132

n = 687

n = 57 n = 663

n = 2,366

n = 699

n = 39
n = 965 n = 1,838

n = 108

n = 48 n = 505
n = 2,167

n = 1,749

n = 60

n = 63 n = 398
n = 235

n = 286

n = 48

Fig. 3 | Characteristic curves for the 18 demands and the 15 LLMs. The 
x-axis shows the demand levels for that dimension and the y-axis the average 
performance (probability of success) for each level. We ensure all bins are 
weighted the same in the fit as the largest one (except those bins with less than 

100 instances, which use a proportional weight for robustness). The curve is a 
logistic fit with an anchor at coordinates (20, 0), accounting for 50% of the total 
weight. The curves thus prolong beyond level 5 and this is why we show the x-axis 
from 0 to 10, even if the present version of the scales only has levels up to 5.


Nature  |  Vol 652  |  2 April 2026  |  63

instance does not exist, humans will find it hard to distinguish the 
dimensions. The dimensions can still be correlated in a particular 
benchmark (for example, because the design or selection bias always 
makes one increase along with the other), but if the correlation is not 
near-maximal, we could conclude that there must be instances with 
very different levels. In Extended Data Fig. 1, we show the Spearman 
correlations of the demand levels for all of the dimensions in the ADeLe 
battery, a representative sample selected mainly from AI benchmarks 
in 2024. The generally low or moderate correlations indicate that most 
dimensions seem to carve different parts of the intelligence space, still 
allowing for cases in which the level for one dimension is low and the 
level for the other dimension is high. These examples do not abound 
but are not impossible. Only two correlations are greater than 0.8 and 
they fall on CL (Conceptualisation, Learning, and Abstraction), which 
looks slightly central in the manifold, given its strong correlation with 
MC (Metacognition and Critical Thinking) and with QLl (Quantita-
tive and Logical Reasoning). We also see that the correlations for the 
extraneous dimensions are high with other demands (except for UG). 
In general, these positive or negative correlations can have several 
interpretations, as they are contingent to our choice of benchmarks.

The overall conclusion is that the annotations by GPT-4o seem under-
standable for humans across all dimensions, and the dimensions can 
be well distinguished. This is valuable, as other rubrics in AI evaluation 
practice tend to be specific, rarely quantitative and only occasionally 
meant to be explanatory44,45, despite the recognition that this under-
standing is a key factor in AI adoption27. Also, the correlations between 
dimensions do not seem to suggest that some combinations of demand 
levels are impossible, but simply infrequent in the present ADeLe bat-
tery v1.0. In this paper, our choice of instances and benchmarks was 
meant to be representative of the landscape of AI benchmarks, rather 
than a cherry-picked selection to minimize correlations. This was con-
ditioned by our interest to explore what the benchmarks measure, as 
we study next.

Explanatory power through benchmark demand profiles
The research question we address in this section is: what is the sensi-
tivity and specificity of ADeLe and its constituent benchmarks? We 
can first look at the demand profiles per benchmark (Fig. 2). This is 

informative to understand what the benchmarks actually measure and 
whether they measure what their designers claim to measure.

Overall, the profiles are considerably distinct, so apparently they 
measure different things. Benchmarks that focus on specialized top-
ics (for example, ChemLLMBench, OmniMath, MedCalcBench and 
SciBench) show high demands in their respective domains (KNa 
(Applied Sciences), KNn (Natural Sciences) and KNf (Formal Sci-
ences)), whereas benchmarks such as TempReason and TruthQuest, 
which target a single domain, often peak in further dimensions. Other 
benchmarks—such as Date Arithmetic, GRE & GMAT, MCTACO, Time-
Dial and TimeQA—have uniformly low demands. By contrast, broader 
assessments such as Civil Service Examination, LSAT and MMLU-Pro 
show mixed profiles.

To determine whether they measure what they claim to measure, 
we must compare Fig. 2 with the list of capabilities or domains these 
benchmarks are said to be measuring (Supplementary Table 28). To 
better illustrate the issues of construct validity, we systematize sensi-
tivity and specificity thresholds through two criteria:
•	 The sensitivity criterion: if a new benchmark claims to measure X, we 

should expect to see a wide distribution of levels for the demands 
related to X in that benchmark: we characterize this by having mean ≥ 2 
and standard deviation (s.d.) ≥ 1.0 in dimension X.

•	 The specificity criterion: moreover, we should expect to see low lev-
els for all dimensions that are not related to what the benchmark 
claims to measure: we characterize this with mean < 2.0 for all other 
‘confounding’ dimensions.

Table 2 quantitatively shows a list of benchmarks and whether they 
meet these specificity and sensitivity criteria. In a few particular cases, 
there is some overlap between what a benchmark claims to measure 
and what capabilities it is sensitive to. However, this occurs for less than 
half of the capabilities that the benchmark claims to measure and does 
not happen for most benchmarks and dimensions (that is, aggregates 
have little sensitivity and specificity). For instance, benchmarks such 
as SAT are saturated for different reasons (low atypicality, that is, high 
contamination), whereas MedCalcBench actually measures whether 
the LLM has sufficient attention and scanning capability to process the 
given information, rather than purely measuring medical calculation 
capabilities. Further, in Supplementary Information Section 1.12, we 

AS CEc

CEe

CL

MCr

MCt

MCu

MS

QLl
QLq

SNs

KNa

KNc

KNf

KNn

KNs

AT

VO 8

Babbage-002
Davinci-002
GPT-3.5-Turbo
GPT-4o
OpenAI o1-mini
OpenAI o1

AS CEc

CEe

CL

MCr

MCt

MCu

MS

QLl
QLq

SNs

KNa

KNc

KNf

KNn

KNs

AT

VO

5
6
7
8

LLaMA-3.2-1B-Instruct
LLaMA-3.2-3B-Instruct
LLaMA-3.2-11B-Instruct
LLaMA-3.2-90B-Instruct
LLaMA-3.1-405B-Instruct

AS CEc

CEe

CL

MCr

MCt

MCu

MS

QLl
QLq

SNs

KNa

KNc

KNf

KNn

KNs

AT

VO

5
6
7
8

DeepSeek-R1-Distilled-Qwen-1.5B
DeepSeek-R1-Distilled-Qwen-7B
DeepSeek-R1-Distilled-Qwen-14B
DeepSeek-R1-Distilled-Qwen-32B

1
2
3
4
5
6
7

1
2
3
4

1
2
3
4

Fig. 4 | Ability profiles of the 15 LLMs. An ability of l means that there is 50% 
probability of the model to succeed on questions at demand level l (that is why 
some abilities go beyond 5). In contrast to radial plots usually shown for LLMs in 
the literature47,48, the values shown here are actual abilities on a ratio scale (0, ∞) 

and the values (in expectation) are more robust to changes in the difficulty 
distribution of the benchmarks used. In Supplementary Information Section 1.4, 
we show clear scaling curves of model abilities as a function of number of 
parameters.


64  |  Nature  |  Vol 652  |  2 April 2026

Article

reconcile common myths in LLM ‘reasoning’, observing the same issue 
of lacking either sensitivity or specificity for a batch of 20 ‘reasoning’ 
benchmarks.

Taking all of this into account, the specificity and sensitivity of com-
mon benchmarks are poor and variable. These results indicate that, by 
assigning one or more benchmarks to one ‘capability’ and aggregating 
their accuracy (as is the present standard practice), different demand 
levels and dimensions are averaged, leading to highly confounded 
results. If this is the baseline for common AI evaluation practice, it is 
simply insufficient to detect problems of specificity and sensitivity10,27. 
This issue becomes even more pronounced when integrating numerous 
benchmarks, such as BIG-bench46 and other mega-benchmarks. Even if 

sensitivity may be increased by this integration (as we see for the whole 
of ADeLe; Extended Data Fig. 2), specificity is lost if aggregate scores 
are used. Instead, with our scales, we can compare mixed subsets of 
items from different benchmarks whose demand levels now become 
commensurate, create recombinations of instances to test specific 
capabilities and systematically select or discard benchmarks altogether 
based on their profile quality, before even using them.

Explanatory power through LLM ability profiles
Another research question about explanatory power moves the focus 
to the AI systems: can we understand the capabilities of models and 

Table 2 | Sensitivity and specificity analysis of a subset of 20 benchmarks in ADeLe

Benchmark Claims to measure (their terminology) Claims to measure  
(our terminology)

Sensitive to… Mean s.d. Accuracy

ChemLLMBench • Generation of descriptions for molecules CEc•, CEe˚, KNf•, KNn˚ AS˚ 3.2 1.0 27.2

• Generation of new molecules CEc• 2.6 1.2

• Chemical name understanding KNf• 4.0 1.2

• Chemical reaction products prediction MCr˚ 3.1 1.1

• Identification of target molecules QLq˚ 2.2 1.5

SNs˚ 3.2 1.3

Civil Service Examination • Logical reasoning QLI˚ KNa˚ 2.1 1.8 73.5

KNs˚ 2.0 1.8

Data Analysis • Data analysis CL˚, KNf˚, QLq˚ KNc˚ 2.2 1.0 69.7

LSAT • Analytical reasoning CEc˚, CL˚, MC˚, QLI˚ KNc˚ 2.9 1.1 81.6

• Logical reasoning KNs˚ 2.1 1.8

• Reading comprehension

MMLU-Pro • Knowledge KNa•, KNc˚, KNf˚, KNn˚, KNs˚, 
QL˚

KNa• 2.9 1.6 87.6

• Reasoning

Math • Mathematics CL˚, KNf˚, QLI•, QLq˚ AT˚ 2.4 1.1 59.0

MCu˚ 2.4 1.0

QLI• 2.7 1.2

MedCalcBench • Medical calculation knowledge KNf˚, KNn˚, QLq˚ AS˚ 2.0 1.2 88.0

• Patient attributes extraction

• Final results arithmetic

MenatQA • Event temporal reasoning QL˚ KNc˚ 2.8 1.0 72.8

OmniMath • Mathematical reasoning at Olympiad level CL•, KNf˚, QLI•, QLq˚ AS˚ 2.3 1.3 34.4

CL• 3.1 1.2

MCr˚ 2.6 1.1

QLI• 3.4 1.0

Reasoning • Spatial Reasoning QLI˚, SN˚ CL˚ 2.6 1.1 48.2

• Logical Reasoning MCr˚ 2.8 1.1

SAT • Critical thinking MC˚ AT˚ 2.2 1.1 98.3

• Problem-solving

• Analytical skills

SciBench • Scientific problem-solving KNa•, KNf˚, KNn•, KNs˚, MC˚ KNa• 3.0 1.6 83.7

KNn• 2.8 1.7

TempReason • Event temporal reasoning QL˚ KNc˚ 2.8 1.1 71.2

TimeQA • Event temporal reasoning QL˚ KNc˚ 2.4 1.1 89.0

Benchmarks are chosen such that at least one dimension that satisfies our two sensitivity criteria (s.d. ≥ 1 and mean ≥ 2). The second column shows what they claim to measure using their own 
terminology (sources are described in Supplementary Table 28). The third column shows the dimensions that the benchmarks claim to measure, expressed in our terminology. The fourth  
column shows what the benchmarks are actually measuring or sensitive to. For every dimension to which the benchmark is sensitive, we provide the mean and s.d. of the levels for that dimension 
(fifth and sixth columns, respectively). The seventh (last) column shows the average accuracy of GPT-4o per benchmark as a reference. A superscript ∘ in the third and fourth columns indicates 
missed dimension (lack of sensitivity) or extra dimension (lack of specificity), respectively. A superscript • indicates the dimensions that are claimed to be measured and are actually measured 
(high sensitivity). The other six benchmarks from the ADeLe battery that do not follow the two aforementioned sensitivity criteria are Date Arithmetic, GRE & GMAT, Language, MCTACO, TimeDial 
and TruthQuest (with the accuracies of GPT-4o being 98.9, 95.6, 72.4, 95.1, 98.8 and 43.0, respectively). Takeaway: no benchmarks have high sensitivity and specificity, indicating a clear lack of 
construct validity.


Nature  |  Vol 652  |  2 April 2026  |  65

their evolution in non-saturating plots? To answer this question, we 
selected 15 LLMs (Extended Data Table 1) and ran them on the ADeLe 
battery. As will be explained in more detail in Methods section ‘Sub-
ject characteristic curves’, we use a dominant slice procedure: for 
each demand level l along a dimension, we aggregate the results of 
only those task instances for which the demands in all remaining 
dimensions do not exceed l. We apply a logistic fit to these points, 
yielding 18 per-dimension characteristic curves that capture how 
model success rates decline with increasing demand (Fig. 3). For 
example, the curves of certain dimensions are steep and with low 
variability across models, such as AS (Attention and Scan) and MCu 
(Calibrating Knowns and Unknowns). They explain success very well 
for instances in the low range (success for demands between 1 and 2) 
and the high range (failure for demand 5 or higher). By contrast, curves 
of other dimensions are flatter, such as KNs (Knowledge of Social 
Sciences), in which the discrimination (between success and failure) 
is the lowest. Notably, several dimensions show particularly distinct 
behaviours. The characteristic curves for MCr (Identifying Relevant 
Information) and MS (Mind Modelling and Social Cognition) clearly 
distinguish the performance of reasoning models (whether distilled 
or not) from non-reasoning ones. All subject characteristic curves, 
in independent plots, can be found in Supplementary Information  
Section 1.14.

We use the area under the subject characteristic curve to estimate 
ability, as explained in Methods section ‘Subject characteristic curves’. 
Note that an ability of 4 does not mean that the model can solve all or 
most of the items at level 4; it actually means that it can solve half of 
those at exactly level 4 in expectation. Figure 4 shows the ability pro-
files of the 15 LLMs, arranged into families. It is now more evident that 
those dimensions related to knowledge are high for larger models and 
reduced for small and distilled models. The reasoning models (such as 
OpenAI’s o1 and DeepSeek-R1-Distilled) have clear improvements on 
the two kinds of QL (Quantitative and Logical Reasoning) but also on 
MCr (Identifying Relevant Information) and MS (Mind Modelling and 
Social Cognition) (even down to 7B in the distilled models).

Finally, the increase of model abilities based on the number of 
parameters seems to be marginal for the two largest LLMs in the LLaMA 
or DeepSeek-R1-Distilled-Qwen families; this is further confirmed in 
Supplementary Information Section 1.4, in which we introduce the 
very first scaling laws of the actual abilities of LLMs. The use of open 
ratio scales commensurate to the demand levels is in opposition to the 
traditional scaling laws using performance, which easily saturate close 
to 100% accuracy and fluctuate heavily depending on the demand-level 
distributions of the selected benchmarks. Aggregation, even if sliced 
by benchmarks, domains or some tags46–48, leads to values in each 
dimension that are not commensurate, hard to explain and volatile to 
the distribution of difficulties. For instance, 70% aggregate accuracy in 
all logical reasoning benchmarks does not mean more capability than 
50% aggregate accuracy in all metacognition benchmarks, not even 
more capability than 50% aggregate accuracy in another set of logical 
reasoning benchmarks. Reflecting to what we saw in Fig. 1, a moderate 
increase in demands can be associated with a big drop in performance 
(as seen in the two versions of OlymMATH). Making differences com-
mensurate is one of the advantages of having demands and capabilities 
in the same scale. By looking at standardized scales on several dimen-
sions, we can explain many conflicting claims made in the literature, 
such as LLMs being considered capable of ‘complex reasoning’49 in 
2022, to claims of LLMs ‘not capable of the non-trivial reasoning’50 
3 years later, which seems inconsistent with the substantial progress 
in chain-of-thought and reasoning models in the past few years. These 
contradictory statements about reasoning are explored and clarified 
with our scales in Supplementary Information Section 1.12.

In general, through our approach, we can investigate the capabili-
ties of models and their evolution in a comprehensive and granular 
way, with characteristic curves explaining why each model succeeds 

or fails in different regions, depending on the demand profile of the 
instance. This explanation originates from the information collected 
from the AI system under observation only: unlike IRT and other latent 
variable approaches (factor analysis or principal component analysis) 
derived from the results of many systems and instances, the abilities 
and explanations we get for one LLM with our methodology are not 
affected by the results or the choice of the other 14 LLMs.

Predictive power through assessors anticipating 
performance
The last research question is, can we predict AI performance on unseen 
instances, both in distribution and out of distribution? As shown in the 
bottom row of Extended Data Fig. 1, most dimensions are negatively 
correlated with success, suggesting that, in aggregate, higher demands 
tend to reduce performance. This is promising about their predictive 
power when used in a multivariate way. To quantify this predictive 
power precisely, we trained three types of instance-level probabilistic 
classifiers, known as assessors: a random forest (RF) that maps the 
19-dimensional demand annotation vector directly into a predicted 
probability of success, another RF model that relies on precomputed 
GloVe embeddings extracted from the raw text of each question and 
a fine-tuned LLaMA model trained end-to-end on the question text 
to predict success. Further details are provided in Methods section 
‘Assessors and metrics’.

In-distribution results (Extended Data Table 2) show that, despite the 
large imbalance in some of the subject models’ own accuracies (from 
0.102 for Babbage-002 to 0.843 for OpenAI o1), the demand-based RF 
achieves high discrimination (between success and failure), as meas-
ured by the area under the receiver operating characteristic curve 
(AUROC), and near-perfect calibration, as quantified by the expected 
calibration error (ECE). In terms of discrimination, we see that the 
best result is achieved for GPT-4o (0.882 in AUROC), being the most 
predictable LLM for the three assessors, whereas small models are 
less predictable. Averaged across all 15 LLMs, the demand-based RF 
produces an accuracy-weighted average AUROC of approximately 
0.84, which is on par with the performance of the fine-tuned LLaMA 
assessor, whereas its average ECE (0.01) is much lower than that of 
the other approaches (0.03 for the GloVe-based model and 0.04 for 
the fine-tuned LLaMA model). Calibration plots demonstrating these 
results are provided in Supplementary Information Section 1.5. This 
strong in-distribution performance supports the internal validity of 
the methodology.

For the analysis of external validity, we further evaluated predictive 
performance under out-of-distribution conditions, by withholding 
entire tasks (task out of distribution) or entire benchmarks (bench-
mark out of distribution) from training (Extended Data Tables 3 and 
4, respectively). In the task out of distribution set-up, the predic-
tive power of the demand-based assessor remains robust (weighted 
AUROC = 0.81, ECE = 0.02), only slightly lower than in-distribution, 
and outperforms the rest of the assessors, whose performance con-
siderably decreases (achieving weighted AUROC values of 0.79 for 
the LLaMA-based and 0.74 for the GloVe-based assessors). In the 
more challenging benchmark out of distribution, the performance 
of the demand-based assessor decreases a slightly more (weighted 
AUROC = 0.75 and ECE = 0.04). By contrast, the predictive power of 
the other two assessors suffers a much greater decrease. This sug-
gests that the demand-based predictor is less prone to overfitting with 
spurious features compared with its counterparts. In Supplementary 
Information Section 1.9, we also demonstrate that the predictive power 
of our demand-based assessor is superior to two domain-based and 
learning-levels taxonomies.

Although many traditional IRT methods can explain performance 
on seen items, they cannot be used to predict performance for new 
instances (except linear logistic test models; see Supplementary 


66  |  Nature  |  Vol 652  |  2 April 2026

Article
Information Section 1.1). IRT requires the item in question to be 
included in the pool of items that are used to extract the parameters 
and multidimensional IRT extracts the difficulty dimensions that mat-
ter for that pool only. In our case, any new instance, coming from any 
new benchmark or batch of examples, can be annotated automatically 
to obtain its vector of demands, which is independent of population, 
and from which we can predict performance.

Predictive power is paramount in a deployment setting in which the 
goal is to anticipate whether AI will perform well in unseen scenarios, 
rather than merely grading subjects in a testing environment. The 
natural baseline in AI evaluation practice is just average accuracy. This 
can extrapolate to an extent for system selection but, in the case of 
instance prediction, it leads to no discriminative power at all (an AUROC 
of 0.5) and calibration that is only good for in distribution. Uncertainty 
estimation from the LLM itself, on the other hand, requires running the 
model or, in many cases, a white-box or grey-box access, whose results 
are not better than external assessors20, as we have used here. Overall, 
the supremacy in predictive power observed for the demand-based 
assessors is clear. They are based on interpretable demands, in com-
parison with the two much larger and uninterpretable baselines. This 
is strongly encouraging, shedding light on a promising future for the 
reliable deployment of AI.

Discussion
So far, AI evaluation is not meeting the needs of a fast-changing 
and increasingly diverse AI ecosystem. Understanding and antici-
pating performance has become an urgent requirement for many 
general-purpose AI systems. By building and exploiting absolute 
demand scales for annotating thousands of instances by means of 
automated rubrics, we have set a promising new direction for AI 
evaluation. The methodology we have presented and illustrated is 
comprehensive, scalable and standardizing, addressing many of the 
issues of conventional AI evaluation practice: a lack of explanatory 
and predictive power, as well as saturation and overfitting to specific 
populations of benchmarks and AI systems, respectively. With the 
pace and penetration of general-purpose AI, a rigorous, scalable and 
pipelined evaluation had been urgently demanded by researchers, 
companies, third-party evaluators, policymakers and regulators.  
It is paradoxical that powerful LLMs as annotators have made this new 
methodology possible and scalable. The explanatory value of LLM 
annotations has been independently validated by humans through 
inter-rater analysis and the Delphi method, and their predictive power 
stands through task diversity.

Nonetheless, our work is not without limitations. First, the DeLeAn 
v1.0 rubrics do not fully cover certain dimensions, such as navigation, 
and excludes capabilities in other modalities and paradigms in AI, such 
as multimodal systems and robotics, given that we limited our analysis 
to LLMs. We encourage other researchers to extend the set of rubrics to 
further dimensions (including propensities, values and other elements 
that are specifically conceived for safety or fairness and evaluate other 
kinds of AI systems with them). Second, there are very few high-quality 
level 5+ items in our present battery. Given the pace of progress in AI, the 
present scales (up to 5+) will need to be extended in a way that remains 
backward compatible with existing scales. Third, we could increase the 
predictive power in and, most importantly, out of distribution, espe-
cially if we introduce more benchmarks with ‘purer’ items only loaded 
on a few demands, and as LLMs improve as annotators. Fourth, we used 
LLM judges for grading model outputs, with excellent results compared 
with human graders. However, some more open-ended or agential 
tasks in the future may require more advanced automated grading.

Overall, the new methodology showcases the successful develop-
ment of the construct-oriented paradigm in AI evaluation8, integrating 
perspectives from different disciplines. A streamlined collaborative 
platform (https://kinds-of-intelligence-cfi.github.io/ADELE/), and 

associated catalogue of rubrics, will grow in the years to come, ready 
to explain and predict the performance and safety of AI systems. In 
a moment when AI evaluation is at the crux of research and regula-
tions, and the science of evaluation had not yet digested the pace of 
general-purpose AI, our work takes crucial steps to make AI evaluation 
fit for purpose.

Online content
Any methods, additional references, Nature Portfolio reporting summa-
ries, source data, extended data, supplementary information, acknowl-
edgements, peer review information; details of author contributions 
and competing interests; and statements of data and code availability 
are available at https://doi.org/10.1038/s41586-026-10303-2.

1.	 Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with 
large language models. Nature 624, 570–578 (2023).

2.	 Abramson, J. et al. Accurate structure prediction of biomolecular interactions with 
AlphaFold 3. Nature 630, 493–500 (2024).

3.	 Eloundou, T., Manning, S., Mishkin, P. & Rock, D. GPTs are GPTs: labor market impact 
potential of LLMs. Science 384, 1306–1308 (2024).

4.	 Rahwan, I. et al. Machine behaviour. Nature 568, 477–486 (2019).
5.	 Shiffrin, R. & Mitchell, M. Probing the psychology of AI models. Proc. Natl Acad. Sci. USA 

120, e2300963120 (2023).
6.	 Zhou, L. et al. Larger and more instructable language models become less reliable. 

Nature 634, 61–68 (2024).
7.	 Zhou, L. et al. Predictable artificial intelligence. Artif. Intell. 353, 104491 (2026).
8.	 Burden, J., Tešić, M., Pacchiardi, L. & Hernández-Orallo, J. Paradigms of AI evaluation: 

mapping goals, methodologies and culture. In Proc. Thirty-Fourth International Joint 
Conference on Artificial Intelligence 10381–10390 (IJCAI, 2025).

9.	 Burnell, R. et al. Rethink reporting of evaluation results in AI. Science 380, 136–138 
(2023).

10.	 Eriksson, M. et al. Can we trust AI benchmarks? An interdisciplinary review of current 
issues in AI evaluation. In Proc. Eighth AAAI/ACM Conference on AI, Ethics, and Society 
850–864 (AAAI Press, 2025).

11.	 Mitchell, M. The metaphors of artificial intelligence. Science 386, eadt6140 (2024).
12.	 Yang, Z. et al. Can large language models always solve easy problems if they can solve 

harder ones? In Proc. 2024 Conference on Empirical Methods in Natural Language 
Processing 1531–1555 (Association for Computational Linguistics, 2024).

13.	 Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. 
Nature 645, 633–638 (2025).

14.	 Mathematical Association of America. American Invitational Mathematics Examination 
(AIME). https://maa.org/maa-invitational-competitions/ (2024).

15.	 Wang, X. et al. Evaluating general-purpose AI with psychometrics. Preprint at https://
arxiv.org/abs/2310.16379 (2023).

16.	 Burnell, R., Hao, H., Conway, A. R. & Orallo, J. H. Revealing the structure of language 
model capabilities. Preprint at https://arxiv.org/abs/2306.10062 (2023).

17.	 Ilić, D. & Gignac, G. E. Evidence of interrelated cognitive-like capabilities in large language 
models: indications of artificial general intelligence or achievement? Intelligence 106, 
101858 (2024).

18.	 Martínez-Plumed, F., Prudêncio, R. B., Martínez-Usó, A. & Hernández-Orallo, J. Item 
response theory in AI: analysing machine learning classifiers at the instance level. 
Artif. Intell. 271, 18–42 (2019).

19.	 Hernández-Orallo, J., Schellaert, W. & Martínez-Plumed, F. Training on the test set: 
mapping the system-problem space in AI. In Proc. 36th AAAI Conference on Artificial 
Intelligence 12256–12261 (AAAI Press, 2022).

20.	 Schellaert, W. The Evaluation of Artificial Intelligence as a Prediction Problem. PhD thesis, 
Universitat Politecnica de Valencia (2025).

21.	 Schellaert, W., Martínez-Plumed, F. & Hernández-Orallo, J. Analysing the predictability of 
language model performance. ACM Trans. Intell. Syst. Technol. 16, 1–26 (2025).

22.	 Pacchiardi, L., Cheke, L. G. & Hernández-Orallo, J. 100 instances is all you need: predicting 
the success of a new LLM on unseen data by testing on a few instances. Preprint at 
https://arxiv.org/abs/2409.03563 (2024).

23.	 Burden, J. et al. Inferring capabilities from task performance with Bayesian triangulation. 
Preprint at https://arxiv.org/abs/2309.11975 (2023).

24.	 Schlangen, D. Targeting the benchmark: on methodology in current natural language 
processing research. In Proc. 59th Annual Meeting of the Association for Computational 
Linguistics and the 11th International Joint Conference on Natural Language Processing 
(Volume 2: Short Papers) 670–674 (Association for Computational Linguistics, 2021).

25.	 Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. HellaSwag: can a machine really 
finish your sentence? In Proc. 57th Annual Meeting of the Association for Computational 
Linguistics 4791–4800 (Association for Computational Linguistics, 2019).

26.	 Hernandez-Orallo, J. AI evaluation: on broken yardsticks and measurement scales.  
In Workshop on Evaluating Evaluation of AI Systems at AAAI (AAAI Press, 2020).

27.	 Hardy, A. et al. More than marketing? On the information value of AI benchmarks for 
practitioners. In Proc. 30th International Conference on Intelligent User Interfaces 
1032–1047 (ACM, 2025).

28.	 Burden, J. Evaluating AI evaluation: perils and prospects. Preprint at https://arxiv.org/abs/ 
2407.09221 (2024).

29.	 Stahl, B. C. et al. A systematic review of artificial intelligence impact assessments. 
Artif. Intell. Rev. 56, 12799–12831 (2023).

https://kinds-of-intelligence-cfi.github.io/ADELE/
https://doi.org/10.1038/s41586-026-10303-2
https://maa.org/maa-invitational-competitions/
https://arxiv.org/abs/2310.16379
https://arxiv.org/abs/2310.16379
https://arxiv.org/abs/2306.10062
https://arxiv.org/abs/2409.03563
https://arxiv.org/abs/2309.11975
https://arxiv.org/abs/2407.09221
https://arxiv.org/abs/2407.09221


Nature  |  Vol 652  |  2 April 2026  |  67

30.	 Lee, M. et al. Evaluating human-language model interaction. Transact. Mach. Learn. Res. 
https://openreview.net/forum?id=hjDYJUn9l1 (2023).

31.	 Collins, K. M. et al. Evaluating language models for mathematics through interactions. 
Proc. Natl Acad. Sci. USA 121, e2318124121 (2024).

32.	 Cohn, A. G. & Hernandez-Orallo, J. Dialectical language model evaluation: an initial 
appraisal of the commonsense spatial reasoning abilities of LLMs. Preprint at https://arxiv.
org/abs/2304.11164 (2023).

33.	 Roberts, M., Thakur, H., Herlihy, C., White, C. & Dooley, S. Data contamination through the 
lens of time. Preprint at https://arxiv.org/abs/2310.10628 (2023).

34.	 Levy, M., Jacoby, A. & Goldberg, Y. Same task, more tokens: the impact of input length on 
the reasoning performance of large language models. In Proc. 62nd Annual Meeting of 
the Association for Computational Linguistics 15339–15353 (Association for Computational 
Linguistics, 2024).

35.	 Wang, Y. et al. MMLU-Pro: a more robust and challenging multi-task language 
understanding benchmark. Adv. Neural Inf. Process. Syst. 37, 95266–95290 (2025).

36.	 Miller, J. K. & Tang, W. Evaluating LLM metrics through real-world capabilities. Preprint at 
https://arxiv.org/abs/2505.08253 (2025).

37.	 Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H. & Krathwohl, D. R. Taxonomy of 
Educational Objectives Vol. 2 (Longmans, 1964).

38.	 Ong, I. et al. RouteLLM: learning to route LLMs from preference data. In Proc. Thirteenth 
International Conference on Learning Representations (ICLR, 2025).

39.	 Zhou, L., Martínez-Plumed, F., Hernández-Orallo, J., Ferri, C. & Schellaert, W. Reject before 
you run: small assessors anticipate big language models. In Proc. Workshop on AI 
Evaluation Beyond Metrics (CEUR Workshop Proceedings, 2022).

40.	 Pacchiardi, L. et al. PredictaBoard: benchmarking LLM score predictability. In Findings of 
the Association for Computational Linguistics: ACL 2025 (eds Che, W. et al.) 15245–15266 
(Association for Computational Linguistics, 2025).

41.	 Clayton, H. H. Thermometric scales for meteorological use. Nature 60, 491 (1899).
42.	 Davies, C. N. Measurement of particles. Nature 195, 768–770 (1962).
43.	 Alessandretti, L., Aslak, U. & Lehmann, S. The scales of human mobility. Nature 587, 

402–407 (2020).
44.	 Jin, Z. et al. CLadder: assessing causal reasoning in language models. Adv. Neural Inf. 

Process. Syst. 36, 31038–31065 (2023).

45.	 Saparov, A. & He, H. Language models are greedy reasoners: a systematic formal analysis 
of chain-of-thought. In Proc. Eleventh International Conference on Learning Representations 
(ICLR, 2023).

46.	 Srivastava, A. et al. Beyond the Imitation Game: quantifying and extrapolating the 
capabilities of language models. Transact. Mach. Learn. Res. https://openreview.net/
forum?id=uyTL5Bvosj (2023).

47.	 Balachandran, V. et al. Eureka: evaluating and understanding large foundation models. 
Preprint at https://arxiv.org/abs/2409.10566 (2024).

48.	 Fountas, Z. et al. Human-inspired episodic memory for infinite context LLMs. In Proc. 
Thirteenth International Conference on Learning Representations (ICLR, 2025).

49.	 Zhou, D. et al. Least-to-most prompting enables complex reasoning in large language 
models. In Proc. Eleventh International Conference on Learning Representations (ICLR, 
2023).

50.	 Fang, M., Wan, X., Lu, F., Xing, F. & Zou, K. Mathodyssey: benchmarking mathematical 
problem-solving skills in large language models using odyssey math data. Sci. Data 12, 
1392 (2025).

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in 
published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 
4.0 International License, which permits use, sharing, adaptation, distribution 
and reproduction in any medium or format, as long as you give appropriate 

credit to the original author(s) and the source, provide a link to the Creative Commons licence, 
and indicate if changes were made. The images or other third party material in this article are 
included in the article’s Creative Commons licence, unless indicated otherwise in a credit line 
to the material. If material is not included in the article’s Creative Commons licence and your 
intended use is not permitted by statutory regulation or exceeds the permitted use, you will 
need to obtain permission directly from the copyright holder. To view a copy of this licence, 
visit http://creativecommons.org/licenses/by/4.0/.

© The Author(s) 2026

https://openreview.net/forum?id=hjDYJUn9l1
https://arxiv.org/abs/2304.11164
https://arxiv.org/abs/2304.11164
https://arxiv.org/abs/2310.10628
https://arxiv.org/abs/2505.08253
https://openreview.net/forum?id=uyTL5Bvosj
https://openreview.net/forum?id=uyTL5Bvosj
https://arxiv.org/abs/2409.10566
http://creativecommons.org/licenses/by/4.0/


Article
Methods

General scales
For more than a century, psychology has introduced many constructs 
with explanatory and predictive power about human behaviour, from 
conscientiousness to metacognition. On the basis of experimental data 
and theories of human cognition, these constructs are usually organ-
ized into hierarchical taxonomies, such as the Cattell–Horn–Carroll  
structure of human cognitive abilities51 or the Big Five personality 
traits52. In principle, we could build a similar taxonomy for artificial 
cognition, based on theory and experiments about machine behaviour4. 
However, as the base population of machines is much more arbitrary 
and changing than those of humans, it makes more sense to devise 
a taxonomy that could encompass any kind of natural and artificial 
intelligence, by considering capabilities that are meaningful for more 
general theories of cognition53. Under this paradigm and by integrating 
and generalizing taxonomies from human psychology, comparative 
cognition and AI53, a general taxonomy of 14 capabilities was designed54 
and later extended with corresponding 14 rubrics by Tolan et al.55 for 
the study of AI and human capabilities in the workplace. These rubrics 
assigned the presence or absence of the need for each capability in 
generic tasks extracted from worker surveys, occupational databases 
and AI benchmarks.

This taxonomy serves as a basis to construct a catalogue of capa-
bilities following these four criteria: (1) the capabilities are general 
rather than specific, enabling the characterization of a wide range of 
tasks usually present in human activities; (2) the capabilities represent 
concepts that are understandable to humans (and LLMs), enabling 
their levels to be expressed through rubrics in plain natural language;  
(3) there is no a priori assumption of correlation or orthogonality in 
these capabilities as observed in humans or LLMs, to accommodate 
various present and future AI paradigms (rather than overfitting to a 
specific state of the art of AI); and (4) two capabilities are considered dis-
tinct as long as many tasks could conceivably require a high level of one 
but not the other. Following criteria (1) and (2), we use capabilities that 
are familiar in human and non-human cognition and AI practice (see 
Supplementary Information Section 1.1 for a coverage of taxonomies 
in humans and AI). Despite these inspirations, we follow (3) to ensure 
that the catalogue does not replicate human intelligence hierarchies 
or taxonomies derived from populational methods. But we do not 
look for a middle ground either: we do not assume that humans and AI 
systems share a common capability structure. Finally, ensuring (4), we 
consider two dimensions to be different (for example, metacognition 
and logical reasoning) if it is possible to conceive tasks that require 
one but not the other, independently of whether they are correlated in 
human or AI populations. Indeed, we include dimensions that may not 
be the most discriminative ones for the population of benchmarks or 
LLMs we use in this paper but can be useful to detect emergent prop-
erties in the future. This population independence is especially criti-
cal in the present era in which benchmarks and models get replaced 
every few months: for instance, for models without chain-of-thought, 
dominant until 2024, the set of reasoning capabilities we use may not 
have been very discriminative; however, with the advent of models 
with reinforcement learning and integrated chain-of-thought in 2025, 
reasoning capabilities become more informative. If our catalogue had 
not included them, we would have been unable to detect this shift, and 
the same applies for capabilities that may not be discriminative now but 
can be conceived of as different from others and may be informative 
in the future. As we may nevertheless miss some capabilities that will 
become relevant, the catalogue is expected to expand to include new 
dimensions in the future, provided they are understandable to humans.

As mentioned, our work builds on that of Tolan et al.55. First, we extend 
the taxonomy by including both knowledge and extraneous dimen-
sions. Second, we develop new scales and rubrics in a quantitative range 
between 0 and 5+, with 0 representing absence of demand, values 1–4 

representing increasing demand levels of the capability and 5+ repre-
senting 5 or above. For instance, the famous Sally–Anne false-belief 
task assesses understanding of an individual’s false belief about the 
properties of an object if those properties change while they are not 
looking (Sally will look for her marble in the basket where she left it, 
even though Anne moved it to the box when Sally was away). This may 
be level 4 for dimension MS (Mind Modelling and Social Cognition) but 
may be level 0 for dimension QLq (Quantitative Reasoning). Similarly, 
the question “if all A are B, some B are C, no C are D, and all D are E, what 
can be inferred about the relationship between A and E?” may be level 
4 for QLl (Logical Reasoning) but level 0 for MS (Mind Modelling and 
Social Cognition).

Extended Data Table 5 shows the set of dimensions we have included 
in the first version of the DeLeAn rubric set (DeLeAn v1.0). We adapt 
seven broad capabilities from Tolan et al.55, applicable to LLMs (for 
example, ‘auditory processing’ was discarded), and refine a subset 
of them hierarchically with subdimensions, making them a group of 
11 ‘proper’ cognitive capabilities that we call ‘elemental’; by ‘elemen-
tal’, we mean that these capabilities are not derived from others, as 
opposed to the knowledge dimensions, which are more acquisitive. 
These ‘elemental’ subdimensions were included after several rounds 
of discussions about whether some of the original seven broad subdi-
mensions could be carved into finer, but still general, subdimensions 
that are conceptually distinct. Beyond the capabilities, we also include 
new dimensions accounting for domain ‘knowledge’, separated into five 
subdimensions (KNn, KNs, KNa, KNf, KNc) covering large branches of 
human knowledge, and three ‘extraneous’ ones, AT (Atypicality), VO 
(Volume) and UG (Unguessability), to account for elements that make 
the task more challenging independently of elemental capabilities or 
knowledge demands.

In particular, Atypicality deals with contamination56,57 and other 
familiarization effects leading to capability overestimation because 
similar data were seen during training. An AI system may simply succeed 
because it has memorized the instance. This dimension can be used to 
explain and predict performance, by identifying AT as a confounder 
with the other demands. The second extraneous dimension, Volume, 
represents the use of ‘collages’ to make instances more difficult. For 
instance, if we put ten simple additions in an exercise and we score 
whether all of them are correct, then we have increased the difficulty 
greatly, but the quantitative reasoning demand is the same. We call 
this phenomenon amalgamation and it is a recurrent trick to make 
instances more difficult, either in benchmarks of increasing hard-
ness46,58,59 or in adversarial testing60. There is a correlation between 
the size of the questions (and the answers) and the difficulty you can 
achieve with it46 (Figs. 3 and 4). In the end, amalgamation produces 
an underestimation of the capabilities, because the subjects fail at 
tasks that are incorporating many simple things. The chances of error 
accumulate, even if the cognitive load is not necessarily increased61,62. 
Finally, Unguessability captures the very usual funnelling effect to make 
a question more amenable for scoring but, at the same time, reducing 
its difficulty. The obvious case is the use of multiple-choice questions, 
which have become predominant in most AI benchmarks, despite its 
issues63. Reducing or increasing the number of options has been a com-
mon practice to change the ‘difficulty’ of a task without modifying its 
cognitive demands35. In general, these three extraneous dimensions 
will account for an important proportion of the predictability in LLM 
success and including them helps clarify these confounding effects.

Although we have 19 dimensions in total, only the first 18 correspond 
to proper capability demands (11 elemental, five knowledge and two 
extraneous) that may be met by the subject or not, with Unguessability 
being a special extraneous dimension reflecting the funnelling in the 
item design (for example, multiple-choice questions). Because of that, 
it is the only dimension expressed between 0 (the correct answer is trivi-
ally determined by the question) and 100 (unguessable, that is, a good 
open-ended question). Each of the other 18 demand rubrics includes 


a general description of the construct to be annotated, followed by 
a description of each of the levels, from 0 to 5+, with three ‘anchor’ 
instances each. By following Supplementary Information Section 2, we 
can better understand the trade-offs in the construction of the rubrics.

It is important to highlight that the catalogue is not definitive and is 
meant to be extended in the future using the same criteria of dimen-
sions being general and conceptually distinct. We use the term ‘cata-
logue’ instead of ‘taxonomy’ to better emphasize its non-definitive 
nature. This is also why we call the rubrics and battery DeLeAn v1.0 and 
ADeLe v1.0, respectively, with the vision of incorporating new capabili-
ties and propensities in the future. This will also include considering 
safety, fairness and values64,65 and not only performance (correctness) 
as the variable to predict.

Ratio scales
We deliberately design the demand scales as ‘ratio’ scales66, with an 
absolute zero(no demand) and differences that are comparable across 
the scale. In the social sciences, a common interest lies in understanding 
differences, as no human has zero capabilities, and an ‘interval’ scale 
with negative capabilities makes sense (as in IRT) or as percentiles of 
a normal distribution (as IQ scores). We argue that for AI, we should 
aim for the top level in Stevens’s topology of measurement67: the ratio 
scales. Ratio scales have all of the properties of the previous scales: 
intervals and differences are meaningful but so too are ratios. Given 
the flexibility with which we can regulate compute and time use in AI, it 
makes more sense to set an absolute zero (no compute) on the demands 
and build the scales in such a way that ratios are meaningful. We wish 
to say that instance xi at level 6 doubles the demand of an instance xj 
at level 3. Taking into account that we fit logistic functions, this can 
be understood in terms of the log odds of being correct halving when 
moving 2x in the scale and doubling when moving x/2 in the scale68.

For this first version of the scales, we decided to choose levels (0, 5) 
of the full range (0, ∞) for practical reasons. With a single rubric, it is 
hard for humans and LLMs to refine beyond five ordinal values—this is 
why Likert scales are so popular. Note that the rubrics only show cases 
in an ordinal scale between 0 and 5 and the annotations are discrete, 
never generating non-integer values. This is convenient for avoiding 
the need of binning for the curves and the demand histograms, but the 
values become fully continuous when estimating the abilities. In any 
case, it is usual to consider originally ordinal scales as interval or ratio 
scales when the number of levels is 5 or more69. Indeed, the magnitudes 
between 0 and 5 should not be interpreted as a mere rank. The way 
the scale increases depends on what the demand represents, but the 
pace of increase, the actual scale, is chosen in such a way that all scales 
are commensurate. For instance, for knowledge dimensions (applied 
sciences, customary everyday knowledge, formal sciences, natural 
sciences and social sciences and humanities) we thought of levels 
corresponding roughly to elementary, middle, high, undergraduate 
and graduate education. By looking at the attainment rates of some 
statistical data of education level rates (for example, Organisation 
for Economic Co-operation and Development (OECD) data70) and the 
specialization of domains as the educational level increases, we noticed 
that the questions of level l were usually sufficiently advanced to have 
roughly one person in 10l−1 solving it correctly. Then we extend this 
criterion as a rule of thumb for all scales, although future work could 
perform a proper calibration and see that the base of each dimension 
corresponds with the correct proportions. By using the same base, 
we achieve ratio scale consistency and commensurate scales across 
dimensions. In general, an item is at level l if l is the highest number 
such that, in at least 95% of samples of n = 10l individuals, there is at 
least one correct response. The levels we have defined are 0 (None), 1 
(Very low), 2 (Low), 3 (Intermediate), 4 (High) and 5+ (Very high), with 
n going from 1 to 100,000.

We could have calibrated some dimensions using procedurally gen-
erated examples. For instance, in reasoning, we could have increased 

the components of reasoning processes71 to see whether the levels 
increase accordingly, but each of these ‘scales’ would have been incom-
mensurate with each other and not sufficiently general.

The 18 rubrics were crafted following the above criteria, using sev-
eral iterations while testing with human and AI annotators. The final 
rubrics can be found in Supplementary Information Section 2. Once 
the rubrics were settled, we conducted the experiments, annotating 
tens of thousands of instances using a LLM, scalably and rapidly. Five 
annotation examples are illustrated in Extended Data Fig. 3.

Dissecting the demand-ability space
Annotating instances using these general scales allows us to compare 
what makes them easy or hard and provides the same lens of analysis 
independently of where the instance comes from: human test, AI bench-
mark or new item design. We can discard or combine instances to build 
a specific test profile. Although this is not new in psychology or AI72, 
the scales can be applied to any task, test or collection of benchmarks; 
DeLeAn v1.0 is instantiated to consider only textual modality for now 
and to be extended in the future. By using the same scales in a stand-
ardized way, the comparison of the vast space of tests and benchmarks 
becomes possible for the first time.

For instance, in this paper, we applied DeLeAn to 16,108 instances 
from 63 tasks from 20 benchmarks, curated from the 2024 proceedings 
of six AI conferences and other venues, while ensuring both data quality 
and diversity (details in the section ‘Benchmark battery: instance selec-
tion and curation’). This is unprecedented, as all of these tasks are now 
represented within the same 19-dimensional space of 18 general cogni-
tive demands (plus unguessability). After the annotation, these 16,108 
instances constitute the ADeLe battery (Supplementary Table 28). We 
can observe the distribution of the demand levels for each dimension, 
the demand profile, represented as a polar histogram. Exploring this 
for each benchmark in ADeLe (Fig. 2) helps answer the question of 
whether each benchmark actually measures what their developers 
claimed to measure, as we explored in the main text.

Once instances are annotated, we can do more insightful analyses 
than just calculating one average for a whole dataset. When we run a 
LLM on an annotated benchmark such as the ADeLe battery, we can 
analyse each dimension separately using a subject characteristic curve73 
to show the performance of an AI system as a function of demand levels, 
offering a comprehensive and robust delineation of the model’s ability 
on that dimension. The curve can be summarized using the area under 
the curve, referred to as the ability score, as described in the section 
‘Subject characteristic curves’.

With this procedure on the characteristic curves, we can derive ability 
profiles as 18-dimensional vectors containing the estimated abilities. 
The usual way of representing a score profile with many dimensions 
is a radial plot. This is common in the behavioural sciences and more 
recently in AI as well. However, if we look at these plots in AI papers (for 
example, refs. 47,48), we see that what they represent in each dimen-
sion is the average accuracy of a selection of instances that belong to 
a particular domain or dataset, not an actual ability. The plots based 
on performance scores will change as the difficulty of the selected 
instances varies, whereas an ability profile is invariant to these changes. 
Overall, our notion of ability using the general scales is very different 
from the common yet inaccurate use of the term in AI as a synonym of 
performance. This includes the use of the term ‘capability’ in the area 
of safety evaluations: even if informally the concept may be associated 
with levels74, these levels were never defined or scaled.

By comparing the ability profile of an AI system with the demand 
profile of a task instance or a benchmark, we can explain the observed 
performance. Moreover, using the differences between abilities and 
demands, we can use interpretable algebraic models to anticipate per-
formance for new instances (Supplementary Information Section 1.7.7). 
Notably, there is potential for other options as well. For example, the 
18 values that are annotated for each single instance on the scale 0 to 


Article
5+ and unguessability constitute a 19-dimensional vector x, which can 
be used as predictor variables for a probabilistic classification model, 
an assessor, outputting the (estimated) performance of an AI system 
on that instance. Each assessor can be trained specifically for each 
LLM, without relying on the features of the LLM. As shown in the main 
text, we can compare this with many other powerful ways of predict-
ing performance, such as assessors with embeddings and fine-tuned 
LLMs (there are more details on how we build distinct assessors in the 
section ‘Assessors and metrics’). Notably, despite the much smaller 
computation cost (apart from annotating the battery, which only needs 
to be done once), the predictive power is substantially better for the 
demands-based assessor than the best baseline, especially out of dis-
tribution, and evidently much better than average accuracy, which is 
only well-calibrated in distribution. This is because our general scales 
provide predictive features over a wide variety of tasks while limiting 
overfitting on features that become spurious when switching tasks 
and benchmarks. Finally, just as ability profiles are non-populational, 
the assessors we derive for each system are inferred exclusively from 
the results of that system, rather than from population-level param-
eters such as those used in scaling laws for aggregate performance 
prediction75.

LLM annotators and inter-rater analysis
With the rubric set in hand, we annotate any new instance along each 
dimension using a LLM to replace human annotations, to scalably and 
rapidly annotate thousands of items. Although there may be some 
discordances between LLM and human scores, scalability is critical 
for widespread deployment of the new evaluation methodology. This 
can be seen as a trade-off but also as an opportunity to have stable and 
fully reproducible annotations using LLMs, which can be improved 
as LLMs get better or are more aligned with human interpretation. In 
fact, the three instance anchors per level were very instrumental for 
the LLMs to perform good ratings (in a few-shot inference fashion) but 
also for human understanding. In our case, we performed the anno-
tations with GPT-4o, with which we found high agreement rate. The 
use of comprehensive rubrics in natural language that can be applied 
automatically is a substantial advancement in making the explanatory 
power of the scale a reality, especially if humans could interact with the 
LLM to explain their annotations.

Specifically, we prompt GPT-4o (‘gpt-4o-0513’ checkpoint)76 to anno-
tate task demands levels (on a discrete scale from 0 to 5) instance by 
instance for all individual rubrics (see DeLeAn Rubric Set v1.0 in Sup-
plementary Information Section 2). We use the Azure AI application 
programming interface (API) with chain-of-thought prompting (Sup-
plementary Table 23) at temperature set to 0 with a maximum output 
token length of 1,000, to ensure that answers can be long enough for 
nearly all instances while substantially reducing the cost. The stopping 
condition and the rest of the parameters are left by default.

To assess the agreement rate between humans and GPT-4o, for each 
demand, we randomly sampled 50 instances while ensuring each level 
had at least a sample size of 3 to avoid minority levels getting neglected 
in our inter-rater analysis. This led to 900 instances to be annotated, 
which were distributed to five humans (authors of this paper, corre-
sponding to Y.H., Y.M.-D., L.Z., Q.Z. and S.Z.), for which each instance 
was annotated by exactly three humans. The annotation process con-
sisted of two steps. First, each annotator independently assigned a dif-
ficulty level (using the 0 to 5+ scale) to each instance using the rubrics. 
Next, the annotators met for a Delphi77 consensus meeting. During this 
meeting, instances for which the minimum and maximum ratings of 
the three annotators differed by two or more points were discussed in 
detail until a consensus was reached. For cases with differences of less 
than two points, a simple majority vote determined the final annota-
tion. To check the inter-rater agreement rates, we use the rWG index78,79 
with the default rectangular null distribution; a score greater than 0.7 
is generally considered as a good agreement rate.

The result is shown in Supplementary Table 22, in which we observe 
satisfactory rWG scores (average = 0.86) between Delphi consensus and 
GPT-4o, consistently greater than 0.80, except for one dimension with 
a score of 0.75. However, the rWG scores between humans before the 
Delphi consensus meeting were slightly lower for certain dimensions. 
These initial disagreements are because of several reasons, identified 
during our Delphi consensus meetings: occasional misinterpretations 
of certain words or terminologies, mainly for those humans whose pri-
mary language for daily use is not English; knowledge gaps in annotat-
ing certain particularly challenging task instances beyond the expertise 
of annotators; cultural variations affecting annotations, especially 
within some knowledge dimensions; and several inconsistent ratings 
for which annotators could not explain their own numerical assign-
ments in hindsight, possibly caused by tiredness in annotating a large 
amount of instances; the reported time in annotating 50 instances 
on only one single rubric usually ranges between 30 and 60 min. The 
Delphi method proved useful to mitigate the individual biases and 
inconsistencies from human annotations caused by the miscellaneous 
reasons listed above, among others.

In Supplementary Information Section 1.8, we also explore two  
alternative LLM annotators. One is DeepSeek-V3, which is similarly 
powerful but open-weight: keeping all other things equal, it exhibits a 
similarly high agreement rate with the Delphi consensus (an average 
rWG of 0.83; slightly worse than that for GPT-4o of 0.86) and it unlocks 
similarly high predictive power, comparing with the section ‘Predictive 
power analysis: anticipating performance with assessors’. The other LLM 
is LLaMA-3.1-8B-Instruct, which is open-source but much smaller. We 
find that it achieves a reasonably good agreement rate with the Delphi 
consensus (an average rWG of 0.74; noticeably worse than that for GPT-4o 
of 0.86) and it exhibits moderately worse predictive power, comparing 
with the section ‘Predictive power analysis: anticipating performance 
with assessors’. This is to be expected, as older and smaller models 
are relatively less powerful in terms of obtaining reliable annotations.

Looking to the future, despite good agreement between humans 
and GPT-4o as annotator, higher agreements may be possible as the 
capabilities of LLMs progress, including their potential for explaining 
their annotations to humans.

Benchmark battery: instance selection and curation
We constructed our benchmark battery by reviewing papers published 
in the 2024 proceedings from top-tier machine learning conferences 
(ICML, NeurIPS, ICLR) and natural language processing venues (ACL, 
EMNLP, NAACL). In our search, we first identified papers with ‘bench’ in 
the title and then supplemented the collection with further benchmark 
sets found at other reputable venues. Before including any benchmark 
(or subset thereof), we applied a rigorous quality check to ensure that 
the source meets the following selection criteria:
•	 The benchmark set must be sufficiently difficult to avoid an overabun-

dance of trivial instances. A benchmark is discarded if state-of-the-art 
LLMs such as GPT-4 achieve more than 75% overall accuracy.

•	 The expected outputs must be amenable to automatic verification by 
LLM-based graders. Tasks requiring lengthy passages or those with 
several valid answers are excluded to maintain grading reliability.

•	 Benchmarks must not contain AI-generated content, when explicitly 
noted in the source paper.

•	 Tasks must be formulated as either open-ended or multiple-choice 
questions with at least four options to minimize the effect of sto-
chastic ‘guessing’.

•	 Licensing requirements for the selected benchmarks shall be compat-
ible and allow for free redistributions.

•	 The collection of benchmark(s) introduced by a paper must be 
publicly available at the time of our curation effort (that is, as of 26 
December 2024).

•	 The task must have an objective ground truth that can be used to 
unambiguously categorize performance as either success or failure.


•	 The quality of ground-truth labelling must be near-perfect, if reported. 
For those benchmarks that do not report any quality scores of their 
ground truth, we apply further quality filters, described both at the 
end of this subsection and in the section ‘Subject LLMs and grading’.

This eventually resulted in a total of 20 benchmarks from nine papers, 
comprising 63 tasks for our analysis (Supplementary Table 28). For 
efficiency reasons, we randomly sampled up to 500 instances per task 
to strike a balance between data diversity and size. This led to an original 
battery of 21,996 instances.

Last, we prompted GPT-4o to annotate three quality indicators:  
(1) the accuracy of ground-truth labels; (2) the objectivity; and (3) the 
unambiguity, for all instances, graded with a Likert scale from 1 to 5 
(Supplementary Tables 24, 25 and 26). We inspected the annotations 
of 50 randomly sampled instances with a score of 1 for each quality 
indicator, in which a human judge (a researcher with a background 
in computer science) reviewed these annotations and labelled them 
as ‘agree’, ‘disagree’ and ‘uncertain’. For the accuracy of ground-truth 
labels, the agreement, disagreement and uncertainty rates were 32%, 
6% and 62%, respectively. For objectivity, the agreement, disagree-
ment and uncertainty rates were 68%, 10% and 22%, respectively. For 
unambiguity, the agreement, disagreement and uncertainty rates were 
70%, 22% and 8%, respectively.

Given this observation, we removed those instances with a score of 1 
in any of the three aforementioned indicators, which accounts for 16% 
of instances in the initial battery, reducing the battery at this stage to 
18,462 instances. Also, we discarded 0.9% of instances in which the LLM 
annotator did not offer an annotation (for example, flagged by OpenAI’s 
moderation filters) or did not yield demand annotations in an expected 
and easily processable format, resulting in 18,291 instances remaining.

This is a satisfactory result, as we removed many problematic 
instances at the cost of eliminating a small proportion of seemingly 
good ones. This cleaning is critical to reduce noise when deriving the 
ability profiles of models and evaluating the predictive power of asses-
sors.

Subject LLMs and grading
The pool of analysed subjects includes 15 LLMs in total (Extended Data 
Table 1), six proprietary models from OpenAI, five open-weight models 
from Meta and four open-weight models from DeepSeek:
•	 GPT/o1: we use six models from the GPT and o1 families (OpenAI)80,81. 

The four GPT models, Babbage-002, Davinci-002, GPT-3.5-Turbo 
(built as ‘gpt-35-turbo-0613’) and GPT-4o (built as ‘gpt-4o-0513’), are 
the original instruction-tuned models in the GPT family, in which the 
last two are also shaped up by fine-tuning with human feedback and 
further include a moderation post-filtering mechanism82. By contrast, 
OpenAI o1-mini (built as ‘o1-mini-2024-09-12’) and OpenAI o1 (built 
as ‘o1-2024-12-17’ with the reasoning effort parameter set to ‘low’) 
belong to a family of ‘reasoning’ models, designed to take extra time 
to generate and refine a chain-of-thought before producing a final 
answer. All of these models were accessed through the public API 
offered by Azure AI Foundry.

•	 LLaMA: we use five different scales of the latest LLaMA saga (LLaMA-3 
family83): 1B, 3B, 11B, 90B and 405B, all of which have been instruction- 
tuned. Note that we refer to them consistently with the suffix ‘-Instruct’ 
as in the original names of the 1B, 3B and 405B variants. This also 
applies to the 11B and 90B variants, although they are originally named 
with the suffix ‘-Vision’ instead of ‘-Instruct’, as these are multimodal. 
To avoid any possible confusion, we replace the suffix ‘-Vision’ with 
‘-Instruct’, as we focus on evaluating text modality in this work. All of 
the inferences were run through the Hugging Face API.

•	 DeepSeek: we locally run the four different scales (1.5B, 7B, 14B and 
32B) of the DeepSeek-R1-Distilled-Qwen suite13, a set of ‘reasoning’ 
models (based on the Qwen-2.5 model family84) that distilled knowl-
edge from a much more powerful LLM (DeepSeek-R1).

For inference, all subject models were queried with the temperature 
parameter set to 0 and no system prompt, with the exceptions of Ope-
nAI’s o1 models, which can only be queried with temperature equal to 
1, and the DeepSeek-R1-Distilled-Qwen models, which were queried 
with a temperature of 0.6 and a top-p of 0.95 as recommended by the 
original paper13. Similarly, we use chain-of-thought prompting for all 
models except for the ‘reasoning’ models (OpenAI’s o1 models and 
DeepSeek-R1-Distilled-Qwen models), which were already shaped up 
to perform chain-of-thought by default by their developers. In terms 
of maximum output token length, we use 2,000 tokens for all models, 
except for OpenAI’s o1 models and the DeepSeek-R1-Distilled models, 
which use 16,384 tokens instead. We used the default values for the 
stopping condition and the rest of the parameters.

Most grading of instances in present AI evaluation practice is per-
formed with LLMs as a judge85, because manual grading for a large num-
ber of instances and models would be infeasible. We follow that practice 
but we do not want to consider instances that are wrongly graded, 
because that would portray a misleading account of the explanatory 
and predictive power of the methodology we present in this paper. We 
then prefer to discard those instances for which the LLMs (as a judge) 
are not robust. This means that we exclude some instances and this may 
introduce some bias selection in ADeLe. We believe instances that are 
hard to grade or verify do not necessarily mean that they are easier or 
harder to solve. In either extreme, they would increase predictability 
but not the separability metrics such as AUROC. Consequently, we per-
form the following procedure. We automatically grade the responses 
of these models on a discrete scale between 1 (surely incorrect) and 
5 to (surely correct) using two LLMs, GPT-4o and Claude 3.5 Sonnet 
(‘claude-3.5-sonnet-1022’ checkpoint), prompted with temperature 
set to 0 while the rest follows the default configurations. The prompt 
contains both the input, the response of the subject and the ground 
truth (Supplementary Table 27) for a sample prompt template. To spot 
instances that are ‘hard to verify’ (for example, owing to inherent sub-
jectivity or erroneous ground truth), which can introduce noise into 
the analysis, we remove approximately 12% of instances in which both 
LLM graders did not agree through simultaneously outputting either 
correctness scores ≥4 (both graders think the answer is a success with 
some confidence) or correctness scores ≤2 (both graders think the 
answer is a failure with some confidence) when verifying GPT-4o as a 
subject; this forms the final ADeLe battery v1.0, with 16,108 instances. 
We finally labelled input–output pairs graded with a mean score less 
than 3 as failure pairs and success otherwise (scores of 3 were filtered 
in the previous step anyway). We randomly sampled 100 instances 
from all of the gradings and manually found that 98% of input–output 
pairs are correctly verified.

Assessors and metrics
An assessor is an external metamodel designed to predict the per-
formance of a subject system (for example, a LLM) on individual task 
instances by taking features of those individual task instances as 
input19,21,22,39. These features can range from the raw representation (full 
text or image) to metafeatures representing cognitive demands and 
linguistic characteristics, as well as more structured representations 
such as average (word) embeddings of each task instance. When perfor-
mance is defined as a binary success score (correct versus incorrect), an 
assessor can be built by using any standard binary classifier, including 
statistical models (for example, RF) and fine-tuned language models 
(for example, fine-tuned LLaMA-3.1-8B). Such models are trained to 
anticipate the success probabilities of a given subject on task instances 
without executing that subject and can be either tailored to predict 
the performance of a single AI system or designed to generalize across 
systems. In this work, we train and compare three types of assessor:
•	 Demand-based: this assessor is a RF86 classifier that takes the vector of 

18 demands and the special UG (Unguessability) dimension as input 
to predict a subject LLM’s performance. The in-distribution data are 


Article
used to optimally select the minimum number of samples required 
to split an internal node, chosen as 2, 50 or 200.

•	 Embeddings-based: in this model, each item instance is represented 
by the average of its GloVe word embeddings87, fed to train a RF clas-
sifier. As with the demand-based assessor, we tuned the minimum- 
samples-per-split hyperparameter of the RF (choosing from 2, 50 and 
200) using the in-distribution data.

•	 Fine-tuned LLaMA: this is a fine-tuned LLaMA-3.1-8B (ref. 83) with a 
linear classification head. This model is trained end-to-end using the 
original input text for each task instance. We use the in-distribution 
data to select the optimal learning rate between 1 × 10−4 and 2 × 10−5. 
To improve training efficiency, we used the NF4 quantization scheme 
and bfloat16 for computation, along with low-rank adaptation (LoRA) 
for efficient training. Training was performed with a batch size of 16 
for three epochs and a weight decay of 0.01.

For implementation, the RF models were trained using the scikit-learn 
library88, whereas the fine-tuned LLaMA-3.1-8B was trained on the Trans-
formers library89 using the PyTorch backend running on Python 3.11. All 
unspecified hyperparameters were left at their default values.

In terms of computational cost, the on-demand assessor was extremely  
efficient. On an M3 Pro CPU, the data of each subject were processed 
by means of tenfold cross-validation in about 4 s. By contrast, the 
embedding-based assessor took about 40 times as long owing to the 
higher computational overhead of processing dense vector representa-
tions. The fine-tuned LLaMA assessor was by far the most expensive, 
taking around 300 GPU hours on a single V100 GPU to converge (that 
is, around six orders of magnitude longer than the demand-based 
approach).

To quantify the predictive quality of these assessors, we used AUROC 
and the ECE with ten equal-width bins, as these two metrics capture two 
key aspects of predictive power (discrimination and calibration) and 
each of them is commensurate when comparing the predictive power 
of distinct assessor–subject pairs.

We compute the statistical significance between the demand-based 
assessor and the strongest baseline. We apply the Wilcoxon signed-rank 
test based on the win–loss outcomes using paired comparisons of each 
fold between two assessors (across ten folds with ten repetitions each 
based on distinct seeds).

Although the use of demand annotations substantially outperforms 
the other baseline approaches as seen in Extended Data Tables 2, 3 and 
4, two key factors explain why the discrimination power declines in 
out-of-distribution settings. First, because our analysis includes only 
63 tasks from 20 benchmarks—many of which (for example, Chem-
LLMBench) have non-overlapping demand distributions— the training 
data do not fully capture the multidimensional demand space. We sug-
gest that the predictive power of the demand-based assessor for any 
arbitrary new tasks or new benchmarks can be boosted to the level of 
in-distribution by ensuring that the demand distribution of the training 
data efficiently covers the multivariate demand space.

Second, there is a paucity of extremely difficult instances to chal-
lenge the high-performance models (for example, OpenAI o1-mini, 
OpenAI o1, DeepSeek-R1-Distilled-Qwen-32B). As shown in Fig. 3, even 
at level 5 (for which instance coverage is low), the best models maintain 
success probabilities well above zero and the estimated abilities can 
go beyond 5, just by extrapolation. In Supplementary Information Sec-
tion 1.6, we further discuss these factors and potential improvements 
on instance selection and automated grading.

Subject characteristic curves
Extended Data Fig. 4 shows a subject characteristic curve for the results 
of Llama-3.1-405B-Instruct on 16,108 instances of the ADeLe battery, 
sorted and binned by the levels on the dimension KNn (Knowledge of 
Natural Sciences). As further elaborated in Supplementary Information 
Section 1.2, for each bin b for that dimension, we exclude all points for 

which the level of any other dimension is greater. In other words, we 
want the represented dimension to dominate on the instances we are 
showing (in this case, only 3,785 out of 16,108).

On this plot, we can then fit a logistic function and look for the 
x-axis value at which the probability of the subject to succeed is 0.5. 
In Extended Data Fig. 4, this leads to an estimated ability of 4.3. Ability 
can then be interpreted as the level of demand at which the probability 
of the subject to succeed is 0.5, assuming all other demand levels are 
lower, which is in accordance with psychometric tradition (ref. 90, p. 
249) and will be followed for the rest of the paper. Note that an ability of 
4.3 does not necessarily mean that the subject solves all tasks instances 
of level 4.3 or less but that it has 50% chance of succeeding at level 4.3, 
higher at level 3, much higher at level 2 and so on, and evidently lower 
at level 5 and above, in a sigmoidal way, as we see in the figure. The 
exact estimation of the ability (as the usually equivalent area under the 
curve) is further explained in Supplementary Information Section 1.2.

The advantages of these curves and this manner of interpreting abil-
ity are reinforced by the fact that the scale on the x-axis is absolute 
rather than relative. It is robust to changes of demand distribution in 
the data. For instance, with the 3,786 instances in Extended Data Fig. 4, 
we get an average accuracy of 62%. However, if we chose the n = 699 
instances of level 5 and repeated them 500 times in the dataset, the 
average accuracy of the LLM would decrease substantially (below 40%), 
as we would be adding more difficult examples. This is what adversarial 
testing does60, especially when benchmarks saturate. By contrast, the 
average accuracy for the instances at bin 5 would remain the same and 
the characteristic curve would not be affected at all. The ability would 
not alter, remaining at level 4.3. This case neatly represents the differ-
ence between performance, which is a measure of a pair subject and task 
distribution (so changing from 62% to 40% when the task distribution 
changes), and ability, which is an inferred property of a subject that is 
invariant to the task distribution. Although all of this is strongly inspired 
by IRT, and the linear logistic test model in particular91, it is important 
to clarify that, unlike these and other latent factors approaches—those 
in AI included16,17,75—we only use the information of a single LLM for the 
estimation of its abilities.

With the demand-based scales and the ability-estimation method 
introduced in this paper, the demands and abilities for tasks and AI sys-
tems get values that are completely independent of other tasks and AI 
systems, now or in the future. We have used the term ‘non-populational’ 
to refer to an indicator or measurement that does not depend on the 
rest of the population, only on the individual. For the first time, there 
is a non-populational measurement paradigm for evaluating the cogni-
tive and intellectual capabilities of general-purpose systems. This is in 
contrast to common non-inferential techniques, such as benchmark 
aggregates, which are affected by the distribution of difficulty in the 
benchmark. Similarly, standard inferential techniques such as IRT, 
principal component analysis and factor analysis are also populational. 
They usually work well with human populations because samples are 
sufficiently stable over time but lead to different results as soon as the 
AI system ‘population’ is modified, whenever a new set of LLMs is added 
to the inferential pool. For instance, the factors that were discovered for 
LLMs in ref. 16 differ from those found in ref. 17, even if the two studies 
collected representative samples of LLMs, used the same factor analysis 
methodology and took place in only a few months time difference. 
This volatility does not happen with our approach. Our abilities are 
not relative to a population of subjects and the scale is absolute. Even 
if the evaluation battery were extended with instances of levels 7 or 8 to 
account for more powerful future AI systems, the logistic curve for the 
old systems would probably have low values on these instances, thus not 
affecting the original estimates. This forward-looking extensibility and 
backward compatibility is crucial for measurement. In sum, there is an 
open opportunity for the new scales, battery and procedure presented 
in this paper to be the genesis of a standardization initiative for the 
robust measurement of present and future AI capabilities.


Pipeline and guidelines for applications and extensions
There is a consensus within the AI community that there is a need for 
a new science of AI evaluation92,93. However, there is also resistance to 
moving beyond the present benchmarking paradigm8. Although some 
have proposed using the potential of the behavioural sciences, such 
as psychology and psychometrics, for AI evaluation, this is generally 
understood to mean populational approaches, such as factor analysis, 
principal component analysis or IRT16–18, whose findings may soon lose 
value owing to the fast-evolving set of AI systems. Our paper demon-
strates that a possible answer for a scientific approach to AI evaluation 
comes from behavioural inference at the instance level. These infer-
ences are made from features that are not derived from a population of 
subjects. This approach was not previously possible for human evalu-
ation because it requires tens of thousands of instance-level results for 
each subject—yet this scale is possible for AI evaluation. Furthermore, 
the annotation of this number of items with a wide range of dimensions 
is only unlocked now by the ability to automate good-quality annotation 
with LLMs. Nevertheless, to move beyond the present paradigm (based 
on benchmark aggregates or the use of latent factors), the methodology 
must be made accessible, modular and customizable.

Extended Data Fig. 5 illustrates a pipeline for our methodology, with 
two processes that can be followed independently. The ‘System Pro-
cess’ (top) can be applied to any new AI system we want to explain or 
predict about and consists of running the model on the ADeLe battery, 
plotting characteristic curves (see, for example, Fig. 3 and Extended 
Data Fig. 4) and summarizing the profile of abilities with a radial plot, 
as in Fig. 4. The ‘Task Process’ (bottom) ensures that the methodology 
can be extended and kept up-to-date by using the DeLeAn rubrics to 
automatically obtain a demand profile for new task instances or bench-
marks. This is especially useful to mitigate challenges such as data 
contamination and benchmark saturation while still keeping everything 
in the same measurement space. This can be compared with the system 
capability profile for any AI system that has previously gone through 
the ‘System Process’ to identify specific areas of strength and weak-
ness relative to the task demands and intuitively predict performance. 
Moreover, we can also train powerful assessors that automatically 
decide whether it is sensible to use the AI system in a given situation.

In Table 1, we enumerated a series of applications. Here we extend 
how they are implemented using the pipelines (system or task process) 
in Extended Data Fig. 5.
•	 Resolution of apparently inconsistent results (system and task pro-

cesses): dual profiling of tasks and systems enables us to reconcile 
seemingly contradictory evaluation outcomes94,95. If two benchmarks 
in the same domain produce different rankings or success rates for 
a model, the discrepancy can be explained by differences in their 
demand profiles. For example, several tasks labelled ‘mathematical 
reasoning’ can require disparate levels of reasoning versus knowledge 
demands, resulting in inconsistent outcomes that our method can 
explain. We illustrate this in Supplementary Information Section 1.12.

•	 Better benchmarks with construct validity by design (task process): 
designing benchmarks using demand-level rubrics ensures that each 
task covers the intended range of abilities without extraneous factors, 
thereby improving construct validity10. By selecting instances that 
span all relevant demand levels, new benchmarks can be aligned with 
their target constructs by design. In practice, this means that a bench-
mark will be sensitive (including items of all difficulty levels relevant 
to the intended skills) and specific (excluding demands relating to 
unintended skills), resulting in more meaningful evaluation results.

•	 Benchmarks interoperability and instance reuse into new batter-
ies (task process): items from different benchmarks can be easily 
integrated into new evaluation batteries (as equating procedures in 
psychometrics96) by placing tasks on the same general demand scales. 
This interoperability allows us to reuse instances across benchmarks, 
covering each other’s blind spots and ensuring broader coverage of 

the capability space. In other words, complementary benchmarks 
can be merged or linked through their demand annotations to cre-
ate composite tests that fill gaps (for example, by adding missing 
high-level reasoning items from one source to another).

•	 Meaningful scaling laws (system process only, if reusing ADeLe): using 
the absolute demand/ability scales provides a clearer picture of how 
model performance scales with size or training. Traditional scaling 
analyses based on aggregate accuracy often saturate or yield ambigu-
ous trends and indeed there is evidence that naive ‘scaling laws’ can 
be misleading, break down under certain conditions or do not scale 
universally97. By contrast, evaluating models on our general scales 
reveals genuine diminishing returns and emergent phenomena, owing 
to some specific capabilities.

•	 Measurements robust to changing populations (system process only, 
if reusing ADeLe: usually, benchmarks are replaced whenever a rel-
evant part of the population of AI systems to which they are applicable 
achieves accuracy equally close to the maximum (termed ‘satura-
tion’98). At the same time, populational methods that infer difficulty 
levels18,98,99 or (the number of) latent factors describing capabilities16,17 
depend on the considered population of models; thus, the extracted 
factors or difficulties may lose relevance when the population evolves. 
With our measurement scales, we can define indicators of progress 
spanning years or even decades.

•	 Capability catalogue accommodating AI progress (system and task 
process): upcoming AI systems may be better described by further 
dimensions that are not included in the present set, such as having 
access to affordances that unlock dimensions in the visual domain. 
The evolution of the catalogue can be used as a mirror of the trends 
of the discipline and a way of making use of standardized rubrics 
for new capabilities that appear as AI advances, and can be used for 
regulation purposes100.

•	 Capability profiles bringing explanatory power (system process): 
summarizing the performance of an AI system as an ability profile 
(a vector of scores for each demand dimension) provides valuable 
insights that go beyond a single aggregate score and offer action-
able insights for selection and deployment. These profiles highlight 
the specific strengths and weaknesses of a model, thus adding an 
interpretable layer to performance evaluation (for example, they 
can reveal whether a model’s strong knowledge base comes at the 
expense of its logical reasoning ability or how training strategies 
such as chain-of-thought prompting can boost certain capabilities 
more than others). Recent work evaluating LLMs shows the value of 
such multidimensional evaluation and analysis101,102.

•	 Model diagnosis and task counterfactuals (system and task process): 
the demand-ability framework enables fine-grained diagnosis of 
model failures and ‘what-if’ analyses. When a model fails a particu-
lar challenge, we can identify the demand dimension that was high 
for that item, thus pinpointing the capability shortfall. Recent work 
decomposes counterfactual reasoning into sub-skills and dem-
onstrates that present LLMs struggle with such tasks103, for which 
prompting is typically used for diagnostic capabilities of LLMs104. 
With profiles, we can adjust task demands or system abilities in a con-
trolled manner, conducting counterfactual experiments to explain 
how LLMs would behave under modified conditions, lower or higher 
demands and abilities.

•	 Routing instance to best system (task process, for LLMs already 
profiled): a new task instance can be annotated ‘on the fly’ and its 
demands compared with the capability profiles of AI systems to 
‘route’ the task instance to the most appropriate LLM38,105. Routers 
can make use of existing system-specific assessors21,22,39. Moreover, 
routers can also combine performance with considerations such as 
cost, speed or uptime, whose importance depends on the considered 
application38,105. Given the high out-of-distribution performance of 
the assessors trained on the ADeLe annotations (see the ‘Predictive 
power analysis: anticipating performance with assessors’ section), 


Article
it is conceivable that routers using these annotations will perform 
similarly well in such scenarios.

•	 Monitoring LLMs and rejecting queries (task process if LLM already 
profiled): demand profiles allow for proactive safety monitoring and 
query rejection106 when appropriate. If an incoming query is estimated 
to require capabilities beyond the reliable scope of a given model, the 
system can either refrain from answering or delegate to a human oper-
ator. Previous studies have shown that a smaller assessor model can 
be trained to predict the performance of a larger model on individual 
instances, enabling a ‘reject before you run’ mechanism39. This type of 
anticipatory rejection or deferral contributes to reliability by avoid-
ing situations in which the model is pushed beyond its capabilities40.

•	 Guiding red teaming (task process, annotation only, if LLM already 
profiled and assessor already built): red-teaming efforts107 can be 
informed by highlighting where an AI system is most vulnerable108. 
For example, if the profile of a model indicates lower ability in meta-
cognition or abstract reasoning, the red team can create prompts that 
heavily tax these abilities. Also, by inverting inputs and outputs in the 
assessor, we can test on areas in which the model is weak, ensuring that 
potential failure modes are covered more thoroughly. This uncovers 
critical vulnerabilities before malicious actors do109 and provides con-
crete feedback for model improvement, as any weakness discovered 
is immediately contextualized by the demand that elicits it.

Other applications related to policy, such as safety auditing or regu-
latory review, require a comparison of LLM and task profiles, with the 
two processes involved.

All of these applications can use and extend the collaborative plat-
form at https://kinds-of-intelligence-cfi.github.io/ADELE. Predicted 
extensions will mostly be led by future applications and the evolution 
of AI. Clearly, more capabilities will be added to the catalogue (for 
specific domains or to cover multimodal or agentic systems), more 
levels for some of the capabilities may be needed as AI become more 
powerful and more annotations of benchmarks, extending or com-
plementing ADeLe for different purposes. This will lead to an evolu-
tion of the catalogue and, if necessary, revision of rubrics and their 
taxonomic relations, provided there is transparency about backward 
compatibility. This should be the seed of a collective consensus and 
standardization effort of measurement scales for AI, as has happened 
in other scientific disciplines.

Inclusion and ethics
We used LLMs that are trained on very different sources of data and 
may have important ethical consequences, such as failing in ways users 
cannot understand or anticipate. This has been the main motivation for 
this research. The domains we use in our experiments and the exam-
ples included in the manuscript do not generate any specific ethical 
issue. We only use examples and prompts in the English language. The 
rubric is also only in English but could be adapted to other languages. 
We did not conduct any human study directly other than a subset of 
the authors applying the rubrics. More details about the costs of this 
research (compute), safety implications and other ethical issues can 
be found in Supplementary Information Section 1.15.

Reporting summary
Further information on research design is available in the Nature Port-
folio Reporting Summary linked to this article.

Data availability
The associated data, code and instance-level results are available in an 
independent public platform: https://kinds-of-intelligence-cfi.github.
io/ADELE. In compliance with the recommendations of ref. 9 about 
the reporting of evaluation results in AI, we include the results at the 
instance level. Source data are provided with this paper.

Code availability
The associated data, code and instance-level results are available in 
an independent public platform: https://kinds-of-intelligence-cfi.
github.io/ADELE. In compliance with the recommendations of ref. 9 
about reporting of evaluation results in AI, we include the results at 
the instance level.
 

51.	 McGrew, K. S. in Contemporary Intellectual Assessment: Theories, Tests, and Issues 2nd 
edn (eds Flanagan, D. P. & Harrison, P. L.) 136–181 (Guilford Press, 2005).

52.	 Rust, J., Kosinski, M. & Stillwell, D. Modern Psychometrics: The Science of Psychological 
Assessment 4th edn (Routledge, 2021).

53.	 Hernández-Orallo, J. The Measure of All Minds: Evaluating Natural and Artificial Intelligence 
(Cambridge Univ. Press, 2017).

54.	 Hernández-Orallo, J. & Vold, K. AI extenders: the ethical and societal implications of 
humans cognitively extended by AI. In Proc. 2019 AAAI/ACM Conference on AI, Ethics, 
and Society 507–513 (ACM, 2019).

55.	 Tolan, S. et al. Measuring the occupational impact of AI: tasks, cognitive abilities and AI 
benchmarks. J. Artif. Intell. Res. 71, 191–236 (2021).

56.	 Balloccu, S., Schmidtová, P., Lango, M. & Dušek, O. Leak, cheat, repeat: data contamination 
and evaluation malpractices in closed-source LLMs. In Proc. 18th Conference of the 
European Chapter of the Association for Computational Linguistics (Volume 1: Long 
Papers) 67–93 (Association for Computational Linguistics, 2024).

57.	 Jiang, M. et al. Investigating data contamination for pre-training language models. 
Preprint at https://arxiv.org/abs/2401.06059 (2024).

58.	 Suzgun, M. et al. Challenging BIG-Bench tasks and whether chain-of-thought can solve 
them. In Findings of the Association for Computational Linguistics: ACL 2023 (eds Rogers, A. 
et al.) 13003–13051 (Association for Computational Linguistics, 2023).

59.	 Kazemi, M. S. et al. BIG-Bench Extra Hard. in Proc. 63rd Annual Meeting of the Association 
for Computational Linguistics (Volume 1: Long Papers) (eds Che, W. et al.) 26473–26501 
(Association for Computational Linguistics, 2025).

60.	 Kiela, D. et al. Dynabench: rethinking benchmarking in NLP. In Proc. 2021 Conference of 
the North American Chapter of the Association for Computational Linguistics: Human 
Language Technologies (eds Toutanova, K. et al.) 4110–4124 (Association for Computational 
Linguistics, 2021).

61.	 Sweller, J. in Psychology of Learning and Motivation Vol. 55, 37–76 (Elsevier, 2011).
62.	 Kalyuga, S. Cognitive load theory: how many types of load does it really need?  

Educ. Psychol. Rev. 23, 1–19 (2011).
63.	 Balepur, N., Rudinger, R. & Boyd-Graber, J. L. Which of these best describes multiple 

choice evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the above. In Proc. 
63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long 
Papers) (eds Che, W. et al.) 3394–3418 (Association for Computational Linguistics, 2025).

64.	 Zeng, Y., Kairong, L., Dong, F. & Zheng, P. Quantifying risk propensities of large language 
models: ethical focus and bias detection through role-play. Preprint at https://arxiv.org/
abs/2411.08884 (2024).

65.	 Yao, J. et al. Value compass benchmarks: a platform for fundamental and validated 
evaluation of LLMs values. Preprint at https://arxiv.org/abs/2501.07071 (2025).

66.	 Hand, D. J. Measurement: A Very Short Introduction (Oxford Univ. Press, 2016).
67.	 Stevens, S. S. On the theory of scales of measurement. Science 103, 677–680 (1946).
68.	 Freund, R. Rasch and Rationality: Scale Typologies as Applied to Item Response Theory. 

PhD thesis, Univ. California, Berkeley (2019).
69.	 Rhemtulla, M., Brosseau-Liard, P. É & Savalei, V. When can categorical variables be treated 

as continuous? A comparison of robust continuous and categorical SEM estimation 
methods under suboptimal conditions. Psychol. Methods 17, 354–373 (2012).

70.	 OECD. Education at a glance 2024: OECD indicators. https://doi.org/10.1787/c00cad36-en 
(2024).

71.	 Mirzadeh, S. I. et al. GSM-Symbolic: understanding the limitations of mathematical 
reasoning in large language models. In Proc. Thirteenth International Conference on 
Learning Representations (ICLR, 2024).

72.	 Zhang, J. et al. Task Me Anything. In Proc. Thirty-Eighth Conference on Neural Information 
Processing Systems Datasets and Benchmarks Track (NeurIPS, 2024).

73.	 Lumsden, J. Person reliability. Appl. Psychol. Meas. 1, 477–482 (1977).
74.	 Phuong, M. et al. Evaluating frontier models for dangerous capabilities. Preprint at https://

arxiv.org/abs/2403.13793 (2024).
75.	 Ruan, Y., Maddison, C. J. & Hashimoto, T. Observational scaling laws and the predictability 

of language model performance. In Proc. 38th International Conference on Neural 
Information Processing Systems 15841–15892 (ACM, 2024).

76.	 Hurst, A. et al. Gpt-4o system card. Preprint at https://arxiv.org/abs/2410.21276 (2024).
77.	 Linstone, H. A. et al. The Delphi Method Vol. 1975 (Addison-Wesley, 1975).
78.	 James, L. R., Demaree, R. G. & Wolf, G. Estimating within-group interrater reliability with 

and without response bias. J. Appl. Psychol. 69, 85 (1984).
79.	 LeBreton, J. M. & Senter, J. L. Answers to 20 questions about interrater reliability and 

interrater agreement. Organ. Res. Methods 11, 815–852 (2008).
80.	 Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding 

by generative pre-training. OpenAI https://cdn.openai.com/research-covers/language- 
unsupervised/language_understanding_paper.pdf (2018).

81.	 Jaech, A. et al. OpenAI o1 system card. Preprint at https://arxiv.org/abs/2412.16720 (2024).
82.	 Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
83.	 Dubey, A. et al. The Llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 

(2024).
84.	 Yang, A. et al. Qwen2.5 technical report. Preprint at https://arxiv.org/abs/2412.15115 (2025).
85.	 Li, D. et al. From generation to judgment: opportunities and challenges of 

LLM-as-a-judge. In Proc. 2025 Conference on Empirical Methods in Natural Language 
Processing 2757–2791 (Association for Computational Linguistics, 2025).

https://kinds-of-intelligence-cfi.github.io/ADELE
https://kinds-of-intelligence-cfi.github.io/ADELE
https://kinds-of-intelligence-cfi.github.io/ADELE
https://kinds-of-intelligence-cfi.github.io/ADELE
https://kinds-of-intelligence-cfi.github.io/ADELE
https://arxiv.org/abs/2401.06059
https://arxiv.org/abs/2411.08884
https://arxiv.org/abs/2411.08884
https://arxiv.org/abs/2501.07071
https://doi.org/10.1787/c00cad36-en
https://arxiv.org/abs/2403.13793
https://arxiv.org/abs/2403.13793
https://arxiv.org/abs/2410.21276
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
https://arxiv.org/abs/2412.16720
https://arxiv.org/abs/2303.08774
https://arxiv.org/abs/2407.21783
https://arxiv.org/abs/2412.15115


86.	 Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
87.	 Pennington, J., Socher, R. & Manning, C. D. GloVe: global vectors for word representation. 

In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 
1532–1543 (Association for Computational Linguistics, 2014).

88.	 Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 
2825–2830 (2011).

89.	 Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 
Conference on Empirical Methods in Natural Language Processing: System Demonstrations 
38–45 (Association for Computational Linguistics, 2020).

90.	 Thurstone, L. L. Ability, motivation, and speed. Psychometrika 2, 249–254 (1937).
91.	 Fischer, G. H. The linear logistic test model as an instrument in educational research. 

Acta Psychol. 37, 359–374 (1973).
92.	 Weidinger, L. et al. Toward an evaluation science for generative AI systems. Preprint at 

https://arxiv.org/abs/2503.05336 (2025).
93.	 Yuan, J., Zhang, J., Wen, A. & Hu, X. The science of evaluating foundation models. Preprint 

at https://arxiv.org/abs/2502.09670 (2025).
94.	 Zhang, G. & Hardt, M. Inherent trade-offs between diversity and stability in multi-task 

benchmarks. In Proc. 41st International Conference on Machine Learning 235, 
58984–59002 (PMLR, 2024).

95.	 Zhang, G., Dominguez-Olmedo, R. & Hardt, M. Train-before-test harmonizes language 
model rankings. Preprint at https://arxiv.org/abs/2507.05195 (2025).

96.	 Leôncio, W., Wiberg, M. & Battauz, M. Evaluating equating transformations in IRT 
observed-score and kernel equating methods. Appl. Psychol. Meas. 47, 123–140 (2023).

97.	 Diaz, F. & Madaio, M. Scaling laws do not scale. In Proc. AAAI/ACM Conference on AI, 
Ethics, and Society Vol. 7, 341–357 (ACM, 2024).

98.	 Vania, C. et al. Comparing test sets with item response theory. In Proc. 59th Annual 
Meeting of the Association for Computational Linguistics and the 11th International Joint 
Conference on Natural Language Processing (Volume 1: Long Papers) (eds Zong, C. et al.) 
1141–1158 (Association for Computational Linguistics, 2021).

99.	 Lalor, J. P., Rodriguez, P., Sedoc, J. & Hernandez-Orallo, J. Item response theory for natural 
language processing. In Proc. 18th Conference of the European Chapter of the Association 
for Computational Linguistics: Tutorial Abstracts 9–13 (Association for Computational 
Linguistics, 2024).

100.	Burden, J. et al. A framework for general-purpose AI model categorisation. Policy 
development KJ-01-25-459-EN-N. https://publications.jrc.ec.europa.eu/repository/
handle/JRC143256 (2025).

101.	 Zhou, Y. et al. Evaluating LLMs across multi-cognitive levels: From medical knowledge 
mastery to scenario-based problem solving. In Proc. 42nd International Conference on 
Machine Learning 267, 78984–79003 (PMLR, 2025).

102.	 Qu, Y. et al. Integration of cognitive tasks into artificial general intelligence test for large 
models. iScience 27, 109550 (2024).

103.	Yang, S., Yang, Q., Tang, L., Blackburn, J. & Xi, Z. On the eligibility of LLMs for 
counterfactual reasoning: a decompositional study. Preprint at https://arxiv.org/
abs/2505.11839 (2025).

104.	 Gaebe, K. & van der Woerd, B. Evaluation of large language models as a diagnostic tool 
for medical learners and clinicians using advanced prompting techniques. PLoS One 20, 
e0325803 (2025).

105.	 Hu, Q. J. et al. Routerbench: a benchmark for multi-LLM routing system. Preprint at 
https://arxiv.org/abs/2403.12031 (2024).

106.	 Hendrickx, K., Perini, L., Van der Plas, D., Meert, W. & Davis, J. Machine learning with a 
reject option: a survey. Mach. Learn. 113, 3073–3110 (2024).

107.	 Yu, J., Lin, X., Yu, Z. & Xing, X. GPTFUZZER: red teaming large language models with 
auto-generated jailbreak prompts. Preprint at https://arxiv.org/abs/2309.10253 (2023).

108.	 Mozes, M., He, X., Kleinberg, B. & Griffin, L. D. Use of LLMs for illicit purposes: threats, 
prevention measures, and vulnerabilities. Preprint at https://arxiv.org/abs/2308.12833 
(2023).

109.	 Freenor, M. et al. Prompt optimization and evaluation for LLM automated red teaming. 
Preprint at https://arxiv.org/abs/2507.22133 (2025).

Acknowledgements We thank the POLARIS Lab and the Computational Cognitive Science Lab 
at Princeton University, the European AI Office, the UK AISI, the Future of Life Institute, the 
OECD Future of Skills team, S. Chang, M. Zilka, J. Lian, R. Burnell, W. Schellaert, Á. Gómez,  
M. Tešić, C. Li and P. Romero for their valuable help and feedback at certain stages of the project. 
We thank OpenAI for granting us research access to several LLMs to conduct the experiments 
in this paper and DeepSeek and Meta for giving us access to the weights of their models.  
We acknowledge support from the following institutions: Microsoft Accelerate Foundation 
Models Research (AFMR) grant programme, Long-Term Future Scholarship financed by 
Coefficient Giving (formerly Open Philanthropy) and the Spanish Government’s Knowledge 
Generation Projects (PID2023-150271NB-C21). This work was also supported by CIPROM/ 
2022/6 (FASSLOW), IDIFEDER/2021/05 (CLUSTERIA), CIACIF/2023/276, financed by 
Generalitat Valenciana, the EC H2020-EU grant agreement no. 952215 (TAILOR), Spanish 
grants PID2021-122830OB-C42 (SFERA), PID2023-150271NB-C21 and PID2024-162030OB-100 
(ROBIN) financed by MCIN/AEI/10.13039/501100011033 and ‘ERDF A way of making Europe’, 
Cátedra ENIA-UPV in Sustainable AI Development, TSI-100930-2023-9, INCIBE’s Chair financed 
by the EU’s NextGenerationEU, EUR2024-153548 (PREDAIT) ‘Towards Predictable AI’ from 
‘Spanish Europe Excelencia’ 2024, Spanish National Research Council (CSIC) Special 
Intramural Projects programme and the Cambridge Trust. M.C. declares support from 
Google.org through the Silicon Valley Community Foundation by means of a grant to 
Fundación General CSIC. The research of J.H.-O. is supported by OpenAI’s grant to the ‘AI 
Progress through the Lens of Predictable AI Ecosystems’ programme, which is based at the 
Leverhulme Centre for the Future of Intelligence at the University of Cambridge.

Author contributions L.Z. and J.H.-O. conceived and led the project. All authors contributed to 
the collection of benchmarks, the development of rubrics, the prompts, as well as the choice 
of model families and experimental methodology. L.Z., Y.M.-D., S.Z., Q.Z., P.S.-G. and Y.H. ran 
the core experiments. L.Z. and F.M.-P. prepared the result analysis and plotting. L.Z., L.P., 
F.M.-P. and J.H.-O. drafted the manuscript. All authors, L.Z., L.P., F.M.-P., K.M.C., Y.M.-D., S.Z., 
Q.Z., Y.H., L.S., J.E.P., Z.L., P.S.-G., K.J.-C., P.A.M.C., J.Z., J.B., B.M., D.S., M.C., J.W., P.H., S.T.W., 
P.C.K., L.C., X.X. and J.H.-O., edited and revised the manuscript. L.Z., X.X. and J.H.-O. supervised 
the project.

Competing interests We received support and free tokens from some of the providers of the 
LLMs evaluated in this paper or some of their direct competitors, namely OpenAI, Microsoft 
Research and Google. OpenAI, Microsoft Corporation and Google Inc. had no role in the ideas 
and research questions, study design, data collection and analysis, decision to publish or 
preparation of the manuscript.

Additional information
Supplementary information The online version contains supplementary material available at 
https://doi.org/10.1038/s41586-026-10303-2.
Correspondence and requests for materials should be addressed to Lexin Zhou, Xing Xie or 
José Hernández-Orallo.
Peer review information Nature thanks the anonymous reviewers for their contribution to the 
peer review of this work.
Reprints and permissions information is available at http://www.nature.com/reprints.

https://arxiv.org/abs/2503.05336
https://arxiv.org/abs/2502.09670
https://arxiv.org/abs/2507.05195
https://publications.jrc.ec.europa.eu/repository/handle/JRC143256
https://publications.jrc.ec.europa.eu/repository/handle/JRC143256
https://arxiv.org/abs/2505.11839
https://arxiv.org/abs/2505.11839
https://arxiv.org/abs/2403.12031
https://arxiv.org/abs/2309.10253
https://arxiv.org/abs/2308.12833
https://arxiv.org/abs/2507.22133
https://doi.org/10.1038/s41586-026-10303-2
http://www.nature.com/reprints


Article

Extended Data Fig. 1 | Correlations of the demand level using all of the items in the ADeLe battery for all pairs of the 18 demands and the special dimension 
UG (Unguessability). It also includes the success (that is, correctness at the instance level) of all of the subject LLMs considered in the experiments.


Extended Data Fig. 2 | Distribution of level frequencies for the 18 demands 
using all of the 16,108 instances in the ADeLe battery v1.0. Dimensions such 
as CEe (Verbal Expression), MS (Mind Modelling and Social Cognition) and 
SNs (Spatial Reasoning and Navigation - Spatial) have low proportion of items 
of high level, but this is in accordance with the focus of LLM evaluation on 
factual questions with no navigation or full social interaction. Future versions 
of the battery for agents or multimodal scenarios can increase the number and 
breadth of the dimensions.


Article

Extended Data Fig. 3 | Level annotations of five items (from benchmarks 
OmniMath, TimeQA, MedCalcBench, MMLU-Pro and TruthQuest) using 
the DeLeAn rubric set by GPT-4o. The listed demands in the table (from left  

to right) follow the same order shown in Extended Data Table 5 (from top to 
bottom): 11 being elemental, five knowledge and three extraneous.


Extended Data Fig. 4 | The characteristic curve of Llama-3.1-405B-Instruct 
for dimension KNn (Knowledge of Natural Sciences) on the ADeLe battery. 
The x-axis shows the demand levels from 0 to 5 for KNn and the y-axis is the 
average performance for that level (probability of success). As usual, level 0 has 
no points left (0 never dominates), but in this case, we see a situation with no 
point for 1 either. The curve is a logistic fit in the output range (0, 1).


Article

Extended Data Fig. 5 | Pipelines to explain and predict performance for new 
systems and benchmarks. Top, ‘System Process’ – steps for each new AI 
system: (1) run the new system on the ADeLe battery; (2) plot characteristic 
curves for all dimensions and extract the ability profile for the system; and, 
optionally, (3) train a simple assessor using the annotated levels as inputs  
and the score as output. Bottom, ‘Task Process’ – steps for each new task or 
benchmark: (A) apply the DeLeAn rubrics to the new tasks using a standard 

LLM; (B) get demand histograms and demand profiles that explain what demands 
the tasks require; and, optionally, (C) predict performance for the new tasks  
for any system that has built an assessor after the ‘System Process’. Assessors 
based on the demand profile have especially higher predictive power in out-of- 
distribution settings than other baseline assessors, anticipating validity in new 
situations.


Extended Data Table 1 | Characteristics of the 15 language models evaluated in this paper

SFT, supervised fine-tuning; RLHF, reinforcement learning from human feedback; CoT, chain-of-thought.


Article
Extended Data Table 2 | In-distribution predictability results of 15 LLMs for the ADeLe battery using tenfold cross-validation, 
averaged across ten seeds

The first two columns show names of subject LLMs and the overall accuracy of subject LLMs on the ADeLe battery. The remaining three pairs of columns show the AUROC and ECE of three 
different assessors (RF using demands, RF using average GloVe embeddings and fine-tuning LLaMA-3.1-8B). For a single LLM subject, the training time is 4 s and 160 s for the demand-based 
and embeddings-based assessors, respectively, on a M3 Pro CPU, whereas the fine-tuned LLAMA assessor costs 300 h on a single V100 GPU. The weighted average is only indicative for easy 
comparison, and uses the normalized LLM accuracy as a weight in the mean, giving more relevance to more powerful models, which are more representative now and in the near future. The 
asterisks indicate statistical difference (α = 0.05) between the demand-based assessor and the strongest baseline (fine-tuned LLAMA), using the Wilcoxon signed-rank. The RF assessor’s s.d. 
across the ten seeds range between 0.0004 and 0.001 for AUROC and between 0.0006 and 0.002 for ECE among subject LLMs. Given these low s.d. scores, we do not show confidence 
intervals (which are very narrow) for the sake of clarity.


Extended Data Table 3 | Task out-of-distribution predictability results

15 LLMs for the ADeLe battery using tenfold cross-validation, averaged across ten seeds (all other things equal to Extended Data Table 2 except for the RF assessor’s s.d. across the ten seeds 
ranging between 0.001 and 0.007 for AUROC and between 0.001 and 0.005 for ECE among subject LLMs).


Article
Extended Data Table 4 | Benchmark out-of-distribution predictability results

15 LLMs for the ADeLe battery using tenfold cross-validation, averaged across ten seeds (all other things equal to Extended Data Table 2 except for the RF assessor’s s.d. across the ten seeds are 
now ranging between 0.004 and 0.02 for AUROC and between 0.003 and 0.03 for ECE among subject LLMs).


Extended Data Table 5 | Dimensions and subdimensions in the DeLeAn rubric set v1.0

The first 18 (grouped into 11 ‘elemental’, five ‘knowledge’ and two ‘extraneous’ in a sequential order) are demand scales in the range (0, 5), whereas UG (Unguessability) is not a demand: it is 
another extraneous dimension representing the (1 minus the) probability of success by random guess or a naive method. For instance, a multiple-choice question with four options would have 
value 75%. Full rubrics are in Supplementary Information Section 2.


	General scales unlock AI evaluation with explanatory and predictive power
	Annotation scales distinguish levels and dimensions
	Explanatory power through benchmark demand profiles
	Explanatory power through LLM ability profiles
	Predictive power through assessors anticipating performance
	Discussion
	Online content
	Fig. 1 Commensurate LLM and benchmark profiles can be compared to explain and predict performance.
	Fig. 2 Distribution of level frequencies for the 18 demands (that is, demand profiles) of the 20 benchmarks in the ADeLe v1.
	Fig. 3 Characteristic curves for the 18 demands and the 15 LLMs.
	Fig. 4 Ability profiles of the 15 LLMs.
	Extended Data Fig. 1 Correlations of the demand level using all of the items in the ADeLe battery for all pairs of the 18 demands and the special dimension UG (Unguessability).
	Extended Data Fig. 2 Distribution of level frequencies for the 18 demands using all of the 16,108 instances in the ADeLe battery v1.
	Extended Data Fig. 3 Level annotations of five items (from benchmarks OmniMath, TimeQA, MedCalcBench, MMLU-Pro and TruthQuest) using the DeLeAn rubric set by GPT-4o.
	Extended Data Fig. 4 The characteristic curve of Llama-3.
	Extended Data Fig. 5 Pipelines to explain and predict performance for new systems and benchmarks.
	Table 1 Diagnosis of the challenges of present AI evaluation paradigms, associated new findings revealed by the methodological solutions contributed in this paper and the potential applications of the new methodology (expanded in Methods section ‘Pipeline
	Table 2 Sensitivity and specificity analysis of a subset of 20 benchmarks in ADeLe.
	Extended Data Table 1 Characteristics of the 15 language models evaluated in this paper.
	Extended Data Table 2 In-distribution predictability results of 15 LLMs for the ADeLe battery using tenfold cross-validation, averaged across ten seeds.
	Extended Data Table 3 Task out-of-distribution predictability results.
	Extended Data Table 4 Benchmark out-of-distribution predictability results.
	Extended Data Table 5 Dimensions and subdimensions in the DeLeAn rubric set v1.