Geometric Deep Learning for Molecular
Modelling and Design

Chaitanya Krishna Joshi

Clare Hall

July, 2025

This thesis is submitted for the degree of Doctor of Philosophy


Declaration

This thesis is the result of my own work and includes nothing which is the outcome of work done
in collaboration except as declared in the preface and specified in the text. It is not substantially
the same as any work that has already been submitted, or is being concurrently submitted, for
any degree, diploma or other qualification at the University of Cambridge or any other University
or similar institution except as declared in the preface and specified in the text. It does not exceed
the prescribed word limit for the relevant Degree Committee.

Chaitanya Krishna Joshi
6 July, 2025


Abstract

Geometric Deep Learning for Molecular Modelling and Design

Chaitanya Krishna Joshi

Molecules are the foundations of biological life and physical materials. Computational
modelling of molecular behaviour remains a grand challenge in science as molecules span
a spectrum of complexity: from periodic crystals to non-periodic biomolecules; from small
drug-like molecules with dozens of atoms to massive proteins with thousands; and from data-rich
domains like protein structures to data-scarce contexts like nucleic acids. Despite this diversity,
all molecular systems share fundamental building blocks: atoms and their interactions in three-
dimensional space governed by physical laws. This thesis develops Geometric Deep Learning
models that leverage these shared principles to advance molecular modelling.

The first part establishes unified foundations for molecular representation learning and gener-
ative modelling. I first introduce the Geometric Weisfeiler-Leman Test (GWL), a mathematical
framework that characterizes the expressive power of neural networks respecting physical sym-
metries in 3D space. GWL provides a unified theory for roto-translationally invariant and
equivariant Graph Neural Networks, and offers mechanistic insights into how different architec-
tures distinguish 3D molecular structures. Building on these insights, I introduce the All-atom
Diffusion Transformer (ADiT), a unified generative architecture that models both periodic
crystals and non-periodic molecules. ADiT demonstrates that joint training across diverse
structural datasets enables scaling capabilities analogous to large language models, achieving
state-of-the-art performance across molecular generation benchmarks.

The second part introduces gRNAde, a novel generative RNA inverse design toolkit. gRNAde
is a structure-conditioned RNA language model that addresses the unique challenges of RNA
molecules, including limited data and inherent flexibility, by leveraging the Geometric Deep
Learning principles established in the first part. Validated through wet lab experiments, gRNAde
demonstrates superior performance over existing physics-based methods and achieves human
expert-level accuracy in pseudoknotted RNA design while remaining fully automated and
scalable. Most notably, gRNAde successfully generates functional RNA enzymes that are
evolutionarily distant from known sequences, opening new avenues for designing RNA structures
with programmable biological functions.


Acknowledgements

First and foremost, I am deeply grateful to my supervisor, Pietro Liò. Pietro created an envi-
ronment of extraordinary freedom, kindness, and collaborative spirit within our research group.
His generosity with his time and his willingness to serve as both a personal and professional
mentor at critical moments throughout this PhD have shaped me in ways I am only beginning to
appreciate. I could not have asked for a more supportive guide and a dear friend. Pietro, and the
broader Cambridge environment, gave me the space and encouragement to explore what kind
of scientist I want to be: to try my hand at theory, at engineering and scaling experiments, and
ultimately at applications validated in the wet lab.

The Computer Laboratory and the Artificial Intelligence Group have been a wonderful intel-
lectual home. Clare Hall, my college, provided a warm and welcoming community throughout.
I am especially thankful to the many friends and colleagues who shared this journey together:
Simon Mathis, Charlie Harris, Alex Norcliffe, Iulia Dutta, Charlotte Magister, Julia Komorowska,
Miruna Cretu, Vladimir Radenkovic, Petar Veličković, Ramon Viñas, Arian Jamasb, Rishabh
Anand, Kieran Didi, Alex Abrudan, Cătălina Cangea, Paul Scherer, Dobrik Georgiev, Cris Bod-
nar, Andrew Blake, Srijit Seal, Adham El-Shazly, and many others. Thank you for the endless
discussions, the shared conference trips, and for keeping me inspired and sane throughout.

This thesis owes a great deal to the scientists and mentors I had the privilege of collaborating
with along the way. Early in my PhD, Taco Cohen and Michael Bronstein helped me develop
the theoretical foundations upon which all the subsequent work in this thesis was built. I was
fortunate to spend two summer internships at the frontier of AI and molecular science: Andreas
Loukas, Jan Ludwiczak, Pan Kessel, Kyunghyun Cho, and the Prescient Design team welcomed
me to Basel, and Zachary Ulissi, Anuroop Sriram, Xiang Fu, Larry Zitnick, and the FAIR
Chemistry team at Meta hosted me in San Francisco. Both experiences were formative: they
showed me firsthand how AI can impact medicine and materials science, and the exceptional
resources and guidance accelerated my growth as a scientist. I am grateful to Rhiju Das for his
infectious conviction that “research doesn’t count unless it involves blind experimental tests,”
which motivated me to go beyond computational benchmarks and into the wet lab. I also thank
Gábor Csányi for stimulating discussions over the course of my PhD and for examining this
thesis with such rigour and thoroughness.


I owe a special debt of gratitude to Philipp Holliger, Edoardo Gianni, Samantha Kwok,
and the entire Holliger Lab for welcoming me into the MRC Laboratory of Molecular Biology
towards the end of my PhD. Being embedded in the LMB and learning to speak the language
of experimentalists transformed how I think about my research. The experience of holding a
pipette for the first time and seeing AI-designed RNA molecules come to life in the lab will stay
with me forever.

I am also thankful to the Learning on Graphs (LoG) conference community. Being part of
founding a new conference from the ground up, with the collective energy and support of so
many in our field, has been one of the most rewarding experiences of my PhD.

I am grateful to A*STAR, Singapore, for their generous support through the National Science
Scholarship, which made this PhD possible. I am especially thankful to Phebe Lim, Regina Chen,
Chay Wah Tay, and the entire Graduate Academy team for their responsiveness. I also thank
Yue Wan, Mile Šikić, Roger Foo, Chuan Sheng Foo, Cheston Tan, and others for hosting me
during my trips home to Singapore and for helping me stay connected to the Singapore research
community while being away.

Above all, I thank my dear family. My parents, Veena and Vivek Joshi, have been a constant
source of support and encouragement. Together with my grandparents, aunts, and uncles, they
have taken a genuine interest in understanding my curiosities since a young age and raised me to
never stop asking questions; that instinct is at the heart of everything in this thesis. My sister,
Kuhu Joshi, is my inspiration. Watching her chart her own path from economics to poetry, and
seeing her thrive in her element, has reminded me that the most rewarding frontiers are often
the most uncharted. Finally, my deepest gratitude goes to my wife, my closest confidant, and
my dearest friend, Genevieve Lam. Her curiosity and enthusiasm for life have been my greatest
source of strength, and our relationship is the foundation upon which this thesis, and our life
together, has been built.


Contents

1 Introduction 11
1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Preliminaries: Deep Learning for Molecular Structure Modelling 21
2.1 Primer on Molecular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Molecular Systems as 3D Geometric Graphs . . . . . . . . . . . . . . . . . . . 25
2.3 Representation Learning of Molecular Structure . . . . . . . . . . . . . . . . . 29
2.4 Generative Modelling of Molecular Systems . . . . . . . . . . . . . . . . . . . 44

I Molecular Representation Learning and Generative Modelling 51

3 Expressive Power of Molecular Structure Representations 53
3.1 Limitations of the Weisfeiler-Leman Test . . . . . . . . . . . . . . . . . . . . 54
3.2 The Geometric Weisfeiler-Leman Framework . . . . . . . . . . . . . . . . . . 56
3.3 Understanding the Geometric GNN Design Space . . . . . . . . . . . . . . . . 61
3.4 Synthetic Experiments on Expressivity . . . . . . . . . . . . . . . . . . . . . . 64
3.5 Experiments on Protein Representation Learning . . . . . . . . . . . . . . . . 68
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Unified Generative Modelling of Molecules and Materials 77
4.1 All-atom Diffusion Transformers . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


II RNA Molecule Design 97

5 gRNAde: Geometric Deep Learning for 3D RNA inverse design 99
5.1 The gRNAde Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6 Inverse Design of RNA Structure and Function with gRNAde 115
6.1 An RNA Inverse Design Pipeline with gRNAde . . . . . . . . . . . . . . . . . 115
6.2 Expert-level Design of RNA Pseudoknotted Structures . . . . . . . . . . . . . 118
6.3 Inverse Design of Functional Polymerase Ribozymes . . . . . . . . . . . . . . 126
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Conclusion 139
7.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

References 145

A Appendix: Expressive Power of Molecular Structure Representations (Chapter 3) 171
A.1 Geometric GNN Design Space Proofs . . . . . . . . . . . . . . . . . . . . . . 171
A.2 Proofs for Equivalence between GWL and Geometric GNNs (Section 3.2.2) . . 176

B Appendix: Unified Generative Modelling of Molecules and Materials (Chapter 4) 181
B.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B.2 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
B.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

C Appendix: gRNAde: Geometric Deep Learning for 3D RNA inverse design (Chap-
ter 5) 189
C.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
C.2 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
C.3 RNASolo data statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

D Appendix: Inverse Design of RNA Structure and Function with gRNAde (Chapter 6)195


Chapter 1

Introduction

Molecular systems are the fundamental building blocks of our world. They form the basis of
biological life, serve as the foundation for medicines that treat disease, and constitute physical
materials around us. At their core, molecules are collections of atoms interacting in three-
dimensional space, governed by the fundamental laws of physics and chemistry. The last decade
has witnessed remarkable progress in computational modelling of molecular systems using
deep learning. Deep learning offers a data-driven paradigm for predictive understanding of the
functional properties of molecular systems, towards enabling the discovery of novel molecules
with desired behaviours [Sanchez-Lengeling and Aspuru-Guzik, 2018, Stokes et al., 2020].

The most notable breakthroughs that have inspired this thesis are those recognized by the
Nobel Prize in Chemistry 2024: highly accurate protein structure prediction [Jumper et al., 2021]
and de novo design of proteins with bespoke functionality [Dauparas et al., 2022, Watson et al.,
2023]. These foundational techniques are now being extended to biomolecular interactions
involving proteins, nucleic acids, and other molecules [Abramson et al., 2024].

Concurrently, machine learning-based interatomic potentials are transforming molecular
dynamics simulation and property prediction [Behler and Parrinello, 2007]. Deep learning has
enabled broadly generalizable representations of atomic interactions across organic molecules
and inorganic materials, closely matching the accuracy of quantum mechanical simulations at a
fraction of the computational cost [Batatia et al., 2023, Wood et al., 2025].

Central to these breakthroughs are deep learning architectures that fall under the umbrella
of Geometric Deep Learning [Bronstein et al., 2021]: an approach to neural network design
that incorporates fundamental physical principles and symmetries into model architectures. For
molecular modelling, this translates to architectural components and inductive biases specifi-
cally tailored to addressing three broad challenges: variations in sizes, local versus long-range
interactions, and 3D geometric symmetries.

Firstly, atomic systems exhibit remarkable diversity in size and complexity. They range
from small drug-like molecules containing tens of atoms to biomolecular complexes with
tens of thousands of atoms. When atoms or ions are packed together to form materials, they

11


can extend infinitely through space as periodically repeated crystal structures. Deep learning
architectures for molecular modelling must accommodate this variability in system size and
periodicity without requiring fixed input dimensions. Set-based architectures such as Graph
Neural Networks (GNNs) [Battaglia et al., 2018] and Transformers [Vaswani et al., 2017a],
which process variable-sized collections of atoms, satisfy this requirements.

Additionally, the physical behaviour of molecules is governed by an interplay of both local
and long-range atomic interactions. Many fundamental properties emerge from short-range
interactions such as covalent bonds or hydrogen bonds. This locality principle is algorithmically
aligned with message passing-based GNN architectures [Xu et al., 2020], which iteratively
aggregate information from local atomic environments to build up representations of molecular
structure [Gilmer et al., 2017]. However, molecules also exhibit long-range interactions, such as
van der Waals forces or Coulomb repulsion, that cannot be captured by purely local models. For
instance, proteins fold into stable 3D structures through interactions between residues distance
in sequence space, so generating or predicting protein structures requires maintaining global
consistency to ensure physical validity [Jumper et al., 2021]. Such problems are well aligned with
the self-attention mechanism in Transformers, which allows for direct communication between
all pairs of atoms, regardless of whether they are locally connected. Further, Transformers can
be seen as message-passing on complete graphs, thereby unifying models for local (GNNs) and
global (Transformers) interactions through a common geometric lens [Joshi, 2025].

Finally, molecules exist in 3D Euclidean space and possess fundamental geometric symme-
tries. Functional properties of molecules are symmetric under rigid geometric transformations of
their structures, such as global rotations, translations, and reflections [Musil et al., 2021]. Modern
geometric deep learning architectures incorporate these symmetries through two complementary
approaches. Explicit symmetry ensures that learned representations transform covariantly (equiv-
ariantly) with 3D transformations of structures, guaranteeing that internal features respect the
same geometric principles as the physical quantities they represent [Thomas et al., 2018]. This
approach is often data-efficient [Batzner et al., 2022] and produces representations with clear
physical interpretations [Fu et al., 2025]. Alternatively, implicit symmetry does not hard-code
geometric constraints into architectures and learns approximate symmetries from data. While
this approach enables more flexible and expressive models, it requires larger training datasets
and greater computational resources for effective learning [Wang et al., 2024].

Together, these developments promise a new era of foundation models for molecular mod-
elling [Bommasani et al., 2021], providing an accurate and scalable toolkit for molecular
discovery. At the same time, there remain fundamental open questions about the theoretical
limits of these architectures, their generalizability across the diversity of molecular systems, and
their application to challenging problems at the forefront of biochemistry and materials science.
This thesis represents my explorations into this frontier, spanning theoretical and methodological
foundations of molecular modelling, as well as real-world applications in molecular design.

12


1.1 Research Questions

This thesis is about developing new deep learning techniques for modelling and designing
molecular systems.

The story begins with representation learning, which is the foundation for both predictive and
generative modelling. While geometric deep learning architectures have been very successful
for learning molecular representations, as summarised in the previous section, a formal and
unified understanding of how different architectural properties affect the class of functions that
a model can express, also known as the expressive power [Raghu et al., 2017], is not well
understood. This lack of understanding of why models succeed or fail limits our ability to design
new architectures in a principled way, bringing us to our first research question:

Q1: What is a unified theoretical framework for characterizing the expressive power
of 3D molecular representations learnt by geometric deep learning models?

During the course of my research, highly expressive generative models trained on extremely
large-scale datasets lead to a paradigm shift across AI [Bommasani et al., 2021]. A key factor
enabling these foundation models was a unification of diverse but interconnected data sources
for pre-training (e.g. all the text on the internet [Achiam et al., 2023]), which enabled models to
learn general-purpose representations and transfer knowledge across related domains such as
mathematics and programming.

Similarly, we know that the physical principles that govern atomic interactions are shared
across diverse molecular systems, ranging from organic molecules to inorganic crystals. However,
current generative models of 3D molecular structures are highly domain specific and not broadly
applicable. Towards developing generative foundation models for molecular structures, we arrive
at our second research question:

Q2: What is the architecture of a unified molecular generative model that benefits
from transfer learning across atomic interactions?

Having established unified foundations for representation learning and generative modelling,
I then explored real-world applications in the inverse design of Ribonucleic Acids (RNA).

I was drawn to RNA due to its increasingly central role in modern molecular biology and
biotechnology.1 RNA are nature’s computers, capable of both information processing and
catalysis [Cech, 2024]. Yet, RNA structure modelling and design remains extremely challenging
due to a paucity of data and the inherently dynamic nature of RNA molecules. Our final research
question tackles new frontiers in RNA design:

Q3: Can we develop a generative inverse design toolkit for RNA structure and
function? And what new experimental capabilities will this enable in wet labs?

1As the webcomic XKCD put it, "Life is a seething mass of RNA that sometimes use DNA to take notes. What do
the proteins do? Errands for RNA." (XKCD #3056)

13

https://xkcd.com/3056/


Deep Learning 
toolkit for 

molecular design Experimental 
wet lab validation

Representation Learning
Geometric Weisfeiler-Leman test characterises

expressivity of 3D Graph Neural Networks.

Generative Modelling
All-atom Diffusion Transformer unifies generation of

periodic crystals and non-periodic molecular structures.

Inverse Design
gRNAde is a Generative AI toolkit for designing 

3D RNA structure and function. 

GAGCGU...gRNAde

Functional RNA structures

Designed
sequences

Molecular systems

Equivalent transformation

3D transformation

Equivariant
neural network

Equivariant
neural network

DecoderEncoder Unified latent
space

Diffusion
Transformer

Stage 1: Autoencoder for reconstruction

Stage 2: Generative model in latent space

Gaussian
noise

Sampled
latents D

Figure 1.1: Overview of thesis contributions. In Chapter 3, I address RQ1 by proposing the
Geometric Weisfeiler-Leman test, a theoretical framework for understanding the expressivity of
3D molecular representations of Geometric GNNs. In Chapter 4, I introduce All-atom Diffusion
Transformer, the first unified generative model for both periodic crystals and non-periodic
molecules to benefit from transfer learning, which addresses RQ2. Finally, I address RQ3 in
Chapter 5 by developing gRNAde, a novel generative inverse design toolkit for RNA molecules,
which we validate through wet lab experiments in Chapter 6.

1.2 Thesis Outline

This section provides an overview of the chapters in this thesis and summarises the contributions
made towards answering the research questions outlined above. The rest of this thesis is struc-
tured as follows, and the main contributions are summarised in Figure 1.1.

Chapter 2: Preliminaries: Deep Learning for Molecular Structure Modelling

I present a self-contained introduction to deep learning fundamentals for molecular
structure modelling. I begin with a concise overview of different types of molec-
ular systems and associated mathematical concepts, such as 3D geometric graphs,
symmetry groups, and equivariance. I then survey deep learning architectures for
representation learning and generative modelling of 3D molecular structure, intro-
ducing techniques such as Geometric Graph Neural Networks, Transformers, and
Diffusion models.

14


Part I: Molecular Representation Learning and Generative Modelling

Chapter 3: Expressive Power of Molecular Structure Representations

I introduce the Geometric Weisfeiler-Leman (GWL) test, a generalisation of the
classic Weisfeiler-Leman algorithm for discriminating geometric graphs while re-
specting underlying 3D symmetries. The GWL framework unifies various classes
of Geometric GNN architectures for molecules, and provides a theoretical char-
acterization of their expressive power. Through GWL, I derive new mechanistic
insights into molecular representation learning, including advantages of equivariant
models over invariant ones, and how higher-order representations enable maximally
expressive architectures. To complement this theoretical framework, I present a suite
of synthetic experiments and a real-world protein function prediction benchmark.

This chapter addresses RQ1 by establishing a unified theoretical framework for
characterizing the expressive power of 3D molecular representations learnt by
Geometric GNNs.

Chapter 4: Unified Generative Modelling of Molecules and Materials

I propose the All-atom Diffusion Transformer (ADiT), a unified generative modelling
architecture capable of jointly learning from both periodic crystals and non-periodic
molecules. ADiT is a latent diffusion model that embeds 3D molecular structures
into a shared latent space, and subsequently learns to sample new latents followed
by mapping them to valid structures. ADiT achieves state-of-the-art performance for
generative modelling across both molecules and materials, outperforming specialized
system-specific methods while benefiting from transfer learning across domains. I
further show that ADiT is significantly more scalable than previous approaches and
that scaling ADiT’s model parameters predictably improves performance, towards
the goal of a unified foundation model for molecular design.

This chapter addresses RQ2 by developing a unified generative architecture that
enables transfer learning across diverse atomic systems.

Part II: RNA Molecule Design

Chapter 5: gRNAde: Geometric Deep Learning for 3D RNA inverse design

I introduce gRNAde, a novel toolkit for 3D RNA inverse design leveraging geomet-
ric deep learning to address the unique challenges of RNA modelling, including
limited data and conformational flexibility. gRNAde is a structure-conditioned RNA

15


language model that uses a multi-state Geometric GNN to generate sequences condi-
tioned on one or more 3D backbone structures. I present computational benchmarks
demonstrating gRNAde’s improved performance, speed and capabilities compared
to state-of-the-art physics-based tools for RNA design.

This chapter addresses the first part of Q3 by developing the first generative inverse
design toolkit for RNA structures.

Chapter 6: Inverse Design of RNA Structure and Function with gRNAde

I present an RNA inverse design pipeline that integrates gRNAde with computational
screening and wet lab validation. I demonstrate that the gRNAde pipeline matches
human expert performance in designing diverse pseudoknotted RNA structures
while being fully automated. Further, I show how gRNAde enables the design of
RNA enzymes (ribozymes) that are significantly distant in mutational space from
known functional sequences, opening new avenues for designing RNA structures
with bespoke biological functions.

This chapter completes Q3 by experimentally validating the gRNAde toolkit’s
capabilities at designing RNA structure and function in real wet lab settings.

Chapter 7: Conclusion

The final chapter reviews the contributions proposed in this thesis, reflects on
unifying themes across the chapters, and discusses future research directions.

1.3 List of Publications

Here, I provide the list of publications that I have co-authored during my PhD, together with a
brief description of my contributions to each publication.

1.3.1 Thesis Publications and Contributions

Chapter 2 is written from scratch, with some content abridged from Duval et al. [2023a].

A Hitchhiker’s Guide to Geometric GNNs for 3D Atomic Systems.
A. Duval∗, S. V. Mathis∗, C. K. Joshi∗, V. Schmidt∗, S. Miret, F. D. Malliaros, T.
Cohen, P. Liò, Y. Bengio, and M. Bronstein. (∗equal first authors)
Preprint, 2023.

I conceived the survey jointly with Alexandre Duval, Simon V. Mathis, and Victor Schmidt,
created a majority of the figures, and contributed extensively to the all aspects of the paper.

16


Chapter 3 is primarily based on Joshi et al. [2023].

On the Expressive Power of Geometric Graph Neural Networks.
C. K. Joshi∗, C. Bodnar∗, S. V. Mathis, T. Cohen, and P. Liò. (∗equal first authors)
International Conference on Machine Learning (ICML), 2023.

Also oral presentation at NeurIPS 2022 Symmetry & Geometry Workshop.

I had the key idea of theoretically characterising the expressive power of Geometric GNNs, with
inputs from Taco Cohen. Cris Bodnar and I jointly conceived the Geometric Weisfeiler Leman
test, and the theoretical results included in this thesis were derived by me. I conducted all the
synthetic experiments, with inputs from Simon V. Mathis. I wrote the majority of the paper with
inputs from all other authors.

The chapter also includes experimental results on Geometric GNNs for protein function
prediction from Jamasb et al. [2024], which were implemented by myself and Arian Jamasb.

Evaluating Representation Learning on the Protein Structure Universe.
A. R. Jamasb∗, A. Morehead∗, C. K. Joshi∗, Z. Zhang∗, K. Didi, S. V. Mathis, C.
Harris, J. Tang, J. Cheng, P. Liò, and T. L. Blundell. (∗equal contribution)
International Conference on Learning Representations (ICLR), 2024.

Chapter 4 is based on Joshi et al. [2025a].

All-atom Diffusion Transformers: Unified generative modelling of molecules
and materials.
C. K. Joshi, X. Fu, Y.-L. Liao, V. Gharakhanyan, B. K. Miller, A. Sriram, and Z. W.
Ulissi.
International Conference on Machine Learning (ICML), 2025.

Also oral presentation at ICLR 2025 AI for Accelerated Materials Design Workshop.

I had the key idea of using latent diffusion models for unified generative modelling of molecules
and materials, with inputs from Xiang Fu. I developed the research, conducted the experiments
and wrote the paper with inputs from all the other authors.

Chapter 5 is based on Joshi et al. [2025c] and Joshi and Liò [2024].

gRNAde: Geometric Deep Learning for 3D RNA inverse design.
C. K. Joshi, A. R. Jamasb, R. Viñas, C. Harris, S. V. Mathis, A. Morehead, R.
Anand, and P. Liò.
International Conference on Learning Representations (ICLR), 2025. Spotlight
presentation.
Also an invited book chapter in RNA Design: Methods and Protocols, pp. 121-135,

Springer, Methods in Molecular Biology (MIMB, volume 2847), 2024

17


I had the key idea of 3D structure-based and multi-state RNA inverse design, with inputs from
Ramon Viñas. I developed the research, conducted the experiments and wrote the paper with
inputs from all the other authors.

Chapter 6 is based on Joshi et al. [2025b].

Generative inverse design of RNA structure and function with gRNAde.
C. K. Joshi∗, E. Gianni∗, S. L. Y. Kwok∗, S. V. Mathis, P. Liò, and P. Holliger.
(∗equal contribution)
Preprint, 2025.

I conceived the RNA design pipeline, with inputs from all the other authors. I generated
computational designs, with inputs from Simon V. Mathis. The wet lab experimental validation
took place in the laboratories of Dr. Phillip Holliger (MRC Laboratory of Molecular Biology,
Cambridge) and Prof. Rhiju Das (Department of Biochemistry, Stanford University). I wrote the
majority of the paper with inputs from all other authors.

1.3.2 Other Publications

I have also contributed to the following publications, which are not included in this thesis but are
listed in chronological order:

Multi-state Protein Design with DynamicMPNN.
A. Abrudan∗, S. Pujalte Ojeda∗, C. K. Joshi, M. Greenig, F. Engelberger, A.
Khmelinskaia, J. Meiler, M. Vendruscolo, T. P. J. Knowles. (∗equal contribution)
International Conference on Learning Representations (ICLR), 2026.

Also presented at ICML 2025 Workshop on Generative AI and Biology.

Multi-scale Protein Structure Modelling with Geometric Graph U-Nets.
C. Liu∗, V. Li∗, L. Leong, V. Radenkovic, P. Liò, C. K. Joshi (∗equal contribution)
Machine Learning in Structural Biology (MLSB), 2025.

Artificial Intelligence for Science in Quantum, Atomistic, and Continuum
Systems
X. Zhang∗, L. Wang∗, J. Helwig∗, Y. Luo∗, C. Fu∗, Y. Xie∗, M. Liu, Y. Lin, Z. Xu,
K. Yan, K. Adams, M. Weiler, X. Li, T. Fu, Y. Wang, A. Strasser, H. Yu, Y. Xie,
X. Fu, S. Xu, Y. Liu, Y. Du, A. Saxton, H. Ling, H. Lawrence, H. Stärk, S. Gui,
C. Edwards, N. Gao, A. Ladera, T. Wu, E. F. Hofgard, A. M. Tehrani, R. Wang, A.
Daigavane, M. Bohde, J. Kurtin, Q. Huang, T. Phung, M. Xu, C. K. Joshi, S. V.
Mathis, K. Azizzadenesheli, A. Fang, A. Aspuru-Guzik, E. Bekkers, M. Bronstein,

18


M. Zitnik, A. Anandkumar, S. Ermon, P. Liò, R. Yu, S. Günnemann, J. Leskovec, H.
Ji, J. Sun, R. Barzilay, T. Jaakkola, C. W. Coley, X. Qian, X. Qian, T. Smidt, S. Ji.
(∗equal contribution)
Foundations and Trends in Machine Learning, 2025.

LeMat-GenBench: Bridging the Gap between Crystal Generation and Ma-
terials Discovery.
S. Betala, S. P. Gleason, A. Ramlaoui, A. Xu, G. Channing, D. Levy, C. Fourrier, N.
Kazeev, C. K. Joshi, S.-O. Kaba, F. Therrien, A. Hernandez-Garcia, R. Mercado, N.
M. Krishnan, A. Duval
NeurIPS 2025 Workshop on AI for Accelerated Materials Design, 2025.

Machine Learning for Toxicity Prediction Using Chemical Structures: Pil-
lars for Success in the Real World.
S. Seal, M. Mahale, M. García-Ortegón, C. K. Joshi, L. Hosseini-Gerami, A. Beat-
son, M. Greenig, M. Shekhar, A. Patra, C. Weis, A. Mehrjou, A. Badré, B. Paisley, R.
Lowe, S. Singh, F. Shah, B. Johannesson, D. Williams, D. Rouquie, D.-A. Clevert, P.
Schwab, N. Richmond, C. A. Nicolaou, R. J. Gonzalez, R. Naven, C. Schramm, L.
R. Vidler, K. Mansouri, W. P. Walters, D. D. Wilk, O. Spjuth, A. E. Carpenter, and
A. Bender.
ACS Chemical Research in Toxicology, 2025.

Towards Mechanistic Interpretability of Graph Transformers via Attention
Graphs.
B. El∗, D. Choudhury∗, P. Liò, and C. K. Joshi. (∗equal contribution)
ICLR 2025 Workshop on XAI4Science, 2025.

Understanding Biology in the Age of Artificial Intelligence.
E. Lawrence, A. El-Shazly, S. Seal, C. K. Joshi, P. Liò, S. Singh, A. Bender, P.
Sormanni, M. Greenig.
Preprint, 2024.

RNA-FrameFlow: Flow Matching for de novo 3D RNA backbone design.
R. Anand∗, C. K. Joshi∗, A. Morehead, A. R. Jamasb, C. Harris, S. Mathis, K. Didi,
B. Hooi, and P. Liò. (∗equal contribution)
Machine Learning for Computational Biology (MLCB), 2024. Oral presentation.
Also oral presentation at ICML 2024 AI4Science Workshop.

19


PoseCheck: Generative Models for 3D Structure-based Drug Design Produce
Unrealistic Poses.
C. Harris, K. Didi, A. R. Jamasb, C. K. Joshi, S. V. Mathis, P. Liò, and T. Blundell.
NeurIPS Workshop on Machine Learning for Structural Biology, 2023.

Group Invariant Global Pooling.
K. Bujel∗, Y. Gideoni∗, C. K. Joshi, and P. Liò. (∗equal contribution)
ICML Workshop on Topology, Algebra, & Geometry, 2023.

Hypergraph Factorisation for Multi-tissue Gene Expression Imputation.
R. Viñas, C. K. Joshi, D. Georgiev, B. Dumitrascu, E. R. Gamazon, and P. Liò.
Nature Machine Intelligence, 2023. Cover article.

20


Chapter 2

Preliminaries: Deep Learning for
Molecular Structure Modelling

This chapter offers an overview of the deep learning fundamentals essential for molecular
structure modelling. We assume familiarity with basic machine learning concepts and common
neural network architectural elements such as Multi Layer Perceptrons, normalisation layers,
and activation functions; see Goodfellow et al. [2016] for an introduction. We also assume basic
knowledge of concepts from physics and chemistry, such as atoms, molecules, and chemical
bonds. We will establish a common mathematical notation for representing molecules and
introduce key concepts that are crucial for understanding the subsequent chapters, such as
geometric graphs and physical symmetries. We will also survey the most important background
methods, including Graph Neural Networks, Transformers, and Diffusion generative models
with an emphasis on their application to molecular systems.

2.1 Primer on Molecular Systems

Let us begin with a brief primer on the different types of molecular systems that we will encounter
in this thesis.

2.1.1 Small Organic Molecules

Small molecules are organic compounds typically characterized by their low molecular weight
(generally under 1000 Daltons) and relatively simple structures comprising a few dozen atoms.
These molecules are ubiquitous in daily life, from water and oxygen to glucose and caffeine,
and play crucial roles in biological processes as signaling molecules, metabolic intermediates,
therapeutic drugs, and building blocks for larger biomolecules.

Consider caffeine (C8H10N4O2), a common stimulant found in coffee and tea. As illustrated
in Figure 2.1, this molecule can be represented in multiple ways: as a SMILES string (a linear

21


CN1C=NC2=C1C(=O)N(C(=O)N2C)C
(a) SMILES string

(b) Chemical graph (c) 3D structure (d) Molecular surface

Figure 2.1: Representations of caffeine, my favorite small organic molecule. (a) SMILES
string. (b) 2D chemical graph, with atoms as nodes and chemical bonds as edges determined
by valence rules. (c) 3D atomic structure, illustrating the spatial arrangement of atoms. (d)
Molecular surface, representing the molecule’s outer boundary.

character encoding) [Weininger, 1988], as a 2D chemical graph with atoms as nodes and bonds as
edges, or as a 3D structure showing spatial arrangements. While SMILES strings and 2D graphs
capture connectivity and chemical information, they fail to represent the complete molecular
story. The 3D geometric conformations, dynamics, and spatial interactions between atoms
ultimately drive molecular functionality and properties, making 3D representations essential for
understanding molecular behavior.

Computationally, a small molecule with N atoms is represented by atom types A =

{ai}Ni=1 ∈ Z1×N and 3D coordinates X = {xi}Ni=1 ∈ R3×N , where each xi ∈ R3 specifies
the position of atom i in space, typically measured in Angstroms (10−10 meters).

Having established the fundamental representation of small molecules, we now explore how
these atomic and molecular building blocks assemble into more complex structures, such as
crystalline materials and biological macromolecules.

2.1.2 Crystalline Materials

Crystalline materials represent a fundamentally different class of atomic systems from small
organic molecules. These solid-state materials feature a highly ordered, three-dimensional
arrangement of atoms, ions, or molecules that extends infinitely in space through periodic
repetition [Ashcroft and Mermin, 1976]. Consider table salt (NaCl), which forms a simple cubic
crystal structure, as shown in Figure 2.2. The sodium (Na+) and chloride (Cl−) ions arrange
themselves in an alternating three-dimensional pattern, with each Na+ ion surrounded by six
Cl− ions, and vice versa. The perfectly periodic nature of crystals is a defining characteristic that
distinguishes them from other forms of matter such as liquids and amorphous solids, which lack
long-range order.

The infinite, repeating nature of crystals requires a different computational representation
than finite molecules. Rather than listing all atoms—which would be impossible for an infinite

22


Figure 2.2: Crystal structure of halite (NaCl) salt. The crystal structure is characterized by a
repeating unit cell containing four Na+ and four Cl− ions in a face-centered arrangement. Source:
OpenGeology.org (CC BY-NC-SA 3.0).

structure—we define a unit cell: the smallest repeating volume that contains complete structural
and symmetry information of the crystal. Within this unit cell, atomic positions are specified
using fractional coordinates F = {fi}Ni=1 ∈ [0, 1)3×N , where each fi ∈ [0, 1)3 represents the
position of atom i relative to the unit cell boundaries. The unit cell’s shape and size are defined
by a lattice matrix L ∈ R3×3, whose columns are the three basis vectors spanning the crystal
lattice. Absolute 3D coordinates can be recovered via X = LF .

2.1.3 Macromolecules: proteins, nucleic acids, and complexes

Macromolecules, such as proteins, nucleic acids (DNA and RNA), and their complexes, are the
fundamental building blocks of life, playing critical roles in biological processes ranging from
catalysis and structural support to genetic information storage and transfer [Alberts et al., 2022].
These molecules are characterized by their large size and complex, hierarchical structures, which
arise from the specific arrangement of smaller subunits (amino acids for proteins, nucleotides
for nucleic acids) into intricate three-dimensional conformations of thousands of atoms. This
hierarchical organization, from primary sequence to complex three-dimensional architecture,
embodies the central principle of structural biology: sequence determines structure, and structure
dictates function [Greslehner, 2018]. This principle is fundamental to understanding biomolecular
behaviour and structure-based design [Huang et al., 2016, Alford et al., 2017].

Protein structure, illustrated in Figure 2.3a, is typically described at four levels of organization:
primary (the linear sequence of amino acids), secondary (local folding patterns such as α-helices
and β-sheets), tertiary (the overall three-dimensional shape of a single chain), and quaternary
(the assembly of multiple chains into a functional complex).

Nucleic acids, DNA and RNA, are the primary carriers of genetic information as well as
playing crucial roles in various cellular functions, including the expression of proteins [Cech,
2024]. The structural principles of nucleic acids [Neidle and Sanderson, 2021] are a special
focus on this thesis, particularly in the context of the work on structure-based RNA design
presented in Part II. Nucleic acid structure also follows the same hierarchical organization
as proteins, as illustrated in Figure 2.3b. The primary structure is the linear sequence of

23

https://opengeology.org/Mineralogy/13-crystal-structures


(a) Proteins (b) Nucleic acids

(c) Nucleobase pairing in RNA

Figure 2.3: Hierarchical structures of biomolecules. Sources: (a) Protein structure (CC BY-SA
4.0). (b) Nucleic acid structure (CC BY-SA 4.0). (c) RNA base pairing (CC BY-SA 4.0).

nucleotides—each comprising a nitrogenous base (Adenine (A), Guanine (G), Cytosine (C),
and Thymine (T) in DNA; or Uracil (U) for Thymine in RNA), a 5-carbon sugar (deoxyribose
in DNA, ribose in RNA), and phosphate groups. These nucleotides link via phosphodiester
bonds, forming a sugar-phosphate backbone with 5’ to 3’ directionality (e.g., GACU for RNA).
The secondary structure arises from base interactions, primarily hydrogen bonding. DNA
typically forms a double helix [Watson and Crick, 1953, Franklin and Gosling, 1953], with two
complementary strands stabilized by base pairs (A-T, G-C) and base stacking (Figure 2.3b).

24

https://en.wikipedia.org/wiki/File:Protein_structure_(full)-en.svg
https://en.wikipedia.org/wiki/File:DNA_RNA_structure_(full).png
https://commons.wikimedia.org/wiki/File:Hachimoji_RNA_BP.svg


RNA, often single-stranded, folds upon itself, allowing complementary regions to base-pair
(A-U, G-C, as shown in Figure 2.3c) and form non-canonical pairs, leading to diverse motifs
like hairpins and pseudoknots. The tertiary structure refers to the complex 3D arrangement
stabilized by metal ions (e.g., Mg2+, K+). This is critical for RNA’s diverse functions, including
catalysis (as ribozymes) and molecular recognition.

The hierarchical organization of macromolecules often extends beyond the folding of indi-
vidual chains. Full functionality is generally achieved by assembling into larger quaternary
structures. These assemblies can involve multiple folded protein subunits (as seen in hemoglobin
[Perutz, 1960]), several nucleic acid strands, or, very commonly, combinations of proteins and
nucleic acids. Prominent examples of such vital protein-nucleic acid complexes include ribo-
somes, the cellular machinery for protein synthesis [Ramakrishnan, 2002], and chromatin, the
DNA-protein complex forming chromosomes [Rowley and Corces, 2018]. The specific arrange-
ment and interactions within these macromolecular assemblies are vital for their biological roles,
enabling sophisticated cellular processes and regulatory networks. Consequently, understanding
these higher-order structures is a key focus of molecular biology and modelling.

2.2 Molecular Systems as 3D Geometric Graphs

Having reviewed the different types of molecular systems, we now turn to how these complex
structures can be represented mathematically as geometric graphs in 3D Euclidean space.

2.2.1 Graphs

Graphs are used to model complex and interconnected systems in the real-world, ranging from
knowledge graphs to social networks and molecular structures. Formally, an attributed graph
G = (A,S) is a set V of n nodes connected by edges, as shown in Figure 2.4a. A denotes an
n× n adjacency matrix where each entry aij ∈ {0, 1} indicates the presence or absence of an
edge connecting nodes i and j. Additionally, we can define Ni as the set of neighbors of node
i, which are the nodes connected to i by an edge, i.e. Ni = {j ∈ V | aij = 1}. The matrix of
scalar features S ∈ Rn×f stores attributes si ∈ Rf associated with each node i. For e.g., in
molecular graphs, each node is an atom and edges represent interactions among atoms.

Typically, the nodes in a graph have no canonical or fixed ordering and can be shuffled
arbitrarily, resulting in an equivalent shuffling of the rows and columns of the adjacency matrix
A. Thus, accounting for permutation symmetry is a critical consideration when designing
machine learning models for graphs [Bronstein et al., 2021]. One can also consider more
complex definitions of a graph, including multi-relational graphs or higher-order topological
variants such as hypergraphs [Battiston et al., 2020], but a basic attributed graph suffices for our
discussions on molecular systems.

25


(a) An attributed graph

x

y

z

(b) A geometric graph

Figure 2.4: Graphs and geometric graphs. (a) Graphs model complex systems via a set of
nodes which are related by edges. (b) Geometric graphs embedded in Euclidean space model
systems containing both geometry and relational structure.

2.2.2 Geometric graphs

As we have seen, molecular systems exhibit both relational structure and geometry: Functional
molecules arise from atoms interacting with one another, and the specific spatial arrangement of
atoms in 3D space determines these interactions. Such systems can be modeled via geometric

graphs embedded in Euclidean space [Duval et al., 2023a]. For example, molecules can be
represented as a set of nodes which contain information about each atom and its 3D spatial
coordinates as well as other geometric quantities such as velocity or acceleration.

As illustrated in Figure 2.4b, a geometric graph G = (A,S, V⃗ , X⃗) is an attributed graph
that is also decorated with geometric attributes: 3D node coordinates X⃗ ∈ Rn×d and, optionally,
vector features V⃗ ∈ Rn×d (e.g. velocity, acceleration), with d = 3.1

For molecules, the conventional procedure for constructing the geometric graph G =

(A,S, V⃗ , X⃗) is via the underlying point cloud (S, V⃗ , X⃗) using a predetermined radial cutoff
rcut. Thus, the adjacency matrix is defined as aij = 1 if ∥x⃗i − x⃗j∥2 ≤ rcut, or 0 otherwise, for all
aij ∈ A. Other common choices for graph construction include long-range connections between
nodes that are not within the cutoff radius, or complete graphs, where all nodes are connected to
each other. See Figure 2.5 for illustrations of the different types of geometric graphs.

Periodic boundary conditions While molecules simply consist of a set of 3D points in space,
easily representable using a finite graph, crystals are modelled to be infinite periodic structures
whose repeating pattern is called a unit cell. To account for the infinite periodicity of the crystal,
we employ periodic boundary conditions (PBC). The unit cell is defined by a lattice matrix
L⃗ ∈ R3×3, where the columns represent the three lattice basis vectors of the unit cell. Due to the
period tiling of the unit cell, an atom i may interact with an image of atom j in a neighbouring
cell. This is formalized by defining an integer-valued shift vector u⃗ij ∈ Z3, which allows the

1Without loss of generality, our formalism uses a single vector feature per node, but we could have had multiple
channels for each node.

26


x

y

z

3D point cloud Smoothed cutoff graph Long-range connections Complete graph

Figure 2.5: From point clouds to geometric graphs. A 3D point cloud is transformed into a
geometric graph via drawing edges between atoms within a radial cutoffs, possibly including
long-range connections, or simply connecting all atoms.

effective distance to be calculated as:

dij = ∥(x⃗i − x⃗j) + L⃗ u⃗ij∥2

Here, the shift vector u⃗ij is determined dynamically based on the atomic positions X, typically
usually utilizing a radial cutoff and the minimum image convention (selecting the image of atom
j that is closest to atom i). This ensures that the graph accurately captures interactions across
cell boundaries, which is in turn critical for accurately simulating the dynamics of atoms that
may ‘drift’ across unit cell boundaries over time.

2.2.3 Physical symmetries

A key characteristic of geometric graphs is that their coordinates and scalar/vector attributes
transform in mathematically precise ways under physical symmetries such as rotations and trans-
lations in 3D space (Figure 2.6). Understanding and modelling these symmetries is fundamental
to building neural networks that maintain physical meaning and produce consistent predictions
regardless of molecular orientation in space [Musil et al., 2021].

x

y

z

Figure 2.6: Geometric attributes transform under 3D symmetries. The group of rotations
and reflections O(3) acts on the vector features v and coordinates x. The translation group T(3)
acts on the coordinates x. Scalar features remain invariant to transformations.

Consider the fundamental physical principle illustrated in Figure 2.7: the potential energy
of a molecule remains unchanged (invariant) under rotations or translations in 3D space, while
atomic coordinates transform consistently (equivariantly) with these same transformations. This
reflects a fundamental principle: the laws of physics are independent of our choice of coordinate

27


H

0 1

2 3

4

5

3D atomic
system

Atom types

3D coordinates

Potential energy

C
H
H
O
H

x, y, z
x, y, z
x, y, z
x, y, z
x, y, z
x, y, z

Permutation
3 2

4 5

1

0

H

C

H
H

O

H

x, y, z

x, y, z

x, y, z
x, y, z

x, y, z

x, y, z

invariant

3D Rotation
3

2

4
5

1

0

H

C

H
H

O

H

invariant

invariant

x', y', z'

x', y', z'
x', y', z'

x', y', z'
x', y', z'
x', y', z'

permute rows

permute rows rotate columns

Figure 2.7: Symmetries of 3D molecular systems. The ordering of atoms/nodes in the system is
arbitrary. Additionally, global rotations or translations of the system in 3D Euclidean space will
lead to an equivalent transformation of 3D coordinates and other geometric attributes. Global
properties of the system such as the potential energy are invariant to both permutation and
physical symmetries. Geometric GNNs explicitly account for both permutation symmetry and
physical transformation behaviours when modelling 3D molecules, while standard GNNs solely
account for permutations.

system. Geometric neural networks must explicitly account for these symmetries to preserve the
physical meaning of their predictions.

Group theory foundations Group theory [Zee, 2016] provides the mathematical framework
for formalizing these symmetries. A group (G, ⋆) consists of a set of elements Gwith a binary
operation ⋆ : G× G→ Gsatisfying three axioms:

1. Associativity: (g1 ⋆ g2) ⋆ g3 = g1 ⋆ (g2 ⋆ g3) for all group elements g1, g2, g3 ∈ G.

2. Identity: There exists e ∈ Gsuch that e ⋆ g = g ⋆ e = g for all g ∈ G.

3. Inverse: For each g ∈ G, there exists h ∈ Gsuch that g ⋆ h = h ⋆ g = e.

Symmetry groups for molecular systems The key symmetry groups relevant to molecular
systems and their actions on geometric graphs G = (A,S, V⃗ , X⃗) are:

• Permutation symmetry Sn: A permutation σ acts via permutation matrix Pσ as:

PσG := (PσAP⊤
σ ,PσS,PσV⃗ ,PσX⃗),

where Pσ ∈ Rn×n has exactly one 1 in every row and column, and 0 elsewhere.

28


• Rotational symmetry SO(d), or rotations and reflections, O(d): We use Ggenerically to
denote SO(d) or O(d). An orthogonal transformation Qg ∈ Gacts as:

QgG := (A,S, V⃗ Qg, X⃗Qg),

where Qg ∈ Rd×d s.t. Q⊤
g = Q−1

g and det(Qg) = 1 for G= SO(d) (or for G= O(d),
det(Qg) = ±1).

• Translational symmetry T (d): A translation vector t⃗ ∈ R3 acts as:

t⃗+ G := (A,S, V⃗ , X⃗ + t⃗).

Note that scalar features S remain unchanged under all transformations, vector features
V⃗ transform under rotations but not translations, while coordinates X⃗ transform under both
rotations and translations. Without loss of generality, we consider a single vector feature per
node; this framework generalizes to multiple vector features and higher-order tensors.

2.3 Representation Learning of Molecular Structure

This section provides an overview of representation learning for molecular structures, focusing
on how to build Graph Neural Networks for 3D geometric graphs [Duval et al., 2023a]. We
will survey the main families of Geometric GNNs: invariant, equivariant, and unconstrained
models. We will also discuss their applications for molecular property prediction and simulation
of molecular dynamics.

2.3.1 Graph Neural Networks

Graph Neural Networks (GNNs) a class of deep learning architectures designed to operate on
graph-structured data. GNNs leverage graph topology to propagate and aggregate information
between connected nodes. While initial GNN architectures were proposed in the late 1990s and
2000s [Goller and Kuchler, 1996, Gori et al., 2005, Scarselli et al., 2008], modern variants have
emerged as the architecture of choice for representation learning on graph data across domains
ranging from molecular modelling [Stokes et al., 2020, Batzner et al., 2022] to recommendation
systems [Ying et al., 2018] and transportation networks [Derrow-Pinion et al., 2021].

GNNs are based on the principle of message passing, where each node iteratively updates
its representations by aggregating from its local neighbors [Battaglia et al., 2018]. This process
is inherently permutation-equivariant, ensuring that the learned representations are invariant
to arbitrary reorderings of nodes. By stacking multiple message passing layers, GNNs can
propagate information beyond immediate neighbors and capture complex multi-hop relationships

29


(a) Message passing (b) GNN computation tree

Figure 2.8: Graph Neural Networks. (a) GNNs build latent representations of graph data
through message passing operations, where each node performs learnable feature aggregation
from its local neighbourhood. (b) Stacking L message passing layers enables GNNs to send and
aggregate information from L-hop subgraphs around each node.

in the graph structure (Figure 2.8).

Message Passing Framework Formally, node features si for each node i ∈ V are updated
from layer/iteration t to t+ 1 through a three-step process:

1. Message construction: For each node i and its neighbors j ∈ Ni, construct a message
m

(t)
ij that captures the relationship between the representations of nodes i and j.

m
(t)
ij = ψ

(
s
(t)
i , s

(t)
j

)
, ∀j ∈ Ni, (2.1)

where ψ : R2×d → Rd is an MLP that learns to construct the message based on the
representations of nodes i and j.

2. Aggregation: Combine all messages from the neighbors of node i to produce a single
aggregated message m

(t)
i .

m
(t)
i =

⊕
j∈Ni

m
(t)
ij , (2.2)

where
⊕

is a permutation-invariant operator (e.g. sum, mean, max) that aggregates
messages from all neighbors j ∈ Ni. Thus, a change in the order of neighbors does not
affect the aggregated message, preserving permutation symmetry.

3. Update: Update the representations of node i using the aggregated message m
(t)
i and its

previous representations s(t)i .

s
(t+1)
i = ϕ

(
s
(t)
i ,m

(t)
i

)
, (2.3)

where ϕ : Rd → Rd is another MLP.

30


Alternatively, this framework can be expressed more abstractly in terms of multisets as:

m
(t)
i := AGG

(
{{(s(t)i , s

(t)
j ) | j ∈ Ni}}

)
, (2.4)

s
(t+1)
i := UPD

(
s
(t)
i , m

(t)
i

)
, (2.5)

where {{·}} denotes a multiset, AGG is a permutation-invariant aggregation function, and UPD is
an MLP. The final node features {s(t=T )

i } at iteration T can be mapped to graph-level predictions
via a permutation-invariant readout function.

This general formulation encompasses well known architectures including Graph Convolu-
tional Networks [Kipf and Welling, 2017], Graph Isomorphism Networks [Xu et al., 2019], and
Message Passing Neural Networks (MPNNs) [Gilmer et al., 2017].

Graph Attention Networks A particularly interesting class of GNNs employs attention
mechanisms to weight the importance of different neighbors during aggregation [Veličković
et al., 2018]. In Graph Attention Networks (GATs), the message from neighbor j to node i is
computed using an attention mechanism [Bahdanau et al., 2015]. For example, we can consider
an attention mechanism based on the dot product between the representations of nodes i and j,
followed by a softmax normalization over all neighbors j′ ∈ Ni:

ψ
(
s
(t)
i , s

(t)
j

)
= Attention

(
W

(t)
Q s

(t)
i , {W (t)

K s
(t)
j , ∀j ∈ Ni} , {W (t)

V s
(t)
j , ∀j ∈ Ni}

)
, (2.6)

=
exp(W

(t)
Q s

(t)
i ·W (t)

K s
(t)
j )∑

j′∈Ni
exp(W

(t)
Q s

(t)
i ·W (t)

K h
(t)
j′ )

·W (t)
V s

(t)
j , (2.7)

whereW (t)
Q ,W

(t)
K ,W

(t)
V ∈ Rd×d are learnable linear transformations denoting the Query, Key and

Value for the attention computation, respectively. Multi-head attention enhances the expressivity
of this operation by computing attention in parallel across multiple representation subspaces
[Vaswani et al., 2017a].

Connection to Transformers The Transformer model [Vaswani et al., 2017a] has emerged as
the deep learning architecture of choice across language [Achiam et al., 2023], vision [Dosovit-
skiy et al., 2021], and audio [Radford et al., 2023] due to its expressivity and scalability.

Transformers and GNNs share deep mathematical connections [Joshi, 2025]. Transformers
can be viewed as GATs operating on complete graphs, where self-attention models relationships
between all input nodes. The Transformer update rule can be directly instantiated in the message

31


(a) G-invariant function (b) G-equivariant function

Figure 2.9: Invariant and equivariant functions. The output of G-invariant functions remains
unchanged under transformations of the input. For G-equivariant functions, transformations of
the input must result in the output transforming equivalently.

passing framework as follows:

ψ
(
s
(t)
i , s

(t)
j

)
= Attention

(
W

(t)
Q s

(t)
i , {W (t)

K s
(t)
j , ∀j ∈ V} , {W (t)

V s
(t)
j , ∀j ∈ V}

)
, (2.8)

=
exp(W

(t)
Q s

(t)
i ·W (t)

K s
(t)
j )∑

j′∈V exp(W
(t)
Q s

(t)
i ·W (t)

K h
(t)
j′ )

·W (t)
V s

(t)
j , (2.9)

Here, ψ
(
s
(t)
i , s

(t)
j

)
computes the message from node j to node i, with the relative importance of

each node computed via attention. Next, the weighted messages from all nodes in the graph (the
set V) are aggregated via a summation, and the features of node i are updated using an MLP ϕ:

s
(t+1)
i = ϕ

(
s
(t)
i ,

∑
j∈V

ψ
(
s
(t)
i , s

(t)
j

))
, (2.10)

This ability to attend to and gather information from all nodes in the set V (i.e., global attention
over a complete graph) allows Transformers to capture both local and global context in the data
via multi-head attention, without being constrained by the pathologies of pre-defined sparse
graph structure, such as oversquashing with increased depth [Di Giovanni et al., 2023]. This can
be especially useful for molecular tasks where we do not have an apriori graph structure, as we
will discuss subsequently.

Conversely, the GAT message passing equation 2.7 is equivalent to equation 2.9 with attention
restricted to local neighbourhoods, where the graph structure is used to implement sparse or
masked attention [Dong et al., 2024]. This connection has inspired the development of Graph

Transformers [Dwivedi and Bresson, 2020, Rampášek et al., 2022] that aim to combine both local
message passing and global attention. These architectures overcome the expressivity limitations
of message passing GNNs while preserving the inductive bias of graph structure.

32


2.3.2 Geometric Graph Neural Networks

Functions on geometric graphs Before describing GNNs specialised for geometric graphs,
we first define two classes of functions that are used to construct geometric neural network layers.
We denote the action of a group Gon a space X by g · x. If Gacts on spaces X and Y , we say:

• A function f : X → Y is G-invariant if f(g·x) = f(x), i.e. the output remains unchanged
under transformations of the input, as shown in Figure 2.9a.

• A function f : X → Y is G-equivariant if f(g · x) = g · f(x), i.e. a transformation of the
input must result in the output transforming equivalently, as shown in Figure 2.9b;

In Chapter 3, we will also consider G-orbit injective functions. The G-orbit of x ∈ X is
OG(x) = {g · x | g ∈ G} ⊆ X . When x and x′ are part of the same orbit, we write x ≃ x′. We
say a function f : X → Y is G-orbit injective if we have f(x1) = f(x2) if and only if x1 ≃ x2

for any x1, x2 ∈ X . Necessarily, such a function is G-invariant, since f(g · x) = f(x).

Geometric GNNs

Biomolecules

Materials

Small molecules

Inv
ari

an
t

GNNs

Cart
es

ian

Equ
iva

ria
nt

GNNs

Sph
eri

ca
l

Equ
iva

ria
nt

GNNs

Unc
on

str
ain

ed

GNNs

Property Prediction

Dynamics Simulation

Generative Modelling

Structure Prediction

2018 2019

SchNet
CGCNN

MEGNet

2020 20232021 2022

DimeNet GemNet
SphereNet

Inv. Point Attention

GearNet
ComENet

GVP-GNN PaiNN
E(n)-GNN

Eq.Transformer
ClofNet

SO3krates

Tensor Field
Network

Cormorant SE(3)-
Transformer

NequIP
SEGNN

MACE
Allegro

Equiformer

eSCN

ForceNet Spherical
Channel
Network

FAENet,
PET

...

Applications

TensorNet

Figure 2.10: Timeline of Geometric GNN architectures, adapted from Duval et al. [2023a].

Geometric GNNs Standard Graph Neural Networks (GNNs), while powerful for general graph
data, are ill-suited for geometric graphs and molecular structures. Directly applying GNNs to
geometric graphs can lead to models that do not respect the physical symmetries inherent to
molecules (Figure 2.7). This can result in predictions that are physically inconsistent [Musil et al.,
2021]. Moreover, learning these symmetries implicitly from data alone, without appropriate
inductive biases, is usually data-inefficient.

To address this, Geometric GNNs, which are GNNs specialized for geometric graphs, extend
the message passing paradigm to incorporate physical symmetries as inductive biases (implicit) or
strict constraints (explicit). In addition to maintaining permutation equivariance for node features,

33


geometric GNNs ensure that operations involving geometric attributes (like atomic coordinates
or vector features) respect physical symmetries. This means that the learned representations
and intermediate geometric features transform covariantly with respect to the group of rotations
(SO(d)) or rotations and reflections (O(d)). We use Gas a generic symbol for these groups.

In recent years, we have seen a wide range of Geometric GNN architectures and their
applications in molecular modelling [Duval et al., 2023a]. To navigate this landscape, the
following sections will survey these models by categorizing them into three main families, as
summarized in Figure 2.10:

• Invariant GNNs, which construct and propagate features that are invariant to G, such as
distances, angles, and torsion angles.

• Equivariant GNNs, where intermediate representations and propagated messages are
themselves geometric quantities. The representations can be expressed as Cartesian
vectors or in spherical harmonic basis.

• Unconstrained networks, which do not explicitly enforce physical symmetries in their
architecture but may learn them implicitly from data, e.g. through data augmentation.

2.3.3 Rotation-invariant GNNs

Invariant GNNs are designed to learn atomic representations that are inherently invariant to
3D Euclidean transformations of the system (the translation group T (3) and the group of
rotations G= SO(3) or rotations and reflections G= O(3)). Translation invariance is typically
achieved by: (1) centering input point clouds (e.g., by subtracting the center of mass from atomic
coordinates) and (2) operating on relative displacement vectors x⃗ij = x⃗i− x⃗j instead of absolute
coordinates.

To enforce G-invariance, these models avoid directly processing geometric quantities that
depend on the frame of reference (like raw coordinate vectors). Instead, they operate on scalarized

geometric invariants: quantities that are inherently invariant to rotations and reflections. Common
examples include pairwise distances (∥x⃗ij∥), triplet-wise angles (derived from dot products like
x⃗ij · x⃗ik), and quadruplet-wise torsion angles. By constructing messages and updating features
using only these invariant scalars, the entire network, from intermediate representations to final
predictions, is guaranteed to be G-invariant.

Message Passing with Invariant Features G-invariant GNN layers follow the general message
passing framework (Equations 2.4-2.5) with specific differences in how geometric information
is incorporated. The key idea is to use geometric invariants for constructing features. Scalar
node features si are updated from layer t to t+ 1. The aggregation step, AGG, now incorporates
invariant geometric information derived from relative positions x⃗ij (and potentially initial vector

34


(a) SchNet (b) DimeNet

Figure 2.11: Invariant GNN message passing. G-invariant layers extract and propagate local
scalar geometric quantities such as distances (SchNet) and bond angles (DimeNet), which are
guaranteed to be invariant to Euclidean transformations.

features v⃗i, v⃗j). The update function UPD then combines these aggregated messages with the
previous node features:

m
(t)
i := AGGinv

(
{{(s(t)i , s

(t)
j , scalarize(x⃗ij, v⃗

(t)
i , v⃗

(t)
j )) | j ∈ Ni}}

)
, (2.11)

s
(t+1)
i := UPD

(
s
(t)
i ,m

(t)
i

)
. (2.12)

Here, scalarize(·) represents the process of extracting geometric invariants (e.g., distances,
angles) from the inputs. The aggregated message m

(t)
i is thus purely an invariant scalar.

Examples Pioneering examples are SchNet [Schütt et al., 2018] and CGCNN [Xie and Gross-
man, 2018], where messages are modulated by functions of interatomic distances ∥x⃗ij∥. As
illustrated in Figure 2.11a, SchNet’s update rule can be seen as:

s
(t+1)
i := s

(t)
i +

∑
j∈Ni

f1

(
s
(t)
j , ∥x⃗ij∥

)
(SchNet) (2.13)

where f1 is an MLP that processes the neighbor’s scalar features s(t)j and the invariant distance.
DimeNet [Gasteiger et al., 2020] (Figure 2.11b) extends this by incorporating angular

information, effectively using messages that depend on triplets of atoms. It computes messages
based on distances and angles (derived from dot products like x⃗ij · x⃗ik):

s
(t+1)
i :=

∑
j∈Ni

f1

(
s
(t)
i , s

(t)
j ,

∑
k∈Ni\{j}

f2

(
s
(t)
j , s

(t)
k , ∥x⃗ij∥, x⃗ij · x⃗ik

))
(DimeNet) (2.14)

In both cases, the updated scalar features s(t+1)
i maintain invariance to G transformations, as

they are constructed solely from geometric invariants.
Other notable invariant GNNs include GemNet [Gasteiger et al., 2021] and SphereNet

[Liu et al., 2022], which incorporate additional geometric features like torsion angles among

35


quadruplets of atoms. Another class of invariant GNNs are based on canonincal frames of
reference, which define a local or global frame to scalarise geometric quantities into invariant
features used for message passing. A notable example of this approach is the Invariant Point
Attention layer from AlphaFold2 [Jumper et al., 2021].

Continuity and smoothness While rotation invariance ensures that atomic representations
and predictions do not change under rotation, applications such as molecular dynamics impose
an additional, stricter constraint on Geometric GNNs: the learned Potential Energy Surface
(PES) must be smooth. Since atomic forces are derived as the negative gradient of the energy
(F⃗i = −∇x⃗i

E), the energy function must be at least twice differentiable (C2 continuous) with
respect to atomic positions to ensure stable, continuous forces and Hessians during simulation
[Musil et al., 2021]. Standard deep learning operations often violate this requirement, motivating
special architectural choices in Geometric GNNs:

• Basis Functions: To provide smooth, learnable representations of geometry, models
project geometric scalars onto sets of continuous basis functions rather than operating on
raw values. Interatomic distances are expanded using Radial Basis Functions (RBFs) such
as Gaussians [Schütt et al., 2018] or Bessel functions [Gasteiger et al., 2020]. Similarly,
angular and torsional features are embedded using basis sets like Fourier series [Gasteiger
et al., 2021].

• Smooth Envelopes (Cutoffs): Discontinuities inevitably arise when atoms enter or leave
the local cutoff radius rcut defined during graph construction. To prevent jumps in energy
(and infinite forces) at this boundary, interactions are modulated by a smooth envelope
function which forces the messages and its derivatives to zero as r → rcut.

• Smooth Activations and Aggregation: Discontinuous non-linearities like ReLU (which
has a discontinuous derivative) or aggregation functions like Max-Pooling introduce
singularities in the force field. Consequently, Geometric GNNs predominantly employ
smooth activation functions such as Swish/SiLU [Ramachandran et al., 2017], and rely on
summation for aggregation to preserve differentiability throughout the network.

2.3.4 From Invariant to Equivariant GNNs

Invariant GNNs, as discussed previously, achieve G-invariance by operating exclusively on
pre-defined local scalar invariants like distances and angles. While this ensures invariance and
can be computationally efficient, it is inherently restrictive: the model is limited to the expressive
power of these fixed, local geometric descriptors.

36


x 2

x 1

x 1

Invariant features

Equivariant features

Figure 2.12: The Picasso Problem and the need for equivariance. Identifying a face requires
understanding not just the presence of individual features like eyes, nose, and mouth (invariant
information), but crucially their relative spatial arrangement (equivariant information) [Hinton,
2021]. Similarly, for molecular systems, predicting invariant properties often necessitates
understanding how different structural motifs are geometrically oriented and interact with one
another—an inherently equivariant sub-task. Equivariant representations enable the network to
dynamically learn complex invariants that extend beyond fixed local neighborhoods.

Consider the Picasso Problem illustrated in Figure 2.12. Recognizing a face requires not just
identifying the presence of eyes, a nose, and a mouth (invariant features), but crucially under-
standing their relative spatial arrangement (equivariant information). Similarly, for molecular
systems, predicting an overall invariant property (like the potential energy) often necessitates
understanding how different sub-structures or motifs are oriented and interact geometrically. This
involves solving equivariant sub-tasks. If a model only processes pre-computed local invariants,
it may struggle to capture these crucial relative geometric relationships that define more global
structural characteristics.

This motivates the development of equivariant GNNs. Instead of discarding directional
information by scalarization, these models propagate and transform geometric quantities (like
vectors or higher-order tensors) in a way that ensures their hidden features at each layer remain
equivariant to the symmetry transformations of the input. If the input molecular structure
is rotated, the intermediate vector or tensor features within an equivariant GNN will rotate
correspondingly. This diligent accounting of geometric information allows the network to learn

how to combine these equivariant features. Consequently, equivariant GNNs can construct more
complex and task-relevant invariants dynamically during message passing. The complexity and
range of these learned invariants (e.g., involving atoms further apart or more intricate geometric
relationships) can increase with the number of message passing layers, allowing the model to
capture information beyond fixed local neighborhoods. Furthermore, this equivariant processing
is essential for tasks that require predicting equivariant quantities themselves, such as atomic
forces in molecular dynamics.

37


Having established the intuition for equivariant representations, we will now introduce two
families of equivariant GNNs using different basis for representing geometric information:
Cartesian vectors and spherical tensors.

2.3.5 Rotation-equivariant GNNs using Cartesian Vectors

Cartesian equivariant GNNs represent a class of models that operate directly with geometric
quantities in Cartesian coordinates. To maintain physical consistency, they restrict operations
on these geometric features to those that preserve equivariance under rotations, reflections, and
translations. These models typically assign two fundamental types of features to each node:
scalar features invariant under Gand vector features that transform covariantly under G.

Equivariant operations To construct equivariant message passing layers which update the
scalar and vector features at each node, only specific operations are permissible:

• Scalar × Scalar → Scalar

• Scalar × Vector → Vector

• Vector · Vector (dot product) → Scalar

• Norm of Vector (∥v⃗∥) → Scalar

• Vector + Vector → Vector (if both are of the same type and representation)

Element-wise non-linear activation functions are generally applied only to scalar quantities, as
applying them directly to vector components can break G-equivariance. The cross-product of
two vectors can also be used, but it is important to note that this operation does not yield a
vector in the same sense as the input vectors; instead, it yields a pseudo-vector which transforms
differently under reflections.2

Message passing with scalars and vectors These fundamental equivariant operations form
the toolkit for constructing message passing layers in Equivariant GNNs. The network updates
both scalar features si and vector features v⃗i at each node i by aggregating information from its
neighbors j ∈ Ni while preserving equivariance. The general message passing update can be
formulated as:

m
(t)
i , m⃗

(t)
i := AGG

(
{{(s(t)i , s

(t)
j , v⃗

(t)
i , v⃗

(t)
j , x⃗ij) | j ∈ Ni}}

)
(Aggregate) (2.15)

s
(t+1)
i , v⃗

(t+1)
i := UPD

(
(s

(t)
i , v⃗

(t)
i ) , (m

(t)
i , m⃗

(t)
i )
)

(Update) (2.16)

2The cross-product of two vectors a⃗ × b⃗ is a pseudo-vector. Under inversion (reflection through the origin),
a⃗ → −a⃗ and b⃗ → −b⃗, so (−a⃗) × (−b⃗) = a⃗ × b⃗. A vector, however, would invert: v⃗ → −v⃗. This distinction is
important when considering O(3) or SO(3) equivariance.

38


(a) PaiNN (b) TFN

Figure 2.13: Equivariant GNN message passing. G-equivariant layers such as PaiNN and
TFN propagated geometric quantities such as vectors, relative positions, or tensors.

Here, both AGG and UPD are composed of the permissible equivariant operations.
For example, PaiNN [Schütt et al., 2021] (Figure 2.13a) implements interaction layers where

the aggregated messages m(t)
i (scalar) and m⃗

(t)
i (vector) are computed as:

m
(t)
i := s

(t)
i +

∑
j∈Ni

f1

(
s
(t)
j , ∥x⃗ij∥

)
(2.17)

m⃗
(t)
i := v⃗

(t)
i +

∑
j∈Ni

f2

(
s
(t)
j , ∥x⃗ij∥

)
⊙ v⃗

(t)
j +

∑
j∈Ni

f3

(
s
(t)
j , ∥x⃗ij∥

)
⊙ x⃗ij (2.18)

The learnable filter functions f1, f2, f3 (typically MLPs) take scalar inputs (neighbor’s scalar
features s(t)j and the invariant distance ∥x⃗ij∥) and produce scalar outputs. These outputs then
scale other scalars (in f1) or vectors (in f2, f3) via element-wise multiplication ⊙, which is
consistent with the allowed equivariant operations. The subsequent update step in PaiNN,
yielding s

(t+1)
i and v⃗

(t+1)
i , often employs a gated non-linearity [Weiler et al., 2018] for the vector

features:

s
(t+1)
i := m

(t)
i + f4

(
m

(t)
i , ∥m⃗

(t)
i ∥
)
, v⃗

(t+1)
i := m⃗

(t)
i + f5

(
m

(t)
i , ∥m⃗

(t)
i ∥
)
⊙ m⃗

(t)
i .

(2.19)

Here, f4, f5 are learnable functions. The vector update scales m⃗
(t)
i by a scalar derived from

m
(t)
i and the norm ∥m⃗(t)

i ∥, preserving equivariance. Other notable Equivariant GNNs include
E(n)-GNN [Satorras et al., 2021] and GVP-GNN [Jing et al., 2020], which employ similar
principles of equivariant operations on Cartesian scalars and vectors.

Higher-order Cartesian tensors The frameworks discussed above restrict node features
to scalars (rank-0) and vectors (rank-1), making them computationally efficient. However,
geometric information can also be encoded in higher-order Cartesian tensors, such as 3 × 3

matrices (rank-2) or higher-dimensional arrays. These quantities are naturally constructed via the
outer product of lower-order features. For instance, given two equivariant vectors u⃗, v⃗ ∈ R3,
their outer product u⃗⊗ v⃗ ∈ R3×3 is a rank-2 tensor that transforms as u⃗⊗ v⃗ → (Ru⃗)⊗ (Rv⃗) =

39


R(u⃗⊗ v⃗)R⊤ under rotation R. Models such as TensorNet [Simeon and Fabritiis, 2023] leverage
this principle, allowing nodes to update rank-2 tensor features through mixing operations that
respect these transformation rules. While expressive, explicitly maintaining full Cartesian tensors
of rank k becomes computationally expensive, as memory scales with 3k. Furthermore, these
representations are mathematically reducible, meaning they contain a mixture of lower-order
symmetric subspaces (e.g., a 3× 3 matrix contains a scalar trace, a vector antisymmetric part,
and a symmetric traceless component). See Duval et al. [2023a] for a detailed derivation.

2.3.6 Rotation-equivariant GNNs using Spherical Tensors

As noted above, while high-order interactions can be represented using Cartesian tensors, doing
so is redundant because these tensors are reducible. For example, a rank-2 Cartesian tensor
(9 components) decomposes into three independent subspaces that transform differently under
rotation: a scalar (1 component), a vector (3 components), and a symmetric traceless tensor (5
components). Rotation-equivariant GNNs using Spherical Tensors aim to work directly with
these irreducible representations (irreps) of the rotation group SO(3), avoiding the redundancy
of the Cartesian basis. Instead of raw coordinate arrays, geometric features are represented as
spherical tensors h̃i,l ∈ R(2l+1)×f of degree l, where l = 0 corresponds to scalars, l = 1 to
vectors, l = 2 to the symmetric traceless tensors mentioned previously, and so on. Models in
this family, such as Tensor Field Networks (TFN) [Thomas et al., 2018], Cormorant [Anderson
et al., 2019], SEGNN [Brandstetter et al., 2022], and MACE [Batatia et al., 2022b] typically
represent node features as collections of spherical tensors h̃i,l ∈ R(2l+1)×f for different orders
l = 0, 1, . . . , Lmax. Here, l = 0 corresponds to scalar features si, l = 1 to vector features v⃗i, and
higher orders capture more complex angular information. The key equivariant operations involve
Spherical Harmonics Y (x̂ij) to encode directional information from relative positions, learnable
radial basis functions f(∥x⃗ij∥) for distances, and Clebsch-Gordan coefficients for combining
these into equivariant tensor products.

Message passing with spherical tensors In these models, node features (collections of spher-
ical tensors h̃(t)

i ) are updated by aggregating messages from neighbors. The core of message
construction is an equivariant tensor product. The aggregated message for node i typically
involves summing contributions from each neighbor j. Each contribution is a tensor product
of the neighbor’s features h̃(t)

j and the spherical harmonic representation Y (x̂ij) of the relative
direction x̂ij = x⃗ij/∥x⃗ij∥. This tensor product is weighted by learnable functions of the inter-
atomic distance ∥x⃗ij∥. A simplified form for the update introduced in Thomas et al. [2018],
including a residual connection, is:

h̃
(t+1)
i := h̃

(t)
i +

∑
j∈Ni

Y (x̂ij)⊗w h̃
(t)
j , (2.20)

40


where ⊗w denotes a learnable tensor product. The weights w are typically outputs of a neural
network (e.g., an MLP) applied to the radial distance, w = MLP(∥x⃗ij∥), producing different
weights for different interaction paths in the tensor product.

More explicitly, to obtain the component m3 of the order-l3 part of the message m̃ij from
neighbor j (before summation and update), the tensor product in equation 2.20 can be expanded
using Clebsch-Gordan coefficients C l3m3

l1m1,l2m2
:

(m̃ij)l3m3
:=

∑
l1m1,l2m2

C l3m3
l1m1,l2m2

fl1l2l3 (∥x⃗ij∥)Y m1
l1

(x̂ij) h̃
(t)
j,l2m2

. (2.21)

The learnable radial function fl1l2l3(·) depends on the specific orders l1, l2, l3 involved in the
interaction path. The Clebsch-Gordan coefficients ensure that the resulting message components,
and thus the updated features h̃(t+1)

i , transform equivariantly under SO(3).
The tensor product in equation 2.21 can be seen as a generalization of the Cartesian vector

operations used scalar-vector equivariant GNNs introduced previously. Notably, when restricting
the tensor product to only scalars (up to Lmax = 0), we obtain updates of the form similar to
equation 2.13 like SchNet. Similarly, when using only scalars and vectors (i.e., Lmax = 1), the
operations resemble those in Cartesian equivariant GNNs like PaiNN (Equations 2.17 and 2.18).

For a comprehensive overview of this class of models, we refer to dedicated surveys such as
Batatia et al. [2022a] and Geiger and Smidt [2022].

Architectural improvements The original TFN architecture [Thomas et al., 2018] has been
extended and improved in several ways. SE(3)-Transformers [Fuchs et al., 2020] introduced
equivariant self-attention for aggregation. SEGNN [Brandstetter et al., 2022] developed equivari-
ant non-linear convolutions using steerable MLPs within a message passing framework, offering
a recipe for equivariant MPNNs with spherical tensor features. Equiformer [Liao and Smidt,
2023] combined these two ideas by interleaving equivariant self-attention with non-linear up-
dates in local Transformer-style blocks. Notably, the self-attention weights in these models are
invariant, derived from scalarized geometric information, and re-weight neighborhood features
during equivariant message passing.

MACE [Batatia et al., 2022b] incorporates many-body interaction terms3 by factorizing
higher-order terms into products of two-body representations, drawing from the Atomic Cluster
Expansion (ACE) formalism [Drautz, 2019]. This "density-trick"4 exchanges summation and
multiplication (e.g., (a+ b)2 efficiently yields a2, ab, ba, b2 terms with coupled coefficients) to
reduce operations. MACE sums one-body features (like spherical harmonic embeddings of
neighbors) and then takes tensor products of these aggregates, efficiently generating high-order
terms. Allegro [Musaelian et al., 2022] implements the ACE framework with a single message

3Many-body effects refer to the collective behaviour of a large number of interacting constituents. They are
needed for an accurate description of both the structure and dynamics of large chemical systems.

4This idea was originally used in Bartók et al. [2010, 2013], and is referred to as density-trick by Drautz [2019].

41


passing layer and an extended local cutoff, enabling efficient GPU parallelization for simulating
large systems.

eSCN [Passaro and Zitnick, 2023] addresses the computational cost of high-rank tensor
products by reducing SO(3) equivariant convolutions to equivalent SO(2) convolutions. This
is achieved by aligning the node embeddings’ primary axis with the edge vector, simplifying
the rotational symmetry to 2D. Despite requiring extra Wigner D-matrix rotations for alignment,
this sparsifies Clebsch-Gordan coefficients, speeding up computations for l > 1. EquiformerV2
[Liao et al., 2024a] leverages the eSCN technique to scale Equiformer to hundreds of millions of
parameters for the first time.

2.3.7 Unconstrained GNNs

The geometric GNN families discussed previously enforce physical symmetries via architectural
constraints. An alternative class of unconstrained GNNs do not strictly enforce symmetries, but
instead incorporate them more flexibly as inductive biases. This enables the direct use of relative
positions x⃗ij and other geometric quantities in MLPs used for message passing:

s
(t+1)
i = ϕ

(
s
(t)
i ,

∑
j∈Ni

ψ
(
s
(t)
i , s

(t)
j , x⃗ij

))
. (2.22)

This strategy trades the guarantee of strict equivariance for potentially greater model expres-
siveness and computational efficiency, as they do not require explicit equivariant operations or
scalarization.

A straightforward but effective approach achieves approximate symmetry through data aug-
mentation. For example, ForceNet [Hu et al., 2021] implicitly learns symmetries by training on
multiple random rotations of each geometric graph, similar to how Vision Transformers [Doso-
vitskiy et al., 2021] learn approximate equivariance from augmented training data. Additionally,
soft constraints can be introduced via regularization terms in the loss function to encourage
symmetry preservation [Elhag et al., 2024].

Alternatively, canonization-based approaches addresses equivariance at the data representa-
tion stage by transforming input into a canonical frame before applying standard GNNs. FAENet
[Duval et al., 2023b] uses PCA to project data into canonical space, then uses the relative
positions x⃗ij in message passing. Rather than relying on hand-designed canonization methods
like PCA, Kaba et al. [2023] proposed learning the canonization transform using a shallow
equivariant network. Other approaches focus on local canonization, defining distinct coordinate
frames at each atom and projecting tensor information onto these local frames. For instance,
Pozdnyakov and Ceriotti [2023] introduced an unconstrained Geometric Transformer that em-
ploys an Equivariant Coordinate System Ensemble, averaging predictions from a non-equivariant
network over multiple such local coordinate systems.

42


While unconstrained GNNs offer computational advantages, the lack of exact symmetry
guarantees means that they may not always produce physically consistent predictions. This
can lead to inaccuracies in tasks requiring strict adherence to physical laws, such as molecular
simulation [Fu et al., 2023, Bigi et al., 2025]. These accuracy-scalability trade-offs are explored
further in Chapter 4 in the context of generative modelling.

2.3.8 Applications of Geometric GNNs

Geometric GNNs have been successfully applied across materials science, chemistry, and
biology, where molecular systems are naturally represented as geometric graphs and predictions
are correlated with physical symmetries. Two primary application domains have emerged as
particularly prominent and impactful: molecular dynamics simulation and property prediction.

Dynamics simulation Understanding the dynamic behavior of molecular systems is funda-
mental to predicting their properties and functions. Almost a century ago, Dirac postulated that
the fundamental mathematical principles describing interactions within materials and molecules
at the atomic scale, based on quantum mechanics, were largely understood [Dirac, 1929]. While
quantum mechanics can, in principle, be used to simulate all kinds of matter, the inherent mathe-
matical complexity makes exact calculations intractable for most practically relevant systems,
necessitating the development of approximate methods. Density Functional Theory (DFT) [Kohn
et al., 1996] became a cornerstone for such approximations, but its cubic scaling with system
size limits simulations to hundreds of atoms. To bridge the gap between quantum accuracy
and large-scale simulation, Machine Learning-based Interatomic Potentials (MLIPs) have been
proposed over the past decade [Behler and Parrinello, 2007, Bartók et al., 2010, 2013, 2017,
Unke et al., 2021]. These models, trained on QM or DFT data, can approximate quantum
mechanical calculations with high accuracy, often being able to generalize beyond their training
data to larger systems.

Geometric GNNs, especially those based on spherical tensor representations, have emerged
as a leading model class for developing MLIPs [Batzner et al., 2022, Batatia et al., 2022b, Wood
et al., 2025]. These models generally predict the potential energy of an atomic configuration,
from which interatomic forces can be computed based on the law of conservation of energy,
using which system dynamics can be simulated. Crucially, this application relies on the C2

continuity guarantees of the Geometric GNN representations discussed in Section 2.3.3, ensuring
that the derived forces are continuous and energy-conserving for stable simulation [Fu et al.,
2025].

Property prediction Beyond simulation, Geometric GNNs are also employed for predicting
functional and experimental properties that may not be directly derivable from first-principles

43


quantum mechanical calculations. This is also known as Quantitative structure-activity relation-
ships [Todeschini and Consonni, 2009], and involves training GNNs to predict diverse functional
properties of small molecules [Stokes et al., 2020], proteins [Gligorijević et al., 2021], crystals
[Xie and Grossman, 2018], and electrocatalysis systems [Lan et al., 2023, Wander et al., 2025].

The typical workflow involves training GNNs on datasets comprising experimentally mea-
sured or computationally derived functional properties, learning to map from the geometric graph
representation of the systems to these target properties. In the context of designing new molecules
and materials, these models can guide the design process through two primary approaches: (1)
High-throughput screening: Evaluating large databases of known or synthesizable molecules
and materials to identify promising candidates based on GNN-predicted properties [Buterez
et al., 2023]; and (2) Generative inverse design: Training generative models (discussed in the
following section) to design novel molecules or materials tailored to desired property profiles
[Sanchez-Lengeling and Aspuru-Guzik, 2018].

2.4 Generative Modelling of Molecular Systems

Having explored various classes of Geometric GNN architectures for learning molecular rep-
resentations, we now turn to the complementary problem of generative modelling. While
representation learning focuses on predictive understanding of molecular systems, generative
models aim to create new molecules with desired characteristics [Du et al., 2024, Winnifrith
et al., 2024]. In the following sections, we will discuss the most relevant generative models
used in this thesis, which can be broadly categorized into autoregressive models, variational
autoencoders, and diffusion models.

2.4.1 Autoregressive (Language) Models

Autoregressive models are a class of generative models that learn to predict the next element
in a sequence given the previous elements [Graves, 2013]. They have achieved remarkable
success in natural language processing, where they model sequences of words or characters
[Sutskever et al., 2014, Bahdanau et al., 2015, Vaswani et al., 2017a], and have been adapted to
molecular systems by generating sequences of categorical tokens such as SMILES strings for
small molecules [Segler et al., 2018], amino acid sequences for proteins [Madani et al., 2023], or
nucleotide sequences for RNA [Shulgina et al., 2024].

Given a sequence X = (x1, x2, . . . , xT ) of T tokens, where each token xt belongs to a
predefined vocabulary V , an autoregressive model defines the joint probability distribution by
factorizing it as:

P (X) = P (x1, x2, . . . , xT ) =
T∏
t=1

P (xt|x1, . . . , xt−1; θ), (2.23)

44


where P (xt|x1, . . . , xt−1; θ) represents the probability of generating the t-th token xt conditioned
on all preceding tokens x<t, parameterized by model parameters θ.

Modern autoregressive models typically employ Transformer architectures [Vaswani et al.,
2017a] to model these conditional probabilities. At each step t, the model processes the sequence
of previously generated tokens (x1, . . . , xt−1) and outputs a categorical probability distribution
over the vocabulary V for the next token xt. Training proceeds by maximizing the log-likelihood
(equation 2.23) of observed sequences from the training dataset using a cross entropy loss.

The training loss for a single token at position t in an autoregressive model is the cross-
entropy loss between the predicted probability distribution over the vocabulary and the true next
token:

Lt = −
|V|∑
v=1

1[xt = v] logP (xt = v|x1, . . . , xt−1; θ), (2.24)

where |V| is the vocabulary size, 1[xt = v] is an indicator function that equals 1 if the true token
at position t is vocabulary item v and 0 otherwise, and P (xt = v|x1, . . . , xt−1; θ) is the model’s
predicted probability for vocabulary item v at position t. The total training loss for a sequence is
the average or sum of cross-entropy losses across all token positions.

A crucial component of Transformers for autoregressive modelling is masked or causal

attention, which prevents the model from accessing future tokens during training. This masking
ensures that predictions for token xt depend strictly on x<t, maintaining the autoregressive
property. Additionally, it is conventional to use teacher forcing [Williams and Zipser, 1989]
to stabilize training, where the true previous tokens are fed into the model at each position t,
rather than sampling from the model’s own predictions. During inference, tokens are sampled
sequentially from the model’s output distribution and fed back as input for subsequent predictions.

2.4.2 Variational Autoencoders

Autoencoders are a type of architecture that learn to encode input data into a (usually) lower-
dimensional latent representation and then decode it back to reconstruct the original data.
Autoencoders were first introduced for dimensionality reduction and unsupervised learning, with
various regularization techniques preventing the latent representation from copying the input
perfectly, as a result of which the model learns a useful representation of the data [Hinton and
Zemel, 1993, Vincent et al., 2008].

Variational autoencoders (VAEs) [Kingma and Welling, 2014, Rezende et al., 2014] are
autoencoders based on variational Bayesian methods, connecting the encoder and decoder via a
probabilistic latent space that corresponds to the parameters of a variational distribution. This
probabilistic perspective of mapping data points to distributions in latent space allows VAEs
to be utilized as generative models. VAEs have been widely used for generative modelling of
continuous natural data such as images and audio [Van Den Oord et al., 2017], and were some of

45


the first generative models for molecular design [Gómez-Bombarelli et al., 2018].
Formally, a VAE consists of two main components: an encoder and a decoder. The en-

coder qϕ(z|x) is a neural network that maps the input data x to the parameters of a proba-
bility distribution over a latent space Z . This distribution is typically a multivariate Gaussian
N (z;µ, diag(σ2)). The encoder network outputs the mean µ and, for numerical stability, the
logarithm of the standard deviations, logσ.5 The decoder pθ(x|z) is another neural network
that maps a sample z ∼ qϕ(z|x) from the latent distribution qϕ(z|x) back to the data space,
aiming to reconstruct the original input x.

The marginal likelihood of the data is given by P (x) =
∫
pθ(x|z) p(z)dz. This integral

does not have an analytic solution or efficient estimator; see Kingma and Welling [2019] for
a full derivation. Rather than directly maximizing the intractable likelihood P (x), VAEs are
trained by optimizing a combined loss for reconstruction while remaining close to the prior
distribution p(z). This loss, termed the Evidence Lower Bound (ELBO), is formulated as:

LELBO(θ, ϕ) = Ez∼qϕ(z|x)[log pθ(x|z)]−DKL (qϕ(z|x) || p(z)) , (2.25)

where the first term is the reconstruction loss measuring how well the decoder can reconstruct
the input x using the latent representation z, and the second term is the Kullback-Leibler
(KL) divergence between the learned latent distribution qϕ(z|x) and the prior distribution p(z)
(typically a standard Gaussian). Such a KL term with a Gaussian prior acts as a regularizer,
encouraging the latent space to be well-structured and enabling smooth interpolation between
data points.

To optimize the ELBO with respect to the encoder parameters ϕ, we need to backpropagate
gradients through the sampling step z ∼ qϕ(z|x). However, sampling is stochastic and non-
differentiable. The reparameterization trick solves this by expressing the random sample z as a
deterministic function of the encoder parameters and an auxiliary noise variable, thus creating
a differentiable path for gradients. Specifically, given the encoder outputs µ and logσ for the
latent distribution qϕ(z|x) = N (z;µ, diag(σ2)), a sample z can then be drawn as:

z = µ+ σ ⊙ ϵ, where ϵ ∼ N (0, I) (2.26)

This formulation allows gradients to backpropagate through µ and logσ. Once trained, new
samples can be generated by sampling new latents from the prior distribution p(z) and mapping
to data space through the decoder pθ(x|z).

For molecular applications, VAEs are generally trained to learn continuous latent represen-
tations where molecules with similar properties cluster together. This structured latent space
enables various design tasks: generating novel structures, interpolating between known molecules
to explore chemical space, and performing property optimization [Gómez-Bombarelli et al.,

5The actual standard deviations σ = exp(logσ), and this determines the variance σ2.

46


2018, Griffiths and Hernández-Lobato, 2020].

2.4.3 Diffusion and Flow Matching Models

Diffusion and flow matching models are a class of generative models that generate data via an
iterative denoising process [Sohl-Dickstein et al., 2015, Lipman et al., 2023]. Rather than directly
learning to generate complex data distributions, these models learns to reverse a noise corruption
process, progressively refining random noise into highly realistic samples. Diffusion and flow
matching models are the state-of-the-art generative modelling approach for continuous domains
such as images, audio, and video [Esser et al., 2024, Betker et al., 2023, Brooks et al., 2024].
The denoising formulation has also proven remarkably effective when adapted to molecular
systems, powering many recent breakthroughs in molecular design [Hoogeboom et al., 2022,
Watson et al., 2023, Zeni et al., 2025].

Diffusion Models Denoising Diffusion Probabilistic Models (DDPMs) [Ho et al., 2020] define
a forward process or noising process that gradually adds noise to data x0 ∼ pdata(x0) over T
discrete time steps. This results in a sequence of increasingly noisy samples x1, . . . ,xT , where
xT approaches a simple prior distribution pprior(xT ) as T increases.

Noise is added to the data according to a schedule defined by variances {βt ∈ (0, 1)}Tt=1. Let
αt = 1− βt and ᾱt =

∏t
s=1 αs. The forward process transitions are defined as:

q(xt|xt−1) = N (xt;
√
αtxt−1, βtI), (2.27)

where each step adds Gaussian noise with variance βt while scaling the signal by
√
αt to control

the signal-to-noise ratio. A common choice for the noise schedule is to use a linear or cosine
schedule, where βt increases linearly or according to a cosine function from a small value at
t = 1 to a larger value at t = T .

The forward process can be expressed in closed form as a Markov chain, where the distribu-
tion of xt given the original data x0 is:

q(xt|x0) = N (xt;
√
ᾱtx0, (1− ᾱt)I). (2.28)

This closed-form expression is a key advantage of DDPMs, as it allows us to sample the noisy
data xt directly from the orginal data x0 as follows (without computing the intermediate steps
x1, . . . ,xt−1):

xt =
√
ᾱt x0 +

√
1− ᾱt ϵ, where ϵ ∼ N (0, I). (2.29)

The noise schedule is designed such that as t increases, the signal component
√
ᾱtx0 diminishes

while the noise component (1− ᾱt) grows. As t→ T , we have ᾱt → 0, meaning the distribution
of xT converges to the standard Gaussian N (0, I). This ensures that the forward process

47


transforms any data distribution into pure noise, providing a well-defined starting point for the
reverse generative process.

The reverse process or denoising process aims to learn how to reverse the forward noising
process, transforming noise xT back into data x0. This reverse process is defined as a Markov
chain with learned parameters θ:

pθ(xt−1|xt) = N (xt−1;µθ(xt, t),Σθ(xt, t)), (2.30)

where µθ(xt, t) and Σθ(xt, t) are the predicted mean and covariance of the distribution at each
time step t. DDPMs aim to learn a denoiser neural network that can effectively denoise samples
from the prior distribution pprior(xT ) back to the data distribution pdata(x0).

The reverse process is trained to approximate the true posterior distribution p(xt−1|xt),
which is intractable. To achieve this, the DDPM loss minimizes the difference between the true
noise added to the data and the noise predicted by the model. This is achieved by training a
neural network ϵθ(xt, t) to predict the noise ϵ that was added to x0 to obtain xt. The training
objective is typically formulated as a mean squared error loss between the true noise and the
predicted noise:

LDDPM = E t∼U(1,T ), x0∼pdata, ϵ∼N (0,I)

[
∥ ϵ− ϵθ(xt, t) ∥2

]
. (2.31)

This loss function encourages the model to learn how to denoise samples at each time step t,
effectively learning the reverse process.

The sampling process from a trained DDPM involves starting from a sample xT ∼ pprior(xT )

(e.g., a standard Gaussian) and iteratively applying the learned denoiser to obtain samples xt−1

from xt until reaching x0. Sampling can be expressed as:

xt−1 =
1

√
αt

(
xt −

1− αt√
1− ᾱt

ϵθ(xt, t)

)
+ σt z, where z ∼ N (0, I), (2.32)

where σ2
t = βt for stochastic sampling or σt = 0 for deterministic sampling. See Ho et al. [2020]

for full derivation on the sampling process.
Denoising Diffusion Implicit Models (DDIMs) [Song et al., 2021] extend DDPMs by in-

troducing a non-Markovian reverse process where the noise schedule is fixed. This allows for
deterministic sampling without the need for noise at each step, enabling faster sampling while
maintaining high-quality samples. Score-based Generative Models [Song and Ermon, 2019]
are another class of diffusion models closely related to DDPMs and DDIMs, where the reverse
process is defined in terms of score functions (gradients of the log probability density) rather
than explicit noise predictions.

48


Flow Matching Flow Matching [Lipman et al., 2023, Liu et al., 2023] offers an alternative,
continuous-time perspective on generative modelling. It frames generation as learning a vector
field that transports samples along a continuous trajectory from a simple prior distribution pprior

to the target data distribution pdata.
The most common formulation defines a path between a sample x0 ∼ pprior (e.g., a standard

Gaussian) and a sample x1 ∼ pdata. For instance, a linear interpolation path is defined as:

xt = (1− t) x0 + t x1, (2.33)

where t ∈ [0, 1] is a continuous time variable6 that controls the interpolation between x0 and x1.
The conditional vector field ut along the path from x0 to x1 at time t is defined as:

ut(xt) =
x1 − xt

1− t
= x1 − x0. (2.34)

A neural network vθ(x, t) is trained to predict this conditional vector field from xt at time
step t, effectively learning how to transport samples from the prior distribution to the data
distribution. The training objective for the neural network vθ(xt, t) is typically formulated as a
mean squared error loss between the predicted vector field and the target vector field:

LFM = Et∼U(0,1),x0∼pprior,x1∼pdata

[
∥ut(xt)− vθ(xt, t)∥2

]
. (2.35)

This loss encourages the model to learn the vector field that guides the samples along the path
defined by equation 2.33 toward the target distribution pdata.

The generation process involves solving an ordinary differential equation (ODE) starting
from an initial sample x0 ∼ pprior and evolving it over time according to the learned vector field
as dx

dt
= vθ(x, t), which leads to samples from the target distribution pdata as t approaches 1. At

each step, we perform Euler integration (or more sophisticated ODE solvers) to transform the
noisy sample toward the data distribution:

xt+∆t = xt +∆t vθ(xt, t), ∆t > 0. (2.36)

Flow Matching models can be viewed as a continuous-time analogue of diffusion models,
where the linear interpolation path in equation 2.33 is a form of the forward process in equa-
tion 2.29 with a particular noise schedule and a Gaussian prior. The reverse process in flow
matching is learned as a vector field that guides the flow from the prior to the data distribution,
similar to how diffusion models learn a denoising process. See Albergo et al. [2023], Gao et al.
[2024] for further discussions on the equivalence between diffusion models and flow matching.

6Note that t is distinct from the discrete time steps t ∈ {1, . . . , T} used in diffusion models.

49


Conditioning and Guidance Diffusion and flow matching models can be conditioned on
additional information to guide the generation process toward specific properties or structures
[Dieleman, 2022]. For example, classifier-based guidance [Dhariwal and Nichol, 2021] uses a
pre-trained classifier to steer the generation process by adjusting the predicted noise or vector
field based on the classifier’s output. Classifier-free guidance [Ho and Salimans, 2022] allows
for more flexible conditioning by directly providing additional information (e.g., class labels,
text prompts) to the denoiser or vector field predictor without requiring a separate classifier.

Conditional generation has proven especially successful for latent diffusion models [Vahdat
et al., 2021], which perform diffusion in the latent space of a pre-trained autoencoder rather than
in high-dimensional raw input space. This approach achieves computational efficiency while
maintaining generation quality by operating on semantically meaningful, lower-dimensional
representations, followed by reconstruction to the original data space [Dieleman, 2025]. Latent
diffusion also enables flexible conditioning strategies across diverse modalities, including class
labels, text descriptions, or any other data type that can be encoded into a compatible latent
representation [Rombach et al., 2022].

Overall, conditional generation capabilities makes diffusion models particularly powerful
for molecular design, where the goal is to discover novel molecules that satisfy predetermined
requirements rather than simply modelling existing molecular distributions. By conditioning
the denoiser on desired properties (e.g., binding affinity [Gruver et al., 2023], bulk modulus,
magnetic density [Zeni et al., 2025]) or structural constraints (e.g., scaffolding around a known
binding site [Watson et al., 2023], completing partial molecular fragments [Schneuing et al.,
2024]), we can guide generation toward molecules with specific functional characteristics.

50


Part I

Molecular Representation Learning and
Generative Modelling

51


Chapter 3

Expressive Power of Molecular Structure
Representations

As we saw in Chapter 2, molecular systems can be represented as 3D geometric graphs with
node attributes that transform along with Euclidean transformations of the system. We then
introduced Geometric Graph Neural Networks (GNNs) that are designed to learn representations
of these graphs, categorised by the geometric inductive biases they implement: (1) Invariant

GNNs which only propagate invariant scalar features such as distances and angles [Schütt et al.,
2018, Gasteiger et al., 2020]; (2) Equivariant GNNs which propagate equivariant geometric
features such as vectors [Satorras et al., 2021] or spherical tensors [Thomas et al., 2018]; and
(3) Unconstrained GNNs which do not enforce any equivariance or invariance on the features.
These architectures have powered application ranging from protein structure prediction [Jumper
et al., 2021] and design [Dauparas et al., 2022] to molecular simulation [Batzner et al., 2022]
and catalysis [Wander et al., 2025].

However, there is no unified theoretical framework to understand and characterise the repre-
sentation capacity or expressive power [Raghu et al., 2017] of different classes of architectures.
The theoretical limits and practical implications of different design choices, such as equivariance
vs. invariance, number of layers, and body order of scalarisation, are not well understood. To
address this, this chapter establishes a theoretical foundation for Geometric GNNs which will
guide their application in subsequent chapters (Part II).

We introduce the Geometric Weisfeiler-Leman (GWL) test, a generalisation of the classic
Weisfeiler-Leman algorithm for discriminating geometric graphs while respecting underlying 3D
symmetries: permutations, rotations, reflections, and translations. We use the GWL framework
to characterise the expressive power of invariant and equivariant GNNs in terms of their ability
to distinguishing geometric graphs. GWL provides mechanistic insights into the advantages of
equivariant models over invariant ones, and how higher-order representations enable maximally
expressive architectures. Overall, we formalize key design choices which influence Geometric
GNN expressivity through the lens of GWL, summarised in Figure 3.1.

53


Tensor Order
of Features

SchNet E(n)-GNN
TFN, SEGNN, 

SE(3)-Transformer

Body Order of Layer

MACE - Multi
Atomic Cluster

Expansion

(Distances)

Many-body

DimeNet,
GemNet-T

(Distances,
Angles)

GVP-GNN,
PaiNN

Sc
ala
rs

Depth
Ca
rte
sia
n

Sp
he
ric
al

SphereNet

Figure 3.1: Axes of Geometric GNN expressivity: (1) Body order: increasing scalarisation
body order builds expressive local descriptors; (2) Tensor order: higher order tensors determine
the relative orientation of neighbourhoods; and (3) Depth: deep equivariant layers propagate
geometric information beyond local neighbourhoods.

To complement GWL’s theoretical framework, this chapter also presents: (1) a suite of
synthetic experiments–Geometric GNN Dojo–that highlight practical challenges for building
expressive Geometric GNNs; and (2) a real-world benchmark for protein function prediction
that fairly compares state-of-the-art Geometric GNNs to sequence-based protein language
models. Open source code is available: github.com/chaitjo/geometric-gnn-dojo
and github.com/a-r-j/ProteinWorkshop, respectively.

3.1 Limitations of the Weisfeiler-Leman Test

Graph Isomorphism The graph isomorphism problem asks whether two graphs are the same,
but drawn differently [Read and Corneil, 1977]. Two attributed graphs G,H are isomorphic

(denoted G ≃ H) if there exists an edge-preserving bijection b : V(G) → V(H) such that
s
(G)
i = s

(H)
b(i) , as illustrated in Figure 3.2.

Figure 3.2: Graph isomorphism. Two attributed graphs G and H are isomorphic if there exists
a bijection b between their nodes that preserves the edge structure and node features.

54

https://github.com/chaitjo/geometric-gnn-dojo
https://github.com/a-r-j/ProteinWorkshop


Figure 3.3: Weisfeiler-Leman Test for non-geometric graphs. WL iteratively refines node
colours based on neighbourhood patterns. Here, WL fails to distinguish the non-isomorphic
molecular graphs Decalin and Bicyclopentyl, converging to identical colour histograms despite
their structural differences. This illustrates a well-known limitation of WL with implications for
molecular representation learning.

Weisfeiler-Leman Test The Weisfeiler-Leman Test (WL) is an algorithm for testing whether
two (attributed) graphs are isomorphic [Weisfeiler and Leman, 1968]. At iteration zero the
algorithm assigns a colour c(0)i ∈ C from a countable space of colours C to each node i. Nodes
are coloured the same if their features are the same, otherwise, they are coloured differently. In
subsequent iterations t, WL iteratively updates the node colouring by producing a new c

(t)
i ∈ C:

c
(t)
i := HASH

(
c
(t−1)
i , {{c(t−1)

j | j ∈ Ni}}
)
, (3.1)

where HASH is an injective map (i.e. a perfect hash map) that assigns a unique colour to each
input and {{·}} denotes a multiset – a set that allows for repeated elements. The test terminates
when the partition of the nodes induced by the colours becomes stable. Given two graphs G
and H, if there exists some iteration t for which {{c(t)i | i ∈ V(G)}} ̸= {{c(t)i | i ∈ V(H)}}, then
the graphs are not isomorphic. Otherwise, the WL test is inconclusive, and we say it cannot
distinguish the two graphs when the number of colours in iterations t and (t− 1) is the same.

Theoretical Limits of WL and GNN WL has several well known failure cases, such as not
being able to distinguish any two regular graphs with the same number of nodes and degree, or
failing to tell apart two equilateral triangles from a regular hexagon. At the same time, WL is
considered powerful enough for most practical graph classification scenarios [Morris et al., 2021].
Results from Babai et al. [1980] can be used to show that the probability of WL identifying a
graph drawn randomly from the class of all n-node graphs goes to 1 as n tends to infinity.

The graph isomorphism problem and WL have become a powerful tool for characterising the
theoretical limits of GNNs [Jegelka, 2022]. It was shown by Xu et al. [2019], Morris et al. [2019]
that message passing GNNs are at most as powerful as WL at distinguishing non-isomorphic

55


graphs, i.e. the expressive power of GNNs is upper-bounded by WL. GNNs can have the same
expressive power as WL if their aggregate, update, and readout are injective functions over
multisets. The WL framework has since become a major driver of progress in designing more
expressive GNNs [Chen et al., 2019, Maron et al., 2019, Dwivedi et al., 2023, Bodnar et al.,
2021b,a]. Notably, GNNs can exceed the expressive power of WL when nodes have unique
identifiers, such as random node features or positional encodings, that distinguish otherwise
equivalent nodes [Loukas, 2020, Sato et al., 2021].

Towards Geometric Graph Isomorphism Clearly, WL does not directly apply to geometric
graphs as they exhibit a stronger notion of geometric isomorphism that accounts for spatial
symmetries. Unlike standard graphs where node features are fixed, geometric attributes such as
3D coordinates transform under rotations, reflections, and translations of the geometric graph.
Simply treating these geometric attributes as static node features would violate the fundamental
symmetries that define molecular systems, making theoretical results associated with WL and
GNNs inapplicable to geometric graphs and Geometric GNNs.

In the following sections, we introduce the Geometric Weisfeiler-Leman (GWL) test that
generalises WL to geometric graphs while respecting underlying 3D symmetries. We use GWL
to characterise the expressive power of invariant and equivariant GNNs in terms of their ability
to distinguish geometric graphs. Unconstrained GNNs, which do not enforce any geometric
symmetries, are an exception to this limitation. These models treat 3D coordinates as static node
features and can thus distinguish any geometric graphs where WL can distinguish the underlying
attributed graphs. While the trade-offs between explicitly enforcing symmetries versus learning
them from data will be discussed in Chapter 4, our analysis in this chapter focuses on invariant
and equivariant GNNs that respect 3D symmetries by design.

3.2 The Geometric Weisfeiler-Leman Framework

Geometric graph isomorphism Two geometric graphs G and H are geometrically isomorphic

if there exists an attributed graph isomorphism b such that the geometric attributes are equivalent,
up to global group actions Qg ∈ Gand t⃗ ∈ T (d):(

s
(G)
i , v⃗

(G)
i , x⃗

(G)
i

)
=
(
s
(H)
b(i) , Qgv⃗

(H)
b(i) , Qg(x⃗

(H)
b(i) + t⃗)

)
for all i ∈ V(G). (3.2)

Note that if two geometric graphs are geometrically isomorphic, they are also isomorphic as
attributed graphs. However, the converse is not true.

Geometric graph isomorphism and distinguishing (sub-)graph geometries has important
practical implications for molecular representation learning. For e.g., an ideal architecture should
map distinct local structural environments around atoms to distinct embeddings in representation

56


space [Bartók et al., 2013, Pozdnyakov et al., 2020].

Assumptions Analogous to the WL test, we assume for the rest of this section that all geometric
graphs we want to distinguish are finite in size and come from a countable dataset. In other
words, the geometric and scalar features the nodes are equipped with come from countable
subsets C ⊂ Rd and C ′ ⊂ R, respectively. As a result, when we require functions to be injective,
we require them to be injective over the countable set of G-orbits that are obtained by acting
with Gon the dataset.

Intuition For an intuition of how to generalise WL to geometric graphs, we note that WL
uses a local, node-centric, procedure to update the colour of each node i using the colours
of its the 1-hop neighbourhood Ni. In the geometric setting, Ni is an attributed point cloud
around the central node i. As a result, each neighbourhood carries two types of information: (1)
neighbourhood type (invariant to G) and (2) neighbourhood geometric orientation (equivariant to
G). This leads to constraints on the generalisation of the neighbourhood aggregation procedure
of Equation 3.1. From an axiomatic point of view, our generalisation of the WL aggregation
procedure must meet two properties:

Property 1: Orbit injectivity of colours If two neighbourhoods are the same up to an action
of G (e.g. rotation), then the colours of the corresponding central nodes should be the same.
Thus, the colouring must be G-orbit injective – which also makes it G-invariant – over the
countable set of all orbits of neighbourhoods in our dataset.

Property 2: Preservation of local geometry A key property of WL is that the aggregation
is injective. A G-invariant colouring procedure that purely satisfies Property 1 is not sufficient
because, by definition, it loses spatial properties of each neighbourhood such as the relative
pose or orientation [Hinton et al., 2011]. Thus, we must additionally update auxiliary geometric

information variables in a way that is G-equivariant and injective.

3.2.1 The Geometric Weisfeiler-Leman Test (GWL)

These intuitions motivate the following definition of the GWL test. At initialisation, we assign to
each node i ∈ V a scalar node colour ci ∈ C ′ and an auxiliary object gi containing the geometric
information associated to it:

c
(0)
i := HASH(si), g

(0)
i :=

(
c
(0)
i , v⃗i

)
, (3.3)

where HASH denotes an injective map over the scalar attributes si of node i, e.g. the HASH of
standard WL. To define the inductive step, assume we have the colours of the nodes and the

57


Figure 3.4: Geometric Weisfeiler-Leman Test. GWL distinguishes non-isomorphic geometric
graphs G1 and G2 by injectively assigning colours to distinct neighbourhood patterns, up to global
symmetries. Each iteration expands the neighbourhood from which geometric information can
be gathered (shaded for node i). Example inspired by Schütt et al. [2021], G= O(d).

associated geometric objects at iteration t−1. Then, we can aggregate the geometric information
around node i into a new object as follows:

g
(t)
i :=

(
(c

(t−1)
i , g

(t−1)
i ) , {{(c(t−1)

j , g
(t−1)
j , x⃗ij) | j ∈ Ni}}

)
, (3.4)

Here {{·}} denotes a multiset – a set in which elements may occur more than once. Importantly,
the group G can act on the geometric objects above inductively by acting on the geometric
information inside it. This amounts to rotating (or reflecting) the entire t-hop neighbourhood
contained inside:

g · g(0)
i :=

(
c
(0)
i , Qgv⃗i

)
, (3.5)

g · g(t)
i :=

(
(c

(t−1)
i , g · g(t−1)

i ), {{(c(t−1)
j , g · g(t−1)

j ,Qgx⃗ij) | j ∈ Ni}}
)

Clearly, the aggregation building gi for any time-step t is injective and G-equivariant. Finally,
we can compute the node colours at iteration t for all i ∈ V by aggregating the geometric
information in the neighbourhood around node i:

c
(t)
i := I-HASH(t)

(
g
(t)
i

)
, (3.6)

by using a G-orbit injective and G-invariant function that we denote by I-HASH. That is for any
geometric objects g, g′, I-HASH(g) = I-HASH(g′) if and only if there exists g ∈ Gsuch that
g = g · g′. Note that I-HASH is an idealised G-orbit injective function, similar to the HASH

function used in WL, which is not necessarily continuous.

58


Figure 3.5: Invariant GWL Test. IGWL cannot distinguish G1 and G2 as they are 1-hop
identical: The G-orbit of the 1-hop neighbourhood around each node is the same, and IGWL
cannot propagate geometric orientation information beyond 1-hop.

Overview of GWL With each iteration, g(t)
i aggregates geometric information in progressively

larger t-hop subgraph neighbourhoods N (t)
i around the node i. The node colours summarise the

structure of these t-hops via the G-invariant aggregation performed by I-HASH. The procedure
terminates when the partitions of the nodes induced by the colours do not change from the
previous iteration. Finally, given two geometric graphs G and H, if there exists some iteration
t for which {{c(t)i | i ∈ V(G)}} ̸= {{c(t)i | i ∈ V(H)}}, then GWL deems the two graphs as
geometrically non-isomorphic. Otherwise, GWL cannot distinguish them.

Invariant GWL Since we are interested in understanding the role of G-equivariance, we also
consider a more restrictive Invariant GWL (IGWL) that only updates node colours using the
G-orbit injective I-HASH function and does not propagate geometric information:

c
(t)
i := I-HASH

(
(c

(t−1)
i , v⃗i) , {{(c(t−1)

j , v⃗j, x⃗ij) | j ∈ Ni}}
)
. (3.7)

IGWL with k-body scalars In order to further analyse the construction of the node colouring
function I-HASH, we consider IGWL(k) based on the maximum number of nodes involved in
the computation of G-invariant scalars (also known as the ‘body order’ [Batatia et al., 2022b]):

c
(t)
i := I-HASH(k)

(
(c

(t−1)
i , v⃗i) , {{(c(t−1)

j , v⃗j, x⃗ij) | j ∈ Ni}}
)
, (3.8)

and I-HASH(k+1) is defined as:

HASH
(
{{I-HASH

(
(c

(t−1)
i , v⃗i), {{(c(t−1)

j1
, v⃗j1 , x⃗ij1), . . . , (c

(t−1)
jk

, v⃗jk , x⃗ijk)}}
)
| j ∈ (Ni)

k}}
)
,

(3.9)

59


Figure 3.6: Geometric Computation Trees for GWL and IGWL. Unlike GWL, geometric
orientation information cannot flow from the leaves to the root in IGWL, restricting its expressive
power. IGWL cannot distinguish G1 and G2 as all 1-hop neighbourhoods are computationally
identical.

where j = [j1, . . . , jk] are all possible k-tuples formed of elements of Ni. Therefore, IGWL(k) is
now constrained to extract information only from all the possible k-sized tuples of nodes (includ-
ing the central node) in a neighbourhood. For instance, I-HASH(2) can identify neighbourhoods
only up to pairwise distances among the central node and any of its neighbours (i.e. a 2-body
scalar), while I-HASH(3) up to distances and angles formed by any two edges (i.e. a 3-body
scalar). Notably, distances and angles alone are incomplete descriptors of local geometry as
there exist several counterexamples that require higher body-order invariants to be distinguished
[Bartók et al., 2013, Pozdnyakov et al., 2020]. Therefore, I-HASH(k) with lower scalarisation
body order k makes the colouring weaker.

3.2.2 Characterising the Expressivity of Geometric GNNs

Following Xu et al. [2019], Morris et al. [2019], we upper bound the expressive power of
Geometric GNNs by the GWL test. We show that any G-equivariant GNN can be at most as
powerful as GWL in distinguishing non-isomorphic geometric graphs. Proofs are available in
Appendix A.2.

Theorem 1. Any pair of geometric graphs distinguishable by a G-equivariant GNN is also

distinguishable by GWL.

With sufficient iterations, the output of G-equivariant GNNs can be equivalent to GWL if
certain conditions are met regarding the aggregate, update and readout functions.

Proposition 2. G-equivariant GNNs have the same expressive power as GWL if the following

conditions hold: (1) The aggregation AGG is an injective, G-equivariant multiset function. (2)

The scalar part of the update UPDs is a G-orbit injective, G-invariant multiset function. (3)

The vector part of the update UPDv is an injective, G-equivariant multiset function. (4) The

graph-level readout f is an injective multiset function.

60


Similar statements can be made for G-invariant GNNs and IGWL. Thus, we can directly
transfer theoretical results between GWL/IGWL, which are abstract mathematical tools, and the
class of Geometric GNNs upper bounded by the respective tests. This connection has several
interesting practical implications, discussed subsequently.

3.3 Understanding the Geometric GNN Design Space

Overview We demonstrate the utility of the GWL framework for understanding how Geometric
GNN design choices [Duval et al., 2023a] influence expressivity: (1) Depth or number of layers;
(2) Invariant vs. equivariant message passing; and (3) Body order of scalarisation. In doing so,
we formalise theoretical limitations of current architectures and provide practical implications
for designing maximally powerful models. Proofs are available in Appendix A.1.

3.3.1 Role of Depth: Propagating Geometric Information

Let us consider the simplified setting of two geometric graphs G1 = (A1,S1, V⃗1, X⃗1) and
G2 = (A2,S2, V⃗2, X⃗2) such that the underlying attributed graphs (A1,S1) and (A2,S2) are
isomorphic. This case frequently occurs in chemistry, where molecules occur in different
conformations, but with the same graph topology given by the covalent bonding structure.

Each iteration of GWL aggregates geometric information g
(k)
i from progressively larger

neighbourhoods N (k)
i around the node i, and distinguishes (sub-)graphs via comparing G-orbit

injective colouring of g(k)
i . We say G1 and G2 are k-hop distinct if for all graph isomorphisms

b, there is some node i ∈ V1, b(i) ∈ V2 such that the corresponding k-hop subgraphs N (k)
i

and N (k)
b(i) are distinct. Otherwise, we say G1 and G2 are k-hop identical if all corresponding

k-hop subgraphs N (k)
i and N (k)

b(i) are the same up to group actions. We can now formalise what
geometric graphs can and cannot be distinguished by GWL and maximally powerful Geometric
GNNs, in terms of the number of iterations.

Proposition 3. GWL can distinguish any k-hop distinct geometric graphs G1 and G2 where the

underlying attributed graphs are isomorphic, and k iterations are sufficient.

Proposition 4. Up to k iterations, GWL cannot distinguish any k-hop identical geometric graphs

G1 and G2 where the underlying attributed graphs are isomorphic.

Additionally, we can state the following results about the more constrained IGWL.

Proposition 5. IGWL can distinguish any 1-hop distinct geometric graphs G1 and G2 where the

underlying attributed graphs are isomorphic, and 1 iteration is sufficient.

Proposition 6. Any number of iterations of IGWL cannot distinguish any 1-hop identical

geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic.

61


An example illustrating Propositions 3 and 6 is shown in Figures 3.4 and 3.5, respectively.
We can now consider the more general case where the underlying attributed graphs for G1 =

(A1,S1, V⃗1, X⃗1) and G2 = (A2,S2, V⃗2, X⃗2) are non-isomorphic and constructed from point
clouds using radial cutoffs, as conventional for biochemistry and material science applications.

Proposition 7. Assuming geometric graphs are constructed from point clouds using radial

cutoffs, GWL can distinguish any geometric graphs G1 and G2 where the underlying attributed

graphs are non-isomorphic. At most kMax iterations are sufficient, where kMax is the maximum

graph diameter among G1 and G2.

Proposition 7 shows that GWL can distinguish any non-isomorphic geometric graphs for
the practical case where the graph structure is computed using radial cutoffs. However, for the
general case where the topology of the geometric graph is independent of the coordinates, GWL
may not be theoretically complete as there may exist pathological edge cases where the test fails.
For example, when all coordinates and vector features are set equal to zero, GWL coincides with
the standard 1-WL. In this edge case, GWL has the same expressive power as 1-WL and inherits
all well-known failure cases of 1-WL.

3.3.2 Limitations of Invariant Message Passing: Failure to Capture Global
Geometry

Propositions 3 and 6 enable us to compare the expressive powers of GWL and IGWL.

Theorem 8. GWL is strictly more powerful than IGWL.

This statement formalises the advantage of G-equivariant intermediate layers for geometric
graphs, as prescribed in the Geometric Deep Learning blueprint [Bronstein et al., 2021], in
addition to echoing similar intuitions in the computer vision community. As remarked by Hinton
et al. [2011], translation invariant models do not understand the relationship between the various
parts of an image (termed the “Picasso problem”). Similarly, our results point to IGWL and
G-invariant GNNs failing to understand how the 1-hop neighbourhoods in a graph are oriented
w.r.t. each other.

As a result, even the most powerful G-invariant GNNs are restricted in their ability to
compute global and non-local geometric properties.

Proposition 9. IGWL and G-invariant GNNs cannot decide several geometric graph properties:

(1) perimeter, surface area, and volume of the bounding box/sphere enclosing the geometric

graph; (2) distance from the centroid or centre of mass; and (3) dihedral angles.

Finally, we identify a setting where this distinction between the two approaches disappears.

Proposition 10. IGWL has the same expressive power as GWL for fully connected geometric

graphs.

62


Practical Implications Proposition 9 highlights critical theoretical limitations of G-invariant
GNNs. This suggests that G-equivariant GNNs should be preferred when working with large
geometric graphs such as macromolecules with thousands of nodes, where message passing is
restricted to local radial neighbourhoods around each node. Stacking multiple G-equivariant
layers enables the computation of compositional geometric features.

Two straightforward approaches to overcoming the limitations of G-invariant GNNs may be:
(1) pre-computing non-local geometric properties as input features, e.g. models such as GemNet
[Gasteiger et al., 2021] and ComENet [Wang et al., 2022] use two-hop dihedral angles. And (2)
working with fully connected geometric graphs, as Proposition 10 suggests that G-equivariant
and G-invariant GNNs can be made equally powerful when performing all-to-all message
passing1. This is supported by the empirical success of recent G-invariant Graph Transformers

[Joshi, 2025, Shi et al., 2022] for small molecules with tens of nodes, where working with full
graphs is tractable.

3.3.3 Role of Scalarisation Body Order: Identifying Neighbourhood G-
orbits

At each iteration of GWL and IGWL, the I-HASH function assigns a G-invariant colouring to
distinct geometric neighbourhood patterns. I-HASH is an idealised G-orbit injective function
which is not necessarily continuous. In Geometric GNNs, this corresponds to scalarising local
geometric information when updating the scalar features. We can analyse the construction of
the I-HASH function and the scalarisation step in Geometric GNNs via the k-body variations
IGWL(k), described in Section 3.2. In doing so, we will make connections between IGWL and
WL for non-geometric graphs.

Firstly, we formalise the relationship between the injectivity of I-HASH(k) and the maximum
cardinality of local neighbourhoods in a given dataset.

Proposition 11. I-HASH(m) is G-orbit injective for m = max({|Ni| | i ∈ V}), the maximum

cardinality of all local neighbourhoods Ni in a given dataset.

Practical Implications While building provably injective I-HASH(k) functions may require
intractably high k, the hierarchy of IGWL(k) tests enable us to study the expressive power of
practical G-invariant aggregators used in current Geometric GNN layers, e.g. SchNet [Schütt
et al., 2018], E-GNN [Satorras et al., 2021], and TFN [Thomas et al., 2018] use distances, while
DimeNet [Gasteiger et al., 2020] uses distances and angles. Notably, MACE [Batatia et al.,
2022b] constructs a complete basis of scalars up to arbitrary body order k via Atomic Cluster

1Subsequent theoretical results have confirmed that for fully connected graphs, higher-order invariant GNNs
(specifically those equivalent to 2-WL) are geometrically complete [Delle Rose et al., 2023]. However, this requires
global all-to-all communication, reinforcing the advantage of equivariant models for scalable, sparse message
passing.

63


Expansion [Dusson et al., 2019], which can be G-orbit injective if the conditions in Proposition
11 are met. We can state the following about the IGWL(k) hierarchy and the corresponding
GNNs.

Proposition 12. IGWL(k) is at least as powerful as IGWL(k−1). For k ≤ 5, IGWL(k) is strictly

more powerful than IGWL(k−1).

Finally, we show that IGWL(2) is equivalent to WL when all the pairwise distances are the
same. A similar observation was recently made by Pozdnyakov and Ceriotti [2022].

Proposition 13. Let G1 = (A1,S1, X⃗1) and G2 = (A2,S2, X⃗2) be two geometric graphs with

the property that all edges have equal length. Then, IGWL(2) distinguishes the two graphs if and

only if WL can distinguish the attributed graphs (A1,S1) and (A1,S1).

This equivalence points to limitations of distance-based G-invariant models like SchNet
[Schütt et al., 2018]. These models suffer from all well-known failure cases of WL, e.g. they
cannot distinguish two equilateral triangles from the regular hexagon [Gasteiger et al., 2020].

3.4 Synthetic Experiments on Expressivity

Overview We perform three simple synthetic experiments to supplement our theoretical results
and highlight the practical challenges in building maximally powerful Geometric GNNs, s.a.

oversmoothing and oversquashing with increased depth, as well the need for higher order tensors
in G-equivariant GNNs.

Setup and Hyperparameters We experiment with the following models: (1) SchNet [Schütt
et al., 2018] and DimeNet [Gasteiger et al., 2020] as representative G-invariant GNNs; (2)
E-GNN [Satorras et al., 2021] and GVP-GNN [Jing et al., 2020] as representative G-equivariant
GNNs which use cartesian vectors; and (3) TFN [Thomas et al., 2018] and MACE [Batatia
et al., 2022b] to study higher order G-equivariant GNNs using spherical tensors. For SchNet
and DimeNet, we use the implementation from PyTorch Geometric [Fey and Lenssen, 2019a].
For E-GNN, GVP-GNN, and MACE, we adapt implementations from the respective authors.
Our TFN implementation is based on e3nn [Geiger and Smidt, 2022], and we also re-implement
MACE by incorporating the EquivariantProductBasisBlock from its authors into our
TFN layer. We set scalar feature channels to 128 for SchNet, DimeNet, and E-GNN. We set
scalar/vector/tensor feature channels to 64 for GVP-GNN, TFN, MACE. TFN and MACE use
order L = 2 tensors by default. MACE uses local body order 4 by default. We train all models
for 100 epochs using the Adam optimiser, with an initial learning rate 1e− 4, which we reduce
by a factor of 0.9 and patience of 25 epochs when the performance plateaus. All results are
averaged across 10 random seeds.

64


(k = 4-chains) Number of layers
GNN Layer ⌊k

2
⌋ ⌊k

2
⌋+ 1 = 3 ⌊k

2
⌋+ 2 ⌊k

2
⌋+ 3 ⌊k

2
⌋+ 4

E
qu

iv
.

GWL 50% 100% 100% 100% 100%
E-GNN 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 100.0 ± 0.0

GVP-GNN 50.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0

TFN 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 80.0 ± 24.5 85.0 ± 22.9

MACE 50.0 ± 0.0 90.0 ± 20.0 90.0 ± 20.0 95.0 ± 15.0 95.0 ± 15.0

In
v.

IGWL 50% 50% 50% 50% 50%
SchNet 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

DimeNet 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

SphereNet 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

SchNetfull graph 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0

SchNetglobal feat 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0

Table 3.1: k-chain geometric graphs. k-chains are (⌊k
2
⌋+ 1)-hop distinguishable and (⌊k

2
⌋+ 1)

GWL iterations are theoretically sufficient to distinguish them. We train Geometric GNNs with
an increasing number of layers to distinguish k = 4-chains. G-equivariant GNNs may require
more iterations that prescribed by GWL, pointing to preliminary evidence of oversquashing
when geometric information is propagated across multiple layers using fixed dimensional
feature spaces. IGWL and G-invariant GNNs are unable to distinguish k-chains for any k ≥ 2
and G= O(3). G-invariant GNNs with precomputed non-local features (volume of bounding
box) or message passing on fully connected graphs can trivially solve the task. Anomalous
results are marked in red and expected results in green .

Tasks We design three synthetic experiments to highlight practical challenges in building
expressive Geometric GNNs, summarised below and described in detail subsequently.

• Distinguishing k-chains, which test a model’s ability to propagate geometric information
non-locally and demonstrate geometric oversquashing with increased depth.

• Rotationally symmetric structures, which test a layer’s ability to identify neighbourhood
orientation and highlight the utility of higher order tensors in G-equivariant GNNs.

• Counterexamples from Pozdnyakov et al. [2020], which test a layer’s ability to create
distinguishing fingerprints for local neighbourhoods and highlight the need for higher body
order of scalarisation.

3.4.1 Depth, Non-local Geometric Properties, and Oversquashing

GWL assumes perfect propagation of G-equivariant geometric information at each iteration,
which implies that the test can be run for any number of iterations without loss of information. In
Geometric GNNs, G-equivariant information is propagated via summing features from multiple
layers in fixed dimensional spaces, which may lead to distortion or loss of information from
distant nodes.

Experiment To study the practical implications of depth in propagating geometric information
beyond local neighbourhoods, we consider k-chain geometric graphs which generalise the

65


Rotational symmetry
GNN Layer 2 fold 3 fold 5 fold 10 fold

C
ar

t. E-GNNL=1 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

GVP-GNNL=1 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

Sp
he

ri
ca

l TFN/MACEL=1 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

TFN/MACEL=2 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

TFN/MACEL=3 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0

TFN/MACEL=5 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0

TFN/MACEL=10 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0

Table 3.2: Rotationally symmetric structures. We train single layer G-equivariant GNNs to
distinguish two distinct rotated versions of each L-fold symmetric structure. We find that layers
using order L tensors are unable to identify the orientation of structures with rotation
symmetry higher than L-fold. This issue is particularly prevalent for layers using cartesian
vectors (tensor order 1).

examples from Schütt et al. [2021]. Each pair of k-chains consists of k + 2 nodes with k nodes
arranged in a line and differentiated by the orientation of the 2 end points. Thus, k-chain graphs
are (⌊k

2
⌋+ 1)-hop distinguishable, and (⌊k

2
⌋+ 1) GWL iterations are theoretically sufficient to

distinguish them. In Table 3.1, we train G-equivariant and G-invariant GNNs with an increasing
number of layers to distinguish k-chains.

Results Despite the supposed simplicity of the task, we find that popular G-equivariant GNNs
such as E-GNN [Satorras et al., 2021] and TFN [Thomas et al., 2018] may require more iterations
that prescribed by GWL. Notably, for chains larger than k = 4, all G-equivariant GNNs tended
to require more than (⌊k

2
⌋+ 1) iterations to solve the task. Additionally, IGWL and G-invariant

GNNs are unable to distinguish k-chains.
Table 3.1 points to preliminary evidence of the oversquashing phenomenon [Alon and Yahav,

2021, Topping et al., 2022] for equivariant features in Geometric GNNs. The issue is most evident
for E-GNN, which uses a single vector feature to aggregate and propagate geometric information.
This may have implications in modelling macromolecules where long-range interactions often
play important roles.

3.4.2 Higher Order Tensors and Rotationally Symmetric Structures

In addition to perfect propagation, GWL is also able to injectively aggregate G-equivariant
information by making use of an auxiliary nested geometric object gi. On the other hand,
G-equivariant GNNs aggregate geometric information via summing neighbourhood features
represented by Cartesian vectors (tensor order 1) or higher order spherical tensors. This choice of
basis often comes with trade-offs between computational tractability and empirical performance.

66


Experiment To demonstrate the utility of higher order tensors in G-equivariant GNNs, we
study how rotational symmetries interact with tensor order. In Table 3.2, we evaluate current
G-equivariant layers on their ability to distinguish the orientation of structures with rotational
symmetry. An L-fold symmetric structure does not change when rotated by an angle 2π

L
around a

point (in 2D) or axis (3D). We consider two distinct rotated versions of each symmetric structure
and train single layer G-equivariant GNNs to classify the two orientations using the updated
equivariant features.

Results We find that layers using order L spherical tensors are unable to identify the orientation
of structures with rotation symmetry higher than L-fold, i.e. two distinct rotated versions of the
input having the same equivariant features. We attribute this observation to spherical harmonics,
which serve as an orthonormal basis for spherical tensor features and exhibit rotational symmetry
themselves.

Similar to the Fourier expansion for signals, the spherical harmonic expansion is employed
for converting Cartesian vectors to spherical signals in G-equivariant GNNs. The tensor order of
the spherical harmonic bases determines the rate of oscillation of the approximated function on
the sphere. We believe that this oscillation rate is closely linked to the rotational fold of a set of
symmetric vectors.

In the Fourier expansion, it is not feasible to accurately approximate a high-frequency
function solely using low-frequency sinusoidal waves. Similarly, when truncating the spherical
harmonic expansion to an order lower than the fold of the rotational symmetry, the rotationally
symmetric vectors act as a higher frequency function. Consequently, the lower frequency bases
cannot preserve the orientation of these vectors.

Layers such as E-GNN [Satorras et al., 2021] and GVP-GNN [Jing et al., 2020] using
Cartesian vectors are popular as higher order tensors can be computationally intractable for
many applications. However, E-GNN and GVP-GNN are particularly poor at discriminating the
orientation of rotationally symmetric structures. This may have implications for the modelling
of periodic materials which naturally exhibit such symmetries [Levine and Steinhardt, 1984].

3.4.3 Body Order of Scalarisation and Neighbourhood Fingerprints

GWL uses a node colouring function I-HASH for distinguishing G-orbits of neighbourhoods, i.e.
a neighbourhood fingerprint. In Geometric GNNs, this corresponds to a scalarisation step where
local geometric information from subsets of neighbours is aggregated to compute G-invariant
scalars (termed the body order).

Experiment To demonstrate the practical implications of scalarisation body order, we evaluate
current Geometric GNN layers on their ability to discriminate counterexamples from Pozdnyakov

67


Counterexample from Pozdnyakov et al. [2020]

GNN Layer 2-body 3-body 4-body
(Fig.1(b)) (Fig.2(f))

In
v.

SchNet2-body 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0
DimeNet3-body 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0
SphereNet4-body 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0

E
qu

iv
.

E-GNN2-body 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0
GVP-GNN3-body 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0
TFN2-body 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0
MACE3-body 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0
MACE4-body 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0
MACE5-body 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0

Table 3.3: Counterexamples from Pozdnyakov et al. [2020]. This task evaluates single layer
Geometric GNNs at distinguishing counterexample structures that are indistinguishable using
k-body scalarisation. Geometric GNN layers with body order k cannot distinguish the
corresponding counterexample. The 3-body counterexample is from Fig.1(b) [Pozdnyakov
et al., 2020], 4-body is from Fig.2(f) [Pozdnyakov et al., 2020], and 2-body is based on the two
local neighbourhoods in our running example.

et al. [2020]. Each counterexample consists of a pair of local neighbourhoods that are indis-
tinguishable when comparing their set of k-body scalars, i.e. I-HASH(k) and Geometric GNN
layers with body order k cannot distinguish the neighbourhoods. The 3-body counterexample
corresponds to Fig.1(b) in Pozdnyakov et al. [2020], 4-body chiral to Fig.2(e), and 4-body
non-chiral to Fig.2(f); the 2-body counterexample is based on the two local neighbourhoods in
our running example.

Results In Table 3.3, we train single layer Geometric GNNs to distinguish the counterexamples
using updated scalar features. Unsurprisingly, we find that most layers computiong 2 or 3 body
scalarisations fail the task. Notably, training higher body order MACE layers to distinguish the
chiral and non-chiral 4-body counterexamples should be possible in theory, but proved challeng-
ing in practice. This highlights the difficulty of designing as well as optimising continuous, high
body order neighbourhood fingerprints.

3.5 Experiments on Protein Representation Learning

Overview Having introduced a unified theoretical framework for characterising the expressive
power of Geometric GNNs, we now turn to evaluating their practical performance on protein
property prediction tasks. We leverage the ProteinWorkshop benchmark suite [Jamasb et al.,
2024] to systematically compare several classes of Geometric GNNs across protein function and
structure annotation datasets. Our evaluation encompasses general-purpose G-invariant [Schütt
et al., 2018] and G-equivariant architectures [Satorras et al., 2021, Thomas et al., 2018, Batatia

68


Figure 3.7: Overview of ProteinWorkshop, a comprehensive benchmarking suite for evaluating
protein structure representation learning. The framework encompasses: (1) Large-scale ML-
ready datasets containing experimental structures from the PDB as well as predicted structures
from AlphaFoldDB and ESM Atlas; (2) Unified implementations of diverse Geometric GNN
architectures and flexible featurisation schemes; (3) Real-world prediction tasks including both
node-level and graph-level tasks; and (4) Rigorous evaluation protocols with standardised splits,
metrics, and pretraining procedures to enable fair comparison and reproducible progress tracking
in protein representation learning.

et al., 2022b], alongside bespoke models specifically designed for protein structures [Morehead
and Cheng, 2024, Zhang et al., 2023b]. Through rigorous benchmarking, these experiments aim
to bridge the gap between theoretical expressivity and practical performance in real-world tasks.

Broader motivation behind ProteinWorkshop Recent advances in protein structure predic-
tion have led to the availability of large-scale structural data [Jumper et al., 2021, Baek et al.,
2021]. However, the mere availability of structural data does not guarantee progress in our
understanding of the relationship between protein sequence, structure, and function [Varadi
et al., 2021]. To address this gap, we need computational methods that can learn meaningful
representations of protein structures for functional annotation.

Geometric GNNs have emerged as the architecture of choice for learning structural repre-
sentations of biomolecules [Schütt et al., 2018, Gasteiger et al., 2020, Jing et al., 2020, Schütt
et al., 2021, Morehead et al., 2022, Zhang et al., 2023b]. Previous works have primarily focused
on learning effective global (i.e. graph-level) representations of protein structure, typically
evaluating methods on function or fold classification tasks [Gligorijević et al., 2021, Zhang et al.,
2023b]. In contrast, there has been comparatively little investigation into the ability of different
methods to learn informative local (node-level) representations. Such local representations are
crucial for a variety of annotation tasks, including binding or interaction site prediction [Gainza
et al., 2020], and for providing conditioning signals in structure-conditioned molecule design
methods [Schneuing et al., 2024, Corso et al., 2023]. Understanding the structure-function
relationship at this granular level can also drive progress in protein design by revealing structural
motifs that underlie desirable properties, enabling their incorporation into new designs. To

69


Table 3.4: Protein property prediction datasets. We benchmark Geometric GNNs on a variety
of protein function prediction and structural-annotation tasks, including both node-level and
graph-level tasks.

Task Dataset Origin Structures # Train # Validation # Test Metric

G
ra

ph
-l

ev
el Fold Prediction Hou et al. [2017] Experimental 12.3 K 0.7 K 1.3/0.7/1.3 K Accuracy

Gene Ontology Prediction Gligorijević et al. [2021] Experimental 27.5 K 3.1 K 3.0 K Fmax

Reaction Class Prediction Hermosilla et al. [2020] Experimental 29.2 K 2.6 K 5.6 K Accuracy
Antibody Dev. Prediction Huang et al. [2021] Experimental 1.7 K 0.24 K 0.48 K AUPRC

N
od

e Inverse Folding Ingraham et al. [2019a] Experimental 3.9 M 105 K 180 K Perplexity
PPI Site Prediction Gainza et al. [2020] Experimental 478 K 53 K 117 K AUPRC

enable systematic evaluation of both global and local representational capabilities, we developed
ProteinWorkshop [Jamasb et al., 2024], a robust, standardised benchmark for evaluating protein
representation learning methods. An overview of the benchmark is shown in Figure 3.7.

3.5.1 Experimental Setup

Benchmark datasets We benchmark Geometric GNNs on a diverse collection of protein
property prediction tasks that span both node-level and graph-level prediction tasks. Our evalua-
tion framework, summarised in Table 3.4, encompasses datasets curated from the literature and
existing benchmarks to systematically assess both global and local representational capabilities
of different architectures2. For node-level tasks, we evaluate models on their ability to learn
informative residue-level representations through inverse folding [Ingraham et al., 2019a], and
protein-protein interaction site prediction [Gainza et al., 2020]. For graph-level tasks, we assess
global structural representation learning through fold prediction [Hou et al., 2017], gene ontology
prediction [Gligorijević et al., 2021], reaction class prediction [Hermosilla et al., 2020], and
antibody development prediction [Huang et al., 2021]. See Jamasb et al. [2024] for detailed
descriptions of the datasets and tasks.

Geometric GNN models We provide a unified implementation of several rotation invariant
and equivariant Geometric GNNs, spanning the range of message passing body order and tensor
order. We benchmark 4 general purpose models: SchNet [Schütt et al., 2018], EGNN [Satorras
et al., 2021], TFN [Thomas et al., 2018], MACE [Batatia et al., 2022b]; and 2 protein-specific
architectures: GCPNet [Morehead and Cheng, 2024], GearNet [Zhang et al., 2023b]. As input
to the models, we convert protein structures to geometric graphs with Cα atoms as nodes with
additional features including the residue type, positional encoding, virtual torsion and bond
angles as well as backbone torsion angles along the protein chain, as summarised in Table 3.5.
Edges are constructed using a k-nearest neighbours graph construction based on the Cα atom

2To retain focus on protein representation learning, we deliberately exclude commonly-used tasks based on
protein-small molecule interactions as it is hard to disentangle the effect of the small molecule representation and
the potential for bias [Boyles et al., 2019].

70


Table 3.5: Structural featurisation schemes. Residue type is a one-hot encoding of the amino
acid type for each node; positional encoding is a 16-dimensional sinusoidal encoding [Vaswani
et al., 2017b]; and ϕ, ψ, ω ∈ R6 and χ1−4 ∈ R8 are backbone dihedral angles and sidechain
torsion angles, respectively, embedded on the unit circle. Similarly, κ, α ∈ R4 are virtual torsion
and bond angles defined over Cα atoms.

Granularity Cα Features Backbone Sidechain

Cα Residue Type
Cα Residue Type, Positional Encoding
Cα Residue Type, Positional Encoding, κ, α
Cα Residue Type, Positional Encoding, κ, α ϕ, ψ, ω
Cα Residue Type, Positional Encoding, κ, α ϕ, ψ, ω χ1, χ2, χ3, χ4

positions, with k = 16.
Additionally, we benchmark against ESM-2 [Lin et al., 2023], a state-of-the-art protein

language model that learns representations from sequence alone without explicit geometric
inductive biases. ESM-2 employs a standard Transformer architecture and is pre-trained on
large-scale protein sequence data, enabling it to implicitly capture structural and functional
patterns. We use the 650 million parameter version as a frozen feature extractor to generate
per-residue embeddings, which we optionally augment with the geometric features described
above. A simple MLP is then trained on these combined features for each downstream task. This
comparison allows us to quantify the benefit of explicit structural modelling in Geometric GNNs
versus the implicit structural information learned by sequence-based ESM-2 through large-scale
pre-training.

Hyperparameters We use consistent experimental settings across all models, with six layers
and 512 hidden channels as our default configuration. For tensor-based equivariant GNNs,
we reduced the number of layers and hidden channels to fit 80GB of GPU memory on one
NVIDIA A100 GPU. All models are trained using the Adam optimizer with a batch size of 32.
Learning rates are selected via grid search over {10−5, 10−4, 3× 10−4, 10−3}, and we employ
early stopping with a patience of 10 epochs to prevent overfitting. Training continues until
convergence or a maximum of 24 hours on a single A100 GPU. We train each model with all
featurisation schemes described in Table 3.5 and report the best performance across all schemes.
To ensure statistical reliability, results are averaged over three random seeds.

3.5.2 Results

Several clear patterns emerge from the results in Table 3.6:

Equivariant models outperform invariant models Across all tasks, equivariant GNNs con-
sistently demonstrate superior performance compared to their invariant counterparts. This

71


Table 3.6: Geometric GNN performance on protein property prediction tasks. We report the
best performance across all featurisation schemes for each model, averaged over three random
seeds. The best and second best performing models for each task are highlighted in bold and
underline, respectively.

Model Gene Ontology Antibody Dev. Fold Reaction PPI Site Inverse Folding
Fmax (↑) AUPRC (↑) Accuracy (↑) Accuracy (↑) AUPRC (↑) Perplexity (↓)

In
v. SchNet 0.429 ± 0.00 0.896 ± 0.00 31.98 ± 0.01 73.83 ± 0.02 0.955 ± 0.00 9.97 ± 0.09

GearNet 0.453 ± 0.00 0.837 ± 0.01 34.63 ± 0.01 80.03 ± 0.01 0.962 ± 0.00 11.23 ± 0.09

E
qu

iv
. E(n) GNN 0.455 ± 0.01 0.927 ± 0.01 41.48 ± 0.02 82.70 ± 0.00 0.965 ± 0.00 8.89 ± 0.04

GCPNet 0.442 ± 0.01 0.881 ± 0.02 38.86 ± 0.02 77.71 ± 0.01 0.968 ± 0.00 7.56 ± 0.11

TFN 0.452 ± 0.00 0.923 ± 0.01 36.65 ± 0.01 81.22 ± 0.01 0.967 ± 0.00 8.73 ± 0.02

MACE 0.411 ± 0.01 0.918 ± 0.00 35.68 ± 0.03 76.34 ± 0.01 0.965 ± 0.00 8.94 ± 0.03

LM ESM-2 0.545 ± 0.00 0.885 ± 0.00 34.59 ± 0.00 82.11 ± 0.00 0.956 ± 0.00 -

empirically validates the theoretical advantage of equivariant message passing established by
the GWL framework. E(n) GNN achieves the best performance on three tasks (Antibody
Development, Fold Classification, and Reaction Class Prediction).

TFN and MACE, which are equivariant GNNs utilizing higher-order tensor representations,
also show strong performance across multiple tasks. TFN consistently appears among the top
three models, empirically supporting the role of higher-order equivariant features in capturing
complex geometric relationships.

Protein-specific architectures GCPNet, designed specifically for protein structures, excels
at node-level tasks requiring fine-grained structural understanding (PPI Site Prediction and
Inverse Folding). Similarly, GearNet, a protein-specific invariant model, tends to outperform the
general-purpose invariant SchNet. This suggests that incorporating domain-specific structural
priors into architecture design can be beneficial, particularly for tasks demanding local geometric
precision.

Language models excel at functional tasks Despite lacking explicit structural inductive
biases, ESM-2 achieves remarkable performance on Gene Ontology prediction and competitive
results on Reaction Class Prediction. This suggests that sequence information alone carries
substantial predictive power for certain functional annotation tasks. However, for structural
tasks like fold prediction, the ability to propagate orientation information appears crucial, with
equivariant GNNs showing substantial advantages.

3.6 Related Work

Completeness of molecular representations A central question in geometric deep learning is
whether invariant scalar features (such as distances) are sufficient to completely distinguish any

72


two non-isomorphic geometric structures, or if equivariant vector features are strictly necessary.
The molecular simulations community has extensively studied the completeness of atom-centred
interatomic potentials, focusing on the ability to distinguish 1-hop local neighbourhoods (point
clouds) around atoms by constructing spanning sets for continuous, G-equivariant multiset
functions [Shapeev, 2016, Drautz, 2019, Dusson et al., 2019, Pozdnyakov et al., 2020]. GWL
generalises and extends this analysis to generic geometric graph isomorphism problems beyond
local atom-centred neighbourhoods.

Subsequent to the publication of the GWL framework [Joshi et al., 2023], several works
established tight bounds for the completeness of invariant representations on fully connected

graphs. Delle Rose et al. [2023] proved that the standard (d− 1)-dimensional Weisfeiler-Leman
test is complete for distinguishing generic point clouds in Rd given the full distance matrix. For
3D molecular structures, this implies that 2-WL (which considers pairs of nodes) is theoretically
sufficient for completeness, provided the graph is fully connected. Similarly, Hordan et al. [2024]
established that GNNs simulating 2-WL are universal approximators for continuous functions
on point clouds. Concurrently, Li et al. [2023c] demonstrated that while full distance matrices
contain sufficient information, standard message passing on distances (equivalent to 1-WL) is
incomplete, echoing the limitations of IGWL discussed in this chapter.

These findings collectively suggest that while invariant representations can be complete on
fully connected geometric graphs, they require higher-order message passing (beyond 1-WL)
to achieve this. In contrast, the GWL framework highlights that for sparse graphs, which are
computationally necessary for large and periodic atomic systems, equivariance provides a strictly
superior inductive bias by propagating orientation information that is otherwise lost in local
invariant updates.

Universality of Geometric GNNs Recent theoretical work [Dym and Maron, 2020, Villar
et al., 2021, Gasteiger et al., 2021, Jing et al., 2020] has shown that architectures such as TFN,
GemNet and GVP-GNN can be universal approximators of continuous, G-equivariant or G-
invariant multiset functions over point clouds, i.e. fully connected graphs. In contrast, the GWL
framework studies the expressive power of Geometric GNNs operating on sparse graphs from
the perspective of discriminating geometric graphs.

In our full paper [Joshi et al., 2023], we included additional proofs that GWL’s discrimination-
based perspective is equivalent to universal approximation. However, the discrimination lens
offers more granular and practically useful insights than universality alone. While universality is
a binary property—a model is either universal or not—discrimination enables a more nuanced
analysis of expressivity by characterizing the specific classes of geometric graphs that can
and cannot be distinguished. This finer-grained perspective allows us to identify concrete
counterexamples and failure modes, making the theoretical framework directly applicable to
practical model design.

73


3.7 Summary

In this chapter, we studied the expressive power of Geometric GNNs from the perspective
of discriminating non-isomorphic geometric graphs. We proposed a geometric version of the
Weisfeiler-Leman graph isomorphism test, termed GWL, which is a theoretical upper bound on
the expressive power of Geometric GNNs. The GWL framework addresses a key research gap as
standard GNNs and the associated theoretical tools are inapplicable for geometric graphs and 3D
molecular structure representation learning.

Through the lens of GWL, we formalised how key design choices influence Geometric GNN
expressivity. Notably, invariant GNNs cannot distinguish graphs where one-hop neighbourhoods
are the same and fail to compute non-local geometric properties such as volume, centroid, etc.
Equivariant GNNs distinguish a larger class of graphs as stacking equivariant layers propagates
geometric information beyond local neighbourhoods.

Our synthetic experiments validate theoretical insights from GWL, highlighting three key
challenges in Geometric GNN design: (1) geometric oversquashing, where equivariant models
require more iterations than theoretically prescribed for propagating geometric information; (2)
need for higher order tensors, as layers using order-L spherical tensors cannot distinguish very
simple structures with symmetry higher than L-fold; and (3) limits of scalarisation body order, as
we can counstruct counterexamples where layers with body order k cannot distinguish structures
requiring (k + 1)-body invariants. These experiments reveal practical limitations of current
architectures for molecular representation learning.

Additionally, through a benchmark on protein function prediction, we explored how our
theoretical insights translate to performance in real-world tasks. We found that equivariant GNNs
consistently outperformed their invariant counterparts across all tasks, confirming the practical
utility of higher-order equivariant features suggested by our theory. Overall, Geometric GNNs
showed improvements over sequence-based protein language models, demonstrating the value of
geometric inductive biases in learning representations of molecular structure.

Future work GWL provides an abstraction to study the theoretical limits of Geometric GNNs.
However, translating these theoretical insights directly into practical models remains challenging,
as GWL assumes perfect neighbourhood aggregation and colouring functions that satisfy the
conditions of Proposition 2. Despite not proposing provably powerful models, the understanding
gained from the GWL framework guides the development of maximally powerful Geometric
GNNs for real-world problems such as those tackled in Part II. Moreover, our discrimination-
based perspective can be a starting point for further investigating the optimisation behaviour and
generalisation capacity of these architectures.

Additionally, GWL does not characterise all classes of Geometric GNNs. Non-local ar-
chitectures that aggregate geometric information beyond immediate neighbourhoods, such as
GemNet-Q [Gasteiger et al., 2021], are not directly covered by our framework. Similarly, archi-

74


tectures based on canonical reference frames [Du et al., 2022, Wang et al., 2022] are outside the
scope of GWL. These methods use local or global frames of reference to transform equivariant
quantities into invariant features, offering an alternative modelling paradigm when canonical
reference frames can be easily defined (e.g. protein backbone structures [Jumper et al., 2021]).

An emerging class of architectures also not covered by GWL are unconstrained networks
that learn roto-translational equivariance implicitly from data, such as standard Transformers
without geometric inductive biases [Wang et al., 2024, Abramson et al., 2024, Joshi et al., 2025a].
These models treat 3D coordinates as static node features and can theoretically distinguish any
geometric graphs where the underlying attributed graphs are distinguishable by WL. Understand-
ing when to prefer explicit inductive biases versus implicit learning of symmetries remains an
important open question, both in theory and practice. The next chapter, Chapter 4, will explore
this trade-off in more detail in the context of generative modelling.

75


76


Chapter 4

Unified Generative Modelling of Molecules
and Materials

In Chapter 3, we studied representation learning of molecular structures, with a focus on
predictive problems. We will now turn to the complementary problem of generative modelling,
which is the foundation for inverse design of molecules with bespoke functionality. The current
state-of-the-art uses diffusion or flow matching models for tasks such as structure prediction and
conditional generation for biomolecules [Watson et al., 2023, Ingraham et al., 2023, Abramson
et al., 2024] and materials [Jiao et al., 2023, Zeni et al., 2025], as well as for structure-based drug
design [Schneuing et al., 2024].

Molecular systems are atoms interacting in 3D space: they share common underlying physical
principles that determine their 3D structure and properties. However, we currently do not have
a unified formulation of diffusion models across different types of systems such as periodic
crystals and non-periodic small molecules or biomolecules. This contrasts with predictive
models, such as interatomic potentials for molecular simulation [Bartók et al., 2017], which have
seen architectural unification through Geometric GNNs and benefited from transfer learning to
achieve broad generalization across domains [Shoghi et al., 2024, Wood et al., 2025].

Most molecular diffusion models are highly specific to each type of system, and involve
multi-modal generative processes on complex product manifolds of categorical and continuous
data types. For example, de novo generation of small molecules is modelled as two independent
diffusion processes for the atom types (categorical) and 3D coordinates (continuous) of a set of
atoms [Hoogeboom et al., 2022]. The denoiser model learns how atom types and 3D coordinates
jointly evolve in order to sample new molecules but passes through unrealistic intermediate
states during the denoising trajectory. Diffusion models for biomolecules treat groups of atoms
as rigid bodies and add a third manifold (rotations) into the joint diffusion process [Campbell
et al., 2024]. For crystals/materials, the diffusion process needs to additionally handle periodicity
and operates on a joint manifold of atom types, fractional coordinates, lattice lengths, and lattice
angles that together define the repeating unit cell [Miller et al., 2024].

77


Input representation
Atom types (B, N, 1)
3D coord. (B, N, 3)
Frac. coord. (B, N, 3)
Cell lengths (B, 1, 3)
Cell angles (B, 1, 3)

Output representation
Atom types (B, N, 1)
3D coord. (B, N, 3)
Frac. coord. (B, N, 3)
Cell lengths (B, 1, 3)
Cell angles (B, 1, 3)

Latent
representation

(B, N, d)

Stage 2: Latent diffusion generative model

Denoising with Diffusion Transformer

Random
Gaussian noise

(B, N, d)

Encoder

Stage 1: Autoencoder for reconstruction

Decoder

Sampled latent
representation

(B, N, d)
D

Class label
periodic/non-periodic

Figure 4.1: Generative modelling of molecules and materials with All-atom Diffusion
Transformers. ADiT performs generative modelling of 3D molecular systems in two stages: (1)
An autoencoder learns a shared latent space by reconstructing all-atom representations of both
molecules (non-periodic) and crystals (periodic); and (2) A Diffusion Transformer samples new
latents from the shared distribution using classifier-free guidance, which are decoded to valid
molecules or crystals using the VAE. Our unified latent diffusion framework enables transfer
learning and avoids the complexity of multiple diffusion processes on categorical-continuous
product manifolds used by equivariant diffusion models.

This chapter introduces the All-atom Diffusion Transformer (ADiT), a unified latent
diffusion model for jointly generating both periodic materials and non-periodic molecules using
the same model. As illustrated in Figure 4.1, ADiT is a latent diffusion model based on two key
ideas: (1) An autoencoder maps a unified, all-atom representations of molecules and crystals
to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent
embeddings that the autoencoder can decode to sample new molecules or crystals. ADiTs achieve
state-of-the-art generative performance on both molecules and crystals while being significantly
more scalable than specialized equivariant diffusion models. Additionally, we demonstrate
that joint training and transfer learning between periodic and non-periodic domains improves
performance, representing a step towards broadly generalizable foundation models for generative
chemistry. Open source code is available: github.com/facebookresearch/
all-atom-diffusion-transformer.

4.1 All-atom Diffusion Transformers

Overview We use latent diffusion [Rombach et al., 2022, Vahdat et al., 2021] to unify generative
modelling across periodic and non-periodic molecular systems. Our approach consists of two
stages: (1) An autoencoder learns a shared latent space by jointly reconstructing all-atom
representations of both molecules and materials; and (2) A Diffusion Transformer [Peebles
and Xie, 2023] generates new samples from this latent space which can be decoded into valid
molecules or crystals using classifier-free guidance [Ho and Salimans, 2022].

78

https://github.com/facebookresearch/all-atom-diffusion-transformer
https://github.com/facebookresearch/all-atom-diffusion-transformer


Compared to existing equivariant diffusion models, our latent diffusion formulation shifts the
complexity of handling categorical and continuous attributes into the autoencoder. This enables
a very simple and highly scalable generative process in a shared latent space of periodic and
non-periodic molecular systems.

4.1.1 Stage 1: Autoencoder for reconstruction

Unified representation of 3D molecular systems Both periodic and non-periodic molecular
systems can be represented as sets of atoms in 3D space, as we saw in Chapter 2. The key
difference is that crystals require an additional periodic unit cell, while molecules have unbounded
coordinates. A crystal or molecule with N atoms is represented as a multi-modal object:

Atom types A = {ai}Ni=1 ∈ Z1×N , 3D coords. X = {xi}Ni=1 ∈ R3×N ,

Fractional coords. F = {fi}Ni=1 ∈ [0, 1)3×N , Unit cell/lattice L = {l1, l2, l3} ∈ R3×3 .

The 3D coordinates X are in nanometers, and the fractional coordinates F are in the range
[0, 1). The lattice matrix L represents a parallelepiped defining the shape of the repeating unit
cell, and fractional coordinates are computed as the inverse of the unit cell matrix multiplied by
the 3D coordinates: F = L−1X . We use Niggli reduction to uniquely determine the unit cell
parameters for crystals [Grosse-Kunstleve et al., 2004]. For non-periodic molecules, we set the
unit cell parameters and fractional coordinates to null values ϕ.

VAE architecture We use a Variational Autoencoder (VAE) [Kingma and Welling, 2014] to
learn a shared latent representation of molecules and materials using a reconstruction objective.
Given an input 3D molecular system (A,X,F ,L), an encoder E maps each atom’s attributes to
a latent representation Z:

Z = E(A,X,F ) , (4.1)

where Z = {zi}Ni=1 ∈ Rd×N encodes information about the categorical atom type and continuous
coordinates (unit cell parameters are encoded implicitly in the fractional coordinates). The
decoder D reconstructs the input molecular system from the latent embedding:

A′,X ′,F ′,L′ = D(Z) . (4.2)

We describe the pseudocode for VAE encoder and decoder operations in Algorithms 1 and 2,
respectively. For the architecture of the encoder E and decoder D, we used the standard Trans-
former [Vaswani et al., 2017a] and learn symmetries via data augmentation. In Appendix B.3,
we also ablated roto-translation equivariant VAEs based on Equiformer-V2 [Liao et al., 2024b],
a state-of-the-art Geometric GNN.

79


Algorithm 1: Pseudocode for VAE encoder E

Input: 3D molecular system ({ai}, {xi}, {fi}, {l1, l2, l3})
Output: Latent reprenstations {zi}

# Project inputs to dmodel

1. hi = Embedding(ai) hi ∈ Rdmodel

2. hi = hi + Linear(Swish(Linear(xi)))
3. hi = hi + Linear(Swish(Linear(fi)))

# Apply encoder network
4. {hi} = TransformerEncoder({hi})

# Down-project to mean µZ and std σZ
5. µzi = Linear(hi) µzi ∈ Rd

6. log σzi = Linear(hi) σzi ∈ Rd

# Sample latents Z
7. zi = µzi + σzi ⊙ ϵ, ϵ ∼ N (0, 1)d zi ∈ Rd

Algorithm 2: Pseudocode for VAE decoder D

Input: Latent reprenstations {zi}
Output: 3D molecular system ({a′i}, {x′i}, {f ′

i}, {l′1, l′2, l′3})
# Up-project latents to dmodel

1. hi = Linear(zi) hi ∈ Rdmodel

# Apply decoder network
2. {hi} = TransformerEncoder({hi})

# Predict outputs
3. a′i = argmax(Linear(hi)) a′i ∈ Z
4. x′i = Linear(hi) x′i ∈ R3

5. f ′
i = Linear(hi) f ′

i ∈ R3

6. {l′1, l′2, l′3} = Linear
(

1
N

∑N
i=1 hi

)
l′ ∈ R3

Reconstruction loss We compute the loss for the predicted atom types A′ via cross-entropy:

LA =
1

N

N∑
i=1

CrossEnt(ai, a′i) . (4.3)

For the predicted 3D coordinates X ′, we use the mean squared error (MSE) reconstruction loss
after zero-centering both sets of coordinates:

x̃i = xi −
1

N

N∑
i=1

xi , x̃′i = x′i −
1

N

N∑
i=1

x′i , LX =
1

3N

N∑
i=1

∥x̃i − x̃′i∥2 . (4.4)

We compute the reconstruction loss for the predicted fractional coordinates F ′ using MSE as
well:

LF =
1

3N

N∑
i=1

∥fi − f ′
i∥2 . (4.5)

For the predicted lattice vectors L′, we first convert to rotation-invariant lattice parameters: three
side lengths of the unit cell Ll = {a, b, c} ∈ R1×3, and three internal angles between them
La = {α, β, γ} ∈ [60°, 120°]1×3, as described in Miller et al. [2024]. We then compute the MSE
reconstruction loss between the predicted and ground truth lattice parameters:

LLl
=

1

3

(
(a− a′)2 + (b− b′)2 + (c− c′)2

)
, (4.6)

LLa =
1

3

(
(α− α′)2 + (β − β′)2 + (γ − γ′)2

)
. (4.7)

Note that in LLl
, we normalize the predicted and groundtruth lengths by the cube root of the

number of atoms to account for the scaling of the unit cell with the number of atoms, following

80


Xie et al. [2022]. All angles are converted from degree to radians for numerical stability.
The autoencoder is trained with a weighted reconstruction loss to balance the relative

magnitudes of the various losses. Depending on whether a training sample is periodic or
non-periodic, we use different reconstruction loss weights:

Lrec = λALA + λXLX + λFLF + λLl
LLl

+ λLaLLa , where (4.8)

λA λX λF λLl
λLa

Periodic 1.0 0.0 10.0 1.0 10.0
Non-periodic 1.0 10.0 0.0 0.0 0.0

Thus, the overall loss for periodic crystals trains the model to reconstruct the atom types, frac-
tional coordinates and lattice parameters while ignoring the predicted 3D coordinates. Similarly,
the overall loss for non-periodic molecules trains the model to reconstruct the atom types and 3D
coordinates while ignoring the predicted fractional coordinates and lattice parameters.

Regularization We use three regularization techniques to learn robust, informative latent
representations: (1) A bottleneck architecture with latent dimension d significantly smaller than
the encoder/decoder hidden dimension dmodel (e.g., d = 8 vs dmodel = 512). (2) A per-channel
KL divergence penalty λKL ·DKL( N (Z;µZ , σZ) || N (0, 1)d ) added to equation 4.8, following
Rombach et al. [2022]. (3) Denoising training with 10% of atoms having their types masked
and coordinates perturbed by N (0, 0.1) Gaussian noise. For non-equivariant encoders/decoders,
we learn symmetries via data augmentation during training. Translation invariance in non-
periodic systems is handled by working with zero-centred coordinates. For translation invariance
in periodic systems, we add a random translation vector to the Cartesian coordinates and re-
compute the fractional coordinates using the updated Cartesian coordinates. Rotation symmetry
is learnt via applying a random rotation to the Cartesian coordinates and unit cell (fractional
coordinates are invariant to global rotations by definition).

Decoding latents to molecular systems During inference or sampling from the DiT, the
desired output type (periodic/non-periodic) determines how we process the decoder outputs. The
VAE decoder D generates four attributes for each system: (1) atom types, (2) 3D coordinates, (3)
fractional coordinates, and (4) lattice parameters. For non-periodic molecules, we only utilize
the atom types and 3D coordinates, constructing the molecule via RDKit. For periodic crystals,
we combine the atom types, fractional coordinates, and lattice parameters to build the crystal
structure using PyMatGen. This split decoding strategy allows a single unified model to share
information between both domains while still respecting their distinct geometric constraints,
enabling effective transfer learning between periodic and non-periodic systems.

81


4.1.2 Stage 2: Latent diffusion generative model

Diffusion formulation We use Gaussian diffusion or flow matching as our generative frame-
work, which iteratively denoises latent samples from a base distribution into samples from a
target distribution [Sohl-Dickstein et al., 2015, Song and Ermon, 2019, Lipman et al., 2023]. Our
formulation uses linear interpolation between a standard normal base distribution and the target
distribution of VAE encoder latent representations of 3D molecular systems (we describe it in
terms of flow matching, though both formulations are equivalent; see Gao et al. [2024]). Thus,
the diffusion model is trained after training the first stage VAE.

Our model learns to generate a set of N latent representations Z = {zi}Ni=1, where each
latent z ∈ Rd encodes information about one atom’s type, coordinates and unit cell, which can
be decoded to a valid molecular system using the VAE decoder D. During training, given an
input molecular system (A,X,F ,L), we first encode it to a latent representation Z using the
VAE encoder E . We denote Z as Z(1), a ‘clean’ training sample at time t = 1. We then sample
a random initial latent Z(0) at time t = 0 from a d-dimensional standard normal distribution
N (0, 1)d, and perform zero-centering by subtracting the per-channel mean of Z(0). We then use
linear interpolation to construct a ‘noisy’ interpolated sample Z(t) at a randomly sampled time
step t ∼ U(0, 1):

Z(t) = (1− t) Z(0) + t Z(1) . (4.9)

Thus, we can define a groundtruth conditional vector field ut(Z(t)|Z(1)) along the path from the
noisy latents Z(t) at time step t to the clean latents Z(1) as:

ut(Z
(t)|Z(1)) =

Z(1) −Z(t)

1− t
. (4.10)

Samples from the base distribution can be transformed to samples from the target distribution by
integrating the vector field ut(Z(t)|Z(1)) over time t. The goal of conditional flow matching is
to train a denoiser network F to match this conditional vector field ut. To do so, the denoiser
takes as input the intermediate noisy latents Z(t) at time step t and an additional class label c
(described subsequently) to predict the final clean latents Z ′(1):

Z ′(1) = F(Z(t), t, c) . (4.11)

The denoiser is trained by minimizing an MSE loss between the resulting predicted conditional

82


vector field and the groundtruth conditional vector field:

Lfm =
1

N

N∑
i=1

∣∣∣∣∣∣ z(1)i − z
(t)
i

1− t
− z

′(1)
i − z

(t)
i

1− t

∣∣∣∣∣∣2 , (4.12)

=
1

(1− t)2
1

N

N∑
i=1

∥z(1)i − z
′(1)
i ∥2 .

In practice, we set a minimum value for time step tmin = 0.01 and maximum value tmax = 0.9 to
prevent numerical instability, following Yim et al. [2023a].

Algorithm 3: Pseudocode for DiT sampling

Input: Class label c, num. integration steps T , cfg. scale γ
Output: Generated sample (A,X,F ,L)

# Sample initial noisy latents Z(0) at t = 0

1. Z(0) = {z(0)i ∼ N (0, 1)d}
2. ∆t = 1/T # Step size

# Denoising loop
3. for t in linspace(tmin, tmax, T ):
4. Z ′

cond = F(Z(t), t, c) # Conditional prediction
5. Z ′

uncond = F(Z(t), t, ϕ) # Unconditional prediction
# Conditioning via classifier-free guidance

6. Z ′ = (1− γ) ·Z ′
uncond + γ ·Z ′

cond

# Euler integration step
7. Z(t+∆t) = Z(t) +∆t · Z′−Z(t)

1−t

# Decode latents to 3D molecular system (Algorithm 2)
8. A,X,F ,L = D(Z(1))

Denoiser architecture As the denoiser network F , we use a class-conditional Diffusion
Transformer (DiT) [Peebles and Xie, 2023]. The DiT largely follows a standard Transformer
architecture with the conditioning information incorporated via adaptive layer norm with zero-
initialization, which replaces all layer norm operations. For class conditioning, we use a binary
embedding to denote whether the system being generated is periodic (crystal) or non-periodic
(molecule). This conditioning allows the model to learn domain-specific features while sharing
most parameters. During training, we apply class label dropout with 10% probability to enable
classifier-free guidance during inference. We also incorporate self-conditioning [Yim et al.,
2023b] where the denoiser’s prediction from the previous timestep is concatenated to the current
input with 50% dropout probability during training. While we currently only condition on the
periodic/non-periodic class label, the DiT architecture can incorporate additional conditioning
signals like target properties or geometric constraints to enable controlled generation. This
represents a promising direction for future work in inverse design applications.

83


Data augmentation The DiT denoiser is trained with data augmentation to learn roto-translational
and periodic symmetries in the VAE’s latent space. During training, each input system coor-
dinates are randomly rotated and translated, and then converted to latents via the frozen VAE
encoder E before being input to the DiT.

Sampling with classifier-free guidance To generate new molecular systems from the trained
diffusion model, we use classifier-free guidance [Ho and Salimans, 2022] to steer the sampling
process. At each denoising step, we compute both a conditional prediction based on the
periodic/non-periodic class label c and an unconditional prediction with null class label ϕ. The
final prediction is a weighted combination of these using guidance scale γ, allowing control
over how strongly the generation follows the class conditioning. The full sampling procedure is
outlined in Algorithm 3. Starting from Gaussian noise Z(0), we iteratively denoise using the DiT
model F for T steps. At each step, we perform Euler integration of the vector field to gradually
transform the noisy latents towards the target distribution. While we currently use simple Euler
integration for efficiency, adaptive ODE solvers could potentially improve performance [Ma
et al., 2024]. Finally, we decode the denoised latents Z(1) to a valid 3D molecular system using
the VAE decoder D.

4.2 Experimental Setup

Datasets For our main experiments, we train models on periodic crystals from MP20 and
non-periodic molecules from QM9, representing two distinct domains of molecular systems.
MP20 [Xie et al., 2022] contains 45,231 metastable crystal structures from the Materials Project
[Jain et al., 2013], each with up to 20 atoms in its unit cell and spanning 89 different element
types. QM9 [Wu et al., 2018] consists of 130,000 stable small organic molecules containing up
to nine heavy atoms (C, N, O, F) along with hydrogens. We split the data following prior work
[Xie et al., 2022, Hoogeboom et al., 2022] to ensure fair comparisons. We also include results
on the GEOM-DRUGS dataset of 430,000 large organic molecules up to 180 atoms [Axelrod
and Gomez-Bombarelli, 2022].

Training and hyperparameters We sequentially train the first-stage VAE and then the second-
stage DiT using AdamW optimizer with a constant learning rate 1e− 4, no weight decay, and
batch size of 256. We use exponential moving average (EMA) of DiT weights over training with
a decay of 0.9999. Both models are trained to convergence for at most 5000 epochs up to 3 days
on 8 V100 GPUs.

For the first-stage VAE, we use a standard Transformer as both encoder E and decoder D
with hidden dimension dmodel = 512, 8 attention heads, and 8 layers (51M parameters). The
latent dimension is set to d = 8 with KL regularization weight λKL = 1e− 5 and 10% denoising

84


perturbation during training. For the second-stage DiT denoiser, we report results primarily
using DiT-B configurations: hidden dimension dmodel = 768, 12 attention heads, 12 layers, and
130M parameters. We also evaluate smaller DiT-S (32M) and larger DiT-L (450M) variants.

Two key inference-time hyperparameters are the number of ODE integration steps T and
the classifier-free guidance scale γ. We find T = 500 or 1000 with γ = 1.0 or 2.0 consistently
works well for both molecules and crystals. Additional ablation studies comparing joint vs.
dataset-specific training, architecture variants, regularization techniques, and inference settings
are presented in Appendix B.3.

Evaluation metrics We evaluate the ability of ADiTs to sample valid and realistic molecules
and crystals. Following prior work [Xie et al., 2022, Hoogeboom et al., 2022], we sample 10,000
crystals and molecules each and compute validity, stability, uniqueness and novelty rates using
density functional theory (DFT) for crystals as well as validity, uniqueness and Posebusters
sanity checks [Buttenschoen et al., 2024] for molecules. Detailed descriptions of all evaluation
metrics are provided in Appendix B.1.

Baselines We compare ADiT trained jointly on both QM9 and MP20 to molecule-only and
crystal-only ADiT variants, as well as state-of-the-art baselines for both datasets. For crystal
generation on MP20, we compare to: (1) four equivariant diffusion and flow matching-based
models operating on multi-modal product manifolds: CDVAE [Xie et al., 2022], DiffCSP [Jiao
et al., 2023], FlowMM [Miller et al., 2024], and a variant of MatterGen [Zeni et al., 2025] trained
on MP20 only; (2) UniMat [Yang et al., 2024], a non-equivariant diffusion model which learns
symmetries from data; (3) FlowLLM [Sriram et al., 2024], a two-stage framework which first
finetunes the autoregressive Llama 2 language model on crystal structures [Touvron et al., 2023,
Gruver et al., 2024], and then trains FlowMM with samples from the language model as the base
distribution and MP20 as the target distribution.

For molecule generation on QM9, we compare to: (1) Equivariant Diffusion [Hoogeboom
et al., 2022], a roto-translationally equivariant diffusion model operating on a multi-modal
product manifold; (2) GeoLDM [Xu et al., 2023], an alternative latent diffusion model using
Equivariant Diffusion in the latent space of a roto-translationally equivariant autoencoder; (3)
Symphony [Daigavane et al., 2024], an equivariant and autoregressive generative model that
iteratively builds a molecule atom-by-atom.

4.3 Results

State-of-the-art crystals and molecule generation Results for crystal generation in Table 4.1
show that ADiTs generate high-quality crystals compared to baseline diffusion models, achieving
improved performance across validity, stability, uniqueness, and novelty metrics for 10,000

85


Table 4.1: Crystal generation results on MP20. We report validity, stability, uniqueness, and
novelty rates for 10,000 sampled crystals. ADiT shows improved performance over diffusion
baselines across all metrics. We see significant gains for compositional validity due to a single
diffusion process in the latent space, as opposed to joint continuous and categorical diffusion
for baselines. Joint training with both molecular and crystal data improves crystal generation
performance over MP20-only models. (Stable: DFT Ehull <0.0 eV/atom, metastable: DFT
Ehull <0.1 eV/atom, ∗ denotes results from MatterGen-MP for 1024 sampled crystals, † denotes
results we replicated using the same DFT setup as ADiT.)

Validity Rate (%) ↑ Metastable Stable M.S.U.N. S.U.N.
Model Structure Composition Overall rate (%) ↑ rate (%) ↑ rate (%) ↑ rate (%) ↑

M
P2

0-
on

ly

CDVAE 100.00 86.70 - - 1.6 - -
DiffCSP 100.00 83.25 - - 5.0 - 3.3
UniMat 97.2 89.4 - - - - -

FlowMM 96.85 83.19 80.30 30.6† 4.6† 22.5† 2.8†

FlowLLM 99.94 90.84 90.81 66.9† 13.9† 26.3† 4.7†

MatterGen-MP - - - 78∗ 13∗ 21∗ -
MP20-only ADiT 99.58 90.46 90.13 81.6 14.1 25.91 4.7

Jointly trained ADiT 99.74 92.14 91.92 81.0 15.4 28.2 5.3

Table 4.2: Molecule generation results on QM9. We report (a) validity and uniqueness rates,
as well as (b) % pass rates on 7 sanity checks from Posebusters for 10,000 sampled molecules.
ADiTs match or improve performance w.r.t. baselines, and sample physically realistic structures.
Joint training with both molecular and crystal data improves molecular generation performance
over QM9-only models. (∗ denotes models which explicitly generate hydrogen atoms.)

(a) Validity results (b) PoseBusters results
Model Validity (%) ↑ Unique (%) ↑

Q
M

9-
on

ly

Equivariant Diffusion 97.50 96.71
Equivariant Diffusion∗ 91.90 98.69

GeoLDM∗ 93.80 98.82
Symphony∗ 83.50 97.98

QM9-only ADiT 96.02 97.76
QM9-only ADiT∗ 92.19 97.90

Jointly trained ADiT 97.43 96.92
Jointly trained ADiT∗ 94.45 97.82

Test (% pass) ↑ Symphony Eq. Diff. ADiT

Atoms connected 99.92 99.88 99.70
Bond angles 99.56 99.98 99.85
Bond lengths 98.72 100.00 99.41

Ring flat 100.00 100.00 100.00
Double bond flat 99.07 98.58 99.98
Internal energy 95.65 94.88 95.86
No steric clash 98.16 99.79 99.79

sampled crystals, with significant gains for compositional validity due to a single diffusion
process in the VAE latent space rather than joint continuous and categorical diffusion.

For molecule generation, ADiTs achieve state-of-the-art performance on validity and unique-
ness metrics across 10,000 sampled molecules, as shown in Table 4.2(a), while Posebusters
sanity check metrics in Table 4.2(b) further confirm that ADiTs generate physically realistic
molecular structures, matching or exceeding baseline models across measures like double bond
flatness, reasonable internal energy and lack of steric clashes.

Joint training improves performance Table 4.1 and Table 4.2 also show that jointly trained
ADiTs (trained on both QM9 and MP20 together) exceed the performance of the MP20-only or

86


0 250 500 750 1000 1250 1500 1750 2000
Epoch

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Tr
ai

n 
Lo

ss

ADiT-S (32M)
ADiT-B (130M)
ADiT-L (450M)

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
log(Number of parameters in M)

1.02

1.04

1.06

1.08

Ep
. 2

00
0:

 Tr
ai

n 
Lo

ss

Pearson: -1.00
Spearman: -1.00

0 250 500 750 1000 1250 1500 1750 2000
Epoch

0.0

0.2

0.4

0.6

0.8

1.0

Cr
ys

ta
l v

al
id

ity
 ra

te
 (%

)

ADiT-S (32M)
ADiT-B (130M)
ADiT-L (450M)

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
log(Number of parameters in M)

0.88

0.89

0.90

0.91

0.92

0.93

Ep
. 2

00
0:

 C
ry

st
al

 v
al

id
ity

 ra
te

 (%
) Pearson: 0.91

Spearman: 1.00

0 250 500 750 1000 1250 1500 1750 2000
Epoch

0.0

0.2

0.4

0.6

0.8

1.0

M
ol

ec
ul

e 
va

lid
ity

 ra
te

 (%
)

ADiT-S (32M)
ADiT-B (130M)
ADiT-L (450M)

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
log(Number of parameters in M)

0.86

0.88

0.90

0.92

0.94

0.96

Ep
. 2

00
0:

 M
ol

ec
ul

e 
va

lid
ity

 ra
te

 (%
) Pearson: 0.94

Spearman: 1.00

Figure 4.2: Scaling up ADiT improves performance. We show the effect of increasing the
number of ADiT denoiser parameters on the training loss and generation validity rates. Left:
training loss and validity rates vs. epochs. Right: Correlation plots for training loss and validity
rates at epoch 2,000 vs. ADiT parameters (in Millions).

QM9-only ADiTs for materials or molecules, respectively. Joint training improves validity and
stability rates for both crystals and molecules, demonstrating effective transfer learning between
periodic and non-periodic domains. These results validate that ADiTs can effectively model
diverse types of molecular systems within a single architecture.

Scaling up ADiT denoiser improves performance In Figure 4.2, we see that generative
modelling performance predictably improves as we scale the DiT denoiser from parameter counts
of 32M (DiT-S) to 130M (DiT-B) all the way to 450M (DiT-L), even with our current modest
dataset size of ∼130K total samples. The diffusion training loss and validity rates consistently
improve with larger model sizes, showing a clear benefit from scale. Strong correlations between

87


10 100 250 500 750 1000
Number of integration steps

0

50

100

150

200

250

300

350
Ti

m
e 

to
 sa

m
pl

e 
10

K 
cr

ys
ta

ls 
(m

in
s)

ADiT-S (32M)
ADiT-B (130M)
ADiT-L (450M)
FlowMM (12M)

10 25 50 100
0

10

20

(a) Crystals – MP20

10 100 250 500 750 1000
Number of integration steps

0

100

200

300

400

Ti
m

e 
to

 sa
m

pl
e 

10
K 

m
ol

ec
ul

es
 (m

in
s)

ADiT-S (32M)
ADiT-B (130M)
ADiT-L (450M)
GeoLDM (5M)

10 25 50 100
0

10

20

(b) Molecules – QM9

Figure 4.3: ADiTs are significantly faster than equivariant diffusion models. We plot the
number of integration steps for ADiTs and equivariant diffusion models vs. time to generate
10,000 samples on a single V100 GPU. ADiTs scale significantly better with the number of
integration steps compared to equivariant diffusion.

model size and performance metrics suggest further gains are possible from scaling both model
size and data – Alexandria (2M inorganic crystals), ZINC (250M molecules), and the Protein
Data Bank (200K biomolecular complexes) present promising opportunities for dataset scaling.

Speedup compared to equivariant diffusion ADiTs achieve significant inference speedup
compared to equivariant diffusion under the same hardware conditions, as shown in Figure 4.3.
When generating 10,000 samples on a V100 GPU, ADiTs based on standard Transformers leads
to better scaling with integration steps compared to FlowMM [Miller et al., 2024] for crystals
and GeoLDM [Xu et al., 2023] for molecules, both of which use computationally intensive
equivariant networks as denoisers. It is significantly more practical to scale up Transformers
than equivariant networks, as seen by the faster inference speed of ADiT-B compared to 100×
smaller equivariant baselines.

PCA visualization of shared latent space In Figure 4.4, we plot the first two PCA principal
components of 100 random samples each from the MP20 and QM9 validation set, as well as
100 generated crystals and 100 generated molecules sampled from ADiT. We observe that the
joint latent space shows distinct clusters between molecules and crystals, with tighter clustering
for molecules and more spread for crystals, reflecting the greater diversity of elements and local
geometric environments in periodic crystal structures.

Next, we plot the same PCA but only keeping atoms of carbon, nitrogen, oxygen, and fluorine
in Figure 4.5. These atoms appear in both QM9 molecules and MP20 crystals, allowing us
to analyze how their representations compare across periodic and non-periodic systems. The
visualization reveals clear patterns: principal component 1 primarily distinguishes between
molecules (clustered between -2 and 2) and crystals, while principal component 2 correlates
with atom type. Most notably, oxygen atoms show similar latent representations whether they

88


Table 4.3: Molecule generation results on GEOM-DRUGS. Left: Validity, uniqueness and
% pass rates on Posebusters for 10,000 sampled molecules (∗ PoseBusters results taken from
Buttenschoen et al. [2025]). ADiT with minimal molecular inductive biases matches or exceeds
state-of-the-art equivariant diffusion baselines, which explicitly predict atomic bonds. Right:
We plot the number of integration steps for ADiTs and SemlaFlow vs. time to generate 10,000
molecules on a single A100 GPU. ADiTs scale favorably with the number of integration steps
compared to SemlaFlow, a highly optimized equivariant diffusion model.

Metric (% pass) ↑ EQGAT-diff∗ SemlaFlow∗ ADiT

Validity 94.6 93.9 95.3
Uniqueness 100.0 100.0 100.0

Atoms connected 84.4 92.3 93.0
Bond angles 86.9 94.8 92.3
Bond lengths 87.0 94.6 92.5

Ring flat 87.0 94.9 95.4
Double bond flat 87.0 94.2 95.3
Internal energy 86.8 94.8 91.3
No steric clash 82.9 92.0 91.8

PoseBusters valid 59.7 87.5 85.3
10 100 250 500 750 1000

Number of integration steps

0

100

200

300

400

500

600

Ti
m

e 
to

 sa
m

pl
e 

10
K 

m
ol

ec
ul

es
 (m

in
s)

10 25 100
0

20

40
ADiT-S (32M)
ADiT-B (150M)
ADiT-L (450M)
SemlaFlow (46M)

appear in molecules or crystals, suggesting ADiT’s latent space captures fundamental chemical
properties that transfer across both domains. This shared representation of oxygen, a key element
in both datasets, may help explain ADiT’s successful joint learning and transfer between periodic
and non-periodic systems.

Extension to larger GEOM-DRUGS molecules. To demonstrate the scalability of the ADiT
architecture to larger systems, we experiment with the GEOM-DRUGS dataset of 430,000
molecules of up to 180 atoms. Our setup follows Vignac et al. [2023] and we compare to
state-of-the-art equivariant diffusion [Le et al., 2024] and flow matching [Irwin et al., 2025]
baselines. In Table 4.3, we see that ADiT is on par or better than equivariant models across
validity and PoseBusters metrics [Buttenschoen et al., 2025]. This is a notable result because
ADiT is based on the standard Transformer architecture with minimal molecular inductive biases
and, unlike equivariant baselines, does not explicitly predict atomic bonds.

Additional results and ablations For additional results, including the distributions of DFT
formation energy, composition, and spacegroups for crystals generated by ADiT compared
to baselines, see Appendix B.2. An ablation study of the ADiT architecture is available in
Appendix B.3.

4.4 Related Work

Generative models for molecules and materials Diffusion models have emerged as the state-
of-the-art for generative modelling of molecular systems, with applications to small molecules,

89


6 4 2 0 2 4 6
Principle component 1

6

4

2

0

2

4

6

Pr
in

cip
le

 c
om

po
ne

nt
 2

System
Crystal
Molecule
Source
Dataset
Generated

Figure 4.4: PCA plot of latent embeddings from ADiT’s VAE for 100 data points from the
MP20 and QM9 datasets, as well as 100 ADiT-generated crystals/molecules each. Each point
represents an atom, coloured by the system type and sized by whether it comes from real data or
generated latents. Latent embeddings from molecules form tighter clusters, while crystals
show greater spread due to their higher elemental and structural diversity.

crystals, and biomolecules. For small molecules, Equivariant Diffusion [Hoogeboom et al.,
2022] pioneered roto-translationally equivariant diffusion on the multi-modal product manifold
of atom types and 3D positions, while GeoLDM [Xu et al., 2023] introduced latent diffusion in
the space of an equivariant autoencoder. Schneuing et al. [2024] extended equivariant diffusion
to generate molecules conditioned on binding protein partners for structure-based drug design,
while Corso et al. [2023] explored similar architectures for protein-small molecule docking. For
crystal generation, state-of-the-art approaches use equivariant diffusion on product manifolds of
atom types, 3D/fractional coordinates, and lattice parameters. Notable examples include CDVAE
[Xie et al., 2022], DiffCSP [Jiao et al., 2023], and FlowMM [Miller et al., 2024]. MatterGen
[Zeni et al., 2025] demonstrated conditional diffusion for inverse design based on target material

90


6 4 2 0 2 4 6
Principle component 1

6

4

2

0

2

4

6

Pr
in

cip
le

 c
om

po
ne

nt
 2

Atom type
C
N
O
F
Source
Dataset
Generated
System
Crystal
Molecule

Figure 4.5: PCA plot of latent embeddings for carbon, nitrogen, oxygen, and fluorine atoms
from ADiT’s VAE for 100 data points from the MP20 and QM9 datasets, as well as 100 ADiT-
generated crystals/molecules each. Each point represents an atom, coloured by atom type and
sized by whether it comes from real data or generated latents. Principle component 1 visually
correlates with whether a system is a molecule (within range -2 – 2) or crystal. Principle
component 2 visually correlates with the atom type. We see distinct clusters for different
atom types, especially oxygen atoms,suggesting that ADiT’s latent space captures shared
chemical properties across periodic and non-periodic systems.

properties and symmetry space groups. Language models have also been used for generating
molecules and crystals as textual representations [Flam-Shepherd and Aspuru-Guzik, 2023,
Gruver et al., 2024].

Our work stands out as the first to develop unified generative models capable of sampling
both periodic crystals and non-periodic molecular systems jointly. The closest work to ADiT in
terms of diffusion formulation is AlphaFold3 [Abramson et al., 2024], which applies standard
Transformers and Gaussian diffusion to generate all-atom biomolecular complex. However, their
formulation is specific to structure prediction for biomolecules and only diffuses 3D atomic

91


coordinates in Cartesian space. In contrast, our latent diffusion formulation is sufficiently general
to work with both periodic and non-periodic systems, generating atom types, coordinates, as
well as unit cell parameters unconditionally or with classifier-free guidance. Our emphasis on
joint representations of molecules and crystals also aligns with recent work on general-purpose
models for molecular simulation and property prediction [Shoghi et al., 2024, Batatia et al., 2023,
Wood et al., 2025]. Similarly, our unified latent diffusion framework can be scaled up with larger
and more diverse chemical datasets towards foundation models for generative chemistry.

Latent diffusion models Latent diffusion models [Vahdat et al., 2021, Rombach et al., 2022]
propose to do diffusion in the latent space of an autoencoder instead of the raw input space
of high-dimensional continuous signals such as pixels, and have been extremely successful
for generating images, audio, and videos [Esser et al., 2024, Betker et al., 2023, Brooks et al.,
2024]. Latent diffusion is a more computationally efficient alternative to standard diffusion as
the autoencoder’s latent space captures semantically meaningful features of the data, allowing
for more efficient diffusion in a lower-dimensional space followed by reconstruction to the
original data space. The original formulation was further improved by Diffusion Transformers
(DiTs) [Peebles and Xie, 2023], which demonstrated that standard Transformers provide a highly
scalable architecture for the denoiser network. Latent diffusion models can easily incorporate
conditioning on additional information like class labels, text prompts, or infilling masks through
classifier-based [Dhariwal and Nichol, 2021] and classifier-free guidance [Ho and Salimans,
2022] as well as finetuning [Zhang et al., 2023a, Dai et al., 2023].

Our work is the first to leverage latent diffusion for jointly generating the complex multi-
modal product of categorical and continuous data types that constitute 3D molecular systems.
This allows us to shift the complexity of handling atom types, coordinates, and unit cell param-
eters into an autoencoder while performing the generative process in latent space with DiTs,
which is simpler and more scalable than alternative multi-modal equivariant diffusion models.

Equivariance and generative modelling Geometric Graph Neural Networks [Duval et al.,
2023a], particularly roto-translationally equivariant networks, have been used as denoisers in
diffusion and flow matching approaches for generative modelling of molecular structures. E(3)-
Equivariant Graph ConvNets [Satorras et al., 2021] are widely used as denoisers for molecule
[Hoogeboom et al., 2022, Xu et al., 2023, Schneuing et al., 2024] and crystal generation [Jiao
et al., 2023, Miller et al., 2024]. More expressive architectures, like higher-order tensor networks
[Liao et al., 2024b] and Invariant Point Attention [Jumper et al., 2021], have been applied to
protein structure generation [Watson et al., 2023, Yim et al., 2023b] and protein-ligand docking
[Corso et al., 2023].

However, equivariant networks are computationally expensive and harder to scale than
standard Transformers in terms of data and model size. This is especially relevant for diffusion

92


models, where denoisers are iteratively run hundreds of times during inference [Song et al., 2021],
and typically process inputs as fully connected graphs to capture global structure [Joshi, 2025].
Recent work has challenges the necessity of 3D inductive biases and equivariance for generative
structure prediction tasks, showing that standard Transformers can achieve strong performance
on biomolecular complexes [Abramson et al., 2024] and small molecule conformations [Wang
et al., 2024, O Pinheiro et al., 2023]. Non-equivariant models have also shown promising results
for protein structure generation [Chu et al., 2024, Martinkus et al., 2024, Lu et al., 2025]. In
the same vein, our work leverages the simplicity and scalability of standard Transformers for
generative modelling across both periodic and non-periodic domains, demonstrating that explicit
equivariance and molecular inductive biases are not a strict requirement for generating valid and
realistic 3D structures at scale.

4.5 Summary

In this chapter, we posed the following question: How can we build unified diffusion models that

can generate both periodic materials and non-periodic molecular systems? Our solution, the
All-atom Diffusion Transformer (ADiT), is a latent diffusion model based on two key ideas:

1. All-atom unified latent representations: We treat both periodic and non-periodic molecular
systems as sets of atoms in 3D space and develop a unified representation with categorical
and continuous attributes per atom. A Variational Autoencoder (VAE) [Kingma and
Welling, 2014] embeds molecules and crystals into a shared latent space by training for
all-atom reconstruction.

2. Latent diffusion using Transformers: We perform generative modelling in the latent space
of the VAE encoder using a Diffusion Transformer (DiT) [Rombach et al., 2022, Peebles
and Xie, 2023]. During inference, classifier-free guidance [Ho and Salimans, 2022] enables
sampling new latents that can be reconstructed to valid molecules or crystals using the
VAE decoder.

ADiTs can be trained jointly on both periodic and non-periodic 3D molecular structures,
demonstrating broad generalizability. Training a single unified model on the QM9 molecular
and MP20 materials datasets leads to state-of-the-art performance in both domains, exceeding
specialized equivariant diffusion models on physics-based validations. DFT calculations reveal
that ADiTs generate stable, unique, and novel crystals at a 5-6% S.U.N. rate, a 25% improvement
upon the 4-5% rates of previous methods. Joint training yields higher validity rates than QM9-
only or MP20-only ADiT variants, demonstrating successful transfer learning between periodic
and non-periodic domains. ADiTs also match or exceed state-of-the-art equivariant models on
the GEOM-DRUGS dataset of molecules with hundreds of atoms.

93


ADiTs are a highly scalable architecture, achieving significant speedups in both training
and inference compared to equivariant diffusion models. By using standard Transformers with
minimal inductive biases for both the autoencoder and diffusion model, ADiTs can generate
10,000 samples in under 20 minutes on a single V100 GPU – an order of magnitude faster than
baselines which take up to 2.5 hours on the same hardware. The practical efficiency of the DiT
denoiser compared to equivariant networks allows us to scale ADiT to half a billion parameters
while keeping data scale fixed. Our scaling analysis demonstrates that generative modelling
performance improves predictably with model size, suggesting further gains are possible through
continued scaling.

All together, our work is the first to develop unified generative models for both periodic
and non-periodic molecular systems, with state-of-the-art performance on both molecules and
crystals, while being conceptually simpler and computationally more efficient than previous
domain-specific approaches. ADiTs represent a step towards broadly generalizable foundation
models for generative chemistry.

Future work Several limitations point to promising future directions. First, we currently use
relatively small datasets and systems for training, which may limit model generalization. Scaling
to larger and more diverse datasets such as Alexandria and the Cambridge Structural Database
for crystals, ZINC for small molecules, and the Protein Data Bank for biomolecular complexes
could significantly improve performance and enable learning of broadly applicable chemical
principles. While we demonstrate success on small molecules and crystals of up to hundreds
of atoms, we have not yet fully validated our approach on larger systems such as metal-organic
frameworks or biomolecules containing thousands of atoms. Adapting ADiT to larger scales,
while maintaining its unified representation across periodic and non-periodic systems, could
enable powerful transfer learning capabilities – especially valuable for low-data domains.

Relatedly, our current models only perform unconditional generation – extending to guided
sampling or conditional generation based on experimental properties [Zeni et al., 2025], motif
scaffolding [Watson et al., 2023], or molecular infilling [Schneuing et al., 2024] would enable
practical inverse design applications in drug discovery, materials science, and beyond.

In terms of architecture, it is often said that curating the latent space is the most important
factor for good generative performance [Dieleman, 2025]. The current first stage autoencoder
used in ADiT employs a straightforward approach: it uses a simple reconstruction objective
based on regression-style losses for atom types, coordinates, and lattice parameters. Unlike
autoencoders commonly used for images, audio and videos, it does not incorporate perceptual or
adversarial losses [Esser et al., 2021, Rombach et al., 2022], which are considered crucial for
capturing high-frequency details and ensuring realism in reconstructions. Their absence in ADiT
suggests the latent space might not encode extremely fine-grained structural information, such as
subtle conformational variations or precise atomic positioning.

94


Additionally, the current autoencoder learns an all-atom latent representation where each
atom is assigned an independent latent vector, without explicit spatial compression or hierarchical
grouping of semantically related atoms. Relatedly, when generating new structures with the
diffusion model, we need to provide the total number of atoms in advance, which may limit
usability when atom counts are unknown. Future work could explore latent space designs that
perform spatial downsampling during encoding, followed by upsampling during decoding and
generation [Jaegle et al., 2021]. This would decouple the number of latents from the number of
atoms, allowing more effective allocation of latent capacity when scaling to larger systems, as
well as enable generating structures with variable atom counts.

95


96


Part II

RNA Molecule Design

97


Chapter 5

gRNAde: Geometric Deep Learning for 3D
RNA inverse design

RNA holds a unique position in biology due to its ability to encode genetic information via
its sequence as well as catalyze reactions through complex 3D structural folding [Cech, 2024].
This dual functionality enables RNA to perform sophisticated computations within cells, from
regulating gene expression to driving essential metabolic processes. Recent years have seen a
surge of interest in RNA-based therapeutics, which target diseases at the genetic level and offer
an alternative to traditional small molecule or protein drugs that treat symptoms [Damase et al.,
2021]. Notable examples of RNAs at the forefront of biology include mRNA vaccines [Metkar
et al., 2024] and CRISPR-based genomic medicine [Doudna and Charpentier, 2014].

Despite their promise, the rational design of RNA molecules is a significant challenge as
the sequence-structure-function relationship is not as well established as it is for proteins. The
availability of extensive protein structure data in the Protein Data Bank (PDB) coupled with
advances in deep learning have revolutionized protein structure prediction [Jumper et al., 2021]
and design [Dauparas et al., 2022, Watson et al., 2023], achievements recognized by the Nobel
Prize in Chemistry 2024. Computational RNA design with deep learning, however, remain
comparatively underexplored, largely due to a scarcity of 3D structural data [Schneider et al.,
2023]. The PDB contains approximately 7,000 RNA 3D structures, versus over 200,000 for
proteins, with most RNA structures originating from a few well-studied families like tRNAs
or ribosomal RNAs. This data limitation has meant that most RNA design tools either focus
on secondary structure, neglecting 3D geometry [Churkin et al., 2018], or rely on non-learned
algorithms with hand-crafted heuristics for aligning 3D RNA fragments [Han et al., 2017,
Yesselman et al., 2019], which can be restrictive.

Beyond data scarcity, a further key technical challenge in RNA design is that RNA are
generally more dynamic than proteins. A single RNA sequence can adopt multiple distinct
conformational states to perform and regulate complex biological functions [Ganser et al., 2019,
Hoetzel and Suess, 2022]. Current computational RNA design often frames the task as an inverse

99


Multi-state
Graph Neural

Network
Encoder

Sequence
Decoder

RNA Conformational
Ensemble

Set of Backbone
Geometric Graphs

Extract
Backbones GAGCGU...

RNA
Sequence

Fixed backbone
re-design

3D roto-translations
node order
conformation order

Equivariant to:

Figure 5.1: 3D RNA inverse design with gRNAde. gRNAde is a generative model for RNA
sequence design conditioned on backbone 3D structure(s). gRNAde processes one or more
RNA backbone graphs (a conformational ensemble) via a multi-state GNN encoder which is
equivariant to 3D roto-translation of coordinates as well as conformational state order, followed
by conformational state order-invariant pooling and autoregressive sequence decoding.

problem: designing sequences for a single target secondary structure, thereby typically neglecting
3D geometry and conformational diversity. Yet, engineering novel biological functions effectively
necessitates considering both the 3D structure and the dynamic conformational landscape of
RNA [Vicens and Kieft, 2022, Ken et al., 2023].

This chapter introduces gRNAde, a geometric RNA design model that addresses the chal-
lenge of designing RNA sequences that fold into target 3D structures while accounting for
conformational dynamics (Figure 5.1). gRNAde is a structure-conditioned RNA language model
that leverages a novel multi-state GNN to generate sequences conditioned on one or more 3D
backbone structures. Our computational evaluations demonstrate that gRNAde significantly
outperforms existing physics-based RNA design methods while being orders of magnitude faster.
Furthermore, gRNAde introduces novel capabilities for RNA design, including zero-shot ranking
of functional mutants and multi-state design for structurally flexible RNAs. Open source code is
available: github.com/chaitjo/geometric-rna-design.

5.1 The gRNAde Model

Figure 5.1 illustrates the RNA inverse design problem: the task of designing new RNA sequences
conditioned on one or more 3D backbone structures. Given the 3D coordinates of a backbone
structure, gRNAde generates sequences that are likely to fold into those target shapes. The
underlying assumption behind inverse design is that structure determines function: by designing
sequences that fold into specific structures, we can create molecules with desired biological
activities [Huang et al., 2016].

100

https://github.com/chaitjo/geometric-rna-design


P

O5'

C5'

C4'

C3'C2'

C1'

O4'

O3'

P

P

P

C4'

RNA backbone atoms Coarse-grained features

Node (nucleotide)

Backbone chain

3D neighbourhood

3x distances
3x angles
3x torsionsRibose

sugar

Base

3-bead
representation
(P, C4', N1/N9)

N1/N9

5'

3'

Figure 5.2: RNA backbone structures featurized as 3D graphs. Each RNA nucleotide is
a node in the graph, consisting of 3 coarse-grained beads for the coordinates for P, C4’, N1
(pyrimidines) or N9 (purines) which are used to compute initial geometric features and edges to
nearest neighbours in 3D space. Backbone chain adapted from Ingraham et al. [2019b].

gRNAde employs a structure-conditioned, autoregressive language model architecture with
geometric GNN encoder and decoder layers [Jing et al., 2020, Dauparas et al., 2022]. The
key innovation is a multi-state GNN encoder that processes conformational ensembles of 3D
backbones, followed by permutation-invariant pooling across conformational states and autore-
gressive sequence decoding. This multi-state design capability distinguishes gRNAde from
existing single-structure inverse folding approaches, enabling the design of sequences that
are compatible with multiple conformational states simultaneously. The architecture naturally
handles both single-state and multi-state design within the same framework.

5.1.1 RNA Conformational Ensembles as Geometric Multi-graphs

Featurization The input to gRNAde is an RNA to be re-designed. For instance, this could be a
set of PDB files with 3D backbone structures for the given RNA (a conformational ensemble)
and the corresponding sequence of n nucleotides. As shown in Figure 5.2, gRNAde builds a
geometric graph representation for each input structure:

1. We start with a 3-bead coarse-grained representation of the RNA backbone, retaining the
coordinates for P, C4’, N1 (pyrimidine) or N9 (purine) for each nucleotide [Dawson et al.,
2016]. These ‘pseudotorsional’ features describe RNA backbones completely in most
cases while reducing the size of the torsional space from 7 angles down to 3 to prevent
overfitting [Wadley et al., 2007].

2. Each nucleotide i is assigned a node in the geometric graph with the 3D coordinate
x⃗i ∈ R3 corresponding to the centroid of the 3 bead atoms. Random Gaussian noise
with standard deviation 0.1Å is added to coordinates during training to prevent overfitting

101


1D edges
k=32 nearest neighbours along sequence

2D edges
base pairing and pseudoknots

3D edges
k=32 nearest neighbours in 3D

Figure 5.3: Types of edges in gRNAde’s input graphs. gRNAde constructs geometric graphs
with three edge types: 1D edges connecting sequential nucleotides, 2D edges from secondary
structure annotations, and 3D edges between nucleotides that are close in 3D space.

on crystallisation artifacts, following Dauparas et al. [2022]. Nodes are initialized with
geometric features analogous to the featurization used in protein design [Ingraham et al.,
2019b, Jing et al., 2020]: (a) forward and reverse unit vectors along the backbone from the
5’ end to the 3’ end, (x⃗i+1 − x⃗i and x⃗i − x⃗i−1); and (b) unit vectors, distances, angles,
and torsions from each C4’ to the corresponding P and N1/N9.

3. Each node is connected to edges of three types (Figure 5.3): (a) its 32 nearest neighbours
in 3D space based on Euclidean distance ∥x⃗i − x⃗j∥2; (b) its 32 nearest neighbours along
the RNA backbone based on sequence distance |j − i|; and (c) all nodes involved in base
pairs and pseudoknots from the secondary structure corresponding to the 3D backbone.
Edge features for an edge from node j to i are initialized as: (a) the unit vector from the
source to destination node, x⃗j − x⃗i; (b) the distance in 3D space, ∥x⃗j − x⃗i∥2, encoded
by 32 radial basis functions; (c) the distance along the backbone, j − i, encoded by 32
sinusoidal positional encodings; and (d) the type of edge (1D, 2D, or 3D) encoded as a
one-hot vector.

Multi-graph representation Given a set of k structures in the input conformational ensemble,
each RNA backbone is featurized as a separate geometric graph G(k) = (A(k),S(k), V⃗ (k)) with
the scalar features S(k) ∈ Rn×f , vector features V⃗ (k) ∈ Rn×f ′×3, and A(k), an n× n adjacency
matrix. For clear presentation and without loss of generality, we omit edge features and use f , f ′

to denote scalar/vector feature channels.
The input to gRNAde is thus a set of geometric graphs {G(1), . . . ,G(k)} which is merged into

what we term a ‘multi-graph’ representation of the conformational ensemble, M = (A,S, V⃗ ),
by stacking the set of scalar features {S(1), . . . ,S(k)} into one tensor S ∈ Rn×k×f along a new
axis for the set size k. Similarly, the set of vector features {V⃗ (1), . . . , V⃗ (k)} is stacked into one
tensor V⃗ ∈ Rn×k×f ′×3. Lastly, the set of adjacency matrices {A(1), . . . ,A(k)} are merged via a
union ∪ into one single joint adjacency matrix A.

102


GVP-GNN
encoder layer

GVP-GNN
encoder layer

Backbone k

Backbone 1

Autoregressive
decoder+

...

A G C U

Per-node
logits

Deep Set
Pooling

x L

x L

Node Embeddings

GAGCG_
SamplingPartial

sequence

Figure 5.4: gRNAde model architecture. One or more RNA backbone structures are encoded
via SE(3)-equivariant GNN layers to build latent representations of each nucleotide’s local 3D
environment per state. Representations from multiple states are pooled via permutation invariant
Deep Sets and fed to an autoregressive decoder to predict probabilities over four bases (A, G,
C, U). During training, the model minimizes cross-entropy loss between predicted and true
sequence identities.

5.1.2 Multi-state GNN for Encoding Conformational Ensembles

The gRNAde model, illustrated in Figure 5.4, processes one or more RNA backbone graphs via
a multi-state GNN encoder which is equivariant to 3D roto-translation of coordinates as well as
to the ordering of conformational states, followed by conformation order-invariant pooling and
sequence decoding. We describe each component in the following sections.

Multi-state GNN encoder When representing conformational ensembles as a multi-graph,
each node feature tensor contains three axes: (#nodes, #conformations, feature channels).
We perform message passing on the multi-graph adjacency to independently process each
conformational state, while maintaining permutation equivariance of the updated feature tensors
along both the first (#nodes) and second (#conformations) axes. This works by operating on only
the feature channels axis and generalising the PyTorch Geometric [Fey and Lenssen, 2019b]
message passing class to account for the extra conformations axis; see Figure 5.5 for details.

We use multiple O(3)-equivariant GVP-GNN [Jing et al., 2020] layers to update scalar
features si ∈ Rk×f and vector features v⃗i ∈ Rk×f ′×3 for each node i:

mi, m⃗i :=
∑
j∈Ni

MSG
(
(si, v⃗i) , (sj, v⃗j) , eij

)
, (5.1)

s′i, v⃗
′
i := UPD

(
(si, v⃗i) , (mi, m⃗i)

)
, (5.2)

where MSG,UPD are Geometric Vector Perceptrons, a generalization of MLPs to take tuples of
scalar and vector features as input and apply O(3)-equivariant non-linear updates. The overall
GNN encoder is SO(3)-equivariant due to the use of reflection-sensitive input features (dihedral
angles) combined with O(3)-equivariant GVP-GNN layers.

Our multi-state GNN encoder is easy to implement in any message passing framework and

103


Set of RNA Conformations

Multi-graph
tensor

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6
Permute nodes

Permute conformers
Rotate 3D features

1

2

3

4

5

6

Update with
multi-GNN layer

1

2

3

4

5

6

Figure 5.5: Multi-graph tensor representation of conformational ensembles, and the associ-
ated symmetry groups acting on each axis. We process a set of k RNA backbone conformations
with n nodes each into a tensor representation. Each multi-state GNN layer updates the tensor
while being equivariant to the underlying symmetries. Here, we show a tensor of 3D vector-type
features with shape n × k × 3. As depicted in the equivariance diagram, the updated tensor
must be equivariant to permutation Sn of n nodes for axis 1, permutation Sk of k conformational
states for axis 2, and rotation SO(3)/O(3) of the 3D features for axis 3.

can be used as a plug-and-play extension for any geometric GNN pipeline to incorporate the
multi-state inductive bias. It serves as an elegant alternative to batching all the conformations,
which we found required major alterations to message passing and pooling.

Conformation order-invariant pooling The final encoder representations in gRNAde account
for multi-state information while being invariant to the permutation of the states. To achieve this,
we perform a Deep Set pooling [Zaheer et al., 2017] over the conformations axis after the final
encoder layer to reduce S ∈ Rn×k×f and V⃗ ∈ Rn×k×f ′×3 to S′ ∈ Rn×f and V⃗ ′ ∈ Rn×f ′×3:

S′, V⃗ ′ :=
1

k

k∑
i=1

(
S[: , i], V⃗ [: , i]

)
. (5.3)

A simple sum or average pooling does not introduce any new learnable parameters to the pipeline
and is flexible to handle a variable number of conformations, enabling both single-state and
multi-state design with the same model.

Sequence decoding and loss function We feed the final encoder representations after pooling,
S′, V⃗ ′, to autoregressive GVP-GNN decoder layers to predict the probability of the four possible
base identities (A, G, C, U) for each node/nucleotide. Decoding proceeds according to the RNA
sequence order from the 5’ end to 3’ end. gRNAde is trained in a self-supervised manner by

104


Backbone
graph

Sampled
sequences

Predicted
structures

True
sequence

...

gRNAde

2D: EternaFold
3D: RhoFold

...True
structure

Sequence recovery

Structural self-consistency scores
2D: MCC, 3D: RMSD, TM, GDT

Figure 5.6: In-silico evaluation metrics for gRNAde designed sequences. We consider (1)
sequence recovery, the percentage of native nucleotides recovered in designed samples, (2)
self-consistency scores, which are measured by ‘forward folding’ designed sequences using a
structure predictor and measuring how well 2D and 3D structure are recovered (we use EternaFold
and RhoFold for 2D/3D structure prediction, respectively). We also report (3) perplexity, the
model’s estimate of the likelihood of a sequence given a backbone.

minimising a cross-entropy loss (with label smoothing value of 0.05) between the predicted
probability distribution and the ground truth identity for each base. During training, we use
teacher forcing [Williams and Zipser, 1989] where the true identity of the base is fed as input to
the decoder at each step, encouraging the model to stay close to the ground-truth sequence.

Sampling When using gRNAde for inference and designing new sequences, we iteratively
sample the base identity for a given nucleotide from the predicted conditional probability
distribution, given the partially designed sequence up until that nucleotide/decoding step. We
can modulate the smoothness or sharpness of the probability distribution by using a temperature
parameter. At lower temperatures, for instance ≤1.0, we expect higher native sequence recovery
and lower diversity in gRNAde’s designs. At higher temperatures, the model produces more
diverse designs by sampling from a smoothed probability distribution.

gRNAde can also use unordered decoding [Dauparas et al., 2022] with minimal impact on
performance, as well as masking or logit biasing during sampling, depending on the design
scenario at hand. This enables gRNAde to perform partial re-design of RNA sequences, retaining
specified nucleotide identities while designing the rest of the sequence. In Chapter 6, we
demonstrate this capability for designing functional ribozymes with gRNAde.

5.1.3 Evaluation Metrics for Designed Sequences

Inverse folding models can generate large numbers of designed sequences for a given backbone
structure, making in-silico evaluation metrics essential for prioritizing which sequences to pursue
in wet lab experiments.

105


We primarily use Native sequence recovery to compare gRNAde to existing methods on
RNA backbones with known native sequences. Sequence recovery is defined as the average
percentage of nucleotides in designed sequences that match the ground truth sequence. While
recovery is the most widely used metric for biomolecule inverse design [Dauparas et al., 2022], it
can be misleading for RNAs where alternative nucleotide pairings may form identical structures.

Thus, we also use self-consistency scores to measure how well the designed sequences are
predicted to recover the target 2D and 3D structure (Figure 5.6):

• Secondary structure self-consistency score, where we ‘forward fold’ the sampled se-
quences using a secondary structure prediction tool (we used EternaFold [Wayment-Steele
et al., 2022a]) and measure the average Matthew’s Correlation Coefficient (MCC) to the
groundtruth secondary structure, represented as a binary adjacency matrix. MCC values
range between -1 and +1, where +1 represents a perfect match, 0 an average random
prediction and -1 an inverse prediction. This measures how well the designs recover base
pairing patterns.

• Tertiary structure self-consistency scores, where we ‘forward fold’ the sampled sequences
using a 3D structure prediction tool (we used RhoFold [Shen et al., 2022]) and compute
the average RMSD, TM-score and GDT_TS to the groundtruth C4’ coordinates to measure
how well the designs recover global structural similarity and 3D conformations.

Lastly, we can also consider Perplexity, a measure of the average number of bases that the
model is selecting from when designing each nucleotide. Formally, perplexity is the average
exponential of the negative log-likelihood of the sampled sequences. A ‘perfect’ model which
regurgitates the groundtruth1 would have perplexity of 1, while a perplexity of 4 means that the
model is making random predictions (the model outputs a uniform probability over 4 possible
bases). Perplexity does not require a ground truth structure to calculate, and can also be used for
ranking sequences as it is the model’s estimate of the compatibility of a sequence with the input
backbone structure.

Limitations While self-consistency metrics such as ‘designability’ (e.g., scRMSD ≤ 2Å) and
perplexity have been shown to correlate with experimental success in protein design [Watson
et al., 2023], precise designability thresholds remain to be established for RNA. As a starting
point, pairs of structures with TM-score ≥ 0.45 or GDT_TS ≥ 0.5 are known to correspond
to roughly the same fold [Zhang et al., 2022]. A major limitation for in-silico evaluation
of 3D RNA design compared to proteins is the relatively poor performance of current RNA
structure prediction tools. We will address these evaluation challenges for real-world RNA
design campaigns in Chapter 6, where we present a design pipeline validated through wet lab
experiments.

1Note that such a model would be practically useless for real design tasks.

106


5.2 Experimental Setup

3D RNA structure dataset We create a machine learning-ready dataset for RNA inverse design
using RNASolo [Adamczyk et al., 2022], a novel repository of RNA 3D structures extracted
from solo RNAs, protein-RNA complexes, and DNA-RNA hybrids in the PDB. We used all
currently known RNA structures at resolution ≤4.0Å resulting in 4,223 unique RNA sequences
for which a total of 12,011 structures are available (RNASolo date cutoff: 31 October 2023). As
inverse folding is a per-node/per-nucleotide level task, our training data contains over 2.8 Million
unique nucleotides. Further dataset statistics are available in Appendix Figure C.2, illustrating
the diversity of our dataset in terms of sequence length, number of structures per sequence, as
well as structural variations among conformations per sequence.

Structural clustering In order to ensure that we evaluate gRNAde’s generalization ability to
novel RNAs, we cluster the 4,223 unique RNAs into groups based on structural similarity. We
use US-align [Zhang et al., 2022] with a similarity threshold of TM-score >0.45 for clustering,
and ensure that we train, validate and test gRNAde on structurally dissimilar clusters (see next
paragraph). We also provide utilities for clustering based on sequence homology using CD-HIT
[Fu et al., 2012], which leads to splits containing biologically dissimilar clusters of RNAs.

Splits to evaluate generalization After clustering, we split the RNAs into training (∼4000
samples), validation and test sets (100 samples each) to evaluate two different design scenarios:

1. Single-state split. This split is used to fairly evaluate gRNAde for single-state design on
a set of RNA structures of interest from the PDB identified by Das et al. [2010], which
mainly includes riboswitches, aptamers, and ribozymes. We identify the structural clusters
belonging to the RNAs identified in Das et al. [2010] and add all the RNAs in these clusters
to the test set (100 samples). The remaining clusters are randomly added to the training
and validation splits.

2. Multi-state split. This split is used to test gRNAde’s ability to design RNA with multiple
distinct conformational states. We order the structural clusters based on median intra-
sequence RMSD among available structures within the cluster2. The top 100 samples from
clusters with the highest median intra-sequence RMSD are added to the test set. The next
100 samples are added to the validation set and all remaining samples are used for training.

Validation and test samples come from clusters with at most 5 unique sequences, in order to
ensure diversity. Any samples that were not assigned clusters are directly appended to the training
set. We also directly add very large RNAs (> 1000 nts) to the training set, as it is unlikely that
we want to design very large RNAs. We exclude very short RNA strands (< 10 nts).

2For each RNA sequence, we compute the pairwise C4’ RMSD among all available structures. We then compute
the median RMSD across all sequences within each structural cluster.

107


ViennaRNA
(2D only)

FARNA RDesign Rosetta gRNAde
0.00

0.25

0.50

0.75

1.00
Na

tiv
e 

se
qu

en
ce

 re
co

ve
ry

0.269
0.321

0.430 0.450

0.568

(a) gRNAde outperforms Rosetta.

0.00 0.25 0.50 0.75 1.00
Rosetta seq. recovery

0.00

0.25

0.50

0.75

1.00

gR
NA

de
 se

q.
 re

co
ve

ry

1.1

1.2

1.3

1.4

1.5

1.6

gR
NA

de
 p

er
pl

ex
ity

(b) Perplexity correlates with recovery.

Figure 5.7: gRNAde compared to Rosetta for single-state design. (a) We benchmark native
sequence recovery of gRNAde, RDesign, Rosetta, FARNA and ViennaRNA on 14 RNA structures
of interest identified by Das et al. [2010]. gRNAde obtains higher native sequence recovery
rates (56% on average) compared to Rosetta (45%) and all other methods. (b) Sequence
recovery per sample for Rosetta and gRNAde, shaded by gRNAde’s perplexity for each sample.
gRNAde’s perplexity is correlated with native sequence recovery for designed sequences (Pearson
correlation: -0.76, Spearman correlation: -0.67). Full results on single-state test set are available
in Appendix C.1 and per-RNA results in Appendix Table C.2.

Evaluation metrics For a given data split, we evaluate models on the held-out test set by
designing 16 sequences (sampled at temperature 0.1) for each test data point and computing
averages for each of the metrics described in Section 5.1.3: native sequence recovery, structural
self-consistency scores and perplexity. We employ early stopping by reporting test set perfor-
mance for the model checkpoint for the epoch with the best validation set recovery. Standard
deviations are reported across 3 consistent random seeds for all models.

Hyperparameters All models use 4 encoder and 4 decoder GVP-GNN layers, with 128
scalar/16 vector node features, 64 scalar/4 vector edge features, and drop out probability 0.5,
resulting in 2,147,944 trainable parameters. All models are trained for a maximum of 50 epochs
using the Adam optimiser with an initial learning rate of 0.0001, which is reduced by a factor
0.9 when validation performance plateaus with patience of 5 epochs. Ablation studies of key
modelling decisions are available in Appendix Table C.1.

5.3 Results

5.3.1 Single-state RNA Design Benchmark

We set out to compare gRNAde to Rosetta, a state-of-the-art physics based toolkit for biomolecu-
lar modelling and design [Leman et al., 2020]. We reproduced the benchmark setup from Das
et al. [2010] for Rosetta’s fixed backbone RNA sequence design workflow on 14 RNA structures
of interest from the PDB, which mainly includes riboswitches, aptamers, and ribozymes (full

108


listing in Table C.2). We trained gRNAde on the single-state split detailed in Section 5.2, explic-
itly excluding the 14 RNAs as well as any structurally similar RNAs in order to ensure that we
fairly evaluate gRNAde’s generalization abilities vs. Rosetta.

gRNAde improves sequence recovery over Rosetta In Figure 5.7, we compare gRNAde’s
native sequence recovery for single-state design with numbers taken from Das et al. [2010] for
Rosetta, FARNA (a predecessor of Rosetta), ViennaRNA (the most popular 2D inverse folding
method), and RDesign [Tan et al., 2023] (a concurrent GNN-based RNA inverse folding model).
gRNAde has higher recovery of 56% on average compared to 45% for Rosetta, 32% for FARNA,
27% for ViennaRNA, and 43% for RDesign. See Appendix Table C.2 for per-RNA results and
Appendix C.1 for full results on the single-state test set of 100 RNAs.

gRNAde is significantly faster than Rosetta In addition to superior sequence recovery,
gRNAde is significantly faster than Rosetta for high-throughout design pipelines. Training
gRNAde from scratch takes roughly 2–6 hours on a single A100 GPU, depending on the exact
hyperparameters. Once trained, gRNAde can design hundreds of sequences for backbones with
hundreds of nucleotides in ∼10 seconds on CPU and ∼1 second with GPU acceleration. On the
other hand, Rosetta takes order of hours to produce a single design due to performing expensive
Monte Carlo optimisation until convergence on CPU.3 Deep learning methods like gRNAde are
arguably easier to use since no expert customization is required and setup is easier compared
to Rosetta (the latest builds do not include RNA recipes), making RNA design more broadly
accessible.

gRNAde’s perplexity correlates with sequence recovery In Figure 5.7b, we plot native
sequence recovery per sample for Rosetta vs. gRNAde, shaded by gRNAde’s average perplexity
for each sample. Perplexity is an indicator of the model’s confidence in its own prediction (lower
perplexity implies higher confidence) and appears to be correlated with native sequence recovery.
In the subsequent Section 5.3.3, we further demonstrate the utility of gRNAde’s perplexity for
zero-shot ranking of RNA fitness landscapes.

5.3.2 Multi-state RNA Design Benchmark

Structured RNAs often adopt multiple distinct conformational states to perform biological
functions [Ken et al., 2023]. For instance, riboswitches adopt at least two distinct functional
conformations: a ligand bound (holo) and unbound (apo) state, which helps them regulate and
control gene expression [Stagno et al., 2017]. If we were to attempt single-state inverse design for
such RNAs, each backbone structure may lead to a different set of sampled sequences. It is not

3Rosetta documentation states that “runs on RNA backbones longer than ∼ten nucleotides take many minutes or
hours”. We have not run Rosetta ourselves as recent builds do not include RNA recipes.

109

https://www.rosettacommons.org/docs/latest/application_documentation/rna/rna-design


RDesi
gn

1 s
tat

e
gR

NAd
e

1 s
tat

e
2 s

tat
es

3 s
tat

es

4 s
tat

es

5 s
tat

es
0.00

0.25

0.50

0.75

1.00
Na

tiv
e 

se
qu

en
ce

 re
co

ve
ry

0.385
0.481 0.507 0.531 0.511 0.510

(a) Per-sample sequence recovery

0.0 0.5 1.0
Nucleotide Paired Probability

0.00

0.25

0.50

0.75

1.00

Na
tiv

e 
se

qu
en

ce
 re

co
ve

ry

0 5 10
Nucleotide RMSD (A)

gRNAde model
1 state
3 states
5 states

(b) Per-nucleotide recovery vs. structural flexibility

Figure 5.8: Multi-state design benchmark. (a) Multi-state gRNAde shows a consistent 3-5%
improvement over the single-state variant in terms of sequence recovery on the multi-state test
set of 100 RNAs, with the best performance obtained using 3 states. (b) When plotting sequence
recovery per-nucleotide, multi-state gRNAde improves over a single-state model for structurally
flexible regions of RNAs, as characterised by nucleotides that tend to undergo changes in base
pairing (left) and nucleotides with higher average RMSD across multiple states (right). Marginal
histograms in blue show the distribution of values. We plot performance for one consistent
random seed across all models; collated results and ablations are available in Appendix C.1.

obvious how to select the input backbone as well as designed sequence when using single-state
models for multi-state design. gRNAde’s multi-state GNN, descibed in Section 5.1.2, directly
‘bakes in’ the multi-state nature of RNA into the architecture and designs sequences explicitly
conditioned on multiple states.

In order to evaluate gRNAde’s multi-state design capabilities, we trained equivalent single-
state and multi-state gRNAde models on the multi-state split detailed in Section 5.2, where the
validation and test sets contain progressively more structurally flexible RNAs as measured by
median RMSD among multiple available states for an RNA.

Multi-state gRNAde consistently boosts sequence recovery In Figure 5.8a, we compared a
single-state variant of gRNAde with otherwise equivalent multi-state models (with up to 5 states)
in terms of native sequence recovery. Multi-state variants show a consistent 3-5% improvement,
with the best performance obtained using 3 states. This trend holds to a lesser extent on the
single-state benchmark where the multi-state model is being used with only one state as input.
This suggests that seeing multiple states during training can be useful for teaching gRNAde
about RNA conformational flexibility and improve performance even for single-state design
tasks. As a caveat, it is worth noting that multi-state models consume more GPU memory than
an equivalent single-state model during mini-batch training (approximate peak GPU usage for
max. number of states = 1: 12GB, 3: 28GB, 5: 50GB on a single A100 with at most 3000 total
nodes in a mini-batch).

110


1st best (fit.: 2.88)

2nd best (fit.: 2.40)

5th best (fit.: 2.23)

10th best (fit.: 1.96)

20th best (fit.: 1.61)
50th best (fit.: 1.35)

wildtype

1 10 50 100 200 449 1500 5000 10493
Selected sequences for assaying

0x
1x
0x

2x

4x

6x

8x

10x

12x

14x

16x

18x

E
xp

ec
te

d 
'm

ax
' f

ol
d 

ch
an

ge
 o

ve
r W

T

Max Fitness by Sample Size and Condition (n=74,943; simulations=10,000)

Condition
random
n_mut==1
n_mut<=2
gRNAde0.00

0.69

1.39

1.79

2.08

2.30

2.48

2.64

2.77

2.89

Fi
tn

es
s

Figure 5.9: Retrospective study of gRNAde for ranking ribozyme mutant fitness. Using the
backbone structure and mutational fitness landscape data from an RNA polymerase ribozyme
[McRae et al., 2024], we retrospectively analyse how well we can rank variants at multiple design
budgets using random selection vs. gRNAde’s perplexity for mutant sequences conditioned on
the backbone structure (catalytic subunit 5TU). Note that gRNAde is used zero-shot here, i.e. it
was not fine-tuned on any assay data. For stochastic strategies, bars indicate median values, and
error bars indicate the interquartile range estimated from 10,000 simulations per strategy and
design budget. At low throughput design budgets of up to ∼500 sequences, selecting mutants
using gRNAde outperforms random baselines in terms of the expected maximum improvement
in fitness over the wild type. In particular, gRNAde performs better than single site saturation
mutagenesis, even when all single mutants are explored (total of 449 single mutants, 10,493
double mutants for the catalytic subunit 5TU in McRae et al. [2024]). See Appendix Figure C.1
for results on scaffolding subunit t1.

Improved recovery in structurally flexible regions In Figure 5.8b, we evaluated gRNAde’s
multi-state sequence recovery at a fine-grained, per-nucleotide level to understand the source
of performance gains. Multi-state GNNs improve sequence recovery over the single-state
variant on structurally flexible nucleotides, as characterised by undergoing changes in base
pairing/secondary structure and higher average RMSD between 3D coordinates across states.

5.3.3 Zero-shot Ranking of RNA Fitness Landscape

Lastly, we explored the use of gRNAde as a zero-shot ranker of mutants in RNA engineering
campaigns. Given the backbone structure of a wild type RNA of interest as well as a candidate
set of mutant sequences, we can compute gRNAde’s perplexity of whether a given sequence
folds into the backbone structure. Perplexity is inversely related to the likelihood of a sequence
conditioned on a structure, as described in Section 5.1.3. We can then rank sequences based
on how ‘compatible’ they are with the backbone structure in order to select a subset to be
experimentally validated in wet labs.

111


Retrospective analysis on ribozyme fitness landscape A recent study by McRae et al. [2024]
determined a cryo-EM structure of a dimeric RNA polymerase ribozyme at 5Å resolution4, along
with fitness landscapes of ∼75K mutants for the catalytic subunit 5TU and ∼48K mutants for the
scaffolding subunit t1. We design a retrospective study using this data of (sequence, fitness value)
pairs where we simulate an RNA engineering campaign with the aim of improving catalytic
subunit fitness over the wild type 5TU sequence.

We consider various design budgets ranging from hundreds to thousands of sequences selected
for experimental validation, and compare 4 unsupervised approaches for ranking/selecting
variants: (1) random choice from all ∼75,000 sequences; (2) random choice from all 449 single
mutant sequences; (3) random choice from all single and double mutant sequences (as sequences
with higher mutation order tend to be less fit); and (4) negative gRNAde perplexity (lower
perplexity is better). For each design budget and ranking approach, we compute the expected
maximum change in fitness over the wild type that could be achieved by screening as many
variants as allowed in the given design budget. We run 10,000 simulations to compute confidence
intervals for the 3 random baselines.

gRNAde outperforms random baselines in low design budget scenarios Figure 5.9 illus-
trates the results of our retrospective study. At low design budgets of up to hundreds of sequences,
which are relevant in the case of a low throughput fitness screening assay, gRNAde outperforms
all random baselines in terms of the maximum change in fitness over the wild type. The top 10
mutants as ranked by gRNAde contain a sequence with 4-fold improved fitness, while the top
200 leads to a 5-fold improvement. 5 Note that gRNAde is used zero-shot here, i.e. it was not
fine-tuned on any assay data.

Perspective Overall, it is promising that gRNAde’s perplexity correlates with experimental
fitness measurements out-of-the-box (zero-shot) and can be a useful ranker of mutant fitness in
our retrospective study. In realistic design scenarios, improvements could likely be obtained by
fine-tuning gRNAde on a low amount of experimental fitness data.

This retrospective study acts as a sanity check before committing to wet lab validation
of gRNAde designs in Chapter 6. We see random mutagenesis and directed evolution-based
approaches as complementary to inverse design approaches like gRNAde [Breaker and Joyce,
1994]. Random mutagenesis can be thought of as local exploration around a wild type sequence,
optimising fitness within an ‘island’ of activity. Structure-based design approaches are akin to
global jumps in sequence space, with the potential to find new islands further away from the wild
type [Huang et al., 2016].

4This RNA was not present in gRNAde’s training data, which contains structures at ≤4.0Å resolution.
5As a caveat, the fitness assays from McRae et al. [2024] used for creating the landscape have inherent noise and

cannot easily differentiate between mutants of similar activity.

112


5.4 Related Work

RNA inverse folding Most tools for RNA inverse folding focus on secondary structure without
considering 3D geometry [Churkin et al., 2018, Runge et al., 2019] and approach the problem
from the lens of energy optimisation [Ward et al., 2023]. Rosetta fixed backbone re-design [Das
et al., 2010] is the only energy optimisation-based approach that accounts for 3D structure. Deep
neural networks such as gRNAde can incorporate 3D structural constraints and are orders of
magnitude faster than optimisation-based approaches; this is particularly attractive for high-
throughput design pipelines as solving the inverse folding optimisation problem is NP hard
[Bonnet et al., 2020].

RNA structure design Inverse folding models for protein design have often been coupled
with backbone generation models which design structural backbones conditioned on various
design constraints [Watson et al., 2023, Ingraham et al., 2023]. Current approaches for RNA
backbone design use classical (non-learnt) algorithms for aligning 3D RNA motifs [Han et al.,
2017, Yesselman et al., 2019], which are small modular pieces of RNA that are believed to fold
independently. Such algorithms may be restricted by the use of hand-crafted heuristics, and we
have explored the first data-driven generative models for RNA backbone design in follow-up
work [Anand et al., 2024].

RNA structure prediction There have been several recent efforts to adapt protein folding
architectures such as AlphaFold2 [Jumper et al., 2021] and RosettaFold [Baek et al., 2021] for
RNA structure prediction [Li et al., 2023b, Wang et al., 2023, Baek et al., 2024]. A previous
generation of models used GNNs as ranking functions together with Rosetta energy optimisation
[Watkins et al., 2020, Townshend et al., 2021]. None of these architectures aim at capturing
conformational flexibility of RNAs, unlike gRNAde which represents RNAs as multi-state
conformational ensembles. Neither can structure prediction tools be used directly for RNA
design tasks as they are not generative models.

RNA language models Self-supervised language models have been developed for predictive
and generative tasks on RNA sequences, including general-purpose models [Chen et al., 2022,
Penic et al., 2024, Zhao et al., 2024] as well as models developed for specific RNA families
[Li et al., 2023a, Sumi et al., 2024, Shulgina et al., 2024]. RNA sequence data repositories are
orders of magnitude larger than those for RNA structure (eg. RiNaLMo is trained on 36 million
sequences). However, standard language models can only implicitly capture RNA structure and
dynamics through sequence co-occurence statistics, which can pose a challenge for designing
structured RNAs. RibonanzaNet [He et al., 2024] represents a recent effort in developing
structure-informed RNA language models by supervised training on experimental readouts from
chemical mapping, although RibonanzaNet cannot be directly used for RNA design, either.

113


5.5 Summary

In this chapter, we introduced gRNAde, a novel geometric deep learning model for RNA sequence
design conditioned on one or more 3D backbone structures. gRNAde represents a significant
advance over Rosetta [Leman et al., 2020], the state-of-the-art physics based tool for 3D RNA
inverse design. On a benchmark of fixed backbone design for 14 biologically relevant RNA
structures from the PDB identified by Das et al. [2010], gRNAde obtains higher native sequence
recovery rates (56% on average) compared to Rosetta (45% on average). Additionally, gRNAde
is significantly faster, sampling 100+ designs in 1 second for an RNA of 60 nucleotides on an
A100 GPU (<10 seconds on CPU) compared to the reported hours for Rosetta on CPU.

gRNAde enables new capabilities which were previously not possible with Rosetta, including
multi-state design for structurally flexible RNAs. Multi-state gRNAde improves sequence
recovery by 5% over an equivalent single-state model on a benchmark of structurally flexible
RNAs, especially for surface nucleotides which undergo positional or secondary structural
changes. gRNAde’s GNN is also the first geometric deep learning architecture for explicit
multi-state biomolecule representation learning. The model is generic and can be repurposed for
other learning tasks on conformational ensembles, including multi-state protein design.

We further show that gRNAde can be used for zero-shot ranking of mutants in RNA engi-
neering campaigns. In a retrospective analysis of mutational fitness landscape data for an RNA
polymerase ribozyme [McRae et al., 2024], we show how gRNAde’s perplexity, the likelihood
of a sequence folding into a backbone structure, can be used to rank mutants based on fitness in
an unsupervised manner. We find that gRNAde outperforms random mutagenesis for improving
fitness over the wild type in low throughput scenarios.

Overall, this chapter has focused on computational evaluations of gRNAde. In the next
chapter, we will transition from retrospective in-silico benchmarks to real-world applications,
using gRNAde for practical RNA design problems with wet lab experimental validation. We will
also discuss gRNAde’s limitations and avenues for future work at the end of the next chapter.

114


Chapter 6

Inverse Design of RNA Structure and
Function with gRNAde

In Chapter 5, we introduced gRNAde, a geometric deep learning model for RNA inverse design
that generates sequences conditioned on target 3D structures. This chapter presents wet lab
validation of gRNAde’s capabilities through biochemical and functional experiments. We
focus on two RNA design problems with broad biological relevance: (1) Designing complex
pseudoknotted RNA structures, which are important 3D functional elements across biology but
have historically been difficult to design using existing computational methods; and (2) Going
beyond static structure and designing functional RNA enzymes (ribozymes), such as RNA
polymerases that catalyze RNA-templated RNA replication [Johnston et al., 2001].

6.1 An RNA Inverse Design Pipeline with gRNAde

Our RNA inverse design pipeline integrates gRNAde for sequence generation with RibonanzaNet
[He et al., 2024], an RNA language model for sequence-to-structural property prediction, to
identify promising designed RNA sequences. This pipeline has been calibrated through multiple
RNA design campaigns with experimentalists at Stanford University and the MRC Laboratory
of Molecular Biology. The screening metrics are selected to correlate with experimental success,
enabling high-throughput computational identification of designs that are most likely to fold into
target structures and perform desired functions.

Input The gRNAde pipeline translates a multi-modal structural ‘prompt’ into novel sequences
predicted to adopt a desired fold. This design specification, analogous to a textual prompt for
a large language model, is highly flexible; it can consist of a target pseudoknotted secondary
structure, 3D backbone coordinates, and partial sequence constraints that must be preserved
(Figure 6.1 (A)).

115


C

RibonanzaNet
RNA structure

foundation model

MCC

MAE

Ope
nK

no
t s

co
re

D

Wet lab 
validation

Partial sequence
(e.g. from fitness

landscape)

Target
pseudoknotted
sec. structure

Target backbone 
3D structure

A B

GNN 
Structure
Encoder

LM
Decoder

gRNAde
Structure-conditioned 
RNA language model

Designed
sequences

Designed
sequences

Pred. sec. structure

Pred. SHAPE Target SHAPE

Target sec. structure

Structural metrics

Top N

R
ea

ct
iv

ity

Position
R

ea
ct

iv
ity

Position

Figure 6.1: The gRNAde pipeline for RNA inverse design.
The automated workflow integrates deep learning-based sequence generation with computational
screening to identify optimal candidates for experimental validation.

A. The pipeline takes multi-modal design constraints as input, optionally including a target pseu-
doknotted secondary structure, a target 3D backbone structure, and partial sequence constraints
such as those derived from fitness landscapes.

B. gRNAde, a structure-conditioned RNA language model, uses these constraints to generate a
large and diverse library of candidate sequences, typically on the order of one million. These
candidates are then passed to the computational filtering stage.

C. In the filtering stage, each designed sequence is evaluated by RibonanzaNet, an RNA structure
foundation model. RibonanzaNet predicts the secondary structure and a per-nucleotide SHAPE
chemical reactivity profile for each candidate.

D. RibonanzaNet predictions for each design are scored against the target secondary structure
and SHAPE profile using metrics such as the Matthews Correlation Coefficient (MCC), Mean
Absolute Error (MAE), and the OpenKnot Score. The top-ranked designs are then selected for
wet-lab synthesis and validation.

The design pipeline proceeds through three sequential stages:

116


Step 1: Sequence generation gRNAde generates a large number of candidate sequences
(typically 1 million) conditioned on the input structure and specified constraints (Figure 6.1 (B)).
During generation, we vary the sampling temperature and random seed to control diversity, with
lower temperatures around 0.1 producing sequences closer to the native sequence and higher
temperatures up to 1.0 yielding more diverse candidates.

Step 2: Structural profile prediction Each generated sequence is evaluated using Ribonan-
zaNet to predict its chemical reactivity profile and secondary structur (Figure 6.1 (C)), providing
computational proxies for experimental folding behavior. We describe the rationale for using
RibonanzaNet in the following paragraphs.

Step 3: Design scoring and selection Candidates are scored and filtered using the predicted
structural profiles as follows (Figure 6.1 (D)):

1. Secondary Structure Score: For natural RNA targets, we compute Matthews Correlation
Coefficient (MCC) between predicted and target secondary structures, retaining only
sequences exceeding a high correlation threshold (e.g., MCC > 0.9). This ensures high
likelihood of target pseudoknotted structure formation. We omit this criterion for synthetic
targets, as RibonanzaNet’s secondary structure predictor was finetuned on natural RNAs.

2. Chemical Reactivity Score: We quantify the Mean Absolute Error (MAE) between each se-
quence’s predicted reactivity profile and a target profile. The target profile is obtained either
from experimental measurements or by applying RibonanzaNet to the native sequence.

3. OpenKnot Score: This metric measures the likelihood of pseudoknotted structure formation
based on predicted chemical reactivity patterns, used for pseudoknot design tasks in
Section 6.2.

4. Final Selection: We rank sequences by the primary metric—typically chemical reactivity
score or OpenKnot score—and select the top N unique designs after removing duplicates.

Due to the computational efficiency of gRNAde and RibonanzaNet, we can screen a large
numbers of designs in parallel. We generate and score up to 1 million sequences for each design
campaign, which takes approximately under 12 hours on a single NVIDIA A100 GPU. After
removing duplicates, this process typically yields hundreds of thousands of unique sequences,
depending on the target structure size and constraint stringency.

Rationale for RibonanzaNet and chemical reactivity We selected RibonanzaNet as our
primary evaluation tool based on several key advantages. RibonanzaNet is an RNA structure
language model that was pre-trained on approximately 2 million RNA sequences paired to
predict their experimental chemical reactivity profiles from high-throughput assays [He et al.,

117


2024]. This diverse training dataset encompasses both natural and synthetic RNAs, making it
well-suited for evaluating designed sequences.

Chemical probing assays measure per-nucleotide reactivities to small molecule modifiers,
providing information about both base pairing and tertiary interactions [Strobel et al., 2018,
Cao et al., 2024]. We utilize the 2A3 chemical modifier (2-Aminopyridine-3-carboxylic acid
imidazolide) [Marinus et al., 2021], which exhibits minimal nucleotide bias compared to other
chemical probes such as DMS, making it particularly suitable for evaluating designed sequences.
2A3 reactivity is high for unpaired and accessible nucleotides but substantially reduced for base-
paired nucleotides or those involved in tertiary interactions such as pseudoknots, providing a
robust signal for structural assessment. RibonanzaNet’s ability to predict these chemical reactivity
profiles enables quantitative evaluation of how well a designed sequence is likely to fold into the
target structure, as reactivity patterns directly reflect the underlying 3D conformation.

Furthermore, RibonanzaNet was fine-tuned on pseudoknotted secondary structures and
achieves state-of-the-art performance on secondary structure prediction benchmarks, giving us
two complementary metrics for evaluating designed sequences: predicted chemical reactivity
profiles and secondary structure predictions.

We evaluated alternative approaches including 3D structure prediction tools such as Al-
phaFold 3 [Abramson et al., 2024] and RNA-specific variants [Li et al., 2023b, Wang et al.,
2023]. However, these methods performed poorly on both native and designed sequences for our
applications, particularly for synthetic RNAs where multiple sequence alignments are unavail-
able, as accuracy is known to be substantially reduced for deep learning models without MSAs
[Das et al., 2023, Kretsch et al., 2025].

6.2 Expert-level Design of RNA Pseudoknotted Structures

6.2.1 The Pseudoknot Design Problem

RNA pseudoknots RNA pseudoknots are complex three-dimensional structural motifs formed
when single-stranded regions base pair with complementary sequences, creating interwoven
stem-loop structures. These sophisticated elements play crucial roles across biology: modulating
gene regulation through ribosomal frameshifting, enabling viral replication in SARS-CoV-2 and
other RNA viruses, and functioning as catalytic ribozymes [Staple and Butcher, 2005]. Despite
their biological significance, fundamental questions remain unresolved. The folding pathways
for pseudoknot formation, their structural dynamics, and thermodynamic properties are not
fullly understood [Vicens and Kieft, 2022]. Current computational methods struggle with their
topological complexity, limiting structure prediction from sequence [Rivas and Eddy, 1999]. As
a result, the rational design of pseudoknots with specified properties remains an open challenge
with significant implications for engineering synthetic ribozymes and functional RNAs.

118


Eterna OpenKnot Benchmark Eterna is an online platform and video game for computational
RNA design that hosts a global community of researchers and citizen-scientists. The platform
regularly releases new RNA design challenges where participants submit sequences designed to
satisfy specific structural and functional requirements [Lee et al., 2014, Wayment-Steele et al.,
2022a,b]. Submitted designs undergo experimental validation at Prof. Rhiju Das’s lab at Stanford
University through high-throughput chemical probing assays.

In this section, we present gRNAde’s performance on the OpenKnot Benchmark, a series of
RNA design challenges hosted on Eterna that specifically target pseudoknotted structures. The
goal is to build a diverse library of experimentally validated pseudoknotted RNAs to advance
fundamental understanding of RNA folding and function. While the Eterna community of expert
designers has contributed numerous designs to these challenges over the years, the manual design
process is slow and hard to scale, limiting the exploration of pseudoknotted sequence space.

To address these limitations, we deployed gRNAde as a fully automated computational
design tool on Eterna, enabling direct performance comparison against both human experts and
other automated RNA design tools.

6.2.2 Setup

OpenKnot Round 7a and 7b The OpenKnot Benchmark consists of multiple rounds, each
focusing on new pseudoknotted RNA structure targets for which participants submit designed
sequences. We entered gRNAde into OpenKnot Round 7a and 7b, which featured RNAs up to
100 nucleotides or 240 nucleotides in length, respectively. For each round, there were a total
of 20 target structures or puzzles, including 10 structures from natural RNAs and 10 synthetic
pseudoknots. Natural targets range from diverse structured RNAs including riboswitches,
ribozymes, ribosomal RNAs, and viral frameshift elements, while synthetic targets include novel
pseudoknots that try to push the limits of theoretically possible pseudoknotted structures. For the
natural targets, we usually have a reliable 3D structure provided by the Eterna organizers, either
from the Protein Data Bank or from high quality structure modelling tools. For the synthetic
targets, only secondary structure information is usually provided with less relaible 3D models, as
the ideal sequence for these targets is unknown.

Design budget and constraints For each puzzle, we submitted 40 gRNAde designs in total via
two approaches: (1) 20 sequences generated with only secondary structure constraints; and (2) 20
sequences generated with both secondary structure and 3D backbone constraints. No sequence
constraints were included, allowing gRNAde to design the entire sequence from scratch.

To provide a direct automated baseline for comparison, the organizers also independently
submitted 20 designs using Rosetta’s RNA inverse design protocol [12], the current state-of-the-
art physics-based method for 3D RNA design. As an additional sanity check, the organizers
evaluated 10 replicates of the wildtype (native) RNA sequence for each puzzle. While wildtype

119


sequences are expected to achieve high scores, particularly for natural RNAs, they may not
represent the optimal sequence for structure formation, which is precisely what the design
challenge aims to discover. The competition also included other, contemporaneous AI-based
methods (MPNN and RFDiffusion).

Evaluation using OpenKnot Score To evaluate each submitted design, the organizers measure
an experimental OpenKnot Score (ranging from 0 to 100) that quantifies the likelihood of a
sequence forming the target pseudoknotted structure. A score above 90 indicates high confidence
that the sequence will fold into the desired structure, while scores below 90 may still represent
successful designs but with less certainty based on the chemical reactivity data.

The OpenKnot score is computed as the average of two complementary metrics that assess
different aspects of pseudoknot formation:

• The Eterna Classic Score evaluates chemical reactivity consistency across all probed
positions. Positions predicted to be base-paired but showing high chemical reactivity
(>0.5) are penalized, as are positions predicted to be unpaired but showing low reactivity
(<0.125). This metric captures how well the overall secondary structure prediction matches
the experimental chemical probing data.

• The Crossed Pair Quality Score applies the same evaluation criteria but focuses specifically
on nucleotides involved in pseudoknotted base pairs—those that cross other base pairs in
the secondary structure. This targeted assessment is crucial for pseudoknot evaluation,
as these crossing interactions define the three-dimensional topology that distinguishes
pseudoknots from simpler secondary structures.

It is important to note that the OpenKnot score is based on 1D chemical reactivity data, and it
is possible for designs to achieve high scores by satisfying the metric without perfectly forming
the target 3D structure; further validation, such as compensatory rescue experiments, is planned
to confirm the 3D accuracy of top OpenKnot Benchmark designs.

6.2.3 Results

gRNAde achieves expert-level performance on short targets In Figure 6.2, we present the
OpenKnot scores for designs from gRNAde, Rosetta, expert human designers, and wildtype
sequences across all 20 target structures of up to 100 nucleotides from OpenKnot Round 7a.
gRNAde achieved a 100% success rate on natural targets and 90% success rate on synthetic
targets, representing a substantial improvement over Rosetta’s 40% and 70% success rates on
natural and synthetic targets, respectively.

Remarkably, gRNAde matches the performance of expert human designers, who achieved
identical success rates of 100% on natural targets and 90% on synthetic targets. This is a

120


significant result as gRNAde is a fully automated pipeline capable of generating designs at scale,
whereas human designers typically require substantially more time to create individual designs.

gRNAde designs improve over native sequences Notably, gRNAde designs consistently
outperform wildtype sequences across most targets, with native sequences achieving lower
success rates of 80% on natural targets and 40% on synthetic targets compared to gRNAde’s
100% and 90% respectively. This demonstrates that gRNAde can design idealized sequences
that are better suited for forming target pseudoknotted structures than their naturally occurring
counterparts. While natural sequences evolved under multiple selective pressures—including
functional constraints, evolutionary history, and cellular context—gRNAde focuses exclusively
on structural optimization for the specified target.

Visualizations of chemical reactivity profiles Figure 6.2 (E) and (F) illustrate specific exam-
ples where gRNAde designs achieve superior OpenKnot scores compared to wildtype sequences
for both natural and synthetic targets from OpenKnot Round 7a. The chemical reactivity profiles
reveal a clear distinction: gRNAde-designed sequences successfully form the target pseudoknot-
ted structure, as evidenced by the characteristic reactivity patterns consistent with the intended
base pairing and tertiary interactions. In contrast, the corresponding wildtype sequences ex-
hibit reactivity profiles indicating alternative structural conformations, suggesting they fold into
non-target structures rather than the desired pseudoknot.

121


WT

Rosetta

gRNAde

P11: PN.v282 - WT vs. best gRNAde design 

A B

DC

E F
P03: ZMP Riboswitch - WT vs. best gRNAde design 

Figure 2 | gRNAde achieves expert-level accuracy in the Eterna OpenKnot Benchmark for 
RNA pseudoknot design. 

WT (Score = 90.0) gRNAde (Score = 94.3) WT (Score = 80.5) gRNAde (Score = 97.5)

122


Figure 6.2. gRNAde achieves expert-level accuracy in the Eterna OpenKnot Benchmark
for RNA pseudoknot design.

Performance of wildtype sequences, Rosetta, MPNN, RFdiffusion, gRNAde, and expert human
designers in the Eterna OpenKnot Round 3 challenge, which targeted pseudoknotted RNAs of
up to 100 nucleotides.

A, B. Results for 10 natural RNA targets. C, D. Results for 10 synthetic RNA targets. Left-sided
panels A and C show the distribution of OpenKnot scores for individual designs across all
puzzles (success threshold > 90, red dashed line). Right-sided panels B and D show the overall
success rate, defined as the percentage of puzzles for which at least one design scored above 90.
gRNAde achieves success rates of 100% (natural) and 90% (synthetic), matching expert human
performance and substantially outperforming physics-based Rosetta (40% and 70% success rates,
respectively).

E, F. Molecular validation of design success through chemical probing. Nucleotides are overlaid
on the target secondary structures and colored by reactivity, with darker reds indicating higher
reactivity and greater accessibility for unpaired positions. Conversely, nucleotides part of base
pairs and pseudoknots are expected to have lower reactivity.

E. For the natural ZMP Riboswitch target, the best gRNAde design (right, score = 94.3) shows a
chemical reactivity profile largely consistent with the target fold, whereas the wildtype sequence
(left, score = 90.0) shows anomalous reactivity in the loop region around position 50-55, suggest-
ing misfolding.

F. For the synthetic PN.v282 target, the gRNAde design (right, score = 97.5) again shows a
superior reactivity pattern compared to the wildtype sequence (left, score = 80.5), which exhibits
high reactivity in various paired regions, indicating disrupted base pairing.

123


A B

C D

Supplementary Figure 1 | gRNAde maintains competitive performance on long RNA 
pseudoknots in the Eterna OpenKnot Benchmark. 

Supplementary Figures

124


Figure 6.3. gRNAde maintains competitive performance on long RNA pseudoknots in the
Eterna OpenKnot Benchmark.

Performance of wildtype sequences, Rosetta, MPNN, RFdiffusion, gRNAde, and expert human
designers for the Eterna OpenKnot Round 4 challenge, which targeted pseudoknotted RNAs of
up to 240 nucleotides.

A. Distribution of OpenKnot scores for individual designs across 10 natural target puzzles
(success threshold > 90, red dashed line).

B. The success rate on natural targets, defined as the percentage of puzzles for which at least one
design scored above 90.

C. Distribution of OpenKnot scores for individual designs across 10 synthetic target puzzles
(success threshold > 90, red dashed line).

D. The success rate on synthetic targets, defined as the percentage of puzzles for which at least
one design scored above 90.

gRNAde achieves success rates of 67% on natural targets and 70% on synthetic targets. Rosetta
(physics-based) could not be evaluated due to scalability issues. Native sequences achieved very
low success rates on both categories, which demonstrates gRNAde’s ability to design idealized
sequences for complex pseudoknotted RNA structures.

Competitve performance on long targets OpenKnot Round 7b evaluates design performance
on significantly larger targets of up to 240 nucleotides, representing a substantial increase
in structural complexity compared to the short targets in Round 7a. Figure 6.3 presents the
OpenKnot scores for designs from gRNAde, expert human designers, and wildtype sequences
across all 20 target structures.

gRNAde achieved success rates of 67% on natural targets and 70% on synthetic targets.
While these results represent a decrease from the success rates observed on short targets in Round
7a, expert human designers also experienced reduced performance at this scale. Rosetta could
not be evaluated on these larger targets due to computational scalability limitations. Importantly,
gRNAde designs substantially outperformed wildtype sequences, which achieved success rates
of only 0% on natural targets and 10% on synthetic targets.

Despite lower success rates, gRNAde designs consistently achieved high median OpenKnot
scores, with most designs scoring above 80 even when falling short of the 90-point success
threshold. This suggests that gRNAde captures many of the structural requirements for these
complex targets, with room for optimization in achieving the highest confidence scores. While
gRNAde’s performance on large targets represents an area for future improvement to close the gap
with expert human designers, the fully automated nature of the approach enables high-throughput
exploration of design space that would be impractical for manual design.

125


6.3 Inverse Design of Functional Polymerase Ribozymes

6.3.1 The Self-replicating Ribozyme Problem

Ribozymes and the RNA World Having demonstrated gRNAde’s ability to design complex
pseudoknotted structures, we now turn to functional RNA design. Our focus will be on RNA
enzymes or ribozymes. Ribozymes perform critical structural and catalytic roles in modern cells,
including tRNA processing (RNaseP), RNA splicing (spliceosome, self-splicing introns), and
translation (ribosome) [Cech, 2024]. Furthermore, in vitro evolution has lead to the discovery
of novel ribozyme activities not observed in nature [Wilson and Szostak, 1999], including
polymerase ribozymes (PRs) capable of synthesizing complementary RNA strands [Wochner
et al., 2011, Attwater et al., 2013, Tjhung et al., 2020]. Among these, polymerase ribozymes that
can replicate themselves hold particular scientific significance. Their capacity for RNA-catalyzed,
RNA-templated synthesis offers a pathway to RNA self-replication, a process central to the RNA
World hypothesis which postulates RNA as a cornerstone of early life [Woese, 1967, Orgel, 1968,
Gilbert, 1986].

Triplet-based RNA polymerase ribozymes A promising candidate for RNA self-replication
is the triplet-based RNA polymerase ribozyme (TPR) [Attwater et al., 2018], which was evolved
to use trinucleotide triphosphates (triplets) as substrates. McRae et al. [2024] recently determined
the cryo-EM structure of the TPR at 5-Å resolution and connected structure to function via a
comprehensive fitness landscape analysis.

As shown in Figure 6.4 (A), the TPR is a heterodimeric RNA composed of a catalytic subunit
(5TU) and a noncatalytic auxiliary subunit (t1), which together form a left-hand-like structure
with thumb and fingers positioned at a 70° angle. The two subunits are held together by two
kissing-loop (KL) interactions that are essential for polymerase function, as evidenced by the
dramatic fitness reduction observed when these regions are mutated. The fitness landscape,
combined with structural data, reveals that these KL interactions preorganize the TPR for optimal
function. This mechanistic understanding of the structure-function relationship in TPR makes it
an ideal candidate for rational design approaches.

In Chapter 5, we established how gRNAde can be used to rank mutants of TPR and identify
functional mutations that improve its catalytic activity. While that retrospective study validated
gRNAde’s ability to assess known variants, the mutants analyzed were obtained through directed
evolution, which typically generates sequences that are relatively close in mutational space to
the native 5TU sequence. Here, we address a more fundamental question: how diverse can

sequences be while still performing RNA-templated RNA polymerisation?

The scientific goal was not merely to maintain and if possible improve enzyme function, but
to test the limits of its functional sequence diversity, i.e. the maximal edit distance of the 5TU
“quasispecies" in RNA sequence [Lambert et al., 2025, Kun et al., 2005, Ekland et al., 1995].

126


This has direct implications for the plausibility of RNA-based self-synthesis and the emergence
of early life. We will use gRNAde to perform large-scale generative “jumps" in sequence space,
aiming to discover functional variants at mutational distances beyond those accessible through
adaptive walks using conventional directed evolution or simple rational design.

6.3.2 Setup

Input constraints Our goal is to design mutants of the 5TU catalytic subunit at varying
mutational distances from the wildtype sequence. We provide gRNAde with the 3D backbone
coordinates of the 5TU-t1 heterodimer and the corresponding pseudoknotted secondary structure
as structural input. For sequence constraints, we fix the t1 sequence since it serves as the
non-catalytic scaffolding subunit that positions 5TU for optimal function. To generate 5TU
designs at specified mutational distances from the native sequence, we define position-specific
design probabilities derived from the fitness landscape data of McRae et al. [2024]. The design
probability assignment procedure combines two complementary metrics (Figure 6.4 (B) and
Figure D.1):

1. Maximum single-mutant fitness: We bin the fitness values of the best single mutant at
each position into four categories: below -4.0 (score: 0), -4.0 to -2.0 (score: 1), -2.0 to 0.0
(score: 2), and above 0.0 (score: 3).

2. Combinability score: We assess how well mutations at each position can be combined
with other mutations to create improved higher-order variants, following Gantz et al.
[2024].1 This metric is binned into four categories: below 0 (score: 0), 0 to 50 (score: 1),
50 to 100 (score: 2), and above 100 (score: 3).

The final design probability for each position is calculated as:

Pdesign =
Binned fitness score + Binned combinability score

6
(6.1)

yielding values between 0.0 and 1.0, where higher probabilities indicate positions more amenable
to mutation while maintaining function.

To preserve essential catalytic elements, we manually set the design probability to zero for
functionally critical positions: the catalytic site (positions 41-43), template binding nucleotides
(positions 22-24), and triple helix-forming adenosines (positions 25-30). These regions require
precise nucleotide identities to perform RNA copying and are therefore held fixed during design.

1The combinability score quantifies how effectively mutations at a given position can be combined with mutations
elsewhere to yield functional variants with positive, non-negatively epistatic fitness effects. It is computed as the
sum of the fitness values of all non-negatively epistatic higher-order mutants involving a position, weighted by the
mutation order.

127


Design budget and baselines To systematically evaluate gRNAde’s performance and ablate
the contribution of each pipeline component, we designed an experiment with the following
specifications (Figure 6.4 (C)):

Mutational distance range: We generated designs spanning mutational distances from 15
to 40 mutations relative to the native 5TU sequence (152 nucleotides long), corresponding to
10-25% sequence similarity. This represents a significant extension beyond the coverage of the
fitness landscape data from McRae et al. [2024], which had almost negligible sampling beyond 6
mutations. Exploring this extended mutational range—where fitness landscape data is sparse
or absent—presents a particularly challenging test case for designing functional sequences and
assessing the limits of structure-based design approaches.

gRNAde design generation: We generated 1 million candidate sequences by varying gR-
NAde’s sampling temperatures, random seeds, and re-sampling sequence constraints from the
design probabilities described above. After deduplication and computational filtering through
our RibonanzaNet pipeline, we selected the top 1,000 designs distributed as 40 designs per
mutational distance.

Rational design baselines: To isolate the contributions of gRNAde’s sequence generation
versus our computational filtering pipeline, we implemented two rational design heuristics as
baselines. A straightforward rational design approach commonly used in RNA design is to
randomly assign nucleotides at each position while respecting base pairing constraints: for paired
positions, sample nucleotides from valid base pairs (A-U, G-C, G-U), and for unpaired positions,
sample from all nucleotides (A, U, G, C). This strategy is simple but cannot account for 3D
structural information during design.

Using identical input constraints as gRNAde we generated two sets of rational designs:

1. Rational design only: 1 million designs generated using position-specific nucleotide
sampling from fitness landscape probabilities, with 20 designs selected randomly per
mutational distance (500 total).

2. Rational design + RibonanzaNet filtering: The same rational generation approach, but
applying our computational filtering pipeline to select the top 20 designs per mutational
distance (500 total).

This experimental design enables direct assessment of gRNAde’s performance relative to
rational approaches while quantifying the individual contributions of our generative model and
computational filtering components.

Experimental validation We validated our designs using a sequencing-based high-throughput
activity assay, similar to McRae et al. [2024], which measures the ability of variants for templated
synthesis of an arbitrary target RNA sequence. The setup consists of two phases:

128


1. Pre-selection library preparation: All designed sequences along with the wildtype 5TU
are synthesized and pooled to create an input library representing the starting population.

2. Activity selection: The library is subjected to conditions that allow only functional ri-
bozymes capable of copying to amplify, creating a post-selection library enriched for
active sequences.

For each sequence, we compute fitness as the log2 enrichment relative to wildtype:

Fitness = log2

(
FApost-selection

FApre-selection

)
− log2

(
FAWT, post-selection

FAWT, pre-selection

)
(6.2)

where fractional abundance (FA) is defined as the number of sequencing reads for a given
sequence divided by the total reads in the corresponding library. Positive fitness values indicate
sequences that replicate more efficiently than wildtype, while negative values indicate reduced
copying activity.

To ensure robust fitness estimates, we applied stringent filtering criteria: Only sequences of
expected length (152) are considered for analysis and sequences must have at least 5 counts in
pre-selection libraries and at least 1 count in post-selection libraries to be classified as "active".
We define two categories of activity based on a fitness value threshold:

• Active: fitness ≥ −1.86 (indicating some level of RNA polymerase activity, calibrated
using a low-throughput gel assay).

• Inactive: fitness < −1.86 (indicating reduced but non-zero self-replication) or zero reads
in post-selection libraries which cannot be assigned a fitness value.

This classification allows us to assess both the overall success rate of designs in retaining
self-replication function and the degree of activity relative to wildtype. This kind of grouping
is also useful because high-throughput sequencing based assays can be noisy, so the exact
fitness values can be hard to interpret, but we can still classify sequences as active or inactive by
calibrating them with a low-throughput gel assay (Figure 6.4 (F) and (G)).

6.3.3 Results

gRNAde outperforms rational design Figure 6.4 demonstrates that the gRNAde pipeline
substantially outperforms rational design approaches both with and without computational
filtering. We quantify the individual contributions of gRNAde’s generative model and our
RibonanzaNet filtering pipeline by comparing success rates across the three design methods.

As shown in Figure 6.4 (D), for designs with up to 20 mutations from wildtype, rational design
with random selection achieves only 3% success rate (1 active design out of 100), highlighting
the difficulty of designing functional ribozymes without 3D structural guidance. Applying

129


computational filtering to rational designs improves success rate to 15%, demonstrating the value
of our filtering pipeline. Notably, gRNAde with the same computational filtering achieves 31.5%
success rate, a 2-fold improvement over filtered rational designs and 10-fold improvement over
unfiltered rational designs.

Furthermore, gRNAde discovered not only more functional variants but also variants with
higher catalytic activity on model templates. The fitness distribution of active gRNAde designs
was generally superior to that of the rational design baselines (Figure 6.4 (E)). Three active
gRNAde variants each for mutational distance ranges 15-19, 20-24, and 25-29 were further
validated via a low-throughput gel assay, showing high Pearson correlation coefficient of 0.85
with the high-throughput fitness (Figure 3F, G). Notably, variants 122 and 143 differed from
the wildtype by 18 and 19 mutations, yet exhibited an 1.6-fold and 1.1-fold enrichment in the
high throughput assay, respectively. gRNAde retained activity even with up to 28 mutations
(Figure D.2). These results showcase the combination of sequence novelty and improved
functionality achieved with gRNAde.

gRNAde leverages 3D structural understanding in design To understand the basis for
gRNAde’s superior performance across diverse RNA design tasks, we analyzed the mutational
patterns in active ribozyme designs from the gRNAde pipeline compared to rational design
with filtering. Rational design tended to mainly mutate canonical base-paired positions while
conserving unpaired loops, a strategy that mainly preserves secondary structure but fails to
account for essential tertiary interactions (Figure 6.5 (B)). gRNAde, in contrast, generated active
designs with a more balanced mutational profile (Figure 6.5 (A)), frequently altering nucleotides
in unpaired regions and loops, altering nucleotides in four unpaired regions: J1/3 (single-stranded
template-binding interface positioned by kissing loops), the loop region of P5 (structural scaffold
of catalytic core), the loop region of P9 (structural scaffold of the extension domain), and the
loop region of P10 (makes critical substrate contacts16 and shows dynamic movement towards
active site) (Figure 6.5 (C-E)). gRNAde can successfully mutate these four regions with diverse
functions, demonstrating that by training on a diverse corpus of 3D structures, it has learned
sophisticated, non-local structure-function relationships that go far beyond simple base-pairing
rules, allowing it to successfully navigate a complex functional landscape.

To further contextualize gRNAde’s design strategy against human experts, we revisited the
OpenKnot Benchmark to analyze the sequence recovery of successful designs from the wildtype
or starting sequence (Figure 6.5 (F)). While gRNAde matched the structural accuracy of human
experts, its designs were significantly more distant in sequence space. The median sequence
recovery for gRNAde designs was 32%, substantially lower than the 72% observed for human
experts. This divergence demonstrates that unlike human designers, who exhibit a strong bias
toward smaller more conservative edits close to the native sequence, gRNAde can successfully
perform generative jumps in sequence space for diverse RNA targets.

130


KL2

KL1

J1/3P1

P10

P9

P5

P7

t1 subunit
(fixed)

5TU subunit
(designed)

P3

A

U

P8

P5

P7

 
30 110
40

50

60

100

90

120

130

140 

150

20 A

U

U

A

G

C

A

U

U

G

A

U

C

G

G

C

C A 

CG

C

G

A
UA

G G C G

G

C

UC G C

C

G

A

G

A

G

C

C

G

G 

U 

G A

U

U 

A 

G 

C 
C
U

C

G A G

C U 

A

C 

G

A 

U

U

G

U

A

C

G

C

G

G

C

A 

U 
C

C

G

A
U

U
G

G
G
A 

C 
C 
U

C

U
U

A A A U

AA
CA

A

A

A 
A

 
A U 
G

CA
U  

U
G

C 

C

U

A C G G

U G C C
A

3′

C U A G GU
C
U C A A A A A G AG A U C U A A C A

70

5′

80 P6
P3

P4

P9

P10

KL1

P1
J1/3

KL2

B

Design probability

F

ED

C

gRNAde 
design 
pipeline

t1+5TU 3D structure

Sequence constraints 
from fitness landscape

1,000 designs
top 40 per 

edit distance

High-throughput 
fitness screening

Low-throughput
gel validation

Calibrate activity
threshold

Rational 
design with

filtering

Rational 
design

only

500 designs
top 20 per 

edit distance

500 designs
random 20 per 

edit distance

G

122 143 33 299 203 319 473 516 549 5TUVariant:
Edit distance: 19 18 15 22 20 22 27 27 29 0

Primer

+CGU

+CGU

+CGU

+CGU

+CGU

+CGU
+CGU
+CGU
+CGU
+CGU
+CGU
+CGU
+CGU

131


Figure 6.4. Generative design and functional validation of RNA polymerase ribozymes.

The gRNAde pipeline was used to design functional variants of the triplet-based RNA polymerase
ribozyme (TPR), substantially outperforming rational design baselines.

A. Cryo-EM structure of the TPR heterodimer (PDB: 8T2P), showing the catalytic 5TU subunit
(colored), which was the target for generative design, and the auxiliary t1 subunit (grey), which
was held constant. Position-specific design probabilities, derived from experimental fitness
landscape data [33], are mapped onto the 5TU structure. Lighter yellows indicate regions with a
high probability of being re-designed by gRNAde, while critical functional sites were constrained
to the wildtype sequence (indicated in darker reds).

B. Position-specific design probabilities mapped onto the 5TU secondary structure.

C. Workflow for design and validation of 5TU variants. The 3D backbone structure, along with
constraints sampled from the fitness landscape data, were input to the full gRNAde pipeline
(Figure 1) as well as two baselines: rational design with the same computational filtering as
gRNAde, and rational design without filtering. A library of 2,000 total designs was screened
via a high-throughput functional assay. The native 5TU and 9 gRNAde designs were further
validated using a low-throughput gel, which was then used to calibrate the activity threshold for
the high-throughput data.

D. Success rate of generating active designs (fitness ≥ -1.86, corresponding to variant 319)
binned by mutational distance from the wildtype sequence. At 15-19 mutations, the gRNAde
pipeline achieves a 31.5% success rate, substantially outperforming filtered rational design
(15.0%) and unfiltered rational design (3.0%).

E. Fitness distributions for all functional designs across mutational distances. The fitness of
gRNAde designs is consistently higher than that of the baseline methods, with many variants
exceeding the activity threshold (fitness ≥ -1.86).

F. Low-throughput primer extension gel assay, using the top 9 gRNAde-designed variants by
fitness in the high-throughput assay. Variant identity and edit distance from wildtype 5TU is
labelled below the gel. The gel confirms the activity of all gRNAde variants, with variants 122,
143, 33, and 203 showing activity comparable or better than the native 5TU ribozyme.

G. Correlation between high-throughput functional assay and low-throughput gel for the native
5TU sequence and 9 gRNAde-designed variants. The fitness scores are highly correlated with
the average per-junction ligation efficiency from the gel (Pearson r = 0.85), and the fitness of
the least active variant 319 is used as an activity threshold for the high-throughput assay, as it
demonstrates some ligation activity on the gel.

132


A

B

C

D

E

KL2

KL1
J1/3

P1

P10

P9

P5

P7

P3

J1/3

KL1

P1

P5

P9

P10

90°

U

P8

P5

P7

 
30 110
40

50

60

100

90

120

130

140 

150

20 A

U

U

A

G

C

A

U

U

G

A

U

C

G

G

C

C A 

CG

C

G

A
UA

G G C G

G

C

UC G C

C

G

A

G

A

G

C

C

G

G 

U 

G A

U

U 

A 

G 

C 
C
U

C

G A G

C U 

A

C 

G

A 

U

U

G

U

A

C

G

C

G

G

C

A 

U 
C

C

G

A
U

U
G

G
G
A 

C 
C 
U

C

U
U

A A A U

AA
CA

A

A

A 
A

 
A U 
G

CA
U  

U
G

C 

C

U

A C G G

U G C C
A

3′

C U A G GU
C
U C A A A A A G AG A U C U A A C A

70

5′

80 P6
P3

P4

P9

P10

KL1

P1
J1/3

KL2

Mutational probability difference

Prefered by
rational design

Prefered by
gRNAde

(Grey denotes positions not mutated in any active designs)

F
Native Sequence Recovery of Successful OpenKnot Designs (OpenKnot Score > 90)

133


Figure 6.5. Mechanistic analysis of gRNAde design strategies.

The analysis compares gRNAde’s design strategy against rational design and human experts,
demonstrating its capacity to learn non-local, 3D-informed structure-function relationships and
achieve highly sequence-divergent yet structurally accurate designs.

A-C. Per-position mutation probability (“hotspots") for active RNA polymerase ribozyme designs
generated by gRNAde (A) and rational design with filtering (B), and the difference between
them (C).

D, E. The difference in mutation probability is mapped onto the secondary structure (D) and
tertiary structure (E) of the catalytic subunit 5TU that was the target of design. This reveals
distinct design strategies: Rational design preferentially mutates canonical base-paired positions
(blue) in active designs, whereas gRNAde identifies novel hotspots in structurally complex,
unpaired regions (red), particularly near the template-binding site (J1/3) as well as ends of helices
P5, P9, and P10.

F. Median native sequence recovery of successful designs from gRNAde and expert human
designers in OpenKnot Round 7, presented by position type (pseudoknotted, paired, unpaired,
and all positions) for natural (left) and synthetic (right) RNA targets. Across all position types,
gRNAde designs exhibit significantly lower median native sequence recovery (32%) compared to
human experts (72%). This demonstrates gRNAde’s capacity to achieve large generative jumps
in sequence space while matching expert accuracy at forming the target structure.

134


0.08 0.10 0.12 0.14
Chemical Reactivity Score

12

10

8

6

4

2

0

2

4

Fit
ne

ss
 (l

og
 e

nr
ich

m
en

t)
r = -0.654
 = -0.655

R² = -2.949
P < 0.001

gRNAde pipeline
(n = 745)

0.10 0.12 0.14 0.16
Chemical Reactivity Score

fit
ne

ss

r = -0.595
 = -0.621

R² = -4.750
P < 0.001

Rational design w/ filtering
(n = 370)

0.15 0.20 0.25
Chemical Reactivity Score

fit
ne

ss

r = -0.441
 = -0.420

R² = -14.145
P < 0.001

Only rational design
(n = 359)

Edit Distance
19

20-24
25-29
30-34

35

Figure 6.6: Computational metrics show moderate correlation with experimental fitness.
Scatter plots showing the correlation between RibonanzaNet chemical reactivity scores and
experimental fitness values for designs across different mutational distance ranges.

Computational filters show moderate correlation with experiments Lastly, we assessed the
correlation between our computational metrics and experimental fitness values to evaluate the
reliability of our filtering pipeline (Figure 6.6). We find an average correlation of -0.563 between
RibonanzaNet chemical reactivity score and experimental fitness across all designs, indicating
moderate predictive power for identifying functional ribozymes. The moderate correlation
reflects the limitations of using predicted chemical reactivity and secondary structure as proxies
for complex 3D structure and catalytic function. Improved predictive models of RNA structure
and function will be necessary to tackle more ambitious RNA design task.

6.4 Summary

Precise control in designing RNA structure and function could transform programmable biology,
enabling new applications such as mRNA therapeutics that respond to personalized cellular
conditions [Felletti et al., 2016, Mustafina et al., 2019] and sophisticated biosensors for multi-
input detection [Choe et al., 2024]. However, progress toward these ambitious goals has been
limited by the difficulty of accurately designing sequences that fold into complex 3D structures
such as pseudoknots, which are essential for RNA functionality.

This chapter has experimentally validated gRNAde, a geometric deep learning pipeline for
RNA inverse design. First, in a blinded, community-wide competition on the Eterna platform,
gRNAde successfully designed complex RNA pseudoknot structures with an accuracy matching
that of human experts, establishing a new state-of-the-art for structural RNA design. Second,
the pipeline was used to generatively explore the functional landscape of a complex RNA
polymerase ribozyme, discovering highly active ribozymes at large sequence distances from
any known functional variant. This dual success on fundamentally different problems—one
focused on folding and structural accuracy, the other on function, embodying both structure and
dynamics—validates the power and generality of the approach.

135


These findings have broader implications for our understanding of both natural and engineered
biological systems. The OpenKnot results, where gRNAde’s designs proved more stable for
targeted structural goals than their native counterparts, suggest that data-driven optimization
can uncover solutions that are more idealized than those found in biology, where evolution
operates under a multitude of competing constraints. Similarly, the discovery of a diverse
and functional ribozyme quasispecies, with active variants differing by nearly 20% of their
sequence, demonstrates that the functional sequence space for complex RNA is likely larger than
anticipated comprising (presumably) structurally-similar variants at mutational distances that
would be challenging to access by directed evolution. Generative models like gRNAde provide
a powerful new tool to explore this vast, uncharted territory, providing an alternative to local
exploration by directed evolution or the limitations of human-centric rational design.

Beyond its direct applications, this work highlights the potential for a virtuous cycle in
computational biology, where the success of gRNAde in generating vast libraries of high-
quality designs has enabled the creation of new, large-scale datasets. For example, a follow-
on collaboration used gRNAde to generate 68 million plausible sequences for 1.6 million
pseudoknotted structures, a dataset orders of magnitude larger than previously possible. This
dataset then trained RibonanzaNet 2 with significantly improved accuracy, creating a powerful
feedback loop where generative models produce data to train better structure prediction models,
which can then be incorporated back into the design pipeline as more accurate filters, further
accelerating progress.

Future work We have focussed on experimental validation of single-state design in this chap-
ter. In principle, gRNAde enables inverse design of RNA sequences conditioned on multiple
conformational states. However, two key methodological directions would make multi-state
design more practically useful for real-world usage: (1) Incorporating partner molecules such as
small molecule ligands or proteins that induce conformational changes during design, enabling
applications like RNA aptamer and riboswitch design with specific unbound and bound states
[Mandal and Breaker, 2004, Mohsen et al., 2023]; and (2) Allowing specification of conforma-
tional propensities to modulate or finetune functionality [Ken et al., 2023], enabling biasing
toward desired functional states and negative design against unwanted conformations.

Additionally, RNA modelling and design tools remain trained on relatively limited datasets
compared to protein design, which can prevent broad generalization to novel targets. The limits
of current RNA 3D structure prediction tools are well-documented [Das et al., 2023, Kretsch
et al., 2025], particularly without multiple sequence alignments, which is typically the case
for designed sequences with no evolutionary history. Unlike computational filtering pipelines
for protein design, where AlphaFold provides near-experimental accuracy, we did not find it
beneficial to filter using RNA 3D structure predictors. Instead, we used RibonanzaNet [He et al.,
2024] to predict chemical reactivity profiles of designed RNAs and found modest correlations

136


with experimental functional measurements. While our initial pipeline combining gRNAde
designs with RibonanzaNet filtering shows promise for a highly complex RNA polymerase
ribozyme, tackling more challenging tasks like RNA interactions and multi-state design will
likely require robust 3D structure prediction capabilities.

We are optimistic that advances in RNA structure determination through computationally-
assisted cryo-EM [Kappel et al., 2020, Bonilla and Kieft, 2022] will expand available structural
data, thereby improving training of geometric deep learning models and enabling new break-
throughs in RNA design.

137


138


Chapter 7

Conclusion

This thesis introduces new Geometric Deep Learning techniques for molecular modelling and
design. In Part I, I developed unified theory and architectures for representation learning and
generative modelling of 3D molecular structures. In Part II, I introduced a novel toolkit for
inverse design of RNA molecules, a challenging and underexplored domain in molecular design.
These contributions share a common geometric foundation: representing molecular systems as
3D geometric graphs with inherent physical symmetries and transformation behaviors, which are
incorporated explicitly or implicitly into the modelling.

Overall, I aimed to integrate principled approaches to representation learning and generative
modelling into practical, wet lab validated frameworks for real-world molecular design.

7.1 Summary of contributions

Part I: Molecular Representation Learning and Generative Modelling

Chapter 3 presents the Geometric Weisfeiler-Leman (GWL) test, which extends the classic
Weisfeiler-Leman graph isomorphism algorithm to geometric graphs while preserving 3D sym-
metries. This framework unifies so far disparate classes of Geometric GNN architectures for
molecular representation learning. GWL provides mechanistic insights into the expressive power
of these architectures, highlighting advantages of equivariant models over invariant ones and the
role of higher-order representations in discriminating 3D structures. I complement the theoretical
framework with synthetic experiments and a benchmark on protein function prediction.

Chapter 4 proposes the All-atom Diffusion Transformer (ADiT), the first unified generative
model for both periodic crystals and non-periodic molecular systems. ADiT embeds 3D molecu-
lar structures into a shared latent space, where it learns to sample new latents and then decodes
them to valid structures. ADiT’s latent diffusion approach enables transfer learning from diverse
chemical spaces, achieving state-of-the-art performance on molecular and crystal generation

139


benchmarks. Built on the standard Transformer, ADiT shows predictable scaling behaviors up
to half a billion parameters, positioning it as a promising foundation model architecture for
molecular generation.

Part II: RNA Molecule Design

Chapter 5 introduces gRNAde, a novel generative RNA inverse design toolkit. gRNAde is a
structure-conditioned RNA language model that uses a multi-state Geometric GNN to generate
sequences conditioned on one or more 3D backbone structures, explicitly accounting for the
conformational diversity of RNA molecules. gRNAde significantly improves both performance
and speed over state-of-the-art physics-based methods in computational benchmarks. gRNAde
also enables new capabilities such as multi-state design and zero-shot ranking in RNA engineering
campaigns.

Chapter 6 presents wet lab experimental validation of gRNAde for real-world RNA design
problems. gRNAde successfully designs diverse pseudoknotted RNA structures with significantly
higher success rates than physics-based methods, matching the performance of expert human
designers while being fully automated and scalable. Most significantly, gRNAde enables the
design of functional RNA enzymes (ribozymes) and systematically explores sequence diversity
that retains biological function—capabilities that substantially exceed current rational design
approaches. Together, these results establish gRNAde as a powerful tool for designing RNA
structures with specific biological functions, opening new frontiers in RNA engineering.

7.2 Discussion

A central theme of this thesis is the interplay between the physical symmetries that govern
molecular systems and whether to implement these symmetries as inductive biases in deep
learning architectures [Bronstein et al., 2021]. I would like to conclude with a reflection on the
engineering and computational aspects of molecular modelling, particularly the notion of the
hardware lottery [Hooker, 2021]: the marriage of architectures and hardware that determines
which research ideas rise to prominence, and its connection to the bitter lesson in AI research
[Sutton, 2019].

These discussions focus on Transformers and GNNs, as well as roto-translation equivariance
versus learning symmetries at scale. These are the two main architectural paradigms that I have
explored through this thesis.

Transformers are winning the hardware lottery Transformers are GNNs which implement
a fully-connected message passing scheme via dense matrix multiplications [Joshi, 2025]. In

140


contrast, GNNs typically implement sparse message passing over locally connected structures
via scatter-gather operations, which are significantly slower on modern GPUs for size ranges
of typical molecular structures. Additionally, state-of-the-art equivariant GNNs for molecular
systems rely on higher-order tensor representations to achieve maximum expressivity while
preserving symmetries, as discussed in Chapter 3. This results in a significant increase in memory
usage and computational complexity, making equivariant networks orders of magnitude slower
to train and scale up than standard Transformers on current hardware.

The evolution from AlphaFold 2 [Jumper et al., 2021] to AlphaFold 3 [Abramson et al.,
2024] exemplifies a paradigm shift in recent years. The AlphaFold 3 architecture is relatively
simpler compared to AlphaFold 2, which explicitly incorporated roto-translation equivariance
when predicting 3D coordinates of protein structures. Instead, AlphaFold 3 uses a largely
standard Transformer architecture and data augmentation when learning to predict 3D coordinates.
This approach is easier to scale and generalizes naturally to all-atom biomolecular complexes
compared to previous approaches. AlphaFold3 is a very effective demonstration of geometric
symmetries learnt at scale using a sufficiently expressive model.

In the near term, the hardware lottery will likely lead to favouring Transformers. Transformers
are likely to be the architecture of choice for molecular foundation models trained on large
datasets and scaled to billions of parameters. Training equivariant networks at such scales would
be prohibitively expensive at present.

A problem-centric approach to architectures It would be naive to conclude that equivariant
networks are inferior to unconstrained architectures. The choice of inductive biases depends
fundamentally on the problem at hand. When data is limited or strict symmetry guarantees
are essential, such as in molecular simulation and property prediction, explicitly enforcing
symmetries provides greater data efficiency and generalization.

For instance, equivariant GNNs with higher-order tensors are the current state-of-the-art in
interatomic potentials for molecular simulation [Batatia et al., 2023, Wood et al., 2025]. For most
practical applications in molecular simulations, models must learn physically meaningful and
smooth energy landscapes [Bigi et al., 2025, Fu et al., 2025]. Here, equivariant representations
that transform predictably under roto-translation provide essential inductive biases for capturing
the underlying physical phenomenon governing the dynamics [Musil et al., 2021].

In contrast, when large-scale training data is available and exact symmetry guarantees are
not crucial, implicit or learned symmetry constraints can have an advantage. Diffusion-based
generative models, as demonstrated in Chapter 4, exemplify this scenario. In diffusion models, a
denoiser network learns the underlying data distribution by observing molecular structures under
varying noise levels and iteratively reconstructing valid configurations. What matters most is that
the denoiser produces valid molecular structures given noisy inputs. If the denoiser produces
different outputs from rotated versions of the same noisy input, this may not be problematic as

141


long as both outputs represent physically plausible structures.
An important insight I have developed while training diffusion models is that learning

from each sample in the data distribution under different noise levels is crucial for optimal
generative modelling. This boils down to performing as many epochs of training as possible with
a sufficiently expressive denoiser, as approximate roto-translation equivariance often emerges in
unconstrained networks when trained at scale. Since the noisy intermediate training steps do not
represent physically meaningful structures, the inductive bias of explicit equivariance becomes
less critical. This phenomenon helps explain the strong performance of recent Transformer-based
diffusion models for molecular generation [Wang et al., 2024, Abramson et al., 2024, Joshi et al.,
2025a]. The hardware lottery enables Transformers to be trained for many more iterations than
equivariant networks within the same computational budget, leading to improved performance.

Overall, roto-translational equivariance is a powerful inductive bias and strong guarantee of
physical correctness. At the same time, equivariance can also be viewed as a hard constraint
that ultimately limits model expressivity. A similar argument can be made regarding locality in
GNNs versus global attention in Transformers [Joshi, 2025].

Through this thesis, I have ultimately arrived at a pragmatic perspective: architectures are
tools for solving problems, and the choice of architecture should be driven by the problem at
hand, the available data, and the computational resources.

7.3 Future Directions

The work conducted in this thesis has prepared me to work on exciting new frontiers in biomolec-
ular modelling and design. I highlight two interconnected directions that I believe will be crucial
for advancing the field, united by a fundamental insight: nature performs computation through
transitions between molecular states, often coupled to chemical reactions and intermolecular
interactions [Al-Hashimi, 2023].

To tackle the most interesting scientific problems in molecular biology, we need new tech-
niques for representing dynamic biological processes. This will necessitate closer collaborations
between AI researchers and experimental biologists, with the aim of jointly designing both the
dataset generation processes and the machine learning models.

7.3.1 Representing and Designing Conformational Dynamics

Multi-state conformational changes and dynamics are fundamental to the function of almost
every biologically relevant molecule, from antibodies and membrane receptors to biocatalysis in
proteins and RNA [Henzler-Wildman and Kern, 2007, Ganser et al., 2019]. An ideal computa-
tional representation of molecular systems must therefore account for both geometric structures
and temporal dynamics [Carugo and Djinović-Carugo, 2023, Lane, 2023]. However, existing

142


approaches, including the ones presented in this thesis, generally focus on static representations.
The next frontiers in molecular modelling are around representing multi-state ensembles and
transition dynamics of conformational changes.

I am optimistic about two possible approaches towards addressing this challenge. First, the
integration of machine learning interaction potentials (MLIPs) for molecular dynamics with prop-
erty prediction and generative models. Recent ‘universal’ MLIPs have demonstrated remarkable
accuracy in approximating quantum mechanics calculations for simulating biomolecular systems
[Kovács et al., 2025, Wood et al., 2025]. An interesting question is whether representations
learned by universal MLIPs can be predictive of dynamical and functional properties of de novo
designed molecules beyond natural systems. If true, MLIPs could enable new capabilities in
molecular design, including dynamics-informed generation via conditioning, or accelerating the
screening of generated designs with desired dynamical properties.

Second, integrating experimental data that explicitly captures conformational flexibility
and dynamics can further advance molecular representations. For example, cryo-EM density
maps from structure determination methods [Jamali et al., 2024], or high-throughput structural
assays such as cross-linking mass spectrometry for proteins [O’Reilly and Rappsilber, 2018] and
chemical probing for RNA [Strobel et al., 2018, Cao et al., 2024] can provide complementary
information to static structures from databases like the PDB.

Ultimately, we must move beyond training models on solely static structures towards un-
derstanding the dynamic behaviour of biomoelcules [Wayment-Steele et al., 2025] and rational
design of functional multi-state systems [Praetorius et al., 2023].

7.3.2 Black-box Data for Lab-in-the-loop Design

Structure-driven molecular design is emerging as a powerful paradigm in biochemistry [Watson
et al., 2023, Schneuing et al., 2024] and materials science [Zeni et al., 2025]. Notably, the Nobel
Prize in Chemistry 2024 recognized computational protein design and structure prediction. The
research in Chapter 5 on RNA structure design was inspired by this Nobel Prize winning work. I
have been fortunate to interact and collaborate with leading RNA biologists to experimentally
validate our RNA design models.

These conversations have made it clear that structure-based design is not a universally
applicable paradigm for molecular design in RNA biology and beyond. It works well only when
there is an established structural basis for function and when high-quality structural data is
available. Many of the most important biological problems may not fit this mould.

Thus, I believe the next frontier in molecular design will extend beyond current structure-
based approaches or augment them with complementary data sources. There is growing excite-
ment about ‘black-box’ experimental datasets from high-throughput assays to connect sequence
with function, specifically created for training machine learning models [Porebski et al., 2024,
Bronstein and Naef, 2024]. When combined with a lab-in-the-loop setup [Frey et al., 2025], we

143


can enable iterative testing and improvement of molecular design models in the real world.
In fact, these ideas hold particular promise for RNA, where next-generation sequencing

can measure structural and functional properties at unprecedented scale and relatively low cost
[Strobel et al., 2018, He et al., 2024].

In the future, I am excited to jointly design both data generation and model development
processes from the ground up, together with experimentalists and AI researchers. Ultimately,
I strongly believe that close collaboration and antedisciplinary science [Eddy, 2005] will be
essential for asking the most interesting scientific questions and unlocking the secrets of life.

144


References

J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, O. Pritzel, Alexander 4and Ronneberger,
L. Willmore, A. J. Ballard, J. Bambrick, et al. Accurate structure prediction of biomolecular
interactions with alphafold 3. Nature, 2024. (Cited on page 11, 75, 77, 91, 93, 118, 141, 142)

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al-
tenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint

arXiv:2303.08774, 2023. (Cited on page 13, 31)

B. Adamczyk, M. Antczak, and M. Szachniuk. Rnasolo: a repository of cleaned pdb-derived rna
3d structures. Bioinformatics, 2022. (Cited on page 107)

H. M. Al-Hashimi. Turing, von neumann, and the computational architecture of biological
machines. Proceedings of the National Academy of Sciences, 2023. (Cited on page 142)

M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying
framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023. (Cited on page
49)

B. Alberts, R. Heald, A. Johnson, D. Morgan, M. Raff, K. Roberts, and P. Walter. Molecular

biology of the cell: seventh international student edition with registration card. WW Norton
& Company, 2022. (Cited on page 23)

R. F. Alford, A. Leaver-Fay, J. R. Jeliazkov, M. J. O’Meara, F. P. DiMaio, H. Park, M. V.
Shapovalov, P. D. Renfrew, V. K. Mulligan, et al. The rosetta all-atom energy function for
macromolecular modelling and design. Journal of chemical theory and computation, 2017.
(Cited on page 23)

U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical implications.
In ICLR, 2021. (Cited on page 66)

R. Anand, C. K. Joshi, A. Morehead, A. R. Jamasb, C. Harris, S. Mathis, K. Didi, B. Hooi, and
P. Liò. Rna-frameflow: Flow matching for de novo 3d rna backbone design. In Machine

Learning for Computational Biology (MLCB), 2024. (Cited on page 113)

145


B. Anderson, T. S. Hy, and R. Kondor. Cormorant: Covariant molecular neural networks.
NeurIPS, 2019. (Cited on page 40)

N. Ashcroft and N. D. Mermin. Solid State Physics. Saunders College Publishing, 1976. (Cited
on page 22)

J. Attwater, A. Wochner, and P. Holliger. In-ice evolution of rna polymerase ribozyme activity.
Nature chemistry, 2013. (Cited on page 126)

J. Attwater, A. Raguram, A. S. Morgunov, E. Gianni, and P. Holliger. Ribozyme-catalysed rna
synthesis using triplet building blocks. Elife, 2018. (Cited on page 126)

S. Axelrod and R. Gomez-Bombarelli. Geom, energy-annotated molecular conformations for
property prediction and molecular generation. Scientific Data, 2022. (Cited on page 84)

L. Babai, P. Erdos, and S. M. Selkow. Random graph isomorphism. SIAM Journal on Computing,
1980. (Cited on page 55)

M. Baek, F. DiMaio, I. Anishchenko, J. Dauparas, et al. Accurate prediction of protein structures
and interactions using a three-track neural network. Science, 2021. (Cited on page 69, 113)

M. Baek, R. McHugh, I. Anishchenko, H. Jiang, D. Baker, and F. DiMaio. Accurate prediction
of protein–nucleic acid complexes using rosettafoldna. Nature Methods, 2024. (Cited on page
113)

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align
and translate. In ICLR, 2015. (Cited on page 31, 44)

A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi. Gaussian approximation potentials: The
accuracy of quantum mechanics, without the electrons. Physical review letters, 2010. (Cited
on page 41, 43)

A. P. Bartók, R. Kondor, and G. Csányi. On representing chemical environments. Physical

Review B, 2013. (Cited on page 41, 43, 57, 60)

A. P. Bartók, S. De, C. Poelking, N. Bernstein, J. R. Kermode, G. Csányi, and M. Ceriotti.
Machine learning unifies the modelling of materials and molecules. Science advances, 2017.
(Cited on page 43, 77)

I. Batatia, S. Batzner, D. P. Kovács, A. Musaelian, G. N. Simm, R. Drautz, C. Ortner, B. Kozinsky,
and G. Csányi. The design space of e (3)-equivariant atom-centered interatomic potentials.
arXiv preprint, 2022a. (Cited on page 41)

146


I. Batatia, D. P. Kovács, G. N. Simm, C. Ortner, and G. Csányi. Mace: Higher order equivariant
message passing neural networks for fast and accurate force fields. In NeurIPS, 2022b. (Cited
on page 40, 41, 43, 59, 63, 64, 68, 70)

I. Batatia, P. Benner, Y. Chiang, A. M. Elena, D. P. Kovács, J. Riebesell, X. R. Advincula,
M. Asta, M. Avaylon, W. J. Baldwin, et al. A foundation model for atomistic materials
chemistry. arXiv preprint arXiv:2401.00096, 2023. (Cited on page 11, 92, 141)

P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski,
A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep
learning, and graph networks. arXiv preprint, 2018. (Cited on page 12, 29)

F. Battiston, G. Cencetti, I. Iacopini, V. Latora, M. Lucas, A. Patania, J.-G. Young, and G. Petri.
Networks beyond pairwise interactions: Structure and dynamics. Physics reports, 2020. (Cited
on page 25)

S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt,
and B. Kozinsky. E (3)-equivariant graph neural networks for data-efficient and accurate
interatomic potentials. Nature communications, 2022. (Cited on page 12, 29, 43, 53)

J. Behler and M. Parrinello. Generalized neural-network representation of high-dimensional
potential-energy surfaces. Physical review letters, 2007. (Cited on page 11, 43)

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al.
Improving image generation with better captions, 2023. URL https://cdn.openai.c

om/papers/dall-e-3.pdf. (Cited on page 47, 92)

F. Bigi, M. Langer, and M. Ceriotti. The dark side of the forces: assessing non-conservative force
models for atomistic machine learning. In International Conference on Machine Learning

(ICML), 2025. (Cited on page 43, 141)

C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler
and lehman go cellular: Cw networks. NeurIPS, 2021a. (Cited on page 56)

C. Bodnar, F. Frasca, Y. Wang, N. Otter, G. F. Montufar, P. Lio, and M. Bronstein. Weisfeiler
and lehman go topological: Message passing simplicial networks. In ICML, 2021b. (Cited on
page 56)

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein,
J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. S.
Chatterji, A. S. Chen, K. A. Creel, J. Davis, D. Demszky, C. Donahue, M. Doumbouya,
E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. E.
Gillespie, K. Goel, N. D. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson,

147

https://cdn.openai. com/papers/dall-e-3.pdf
https://cdn.openai. com/papers/dall-e-3.pdf


J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. F. Icard, S. Jain, D. Jurafsky, P. Kalluri,
S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. S. Krass, R. Krishna,
R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li,
T. Ma, A. Malik, C. D. Manning, S. P. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair,
A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. F. Nyarko,
G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan,
R. Reich, H. Ren, F. Rong, Y. H. Roohani, C. Ruiz, J. Ryan, C. R’e, D. Sadigh, S. Sagawa,
K. Santhanam, A. Shih, K. P. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr,
R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. A.
Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang. On the
opportunities and risks of foundation models. ArXiv, 2021. (Cited on page 12, 13)

S. L. Bonilla and J. S. Kieft. The promise of cryo-em to explore rna structural dynamics. Journal

of Molecular Biology, 2022. (Cited on page 137)

E. Bonnet, P. Rzazewski, and F. Sikora. Designing rna secondary structures is hard. Journal of

Computational Biology, 2020. (Cited on page 113)

F. Boyles, C. M. Deane, and G. M. Morris. Learning from the ligand: using ligand-based features
to improve binding affinity prediction. Bioinformatics, 2019. (Cited on page 70)

J. Brandstetter, R. Hesselink, E. van der Pol, E. J. Bekkers, and M. Welling. Geometric and
physical quantities improve e(3) equivariant message passing. In ICLR, 2022. (Cited on page
40, 41)

R. R. Breaker and G. F. Joyce. Inventing and improving ribozyme function: rational design
versus iterative selection methods. Trends in biotechnology, 1994. (Cited on page 112)

M. Bronstein and L. Naef. The road to biology 2.0 will pass through black-box data. Towards

Data Science, 2024. (Cited on page 143)

M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids, groups,
graphs, geodesics, and gauges. arXiv preprint, 2021. (Cited on page 11, 25, 62, 140)

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman,
E. Luhman, et al. Video generation models as world simulators. 2024. (Cited on page 47, 92)

D. Buterez, J. P. Janet, S. J. Kiddle, and P. Liò. Mf-pcba: Multifidelity high-throughput screening
benchmarks for drug discovery and machine learning. Journal of Chemical Information and

Modeling, 2023. (Cited on page 44)

M. Buttenschoen, G. M. Morris, and C. M. Deane. Posebusters: Ai-based docking methods fail
to generate physically valid poses or generalise to novel sequences. Chemical Science, 2024.
(Cited on page 85, 182)

148


M. Buttenschoen, Y. Ziv, G. M. Morris, and C. Deane. An evaluation of unconditional 3d molec-
ular generation methods. In ICLR Workshop on Generative and Experimental Perspectives for

Biomolecular Design, 2025. (Cited on page 89)

A. Campbell, J. Yim, R. Barzilay, T. Rainforth, and T. Jaakkola. Generative flows on discrete
state-spaces: Enabling multimodal flows with applications to protein co-design. In Forty-first

International Conference on Machine Learning, 2024. (Cited on page 77)

X. Cao, Y. Zhang, Y. Ding, and Y. Wan. Identification of rna structures and their roles in rna
functions. Nature Reviews Molecular Cell Biology, 2024. (Cited on page 118, 143)

O. Carugo and K. Djinović-Carugo. Structural biology: A golden era. PLoS Biology, 2023.
(Cited on page 142)

T. R. Cech. The Catalyst: RNA and the Quest to Unlock Life’s Deepest Secrets. WW Norton &
Company, 2024. (Cited on page 13, 23, 99, 126)

J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen, et al.
Interpretable rna foundation model from unannotated data for highly accurate rna structure
and function predictions. arXiv preprint, 2022. (Cited on page 113)

Z. Chen, S. Villar, L. Chen, and J. Bruna. On the equivalence between graph isomorphism testing
and function approximation with gnns. NeurIPS, 2019. (Cited on page 56)

C. Choe, J. O. Andreasson, F. Melaine, W. Kladwang, M. J. Wu, F. Portela, R. Wellington-Oguri,
J. J. Nicol, H. K. Wayment-Steele, M. Gotrik, et al. Compact rna sensors for increasingly
complex functions of multiple inputs. bioRxiv, 2024. (Cited on page 135)

A. E. Chu, J. Kim, L. Cheng, G. El Nesr, M. Xu, R. W. Shuai, and P.-S. Huang. An all-atom
protein generative model. Proceedings of the National Academy of Sciences, 2024. (Cited on
page 93)

A. Churkin, M. D. Retwitzer, V. Reinharz, Y. Ponty, J. Waldispühl, and D. Barash. Design of
rnas: comparing programs for inverse rna folding. Briefings in bioinformatics, 2018. (Cited
on page 99, 113)

G. Corso, B. Jing, R. Barzilay, T. Jaakkola, et al. Diffdock: Diffusion steps, twists, and turns for
molecular docking. In International Conference on Learning Representations, 2023. (Cited
on page 69, 90, 92)

X. Dai, J. Hou, C.-Y. Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang,
A. Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a
haystack. arXiv preprint arXiv:2309.15807, 2023. (Cited on page 92)

149


A. Daigavane, S. E. Kim, M. Geiger, and T. Smidt. Symphony: Symmetry-equivariant point-
centered spherical harmonics for 3d molecule generation. In The Twelfth International

Conference on Learning Representations, 2024. (Cited on page 85, 182)

T. R. Damase, R. Sukhovershin, C. Boada, F. Taraballi, R. I. Pettigrew, and J. P. Cooke. The
limitless future of rna therapeutics. Frontiers in bioengineering and biotechnology, 2021.
(Cited on page 99)

R. Das, J. Karanicolas, and D. Baker. Atomic accuracy in predicting and designing noncanonical
rna structure. Nature methods, 2010. (Cited on page 107, 108, 109, 113, 114, 192)

R. Das, R. C. Kretsch, A. J. Simpkin, T. Mulvaney, P. Pham, R. Rangan, F. Bu, R. M. Keegan,
M. Topf, D. J. Rigden, et al. Assessment of three-dimensional rna structure prediction in
casp15. Proteins: Structure, Function, and Bioinformatics, 2023. (Cited on page 118, 136)

J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. Wicky,
A. Courbet, R. J. de Haas, N. Bethel, et al. Robust deep learning based protein sequence
design using proteinmpnn. Science, 2022. (Cited on page 11, 53, 99, 101, 102, 105, 106)

D. W. Davies, K. T. Butler, A. J. Jackson, J. M. Skelton, K. Morita, and A. Walsh. Smact:
Semiconducting materials by analogy and chemical theory. Journal of Open Source Software,
2019. (Cited on page 181)

W. K. Dawson, M. Maciejczyk, E. J. Jankowska, and J. M. Bujnicki. Coarse-grained modelling
of rna 3d structure. Methods, 2016. (Cited on page 101)

V. Delle Rose, A. Kozachinskiy, C. Rojas, M. Petrache, and P. Barceló. Three iterations of (d-
1)-wl test distinguish non isometric clouds of d-dimensional points. NeurIPS, 2023. (Cited on
page 63, 73)

B. Deng, P. Zhong, K. Jun, J. Riebesell, K. Han, C. J. Bartel, and G. Ceder. Chgnet as a pretrained
universal neural network potential for charge-informed atomistic modelling. Nature Machine

Intelligence, 2023. (Cited on page 181)

A. Derrow-Pinion, J. She, D. Wong, O. Lange, T. Hester, L. Perez, M. Nunkesser, S. Lee, X. Guo,
B. Wiltshire, et al. Eta prediction with graph neural networks in google maps. In Proceedings

of the 30th ACM International Conference on Information & Knowledge Management, 2021.
(Cited on page 29)

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural

information processing systems, 2021. (Cited on page 50, 92)

150


F. Di Giovanni, L. Giusti, F. Barbero, G. Luise, P. Lio, and M. M. Bronstein. On over-squashing in
message passing neural networks: The impact of width, depth, and topology. In International

Conference on Machine Learning. PMLR, 2023. (Cited on page 32)

S. Dieleman. Guidance: a cheat code for diffusion models, 2022. URL https://benanne.

github.io/2022/05/26/guidance.html. (Cited on page 50)

S. Dieleman. Generative modelling in latent space, 2025. URL https://sander.ai/20

25/04/15/latents.html. (Cited on page 50, 94)

P. A. M. Dirac. Quantum mechanics of many-electron systems. Proceedings of the Royal Society

of London. Series A, Containing Papers of a Mathematical and Physical Character, 1929.
(Cited on page 43)

J. Dong, B. Feng, D. Guessous, Y. Liang, and H. He. Flex attention: A programming model for
generating optimized attention kernels. arXiv preprint arXiv:2412.05496, 2024. (Cited on
page 32)

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for
image recognition at scale. In 9th International Conference on Learning Representations,

ICLR, 2021. (Cited on page 31, 42)

J. A. Doudna and E. Charpentier. The new frontier of genome engineering with crispr-cas9.
Science, 2014. (Cited on page 99)

R. Drautz. Atomic cluster expansion for accurate and transferable interatomic potentials. Physical

Review B, 2019. (Cited on page 41, 73)

W. Du, H. Zhang, Y. Du, Q. Meng, W. Chen, N. Zheng, B. Shao, and T.-Y. Liu. Se (3) equivariant
graph neural networks with complete local frames. In ICML, 2022. (Cited on page 75)

Y. Du, A. R. Jamasb, J. Guo, T. Fu, C. Harris, Y. Wang, C. Duan, P. Liò, P. Schwaller, and T. L.
Blundell. Machine learning-aided generative molecular design. Nature Machine Intelligence,
2024. (Cited on page 44)

G. Dusson, M. Bachmayr, G. Csanyi, R. Drautz, S. Etter, C. van der Oord, and C. Ortner. Atomic
cluster expansion: Completeness, efficiency and stability. arXiv preprint, 2019. (Cited on
page 64, 73)

A. Duval, S. V. Mathis, C. K. Joshi, V. Schmidt, S. Miret, F. D. Malliaros, T. Cohen, P. Liò,
Y. Bengio, and M. Bronstein. A hitchhiker’s guide to geometric gnns for 3d atomic systems.
arXiv preprint, 2023a. (Cited on page 16, 26, 29, 33, 34, 40, 61, 92)

151

https://benanne.github.io/2022/05/26/guidance.html
https://benanne.github.io/2022/05/26/guidance.html
https://sander.ai/2025/04/15/latents.html
https://sander.ai/2025/04/15/latents.html


A. Duval, V. Schmidt, A. Hernández-García, S. Miret, F. D. Malliaros, Y. Bengio, and D. Rolnick.
Faenet: Frame averaging equivariant GNN for materials modeling. In International Conference

on Machine Learning, ICML, 2023b. (Cited on page 42)

V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. arXiv

preprint arXiv:2012.09699, 2020. (Cited on page 32)

V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking
graph neural networks. JMLR, 2023. (Cited on page 56)

N. Dym and H. Maron. On the universality of rotation equivariant point cloud networks. In
ICLR, 2020. (Cited on page 73)

S. R. Eddy. “antedisciplinary” science. PLoS computational biology, 2005. (Cited on page 144)

E. H. Ekland, J. W. Szostak, and D. P. Bartel. Structurally complex and highly active rna ligases
derived from random rna sequences. Science, 1995. (Cited on page 126)

A. A. Elhag, T. K. Rusch, F. Di Giovanni, and M. Bronstein. Relaxed equivariance via multitask
learning. arXiv preprint arXiv:2410.17878, 2024. (Cited on page 42)

P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis.
In Computer Vision and Pattern Recognition (CVPR), 2021. (Cited on page 94)

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer,
F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In
International Conference on Machine Learning, 2024. (Cited on page 47, 92)

M. Felletti, J. Stifel, L. A. Wurmthaler, S. Geiger, and J. S. Hartig. Twister ribozymes as highly
versatile expression platforms for artificial riboswitches. Nature communications, 2016. (Cited
on page 135)

M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR

Workshop, 2019a. (Cited on page 64)

M. Fey and J. E. Lenssen. Fast graph representation learning with pytorch geometric. ICLR 2019

Representation Learning on Graphs and Manifolds Workshop, 2019b. (Cited on page 103)

D. Flam-Shepherd and A. Aspuru-Guzik. Language models can generate molecules, materials,
and protein binding sites directly in three dimensions as xyz, cif, and pdb files. arXiv preprint

arXiv:2305.05708, 2023. (Cited on page 91)

R. E. Franklin and R. G. Gosling. Molecular structure of nucleic acids: Molecular configuration
in sodium thymonucleate. Nature, 1953. (Cited on page 24)

152


N. C. Frey, I. Hötzel, S. D. Stanton, R. Kelly, R. G. Alberstein, E. Makowski, K. Martinkus,
D. Berenberg, J. Bevers III, T. Bryson, et al. Lab-in-the-loop therapeutic antibody design with
deep learning. bioRxiv, 2025. (Cited on page 143)

L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li. Cd-hit: accelerated for clustering the next-generation
sequencing data. Bioinformatics, 2012. (Cited on page 107)

X. Fu, Z. Wu, W. Wang, T. Xie, S. Keten, R. Gomez-Bombarelli, and T. S. Jaakkola. Forces
are not enough: Benchmark and critical evaluation for machine learning force fields with
molecular simulations. Transactions on Machine Learning Research, 2023. (Cited on page
43)

X. Fu, B. M. Wood, L. Barroso-Luque, D. S. Levine, M. Gao, M. Dzamba, and C. L. Zitnick.
Learning smooth and expressive interatomic potentials for physical property prediction. In
International Conference on Machine Learning, 2025. (Cited on page 12, 43, 141)

F. Fuchs, D. Worrall, V. Fischer, and M. Welling. Se (3)-transformers: 3d roto-translation
equivariant attention networks. NeurIPS, 2020. (Cited on page 41)

P. Gainza, F. Sverrisson, F. Monti, E. Rodola, D. Boscaini, M. Bronstein, and B. Correia.
Deciphering interaction fingerprints from protein molecular surfaces using geometric deep
learning. Nature Methods, 17(2), 2020. (Cited on page 69, 70)

L. R. Ganser, M. L. Kelly, D. Herschlag, and H. M. Al-Hashimi. The roles of structural dynamics
in the cellular functions of rnas. Nature reviews Molecular cell biology, 2019. (Cited on page
99, 142)

M. Gantz, S. V. Mathis, F. E. Nintzel, P. J. Zurek, T. Knaus, E. Patel, D. Boros, F.-M. Weberling,
M. R. Kenneth, O. J. Klein, et al. Microdroplet screening rapidly profiles a biocatalyst to
enable its ai-assisted engineering. bioRxiv, 2024. (Cited on page 127)

R. Gao, E. Hoogeboom, J. Heek, V. D. Bortoli, K. P. Murphy, and T. Salimans. Diffusion meets
flow matching: Two sides of the same coin, 2024. URL https://diffusionflow.gi

thub.io/. (Cited on page 49, 82)

V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural
networks. In ICML, 2020. (Cited on page 173, 174)

J. Gasteiger, J. Groß, and S. Günnemann. Directional message passing for molecular graphs. In
ICLR, 2020. (Cited on page 35, 36, 53, 63, 64, 69)

J. Gasteiger, F. Becker, and S. Günnemann. Gemnet: Universal directional graph neural networks
for molecules. In NeurIPS, 2021. (Cited on page 35, 36, 63, 73, 74)

153

https://diffusionflow.github.io/
https://diffusionflow.github.io/


M. Geiger and T. Smidt. e3nn: Euclidean neural networks. arXiv preprint, 2022. (Cited on page
41, 64)

W. Gilbert. Origin of life: The rna world. nature, 319(6055):618–618, 1986. (Cited on page
126)

J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for
quantum chemistry. In ICML, 2017. (Cited on page 12, 31)

V. Gligorijević, P. D. Renfrew, T. Kosciolek, J. K. Leman, D. Berenberg, T. Vatanen, C. Chandler,
B. C. Taylor, I. M. Fisk, H. Vlamakis, et al. Structure-based protein function prediction using
graph convolutional networks. Nature Communications, 2021. (Cited on page 69, 70)

V. Gligorijević, P. D. Renfrew, T. Kosciolek, J. K. Leman, D. Berenberg, T. Vatanen, C. Chandler,
B. C. Taylor, I. M. Fisk, H. Vlamakis, et al. Structure-based protein function prediction using
graph convolutional networks. Nature communications, 12(1), 2021. (Cited on page 44)

C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropa-
gation through structure. In Proceedings of International Conference on Neural Networks

(ICNN’96). IEEE, 1996. (Cited on page 29)

R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling,
D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik.
Automatic chemical design using a data-driven continuous representation of molecules. ACS

central science, 2018. (Cited on page 46)

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http:

//www.deeplearningbook.org. (Cited on page 21)

M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In
Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. IEEE,
2005. (Cited on page 29)

A. Graves. Generating sequences with recurrent neural networks. arXiv preprint

arXiv:1308.0850, 2013. (Cited on page 44)

G. P. Greslehner. What do molecular biologists mean when they say ’structure determines
function’? 2018. (Cited on page 23)

R.-R. Griffiths and J. M. Hernández-Lobato. Constrained bayesian optimization for automatic
chemical design using variational autoencoders. Chemical science, 2020. (Cited on page 47)

R. W. Grosse-Kunstleve, N. K. Sauter, and P. D. Adams. Numerically stable algorithms for
the computation of reduced unit cells. Acta Crystallographica Section A: Foundations of

Crystallography, 2004. (Cited on page 79)

154

http://www.deeplearningbook.org
http://www.deeplearningbook.org


N. Gruver, S. Stanton, N. Frey, T. G. Rudner, I. Hotzel, J. Lafrance-Vanasse, A. Rajpal, K. Cho,
and A. G. Wilson. Protein design with guided discrete diffusion. Advances in neural informa-

tion processing systems, 2023. (Cited on page 50)

N. Gruver, A. Sriram, A. Madotto, A. G. Wilson, C. L. Zitnick, and Z. W. Ulissi. Fine-tuned
language models generate stable inorganic materials as text. In The Twelfth International

Conference on Learning Representations, 2024. (Cited on page 85, 91)

D. Han, X. Qi, C. Myhrvold, B. Wang, M. Dai, S. Jiang, M. Bates, Y. Liu, B. An, F. Zhang, et al.
Single-stranded dna and rna origami. Science, 2017. (Cited on page 99, 113)

C. Harris, K. Didi, A. R. Jamasb, C. K. Joshi, S. V. Mathis, P. Lio, and T. Blundell. Posecheck:
Generative models for 3d structure-based drug design produce unrealistic poses. NeurIPS

Workshop on Machine Learning for Structural Biology, 2023. (Cited on page 182)

S. He, R. Huang, J. Townley, R. C. Kretsch, T. G. Karagianes, D. B. Cox, H. Blair, D. Penzar,
V. Vyaltsev, E. Aristova, et al. Ribonanza: deep learning of rna structure through dual
crowdsourcing. bioRxiv, 2024. (Cited on page 113, 115, 117, 136, 144)

K. Henzler-Wildman and D. Kern. Dynamic personalities of proteins. Nature, 2007. (Cited on
page 142)

P. Hermosilla, M. Schäfer, M. Lang, G. Fackelmann, P. P. Vázquez, B. Kozlíková, M. Krone,
T. Ritschel, and T. Ropinski. Intrinsic-extrinsic convolution and pooling for learning on 3d
protein structures. arXiv preprint arXiv:2007.06252, 2020. (Cited on page 70)

G. Hinton. How to represent part-whole hierarchies in a neural network. arXiv preprint

arXiv:2102.12627, 2021. (Cited on page 37)

G. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN, 2011.
(Cited on page 57, 62)

G. E. Hinton and R. Zemel. Autoencoders, minimum description length and helmholtz free
energy. Advances in neural information processing systems, 1993. (Cited on page 45)

J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
2022. (Cited on page 50, 78, 84, 92, 93)

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural

information processing systems, 2020. (Cited on page 47, 48)

J. Hoetzel and B. Suess. Structural changes in aptamers are essential for synthetic riboswitch
engineering. Journal of Molecular Biology, 2022. (Cited on page 99)

155


E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling. Equivariant diffusion for molecule
generation in 3d. In International conference on machine learning. PMLR, 2022. (Cited on
page 47, 77, 84, 85, 90, 92, 182)

S. Hooker. The hardware lottery. Communications of the ACM, 2021. (Cited on page 140)

S. Hordan, T. Amir, and N. Dym. Weisfeiler leman for euclidean equivariant machine learning.
arXiv preprint arXiv:2402.02484, 2024. (Cited on page 73)

J. Hou, B. Adhikari, and J. Cheng. DeepSF: deep convolutional neural network for mapping
protein sequences to folds. Bioinformatics, 2017. (Cited on page 70)

W. Hu, M. Shuaibi, A. Das, S. Goyal, A. Sriram, J. Leskovec, D. Parikh, and C. L. Zit-
nick. Forcenet: A graph neural network for large-scale quantum calculations. Preprint

arXiv:2103.01436, 2021. (Cited on page 42)

K. Huang, T. Fu, W. Gao, Y. Zhao, Y. Roohani, J. Leskovec, C. Coley, C. Xiao, J. Sun, and
M. Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery
and development. In Proceedings of the Neural Information Processing Systems Track on

Datasets and Benchmarks, volume 1, 2021. (Cited on page 70)

P.-S. Huang, S. E. Boyken, and D. Baker. The coming of age of de novo protein design. Nature,
2016. (Cited on page 23, 100, 112)

J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola. Generative models for graph-based protein
design. In NeurIPS, 2019a. (Cited on page 70)

J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola. Generative models for graph-based protein
design. NeurIPS, 2019b. (Cited on page 101, 102)

J. B. Ingraham, M. Baranov, Z. Costello, K. W. Barber, W. Wang, A. Ismail, V. Frappier,
D. M. Lord, C. Ng-Thow-Hing, E. R. Van Vlack, et al. Illuminating protein space with a
programmable generative model. Nature, 2023. (Cited on page 77, 113)

R. Irwin, A. Tibo, J. P. Janet, and S. Olsson. Semlaflow–efficient 3d molecular generation
with latent attention and equivariant flow matching. In The 28th International Conference on

Artificial Intelligence and Statistics, 2025. (Cited on page 89)

A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran,
A. Brock, E. Shelhamer, et al. Perceiver io: A general architecture for structured inputs &
outputs. arXiv preprint arXiv:2107.14795, 2021. (Cited on page 95)

A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter,
D. Skinner, G. Ceder, et al. Commentary: The materials project: A materials genome approach
to accelerating materials innovation. APL materials, 2013. (Cited on page 84)

156


K. Jamali, L. Käll, R. Zhang, A. Brown, D. Kimanius, and S. H. Scheres. Automated model
building and protein identification in cryo-em maps. Nature, 2024. (Cited on page 143)

A. R. Jamasb, A. Morehead, C. K. Joshi, Z. Zuobai, K. Didi, S. V. Mathis, C. Harris, J. Tang,
J. Cheng, P. Liò, et al. Evaluating representation learning on the protein structure universe. In
ICLR, 2024. (Cited on page 17, 68, 70)

S. Jegelka. Theory of graph neural networks: Representation and learning. arXiv preprint

arXiv:2204.07697, 2022. (Cited on page 55)

R. Jiao, W. Huang, P. Lin, J. Han, P. Chen, Y. Lu, and Y. Liu. Crystal structure prediction by
joint equivariant diffusion. In Thirty-seventh Conference on Neural Information Processing

Systems, 2023. (Cited on page 77, 85, 90, 92)

B. Jing, S. Eismann, P. Suriana, R. J. L. Townshend, and R. Dror. Learning from protein structure
with geometric vector perceptrons. In ICLR, 2020. (Cited on page 39, 64, 67, 69, 73, 101,
102, 103)

W. K. Johnston, P. J. Unrau, M. S. Lawrence, M. E. Glasner, and D. P. Bartel. Rna-catalyzed rna
polymerization: accurate and general rna-templated primer extension. Science, 2001. (Cited
on page 115)

C. K. Joshi. Transformers are graph neural networks. The Gradient, and arXiv preprint

arXiv:2506.22084, 2025. (Cited on page 12, 31, 63, 93, 140, 142)

C. K. Joshi and P. Liò. grnade: A geometric deep learning pipeline for 3d rna inverse design.
In A. Churkin and D. Barash, editors, RNA Design: Methods and Protocols, pages 121–135.
Springer, Methods in Molecular Biology (MIMB, volume 2847), 2024. (Cited on page 17)

C. K. Joshi, C. Bodnar, S. V. Mathis, T. Cohen, and P. Lio. On the expressive power of geometric
graph neural networks. In International conference on machine learning, 2023. (Cited on
page 17, 73, 185, 189)

C. K. Joshi, X. Fu, Y.-L. Liao, V. Gharakhanyan, B. K. Miller, A. Sriram, and Z. W. Ulissi.
All-atom diffusion transformers: Unified generative modelling of molecules and materials. In
International Conference on Machine Learning (ICML), 2025a. (Cited on page 17, 75, 142)

C. K. Joshi, E. Gianni, S. L. Kwok, S. V. Mathis, P. Liò, and P. Holliger. Generative inverse
design of rna structure and function with grnade. bioRxiv, pages 2025–11, 2025b. (Cited on
page 18)

C. K. Joshi, A. R. Jamasb, R. Viñas, C. Harris, S. Mathis, A. Morehead, R. Anand, and P. Liò.
grnade: Geometric deep learning for 3d rna inverse design. In International Conference on

Learning Representations (ICLR), 2025c. (Cited on page 17)

157


J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, et al. Highly accurate
protein structure prediction with alphafold. Nature, 2021. (Cited on page 11, 12, 36, 53, 69,
75, 92, 99, 113, 141)

S.-O. Kaba, A. K. Mondal, Y. Zhang, Y. Bengio, and S. Ravanbakhsh. Equivariance with learned
canonicalization functions. In International Conference on Machine Learning. PMLR, 2023.
(Cited on page 42)

K. Kappel, K. Zhang, Z. Su, A. M. Watkins, W. Kladwang, S. Li, G. Pintilie, V. V. Topkar, R. Ran-
gan, I. N. Zheludev, et al. Accelerated cryo-em-guided determination of three-dimensional
rna-only structures. Nature methods, 2020. (Cited on page 137)

M. L. Ken, R. Roy, A. Geng, L. R. Ganser, A. Manghrani, B. R. Cullen, U. Schulze-Gahmen,
D. Herschlag, and H. M. Al-Hashimi. Rna conformational propensities determine cellular
activity. Nature, 2023. (Cited on page 100, 109, 136)

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on

Learning Representations, 2014. (Cited on page 45, 79, 93)

D. P. Kingma and M. Welling. An introduction to variational autoencoders. Foundations and

Trends in Machine Learning, 2019. (Cited on page 46)

T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.
In ICLR, 2017. (Cited on page 31)

W. Kohn, A. D. Becke, and R. G. Parr. Density functional theory of electronic structure. The

Journal of Physical Chemistry, 1996. (Cited on page 43)

D. P. Kovács, J. H. Moore, N. J. Browning, I. Batatia, J. T. Horton, Y. Pu, V. Kapil, W. C. Witt,
I.-B. Magdău, D. J. Cole, and G. Csányi. Mace-off: Short-range transferable machine learning
force fields for organic molecules. Journal of the American Chemical Society, 2025. (Cited
on page 143)

R. C. Kretsch, A. M. Hummer, S. He, R. Yuan, J. Zhang, T. Karagianes, Q. Cong,
A. Kryshtafovych, and R. Das. Assessment of nucleic acid structure prediction in casp16.
bioRxiv, 2025. (Cited on page 118, 136)

A. Kun, M. Santos, and E. Szathmáry. Real ribozymes suggest a relaxed error threshold. Nature

genetics, 2005. (Cited on page 126)

C. N. Lambert, V. Opuu, F. Calvanese, P. Pavlinova, F. Zamponi, E. J. Hayden, M. Weigt,
M. Smerlak, and P. Nghe. Exploring the space of self-reproducing ribozymes using generative
models. Nature communications, 2025. (Cited on page 126)

158


J. Lan, A. Palizhati, M. Shuaibi, B. M. Wood, B. Wander, A. Das, M. Uyttendaele, C. L. Zitnick,
and Z. W. Ulissi. Adsorbml: a leap in efficiency for adsorption energy calculations using
generalizable machine learning potentials. npj Computational Materials, 9(1):172, 2023.
(Cited on page 44)

T. J. Lane. Protein structure prediction has reached the single-structure frontier. Nature Methods,
2023. (Cited on page 142)

T. Le, J. Cremer, F. Noe, D.-A. Clevert, and K. T. Schütt. Navigating the design space of
equivariant diffusion-based generative models for de novo 3d molecule generation. In The

Twelfth International Conference on Learning Representations, 2024. (Cited on page 89)

J. Lee, W. Kladwang, M. Lee, D. Cantu, M. Azizyan, H. Kim, A. Limpaecher, S. Gaikwad,
S. Yoon, A. Treuille, et al. Rna design rules from a massive open laboratory. Proceedings of

the National Academy of Sciences, 2014. (Cited on page 119)

J. K. Leman, B. D. Weitzner, S. M. Lewis, J. Adolf-Bryfogle, N. Alam, R. F. Alford, M. Apra-
hamian, D. Baker, K. A. Barlow, P. Barth, et al. Macromolecular modelling and design in
rosetta: recent methods and frameworks. Nature methods, 2020. (Cited on page 108, 114)

D. Levine and P. J. Steinhardt. Quasicrystals: a new class of ordered structures. Physical review

letters, 1984. (Cited on page 67)

S. Li, S. Moayedpour, R. Li, M. Bailey, S. Riahi, L. Kogler-Anele, M. Miladi, J. Miner, D. Zheng,
J. Wang, et al. Codonbert: Large language models for mrna design and optimization. bioRxiv,
2023a. (Cited on page 113)

Y. Li, C. Zhang, C. Feng, R. Pearce, P. Lydia Freddolino, and Y. Zhang. Integrating end-to-
end learning with deep geometrical potentials for ab initio rna structure prediction. Nature

Communications, 2023b. (Cited on page 113, 118)

Z. Li, X. Wang, Y. Huang, and M. Zhang. Is distance matrix enough for geometric deep learning?
NeurIPS, 2023c. (Cited on page 73)

Y. Liao and T. E. Smidt. Equiformer: Equivariant graph attention transformer for 3d atomistic
graphs. In ICLR, 2023. (Cited on page 41)

Y.-L. Liao, B. M. Wood, A. Das, and T. Smidt. Equiformerv2: Improved equivariant transformer
for scaling to higher-degree representations. In ICLR, 2024a. (Cited on page 42)

Y.-L. Liao, B. M. Wood, A. Das, and T. Smidt. Equiformerv2: Improved equivariant trans-
former for scaling to higher-degree representations. In International Conference on Learning

Representations, 2024b. (Cited on page 79, 92, 185)

159


Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli,
et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.
Science, 2023. (Cited on page 71)

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative
modeling. In International Conference on Learning Representations, 2023. (Cited on page 47,
49, 82)

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with
rectified flow. In The Eleventh International Conference on Learning Representations, 2023.
URL https://openreview.net/forum?id=XVjTT1nw5z. (Cited on page 49)

Y. Liu, L. Wang, M. Liu, Y. Lin, X. Zhang, B. Oztekin, and S. Ji. Spherical message passing for
3d molecular graphs. In ICLR, 2022. (Cited on page 35)

A. Loukas. What graph neural networks cannot learn: depth vs width. In International

Conference on Learning Representations, 2020. (Cited on page 56)

A. X. Lu, W. Yan, S. A. Robinson, S. Kelow, K. K. Yang, V. Gligorijevic, K. Cho, R. Bonneau,
P. Abbeel, and N. C. Frey. All-atom protein generation with latent diffusion. In ICLR 2025

Workshop on Generative and Experimental Perspectives for Biomolecular Design, 2025.
(Cited on page 93)

N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. Sit: Exploring
flow and diffusion-based generative models with scalable interpolant transformers. In ECCV,
2024. (Cited on page 84)

A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos Jr,
C. Xiong, Z. Z. Sun, R. Socher, et al. Large language models generate functional protein
sequences across diverse families. Nature biotechnology, 2023. (Cited on page 44)

M. Mandal and R. R. Breaker. Gene regulation by riboswitches. Nature reviews Molecular cell

biology, 2004. (Cited on page 136)

T. Marinus, A. B. Fessler, C. A. Ogle, and D. Incarnato. A novel shape reagent enables the
analysis of rna structure in living cells with unprecedented accuracy. Nucleic acids research,
2021. (Cited on page 118)

H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks.
NeurIPS, 2019. (Cited on page 56)

K. Martinkus, J. Ludwiczak, W.-C. Liang, J. Lafrance-Vanasse, I. Hotzel, A. Rajpal, et al.
Abdiffuser: full-atom generation of in-vitro functioning antibodies. Advances in Neural

Information Processing Systems, 2024. (Cited on page 93)

160

https://openreview.net/forum?id=XVjTT1nw5z


E. K. McRae, C. J. Wan, E. L. Kristoffersen, K. Hansen, E. Gianni, I. Gallego, J. F. Curran,
J. Attwater, P. Holliger, and E. S. Andersen. Cryo-em structure and functional landscape of an
rna polymerase ribozyme. Proceedings of the National Academy of Sciences, 2024. (Cited on
page 111, 112, 114, 126, 127, 128, 192)

M. Metkar, C. S. Pepin, and M. J. Moore. Tailor made: the art of therapeutic mrna design.
Nature Reviews Drug Discovery, 2024. (Cited on page 99)

B. K. Miller, R. T. Chen, A. Sriram, and B. M. Wood. Flowmm: Generating materials with
riemannian flow matching. In Forty-first International Conference on Machine Learning,
2024. (Cited on page 77, 80, 85, 88, 90, 92, 181)

M. G. Mohsen, M. K. Midy, A. Balaji, and R. R. Breaker. Exploiting natural riboswitches for
aptamer engineering and validation. Nucleic Acids Research, 2023. (Cited on page 136)

A. Morehead and J. Cheng. Geometry-complete perceptron networks for 3d molecular graphs.
Bioinformatics, 2024. (Cited on page 69, 70)

A. Morehead, C. Chen, and J. Cheng. Geometric transformers for protein interface contact
prediction. In International Conference on Learning Representations, 2022. (Cited on page
69)

C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler
and leman go neural: Higher-order graph neural networks. In AAAI, 2019. (Cited on page 55,
60, 176)

C. Morris, Y. Lipman, H. Maron, B. Rieck, N. M. Kriege, M. Grohe, M. Fey, and K. Borgwardt.
Weisfeiler and leman go machine learning: The story so far. arXiv preprint, 2021. (Cited on
page 55)

A. Musaelian, S. L. Batzner, A. Johansson, L. Sun, C. J. Owen, M. Kornbluth, and B. Kozin-
sky. Learning local equivariant representations for large-scale atomistic dynamics. Nature

Communications, 2022. (Cited on page 41)

F. Musil, A. Grisafi, A. P. Bartók, C. Ortner, G. Csányi, and M. Ceriotti. Physics-inspired
structural representations for molecules and materials. ACS Chemical Reviews, 2021. (Cited
on page 12, 27, 33, 36, 141)

K. Mustafina, K. Fukunaga, and Y. Yokobayashi. Design of mammalian on-riboswitches based
on tandemly fused aptamer and ribozyme. ACS Synthetic Biology, 2019. (Cited on page 135)

S. Neidle and M. Sanderson. Principles of nucleic acid structure. Academic Press, 2021. (Cited
on page 23)

161


P. O. O Pinheiro, J. Rackers, J. Kleinhenz, M. Maser, O. Mahmood, A. Watkins, S. Ra, V. Sresht,
and S. Saremi. 3d molecule generation by denoising voxel grids. Advances in Neural

Information Processing Systems, 36:69077–69097, 2023. (Cited on page 93)

S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V. L. Chevrier,
K. A. Persson, and G. Ceder. Python materials genomics (pymatgen): A robust, open-source
python library for materials analysis. Computational Materials Science, 2013. (Cited on page
181)

F. J. O’Reilly and J. Rappsilber. Cross-linking mass spectrometry: methods and applications
in structural, molecular and systems biology. Nature structural & molecular biology, 2018.
(Cited on page 143)

L. Orgel. Evolution of the genetic apparatus. Journal of Molecular Biology, 1968. (Cited on
page 126)

S. Passaro and C. L. Zitnick. Reducing so (3) convolutions to so (2) for efficient equivariant
gnns. arXiv preprint arXiv:2302.03655, 2023. (Cited on page 42)

W. Peebles and S. Xie. Scalable diffusion models with transformers. In International Conference

on Computer Vision, 2023. (Cited on page 78, 83, 92, 93)

R. J. Penic, T. Vlasic, R. G. Huber, Y. Wan, and M. Sikic. Rinalmo: General-purpose rna
language models can generalize well on structure prediction tasks. arXiv preprint, 2024.
(Cited on page 113)

M. F. Perutz. Structure of haemoglobin. Brookhaven Symposia in Biology, 1960. (Cited on page
25)

B. T. Porebski, M. Balmforth, G. Browne, A. Riley, K. Jamali, M. J. Fürst, M. Velic, A. Buchanan,
R. Minter, T. Vaughan, et al. Rapid discovery of high-affinity antibodies via massively parallel
sequencing, ribosome display and affinity screening. Nature biomedical engineering, 2024.
(Cited on page 143)

S. N. Pozdnyakov and M. Ceriotti. Incompleteness of graph convolutional neural networks for
points clouds in three dimensions. arXiv preprint, 2022. (Cited on page 64)

S. N. Pozdnyakov and M. Ceriotti. Smooth, exact rotational symmetrization for deep learning on
point clouds. arXiv preprint arXiv:2305.19302, 2023. (Cited on page 42)

S. N. Pozdnyakov, M. J. Willatt, A. P. Bartók, C. Ortner, G. Csányi, and M. Ceriotti. Incomplete-
ness of atomic structure representations. Physical Review Letters, 2020. (Cited on page 57,
60, 65, 67, 68, 73, 175, 176)

162


F. Praetorius, P. J. Leung, M. H. Tessmer, A. Broerman, C. Demakis, A. F. Dishman, A. Pillai,
A. Idris, D. Juergens, J. Dauparas, et al. Design of stimulus-responsive two-state hinge proteins.
Science, 2023. (Cited on page 143)

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever. Robust speech
recognition via large-scale weak supervision. In Proceedings of the 40th International

Conference on Machine Learning, volume 202, 2023. (Cited on page 31)

M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of
deep neural networks. In International conference on machine learning, 2017. (Cited on page
13, 53)

P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint

arXiv:1710.05941, 2017. (Cited on page 36)

V. Ramakrishnan. Ribosome structure and the mechanism of translation. Cell, 2002. (Cited on
page 25)

L. Rampášek, M. Galkin, V. P. Dwivedi, A. T. Luu, G. Wolf, and D. Beaini. Recipe for a general,
powerful, scalable graph transformer. Advances in Neural Information Processing Systems,
2022. (Cited on page 32)

R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of graph theory, 1977.
(Cited on page 54)

D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate
inference in deep generative models. In International conference on machine learning, 2014.
(Cited on page 45)

J. Riebesell, R. E. Goodall, A. Jain, P. Benner, K. A. Persson, and A. A. Lee. Matbench
discovery–an evaluation framework for machine learning crystal stability prediction. arXiv

preprint arXiv:2308.14920, 2023. (Cited on page 181)

E. Rivas and S. R. Eddy. A dynamic programming algorithm for rna structure prediction
including pseudoknots. Journal of molecular biology, 1999. (Cited on page 118)

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis
with latent diffusion models. In CVPR, 2022. (Cited on page 50, 78, 81, 92, 93, 94)

M. J. Rowley and V. G. Corces. Organizational principles of 3d genome architecture. Nature

Reviews Genetics, 2018. (Cited on page 25)

F. Runge, D. Stoll, S. Falkner, and F. Hutter. Learning to design RNA. In ICLR, 2019. (Cited on
page 113)

163


B. Sanchez-Lengeling and A. Aspuru-Guzik. Inverse molecular design using machine learning:
Generative models for matter engineering. Science, 2018. (Cited on page 11, 44)

R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In
SIAM International Conference on Data Mining (SDM), 2021. (Cited on page 56)

V. G. Satorras, E. Hoogeboom, and M. Welling. E (n) equivariant graph neural networks. In
ICML, 2021. (Cited on page 39, 53, 63, 64, 66, 67, 68, 70, 92)

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural
network model. IEEE transactions on neural networks, 2008. (Cited on page 29)

B. Schneider, B. A. Sweeney, A. Bateman, J. Cerny, T. Zok, and M. Szachniuk. When will rna
get its alphafold moment? Nucleic Acids Research, 2023. (Cited on page 99)

A. Schneuing, C. Harris, Y. Du, K. Didi, A. Jamasb, I. Igashov, et al. Structure-based drug
design with equivariant diffusion models. Nature Computational Science, 2024. (Cited on
page 50, 69, 77, 90, 92, 94, 143)

K. Schütt, O. Unke, and M. Gastegger. Equivariant message passing for the prediction of
tensorial properties and molecular spectra. In ICML, 2021. (Cited on page 39, 58, 66, 69, 175)

K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller. Schnet–a deep
learning architecture for molecules and materials. The Journal of Chemical Physics, 2018.
(Cited on page 35, 36, 53, 63, 64, 68, 69, 70)

M. H. Segler, T. Kogej, C. Tyrchan, and M. P. Waller. Generating focused molecule libraries for
drug discovery with recurrent neural networks. ACS central science, 2018. (Cited on page 44)

A. V. Shapeev. Moment tensor potentials: A class of systematically improvable interatomic
potentials. Multiscale Modeling & Simulation, 2016. (Cited on page 73)

T. Shen, Z. Hu, Z. Peng, J. Chen, P. Xiong, L. Hong, L. Zheng, Y. Wang, I. King, S. Wang, et al.
E2efold-3d: End-to-end deep learning method for accurate de novo rna 3d structure prediction.
arXiv preprint, 2022. (Cited on page 106)

Y. Shi, S. Zheng, G. Ke, Y. Shen, J. You, J. He, S. Luo, C. Liu, D. He, and T.-Y. Liu. Bench-
marking graphormer on large-scale molecular modeling datasets. arXiv preprint, 2022. (Cited
on page 63)

N. Shoghi, A. Kolluru, J. R. Kitchin, Z. W. Ulissi, C. L. Zitnick, and B. M. Wood. From
molecules to materials: Pre-training large generalizable models for atomic property prediction.
In The Twelfth International Conference on Learning Representations, 2024. (Cited on page
77, 92)

164


Y. Shulgina, M. I. Trinidad, C. J. Langeberg, H. Nisonoff, S. Chithrananda, P. Skopintsev, A. J.
Nissley, J. Patel, R. S. Boger, H. Shi, et al. Rna language models predict mutations that
improve rna function. Nature Communications, 2024. (Cited on page 44, 113)

G. Simeon and G. D. Fabritiis. Tensornet: Cartesian tensor representations for efficient learning
of molecular potentials. In NeurIPS, 2023. (Cited on page 40)

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning
using nonequilibrium thermodynamics. In International conference on machine learning,
2015. (Cited on page 47, 82)

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In International

Conference on Learning Representations, 2021. (Cited on page 48, 93)

Y. Song and S. Ermon. Generative modelling by estimating gradients of the data distribution.
Advances in neural information processing systems, 2019. (Cited on page 48, 82)

A. Sriram, B. K. Miller, R. T. Q. Chen, and B. M. Wood. Flowllm: Flow matching for material
generation with large language models as base distributions. In NeurIPS, 2024. (Cited on
page 85, 181)

J. Stagno, Y. Liu, Y. Bhandari, C. Conrad, S. Panja, M. Swain, L. Fan, G. Nelson, C. Li,
D. Wendel, et al. Structures of riboswitch rna reaction states by mix-and-inject xfel serial
crystallography. Nature, 2017. (Cited on page 109)

D. W. Staple and S. E. Butcher. Pseudoknots: Rna structures with diverse functions. PLoS

biology, 2005. (Cited on page 118)

J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair,
S. French, L. A. Carfrae, Z. Bloom-Ackermann, et al. A deep learning approach to antibiotic
discovery. Cell, 2020. (Cited on page 11, 29, 44)

E. J. Strobel, A. M. Yu, and J. B. Lucks. High-throughput determination of rna structures. Nature

Reviews Genetics, 2018. (Cited on page 118, 143, 144)

S. Sumi, M. Hamada, and H. Saito. Deep generative design of rna family sequences. Nature

Methods, 2024. (Cited on page 113)

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In
Advances in neural information processing systems, 2014. (Cited on page 44)

R. Sutton. The bitter lesson. Incomplete Ideas (blog), 2019. (Cited on page 140)

C. Tan, Y. Zhang, Z. Gao, H. Cao, and S. Z. Li. Hierarchical data-efficient representation learning
for tertiary structure-based rna design. arXiv preprint, 2023. (Cited on page 109)

165


N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor field networks:
Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint, 2018.
(Cited on page 12, 40, 41, 53, 63, 64, 66, 68, 70)

K. F. Tjhung, M. N. Shokhirev, D. P. Horning, and G. F. Joyce. An rna polymerase ribozyme
that synthesizes its own ancestor. Proceedings of the National Academy of Sciences, 2020.
(Cited on page 126)

R. Todeschini and V. Consonni. Molecular descriptors for chemoinformatics: volume I: alpha-

betical listing/volume II: appendices, references. John Wiley & Sons, 2009. (Cited on page
44)

J. Topping, F. D. Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding
over-squashing and bottlenecks on graphs via curvature. In ICLR, 2022. (Cited on page 66)

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv

preprint arXiv:2307.09288, 2023. (Cited on page 85)

R. J. Townshend, S. Eismann, A. M. Watkins, R. Rangan, M. Karelina, R. Das, and R. O. Dror.
Geometric deep learning of rna structure. Science, 2021. (Cited on page 113)

O. T. Unke, S. Chmiela, H. E. Sauceda, M. Gastegger, I. Poltavsky, K. T. Schutt, A. Tkatchenko,
and K.-R. Muller. Machine learning force fields. Chemical Reviews, 2021. (Cited on page 43)

A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modelling in latent space. Advances

in neural information processing systems, 2021. (Cited on page 50, 78, 92)

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. Advances in neural

information processing systems, 2017. (Cited on page 45)

M. Varadi, S. Anyango, M. Deshpande, S. Nair, C. Natassia, G. Yordanova, D. Yuan, O. Stroe,
G. Wood, A. Laydon, et al. AlphaFold protein structure database: massively expanding the
structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids

Research, 2021. (Cited on page 69)

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and
I. Polosukhin. Attention is all you need. Advances in neural information processing systems,
2017a. (Cited on page 12, 31, 44, 45, 79)

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and
I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing

Systems, volume 30, 2017b. (Cited on page 71)

166


P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention
Networks. ICLR, 2018. (Cited on page 31)

Q. Vicens and J. S. Kieft. Thoughts on how to think (and talk) about rna structure. Proceedings

of the National Academy of Sciences, 2022. (Cited on page 100, 118, 190)

C. Vignac, N. Osman, L. Toni, and P. Frossard. Midi: Mixed graph and 3d denoising diffusion
for molecule generation. In ECML PKDD, 2023. (Cited on page 89)

S. Villar, D. W. Hogg, K. Storey-Fisher, W. Yao, and B. Blum-Smith. Scalars are universal:
Equivariant machine learning, structured like classical physics. NeurIPS, 2021. (Cited on
page 73)

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust
features with denoising autoencoders. In Proceedings of the 25th international conference on

Machine learning, 2008. (Cited on page 45)

L. M. Wadley, K. S. Keating, C. M. Duarte, and A. M. Pyle. Evaluating and learning from rna
pseudotorsional space: quantitative validation of a reduced representation for rna structure.
Journal of molecular biology, 2007. (Cited on page 101)

B. Wander, M. Shuaibi, J. R. Kitchin, Z. W. Ulissi, and C. L. Zitnick. Cattsunami: Accelerating
transition state energy calculations with pretrained graph neural networks. ACS Catalysis, 15
(7):5283–5294, 2025. (Cited on page 44, 53)

L. Wang, Y. Liu, Y. Lin, H. Liu, and S. Ji. Comenet: Towards complete and efficient message
passing for 3d molecular graphs. 2022. (Cited on page 63, 75)

W. Wang, C. Feng, R. Han, Z. Wang, L. Ye, Z. Du, H. Wei, F. Zhang, Z. Peng, and J. Yang.
trrosettarna: automated prediction of rna 3d structure with transformer network. Nature

Communications, 2023. (Cited on page 113, 118)

Y. Wang, A. A. Elhag, N. Jaitly, J. M. Susskind, and M. A. Bautista. Swallowing the bitter pill:
Simplified scalable conformer generation. In International conference on machine learning,
2024. (Cited on page 12, 75, 93, 142)

M. Ward, E. Courtney, and E. Rivas. Fitness functions for rna structure design. Nucleic Acids

Research, 2023. (Cited on page 113)

A. M. Watkins, R. Rangan, and R. Das. Farfar2: improved de novo rosetta prediction of complex
global rna folds. Structure, 2020. (Cited on page 113)

J. D. Watson and F. H. Crick. Molecular structure of nucleic acids: a structure for deoxyribose
nucleic acid. Nature, 1953. (Cited on page 24)

167


J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J.
Borst, R. J. Ragotte, L. F. Milles, et al. De novo design of protein structure and function with
rfdiffusion. Nature, 2023. (Cited on page 11, 47, 50, 77, 92, 94, 99, 106, 113, 143)

H. K. Wayment-Steele, W. Kladwang, A. I. Strom, J. Lee, A. Treuille, A. Becka, E. Participants,
and R. Das. Rna secondary structure packages evaluated and improved by high-throughput
experiments. Nature methods, 2022a. (Cited on page 106, 119)

H. K. Wayment-Steele, W. Kladwang, A. M. Watkins, D. S. Kim, B. Tunguz, W. Reade,
M. Demkin, J. Romano, R. Wellington-Oguri, J. J. Nicol, et al. Deep learning models for
predicting rna degradation via dual crowdsourcing. Nature Machine Intelligence, 2022b.
(Cited on page 119)

H. K. Wayment-Steele, G. El Nesr, R. Hettiarachchi, H. Kariyawasam, S. Ovchinnikov, and
D. Kern. Learning millisecond protein dynamics from what is missing in nmr spectra. bioRxiv,
pages 2025–03, 2025. (Cited on page 143)

M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. S. Cohen. 3d steerable cnns: Learning
rotationally equivariant features in volumetric data. NeurIPS, 2018. (Cited on page 39)

D. Weininger. Smiles, a chemical language and information system. 1. introduction to method-
ology and encoding rules. Journal of Chemical Information and Computer Sciences, 1988.
(Cited on page 22)

B. Weisfeiler and A. Leman. The reduction of a graph to canonical form and the algebra which
appears therein. NTI, Series, 1968. (Cited on page 55)

R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural
networks. Neural computation, 1989. (Cited on page 45, 105)

D. S. Wilson and J. W. Szostak. In vitro selection of functional nucleic acids. Annual review of

biochemistry, 1999. (Cited on page 126)

A. Winnifrith, C. Outeiral, and B. Hie. Generative artificial intelligence for de novo protein
design. Current Opinion in Structural Biology, 2024. (Cited on page 44)

A. Wochner, J. Attwater, A. Coulson, and P. Holliger. Ribozyme-catalyzed transcription of an
active ribozyme. Science, 2011. (Cited on page 126)

C. Woese. The Genetic Code: the Molecular basis for Genetic Expression. New York: Harper &
Row, 1967. (Cited on page 126)

B. M. Wood, M. Dzamba, X. Fu, M. Gao, M. Shuaibi, L. Barroso-Luque, K. Abdelmaqsoud,
V. Gharakhanyan, J. R. Kitchin, D. S. Levine, K. Michel, A. Sriram, T. Cohen, A. Das,

168


A. Rizvi, S. J. Sahoo, Z. W. Ulissi, and C. L. Zitnick. Uma: A family of universal models for
atoms. 2025. (Cited on page 11, 43, 77, 92, 141, 143)

Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and
V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 2018.
(Cited on page 84)

T. Xie and J. C. Grossman. Crystal graph convolutional neural networks for an accurate and
interpretable prediction of material properties. Phys. Rev. Lett., 2018. (Cited on page 35, 44)

T. Xie, X. Fu, O.-E. Ganea, R. Barzilay, and T. S. Jaakkola. Crystal diffusion variational
autoencoder for periodic material generation. In International Conference on Learning

Representations, 2022. (Cited on page 81, 84, 85, 90, 181)

K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In ICLR,
2019. (Cited on page 31, 55, 60, 176)

K. Xu, J. Li, M. Zhang, S. S. Du, K. ichi Kawarabayashi, and S. Jegelka. What can neural
networks reason about? In International Conference on Learning Representations, 2020.
(Cited on page 12)

M. Xu, A. S. Powers, R. O. Dror, S. Ermon, and J. Leskovec. Geometric latent diffusion models
for 3d molecule generation. In International Conference on Machine Learning, 2023. (Cited
on page 85, 88, 90, 92)

S. Yang, K. Cho, A. Merchant, P. Abbeel, D. Schuurmans, I. Mordatch, and E. D. Cubuk.
Scalable diffusion for materials generation. In The Twelfth International Conference on

Learning Representations, 2024. (Cited on page 85)

J. D. Yesselman, D. Eiler, E. D. Carlson, M. R. Gotrik, A. E. d’Aquino, A. N. Ooms, W. Klad-
wang, P. D. Carlson, X. Shi, D. A. Costantino, et al. Computational design of three-dimensional
rna structure and function. Nature nanotechnology, 2019. (Cited on page 99, 113)

J. Yim, A. Campbell, A. Y. Foong, M. Gastegger, J. Jiménez-Luna, S. Lewis, V. G. Satorras,
B. S. Veeling, R. Barzilay, T. Jaakkola, et al. Fast protein backbone generation with se (3) flow
matching. arXiv preprint arXiv:2310.05297, 2023a. (Cited on page 83)

J. Yim, B. L. Trippe, V. De Bortoli, E. Mathieu, A. Doucet, R. Barzilay, and T. Jaakkola. Se (3)
diffusion model with application to protein backbone generation. 2023b. (Cited on page 83,
92)

R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional
neural networks for web-scale recommender systems. In Proceedings of the 24th ACM

169


SIGKDD international conference on knowledge discovery & data mining, 2018. (Cited on
page 29)

M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep
sets. NeurIPS, 2017. (Cited on page 104)

A. Zee. Group theory in a nutshell for physicists. 2016. (Cited on page 28)

C. Zeni, R. Pinsler, D. Zügner, A. Fowler, M. Horton, X. Fu, et al. Mattergen: a generative model
for inorganic materials design. Nature, 2025. (Cited on page 47, 50, 77, 85, 90, 94, 143)

C. Zhang, M. Shine, A. M. Pyle, and Y. Zhang. Us-align: universal structure alignments of
proteins, nucleic acids, and macromolecular complexes. Nature methods, 2022. (Cited on
page 106, 107)

L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion
models. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
2023a. (Cited on page 92)

Z. Zhang, M. Xu, A. R. Jamasb, V. Chenthamarakshan, A. Lozano, P. Das, and J. Tang. Protein
representation learning by geometric structure pretraining. In The Eleventh International

Conference on Learning Representations, 2023b. (Cited on page 69, 70)

Y. Zhao, K. Oono, H. Takizawa, and M. Kotera. Generrna: A generative pre-trained language
model for de novo rna design. PLoS One, 2024. (Cited on page 113)

170


Appendix A

Appendix: Expressive Power of Molecular
Structure Representations (Chapter 3)

A.1 Geometric GNN Design Space Proofs

A.1.1 Role of Depth (Section 3.3.1)

The following results are a consequence of the construction of GWL as well as the definitions of
k-hop distinct and k-hop identical geometric graphs. Note that k-hop distinct geometric graphs
are also (k + 1)-hop distinct. Similarly, k-hop identical geometric graphs are also (k − 1)-hop
identical, but not necessarily (k + 1)-hop distinct.

Given two distinct neighbourhoods N1 and N2, the G-orbits of the corresponding geometric
multisets g1 and g2 are mutually exclusive, i.e. OG(g1) ∩ OG(g2) ≡ ∅. By the properties of
I-HASH this implies c1 ̸= c2. Conversely, if N1 and N2 were identical up to group actions, their
G-orbits would overlap, i.e. g1 = g g2 for some g ∈ Gand OG(g1) = OG(g2) ⇒ c1 = c2.

Proposition 3. GWL can distinguish any k-hop distinct geometric graphs G1 and G2 where the

underlying attributed graphs are isomorphic, and k iterations are sufficient.

Proof of Proposition 3. The k-th iteration of GWL identifies the G-orbit of the k-hop subgraph
N (k)

i at each node i via the geometric multiset g(k)
i . G1 and G2 being k-hop distinct implies

that there exists some bijection b and some node i ∈ V1, b(i) ∈ V2 such that the corresponding
k-hop subgraphs N (k)

i and N (k)
b(i) are distinct. Thus, the G-orbits of the corresponding geometric

multisets g(k)
i and g

(k)
b(i) are mutually exclusive, i.e. OG(g

(k)
i ) ∩ OG(g

(k)
b(i)) ≡ ∅ ⇒ c

(k)
i ̸= c

(k)
b(i).

Thus, k iterations of GWL are sufficient to distinguish G1 and G2.

Proposition 4. Up to k iterations, GWL cannot distinguish any k-hop identical geometric graphs

G1 and G2 where the underlying attributed graphs are isomorphic.

Proof of Proposition 4. The k-th iteration of GWL identifies the G-orbit of the k-hop subgraph
N (k)

i at each node i via the geometric multiset g(k)
i . G1 and G2 being k-hop identical implies

171


that for all bijections b and all nodes i ∈ V1, b(i) ∈ V2, the corresponding k-hop subgraphs N (k)
i

and N (k)
b(i) are identical up to group actions. Thus, the G-orbits of the corresponding geometric

multisets g
(k)
i and g

(k)
b(i) overlap, i.e. OG(g

(k)
i ) = OG(g

(k)
b(i)) ⇒ c

(k)
i = c

(k)
b(i). Thus, up to k

iterations of GWL cannot distinguish G1 and G2.

Proposition 5. IGWL can distinguish any 1-hop distinct geometric graphs G1 and G2 where the

underlying attributed graphs are isomorphic, and 1 iteration is sufficient.

Proof of Proposition 5. Each iteration of IGWL identifies the G-orbit of the 1-hop local neigh-
bourhood N (k=1)

i at each node i. G1 and G2 being 1-hop distinct implies that there exists some
bijection b and some node i ∈ V1, b(i) ∈ V2 such that the corresponding 1-hop local neighbour-
hoods N (1)

i and N (1)
b(i) are distinct. Thus, the G-orbits of the corresponding geometric multisets

g
(1)
i and g

(1)
b(i) are mutually exclusive, i.e. OG(g

(1)
i ) ∩ OG(g

(1)
b(i)) ≡ ∅ ⇒ c

(1)
i ̸= c

(1)
b(i). Thus, 1

iteration of IGWL is sufficient to distinguish G1 and G2.

Proposition 6. Any number of iterations of IGWL cannot distinguish any 1-hop identical

geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic.

Proof of Proposition 6. Each iteration of IGWL identifies the G-orbit of the 1-hop local neigh-
bourhood N (k=1)

i at each node i, but cannot identify G-orbits beyond 1-hop by the construction
of IGWL as no geometric information is propagated. G1 and G2 being 1-hop identical implies
that for all bijections b and all nodes i ∈ V1, b(i) ∈ V2, the corresponding 1-hop local neighbour-
hoods N (k)

i and N (k)
b(i) are identical up to group actions. Thus, the G-orbits of the corresponding

geometric multisets g(1)
i and g

(1)
b(i) overlap, i.e. OG(g

(1)
i ) = OG(g

(1)
b(i)) ⇒ c

(k)
i = c

(k)
b(i). Thus, any

number of IGWL iterations cannot distinguish G1 and G2.

Proposition 7. Assuming geometric graphs are constructed from point clouds using radial

cutoffs, GWL can distinguish any geometric graphs G1 and G2 where the underlying attributed

graphs are non-isomorphic. At most kMax iterations are sufficient, where kMax is the maximum

graph diameter among G1 and G2.

Proof of Proposition 7. We assume that a geometric graph G = (A,S, V⃗ , X⃗) is constructed
from a point cloud (S, V⃗ , X⃗) using a predetermined radial cutoff r. Thus, the adjacency matrix
is defined as aij = 1 if ∥x⃗i − x⃗j∥2 ≤ r, or 0 otherwise, for all aij ∈ A. Such construction
procedures are conventional for geometric graphs in molecular modelling.

Given geometric graphs G1 and G2 where the underlying attributed graphs are non-isomorphic,
identify kMax the maximum of the graph diameters of G1 and G2, and chose any arbitrary nodes
i ∈ V1, j ∈ V2. We can define the kMax-hop subgraphs N (kMax)

i and N (kMax)
j at i and j, respectively.

Thus, N (kMax)
i = V1 for all i ∈ V1, and N (kMax)

j = V2 for all j ∈ V2. Due to the assumed
construction procedure of geometric graphs, N (kMax)

i and N (kMax)
j must be distinct. Otherwise, if

N (kMax)
i and N (kMax)

j were identical up to group actions, the sets (S1, V⃗1, X⃗1) and (S2, V⃗2, X⃗2)

would have yielded isomorphic graphs.

172


Figure A.1: Two geometric graphs for which IGWL and G-invariant GNNs cannot distinguish
their perimeter, surface area, volume of the bounding box/sphere, distance from the centroid,
and dihedral angles. The centroid is denoted by a red point and distances from it are denoted by
dotted red lines. The bounding box enclosing the geometric graph is denoted by the dotted green
lines.

The kMax-th iteration of GWL identifies the G-orbit of the kMax-hop subgraph N (kMax)
i at each

node i via the geometric multiset g(kMax)
i . As N (kMax)

i and N (kMax)
j are distinct for any arbitrary

nodes i ∈ V1, j ∈ V2, the G-orbits of the corresponding geometric multisets g(kMax)
i and g

(kMax)
j

are mutually exclusive, i.e. OG(g
(kMax)
i ) ∩ OG(g

(kMax)
j ) ≡ ∅ ⇒ c

(kMax)
i ̸= c

(kMax)
j . Thus, kMax

iterations of GWL are sufficient to distinguish G1 and G2.

A.1.2 Limitations of Invariant Message Passing (Section 3.3.2)

Theorem 8. GWL is strictly more powerful than IGWL.

Proof of Theorem 8. Firstly, we can show that the GWL class contains IGWL if GWL can learn
the identity when updating gi for all i ∈ V , i.e. g(t)

i = g
(t−1)
i = g

(0)
i ≡ (si, v⃗i). Thus, GWL is at

least as powerful as IGWL, which does not update gi.
Secondly, to show that GWL is strictly more powerful than IGWL, it suffices to show that

there exist a pair of geometric graphs that can be distinguished by GWL but not by IGWL. We
may consider any k-hop distinct geometric graphs for k > 1, where the underlying attributed
graphs are isomorphic. Proposition 3 states that GWL can distinguish any such graphs, while
Proposition 6 states that IGWL cannot distinguish them. An example is the pair of graphs in
Figures 3.4 and 3.5.

Proposition 9. IGWL and G-invariant GNNs cannot decide several geometric graph properties:

(1) perimeter, surface area, and volume of the bounding box/sphere enclosing the geometric

graph; (2) distance from the centroid or centre of mass; and (3) dihedral angles.

Proof of Proposition 9. Following Garg et al. [2020], we say that a class of models decides a
geometric graph property if there exists a model belonging to this class such that for any two
geometric graphs that differ in the property, the model is able to distinguish the two geometric
graphs.

In Figure A.1, we provide an example of two geometric graphs that demonstrate the proposi-
tion. G1 and G2 differ in the following geometric graph properties:

173


• Perimeter, surface area, and volume of the bounding box enclosing the geometric graph1:
(32 units, 40 units2, 16 units3) vs. (28 units, 24 units2, 8 units3).

• Multiset of distances from the centroid or centre of mass: {0.00, 1.00, 1.00, 2.45, 2.45} vs.
{0.40, 1.08, 1.08, 2.32, 2.32}.

• Dihedral angles: ∠(ljkm) =
(x⃗jk×x⃗lj)·(x⃗jk×x⃗mk)

|x⃗jk×x⃗lj ||x⃗jk×x⃗mk|
are clearly different for the two graphs.

However, according to Proposition 6 and Theorem 14, both IGWL and G-invariant GNNs cannot
distinguish these two geometric graphs, and therefore, cannot decide all these properties.

We can also show this via a geometric version of computation trees [Garg et al., 2020], for any
number of IGWL or G-invariant GNN iterations, as illustrated in Figure 3.6. A computation tree
T (t)
i represents the maximum information contained in GWL/IGWL colours or GNN features

for node i at iteration t by an ‘unrolling’ of the message passing procedure. GWL, IGWL,
and the corresponding classes of GNNs can be intuitively understood as colouring geometric
computation trees.

Geometric computation trees are constructed recursively: T (0)
i = (si, v⃗i) for all i ∈ V . For

t > 0, we start with a root node (si, v⃗i) and add a child subtree T (t−1)
j for all j ∈ Ni along with

the relative position x⃗ij along the edge. To obtain the root node’s embedding or colour, both
scalar and geometric information is propagated from the leaves up to the root. Thus, if two nodes
have identical geometric computation trees, they will be mapped to the same node embedding or
colour.

Critically, geometric orientation information cannot flow from one level to another in the
computation trees for IGWL and G-invariant GNNs, as they only update scalar information. In
the recursive construction procedure, we must insert a connector node (sj, v⃗j) before adding
the child subtree T (t−1)

j for all j ∈ Ni and prevent geometric information propagation between
them.

Following the construction procedure for the geometric graphs in Figure A.1, we observe
that the IGWL computation trees of any pair of isomorphic nodes are identical, as all 1-hop
neighbourhoods are computationally identical. Therefore, the set of node colours or node scalar
features will also be identical, which implies that G1 and G2 cannot be distinguished.

Proposition 10. IGWL has the same expressive power as GWL for fully connected geometric

graphs.

Proof of Proposition 10. We will prove by contradiction. Assume that there exist a pair of fully
connected geometric graphs G1 and G2 which GWL can distinguish, but IGWL cannot.

If the underlying attributed graphs of G1 and G2 are isomorphic, by Proposition 3 and
Proposition 6, G1 and G2 are 1-hop identical but k-hop distinct for some k > 1. For all bijections
b and all nodes i ∈ V1, b(i) ∈ V2, the local neighbourhoods N (1)

i and N (1)
b(i) are identical up

1The same result applies for the bounding sphere, not shown in the figure.

174


to group actions, and OG(g
(1)
i ) = OG(g

(1)
b(i)) ⇒ c

(1)
i = c

(1)
b(i). Additionally, there exists some

bijection b and some nodes i ∈ V1, b(i) ∈ V2 such that the k-hop subgraphs N (k)
i and N (k)

b(i)

are distinct, and OG(g
(k)
i ) ∩ OG(g

(k)
b(i)) ≡ ∅ ⇒ c

(k)
i ̸= c

(k)
b(i). However, as G1 and G2 are fully

connected, for any k, N (1)
i = N (k)

i and N (1)
b(i) = N (k)

b(i) are identical up to group actions. Thus,
OG(g

(1)
i ) = OG(g

(k)
i ) = OG(g

(1)
b(i)) = OG(g

(k)
b(i)) ⇒ c

(1)
i = c

(k)
i = c

(k)
b(i) = c

(k)
b(i). This is a

contradiction.
If G1 and G2 are non-isomorphic and fully connected, for any arbitrary i ∈ V1, j ∈ V2

and any k-hop neighbourhood, we know that N (1)
i = N (k)

i and N (1)
j = N (k)

j . Thus, a single
iteration of GWL and IGWL identify the same G-orbits and assign the same node colours, i.e.
OG(g

(1)
i ) = OG(g

(k)
i ) ⇒ c

(1)
i = c

(k)
i and OG(g

(1)
j ) = OG(g

(k)
j ) ⇒ c

(1)
j = c

(k)
j . This is a

contradiction.

A.1.3 Role of Scalarisation Body Order (Section 3.3.3)

Proposition 11. I-HASH(m) is G-orbit injective for m = max({|Ni| | i ∈ V}), the maximum

cardinality of all local neighbourhoods Ni in a given dataset.

Proof of Proposition 11. As m is the maximum cardinality of all local neighbourhoods Ni under
consideration, any distinct neighbourhoods N1 and N2 must have distinct multisets of m-body
scalars. As I-HASH(m) computes scalars involving up to m nodes, it will be able to distinguish
any such N1 and N2. Thus, I-HASH(m) is G-orbit injective.

Proposition 12. IGWL(k) is at least as powerful as IGWL(k−1). For k ≤ 5, IGWL(k) is strictly

more powerful than IGWL(k−1).

Proof of Proposition 12. By construction, I-HASH(k) computes G-invariant scalars from all
possible tuples of up to k nodes formed by the elements of a neighbourhood and the central
node. Thus, the I-HASH(k) class contains I-HASH(k−1), and I-HASH(k) is at least as powerful as
I-HASH(k−1). Thus, the corresponding test IGWL(k) is at least as powerful as IGWL(k−1).

Secondly, to show that IGWL(k) is strictly more powerful than IGWL(k−1) for k ≤ 5, it
suffices to show that there exist a pair of geometric neighbourhoods that can be distinguished by
IGWL(k) but not by IGWL(k−1):

• For k = 3 and G= O(3) or SO(3), for the local neighbourhood from Figure 1 in Schütt
et al. [2021], two configurations with different angles between the neighbouring nodes can
be distinguished by IGWL(3) but not by IGWL(2).

• For k = 4 and G= O(3) or SO(3), the pair of local neighbourhoods from Figure 1 in
Pozdnyakov et al. [2020] can be distinguished by IGWL(4) but not by IGWL(3).

• For k = 5 and G= O(3), the pair of local neighbourhoods from Figure 2(e) in Pozdnyakov
et al. [2020] can be distinguished by IGWL(5) but not by IGWL(4).

175


• For k = 5 and G= SO(3), the pair of local neighbourhoods from Figure 2(f) in Pozd-
nyakov et al. [2020] can be distinguished by IGWL(5) but not by IGWL(4).

Proposition 13. Let G1 = (A1,S1, X⃗1) and G2 = (A2,S2, X⃗2) be two geometric graphs with

the property that all edges have equal length. Then, IGWL(2) distinguishes the two graphs if and

only if WL can distinguish the attributed graphs (A1,S1) and (A1,S1).

Proof of Proposition 13. Let c and k the colours produced by IGWL(2) and WL, respectively,
and let i and j be two nodes belonging to any two graphs like in the statement of the result. We
prove the statement inductively.

Clearly, c(0)i = k
(0)
i for all nodes i and c(0)i = c

(0)
j if and only if k(0)i = k

(0)
j . Now, assume that

the statement holds for iteration t. That is c(t)i = c
(t)
j if and only if k(t)i = k

(t)
j holds for all i. Note

that c(t+1)
i = c

(t+1)
j if and only if c(t)i = c

(t)
j and {{(c(t)p , ∥x⃗ip∥) | p ∈ Ni}} = {{(c(t)p , ∥x⃗jp∥) |

p ∈ Nj}}, since the norm of the relative vectors is the only injective invariant that IGWL(2) can
compute (up to a scaling). Since all the norms are equal, by the induction hypothesis, this is
equivalent to k(t)i = k

(t)
j and {{k(t)p | p ∈ Ni}} = {{k(t) | p ∈ Nj}}. Therefore, this is equivalent

to k(t+1)
i = k

(t+1)
j

A.2 Proofs for Equivalence between GWL and Geometric
GNNs (Section 3.2.2)

Our proofs adapt the techniques used in Xu et al. [2019], Morris et al. [2019] for connecting WL
with GNNs. Note that we omit including the relative position vectors x⃗ij in GWL and geometric
GNN updates for brevity, as relative positions vectors can be merged into the vector features.

Theorem 1. Any pair of geometric graphs distinguishable by a G-equivariant GNN is also

distinguishable by GWL.

Proof of Theorem 1. Consider two geometric graphs G and H. The theorem implies that if the
GNN graph-level readout outputs f(G) ̸= f(H), then the GWL test will always determine G
and H to be non-isomorphic, i.e. G ≠ H.

We will prove by contradiction. Suppose after T iterations, a GNN graph-level readout
outputs f(G) ̸= f(H), but the GWL test cannot decide G and H are non-isomorphic, i.e. G and
H always have the same collection of node colours for iterations 0 to T . Thus, for iteration t and
t+ 1 for any t = 0 . . . T − 1, G and H have the same collection of node colours {c(t)i } as well as
the same collection of neighbourhood geometric multisets

{
(c

(t)
i , g

(t)
i ) , {{(c(t)j , g

(t)
j ) | j ∈ Ni}}

}
up to group actions. Otherwise, the GWL test would have produced different node colours at
iteration t+ 1 for G and H as different geometric multisets get unique new colours.

176


We will show that on the same graph for nodes i and k, if (c(t)i , g
(t)
i ) = (c

(t)
k , g · g(t)

k ), we
always have GNN features (s(t)i , v⃗

(t)
i ) = (s

(t)
k ,Qgv⃗

(t)
k ) for any iteration t. This holds for t = 0

because GWL and the GNN start with the same initialisation. Suppose this holds for iteration t.
At iteration t+ 1, if for any i and k, (c(t+1)

i , g
(t+1)
i ) = (c

(t+1)
k , g · g(t+1)

k ), then:{
(c

(t)
i , g

(t)
i ) , {{(c(t)j , g

(t)
j ) | j ∈ Ni}}

}
=
{
(c

(t)
k , g · g

(t)
k ) , {{(c(t)j , g · g

(t)
j ) | j ∈ Nk}}

}
(A.1)

By our assumption on iteration t,{
(s

(t)
i , v⃗

(t)
i ) , {{(s(t)j , v⃗

(t)
j ) | j ∈ Ni}}

}
=
{
(s

(t)
k ,Qgv⃗

(t)
k ) , {{(s(t)j ,Qgv⃗

(t)
j ) | j ∈ Nk}}

}
(A.2)

As the same aggregate and update operations are applied at each node within the GNN, the same
inputs, i.e. neighbourhood features, are mapped to the same output. Thus, (s(t+1)

i , v⃗
(t+1)
i ) =

(s
(t+1)
k ,Qgv⃗

(t+1)
k ). By induction, if (c

(t)
i , g

(t)
i ) = (c

(t)
k , g · g(t)

k ), we always have GNN node
features (s(t)i , v⃗

(t)
i ) = (s

(t)
k ,Qgv⃗

(t)
k ) for any iteration t. This creates valid mappings ϕs, ϕv such

that s(t)i = ϕs(c
(t)
i ) and v⃗

(t)
i = ϕv(c

(t)
i , g

(t)
i ) for any i ∈ V .

Thus, if G and H have the same collection of node colours and geometric multisets, then G
and H also have the same collection of GNN neighbourhood features{
(s

(t)
i , v⃗

(t)
i ) , {{(s(t)j , v⃗

(t)
j ) | j ∈ Ni}}

}
=
{
(ϕs(c

(t)
i ), ϕv(c

(t)
i , g

(t)
i )) , {{(ϕs(c

(t)
j ), ϕv(c

(t)
i , g

(t)
i )) | j ∈ Ni}}

}
Thus, the GNN will output the same collection of node scalar features {s(T )

i } for G and H and the
permutation-invariant graph-level readout will output f(G) = f(H). This is a contradiction.

Similarly, G-invariant GNNs can be at most as powerful as IGWL.

Theorem 14. Any pair of geometric graphs distinguishable by a G-invariant GNN is also

distinguishable by IGWL.

Proof. The proof follows similarly to the proof for Theorem 1.

Proposition 2. G-equivariant GNNs have the same expressive power as GWL if the following

conditions hold: (1) The aggregation AGG is an injective, G-equivariant multiset function. (2)

The scalar part of the update UPDs is a G-orbit injective, G-invariant multiset function. (3)

The vector part of the update UPDv is an injective, G-equivariant multiset function. (4) The

graph-level readout f is an injective multiset function.

Proof of Theorem 2. Consider a GNN where the conditions hold. We will show that, with a
sufficient number of iterations t, the output of this GNN is equivalent to GWL, i.e. s(t) ≡ c(t).

Let G and H be any geometric graphs which the GWL test decides as non-isomorphic at
iteration T . Because the graph-level readout function is injective, i.e. it maps distinct multiset of
node scalar features into unique embeddings, it suffices to show that the GNN’s neighbourhood

177


aggregation process, with sufficient iterations, embeds G and H into different multisets of node
features.

For this proof, we replace G-orbit injective functions with injective functions over the
equivalence class generated by the actions of G. Thus, all elements belonging to the same
G-orbit will first be mapped to the same representative of the equivalence class, denoted by the
square brackets [. . . ], followed by an injective map. The result is G-orbit injective.

Let us assume the GNN updates node scalar and vector features as:

s
(t)
i = UPDs

([
(s

(t−1)
i , v⃗

(t−1)
i ) , AGG

(
{{(s(t−1)

i , s
(t−1)
j , v⃗

(t−1)
i , v⃗

(t−1)
j ) | j ∈ Ni}}

)])
(A.3)

v⃗
(t)
i = UPDv

(
(s

(t−1)
i , v⃗

(t−1)
i ) , AGG

(
{{(s(t−1)

i , s
(t−1)
j , v⃗

(t−1)
i , v⃗

(t−1)
j ) | j ∈ Ni}}

))
(A.4)

with the aggregation function AGG being G-equivariant and injective, the scalar update function
UPDs being G-invariant and injective, and the vector update function UPDv being G-equivariant
and injective.

The GWL test updates the node colour c(t)i and geometric multiset g(t)
i as:

c
(t)
i = hs

([
(c

(t−1)
i , g

(t−1)
i ) , {{(c(t−1)

j , g
(t−1)
j ) | j ∈ Ni}}

])
, (A.5)

g
(t)
i = hv

(
(c

(t−1)
i , g

(t−1)
i ) , {{(c(t−1)

j , g
(t−1)
j ) | j ∈ Ni}}

)
, (A.6)

where hs is a G-invariant and injective map, and hv is a G-equivariant and injective operation
(e.g. in equation 3.4, expanding the geometric multiset by copying).

We will show by induction that at any iteration t, there always exist injective functions φs

and φv such that s(t)i = φs(c
(t)
i ) and v⃗

(t)
i = φv(c

(t)
i , g

(t)
i ). This holds for t = 0 because the

initial node features are the same for GWL and GNN, c(0)i ≡ s
(0)
i and g

(0)
i ≡ (s

(0)
i , v⃗

(0)
i ) for all

i ∈ V(G),V(H). Suppose this holds for iteration t. At iteration t + 1, substituting s
(t)
i with

φs(c
(t)
i ), and v⃗

(t)
i with φv(c

(t)
i , g

(t)
i ) gives us

s
(t+1)
i = UPDs

([
(φs(c

(t)
i ), φv(c

(t)
i , g

(t)
i )) , AGG

(
{{(φs(c

(t)
i ), φs(c

(t)
j ), φv(c

(t)
i , g

(t)
i ), φv(c

(t)
j , g

(t)
j )) | j ∈ Ni}}

)])
v⃗
(t+1)
i = UPDv

(
(φs(c

(t)
i ), φv(c

(t)
i , g

(t)
i )) , AGG

(
{{(φs(c

(t)
i ), φs(c

(t)
j ), φv(c

(t)
i , g

(t)
i ), φv(c

(t)
j , g

(t)
j )) | j ∈ Ni}}

))
The composition of multiple injective functions is injective. Therefore, there exist some injective
functions gs and gv such that:

s
(t+1)
i = gs

([
(c

(t)
i , g

(t)
i ) , {{(c(t)j , g

(t)
j ) | j ∈ Ni}}

])
, (A.7)

v⃗
(t+1)
i = gv

(
(c

(t)
i , g

(t)
i ) , {{(c(t)j , g

(t)
j ) | j ∈ Ni}}

)
, (A.8)

178


We can then consider:

s
(t+1)
i = gs ◦ h−1

s hs

([
(c

(t)
i , g

(t)
i ) , {{(c(t)j , g

(t)
j ) | j ∈ Ni}}

])
, (A.9)

v⃗
(t+1)
i = gv ◦ h−1

v hv

(
(c

(t)
i , g

(t)
i ) , {{(c(t)j , g

(t)
j ) | j ∈ Ni}}

)
, (A.10)

Then, we can denote φs = gs ◦ h−1
s and φv = gv ◦ h−1

v as injective functions because the
composition of injective functions is injective. Hence, for any iteration t+1, there exist injective
functions φs and φv such that s(t+1)

i = φs

(
c
(t+1)
i

)
and v⃗

(t+1)
i = φv

(
c
(t+1)
i , g

(t+1)
i

)
.

At the T -th iteration, the GWL test decides that G and H are non-isomorphic, which means
the multisets of node colours {c(T )

i } are different for G and H. The GNN’s node scalar features
{s(T )

i } = {φs(c
(T )
i )} must also be different for G and H because of the injectivity of φs.

A weaker set of conditions is sufficient for a G-invariant GNN to be at least as expressive as
IGWL.

Proposition 15. G-invariant GNNs have the same expressive power as IGWL if the following

conditions hold: (1) The aggregation ψ and update ϕ are G-orbit injective, G-invariant multiset

functions. (2) The graph-level readout f is an injective multiset function.

Proof. The proof follows similarly to the proof for Theorem 2.

179


180


Appendix B

Appendix: Unified Generative Modelling of
Molecules and Materials (Chapter 4)

B.1 Evaluation Metrics

Crystal generation metrics We follow the evaluation protocol established by Xie et al. [2022],
Miller et al. [2024], where we sample 10,000 crystals and compute validity, stability, uniqueness,
and novelty rates, defined as follows:

• Structural validity: % of crystals with all pairwise distances >= 0.5 and volume >= 0.1.
• Compositional validity: % of crystal compositions with charge neutrality and electronegativity

balance according to SMACT [Davies et al., 2019].
• Overall validity: % of crystals which are both structurally and compositionally valid.
• Stability: % of crystals with DFT energy above hull <0.0 eV/atom and number of unique

elements >= 2. (We also report metastability as DFT energy above hull <0.1 eV/atom and
number of unique elements >= 2.)

• Stable & unique: % of stable crystals which are unique, as defined by an all-to-all comparison
using Structure Matcher from PyMatGen [Ong et al., 2013].

• Stable, unique & novel: % of stable, unique crystals which are novel, as defined by an all-to-all
comparison to all crystals in MP-20 using Structure Matcher.

To compute the stability, uniqueness, and novelty rates, we follow Miller et al. [2024], Sriram
et al. [2024]: We first pre-relax the sampled crystals using a fast ML potential, CHGnet [Deng
et al., 2023], and then perform DFT relaxation. We then determine the DFT energy above hull
for the relaxed structures against the Matbench Discovery convex hull [Riebesell et al., 2023].
Note that there is a lower bound on the number of completed DFT calculations due to memory
or timeout errors.

181


Molecule generation metrics We follow the evaluation protocol established by Hoogeboom
et al. [2022], Daigavane et al. [2024], where we sample 10,000 molecules and compute validity
and uniqueness rates as well as success rates for 7 sanity checks from Posebusters [Buttenschoen
et al., 2024], as follows:

• Validity: % of molecules with canonical SMILES string found by RDKit.
• Uniqueness: % of unique SMILES among valid ones.
• All-atoms connected: % of molecules where there exists a path along bonds between all atoms.
• Reasonable bond angles/lengths: % of molecules where all angles/lengths are within 0.75 of

the lower and 1.25 of the upper bounds determined by distance geometry.
• Aromatic rings flatness: % of molecules where All-atoms in aromatic rings with 5 or 6

members are within 0.25Å of the closest shared plane molecule.
• Double bond flatness: % of molecules where All-atoms of aliphatic carbon-carbon double

bonds and their four neighbours are within 0.25Å of the closest shared plane.
• Reasonable internal energy: % of molecules where the calculated energy is no more than 100

times the average energy of an ensemble of 50 conformations generated for the input molecule.
• No internal steric clash: % of molecules where the interatomic distance between pairs of

non-covalently bound atoms is above 0.8 of the distance geometry lower bound.

The validity and uniqueness metrics focus on whether the chemical composition of generated
molecules can be processed by RDKit, while the Posebusters sanity checks evaluate the physical
realism of the generated 3D structures across multiple criteria, from geometric constraints like
bond lengths to energetic considerations [Harris et al., 2023].

B.2 Additional Results

Histograms from DFT validation In Figure B.1, we show histograms of DFT energy above
hull, formation energy, and number of unique elements per crystal for 10,000 generated crystals
from ADiT, FlowMM, and FlowLLM compared to the MP20 training distribution. ADiT
generates more thermodynamically stable crystals than prior models, as shown by the larger
proportion of samples with DFT energy above hull below 0.0 eV/atom. The distribution of DFT
formation energies and number of unique elements per crystal from ADiT samples more closely
matches the MP20 training data compared to FlowMM and FlowLLM baselines, suggesting
that ADiT better captures the underlying physical and chemical constraints of stable crystal
structures. Note that we ran DFT calculations for all model samples under identical hardware
and settings to ensure fair comparison.

Histogram of spacegroups In Figure B.2, we show the distribution of spacegroups for 10,000
generated crystals from ADiT, FlowMM, FlowLLM and the MP20 distribution. Diffusion-based

182


0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5
Ehull (eV / atom)

0

250

500

750

1000

1250

1500

1750

Co
un

t

St
ab

le
: E

hu
ll <

0.
0

M
P2

0:
 E

hu
ll <

0.
08

Model
ADiT
FlowLLM
FlowMM

(a) DFT energy above hull

4 3 2 1 0 1 2
Formation energy (eV / atom)

0

200

400

600

800

1000

Co
un

t

Model
ADiT
FlowLLM
FlowMM
MP20 test set

(b) DFT formation energy

1 2 3 4 5 6 7
Number of unique elements per crystal

0

1000

2000

3000

4000

5000

6000

Co
un

t

Model
ADiT
FlowLLM
FlowMM
MP20 test set

(c) Number of elements

Figure B.1: Histograms from DFT validation of 10,000 generated crystals. ADiT is more
likely to generate stable crystals with DFT energy above hull <0.0 eV/atom compared to prior
models. Samples from ADiT most closely follow the distributions for DFT formation energy
and number of unique elements per crystal from MP20.

models (ADiT and FlowMM) tend to over sample crystals with P1 spacegroup, which represents
the lowest symmetry group, likely due to their local, step-wise denoising process. In contrast,
FlowLLM, an autoregressive language model, tends to over sample spacegroups like Fm-3m,
Pm-3m, and I4/mmm compared to the training data. While it would be straightforward to control
the distribution of spacegroups generated by ADiT through classifier-free guidance conditioning,
we leave this for future work since our current focus is on unconditional generation of diverse
molecular systems.

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
International spacegroup number

0

500

1000

1500

2000

2500

3000

3500

4000

Co
un

t

P1 Fm-3m

I4/mmm Pm-3m
Cm

P2_1/c Pnma
P4/nmm

P6_3/mmc

Model
ADiT
FlowLLM
FlowMM
MP20 test set

Figure B.2: Histogram of spacegroups for 10,000 generated crystals. Diffusion-based ADiT
and FlowMM tend to over sample crystals with P1 spacegroup compared to the MP20 training
distribution. FlowLLM, an autoregressive language, tends to over sample crystals with Fm-3m,
Pm-3m, and I4/mmm spacegroups.

SUN rate and scaling ADiT In Table B.1, we observe that the combined stability, uniqueness,
and novelty (S.U.N.) rate for crystal generation decreases as we scale up the DiT denoiser from
DiT-S (32M) to DiT-L (450M). While stability and uniqueness rates increase with model size,
the S.U.N. rate decreases due to the larger model’s greater capacity to memorize the small
MP20 training dataset of 27K crystals. This suggests that larger models may be more prone to
generating duplicate or near-duplicate samples, which we plan to address by training on larger

183


Table B.1: Impact of scaling on stability, uniqueness, and novelty rates for 10,000 generated
crystals. We find that stability rate as well as stability & uniqueness rate increase as we increase
the number of model parameters for ADiT from 32M to 450M. However, larger ADiT models
have greater capacity to memorise the small MP20 training dataset of 27K crystals, resulting in
decrease in the combined stability, uniqueness, & novelty rate. ADiT-S trained on MP20-only
achieves a S.U.N. rate of 6.5%, representing a significant improvement over previously published
state-of-the-art models which attained S.U.N. rates up to 4.7%.

Stability (Ehull <0.0) Metatability (Ehull <0.1)
Model S (%) ↑ S.U. (%) ↑ S.U.N. (%) ↑ M.S (%) ↑ M.S.U. (%) ↑ M.S.U.N. (%) ↑

MP20-only ADiT-S (32M) 12.8 11.8 6.5 71.1 64.9 38.1
MP20-only ADiT-B (130M) 14.1 12.5 4.7 81.6 67.3 25.9

Joint ADiT-S (32M) 12.6 11.4 6.0 71.9 64.7 37.7
Joint ADiT-B (130M) 15.4 13.4 5.3 81.0 70.2 28.2
Joint ADiT-L (450M) 15.5 13.5 5.0 82.5 70.9 27.9

and more diverse datasets in future work. For crystals, the Alexandria dataset of inorganic crystals
and the Crystallography Open Database of organic crystals present promising opportunities for
scaling up. Notably, ADiT-S trained on MP20-only achieves a S.U.N. rate of 6.5%, representing a
significant improvement over previously published results from FlowMM (2.8%) and FlowLLM
(4.7%). This demonstrates that even our smallest model variant substantially advances the
state-of-the-art for crystal generation.

100 1000 2500 5000 10000
Number of samples generated (ADiT)

88

90

92

94

96

98

100

Va
lid

ity
 ra

te
 (%

)

System
Crystals
Molecules

(a) Validity rates are consistent across seeds.

100 1000 2500 5000 7500 10000
Number of crystals sampled

0

2

4

6

8

10

12

14

St
ab

le
, u

ni
qu

e,
 n

ov
el

 ra
te

 (%
)

Model
ADiT-L
ADiT-B
ADiT-S
FlowLLM
FlowMM

(b) S.U.N. rates converge after 5,000 samples.

Figure B.3: Consistency of validity and S.U.N. rates as we increase number of samples.
We plot the validity and S.U.N. rates vs. number of sampled crystals or molecules. Error bars
indicate 95% confidence interval across three different random seeds.

Sensitivity of validity rate to number of samples and random seed In Figure B.3a, we plot
the validity rates for crystal and molecule generation as we increase the number of samples from
100 to 10,000 for 3 different random seeds. We observe that the validity rates generally converge
and are stable across random seeds after sampling over 5,000 crystals or molecules.

184


Sensitivity of S.U.N. rate to number of samples In Figure B.3b, we plot the S.U.N. (stability,
uniqueness, and novelty) rates for crystal generation as we increase the number of samples from
100 to 10,000 across 3 different random seeds. The S.U.N. rates converge after approximately
5,000 samples for diffusion-based methods like ADiT and FlowMM. In contrast, autoregressive
models like FlowLLM show higher variance in S.U.N. rates, likely due to more frequent
generation of duplicate crystals during low-temperature sampling.

B.3 Ablation Study

Table B.3 and Table B.2 presents ablation studies as well as aggregated benchmarks for various
configurations of ADiT’s latent diffusion model and autoencoder, respectively. Key takeaways are
highlighted below. Note that, unless otherwise stated, results in the main paper are reported for
jointly trained ADiT-B which uses DiT-B denoiser, standard Transformer encoder and decoder,
latent dimension d = 8, and KL regularization weight λKL = 1e− 5.

Joint vs. dataset-specific training Joint training of the autoencoder to embed both molecules
and crystals into a shared latent space achieves similar or better reconstruction performance
compared to dataset-specific training, as shown in Table B.2 (rows 3, 6, 10). The benefits of
joint training are most evident in generative modelling performance – samples from the joint
model have higher validity rates for both crystals and molecules compared to dataset-specific
models, demonstrating effective transfer learning between periodic and non-periodic molecular
systems (Table B.3, rows 12, 16, 20). These results provide strong evidence that ADiTs can
successfully unify the modelling of both periodic and non-periodic systems within a single
architecture, without compromising performance on either domain.

Denoiser architecture The DiT denoiser is a standard Transformer with key hyperparam-
eters including the hidden dimension dmodel, number of attention heads, and number of layers.
Scaling up the DiT denoiser from DiT-S (32M parameters, dmodel = 384, 6 heads, 12 layers)
to DiT-B (150M, dmodel = 768, 12 heads, 12 layers) and DiT-L (450M, dmodel = 1024, 24
heads, 24 layers) consistently improves generative performance, as shown in Table B.3 (rows
12, 16, 20). We have additionally performed scaling analysis for the training loss and validity
rates in Figure 4.2, seeing strong correlations between model size and performance metrics. In
Figure B.3b, we further see that S.U.N. rates for larger models are better than smaller models,
further confirming the benefits of scaling up the DiT denoiser.

Autoencoder architecture For the architecture of the autoencoder’s encoder and decoder,
we explored both roto-translation equivariant as well as non-equivariant VAEs. For the equivariant
VAE variant, the encoder is Equiformer-V2 [Liao et al., 2024b] and the decoder is an equivariant
feedforward network adapted from output heads in the Equiformer-V2 codebase. We selected
Equiformer-V2 as it is theoretically expressive [Joshi et al., 2023] and has state-of-the-art
performance across diverse 3D molecular tasks. As input to the Equiformer-V2 encoder, we

185


use spherical harmonic embeddings of displacement vectors as edge features and exclude the
3D coordinates in Algorithm 1, line 2, from the initial features {hi} as a result. The initial
features {hi} are used as the L = 0 scalar component of the initial spherical tensor features of
Equiformer-V2. The rest of the pseudocode in Algorithms 1 and 2 remains the same.

As shown in Table B.2 (rows 1-4 and 5-8), the choice of autoencoder architecture has no-
ticeable impact on reconstruction performance. Standard Transformers generally outperform
Equiformer-V2 for both crystals and molecules, achieving higher match rates (% of test set sam-
ples where the reconstructed structure matches the groundtruth, as determined by PyMatGen’s
StructureMatcher/MoleculeMatcher). More importantly, the latent space learned by standard
Transformers proved more suitable for the latent diffusion process compared to Equiformer-
V2’s equivariant latent space, leading to substantially better generative performance in terms of
validity rates, particularly for crystals (Table B.3, rows 1-4 and 5-8).

Autoencoder regularization As shown in Table B.2 (rows 9-12), increasing the latent
dimension and reducing the KL regularization weight generally improved autoencoder recon-
struction performance by lowering RMSD values which measure the average distance between
the reconstructed and groundtruth structures. These improvements in reconstruction quality
translated to better generative performance, with higher validity rates for both crystals and
molecules at larger latent dimensions and lower KL weights (see Table B.3, rows 9-12).

Sampling hyperparameters. Classifier-free guidance scale and number of integra-
tion steps are important hyperparameters for inference-time tuning. In Figure B.4, we show
a grid search over guidance scales γ ∈ {1.0, 2.0, 3.0, 4.0, 6.0} and integration steps T ∈
{10, 50, 100, 250, 500, 1000}, finding that different combinations may be optimal for crystals vs.
molecule generation. For each entry in Table B.3, we have reported results for T and γ which
obtain the highest validity rates. T = 500 or 1000 with γ = 1.0 or 2.0 tends to work well across
both molecules and crystals.

(a) Crystals – MP20 (b) Molecules – QM9

Figure B.4: Tuning inference hyperparameters for best performance. Best generative
modelling results for crystals and molecules are achieved with different classifier-free guidance
scales γ and number of integration steps T . T = 500 or 1000 with γ = 1.0 or 2.0 tends to work
well across both molecules and crystals.

186


Table B.2: Autoencoder ablation study. We report match rate (computed with StructureMatcher
or MoleculeMatcher from PyMatGen) and RMSD between the reconstructed and groundtruth
structures for MP20 crystals and QM9 molecules.

Train Autoencoder hyperparameters Crystals – MP20 Molecules – QM9
Set Encoder Latent KL Match Rate (%) ↑ RMSD (Å) ↓ Match Rate (%) ↑ RMSD (Å) ↓

MP20 Transformer 4 0.0001 85.50 0.0598 - -
MP20 Equiformer-V2 4 0.0001 81.70 0.1652 - -
MP20 Transformer 8 0.0001 84.50 0.0502 - -
MP20 Equiformer-V2 8 0.0001 88.90 0.0296 - -

QM9 Transformer 4 0.0001 - - 97.20 0.0747
QM9 Equiformer-V2 4 0.0001 - - 96.20 0.0765
QM9 Transformer 8 0.0001 - - 96.50 0.0823
QM9 Equiformer-V2 8 0.0001 - - 96.20 0.0746

Joint Transformer 4 0.0001 88.30 0.0471 96.60 0.0785
Joint Transformer 4 0.00001 88.50 0.0468 98.50 0.0524
Joint Transformer 8 0.0001 88.60 0.0269 96.60 0.0760
Joint Transformer 8 0.00001 88.60 0.0239 97.00 0.0399

Table B.3: Latent diffusion model ablation study. We report validity rates for 10,000 generated
crystals or molecules.

Autoencoder hyperparameters Crystals – MP20 Molecules – QM9
Train Diffusion

Encoder Latent KL
Structure Composition Overall Validity Validity*

Set Denoiser Valid (%) ↑ Valid (%) ↑ Valid (%) ↑ (%) ↑ (%) ↑

MP20 DiT-S Transformer 4 0.0001 98.90 89.19 88.19 - -
MP20 DiT-S Equiformer-V2 4 0.0001 91.74 81.03 74.43
MP20 DiT-S Transformer 8 0.0001 99.58 90.46 90.13 - -
MP20 DiT-S Equiformer-V2 8 0.0001 99.26 86.09 85.50

QM9 DiT-S Transformer 4 0.0001 - - - 95.94 92.19
QM9 DiT-S Equiformer-V2 4 0.0001 - - - 95.36 91.37
QM9 DiT-S Transformer 8 0.0001 - - - 96.02 91.58
QM9 DiT-S Equiformer-V2 8 0.0001 - - - 96.24 91.47

Joint DiT-S Transformer 4 0.0001 98.21 91.05 89.38 96.90 93.47
Joint DiT-S Transformer 4 0.00001 98.74 90.74 89.60 96.40 91.85
Joint DiT-S Transformer 8 0.0001 99.66 91.07 90.76 96.85 93.33
Joint DiT-S Transformer 8 0.00001 99.67 91.25 90.93 96.36 92.06

Joint DiT-B Transformer 4 0.0001 99.00 91.23 90.29 97.33 94.45
Joint DiT-B Transformer 4 0.00001 99.51 90.73 90.29 97.04 94.06
Joint DiT-B Transformer 8 0.0001 99.67 91.60 91.32 95.30 89.85
Joint DiT-B Transformer 8 0.00001 99.74 92.14 91.92 97.43 93.99

Joint DiT-L Transformer 4 0.0001 99.31 90.92 90.29 97.80 94.67
Joint DiT-L Transformer 4 0.00001 99.43 90.84 90.31 96.71 92.78
Joint DiT-L Transformer 8 0.0001 99.75 92.17 91.92 96.11 91.45
Joint DiT-L Transformer 8 0.00001 99.66 91.42 91.14 97.79 95.01

187


188


Appendix C

Appendix: gRNAde: Geometric Deep
Learning for 3D RNA inverse design
(Chapter 5)

C.1 Ablation Study

Table C.1 presents an ablation study as well as aggregated benchmark for various configurations
of gRNAde. Key takeaways are highlighted below. Note that all results in the main paper are
reported for models trained on the maximum length of 5000 nucleotides using autoregressive
decoding and rotation-equivariant GNN layers, as this lead to the lowest perplexity values.

Split. Single- and multi-state splits are described in Section 5.2; the multi-state split is
relatively harder than the single-state split based on overall reduced performance for all baselines
and models. The multi-state split evaluates a particularly challenging o.o.d. scenario as the RNAs
in the test set have significantly higher structural flexibility compared to those in the training set.

Max. #states We evaluate the impact of increasing the maximum number of states as input
to gRNAde. Multi-state models improve native sequence recovery as well as structural self-
consistency scores over an equivalent single state variant. Notably, on the more challenging
multi-state split, the improvement in sequence recovery was observed to be as high as 5-6%
for the best multi-state models. This trend holds even for the single-state benchmark where the
multi-state model is being used with only one state as input. This suggests that seeing multiple
states during training can be useful for teaching gRNAde about RNA conformational flexibility
and improve performance even for single-state design tasks.

GNN and pooling architecture We ablated whether the internal representations of the
GVP-GNN are rotation invariant or equivariant. Equivariant GNNs are theoretically more
expressive [Joshi et al., 2023] and we find them more capable at fitting the training distribution
(as shown by lower perplexity) which in turn results in improved metrics compared to invariant
GNNs.

189


Table C.1: Ablation study and aggregated benchmark results for gRNAde. We report metrics
averaged over 100 test sets samples and standard deviations across 3 consistent random seeds.
The percentages reported in brackets for the 3D self-consistency scores are the percentage
of designed samples within the ‘designability’ threshold values (scRMSD≤2Å, scTM≥0.45,
scGDT≥0.5).

Self-consistency metrics

Max. Max. train Perplexity Native seq. 2D – EternaFold 3D – RhoFold
Split #states Model GNN length (↓) recovery (↑) scMCC (↑) scRMSD (↓) scTM-score (↑) scGDT_TS (↑)

Si
ng

le
-s

ta
te

sp
lit

1 AR Equiv 500 1.77±0.07 0.438±0.01 0.624±0.07 13.01±1.18 (0.5%) 0.21±0.0 (14.3%) 0.22±0.0 (12.7%)
1 AR Equiv 1000 1.73±0.08 0.453±0.01 0.648±0.01 13.10±0.58 (1.0%) 0.20±0.0 (10.8%) 0.21±0.0 (10.6%)
1 AR Equiv 2500 1.41±0.01 0.513±0.01 0.633±0.03 11.76±0.91 (1.4%) 0.27±0.0 (28.8%) 0.27±0.0 (28.0%)
1 AR Equiv 5000 1.29±0.02 0.538±0.03 0.612±0.02 11.50±0.64 (1.9%) 0.28±0.0 (32.1%) 0.28±0.0 (26.2%)
1 AR, rand Equiv 5000 1.59±0.16 0.531±0.04 0.621±0.04 11.87±1.06 (1.9%) 0.26±0.0 (28.1%) 0.26±0.0 (24.1%)

1 AR Inv 5000 1.32±0.04 0.531±0.01 0.585±0.03 11.70±0.56 (1.3%) 0.26±0.0 (24.8%) 0.25±0.0 (20.1%)

1 NAR Inv 5000 1.54±0.04 0.571±0.00 0.430±0.02 14.26±0.51 (1.3%) 0.19±0.0 (15.9%) 0.18±0.0 (12.7%)
1 NAR Equiv 5000 1.46±0.06 0.584±0.00 0.473±0.02 13.04±0.88 (1.3%) 0.23±0.0 (24.0%) 0.22±0.0 (17.9%)

3 AR Equiv, DS 5000 1.23±0.05 0.539±0.01 0.620±0.01 11.47±1.05 (2.5%) 0.28±0.0 (31.4%) 0.28±0.0 (27.2%)
5 AR Equiv, DS 5000 1.25±0.01 0.539±0.02 0.596±0.03 11.90±1.00 (2.9%) 0.27±0.0 (31.6%) 0.26±0.0 (26.4%)

Groundtruth sequence prediction baseline: - 1.000±0.00 0.686±0.00 5.23±0.07 (27.9%) 0.56±0.0 (68.7%) 0.55±0.0 (68.7%)
Random sequence prediction baseline: - 0.251±0.00 0.012±0.00 24.40±0.34 (0.0%) 0.04±0.0 (0.0%) 0.02±0.0 (0.0%)

ViennaRNA 2D-only baseline: - 0.259±0.00 0.611±0.00 20.34±0.10 (0.0%) 0.07±0.0 (0.6%) 0.07±0.0 (1.1%)

M
ul

ti-
st

at
e

sp
lit

1 AR Equiv 5000 1.51±0.01 0.481±0.00 0.573±0.04 21.83±0.53 (0.0%) 0.12±0.0 (2.6%) 0.15±0.0 (5.5%)

3 AR Equiv, DS 500 1.87±0.04 0.444±0.01 0.587±0.02 22.09±0.13 (0.0%) 0.12±0.0 (2.3%) 0.14±0.0 (5.7%)
3 AR Equiv, DS 1000 1.76±0.04 0.455±0.03 0.504±0.04 22.92±1.43 (0.0%) 0.11±0.0 (2.3%) 0.14±0.0 (5.8%)
3 AR Equiv, DS 2500 1.54±0.07 0.500±0.01 0.543±0.01 22.00±0.26 (0.0%) 0.11±0.0 (2.9%) 0.14±0.0 (3.7%)
3 AR Equiv, DS 5000 1.44±0.04 0.531±0.00 0.573±0.03 22.19±0.28 (0.0%) 0.12±0.0 (4.2%) 0.15±0.0 (7.5%)
3 AR Equiv, DSS 5000 1.37±0.04 0.540±0.03 0.574±0.03 22.20±0.43 (0.0%) 0.12±0.0 (4.0%) 0.15±0.0 (7.5%)

5 AR Equiv, DS 5000 1.37±0.03 0.510±0.00 0.514±0.00 21.80±0.08 (0.0%) 0.12±0.0 (2.9%) 0.14±0.0 (6.2%)

1 NAR Equiv 5000 1.81±0.03 0.489±0.00 0.372±0.03 24.18±0.63 (0.0%) 0.09±0.0 (2.2%) 0.12±0.0 (4.7%)
3 NAR Equiv, DS 5000 1.65±0.13 0.506±0.01 0.346±0.02 24.06±0.43 (0.0%) 0.08±0.0 (2.0%) 0.11±0.0 (2.9%)
3 NAR Equiv, DSS 5000 1.60±0.10 0.520±0.02 0.352±0.03 24.18±0.55 (0.0%) 0.09±0.0 (2.2%) 0.12±0.0 (4.7%)
5 NAR Equiv, DS 5000 1.59±0.21 0.517±0.01 0.339±0.01 24.16±0.75 (0.0%) 0.08±0.0 (2.2%) 0.10±0.0 (4.5%)

Groundtruth sequence prediction baseline: - 1.000±0.00 0.525±0.00 17.52±0.32 (3.9%) 0.25±0.0 (24.2%) 0.29±0.0 (31.4%)
Random sequence prediction baseline: - 0.249±0.00 0.013±0.00 31.00±0.20 (0.0%) 0.03±0.0 (0.0%) 0.02±0.0 (0.0%)

ViennaRNA 2D-only baseline: - 0.258±0.00 0.470±0.00 29.10±0.00 (0.0%) 0.05±0.0 (0.0%) 0.05±0.0 (0.0%)

Model and decoder ‘AR’ implies autoregressive decoding (described in Section 5.1.2, uses
4 encoder and 4 decoder layers), while ‘NAR’ implies non-autoregressive, one-shot decoding us-
ing an MLP (uses 8 encoder layers). Across both evaluation splits, AR models show significantly
higher self-consistency scores than NAR, even though NAR lead to higher sequence recovery for
the single-state split. AR is more expressive and can condition predictions at each decoding step
on past predictions, while one-shot NAR samples from independent probability distributions
for each nucleotide. Thus, AR is a better inductive bias for predicting base pairing and base
stacking interactions that are drivers of RNA structure [Vicens and Kieft, 2022]. For instance,
G-C and A-U pairs can often be swapped for one another, but non-autoregressive decoding does
not capture such paired constraints.

Additionally, we also present results for the impact of training gRNAde with random decoding
order. This can be practically very useful for partial or conditional design scenarios, and leads to
a minor reduction in sequence recovery and 3D self-consistency (in line with what was observed
for ProteinMPNN).

Max. train RNA length Limiting the maximum length of RNAs used for training can be
seen as ablating the use of ribosomal RNA families (which are thousands of nucleotides long

190


and form complexes with specialised ribosomal proteins). We find that training on only short
RNAs fewer than 1000s of nucleotides leads to worse sequence recovery and 3D self-consistency
scores, even though it improves 2D self-consistency across both evaluation splits. This suggests
that tertiary interactions learnt from ribosomal RNAs can generalise to other RNA families to
some extent (large ribosomal RNAs were excluded from test sets).

Non-learnt baselines. We report the performance of two non-learnt baselines to contextualise
gRNAde’s performance: for each test sample, simply predicting the groundtruth sequence back
and predicting a random sequence. Structural self-consistency scores for the Groundtruth
baseline provides a rough upper bounds on the maximum score that any gRNAde designs can
theoretically obtain given the current state of 2D/3D structure predictors being used. gRNAde
always performs better than the random baseline and often reaches 2D self-consistency scores
close to the upper bound. Both 2D and 3D self-consistency scores are inherently limited by the
performance of the structure prediction methods used.

2D inverse folding baseline. We additionally report results for ViennaRNA’s 2D-only
inverse folding method to further demonstrate the utility of 3D inverse folding. ViennaRNA has
improved 2D self-consistency scores over gRNAde but fails to capture tertiary interactions in
its designs, as evident by poor recovery and 3D self-consistency scores similar to the random
baseline. We observed the same trend for other 2D-only inverse folding methods such as
NuPack’s design tool. This result should not be surprising, as 2D tools are meant for design
scenarios that only involve base pairing and do not take any 3D information into account.

Choice of structure predictors. As previously noted, self-consistency metrics are highly
dependent on the performance of the structure prediction method used. We chose EternaFold as
it is simple to use as well as validated for designed and synthetic RNAs, unlike most other 2D
structure prediction tools. Replacing EternaFold with RNAFold lead to unchanged results and
did not modify the relative rankings of the models:

• AR, 1 state, Equiv. GNN, EternaFold scMCC: 0.612±0.02, RNAFold scMCC: 0.614±0.03.

• NAR, 1 state, Equiv. GNN, EternaFold scMCC: 0.473±0.02, RNAFold scMCC: 0.477±0.04.

Lastly, we would like to note the challenge of evaluating multi-state design: Structural
self-consistency metrics are not ideal for evaluating RNAs which do not have one fixed struc-
ture/undergo changes to their structure. It would be ideal (but extremely slow and expensive) to
run MD simulations to validate multi-state design models.

191


C.2 Additional Results

Table C.2: Full results for Figure 5.7 comparing gRNAde to Rosetta, FARNA, ViennaRNA and
RDesign for single-state design on 14 RNA structures of interest identified by Das et al. [2010].
Rosetta and FARNA recovery values are taken from Das et al. [2010], Supplementary Table 2.

ViennaRNA FARNA RDesign Rosetta gRNAde (single-state)
PDB ID Description Recovery Recovery Recovery Recovery Recovery Perplexity 2D self-cons.

1CSL RRE high affinity site 0.25 0.20 0.4455 0.44 0.5719 1.2812 0.8644
1ET4 Vitamin B12 binding RNA aptamer 0.25 0.34 0.3929 0.44 0.6250 1.3457 -0.0135
1F27 Biotin-binding RNA pseudoknot 0.30 0.36 0.3013 0.37 0.3437 1.6203 0.4523
1L2X Viral RNA pseudoknot 0.24 0.45 0.3727 0.48 0.4721 1.3181 0.5692
1LNT RNA internal loop of SRP 0.33 0.27 0.5556 0.53 0.5843 1.4337 0.1379
1Q9A Sarcin/ricin domain from E.coli 23S rRNA 0.27 0.40 0.4417 0.41 0.5044 1.3411 0.0597
4FE5 Guanine riboswitch aptamer 0.29 0.28 0.4112 0.36 0.5300 1.3824 0.9116
1X9C All-RNA hairpin ribozyme 0.26 0.31 0.3967 0.50 0.5000 1.3905 0.6630
1XPE HIV-1 B RNA dimerization initiation site 0.27 0.24 0.3834 0.40 0.7037 1.2177 0.7768
2GCS Pre-cleavage state of glmS ribozyme 0.25 0.26 0.4518 0.44 0.5078 1.3053 0.4062
2GDI Thiamine pyrophosphate-specific riboswitch 0.25 0.38 0.3523 0.48 0.6500 1.2363 -0.0251
2OEU Junctionless hairpin ribozyme 0.23 0.30 0.5000 0.37 0.9519 1.0913 0.7768
2R8S Tetrahymena ribozyme P4-P6 domain 0.27 0.36 0.5641 0.53 0.5689 1.1881 0.7281
354D Loop E from E. coli 5S rRNA 0.28 0.35 0.4458 0.55 0.4410 1.4938 0.0430

Overall recovery: 0.27 0.32 0.4296 0.45 0.5682

1st best (fit.: 3.41)

3rd best (fit.: 3.16)

10th best (fit.: 2.67)

50th best (fit.: 2.27)

200th best (fit.: 1.94)

wildtype

1 10 50 100 200 403 1500 5000 17027
Selected sequences for assaying

0x
1x
0x

3x

6x

9x

12x

15x

18x

21x

24x

27x

30x

E
xp

ec
te

d 
'm

ax
' f

ol
d 

ch
an

ge
 o

ve
r W

T

Max Fitness by Sample Size and Condition (n=47,504; simulations=10,000)

Condition
random
n_mut==1
n_mut<=2
gRNAde

0.00

1.10

1.79

2.20

2.48

2.71

2.89

3.04

3.18

3.30

3.40

Fi
tn

es
s

Figure C.1: Retrospective study of gRNAde for ranking ribozyme mutant fitness (t1 subunit).
Using the backbone structure and mutational fitness landscape data from an RNA polymerase
ribozyme [McRae et al., 2024], we retrospectively analyse how well we can rank variants at
multiple design budgets using random selection vs. gRNAde’s perplexity for mutant sequences
conditioned on the backbone structure (scaffolding subunit t1). gRNAde performs better than
single site saturation mutagenesis, even when all single mutants are explored (total of 403 single
mutants, 17,027 double mutants for the scaffolding subunit t1 in McRae et al. [2024]). See
Section 5.3.3 for results on catalytic subunit 5TU and further discussions.

192


C.3 RNASolo data statistics

0 1000 2000 3000 4000
Sequence length

0

200

400

600

800

1000

1200

1400

Fr
eq

ue
nc

y
Histogram of sequence lengths

Distribution: 684.9 ± 1072.8, Max: 4455, Min: 11

0 50 100 150 200
0

100

200

300

(a) Sequence length. The dataset is long-tailed in
terms of RNA sequence length, with many short
sequences including aptamers, riboswitches, ri-
bozymes, and tRNAs (fewer than 200 nucleotides).
The dataset also includes several longer ribosomal
RNAs (thousands of nucleotides).

0 10 20 30 40 50
Number of structures per sequence

0

500

1000

1500

2000

2500

Fr
eq

ue
nc

y

Histogram of no. of structures per unique sequence
Distribution: 2.84 ± 9.39, Max: 267, Min: 1

5 10 15 20
0

200

400

600

800
Sequences with >1 structure

(b) Number of structures per sequence. The
dataset covers a wide range of RNA conformation
ensembles, with on average 3 structures per se-
quence. There are multiple structures available for
1,547 sequences. The remaining 2,676 sequences
have one corresponding structure.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
Avg. pairwise RMSD among structures per sequence (Å)

0

50

100

150

200

250

300

350

400

Fr
eq

ue
nc

y

Histogram of avg. pairwise RMSD per sequence
Distribution: 1.33Å ± 1.89, Max: 18.35Å, Min: 0.00Å

0 1 2 3 4 5
0

20

40

60

(c) Average pairwise RMSD per sequence. For
1,547 sequences with multiple structures, there is
significant structural diversity among conforma-
tions. On average, the pairwise C4’ RMSD among
the set of structures for a sequence is greater than
1Å.

101 102 103

Sequence length (log scale)

0

1

2

3

4

5

6

7

Av
g.

 p
ai

rw
ise

 R
M

SD
 a

m
on

g 
st

ru
ct

ur
es

 (Å
)

(d) Bivariate distribution for sequence length
vs. avg. RMSD. The joint plot illustrates how
structural diversity (measured by avg. pairwise
RMSD) varies across sequence lengths. We no-
tice similar structural variations regardless of se-
quence length.

Figure C.2: RNASolo data statistics. We plot histograms to visualise the diversity of RNAs
available in terms of (a) sequence length, (b) number of structures available per sequence, as
well as (c) structural variation among conformations for those RNA that have multiple structures.
The bivariate distribution plot (d) for sequence length vs. average pairwise RMSD illustrates
structural diversity regardless of sequence lengths.

193


194


Appendix D

Appendix: Inverse Design of RNA
Structure and Function with gRNAde
(Chapter 6)

A B C

0 20 40 60 80 100 120 140
Sequence Position

4

2

0

M
ax

 Fi
tn

es
s

Max Single Mutant Fitness

0 20 40 60 80 100 120 140
Sequence Position

0

50

100

150

Co
m

bi
na

bi
lit

y 
Sc

or
e Higher-order Mutant Combinability

0 20 40 60 80 100 120 140
Sequence Position

0.0

0.5

1.0

De
sig

n 
Pr

ob
ab

ilit
y

Final Constraints on Probability of Designing at Position

-2.0 or below

-1.0

0.0

1.0

2.0

Fitness

0

30

60

90

120

150

Com
binability Score

0.00

0.25

0.50

0.75

1.00

Design Probability

Figure D.1: Design probabilities for 5TU derived from fitness landscape data. (A) Maximum
single-mutant fitness at each position, showing tolerance to point mutations. (B) Combinability
scores quantifying how well mutations at each position can be combined with other mutations
to create functional variants. (C) Final design probabilities computed by combining fitness and
combinability. Critical functional regions (catalytic site, template binding nucleotides, and triple
helix-forming adenosines) are constrained to zero probability to preserve essential catalytic
activity. During design, these probabilities are used to sample which positions can be mutated,
enabling generation of variants at a range of mutational distances.

195


Figure D.2: gRNAde variants show activity at large mutational distance. Low-throughput
gel analysis of primer extension reactions on a 6 GAA repeat template using the wild-type 5TU
ribozyme and top 13 gRNAde-designed variants from the 6x6 AUA high-throughput screen.
Variant identity and edit distance from native 5TU are labeled. The gel-confirmed activity of
variant 549, which carries 28 mutations, is a key finding, proving that gRNAde can generate
functional ribozymes at large mutational distances beyond those typically accessible by rational
design or directed evolution. Variants 122 and 123 also show high activity, comparable to the
native 5TU ribozyme.

196


	Introduction
	Research Questions
	Thesis Outline
	List of Publications

	Preliminaries: Deep Learning for Molecular Structure Modelling
	Primer on Molecular Systems
	Molecular Systems as 3D Geometric Graphs
	Representation Learning of Molecular Structure
	Generative Modelling of Molecular Systems

	I Molecular Representation Learning and Generative Modelling
	Expressive Power of Molecular Structure Representations
	Limitations of the Weisfeiler-Leman Test
	The Geometric Weisfeiler-Leman Framework
	Understanding the Geometric GNN Design Space
	Synthetic Experiments on Expressivity
	Experiments on Protein Representation Learning
	Related Work
	Summary

	Unified Generative Modelling of Molecules and Materials
	All-atom Diffusion Transformers
	Experimental Setup
	Results
	Related Work
	Summary


	II RNA Molecule Design
	gRNAde: Geometric Deep Learning for 3D RNA inverse design
	The gRNAde Model
	Experimental Setup
	Results
	Related Work
	Summary

	Inverse Design of RNA Structure and Function with gRNAde
	An RNA Inverse Design Pipeline with gRNAde
	Expert-level Design of RNA Pseudoknotted Structures
	Inverse Design of Functional Polymerase Ribozymes
	Summary

	Conclusion
	Summary of contributions
	Discussion
	Future Directions

	References
	Appendix: Expressive Power of Molecular Structure Representations (chap:gwl)
	Geometric GNN Design Space Proofs
	Proofs for Equivalence between GWL and Geometric GNNs (sec:gwl:equivalence)

	Appendix: Unified Generative Modelling of Molecules and Materials (chap:adit)
	Evaluation Metrics
	Additional Results
	Ablation Study

	Appendix: gRNAde: Geometric Deep Learning for 3D RNA inverse design (chap:grnade)
	Ablation Study
	Additional Results
	RNASolo data statistics

	Appendix: Inverse Design of RNA Structure and Function with gRNAde (chap:experiments)