Geometric Deep Learning for Molecular Modelling and Design Chaitanya Krishna Joshi Clare Hall July, 2025 This thesis is submitted for the degree of Doctor of Philosophy Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the preface and specified in the text. It is not substantially the same as any work that has already been submitted, or is being concurrently submitted, for any degree, diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the preface and specified in the text. It does not exceed the prescribed word limit for the relevant Degree Committee. Chaitanya Krishna Joshi 6 July, 2025 Abstract Geometric Deep Learning for Molecular Modelling and Design Chaitanya Krishna Joshi Molecules are the foundations of biological life and physical materials. Computational modelling of molecular behaviour remains a grand challenge in science as molecules span a spectrum of complexity: from periodic crystals to non-periodic biomolecules; from small drug-like molecules with dozens of atoms to massive proteins with thousands; and from data-rich domains like protein structures to data-scarce contexts like nucleic acids. Despite this diversity, all molecular systems share fundamental building blocks: atoms and their interactions in three- dimensional space governed by physical laws. This thesis develops Geometric Deep Learning models that leverage these shared principles to advance molecular modelling. The first part establishes unified foundations for molecular representation learning and gener- ative modelling. I first introduce the Geometric Weisfeiler-Leman Test (GWL), a mathematical framework that characterizes the expressive power of neural networks respecting physical sym- metries in 3D space. GWL provides a unified theory for roto-translationally invariant and equivariant Graph Neural Networks, and offers mechanistic insights into how different architec- tures distinguish 3D molecular structures. Building on these insights, I introduce the All-atom Diffusion Transformer (ADiT), a unified generative architecture that models both periodic crystals and non-periodic molecules. ADiT demonstrates that joint training across diverse structural datasets enables scaling capabilities analogous to large language models, achieving state-of-the-art performance across molecular generation benchmarks. The second part introduces gRNAde, a novel generative RNA inverse design toolkit. gRNAde is a structure-conditioned RNA language model that addresses the unique challenges of RNA molecules, including limited data and inherent flexibility, by leveraging the Geometric Deep Learning principles established in the first part. Validated through wet lab experiments, gRNAde demonstrates superior performance over existing physics-based methods and achieves human expert-level accuracy in pseudoknotted RNA design while remaining fully automated and scalable. Most notably, gRNAde successfully generates functional RNA enzymes that are evolutionarily distant from known sequences, opening new avenues for designing RNA structures with programmable biological functions. Acknowledgements First and foremost, I am deeply grateful to my supervisor, Pietro Liò. Pietro created an envi- ronment of extraordinary freedom, kindness, and collaborative spirit within our research group. His generosity with his time and his willingness to serve as both a personal and professional mentor at critical moments throughout this PhD have shaped me in ways I am only beginning to appreciate. I could not have asked for a more supportive guide and a dear friend. Pietro, and the broader Cambridge environment, gave me the space and encouragement to explore what kind of scientist I want to be: to try my hand at theory, at engineering and scaling experiments, and ultimately at applications validated in the wet lab. The Computer Laboratory and the Artificial Intelligence Group have been a wonderful intel- lectual home. Clare Hall, my college, provided a warm and welcoming community throughout. I am especially thankful to the many friends and colleagues who shared this journey together: Simon Mathis, Charlie Harris, Alex Norcliffe, Iulia Dutta, Charlotte Magister, Julia Komorowska, Miruna Cretu, Vladimir Radenkovic, Petar Veličković, Ramon Viñas, Arian Jamasb, Rishabh Anand, Kieran Didi, Alex Abrudan, Cătălina Cangea, Paul Scherer, Dobrik Georgiev, Cris Bod- nar, Andrew Blake, Srijit Seal, Adham El-Shazly, and many others. Thank you for the endless discussions, the shared conference trips, and for keeping me inspired and sane throughout. This thesis owes a great deal to the scientists and mentors I had the privilege of collaborating with along the way. Early in my PhD, Taco Cohen and Michael Bronstein helped me develop the theoretical foundations upon which all the subsequent work in this thesis was built. I was fortunate to spend two summer internships at the frontier of AI and molecular science: Andreas Loukas, Jan Ludwiczak, Pan Kessel, Kyunghyun Cho, and the Prescient Design team welcomed me to Basel, and Zachary Ulissi, Anuroop Sriram, Xiang Fu, Larry Zitnick, and the FAIR Chemistry team at Meta hosted me in San Francisco. Both experiences were formative: they showed me firsthand how AI can impact medicine and materials science, and the exceptional resources and guidance accelerated my growth as a scientist. I am grateful to Rhiju Das for his infectious conviction that “research doesn’t count unless it involves blind experimental tests,” which motivated me to go beyond computational benchmarks and into the wet lab. I also thank Gábor Csányi for stimulating discussions over the course of my PhD and for examining this thesis with such rigour and thoroughness. I owe a special debt of gratitude to Philipp Holliger, Edoardo Gianni, Samantha Kwok, and the entire Holliger Lab for welcoming me into the MRC Laboratory of Molecular Biology towards the end of my PhD. Being embedded in the LMB and learning to speak the language of experimentalists transformed how I think about my research. The experience of holding a pipette for the first time and seeing AI-designed RNA molecules come to life in the lab will stay with me forever. I am also thankful to the Learning on Graphs (LoG) conference community. Being part of founding a new conference from the ground up, with the collective energy and support of so many in our field, has been one of the most rewarding experiences of my PhD. I am grateful to A*STAR, Singapore, for their generous support through the National Science Scholarship, which made this PhD possible. I am especially thankful to Phebe Lim, Regina Chen, Chay Wah Tay, and the entire Graduate Academy team for their responsiveness. I also thank Yue Wan, Mile Šikić, Roger Foo, Chuan Sheng Foo, Cheston Tan, and others for hosting me during my trips home to Singapore and for helping me stay connected to the Singapore research community while being away. Above all, I thank my dear family. My parents, Veena and Vivek Joshi, have been a constant source of support and encouragement. Together with my grandparents, aunts, and uncles, they have taken a genuine interest in understanding my curiosities since a young age and raised me to never stop asking questions; that instinct is at the heart of everything in this thesis. My sister, Kuhu Joshi, is my inspiration. Watching her chart her own path from economics to poetry, and seeing her thrive in her element, has reminded me that the most rewarding frontiers are often the most uncharted. Finally, my deepest gratitude goes to my wife, my closest confidant, and my dearest friend, Genevieve Lam. Her curiosity and enthusiasm for life have been my greatest source of strength, and our relationship is the foundation upon which this thesis, and our life together, has been built. Contents 1 Introduction 11 1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Preliminaries: Deep Learning for Molecular Structure Modelling 21 2.1 Primer on Molecular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Molecular Systems as 3D Geometric Graphs . . . . . . . . . . . . . . . . . . . 25 2.3 Representation Learning of Molecular Structure . . . . . . . . . . . . . . . . . 29 2.4 Generative Modelling of Molecular Systems . . . . . . . . . . . . . . . . . . . 44 I Molecular Representation Learning and Generative Modelling 51 3 Expressive Power of Molecular Structure Representations 53 3.1 Limitations of the Weisfeiler-Leman Test . . . . . . . . . . . . . . . . . . . . 54 3.2 The Geometric Weisfeiler-Leman Framework . . . . . . . . . . . . . . . . . . 56 3.3 Understanding the Geometric GNN Design Space . . . . . . . . . . . . . . . . 61 3.4 Synthetic Experiments on Expressivity . . . . . . . . . . . . . . . . . . . . . . 64 3.5 Experiments on Protein Representation Learning . . . . . . . . . . . . . . . . 68 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4 Unified Generative Modelling of Molecules and Materials 77 4.1 All-atom Diffusion Transformers . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 II RNA Molecule Design 97 5 gRNAde: Geometric Deep Learning for 3D RNA inverse design 99 5.1 The gRNAde Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6 Inverse Design of RNA Structure and Function with gRNAde 115 6.1 An RNA Inverse Design Pipeline with gRNAde . . . . . . . . . . . . . . . . . 115 6.2 Expert-level Design of RNA Pseudoknotted Structures . . . . . . . . . . . . . 118 6.3 Inverse Design of Functional Polymerase Ribozymes . . . . . . . . . . . . . . 126 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7 Conclusion 139 7.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 References 145 A Appendix: Expressive Power of Molecular Structure Representations (Chapter 3) 171 A.1 Geometric GNN Design Space Proofs . . . . . . . . . . . . . . . . . . . . . . 171 A.2 Proofs for Equivalence between GWL and Geometric GNNs (Section 3.2.2) . . 176 B Appendix: Unified Generative Modelling of Molecules and Materials (Chapter 4) 181 B.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 B.2 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 B.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 C Appendix: gRNAde: Geometric Deep Learning for 3D RNA inverse design (Chap- ter 5) 189 C.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 C.2 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 C.3 RNASolo data statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 D Appendix: Inverse Design of RNA Structure and Function with gRNAde (Chapter 6)195 Chapter 1 Introduction Molecular systems are the fundamental building blocks of our world. They form the basis of biological life, serve as the foundation for medicines that treat disease, and constitute physical materials around us. At their core, molecules are collections of atoms interacting in three- dimensional space, governed by the fundamental laws of physics and chemistry. The last decade has witnessed remarkable progress in computational modelling of molecular systems using deep learning. Deep learning offers a data-driven paradigm for predictive understanding of the functional properties of molecular systems, towards enabling the discovery of novel molecules with desired behaviours [Sanchez-Lengeling and Aspuru-Guzik, 2018, Stokes et al., 2020]. The most notable breakthroughs that have inspired this thesis are those recognized by the Nobel Prize in Chemistry 2024: highly accurate protein structure prediction [Jumper et al., 2021] and de novo design of proteins with bespoke functionality [Dauparas et al., 2022, Watson et al., 2023]. These foundational techniques are now being extended to biomolecular interactions involving proteins, nucleic acids, and other molecules [Abramson et al., 2024]. Concurrently, machine learning-based interatomic potentials are transforming molecular dynamics simulation and property prediction [Behler and Parrinello, 2007]. Deep learning has enabled broadly generalizable representations of atomic interactions across organic molecules and inorganic materials, closely matching the accuracy of quantum mechanical simulations at a fraction of the computational cost [Batatia et al., 2023, Wood et al., 2025]. Central to these breakthroughs are deep learning architectures that fall under the umbrella of Geometric Deep Learning [Bronstein et al., 2021]: an approach to neural network design that incorporates fundamental physical principles and symmetries into model architectures. For molecular modelling, this translates to architectural components and inductive biases specifi- cally tailored to addressing three broad challenges: variations in sizes, local versus long-range interactions, and 3D geometric symmetries. Firstly, atomic systems exhibit remarkable diversity in size and complexity. They range from small drug-like molecules containing tens of atoms to biomolecular complexes with tens of thousands of atoms. When atoms or ions are packed together to form materials, they 11 can extend infinitely through space as periodically repeated crystal structures. Deep learning architectures for molecular modelling must accommodate this variability in system size and periodicity without requiring fixed input dimensions. Set-based architectures such as Graph Neural Networks (GNNs) [Battaglia et al., 2018] and Transformers [Vaswani et al., 2017a], which process variable-sized collections of atoms, satisfy this requirements. Additionally, the physical behaviour of molecules is governed by an interplay of both local and long-range atomic interactions. Many fundamental properties emerge from short-range interactions such as covalent bonds or hydrogen bonds. This locality principle is algorithmically aligned with message passing-based GNN architectures [Xu et al., 2020], which iteratively aggregate information from local atomic environments to build up representations of molecular structure [Gilmer et al., 2017]. However, molecules also exhibit long-range interactions, such as van der Waals forces or Coulomb repulsion, that cannot be captured by purely local models. For instance, proteins fold into stable 3D structures through interactions between residues distance in sequence space, so generating or predicting protein structures requires maintaining global consistency to ensure physical validity [Jumper et al., 2021]. Such problems are well aligned with the self-attention mechanism in Transformers, which allows for direct communication between all pairs of atoms, regardless of whether they are locally connected. Further, Transformers can be seen as message-passing on complete graphs, thereby unifying models for local (GNNs) and global (Transformers) interactions through a common geometric lens [Joshi, 2025]. Finally, molecules exist in 3D Euclidean space and possess fundamental geometric symme- tries. Functional properties of molecules are symmetric under rigid geometric transformations of their structures, such as global rotations, translations, and reflections [Musil et al., 2021]. Modern geometric deep learning architectures incorporate these symmetries through two complementary approaches. Explicit symmetry ensures that learned representations transform covariantly (equiv- ariantly) with 3D transformations of structures, guaranteeing that internal features respect the same geometric principles as the physical quantities they represent [Thomas et al., 2018]. This approach is often data-efficient [Batzner et al., 2022] and produces representations with clear physical interpretations [Fu et al., 2025]. Alternatively, implicit symmetry does not hard-code geometric constraints into architectures and learns approximate symmetries from data. While this approach enables more flexible and expressive models, it requires larger training datasets and greater computational resources for effective learning [Wang et al., 2024]. Together, these developments promise a new era of foundation models for molecular mod- elling [Bommasani et al., 2021], providing an accurate and scalable toolkit for molecular discovery. At the same time, there remain fundamental open questions about the theoretical limits of these architectures, their generalizability across the diversity of molecular systems, and their application to challenging problems at the forefront of biochemistry and materials science. This thesis represents my explorations into this frontier, spanning theoretical and methodological foundations of molecular modelling, as well as real-world applications in molecular design. 12 1.1 Research Questions This thesis is about developing new deep learning techniques for modelling and designing molecular systems. The story begins with representation learning, which is the foundation for both predictive and generative modelling. While geometric deep learning architectures have been very successful for learning molecular representations, as summarised in the previous section, a formal and unified understanding of how different architectural properties affect the class of functions that a model can express, also known as the expressive power [Raghu et al., 2017], is not well understood. This lack of understanding of why models succeed or fail limits our ability to design new architectures in a principled way, bringing us to our first research question: Q1: What is a unified theoretical framework for characterizing the expressive power of 3D molecular representations learnt by geometric deep learning models? During the course of my research, highly expressive generative models trained on extremely large-scale datasets lead to a paradigm shift across AI [Bommasani et al., 2021]. A key factor enabling these foundation models was a unification of diverse but interconnected data sources for pre-training (e.g. all the text on the internet [Achiam et al., 2023]), which enabled models to learn general-purpose representations and transfer knowledge across related domains such as mathematics and programming. Similarly, we know that the physical principles that govern atomic interactions are shared across diverse molecular systems, ranging from organic molecules to inorganic crystals. However, current generative models of 3D molecular structures are highly domain specific and not broadly applicable. Towards developing generative foundation models for molecular structures, we arrive at our second research question: Q2: What is the architecture of a unified molecular generative model that benefits from transfer learning across atomic interactions? Having established unified foundations for representation learning and generative modelling, I then explored real-world applications in the inverse design of Ribonucleic Acids (RNA). I was drawn to RNA due to its increasingly central role in modern molecular biology and biotechnology.1 RNA are nature’s computers, capable of both information processing and catalysis [Cech, 2024]. Yet, RNA structure modelling and design remains extremely challenging due to a paucity of data and the inherently dynamic nature of RNA molecules. Our final research question tackles new frontiers in RNA design: Q3: Can we develop a generative inverse design toolkit for RNA structure and function? And what new experimental capabilities will this enable in wet labs? 1As the webcomic XKCD put it, "Life is a seething mass of RNA that sometimes use DNA to take notes. What do the proteins do? Errands for RNA." (XKCD #3056) 13 https://xkcd.com/3056/ Deep Learning  toolkit for  molecular design Experimental  wet lab validation Representation Learning Geometric Weisfeiler-Leman test characterises expressivity of 3D Graph Neural Networks. Generative Modelling All-atom Diffusion Transformer unifies generation of periodic crystals and non-periodic molecular structures. Inverse Design gRNAde is a Generative AI toolkit for designing  3D RNA structure and function.  GAGCGU...gRNAde Functional RNA structures Designed sequences Molecular systems Equivalent transformation 3D transformation Equivariant neural network Equivariant neural network DecoderEncoder Unified latent space Diffusion Transformer Stage 1: Autoencoder for reconstruction Stage 2: Generative model in latent space Gaussian noise Sampled latents D Figure 1.1: Overview of thesis contributions. In Chapter 3, I address RQ1 by proposing the Geometric Weisfeiler-Leman test, a theoretical framework for understanding the expressivity of 3D molecular representations of Geometric GNNs. In Chapter 4, I introduce All-atom Diffusion Transformer, the first unified generative model for both periodic crystals and non-periodic molecules to benefit from transfer learning, which addresses RQ2. Finally, I address RQ3 in Chapter 5 by developing gRNAde, a novel generative inverse design toolkit for RNA molecules, which we validate through wet lab experiments in Chapter 6. 1.2 Thesis Outline This section provides an overview of the chapters in this thesis and summarises the contributions made towards answering the research questions outlined above. The rest of this thesis is struc- tured as follows, and the main contributions are summarised in Figure 1.1. Chapter 2: Preliminaries: Deep Learning for Molecular Structure Modelling I present a self-contained introduction to deep learning fundamentals for molecular structure modelling. I begin with a concise overview of different types of molec- ular systems and associated mathematical concepts, such as 3D geometric graphs, symmetry groups, and equivariance. I then survey deep learning architectures for representation learning and generative modelling of 3D molecular structure, intro- ducing techniques such as Geometric Graph Neural Networks, Transformers, and Diffusion models. 14 Part I: Molecular Representation Learning and Generative Modelling Chapter 3: Expressive Power of Molecular Structure Representations I introduce the Geometric Weisfeiler-Leman (GWL) test, a generalisation of the classic Weisfeiler-Leman algorithm for discriminating geometric graphs while re- specting underlying 3D symmetries. The GWL framework unifies various classes of Geometric GNN architectures for molecules, and provides a theoretical char- acterization of their expressive power. Through GWL, I derive new mechanistic insights into molecular representation learning, including advantages of equivariant models over invariant ones, and how higher-order representations enable maximally expressive architectures. To complement this theoretical framework, I present a suite of synthetic experiments and a real-world protein function prediction benchmark. This chapter addresses RQ1 by establishing a unified theoretical framework for characterizing the expressive power of 3D molecular representations learnt by Geometric GNNs. Chapter 4: Unified Generative Modelling of Molecules and Materials I propose the All-atom Diffusion Transformer (ADiT), a unified generative modelling architecture capable of jointly learning from both periodic crystals and non-periodic molecules. ADiT is a latent diffusion model that embeds 3D molecular structures into a shared latent space, and subsequently learns to sample new latents followed by mapping them to valid structures. ADiT achieves state-of-the-art performance for generative modelling across both molecules and materials, outperforming specialized system-specific methods while benefiting from transfer learning across domains. I further show that ADiT is significantly more scalable than previous approaches and that scaling ADiT’s model parameters predictably improves performance, towards the goal of a unified foundation model for molecular design. This chapter addresses RQ2 by developing a unified generative architecture that enables transfer learning across diverse atomic systems. Part II: RNA Molecule Design Chapter 5: gRNAde: Geometric Deep Learning for 3D RNA inverse design I introduce gRNAde, a novel toolkit for 3D RNA inverse design leveraging geomet- ric deep learning to address the unique challenges of RNA modelling, including limited data and conformational flexibility. gRNAde is a structure-conditioned RNA 15 language model that uses a multi-state Geometric GNN to generate sequences condi- tioned on one or more 3D backbone structures. I present computational benchmarks demonstrating gRNAde’s improved performance, speed and capabilities compared to state-of-the-art physics-based tools for RNA design. This chapter addresses the first part of Q3 by developing the first generative inverse design toolkit for RNA structures. Chapter 6: Inverse Design of RNA Structure and Function with gRNAde I present an RNA inverse design pipeline that integrates gRNAde with computational screening and wet lab validation. I demonstrate that the gRNAde pipeline matches human expert performance in designing diverse pseudoknotted RNA structures while being fully automated. Further, I show how gRNAde enables the design of RNA enzymes (ribozymes) that are significantly distant in mutational space from known functional sequences, opening new avenues for designing RNA structures with bespoke biological functions. This chapter completes Q3 by experimentally validating the gRNAde toolkit’s capabilities at designing RNA structure and function in real wet lab settings. Chapter 7: Conclusion The final chapter reviews the contributions proposed in this thesis, reflects on unifying themes across the chapters, and discusses future research directions. 1.3 List of Publications Here, I provide the list of publications that I have co-authored during my PhD, together with a brief description of my contributions to each publication. 1.3.1 Thesis Publications and Contributions Chapter 2 is written from scratch, with some content abridged from Duval et al. [2023a]. A Hitchhiker’s Guide to Geometric GNNs for 3D Atomic Systems. A. Duval∗, S. V. Mathis∗, C. K. Joshi∗, V. Schmidt∗, S. Miret, F. D. Malliaros, T. Cohen, P. Liò, Y. Bengio, and M. Bronstein. (∗equal first authors) Preprint, 2023. I conceived the survey jointly with Alexandre Duval, Simon V. Mathis, and Victor Schmidt, created a majority of the figures, and contributed extensively to the all aspects of the paper. 16 Chapter 3 is primarily based on Joshi et al. [2023]. On the Expressive Power of Geometric Graph Neural Networks. C. K. Joshi∗, C. Bodnar∗, S. V. Mathis, T. Cohen, and P. Liò. (∗equal first authors) International Conference on Machine Learning (ICML), 2023. Also oral presentation at NeurIPS 2022 Symmetry & Geometry Workshop. I had the key idea of theoretically characterising the expressive power of Geometric GNNs, with inputs from Taco Cohen. Cris Bodnar and I jointly conceived the Geometric Weisfeiler Leman test, and the theoretical results included in this thesis were derived by me. I conducted all the synthetic experiments, with inputs from Simon V. Mathis. I wrote the majority of the paper with inputs from all other authors. The chapter also includes experimental results on Geometric GNNs for protein function prediction from Jamasb et al. [2024], which were implemented by myself and Arian Jamasb. Evaluating Representation Learning on the Protein Structure Universe. A. R. Jamasb∗, A. Morehead∗, C. K. Joshi∗, Z. Zhang∗, K. Didi, S. V. Mathis, C. Harris, J. Tang, J. Cheng, P. Liò, and T. L. Blundell. (∗equal contribution) International Conference on Learning Representations (ICLR), 2024. Chapter 4 is based on Joshi et al. [2025a]. All-atom Diffusion Transformers: Unified generative modelling of molecules and materials. C. K. Joshi, X. Fu, Y.-L. Liao, V. Gharakhanyan, B. K. Miller, A. Sriram, and Z. W. Ulissi. International Conference on Machine Learning (ICML), 2025. Also oral presentation at ICLR 2025 AI for Accelerated Materials Design Workshop. I had the key idea of using latent diffusion models for unified generative modelling of molecules and materials, with inputs from Xiang Fu. I developed the research, conducted the experiments and wrote the paper with inputs from all the other authors. Chapter 5 is based on Joshi et al. [2025c] and Joshi and Liò [2024]. gRNAde: Geometric Deep Learning for 3D RNA inverse design. C. K. Joshi, A. R. Jamasb, R. Viñas, C. Harris, S. V. Mathis, A. Morehead, R. Anand, and P. Liò. International Conference on Learning Representations (ICLR), 2025. Spotlight presentation. Also an invited book chapter in RNA Design: Methods and Protocols, pp. 121-135, Springer, Methods in Molecular Biology (MIMB, volume 2847), 2024 17 I had the key idea of 3D structure-based and multi-state RNA inverse design, with inputs from Ramon Viñas. I developed the research, conducted the experiments and wrote the paper with inputs from all the other authors. Chapter 6 is based on Joshi et al. [2025b]. Generative inverse design of RNA structure and function with gRNAde. C. K. Joshi∗, E. Gianni∗, S. L. Y. Kwok∗, S. V. Mathis, P. Liò, and P. Holliger. (∗equal contribution) Preprint, 2025. I conceived the RNA design pipeline, with inputs from all the other authors. I generated computational designs, with inputs from Simon V. Mathis. The wet lab experimental validation took place in the laboratories of Dr. Phillip Holliger (MRC Laboratory of Molecular Biology, Cambridge) and Prof. Rhiju Das (Department of Biochemistry, Stanford University). I wrote the majority of the paper with inputs from all other authors. 1.3.2 Other Publications I have also contributed to the following publications, which are not included in this thesis but are listed in chronological order: Multi-state Protein Design with DynamicMPNN. A. Abrudan∗, S. Pujalte Ojeda∗, C. K. Joshi, M. Greenig, F. Engelberger, A. Khmelinskaia, J. Meiler, M. Vendruscolo, T. P. J. Knowles. (∗equal contribution) International Conference on Learning Representations (ICLR), 2026. Also presented at ICML 2025 Workshop on Generative AI and Biology. Multi-scale Protein Structure Modelling with Geometric Graph U-Nets. C. Liu∗, V. Li∗, L. Leong, V. Radenkovic, P. Liò, C. K. Joshi (∗equal contribution) Machine Learning in Structural Biology (MLSB), 2025. Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems X. Zhang∗, L. Wang∗, J. Helwig∗, Y. Luo∗, C. Fu∗, Y. Xie∗, M. Liu, Y. Lin, Z. Xu, K. Yan, K. Adams, M. Weiler, X. Li, T. Fu, Y. Wang, A. Strasser, H. Yu, Y. Xie, X. Fu, S. Xu, Y. Liu, Y. Du, A. Saxton, H. Ling, H. Lawrence, H. Stärk, S. Gui, C. Edwards, N. Gao, A. Ladera, T. Wu, E. F. Hofgard, A. M. Tehrani, R. Wang, A. Daigavane, M. Bohde, J. Kurtin, Q. Huang, T. Phung, M. Xu, C. K. Joshi, S. V. Mathis, K. Azizzadenesheli, A. Fang, A. Aspuru-Guzik, E. Bekkers, M. Bronstein, 18 M. Zitnik, A. Anandkumar, S. Ermon, P. Liò, R. Yu, S. Günnemann, J. Leskovec, H. Ji, J. Sun, R. Barzilay, T. Jaakkola, C. W. Coley, X. Qian, X. Qian, T. Smidt, S. Ji. (∗equal contribution) Foundations and Trends in Machine Learning, 2025. LeMat-GenBench: Bridging the Gap between Crystal Generation and Ma- terials Discovery. S. Betala, S. P. Gleason, A. Ramlaoui, A. Xu, G. Channing, D. Levy, C. Fourrier, N. Kazeev, C. K. Joshi, S.-O. Kaba, F. Therrien, A. Hernandez-Garcia, R. Mercado, N. M. Krishnan, A. Duval NeurIPS 2025 Workshop on AI for Accelerated Materials Design, 2025. Machine Learning for Toxicity Prediction Using Chemical Structures: Pil- lars for Success in the Real World. S. Seal, M. Mahale, M. García-Ortegón, C. K. Joshi, L. Hosseini-Gerami, A. Beat- son, M. Greenig, M. Shekhar, A. Patra, C. Weis, A. Mehrjou, A. Badré, B. Paisley, R. Lowe, S. Singh, F. Shah, B. Johannesson, D. Williams, D. Rouquie, D.-A. Clevert, P. Schwab, N. Richmond, C. A. Nicolaou, R. J. Gonzalez, R. Naven, C. Schramm, L. R. Vidler, K. Mansouri, W. P. Walters, D. D. Wilk, O. Spjuth, A. E. Carpenter, and A. Bender. ACS Chemical Research in Toxicology, 2025. Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs. B. El∗, D. Choudhury∗, P. Liò, and C. K. Joshi. (∗equal contribution) ICLR 2025 Workshop on XAI4Science, 2025. Understanding Biology in the Age of Artificial Intelligence. E. Lawrence, A. El-Shazly, S. Seal, C. K. Joshi, P. Liò, S. Singh, A. Bender, P. Sormanni, M. Greenig. Preprint, 2024. RNA-FrameFlow: Flow Matching for de novo 3D RNA backbone design. R. Anand∗, C. K. Joshi∗, A. Morehead, A. R. Jamasb, C. Harris, S. Mathis, K. Didi, B. Hooi, and P. Liò. (∗equal contribution) Machine Learning for Computational Biology (MLCB), 2024. Oral presentation. Also oral presentation at ICML 2024 AI4Science Workshop. 19 PoseCheck: Generative Models for 3D Structure-based Drug Design Produce Unrealistic Poses. C. Harris, K. Didi, A. R. Jamasb, C. K. Joshi, S. V. Mathis, P. Liò, and T. Blundell. NeurIPS Workshop on Machine Learning for Structural Biology, 2023. Group Invariant Global Pooling. K. Bujel∗, Y. Gideoni∗, C. K. Joshi, and P. Liò. (∗equal contribution) ICML Workshop on Topology, Algebra, & Geometry, 2023. Hypergraph Factorisation for Multi-tissue Gene Expression Imputation. R. Viñas, C. K. Joshi, D. Georgiev, B. Dumitrascu, E. R. Gamazon, and P. Liò. Nature Machine Intelligence, 2023. Cover article. 20 Chapter 2 Preliminaries: Deep Learning for Molecular Structure Modelling This chapter offers an overview of the deep learning fundamentals essential for molecular structure modelling. We assume familiarity with basic machine learning concepts and common neural network architectural elements such as Multi Layer Perceptrons, normalisation layers, and activation functions; see Goodfellow et al. [2016] for an introduction. We also assume basic knowledge of concepts from physics and chemistry, such as atoms, molecules, and chemical bonds. We will establish a common mathematical notation for representing molecules and introduce key concepts that are crucial for understanding the subsequent chapters, such as geometric graphs and physical symmetries. We will also survey the most important background methods, including Graph Neural Networks, Transformers, and Diffusion generative models with an emphasis on their application to molecular systems. 2.1 Primer on Molecular Systems Let us begin with a brief primer on the different types of molecular systems that we will encounter in this thesis. 2.1.1 Small Organic Molecules Small molecules are organic compounds typically characterized by their low molecular weight (generally under 1000 Daltons) and relatively simple structures comprising a few dozen atoms. These molecules are ubiquitous in daily life, from water and oxygen to glucose and caffeine, and play crucial roles in biological processes as signaling molecules, metabolic intermediates, therapeutic drugs, and building blocks for larger biomolecules. Consider caffeine (C8H10N4O2), a common stimulant found in coffee and tea. As illustrated in Figure 2.1, this molecule can be represented in multiple ways: as a SMILES string (a linear 21 CN1C=NC2=C1C(=O)N(C(=O)N2C)C (a) SMILES string (b) Chemical graph (c) 3D structure (d) Molecular surface Figure 2.1: Representations of caffeine, my favorite small organic molecule. (a) SMILES string. (b) 2D chemical graph, with atoms as nodes and chemical bonds as edges determined by valence rules. (c) 3D atomic structure, illustrating the spatial arrangement of atoms. (d) Molecular surface, representing the molecule’s outer boundary. character encoding) [Weininger, 1988], as a 2D chemical graph with atoms as nodes and bonds as edges, or as a 3D structure showing spatial arrangements. While SMILES strings and 2D graphs capture connectivity and chemical information, they fail to represent the complete molecular story. The 3D geometric conformations, dynamics, and spatial interactions between atoms ultimately drive molecular functionality and properties, making 3D representations essential for understanding molecular behavior. Computationally, a small molecule with N atoms is represented by atom types A = {ai}Ni=1 ∈ Z1×N and 3D coordinates X = {xi}Ni=1 ∈ R3×N , where each xi ∈ R3 specifies the position of atom i in space, typically measured in Angstroms (10−10 meters). Having established the fundamental representation of small molecules, we now explore how these atomic and molecular building blocks assemble into more complex structures, such as crystalline materials and biological macromolecules. 2.1.2 Crystalline Materials Crystalline materials represent a fundamentally different class of atomic systems from small organic molecules. These solid-state materials feature a highly ordered, three-dimensional arrangement of atoms, ions, or molecules that extends infinitely in space through periodic repetition [Ashcroft and Mermin, 1976]. Consider table salt (NaCl), which forms a simple cubic crystal structure, as shown in Figure 2.2. The sodium (Na+) and chloride (Cl−) ions arrange themselves in an alternating three-dimensional pattern, with each Na+ ion surrounded by six Cl− ions, and vice versa. The perfectly periodic nature of crystals is a defining characteristic that distinguishes them from other forms of matter such as liquids and amorphous solids, which lack long-range order. The infinite, repeating nature of crystals requires a different computational representation than finite molecules. Rather than listing all atoms—which would be impossible for an infinite 22 Figure 2.2: Crystal structure of halite (NaCl) salt. The crystal structure is characterized by a repeating unit cell containing four Na+ and four Cl− ions in a face-centered arrangement. Source: OpenGeology.org (CC BY-NC-SA 3.0). structure—we define a unit cell: the smallest repeating volume that contains complete structural and symmetry information of the crystal. Within this unit cell, atomic positions are specified using fractional coordinates F = {fi}Ni=1 ∈ [0, 1)3×N , where each fi ∈ [0, 1)3 represents the position of atom i relative to the unit cell boundaries. The unit cell’s shape and size are defined by a lattice matrix L ∈ R3×3, whose columns are the three basis vectors spanning the crystal lattice. Absolute 3D coordinates can be recovered via X = LF . 2.1.3 Macromolecules: proteins, nucleic acids, and complexes Macromolecules, such as proteins, nucleic acids (DNA and RNA), and their complexes, are the fundamental building blocks of life, playing critical roles in biological processes ranging from catalysis and structural support to genetic information storage and transfer [Alberts et al., 2022]. These molecules are characterized by their large size and complex, hierarchical structures, which arise from the specific arrangement of smaller subunits (amino acids for proteins, nucleotides for nucleic acids) into intricate three-dimensional conformations of thousands of atoms. This hierarchical organization, from primary sequence to complex three-dimensional architecture, embodies the central principle of structural biology: sequence determines structure, and structure dictates function [Greslehner, 2018]. This principle is fundamental to understanding biomolecular behaviour and structure-based design [Huang et al., 2016, Alford et al., 2017]. Protein structure, illustrated in Figure 2.3a, is typically described at four levels of organization: primary (the linear sequence of amino acids), secondary (local folding patterns such as α-helices and β-sheets), tertiary (the overall three-dimensional shape of a single chain), and quaternary (the assembly of multiple chains into a functional complex). Nucleic acids, DNA and RNA, are the primary carriers of genetic information as well as playing crucial roles in various cellular functions, including the expression of proteins [Cech, 2024]. The structural principles of nucleic acids [Neidle and Sanderson, 2021] are a special focus on this thesis, particularly in the context of the work on structure-based RNA design presented in Part II. Nucleic acid structure also follows the same hierarchical organization as proteins, as illustrated in Figure 2.3b. The primary structure is the linear sequence of 23 https://opengeology.org/Mineralogy/13-crystal-structures (a) Proteins (b) Nucleic acids (c) Nucleobase pairing in RNA Figure 2.3: Hierarchical structures of biomolecules. Sources: (a) Protein structure (CC BY-SA 4.0). (b) Nucleic acid structure (CC BY-SA 4.0). (c) RNA base pairing (CC BY-SA 4.0). nucleotides—each comprising a nitrogenous base (Adenine (A), Guanine (G), Cytosine (C), and Thymine (T) in DNA; or Uracil (U) for Thymine in RNA), a 5-carbon sugar (deoxyribose in DNA, ribose in RNA), and phosphate groups. These nucleotides link via phosphodiester bonds, forming a sugar-phosphate backbone with 5’ to 3’ directionality (e.g., GACU for RNA). The secondary structure arises from base interactions, primarily hydrogen bonding. DNA typically forms a double helix [Watson and Crick, 1953, Franklin and Gosling, 1953], with two complementary strands stabilized by base pairs (A-T, G-C) and base stacking (Figure 2.3b). 24 https://en.wikipedia.org/wiki/File:Protein_structure_(full)-en.svg https://en.wikipedia.org/wiki/File:DNA_RNA_structure_(full).png https://commons.wikimedia.org/wiki/File:Hachimoji_RNA_BP.svg RNA, often single-stranded, folds upon itself, allowing complementary regions to base-pair (A-U, G-C, as shown in Figure 2.3c) and form non-canonical pairs, leading to diverse motifs like hairpins and pseudoknots. The tertiary structure refers to the complex 3D arrangement stabilized by metal ions (e.g., Mg2+, K+). This is critical for RNA’s diverse functions, including catalysis (as ribozymes) and molecular recognition. The hierarchical organization of macromolecules often extends beyond the folding of indi- vidual chains. Full functionality is generally achieved by assembling into larger quaternary structures. These assemblies can involve multiple folded protein subunits (as seen in hemoglobin [Perutz, 1960]), several nucleic acid strands, or, very commonly, combinations of proteins and nucleic acids. Prominent examples of such vital protein-nucleic acid complexes include ribo- somes, the cellular machinery for protein synthesis [Ramakrishnan, 2002], and chromatin, the DNA-protein complex forming chromosomes [Rowley and Corces, 2018]. The specific arrange- ment and interactions within these macromolecular assemblies are vital for their biological roles, enabling sophisticated cellular processes and regulatory networks. Consequently, understanding these higher-order structures is a key focus of molecular biology and modelling. 2.2 Molecular Systems as 3D Geometric Graphs Having reviewed the different types of molecular systems, we now turn to how these complex structures can be represented mathematically as geometric graphs in 3D Euclidean space. 2.2.1 Graphs Graphs are used to model complex and interconnected systems in the real-world, ranging from knowledge graphs to social networks and molecular structures. Formally, an attributed graph G = (A,S) is a set V of n nodes connected by edges, as shown in Figure 2.4a. A denotes an n× n adjacency matrix where each entry aij ∈ {0, 1} indicates the presence or absence of an edge connecting nodes i and j. Additionally, we can define Ni as the set of neighbors of node i, which are the nodes connected to i by an edge, i.e. Ni = {j ∈ V | aij = 1}. The matrix of scalar features S ∈ Rn×f stores attributes si ∈ Rf associated with each node i. For e.g., in molecular graphs, each node is an atom and edges represent interactions among atoms. Typically, the nodes in a graph have no canonical or fixed ordering and can be shuffled arbitrarily, resulting in an equivalent shuffling of the rows and columns of the adjacency matrix A. Thus, accounting for permutation symmetry is a critical consideration when designing machine learning models for graphs [Bronstein et al., 2021]. One can also consider more complex definitions of a graph, including multi-relational graphs or higher-order topological variants such as hypergraphs [Battiston et al., 2020], but a basic attributed graph suffices for our discussions on molecular systems. 25 (a) An attributed graph x y z (b) A geometric graph Figure 2.4: Graphs and geometric graphs. (a) Graphs model complex systems via a set of nodes which are related by edges. (b) Geometric graphs embedded in Euclidean space model systems containing both geometry and relational structure. 2.2.2 Geometric graphs As we have seen, molecular systems exhibit both relational structure and geometry: Functional molecules arise from atoms interacting with one another, and the specific spatial arrangement of atoms in 3D space determines these interactions. Such systems can be modeled via geometric graphs embedded in Euclidean space [Duval et al., 2023a]. For example, molecules can be represented as a set of nodes which contain information about each atom and its 3D spatial coordinates as well as other geometric quantities such as velocity or acceleration. As illustrated in Figure 2.4b, a geometric graph G = (A,S, V⃗ , X⃗) is an attributed graph that is also decorated with geometric attributes: 3D node coordinates X⃗ ∈ Rn×d and, optionally, vector features V⃗ ∈ Rn×d (e.g. velocity, acceleration), with d = 3.1 For molecules, the conventional procedure for constructing the geometric graph G = (A,S, V⃗ , X⃗) is via the underlying point cloud (S, V⃗ , X⃗) using a predetermined radial cutoff rcut. Thus, the adjacency matrix is defined as aij = 1 if ∥x⃗i − x⃗j∥2 ≤ rcut, or 0 otherwise, for all aij ∈ A. Other common choices for graph construction include long-range connections between nodes that are not within the cutoff radius, or complete graphs, where all nodes are connected to each other. See Figure 2.5 for illustrations of the different types of geometric graphs. Periodic boundary conditions While molecules simply consist of a set of 3D points in space, easily representable using a finite graph, crystals are modelled to be infinite periodic structures whose repeating pattern is called a unit cell. To account for the infinite periodicity of the crystal, we employ periodic boundary conditions (PBC). The unit cell is defined by a lattice matrix L⃗ ∈ R3×3, where the columns represent the three lattice basis vectors of the unit cell. Due to the period tiling of the unit cell, an atom i may interact with an image of atom j in a neighbouring cell. This is formalized by defining an integer-valued shift vector u⃗ij ∈ Z3, which allows the 1Without loss of generality, our formalism uses a single vector feature per node, but we could have had multiple channels for each node. 26 x y z 3D point cloud Smoothed cutoff graph Long-range connections Complete graph Figure 2.5: From point clouds to geometric graphs. A 3D point cloud is transformed into a geometric graph via drawing edges between atoms within a radial cutoffs, possibly including long-range connections, or simply connecting all atoms. effective distance to be calculated as: dij = ∥(x⃗i − x⃗j) + L⃗ u⃗ij∥2 Here, the shift vector u⃗ij is determined dynamically based on the atomic positions X, typically usually utilizing a radial cutoff and the minimum image convention (selecting the image of atom j that is closest to atom i). This ensures that the graph accurately captures interactions across cell boundaries, which is in turn critical for accurately simulating the dynamics of atoms that may ‘drift’ across unit cell boundaries over time. 2.2.3 Physical symmetries A key characteristic of geometric graphs is that their coordinates and scalar/vector attributes transform in mathematically precise ways under physical symmetries such as rotations and trans- lations in 3D space (Figure 2.6). Understanding and modelling these symmetries is fundamental to building neural networks that maintain physical meaning and produce consistent predictions regardless of molecular orientation in space [Musil et al., 2021]. x y z Figure 2.6: Geometric attributes transform under 3D symmetries. The group of rotations and reflections O(3) acts on the vector features v and coordinates x. The translation group T(3) acts on the coordinates x. Scalar features remain invariant to transformations. Consider the fundamental physical principle illustrated in Figure 2.7: the potential energy of a molecule remains unchanged (invariant) under rotations or translations in 3D space, while atomic coordinates transform consistently (equivariantly) with these same transformations. This reflects a fundamental principle: the laws of physics are independent of our choice of coordinate 27 H 0 1 2 3 4 5 3D atomic system Atom types 3D coordinates Potential energy C H H O H x, y, z x, y, z x, y, z x, y, z x, y, z x, y, z Permutation 3 2 4 5 1 0 H C H H O H x, y, z x, y, z x, y, z x, y, z x, y, z x, y, z invariant 3D Rotation 3 2 4 5 1 0 H C H H O H invariant invariant x', y', z' x', y', z' x', y', z' x', y', z' x', y', z' x', y', z' permute rows permute rows rotate columns Figure 2.7: Symmetries of 3D molecular systems. The ordering of atoms/nodes in the system is arbitrary. Additionally, global rotations or translations of the system in 3D Euclidean space will lead to an equivalent transformation of 3D coordinates and other geometric attributes. Global properties of the system such as the potential energy are invariant to both permutation and physical symmetries. Geometric GNNs explicitly account for both permutation symmetry and physical transformation behaviours when modelling 3D molecules, while standard GNNs solely account for permutations. system. Geometric neural networks must explicitly account for these symmetries to preserve the physical meaning of their predictions. Group theory foundations Group theory [Zee, 2016] provides the mathematical framework for formalizing these symmetries. A group (G, ⋆) consists of a set of elements Gwith a binary operation ⋆ : G× G→ Gsatisfying three axioms: 1. Associativity: (g1 ⋆ g2) ⋆ g3 = g1 ⋆ (g2 ⋆ g3) for all group elements g1, g2, g3 ∈ G. 2. Identity: There exists e ∈ Gsuch that e ⋆ g = g ⋆ e = g for all g ∈ G. 3. Inverse: For each g ∈ G, there exists h ∈ Gsuch that g ⋆ h = h ⋆ g = e. Symmetry groups for molecular systems The key symmetry groups relevant to molecular systems and their actions on geometric graphs G = (A,S, V⃗ , X⃗) are: • Permutation symmetry Sn: A permutation σ acts via permutation matrix Pσ as: PσG := (PσAP⊤ σ ,PσS,PσV⃗ ,PσX⃗), where Pσ ∈ Rn×n has exactly one 1 in every row and column, and 0 elsewhere. 28 • Rotational symmetry SO(d), or rotations and reflections, O(d): We use Ggenerically to denote SO(d) or O(d). An orthogonal transformation Qg ∈ Gacts as: QgG := (A,S, V⃗ Qg, X⃗Qg), where Qg ∈ Rd×d s.t. Q⊤ g = Q−1 g and det(Qg) = 1 for G= SO(d) (or for G= O(d), det(Qg) = ±1). • Translational symmetry T (d): A translation vector t⃗ ∈ R3 acts as: t⃗+ G := (A,S, V⃗ , X⃗ + t⃗). Note that scalar features S remain unchanged under all transformations, vector features V⃗ transform under rotations but not translations, while coordinates X⃗ transform under both rotations and translations. Without loss of generality, we consider a single vector feature per node; this framework generalizes to multiple vector features and higher-order tensors. 2.3 Representation Learning of Molecular Structure This section provides an overview of representation learning for molecular structures, focusing on how to build Graph Neural Networks for 3D geometric graphs [Duval et al., 2023a]. We will survey the main families of Geometric GNNs: invariant, equivariant, and unconstrained models. We will also discuss their applications for molecular property prediction and simulation of molecular dynamics. 2.3.1 Graph Neural Networks Graph Neural Networks (GNNs) a class of deep learning architectures designed to operate on graph-structured data. GNNs leverage graph topology to propagate and aggregate information between connected nodes. While initial GNN architectures were proposed in the late 1990s and 2000s [Goller and Kuchler, 1996, Gori et al., 2005, Scarselli et al., 2008], modern variants have emerged as the architecture of choice for representation learning on graph data across domains ranging from molecular modelling [Stokes et al., 2020, Batzner et al., 2022] to recommendation systems [Ying et al., 2018] and transportation networks [Derrow-Pinion et al., 2021]. GNNs are based on the principle of message passing, where each node iteratively updates its representations by aggregating from its local neighbors [Battaglia et al., 2018]. This process is inherently permutation-equivariant, ensuring that the learned representations are invariant to arbitrary reorderings of nodes. By stacking multiple message passing layers, GNNs can propagate information beyond immediate neighbors and capture complex multi-hop relationships 29 (a) Message passing (b) GNN computation tree Figure 2.8: Graph Neural Networks. (a) GNNs build latent representations of graph data through message passing operations, where each node performs learnable feature aggregation from its local neighbourhood. (b) Stacking L message passing layers enables GNNs to send and aggregate information from L-hop subgraphs around each node. in the graph structure (Figure 2.8). Message Passing Framework Formally, node features si for each node i ∈ V are updated from layer/iteration t to t+ 1 through a three-step process: 1. Message construction: For each node i and its neighbors j ∈ Ni, construct a message m (t) ij that captures the relationship between the representations of nodes i and j. m (t) ij = ψ ( s (t) i , s (t) j ) , ∀j ∈ Ni, (2.1) where ψ : R2×d → Rd is an MLP that learns to construct the message based on the representations of nodes i and j. 2. Aggregation: Combine all messages from the neighbors of node i to produce a single aggregated message m (t) i . m (t) i = ⊕ j∈Ni m (t) ij , (2.2) where ⊕ is a permutation-invariant operator (e.g. sum, mean, max) that aggregates messages from all neighbors j ∈ Ni. Thus, a change in the order of neighbors does not affect the aggregated message, preserving permutation symmetry. 3. Update: Update the representations of node i using the aggregated message m (t) i and its previous representations s(t)i . s (t+1) i = ϕ ( s (t) i ,m (t) i ) , (2.3) where ϕ : Rd → Rd is another MLP. 30 Alternatively, this framework can be expressed more abstractly in terms of multisets as: m (t) i := AGG ( {{(s(t)i , s (t) j ) | j ∈ Ni}} ) , (2.4) s (t+1) i := UPD ( s (t) i , m (t) i ) , (2.5) where {{·}} denotes a multiset, AGG is a permutation-invariant aggregation function, and UPD is an MLP. The final node features {s(t=T ) i } at iteration T can be mapped to graph-level predictions via a permutation-invariant readout function. This general formulation encompasses well known architectures including Graph Convolu- tional Networks [Kipf and Welling, 2017], Graph Isomorphism Networks [Xu et al., 2019], and Message Passing Neural Networks (MPNNs) [Gilmer et al., 2017]. Graph Attention Networks A particularly interesting class of GNNs employs attention mechanisms to weight the importance of different neighbors during aggregation [Veličković et al., 2018]. In Graph Attention Networks (GATs), the message from neighbor j to node i is computed using an attention mechanism [Bahdanau et al., 2015]. For example, we can consider an attention mechanism based on the dot product between the representations of nodes i and j, followed by a softmax normalization over all neighbors j′ ∈ Ni: ψ ( s (t) i , s (t) j ) = Attention ( W (t) Q s (t) i , {W (t) K s (t) j , ∀j ∈ Ni} , {W (t) V s (t) j , ∀j ∈ Ni} ) , (2.6) = exp(W (t) Q s (t) i ·W (t) K s (t) j )∑ j′∈Ni exp(W (t) Q s (t) i ·W (t) K h (t) j′ ) ·W (t) V s (t) j , (2.7) whereW (t) Q ,W (t) K ,W (t) V ∈ Rd×d are learnable linear transformations denoting the Query, Key and Value for the attention computation, respectively. Multi-head attention enhances the expressivity of this operation by computing attention in parallel across multiple representation subspaces [Vaswani et al., 2017a]. Connection to Transformers The Transformer model [Vaswani et al., 2017a] has emerged as the deep learning architecture of choice across language [Achiam et al., 2023], vision [Dosovit- skiy et al., 2021], and audio [Radford et al., 2023] due to its expressivity and scalability. Transformers and GNNs share deep mathematical connections [Joshi, 2025]. Transformers can be viewed as GATs operating on complete graphs, where self-attention models relationships between all input nodes. The Transformer update rule can be directly instantiated in the message 31 (a) G-invariant function (b) G-equivariant function Figure 2.9: Invariant and equivariant functions. The output of G-invariant functions remains unchanged under transformations of the input. For G-equivariant functions, transformations of the input must result in the output transforming equivalently. passing framework as follows: ψ ( s (t) i , s (t) j ) = Attention ( W (t) Q s (t) i , {W (t) K s (t) j , ∀j ∈ V} , {W (t) V s (t) j , ∀j ∈ V} ) , (2.8) = exp(W (t) Q s (t) i ·W (t) K s (t) j )∑ j′∈V exp(W (t) Q s (t) i ·W (t) K h (t) j′ ) ·W (t) V s (t) j , (2.9) Here, ψ ( s (t) i , s (t) j ) computes the message from node j to node i, with the relative importance of each node computed via attention. Next, the weighted messages from all nodes in the graph (the set V) are aggregated via a summation, and the features of node i are updated using an MLP ϕ: s (t+1) i = ϕ ( s (t) i , ∑ j∈V ψ ( s (t) i , s (t) j )) , (2.10) This ability to attend to and gather information from all nodes in the set V (i.e., global attention over a complete graph) allows Transformers to capture both local and global context in the data via multi-head attention, without being constrained by the pathologies of pre-defined sparse graph structure, such as oversquashing with increased depth [Di Giovanni et al., 2023]. This can be especially useful for molecular tasks where we do not have an apriori graph structure, as we will discuss subsequently. Conversely, the GAT message passing equation 2.7 is equivalent to equation 2.9 with attention restricted to local neighbourhoods, where the graph structure is used to implement sparse or masked attention [Dong et al., 2024]. This connection has inspired the development of Graph Transformers [Dwivedi and Bresson, 2020, Rampášek et al., 2022] that aim to combine both local message passing and global attention. These architectures overcome the expressivity limitations of message passing GNNs while preserving the inductive bias of graph structure. 32 2.3.2 Geometric Graph Neural Networks Functions on geometric graphs Before describing GNNs specialised for geometric graphs, we first define two classes of functions that are used to construct geometric neural network layers. We denote the action of a group Gon a space X by g · x. If Gacts on spaces X and Y , we say: • A function f : X → Y is G-invariant if f(g·x) = f(x), i.e. the output remains unchanged under transformations of the input, as shown in Figure 2.9a. • A function f : X → Y is G-equivariant if f(g · x) = g · f(x), i.e. a transformation of the input must result in the output transforming equivalently, as shown in Figure 2.9b; In Chapter 3, we will also consider G-orbit injective functions. The G-orbit of x ∈ X is OG(x) = {g · x | g ∈ G} ⊆ X . When x and x′ are part of the same orbit, we write x ≃ x′. We say a function f : X → Y is G-orbit injective if we have f(x1) = f(x2) if and only if x1 ≃ x2 for any x1, x2 ∈ X . Necessarily, such a function is G-invariant, since f(g · x) = f(x). Geometric GNNs Biomolecules Materials Small molecules Inv ari an t GNNs Cart es ian Equ iva ria nt GNNs Sph eri ca l Equ iva ria nt GNNs Unc on str ain ed GNNs Property Prediction Dynamics Simulation Generative Modelling Structure Prediction 2018 2019 SchNet CGCNN MEGNet 2020 20232021 2022 DimeNet GemNet SphereNet Inv. Point Attention GearNet ComENet GVP-GNN PaiNN E(n)-GNN Eq.Transformer ClofNet SO3krates Tensor Field Network Cormorant SE(3)- Transformer NequIP SEGNN MACE Allegro Equiformer eSCN ForceNet Spherical Channel Network FAENet, PET ... Applications TensorNet Figure 2.10: Timeline of Geometric GNN architectures, adapted from Duval et al. [2023a]. Geometric GNNs Standard Graph Neural Networks (GNNs), while powerful for general graph data, are ill-suited for geometric graphs and molecular structures. Directly applying GNNs to geometric graphs can lead to models that do not respect the physical symmetries inherent to molecules (Figure 2.7). This can result in predictions that are physically inconsistent [Musil et al., 2021]. Moreover, learning these symmetries implicitly from data alone, without appropriate inductive biases, is usually data-inefficient. To address this, Geometric GNNs, which are GNNs specialized for geometric graphs, extend the message passing paradigm to incorporate physical symmetries as inductive biases (implicit) or strict constraints (explicit). In addition to maintaining permutation equivariance for node features, 33 geometric GNNs ensure that operations involving geometric attributes (like atomic coordinates or vector features) respect physical symmetries. This means that the learned representations and intermediate geometric features transform covariantly with respect to the group of rotations (SO(d)) or rotations and reflections (O(d)). We use Gas a generic symbol for these groups. In recent years, we have seen a wide range of Geometric GNN architectures and their applications in molecular modelling [Duval et al., 2023a]. To navigate this landscape, the following sections will survey these models by categorizing them into three main families, as summarized in Figure 2.10: • Invariant GNNs, which construct and propagate features that are invariant to G, such as distances, angles, and torsion angles. • Equivariant GNNs, where intermediate representations and propagated messages are themselves geometric quantities. The representations can be expressed as Cartesian vectors or in spherical harmonic basis. • Unconstrained networks, which do not explicitly enforce physical symmetries in their architecture but may learn them implicitly from data, e.g. through data augmentation. 2.3.3 Rotation-invariant GNNs Invariant GNNs are designed to learn atomic representations that are inherently invariant to 3D Euclidean transformations of the system (the translation group T (3) and the group of rotations G= SO(3) or rotations and reflections G= O(3)). Translation invariance is typically achieved by: (1) centering input point clouds (e.g., by subtracting the center of mass from atomic coordinates) and (2) operating on relative displacement vectors x⃗ij = x⃗i− x⃗j instead of absolute coordinates. To enforce G-invariance, these models avoid directly processing geometric quantities that depend on the frame of reference (like raw coordinate vectors). Instead, they operate on scalarized geometric invariants: quantities that are inherently invariant to rotations and reflections. Common examples include pairwise distances (∥x⃗ij∥), triplet-wise angles (derived from dot products like x⃗ij · x⃗ik), and quadruplet-wise torsion angles. By constructing messages and updating features using only these invariant scalars, the entire network, from intermediate representations to final predictions, is guaranteed to be G-invariant. Message Passing with Invariant Features G-invariant GNN layers follow the general message passing framework (Equations 2.4-2.5) with specific differences in how geometric information is incorporated. The key idea is to use geometric invariants for constructing features. Scalar node features si are updated from layer t to t+ 1. The aggregation step, AGG, now incorporates invariant geometric information derived from relative positions x⃗ij (and potentially initial vector 34 (a) SchNet (b) DimeNet Figure 2.11: Invariant GNN message passing. G-invariant layers extract and propagate local scalar geometric quantities such as distances (SchNet) and bond angles (DimeNet), which are guaranteed to be invariant to Euclidean transformations. features v⃗i, v⃗j). The update function UPD then combines these aggregated messages with the previous node features: m (t) i := AGGinv ( {{(s(t)i , s (t) j , scalarize(x⃗ij, v⃗ (t) i , v⃗ (t) j )) | j ∈ Ni}} ) , (2.11) s (t+1) i := UPD ( s (t) i ,m (t) i ) . (2.12) Here, scalarize(·) represents the process of extracting geometric invariants (e.g., distances, angles) from the inputs. The aggregated message m (t) i is thus purely an invariant scalar. Examples Pioneering examples are SchNet [Schütt et al., 2018] and CGCNN [Xie and Gross- man, 2018], where messages are modulated by functions of interatomic distances ∥x⃗ij∥. As illustrated in Figure 2.11a, SchNet’s update rule can be seen as: s (t+1) i := s (t) i + ∑ j∈Ni f1 ( s (t) j , ∥x⃗ij∥ ) (SchNet) (2.13) where f1 is an MLP that processes the neighbor’s scalar features s(t)j and the invariant distance. DimeNet [Gasteiger et al., 2020] (Figure 2.11b) extends this by incorporating angular information, effectively using messages that depend on triplets of atoms. It computes messages based on distances and angles (derived from dot products like x⃗ij · x⃗ik): s (t+1) i := ∑ j∈Ni f1 ( s (t) i , s (t) j , ∑ k∈Ni\{j} f2 ( s (t) j , s (t) k , ∥x⃗ij∥, x⃗ij · x⃗ik )) (DimeNet) (2.14) In both cases, the updated scalar features s(t+1) i maintain invariance to G transformations, as they are constructed solely from geometric invariants. Other notable invariant GNNs include GemNet [Gasteiger et al., 2021] and SphereNet [Liu et al., 2022], which incorporate additional geometric features like torsion angles among 35 quadruplets of atoms. Another class of invariant GNNs are based on canonincal frames of reference, which define a local or global frame to scalarise geometric quantities into invariant features used for message passing. A notable example of this approach is the Invariant Point Attention layer from AlphaFold2 [Jumper et al., 2021]. Continuity and smoothness While rotation invariance ensures that atomic representations and predictions do not change under rotation, applications such as molecular dynamics impose an additional, stricter constraint on Geometric GNNs: the learned Potential Energy Surface (PES) must be smooth. Since atomic forces are derived as the negative gradient of the energy (F⃗i = −∇x⃗i E), the energy function must be at least twice differentiable (C2 continuous) with respect to atomic positions to ensure stable, continuous forces and Hessians during simulation [Musil et al., 2021]. Standard deep learning operations often violate this requirement, motivating special architectural choices in Geometric GNNs: • Basis Functions: To provide smooth, learnable representations of geometry, models project geometric scalars onto sets of continuous basis functions rather than operating on raw values. Interatomic distances are expanded using Radial Basis Functions (RBFs) such as Gaussians [Schütt et al., 2018] or Bessel functions [Gasteiger et al., 2020]. Similarly, angular and torsional features are embedded using basis sets like Fourier series [Gasteiger et al., 2021]. • Smooth Envelopes (Cutoffs): Discontinuities inevitably arise when atoms enter or leave the local cutoff radius rcut defined during graph construction. To prevent jumps in energy (and infinite forces) at this boundary, interactions are modulated by a smooth envelope function which forces the messages and its derivatives to zero as r → rcut. • Smooth Activations and Aggregation: Discontinuous non-linearities like ReLU (which has a discontinuous derivative) or aggregation functions like Max-Pooling introduce singularities in the force field. Consequently, Geometric GNNs predominantly employ smooth activation functions such as Swish/SiLU [Ramachandran et al., 2017], and rely on summation for aggregation to preserve differentiability throughout the network. 2.3.4 From Invariant to Equivariant GNNs Invariant GNNs, as discussed previously, achieve G-invariance by operating exclusively on pre-defined local scalar invariants like distances and angles. While this ensures invariance and can be computationally efficient, it is inherently restrictive: the model is limited to the expressive power of these fixed, local geometric descriptors. 36 x 2 x 1 x 1 Invariant features Equivariant features Figure 2.12: The Picasso Problem and the need for equivariance. Identifying a face requires understanding not just the presence of individual features like eyes, nose, and mouth (invariant information), but crucially their relative spatial arrangement (equivariant information) [Hinton, 2021]. Similarly, for molecular systems, predicting invariant properties often necessitates understanding how different structural motifs are geometrically oriented and interact with one another—an inherently equivariant sub-task. Equivariant representations enable the network to dynamically learn complex invariants that extend beyond fixed local neighborhoods. Consider the Picasso Problem illustrated in Figure 2.12. Recognizing a face requires not just identifying the presence of eyes, a nose, and a mouth (invariant features), but crucially under- standing their relative spatial arrangement (equivariant information). Similarly, for molecular systems, predicting an overall invariant property (like the potential energy) often necessitates understanding how different sub-structures or motifs are oriented and interact geometrically. This involves solving equivariant sub-tasks. If a model only processes pre-computed local invariants, it may struggle to capture these crucial relative geometric relationships that define more global structural characteristics. This motivates the development of equivariant GNNs. Instead of discarding directional information by scalarization, these models propagate and transform geometric quantities (like vectors or higher-order tensors) in a way that ensures their hidden features at each layer remain equivariant to the symmetry transformations of the input. If the input molecular structure is rotated, the intermediate vector or tensor features within an equivariant GNN will rotate correspondingly. This diligent accounting of geometric information allows the network to learn how to combine these equivariant features. Consequently, equivariant GNNs can construct more complex and task-relevant invariants dynamically during message passing. The complexity and range of these learned invariants (e.g., involving atoms further apart or more intricate geometric relationships) can increase with the number of message passing layers, allowing the model to capture information beyond fixed local neighborhoods. Furthermore, this equivariant processing is essential for tasks that require predicting equivariant quantities themselves, such as atomic forces in molecular dynamics. 37 Having established the intuition for equivariant representations, we will now introduce two families of equivariant GNNs using different basis for representing geometric information: Cartesian vectors and spherical tensors. 2.3.5 Rotation-equivariant GNNs using Cartesian Vectors Cartesian equivariant GNNs represent a class of models that operate directly with geometric quantities in Cartesian coordinates. To maintain physical consistency, they restrict operations on these geometric features to those that preserve equivariance under rotations, reflections, and translations. These models typically assign two fundamental types of features to each node: scalar features invariant under Gand vector features that transform covariantly under G. Equivariant operations To construct equivariant message passing layers which update the scalar and vector features at each node, only specific operations are permissible: • Scalar × Scalar → Scalar • Scalar × Vector → Vector • Vector · Vector (dot product) → Scalar • Norm of Vector (∥v⃗∥) → Scalar • Vector + Vector → Vector (if both are of the same type and representation) Element-wise non-linear activation functions are generally applied only to scalar quantities, as applying them directly to vector components can break G-equivariance. The cross-product of two vectors can also be used, but it is important to note that this operation does not yield a vector in the same sense as the input vectors; instead, it yields a pseudo-vector which transforms differently under reflections.2 Message passing with scalars and vectors These fundamental equivariant operations form the toolkit for constructing message passing layers in Equivariant GNNs. The network updates both scalar features si and vector features v⃗i at each node i by aggregating information from its neighbors j ∈ Ni while preserving equivariance. The general message passing update can be formulated as: m (t) i , m⃗ (t) i := AGG ( {{(s(t)i , s (t) j , v⃗ (t) i , v⃗ (t) j , x⃗ij) | j ∈ Ni}} ) (Aggregate) (2.15) s (t+1) i , v⃗ (t+1) i := UPD ( (s (t) i , v⃗ (t) i ) , (m (t) i , m⃗ (t) i ) ) (Update) (2.16) 2The cross-product of two vectors a⃗ × b⃗ is a pseudo-vector. Under inversion (reflection through the origin), a⃗ → −a⃗ and b⃗ → −b⃗, so (−a⃗) × (−b⃗) = a⃗ × b⃗. A vector, however, would invert: v⃗ → −v⃗. This distinction is important when considering O(3) or SO(3) equivariance. 38 (a) PaiNN (b) TFN Figure 2.13: Equivariant GNN message passing. G-equivariant layers such as PaiNN and TFN propagated geometric quantities such as vectors, relative positions, or tensors. Here, both AGG and UPD are composed of the permissible equivariant operations. For example, PaiNN [Schütt et al., 2021] (Figure 2.13a) implements interaction layers where the aggregated messages m(t) i (scalar) and m⃗ (t) i (vector) are computed as: m (t) i := s (t) i + ∑ j∈Ni f1 ( s (t) j , ∥x⃗ij∥ ) (2.17) m⃗ (t) i := v⃗ (t) i + ∑ j∈Ni f2 ( s (t) j , ∥x⃗ij∥ ) ⊙ v⃗ (t) j + ∑ j∈Ni f3 ( s (t) j , ∥x⃗ij∥ ) ⊙ x⃗ij (2.18) The learnable filter functions f1, f2, f3 (typically MLPs) take scalar inputs (neighbor’s scalar features s(t)j and the invariant distance ∥x⃗ij∥) and produce scalar outputs. These outputs then scale other scalars (in f1) or vectors (in f2, f3) via element-wise multiplication ⊙, which is consistent with the allowed equivariant operations. The subsequent update step in PaiNN, yielding s (t+1) i and v⃗ (t+1) i , often employs a gated non-linearity [Weiler et al., 2018] for the vector features: s (t+1) i := m (t) i + f4 ( m (t) i , ∥m⃗ (t) i ∥ ) , v⃗ (t+1) i := m⃗ (t) i + f5 ( m (t) i , ∥m⃗ (t) i ∥ ) ⊙ m⃗ (t) i . (2.19) Here, f4, f5 are learnable functions. The vector update scales m⃗ (t) i by a scalar derived from m (t) i and the norm ∥m⃗(t) i ∥, preserving equivariance. Other notable Equivariant GNNs include E(n)-GNN [Satorras et al., 2021] and GVP-GNN [Jing et al., 2020], which employ similar principles of equivariant operations on Cartesian scalars and vectors. Higher-order Cartesian tensors The frameworks discussed above restrict node features to scalars (rank-0) and vectors (rank-1), making them computationally efficient. However, geometric information can also be encoded in higher-order Cartesian tensors, such as 3 × 3 matrices (rank-2) or higher-dimensional arrays. These quantities are naturally constructed via the outer product of lower-order features. For instance, given two equivariant vectors u⃗, v⃗ ∈ R3, their outer product u⃗⊗ v⃗ ∈ R3×3 is a rank-2 tensor that transforms as u⃗⊗ v⃗ → (Ru⃗)⊗ (Rv⃗) = 39 R(u⃗⊗ v⃗)R⊤ under rotation R. Models such as TensorNet [Simeon and Fabritiis, 2023] leverage this principle, allowing nodes to update rank-2 tensor features through mixing operations that respect these transformation rules. While expressive, explicitly maintaining full Cartesian tensors of rank k becomes computationally expensive, as memory scales with 3k. Furthermore, these representations are mathematically reducible, meaning they contain a mixture of lower-order symmetric subspaces (e.g., a 3× 3 matrix contains a scalar trace, a vector antisymmetric part, and a symmetric traceless component). See Duval et al. [2023a] for a detailed derivation. 2.3.6 Rotation-equivariant GNNs using Spherical Tensors As noted above, while high-order interactions can be represented using Cartesian tensors, doing so is redundant because these tensors are reducible. For example, a rank-2 Cartesian tensor (9 components) decomposes into three independent subspaces that transform differently under rotation: a scalar (1 component), a vector (3 components), and a symmetric traceless tensor (5 components). Rotation-equivariant GNNs using Spherical Tensors aim to work directly with these irreducible representations (irreps) of the rotation group SO(3), avoiding the redundancy of the Cartesian basis. Instead of raw coordinate arrays, geometric features are represented as spherical tensors h̃i,l ∈ R(2l+1)×f of degree l, where l = 0 corresponds to scalars, l = 1 to vectors, l = 2 to the symmetric traceless tensors mentioned previously, and so on. Models in this family, such as Tensor Field Networks (TFN) [Thomas et al., 2018], Cormorant [Anderson et al., 2019], SEGNN [Brandstetter et al., 2022], and MACE [Batatia et al., 2022b] typically represent node features as collections of spherical tensors h̃i,l ∈ R(2l+1)×f for different orders l = 0, 1, . . . , Lmax. Here, l = 0 corresponds to scalar features si, l = 1 to vector features v⃗i, and higher orders capture more complex angular information. The key equivariant operations involve Spherical Harmonics Y (x̂ij) to encode directional information from relative positions, learnable radial basis functions f(∥x⃗ij∥) for distances, and Clebsch-Gordan coefficients for combining these into equivariant tensor products. Message passing with spherical tensors In these models, node features (collections of spher- ical tensors h̃(t) i ) are updated by aggregating messages from neighbors. The core of message construction is an equivariant tensor product. The aggregated message for node i typically involves summing contributions from each neighbor j. Each contribution is a tensor product of the neighbor’s features h̃(t) j and the spherical harmonic representation Y (x̂ij) of the relative direction x̂ij = x⃗ij/∥x⃗ij∥. This tensor product is weighted by learnable functions of the inter- atomic distance ∥x⃗ij∥. A simplified form for the update introduced in Thomas et al. [2018], including a residual connection, is: h̃ (t+1) i := h̃ (t) i + ∑ j∈Ni Y (x̂ij)⊗w h̃ (t) j , (2.20) 40 where ⊗w denotes a learnable tensor product. The weights w are typically outputs of a neural network (e.g., an MLP) applied to the radial distance, w = MLP(∥x⃗ij∥), producing different weights for different interaction paths in the tensor product. More explicitly, to obtain the component m3 of the order-l3 part of the message m̃ij from neighbor j (before summation and update), the tensor product in equation 2.20 can be expanded using Clebsch-Gordan coefficients C l3m3 l1m1,l2m2 : (m̃ij)l3m3 := ∑ l1m1,l2m2 C l3m3 l1m1,l2m2 fl1l2l3 (∥x⃗ij∥)Y m1 l1 (x̂ij) h̃ (t) j,l2m2 . (2.21) The learnable radial function fl1l2l3(·) depends on the specific orders l1, l2, l3 involved in the interaction path. The Clebsch-Gordan coefficients ensure that the resulting message components, and thus the updated features h̃(t+1) i , transform equivariantly under SO(3). The tensor product in equation 2.21 can be seen as a generalization of the Cartesian vector operations used scalar-vector equivariant GNNs introduced previously. Notably, when restricting the tensor product to only scalars (up to Lmax = 0), we obtain updates of the form similar to equation 2.13 like SchNet. Similarly, when using only scalars and vectors (i.e., Lmax = 1), the operations resemble those in Cartesian equivariant GNNs like PaiNN (Equations 2.17 and 2.18). For a comprehensive overview of this class of models, we refer to dedicated surveys such as Batatia et al. [2022a] and Geiger and Smidt [2022]. Architectural improvements The original TFN architecture [Thomas et al., 2018] has been extended and improved in several ways. SE(3)-Transformers [Fuchs et al., 2020] introduced equivariant self-attention for aggregation. SEGNN [Brandstetter et al., 2022] developed equivari- ant non-linear convolutions using steerable MLPs within a message passing framework, offering a recipe for equivariant MPNNs with spherical tensor features. Equiformer [Liao and Smidt, 2023] combined these two ideas by interleaving equivariant self-attention with non-linear up- dates in local Transformer-style blocks. Notably, the self-attention weights in these models are invariant, derived from scalarized geometric information, and re-weight neighborhood features during equivariant message passing. MACE [Batatia et al., 2022b] incorporates many-body interaction terms3 by factorizing higher-order terms into products of two-body representations, drawing from the Atomic Cluster Expansion (ACE) formalism [Drautz, 2019]. This "density-trick"4 exchanges summation and multiplication (e.g., (a+ b)2 efficiently yields a2, ab, ba, b2 terms with coupled coefficients) to reduce operations. MACE sums one-body features (like spherical harmonic embeddings of neighbors) and then takes tensor products of these aggregates, efficiently generating high-order terms. Allegro [Musaelian et al., 2022] implements the ACE framework with a single message 3Many-body effects refer to the collective behaviour of a large number of interacting constituents. They are needed for an accurate description of both the structure and dynamics of large chemical systems. 4This idea was originally used in Bartók et al. [2010, 2013], and is referred to as density-trick by Drautz [2019]. 41 passing layer and an extended local cutoff, enabling efficient GPU parallelization for simulating large systems. eSCN [Passaro and Zitnick, 2023] addresses the computational cost of high-rank tensor products by reducing SO(3) equivariant convolutions to equivalent SO(2) convolutions. This is achieved by aligning the node embeddings’ primary axis with the edge vector, simplifying the rotational symmetry to 2D. Despite requiring extra Wigner D-matrix rotations for alignment, this sparsifies Clebsch-Gordan coefficients, speeding up computations for l > 1. EquiformerV2 [Liao et al., 2024a] leverages the eSCN technique to scale Equiformer to hundreds of millions of parameters for the first time. 2.3.7 Unconstrained GNNs The geometric GNN families discussed previously enforce physical symmetries via architectural constraints. An alternative class of unconstrained GNNs do not strictly enforce symmetries, but instead incorporate them more flexibly as inductive biases. This enables the direct use of relative positions x⃗ij and other geometric quantities in MLPs used for message passing: s (t+1) i = ϕ ( s (t) i , ∑ j∈Ni ψ ( s (t) i , s (t) j , x⃗ij )) . (2.22) This strategy trades the guarantee of strict equivariance for potentially greater model expres- siveness and computational efficiency, as they do not require explicit equivariant operations or scalarization. A straightforward but effective approach achieves approximate symmetry through data aug- mentation. For example, ForceNet [Hu et al., 2021] implicitly learns symmetries by training on multiple random rotations of each geometric graph, similar to how Vision Transformers [Doso- vitskiy et al., 2021] learn approximate equivariance from augmented training data. Additionally, soft constraints can be introduced via regularization terms in the loss function to encourage symmetry preservation [Elhag et al., 2024]. Alternatively, canonization-based approaches addresses equivariance at the data representa- tion stage by transforming input into a canonical frame before applying standard GNNs. FAENet [Duval et al., 2023b] uses PCA to project data into canonical space, then uses the relative positions x⃗ij in message passing. Rather than relying on hand-designed canonization methods like PCA, Kaba et al. [2023] proposed learning the canonization transform using a shallow equivariant network. Other approaches focus on local canonization, defining distinct coordinate frames at each atom and projecting tensor information onto these local frames. For instance, Pozdnyakov and Ceriotti [2023] introduced an unconstrained Geometric Transformer that em- ploys an Equivariant Coordinate System Ensemble, averaging predictions from a non-equivariant network over multiple such local coordinate systems. 42 While unconstrained GNNs offer computational advantages, the lack of exact symmetry guarantees means that they may not always produce physically consistent predictions. This can lead to inaccuracies in tasks requiring strict adherence to physical laws, such as molecular simulation [Fu et al., 2023, Bigi et al., 2025]. These accuracy-scalability trade-offs are explored further in Chapter 4 in the context of generative modelling. 2.3.8 Applications of Geometric GNNs Geometric GNNs have been successfully applied across materials science, chemistry, and biology, where molecular systems are naturally represented as geometric graphs and predictions are correlated with physical symmetries. Two primary application domains have emerged as particularly prominent and impactful: molecular dynamics simulation and property prediction. Dynamics simulation Understanding the dynamic behavior of molecular systems is funda- mental to predicting their properties and functions. Almost a century ago, Dirac postulated that the fundamental mathematical principles describing interactions within materials and molecules at the atomic scale, based on quantum mechanics, were largely understood [Dirac, 1929]. While quantum mechanics can, in principle, be used to simulate all kinds of matter, the inherent mathe- matical complexity makes exact calculations intractable for most practically relevant systems, necessitating the development of approximate methods. Density Functional Theory (DFT) [Kohn et al., 1996] became a cornerstone for such approximations, but its cubic scaling with system size limits simulations to hundreds of atoms. To bridge the gap between quantum accuracy and large-scale simulation, Machine Learning-based Interatomic Potentials (MLIPs) have been proposed over the past decade [Behler and Parrinello, 2007, Bartók et al., 2010, 2013, 2017, Unke et al., 2021]. These models, trained on QM or DFT data, can approximate quantum mechanical calculations with high accuracy, often being able to generalize beyond their training data to larger systems. Geometric GNNs, especially those based on spherical tensor representations, have emerged as a leading model class for developing MLIPs [Batzner et al., 2022, Batatia et al., 2022b, Wood et al., 2025]. These models generally predict the potential energy of an atomic configuration, from which interatomic forces can be computed based on the law of conservation of energy, using which system dynamics can be simulated. Crucially, this application relies on the C2 continuity guarantees of the Geometric GNN representations discussed in Section 2.3.3, ensuring that the derived forces are continuous and energy-conserving for stable simulation [Fu et al., 2025]. Property prediction Beyond simulation, Geometric GNNs are also employed for predicting functional and experimental properties that may not be directly derivable from first-principles 43 quantum mechanical calculations. This is also known as Quantitative structure-activity relation- ships [Todeschini and Consonni, 2009], and involves training GNNs to predict diverse functional properties of small molecules [Stokes et al., 2020], proteins [Gligorijević et al., 2021], crystals [Xie and Grossman, 2018], and electrocatalysis systems [Lan et al., 2023, Wander et al., 2025]. The typical workflow involves training GNNs on datasets comprising experimentally mea- sured or computationally derived functional properties, learning to map from the geometric graph representation of the systems to these target properties. In the context of designing new molecules and materials, these models can guide the design process through two primary approaches: (1) High-throughput screening: Evaluating large databases of known or synthesizable molecules and materials to identify promising candidates based on GNN-predicted properties [Buterez et al., 2023]; and (2) Generative inverse design: Training generative models (discussed in the following section) to design novel molecules or materials tailored to desired property profiles [Sanchez-Lengeling and Aspuru-Guzik, 2018]. 2.4 Generative Modelling of Molecular Systems Having explored various classes of Geometric GNN architectures for learning molecular rep- resentations, we now turn to the complementary problem of generative modelling. While representation learning focuses on predictive understanding of molecular systems, generative models aim to create new molecules with desired characteristics [Du et al., 2024, Winnifrith et al., 2024]. In the following sections, we will discuss the most relevant generative models used in this thesis, which can be broadly categorized into autoregressive models, variational autoencoders, and diffusion models. 2.4.1 Autoregressive (Language) Models Autoregressive models are a class of generative models that learn to predict the next element in a sequence given the previous elements [Graves, 2013]. They have achieved remarkable success in natural language processing, where they model sequences of words or characters [Sutskever et al., 2014, Bahdanau et al., 2015, Vaswani et al., 2017a], and have been adapted to molecular systems by generating sequences of categorical tokens such as SMILES strings for small molecules [Segler et al., 2018], amino acid sequences for proteins [Madani et al., 2023], or nucleotide sequences for RNA [Shulgina et al., 2024]. Given a sequence X = (x1, x2, . . . , xT ) of T tokens, where each token xt belongs to a predefined vocabulary V , an autoregressive model defines the joint probability distribution by factorizing it as: P (X) = P (x1, x2, . . . , xT ) = T∏ t=1 P (xt|x1, . . . , xt−1; θ), (2.23) 44 where P (xt|x1, . . . , xt−1; θ) represents the probability of generating the t-th token xt conditioned on all preceding tokens x 0. (2.36) Flow Matching models can be viewed as a continuous-time analogue of diffusion models, where the linear interpolation path in equation 2.33 is a form of the forward process in equa- tion 2.29 with a particular noise schedule and a Gaussian prior. The reverse process in flow matching is learned as a vector field that guides the flow from the prior to the data distribution, similar to how diffusion models learn a denoising process. See Albergo et al. [2023], Gao et al. [2024] for further discussions on the equivalence between diffusion models and flow matching. 6Note that t is distinct from the discrete time steps t ∈ {1, . . . , T} used in diffusion models. 49 Conditioning and Guidance Diffusion and flow matching models can be conditioned on additional information to guide the generation process toward specific properties or structures [Dieleman, 2022]. For example, classifier-based guidance [Dhariwal and Nichol, 2021] uses a pre-trained classifier to steer the generation process by adjusting the predicted noise or vector field based on the classifier’s output. Classifier-free guidance [Ho and Salimans, 2022] allows for more flexible conditioning by directly providing additional information (e.g., class labels, text prompts) to the denoiser or vector field predictor without requiring a separate classifier. Conditional generation has proven especially successful for latent diffusion models [Vahdat et al., 2021], which perform diffusion in the latent space of a pre-trained autoencoder rather than in high-dimensional raw input space. This approach achieves computational efficiency while maintaining generation quality by operating on semantically meaningful, lower-dimensional representations, followed by reconstruction to the original data space [Dieleman, 2025]. Latent diffusion also enables flexible conditioning strategies across diverse modalities, including class labels, text descriptions, or any other data type that can be encoded into a compatible latent representation [Rombach et al., 2022]. Overall, conditional generation capabilities makes diffusion models particularly powerful for molecular design, where the goal is to discover novel molecules that satisfy predetermined requirements rather than simply modelling existing molecular distributions. By conditioning the denoiser on desired properties (e.g., binding affinity [Gruver et al., 2023], bulk modulus, magnetic density [Zeni et al., 2025]) or structural constraints (e.g., scaffolding around a known binding site [Watson et al., 2023], completing partial molecular fragments [Schneuing et al., 2024]), we can guide generation toward molecules with specific functional characteristics. 50 Part I Molecular Representation Learning and Generative Modelling 51 Chapter 3 Expressive Power of Molecular Structure Representations As we saw in Chapter 2, molecular systems can be represented as 3D geometric graphs with node attributes that transform along with Euclidean transformations of the system. We then introduced Geometric Graph Neural Networks (GNNs) that are designed to learn representations of these graphs, categorised by the geometric inductive biases they implement: (1) Invariant GNNs which only propagate invariant scalar features such as distances and angles [Schütt et al., 2018, Gasteiger et al., 2020]; (2) Equivariant GNNs which propagate equivariant geometric features such as vectors [Satorras et al., 2021] or spherical tensors [Thomas et al., 2018]; and (3) Unconstrained GNNs which do not enforce any equivariance or invariance on the features. These architectures have powered application ranging from protein structure prediction [Jumper et al., 2021] and design [Dauparas et al., 2022] to molecular simulation [Batzner et al., 2022] and catalysis [Wander et al., 2025]. However, there is no unified theoretical framework to understand and characterise the repre- sentation capacity or expressive power [Raghu et al., 2017] of different classes of architectures. The theoretical limits and practical implications of different design choices, such as equivariance vs. invariance, number of layers, and body order of scalarisation, are not well understood. To address this, this chapter establishes a theoretical foundation for Geometric GNNs which will guide their application in subsequent chapters (Part II). We introduce the Geometric Weisfeiler-Leman (GWL) test, a generalisation of the classic Weisfeiler-Leman algorithm for discriminating geometric graphs while respecting underlying 3D symmetries: permutations, rotations, reflections, and translations. We use the GWL framework to characterise the expressive power of invariant and equivariant GNNs in terms of their ability to distinguishing geometric graphs. GWL provides mechanistic insights into the advantages of equivariant models over invariant ones, and how higher-order representations enable maximally expressive architectures. Overall, we formalize key design choices which influence Geometric GNN expressivity through the lens of GWL, summarised in Figure 3.1. 53 Tensor Order of Features SchNet E(n)-GNN TFN, SEGNN,  SE(3)-Transformer Body Order of Layer MACE - Multi Atomic Cluster Expansion (Distances) Many-body DimeNet, GemNet-T (Distances, Angles) GVP-GNN, PaiNN Sc ala rs Depth Ca rte sia n Sp he ric al SphereNet Figure 3.1: Axes of Geometric GNN expressivity: (1) Body order: increasing scalarisation body order builds expressive local descriptors; (2) Tensor order: higher order tensors determine the relative orientation of neighbourhoods; and (3) Depth: deep equivariant layers propagate geometric information beyond local neighbourhoods. To complement GWL’s theoretical framework, this chapter also presents: (1) a suite of synthetic experiments–Geometric GNN Dojo–that highlight practical challenges for building expressive Geometric GNNs; and (2) a real-world benchmark for protein function prediction that fairly compares state-of-the-art Geometric GNNs to sequence-based protein language models. Open source code is available: github.com/chaitjo/geometric-gnn-dojo and github.com/a-r-j/ProteinWorkshop, respectively. 3.1 Limitations of the Weisfeiler-Leman Test Graph Isomorphism The graph isomorphism problem asks whether two graphs are the same, but drawn differently [Read and Corneil, 1977]. Two attributed graphs G,H are isomorphic (denoted G ≃ H) if there exists an edge-preserving bijection b : V(G) → V(H) such that s (G) i = s (H) b(i) , as illustrated in Figure 3.2. Figure 3.2: Graph isomorphism. Two attributed graphs G and H are isomorphic if there exists a bijection b between their nodes that preserves the edge structure and node features. 54 https://github.com/chaitjo/geometric-gnn-dojo https://github.com/a-r-j/ProteinWorkshop Figure 3.3: Weisfeiler-Leman Test for non-geometric graphs. WL iteratively refines node colours based on neighbourhood patterns. Here, WL fails to distinguish the non-isomorphic molecular graphs Decalin and Bicyclopentyl, converging to identical colour histograms despite their structural differences. This illustrates a well-known limitation of WL with implications for molecular representation learning. Weisfeiler-Leman Test The Weisfeiler-Leman Test (WL) is an algorithm for testing whether two (attributed) graphs are isomorphic [Weisfeiler and Leman, 1968]. At iteration zero the algorithm assigns a colour c(0)i ∈ C from a countable space of colours C to each node i. Nodes are coloured the same if their features are the same, otherwise, they are coloured differently. In subsequent iterations t, WL iteratively updates the node colouring by producing a new c (t) i ∈ C: c (t) i := HASH ( c (t−1) i , {{c(t−1) j | j ∈ Ni}} ) , (3.1) where HASH is an injective map (i.e. a perfect hash map) that assigns a unique colour to each input and {{·}} denotes a multiset – a set that allows for repeated elements. The test terminates when the partition of the nodes induced by the colours becomes stable. Given two graphs G and H, if there exists some iteration t for which {{c(t)i | i ∈ V(G)}} ̸= {{c(t)i | i ∈ V(H)}}, then the graphs are not isomorphic. Otherwise, the WL test is inconclusive, and we say it cannot distinguish the two graphs when the number of colours in iterations t and (t− 1) is the same. Theoretical Limits of WL and GNN WL has several well known failure cases, such as not being able to distinguish any two regular graphs with the same number of nodes and degree, or failing to tell apart two equilateral triangles from a regular hexagon. At the same time, WL is considered powerful enough for most practical graph classification scenarios [Morris et al., 2021]. Results from Babai et al. [1980] can be used to show that the probability of WL identifying a graph drawn randomly from the class of all n-node graphs goes to 1 as n tends to infinity. The graph isomorphism problem and WL have become a powerful tool for characterising the theoretical limits of GNNs [Jegelka, 2022]. It was shown by Xu et al. [2019], Morris et al. [2019] that message passing GNNs are at most as powerful as WL at distinguishing non-isomorphic 55 graphs, i.e. the expressive power of GNNs is upper-bounded by WL. GNNs can have the same expressive power as WL if their aggregate, update, and readout are injective functions over multisets. The WL framework has since become a major driver of progress in designing more expressive GNNs [Chen et al., 2019, Maron et al., 2019, Dwivedi et al., 2023, Bodnar et al., 2021b,a]. Notably, GNNs can exceed the expressive power of WL when nodes have unique identifiers, such as random node features or positional encodings, that distinguish otherwise equivalent nodes [Loukas, 2020, Sato et al., 2021]. Towards Geometric Graph Isomorphism Clearly, WL does not directly apply to geometric graphs as they exhibit a stronger notion of geometric isomorphism that accounts for spatial symmetries. Unlike standard graphs where node features are fixed, geometric attributes such as 3D coordinates transform under rotations, reflections, and translations of the geometric graph. Simply treating these geometric attributes as static node features would violate the fundamental symmetries that define molecular systems, making theoretical results associated with WL and GNNs inapplicable to geometric graphs and Geometric GNNs. In the following sections, we introduce the Geometric Weisfeiler-Leman (GWL) test that generalises WL to geometric graphs while respecting underlying 3D symmetries. We use GWL to characterise the expressive power of invariant and equivariant GNNs in terms of their ability to distinguish geometric graphs. Unconstrained GNNs, which do not enforce any geometric symmetries, are an exception to this limitation. These models treat 3D coordinates as static node features and can thus distinguish any geometric graphs where WL can distinguish the underlying attributed graphs. While the trade-offs between explicitly enforcing symmetries versus learning them from data will be discussed in Chapter 4, our analysis in this chapter focuses on invariant and equivariant GNNs that respect 3D symmetries by design. 3.2 The Geometric Weisfeiler-Leman Framework Geometric graph isomorphism Two geometric graphs G and H are geometrically isomorphic if there exists an attributed graph isomorphism b such that the geometric attributes are equivalent, up to global group actions Qg ∈ Gand t⃗ ∈ T (d):( s (G) i , v⃗ (G) i , x⃗ (G) i ) = ( s (H) b(i) , Qgv⃗ (H) b(i) , Qg(x⃗ (H) b(i) + t⃗) ) for all i ∈ V(G). (3.2) Note that if two geometric graphs are geometrically isomorphic, they are also isomorphic as attributed graphs. However, the converse is not true. Geometric graph isomorphism and distinguishing (sub-)graph geometries has important practical implications for molecular representation learning. For e.g., an ideal architecture should map distinct local structural environments around atoms to distinct embeddings in representation 56 space [Bartók et al., 2013, Pozdnyakov et al., 2020]. Assumptions Analogous to the WL test, we assume for the rest of this section that all geometric graphs we want to distinguish are finite in size and come from a countable dataset. In other words, the geometric and scalar features the nodes are equipped with come from countable subsets C ⊂ Rd and C ′ ⊂ R, respectively. As a result, when we require functions to be injective, we require them to be injective over the countable set of G-orbits that are obtained by acting with Gon the dataset. Intuition For an intuition of how to generalise WL to geometric graphs, we note that WL uses a local, node-centric, procedure to update the colour of each node i using the colours of its the 1-hop neighbourhood Ni. In the geometric setting, Ni is an attributed point cloud around the central node i. As a result, each neighbourhood carries two types of information: (1) neighbourhood type (invariant to G) and (2) neighbourhood geometric orientation (equivariant to G). This leads to constraints on the generalisation of the neighbourhood aggregation procedure of Equation 3.1. From an axiomatic point of view, our generalisation of the WL aggregation procedure must meet two properties: Property 1: Orbit injectivity of colours If two neighbourhoods are the same up to an action of G (e.g. rotation), then the colours of the corresponding central nodes should be the same. Thus, the colouring must be G-orbit injective – which also makes it G-invariant – over the countable set of all orbits of neighbourhoods in our dataset. Property 2: Preservation of local geometry A key property of WL is that the aggregation is injective. A G-invariant colouring procedure that purely satisfies Property 1 is not sufficient because, by definition, it loses spatial properties of each neighbourhood such as the relative pose or orientation [Hinton et al., 2011]. Thus, we must additionally update auxiliary geometric information variables in a way that is G-equivariant and injective. 3.2.1 The Geometric Weisfeiler-Leman Test (GWL) These intuitions motivate the following definition of the GWL test. At initialisation, we assign to each node i ∈ V a scalar node colour ci ∈ C ′ and an auxiliary object gi containing the geometric information associated to it: c (0) i := HASH(si), g (0) i := ( c (0) i , v⃗i ) , (3.3) where HASH denotes an injective map over the scalar attributes si of node i, e.g. the HASH of standard WL. To define the inductive step, assume we have the colours of the nodes and the 57 Figure 3.4: Geometric Weisfeiler-Leman Test. GWL distinguishes non-isomorphic geometric graphs G1 and G2 by injectively assigning colours to distinct neighbourhood patterns, up to global symmetries. Each iteration expands the neighbourhood from which geometric information can be gathered (shaded for node i). Example inspired by Schütt et al. [2021], G= O(d). associated geometric objects at iteration t−1. Then, we can aggregate the geometric information around node i into a new object as follows: g (t) i := ( (c (t−1) i , g (t−1) i ) , {{(c(t−1) j , g (t−1) j , x⃗ij) | j ∈ Ni}} ) , (3.4) Here {{·}} denotes a multiset – a set in which elements may occur more than once. Importantly, the group G can act on the geometric objects above inductively by acting on the geometric information inside it. This amounts to rotating (or reflecting) the entire t-hop neighbourhood contained inside: g · g(0) i := ( c (0) i , Qgv⃗i ) , (3.5) g · g(t) i := ( (c (t−1) i , g · g(t−1) i ), {{(c(t−1) j , g · g(t−1) j ,Qgx⃗ij) | j ∈ Ni}} ) Clearly, the aggregation building gi for any time-step t is injective and G-equivariant. Finally, we can compute the node colours at iteration t for all i ∈ V by aggregating the geometric information in the neighbourhood around node i: c (t) i := I-HASH(t) ( g (t) i ) , (3.6) by using a G-orbit injective and G-invariant function that we denote by I-HASH. That is for any geometric objects g, g′, I-HASH(g) = I-HASH(g′) if and only if there exists g ∈ Gsuch that g = g · g′. Note that I-HASH is an idealised G-orbit injective function, similar to the HASH function used in WL, which is not necessarily continuous. 58 Figure 3.5: Invariant GWL Test. IGWL cannot distinguish G1 and G2 as they are 1-hop identical: The G-orbit of the 1-hop neighbourhood around each node is the same, and IGWL cannot propagate geometric orientation information beyond 1-hop. Overview of GWL With each iteration, g(t) i aggregates geometric information in progressively larger t-hop subgraph neighbourhoods N (t) i around the node i. The node colours summarise the structure of these t-hops via the G-invariant aggregation performed by I-HASH. The procedure terminates when the partitions of the nodes induced by the colours do not change from the previous iteration. Finally, given two geometric graphs G and H, if there exists some iteration t for which {{c(t)i | i ∈ V(G)}} ̸= {{c(t)i | i ∈ V(H)}}, then GWL deems the two graphs as geometrically non-isomorphic. Otherwise, GWL cannot distinguish them. Invariant GWL Since we are interested in understanding the role of G-equivariance, we also consider a more restrictive Invariant GWL (IGWL) that only updates node colours using the G-orbit injective I-HASH function and does not propagate geometric information: c (t) i := I-HASH ( (c (t−1) i , v⃗i) , {{(c(t−1) j , v⃗j, x⃗ij) | j ∈ Ni}} ) . (3.7) IGWL with k-body scalars In order to further analyse the construction of the node colouring function I-HASH, we consider IGWL(k) based on the maximum number of nodes involved in the computation of G-invariant scalars (also known as the ‘body order’ [Batatia et al., 2022b]): c (t) i := I-HASH(k) ( (c (t−1) i , v⃗i) , {{(c(t−1) j , v⃗j, x⃗ij) | j ∈ Ni}} ) , (3.8) and I-HASH(k+1) is defined as: HASH ( {{I-HASH ( (c (t−1) i , v⃗i), {{(c(t−1) j1 , v⃗j1 , x⃗ij1), . . . , (c (t−1) jk , v⃗jk , x⃗ijk)}} ) | j ∈ (Ni) k}} ) , (3.9) 59 Figure 3.6: Geometric Computation Trees for GWL and IGWL. Unlike GWL, geometric orientation information cannot flow from the leaves to the root in IGWL, restricting its expressive power. IGWL cannot distinguish G1 and G2 as all 1-hop neighbourhoods are computationally identical. where j = [j1, . . . , jk] are all possible k-tuples formed of elements of Ni. Therefore, IGWL(k) is now constrained to extract information only from all the possible k-sized tuples of nodes (includ- ing the central node) in a neighbourhood. For instance, I-HASH(2) can identify neighbourhoods only up to pairwise distances among the central node and any of its neighbours (i.e. a 2-body scalar), while I-HASH(3) up to distances and angles formed by any two edges (i.e. a 3-body scalar). Notably, distances and angles alone are incomplete descriptors of local geometry as there exist several counterexamples that require higher body-order invariants to be distinguished [Bartók et al., 2013, Pozdnyakov et al., 2020]. Therefore, I-HASH(k) with lower scalarisation body order k makes the colouring weaker. 3.2.2 Characterising the Expressivity of Geometric GNNs Following Xu et al. [2019], Morris et al. [2019], we upper bound the expressive power of Geometric GNNs by the GWL test. We show that any G-equivariant GNN can be at most as powerful as GWL in distinguishing non-isomorphic geometric graphs. Proofs are available in Appendix A.2. Theorem 1. Any pair of geometric graphs distinguishable by a G-equivariant GNN is also distinguishable by GWL. With sufficient iterations, the output of G-equivariant GNNs can be equivalent to GWL if certain conditions are met regarding the aggregate, update and readout functions. Proposition 2. G-equivariant GNNs have the same expressive power as GWL if the following conditions hold: (1) The aggregation AGG is an injective, G-equivariant multiset function. (2) The scalar part of the update UPDs is a G-orbit injective, G-invariant multiset function. (3) The vector part of the update UPDv is an injective, G-equivariant multiset function. (4) The graph-level readout f is an injective multiset function. 60 Similar statements can be made for G-invariant GNNs and IGWL. Thus, we can directly transfer theoretical results between GWL/IGWL, which are abstract mathematical tools, and the class of Geometric GNNs upper bounded by the respective tests. This connection has several interesting practical implications, discussed subsequently. 3.3 Understanding the Geometric GNN Design Space Overview We demonstrate the utility of the GWL framework for understanding how Geometric GNN design choices [Duval et al., 2023a] influence expressivity: (1) Depth or number of layers; (2) Invariant vs. equivariant message passing; and (3) Body order of scalarisation. In doing so, we formalise theoretical limitations of current architectures and provide practical implications for designing maximally powerful models. Proofs are available in Appendix A.1. 3.3.1 Role of Depth: Propagating Geometric Information Let us consider the simplified setting of two geometric graphs G1 = (A1,S1, V⃗1, X⃗1) and G2 = (A2,S2, V⃗2, X⃗2) such that the underlying attributed graphs (A1,S1) and (A2,S2) are isomorphic. This case frequently occurs in chemistry, where molecules occur in different conformations, but with the same graph topology given by the covalent bonding structure. Each iteration of GWL aggregates geometric information g (k) i from progressively larger neighbourhoods N (k) i around the node i, and distinguishes (sub-)graphs via comparing G-orbit injective colouring of g(k) i . We say G1 and G2 are k-hop distinct if for all graph isomorphisms b, there is some node i ∈ V1, b(i) ∈ V2 such that the corresponding k-hop subgraphs N (k) i and N (k) b(i) are distinct. Otherwise, we say G1 and G2 are k-hop identical if all corresponding k-hop subgraphs N (k) i and N (k) b(i) are the same up to group actions. We can now formalise what geometric graphs can and cannot be distinguished by GWL and maximally powerful Geometric GNNs, in terms of the number of iterations. Proposition 3. GWL can distinguish any k-hop distinct geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic, and k iterations are sufficient. Proposition 4. Up to k iterations, GWL cannot distinguish any k-hop identical geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic. Additionally, we can state the following results about the more constrained IGWL. Proposition 5. IGWL can distinguish any 1-hop distinct geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic, and 1 iteration is sufficient. Proposition 6. Any number of iterations of IGWL cannot distinguish any 1-hop identical geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic. 61 An example illustrating Propositions 3 and 6 is shown in Figures 3.4 and 3.5, respectively. We can now consider the more general case where the underlying attributed graphs for G1 = (A1,S1, V⃗1, X⃗1) and G2 = (A2,S2, V⃗2, X⃗2) are non-isomorphic and constructed from point clouds using radial cutoffs, as conventional for biochemistry and material science applications. Proposition 7. Assuming geometric graphs are constructed from point clouds using radial cutoffs, GWL can distinguish any geometric graphs G1 and G2 where the underlying attributed graphs are non-isomorphic. At most kMax iterations are sufficient, where kMax is the maximum graph diameter among G1 and G2. Proposition 7 shows that GWL can distinguish any non-isomorphic geometric graphs for the practical case where the graph structure is computed using radial cutoffs. However, for the general case where the topology of the geometric graph is independent of the coordinates, GWL may not be theoretically complete as there may exist pathological edge cases where the test fails. For example, when all coordinates and vector features are set equal to zero, GWL coincides with the standard 1-WL. In this edge case, GWL has the same expressive power as 1-WL and inherits all well-known failure cases of 1-WL. 3.3.2 Limitations of Invariant Message Passing: Failure to Capture Global Geometry Propositions 3 and 6 enable us to compare the expressive powers of GWL and IGWL. Theorem 8. GWL is strictly more powerful than IGWL. This statement formalises the advantage of G-equivariant intermediate layers for geometric graphs, as prescribed in the Geometric Deep Learning blueprint [Bronstein et al., 2021], in addition to echoing similar intuitions in the computer vision community. As remarked by Hinton et al. [2011], translation invariant models do not understand the relationship between the various parts of an image (termed the “Picasso problem”). Similarly, our results point to IGWL and G-invariant GNNs failing to understand how the 1-hop neighbourhoods in a graph are oriented w.r.t. each other. As a result, even the most powerful G-invariant GNNs are restricted in their ability to compute global and non-local geometric properties. Proposition 9. IGWL and G-invariant GNNs cannot decide several geometric graph properties: (1) perimeter, surface area, and volume of the bounding box/sphere enclosing the geometric graph; (2) distance from the centroid or centre of mass; and (3) dihedral angles. Finally, we identify a setting where this distinction between the two approaches disappears. Proposition 10. IGWL has the same expressive power as GWL for fully connected geometric graphs. 62 Practical Implications Proposition 9 highlights critical theoretical limitations of G-invariant GNNs. This suggests that G-equivariant GNNs should be preferred when working with large geometric graphs such as macromolecules with thousands of nodes, where message passing is restricted to local radial neighbourhoods around each node. Stacking multiple G-equivariant layers enables the computation of compositional geometric features. Two straightforward approaches to overcoming the limitations of G-invariant GNNs may be: (1) pre-computing non-local geometric properties as input features, e.g. models such as GemNet [Gasteiger et al., 2021] and ComENet [Wang et al., 2022] use two-hop dihedral angles. And (2) working with fully connected geometric graphs, as Proposition 10 suggests that G-equivariant and G-invariant GNNs can be made equally powerful when performing all-to-all message passing1. This is supported by the empirical success of recent G-invariant Graph Transformers [Joshi, 2025, Shi et al., 2022] for small molecules with tens of nodes, where working with full graphs is tractable. 3.3.3 Role of Scalarisation Body Order: Identifying Neighbourhood G- orbits At each iteration of GWL and IGWL, the I-HASH function assigns a G-invariant colouring to distinct geometric neighbourhood patterns. I-HASH is an idealised G-orbit injective function which is not necessarily continuous. In Geometric GNNs, this corresponds to scalarising local geometric information when updating the scalar features. We can analyse the construction of the I-HASH function and the scalarisation step in Geometric GNNs via the k-body variations IGWL(k), described in Section 3.2. In doing so, we will make connections between IGWL and WL for non-geometric graphs. Firstly, we formalise the relationship between the injectivity of I-HASH(k) and the maximum cardinality of local neighbourhoods in a given dataset. Proposition 11. I-HASH(m) is G-orbit injective for m = max({|Ni| | i ∈ V}), the maximum cardinality of all local neighbourhoods Ni in a given dataset. Practical Implications While building provably injective I-HASH(k) functions may require intractably high k, the hierarchy of IGWL(k) tests enable us to study the expressive power of practical G-invariant aggregators used in current Geometric GNN layers, e.g. SchNet [Schütt et al., 2018], E-GNN [Satorras et al., 2021], and TFN [Thomas et al., 2018] use distances, while DimeNet [Gasteiger et al., 2020] uses distances and angles. Notably, MACE [Batatia et al., 2022b] constructs a complete basis of scalars up to arbitrary body order k via Atomic Cluster 1Subsequent theoretical results have confirmed that for fully connected graphs, higher-order invariant GNNs (specifically those equivalent to 2-WL) are geometrically complete [Delle Rose et al., 2023]. However, this requires global all-to-all communication, reinforcing the advantage of equivariant models for scalable, sparse message passing. 63 Expansion [Dusson et al., 2019], which can be G-orbit injective if the conditions in Proposition 11 are met. We can state the following about the IGWL(k) hierarchy and the corresponding GNNs. Proposition 12. IGWL(k) is at least as powerful as IGWL(k−1). For k ≤ 5, IGWL(k) is strictly more powerful than IGWL(k−1). Finally, we show that IGWL(2) is equivalent to WL when all the pairwise distances are the same. A similar observation was recently made by Pozdnyakov and Ceriotti [2022]. Proposition 13. Let G1 = (A1,S1, X⃗1) and G2 = (A2,S2, X⃗2) be two geometric graphs with the property that all edges have equal length. Then, IGWL(2) distinguishes the two graphs if and only if WL can distinguish the attributed graphs (A1,S1) and (A1,S1). This equivalence points to limitations of distance-based G-invariant models like SchNet [Schütt et al., 2018]. These models suffer from all well-known failure cases of WL, e.g. they cannot distinguish two equilateral triangles from the regular hexagon [Gasteiger et al., 2020]. 3.4 Synthetic Experiments on Expressivity Overview We perform three simple synthetic experiments to supplement our theoretical results and highlight the practical challenges in building maximally powerful Geometric GNNs, s.a. oversmoothing and oversquashing with increased depth, as well the need for higher order tensors in G-equivariant GNNs. Setup and Hyperparameters We experiment with the following models: (1) SchNet [Schütt et al., 2018] and DimeNet [Gasteiger et al., 2020] as representative G-invariant GNNs; (2) E-GNN [Satorras et al., 2021] and GVP-GNN [Jing et al., 2020] as representative G-equivariant GNNs which use cartesian vectors; and (3) TFN [Thomas et al., 2018] and MACE [Batatia et al., 2022b] to study higher order G-equivariant GNNs using spherical tensors. For SchNet and DimeNet, we use the implementation from PyTorch Geometric [Fey and Lenssen, 2019a]. For E-GNN, GVP-GNN, and MACE, we adapt implementations from the respective authors. Our TFN implementation is based on e3nn [Geiger and Smidt, 2022], and we also re-implement MACE by incorporating the EquivariantProductBasisBlock from its authors into our TFN layer. We set scalar feature channels to 128 for SchNet, DimeNet, and E-GNN. We set scalar/vector/tensor feature channels to 64 for GVP-GNN, TFN, MACE. TFN and MACE use order L = 2 tensors by default. MACE uses local body order 4 by default. We train all models for 100 epochs using the Adam optimiser, with an initial learning rate 1e− 4, which we reduce by a factor of 0.9 and patience of 25 epochs when the performance plateaus. All results are averaged across 10 random seeds. 64 (k = 4-chains) Number of layers GNN Layer ⌊k 2 ⌋ ⌊k 2 ⌋+ 1 = 3 ⌊k 2 ⌋+ 2 ⌊k 2 ⌋+ 3 ⌊k 2 ⌋+ 4 E qu iv . GWL 50% 100% 100% 100% 100% E-GNN 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 100.0 ± 0.0 GVP-GNN 50.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 TFN 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 80.0 ± 24.5 85.0 ± 22.9 MACE 50.0 ± 0.0 90.0 ± 20.0 90.0 ± 20.0 95.0 ± 15.0 95.0 ± 15.0 In v. IGWL 50% 50% 50% 50% 50% SchNet 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 DimeNet 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 SphereNet 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 SchNetfull graph 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 SchNetglobal feat 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 Table 3.1: k-chain geometric graphs. k-chains are (⌊k 2 ⌋+ 1)-hop distinguishable and (⌊k 2 ⌋+ 1) GWL iterations are theoretically sufficient to distinguish them. We train Geometric GNNs with an increasing number of layers to distinguish k = 4-chains. G-equivariant GNNs may require more iterations that prescribed by GWL, pointing to preliminary evidence of oversquashing when geometric information is propagated across multiple layers using fixed dimensional feature spaces. IGWL and G-invariant GNNs are unable to distinguish k-chains for any k ≥ 2 and G= O(3). G-invariant GNNs with precomputed non-local features (volume of bounding box) or message passing on fully connected graphs can trivially solve the task. Anomalous results are marked in red and expected results in green . Tasks We design three synthetic experiments to highlight practical challenges in building expressive Geometric GNNs, summarised below and described in detail subsequently. • Distinguishing k-chains, which test a model’s ability to propagate geometric information non-locally and demonstrate geometric oversquashing with increased depth. • Rotationally symmetric structures, which test a layer’s ability to identify neighbourhood orientation and highlight the utility of higher order tensors in G-equivariant GNNs. • Counterexamples from Pozdnyakov et al. [2020], which test a layer’s ability to create distinguishing fingerprints for local neighbourhoods and highlight the need for higher body order of scalarisation. 3.4.1 Depth, Non-local Geometric Properties, and Oversquashing GWL assumes perfect propagation of G-equivariant geometric information at each iteration, which implies that the test can be run for any number of iterations without loss of information. In Geometric GNNs, G-equivariant information is propagated via summing features from multiple layers in fixed dimensional spaces, which may lead to distortion or loss of information from distant nodes. Experiment To study the practical implications of depth in propagating geometric information beyond local neighbourhoods, we consider k-chain geometric graphs which generalise the 65 Rotational symmetry GNN Layer 2 fold 3 fold 5 fold 10 fold C ar t. E-GNNL=1 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 GVP-GNNL=1 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 Sp he ri ca l TFN/MACEL=1 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 TFN/MACEL=2 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 TFN/MACEL=3 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 TFN/MACEL=5 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0 TFN/MACEL=10 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 Table 3.2: Rotationally symmetric structures. We train single layer G-equivariant GNNs to distinguish two distinct rotated versions of each L-fold symmetric structure. We find that layers using order L tensors are unable to identify the orientation of structures with rotation symmetry higher than L-fold. This issue is particularly prevalent for layers using cartesian vectors (tensor order 1). examples from Schütt et al. [2021]. Each pair of k-chains consists of k + 2 nodes with k nodes arranged in a line and differentiated by the orientation of the 2 end points. Thus, k-chain graphs are (⌊k 2 ⌋+ 1)-hop distinguishable, and (⌊k 2 ⌋+ 1) GWL iterations are theoretically sufficient to distinguish them. In Table 3.1, we train G-equivariant and G-invariant GNNs with an increasing number of layers to distinguish k-chains. Results Despite the supposed simplicity of the task, we find that popular G-equivariant GNNs such as E-GNN [Satorras et al., 2021] and TFN [Thomas et al., 2018] may require more iterations that prescribed by GWL. Notably, for chains larger than k = 4, all G-equivariant GNNs tended to require more than (⌊k 2 ⌋+ 1) iterations to solve the task. Additionally, IGWL and G-invariant GNNs are unable to distinguish k-chains. Table 3.1 points to preliminary evidence of the oversquashing phenomenon [Alon and Yahav, 2021, Topping et al., 2022] for equivariant features in Geometric GNNs. The issue is most evident for E-GNN, which uses a single vector feature to aggregate and propagate geometric information. This may have implications in modelling macromolecules where long-range interactions often play important roles. 3.4.2 Higher Order Tensors and Rotationally Symmetric Structures In addition to perfect propagation, GWL is also able to injectively aggregate G-equivariant information by making use of an auxiliary nested geometric object gi. On the other hand, G-equivariant GNNs aggregate geometric information via summing neighbourhood features represented by Cartesian vectors (tensor order 1) or higher order spherical tensors. This choice of basis often comes with trade-offs between computational tractability and empirical performance. 66 Experiment To demonstrate the utility of higher order tensors in G-equivariant GNNs, we study how rotational symmetries interact with tensor order. In Table 3.2, we evaluate current G-equivariant layers on their ability to distinguish the orientation of structures with rotational symmetry. An L-fold symmetric structure does not change when rotated by an angle 2π L around a point (in 2D) or axis (3D). We consider two distinct rotated versions of each symmetric structure and train single layer G-equivariant GNNs to classify the two orientations using the updated equivariant features. Results We find that layers using order L spherical tensors are unable to identify the orientation of structures with rotation symmetry higher than L-fold, i.e. two distinct rotated versions of the input having the same equivariant features. We attribute this observation to spherical harmonics, which serve as an orthonormal basis for spherical tensor features and exhibit rotational symmetry themselves. Similar to the Fourier expansion for signals, the spherical harmonic expansion is employed for converting Cartesian vectors to spherical signals in G-equivariant GNNs. The tensor order of the spherical harmonic bases determines the rate of oscillation of the approximated function on the sphere. We believe that this oscillation rate is closely linked to the rotational fold of a set of symmetric vectors. In the Fourier expansion, it is not feasible to accurately approximate a high-frequency function solely using low-frequency sinusoidal waves. Similarly, when truncating the spherical harmonic expansion to an order lower than the fold of the rotational symmetry, the rotationally symmetric vectors act as a higher frequency function. Consequently, the lower frequency bases cannot preserve the orientation of these vectors. Layers such as E-GNN [Satorras et al., 2021] and GVP-GNN [Jing et al., 2020] using Cartesian vectors are popular as higher order tensors can be computationally intractable for many applications. However, E-GNN and GVP-GNN are particularly poor at discriminating the orientation of rotationally symmetric structures. This may have implications for the modelling of periodic materials which naturally exhibit such symmetries [Levine and Steinhardt, 1984]. 3.4.3 Body Order of Scalarisation and Neighbourhood Fingerprints GWL uses a node colouring function I-HASH for distinguishing G-orbits of neighbourhoods, i.e. a neighbourhood fingerprint. In Geometric GNNs, this corresponds to a scalarisation step where local geometric information from subsets of neighbours is aggregated to compute G-invariant scalars (termed the body order). Experiment To demonstrate the practical implications of scalarisation body order, we evaluate current Geometric GNN layers on their ability to discriminate counterexamples from Pozdnyakov 67 Counterexample from Pozdnyakov et al. [2020] GNN Layer 2-body 3-body 4-body (Fig.1(b)) (Fig.2(f)) In v. SchNet2-body 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 DimeNet3-body 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 SphereNet4-body 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0 E qu iv . E-GNN2-body 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 GVP-GNN3-body 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 TFN2-body 50.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 MACE3-body 100.0 ± 0.0 50.0 ± 0.0 50.0 ± 0.0 MACE4-body 100.0 ± 0.0 100.0 ± 0.0 50.0 ± 0.0 MACE5-body 100.0 ± 0.0 100.0 ± 0.0 100.0 ± 0.0 Table 3.3: Counterexamples from Pozdnyakov et al. [2020]. This task evaluates single layer Geometric GNNs at distinguishing counterexample structures that are indistinguishable using k-body scalarisation. Geometric GNN layers with body order k cannot distinguish the corresponding counterexample. The 3-body counterexample is from Fig.1(b) [Pozdnyakov et al., 2020], 4-body is from Fig.2(f) [Pozdnyakov et al., 2020], and 2-body is based on the two local neighbourhoods in our running example. et al. [2020]. Each counterexample consists of a pair of local neighbourhoods that are indis- tinguishable when comparing their set of k-body scalars, i.e. I-HASH(k) and Geometric GNN layers with body order k cannot distinguish the neighbourhoods. The 3-body counterexample corresponds to Fig.1(b) in Pozdnyakov et al. [2020], 4-body chiral to Fig.2(e), and 4-body non-chiral to Fig.2(f); the 2-body counterexample is based on the two local neighbourhoods in our running example. Results In Table 3.3, we train single layer Geometric GNNs to distinguish the counterexamples using updated scalar features. Unsurprisingly, we find that most layers computiong 2 or 3 body scalarisations fail the task. Notably, training higher body order MACE layers to distinguish the chiral and non-chiral 4-body counterexamples should be possible in theory, but proved challeng- ing in practice. This highlights the difficulty of designing as well as optimising continuous, high body order neighbourhood fingerprints. 3.5 Experiments on Protein Representation Learning Overview Having introduced a unified theoretical framework for characterising the expressive power of Geometric GNNs, we now turn to evaluating their practical performance on protein property prediction tasks. We leverage the ProteinWorkshop benchmark suite [Jamasb et al., 2024] to systematically compare several classes of Geometric GNNs across protein function and structure annotation datasets. Our evaluation encompasses general-purpose G-invariant [Schütt et al., 2018] and G-equivariant architectures [Satorras et al., 2021, Thomas et al., 2018, Batatia 68 Figure 3.7: Overview of ProteinWorkshop, a comprehensive benchmarking suite for evaluating protein structure representation learning. The framework encompasses: (1) Large-scale ML- ready datasets containing experimental structures from the PDB as well as predicted structures from AlphaFoldDB and ESM Atlas; (2) Unified implementations of diverse Geometric GNN architectures and flexible featurisation schemes; (3) Real-world prediction tasks including both node-level and graph-level tasks; and (4) Rigorous evaluation protocols with standardised splits, metrics, and pretraining procedures to enable fair comparison and reproducible progress tracking in protein representation learning. et al., 2022b], alongside bespoke models specifically designed for protein structures [Morehead and Cheng, 2024, Zhang et al., 2023b]. Through rigorous benchmarking, these experiments aim to bridge the gap between theoretical expressivity and practical performance in real-world tasks. Broader motivation behind ProteinWorkshop Recent advances in protein structure predic- tion have led to the availability of large-scale structural data [Jumper et al., 2021, Baek et al., 2021]. However, the mere availability of structural data does not guarantee progress in our understanding of the relationship between protein sequence, structure, and function [Varadi et al., 2021]. To address this gap, we need computational methods that can learn meaningful representations of protein structures for functional annotation. Geometric GNNs have emerged as the architecture of choice for learning structural repre- sentations of biomolecules [Schütt et al., 2018, Gasteiger et al., 2020, Jing et al., 2020, Schütt et al., 2021, Morehead et al., 2022, Zhang et al., 2023b]. Previous works have primarily focused on learning effective global (i.e. graph-level) representations of protein structure, typically evaluating methods on function or fold classification tasks [Gligorijević et al., 2021, Zhang et al., 2023b]. In contrast, there has been comparatively little investigation into the ability of different methods to learn informative local (node-level) representations. Such local representations are crucial for a variety of annotation tasks, including binding or interaction site prediction [Gainza et al., 2020], and for providing conditioning signals in structure-conditioned molecule design methods [Schneuing et al., 2024, Corso et al., 2023]. Understanding the structure-function relationship at this granular level can also drive progress in protein design by revealing structural motifs that underlie desirable properties, enabling their incorporation into new designs. To 69 Table 3.4: Protein property prediction datasets. We benchmark Geometric GNNs on a variety of protein function prediction and structural-annotation tasks, including both node-level and graph-level tasks. Task Dataset Origin Structures # Train # Validation # Test Metric G ra ph -l ev el Fold Prediction Hou et al. [2017] Experimental 12.3 K 0.7 K 1.3/0.7/1.3 K Accuracy Gene Ontology Prediction Gligorijević et al. [2021] Experimental 27.5 K 3.1 K 3.0 K Fmax Reaction Class Prediction Hermosilla et al. [2020] Experimental 29.2 K 2.6 K 5.6 K Accuracy Antibody Dev. Prediction Huang et al. [2021] Experimental 1.7 K 0.24 K 0.48 K AUPRC N od e Inverse Folding Ingraham et al. [2019a] Experimental 3.9 M 105 K 180 K Perplexity PPI Site Prediction Gainza et al. [2020] Experimental 478 K 53 K 117 K AUPRC enable systematic evaluation of both global and local representational capabilities, we developed ProteinWorkshop [Jamasb et al., 2024], a robust, standardised benchmark for evaluating protein representation learning methods. An overview of the benchmark is shown in Figure 3.7. 3.5.1 Experimental Setup Benchmark datasets We benchmark Geometric GNNs on a diverse collection of protein property prediction tasks that span both node-level and graph-level prediction tasks. Our evalua- tion framework, summarised in Table 3.4, encompasses datasets curated from the literature and existing benchmarks to systematically assess both global and local representational capabilities of different architectures2. For node-level tasks, we evaluate models on their ability to learn informative residue-level representations through inverse folding [Ingraham et al., 2019a], and protein-protein interaction site prediction [Gainza et al., 2020]. For graph-level tasks, we assess global structural representation learning through fold prediction [Hou et al., 2017], gene ontology prediction [Gligorijević et al., 2021], reaction class prediction [Hermosilla et al., 2020], and antibody development prediction [Huang et al., 2021]. See Jamasb et al. [2024] for detailed descriptions of the datasets and tasks. Geometric GNN models We provide a unified implementation of several rotation invariant and equivariant Geometric GNNs, spanning the range of message passing body order and tensor order. We benchmark 4 general purpose models: SchNet [Schütt et al., 2018], EGNN [Satorras et al., 2021], TFN [Thomas et al., 2018], MACE [Batatia et al., 2022b]; and 2 protein-specific architectures: GCPNet [Morehead and Cheng, 2024], GearNet [Zhang et al., 2023b]. As input to the models, we convert protein structures to geometric graphs with Cα atoms as nodes with additional features including the residue type, positional encoding, virtual torsion and bond angles as well as backbone torsion angles along the protein chain, as summarised in Table 3.5. Edges are constructed using a k-nearest neighbours graph construction based on the Cα atom 2To retain focus on protein representation learning, we deliberately exclude commonly-used tasks based on protein-small molecule interactions as it is hard to disentangle the effect of the small molecule representation and the potential for bias [Boyles et al., 2019]. 70 Table 3.5: Structural featurisation schemes. Residue type is a one-hot encoding of the amino acid type for each node; positional encoding is a 16-dimensional sinusoidal encoding [Vaswani et al., 2017b]; and ϕ, ψ, ω ∈ R6 and χ1−4 ∈ R8 are backbone dihedral angles and sidechain torsion angles, respectively, embedded on the unit circle. Similarly, κ, α ∈ R4 are virtual torsion and bond angles defined over Cα atoms. Granularity Cα Features Backbone Sidechain Cα Residue Type Cα Residue Type, Positional Encoding Cα Residue Type, Positional Encoding, κ, α Cα Residue Type, Positional Encoding, κ, α ϕ, ψ, ω Cα Residue Type, Positional Encoding, κ, α ϕ, ψ, ω χ1, χ2, χ3, χ4 positions, with k = 16. Additionally, we benchmark against ESM-2 [Lin et al., 2023], a state-of-the-art protein language model that learns representations from sequence alone without explicit geometric inductive biases. ESM-2 employs a standard Transformer architecture and is pre-trained on large-scale protein sequence data, enabling it to implicitly capture structural and functional patterns. We use the 650 million parameter version as a frozen feature extractor to generate per-residue embeddings, which we optionally augment with the geometric features described above. A simple MLP is then trained on these combined features for each downstream task. This comparison allows us to quantify the benefit of explicit structural modelling in Geometric GNNs versus the implicit structural information learned by sequence-based ESM-2 through large-scale pre-training. Hyperparameters We use consistent experimental settings across all models, with six layers and 512 hidden channels as our default configuration. For tensor-based equivariant GNNs, we reduced the number of layers and hidden channels to fit 80GB of GPU memory on one NVIDIA A100 GPU. All models are trained using the Adam optimizer with a batch size of 32. Learning rates are selected via grid search over {10−5, 10−4, 3× 10−4, 10−3}, and we employ early stopping with a patience of 10 epochs to prevent overfitting. Training continues until convergence or a maximum of 24 hours on a single A100 GPU. We train each model with all featurisation schemes described in Table 3.5 and report the best performance across all schemes. To ensure statistical reliability, results are averaged over three random seeds. 3.5.2 Results Several clear patterns emerge from the results in Table 3.6: Equivariant models outperform invariant models Across all tasks, equivariant GNNs con- sistently demonstrate superior performance compared to their invariant counterparts. This 71 Table 3.6: Geometric GNN performance on protein property prediction tasks. We report the best performance across all featurisation schemes for each model, averaged over three random seeds. The best and second best performing models for each task are highlighted in bold and underline, respectively. Model Gene Ontology Antibody Dev. Fold Reaction PPI Site Inverse Folding Fmax (↑) AUPRC (↑) Accuracy (↑) Accuracy (↑) AUPRC (↑) Perplexity (↓) In v. SchNet 0.429 ± 0.00 0.896 ± 0.00 31.98 ± 0.01 73.83 ± 0.02 0.955 ± 0.00 9.97 ± 0.09 GearNet 0.453 ± 0.00 0.837 ± 0.01 34.63 ± 0.01 80.03 ± 0.01 0.962 ± 0.00 11.23 ± 0.09 E qu iv . E(n) GNN 0.455 ± 0.01 0.927 ± 0.01 41.48 ± 0.02 82.70 ± 0.00 0.965 ± 0.00 8.89 ± 0.04 GCPNet 0.442 ± 0.01 0.881 ± 0.02 38.86 ± 0.02 77.71 ± 0.01 0.968 ± 0.00 7.56 ± 0.11 TFN 0.452 ± 0.00 0.923 ± 0.01 36.65 ± 0.01 81.22 ± 0.01 0.967 ± 0.00 8.73 ± 0.02 MACE 0.411 ± 0.01 0.918 ± 0.00 35.68 ± 0.03 76.34 ± 0.01 0.965 ± 0.00 8.94 ± 0.03 LM ESM-2 0.545 ± 0.00 0.885 ± 0.00 34.59 ± 0.00 82.11 ± 0.00 0.956 ± 0.00 - empirically validates the theoretical advantage of equivariant message passing established by the GWL framework. E(n) GNN achieves the best performance on three tasks (Antibody Development, Fold Classification, and Reaction Class Prediction). TFN and MACE, which are equivariant GNNs utilizing higher-order tensor representations, also show strong performance across multiple tasks. TFN consistently appears among the top three models, empirically supporting the role of higher-order equivariant features in capturing complex geometric relationships. Protein-specific architectures GCPNet, designed specifically for protein structures, excels at node-level tasks requiring fine-grained structural understanding (PPI Site Prediction and Inverse Folding). Similarly, GearNet, a protein-specific invariant model, tends to outperform the general-purpose invariant SchNet. This suggests that incorporating domain-specific structural priors into architecture design can be beneficial, particularly for tasks demanding local geometric precision. Language models excel at functional tasks Despite lacking explicit structural inductive biases, ESM-2 achieves remarkable performance on Gene Ontology prediction and competitive results on Reaction Class Prediction. This suggests that sequence information alone carries substantial predictive power for certain functional annotation tasks. However, for structural tasks like fold prediction, the ability to propagate orientation information appears crucial, with equivariant GNNs showing substantial advantages. 3.6 Related Work Completeness of molecular representations A central question in geometric deep learning is whether invariant scalar features (such as distances) are sufficient to completely distinguish any 72 two non-isomorphic geometric structures, or if equivariant vector features are strictly necessary. The molecular simulations community has extensively studied the completeness of atom-centred interatomic potentials, focusing on the ability to distinguish 1-hop local neighbourhoods (point clouds) around atoms by constructing spanning sets for continuous, G-equivariant multiset functions [Shapeev, 2016, Drautz, 2019, Dusson et al., 2019, Pozdnyakov et al., 2020]. GWL generalises and extends this analysis to generic geometric graph isomorphism problems beyond local atom-centred neighbourhoods. Subsequent to the publication of the GWL framework [Joshi et al., 2023], several works established tight bounds for the completeness of invariant representations on fully connected graphs. Delle Rose et al. [2023] proved that the standard (d− 1)-dimensional Weisfeiler-Leman test is complete for distinguishing generic point clouds in Rd given the full distance matrix. For 3D molecular structures, this implies that 2-WL (which considers pairs of nodes) is theoretically sufficient for completeness, provided the graph is fully connected. Similarly, Hordan et al. [2024] established that GNNs simulating 2-WL are universal approximators for continuous functions on point clouds. Concurrently, Li et al. [2023c] demonstrated that while full distance matrices contain sufficient information, standard message passing on distances (equivalent to 1-WL) is incomplete, echoing the limitations of IGWL discussed in this chapter. These findings collectively suggest that while invariant representations can be complete on fully connected geometric graphs, they require higher-order message passing (beyond 1-WL) to achieve this. In contrast, the GWL framework highlights that for sparse graphs, which are computationally necessary for large and periodic atomic systems, equivariance provides a strictly superior inductive bias by propagating orientation information that is otherwise lost in local invariant updates. Universality of Geometric GNNs Recent theoretical work [Dym and Maron, 2020, Villar et al., 2021, Gasteiger et al., 2021, Jing et al., 2020] has shown that architectures such as TFN, GemNet and GVP-GNN can be universal approximators of continuous, G-equivariant or G- invariant multiset functions over point clouds, i.e. fully connected graphs. In contrast, the GWL framework studies the expressive power of Geometric GNNs operating on sparse graphs from the perspective of discriminating geometric graphs. In our full paper [Joshi et al., 2023], we included additional proofs that GWL’s discrimination- based perspective is equivalent to universal approximation. However, the discrimination lens offers more granular and practically useful insights than universality alone. While universality is a binary property—a model is either universal or not—discrimination enables a more nuanced analysis of expressivity by characterizing the specific classes of geometric graphs that can and cannot be distinguished. This finer-grained perspective allows us to identify concrete counterexamples and failure modes, making the theoretical framework directly applicable to practical model design. 73 3.7 Summary In this chapter, we studied the expressive power of Geometric GNNs from the perspective of discriminating non-isomorphic geometric graphs. We proposed a geometric version of the Weisfeiler-Leman graph isomorphism test, termed GWL, which is a theoretical upper bound on the expressive power of Geometric GNNs. The GWL framework addresses a key research gap as standard GNNs and the associated theoretical tools are inapplicable for geometric graphs and 3D molecular structure representation learning. Through the lens of GWL, we formalised how key design choices influence Geometric GNN expressivity. Notably, invariant GNNs cannot distinguish graphs where one-hop neighbourhoods are the same and fail to compute non-local geometric properties such as volume, centroid, etc. Equivariant GNNs distinguish a larger class of graphs as stacking equivariant layers propagates geometric information beyond local neighbourhoods. Our synthetic experiments validate theoretical insights from GWL, highlighting three key challenges in Geometric GNN design: (1) geometric oversquashing, where equivariant models require more iterations than theoretically prescribed for propagating geometric information; (2) need for higher order tensors, as layers using order-L spherical tensors cannot distinguish very simple structures with symmetry higher than L-fold; and (3) limits of scalarisation body order, as we can counstruct counterexamples where layers with body order k cannot distinguish structures requiring (k + 1)-body invariants. These experiments reveal practical limitations of current architectures for molecular representation learning. Additionally, through a benchmark on protein function prediction, we explored how our theoretical insights translate to performance in real-world tasks. We found that equivariant GNNs consistently outperformed their invariant counterparts across all tasks, confirming the practical utility of higher-order equivariant features suggested by our theory. Overall, Geometric GNNs showed improvements over sequence-based protein language models, demonstrating the value of geometric inductive biases in learning representations of molecular structure. Future work GWL provides an abstraction to study the theoretical limits of Geometric GNNs. However, translating these theoretical insights directly into practical models remains challenging, as GWL assumes perfect neighbourhood aggregation and colouring functions that satisfy the conditions of Proposition 2. Despite not proposing provably powerful models, the understanding gained from the GWL framework guides the development of maximally powerful Geometric GNNs for real-world problems such as those tackled in Part II. Moreover, our discrimination- based perspective can be a starting point for further investigating the optimisation behaviour and generalisation capacity of these architectures. Additionally, GWL does not characterise all classes of Geometric GNNs. Non-local ar- chitectures that aggregate geometric information beyond immediate neighbourhoods, such as GemNet-Q [Gasteiger et al., 2021], are not directly covered by our framework. Similarly, archi- 74 tectures based on canonical reference frames [Du et al., 2022, Wang et al., 2022] are outside the scope of GWL. These methods use local or global frames of reference to transform equivariant quantities into invariant features, offering an alternative modelling paradigm when canonical reference frames can be easily defined (e.g. protein backbone structures [Jumper et al., 2021]). An emerging class of architectures also not covered by GWL are unconstrained networks that learn roto-translational equivariance implicitly from data, such as standard Transformers without geometric inductive biases [Wang et al., 2024, Abramson et al., 2024, Joshi et al., 2025a]. These models treat 3D coordinates as static node features and can theoretically distinguish any geometric graphs where the underlying attributed graphs are distinguishable by WL. Understand- ing when to prefer explicit inductive biases versus implicit learning of symmetries remains an important open question, both in theory and practice. The next chapter, Chapter 4, will explore this trade-off in more detail in the context of generative modelling. 75 76 Chapter 4 Unified Generative Modelling of Molecules and Materials In Chapter 3, we studied representation learning of molecular structures, with a focus on predictive problems. We will now turn to the complementary problem of generative modelling, which is the foundation for inverse design of molecules with bespoke functionality. The current state-of-the-art uses diffusion or flow matching models for tasks such as structure prediction and conditional generation for biomolecules [Watson et al., 2023, Ingraham et al., 2023, Abramson et al., 2024] and materials [Jiao et al., 2023, Zeni et al., 2025], as well as for structure-based drug design [Schneuing et al., 2024]. Molecular systems are atoms interacting in 3D space: they share common underlying physical principles that determine their 3D structure and properties. However, we currently do not have a unified formulation of diffusion models across different types of systems such as periodic crystals and non-periodic small molecules or biomolecules. This contrasts with predictive models, such as interatomic potentials for molecular simulation [Bartók et al., 2017], which have seen architectural unification through Geometric GNNs and benefited from transfer learning to achieve broad generalization across domains [Shoghi et al., 2024, Wood et al., 2025]. Most molecular diffusion models are highly specific to each type of system, and involve multi-modal generative processes on complex product manifolds of categorical and continuous data types. For example, de novo generation of small molecules is modelled as two independent diffusion processes for the atom types (categorical) and 3D coordinates (continuous) of a set of atoms [Hoogeboom et al., 2022]. The denoiser model learns how atom types and 3D coordinates jointly evolve in order to sample new molecules but passes through unrealistic intermediate states during the denoising trajectory. Diffusion models for biomolecules treat groups of atoms as rigid bodies and add a third manifold (rotations) into the joint diffusion process [Campbell et al., 2024]. For crystals/materials, the diffusion process needs to additionally handle periodicity and operates on a joint manifold of atom types, fractional coordinates, lattice lengths, and lattice angles that together define the repeating unit cell [Miller et al., 2024]. 77 Input representation Atom types (B, N, 1) 3D coord. (B, N, 3) Frac. coord. (B, N, 3) Cell lengths (B, 1, 3) Cell angles (B, 1, 3) Output representation Atom types (B, N, 1) 3D coord. (B, N, 3) Frac. coord. (B, N, 3) Cell lengths (B, 1, 3) Cell angles (B, 1, 3) Latent representation (B, N, d) Stage 2: Latent diffusion generative model Denoising with Diffusion Transformer Random Gaussian noise (B, N, d) Encoder Stage 1: Autoencoder for reconstruction Decoder Sampled latent representation (B, N, d) D Class label periodic/non-periodic Figure 4.1: Generative modelling of molecules and materials with All-atom Diffusion Transformers. ADiT performs generative modelling of 3D molecular systems in two stages: (1) An autoencoder learns a shared latent space by reconstructing all-atom representations of both molecules (non-periodic) and crystals (periodic); and (2) A Diffusion Transformer samples new latents from the shared distribution using classifier-free guidance, which are decoded to valid molecules or crystals using the VAE. Our unified latent diffusion framework enables transfer learning and avoids the complexity of multiple diffusion processes on categorical-continuous product manifolds used by equivariant diffusion models. This chapter introduces the All-atom Diffusion Transformer (ADiT), a unified latent diffusion model for jointly generating both periodic materials and non-periodic molecules using the same model. As illustrated in Figure 4.1, ADiT is a latent diffusion model based on two key ideas: (1) An autoencoder maps a unified, all-atom representations of molecules and crystals to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or crystals. ADiTs achieve state-of-the-art generative performance on both molecules and crystals while being significantly more scalable than specialized equivariant diffusion models. Additionally, we demonstrate that joint training and transfer learning between periodic and non-periodic domains improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code is available: github.com/facebookresearch/ all-atom-diffusion-transformer. 4.1 All-atom Diffusion Transformers Overview We use latent diffusion [Rombach et al., 2022, Vahdat et al., 2021] to unify generative modelling across periodic and non-periodic molecular systems. Our approach consists of two stages: (1) An autoencoder learns a shared latent space by jointly reconstructing all-atom representations of both molecules and materials; and (2) A Diffusion Transformer [Peebles and Xie, 2023] generates new samples from this latent space which can be decoded into valid molecules or crystals using classifier-free guidance [Ho and Salimans, 2022]. 78 https://github.com/facebookresearch/all-atom-diffusion-transformer https://github.com/facebookresearch/all-atom-diffusion-transformer Compared to existing equivariant diffusion models, our latent diffusion formulation shifts the complexity of handling categorical and continuous attributes into the autoencoder. This enables a very simple and highly scalable generative process in a shared latent space of periodic and non-periodic molecular systems. 4.1.1 Stage 1: Autoencoder for reconstruction Unified representation of 3D molecular systems Both periodic and non-periodic molecular systems can be represented as sets of atoms in 3D space, as we saw in Chapter 2. The key difference is that crystals require an additional periodic unit cell, while molecules have unbounded coordinates. A crystal or molecule with N atoms is represented as a multi-modal object: Atom types A = {ai}Ni=1 ∈ Z1×N , 3D coords. X = {xi}Ni=1 ∈ R3×N , Fractional coords. F = {fi}Ni=1 ∈ [0, 1)3×N , Unit cell/lattice L = {l1, l2, l3} ∈ R3×3 . The 3D coordinates X are in nanometers, and the fractional coordinates F are in the range [0, 1). The lattice matrix L represents a parallelepiped defining the shape of the repeating unit cell, and fractional coordinates are computed as the inverse of the unit cell matrix multiplied by the 3D coordinates: F = L−1X . We use Niggli reduction to uniquely determine the unit cell parameters for crystals [Grosse-Kunstleve et al., 2004]. For non-periodic molecules, we set the unit cell parameters and fractional coordinates to null values ϕ. VAE architecture We use a Variational Autoencoder (VAE) [Kingma and Welling, 2014] to learn a shared latent representation of molecules and materials using a reconstruction objective. Given an input 3D molecular system (A,X,F ,L), an encoder E maps each atom’s attributes to a latent representation Z: Z = E(A,X,F ) , (4.1) where Z = {zi}Ni=1 ∈ Rd×N encodes information about the categorical atom type and continuous coordinates (unit cell parameters are encoded implicitly in the fractional coordinates). The decoder D reconstructs the input molecular system from the latent embedding: A′,X ′,F ′,L′ = D(Z) . (4.2) We describe the pseudocode for VAE encoder and decoder operations in Algorithms 1 and 2, respectively. For the architecture of the encoder E and decoder D, we used the standard Trans- former [Vaswani et al., 2017a] and learn symmetries via data augmentation. In Appendix B.3, we also ablated roto-translation equivariant VAEs based on Equiformer-V2 [Liao et al., 2024b], a state-of-the-art Geometric GNN. 79 Algorithm 1: Pseudocode for VAE encoder E Input: 3D molecular system ({ai}, {xi}, {fi}, {l1, l2, l3}) Output: Latent reprenstations {zi} # Project inputs to dmodel 1. hi = Embedding(ai) hi ∈ Rdmodel 2. hi = hi + Linear(Swish(Linear(xi))) 3. hi = hi + Linear(Swish(Linear(fi))) # Apply encoder network 4. {hi} = TransformerEncoder({hi}) # Down-project to mean µZ and std σZ 5. µzi = Linear(hi) µzi ∈ Rd 6. log σzi = Linear(hi) σzi ∈ Rd # Sample latents Z 7. zi = µzi + σzi ⊙ ϵ, ϵ ∼ N (0, 1)d zi ∈ Rd Algorithm 2: Pseudocode for VAE decoder D Input: Latent reprenstations {zi} Output: 3D molecular system ({a′i}, {x′i}, {f ′ i}, {l′1, l′2, l′3}) # Up-project latents to dmodel 1. hi = Linear(zi) hi ∈ Rdmodel # Apply decoder network 2. {hi} = TransformerEncoder({hi}) # Predict outputs 3. a′i = argmax(Linear(hi)) a′i ∈ Z 4. x′i = Linear(hi) x′i ∈ R3 5. f ′ i = Linear(hi) f ′ i ∈ R3 6. {l′1, l′2, l′3} = Linear ( 1 N ∑N i=1 hi ) l′ ∈ R3 Reconstruction loss We compute the loss for the predicted atom types A′ via cross-entropy: LA = 1 N N∑ i=1 CrossEnt(ai, a′i) . (4.3) For the predicted 3D coordinates X ′, we use the mean squared error (MSE) reconstruction loss after zero-centering both sets of coordinates: x̃i = xi − 1 N N∑ i=1 xi , x̃′i = x′i − 1 N N∑ i=1 x′i , LX = 1 3N N∑ i=1 ∥x̃i − x̃′i∥2 . (4.4) We compute the reconstruction loss for the predicted fractional coordinates F ′ using MSE as well: LF = 1 3N N∑ i=1 ∥fi − f ′ i∥2 . (4.5) For the predicted lattice vectors L′, we first convert to rotation-invariant lattice parameters: three side lengths of the unit cell Ll = {a, b, c} ∈ R1×3, and three internal angles between them La = {α, β, γ} ∈ [60°, 120°]1×3, as described in Miller et al. [2024]. We then compute the MSE reconstruction loss between the predicted and ground truth lattice parameters: LLl = 1 3 ( (a− a′)2 + (b− b′)2 + (c− c′)2 ) , (4.6) LLa = 1 3 ( (α− α′)2 + (β − β′)2 + (γ − γ′)2 ) . (4.7) Note that in LLl , we normalize the predicted and groundtruth lengths by the cube root of the number of atoms to account for the scaling of the unit cell with the number of atoms, following 80 Xie et al. [2022]. All angles are converted from degree to radians for numerical stability. The autoencoder is trained with a weighted reconstruction loss to balance the relative magnitudes of the various losses. Depending on whether a training sample is periodic or non-periodic, we use different reconstruction loss weights: Lrec = λALA + λXLX + λFLF + λLl LLl + λLaLLa , where (4.8) λA λX λF λLl λLa Periodic 1.0 0.0 10.0 1.0 10.0 Non-periodic 1.0 10.0 0.0 0.0 0.0 Thus, the overall loss for periodic crystals trains the model to reconstruct the atom types, frac- tional coordinates and lattice parameters while ignoring the predicted 3D coordinates. Similarly, the overall loss for non-periodic molecules trains the model to reconstruct the atom types and 3D coordinates while ignoring the predicted fractional coordinates and lattice parameters. Regularization We use three regularization techniques to learn robust, informative latent representations: (1) A bottleneck architecture with latent dimension d significantly smaller than the encoder/decoder hidden dimension dmodel (e.g., d = 8 vs dmodel = 512). (2) A per-channel KL divergence penalty λKL ·DKL( N (Z;µZ , σZ) || N (0, 1)d ) added to equation 4.8, following Rombach et al. [2022]. (3) Denoising training with 10% of atoms having their types masked and coordinates perturbed by N (0, 0.1) Gaussian noise. For non-equivariant encoders/decoders, we learn symmetries via data augmentation during training. Translation invariance in non- periodic systems is handled by working with zero-centred coordinates. For translation invariance in periodic systems, we add a random translation vector to the Cartesian coordinates and re- compute the fractional coordinates using the updated Cartesian coordinates. Rotation symmetry is learnt via applying a random rotation to the Cartesian coordinates and unit cell (fractional coordinates are invariant to global rotations by definition). Decoding latents to molecular systems During inference or sampling from the DiT, the desired output type (periodic/non-periodic) determines how we process the decoder outputs. The VAE decoder D generates four attributes for each system: (1) atom types, (2) 3D coordinates, (3) fractional coordinates, and (4) lattice parameters. For non-periodic molecules, we only utilize the atom types and 3D coordinates, constructing the molecule via RDKit. For periodic crystals, we combine the atom types, fractional coordinates, and lattice parameters to build the crystal structure using PyMatGen. This split decoding strategy allows a single unified model to share information between both domains while still respecting their distinct geometric constraints, enabling effective transfer learning between periodic and non-periodic systems. 81 4.1.2 Stage 2: Latent diffusion generative model Diffusion formulation We use Gaussian diffusion or flow matching as our generative frame- work, which iteratively denoises latent samples from a base distribution into samples from a target distribution [Sohl-Dickstein et al., 2015, Song and Ermon, 2019, Lipman et al., 2023]. Our formulation uses linear interpolation between a standard normal base distribution and the target distribution of VAE encoder latent representations of 3D molecular systems (we describe it in terms of flow matching, though both formulations are equivalent; see Gao et al. [2024]). Thus, the diffusion model is trained after training the first stage VAE. Our model learns to generate a set of N latent representations Z = {zi}Ni=1, where each latent z ∈ Rd encodes information about one atom’s type, coordinates and unit cell, which can be decoded to a valid molecular system using the VAE decoder D. During training, given an input molecular system (A,X,F ,L), we first encode it to a latent representation Z using the VAE encoder E . We denote Z as Z(1), a ‘clean’ training sample at time t = 1. We then sample a random initial latent Z(0) at time t = 0 from a d-dimensional standard normal distribution N (0, 1)d, and perform zero-centering by subtracting the per-channel mean of Z(0). We then use linear interpolation to construct a ‘noisy’ interpolated sample Z(t) at a randomly sampled time step t ∼ U(0, 1): Z(t) = (1− t) Z(0) + t Z(1) . (4.9) Thus, we can define a groundtruth conditional vector field ut(Z(t)|Z(1)) along the path from the noisy latents Z(t) at time step t to the clean latents Z(1) as: ut(Z (t)|Z(1)) = Z(1) −Z(t) 1− t . (4.10) Samples from the base distribution can be transformed to samples from the target distribution by integrating the vector field ut(Z(t)|Z(1)) over time t. The goal of conditional flow matching is to train a denoiser network F to match this conditional vector field ut. To do so, the denoiser takes as input the intermediate noisy latents Z(t) at time step t and an additional class label c (described subsequently) to predict the final clean latents Z ′(1): Z ′(1) = F(Z(t), t, c) . (4.11) The denoiser is trained by minimizing an MSE loss between the resulting predicted conditional 82 vector field and the groundtruth conditional vector field: Lfm = 1 N N∑ i=1 ∣∣∣∣∣∣ z(1)i − z (t) i 1− t − z ′(1) i − z (t) i 1− t ∣∣∣∣∣∣2 , (4.12) = 1 (1− t)2 1 N N∑ i=1 ∥z(1)i − z ′(1) i ∥2 . In practice, we set a minimum value for time step tmin = 0.01 and maximum value tmax = 0.9 to prevent numerical instability, following Yim et al. [2023a]. Algorithm 3: Pseudocode for DiT sampling Input: Class label c, num. integration steps T , cfg. scale γ Output: Generated sample (A,X,F ,L) # Sample initial noisy latents Z(0) at t = 0 1. Z(0) = {z(0)i ∼ N (0, 1)d} 2. ∆t = 1/T # Step size # Denoising loop 3. for t in linspace(tmin, tmax, T ): 4. Z ′ cond = F(Z(t), t, c) # Conditional prediction 5. Z ′ uncond = F(Z(t), t, ϕ) # Unconditional prediction # Conditioning via classifier-free guidance 6. Z ′ = (1− γ) ·Z ′ uncond + γ ·Z ′ cond # Euler integration step 7. Z(t+∆t) = Z(t) +∆t · Z′−Z(t) 1−t # Decode latents to 3D molecular system (Algorithm 2) 8. A,X,F ,L = D(Z(1)) Denoiser architecture As the denoiser network F , we use a class-conditional Diffusion Transformer (DiT) [Peebles and Xie, 2023]. The DiT largely follows a standard Transformer architecture with the conditioning information incorporated via adaptive layer norm with zero- initialization, which replaces all layer norm operations. For class conditioning, we use a binary embedding to denote whether the system being generated is periodic (crystal) or non-periodic (molecule). This conditioning allows the model to learn domain-specific features while sharing most parameters. During training, we apply class label dropout with 10% probability to enable classifier-free guidance during inference. We also incorporate self-conditioning [Yim et al., 2023b] where the denoiser’s prediction from the previous timestep is concatenated to the current input with 50% dropout probability during training. While we currently only condition on the periodic/non-periodic class label, the DiT architecture can incorporate additional conditioning signals like target properties or geometric constraints to enable controlled generation. This represents a promising direction for future work in inverse design applications. 83 Data augmentation The DiT denoiser is trained with data augmentation to learn roto-translational and periodic symmetries in the VAE’s latent space. During training, each input system coor- dinates are randomly rotated and translated, and then converted to latents via the frozen VAE encoder E before being input to the DiT. Sampling with classifier-free guidance To generate new molecular systems from the trained diffusion model, we use classifier-free guidance [Ho and Salimans, 2022] to steer the sampling process. At each denoising step, we compute both a conditional prediction based on the periodic/non-periodic class label c and an unconditional prediction with null class label ϕ. The final prediction is a weighted combination of these using guidance scale γ, allowing control over how strongly the generation follows the class conditioning. The full sampling procedure is outlined in Algorithm 3. Starting from Gaussian noise Z(0), we iteratively denoise using the DiT model F for T steps. At each step, we perform Euler integration of the vector field to gradually transform the noisy latents towards the target distribution. While we currently use simple Euler integration for efficiency, adaptive ODE solvers could potentially improve performance [Ma et al., 2024]. Finally, we decode the denoised latents Z(1) to a valid 3D molecular system using the VAE decoder D. 4.2 Experimental Setup Datasets For our main experiments, we train models on periodic crystals from MP20 and non-periodic molecules from QM9, representing two distinct domains of molecular systems. MP20 [Xie et al., 2022] contains 45,231 metastable crystal structures from the Materials Project [Jain et al., 2013], each with up to 20 atoms in its unit cell and spanning 89 different element types. QM9 [Wu et al., 2018] consists of 130,000 stable small organic molecules containing up to nine heavy atoms (C, N, O, F) along with hydrogens. We split the data following prior work [Xie et al., 2022, Hoogeboom et al., 2022] to ensure fair comparisons. We also include results on the GEOM-DRUGS dataset of 430,000 large organic molecules up to 180 atoms [Axelrod and Gomez-Bombarelli, 2022]. Training and hyperparameters We sequentially train the first-stage VAE and then the second- stage DiT using AdamW optimizer with a constant learning rate 1e− 4, no weight decay, and batch size of 256. We use exponential moving average (EMA) of DiT weights over training with a decay of 0.9999. Both models are trained to convergence for at most 5000 epochs up to 3 days on 8 V100 GPUs. For the first-stage VAE, we use a standard Transformer as both encoder E and decoder D with hidden dimension dmodel = 512, 8 attention heads, and 8 layers (51M parameters). The latent dimension is set to d = 8 with KL regularization weight λKL = 1e− 5 and 10% denoising 84 perturbation during training. For the second-stage DiT denoiser, we report results primarily using DiT-B configurations: hidden dimension dmodel = 768, 12 attention heads, 12 layers, and 130M parameters. We also evaluate smaller DiT-S (32M) and larger DiT-L (450M) variants. Two key inference-time hyperparameters are the number of ODE integration steps T and the classifier-free guidance scale γ. We find T = 500 or 1000 with γ = 1.0 or 2.0 consistently works well for both molecules and crystals. Additional ablation studies comparing joint vs. dataset-specific training, architecture variants, regularization techniques, and inference settings are presented in Appendix B.3. Evaluation metrics We evaluate the ability of ADiTs to sample valid and realistic molecules and crystals. Following prior work [Xie et al., 2022, Hoogeboom et al., 2022], we sample 10,000 crystals and molecules each and compute validity, stability, uniqueness and novelty rates using density functional theory (DFT) for crystals as well as validity, uniqueness and Posebusters sanity checks [Buttenschoen et al., 2024] for molecules. Detailed descriptions of all evaluation metrics are provided in Appendix B.1. Baselines We compare ADiT trained jointly on both QM9 and MP20 to molecule-only and crystal-only ADiT variants, as well as state-of-the-art baselines for both datasets. For crystal generation on MP20, we compare to: (1) four equivariant diffusion and flow matching-based models operating on multi-modal product manifolds: CDVAE [Xie et al., 2022], DiffCSP [Jiao et al., 2023], FlowMM [Miller et al., 2024], and a variant of MatterGen [Zeni et al., 2025] trained on MP20 only; (2) UniMat [Yang et al., 2024], a non-equivariant diffusion model which learns symmetries from data; (3) FlowLLM [Sriram et al., 2024], a two-stage framework which first finetunes the autoregressive Llama 2 language model on crystal structures [Touvron et al., 2023, Gruver et al., 2024], and then trains FlowMM with samples from the language model as the base distribution and MP20 as the target distribution. For molecule generation on QM9, we compare to: (1) Equivariant Diffusion [Hoogeboom et al., 2022], a roto-translationally equivariant diffusion model operating on a multi-modal product manifold; (2) GeoLDM [Xu et al., 2023], an alternative latent diffusion model using Equivariant Diffusion in the latent space of a roto-translationally equivariant autoencoder; (3) Symphony [Daigavane et al., 2024], an equivariant and autoregressive generative model that iteratively builds a molecule atom-by-atom. 4.3 Results State-of-the-art crystals and molecule generation Results for crystal generation in Table 4.1 show that ADiTs generate high-quality crystals compared to baseline diffusion models, achieving improved performance across validity, stability, uniqueness, and novelty metrics for 10,000 85 Table 4.1: Crystal generation results on MP20. We report validity, stability, uniqueness, and novelty rates for 10,000 sampled crystals. ADiT shows improved performance over diffusion baselines across all metrics. We see significant gains for compositional validity due to a single diffusion process in the latent space, as opposed to joint continuous and categorical diffusion for baselines. Joint training with both molecular and crystal data improves crystal generation performance over MP20-only models. (Stable: DFT Ehull <0.0 eV/atom, metastable: DFT Ehull <0.1 eV/atom, ∗ denotes results from MatterGen-MP for 1024 sampled crystals, † denotes results we replicated using the same DFT setup as ADiT.) Validity Rate (%) ↑ Metastable Stable M.S.U.N. S.U.N. Model Structure Composition Overall rate (%) ↑ rate (%) ↑ rate (%) ↑ rate (%) ↑ M P2 0- on ly CDVAE 100.00 86.70 - - 1.6 - - DiffCSP 100.00 83.25 - - 5.0 - 3.3 UniMat 97.2 89.4 - - - - - FlowMM 96.85 83.19 80.30 30.6† 4.6† 22.5† 2.8† FlowLLM 99.94 90.84 90.81 66.9† 13.9† 26.3† 4.7† MatterGen-MP - - - 78∗ 13∗ 21∗ - MP20-only ADiT 99.58 90.46 90.13 81.6 14.1 25.91 4.7 Jointly trained ADiT 99.74 92.14 91.92 81.0 15.4 28.2 5.3 Table 4.2: Molecule generation results on QM9. We report (a) validity and uniqueness rates, as well as (b) % pass rates on 7 sanity checks from Posebusters for 10,000 sampled molecules. ADiTs match or improve performance w.r.t. baselines, and sample physically realistic structures. Joint training with both molecular and crystal data improves molecular generation performance over QM9-only models. (∗ denotes models which explicitly generate hydrogen atoms.) (a) Validity results (b) PoseBusters results Model Validity (%) ↑ Unique (%) ↑ Q M 9- on ly Equivariant Diffusion 97.50 96.71 Equivariant Diffusion∗ 91.90 98.69 GeoLDM∗ 93.80 98.82 Symphony∗ 83.50 97.98 QM9-only ADiT 96.02 97.76 QM9-only ADiT∗ 92.19 97.90 Jointly trained ADiT 97.43 96.92 Jointly trained ADiT∗ 94.45 97.82 Test (% pass) ↑ Symphony Eq. Diff. ADiT Atoms connected 99.92 99.88 99.70 Bond angles 99.56 99.98 99.85 Bond lengths 98.72 100.00 99.41 Ring flat 100.00 100.00 100.00 Double bond flat 99.07 98.58 99.98 Internal energy 95.65 94.88 95.86 No steric clash 98.16 99.79 99.79 sampled crystals, with significant gains for compositional validity due to a single diffusion process in the VAE latent space rather than joint continuous and categorical diffusion. For molecule generation, ADiTs achieve state-of-the-art performance on validity and unique- ness metrics across 10,000 sampled molecules, as shown in Table 4.2(a), while Posebusters sanity check metrics in Table 4.2(b) further confirm that ADiTs generate physically realistic molecular structures, matching or exceeding baseline models across measures like double bond flatness, reasonable internal energy and lack of steric clashes. Joint training improves performance Table 4.1 and Table 4.2 also show that jointly trained ADiTs (trained on both QM9 and MP20 together) exceed the performance of the MP20-only or 86 0 250 500 750 1000 1250 1500 1750 2000 Epoch 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Tr ai n Lo ss ADiT-S (32M) ADiT-B (130M) ADiT-L (450M) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 log(Number of parameters in M) 1.02 1.04 1.06 1.08 Ep . 2 00 0: Tr ai n Lo ss Pearson: -1.00 Spearman: -1.00 0 250 500 750 1000 1250 1500 1750 2000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Cr ys ta l v al id ity ra te (% ) ADiT-S (32M) ADiT-B (130M) ADiT-L (450M) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 log(Number of parameters in M) 0.88 0.89 0.90 0.91 0.92 0.93 Ep . 2 00 0: C ry st al v al id ity ra te (% ) Pearson: 0.91 Spearman: 1.00 0 250 500 750 1000 1250 1500 1750 2000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 M ol ec ul e va lid ity ra te (% ) ADiT-S (32M) ADiT-B (130M) ADiT-L (450M) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 log(Number of parameters in M) 0.86 0.88 0.90 0.92 0.94 0.96 Ep . 2 00 0: M ol ec ul e va lid ity ra te (% ) Pearson: 0.94 Spearman: 1.00 Figure 4.2: Scaling up ADiT improves performance. We show the effect of increasing the number of ADiT denoiser parameters on the training loss and generation validity rates. Left: training loss and validity rates vs. epochs. Right: Correlation plots for training loss and validity rates at epoch 2,000 vs. ADiT parameters (in Millions). QM9-only ADiTs for materials or molecules, respectively. Joint training improves validity and stability rates for both crystals and molecules, demonstrating effective transfer learning between periodic and non-periodic domains. These results validate that ADiTs can effectively model diverse types of molecular systems within a single architecture. Scaling up ADiT denoiser improves performance In Figure 4.2, we see that generative modelling performance predictably improves as we scale the DiT denoiser from parameter counts of 32M (DiT-S) to 130M (DiT-B) all the way to 450M (DiT-L), even with our current modest dataset size of ∼130K total samples. The diffusion training loss and validity rates consistently improve with larger model sizes, showing a clear benefit from scale. Strong correlations between 87 10 100 250 500 750 1000 Number of integration steps 0 50 100 150 200 250 300 350 Ti m e to sa m pl e 10 K cr ys ta ls (m in s) ADiT-S (32M) ADiT-B (130M) ADiT-L (450M) FlowMM (12M) 10 25 50 100 0 10 20 (a) Crystals – MP20 10 100 250 500 750 1000 Number of integration steps 0 100 200 300 400 Ti m e to sa m pl e 10 K m ol ec ul es (m in s) ADiT-S (32M) ADiT-B (130M) ADiT-L (450M) GeoLDM (5M) 10 25 50 100 0 10 20 (b) Molecules – QM9 Figure 4.3: ADiTs are significantly faster than equivariant diffusion models. We plot the number of integration steps for ADiTs and equivariant diffusion models vs. time to generate 10,000 samples on a single V100 GPU. ADiTs scale significantly better with the number of integration steps compared to equivariant diffusion. model size and performance metrics suggest further gains are possible from scaling both model size and data – Alexandria (2M inorganic crystals), ZINC (250M molecules), and the Protein Data Bank (200K biomolecular complexes) present promising opportunities for dataset scaling. Speedup compared to equivariant diffusion ADiTs achieve significant inference speedup compared to equivariant diffusion under the same hardware conditions, as shown in Figure 4.3. When generating 10,000 samples on a V100 GPU, ADiTs based on standard Transformers leads to better scaling with integration steps compared to FlowMM [Miller et al., 2024] for crystals and GeoLDM [Xu et al., 2023] for molecules, both of which use computationally intensive equivariant networks as denoisers. It is significantly more practical to scale up Transformers than equivariant networks, as seen by the faster inference speed of ADiT-B compared to 100× smaller equivariant baselines. PCA visualization of shared latent space In Figure 4.4, we plot the first two PCA principal components of 100 random samples each from the MP20 and QM9 validation set, as well as 100 generated crystals and 100 generated molecules sampled from ADiT. We observe that the joint latent space shows distinct clusters between molecules and crystals, with tighter clustering for molecules and more spread for crystals, reflecting the greater diversity of elements and local geometric environments in periodic crystal structures. Next, we plot the same PCA but only keeping atoms of carbon, nitrogen, oxygen, and fluorine in Figure 4.5. These atoms appear in both QM9 molecules and MP20 crystals, allowing us to analyze how their representations compare across periodic and non-periodic systems. The visualization reveals clear patterns: principal component 1 primarily distinguishes between molecules (clustered between -2 and 2) and crystals, while principal component 2 correlates with atom type. Most notably, oxygen atoms show similar latent representations whether they 88 Table 4.3: Molecule generation results on GEOM-DRUGS. Left: Validity, uniqueness and % pass rates on Posebusters for 10,000 sampled molecules (∗ PoseBusters results taken from Buttenschoen et al. [2025]). ADiT with minimal molecular inductive biases matches or exceeds state-of-the-art equivariant diffusion baselines, which explicitly predict atomic bonds. Right: We plot the number of integration steps for ADiTs and SemlaFlow vs. time to generate 10,000 molecules on a single A100 GPU. ADiTs scale favorably with the number of integration steps compared to SemlaFlow, a highly optimized equivariant diffusion model. Metric (% pass) ↑ EQGAT-diff∗ SemlaFlow∗ ADiT Validity 94.6 93.9 95.3 Uniqueness 100.0 100.0 100.0 Atoms connected 84.4 92.3 93.0 Bond angles 86.9 94.8 92.3 Bond lengths 87.0 94.6 92.5 Ring flat 87.0 94.9 95.4 Double bond flat 87.0 94.2 95.3 Internal energy 86.8 94.8 91.3 No steric clash 82.9 92.0 91.8 PoseBusters valid 59.7 87.5 85.3 10 100 250 500 750 1000 Number of integration steps 0 100 200 300 400 500 600 Ti m e to sa m pl e 10 K m ol ec ul es (m in s) 10 25 100 0 20 40 ADiT-S (32M) ADiT-B (150M) ADiT-L (450M) SemlaFlow (46M) appear in molecules or crystals, suggesting ADiT’s latent space captures fundamental chemical properties that transfer across both domains. This shared representation of oxygen, a key element in both datasets, may help explain ADiT’s successful joint learning and transfer between periodic and non-periodic systems. Extension to larger GEOM-DRUGS molecules. To demonstrate the scalability of the ADiT architecture to larger systems, we experiment with the GEOM-DRUGS dataset of 430,000 molecules of up to 180 atoms. Our setup follows Vignac et al. [2023] and we compare to state-of-the-art equivariant diffusion [Le et al., 2024] and flow matching [Irwin et al., 2025] baselines. In Table 4.3, we see that ADiT is on par or better than equivariant models across validity and PoseBusters metrics [Buttenschoen et al., 2025]. This is a notable result because ADiT is based on the standard Transformer architecture with minimal molecular inductive biases and, unlike equivariant baselines, does not explicitly predict atomic bonds. Additional results and ablations For additional results, including the distributions of DFT formation energy, composition, and spacegroups for crystals generated by ADiT compared to baselines, see Appendix B.2. An ablation study of the ADiT architecture is available in Appendix B.3. 4.4 Related Work Generative models for molecules and materials Diffusion models have emerged as the state- of-the-art for generative modelling of molecular systems, with applications to small molecules, 89 6 4 2 0 2 4 6 Principle component 1 6 4 2 0 2 4 6 Pr in cip le c om po ne nt 2 System Crystal Molecule Source Dataset Generated Figure 4.4: PCA plot of latent embeddings from ADiT’s VAE for 100 data points from the MP20 and QM9 datasets, as well as 100 ADiT-generated crystals/molecules each. Each point represents an atom, coloured by the system type and sized by whether it comes from real data or generated latents. Latent embeddings from molecules form tighter clusters, while crystals show greater spread due to their higher elemental and structural diversity. crystals, and biomolecules. For small molecules, Equivariant Diffusion [Hoogeboom et al., 2022] pioneered roto-translationally equivariant diffusion on the multi-modal product manifold of atom types and 3D positions, while GeoLDM [Xu et al., 2023] introduced latent diffusion in the space of an equivariant autoencoder. Schneuing et al. [2024] extended equivariant diffusion to generate molecules conditioned on binding protein partners for structure-based drug design, while Corso et al. [2023] explored similar architectures for protein-small molecule docking. For crystal generation, state-of-the-art approaches use equivariant diffusion on product manifolds of atom types, 3D/fractional coordinates, and lattice parameters. Notable examples include CDVAE [Xie et al., 2022], DiffCSP [Jiao et al., 2023], and FlowMM [Miller et al., 2024]. MatterGen [Zeni et al., 2025] demonstrated conditional diffusion for inverse design based on target material 90 6 4 2 0 2 4 6 Principle component 1 6 4 2 0 2 4 6 Pr in cip le c om po ne nt 2 Atom type C N O F Source Dataset Generated System Crystal Molecule Figure 4.5: PCA plot of latent embeddings for carbon, nitrogen, oxygen, and fluorine atoms from ADiT’s VAE for 100 data points from the MP20 and QM9 datasets, as well as 100 ADiT- generated crystals/molecules each. Each point represents an atom, coloured by atom type and sized by whether it comes from real data or generated latents. Principle component 1 visually correlates with whether a system is a molecule (within range -2 – 2) or crystal. Principle component 2 visually correlates with the atom type. We see distinct clusters for different atom types, especially oxygen atoms,suggesting that ADiT’s latent space captures shared chemical properties across periodic and non-periodic systems. properties and symmetry space groups. Language models have also been used for generating molecules and crystals as textual representations [Flam-Shepherd and Aspuru-Guzik, 2023, Gruver et al., 2024]. Our work stands out as the first to develop unified generative models capable of sampling both periodic crystals and non-periodic molecular systems jointly. The closest work to ADiT in terms of diffusion formulation is AlphaFold3 [Abramson et al., 2024], which applies standard Transformers and Gaussian diffusion to generate all-atom biomolecular complex. However, their formulation is specific to structure prediction for biomolecules and only diffuses 3D atomic 91 coordinates in Cartesian space. In contrast, our latent diffusion formulation is sufficiently general to work with both periodic and non-periodic systems, generating atom types, coordinates, as well as unit cell parameters unconditionally or with classifier-free guidance. Our emphasis on joint representations of molecules and crystals also aligns with recent work on general-purpose models for molecular simulation and property prediction [Shoghi et al., 2024, Batatia et al., 2023, Wood et al., 2025]. Similarly, our unified latent diffusion framework can be scaled up with larger and more diverse chemical datasets towards foundation models for generative chemistry. Latent diffusion models Latent diffusion models [Vahdat et al., 2021, Rombach et al., 2022] propose to do diffusion in the latent space of an autoencoder instead of the raw input space of high-dimensional continuous signals such as pixels, and have been extremely successful for generating images, audio, and videos [Esser et al., 2024, Betker et al., 2023, Brooks et al., 2024]. Latent diffusion is a more computationally efficient alternative to standard diffusion as the autoencoder’s latent space captures semantically meaningful features of the data, allowing for more efficient diffusion in a lower-dimensional space followed by reconstruction to the original data space. The original formulation was further improved by Diffusion Transformers (DiTs) [Peebles and Xie, 2023], which demonstrated that standard Transformers provide a highly scalable architecture for the denoiser network. Latent diffusion models can easily incorporate conditioning on additional information like class labels, text prompts, or infilling masks through classifier-based [Dhariwal and Nichol, 2021] and classifier-free guidance [Ho and Salimans, 2022] as well as finetuning [Zhang et al., 2023a, Dai et al., 2023]. Our work is the first to leverage latent diffusion for jointly generating the complex multi- modal product of categorical and continuous data types that constitute 3D molecular systems. This allows us to shift the complexity of handling atom types, coordinates, and unit cell param- eters into an autoencoder while performing the generative process in latent space with DiTs, which is simpler and more scalable than alternative multi-modal equivariant diffusion models. Equivariance and generative modelling Geometric Graph Neural Networks [Duval et al., 2023a], particularly roto-translationally equivariant networks, have been used as denoisers in diffusion and flow matching approaches for generative modelling of molecular structures. E(3)- Equivariant Graph ConvNets [Satorras et al., 2021] are widely used as denoisers for molecule [Hoogeboom et al., 2022, Xu et al., 2023, Schneuing et al., 2024] and crystal generation [Jiao et al., 2023, Miller et al., 2024]. More expressive architectures, like higher-order tensor networks [Liao et al., 2024b] and Invariant Point Attention [Jumper et al., 2021], have been applied to protein structure generation [Watson et al., 2023, Yim et al., 2023b] and protein-ligand docking [Corso et al., 2023]. However, equivariant networks are computationally expensive and harder to scale than standard Transformers in terms of data and model size. This is especially relevant for diffusion 92 models, where denoisers are iteratively run hundreds of times during inference [Song et al., 2021], and typically process inputs as fully connected graphs to capture global structure [Joshi, 2025]. Recent work has challenges the necessity of 3D inductive biases and equivariance for generative structure prediction tasks, showing that standard Transformers can achieve strong performance on biomolecular complexes [Abramson et al., 2024] and small molecule conformations [Wang et al., 2024, O Pinheiro et al., 2023]. Non-equivariant models have also shown promising results for protein structure generation [Chu et al., 2024, Martinkus et al., 2024, Lu et al., 2025]. In the same vein, our work leverages the simplicity and scalability of standard Transformers for generative modelling across both periodic and non-periodic domains, demonstrating that explicit equivariance and molecular inductive biases are not a strict requirement for generating valid and realistic 3D structures at scale. 4.5 Summary In this chapter, we posed the following question: How can we build unified diffusion models that can generate both periodic materials and non-periodic molecular systems? Our solution, the All-atom Diffusion Transformer (ADiT), is a latent diffusion model based on two key ideas: 1. All-atom unified latent representations: We treat both periodic and non-periodic molecular systems as sets of atoms in 3D space and develop a unified representation with categorical and continuous attributes per atom. A Variational Autoencoder (VAE) [Kingma and Welling, 2014] embeds molecules and crystals into a shared latent space by training for all-atom reconstruction. 2. Latent diffusion using Transformers: We perform generative modelling in the latent space of the VAE encoder using a Diffusion Transformer (DiT) [Rombach et al., 2022, Peebles and Xie, 2023]. During inference, classifier-free guidance [Ho and Salimans, 2022] enables sampling new latents that can be reconstructed to valid molecules or crystals using the VAE decoder. ADiTs can be trained jointly on both periodic and non-periodic 3D molecular structures, demonstrating broad generalizability. Training a single unified model on the QM9 molecular and MP20 materials datasets leads to state-of-the-art performance in both domains, exceeding specialized equivariant diffusion models on physics-based validations. DFT calculations reveal that ADiTs generate stable, unique, and novel crystals at a 5-6% S.U.N. rate, a 25% improvement upon the 4-5% rates of previous methods. Joint training yields higher validity rates than QM9- only or MP20-only ADiT variants, demonstrating successful transfer learning between periodic and non-periodic domains. ADiTs also match or exceed state-of-the-art equivariant models on the GEOM-DRUGS dataset of molecules with hundreds of atoms. 93 ADiTs are a highly scalable architecture, achieving significant speedups in both training and inference compared to equivariant diffusion models. By using standard Transformers with minimal inductive biases for both the autoencoder and diffusion model, ADiTs can generate 10,000 samples in under 20 minutes on a single V100 GPU – an order of magnitude faster than baselines which take up to 2.5 hours on the same hardware. The practical efficiency of the DiT denoiser compared to equivariant networks allows us to scale ADiT to half a billion parameters while keeping data scale fixed. Our scaling analysis demonstrates that generative modelling performance improves predictably with model size, suggesting further gains are possible through continued scaling. All together, our work is the first to develop unified generative models for both periodic and non-periodic molecular systems, with state-of-the-art performance on both molecules and crystals, while being conceptually simpler and computationally more efficient than previous domain-specific approaches. ADiTs represent a step towards broadly generalizable foundation models for generative chemistry. Future work Several limitations point to promising future directions. First, we currently use relatively small datasets and systems for training, which may limit model generalization. Scaling to larger and more diverse datasets such as Alexandria and the Cambridge Structural Database for crystals, ZINC for small molecules, and the Protein Data Bank for biomolecular complexes could significantly improve performance and enable learning of broadly applicable chemical principles. While we demonstrate success on small molecules and crystals of up to hundreds of atoms, we have not yet fully validated our approach on larger systems such as metal-organic frameworks or biomolecules containing thousands of atoms. Adapting ADiT to larger scales, while maintaining its unified representation across periodic and non-periodic systems, could enable powerful transfer learning capabilities – especially valuable for low-data domains. Relatedly, our current models only perform unconditional generation – extending to guided sampling or conditional generation based on experimental properties [Zeni et al., 2025], motif scaffolding [Watson et al., 2023], or molecular infilling [Schneuing et al., 2024] would enable practical inverse design applications in drug discovery, materials science, and beyond. In terms of architecture, it is often said that curating the latent space is the most important factor for good generative performance [Dieleman, 2025]. The current first stage autoencoder used in ADiT employs a straightforward approach: it uses a simple reconstruction objective based on regression-style losses for atom types, coordinates, and lattice parameters. Unlike autoencoders commonly used for images, audio and videos, it does not incorporate perceptual or adversarial losses [Esser et al., 2021, Rombach et al., 2022], which are considered crucial for capturing high-frequency details and ensuring realism in reconstructions. Their absence in ADiT suggests the latent space might not encode extremely fine-grained structural information, such as subtle conformational variations or precise atomic positioning. 94 Additionally, the current autoencoder learns an all-atom latent representation where each atom is assigned an independent latent vector, without explicit spatial compression or hierarchical grouping of semantically related atoms. Relatedly, when generating new structures with the diffusion model, we need to provide the total number of atoms in advance, which may limit usability when atom counts are unknown. Future work could explore latent space designs that perform spatial downsampling during encoding, followed by upsampling during decoding and generation [Jaegle et al., 2021]. This would decouple the number of latents from the number of atoms, allowing more effective allocation of latent capacity when scaling to larger systems, as well as enable generating structures with variable atom counts. 95 96 Part II RNA Molecule Design 97 Chapter 5 gRNAde: Geometric Deep Learning for 3D RNA inverse design RNA holds a unique position in biology due to its ability to encode genetic information via its sequence as well as catalyze reactions through complex 3D structural folding [Cech, 2024]. This dual functionality enables RNA to perform sophisticated computations within cells, from regulating gene expression to driving essential metabolic processes. Recent years have seen a surge of interest in RNA-based therapeutics, which target diseases at the genetic level and offer an alternative to traditional small molecule or protein drugs that treat symptoms [Damase et al., 2021]. Notable examples of RNAs at the forefront of biology include mRNA vaccines [Metkar et al., 2024] and CRISPR-based genomic medicine [Doudna and Charpentier, 2014]. Despite their promise, the rational design of RNA molecules is a significant challenge as the sequence-structure-function relationship is not as well established as it is for proteins. The availability of extensive protein structure data in the Protein Data Bank (PDB) coupled with advances in deep learning have revolutionized protein structure prediction [Jumper et al., 2021] and design [Dauparas et al., 2022, Watson et al., 2023], achievements recognized by the Nobel Prize in Chemistry 2024. Computational RNA design with deep learning, however, remain comparatively underexplored, largely due to a scarcity of 3D structural data [Schneider et al., 2023]. The PDB contains approximately 7,000 RNA 3D structures, versus over 200,000 for proteins, with most RNA structures originating from a few well-studied families like tRNAs or ribosomal RNAs. This data limitation has meant that most RNA design tools either focus on secondary structure, neglecting 3D geometry [Churkin et al., 2018], or rely on non-learned algorithms with hand-crafted heuristics for aligning 3D RNA fragments [Han et al., 2017, Yesselman et al., 2019], which can be restrictive. Beyond data scarcity, a further key technical challenge in RNA design is that RNA are generally more dynamic than proteins. A single RNA sequence can adopt multiple distinct conformational states to perform and regulate complex biological functions [Ganser et al., 2019, Hoetzel and Suess, 2022]. Current computational RNA design often frames the task as an inverse 99 Multi-state Graph Neural Network Encoder Sequence Decoder RNA Conformational Ensemble Set of Backbone Geometric Graphs Extract Backbones GAGCGU... RNA Sequence Fixed backbone re-design 3D roto-translations node order conformation order Equivariant to: Figure 5.1: 3D RNA inverse design with gRNAde. gRNAde is a generative model for RNA sequence design conditioned on backbone 3D structure(s). gRNAde processes one or more RNA backbone graphs (a conformational ensemble) via a multi-state GNN encoder which is equivariant to 3D roto-translation of coordinates as well as conformational state order, followed by conformational state order-invariant pooling and autoregressive sequence decoding. problem: designing sequences for a single target secondary structure, thereby typically neglecting 3D geometry and conformational diversity. Yet, engineering novel biological functions effectively necessitates considering both the 3D structure and the dynamic conformational landscape of RNA [Vicens and Kieft, 2022, Ken et al., 2023]. This chapter introduces gRNAde, a geometric RNA design model that addresses the chal- lenge of designing RNA sequences that fold into target 3D structures while accounting for conformational dynamics (Figure 5.1). gRNAde is a structure-conditioned RNA language model that leverages a novel multi-state GNN to generate sequences conditioned on one or more 3D backbone structures. Our computational evaluations demonstrate that gRNAde significantly outperforms existing physics-based RNA design methods while being orders of magnitude faster. Furthermore, gRNAde introduces novel capabilities for RNA design, including zero-shot ranking of functional mutants and multi-state design for structurally flexible RNAs. Open source code is available: github.com/chaitjo/geometric-rna-design. 5.1 The gRNAde Model Figure 5.1 illustrates the RNA inverse design problem: the task of designing new RNA sequences conditioned on one or more 3D backbone structures. Given the 3D coordinates of a backbone structure, gRNAde generates sequences that are likely to fold into those target shapes. The underlying assumption behind inverse design is that structure determines function: by designing sequences that fold into specific structures, we can create molecules with desired biological activities [Huang et al., 2016]. 100 https://github.com/chaitjo/geometric-rna-design P O5' C5' C4' C3'C2' C1' O4' O3' P P P C4' RNA backbone atoms Coarse-grained features Node (nucleotide) Backbone chain 3D neighbourhood 3x distances 3x angles 3x torsionsRibose sugar Base 3-bead representation (P, C4', N1/N9) N1/N9 5' 3' Figure 5.2: RNA backbone structures featurized as 3D graphs. Each RNA nucleotide is a node in the graph, consisting of 3 coarse-grained beads for the coordinates for P, C4’, N1 (pyrimidines) or N9 (purines) which are used to compute initial geometric features and edges to nearest neighbours in 3D space. Backbone chain adapted from Ingraham et al. [2019b]. gRNAde employs a structure-conditioned, autoregressive language model architecture with geometric GNN encoder and decoder layers [Jing et al., 2020, Dauparas et al., 2022]. The key innovation is a multi-state GNN encoder that processes conformational ensembles of 3D backbones, followed by permutation-invariant pooling across conformational states and autore- gressive sequence decoding. This multi-state design capability distinguishes gRNAde from existing single-structure inverse folding approaches, enabling the design of sequences that are compatible with multiple conformational states simultaneously. The architecture naturally handles both single-state and multi-state design within the same framework. 5.1.1 RNA Conformational Ensembles as Geometric Multi-graphs Featurization The input to gRNAde is an RNA to be re-designed. For instance, this could be a set of PDB files with 3D backbone structures for the given RNA (a conformational ensemble) and the corresponding sequence of n nucleotides. As shown in Figure 5.2, gRNAde builds a geometric graph representation for each input structure: 1. We start with a 3-bead coarse-grained representation of the RNA backbone, retaining the coordinates for P, C4’, N1 (pyrimidine) or N9 (purine) for each nucleotide [Dawson et al., 2016]. These ‘pseudotorsional’ features describe RNA backbones completely in most cases while reducing the size of the torsional space from 7 angles down to 3 to prevent overfitting [Wadley et al., 2007]. 2. Each nucleotide i is assigned a node in the geometric graph with the 3D coordinate x⃗i ∈ R3 corresponding to the centroid of the 3 bead atoms. Random Gaussian noise with standard deviation 0.1Å is added to coordinates during training to prevent overfitting 101 1D edges k=32 nearest neighbours along sequence 2D edges base pairing and pseudoknots 3D edges k=32 nearest neighbours in 3D Figure 5.3: Types of edges in gRNAde’s input graphs. gRNAde constructs geometric graphs with three edge types: 1D edges connecting sequential nucleotides, 2D edges from secondary structure annotations, and 3D edges between nucleotides that are close in 3D space. on crystallisation artifacts, following Dauparas et al. [2022]. Nodes are initialized with geometric features analogous to the featurization used in protein design [Ingraham et al., 2019b, Jing et al., 2020]: (a) forward and reverse unit vectors along the backbone from the 5’ end to the 3’ end, (x⃗i+1 − x⃗i and x⃗i − x⃗i−1); and (b) unit vectors, distances, angles, and torsions from each C4’ to the corresponding P and N1/N9. 3. Each node is connected to edges of three types (Figure 5.3): (a) its 32 nearest neighbours in 3D space based on Euclidean distance ∥x⃗i − x⃗j∥2; (b) its 32 nearest neighbours along the RNA backbone based on sequence distance |j − i|; and (c) all nodes involved in base pairs and pseudoknots from the secondary structure corresponding to the 3D backbone. Edge features for an edge from node j to i are initialized as: (a) the unit vector from the source to destination node, x⃗j − x⃗i; (b) the distance in 3D space, ∥x⃗j − x⃗i∥2, encoded by 32 radial basis functions; (c) the distance along the backbone, j − i, encoded by 32 sinusoidal positional encodings; and (d) the type of edge (1D, 2D, or 3D) encoded as a one-hot vector. Multi-graph representation Given a set of k structures in the input conformational ensemble, each RNA backbone is featurized as a separate geometric graph G(k) = (A(k),S(k), V⃗ (k)) with the scalar features S(k) ∈ Rn×f , vector features V⃗ (k) ∈ Rn×f ′×3, and A(k), an n× n adjacency matrix. For clear presentation and without loss of generality, we omit edge features and use f , f ′ to denote scalar/vector feature channels. The input to gRNAde is thus a set of geometric graphs {G(1), . . . ,G(k)} which is merged into what we term a ‘multi-graph’ representation of the conformational ensemble, M = (A,S, V⃗ ), by stacking the set of scalar features {S(1), . . . ,S(k)} into one tensor S ∈ Rn×k×f along a new axis for the set size k. Similarly, the set of vector features {V⃗ (1), . . . , V⃗ (k)} is stacked into one tensor V⃗ ∈ Rn×k×f ′×3. Lastly, the set of adjacency matrices {A(1), . . . ,A(k)} are merged via a union ∪ into one single joint adjacency matrix A. 102 GVP-GNN encoder layer GVP-GNN encoder layer Backbone k Backbone 1 Autoregressive decoder+ ... A G C U Per-node logits Deep Set Pooling x L x L Node Embeddings GAGCG_ SamplingPartial sequence Figure 5.4: gRNAde model architecture. One or more RNA backbone structures are encoded via SE(3)-equivariant GNN layers to build latent representations of each nucleotide’s local 3D environment per state. Representations from multiple states are pooled via permutation invariant Deep Sets and fed to an autoregressive decoder to predict probabilities over four bases (A, G, C, U). During training, the model minimizes cross-entropy loss between predicted and true sequence identities. 5.1.2 Multi-state GNN for Encoding Conformational Ensembles The gRNAde model, illustrated in Figure 5.4, processes one or more RNA backbone graphs via a multi-state GNN encoder which is equivariant to 3D roto-translation of coordinates as well as to the ordering of conformational states, followed by conformation order-invariant pooling and sequence decoding. We describe each component in the following sections. Multi-state GNN encoder When representing conformational ensembles as a multi-graph, each node feature tensor contains three axes: (#nodes, #conformations, feature channels). We perform message passing on the multi-graph adjacency to independently process each conformational state, while maintaining permutation equivariance of the updated feature tensors along both the first (#nodes) and second (#conformations) axes. This works by operating on only the feature channels axis and generalising the PyTorch Geometric [Fey and Lenssen, 2019b] message passing class to account for the extra conformations axis; see Figure 5.5 for details. We use multiple O(3)-equivariant GVP-GNN [Jing et al., 2020] layers to update scalar features si ∈ Rk×f and vector features v⃗i ∈ Rk×f ′×3 for each node i: mi, m⃗i := ∑ j∈Ni MSG ( (si, v⃗i) , (sj, v⃗j) , eij ) , (5.1) s′i, v⃗ ′ i := UPD ( (si, v⃗i) , (mi, m⃗i) ) , (5.2) where MSG,UPD are Geometric Vector Perceptrons, a generalization of MLPs to take tuples of scalar and vector features as input and apply O(3)-equivariant non-linear updates. The overall GNN encoder is SO(3)-equivariant due to the use of reflection-sensitive input features (dihedral angles) combined with O(3)-equivariant GVP-GNN layers. Our multi-state GNN encoder is easy to implement in any message passing framework and 103 Set of RNA Conformations Multi-graph tensor 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Permute nodes Permute conformers Rotate 3D features 1 2 3 4 5 6 Update with multi-GNN layer 1 2 3 4 5 6 Figure 5.5: Multi-graph tensor representation of conformational ensembles, and the associ- ated symmetry groups acting on each axis. We process a set of k RNA backbone conformations with n nodes each into a tensor representation. Each multi-state GNN layer updates the tensor while being equivariant to the underlying symmetries. Here, we show a tensor of 3D vector-type features with shape n × k × 3. As depicted in the equivariance diagram, the updated tensor must be equivariant to permutation Sn of n nodes for axis 1, permutation Sk of k conformational states for axis 2, and rotation SO(3)/O(3) of the 3D features for axis 3. can be used as a plug-and-play extension for any geometric GNN pipeline to incorporate the multi-state inductive bias. It serves as an elegant alternative to batching all the conformations, which we found required major alterations to message passing and pooling. Conformation order-invariant pooling The final encoder representations in gRNAde account for multi-state information while being invariant to the permutation of the states. To achieve this, we perform a Deep Set pooling [Zaheer et al., 2017] over the conformations axis after the final encoder layer to reduce S ∈ Rn×k×f and V⃗ ∈ Rn×k×f ′×3 to S′ ∈ Rn×f and V⃗ ′ ∈ Rn×f ′×3: S′, V⃗ ′ := 1 k k∑ i=1 ( S[: , i], V⃗ [: , i] ) . (5.3) A simple sum or average pooling does not introduce any new learnable parameters to the pipeline and is flexible to handle a variable number of conformations, enabling both single-state and multi-state design with the same model. Sequence decoding and loss function We feed the final encoder representations after pooling, S′, V⃗ ′, to autoregressive GVP-GNN decoder layers to predict the probability of the four possible base identities (A, G, C, U) for each node/nucleotide. Decoding proceeds according to the RNA sequence order from the 5’ end to 3’ end. gRNAde is trained in a self-supervised manner by 104 Backbone graph Sampled sequences Predicted structures True sequence ... gRNAde 2D: EternaFold 3D: RhoFold ...True structure Sequence recovery Structural self-consistency scores 2D: MCC, 3D: RMSD, TM, GDT Figure 5.6: In-silico evaluation metrics for gRNAde designed sequences. We consider (1) sequence recovery, the percentage of native nucleotides recovered in designed samples, (2) self-consistency scores, which are measured by ‘forward folding’ designed sequences using a structure predictor and measuring how well 2D and 3D structure are recovered (we use EternaFold and RhoFold for 2D/3D structure prediction, respectively). We also report (3) perplexity, the model’s estimate of the likelihood of a sequence given a backbone. minimising a cross-entropy loss (with label smoothing value of 0.05) between the predicted probability distribution and the ground truth identity for each base. During training, we use teacher forcing [Williams and Zipser, 1989] where the true identity of the base is fed as input to the decoder at each step, encouraging the model to stay close to the ground-truth sequence. Sampling When using gRNAde for inference and designing new sequences, we iteratively sample the base identity for a given nucleotide from the predicted conditional probability distribution, given the partially designed sequence up until that nucleotide/decoding step. We can modulate the smoothness or sharpness of the probability distribution by using a temperature parameter. At lower temperatures, for instance ≤1.0, we expect higher native sequence recovery and lower diversity in gRNAde’s designs. At higher temperatures, the model produces more diverse designs by sampling from a smoothed probability distribution. gRNAde can also use unordered decoding [Dauparas et al., 2022] with minimal impact on performance, as well as masking or logit biasing during sampling, depending on the design scenario at hand. This enables gRNAde to perform partial re-design of RNA sequences, retaining specified nucleotide identities while designing the rest of the sequence. In Chapter 6, we demonstrate this capability for designing functional ribozymes with gRNAde. 5.1.3 Evaluation Metrics for Designed Sequences Inverse folding models can generate large numbers of designed sequences for a given backbone structure, making in-silico evaluation metrics essential for prioritizing which sequences to pursue in wet lab experiments. 105 We primarily use Native sequence recovery to compare gRNAde to existing methods on RNA backbones with known native sequences. Sequence recovery is defined as the average percentage of nucleotides in designed sequences that match the ground truth sequence. While recovery is the most widely used metric for biomolecule inverse design [Dauparas et al., 2022], it can be misleading for RNAs where alternative nucleotide pairings may form identical structures. Thus, we also use self-consistency scores to measure how well the designed sequences are predicted to recover the target 2D and 3D structure (Figure 5.6): • Secondary structure self-consistency score, where we ‘forward fold’ the sampled se- quences using a secondary structure prediction tool (we used EternaFold [Wayment-Steele et al., 2022a]) and measure the average Matthew’s Correlation Coefficient (MCC) to the groundtruth secondary structure, represented as a binary adjacency matrix. MCC values range between -1 and +1, where +1 represents a perfect match, 0 an average random prediction and -1 an inverse prediction. This measures how well the designs recover base pairing patterns. • Tertiary structure self-consistency scores, where we ‘forward fold’ the sampled sequences using a 3D structure prediction tool (we used RhoFold [Shen et al., 2022]) and compute the average RMSD, TM-score and GDT_TS to the groundtruth C4’ coordinates to measure how well the designs recover global structural similarity and 3D conformations. Lastly, we can also consider Perplexity, a measure of the average number of bases that the model is selecting from when designing each nucleotide. Formally, perplexity is the average exponential of the negative log-likelihood of the sampled sequences. A ‘perfect’ model which regurgitates the groundtruth1 would have perplexity of 1, while a perplexity of 4 means that the model is making random predictions (the model outputs a uniform probability over 4 possible bases). Perplexity does not require a ground truth structure to calculate, and can also be used for ranking sequences as it is the model’s estimate of the compatibility of a sequence with the input backbone structure. Limitations While self-consistency metrics such as ‘designability’ (e.g., scRMSD ≤ 2Å) and perplexity have been shown to correlate with experimental success in protein design [Watson et al., 2023], precise designability thresholds remain to be established for RNA. As a starting point, pairs of structures with TM-score ≥ 0.45 or GDT_TS ≥ 0.5 are known to correspond to roughly the same fold [Zhang et al., 2022]. A major limitation for in-silico evaluation of 3D RNA design compared to proteins is the relatively poor performance of current RNA structure prediction tools. We will address these evaluation challenges for real-world RNA design campaigns in Chapter 6, where we present a design pipeline validated through wet lab experiments. 1Note that such a model would be practically useless for real design tasks. 106 5.2 Experimental Setup 3D RNA structure dataset We create a machine learning-ready dataset for RNA inverse design using RNASolo [Adamczyk et al., 2022], a novel repository of RNA 3D structures extracted from solo RNAs, protein-RNA complexes, and DNA-RNA hybrids in the PDB. We used all currently known RNA structures at resolution ≤4.0Å resulting in 4,223 unique RNA sequences for which a total of 12,011 structures are available (RNASolo date cutoff: 31 October 2023). As inverse folding is a per-node/per-nucleotide level task, our training data contains over 2.8 Million unique nucleotides. Further dataset statistics are available in Appendix Figure C.2, illustrating the diversity of our dataset in terms of sequence length, number of structures per sequence, as well as structural variations among conformations per sequence. Structural clustering In order to ensure that we evaluate gRNAde’s generalization ability to novel RNAs, we cluster the 4,223 unique RNAs into groups based on structural similarity. We use US-align [Zhang et al., 2022] with a similarity threshold of TM-score >0.45 for clustering, and ensure that we train, validate and test gRNAde on structurally dissimilar clusters (see next paragraph). We also provide utilities for clustering based on sequence homology using CD-HIT [Fu et al., 2012], which leads to splits containing biologically dissimilar clusters of RNAs. Splits to evaluate generalization After clustering, we split the RNAs into training (∼4000 samples), validation and test sets (100 samples each) to evaluate two different design scenarios: 1. Single-state split. This split is used to fairly evaluate gRNAde for single-state design on a set of RNA structures of interest from the PDB identified by Das et al. [2010], which mainly includes riboswitches, aptamers, and ribozymes. We identify the structural clusters belonging to the RNAs identified in Das et al. [2010] and add all the RNAs in these clusters to the test set (100 samples). The remaining clusters are randomly added to the training and validation splits. 2. Multi-state split. This split is used to test gRNAde’s ability to design RNA with multiple distinct conformational states. We order the structural clusters based on median intra- sequence RMSD among available structures within the cluster2. The top 100 samples from clusters with the highest median intra-sequence RMSD are added to the test set. The next 100 samples are added to the validation set and all remaining samples are used for training. Validation and test samples come from clusters with at most 5 unique sequences, in order to ensure diversity. Any samples that were not assigned clusters are directly appended to the training set. We also directly add very large RNAs (> 1000 nts) to the training set, as it is unlikely that we want to design very large RNAs. We exclude very short RNA strands (< 10 nts). 2For each RNA sequence, we compute the pairwise C4’ RMSD among all available structures. We then compute the median RMSD across all sequences within each structural cluster. 107 ViennaRNA (2D only) FARNA RDesign Rosetta gRNAde 0.00 0.25 0.50 0.75 1.00 Na tiv e se qu en ce re co ve ry 0.269 0.321 0.430 0.450 0.568 (a) gRNAde outperforms Rosetta. 0.00 0.25 0.50 0.75 1.00 Rosetta seq. recovery 0.00 0.25 0.50 0.75 1.00 gR NA de se q. re co ve ry 1.1 1.2 1.3 1.4 1.5 1.6 gR NA de p er pl ex ity (b) Perplexity correlates with recovery. Figure 5.7: gRNAde compared to Rosetta for single-state design. (a) We benchmark native sequence recovery of gRNAde, RDesign, Rosetta, FARNA and ViennaRNA on 14 RNA structures of interest identified by Das et al. [2010]. gRNAde obtains higher native sequence recovery rates (56% on average) compared to Rosetta (45%) and all other methods. (b) Sequence recovery per sample for Rosetta and gRNAde, shaded by gRNAde’s perplexity for each sample. gRNAde’s perplexity is correlated with native sequence recovery for designed sequences (Pearson correlation: -0.76, Spearman correlation: -0.67). Full results on single-state test set are available in Appendix C.1 and per-RNA results in Appendix Table C.2. Evaluation metrics For a given data split, we evaluate models on the held-out test set by designing 16 sequences (sampled at temperature 0.1) for each test data point and computing averages for each of the metrics described in Section 5.1.3: native sequence recovery, structural self-consistency scores and perplexity. We employ early stopping by reporting test set perfor- mance for the model checkpoint for the epoch with the best validation set recovery. Standard deviations are reported across 3 consistent random seeds for all models. Hyperparameters All models use 4 encoder and 4 decoder GVP-GNN layers, with 128 scalar/16 vector node features, 64 scalar/4 vector edge features, and drop out probability 0.5, resulting in 2,147,944 trainable parameters. All models are trained for a maximum of 50 epochs using the Adam optimiser with an initial learning rate of 0.0001, which is reduced by a factor 0.9 when validation performance plateaus with patience of 5 epochs. Ablation studies of key modelling decisions are available in Appendix Table C.1. 5.3 Results 5.3.1 Single-state RNA Design Benchmark We set out to compare gRNAde to Rosetta, a state-of-the-art physics based toolkit for biomolecu- lar modelling and design [Leman et al., 2020]. We reproduced the benchmark setup from Das et al. [2010] for Rosetta’s fixed backbone RNA sequence design workflow on 14 RNA structures of interest from the PDB, which mainly includes riboswitches, aptamers, and ribozymes (full 108 listing in Table C.2). We trained gRNAde on the single-state split detailed in Section 5.2, explic- itly excluding the 14 RNAs as well as any structurally similar RNAs in order to ensure that we fairly evaluate gRNAde’s generalization abilities vs. Rosetta. gRNAde improves sequence recovery over Rosetta In Figure 5.7, we compare gRNAde’s native sequence recovery for single-state design with numbers taken from Das et al. [2010] for Rosetta, FARNA (a predecessor of Rosetta), ViennaRNA (the most popular 2D inverse folding method), and RDesign [Tan et al., 2023] (a concurrent GNN-based RNA inverse folding model). gRNAde has higher recovery of 56% on average compared to 45% for Rosetta, 32% for FARNA, 27% for ViennaRNA, and 43% for RDesign. See Appendix Table C.2 for per-RNA results and Appendix C.1 for full results on the single-state test set of 100 RNAs. gRNAde is significantly faster than Rosetta In addition to superior sequence recovery, gRNAde is significantly faster than Rosetta for high-throughout design pipelines. Training gRNAde from scratch takes roughly 2–6 hours on a single A100 GPU, depending on the exact hyperparameters. Once trained, gRNAde can design hundreds of sequences for backbones with hundreds of nucleotides in ∼10 seconds on CPU and ∼1 second with GPU acceleration. On the other hand, Rosetta takes order of hours to produce a single design due to performing expensive Monte Carlo optimisation until convergence on CPU.3 Deep learning methods like gRNAde are arguably easier to use since no expert customization is required and setup is easier compared to Rosetta (the latest builds do not include RNA recipes), making RNA design more broadly accessible. gRNAde’s perplexity correlates with sequence recovery In Figure 5.7b, we plot native sequence recovery per sample for Rosetta vs. gRNAde, shaded by gRNAde’s average perplexity for each sample. Perplexity is an indicator of the model’s confidence in its own prediction (lower perplexity implies higher confidence) and appears to be correlated with native sequence recovery. In the subsequent Section 5.3.3, we further demonstrate the utility of gRNAde’s perplexity for zero-shot ranking of RNA fitness landscapes. 5.3.2 Multi-state RNA Design Benchmark Structured RNAs often adopt multiple distinct conformational states to perform biological functions [Ken et al., 2023]. For instance, riboswitches adopt at least two distinct functional conformations: a ligand bound (holo) and unbound (apo) state, which helps them regulate and control gene expression [Stagno et al., 2017]. If we were to attempt single-state inverse design for such RNAs, each backbone structure may lead to a different set of sampled sequences. It is not 3Rosetta documentation states that “runs on RNA backbones longer than ∼ten nucleotides take many minutes or hours”. We have not run Rosetta ourselves as recent builds do not include RNA recipes. 109 https://www.rosettacommons.org/docs/latest/application_documentation/rna/rna-design RDesi gn 1 s tat e gR NAd e 1 s tat e 2 s tat es 3 s tat es 4 s tat es 5 s tat es 0.00 0.25 0.50 0.75 1.00 Na tiv e se qu en ce re co ve ry 0.385 0.481 0.507 0.531 0.511 0.510 (a) Per-sample sequence recovery 0.0 0.5 1.0 Nucleotide Paired Probability 0.00 0.25 0.50 0.75 1.00 Na tiv e se qu en ce re co ve ry 0 5 10 Nucleotide RMSD (A) gRNAde model 1 state 3 states 5 states (b) Per-nucleotide recovery vs. structural flexibility Figure 5.8: Multi-state design benchmark. (a) Multi-state gRNAde shows a consistent 3-5% improvement over the single-state variant in terms of sequence recovery on the multi-state test set of 100 RNAs, with the best performance obtained using 3 states. (b) When plotting sequence recovery per-nucleotide, multi-state gRNAde improves over a single-state model for structurally flexible regions of RNAs, as characterised by nucleotides that tend to undergo changes in base pairing (left) and nucleotides with higher average RMSD across multiple states (right). Marginal histograms in blue show the distribution of values. We plot performance for one consistent random seed across all models; collated results and ablations are available in Appendix C.1. obvious how to select the input backbone as well as designed sequence when using single-state models for multi-state design. gRNAde’s multi-state GNN, descibed in Section 5.1.2, directly ‘bakes in’ the multi-state nature of RNA into the architecture and designs sequences explicitly conditioned on multiple states. In order to evaluate gRNAde’s multi-state design capabilities, we trained equivalent single- state and multi-state gRNAde models on the multi-state split detailed in Section 5.2, where the validation and test sets contain progressively more structurally flexible RNAs as measured by median RMSD among multiple available states for an RNA. Multi-state gRNAde consistently boosts sequence recovery In Figure 5.8a, we compared a single-state variant of gRNAde with otherwise equivalent multi-state models (with up to 5 states) in terms of native sequence recovery. Multi-state variants show a consistent 3-5% improvement, with the best performance obtained using 3 states. This trend holds to a lesser extent on the single-state benchmark where the multi-state model is being used with only one state as input. This suggests that seeing multiple states during training can be useful for teaching gRNAde about RNA conformational flexibility and improve performance even for single-state design tasks. As a caveat, it is worth noting that multi-state models consume more GPU memory than an equivalent single-state model during mini-batch training (approximate peak GPU usage for max. number of states = 1: 12GB, 3: 28GB, 5: 50GB on a single A100 with at most 3000 total nodes in a mini-batch). 110 1st best (fit.: 2.88) 2nd best (fit.: 2.40) 5th best (fit.: 2.23) 10th best (fit.: 1.96) 20th best (fit.: 1.61) 50th best (fit.: 1.35) wildtype 1 10 50 100 200 449 1500 5000 10493 Selected sequences for assaying 0x 1x 0x 2x 4x 6x 8x 10x 12x 14x 16x 18x E xp ec te d 'm ax ' f ol d ch an ge o ve r W T Max Fitness by Sample Size and Condition (n=74,943; simulations=10,000) Condition random n_mut==1 n_mut<=2 gRNAde0.00 0.69 1.39 1.79 2.08 2.30 2.48 2.64 2.77 2.89 Fi tn es s Figure 5.9: Retrospective study of gRNAde for ranking ribozyme mutant fitness. Using the backbone structure and mutational fitness landscape data from an RNA polymerase ribozyme [McRae et al., 2024], we retrospectively analyse how well we can rank variants at multiple design budgets using random selection vs. gRNAde’s perplexity for mutant sequences conditioned on the backbone structure (catalytic subunit 5TU). Note that gRNAde is used zero-shot here, i.e. it was not fine-tuned on any assay data. For stochastic strategies, bars indicate median values, and error bars indicate the interquartile range estimated from 10,000 simulations per strategy and design budget. At low throughput design budgets of up to ∼500 sequences, selecting mutants using gRNAde outperforms random baselines in terms of the expected maximum improvement in fitness over the wild type. In particular, gRNAde performs better than single site saturation mutagenesis, even when all single mutants are explored (total of 449 single mutants, 10,493 double mutants for the catalytic subunit 5TU in McRae et al. [2024]). See Appendix Figure C.1 for results on scaffolding subunit t1. Improved recovery in structurally flexible regions In Figure 5.8b, we evaluated gRNAde’s multi-state sequence recovery at a fine-grained, per-nucleotide level to understand the source of performance gains. Multi-state GNNs improve sequence recovery over the single-state variant on structurally flexible nucleotides, as characterised by undergoing changes in base pairing/secondary structure and higher average RMSD between 3D coordinates across states. 5.3.3 Zero-shot Ranking of RNA Fitness Landscape Lastly, we explored the use of gRNAde as a zero-shot ranker of mutants in RNA engineering campaigns. Given the backbone structure of a wild type RNA of interest as well as a candidate set of mutant sequences, we can compute gRNAde’s perplexity of whether a given sequence folds into the backbone structure. Perplexity is inversely related to the likelihood of a sequence conditioned on a structure, as described in Section 5.1.3. We can then rank sequences based on how ‘compatible’ they are with the backbone structure in order to select a subset to be experimentally validated in wet labs. 111 Retrospective analysis on ribozyme fitness landscape A recent study by McRae et al. [2024] determined a cryo-EM structure of a dimeric RNA polymerase ribozyme at 5Å resolution4, along with fitness landscapes of ∼75K mutants for the catalytic subunit 5TU and ∼48K mutants for the scaffolding subunit t1. We design a retrospective study using this data of (sequence, fitness value) pairs where we simulate an RNA engineering campaign with the aim of improving catalytic subunit fitness over the wild type 5TU sequence. We consider various design budgets ranging from hundreds to thousands of sequences selected for experimental validation, and compare 4 unsupervised approaches for ranking/selecting variants: (1) random choice from all ∼75,000 sequences; (2) random choice from all 449 single mutant sequences; (3) random choice from all single and double mutant sequences (as sequences with higher mutation order tend to be less fit); and (4) negative gRNAde perplexity (lower perplexity is better). For each design budget and ranking approach, we compute the expected maximum change in fitness over the wild type that could be achieved by screening as many variants as allowed in the given design budget. We run 10,000 simulations to compute confidence intervals for the 3 random baselines. gRNAde outperforms random baselines in low design budget scenarios Figure 5.9 illus- trates the results of our retrospective study. At low design budgets of up to hundreds of sequences, which are relevant in the case of a low throughput fitness screening assay, gRNAde outperforms all random baselines in terms of the maximum change in fitness over the wild type. The top 10 mutants as ranked by gRNAde contain a sequence with 4-fold improved fitness, while the top 200 leads to a 5-fold improvement. 5 Note that gRNAde is used zero-shot here, i.e. it was not fine-tuned on any assay data. Perspective Overall, it is promising that gRNAde’s perplexity correlates with experimental fitness measurements out-of-the-box (zero-shot) and can be a useful ranker of mutant fitness in our retrospective study. In realistic design scenarios, improvements could likely be obtained by fine-tuning gRNAde on a low amount of experimental fitness data. This retrospective study acts as a sanity check before committing to wet lab validation of gRNAde designs in Chapter 6. We see random mutagenesis and directed evolution-based approaches as complementary to inverse design approaches like gRNAde [Breaker and Joyce, 1994]. Random mutagenesis can be thought of as local exploration around a wild type sequence, optimising fitness within an ‘island’ of activity. Structure-based design approaches are akin to global jumps in sequence space, with the potential to find new islands further away from the wild type [Huang et al., 2016]. 4This RNA was not present in gRNAde’s training data, which contains structures at ≤4.0Å resolution. 5As a caveat, the fitness assays from McRae et al. [2024] used for creating the landscape have inherent noise and cannot easily differentiate between mutants of similar activity. 112 5.4 Related Work RNA inverse folding Most tools for RNA inverse folding focus on secondary structure without considering 3D geometry [Churkin et al., 2018, Runge et al., 2019] and approach the problem from the lens of energy optimisation [Ward et al., 2023]. Rosetta fixed backbone re-design [Das et al., 2010] is the only energy optimisation-based approach that accounts for 3D structure. Deep neural networks such as gRNAde can incorporate 3D structural constraints and are orders of magnitude faster than optimisation-based approaches; this is particularly attractive for high- throughput design pipelines as solving the inverse folding optimisation problem is NP hard [Bonnet et al., 2020]. RNA structure design Inverse folding models for protein design have often been coupled with backbone generation models which design structural backbones conditioned on various design constraints [Watson et al., 2023, Ingraham et al., 2023]. Current approaches for RNA backbone design use classical (non-learnt) algorithms for aligning 3D RNA motifs [Han et al., 2017, Yesselman et al., 2019], which are small modular pieces of RNA that are believed to fold independently. Such algorithms may be restricted by the use of hand-crafted heuristics, and we have explored the first data-driven generative models for RNA backbone design in follow-up work [Anand et al., 2024]. RNA structure prediction There have been several recent efforts to adapt protein folding architectures such as AlphaFold2 [Jumper et al., 2021] and RosettaFold [Baek et al., 2021] for RNA structure prediction [Li et al., 2023b, Wang et al., 2023, Baek et al., 2024]. A previous generation of models used GNNs as ranking functions together with Rosetta energy optimisation [Watkins et al., 2020, Townshend et al., 2021]. None of these architectures aim at capturing conformational flexibility of RNAs, unlike gRNAde which represents RNAs as multi-state conformational ensembles. Neither can structure prediction tools be used directly for RNA design tasks as they are not generative models. RNA language models Self-supervised language models have been developed for predictive and generative tasks on RNA sequences, including general-purpose models [Chen et al., 2022, Penic et al., 2024, Zhao et al., 2024] as well as models developed for specific RNA families [Li et al., 2023a, Sumi et al., 2024, Shulgina et al., 2024]. RNA sequence data repositories are orders of magnitude larger than those for RNA structure (eg. RiNaLMo is trained on 36 million sequences). However, standard language models can only implicitly capture RNA structure and dynamics through sequence co-occurence statistics, which can pose a challenge for designing structured RNAs. RibonanzaNet [He et al., 2024] represents a recent effort in developing structure-informed RNA language models by supervised training on experimental readouts from chemical mapping, although RibonanzaNet cannot be directly used for RNA design, either. 113 5.5 Summary In this chapter, we introduced gRNAde, a novel geometric deep learning model for RNA sequence design conditioned on one or more 3D backbone structures. gRNAde represents a significant advance over Rosetta [Leman et al., 2020], the state-of-the-art physics based tool for 3D RNA inverse design. On a benchmark of fixed backbone design for 14 biologically relevant RNA structures from the PDB identified by Das et al. [2010], gRNAde obtains higher native sequence recovery rates (56% on average) compared to Rosetta (45% on average). Additionally, gRNAde is significantly faster, sampling 100+ designs in 1 second for an RNA of 60 nucleotides on an A100 GPU (<10 seconds on CPU) compared to the reported hours for Rosetta on CPU. gRNAde enables new capabilities which were previously not possible with Rosetta, including multi-state design for structurally flexible RNAs. Multi-state gRNAde improves sequence recovery by 5% over an equivalent single-state model on a benchmark of structurally flexible RNAs, especially for surface nucleotides which undergo positional or secondary structural changes. gRNAde’s GNN is also the first geometric deep learning architecture for explicit multi-state biomolecule representation learning. The model is generic and can be repurposed for other learning tasks on conformational ensembles, including multi-state protein design. We further show that gRNAde can be used for zero-shot ranking of mutants in RNA engi- neering campaigns. In a retrospective analysis of mutational fitness landscape data for an RNA polymerase ribozyme [McRae et al., 2024], we show how gRNAde’s perplexity, the likelihood of a sequence folding into a backbone structure, can be used to rank mutants based on fitness in an unsupervised manner. We find that gRNAde outperforms random mutagenesis for improving fitness over the wild type in low throughput scenarios. Overall, this chapter has focused on computational evaluations of gRNAde. In the next chapter, we will transition from retrospective in-silico benchmarks to real-world applications, using gRNAde for practical RNA design problems with wet lab experimental validation. We will also discuss gRNAde’s limitations and avenues for future work at the end of the next chapter. 114 Chapter 6 Inverse Design of RNA Structure and Function with gRNAde In Chapter 5, we introduced gRNAde, a geometric deep learning model for RNA inverse design that generates sequences conditioned on target 3D structures. This chapter presents wet lab validation of gRNAde’s capabilities through biochemical and functional experiments. We focus on two RNA design problems with broad biological relevance: (1) Designing complex pseudoknotted RNA structures, which are important 3D functional elements across biology but have historically been difficult to design using existing computational methods; and (2) Going beyond static structure and designing functional RNA enzymes (ribozymes), such as RNA polymerases that catalyze RNA-templated RNA replication [Johnston et al., 2001]. 6.1 An RNA Inverse Design Pipeline with gRNAde Our RNA inverse design pipeline integrates gRNAde for sequence generation with RibonanzaNet [He et al., 2024], an RNA language model for sequence-to-structural property prediction, to identify promising designed RNA sequences. This pipeline has been calibrated through multiple RNA design campaigns with experimentalists at Stanford University and the MRC Laboratory of Molecular Biology. The screening metrics are selected to correlate with experimental success, enabling high-throughput computational identification of designs that are most likely to fold into target structures and perform desired functions. Input The gRNAde pipeline translates a multi-modal structural ‘prompt’ into novel sequences predicted to adopt a desired fold. This design specification, analogous to a textual prompt for a large language model, is highly flexible; it can consist of a target pseudoknotted secondary structure, 3D backbone coordinates, and partial sequence constraints that must be preserved (Figure 6.1 (A)). 115 C RibonanzaNet RNA structure foundation model MCC MAE Ope nK no t s co re D Wet lab validation Partial sequence (e.g. from fitness landscape) Target pseudoknotted sec. structure Target backbone 3D structure A B GNN Structure Encoder LM Decoder gRNAde Structure-conditioned RNA language model Designed sequences Designed sequences Pred. sec. structure Pred. SHAPE Target SHAPE Target sec. structure Structural metrics Top N R ea ct iv ity Position R ea ct iv ity Position Figure 6.1: The gRNAde pipeline for RNA inverse design. The automated workflow integrates deep learning-based sequence generation with computational screening to identify optimal candidates for experimental validation. A. The pipeline takes multi-modal design constraints as input, optionally including a target pseu- doknotted secondary structure, a target 3D backbone structure, and partial sequence constraints such as those derived from fitness landscapes. B. gRNAde, a structure-conditioned RNA language model, uses these constraints to generate a large and diverse library of candidate sequences, typically on the order of one million. These candidates are then passed to the computational filtering stage. C. In the filtering stage, each designed sequence is evaluated by RibonanzaNet, an RNA structure foundation model. RibonanzaNet predicts the secondary structure and a per-nucleotide SHAPE chemical reactivity profile for each candidate. D. RibonanzaNet predictions for each design are scored against the target secondary structure and SHAPE profile using metrics such as the Matthews Correlation Coefficient (MCC), Mean Absolute Error (MAE), and the OpenKnot Score. The top-ranked designs are then selected for wet-lab synthesis and validation. The design pipeline proceeds through three sequential stages: 116 Step 1: Sequence generation gRNAde generates a large number of candidate sequences (typically 1 million) conditioned on the input structure and specified constraints (Figure 6.1 (B)). During generation, we vary the sampling temperature and random seed to control diversity, with lower temperatures around 0.1 producing sequences closer to the native sequence and higher temperatures up to 1.0 yielding more diverse candidates. Step 2: Structural profile prediction Each generated sequence is evaluated using Ribonan- zaNet to predict its chemical reactivity profile and secondary structur (Figure 6.1 (C)), providing computational proxies for experimental folding behavior. We describe the rationale for using RibonanzaNet in the following paragraphs. Step 3: Design scoring and selection Candidates are scored and filtered using the predicted structural profiles as follows (Figure 6.1 (D)): 1. Secondary Structure Score: For natural RNA targets, we compute Matthews Correlation Coefficient (MCC) between predicted and target secondary structures, retaining only sequences exceeding a high correlation threshold (e.g., MCC > 0.9). This ensures high likelihood of target pseudoknotted structure formation. We omit this criterion for synthetic targets, as RibonanzaNet’s secondary structure predictor was finetuned on natural RNAs. 2. Chemical Reactivity Score: We quantify the Mean Absolute Error (MAE) between each se- quence’s predicted reactivity profile and a target profile. The target profile is obtained either from experimental measurements or by applying RibonanzaNet to the native sequence. 3. OpenKnot Score: This metric measures the likelihood of pseudoknotted structure formation based on predicted chemical reactivity patterns, used for pseudoknot design tasks in Section 6.2. 4. Final Selection: We rank sequences by the primary metric—typically chemical reactivity score or OpenKnot score—and select the top N unique designs after removing duplicates. Due to the computational efficiency of gRNAde and RibonanzaNet, we can screen a large numbers of designs in parallel. We generate and score up to 1 million sequences for each design campaign, which takes approximately under 12 hours on a single NVIDIA A100 GPU. After removing duplicates, this process typically yields hundreds of thousands of unique sequences, depending on the target structure size and constraint stringency. Rationale for RibonanzaNet and chemical reactivity We selected RibonanzaNet as our primary evaluation tool based on several key advantages. RibonanzaNet is an RNA structure language model that was pre-trained on approximately 2 million RNA sequences paired to predict their experimental chemical reactivity profiles from high-throughput assays [He et al., 117 2024]. This diverse training dataset encompasses both natural and synthetic RNAs, making it well-suited for evaluating designed sequences. Chemical probing assays measure per-nucleotide reactivities to small molecule modifiers, providing information about both base pairing and tertiary interactions [Strobel et al., 2018, Cao et al., 2024]. We utilize the 2A3 chemical modifier (2-Aminopyridine-3-carboxylic acid imidazolide) [Marinus et al., 2021], which exhibits minimal nucleotide bias compared to other chemical probes such as DMS, making it particularly suitable for evaluating designed sequences. 2A3 reactivity is high for unpaired and accessible nucleotides but substantially reduced for base- paired nucleotides or those involved in tertiary interactions such as pseudoknots, providing a robust signal for structural assessment. RibonanzaNet’s ability to predict these chemical reactivity profiles enables quantitative evaluation of how well a designed sequence is likely to fold into the target structure, as reactivity patterns directly reflect the underlying 3D conformation. Furthermore, RibonanzaNet was fine-tuned on pseudoknotted secondary structures and achieves state-of-the-art performance on secondary structure prediction benchmarks, giving us two complementary metrics for evaluating designed sequences: predicted chemical reactivity profiles and secondary structure predictions. We evaluated alternative approaches including 3D structure prediction tools such as Al- phaFold 3 [Abramson et al., 2024] and RNA-specific variants [Li et al., 2023b, Wang et al., 2023]. However, these methods performed poorly on both native and designed sequences for our applications, particularly for synthetic RNAs where multiple sequence alignments are unavail- able, as accuracy is known to be substantially reduced for deep learning models without MSAs [Das et al., 2023, Kretsch et al., 2025]. 6.2 Expert-level Design of RNA Pseudoknotted Structures 6.2.1 The Pseudoknot Design Problem RNA pseudoknots RNA pseudoknots are complex three-dimensional structural motifs formed when single-stranded regions base pair with complementary sequences, creating interwoven stem-loop structures. These sophisticated elements play crucial roles across biology: modulating gene regulation through ribosomal frameshifting, enabling viral replication in SARS-CoV-2 and other RNA viruses, and functioning as catalytic ribozymes [Staple and Butcher, 2005]. Despite their biological significance, fundamental questions remain unresolved. The folding pathways for pseudoknot formation, their structural dynamics, and thermodynamic properties are not fullly understood [Vicens and Kieft, 2022]. Current computational methods struggle with their topological complexity, limiting structure prediction from sequence [Rivas and Eddy, 1999]. As a result, the rational design of pseudoknots with specified properties remains an open challenge with significant implications for engineering synthetic ribozymes and functional RNAs. 118 Eterna OpenKnot Benchmark Eterna is an online platform and video game for computational RNA design that hosts a global community of researchers and citizen-scientists. The platform regularly releases new RNA design challenges where participants submit sequences designed to satisfy specific structural and functional requirements [Lee et al., 2014, Wayment-Steele et al., 2022a,b]. Submitted designs undergo experimental validation at Prof. Rhiju Das’s lab at Stanford University through high-throughput chemical probing assays. In this section, we present gRNAde’s performance on the OpenKnot Benchmark, a series of RNA design challenges hosted on Eterna that specifically target pseudoknotted structures. The goal is to build a diverse library of experimentally validated pseudoknotted RNAs to advance fundamental understanding of RNA folding and function. While the Eterna community of expert designers has contributed numerous designs to these challenges over the years, the manual design process is slow and hard to scale, limiting the exploration of pseudoknotted sequence space. To address these limitations, we deployed gRNAde as a fully automated computational design tool on Eterna, enabling direct performance comparison against both human experts and other automated RNA design tools. 6.2.2 Setup OpenKnot Round 7a and 7b The OpenKnot Benchmark consists of multiple rounds, each focusing on new pseudoknotted RNA structure targets for which participants submit designed sequences. We entered gRNAde into OpenKnot Round 7a and 7b, which featured RNAs up to 100 nucleotides or 240 nucleotides in length, respectively. For each round, there were a total of 20 target structures or puzzles, including 10 structures from natural RNAs and 10 synthetic pseudoknots. Natural targets range from diverse structured RNAs including riboswitches, ribozymes, ribosomal RNAs, and viral frameshift elements, while synthetic targets include novel pseudoknots that try to push the limits of theoretically possible pseudoknotted structures. For the natural targets, we usually have a reliable 3D structure provided by the Eterna organizers, either from the Protein Data Bank or from high quality structure modelling tools. For the synthetic targets, only secondary structure information is usually provided with less relaible 3D models, as the ideal sequence for these targets is unknown. Design budget and constraints For each puzzle, we submitted 40 gRNAde designs in total via two approaches: (1) 20 sequences generated with only secondary structure constraints; and (2) 20 sequences generated with both secondary structure and 3D backbone constraints. No sequence constraints were included, allowing gRNAde to design the entire sequence from scratch. To provide a direct automated baseline for comparison, the organizers also independently submitted 20 designs using Rosetta’s RNA inverse design protocol [12], the current state-of-the- art physics-based method for 3D RNA design. As an additional sanity check, the organizers evaluated 10 replicates of the wildtype (native) RNA sequence for each puzzle. While wildtype 119 sequences are expected to achieve high scores, particularly for natural RNAs, they may not represent the optimal sequence for structure formation, which is precisely what the design challenge aims to discover. The competition also included other, contemporaneous AI-based methods (MPNN and RFDiffusion). Evaluation using OpenKnot Score To evaluate each submitted design, the organizers measure an experimental OpenKnot Score (ranging from 0 to 100) that quantifies the likelihood of a sequence forming the target pseudoknotted structure. A score above 90 indicates high confidence that the sequence will fold into the desired structure, while scores below 90 may still represent successful designs but with less certainty based on the chemical reactivity data. The OpenKnot score is computed as the average of two complementary metrics that assess different aspects of pseudoknot formation: • The Eterna Classic Score evaluates chemical reactivity consistency across all probed positions. Positions predicted to be base-paired but showing high chemical reactivity (>0.5) are penalized, as are positions predicted to be unpaired but showing low reactivity (<0.125). This metric captures how well the overall secondary structure prediction matches the experimental chemical probing data. • The Crossed Pair Quality Score applies the same evaluation criteria but focuses specifically on nucleotides involved in pseudoknotted base pairs—those that cross other base pairs in the secondary structure. This targeted assessment is crucial for pseudoknot evaluation, as these crossing interactions define the three-dimensional topology that distinguishes pseudoknots from simpler secondary structures. It is important to note that the OpenKnot score is based on 1D chemical reactivity data, and it is possible for designs to achieve high scores by satisfying the metric without perfectly forming the target 3D structure; further validation, such as compensatory rescue experiments, is planned to confirm the 3D accuracy of top OpenKnot Benchmark designs. 6.2.3 Results gRNAde achieves expert-level performance on short targets In Figure 6.2, we present the OpenKnot scores for designs from gRNAde, Rosetta, expert human designers, and wildtype sequences across all 20 target structures of up to 100 nucleotides from OpenKnot Round 7a. gRNAde achieved a 100% success rate on natural targets and 90% success rate on synthetic targets, representing a substantial improvement over Rosetta’s 40% and 70% success rates on natural and synthetic targets, respectively. Remarkably, gRNAde matches the performance of expert human designers, who achieved identical success rates of 100% on natural targets and 90% on synthetic targets. This is a 120 significant result as gRNAde is a fully automated pipeline capable of generating designs at scale, whereas human designers typically require substantially more time to create individual designs. gRNAde designs improve over native sequences Notably, gRNAde designs consistently outperform wildtype sequences across most targets, with native sequences achieving lower success rates of 80% on natural targets and 40% on synthetic targets compared to gRNAde’s 100% and 90% respectively. This demonstrates that gRNAde can design idealized sequences that are better suited for forming target pseudoknotted structures than their naturally occurring counterparts. While natural sequences evolved under multiple selective pressures—including functional constraints, evolutionary history, and cellular context—gRNAde focuses exclusively on structural optimization for the specified target. Visualizations of chemical reactivity profiles Figure 6.2 (E) and (F) illustrate specific exam- ples where gRNAde designs achieve superior OpenKnot scores compared to wildtype sequences for both natural and synthetic targets from OpenKnot Round 7a. The chemical reactivity profiles reveal a clear distinction: gRNAde-designed sequences successfully form the target pseudoknot- ted structure, as evidenced by the characteristic reactivity patterns consistent with the intended base pairing and tertiary interactions. In contrast, the corresponding wildtype sequences ex- hibit reactivity profiles indicating alternative structural conformations, suggesting they fold into non-target structures rather than the desired pseudoknot. 121 WT Rosetta gRNAde P11: PN.v282 - WT vs. best gRNAde design A B DC E F P03: ZMP Riboswitch - WT vs. best gRNAde design Figure 2 | gRNAde achieves expert-level accuracy in the Eterna OpenKnot Benchmark for RNA pseudoknot design. WT (Score = 90.0) gRNAde (Score = 94.3) WT (Score = 80.5) gRNAde (Score = 97.5) 122 Figure 6.2. gRNAde achieves expert-level accuracy in the Eterna OpenKnot Benchmark for RNA pseudoknot design. Performance of wildtype sequences, Rosetta, MPNN, RFdiffusion, gRNAde, and expert human designers in the Eterna OpenKnot Round 3 challenge, which targeted pseudoknotted RNAs of up to 100 nucleotides. A, B. Results for 10 natural RNA targets. C, D. Results for 10 synthetic RNA targets. Left-sided panels A and C show the distribution of OpenKnot scores for individual designs across all puzzles (success threshold > 90, red dashed line). Right-sided panels B and D show the overall success rate, defined as the percentage of puzzles for which at least one design scored above 90. gRNAde achieves success rates of 100% (natural) and 90% (synthetic), matching expert human performance and substantially outperforming physics-based Rosetta (40% and 70% success rates, respectively). E, F. Molecular validation of design success through chemical probing. Nucleotides are overlaid on the target secondary structures and colored by reactivity, with darker reds indicating higher reactivity and greater accessibility for unpaired positions. Conversely, nucleotides part of base pairs and pseudoknots are expected to have lower reactivity. E. For the natural ZMP Riboswitch target, the best gRNAde design (right, score = 94.3) shows a chemical reactivity profile largely consistent with the target fold, whereas the wildtype sequence (left, score = 90.0) shows anomalous reactivity in the loop region around position 50-55, suggest- ing misfolding. F. For the synthetic PN.v282 target, the gRNAde design (right, score = 97.5) again shows a superior reactivity pattern compared to the wildtype sequence (left, score = 80.5), which exhibits high reactivity in various paired regions, indicating disrupted base pairing. 123 A B C D Supplementary Figure 1 | gRNAde maintains competitive performance on long RNA pseudoknots in the Eterna OpenKnot Benchmark. Supplementary Figures 124 Figure 6.3. gRNAde maintains competitive performance on long RNA pseudoknots in the Eterna OpenKnot Benchmark. Performance of wildtype sequences, Rosetta, MPNN, RFdiffusion, gRNAde, and expert human designers for the Eterna OpenKnot Round 4 challenge, which targeted pseudoknotted RNAs of up to 240 nucleotides. A. Distribution of OpenKnot scores for individual designs across 10 natural target puzzles (success threshold > 90, red dashed line). B. The success rate on natural targets, defined as the percentage of puzzles for which at least one design scored above 90. C. Distribution of OpenKnot scores for individual designs across 10 synthetic target puzzles (success threshold > 90, red dashed line). D. The success rate on synthetic targets, defined as the percentage of puzzles for which at least one design scored above 90. gRNAde achieves success rates of 67% on natural targets and 70% on synthetic targets. Rosetta (physics-based) could not be evaluated due to scalability issues. Native sequences achieved very low success rates on both categories, which demonstrates gRNAde’s ability to design idealized sequences for complex pseudoknotted RNA structures. Competitve performance on long targets OpenKnot Round 7b evaluates design performance on significantly larger targets of up to 240 nucleotides, representing a substantial increase in structural complexity compared to the short targets in Round 7a. Figure 6.3 presents the OpenKnot scores for designs from gRNAde, expert human designers, and wildtype sequences across all 20 target structures. gRNAde achieved success rates of 67% on natural targets and 70% on synthetic targets. While these results represent a decrease from the success rates observed on short targets in Round 7a, expert human designers also experienced reduced performance at this scale. Rosetta could not be evaluated on these larger targets due to computational scalability limitations. Importantly, gRNAde designs substantially outperformed wildtype sequences, which achieved success rates of only 0% on natural targets and 10% on synthetic targets. Despite lower success rates, gRNAde designs consistently achieved high median OpenKnot scores, with most designs scoring above 80 even when falling short of the 90-point success threshold. This suggests that gRNAde captures many of the structural requirements for these complex targets, with room for optimization in achieving the highest confidence scores. While gRNAde’s performance on large targets represents an area for future improvement to close the gap with expert human designers, the fully automated nature of the approach enables high-throughput exploration of design space that would be impractical for manual design. 125 6.3 Inverse Design of Functional Polymerase Ribozymes 6.3.1 The Self-replicating Ribozyme Problem Ribozymes and the RNA World Having demonstrated gRNAde’s ability to design complex pseudoknotted structures, we now turn to functional RNA design. Our focus will be on RNA enzymes or ribozymes. Ribozymes perform critical structural and catalytic roles in modern cells, including tRNA processing (RNaseP), RNA splicing (spliceosome, self-splicing introns), and translation (ribosome) [Cech, 2024]. Furthermore, in vitro evolution has lead to the discovery of novel ribozyme activities not observed in nature [Wilson and Szostak, 1999], including polymerase ribozymes (PRs) capable of synthesizing complementary RNA strands [Wochner et al., 2011, Attwater et al., 2013, Tjhung et al., 2020]. Among these, polymerase ribozymes that can replicate themselves hold particular scientific significance. Their capacity for RNA-catalyzed, RNA-templated synthesis offers a pathway to RNA self-replication, a process central to the RNA World hypothesis which postulates RNA as a cornerstone of early life [Woese, 1967, Orgel, 1968, Gilbert, 1986]. Triplet-based RNA polymerase ribozymes A promising candidate for RNA self-replication is the triplet-based RNA polymerase ribozyme (TPR) [Attwater et al., 2018], which was evolved to use trinucleotide triphosphates (triplets) as substrates. McRae et al. [2024] recently determined the cryo-EM structure of the TPR at 5-Å resolution and connected structure to function via a comprehensive fitness landscape analysis. As shown in Figure 6.4 (A), the TPR is a heterodimeric RNA composed of a catalytic subunit (5TU) and a noncatalytic auxiliary subunit (t1), which together form a left-hand-like structure with thumb and fingers positioned at a 70° angle. The two subunits are held together by two kissing-loop (KL) interactions that are essential for polymerase function, as evidenced by the dramatic fitness reduction observed when these regions are mutated. The fitness landscape, combined with structural data, reveals that these KL interactions preorganize the TPR for optimal function. This mechanistic understanding of the structure-function relationship in TPR makes it an ideal candidate for rational design approaches. In Chapter 5, we established how gRNAde can be used to rank mutants of TPR and identify functional mutations that improve its catalytic activity. While that retrospective study validated gRNAde’s ability to assess known variants, the mutants analyzed were obtained through directed evolution, which typically generates sequences that are relatively close in mutational space to the native 5TU sequence. Here, we address a more fundamental question: how diverse can sequences be while still performing RNA-templated RNA polymerisation? The scientific goal was not merely to maintain and if possible improve enzyme function, but to test the limits of its functional sequence diversity, i.e. the maximal edit distance of the 5TU “quasispecies" in RNA sequence [Lambert et al., 2025, Kun et al., 2005, Ekland et al., 1995]. 126 This has direct implications for the plausibility of RNA-based self-synthesis and the emergence of early life. We will use gRNAde to perform large-scale generative “jumps" in sequence space, aiming to discover functional variants at mutational distances beyond those accessible through adaptive walks using conventional directed evolution or simple rational design. 6.3.2 Setup Input constraints Our goal is to design mutants of the 5TU catalytic subunit at varying mutational distances from the wildtype sequence. We provide gRNAde with the 3D backbone coordinates of the 5TU-t1 heterodimer and the corresponding pseudoknotted secondary structure as structural input. For sequence constraints, we fix the t1 sequence since it serves as the non-catalytic scaffolding subunit that positions 5TU for optimal function. To generate 5TU designs at specified mutational distances from the native sequence, we define position-specific design probabilities derived from the fitness landscape data of McRae et al. [2024]. The design probability assignment procedure combines two complementary metrics (Figure 6.4 (B) and Figure D.1): 1. Maximum single-mutant fitness: We bin the fitness values of the best single mutant at each position into four categories: below -4.0 (score: 0), -4.0 to -2.0 (score: 1), -2.0 to 0.0 (score: 2), and above 0.0 (score: 3). 2. Combinability score: We assess how well mutations at each position can be combined with other mutations to create improved higher-order variants, following Gantz et al. [2024].1 This metric is binned into four categories: below 0 (score: 0), 0 to 50 (score: 1), 50 to 100 (score: 2), and above 100 (score: 3). The final design probability for each position is calculated as: Pdesign = Binned fitness score + Binned combinability score 6 (6.1) yielding values between 0.0 and 1.0, where higher probabilities indicate positions more amenable to mutation while maintaining function. To preserve essential catalytic elements, we manually set the design probability to zero for functionally critical positions: the catalytic site (positions 41-43), template binding nucleotides (positions 22-24), and triple helix-forming adenosines (positions 25-30). These regions require precise nucleotide identities to perform RNA copying and are therefore held fixed during design. 1The combinability score quantifies how effectively mutations at a given position can be combined with mutations elsewhere to yield functional variants with positive, non-negatively epistatic fitness effects. It is computed as the sum of the fitness values of all non-negatively epistatic higher-order mutants involving a position, weighted by the mutation order. 127 Design budget and baselines To systematically evaluate gRNAde’s performance and ablate the contribution of each pipeline component, we designed an experiment with the following specifications (Figure 6.4 (C)): Mutational distance range: We generated designs spanning mutational distances from 15 to 40 mutations relative to the native 5TU sequence (152 nucleotides long), corresponding to 10-25% sequence similarity. This represents a significant extension beyond the coverage of the fitness landscape data from McRae et al. [2024], which had almost negligible sampling beyond 6 mutations. Exploring this extended mutational range—where fitness landscape data is sparse or absent—presents a particularly challenging test case for designing functional sequences and assessing the limits of structure-based design approaches. gRNAde design generation: We generated 1 million candidate sequences by varying gR- NAde’s sampling temperatures, random seeds, and re-sampling sequence constraints from the design probabilities described above. After deduplication and computational filtering through our RibonanzaNet pipeline, we selected the top 1,000 designs distributed as 40 designs per mutational distance. Rational design baselines: To isolate the contributions of gRNAde’s sequence generation versus our computational filtering pipeline, we implemented two rational design heuristics as baselines. A straightforward rational design approach commonly used in RNA design is to randomly assign nucleotides at each position while respecting base pairing constraints: for paired positions, sample nucleotides from valid base pairs (A-U, G-C, G-U), and for unpaired positions, sample from all nucleotides (A, U, G, C). This strategy is simple but cannot account for 3D structural information during design. Using identical input constraints as gRNAde we generated two sets of rational designs: 1. Rational design only: 1 million designs generated using position-specific nucleotide sampling from fitness landscape probabilities, with 20 designs selected randomly per mutational distance (500 total). 2. Rational design + RibonanzaNet filtering: The same rational generation approach, but applying our computational filtering pipeline to select the top 20 designs per mutational distance (500 total). This experimental design enables direct assessment of gRNAde’s performance relative to rational approaches while quantifying the individual contributions of our generative model and computational filtering components. Experimental validation We validated our designs using a sequencing-based high-throughput activity assay, similar to McRae et al. [2024], which measures the ability of variants for templated synthesis of an arbitrary target RNA sequence. The setup consists of two phases: 128 1. Pre-selection library preparation: All designed sequences along with the wildtype 5TU are synthesized and pooled to create an input library representing the starting population. 2. Activity selection: The library is subjected to conditions that allow only functional ri- bozymes capable of copying to amplify, creating a post-selection library enriched for active sequences. For each sequence, we compute fitness as the log2 enrichment relative to wildtype: Fitness = log2 ( FApost-selection FApre-selection ) − log2 ( FAWT, post-selection FAWT, pre-selection ) (6.2) where fractional abundance (FA) is defined as the number of sequencing reads for a given sequence divided by the total reads in the corresponding library. Positive fitness values indicate sequences that replicate more efficiently than wildtype, while negative values indicate reduced copying activity. To ensure robust fitness estimates, we applied stringent filtering criteria: Only sequences of expected length (152) are considered for analysis and sequences must have at least 5 counts in pre-selection libraries and at least 1 count in post-selection libraries to be classified as "active". We define two categories of activity based on a fitness value threshold: • Active: fitness ≥ −1.86 (indicating some level of RNA polymerase activity, calibrated using a low-throughput gel assay). • Inactive: fitness < −1.86 (indicating reduced but non-zero self-replication) or zero reads in post-selection libraries which cannot be assigned a fitness value. This classification allows us to assess both the overall success rate of designs in retaining self-replication function and the degree of activity relative to wildtype. This kind of grouping is also useful because high-throughput sequencing based assays can be noisy, so the exact fitness values can be hard to interpret, but we can still classify sequences as active or inactive by calibrating them with a low-throughput gel assay (Figure 6.4 (F) and (G)). 6.3.3 Results gRNAde outperforms rational design Figure 6.4 demonstrates that the gRNAde pipeline substantially outperforms rational design approaches both with and without computational filtering. We quantify the individual contributions of gRNAde’s generative model and our RibonanzaNet filtering pipeline by comparing success rates across the three design methods. As shown in Figure 6.4 (D), for designs with up to 20 mutations from wildtype, rational design with random selection achieves only 3% success rate (1 active design out of 100), highlighting the difficulty of designing functional ribozymes without 3D structural guidance. Applying 129 computational filtering to rational designs improves success rate to 15%, demonstrating the value of our filtering pipeline. Notably, gRNAde with the same computational filtering achieves 31.5% success rate, a 2-fold improvement over filtered rational designs and 10-fold improvement over unfiltered rational designs. Furthermore, gRNAde discovered not only more functional variants but also variants with higher catalytic activity on model templates. The fitness distribution of active gRNAde designs was generally superior to that of the rational design baselines (Figure 6.4 (E)). Three active gRNAde variants each for mutational distance ranges 15-19, 20-24, and 25-29 were further validated via a low-throughput gel assay, showing high Pearson correlation coefficient of 0.85 with the high-throughput fitness (Figure 3F, G). Notably, variants 122 and 143 differed from the wildtype by 18 and 19 mutations, yet exhibited an 1.6-fold and 1.1-fold enrichment in the high throughput assay, respectively. gRNAde retained activity even with up to 28 mutations (Figure D.2). These results showcase the combination of sequence novelty and improved functionality achieved with gRNAde. gRNAde leverages 3D structural understanding in design To understand the basis for gRNAde’s superior performance across diverse RNA design tasks, we analyzed the mutational patterns in active ribozyme designs from the gRNAde pipeline compared to rational design with filtering. Rational design tended to mainly mutate canonical base-paired positions while conserving unpaired loops, a strategy that mainly preserves secondary structure but fails to account for essential tertiary interactions (Figure 6.5 (B)). gRNAde, in contrast, generated active designs with a more balanced mutational profile (Figure 6.5 (A)), frequently altering nucleotides in unpaired regions and loops, altering nucleotides in four unpaired regions: J1/3 (single-stranded template-binding interface positioned by kissing loops), the loop region of P5 (structural scaffold of catalytic core), the loop region of P9 (structural scaffold of the extension domain), and the loop region of P10 (makes critical substrate contacts16 and shows dynamic movement towards active site) (Figure 6.5 (C-E)). gRNAde can successfully mutate these four regions with diverse functions, demonstrating that by training on a diverse corpus of 3D structures, it has learned sophisticated, non-local structure-function relationships that go far beyond simple base-pairing rules, allowing it to successfully navigate a complex functional landscape. To further contextualize gRNAde’s design strategy against human experts, we revisited the OpenKnot Benchmark to analyze the sequence recovery of successful designs from the wildtype or starting sequence (Figure 6.5 (F)). While gRNAde matched the structural accuracy of human experts, its designs were significantly more distant in sequence space. The median sequence recovery for gRNAde designs was 32%, substantially lower than the 72% observed for human experts. This divergence demonstrates that unlike human designers, who exhibit a strong bias toward smaller more conservative edits close to the native sequence, gRNAde can successfully perform generative jumps in sequence space for diverse RNA targets. 130 KL2 KL1 J1/3P1 P10 P9 P5 P7 t1 subunit (fixed) 5TU subunit (designed) P3 A U P8 P5 P7 30 110 40 50 60 100 90 120 130 140 150 20 A U U A G C A U U G A U C G G C C A CG C G A UA G G C G G C UC G C C G A G A G C C G G U G A U U A G C C U C G A G C U A C G A U U G U A C G C G G C A U C C G A U U G G G A C C U C U U A A A U AA CA A A A A A U G CA U U G C C U A C G G U G C C A 3′ C U A G GU C U C A A A A A G AG A U C U A A C A 70 5′ 80 P6 P3 P4 P9 P10 KL1 P1 J1/3 KL2 B Design probability F ED C gRNAde design pipeline t1+5TU 3D structure Sequence constraints from fitness landscape 1,000 designs top 40 per edit distance High-throughput fitness screening Low-throughput gel validation Calibrate activity threshold Rational design with filtering Rational design only 500 designs top 20 per edit distance 500 designs random 20 per edit distance G 122 143 33 299 203 319 473 516 549 5TUVariant: Edit distance: 19 18 15 22 20 22 27 27 29 0 Primer +CGU +CGU +CGU +CGU +CGU +CGU +CGU +CGU +CGU +CGU +CGU +CGU +CGU 131 Figure 6.4. Generative design and functional validation of RNA polymerase ribozymes. The gRNAde pipeline was used to design functional variants of the triplet-based RNA polymerase ribozyme (TPR), substantially outperforming rational design baselines. A. Cryo-EM structure of the TPR heterodimer (PDB: 8T2P), showing the catalytic 5TU subunit (colored), which was the target for generative design, and the auxiliary t1 subunit (grey), which was held constant. Position-specific design probabilities, derived from experimental fitness landscape data [33], are mapped onto the 5TU structure. Lighter yellows indicate regions with a high probability of being re-designed by gRNAde, while critical functional sites were constrained to the wildtype sequence (indicated in darker reds). B. Position-specific design probabilities mapped onto the 5TU secondary structure. C. Workflow for design and validation of 5TU variants. The 3D backbone structure, along with constraints sampled from the fitness landscape data, were input to the full gRNAde pipeline (Figure 1) as well as two baselines: rational design with the same computational filtering as gRNAde, and rational design without filtering. A library of 2,000 total designs was screened via a high-throughput functional assay. The native 5TU and 9 gRNAde designs were further validated using a low-throughput gel, which was then used to calibrate the activity threshold for the high-throughput data. D. Success rate of generating active designs (fitness ≥ -1.86, corresponding to variant 319) binned by mutational distance from the wildtype sequence. At 15-19 mutations, the gRNAde pipeline achieves a 31.5% success rate, substantially outperforming filtered rational design (15.0%) and unfiltered rational design (3.0%). E. Fitness distributions for all functional designs across mutational distances. The fitness of gRNAde designs is consistently higher than that of the baseline methods, with many variants exceeding the activity threshold (fitness ≥ -1.86). F. Low-throughput primer extension gel assay, using the top 9 gRNAde-designed variants by fitness in the high-throughput assay. Variant identity and edit distance from wildtype 5TU is labelled below the gel. The gel confirms the activity of all gRNAde variants, with variants 122, 143, 33, and 203 showing activity comparable or better than the native 5TU ribozyme. G. Correlation between high-throughput functional assay and low-throughput gel for the native 5TU sequence and 9 gRNAde-designed variants. The fitness scores are highly correlated with the average per-junction ligation efficiency from the gel (Pearson r = 0.85), and the fitness of the least active variant 319 is used as an activity threshold for the high-throughput assay, as it demonstrates some ligation activity on the gel. 132 A B C D E KL2 KL1 J1/3 P1 P10 P9 P5 P7 P3 J1/3 KL1 P1 P5 P9 P10 90° U P8 P5 P7 30 110 40 50 60 100 90 120 130 140 150 20 A U U A G C A U U G A U C G G C C A CG C G A UA G G C G G C UC G C C G A G A G C C G G U G A U U A G C C U C G A G C U A C G A U U G U A C G C G G C A U C C G A U U G G G A C C U C U U A A A U AA CA A A A A A U G CA U U G C C U A C G G U G C C A 3′ C U A G GU C U C A A A A A G AG A U C U A A C A 70 5′ 80 P6 P3 P4 P9 P10 KL1 P1 J1/3 KL2 Mutational probability difference Prefered by rational design Prefered by gRNAde (Grey denotes positions not mutated in any active designs) F Native Sequence Recovery of Successful OpenKnot Designs (OpenKnot Score > 90) 133 Figure 6.5. Mechanistic analysis of gRNAde design strategies. The analysis compares gRNAde’s design strategy against rational design and human experts, demonstrating its capacity to learn non-local, 3D-informed structure-function relationships and achieve highly sequence-divergent yet structurally accurate designs. A-C. Per-position mutation probability (“hotspots") for active RNA polymerase ribozyme designs generated by gRNAde (A) and rational design with filtering (B), and the difference between them (C). D, E. The difference in mutation probability is mapped onto the secondary structure (D) and tertiary structure (E) of the catalytic subunit 5TU that was the target of design. This reveals distinct design strategies: Rational design preferentially mutates canonical base-paired positions (blue) in active designs, whereas gRNAde identifies novel hotspots in structurally complex, unpaired regions (red), particularly near the template-binding site (J1/3) as well as ends of helices P5, P9, and P10. F. Median native sequence recovery of successful designs from gRNAde and expert human designers in OpenKnot Round 7, presented by position type (pseudoknotted, paired, unpaired, and all positions) for natural (left) and synthetic (right) RNA targets. Across all position types, gRNAde designs exhibit significantly lower median native sequence recovery (32%) compared to human experts (72%). This demonstrates gRNAde’s capacity to achieve large generative jumps in sequence space while matching expert accuracy at forming the target structure. 134 0.08 0.10 0.12 0.14 Chemical Reactivity Score 12 10 8 6 4 2 0 2 4 Fit ne ss (l og e nr ich m en t) r = -0.654 = -0.655 R² = -2.949 P < 0.001 gRNAde pipeline (n = 745) 0.10 0.12 0.14 0.16 Chemical Reactivity Score fit ne ss r = -0.595 = -0.621 R² = -4.750 P < 0.001 Rational design w/ filtering (n = 370) 0.15 0.20 0.25 Chemical Reactivity Score fit ne ss r = -0.441 = -0.420 R² = -14.145 P < 0.001 Only rational design (n = 359) Edit Distance 19 20-24 25-29 30-34 35 Figure 6.6: Computational metrics show moderate correlation with experimental fitness. Scatter plots showing the correlation between RibonanzaNet chemical reactivity scores and experimental fitness values for designs across different mutational distance ranges. Computational filters show moderate correlation with experiments Lastly, we assessed the correlation between our computational metrics and experimental fitness values to evaluate the reliability of our filtering pipeline (Figure 6.6). We find an average correlation of -0.563 between RibonanzaNet chemical reactivity score and experimental fitness across all designs, indicating moderate predictive power for identifying functional ribozymes. The moderate correlation reflects the limitations of using predicted chemical reactivity and secondary structure as proxies for complex 3D structure and catalytic function. Improved predictive models of RNA structure and function will be necessary to tackle more ambitious RNA design task. 6.4 Summary Precise control in designing RNA structure and function could transform programmable biology, enabling new applications such as mRNA therapeutics that respond to personalized cellular conditions [Felletti et al., 2016, Mustafina et al., 2019] and sophisticated biosensors for multi- input detection [Choe et al., 2024]. However, progress toward these ambitious goals has been limited by the difficulty of accurately designing sequences that fold into complex 3D structures such as pseudoknots, which are essential for RNA functionality. This chapter has experimentally validated gRNAde, a geometric deep learning pipeline for RNA inverse design. First, in a blinded, community-wide competition on the Eterna platform, gRNAde successfully designed complex RNA pseudoknot structures with an accuracy matching that of human experts, establishing a new state-of-the-art for structural RNA design. Second, the pipeline was used to generatively explore the functional landscape of a complex RNA polymerase ribozyme, discovering highly active ribozymes at large sequence distances from any known functional variant. This dual success on fundamentally different problems—one focused on folding and structural accuracy, the other on function, embodying both structure and dynamics—validates the power and generality of the approach. 135 These findings have broader implications for our understanding of both natural and engineered biological systems. The OpenKnot results, where gRNAde’s designs proved more stable for targeted structural goals than their native counterparts, suggest that data-driven optimization can uncover solutions that are more idealized than those found in biology, where evolution operates under a multitude of competing constraints. Similarly, the discovery of a diverse and functional ribozyme quasispecies, with active variants differing by nearly 20% of their sequence, demonstrates that the functional sequence space for complex RNA is likely larger than anticipated comprising (presumably) structurally-similar variants at mutational distances that would be challenging to access by directed evolution. Generative models like gRNAde provide a powerful new tool to explore this vast, uncharted territory, providing an alternative to local exploration by directed evolution or the limitations of human-centric rational design. Beyond its direct applications, this work highlights the potential for a virtuous cycle in computational biology, where the success of gRNAde in generating vast libraries of high- quality designs has enabled the creation of new, large-scale datasets. For example, a follow- on collaboration used gRNAde to generate 68 million plausible sequences for 1.6 million pseudoknotted structures, a dataset orders of magnitude larger than previously possible. This dataset then trained RibonanzaNet 2 with significantly improved accuracy, creating a powerful feedback loop where generative models produce data to train better structure prediction models, which can then be incorporated back into the design pipeline as more accurate filters, further accelerating progress. Future work We have focussed on experimental validation of single-state design in this chap- ter. In principle, gRNAde enables inverse design of RNA sequences conditioned on multiple conformational states. However, two key methodological directions would make multi-state design more practically useful for real-world usage: (1) Incorporating partner molecules such as small molecule ligands or proteins that induce conformational changes during design, enabling applications like RNA aptamer and riboswitch design with specific unbound and bound states [Mandal and Breaker, 2004, Mohsen et al., 2023]; and (2) Allowing specification of conforma- tional propensities to modulate or finetune functionality [Ken et al., 2023], enabling biasing toward desired functional states and negative design against unwanted conformations. Additionally, RNA modelling and design tools remain trained on relatively limited datasets compared to protein design, which can prevent broad generalization to novel targets. The limits of current RNA 3D structure prediction tools are well-documented [Das et al., 2023, Kretsch et al., 2025], particularly without multiple sequence alignments, which is typically the case for designed sequences with no evolutionary history. Unlike computational filtering pipelines for protein design, where AlphaFold provides near-experimental accuracy, we did not find it beneficial to filter using RNA 3D structure predictors. Instead, we used RibonanzaNet [He et al., 2024] to predict chemical reactivity profiles of designed RNAs and found modest correlations 136 with experimental functional measurements. While our initial pipeline combining gRNAde designs with RibonanzaNet filtering shows promise for a highly complex RNA polymerase ribozyme, tackling more challenging tasks like RNA interactions and multi-state design will likely require robust 3D structure prediction capabilities. We are optimistic that advances in RNA structure determination through computationally- assisted cryo-EM [Kappel et al., 2020, Bonilla and Kieft, 2022] will expand available structural data, thereby improving training of geometric deep learning models and enabling new break- throughs in RNA design. 137 138 Chapter 7 Conclusion This thesis introduces new Geometric Deep Learning techniques for molecular modelling and design. In Part I, I developed unified theory and architectures for representation learning and generative modelling of 3D molecular structures. In Part II, I introduced a novel toolkit for inverse design of RNA molecules, a challenging and underexplored domain in molecular design. These contributions share a common geometric foundation: representing molecular systems as 3D geometric graphs with inherent physical symmetries and transformation behaviors, which are incorporated explicitly or implicitly into the modelling. Overall, I aimed to integrate principled approaches to representation learning and generative modelling into practical, wet lab validated frameworks for real-world molecular design. 7.1 Summary of contributions Part I: Molecular Representation Learning and Generative Modelling Chapter 3 presents the Geometric Weisfeiler-Leman (GWL) test, which extends the classic Weisfeiler-Leman graph isomorphism algorithm to geometric graphs while preserving 3D sym- metries. This framework unifies so far disparate classes of Geometric GNN architectures for molecular representation learning. GWL provides mechanistic insights into the expressive power of these architectures, highlighting advantages of equivariant models over invariant ones and the role of higher-order representations in discriminating 3D structures. I complement the theoretical framework with synthetic experiments and a benchmark on protein function prediction. Chapter 4 proposes the All-atom Diffusion Transformer (ADiT), the first unified generative model for both periodic crystals and non-periodic molecular systems. ADiT embeds 3D molecu- lar structures into a shared latent space, where it learns to sample new latents and then decodes them to valid structures. ADiT’s latent diffusion approach enables transfer learning from diverse chemical spaces, achieving state-of-the-art performance on molecular and crystal generation 139 benchmarks. Built on the standard Transformer, ADiT shows predictable scaling behaviors up to half a billion parameters, positioning it as a promising foundation model architecture for molecular generation. Part II: RNA Molecule Design Chapter 5 introduces gRNAde, a novel generative RNA inverse design toolkit. gRNAde is a structure-conditioned RNA language model that uses a multi-state Geometric GNN to generate sequences conditioned on one or more 3D backbone structures, explicitly accounting for the conformational diversity of RNA molecules. gRNAde significantly improves both performance and speed over state-of-the-art physics-based methods in computational benchmarks. gRNAde also enables new capabilities such as multi-state design and zero-shot ranking in RNA engineering campaigns. Chapter 6 presents wet lab experimental validation of gRNAde for real-world RNA design problems. gRNAde successfully designs diverse pseudoknotted RNA structures with significantly higher success rates than physics-based methods, matching the performance of expert human designers while being fully automated and scalable. Most significantly, gRNAde enables the design of functional RNA enzymes (ribozymes) and systematically explores sequence diversity that retains biological function—capabilities that substantially exceed current rational design approaches. Together, these results establish gRNAde as a powerful tool for designing RNA structures with specific biological functions, opening new frontiers in RNA engineering. 7.2 Discussion A central theme of this thesis is the interplay between the physical symmetries that govern molecular systems and whether to implement these symmetries as inductive biases in deep learning architectures [Bronstein et al., 2021]. I would like to conclude with a reflection on the engineering and computational aspects of molecular modelling, particularly the notion of the hardware lottery [Hooker, 2021]: the marriage of architectures and hardware that determines which research ideas rise to prominence, and its connection to the bitter lesson in AI research [Sutton, 2019]. These discussions focus on Transformers and GNNs, as well as roto-translation equivariance versus learning symmetries at scale. These are the two main architectural paradigms that I have explored through this thesis. Transformers are winning the hardware lottery Transformers are GNNs which implement a fully-connected message passing scheme via dense matrix multiplications [Joshi, 2025]. In 140 contrast, GNNs typically implement sparse message passing over locally connected structures via scatter-gather operations, which are significantly slower on modern GPUs for size ranges of typical molecular structures. Additionally, state-of-the-art equivariant GNNs for molecular systems rely on higher-order tensor representations to achieve maximum expressivity while preserving symmetries, as discussed in Chapter 3. This results in a significant increase in memory usage and computational complexity, making equivariant networks orders of magnitude slower to train and scale up than standard Transformers on current hardware. The evolution from AlphaFold 2 [Jumper et al., 2021] to AlphaFold 3 [Abramson et al., 2024] exemplifies a paradigm shift in recent years. The AlphaFold 3 architecture is relatively simpler compared to AlphaFold 2, which explicitly incorporated roto-translation equivariance when predicting 3D coordinates of protein structures. Instead, AlphaFold 3 uses a largely standard Transformer architecture and data augmentation when learning to predict 3D coordinates. This approach is easier to scale and generalizes naturally to all-atom biomolecular complexes compared to previous approaches. AlphaFold3 is a very effective demonstration of geometric symmetries learnt at scale using a sufficiently expressive model. In the near term, the hardware lottery will likely lead to favouring Transformers. Transformers are likely to be the architecture of choice for molecular foundation models trained on large datasets and scaled to billions of parameters. Training equivariant networks at such scales would be prohibitively expensive at present. A problem-centric approach to architectures It would be naive to conclude that equivariant networks are inferior to unconstrained architectures. The choice of inductive biases depends fundamentally on the problem at hand. When data is limited or strict symmetry guarantees are essential, such as in molecular simulation and property prediction, explicitly enforcing symmetries provides greater data efficiency and generalization. For instance, equivariant GNNs with higher-order tensors are the current state-of-the-art in interatomic potentials for molecular simulation [Batatia et al., 2023, Wood et al., 2025]. For most practical applications in molecular simulations, models must learn physically meaningful and smooth energy landscapes [Bigi et al., 2025, Fu et al., 2025]. Here, equivariant representations that transform predictably under roto-translation provide essential inductive biases for capturing the underlying physical phenomenon governing the dynamics [Musil et al., 2021]. In contrast, when large-scale training data is available and exact symmetry guarantees are not crucial, implicit or learned symmetry constraints can have an advantage. Diffusion-based generative models, as demonstrated in Chapter 4, exemplify this scenario. In diffusion models, a denoiser network learns the underlying data distribution by observing molecular structures under varying noise levels and iteratively reconstructing valid configurations. What matters most is that the denoiser produces valid molecular structures given noisy inputs. If the denoiser produces different outputs from rotated versions of the same noisy input, this may not be problematic as 141 long as both outputs represent physically plausible structures. An important insight I have developed while training diffusion models is that learning from each sample in the data distribution under different noise levels is crucial for optimal generative modelling. This boils down to performing as many epochs of training as possible with a sufficiently expressive denoiser, as approximate roto-translation equivariance often emerges in unconstrained networks when trained at scale. Since the noisy intermediate training steps do not represent physically meaningful structures, the inductive bias of explicit equivariance becomes less critical. This phenomenon helps explain the strong performance of recent Transformer-based diffusion models for molecular generation [Wang et al., 2024, Abramson et al., 2024, Joshi et al., 2025a]. The hardware lottery enables Transformers to be trained for many more iterations than equivariant networks within the same computational budget, leading to improved performance. Overall, roto-translational equivariance is a powerful inductive bias and strong guarantee of physical correctness. At the same time, equivariance can also be viewed as a hard constraint that ultimately limits model expressivity. A similar argument can be made regarding locality in GNNs versus global attention in Transformers [Joshi, 2025]. Through this thesis, I have ultimately arrived at a pragmatic perspective: architectures are tools for solving problems, and the choice of architecture should be driven by the problem at hand, the available data, and the computational resources. 7.3 Future Directions The work conducted in this thesis has prepared me to work on exciting new frontiers in biomolec- ular modelling and design. I highlight two interconnected directions that I believe will be crucial for advancing the field, united by a fundamental insight: nature performs computation through transitions between molecular states, often coupled to chemical reactions and intermolecular interactions [Al-Hashimi, 2023]. To tackle the most interesting scientific problems in molecular biology, we need new tech- niques for representing dynamic biological processes. This will necessitate closer collaborations between AI researchers and experimental biologists, with the aim of jointly designing both the dataset generation processes and the machine learning models. 7.3.1 Representing and Designing Conformational Dynamics Multi-state conformational changes and dynamics are fundamental to the function of almost every biologically relevant molecule, from antibodies and membrane receptors to biocatalysis in proteins and RNA [Henzler-Wildman and Kern, 2007, Ganser et al., 2019]. An ideal computa- tional representation of molecular systems must therefore account for both geometric structures and temporal dynamics [Carugo and Djinović-Carugo, 2023, Lane, 2023]. However, existing 142 approaches, including the ones presented in this thesis, generally focus on static representations. The next frontiers in molecular modelling are around representing multi-state ensembles and transition dynamics of conformational changes. I am optimistic about two possible approaches towards addressing this challenge. First, the integration of machine learning interaction potentials (MLIPs) for molecular dynamics with prop- erty prediction and generative models. Recent ‘universal’ MLIPs have demonstrated remarkable accuracy in approximating quantum mechanics calculations for simulating biomolecular systems [Kovács et al., 2025, Wood et al., 2025]. An interesting question is whether representations learned by universal MLIPs can be predictive of dynamical and functional properties of de novo designed molecules beyond natural systems. If true, MLIPs could enable new capabilities in molecular design, including dynamics-informed generation via conditioning, or accelerating the screening of generated designs with desired dynamical properties. Second, integrating experimental data that explicitly captures conformational flexibility and dynamics can further advance molecular representations. For example, cryo-EM density maps from structure determination methods [Jamali et al., 2024], or high-throughput structural assays such as cross-linking mass spectrometry for proteins [O’Reilly and Rappsilber, 2018] and chemical probing for RNA [Strobel et al., 2018, Cao et al., 2024] can provide complementary information to static structures from databases like the PDB. Ultimately, we must move beyond training models on solely static structures towards un- derstanding the dynamic behaviour of biomoelcules [Wayment-Steele et al., 2025] and rational design of functional multi-state systems [Praetorius et al., 2023]. 7.3.2 Black-box Data for Lab-in-the-loop Design Structure-driven molecular design is emerging as a powerful paradigm in biochemistry [Watson et al., 2023, Schneuing et al., 2024] and materials science [Zeni et al., 2025]. Notably, the Nobel Prize in Chemistry 2024 recognized computational protein design and structure prediction. The research in Chapter 5 on RNA structure design was inspired by this Nobel Prize winning work. I have been fortunate to interact and collaborate with leading RNA biologists to experimentally validate our RNA design models. These conversations have made it clear that structure-based design is not a universally applicable paradigm for molecular design in RNA biology and beyond. It works well only when there is an established structural basis for function and when high-quality structural data is available. Many of the most important biological problems may not fit this mould. Thus, I believe the next frontier in molecular design will extend beyond current structure- based approaches or augment them with complementary data sources. There is growing excite- ment about ‘black-box’ experimental datasets from high-throughput assays to connect sequence with function, specifically created for training machine learning models [Porebski et al., 2024, Bronstein and Naef, 2024]. When combined with a lab-in-the-loop setup [Frey et al., 2025], we 143 can enable iterative testing and improvement of molecular design models in the real world. In fact, these ideas hold particular promise for RNA, where next-generation sequencing can measure structural and functional properties at unprecedented scale and relatively low cost [Strobel et al., 2018, He et al., 2024]. In the future, I am excited to jointly design both data generation and model development processes from the ground up, together with experimentalists and AI researchers. Ultimately, I strongly believe that close collaboration and antedisciplinary science [Eddy, 2005] will be essential for asking the most interesting scientific questions and unlocking the secrets of life. 144 References J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, O. Pritzel, Alexander 4and Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 2024. (Cited on page 11, 75, 77, 91, 93, 118, 141, 142) J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. (Cited on page 13, 31) B. Adamczyk, M. Antczak, and M. Szachniuk. Rnasolo: a repository of cleaned pdb-derived rna 3d structures. Bioinformatics, 2022. (Cited on page 107) H. M. Al-Hashimi. Turing, von neumann, and the computational architecture of biological machines. Proceedings of the National Academy of Sciences, 2023. (Cited on page 142) M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023. (Cited on page 49) B. Alberts, R. Heald, A. Johnson, D. Morgan, M. Raff, K. Roberts, and P. Walter. Molecular biology of the cell: seventh international student edition with registration card. WW Norton & Company, 2022. (Cited on page 23) R. F. Alford, A. Leaver-Fay, J. R. Jeliazkov, M. J. O’Meara, F. P. DiMaio, H. Park, M. V. Shapovalov, P. D. Renfrew, V. K. Mulligan, et al. The rosetta all-atom energy function for macromolecular modelling and design. Journal of chemical theory and computation, 2017. (Cited on page 23) U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical implications. In ICLR, 2021. (Cited on page 66) R. Anand, C. K. Joshi, A. Morehead, A. R. Jamasb, C. Harris, S. Mathis, K. Didi, B. Hooi, and P. Liò. Rna-frameflow: Flow matching for de novo 3d rna backbone design. In Machine Learning for Computational Biology (MLCB), 2024. (Cited on page 113) 145 B. Anderson, T. S. Hy, and R. Kondor. Cormorant: Covariant molecular neural networks. NeurIPS, 2019. (Cited on page 40) N. Ashcroft and N. D. Mermin. Solid State Physics. Saunders College Publishing, 1976. (Cited on page 22) J. Attwater, A. Wochner, and P. Holliger. In-ice evolution of rna polymerase ribozyme activity. Nature chemistry, 2013. (Cited on page 126) J. Attwater, A. Raguram, A. S. Morgunov, E. Gianni, and P. Holliger. Ribozyme-catalysed rna synthesis using triplet building blocks. Elife, 2018. (Cited on page 126) S. Axelrod and R. Gomez-Bombarelli. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 2022. (Cited on page 84) L. Babai, P. Erdos, and S. M. Selkow. Random graph isomorphism. SIAM Journal on Computing, 1980. (Cited on page 55) M. Baek, F. DiMaio, I. Anishchenko, J. Dauparas, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 2021. (Cited on page 69, 113) M. Baek, R. McHugh, I. Anishchenko, H. Jiang, D. Baker, and F. DiMaio. Accurate prediction of protein–nucleic acid complexes using rosettafoldna. Nature Methods, 2024. (Cited on page 113) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015. (Cited on page 31, 44) A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Physical review letters, 2010. (Cited on page 41, 43) A. P. Bartók, R. Kondor, and G. Csányi. On representing chemical environments. Physical Review B, 2013. (Cited on page 41, 43, 57, 60) A. P. Bartók, S. De, C. Poelking, N. Bernstein, J. R. Kermode, G. Csányi, and M. Ceriotti. Machine learning unifies the modelling of materials and molecules. Science advances, 2017. (Cited on page 43, 77) I. Batatia, S. Batzner, D. P. Kovács, A. Musaelian, G. N. Simm, R. Drautz, C. Ortner, B. Kozinsky, and G. Csányi. The design space of e (3)-equivariant atom-centered interatomic potentials. arXiv preprint, 2022a. (Cited on page 41) 146 I. Batatia, D. P. Kovács, G. N. Simm, C. Ortner, and G. Csányi. Mace: Higher order equivariant message passing neural networks for fast and accurate force fields. In NeurIPS, 2022b. (Cited on page 40, 41, 43, 59, 63, 64, 68, 70) I. Batatia, P. Benner, Y. Chiang, A. M. Elena, D. P. Kovács, J. Riebesell, X. R. Advincula, M. Asta, M. Avaylon, W. J. Baldwin, et al. A foundation model for atomistic materials chemistry. arXiv preprint arXiv:2401.00096, 2023. (Cited on page 11, 92, 141) P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint, 2018. (Cited on page 12, 29) F. Battiston, G. Cencetti, I. Iacopini, V. Latora, M. Lucas, A. Patania, J.-G. Young, and G. Petri. Networks beyond pairwise interactions: Structure and dynamics. Physics reports, 2020. (Cited on page 25) S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications, 2022. (Cited on page 12, 29, 43, 53) J. Behler and M. Parrinello. Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical review letters, 2007. (Cited on page 11, 43) J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. Improving image generation with better captions, 2023. URL https://cdn.openai.c om/papers/dall-e-3.pdf. (Cited on page 47, 92) F. Bigi, M. Langer, and M. Ceriotti. The dark side of the forces: assessing non-conservative force models for atomistic machine learning. In International Conference on Machine Learning (ICML), 2025. (Cited on page 43, 141) C. Bodnar, F. Frasca, N. Otter, Y. Wang, P. Lio, G. F. Montufar, and M. Bronstein. Weisfeiler and lehman go cellular: Cw networks. NeurIPS, 2021a. (Cited on page 56) C. Bodnar, F. Frasca, Y. Wang, N. Otter, G. F. Montufar, P. Lio, and M. Bronstein. Weisfeiler and lehman go topological: Message passing simplicial networks. In ICML, 2021b. (Cited on page 56) R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. S. Chatterji, A. S. Chen, K. A. Creel, J. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. E. Gillespie, K. Goel, N. D. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, 147 https://cdn.openai. com/papers/dall-e-3.pdf https://cdn.openai. com/papers/dall-e-3.pdf J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. F. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. S. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. P. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. F. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. H. Roohani, C. Ruiz, J. Ryan, C. R’e, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. P. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. A. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang. On the opportunities and risks of foundation models. ArXiv, 2021. (Cited on page 12, 13) S. L. Bonilla and J. S. Kieft. The promise of cryo-em to explore rna structural dynamics. Journal of Molecular Biology, 2022. (Cited on page 137) E. Bonnet, P. Rzazewski, and F. Sikora. Designing rna secondary structures is hard. Journal of Computational Biology, 2020. (Cited on page 113) F. Boyles, C. M. Deane, and G. M. Morris. Learning from the ligand: using ligand-based features to improve binding affinity prediction. Bioinformatics, 2019. (Cited on page 70) J. Brandstetter, R. Hesselink, E. van der Pol, E. J. Bekkers, and M. Welling. Geometric and physical quantities improve e(3) equivariant message passing. In ICLR, 2022. (Cited on page 40, 41) R. R. Breaker and G. F. Joyce. Inventing and improving ribozyme function: rational design versus iterative selection methods. Trends in biotechnology, 1994. (Cited on page 112) M. Bronstein and L. Naef. The road to biology 2.0 will pass through black-box data. Towards Data Science, 2024. (Cited on page 143) M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint, 2021. (Cited on page 11, 25, 62, 140) T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, et al. Video generation models as world simulators. 2024. (Cited on page 47, 92) D. Buterez, J. P. Janet, S. J. Kiddle, and P. Liò. Mf-pcba: Multifidelity high-throughput screening benchmarks for drug discovery and machine learning. Journal of Chemical Information and Modeling, 2023. (Cited on page 44) M. Buttenschoen, G. M. Morris, and C. M. Deane. Posebusters: Ai-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science, 2024. (Cited on page 85, 182) 148 M. Buttenschoen, Y. Ziv, G. M. Morris, and C. Deane. An evaluation of unconditional 3d molec- ular generation methods. In ICLR Workshop on Generative and Experimental Perspectives for Biomolecular Design, 2025. (Cited on page 89) A. Campbell, J. Yim, R. Barzilay, T. Rainforth, and T. Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In Forty-first International Conference on Machine Learning, 2024. (Cited on page 77) X. Cao, Y. Zhang, Y. Ding, and Y. Wan. Identification of rna structures and their roles in rna functions. Nature Reviews Molecular Cell Biology, 2024. (Cited on page 118, 143) O. Carugo and K. Djinović-Carugo. Structural biology: A golden era. PLoS Biology, 2023. (Cited on page 142) T. R. Cech. The Catalyst: RNA and the Quest to Unlock Life’s Deepest Secrets. WW Norton & Company, 2024. (Cited on page 13, 23, 99, 126) J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen, et al. Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions. arXiv preprint, 2022. (Cited on page 113) Z. Chen, S. Villar, L. Chen, and J. Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. NeurIPS, 2019. (Cited on page 56) C. Choe, J. O. Andreasson, F. Melaine, W. Kladwang, M. J. Wu, F. Portela, R. Wellington-Oguri, J. J. Nicol, H. K. Wayment-Steele, M. Gotrik, et al. Compact rna sensors for increasingly complex functions of multiple inputs. bioRxiv, 2024. (Cited on page 135) A. E. Chu, J. Kim, L. Cheng, G. El Nesr, M. Xu, R. W. Shuai, and P.-S. Huang. An all-atom protein generative model. Proceedings of the National Academy of Sciences, 2024. (Cited on page 93) A. Churkin, M. D. Retwitzer, V. Reinharz, Y. Ponty, J. Waldispühl, and D. Barash. Design of rnas: comparing programs for inverse rna folding. Briefings in bioinformatics, 2018. (Cited on page 99, 113) G. Corso, B. Jing, R. Barzilay, T. Jaakkola, et al. Diffdock: Diffusion steps, twists, and turns for molecular docking. In International Conference on Learning Representations, 2023. (Cited on page 69, 90, 92) X. Dai, J. Hou, C.-Y. Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023. (Cited on page 92) 149 A. Daigavane, S. E. Kim, M. Geiger, and T. Smidt. Symphony: Symmetry-equivariant point- centered spherical harmonics for 3d molecule generation. In The Twelfth International Conference on Learning Representations, 2024. (Cited on page 85, 182) T. R. Damase, R. Sukhovershin, C. Boada, F. Taraballi, R. I. Pettigrew, and J. P. Cooke. The limitless future of rna therapeutics. Frontiers in bioengineering and biotechnology, 2021. (Cited on page 99) R. Das, J. Karanicolas, and D. Baker. Atomic accuracy in predicting and designing noncanonical rna structure. Nature methods, 2010. (Cited on page 107, 108, 109, 113, 114, 192) R. Das, R. C. Kretsch, A. J. Simpkin, T. Mulvaney, P. Pham, R. Rangan, F. Bu, R. M. Keegan, M. Topf, D. J. Rigden, et al. Assessment of three-dimensional rna structure prediction in casp15. Proteins: Structure, Function, and Bioinformatics, 2023. (Cited on page 118, 136) J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. Wicky, A. Courbet, R. J. de Haas, N. Bethel, et al. Robust deep learning based protein sequence design using proteinmpnn. Science, 2022. (Cited on page 11, 53, 99, 101, 102, 105, 106) D. W. Davies, K. T. Butler, A. J. Jackson, J. M. Skelton, K. Morita, and A. Walsh. Smact: Semiconducting materials by analogy and chemical theory. Journal of Open Source Software, 2019. (Cited on page 181) W. K. Dawson, M. Maciejczyk, E. J. Jankowska, and J. M. Bujnicki. Coarse-grained modelling of rna 3d structure. Methods, 2016. (Cited on page 101) V. Delle Rose, A. Kozachinskiy, C. Rojas, M. Petrache, and P. Barceló. Three iterations of (d- 1)-wl test distinguish non isometric clouds of d-dimensional points. NeurIPS, 2023. (Cited on page 63, 73) B. Deng, P. Zhong, K. Jun, J. Riebesell, K. Han, C. J. Bartel, and G. Ceder. Chgnet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nature Machine Intelligence, 2023. (Cited on page 181) A. Derrow-Pinion, J. She, D. Wong, O. Lange, T. Hester, L. Perez, M. Nunkesser, S. Lee, X. Guo, B. Wiltshire, et al. Eta prediction with graph neural networks in google maps. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021. (Cited on page 29) P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 2021. (Cited on page 50, 92) 150 F. Di Giovanni, L. Giusti, F. Barbero, G. Luise, P. Lio, and M. M. Bronstein. On over-squashing in message passing neural networks: The impact of width, depth, and topology. In International Conference on Machine Learning. PMLR, 2023. (Cited on page 32) S. Dieleman. Guidance: a cheat code for diffusion models, 2022. URL https://benanne. github.io/2022/05/26/guidance.html. (Cited on page 50) S. Dieleman. Generative modelling in latent space, 2025. URL https://sander.ai/20 25/04/15/latents.html. (Cited on page 50, 94) P. A. M. Dirac. Quantum mechanics of many-electron systems. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 1929. (Cited on page 43) J. Dong, B. Feng, D. Guessous, Y. Liang, and H. He. Flex attention: A programming model for generating optimized attention kernels. arXiv preprint arXiv:2412.05496, 2024. (Cited on page 32) A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR, 2021. (Cited on page 31, 42) J. A. Doudna and E. Charpentier. The new frontier of genome engineering with crispr-cas9. Science, 2014. (Cited on page 99) R. Drautz. Atomic cluster expansion for accurate and transferable interatomic potentials. Physical Review B, 2019. (Cited on page 41, 73) W. Du, H. Zhang, Y. Du, Q. Meng, W. Chen, N. Zheng, B. Shao, and T.-Y. Liu. Se (3) equivariant graph neural networks with complete local frames. In ICML, 2022. (Cited on page 75) Y. Du, A. R. Jamasb, J. Guo, T. Fu, C. Harris, Y. Wang, C. Duan, P. Liò, P. Schwaller, and T. L. Blundell. Machine learning-aided generative molecular design. Nature Machine Intelligence, 2024. (Cited on page 44) G. Dusson, M. Bachmayr, G. Csanyi, R. Drautz, S. Etter, C. van der Oord, and C. Ortner. Atomic cluster expansion: Completeness, efficiency and stability. arXiv preprint, 2019. (Cited on page 64, 73) A. Duval, S. V. Mathis, C. K. Joshi, V. Schmidt, S. Miret, F. D. Malliaros, T. Cohen, P. Liò, Y. Bengio, and M. Bronstein. A hitchhiker’s guide to geometric gnns for 3d atomic systems. arXiv preprint, 2023a. (Cited on page 16, 26, 29, 33, 34, 40, 61, 92) 151 https://benanne.github.io/2022/05/26/guidance.html https://benanne.github.io/2022/05/26/guidance.html https://sander.ai/2025/04/15/latents.html https://sander.ai/2025/04/15/latents.html A. Duval, V. Schmidt, A. Hernández-García, S. Miret, F. D. Malliaros, Y. Bengio, and D. Rolnick. Faenet: Frame averaging equivariant GNN for materials modeling. In International Conference on Machine Learning, ICML, 2023b. (Cited on page 42) V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699, 2020. (Cited on page 32) V. P. Dwivedi, C. K. Joshi, A. T. Luu, T. Laurent, Y. Bengio, and X. Bresson. Benchmarking graph neural networks. JMLR, 2023. (Cited on page 56) N. Dym and H. Maron. On the universality of rotation equivariant point cloud networks. In ICLR, 2020. (Cited on page 73) S. R. Eddy. “antedisciplinary” science. PLoS computational biology, 2005. (Cited on page 144) E. H. Ekland, J. W. Szostak, and D. P. Bartel. Structurally complex and highly active rna ligases derived from random rna sequences. Science, 1995. (Cited on page 126) A. A. Elhag, T. K. Rusch, F. Di Giovanni, and M. Bronstein. Relaxed equivariance via multitask learning. arXiv preprint arXiv:2410.17878, 2024. (Cited on page 42) P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Computer Vision and Pattern Recognition (CVPR), 2021. (Cited on page 94) P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024. (Cited on page 47, 92) M. Felletti, J. Stifel, L. A. Wurmthaler, S. Geiger, and J. S. Hartig. Twister ribozymes as highly versatile expression platforms for artificial riboswitches. Nature communications, 2016. (Cited on page 135) M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop, 2019a. (Cited on page 64) M. Fey and J. E. Lenssen. Fast graph representation learning with pytorch geometric. ICLR 2019 Representation Learning on Graphs and Manifolds Workshop, 2019b. (Cited on page 103) D. Flam-Shepherd and A. Aspuru-Guzik. Language models can generate molecules, materials, and protein binding sites directly in three dimensions as xyz, cif, and pdb files. arXiv preprint arXiv:2305.05708, 2023. (Cited on page 91) R. E. Franklin and R. G. Gosling. Molecular structure of nucleic acids: Molecular configuration in sodium thymonucleate. Nature, 1953. (Cited on page 24) 152 N. C. Frey, I. Hötzel, S. D. Stanton, R. Kelly, R. G. Alberstein, E. Makowski, K. Martinkus, D. Berenberg, J. Bevers III, T. Bryson, et al. Lab-in-the-loop therapeutic antibody design with deep learning. bioRxiv, 2025. (Cited on page 143) L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li. Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012. (Cited on page 107) X. Fu, Z. Wu, W. Wang, T. Xie, S. Keten, R. Gomez-Bombarelli, and T. S. Jaakkola. Forces are not enough: Benchmark and critical evaluation for machine learning force fields with molecular simulations. Transactions on Machine Learning Research, 2023. (Cited on page 43) X. Fu, B. M. Wood, L. Barroso-Luque, D. S. Levine, M. Gao, M. Dzamba, and C. L. Zitnick. Learning smooth and expressive interatomic potentials for physical property prediction. In International Conference on Machine Learning, 2025. (Cited on page 12, 43, 141) F. Fuchs, D. Worrall, V. Fischer, and M. Welling. Se (3)-transformers: 3d roto-translation equivariant attention networks. NeurIPS, 2020. (Cited on page 41) P. Gainza, F. Sverrisson, F. Monti, E. Rodola, D. Boscaini, M. Bronstein, and B. Correia. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods, 17(2), 2020. (Cited on page 69, 70) L. R. Ganser, M. L. Kelly, D. Herschlag, and H. M. Al-Hashimi. The roles of structural dynamics in the cellular functions of rnas. Nature reviews Molecular cell biology, 2019. (Cited on page 99, 142) M. Gantz, S. V. Mathis, F. E. Nintzel, P. J. Zurek, T. Knaus, E. Patel, D. Boros, F.-M. Weberling, M. R. Kenneth, O. J. Klein, et al. Microdroplet screening rapidly profiles a biocatalyst to enable its ai-assisted engineering. bioRxiv, 2024. (Cited on page 127) R. Gao, E. Hoogeboom, J. Heek, V. D. Bortoli, K. P. Murphy, and T. Salimans. Diffusion meets flow matching: Two sides of the same coin, 2024. URL https://diffusionflow.gi thub.io/. (Cited on page 49, 82) V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks. In ICML, 2020. (Cited on page 173, 174) J. Gasteiger, J. Groß, and S. Günnemann. Directional message passing for molecular graphs. In ICLR, 2020. (Cited on page 35, 36, 53, 63, 64, 69) J. Gasteiger, F. Becker, and S. Günnemann. Gemnet: Universal directional graph neural networks for molecules. In NeurIPS, 2021. (Cited on page 35, 36, 63, 73, 74) 153 https://diffusionflow.github.io/ https://diffusionflow.github.io/ M. Geiger and T. Smidt. e3nn: Euclidean neural networks. arXiv preprint, 2022. (Cited on page 41, 64) W. Gilbert. Origin of life: The rna world. nature, 319(6055):618–618, 1986. (Cited on page 126) J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017. (Cited on page 12, 31) V. Gligorijević, P. D. Renfrew, T. Kosciolek, J. K. Leman, D. Berenberg, T. Vatanen, C. Chandler, B. C. Taylor, I. M. Fisk, H. Vlamakis, et al. Structure-based protein function prediction using graph convolutional networks. Nature Communications, 2021. (Cited on page 69, 70) V. Gligorijević, P. D. Renfrew, T. Kosciolek, J. K. Leman, D. Berenberg, T. Vatanen, C. Chandler, B. C. Taylor, I. M. Fisk, H. Vlamakis, et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1), 2021. (Cited on page 44) C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropa- gation through structure. In Proceedings of International Conference on Neural Networks (ICNN’96). IEEE, 1996. (Cited on page 29) R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 2018. (Cited on page 46) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org. (Cited on page 21) M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. IEEE, 2005. (Cited on page 29) A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. (Cited on page 44) G. P. Greslehner. What do molecular biologists mean when they say ’structure determines function’? 2018. (Cited on page 23) R.-R. Griffiths and J. M. Hernández-Lobato. Constrained bayesian optimization for automatic chemical design using variational autoencoders. Chemical science, 2020. (Cited on page 47) R. W. Grosse-Kunstleve, N. K. Sauter, and P. D. Adams. Numerically stable algorithms for the computation of reduced unit cells. Acta Crystallographica Section A: Foundations of Crystallography, 2004. (Cited on page 79) 154 http://www.deeplearningbook.org http://www.deeplearningbook.org N. Gruver, S. Stanton, N. Frey, T. G. Rudner, I. Hotzel, J. Lafrance-Vanasse, A. Rajpal, K. Cho, and A. G. Wilson. Protein design with guided discrete diffusion. Advances in neural informa- tion processing systems, 2023. (Cited on page 50) N. Gruver, A. Sriram, A. Madotto, A. G. Wilson, C. L. Zitnick, and Z. W. Ulissi. Fine-tuned language models generate stable inorganic materials as text. In The Twelfth International Conference on Learning Representations, 2024. (Cited on page 85, 91) D. Han, X. Qi, C. Myhrvold, B. Wang, M. Dai, S. Jiang, M. Bates, Y. Liu, B. An, F. Zhang, et al. Single-stranded dna and rna origami. Science, 2017. (Cited on page 99, 113) C. Harris, K. Didi, A. R. Jamasb, C. K. Joshi, S. V. Mathis, P. Lio, and T. Blundell. Posecheck: Generative models for 3d structure-based drug design produce unrealistic poses. NeurIPS Workshop on Machine Learning for Structural Biology, 2023. (Cited on page 182) S. He, R. Huang, J. Townley, R. C. Kretsch, T. G. Karagianes, D. B. Cox, H. Blair, D. Penzar, V. Vyaltsev, E. Aristova, et al. Ribonanza: deep learning of rna structure through dual crowdsourcing. bioRxiv, 2024. (Cited on page 113, 115, 117, 136, 144) K. Henzler-Wildman and D. Kern. Dynamic personalities of proteins. Nature, 2007. (Cited on page 142) P. Hermosilla, M. Schäfer, M. Lang, G. Fackelmann, P. P. Vázquez, B. Kozlíková, M. Krone, T. Ritschel, and T. Ropinski. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. arXiv preprint arXiv:2007.06252, 2020. (Cited on page 70) G. Hinton. How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627, 2021. (Cited on page 37) G. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN, 2011. (Cited on page 57, 62) G. E. Hinton and R. Zemel. Autoencoders, minimum description length and helmholtz free energy. Advances in neural information processing systems, 1993. (Cited on page 45) J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. (Cited on page 50, 78, 84, 92, 93) J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020. (Cited on page 47, 48) J. Hoetzel and B. Suess. Structural changes in aptamers are essential for synthetic riboswitch engineering. Journal of Molecular Biology, 2022. (Cited on page 99) 155 E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning. PMLR, 2022. (Cited on page 47, 77, 84, 85, 90, 92, 182) S. Hooker. The hardware lottery. Communications of the ACM, 2021. (Cited on page 140) S. Hordan, T. Amir, and N. Dym. Weisfeiler leman for euclidean equivariant machine learning. arXiv preprint arXiv:2402.02484, 2024. (Cited on page 73) J. Hou, B. Adhikari, and J. Cheng. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 2017. (Cited on page 70) W. Hu, M. Shuaibi, A. Das, S. Goyal, A. Sriram, J. Leskovec, D. Parikh, and C. L. Zit- nick. Forcenet: A graph neural network for large-scale quantum calculations. Preprint arXiv:2103.01436, 2021. (Cited on page 42) K. Huang, T. Fu, W. Gao, Y. Zhao, Y. Roohani, J. Leskovec, C. Coley, C. Xiao, J. Sun, and M. Zitnik. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. (Cited on page 70) P.-S. Huang, S. E. Boyken, and D. Baker. The coming of age of de novo protein design. Nature, 2016. (Cited on page 23, 100, 112) J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola. Generative models for graph-based protein design. In NeurIPS, 2019a. (Cited on page 70) J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola. Generative models for graph-based protein design. NeurIPS, 2019b. (Cited on page 101, 102) J. B. Ingraham, M. Baranov, Z. Costello, K. W. Barber, W. Wang, A. Ismail, V. Frappier, D. M. Lord, C. Ng-Thow-Hing, E. R. Van Vlack, et al. Illuminating protein space with a programmable generative model. Nature, 2023. (Cited on page 77, 113) R. Irwin, A. Tibo, J. P. Janet, and S. Olsson. Semlaflow–efficient 3d molecular generation with latent attention and equivariant flow matching. In The 28th International Conference on Artificial Intelligence and Statistics, 2025. (Cited on page 89) A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021. (Cited on page 95) A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL materials, 2013. (Cited on page 84) 156 K. Jamali, L. Käll, R. Zhang, A. Brown, D. Kimanius, and S. H. Scheres. Automated model building and protein identification in cryo-em maps. Nature, 2024. (Cited on page 143) A. R. Jamasb, A. Morehead, C. K. Joshi, Z. Zuobai, K. Didi, S. V. Mathis, C. Harris, J. Tang, J. Cheng, P. Liò, et al. Evaluating representation learning on the protein structure universe. In ICLR, 2024. (Cited on page 17, 68, 70) S. Jegelka. Theory of graph neural networks: Representation and learning. arXiv preprint arXiv:2204.07697, 2022. (Cited on page 55) R. Jiao, W. Huang, P. Lin, J. Han, P. Chen, Y. Lu, and Y. Liu. Crystal structure prediction by joint equivariant diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. (Cited on page 77, 85, 90, 92) B. Jing, S. Eismann, P. Suriana, R. J. L. Townshend, and R. Dror. Learning from protein structure with geometric vector perceptrons. In ICLR, 2020. (Cited on page 39, 64, 67, 69, 73, 101, 102, 103) W. K. Johnston, P. J. Unrau, M. S. Lawrence, M. E. Glasner, and D. P. Bartel. Rna-catalyzed rna polymerization: accurate and general rna-templated primer extension. Science, 2001. (Cited on page 115) C. K. Joshi. Transformers are graph neural networks. The Gradient, and arXiv preprint arXiv:2506.22084, 2025. (Cited on page 12, 31, 63, 93, 140, 142) C. K. Joshi and P. Liò. grnade: A geometric deep learning pipeline for 3d rna inverse design. In A. Churkin and D. Barash, editors, RNA Design: Methods and Protocols, pages 121–135. Springer, Methods in Molecular Biology (MIMB, volume 2847), 2024. (Cited on page 17) C. K. Joshi, C. Bodnar, S. V. Mathis, T. Cohen, and P. Lio. On the expressive power of geometric graph neural networks. In International conference on machine learning, 2023. (Cited on page 17, 73, 185, 189) C. K. Joshi, X. Fu, Y.-L. Liao, V. Gharakhanyan, B. K. Miller, A. Sriram, and Z. W. Ulissi. All-atom diffusion transformers: Unified generative modelling of molecules and materials. In International Conference on Machine Learning (ICML), 2025a. (Cited on page 17, 75, 142) C. K. Joshi, E. Gianni, S. L. Kwok, S. V. Mathis, P. Liò, and P. Holliger. Generative inverse design of rna structure and function with grnade. bioRxiv, pages 2025–11, 2025b. (Cited on page 18) C. K. Joshi, A. R. Jamasb, R. Viñas, C. Harris, S. Mathis, A. Morehead, R. Anand, and P. Liò. grnade: Geometric deep learning for 3d rna inverse design. In International Conference on Learning Representations (ICLR), 2025c. (Cited on page 17) 157 J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, et al. Highly accurate protein structure prediction with alphafold. Nature, 2021. (Cited on page 11, 12, 36, 53, 69, 75, 92, 99, 113, 141) S.-O. Kaba, A. K. Mondal, Y. Zhang, Y. Bengio, and S. Ravanbakhsh. Equivariance with learned canonicalization functions. In International Conference on Machine Learning. PMLR, 2023. (Cited on page 42) K. Kappel, K. Zhang, Z. Su, A. M. Watkins, W. Kladwang, S. Li, G. Pintilie, V. V. Topkar, R. Ran- gan, I. N. Zheludev, et al. Accelerated cryo-em-guided determination of three-dimensional rna-only structures. Nature methods, 2020. (Cited on page 137) M. L. Ken, R. Roy, A. Geng, L. R. Ganser, A. Manghrani, B. R. Cullen, U. Schulze-Gahmen, D. Herschlag, and H. M. Al-Hashimi. Rna conformational propensities determine cellular activity. Nature, 2023. (Cited on page 100, 109, 136) D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014. (Cited on page 45, 79, 93) D. P. Kingma and M. Welling. An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 2019. (Cited on page 46) T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. (Cited on page 31) W. Kohn, A. D. Becke, and R. G. Parr. Density functional theory of electronic structure. The Journal of Physical Chemistry, 1996. (Cited on page 43) D. P. Kovács, J. H. Moore, N. J. Browning, I. Batatia, J. T. Horton, Y. Pu, V. Kapil, W. C. Witt, I.-B. Magdău, D. J. Cole, and G. Csányi. Mace-off: Short-range transferable machine learning force fields for organic molecules. Journal of the American Chemical Society, 2025. (Cited on page 143) R. C. Kretsch, A. M. Hummer, S. He, R. Yuan, J. Zhang, T. Karagianes, Q. Cong, A. Kryshtafovych, and R. Das. Assessment of nucleic acid structure prediction in casp16. bioRxiv, 2025. (Cited on page 118, 136) A. Kun, M. Santos, and E. Szathmáry. Real ribozymes suggest a relaxed error threshold. Nature genetics, 2005. (Cited on page 126) C. N. Lambert, V. Opuu, F. Calvanese, P. Pavlinova, F. Zamponi, E. J. Hayden, M. Weigt, M. Smerlak, and P. Nghe. Exploring the space of self-reproducing ribozymes using generative models. Nature communications, 2025. (Cited on page 126) 158 J. Lan, A. Palizhati, M. Shuaibi, B. M. Wood, B. Wander, A. Das, M. Uyttendaele, C. L. Zitnick, and Z. W. Ulissi. Adsorbml: a leap in efficiency for adsorption energy calculations using generalizable machine learning potentials. npj Computational Materials, 9(1):172, 2023. (Cited on page 44) T. J. Lane. Protein structure prediction has reached the single-structure frontier. Nature Methods, 2023. (Cited on page 142) T. Le, J. Cremer, F. Noe, D.-A. Clevert, and K. T. Schütt. Navigating the design space of equivariant diffusion-based generative models for de novo 3d molecule generation. In The Twelfth International Conference on Learning Representations, 2024. (Cited on page 89) J. Lee, W. Kladwang, M. Lee, D. Cantu, M. Azizyan, H. Kim, A. Limpaecher, S. Gaikwad, S. Yoon, A. Treuille, et al. Rna design rules from a massive open laboratory. Proceedings of the National Academy of Sciences, 2014. (Cited on page 119) J. K. Leman, B. D. Weitzner, S. M. Lewis, J. Adolf-Bryfogle, N. Alam, R. F. Alford, M. Apra- hamian, D. Baker, K. A. Barlow, P. Barth, et al. Macromolecular modelling and design in rosetta: recent methods and frameworks. Nature methods, 2020. (Cited on page 108, 114) D. Levine and P. J. Steinhardt. Quasicrystals: a new class of ordered structures. Physical review letters, 1984. (Cited on page 67) S. Li, S. Moayedpour, R. Li, M. Bailey, S. Riahi, L. Kogler-Anele, M. Miladi, J. Miner, D. Zheng, J. Wang, et al. Codonbert: Large language models for mrna design and optimization. bioRxiv, 2023a. (Cited on page 113) Y. Li, C. Zhang, C. Feng, R. Pearce, P. Lydia Freddolino, and Y. Zhang. Integrating end-to- end learning with deep geometrical potentials for ab initio rna structure prediction. Nature Communications, 2023b. (Cited on page 113, 118) Z. Li, X. Wang, Y. Huang, and M. Zhang. Is distance matrix enough for geometric deep learning? NeurIPS, 2023c. (Cited on page 73) Y. Liao and T. E. Smidt. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs. In ICLR, 2023. (Cited on page 41) Y.-L. Liao, B. M. Wood, A. Das, and T. Smidt. Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations. In ICLR, 2024a. (Cited on page 42) Y.-L. Liao, B. M. Wood, A. Das, and T. Smidt. Equiformerv2: Improved equivariant trans- former for scaling to higher-degree representations. In International Conference on Learning Representations, 2024b. (Cited on page 79, 92, 185) 159 Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023. (Cited on page 71) Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023. (Cited on page 47, 49, 82) X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z. (Cited on page 49) Y. Liu, L. Wang, M. Liu, Y. Lin, X. Zhang, B. Oztekin, and S. Ji. Spherical message passing for 3d molecular graphs. In ICLR, 2022. (Cited on page 35) A. Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020. (Cited on page 56) A. X. Lu, W. Yan, S. A. Robinson, S. Kelow, K. K. Yang, V. Gligorijevic, K. Cho, R. Bonneau, P. Abbeel, and N. C. Frey. All-atom protein generation with latent diffusion. In ICLR 2025 Workshop on Generative and Experimental Perspectives for Biomolecular Design, 2025. (Cited on page 93) N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In ECCV, 2024. (Cited on page 84) A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos Jr, C. Xiong, Z. Z. Sun, R. Socher, et al. Large language models generate functional protein sequences across diverse families. Nature biotechnology, 2023. (Cited on page 44) M. Mandal and R. R. Breaker. Gene regulation by riboswitches. Nature reviews Molecular cell biology, 2004. (Cited on page 136) T. Marinus, A. B. Fessler, C. A. Ogle, and D. Incarnato. A novel shape reagent enables the analysis of rna structure in living cells with unprecedented accuracy. Nucleic acids research, 2021. (Cited on page 118) H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. NeurIPS, 2019. (Cited on page 56) K. Martinkus, J. Ludwiczak, W.-C. Liang, J. Lafrance-Vanasse, I. Hotzel, A. Rajpal, et al. Abdiffuser: full-atom generation of in-vitro functioning antibodies. Advances in Neural Information Processing Systems, 2024. (Cited on page 93) 160 https://openreview.net/forum?id=XVjTT1nw5z E. K. McRae, C. J. Wan, E. L. Kristoffersen, K. Hansen, E. Gianni, I. Gallego, J. F. Curran, J. Attwater, P. Holliger, and E. S. Andersen. Cryo-em structure and functional landscape of an rna polymerase ribozyme. Proceedings of the National Academy of Sciences, 2024. (Cited on page 111, 112, 114, 126, 127, 128, 192) M. Metkar, C. S. Pepin, and M. J. Moore. Tailor made: the art of therapeutic mrna design. Nature Reviews Drug Discovery, 2024. (Cited on page 99) B. K. Miller, R. T. Chen, A. Sriram, and B. M. Wood. Flowmm: Generating materials with riemannian flow matching. In Forty-first International Conference on Machine Learning, 2024. (Cited on page 77, 80, 85, 88, 90, 92, 181) M. G. Mohsen, M. K. Midy, A. Balaji, and R. R. Breaker. Exploiting natural riboswitches for aptamer engineering and validation. Nucleic Acids Research, 2023. (Cited on page 136) A. Morehead and J. Cheng. Geometry-complete perceptron networks for 3d molecular graphs. Bioinformatics, 2024. (Cited on page 69, 70) A. Morehead, C. Chen, and J. Cheng. Geometric transformers for protein interface contact prediction. In International Conference on Learning Representations, 2022. (Cited on page 69) C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In AAAI, 2019. (Cited on page 55, 60, 176) C. Morris, Y. Lipman, H. Maron, B. Rieck, N. M. Kriege, M. Grohe, M. Fey, and K. Borgwardt. Weisfeiler and leman go machine learning: The story so far. arXiv preprint, 2021. (Cited on page 55) A. Musaelian, S. L. Batzner, A. Johansson, L. Sun, C. J. Owen, M. Kornbluth, and B. Kozin- sky. Learning local equivariant representations for large-scale atomistic dynamics. Nature Communications, 2022. (Cited on page 41) F. Musil, A. Grisafi, A. P. Bartók, C. Ortner, G. Csányi, and M. Ceriotti. Physics-inspired structural representations for molecules and materials. ACS Chemical Reviews, 2021. (Cited on page 12, 27, 33, 36, 141) K. Mustafina, K. Fukunaga, and Y. Yokobayashi. Design of mammalian on-riboswitches based on tandemly fused aptamer and ribozyme. ACS Synthetic Biology, 2019. (Cited on page 135) S. Neidle and M. Sanderson. Principles of nucleic acid structure. Academic Press, 2021. (Cited on page 23) 161 P. O. O Pinheiro, J. Rackers, J. Kleinhenz, M. Maser, O. Mahmood, A. Watkins, S. Ra, V. Sresht, and S. Saremi. 3d molecule generation by denoising voxel grids. Advances in Neural Information Processing Systems, 36:69077–69097, 2023. (Cited on page 93) S. P. Ong, W. D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V. L. Chevrier, K. A. Persson, and G. Ceder. Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science, 2013. (Cited on page 181) F. J. O’Reilly and J. Rappsilber. Cross-linking mass spectrometry: methods and applications in structural, molecular and systems biology. Nature structural & molecular biology, 2018. (Cited on page 143) L. Orgel. Evolution of the genetic apparatus. Journal of Molecular Biology, 1968. (Cited on page 126) S. Passaro and C. L. Zitnick. Reducing so (3) convolutions to so (2) for efficient equivariant gnns. arXiv preprint arXiv:2302.03655, 2023. (Cited on page 42) W. Peebles and S. Xie. Scalable diffusion models with transformers. In International Conference on Computer Vision, 2023. (Cited on page 78, 83, 92, 93) R. J. Penic, T. Vlasic, R. G. Huber, Y. Wan, and M. Sikic. Rinalmo: General-purpose rna language models can generalize well on structure prediction tasks. arXiv preprint, 2024. (Cited on page 113) M. F. Perutz. Structure of haemoglobin. Brookhaven Symposia in Biology, 1960. (Cited on page 25) B. T. Porebski, M. Balmforth, G. Browne, A. Riley, K. Jamali, M. J. Fürst, M. Velic, A. Buchanan, R. Minter, T. Vaughan, et al. Rapid discovery of high-affinity antibodies via massively parallel sequencing, ribosome display and affinity screening. Nature biomedical engineering, 2024. (Cited on page 143) S. N. Pozdnyakov and M. Ceriotti. Incompleteness of graph convolutional neural networks for points clouds in three dimensions. arXiv preprint, 2022. (Cited on page 64) S. N. Pozdnyakov and M. Ceriotti. Smooth, exact rotational symmetrization for deep learning on point clouds. arXiv preprint arXiv:2305.19302, 2023. (Cited on page 42) S. N. Pozdnyakov, M. J. Willatt, A. P. Bartók, C. Ortner, G. Csányi, and M. Ceriotti. Incomplete- ness of atomic structure representations. Physical Review Letters, 2020. (Cited on page 57, 60, 65, 67, 68, 73, 175, 176) 162 F. Praetorius, P. J. Leung, M. H. Tessmer, A. Broerman, C. Demakis, A. F. Dishman, A. Pillai, A. Idris, D. Juergens, J. Dauparas, et al. Design of stimulus-responsive two-state hinge proteins. Science, 2023. (Cited on page 143) A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202, 2023. (Cited on page 31) M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. In International conference on machine learning, 2017. (Cited on page 13, 53) P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017. (Cited on page 36) V. Ramakrishnan. Ribosome structure and the mechanism of translation. Cell, 2002. (Cited on page 25) L. Rampášek, M. Galkin, V. P. Dwivedi, A. T. Luu, G. Wolf, and D. Beaini. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 2022. (Cited on page 32) R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of graph theory, 1977. (Cited on page 54) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, 2014. (Cited on page 45) J. Riebesell, R. E. Goodall, A. Jain, P. Benner, K. A. Persson, and A. A. Lee. Matbench discovery–an evaluation framework for machine learning crystal stability prediction. arXiv preprint arXiv:2308.14920, 2023. (Cited on page 181) E. Rivas and S. R. Eddy. A dynamic programming algorithm for rna structure prediction including pseudoknots. Journal of molecular biology, 1999. (Cited on page 118) R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. (Cited on page 50, 78, 81, 92, 93, 94) M. J. Rowley and V. G. Corces. Organizational principles of 3d genome architecture. Nature Reviews Genetics, 2018. (Cited on page 25) F. Runge, D. Stoll, S. Falkner, and F. Hutter. Learning to design RNA. In ICLR, 2019. (Cited on page 113) 163 B. Sanchez-Lengeling and A. Aspuru-Guzik. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 2018. (Cited on page 11, 44) R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In SIAM International Conference on Data Mining (SDM), 2021. (Cited on page 56) V. G. Satorras, E. Hoogeboom, and M. Welling. E (n) equivariant graph neural networks. In ICML, 2021. (Cited on page 39, 53, 63, 64, 66, 67, 68, 70, 92) F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE transactions on neural networks, 2008. (Cited on page 29) B. Schneider, B. A. Sweeney, A. Bateman, J. Cerny, T. Zok, and M. Szachniuk. When will rna get its alphafold moment? Nucleic Acids Research, 2023. (Cited on page 99) A. Schneuing, C. Harris, Y. Du, K. Didi, A. Jamasb, I. Igashov, et al. Structure-based drug design with equivariant diffusion models. Nature Computational Science, 2024. (Cited on page 50, 69, 77, 90, 92, 94, 143) K. Schütt, O. Unke, and M. Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In ICML, 2021. (Cited on page 39, 58, 66, 69, 175) K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller. Schnet–a deep learning architecture for molecules and materials. The Journal of Chemical Physics, 2018. (Cited on page 35, 36, 53, 63, 64, 68, 69, 70) M. H. Segler, T. Kogej, C. Tyrchan, and M. P. Waller. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central science, 2018. (Cited on page 44) A. V. Shapeev. Moment tensor potentials: A class of systematically improvable interatomic potentials. Multiscale Modeling & Simulation, 2016. (Cited on page 73) T. Shen, Z. Hu, Z. Peng, J. Chen, P. Xiong, L. Hong, L. Zheng, Y. Wang, I. King, S. Wang, et al. E2efold-3d: End-to-end deep learning method for accurate de novo rna 3d structure prediction. arXiv preprint, 2022. (Cited on page 106) Y. Shi, S. Zheng, G. Ke, Y. Shen, J. You, J. He, S. Luo, C. Liu, D. He, and T.-Y. Liu. Bench- marking graphormer on large-scale molecular modeling datasets. arXiv preprint, 2022. (Cited on page 63) N. Shoghi, A. Kolluru, J. R. Kitchin, Z. W. Ulissi, C. L. Zitnick, and B. M. Wood. From molecules to materials: Pre-training large generalizable models for atomic property prediction. In The Twelfth International Conference on Learning Representations, 2024. (Cited on page 77, 92) 164 Y. Shulgina, M. I. Trinidad, C. J. Langeberg, H. Nisonoff, S. Chithrananda, P. Skopintsev, A. J. Nissley, J. Patel, R. S. Boger, H. Shi, et al. Rna language models predict mutations that improve rna function. Nature Communications, 2024. (Cited on page 44, 113) G. Simeon and G. D. Fabritiis. Tensornet: Cartesian tensor representations for efficient learning of molecular potentials. In NeurIPS, 2023. (Cited on page 40) J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, 2015. (Cited on page 47, 82) J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021. (Cited on page 48, 93) Y. Song and S. Ermon. Generative modelling by estimating gradients of the data distribution. Advances in neural information processing systems, 2019. (Cited on page 48, 82) A. Sriram, B. K. Miller, R. T. Q. Chen, and B. M. Wood. Flowllm: Flow matching for material generation with large language models as base distributions. In NeurIPS, 2024. (Cited on page 85, 181) J. Stagno, Y. Liu, Y. Bhandari, C. Conrad, S. Panja, M. Swain, L. Fan, G. Nelson, C. Li, D. Wendel, et al. Structures of riboswitch rna reaction states by mix-and-inject xfel serial crystallography. Nature, 2017. (Cited on page 109) D. W. Staple and S. E. Butcher. Pseudoknots: Rna structures with diverse functions. PLoS biology, 2005. (Cited on page 118) J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery. Cell, 2020. (Cited on page 11, 29, 44) E. J. Strobel, A. M. Yu, and J. B. Lucks. High-throughput determination of rna structures. Nature Reviews Genetics, 2018. (Cited on page 118, 143, 144) S. Sumi, M. Hamada, and H. Saito. Deep generative design of rna family sequences. Nature Methods, 2024. (Cited on page 113) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 2014. (Cited on page 44) R. Sutton. The bitter lesson. Incomplete Ideas (blog), 2019. (Cited on page 140) C. Tan, Y. Zhang, Z. Gao, H. Cao, and S. Z. Li. Hierarchical data-efficient representation learning for tertiary structure-based rna design. arXiv preprint, 2023. (Cited on page 109) 165 N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint, 2018. (Cited on page 12, 40, 41, 53, 63, 64, 66, 68, 70) K. F. Tjhung, M. N. Shokhirev, D. P. Horning, and G. F. Joyce. An rna polymerase ribozyme that synthesizes its own ancestor. Proceedings of the National Academy of Sciences, 2020. (Cited on page 126) R. Todeschini and V. Consonni. Molecular descriptors for chemoinformatics: volume I: alpha- betical listing/volume II: appendices, references. John Wiley & Sons, 2009. (Cited on page 44) J. Topping, F. D. Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In ICLR, 2022. (Cited on page 66) H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. (Cited on page 85) R. J. Townshend, S. Eismann, A. M. Watkins, R. Rangan, M. Karelina, R. Das, and R. O. Dror. Geometric deep learning of rna structure. Science, 2021. (Cited on page 113) O. T. Unke, S. Chmiela, H. E. Sauceda, M. Gastegger, I. Poltavsky, K. T. Schutt, A. Tkatchenko, and K.-R. Muller. Machine learning force fields. Chemical Reviews, 2021. (Cited on page 43) A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modelling in latent space. Advances in neural information processing systems, 2021. (Cited on page 50, 78, 92) A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 2017. (Cited on page 45) M. Varadi, S. Anyango, M. Deshpande, S. Nair, C. Natassia, G. Yordanova, D. Yuan, O. Stroe, G. Wood, A. Laydon, et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 2021. (Cited on page 69) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 2017a. (Cited on page 12, 31, 44, 45, 79) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017b. (Cited on page 71) 166 P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. ICLR, 2018. (Cited on page 31) Q. Vicens and J. S. Kieft. Thoughts on how to think (and talk) about rna structure. Proceedings of the National Academy of Sciences, 2022. (Cited on page 100, 118, 190) C. Vignac, N. Osman, L. Toni, and P. Frossard. Midi: Mixed graph and 3d denoising diffusion for molecule generation. In ECML PKDD, 2023. (Cited on page 89) S. Villar, D. W. Hogg, K. Storey-Fisher, W. Yao, and B. Blum-Smith. Scalars are universal: Equivariant machine learning, structured like classical physics. NeurIPS, 2021. (Cited on page 73) P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, 2008. (Cited on page 45) L. M. Wadley, K. S. Keating, C. M. Duarte, and A. M. Pyle. Evaluating and learning from rna pseudotorsional space: quantitative validation of a reduced representation for rna structure. Journal of molecular biology, 2007. (Cited on page 101) B. Wander, M. Shuaibi, J. R. Kitchin, Z. W. Ulissi, and C. L. Zitnick. Cattsunami: Accelerating transition state energy calculations with pretrained graph neural networks. ACS Catalysis, 15 (7):5283–5294, 2025. (Cited on page 44, 53) L. Wang, Y. Liu, Y. Lin, H. Liu, and S. Ji. Comenet: Towards complete and efficient message passing for 3d molecular graphs. 2022. (Cited on page 63, 75) W. Wang, C. Feng, R. Han, Z. Wang, L. Ye, Z. Du, H. Wei, F. Zhang, Z. Peng, and J. Yang. trrosettarna: automated prediction of rna 3d structure with transformer network. Nature Communications, 2023. (Cited on page 113, 118) Y. Wang, A. A. Elhag, N. Jaitly, J. M. Susskind, and M. A. Bautista. Swallowing the bitter pill: Simplified scalable conformer generation. In International conference on machine learning, 2024. (Cited on page 12, 75, 93, 142) M. Ward, E. Courtney, and E. Rivas. Fitness functions for rna structure design. Nucleic Acids Research, 2023. (Cited on page 113) A. M. Watkins, R. Rangan, and R. Das. Farfar2: improved de novo rosetta prediction of complex global rna folds. Structure, 2020. (Cited on page 113) J. D. Watson and F. H. Crick. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature, 1953. (Cited on page 24) 167 J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J. Borst, R. J. Ragotte, L. F. Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, 2023. (Cited on page 11, 47, 50, 77, 92, 94, 99, 106, 113, 143) H. K. Wayment-Steele, W. Kladwang, A. I. Strom, J. Lee, A. Treuille, A. Becka, E. Participants, and R. Das. Rna secondary structure packages evaluated and improved by high-throughput experiments. Nature methods, 2022a. (Cited on page 106, 119) H. K. Wayment-Steele, W. Kladwang, A. M. Watkins, D. S. Kim, B. Tunguz, W. Reade, M. Demkin, J. Romano, R. Wellington-Oguri, J. J. Nicol, et al. Deep learning models for predicting rna degradation via dual crowdsourcing. Nature Machine Intelligence, 2022b. (Cited on page 119) H. K. Wayment-Steele, G. El Nesr, R. Hettiarachchi, H. Kariyawasam, S. Ovchinnikov, and D. Kern. Learning millisecond protein dynamics from what is missing in nmr spectra. bioRxiv, pages 2025–03, 2025. (Cited on page 143) M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. S. Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. NeurIPS, 2018. (Cited on page 39) D. Weininger. Smiles, a chemical language and information system. 1. introduction to method- ology and encoding rules. Journal of Chemical Information and Computer Sciences, 1988. (Cited on page 22) B. Weisfeiler and A. Leman. The reduction of a graph to canonical form and the algebra which appears therein. NTI, Series, 1968. (Cited on page 55) R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1989. (Cited on page 45, 105) D. S. Wilson and J. W. Szostak. In vitro selection of functional nucleic acids. Annual review of biochemistry, 1999. (Cited on page 126) A. Winnifrith, C. Outeiral, and B. Hie. Generative artificial intelligence for de novo protein design. Current Opinion in Structural Biology, 2024. (Cited on page 44) A. Wochner, J. Attwater, A. Coulson, and P. Holliger. Ribozyme-catalyzed transcription of an active ribozyme. Science, 2011. (Cited on page 126) C. Woese. The Genetic Code: the Molecular basis for Genetic Expression. New York: Harper & Row, 1967. (Cited on page 126) B. M. Wood, M. Dzamba, X. Fu, M. Gao, M. Shuaibi, L. Barroso-Luque, K. Abdelmaqsoud, V. Gharakhanyan, J. R. Kitchin, D. S. Levine, K. Michel, A. Sriram, T. Cohen, A. Das, 168 A. Rizvi, S. J. Sahoo, Z. W. Ulissi, and C. L. Zitnick. Uma: A family of universal models for atoms. 2025. (Cited on page 11, 43, 77, 92, 141, 143) Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 2018. (Cited on page 84) T. Xie and J. C. Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett., 2018. (Cited on page 35, 44) T. Xie, X. Fu, O.-E. Ganea, R. Barzilay, and T. S. Jaakkola. Crystal diffusion variational autoencoder for periodic material generation. In International Conference on Learning Representations, 2022. (Cited on page 81, 84, 85, 90, 181) K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In ICLR, 2019. (Cited on page 31, 55, 60, 176) K. Xu, J. Li, M. Zhang, S. S. Du, K. ichi Kawarabayashi, and S. Jegelka. What can neural networks reason about? In International Conference on Learning Representations, 2020. (Cited on page 12) M. Xu, A. S. Powers, R. O. Dror, S. Ermon, and J. Leskovec. Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, 2023. (Cited on page 85, 88, 90, 92) S. Yang, K. Cho, A. Merchant, P. Abbeel, D. Schuurmans, I. Mordatch, and E. D. Cubuk. Scalable diffusion for materials generation. In The Twelfth International Conference on Learning Representations, 2024. (Cited on page 85) J. D. Yesselman, D. Eiler, E. D. Carlson, M. R. Gotrik, A. E. d’Aquino, A. N. Ooms, W. Klad- wang, P. D. Carlson, X. Shi, D. A. Costantino, et al. Computational design of three-dimensional rna structure and function. Nature nanotechnology, 2019. (Cited on page 99, 113) J. Yim, A. Campbell, A. Y. Foong, M. Gastegger, J. Jiménez-Luna, S. Lewis, V. G. Satorras, B. S. Veeling, R. Barzilay, T. Jaakkola, et al. Fast protein backbone generation with se (3) flow matching. arXiv preprint arXiv:2310.05297, 2023a. (Cited on page 83) J. Yim, B. L. Trippe, V. De Bortoli, E. Mathieu, A. Doucet, R. Barzilay, and T. Jaakkola. Se (3) diffusion model with application to protein backbone generation. 2023b. (Cited on page 83, 92) R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM 169 SIGKDD international conference on knowledge discovery & data mining, 2018. (Cited on page 29) M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. NeurIPS, 2017. (Cited on page 104) A. Zee. Group theory in a nutshell for physicists. 2016. (Cited on page 28) C. Zeni, R. Pinsler, D. Zügner, A. Fowler, M. Horton, X. Fu, et al. Mattergen: a generative model for inorganic materials design. Nature, 2025. (Cited on page 47, 50, 77, 85, 90, 94, 143) C. Zhang, M. Shine, A. M. Pyle, and Y. Zhang. Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature methods, 2022. (Cited on page 106, 107) L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023a. (Cited on page 92) Z. Zhang, M. Xu, A. R. Jamasb, V. Chenthamarakshan, A. Lozano, P. Das, and J. Tang. Protein representation learning by geometric structure pretraining. In The Eleventh International Conference on Learning Representations, 2023b. (Cited on page 69, 70) Y. Zhao, K. Oono, H. Takizawa, and M. Kotera. Generrna: A generative pre-trained language model for de novo rna design. PLoS One, 2024. (Cited on page 113) 170 Appendix A Appendix: Expressive Power of Molecular Structure Representations (Chapter 3) A.1 Geometric GNN Design Space Proofs A.1.1 Role of Depth (Section 3.3.1) The following results are a consequence of the construction of GWL as well as the definitions of k-hop distinct and k-hop identical geometric graphs. Note that k-hop distinct geometric graphs are also (k + 1)-hop distinct. Similarly, k-hop identical geometric graphs are also (k − 1)-hop identical, but not necessarily (k + 1)-hop distinct. Given two distinct neighbourhoods N1 and N2, the G-orbits of the corresponding geometric multisets g1 and g2 are mutually exclusive, i.e. OG(g1) ∩ OG(g2) ≡ ∅. By the properties of I-HASH this implies c1 ̸= c2. Conversely, if N1 and N2 were identical up to group actions, their G-orbits would overlap, i.e. g1 = g g2 for some g ∈ Gand OG(g1) = OG(g2) ⇒ c1 = c2. Proposition 3. GWL can distinguish any k-hop distinct geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic, and k iterations are sufficient. Proof of Proposition 3. The k-th iteration of GWL identifies the G-orbit of the k-hop subgraph N (k) i at each node i via the geometric multiset g(k) i . G1 and G2 being k-hop distinct implies that there exists some bijection b and some node i ∈ V1, b(i) ∈ V2 such that the corresponding k-hop subgraphs N (k) i and N (k) b(i) are distinct. Thus, the G-orbits of the corresponding geometric multisets g(k) i and g (k) b(i) are mutually exclusive, i.e. OG(g (k) i ) ∩ OG(g (k) b(i)) ≡ ∅ ⇒ c (k) i ̸= c (k) b(i). Thus, k iterations of GWL are sufficient to distinguish G1 and G2. Proposition 4. Up to k iterations, GWL cannot distinguish any k-hop identical geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic. Proof of Proposition 4. The k-th iteration of GWL identifies the G-orbit of the k-hop subgraph N (k) i at each node i via the geometric multiset g(k) i . G1 and G2 being k-hop identical implies 171 that for all bijections b and all nodes i ∈ V1, b(i) ∈ V2, the corresponding k-hop subgraphs N (k) i and N (k) b(i) are identical up to group actions. Thus, the G-orbits of the corresponding geometric multisets g (k) i and g (k) b(i) overlap, i.e. OG(g (k) i ) = OG(g (k) b(i)) ⇒ c (k) i = c (k) b(i). Thus, up to k iterations of GWL cannot distinguish G1 and G2. Proposition 5. IGWL can distinguish any 1-hop distinct geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic, and 1 iteration is sufficient. Proof of Proposition 5. Each iteration of IGWL identifies the G-orbit of the 1-hop local neigh- bourhood N (k=1) i at each node i. G1 and G2 being 1-hop distinct implies that there exists some bijection b and some node i ∈ V1, b(i) ∈ V2 such that the corresponding 1-hop local neighbour- hoods N (1) i and N (1) b(i) are distinct. Thus, the G-orbits of the corresponding geometric multisets g (1) i and g (1) b(i) are mutually exclusive, i.e. OG(g (1) i ) ∩ OG(g (1) b(i)) ≡ ∅ ⇒ c (1) i ̸= c (1) b(i). Thus, 1 iteration of IGWL is sufficient to distinguish G1 and G2. Proposition 6. Any number of iterations of IGWL cannot distinguish any 1-hop identical geometric graphs G1 and G2 where the underlying attributed graphs are isomorphic. Proof of Proposition 6. Each iteration of IGWL identifies the G-orbit of the 1-hop local neigh- bourhood N (k=1) i at each node i, but cannot identify G-orbits beyond 1-hop by the construction of IGWL as no geometric information is propagated. G1 and G2 being 1-hop identical implies that for all bijections b and all nodes i ∈ V1, b(i) ∈ V2, the corresponding 1-hop local neighbour- hoods N (k) i and N (k) b(i) are identical up to group actions. Thus, the G-orbits of the corresponding geometric multisets g(1) i and g (1) b(i) overlap, i.e. OG(g (1) i ) = OG(g (1) b(i)) ⇒ c (k) i = c (k) b(i). Thus, any number of IGWL iterations cannot distinguish G1 and G2. Proposition 7. Assuming geometric graphs are constructed from point clouds using radial cutoffs, GWL can distinguish any geometric graphs G1 and G2 where the underlying attributed graphs are non-isomorphic. At most kMax iterations are sufficient, where kMax is the maximum graph diameter among G1 and G2. Proof of Proposition 7. We assume that a geometric graph G = (A,S, V⃗ , X⃗) is constructed from a point cloud (S, V⃗ , X⃗) using a predetermined radial cutoff r. Thus, the adjacency matrix is defined as aij = 1 if ∥x⃗i − x⃗j∥2 ≤ r, or 0 otherwise, for all aij ∈ A. Such construction procedures are conventional for geometric graphs in molecular modelling. Given geometric graphs G1 and G2 where the underlying attributed graphs are non-isomorphic, identify kMax the maximum of the graph diameters of G1 and G2, and chose any arbitrary nodes i ∈ V1, j ∈ V2. We can define the kMax-hop subgraphs N (kMax) i and N (kMax) j at i and j, respectively. Thus, N (kMax) i = V1 for all i ∈ V1, and N (kMax) j = V2 for all j ∈ V2. Due to the assumed construction procedure of geometric graphs, N (kMax) i and N (kMax) j must be distinct. Otherwise, if N (kMax) i and N (kMax) j were identical up to group actions, the sets (S1, V⃗1, X⃗1) and (S2, V⃗2, X⃗2) would have yielded isomorphic graphs. 172 Figure A.1: Two geometric graphs for which IGWL and G-invariant GNNs cannot distinguish their perimeter, surface area, volume of the bounding box/sphere, distance from the centroid, and dihedral angles. The centroid is denoted by a red point and distances from it are denoted by dotted red lines. The bounding box enclosing the geometric graph is denoted by the dotted green lines. The kMax-th iteration of GWL identifies the G-orbit of the kMax-hop subgraph N (kMax) i at each node i via the geometric multiset g(kMax) i . As N (kMax) i and N (kMax) j are distinct for any arbitrary nodes i ∈ V1, j ∈ V2, the G-orbits of the corresponding geometric multisets g(kMax) i and g (kMax) j are mutually exclusive, i.e. OG(g (kMax) i ) ∩ OG(g (kMax) j ) ≡ ∅ ⇒ c (kMax) i ̸= c (kMax) j . Thus, kMax iterations of GWL are sufficient to distinguish G1 and G2. A.1.2 Limitations of Invariant Message Passing (Section 3.3.2) Theorem 8. GWL is strictly more powerful than IGWL. Proof of Theorem 8. Firstly, we can show that the GWL class contains IGWL if GWL can learn the identity when updating gi for all i ∈ V , i.e. g(t) i = g (t−1) i = g (0) i ≡ (si, v⃗i). Thus, GWL is at least as powerful as IGWL, which does not update gi. Secondly, to show that GWL is strictly more powerful than IGWL, it suffices to show that there exist a pair of geometric graphs that can be distinguished by GWL but not by IGWL. We may consider any k-hop distinct geometric graphs for k > 1, where the underlying attributed graphs are isomorphic. Proposition 3 states that GWL can distinguish any such graphs, while Proposition 6 states that IGWL cannot distinguish them. An example is the pair of graphs in Figures 3.4 and 3.5. Proposition 9. IGWL and G-invariant GNNs cannot decide several geometric graph properties: (1) perimeter, surface area, and volume of the bounding box/sphere enclosing the geometric graph; (2) distance from the centroid or centre of mass; and (3) dihedral angles. Proof of Proposition 9. Following Garg et al. [2020], we say that a class of models decides a geometric graph property if there exists a model belonging to this class such that for any two geometric graphs that differ in the property, the model is able to distinguish the two geometric graphs. In Figure A.1, we provide an example of two geometric graphs that demonstrate the proposi- tion. G1 and G2 differ in the following geometric graph properties: 173 • Perimeter, surface area, and volume of the bounding box enclosing the geometric graph1: (32 units, 40 units2, 16 units3) vs. (28 units, 24 units2, 8 units3). • Multiset of distances from the centroid or centre of mass: {0.00, 1.00, 1.00, 2.45, 2.45} vs. {0.40, 1.08, 1.08, 2.32, 2.32}. • Dihedral angles: ∠(ljkm) = (x⃗jk×x⃗lj)·(x⃗jk×x⃗mk) |x⃗jk×x⃗lj ||x⃗jk×x⃗mk| are clearly different for the two graphs. However, according to Proposition 6 and Theorem 14, both IGWL and G-invariant GNNs cannot distinguish these two geometric graphs, and therefore, cannot decide all these properties. We can also show this via a geometric version of computation trees [Garg et al., 2020], for any number of IGWL or G-invariant GNN iterations, as illustrated in Figure 3.6. A computation tree T (t) i represents the maximum information contained in GWL/IGWL colours or GNN features for node i at iteration t by an ‘unrolling’ of the message passing procedure. GWL, IGWL, and the corresponding classes of GNNs can be intuitively understood as colouring geometric computation trees. Geometric computation trees are constructed recursively: T (0) i = (si, v⃗i) for all i ∈ V . For t > 0, we start with a root node (si, v⃗i) and add a child subtree T (t−1) j for all j ∈ Ni along with the relative position x⃗ij along the edge. To obtain the root node’s embedding or colour, both scalar and geometric information is propagated from the leaves up to the root. Thus, if two nodes have identical geometric computation trees, they will be mapped to the same node embedding or colour. Critically, geometric orientation information cannot flow from one level to another in the computation trees for IGWL and G-invariant GNNs, as they only update scalar information. In the recursive construction procedure, we must insert a connector node (sj, v⃗j) before adding the child subtree T (t−1) j for all j ∈ Ni and prevent geometric information propagation between them. Following the construction procedure for the geometric graphs in Figure A.1, we observe that the IGWL computation trees of any pair of isomorphic nodes are identical, as all 1-hop neighbourhoods are computationally identical. Therefore, the set of node colours or node scalar features will also be identical, which implies that G1 and G2 cannot be distinguished. Proposition 10. IGWL has the same expressive power as GWL for fully connected geometric graphs. Proof of Proposition 10. We will prove by contradiction. Assume that there exist a pair of fully connected geometric graphs G1 and G2 which GWL can distinguish, but IGWL cannot. If the underlying attributed graphs of G1 and G2 are isomorphic, by Proposition 3 and Proposition 6, G1 and G2 are 1-hop identical but k-hop distinct for some k > 1. For all bijections b and all nodes i ∈ V1, b(i) ∈ V2, the local neighbourhoods N (1) i and N (1) b(i) are identical up 1The same result applies for the bounding sphere, not shown in the figure. 174 to group actions, and OG(g (1) i ) = OG(g (1) b(i)) ⇒ c (1) i = c (1) b(i). Additionally, there exists some bijection b and some nodes i ∈ V1, b(i) ∈ V2 such that the k-hop subgraphs N (k) i and N (k) b(i) are distinct, and OG(g (k) i ) ∩ OG(g (k) b(i)) ≡ ∅ ⇒ c (k) i ̸= c (k) b(i). However, as G1 and G2 are fully connected, for any k, N (1) i = N (k) i and N (1) b(i) = N (k) b(i) are identical up to group actions. Thus, OG(g (1) i ) = OG(g (k) i ) = OG(g (1) b(i)) = OG(g (k) b(i)) ⇒ c (1) i = c (k) i = c (k) b(i) = c (k) b(i). This is a contradiction. If G1 and G2 are non-isomorphic and fully connected, for any arbitrary i ∈ V1, j ∈ V2 and any k-hop neighbourhood, we know that N (1) i = N (k) i and N (1) j = N (k) j . Thus, a single iteration of GWL and IGWL identify the same G-orbits and assign the same node colours, i.e. OG(g (1) i ) = OG(g (k) i ) ⇒ c (1) i = c (k) i and OG(g (1) j ) = OG(g (k) j ) ⇒ c (1) j = c (k) j . This is a contradiction. A.1.3 Role of Scalarisation Body Order (Section 3.3.3) Proposition 11. I-HASH(m) is G-orbit injective for m = max({|Ni| | i ∈ V}), the maximum cardinality of all local neighbourhoods Ni in a given dataset. Proof of Proposition 11. As m is the maximum cardinality of all local neighbourhoods Ni under consideration, any distinct neighbourhoods N1 and N2 must have distinct multisets of m-body scalars. As I-HASH(m) computes scalars involving up to m nodes, it will be able to distinguish any such N1 and N2. Thus, I-HASH(m) is G-orbit injective. Proposition 12. IGWL(k) is at least as powerful as IGWL(k−1). For k ≤ 5, IGWL(k) is strictly more powerful than IGWL(k−1). Proof of Proposition 12. By construction, I-HASH(k) computes G-invariant scalars from all possible tuples of up to k nodes formed by the elements of a neighbourhood and the central node. Thus, the I-HASH(k) class contains I-HASH(k−1), and I-HASH(k) is at least as powerful as I-HASH(k−1). Thus, the corresponding test IGWL(k) is at least as powerful as IGWL(k−1). Secondly, to show that IGWL(k) is strictly more powerful than IGWL(k−1) for k ≤ 5, it suffices to show that there exist a pair of geometric neighbourhoods that can be distinguished by IGWL(k) but not by IGWL(k−1): • For k = 3 and G= O(3) or SO(3), for the local neighbourhood from Figure 1 in Schütt et al. [2021], two configurations with different angles between the neighbouring nodes can be distinguished by IGWL(3) but not by IGWL(2). • For k = 4 and G= O(3) or SO(3), the pair of local neighbourhoods from Figure 1 in Pozdnyakov et al. [2020] can be distinguished by IGWL(4) but not by IGWL(3). • For k = 5 and G= O(3), the pair of local neighbourhoods from Figure 2(e) in Pozdnyakov et al. [2020] can be distinguished by IGWL(5) but not by IGWL(4). 175 • For k = 5 and G= SO(3), the pair of local neighbourhoods from Figure 2(f) in Pozd- nyakov et al. [2020] can be distinguished by IGWL(5) but not by IGWL(4). Proposition 13. Let G1 = (A1,S1, X⃗1) and G2 = (A2,S2, X⃗2) be two geometric graphs with the property that all edges have equal length. Then, IGWL(2) distinguishes the two graphs if and only if WL can distinguish the attributed graphs (A1,S1) and (A1,S1). Proof of Proposition 13. Let c and k the colours produced by IGWL(2) and WL, respectively, and let i and j be two nodes belonging to any two graphs like in the statement of the result. We prove the statement inductively. Clearly, c(0)i = k (0) i for all nodes i and c(0)i = c (0) j if and only if k(0)i = k (0) j . Now, assume that the statement holds for iteration t. That is c(t)i = c (t) j if and only if k(t)i = k (t) j holds for all i. Note that c(t+1) i = c (t+1) j if and only if c(t)i = c (t) j and {{(c(t)p , ∥x⃗ip∥) | p ∈ Ni}} = {{(c(t)p , ∥x⃗jp∥) | p ∈ Nj}}, since the norm of the relative vectors is the only injective invariant that IGWL(2) can compute (up to a scaling). Since all the norms are equal, by the induction hypothesis, this is equivalent to k(t)i = k (t) j and {{k(t)p | p ∈ Ni}} = {{k(t) | p ∈ Nj}}. Therefore, this is equivalent to k(t+1) i = k (t+1) j A.2 Proofs for Equivalence between GWL and Geometric GNNs (Section 3.2.2) Our proofs adapt the techniques used in Xu et al. [2019], Morris et al. [2019] for connecting WL with GNNs. Note that we omit including the relative position vectors x⃗ij in GWL and geometric GNN updates for brevity, as relative positions vectors can be merged into the vector features. Theorem 1. Any pair of geometric graphs distinguishable by a G-equivariant GNN is also distinguishable by GWL. Proof of Theorem 1. Consider two geometric graphs G and H. The theorem implies that if the GNN graph-level readout outputs f(G) ̸= f(H), then the GWL test will always determine G and H to be non-isomorphic, i.e. G ≠ H. We will prove by contradiction. Suppose after T iterations, a GNN graph-level readout outputs f(G) ̸= f(H), but the GWL test cannot decide G and H are non-isomorphic, i.e. G and H always have the same collection of node colours for iterations 0 to T . Thus, for iteration t and t+ 1 for any t = 0 . . . T − 1, G and H have the same collection of node colours {c(t)i } as well as the same collection of neighbourhood geometric multisets { (c (t) i , g (t) i ) , {{(c(t)j , g (t) j ) | j ∈ Ni}} } up to group actions. Otherwise, the GWL test would have produced different node colours at iteration t+ 1 for G and H as different geometric multisets get unique new colours. 176 We will show that on the same graph for nodes i and k, if (c(t)i , g (t) i ) = (c (t) k , g · g(t) k ), we always have GNN features (s(t)i , v⃗ (t) i ) = (s (t) k ,Qgv⃗ (t) k ) for any iteration t. This holds for t = 0 because GWL and the GNN start with the same initialisation. Suppose this holds for iteration t. At iteration t+ 1, if for any i and k, (c(t+1) i , g (t+1) i ) = (c (t+1) k , g · g(t+1) k ), then:{ (c (t) i , g (t) i ) , {{(c(t)j , g (t) j ) | j ∈ Ni}} } = { (c (t) k , g · g (t) k ) , {{(c(t)j , g · g (t) j ) | j ∈ Nk}} } (A.1) By our assumption on iteration t,{ (s (t) i , v⃗ (t) i ) , {{(s(t)j , v⃗ (t) j ) | j ∈ Ni}} } = { (s (t) k ,Qgv⃗ (t) k ) , {{(s(t)j ,Qgv⃗ (t) j ) | j ∈ Nk}} } (A.2) As the same aggregate and update operations are applied at each node within the GNN, the same inputs, i.e. neighbourhood features, are mapped to the same output. Thus, (s(t+1) i , v⃗ (t+1) i ) = (s (t+1) k ,Qgv⃗ (t+1) k ). By induction, if (c (t) i , g (t) i ) = (c (t) k , g · g(t) k ), we always have GNN node features (s(t)i , v⃗ (t) i ) = (s (t) k ,Qgv⃗ (t) k ) for any iteration t. This creates valid mappings ϕs, ϕv such that s(t)i = ϕs(c (t) i ) and v⃗ (t) i = ϕv(c (t) i , g (t) i ) for any i ∈ V . Thus, if G and H have the same collection of node colours and geometric multisets, then G and H also have the same collection of GNN neighbourhood features{ (s (t) i , v⃗ (t) i ) , {{(s(t)j , v⃗ (t) j ) | j ∈ Ni}} } = { (ϕs(c (t) i ), ϕv(c (t) i , g (t) i )) , {{(ϕs(c (t) j ), ϕv(c (t) i , g (t) i )) | j ∈ Ni}} } Thus, the GNN will output the same collection of node scalar features {s(T ) i } for G and H and the permutation-invariant graph-level readout will output f(G) = f(H). This is a contradiction. Similarly, G-invariant GNNs can be at most as powerful as IGWL. Theorem 14. Any pair of geometric graphs distinguishable by a G-invariant GNN is also distinguishable by IGWL. Proof. The proof follows similarly to the proof for Theorem 1. Proposition 2. G-equivariant GNNs have the same expressive power as GWL if the following conditions hold: (1) The aggregation AGG is an injective, G-equivariant multiset function. (2) The scalar part of the update UPDs is a G-orbit injective, G-invariant multiset function. (3) The vector part of the update UPDv is an injective, G-equivariant multiset function. (4) The graph-level readout f is an injective multiset function. Proof of Theorem 2. Consider a GNN where the conditions hold. We will show that, with a sufficient number of iterations t, the output of this GNN is equivalent to GWL, i.e. s(t) ≡ c(t). Let G and H be any geometric graphs which the GWL test decides as non-isomorphic at iteration T . Because the graph-level readout function is injective, i.e. it maps distinct multiset of node scalar features into unique embeddings, it suffices to show that the GNN’s neighbourhood 177 aggregation process, with sufficient iterations, embeds G and H into different multisets of node features. For this proof, we replace G-orbit injective functions with injective functions over the equivalence class generated by the actions of G. Thus, all elements belonging to the same G-orbit will first be mapped to the same representative of the equivalence class, denoted by the square brackets [. . . ], followed by an injective map. The result is G-orbit injective. Let us assume the GNN updates node scalar and vector features as: s (t) i = UPDs ([ (s (t−1) i , v⃗ (t−1) i ) , AGG ( {{(s(t−1) i , s (t−1) j , v⃗ (t−1) i , v⃗ (t−1) j ) | j ∈ Ni}} )]) (A.3) v⃗ (t) i = UPDv ( (s (t−1) i , v⃗ (t−1) i ) , AGG ( {{(s(t−1) i , s (t−1) j , v⃗ (t−1) i , v⃗ (t−1) j ) | j ∈ Ni}} )) (A.4) with the aggregation function AGG being G-equivariant and injective, the scalar update function UPDs being G-invariant and injective, and the vector update function UPDv being G-equivariant and injective. The GWL test updates the node colour c(t)i and geometric multiset g(t) i as: c (t) i = hs ([ (c (t−1) i , g (t−1) i ) , {{(c(t−1) j , g (t−1) j ) | j ∈ Ni}} ]) , (A.5) g (t) i = hv ( (c (t−1) i , g (t−1) i ) , {{(c(t−1) j , g (t−1) j ) | j ∈ Ni}} ) , (A.6) where hs is a G-invariant and injective map, and hv is a G-equivariant and injective operation (e.g. in equation 3.4, expanding the geometric multiset by copying). We will show by induction that at any iteration t, there always exist injective functions φs and φv such that s(t)i = φs(c (t) i ) and v⃗ (t) i = φv(c (t) i , g (t) i ). This holds for t = 0 because the initial node features are the same for GWL and GNN, c(0)i ≡ s (0) i and g (0) i ≡ (s (0) i , v⃗ (0) i ) for all i ∈ V(G),V(H). Suppose this holds for iteration t. At iteration t + 1, substituting s (t) i with φs(c (t) i ), and v⃗ (t) i with φv(c (t) i , g (t) i ) gives us s (t+1) i = UPDs ([ (φs(c (t) i ), φv(c (t) i , g (t) i )) , AGG ( {{(φs(c (t) i ), φs(c (t) j ), φv(c (t) i , g (t) i ), φv(c (t) j , g (t) j )) | j ∈ Ni}} )]) v⃗ (t+1) i = UPDv ( (φs(c (t) i ), φv(c (t) i , g (t) i )) , AGG ( {{(φs(c (t) i ), φs(c (t) j ), φv(c (t) i , g (t) i ), φv(c (t) j , g (t) j )) | j ∈ Ni}} )) The composition of multiple injective functions is injective. Therefore, there exist some injective functions gs and gv such that: s (t+1) i = gs ([ (c (t) i , g (t) i ) , {{(c(t)j , g (t) j ) | j ∈ Ni}} ]) , (A.7) v⃗ (t+1) i = gv ( (c (t) i , g (t) i ) , {{(c(t)j , g (t) j ) | j ∈ Ni}} ) , (A.8) 178 We can then consider: s (t+1) i = gs ◦ h−1 s hs ([ (c (t) i , g (t) i ) , {{(c(t)j , g (t) j ) | j ∈ Ni}} ]) , (A.9) v⃗ (t+1) i = gv ◦ h−1 v hv ( (c (t) i , g (t) i ) , {{(c(t)j , g (t) j ) | j ∈ Ni}} ) , (A.10) Then, we can denote φs = gs ◦ h−1 s and φv = gv ◦ h−1 v as injective functions because the composition of injective functions is injective. Hence, for any iteration t+1, there exist injective functions φs and φv such that s(t+1) i = φs ( c (t+1) i ) and v⃗ (t+1) i = φv ( c (t+1) i , g (t+1) i ) . At the T -th iteration, the GWL test decides that G and H are non-isomorphic, which means the multisets of node colours {c(T ) i } are different for G and H. The GNN’s node scalar features {s(T ) i } = {φs(c (T ) i )} must also be different for G and H because of the injectivity of φs. A weaker set of conditions is sufficient for a G-invariant GNN to be at least as expressive as IGWL. Proposition 15. G-invariant GNNs have the same expressive power as IGWL if the following conditions hold: (1) The aggregation ψ and update ϕ are G-orbit injective, G-invariant multiset functions. (2) The graph-level readout f is an injective multiset function. Proof. The proof follows similarly to the proof for Theorem 2. 179 180 Appendix B Appendix: Unified Generative Modelling of Molecules and Materials (Chapter 4) B.1 Evaluation Metrics Crystal generation metrics We follow the evaluation protocol established by Xie et al. [2022], Miller et al. [2024], where we sample 10,000 crystals and compute validity, stability, uniqueness, and novelty rates, defined as follows: • Structural validity: % of crystals with all pairwise distances >= 0.5 and volume >= 0.1. • Compositional validity: % of crystal compositions with charge neutrality and electronegativity balance according to SMACT [Davies et al., 2019]. • Overall validity: % of crystals which are both structurally and compositionally valid. • Stability: % of crystals with DFT energy above hull <0.0 eV/atom and number of unique elements >= 2. (We also report metastability as DFT energy above hull <0.1 eV/atom and number of unique elements >= 2.) • Stable & unique: % of stable crystals which are unique, as defined by an all-to-all comparison using Structure Matcher from PyMatGen [Ong et al., 2013]. • Stable, unique & novel: % of stable, unique crystals which are novel, as defined by an all-to-all comparison to all crystals in MP-20 using Structure Matcher. To compute the stability, uniqueness, and novelty rates, we follow Miller et al. [2024], Sriram et al. [2024]: We first pre-relax the sampled crystals using a fast ML potential, CHGnet [Deng et al., 2023], and then perform DFT relaxation. We then determine the DFT energy above hull for the relaxed structures against the Matbench Discovery convex hull [Riebesell et al., 2023]. Note that there is a lower bound on the number of completed DFT calculations due to memory or timeout errors. 181 Molecule generation metrics We follow the evaluation protocol established by Hoogeboom et al. [2022], Daigavane et al. [2024], where we sample 10,000 molecules and compute validity and uniqueness rates as well as success rates for 7 sanity checks from Posebusters [Buttenschoen et al., 2024], as follows: • Validity: % of molecules with canonical SMILES string found by RDKit. • Uniqueness: % of unique SMILES among valid ones. • All-atoms connected: % of molecules where there exists a path along bonds between all atoms. • Reasonable bond angles/lengths: % of molecules where all angles/lengths are within 0.75 of the lower and 1.25 of the upper bounds determined by distance geometry. • Aromatic rings flatness: % of molecules where All-atoms in aromatic rings with 5 or 6 members are within 0.25Å of the closest shared plane molecule. • Double bond flatness: % of molecules where All-atoms of aliphatic carbon-carbon double bonds and their four neighbours are within 0.25Å of the closest shared plane. • Reasonable internal energy: % of molecules where the calculated energy is no more than 100 times the average energy of an ensemble of 50 conformations generated for the input molecule. • No internal steric clash: % of molecules where the interatomic distance between pairs of non-covalently bound atoms is above 0.8 of the distance geometry lower bound. The validity and uniqueness metrics focus on whether the chemical composition of generated molecules can be processed by RDKit, while the Posebusters sanity checks evaluate the physical realism of the generated 3D structures across multiple criteria, from geometric constraints like bond lengths to energetic considerations [Harris et al., 2023]. B.2 Additional Results Histograms from DFT validation In Figure B.1, we show histograms of DFT energy above hull, formation energy, and number of unique elements per crystal for 10,000 generated crystals from ADiT, FlowMM, and FlowLLM compared to the MP20 training distribution. ADiT generates more thermodynamically stable crystals than prior models, as shown by the larger proportion of samples with DFT energy above hull below 0.0 eV/atom. The distribution of DFT formation energies and number of unique elements per crystal from ADiT samples more closely matches the MP20 training data compared to FlowMM and FlowLLM baselines, suggesting that ADiT better captures the underlying physical and chemical constraints of stable crystal structures. Note that we ran DFT calculations for all model samples under identical hardware and settings to ensure fair comparison. Histogram of spacegroups In Figure B.2, we show the distribution of spacegroups for 10,000 generated crystals from ADiT, FlowMM, FlowLLM and the MP20 distribution. Diffusion-based 182 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Ehull (eV / atom) 0 250 500 750 1000 1250 1500 1750 Co un t St ab le : E hu ll < 0. 0 M P2 0: E hu ll < 0. 08 Model ADiT FlowLLM FlowMM (a) DFT energy above hull 4 3 2 1 0 1 2 Formation energy (eV / atom) 0 200 400 600 800 1000 Co un t Model ADiT FlowLLM FlowMM MP20 test set (b) DFT formation energy 1 2 3 4 5 6 7 Number of unique elements per crystal 0 1000 2000 3000 4000 5000 6000 Co un t Model ADiT FlowLLM FlowMM MP20 test set (c) Number of elements Figure B.1: Histograms from DFT validation of 10,000 generated crystals. ADiT is more likely to generate stable crystals with DFT energy above hull <0.0 eV/atom compared to prior models. Samples from ADiT most closely follow the distributions for DFT formation energy and number of unique elements per crystal from MP20. models (ADiT and FlowMM) tend to over sample crystals with P1 spacegroup, which represents the lowest symmetry group, likely due to their local, step-wise denoising process. In contrast, FlowLLM, an autoregressive language model, tends to over sample spacegroups like Fm-3m, Pm-3m, and I4/mmm compared to the training data. While it would be straightforward to control the distribution of spacegroups generated by ADiT through classifier-free guidance conditioning, we leave this for future work since our current focus is on unconditional generation of diverse molecular systems. 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 International spacegroup number 0 500 1000 1500 2000 2500 3000 3500 4000 Co un t P1 Fm-3m I4/mmm Pm-3m Cm P2_1/c Pnma P4/nmm P6_3/mmc Model ADiT FlowLLM FlowMM MP20 test set Figure B.2: Histogram of spacegroups for 10,000 generated crystals. Diffusion-based ADiT and FlowMM tend to over sample crystals with P1 spacegroup compared to the MP20 training distribution. FlowLLM, an autoregressive language, tends to over sample crystals with Fm-3m, Pm-3m, and I4/mmm spacegroups. SUN rate and scaling ADiT In Table B.1, we observe that the combined stability, uniqueness, and novelty (S.U.N.) rate for crystal generation decreases as we scale up the DiT denoiser from DiT-S (32M) to DiT-L (450M). While stability and uniqueness rates increase with model size, the S.U.N. rate decreases due to the larger model’s greater capacity to memorize the small MP20 training dataset of 27K crystals. This suggests that larger models may be more prone to generating duplicate or near-duplicate samples, which we plan to address by training on larger 183 Table B.1: Impact of scaling on stability, uniqueness, and novelty rates for 10,000 generated crystals. We find that stability rate as well as stability & uniqueness rate increase as we increase the number of model parameters for ADiT from 32M to 450M. However, larger ADiT models have greater capacity to memorise the small MP20 training dataset of 27K crystals, resulting in decrease in the combined stability, uniqueness, & novelty rate. ADiT-S trained on MP20-only achieves a S.U.N. rate of 6.5%, representing a significant improvement over previously published state-of-the-art models which attained S.U.N. rates up to 4.7%. Stability (Ehull <0.0) Metatability (Ehull <0.1) Model S (%) ↑ S.U. (%) ↑ S.U.N. (%) ↑ M.S (%) ↑ M.S.U. (%) ↑ M.S.U.N. (%) ↑ MP20-only ADiT-S (32M) 12.8 11.8 6.5 71.1 64.9 38.1 MP20-only ADiT-B (130M) 14.1 12.5 4.7 81.6 67.3 25.9 Joint ADiT-S (32M) 12.6 11.4 6.0 71.9 64.7 37.7 Joint ADiT-B (130M) 15.4 13.4 5.3 81.0 70.2 28.2 Joint ADiT-L (450M) 15.5 13.5 5.0 82.5 70.9 27.9 and more diverse datasets in future work. For crystals, the Alexandria dataset of inorganic crystals and the Crystallography Open Database of organic crystals present promising opportunities for scaling up. Notably, ADiT-S trained on MP20-only achieves a S.U.N. rate of 6.5%, representing a significant improvement over previously published results from FlowMM (2.8%) and FlowLLM (4.7%). This demonstrates that even our smallest model variant substantially advances the state-of-the-art for crystal generation. 100 1000 2500 5000 10000 Number of samples generated (ADiT) 88 90 92 94 96 98 100 Va lid ity ra te (% ) System Crystals Molecules (a) Validity rates are consistent across seeds. 100 1000 2500 5000 7500 10000 Number of crystals sampled 0 2 4 6 8 10 12 14 St ab le , u ni qu e, n ov el ra te (% ) Model ADiT-L ADiT-B ADiT-S FlowLLM FlowMM (b) S.U.N. rates converge after 5,000 samples. Figure B.3: Consistency of validity and S.U.N. rates as we increase number of samples. We plot the validity and S.U.N. rates vs. number of sampled crystals or molecules. Error bars indicate 95% confidence interval across three different random seeds. Sensitivity of validity rate to number of samples and random seed In Figure B.3a, we plot the validity rates for crystal and molecule generation as we increase the number of samples from 100 to 10,000 for 3 different random seeds. We observe that the validity rates generally converge and are stable across random seeds after sampling over 5,000 crystals or molecules. 184 Sensitivity of S.U.N. rate to number of samples In Figure B.3b, we plot the S.U.N. (stability, uniqueness, and novelty) rates for crystal generation as we increase the number of samples from 100 to 10,000 across 3 different random seeds. The S.U.N. rates converge after approximately 5,000 samples for diffusion-based methods like ADiT and FlowMM. In contrast, autoregressive models like FlowLLM show higher variance in S.U.N. rates, likely due to more frequent generation of duplicate crystals during low-temperature sampling. B.3 Ablation Study Table B.3 and Table B.2 presents ablation studies as well as aggregated benchmarks for various configurations of ADiT’s latent diffusion model and autoencoder, respectively. Key takeaways are highlighted below. Note that, unless otherwise stated, results in the main paper are reported for jointly trained ADiT-B which uses DiT-B denoiser, standard Transformer encoder and decoder, latent dimension d = 8, and KL regularization weight λKL = 1e− 5. Joint vs. dataset-specific training Joint training of the autoencoder to embed both molecules and crystals into a shared latent space achieves similar or better reconstruction performance compared to dataset-specific training, as shown in Table B.2 (rows 3, 6, 10). The benefits of joint training are most evident in generative modelling performance – samples from the joint model have higher validity rates for both crystals and molecules compared to dataset-specific models, demonstrating effective transfer learning between periodic and non-periodic molecular systems (Table B.3, rows 12, 16, 20). These results provide strong evidence that ADiTs can successfully unify the modelling of both periodic and non-periodic systems within a single architecture, without compromising performance on either domain. Denoiser architecture The DiT denoiser is a standard Transformer with key hyperparam- eters including the hidden dimension dmodel, number of attention heads, and number of layers. Scaling up the DiT denoiser from DiT-S (32M parameters, dmodel = 384, 6 heads, 12 layers) to DiT-B (150M, dmodel = 768, 12 heads, 12 layers) and DiT-L (450M, dmodel = 1024, 24 heads, 24 layers) consistently improves generative performance, as shown in Table B.3 (rows 12, 16, 20). We have additionally performed scaling analysis for the training loss and validity rates in Figure 4.2, seeing strong correlations between model size and performance metrics. In Figure B.3b, we further see that S.U.N. rates for larger models are better than smaller models, further confirming the benefits of scaling up the DiT denoiser. Autoencoder architecture For the architecture of the autoencoder’s encoder and decoder, we explored both roto-translation equivariant as well as non-equivariant VAEs. For the equivariant VAE variant, the encoder is Equiformer-V2 [Liao et al., 2024b] and the decoder is an equivariant feedforward network adapted from output heads in the Equiformer-V2 codebase. We selected Equiformer-V2 as it is theoretically expressive [Joshi et al., 2023] and has state-of-the-art performance across diverse 3D molecular tasks. As input to the Equiformer-V2 encoder, we 185 use spherical harmonic embeddings of displacement vectors as edge features and exclude the 3D coordinates in Algorithm 1, line 2, from the initial features {hi} as a result. The initial features {hi} are used as the L = 0 scalar component of the initial spherical tensor features of Equiformer-V2. The rest of the pseudocode in Algorithms 1 and 2 remains the same. As shown in Table B.2 (rows 1-4 and 5-8), the choice of autoencoder architecture has no- ticeable impact on reconstruction performance. Standard Transformers generally outperform Equiformer-V2 for both crystals and molecules, achieving higher match rates (% of test set sam- ples where the reconstructed structure matches the groundtruth, as determined by PyMatGen’s StructureMatcher/MoleculeMatcher). More importantly, the latent space learned by standard Transformers proved more suitable for the latent diffusion process compared to Equiformer- V2’s equivariant latent space, leading to substantially better generative performance in terms of validity rates, particularly for crystals (Table B.3, rows 1-4 and 5-8). Autoencoder regularization As shown in Table B.2 (rows 9-12), increasing the latent dimension and reducing the KL regularization weight generally improved autoencoder recon- struction performance by lowering RMSD values which measure the average distance between the reconstructed and groundtruth structures. These improvements in reconstruction quality translated to better generative performance, with higher validity rates for both crystals and molecules at larger latent dimensions and lower KL weights (see Table B.3, rows 9-12). Sampling hyperparameters. Classifier-free guidance scale and number of integra- tion steps are important hyperparameters for inference-time tuning. In Figure B.4, we show a grid search over guidance scales γ ∈ {1.0, 2.0, 3.0, 4.0, 6.0} and integration steps T ∈ {10, 50, 100, 250, 500, 1000}, finding that different combinations may be optimal for crystals vs. molecule generation. For each entry in Table B.3, we have reported results for T and γ which obtain the highest validity rates. T = 500 or 1000 with γ = 1.0 or 2.0 tends to work well across both molecules and crystals. (a) Crystals – MP20 (b) Molecules – QM9 Figure B.4: Tuning inference hyperparameters for best performance. Best generative modelling results for crystals and molecules are achieved with different classifier-free guidance scales γ and number of integration steps T . T = 500 or 1000 with γ = 1.0 or 2.0 tends to work well across both molecules and crystals. 186 Table B.2: Autoencoder ablation study. We report match rate (computed with StructureMatcher or MoleculeMatcher from PyMatGen) and RMSD between the reconstructed and groundtruth structures for MP20 crystals and QM9 molecules. Train Autoencoder hyperparameters Crystals – MP20 Molecules – QM9 Set Encoder Latent KL Match Rate (%) ↑ RMSD (Å) ↓ Match Rate (%) ↑ RMSD (Å) ↓ MP20 Transformer 4 0.0001 85.50 0.0598 - - MP20 Equiformer-V2 4 0.0001 81.70 0.1652 - - MP20 Transformer 8 0.0001 84.50 0.0502 - - MP20 Equiformer-V2 8 0.0001 88.90 0.0296 - - QM9 Transformer 4 0.0001 - - 97.20 0.0747 QM9 Equiformer-V2 4 0.0001 - - 96.20 0.0765 QM9 Transformer 8 0.0001 - - 96.50 0.0823 QM9 Equiformer-V2 8 0.0001 - - 96.20 0.0746 Joint Transformer 4 0.0001 88.30 0.0471 96.60 0.0785 Joint Transformer 4 0.00001 88.50 0.0468 98.50 0.0524 Joint Transformer 8 0.0001 88.60 0.0269 96.60 0.0760 Joint Transformer 8 0.00001 88.60 0.0239 97.00 0.0399 Table B.3: Latent diffusion model ablation study. We report validity rates for 10,000 generated crystals or molecules. Autoencoder hyperparameters Crystals – MP20 Molecules – QM9 Train Diffusion Encoder Latent KL Structure Composition Overall Validity Validity* Set Denoiser Valid (%) ↑ Valid (%) ↑ Valid (%) ↑ (%) ↑ (%) ↑ MP20 DiT-S Transformer 4 0.0001 98.90 89.19 88.19 - - MP20 DiT-S Equiformer-V2 4 0.0001 91.74 81.03 74.43 MP20 DiT-S Transformer 8 0.0001 99.58 90.46 90.13 - - MP20 DiT-S Equiformer-V2 8 0.0001 99.26 86.09 85.50 QM9 DiT-S Transformer 4 0.0001 - - - 95.94 92.19 QM9 DiT-S Equiformer-V2 4 0.0001 - - - 95.36 91.37 QM9 DiT-S Transformer 8 0.0001 - - - 96.02 91.58 QM9 DiT-S Equiformer-V2 8 0.0001 - - - 96.24 91.47 Joint DiT-S Transformer 4 0.0001 98.21 91.05 89.38 96.90 93.47 Joint DiT-S Transformer 4 0.00001 98.74 90.74 89.60 96.40 91.85 Joint DiT-S Transformer 8 0.0001 99.66 91.07 90.76 96.85 93.33 Joint DiT-S Transformer 8 0.00001 99.67 91.25 90.93 96.36 92.06 Joint DiT-B Transformer 4 0.0001 99.00 91.23 90.29 97.33 94.45 Joint DiT-B Transformer 4 0.00001 99.51 90.73 90.29 97.04 94.06 Joint DiT-B Transformer 8 0.0001 99.67 91.60 91.32 95.30 89.85 Joint DiT-B Transformer 8 0.00001 99.74 92.14 91.92 97.43 93.99 Joint DiT-L Transformer 4 0.0001 99.31 90.92 90.29 97.80 94.67 Joint DiT-L Transformer 4 0.00001 99.43 90.84 90.31 96.71 92.78 Joint DiT-L Transformer 8 0.0001 99.75 92.17 91.92 96.11 91.45 Joint DiT-L Transformer 8 0.00001 99.66 91.42 91.14 97.79 95.01 187 188 Appendix C Appendix: gRNAde: Geometric Deep Learning for 3D RNA inverse design (Chapter 5) C.1 Ablation Study Table C.1 presents an ablation study as well as aggregated benchmark for various configurations of gRNAde. Key takeaways are highlighted below. Note that all results in the main paper are reported for models trained on the maximum length of 5000 nucleotides using autoregressive decoding and rotation-equivariant GNN layers, as this lead to the lowest perplexity values. Split. Single- and multi-state splits are described in Section 5.2; the multi-state split is relatively harder than the single-state split based on overall reduced performance for all baselines and models. The multi-state split evaluates a particularly challenging o.o.d. scenario as the RNAs in the test set have significantly higher structural flexibility compared to those in the training set. Max. #states We evaluate the impact of increasing the maximum number of states as input to gRNAde. Multi-state models improve native sequence recovery as well as structural self- consistency scores over an equivalent single state variant. Notably, on the more challenging multi-state split, the improvement in sequence recovery was observed to be as high as 5-6% for the best multi-state models. This trend holds even for the single-state benchmark where the multi-state model is being used with only one state as input. This suggests that seeing multiple states during training can be useful for teaching gRNAde about RNA conformational flexibility and improve performance even for single-state design tasks. GNN and pooling architecture We ablated whether the internal representations of the GVP-GNN are rotation invariant or equivariant. Equivariant GNNs are theoretically more expressive [Joshi et al., 2023] and we find them more capable at fitting the training distribution (as shown by lower perplexity) which in turn results in improved metrics compared to invariant GNNs. 189 Table C.1: Ablation study and aggregated benchmark results for gRNAde. We report metrics averaged over 100 test sets samples and standard deviations across 3 consistent random seeds. The percentages reported in brackets for the 3D self-consistency scores are the percentage of designed samples within the ‘designability’ threshold values (scRMSD≤2Å, scTM≥0.45, scGDT≥0.5). Self-consistency metrics Max. Max. train Perplexity Native seq. 2D – EternaFold 3D – RhoFold Split #states Model GNN length (↓) recovery (↑) scMCC (↑) scRMSD (↓) scTM-score (↑) scGDT_TS (↑) Si ng le -s ta te sp lit 1 AR Equiv 500 1.77±0.07 0.438±0.01 0.624±0.07 13.01±1.18 (0.5%) 0.21±0.0 (14.3%) 0.22±0.0 (12.7%) 1 AR Equiv 1000 1.73±0.08 0.453±0.01 0.648±0.01 13.10±0.58 (1.0%) 0.20±0.0 (10.8%) 0.21±0.0 (10.6%) 1 AR Equiv 2500 1.41±0.01 0.513±0.01 0.633±0.03 11.76±0.91 (1.4%) 0.27±0.0 (28.8%) 0.27±0.0 (28.0%) 1 AR Equiv 5000 1.29±0.02 0.538±0.03 0.612±0.02 11.50±0.64 (1.9%) 0.28±0.0 (32.1%) 0.28±0.0 (26.2%) 1 AR, rand Equiv 5000 1.59±0.16 0.531±0.04 0.621±0.04 11.87±1.06 (1.9%) 0.26±0.0 (28.1%) 0.26±0.0 (24.1%) 1 AR Inv 5000 1.32±0.04 0.531±0.01 0.585±0.03 11.70±0.56 (1.3%) 0.26±0.0 (24.8%) 0.25±0.0 (20.1%) 1 NAR Inv 5000 1.54±0.04 0.571±0.00 0.430±0.02 14.26±0.51 (1.3%) 0.19±0.0 (15.9%) 0.18±0.0 (12.7%) 1 NAR Equiv 5000 1.46±0.06 0.584±0.00 0.473±0.02 13.04±0.88 (1.3%) 0.23±0.0 (24.0%) 0.22±0.0 (17.9%) 3 AR Equiv, DS 5000 1.23±0.05 0.539±0.01 0.620±0.01 11.47±1.05 (2.5%) 0.28±0.0 (31.4%) 0.28±0.0 (27.2%) 5 AR Equiv, DS 5000 1.25±0.01 0.539±0.02 0.596±0.03 11.90±1.00 (2.9%) 0.27±0.0 (31.6%) 0.26±0.0 (26.4%) Groundtruth sequence prediction baseline: - 1.000±0.00 0.686±0.00 5.23±0.07 (27.9%) 0.56±0.0 (68.7%) 0.55±0.0 (68.7%) Random sequence prediction baseline: - 0.251±0.00 0.012±0.00 24.40±0.34 (0.0%) 0.04±0.0 (0.0%) 0.02±0.0 (0.0%) ViennaRNA 2D-only baseline: - 0.259±0.00 0.611±0.00 20.34±0.10 (0.0%) 0.07±0.0 (0.6%) 0.07±0.0 (1.1%) M ul ti- st at e sp lit 1 AR Equiv 5000 1.51±0.01 0.481±0.00 0.573±0.04 21.83±0.53 (0.0%) 0.12±0.0 (2.6%) 0.15±0.0 (5.5%) 3 AR Equiv, DS 500 1.87±0.04 0.444±0.01 0.587±0.02 22.09±0.13 (0.0%) 0.12±0.0 (2.3%) 0.14±0.0 (5.7%) 3 AR Equiv, DS 1000 1.76±0.04 0.455±0.03 0.504±0.04 22.92±1.43 (0.0%) 0.11±0.0 (2.3%) 0.14±0.0 (5.8%) 3 AR Equiv, DS 2500 1.54±0.07 0.500±0.01 0.543±0.01 22.00±0.26 (0.0%) 0.11±0.0 (2.9%) 0.14±0.0 (3.7%) 3 AR Equiv, DS 5000 1.44±0.04 0.531±0.00 0.573±0.03 22.19±0.28 (0.0%) 0.12±0.0 (4.2%) 0.15±0.0 (7.5%) 3 AR Equiv, DSS 5000 1.37±0.04 0.540±0.03 0.574±0.03 22.20±0.43 (0.0%) 0.12±0.0 (4.0%) 0.15±0.0 (7.5%) 5 AR Equiv, DS 5000 1.37±0.03 0.510±0.00 0.514±0.00 21.80±0.08 (0.0%) 0.12±0.0 (2.9%) 0.14±0.0 (6.2%) 1 NAR Equiv 5000 1.81±0.03 0.489±0.00 0.372±0.03 24.18±0.63 (0.0%) 0.09±0.0 (2.2%) 0.12±0.0 (4.7%) 3 NAR Equiv, DS 5000 1.65±0.13 0.506±0.01 0.346±0.02 24.06±0.43 (0.0%) 0.08±0.0 (2.0%) 0.11±0.0 (2.9%) 3 NAR Equiv, DSS 5000 1.60±0.10 0.520±0.02 0.352±0.03 24.18±0.55 (0.0%) 0.09±0.0 (2.2%) 0.12±0.0 (4.7%) 5 NAR Equiv, DS 5000 1.59±0.21 0.517±0.01 0.339±0.01 24.16±0.75 (0.0%) 0.08±0.0 (2.2%) 0.10±0.0 (4.5%) Groundtruth sequence prediction baseline: - 1.000±0.00 0.525±0.00 17.52±0.32 (3.9%) 0.25±0.0 (24.2%) 0.29±0.0 (31.4%) Random sequence prediction baseline: - 0.249±0.00 0.013±0.00 31.00±0.20 (0.0%) 0.03±0.0 (0.0%) 0.02±0.0 (0.0%) ViennaRNA 2D-only baseline: - 0.258±0.00 0.470±0.00 29.10±0.00 (0.0%) 0.05±0.0 (0.0%) 0.05±0.0 (0.0%) Model and decoder ‘AR’ implies autoregressive decoding (described in Section 5.1.2, uses 4 encoder and 4 decoder layers), while ‘NAR’ implies non-autoregressive, one-shot decoding us- ing an MLP (uses 8 encoder layers). Across both evaluation splits, AR models show significantly higher self-consistency scores than NAR, even though NAR lead to higher sequence recovery for the single-state split. AR is more expressive and can condition predictions at each decoding step on past predictions, while one-shot NAR samples from independent probability distributions for each nucleotide. Thus, AR is a better inductive bias for predicting base pairing and base stacking interactions that are drivers of RNA structure [Vicens and Kieft, 2022]. For instance, G-C and A-U pairs can often be swapped for one another, but non-autoregressive decoding does not capture such paired constraints. Additionally, we also present results for the impact of training gRNAde with random decoding order. This can be practically very useful for partial or conditional design scenarios, and leads to a minor reduction in sequence recovery and 3D self-consistency (in line with what was observed for ProteinMPNN). Max. train RNA length Limiting the maximum length of RNAs used for training can be seen as ablating the use of ribosomal RNA families (which are thousands of nucleotides long 190 and form complexes with specialised ribosomal proteins). We find that training on only short RNAs fewer than 1000s of nucleotides leads to worse sequence recovery and 3D self-consistency scores, even though it improves 2D self-consistency across both evaluation splits. This suggests that tertiary interactions learnt from ribosomal RNAs can generalise to other RNA families to some extent (large ribosomal RNAs were excluded from test sets). Non-learnt baselines. We report the performance of two non-learnt baselines to contextualise gRNAde’s performance: for each test sample, simply predicting the groundtruth sequence back and predicting a random sequence. Structural self-consistency scores for the Groundtruth baseline provides a rough upper bounds on the maximum score that any gRNAde designs can theoretically obtain given the current state of 2D/3D structure predictors being used. gRNAde always performs better than the random baseline and often reaches 2D self-consistency scores close to the upper bound. Both 2D and 3D self-consistency scores are inherently limited by the performance of the structure prediction methods used. 2D inverse folding baseline. We additionally report results for ViennaRNA’s 2D-only inverse folding method to further demonstrate the utility of 3D inverse folding. ViennaRNA has improved 2D self-consistency scores over gRNAde but fails to capture tertiary interactions in its designs, as evident by poor recovery and 3D self-consistency scores similar to the random baseline. We observed the same trend for other 2D-only inverse folding methods such as NuPack’s design tool. This result should not be surprising, as 2D tools are meant for design scenarios that only involve base pairing and do not take any 3D information into account. Choice of structure predictors. As previously noted, self-consistency metrics are highly dependent on the performance of the structure prediction method used. We chose EternaFold as it is simple to use as well as validated for designed and synthetic RNAs, unlike most other 2D structure prediction tools. Replacing EternaFold with RNAFold lead to unchanged results and did not modify the relative rankings of the models: • AR, 1 state, Equiv. GNN, EternaFold scMCC: 0.612±0.02, RNAFold scMCC: 0.614±0.03. • NAR, 1 state, Equiv. GNN, EternaFold scMCC: 0.473±0.02, RNAFold scMCC: 0.477±0.04. Lastly, we would like to note the challenge of evaluating multi-state design: Structural self-consistency metrics are not ideal for evaluating RNAs which do not have one fixed struc- ture/undergo changes to their structure. It would be ideal (but extremely slow and expensive) to run MD simulations to validate multi-state design models. 191 C.2 Additional Results Table C.2: Full results for Figure 5.7 comparing gRNAde to Rosetta, FARNA, ViennaRNA and RDesign for single-state design on 14 RNA structures of interest identified by Das et al. [2010]. Rosetta and FARNA recovery values are taken from Das et al. [2010], Supplementary Table 2. ViennaRNA FARNA RDesign Rosetta gRNAde (single-state) PDB ID Description Recovery Recovery Recovery Recovery Recovery Perplexity 2D self-cons. 1CSL RRE high affinity site 0.25 0.20 0.4455 0.44 0.5719 1.2812 0.8644 1ET4 Vitamin B12 binding RNA aptamer 0.25 0.34 0.3929 0.44 0.6250 1.3457 -0.0135 1F27 Biotin-binding RNA pseudoknot 0.30 0.36 0.3013 0.37 0.3437 1.6203 0.4523 1L2X Viral RNA pseudoknot 0.24 0.45 0.3727 0.48 0.4721 1.3181 0.5692 1LNT RNA internal loop of SRP 0.33 0.27 0.5556 0.53 0.5843 1.4337 0.1379 1Q9A Sarcin/ricin domain from E.coli 23S rRNA 0.27 0.40 0.4417 0.41 0.5044 1.3411 0.0597 4FE5 Guanine riboswitch aptamer 0.29 0.28 0.4112 0.36 0.5300 1.3824 0.9116 1X9C All-RNA hairpin ribozyme 0.26 0.31 0.3967 0.50 0.5000 1.3905 0.6630 1XPE HIV-1 B RNA dimerization initiation site 0.27 0.24 0.3834 0.40 0.7037 1.2177 0.7768 2GCS Pre-cleavage state of glmS ribozyme 0.25 0.26 0.4518 0.44 0.5078 1.3053 0.4062 2GDI Thiamine pyrophosphate-specific riboswitch 0.25 0.38 0.3523 0.48 0.6500 1.2363 -0.0251 2OEU Junctionless hairpin ribozyme 0.23 0.30 0.5000 0.37 0.9519 1.0913 0.7768 2R8S Tetrahymena ribozyme P4-P6 domain 0.27 0.36 0.5641 0.53 0.5689 1.1881 0.7281 354D Loop E from E. coli 5S rRNA 0.28 0.35 0.4458 0.55 0.4410 1.4938 0.0430 Overall recovery: 0.27 0.32 0.4296 0.45 0.5682 1st best (fit.: 3.41) 3rd best (fit.: 3.16) 10th best (fit.: 2.67) 50th best (fit.: 2.27) 200th best (fit.: 1.94) wildtype 1 10 50 100 200 403 1500 5000 17027 Selected sequences for assaying 0x 1x 0x 3x 6x 9x 12x 15x 18x 21x 24x 27x 30x E xp ec te d 'm ax ' f ol d ch an ge o ve r W T Max Fitness by Sample Size and Condition (n=47,504; simulations=10,000) Condition random n_mut==1 n_mut<=2 gRNAde 0.00 1.10 1.79 2.20 2.48 2.71 2.89 3.04 3.18 3.30 3.40 Fi tn es s Figure C.1: Retrospective study of gRNAde for ranking ribozyme mutant fitness (t1 subunit). Using the backbone structure and mutational fitness landscape data from an RNA polymerase ribozyme [McRae et al., 2024], we retrospectively analyse how well we can rank variants at multiple design budgets using random selection vs. gRNAde’s perplexity for mutant sequences conditioned on the backbone structure (scaffolding subunit t1). gRNAde performs better than single site saturation mutagenesis, even when all single mutants are explored (total of 403 single mutants, 17,027 double mutants for the scaffolding subunit t1 in McRae et al. [2024]). See Section 5.3.3 for results on catalytic subunit 5TU and further discussions. 192 C.3 RNASolo data statistics 0 1000 2000 3000 4000 Sequence length 0 200 400 600 800 1000 1200 1400 Fr eq ue nc y Histogram of sequence lengths Distribution: 684.9 ± 1072.8, Max: 4455, Min: 11 0 50 100 150 200 0 100 200 300 (a) Sequence length. The dataset is long-tailed in terms of RNA sequence length, with many short sequences including aptamers, riboswitches, ri- bozymes, and tRNAs (fewer than 200 nucleotides). The dataset also includes several longer ribosomal RNAs (thousands of nucleotides). 0 10 20 30 40 50 Number of structures per sequence 0 500 1000 1500 2000 2500 Fr eq ue nc y Histogram of no. of structures per unique sequence Distribution: 2.84 ± 9.39, Max: 267, Min: 1 5 10 15 20 0 200 400 600 800 Sequences with >1 structure (b) Number of structures per sequence. The dataset covers a wide range of RNA conformation ensembles, with on average 3 structures per se- quence. There are multiple structures available for 1,547 sequences. The remaining 2,676 sequences have one corresponding structure. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Avg. pairwise RMSD among structures per sequence (Å) 0 50 100 150 200 250 300 350 400 Fr eq ue nc y Histogram of avg. pairwise RMSD per sequence Distribution: 1.33Å ± 1.89, Max: 18.35Å, Min: 0.00Å 0 1 2 3 4 5 0 20 40 60 (c) Average pairwise RMSD per sequence. For 1,547 sequences with multiple structures, there is significant structural diversity among conforma- tions. On average, the pairwise C4’ RMSD among the set of structures for a sequence is greater than 1Å. 101 102 103 Sequence length (log scale) 0 1 2 3 4 5 6 7 Av g. p ai rw ise R M SD a m on g st ru ct ur es (Å ) (d) Bivariate distribution for sequence length vs. avg. RMSD. The joint plot illustrates how structural diversity (measured by avg. pairwise RMSD) varies across sequence lengths. We no- tice similar structural variations regardless of se- quence length. Figure C.2: RNASolo data statistics. We plot histograms to visualise the diversity of RNAs available in terms of (a) sequence length, (b) number of structures available per sequence, as well as (c) structural variation among conformations for those RNA that have multiple structures. The bivariate distribution plot (d) for sequence length vs. average pairwise RMSD illustrates structural diversity regardless of sequence lengths. 193 194 Appendix D Appendix: Inverse Design of RNA Structure and Function with gRNAde (Chapter 6) A B C 0 20 40 60 80 100 120 140 Sequence Position 4 2 0 M ax Fi tn es s Max Single Mutant Fitness 0 20 40 60 80 100 120 140 Sequence Position 0 50 100 150 Co m bi na bi lit y Sc or e Higher-order Mutant Combinability 0 20 40 60 80 100 120 140 Sequence Position 0.0 0.5 1.0 De sig n Pr ob ab ilit y Final Constraints on Probability of Designing at Position -2.0 or below -1.0 0.0 1.0 2.0 Fitness 0 30 60 90 120 150 Com binability Score 0.00 0.25 0.50 0.75 1.00 Design Probability Figure D.1: Design probabilities for 5TU derived from fitness landscape data. (A) Maximum single-mutant fitness at each position, showing tolerance to point mutations. (B) Combinability scores quantifying how well mutations at each position can be combined with other mutations to create functional variants. (C) Final design probabilities computed by combining fitness and combinability. Critical functional regions (catalytic site, template binding nucleotides, and triple helix-forming adenosines) are constrained to zero probability to preserve essential catalytic activity. During design, these probabilities are used to sample which positions can be mutated, enabling generation of variants at a range of mutational distances. 195 Figure D.2: gRNAde variants show activity at large mutational distance. Low-throughput gel analysis of primer extension reactions on a 6 GAA repeat template using the wild-type 5TU ribozyme and top 13 gRNAde-designed variants from the 6x6 AUA high-throughput screen. Variant identity and edit distance from native 5TU are labeled. The gel-confirmed activity of variant 549, which carries 28 mutations, is a key finding, proving that gRNAde can generate functional ribozymes at large mutational distances beyond those typically accessible by rational design or directed evolution. Variants 122 and 123 also show high activity, comparable to the native 5TU ribozyme. 196 Introduction Research Questions Thesis Outline List of Publications Preliminaries: Deep Learning for Molecular Structure Modelling Primer on Molecular Systems Molecular Systems as 3D Geometric Graphs Representation Learning of Molecular Structure Generative Modelling of Molecular Systems I Molecular Representation Learning and Generative Modelling Expressive Power of Molecular Structure Representations Limitations of the Weisfeiler-Leman Test The Geometric Weisfeiler-Leman Framework Understanding the Geometric GNN Design Space Synthetic Experiments on Expressivity Experiments on Protein Representation Learning Related Work Summary Unified Generative Modelling of Molecules and Materials All-atom Diffusion Transformers Experimental Setup Results Related Work Summary II RNA Molecule Design gRNAde: Geometric Deep Learning for 3D RNA inverse design The gRNAde Model Experimental Setup Results Related Work Summary Inverse Design of RNA Structure and Function with gRNAde An RNA Inverse Design Pipeline with gRNAde Expert-level Design of RNA Pseudoknotted Structures Inverse Design of Functional Polymerase Ribozymes Summary Conclusion Summary of contributions Discussion Future Directions References Appendix: Expressive Power of Molecular Structure Representations (chap:gwl) Geometric GNN Design Space Proofs Proofs for Equivalence between GWL and Geometric GNNs (sec:gwl:equivalence) Appendix: Unified Generative Modelling of Molecules and Materials (chap:adit) Evaluation Metrics Additional Results Ablation Study Appendix: gRNAde: Geometric Deep Learning for 3D RNA inverse design (chap:grnade) Ablation Study Additional Results RNASolo data statistics Appendix: Inverse Design of RNA Structure and Function with gRNAde (chap:experiments)