Approximate Inference in Bayesian Neural Networks and Translation Equivariant Neural Processes Yue Kwang Foong Department of Engineering University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Trinity Hall September 2022 Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. I further state that no substantial part of my thesis has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. This dissertation contains fewer than 65,000 words including appendices, bibliography, footnotes, tables and equations and has fewer than 150 figures. Yue Kwang Foong September 2022 Abstract Approximate Inference in Bayesian Neural Networks and Translation Equivariant Neural Processes Yue Kwang Foong It has been a longstanding goal in machine learning to develop flexible prediction methods that ‘know what they don’t know’ — when faced with an out-of-distribution input, these models should signal their uncertainty rather than be confidently wrong. This thesis is concerned with two such probabilistic machine learning models: Bayesian neural networks and neural processes. Bayesian neural networks are a classical model that has been the subject of research since the 1990s. They rely on Bayesian inference to represent uncertainty in the weights of a neural network. On the other hand, neural processes are a recently introduced model that relies on meta-learning rather than Bayesian inference to obtain uncertainty estimates. This thesis provides contributions to both of these research areas. For Bayesian neural networks, we provide a theoretical and empirical study of the quality of com- mon variational methods in approximating the Bayesian predictive distribution. We show that for single-hidden layer networks with ReLU activation functions, there are fundamental limitations concerning the representation of in-between uncertainty : increased uncertainty in between well separated regions of low uncertainty. We show that this theoretical limitation doesn’t apply for deeper networks. However, in practice, in-between uncertainty is a feature of the exact predictive distribution that is still often lost by approximate inference, even with deep networks. In the second part of this thesis, we focus on neural processes. In contrast to Bayesian neural networks, neural processes do not rely on approximate inference. Instead, they use neural networks to directly parameterise the map from a dataset to the posterior predictive stochastic process conditioned on that dataset. In this thesis we introduce the convolutional neural process, a new kind of neural process architecture which incorporates translation equivariance into its predictions. We show that when this symmetry is an appropriate assumption, convolutional neural processes outperform their standard multilayer perceptron-based and attentive counterparts on a variety of regression benchmarks. Acknowledgements My thanks goes first and foremost to my supervisor, Richard E. Turner. I could not have asked for a better supervisor. From day one of the PhD he has been supportive and insightful, giving me the freedom to pursue topics of my interest, while being available whenever I needed help. Rich took a chance by taking on a student who didn’t have prior research experience in machine learning, and for this I’ll always be grateful. I also have the pleasure of thanking my supervisors during my time at two very enjoyable internships. Sebastian Nowozin, who supervised me at Microsoft Research, was a pleasure to work with. I am particularly grateful for his mentorship during my first taste of industry. Working with Michalis Titsias, my supervisor at DeepMind, was also an enormous privilege. His knowledge of his field is unrivalled, and I was struck by his patience and helpfulness during meetings. This thesis would not have been possible without my collaborators, from whom I have learned so much. It is a pleasure in particular to thank David R. Burt, Jonathan Gordon and Wessel P. Bruinsma, all of whom I spent countless hours discussing ideas, debugging code and proving theorems with. I’d also like to thank my co-authors Yingzhen Li, José Miguel Hernández-Lobato, Yann Dubois, James Requiema, Marcin Tomczak, Siddharth Swaroop, Tim Pearce, and everyone at the Computational and Biological Learning Lab. Ross Clarke gave me hours of invaluable help with computing when I was getting started, and Sebastian Ober has been an excellent office mate. It’s been a privilege to work with all them. I’d also like to thank my 4th year project supervisor Ramji Venkataramanan, without whom I would not have applied for the Trinity Hall Research Studentship, which, along with the George and Lilian Schiff Foundation, generously funded my PhD. I would not have been able to finish this PhD without the kindness of my many dear friends at St Andrew the Great church. Throughout it all, I have relied on the mercy and grace of a generous God, from whom comes all knowledge. Finally, this thesis is gratefully dedicated to my parents. Their love and support have made all of this possible. Table of contents List of figures xv List of tables xvii 1 Introduction 1 1.1 Overview of thesis and main contributions . . . . . . . . . . . . . . . . 2 1.2 List of publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Bayesian neural networks 7 2.1 Standard neural network training . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Multilayer perceptrons . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Probabilistic modelling with MLPs . . . . . . . . . . . . . . . . 8 2.1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . 10 2.1.4 Maximum a posteriori estimation . . . . . . . . . . . . . . . . . 11 2.2 Bayesian neural networks and uncertainty in deep learning . . . . . . . 13 2.2.1 Epistemic and aleatoric uncertainty . . . . . . . . . . . . . . . . 14 2.2.2 Bayesian inference for neural networks . . . . . . . . . . . . . . 14 2.2.3 Specifying the prior . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 Applications of Bayesian neural network uncertainty . . . . . . 17 2.3 Approximate inference in Bayesian neural networks . . . . . . . . . . . 18 2.3.1 Sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Approximating family methods . . . . . . . . . . . . . . . . . . 21 2.3.3 Choosing and evaluating approximating family methods . . . . . 27 2.4 History of approximating families in Bayesian neural networks . . . . . 28 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 The expressiveness of approximate inference in Bayesian neural net- works 31 3.1 Criteria for successful approximation . . . . . . . . . . . . . . . . . . . 32 x Table of contents 3.2 Priors and references for the exact predictive . . . . . . . . . . . . . . . 34 3.3 Single-hidden layer neural networks . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Numerical verification of theorems . . . . . . . . . . . . . . . . . 38 3.3.2 In-between uncertainty in other regions of input space . . . . . . 40 3.3.3 Intuition for results . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.4 Empirical tests of approximate inference in single-hidden layer BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Deeper networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 Proof sketch of Theorem 4 . . . . . . . . . . . . . . . . . . . . . 46 3.4.2 Empirical tests of approximate inference in deep BNNs . . . . . 47 3.4.3 Initialising a BNN with in-between uncertainty . . . . . . . . . 51 3.5 Case study: active learning with BNNs . . . . . . . . . . . . . . . . . . 55 3.5.1 Experimental set-up and results . . . . . . . . . . . . . . . . . . 56 3.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.1 Discussion of Farquhar et al. (2020) . . . . . . . . . . . . . . . . 62 3.6.2 Pathologies of the optimal mean-field posterior in wide BNNs . 65 3.6.3 The cold posterior effect and prior selection . . . . . . . . . . . 66 3.6.4 Properties of MC dropout posteriors . . . . . . . . . . . . . . . 67 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Neural processes 69 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1.1 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.2 Stochastic process prediction . . . . . . . . . . . . . . . . . . . . 73 4.1.3 Stochastic process consistency . . . . . . . . . . . . . . . . . . . 74 4.1.4 The prediction map . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Neural process architectural framework . . . . . . . . . . . . . . . . . . 77 4.2.1 Kolmogorov consistency of CNPs and LNPs . . . . . . . . . . . 79 4.2.2 MLP-conditional neural processes . . . . . . . . . . . . . . . . . 82 4.2.3 MLP-latent neural processes . . . . . . . . . . . . . . . . . . . . 84 4.2.4 Attentive neural processes . . . . . . . . . . . . . . . . . . . . . 84 4.3 Deep sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Training neural processes . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Neural process variational inference . . . . . . . . . . . . . . . . 92 4.4.3 Approximate log-likelihood . . . . . . . . . . . . . . . . . . . . . 94 Table of contents xi 4.4.4 Approximate maximum-likelihood vs variational lower bound maximisation for training NPs . . . . . . . . . . . . . . . . . . . 94 4.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 96 5 Convolutional neural processes 97 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.1.1 Translation equivariance and stationarity . . . . . . . . . . . . . 101 5.2 Convolutional deep sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.1 Representing translation equivariant functions on sets . . . . . . 103 5.3 Convolutional conditional neural processes . . . . . . . . . . . . . . . . 106 5.3.1 ConvCNPs for off-the-grid data . . . . . . . . . . . . . . . . . . 108 5.3.2 ConvCNPs for on-the-grid data. . . . . . . . . . . . . . . . . . . 108 5.4 ConvCNP experimental results . . . . . . . . . . . . . . . . . . . . . . 110 5.4.1 Synthetic 1D experiments . . . . . . . . . . . . . . . . . . . . . 110 5.4.2 2D image completion experiments . . . . . . . . . . . . . . . . . 112 5.4.3 Limitations of factorised predictive distributions . . . . . . . . . 116 5.5 Convolutional latent neural processes . . . . . . . . . . . . . . . . . . . 117 5.6 ConvLNP experimental results . . . . . . . . . . . . . . . . . . . . . . . 119 5.6.1 1D regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.6.2 Image completion . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.6.3 Environmental data . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.7 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 Conclusions and discussion 131 6.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.1.1 Approximate inference in Bayesian neural networks . . . . . . . 131 6.1.2 Convolutional neural processes . . . . . . . . . . . . . . . . . . . 132 6.2 BNNs and NPs compared . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.3 Continued work and future research directions . . . . . . . . . . . . . . 135 6.3.1 Approximate inference in Bayesian neural networks . . . . . . . 135 6.3.2 Convolutional neural processes . . . . . . . . . . . . . . . . . . . 136 References 139 Appendix A Proofs of results on single-hidden layer BNNs 153 A.1 General theorem statements . . . . . . . . . . . . . . . . . . . . . . . . 153 A.2 Statements of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.3 Proofs of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 xii Table of contents A.3.1 Proof of lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 157 A.3.2 Proof of lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 157 A.3.3 Proof of lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A.3.4 Proof of lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A.4 Proofs of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Appendix B Bayesian neural network experimental details 163 Appendix C Proofs of results on deep BNNs 165 C.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.1.1 Proof of Theorem 4 for QFFG . . . . . . . . . . . . . . . . . . . 166 C.1.2 Proof of Theorem 10 for MCDO . . . . . . . . . . . . . . . . . . 173 C.2 Counterexample when inputs are dropped out . . . . . . . . . . . . . . 177 Appendix D ConvCNP experimental details 179 D.1 Baseline neural process models . . . . . . . . . . . . . . . . . . . . . . . 179 D.2 1-dimensional experiments . . . . . . . . . . . . . . . . . . . . . . . . . 180 D.2.1 CNN architectures . . . . . . . . . . . . . . . . . . . . . . . . . 180 D.2.2 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 D.3 Image experimental details and additional results . . . . . . . . . . . . 184 D.3.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . 184 D.3.2 ACNP and ConvCNP qualitative comparison . . . . . . . . . . 186 Appendix E Effect of number of samples used on evaluation of latent neural processes 189 Appendix F ConvLNP experimental details 193 F.1 Experimental details on 1D regression . . . . . . . . . . . . . . . . . . . 193 F.2 Experimental details on image completion . . . . . . . . . . . . . . . . 196 F.2.1 Data details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 F.2.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 F.2.3 General architecture details . . . . . . . . . . . . . . . . . . . . 197 F.2.4 Additional results on image completion . . . . . . . . . . . . . . 199 F.3 Experimental details on environmental data . . . . . . . . . . . . . . . 202 F.3.1 Data details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 F.3.2 Gaussian process baseline . . . . . . . . . . . . . . . . . . . . . 204 F.3.3 ConvLNP architecture and training details . . . . . . . . . . . . 205 F.3.4 Prediction and sampling . . . . . . . . . . . . . . . . . . . . . . 205 Table of contents xiii F.3.5 Bayesian optimization . . . . . . . . . . . . . . . . . . . . . . . 206 F.3.6 Additional figures for environmental data . . . . . . . . . . . . . 206 List of figures 3.1 Regions with restricted in-between uncertainty with MFVI. . . . . . . . 36 3.2 Regions with restricted in-between uncertainty with MC dropout. . . . 38 3.3 Restrictiveness of predictive variance of shallow MFVI and MCDO BNNs. 39 3.4 Predictive distributions of BNNs on randomly generated data. . . . . . 41 3.5 Contribution of a single neuron to the predictive variance. . . . . . . . 43 3.6 BNN regression on a 2D synthetic dataset. . . . . . . . . . . . . . . . . 44 3.7 Expressiveness of deep BNN predictive variance function. . . . . . . . . 46 3.8 Schematic of construction used to prove expressiveness of deep BNNs. . 48 3.9 Overconfidence of BNNs relative to GP. . . . . . . . . . . . . . . . . . . 50 3.10 Overconfidence of BNNs relative to GP with σw = √ 2. . . . . . . . . . 50 3.11 Overconfidence of BNN relative to GP plotted over training data. . . . 52 3.12 Deep BNN predictive distributions on randomly generated data. . . . . 53 3.13 Predictive distributions of BNNs initialised by matching the limiting GP. 54 3.14 Points chosen during active learning with shallow BNNs. . . . . . . . . 59 3.15 Points chosen during active learning with deeper BNNs. . . . . . . . . . 60 3.16 Predictive uncertainties before and after active learning. . . . . . . . . 61 4.1 Graphical model of a conditional neural process. . . . . . . . . . . . . . 79 4.2 Graphical model of a latent neural process. . . . . . . . . . . . . . . . . 80 5.1 Schematic illustration of translation equivariance. . . . . . . . . . . . . 99 5.2 Illustration of ConvCNP forward pass. . . . . . . . . . . . . . . . . . . 107 5.3 Predictive distributions of the ACNP and ConvCNP. . . . . . . . . . . 111 5.4 Qualitative evaluation of ConvCNP. . . . . . . . . . . . . . . . . . . . . 114 5.5 Samples from the zero-shot multi MNIST dataset. . . . . . . . . . . . . 115 5.6 Zero-shot generalisation of the ConvCNP. . . . . . . . . . . . . . . . . . 115 5.7 ConvLNP encoder-decoder architecture. . . . . . . . . . . . . . . . . . 118 5.8 Illustration of ConvLNP forward pass. . . . . . . . . . . . . . . . . . . 120 xvi List of figures 5.9 Algorithm for off-the-grid ConvLNP forward pass. . . . . . . . . . . . . 121 5.10 Algorithm for on-the-grid ConvLNP forward pass. . . . . . . . . . . . . 121 5.11 Predictive distributions of ConvLNPs and ANPs. . . . . . . . . . . . . 124 5.12 Predictive samples for MNIST and zero-shot multi MNIST. . . . . . . . 124 5.13 Predictive samples of precipitation overlaid on Europe. . . . . . . . . . 127 5.14 Results of Bayesian optimisation experiment. . . . . . . . . . . . . . . . 127 D.1 Predictive distribution of ConvCNP, ACNP and CNP for EQ kernel. . 182 D.2 Predictive distribution of ConvCNP, ACNP and CNP for Matérn kernel.183 D.3 Predictive distribution of ConvCNP, ACNP and CNP for sawtooth function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 D.4 ACNP and ConvCNP predictions on MNIST, SVHN and CelebA. . . . 187 D.5 ConvCNP and ACNP applied to Ellen’s Oscar selfie. . . . . . . . . . . 187 E.1 Log-likelihood bounds as a function of number of samples. . . . . . . . 190 E.2 Effect of number of samples used during training. . . . . . . . . . . . . 191 F.1 Samples from ConvLNP trained on MNIST, ZSMM, SVHN and CelebA.200 F.2 Image completion samples for ConvLNP and ANP. . . . . . . . . . . . 201 F.3 Log-likelihood and image completion samples for ConvLNP and ANP. . 203 F.4 Train and test regions in Europe. . . . . . . . . . . . . . . . . . . . . . 204 F.5 Predictive density of ConvLNP and GP. . . . . . . . . . . . . . . . . . 207 F.6 Samples from models trained on precipitation in Europe. . . . . . . . . 208 F.7 Samples from models trained on precipitation in Europe. . . . . . . . . 208 F.8 Samples from models trained on precipitation in Europe. . . . . . . . . 208 F.9 Samples from models trained on precipitation in Europe. . . . . . . . . 209 F.10 Samples from models trained on precipitation in Europe. . . . . . . . . 209 List of tables 3.1 Results of active learning experiment. . . . . . . . . . . . . . . . . . . . 57 5.1 Log-likelihood from synthetic 1-dimensional experiments. . . . . . . . . 112 5.2 Log-likelihood from image experiments. . . . . . . . . . . . . . . . . . . 112 5.3 Log-likelihood from synthetic 1-dimensional experiments with latent variable models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.4 Log-likelihoods from image completion with latent variable models. . . 125 5.5 Log-likelihoods and RMSEs on ERA5-Land dataset. . . . . . . . . . . . 126 D.1 CNN architecture for the image experiments. . . . . . . . . . . . . . . . 186 F.1 Parameter counts of models in 1D regression. . . . . . . . . . . . . . . . 196 F.2 Coordinates for boxes defining the train and test regions. . . . . . . . . 202 Chapter 1 Introduction In this thesis, we consider two machine learning methods for performing regression with uncertainty estimates: Bayesian neural networks (BNNs; MacKay, 1992b; Neal, 1995) and neural processes (NPs; Garnelo et al., 2018a,b). Bayesian neural networks are a classical model, first proposed in the 1990s, that marries the principled mathematical framework of Bayesian inference with the flexibility of neural networks. However, ever since their inception, they have been plagued with issues surrounding approximate inference. The first part of this thesis studies, both theoretically and empirically, the consequences of approximate inference in Bayesian neural networks when using mean-field variational inference (Blundell et al., 2015; Graves, 2011) and Monte Carlo dropout (Gal and Ghahramani, 2016) as approximate inference techniques. In the second half of the thesis, we turn to neural processes. Neural processes are a recently introduced machine learning model that uses deep learning to model predictive stochastic processes. In contrast to BNNs, NPs rely on meta-learning (Schmidhuber, 1987) to directly learn the appropriate amount of uncertainty in their predictions. Neural processes come in a variety of flavours. Our main contribution in the second half of this thesis will be to motivate and propose a new member of the neural process family: convolutional neural processes (ConvNPs). Unlike previous NP architectures, ConvNPs leverage the fact that if the data-generating stochastic process is stationary, then the corresponding map from observed datasets to predictive distributions is translation equivariant. ConvNPs use convolutional neural networks to bake this symmetry directly into the architecture. 2 Introduction 1.1 Overview of thesis and main contributions The research in this thesis appears in several publications written during the course of my PhD studies. The work on Bayesian neural networks was published in Foong et al. (2020b), and the work on convolutional neural processes was published in Foong et al. (2020a); Gordon et al. (2020). We now provide an overview of the rest of the thesis, and highlight our contributions in each chapter. Chapter 2 presents an introduction to Bayesian neural networks. This chapter does not include novel research, but instead provides background on the subject of approximate inference in BNNs. We begin in Section 2.1 by describing standard neural network training as maximum a posteriori inference. In Section 2.2 we then show how this naturally motivates a fully Bayesian treatment of the network weights. The remainder of the chapter describes various proposed methods for approximate inference, with a particular focus on approximating family methods, which will be a major focus of this thesis. Chapter 2 concludes with a brief history of approximating families in Bayesian neural networks. In Chapter 3 we present our theoretical and empirical studies of approximate inference in Bayesian neural networks with mean-field variational inference and Monte Carlo dropout. Our central theoretical findings can be split into a negative result concerning single-hidden layer BNNs with ReLU activations, and a positive result concerning deeper ReLU BNNs. For single-hidden layer networks, we show that there are simple situations where no setting of the variational parameters can represent in-between uncertainty : increased uncertainty in between well-separated clusters of low uncertainty. In contrast, for deeper and sufficiently wide networks, we show that there exist variational parameters that can approximate any predictive mean and variance function. However, we also show empirically that the appropriate parameters to approximate the exact predictive distribution well in function space are often not found when maximising the ELBO. The theorems and experiments in this chapter were developed with my co-author David R. Burt, with Yingzhen Li and Richard E. Turner supervising throughout. The material in this chapter is published in Foong et al. (2020b). In Chapter 4 we turn to the second main topic of this thesis, neural processes. This chapter presents our perspective of neural processes as learning to approximate the prediction map: i.e., the map that takes an observed dataset to the exact predictive stochastic process conditioned on that dataset. We introduce various kinds of existing NPs and describe the meta-learning procedure used to train them. The exposition in 1.2 List of publications 3 this chapter is based on an online Jupyter-book tutorial on neural processes (Dubois et al., 2020), which I co-wrote with Yann Dubois and Jonathan Gordon. In Chapter 5 we present our proposed convolutional neural process, a new addition to the neural process family of models. In Section 5.2 we present the theoretical foundations of convolutional neural processes in the form of a representation theorem that incorporates translation equivariance into the standard Deep Sets representation theorem (Zaheer et al., 2017). In Section 5.3 we introduce the convolutional conditional neural process, which parameterises a translation-equivariant map from datasets to predictive stochastic processes. However, since it only outputs a predictive mean and variance function, its predictions are necessarily factorised. We address this in Section 5.5 by introducing a latent variable, leading to the convolutional latent neural process. We show that both models outperform their standard MLP-based and attentive counterparts on a variety of regression tasks. The research in this chapter was conducted in collaboration with Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois and James Requiema, and was supervised by Richard E. Turner throughout. It is published in Gordon et al. (2020) and Foong et al. (2020a). The research in these publications also appears in the PhD theses of my collaborators Jonathan Gordon (Gordon, 2021) and Wessel P. Bruinsma (forthcoming), both submitted to the University of Cambridge. 1.2 List of publications The following is a list of publications I co-authored throughout the course of the PhD, regardless of whether the research appears in this thesis. Peer-reviewed conference proceedings Jonathan Gordon, Wessel P. Bruinsma, Andrew Y. K. Foong, James Requiema, Yann Dubois, and Richard E. Turner (2020). ‘Convolutional Conditional Neural Pro- cesses’. In: International Conference on Learning Representations. Andrew Y. K. Foong, David R. Burt, Yingzhen Li, and Richard E. Turner (2020). ‘On the Expressiveness of Approximate Inference in Bayesian Neural Networks’. In: Advances in Neural Information Processing Systems 33. 4 Introduction Andrew Y. K. Foong, Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois, James Requiema, and Richard E. Turner (2020). ‘Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes’. In: Advances in Neural Information Processing Systems 33. Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt, and Richard E. Turner (2021). ‘How Tight Can PAC-Bayes be in the Small Data Regime?’. In: Advances in Neural Information Processing Systems 34. Marcin Tomczak, Siddharth Swaroop, Andrew Y. K. Foong, and Richard E. Turner (2021). ‘Collapsed Variational Bounds for Bayesian Neural Networks’. In: Ad- vances in Neural Information Processing Systems 34. Peer-reviewed workshop proceedings Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner (2019). “In-Between’ Uncertainty in Bayesian Neural Networks’. In: Uncertainty in Deep Learning Workshop, ICML 2019. Andrew Y. K. Foong, David R. Burt, Yingzhen Li, and Richard E. Turner (2019). ‘Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks’. In: Bayesian Deep Learning Workshop, NeurIPS 2019. Tim Pearce, Andrew Y. K. Foong, and Alexandra Brintrup (2020). ‘Structured Weight Priors for Convolutional Neural Networks’. In: Uncertainty in Deep Learning Workshop, ICML 2020. Wessel P. Bruinsma, James Requiema, Andrew Y. K. Foong, Jonathan Gordon, and Richard E. Turner (2020). ‘The Gaussian Neural Process’. In: Proceedings of the 3rd Symposium on Advances in Approximate Bayesian Inference. Andrew Gordon Wilson, Pavel Izmailov, Matthew D. Hoffman, Yarin Gal, Yingzhen Li, Melanie F. Pradier, Sharad Vikram, Andrew Y. K. Foong, Sanae Lotfi, and Sebastian Farquhar (2021). ‘Evaluating Approximate Inference in Bayesian Deep Learning’. In: NeurIPS 2021 Competitions and Demonstrations Track 1.2 List of publications 5 Unreviewed preprints and other Yann Dubois, Jonathan Gordon, and Andrew Y. K. Foong (2020). ‘The Neural Process Family’. Jupyter book tutorial on neural processes. https://yanndubs.github.io/ Neural-Process-Family. Andrew Y. K. Foong, Wessel P. Bruinsma, and David R. Burt (2022). ‘A Note on the Chernoff Bound for Random Variables in the Unit Interval’ In arXiv:2205.07880. Electronic print: https://arxiv.org/abs/2205.07880. Chapter 2 Bayesian neural networks In this chapter we introduce Bayesian neural networks. We begin in Section 2.1 by describing standard neural network training as maximum likelihood and maximum a posteriori estimation. In Section 2.2 we describe how the need to represent uncertainty in the parameters of the network leads us to fully Bayesian neural networks. In Section 2.3 we consider one of the main challenges with this approach: the intractability of exact Bayesian inference and the need to rely on approximate inference algorithms, which we categorise into sampling methods and approximating family methods. Understanding the consequences of approximating family methods will be a key focus of this thesis. Finally, in Section 2.4 we give a brief history of various approximating families in Bayesian neural networks. 2.1 Standard neural network training There are a variety of deep neural network architectures, including multilayer percep- trons (MLPs), convolutional neural networks (CNNs), recurrent neural networks and transformers. In this chapter we will focus largely on MLPs, as they are the simplest network architecture and form a building block for many others. However, much of this discussion applies with appropriate modifications to other deep learning architectures as well. 2.1.1 Multilayer perceptrons An MLP is a neural network formed by stacking a series of affine transformations interleaved with element-wise nonlinearities. Formally, an MLP is a parameterised function f : RD → RK , defined as follows: Let x ∈ RD be the input to the network. 8 Bayesian neural networks Let (W (l))Ll=0 and (b(l))Ll=0 be a collection of L + 1 weight matrices and bias vectors respectively, which together represent the learnable parameters of the MLP, collectively denoted as θ. Then the output of the MLP, f(x) ∈ RK is defined by: h(0)(x) := x, (2.1) h(l+1)(x) := ϕ ( W (l)h(l)(x) + b(l) ) for 0 ≤ l ≤ L− 1, (2.2) f(x) := W (L)h(L)(x) + b(L). (2.3) Here ϕ is the nonlinearity or activation function, and is applied elementwise. Common choices for ϕ include the rectified linear unit (ReLU), ϕ(a) = max(0, a), and the hyperbolic tangent function. L is known as the number of hidden layers in the network, and the length of the vector h(l)(x) is referred to as the number of hidden units, or neurons, in the lth hidden layer. The term ‘deep learning’ refers to the use of neural networks with many hidden layers. 2.1.2 Probabilistic modelling with MLPs MLPs may be viewed simply as flexible function approximators, without being given a probabilistic interpretation. This is, for example, the view taken in frequentist statistical learning theory, where the goal of a learning algorithm is to choose a hypothesis from a large class of functions, and the performance of the hypothesis is measured by the expectation of some loss function for that hypothesis over the true data-generating distribution (Shalev-Shwartz and Ben-David, 2014). This loss function can be chosen to reflect desiderata about the task that the MLP is used for, and does not necessarily have to relate to any probabilistic model. Alternatively, it is possible to interpret an MLP as defining a probabilistic model which explicitly encodes uncertainty in its predictions using a probability distribution. This will be the key to the Bayesian approach to neural network training, and it is this viewpoint that we will focus on in this thesis. Consider supervised learning. For an input x with supervised label y, we interpret the MLP output f(x) as the parameters of a likelihood function for y. For example, in univariate regression, we seek to predict a value y ∈ R from an input vector x. One way to do this is to set the output dimensionality of the MLP to K = 1 and interpret 2.1 Standard neural network training 9 the output of the network1 f(x; θ) as the mean of a Gaussian distribution over y: p(y|x, θ) := N (y; f(x; θ), σ2), (2.4) where σ2 is some constant. This is known as homoscedastic regression. Alternatively, the likelihood variance could be an output of the network itself; in which case we could take K = 2 and define p(y|x, θ) := N (y; f1(x; θ), exp(f2(x; θ))) , (2.5) where the exponentiation guarantees that the noise variance is positive. Here f(x; θ) is a two-dimensional vector, with f1(x; θ) and f2(x; θ) being the first and second entry respectively. This allows the network to model different values of the observation noise in different regions of input space, and is known as heteroscedastic regression. As another example, consider C-way classification. Now the likelihood function must be defined using a categorical distribution over C discrete classes. We can parameterise this likelihood by setting the output dimensionality of the MLP such that f(x) ∈ RC , and using the softmax function: Pr(class label = c|x; θ) := exp(fc(x; θ))∑C c′=1 exp(fc′(x; θ)) , for 1 ≤ c ≤ C. (2.6) In this context the MLP outputs fc are known as logits. The softmax function transforms these logits so that they form a valid normalised probability distribution over the class labels. As part of the probabilistic modelling interpretation, we interpret p(y|x; θ) as the probability or degree of belief that the model with parameters θ assigns to the event that the label takes the value y given the input x. Hence when setting up these models we make the implicit assumption that there exists some value of the parameters θ such that p(y|x; θ) accurately represents our beliefs about the relationship between x and y. For simple models such as linear models, this is unlikely to be the case for complex tasks. However, as MLPs are extremely flexible function approximators (and are universal given sufficient width (Hornik, 1991)), it stands to reason that for a large enough MLP, there will exist some setting of θ for which this is a good assumption. We next focus on how to learn θ from data. 1Here we make the dependence of the outputs on the parameters θ explicit. 10 Bayesian neural networks 2.1.3 Maximum likelihood estimation In order to set the model parameters θ, standard neural network training proceeds by defining an objective function and attempting to find parameters that maximise that function. A common way of defining the objective function is to take the logarithm of the likelihood function defined by the network. For example, consider a dataset D = ((xn, yn))Nn=1 on which we apply homoscedastic regression.2 The likelihood function defined by the network is log p((yn) N n=1|(xn)Nn=1, θ) = log N∏ n=1 p(yn|xn, θ) (2.7) = N∑ n=1 logN (yn; f(xn; θ), σ2) (2.8) = −N 2 log(2πσ2)− 1 2σ2 N∑ n=1 (yn − f(xn; θ))2. (2.9) Equivalently, maximum likelihood estimation involves minimising the following loss: L(θ) = N 2 log(2πσ2) + 1 2σ2 N∑ n=1 (yn − f(xn; θ))2. (2.10) We see that for fixed σ2, up to a constant, maximising the likelihood as a function of θ is equivalent to minimising the squared error between the network outputs and the observed targets. A similar derivation can be performed to obtain an objective function for classification, which leads to the widely used cross-entropy loss. Once the objective function is defined, a gradient based optimisation algorithm such as stochastic gradient descent (SGD) or ADAM (Kingma and Ba, 2014) can be used to optimise θ. The gradients can be computed efficiently using the backpropagation algorithm (Rumelhart et al., 1988), and are easily obtained using modern implementations of automatic differentiation (Abadi et al., 2016; Frostig et al., 2018; Paszke et al., 2017). This method for optimising θ is known as maximum likelihood estimation. Unfortunately, neural networks trained via maximum likelihood can be prone to overfitting — the network obtains increasingly good performance on the training set, but begins to deteriorate in its predictive performance on unseen test data. Many methods have been proposed to prevent overfitting, including early stopping, where the optimisation of the weights is halted before the loss function reaches a minimum, 2Here and throughout this thesis the notation (an)In=1 refers to the sequence (a1, . . . , aN ). 2.1 Standard neural network training 11 and also limiting the complexity of the network, by reducing its depth or the number of hidden units (MacKay, 2003, Chapter 39.4). More recently, it has been argued that in the overparameterised regime where networks have the capacity to memorise the training set, overfitting can actually be alleviated by further increasing the capacity of the network (Nakkiran et al., 2021). In the next section we will focus on the classical technique of weight regularisation as a means of controlling overfitting, as it leads naturally to a probabilistic modelling viewpoint of the weights in neural network learning (MacKay, 2003, Chapter 41). 2.1.4 Maximum a posteriori estimation We introduce a probabilistic modelling perspective on θ by first describing standard weight regularisation practice. The most common form of weight regularisation is ℓ2 regularisation (Goodfellow et al., 2016, Chapter 7), where the objective function in Equation (2.9) is modified by adding a term proportional to the squared ℓ2 norm of the parameters θ: L(θ) = N 2 log(2πσ2) + 1 2σ2 N∑ n=1 (yn − f(xn; θ))2 + α 2 ∥θ∥22︸ ︷︷ ︸ regulariser . (2.11) Here α is a non-negative hyperparameter that controls the strength of the regularisation. By comparing this loss function with Equation (2.10), we see that when α → 0 we recover the original maximum likelihood loss function, and when α → ∞ the optimisation algorithm will ignore the data and focus on minimising ∥θ∥22. Introducing ℓ2 regularisation biases the learning process towards networks that have smaller magnitudes for their parameters. Networks with larger weights represent functions that have greater complexity — they tend to be more ‘wiggly’ as a function of x (MacKay, 2003, Chapter 44). Hence ℓ2 regularisation encodes a preference in favour of simpler functions. Empirically, ℓ2 regularisation is effective at mitigating overfitting and improving generalisation performance.3 Though it can be viewed simply as an ad hoc procedure, we now describe how to give it a probabilistic modelling interpretation by framing learning θ as a Bayesian inference problem. 3Recent studies have shown that there are more complex factors at play in the success of ℓ2 regularisation beyond controlling the complexity of the network (Zhang et al., 2019). For example, Krizhevsky et al. (2012) found that ℓ2 regularisation can improve training accuracy in deep networks. We mainly consider ℓ2 regularisation as a classical means of motivating the introduction of Gaussian priors in Bayesian neural networks, but as we discuss in Section 2.2.3, it is far from clear that Gaussian priors are the best choice. 12 Bayesian neural networks Bayesian inference is a statistical framework that represents uncertainty by means of probability distributions which are updated using Bayes’ rule. It has several features that make it a compelling framework for neural network learning. On the theoretical side, it can be motivated by various axiomatic constructions as a procedure for updating beliefs in a consistent way (Cox, 1946; Ramsey, 2016; Savage, 1972). Furthermore, it provides a way of interpreting ad hoc choices of the loss function and regulariser as modelling assumptions that can be more precisely critiqued (for example, using the marginal likelihood (MacKay, 2003, Chapter 3)). Finally, on a practical level, Bayesian inference has been applied with great success to a broad range of machine learning tasks (Ghahramani, 2015; Murphy, 2012). This is typically done by taking a standard non-Bayesian machine learning model and specifying a Bayesian prior over its learnable parameters, a process we now describe for neural networks. In Bayesian inference, the prior distribution p(θ) describes what we believe about the parameters θ before the dataset D is observed. Ideally, this should be a distribution over θ that induces a distribution over functions f(x; θ) which encapsulates all of our experience and intuition about the problem at hand. However, specifying such a prior precisely is a highly nontrivial task for neural networks, which we will discuss more in Section 2.2.3. Here we will simply consider the simplest, most convenient prior. Anticipating the objective in Equation (2.11), we set the prior to be a factorised Gaussian distribution: p(θ) = N (θ; 0, α−1I). Our next task is to update our beliefs about θ in light of the observed dataset D. These updated beliefs are represented by the posterior distribution p(θ|D). We can compute this by applying Bayes’ rule: p(θ|D) ∝ p((yn)Nn=1|(xn)Nn=1, θ)p(θ) (2.12) = N∏ n=1 N (yn; f(xn; θ), σ2)N (θ; 0, α−1I) (2.13) log p(θ|D) = − 1 2σ2 N∑ n=1 (yn − f(xn; θ))2 + α 2 ∥θ∥22 + const. (2.14) Maximising log p(θ|D) in Equation (2.14) is known as maximum a posteriori (MAP) estimation and corresponds to finding the setting of the θ that has the highest density under the posterior p(θ|D). So far, the Bayesian interpretation of neural network training we have described falls within the bounds of standard deep learning practice — although this interpretation may bring new insights into the choice and interpretation of α, the resulting MAP 2.2 Bayesian neural networks and uncertainty in deep learning 13 estimation algorithm is, by construction, identical to minimising squared error/cross- entropy with an added ℓ2 regularisation term. Combined with other training innovations such as weight normalisation, dropout and data augmentation, optimising this objective function is the workhorse of most modern neural network training, whether explicitly given a Bayesian interpretation or not. However, the Bayesian interpretation allows us to go much further than MAP estimation, since the maximiser of the posterior density has no fundamental status in Bayesian inference.4 In the next section, we will describe why it can be desirable to utilise the entire posterior distribution p(θ|D) when making predictions. Implementing this will necessitate fundamental changes to our learning algorithms for neural networks. 2.2 Bayesian neural networks and uncertainty in deep learning To motivate the use of full Bayesian inference with the entire posterior distribution rather than simply MAP estimation, we consider the problem of uncertainty quantifica- tion. This is the task of obtaining neural networks that understand the limits of their knowledge; or, in other words, that ‘know what they don’t know’ (Gal, 2016). Empirically, it has been observed that neural networks regularly make overconfident predictions, especially on out-of-distribution data (Ovadia et al., 2019). For example, it has been shown that deep convolutional neural networks, when trained on the ImageNet dataset, give unpredictable and unreasonably confident answers when shown inputs that are unlike anything in the training set. In Shafaei et al. (2018), a picture of random Gaussian noise was classified as a ‘chainlink fence’ with 31% probability. Ideally, when presented with an out-of-distribution (OOD) input like this, an uncertainty-aware machine learning algorithm would make a high-entropy, unconfident prediction, with probability mass spread widely over many classes, rather than make an arbitrary prediction with high confidence. Recently, Fort et al. (2021) demonstrated that vision transformers (Dosovitskiy et al., 2020) pretrained on very large datasets of image-text pairs via methods such as CLIP (Radford et al., 2021) and subsequently fine-tuned to a specific task can show much better performance for OOD detection. However, this approach relies on the pretraining dataset (which is commonly obtained by scraping websites on the Internet) to provide representations that are relevant for the fine-tuned task. Hence it is not 4In fact, the maximiser of the posterior density is not invariant to non-linear reparameterisations of θ (MacKay, 2003, Chapter 28), so that MAP estimate is a parameterisation-dependent notion. 14 Bayesian neural networks directly applicable to situations where the fine-tuned task is very specialised, which is the case for most medical and scientific applications. We next discuss more precisely what kind of uncertainty we would like our network to reflect in these situations. 2.2.1 Epistemic and aleatoric uncertainty We now distinguish between two kinds of uncertainty — ‘aleatoric’ uncertainty and ‘epistemic’ uncertainty (Der Kiureghian and Ditlevsen, 2009; Kendall and Gal, 2017). Aleatoric uncertainty is uncertainty which is modelled as inherent to the observations and cannot be reduced by collecting greater amounts of data. Epistemic uncertainty is uncertainty due to the model parameters not being fully known, and can be reduced by collecting more data. For example, assume for the sake of argument that the likelihood in Equation (2.4) fully represents our beliefs regarding the data-generating process for the label y given an input x. Furthermore, assume that the true values of the parameters θ are completely known. Then for any x, there is still uncertainty about the corresponding label y due to the non-zero noise variance σ2. As the model parameters θ are already completely known, no amount of added data can reduce this uncertainty in y, which is inherent to the data-generating process. This is an example of aleatoric uncertainty. On the other hand, consider the case where σ2 → 0, so that, given a value of θ, the label y corresponding to an input x is essentially deterministically set to f(x; θ). However, suppose that we do not know the true value of θ, but we are instead uncertain about its value. Then this would lead to another, independent source of uncertainty about the value of y. This is an example of epistemic uncertainty. What distinguishes this from aleatoric uncertainty is that, in principle, if additional data were to be collected, this could allow us to reduce our uncertainty about θ, which in turn would reduce our uncertainty about y. 2.2.2 Bayesian inference for neural networks Given this distinction, we can now provide a Bayesian perspective on why neural networks may be overconfident, and what can be done to mitigate this. Deep neural networks are extremely flexible function approximators, often with millions of parame- ters. Furthermore, the input space that x resides in can be very high-dimensional, and the training set will only occupy a small region of that space. For example, in the case of CNNs trained on ImageNet, the input space is the space of all images of a certain size, but natural images will live only near a low-dimensional manifold in this input 2.2 Bayesian neural networks and uncertainty in deep learning 15 space. Intuitively, there will be many settings of the parameters that will give good predictions on the training set but may differ greatly in OOD regions. Since the data do not determine the parameters exactly, we are epistemically uncertain about which setting is the ‘correct’ one. When we perform MAP estimation as in standard neural network training, this can be viewed as choosing a single setting of the parameters that fits the data (and prior) best, even though many other settings may also be plausible according to the posterior. From the Bayesian point of view, this procedure can be expected to lead to overconfidence because it fails to propagate our uncertainty about the parameters into our predictions. In other words, standard MAP estimation is able to capture aleatoric uncertainty but completely ignores epistemic uncertainty. From a Bayesian perspective, the remedy for this is straightforward (at least in principle): we should compute the probability distribution over the label y given the input x using the rules of probability, by marginalising out the uncertain parameters θ: p(y|x,D) = ∫ p(y, θ|x,D) dθ (2.15) = ∫ p(y|θ, x,D)p(θ|x,D) dθ (2.16) = ∫ p(y|x, θ)p(θ|D) dθ. (2.17) Here Equation (2.17) follows because according to our model, once θ is known, the distribution of y given x is completely specified. The distribution p(y|x,D) is known as the Bayesian posterior predictive distribution, or simply the posterior predictive or just the predictive distribution. Equation (2.17) tells us that to compute the posterior predictive, we should average our predictions over the entire posterior distribution p(θ|D), instead of just plugging in the value of θ that maximises the posterior density, as in MAP estimation. In other words, we marginalise out the parameters when making predictions, thus propagating our epistemic uncertainty from θ to y. When neural network predictions are made in this way, we refer to the model as a Bayesian neural network (BNN).5 5Although neural networks trained using MAP estimation involve placing a Bayesian prior on the parameters, we generally reserve the term ‘Bayesian neural network’ to refer to models where we (at least approximately) compute the full posterior distribution over θ and marginalise it out. In a loose sense MAP estimation may be viewed as a BNN where the posterior is approximated by a point mass at the maximum of p(θ|D). 16 Bayesian neural networks 2.2.3 Specifying the prior One thing we have left out of the discussion so far is how to choose the prior over parameters, p(θ). Ideally this prior should encapsulate all our beliefs about the task before the data are observed. For example, consider the case of image classification with a deep neural network. In some sense, p(θ) should place higher probability on regions of parameter space that correspond to classifiers that we expect to be more plausible. Specifying such a p(θ) is extremely difficult — we often do not have well-formed, consistent, quantifiable beliefs about such a large space. Even if we did, transforming beliefs about output probabilities in function space into beliefs about the parameters in weight space is non-trivial and is an active area of research (Flam-Shepherd et al., 2017; Tran et al., 2020; Yang et al., 2019). As such, common practice is to specify computationally convenient priors, such as factorised Gaussian priors. The efficacy of such priors has been a subject of debate recently, with Wenzel et al. (2020) arguing that it leads to poor performance (and may be to blame for the so-called ‘cold posterior effect’), and Wilson (2020); Wilson and Izmailov (2020) arguing that the network architecture provides sufficient structure in the prior over functions, with p(θ) simply needing to be sufficiently vague to provide good results. Fortuin et al. (2021) train networks using SGD and compute summary statistics of the weights in order to motivate the choice of prior. They propose using spatially correlated priors for the weights in convolutional networks, and heavy-tailed priors for the weights in MLPs, demonstrating that this can lead to improved performance. Fortuin (2022) provides a review on recent work on specifying priors in Bayesian deep learning. Although the problem of prior selection is crucial for BNNs, in this thesis we will focus primarily on issues relating to approximate inference. One reason for this is that a proper understanding of inference is needed to reliably evaluate our priors. Often in Bayesian modelling, the efficacy of a prior is only made clear after it has been combined with data to form a posterior predictive. By inspecting this posterior, practitioners are able to critique pathologies and iterate towards better priors. This way of thinking is neatly summarised in the quote by the statistician I. J. Good: “Ye priors shall be known by their posteriors” (Good, 1983). If the inference process itself is poorly understood, it is difficult to disentagle whether the behaviour of the predictive distribution is more of a consequence of the choice of prior or choice of approximate inference algorithm. Hence a better understanding of approximate inference allows for a better understanding of BNN priors. 2.2 Bayesian neural networks and uncertainty in deep learning 17 2.2.4 Applications of Bayesian neural network uncertainty Before describing in Section 2.3 how to actually perform the computations required to obtain the posterior predictive from Equation (2.17), we briefly mention some of the uses of the epistemic uncertainty estimates that a BNN could provide. Such a model would have numerous applications in active learning (Gal et al., 2017b), reinforcement learning (Chua et al., 2018), Bayesian optimisation (Snoek et al., 2012) and high-risk decision making tasks. For example, in active learning (Cohn et al., 1995; MacKay, 1992a) we are presented with a large, unlabelled dataset. We are allowed to query an oracle which will provide the label for any element in the dataset. However, each query is associated with a cost, and the aim is to obtain a labelled subset that will allow us to train the most accurate model possible whilst minimising the number of queries made. This problem is naturally tackled by separately identifying the aleatoric and epistemic contributions to uncertainty: we want to query datapoints with high epistemic uncertainty in their predictions (since we stand to gain the most information about the model parameters) but low aleatoric uncertainty (since inherently noisy labels are less informative). In reinforcement learning and Bayesian optimisation (Deisenroth and Rasmussen, 2011; Gal et al., 2016; Snoek et al., 2012, 2015), a central problem is that of balancing exploration and exploitation. The task of quantifying the value of exploration naturally involves epistemic uncertainty — an action is more worth exploring if we are uncertain about its result, but we can reduce that uncertainty by gathering observations. Con- versely, an action that simply leads to an outcome with high aleatoric uncertainty is less worth exploring. Lastly, high-risk decision making tasks, which commonly arise e.g. in medical applications, require good epistemic uncertainty quantification. For example, deep neural networks have been used to classify skin lesions as either cancerous or benign with human expert-level accuracy (Esteva et al., 2017). A wrongly made classification here can lead to disastrous consequences for a patient. Ideally a neural network, when presented with a skin lesion it had never seen the like of before, would be able to signal its uncertainty. This information could be used in a doctor’s decision making process to perform more thorough checks (Filos et al., 2019; Mobiny et al., 2019). However, as we’ve seen, standard neural networks that do not quantify epistemic uncertainty can confidently make arbitrary predictions when presented with out-of-distribution inputs. 18 Bayesian neural networks 2.3 Approximate inference in Bayesian neural net- works Having described the ideal of taking into account epistemic uncertainty with the Bayesian posterior predictive, we return to the task of how to actually perform this computation in practice. It is here that we run into a major hurdle: computing the posterior predictive involves evaluating intractable integrals. To recap, the posterior predictive in Equation (2.17) involves averaging our predic- tions over the entire posterior distribution. The posterior distribution itself is calculated using Bayes’ theorem: p(θ|D) = p(D|θ)p(θ) p(D) (2.18) = p(D|θ)p(θ)∫ p(D|θ)p(θ) dθ . (2.19) We see that obtaining the posterior predictive requires performing two integrals, one to calculate the normalising constant in Bayes’ theorem in Equation (2.19), and one to average our predictions over the posterior distribution in Equation (2.17). Since the likelihood function p(D|θ) is highly non-linear in θ for neural networks, these integrals are analytically intractable — approximate inference methods are needed. A great variety of approximate inference algorithms have been proposed for Bayesian neural networks. We will only give a brief overview here. In broad terms, we can divide approximate inference methods for BNNs into two categories. The first are sampling methods — those that aim to represent the posterior distribution by a collection of representative samples only. The second are what we refer to as approximating family methods — those that assume a particular parametric form for the approximate posterior. The work in this thesis will focus primarily on approximating family methods, although we also give a brief overview of sampling methods in the next section. 2.3.1 Sampling methods Sampling methods for BNNs are based on the idea of Monte Carlo integration. This approach relies on the fact that the formula for the posterior predictive, Equation (2.17), can be written as an expectation:∫ p(y|x, θ)p(θ|D) dθ = Ep(θ|D) [p(y|x, θ)] (2.20) 2.3 Approximate inference in Bayesian neural networks 19 This expectation can then be approximated by drawing samples from the posterior: Ep(θ|D) [p(y|x, θ)] ≈ 1 M M∑ m=1 p(y|x, θm), θm ∼ p(θ|D). (2.21) In simple Monte Carlo methods, each of the samples θm is independent and identically distributed. However, this is often difficult to achieve in practice. Nevertheless, even if the samples are dependent, Equation (2.21) is still an unbiased estimator of the posterior predictive, which converges to the true value as long as the dependencies are not too strong. In order to use Equation (2.21), we need a method of drawing samples from the posterior distribution. Since θ is high-dimensional and the posterior is multimodal, this is a non-trivial task. The most common way of doing this is with Markov chain Monte Carlo (MCMC) methods. In an MCMC method, a Markov chain is constructed such that its stationary distribution is p(θ|D). Since an ergodic Markov chain has a unique stationary distribution to which it converges from any initial state, Bayesian predictions can be made by simulating the chain until it converges and using Equation (2.21). One advantage of MCMC is its convergence guarantees — in the limit as the chain is simulated for an infinitely long time, the samples obtained will be exact draws from p(θ|D). However, depending on the problem, MCMC can often take an impractically long time to converge. This is complicated by the fact that it is often difficult to diagnose when convergence has occurred. Common diagnostics have been proposed such as the potential scale reduction factor Rˆ (Gelman and Rubin, 1992). However, such diagnostics are not foolproof, and later studies have shown that the Rˆ diagnostic can have serious flaws (Vehtari et al., 2021). In practice, diagnosing MCMC convergence confidently often requires many different checks and some amount of subjective judgement on the part of the practitioner. Naïve MCMC methods, such as the Metropolis-Hastings algorithm (Metropolis et al., 1953) suffer from random-walk behaviour — the state of the Markov chain takes a random walk through parameter space, thus requiring an inordinately long time to explore large regions of the posterior distribution. More advanced MCMC techniques, such as Hamiltonian Monte Carlo (HMC) (Neal, 1995) make use of gradient information to avoid random walk behaviour and thus explore the posterior more quickly. HMC is often considered a gold-standard in terms of BNN inference quality, although even with HMC, it is difficult to assess whether a particular Markov chain has converged in practice. Moreover, HMC has several hyperparameters, such as the number of leapfrog steps, the step size, and the mass matrix, that have to be tuned 20 Bayesian neural networks to obtain good performance. Another limitation is that HMC involves performing an accept-reject step, which requires the likelihood function for the entire dataset to be computed at the end of every Markov chain transition. This severely limits HMC’s scalability to the massive datasets that have become common in modern deep learning. Although full-scale HMC on large image recognition datasets has been performed for research purposes (Izmailov et al., 2021), the computational costs render such a procedure impractical for real-world applications. Work has been done to address both of these issues, with the No U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014) automatically tuning the number of leapfrog steps of HMC, and methods such as stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011) and stochastic gradient Hamiltonian Monte Carlo (SGHMC) (Chen et al., 2014) introducing mini-batch methods to HMC. These stochastic gradient MCMC (SGMCMC) methods focus on performing discrete-step approximations to the simulation of a stochastic differential equation. Ma et al. (2015) provides a unifying framework for SGMCMC methods under this interpretation. Unlike standard HMC, SGMCMC methods usually omit any form of accept-reject step. Hence although they are more scalable, they do not enjoy the same theoretical guarantees as HMC. SGHMC has been applied to BNNs for Bayesian optimisation, with promising results (Springenberg et al., 2016). However, scaling MCMC methods up to modern architectures like deep convolutional networks with massive datasets such as ImageNet, whilst still providing high quality inference, has proven challenging. Recent work in this area has begun to show results competitive with standard CNNs (Heek and Kalchbrenner, 2019; Zhang et al., 2020); however it seems that artificially increasing the sharpness of the likelihood function may be required for good performance in practice (the so-called ‘cold posterior effect’ (Wenzel et al., 2020)). Finally, deep ensembles (Lakshminarayanan et al., 2017) have been interpreted as a kind of sampling method for BNNs. A deep ensemble is simply an ensemble of (non- Bayesian) neural networks trained by standard gradient descent, where the randomness in the ensemble comes from the random initialisation of the network parameters. Once the ensemble is trained, each trained network can be interpreted as a sample from the approximate posterior, and predictions can be made using Equation (2.21). Although deep ensembles were originally introduced as an alternative to Bayesian neural networks, Wilson (2020); Wilson and Izmailov (2020) argue that they should be interpreted as a kind of approximation to the Bayesian posterior, since each member of the ensemble represents a mode of the posterior. In fact, they argue that deep ensembles can provide a better approximation to the posterior than other approximate Bayesian methods 2.3 Approximate inference in Bayesian neural networks 21 that only represent a single mode. Izmailov et al. (2021) compared deep ensembles to full-batch HMC on the CIFAR-10 dataset and found that the predictive distribution of deep ensembles resembled the HMC predictive as closely as SGLD, and better than variational inference (which we discuss in the next section). Although the Bayesian interpretation of deep ensembles is debated, its efficacy as a simple and effective method for obtaining predictive uncertainty estimates has been demonstrated by several studies, often outperforming other Bayesian methods in uncertainty estimation benchmarks (Ashukha et al., 2019; Ovadia et al., 2019). 2.3.2 Approximating family methods The other major class of approximate inference methods are approximating family methods. These methods will be the focus of the study presented in Chapter 3. We define an approximating family method as a method that assumes a pre-specified parametric form for the approximate posterior distribution. These methods approximate the true posterior p(θ|D) with an approximate posterior qϕ(θ), where ϕ are the parameters of the distribution. We will refer to the set of all approximating distributions consistent with the pre-specified parametric form as the approximating family, Q. Each approximating family method must define its family Q, and also a method for choosing a member of that family, or equivalently, a method of choosing ϕ. Once this is done, predictions can be made by replacing the expectation under the posterior in Equation (2.17) by an expectation under the approximate posterior:∫ p(y|x, θ)p(θ|D) dθ = Ep(θ|D) [p(y|x, θ)] ≈ Eqϕ(θ) [p(y|x, θ)] . (2.22) Usually, the approximating family is chosen such that it is easy to obtain independent samples from qϕ(θ). In this case, we can make predictions by simple Monte Carlo, without the need to resort to MCMC methods: Eqϕ(θ) [p(y|x, θ)] ≈ 1 M M∑ m=1 p(y|x, θm), θm i.i.d.∼ qϕ(θ). (2.23) Since it is easy to reduce the variance of this estimator by drawing many samples from qϕ(θ), the main challenge in approximating family methods is choosing Q and ϕ such that qϕ(θ) approximates the true posterior well in some sense. We now present some common examples of approximating family methods. 22 Bayesian neural networks Variational inference Variational inference (VI) (Beal, 2003; Blei et al., 2017; Jordan et al., 1999) frames approximate inference as an optimisation problem. The parameters ϕ of the approxi- mating distribution are chosen to minimise the KL-divergence between qϕ(θ) and the true posterior p(θ|D). In other words, we seek the parameters ϕ∗ such that ϕ∗ = argmin ϕ KL(qϕ(θ)∥p(θ|D)). (2.24) The KL-divergence is non-negative, and equals zero if and only if qϕ(θ) = p(θ|D). In order to perform this minimisation, we need to write the KL-divergence in terms of computationally tractable quantities. If we define the quantity LELBO(ϕ) := log p(D)−KL(qϕ(θ)||p(θ|D)), (2.25) it can easily be shown that: LELBO(ϕ) = Eqϕ(θ) [log p(D|θ)]−KL(qϕ(θ)||p(θ)) (2.26) = N∑ n=1 Eqϕ(θ) [log p(yn|xn, θ)]−KL(qϕ(θ)||p(θ)) (2.27) = N∑ n=1 Eqϕ(θ) [log p(yn|xn, θ)]− Eqϕ(θ) [ log qϕ(θ) p(θ) ] . (2.28) From Equation (2.25) we can see that LELBO is a lower bound to the model evidence log p(D), and that maximising LELBO is equivalent to minimising KL(qϕ(θ)||p(θ|D)). LELBO is known as the evidence lower bound (ELBO) or the (negative) variational free energy. Moreoever, unlike KL(qϕ(θ)||p(θ|D)), it is often computationally tractable to obtain an unbiased estimate of LELBO. This can be done by forming simple Monte Carlo estimates of the expectations in Equation (2.28). In some cases, e.g. when qϕ(θ) and p(θ) are both Gaussian distributions, the KL-divergence between them can also be evaluated analytically. Furthermore, since the likelihood terms appear as a sum over datapoints in Equation (2.28), it is trivial to form an unbiased estimate of LELBO using minibatches of data, allowing VI to scale to massive datasets. In order to optimise ϕ, gradient-based optimisers can be used with unbiased estimates of ∇ϕLELBO obtained via the reparameterisation trick (Blundell et al., 2015; Kingma et al., 2015; Kingma and Welling, 2013). 2.3 Approximate inference in Bayesian neural networks 23 Variational inference is a wide-ranging approximate inference technique that encom- passes a broad variety of approximating families (also known as variational families in this context). The most commonly used variational family is the set of all fully- factorised Gaussian distributions over θ, which we denote as QFFG. VI with QFFG is usually referred to as mean-field variational inference (MFVI) (Blundell et al., 2015; Graves, 2011; Hinton and Van Camp, 1993). Other, more flexible families have also been proposed, ranging from the set of all multivariate Gaussian distributions over θ, denoted QFCG (Barber and Bishop, 1998)6, to families involving matrix-variate Gaussians (Louizos and Welling, 2016), normalising flows (Rezende and Mohamed, 2015) and even implicit distributions (Huszár, 2017; Mescheder et al., 2017; Ranganath et al., 2016a; Shi et al., 2018) for qϕ(θ), whose densities cannot be evaluated directly. However, using more flexible families often substantially complicates the method. In Multiplicative Normalising Flows (MNF) (Louizos and Welling, 2017), latent variables are used to multiply the outputs of each neuron. The distribution over the latent variables is specified by a normalising flow. However, the training procedure then necessitates the use of a hierarchical ELBO, which is itself a lower bound to the ELBO (Ranganath et al., 2016b). In Kernel Implicit Variational Inference (KIVI) (Shi et al., 2018), an implicit distribution is used for the variational posterior, but this necessitates the use of kernel density estimators to obtain a (biased) estimate of the KL-divergence term in the ELBO. The computational overhead and the increased complexity introduced by using these more expressive variational families has prevented their widespread adoption. The fully factorised Gaussian family remains the most widely used approximating family for its simplicity and scalability, and has been successfully applied up to the ImageNet scale, when combined with natural-gradient optimisation methods (Osawa et al., 2019). Monte Carlo dropout Monte Carlo dropout (MCDO) is an approximate inference method for BNNs that works by training a neural network with dropout (Srivastava et al., 2014), where hidden units are stochastically dropped (set to zero) during training with probability p. Dropout was originally conceived as a regularisation technique designed to prevent neurons from co-adapting to one another. In standard dropout, once training is complete, predictions are made with all hidden units present, but with the weights downscaled by a factor. In contrast, in MC dropout, units are stochastically dropped during test time. Multiple forward passes are made through the network for each 6Here the ‘FCG’ in QFCG stands for ‘full-covariance Gaussian’. 24 Bayesian neural networks prediction, with each forward pass being performed with a random subset of units dropped. The final prediction is then the average of all of these forward passes. MC dropout has been given a Bayesian interpretation in Gal and Ghahramani (2016) and Gal (2016) as a form of approximate variational inference. In this interpretation, the stochasticity is interpreted as occuring in parameter space, not the space of hidden features. Specifically, let h(l) be the hidden features in the lth layer before any units are dropped out, and let ĥ(l) be a sample of the hidden features after units have been dropped out with probability p. Then we can write: ĥ(l) = ϵ(l) ⊙ h(l), (2.29) where ϵ(l) is a vector of the same length as h(l), with each element of ϵ drawn i.i.d., and taking the value 0 with probability p, and the value 1 with probability 1 − p. Here ⊙ denotes the element-wise product and the stochasticity is usually interpreted as being in hidden feature space. However, in Bayesian inference we want to quantify our uncertainty about the parameters, not the hidden features. To do this, we note that hidden features in an MLP are always multiplied by a weight matrix. We can then write: W (l)ĥ(l) = W (l)(ϵ(l) ⊙ h(l)) (2.30) = W (l)diag(ϵ(l))h(l) (2.31) = Ŵ (l)h(l) (2.32) where diag(·) maps a vector to a diagonal matrix with that vector on the diagonal, and we have defined the random matrix Ŵ (l) := W (l)diag(ϵ(l)). Hence we can view the stochasticity as occurring in parameter space through these random weight matrices. It has been argued by Gal (2016) that standard dropout training with ℓ2 regu- larisation approximates stochastic optimisation of the ELBO in variational inference. Under this interpretation, the variational family, QMCDO, is the set of distributions over weight matrices induced by the sampling procedure Ŵ (l) := W (l)diag(ϵ(l)) for 0 ≤ l ≤ L. Members of this family are referred to as Bernoulli variational distributions, or dropout variational distributions. Here the variational parameters ϕ are the pre- dropout weight matrices (W (l))Ll=0, the biases (which are deterministic) (b(l))Ll=0, and the dropout probability p. Applying dropout at test time can then be viewed as an application of Equation (2.23). In its original implementation, the dropout probability p was tuned by cross-validation on a held out dataset. Concrete dropout (Gal et al., 2.3 Approximate inference in Bayesian neural networks 25 2017a) is an extension of MC dropout that allows p to be learned automatically using the VI interpretation. The MC dropout approximating family QMCDO is unusual in that it has support over only a finite number of settings of the parameters θ. In other words, it can be expressed as a finite mixture of Dirac-δ distributions. Hence, in the case of commonly used Gaussian priors p(θ), KL(qϕ(θ)||p(θ)) is infinite. Gal (2016) justifies the method by considering the delta functions to be Gaussians with small variances, or by considering a discrete prior instead of a Gaussian prior. This procedure is given a rigorous justification in Hron et al. (2018). Another interesting feature of the dropout variational distribution is that although hidden units are dropped out independently of each other, it is not a fully factorised distribution when viewed in weight space - the weights out of each hidden unit are dependent on each other. Laplace approximation The Laplace approximation (Denker and LeCun, 1991; MacKay, 1992b) works by finding a mode θMAP of the posterior via standard gradient-based optimisation, and then sets the approximate posterior to qϕ(θ) = N (θ;µ,Σ) with µ = θMAP, the mode of the posterior. Σ is set such that the curvature of log p(θ|D) matches the curvature of the logarithm of the Gaussian approximation at θMAP, that is: Σ = − [ ∇θ∇θ log p(θ|D) ∣∣ θ=θMAP ]−1 . (2.33) In words, Σ is the negative inverse Hessian evaluated at the MAP solution. In practice, for regression networks it is common to use the Gauss-Newton matrix as an approximation to the Hessian. The Gauss-Newton matrix is guaranteed to be positive semi-definite, and can be evaluated using only first derivatives: Σ = − [ 1 σ2 N∑ n=1 g(xn)g(xn) T + diag(p) ]−1 . (2.34) Here g(xn) = ∇θfθ(xn) ∣∣ θ=θMAP and p is a vector whose ith element is 1/σ2i , where σ2i is the prior variance7 of θi. For networks with other likelihoods such as classification networks, a generalised Gauss-Newton approximation to the Hessian can be used instead (Martens, 2020). 7Here we have assumed a diagonal Gaussian prior for θ. 26 Bayesian neural networks In this case, the approximating family is the set of multivariate Gaussian distribu- tions over the parameters of the network, i.e., QFCG. However, other approximating families may also be considered for use with the Laplace approximation. Let the number of parameters in the network be NP . The method as presented here requires the storage and inversion of an (NP ×NP ) matrix. While this is still feasible for small networks such as those considered in (MacKay, 1992b), it is prohibitively expensive for the large neural networks considered in modern deep learning. As a more scalable alternative, we could take just the diagonal of the Hessian matrix and invert it to obtain a diagonal covariance, Σdiag. This is known as the diagonal Laplace approximation, and was first proposed by Denker and LeCun (1991). In this case the approximating family is just QFFG. Recently, the K-FAC (Kronecker-factored approximate curvature) Laplace approximation has been proposed for deep neural networks (Daxberger et al., 2021; Immer et al., 2021a,b; Ritter et al., 2018) that is more scalable than the full Laplace approximation while using a more flexible approximating family than the diagonal Laplace approximation. In the K-FAC Laplace approximation, Q is the set of Gaussian distributions that are factorised over layers in the network (leading to a block diagonal covariance), with the covariance matrices within each layer being Kronecker-factored. Other approximating family methods There are a wide range of other approximating family methods which we will not discuss in detail. These include methods that minimise a divergence other than the KL-divergence such as expectation propagation (Hernández-Lobato and Adams, 2015; Minka, 2001) black-box alpha divergence minimisation (Hernández-Lobato et al., 2016) and Rényi divergence variational inference (Li and Turner, 2016). In addition, there are methods such as stochastic weight averaging-Gaussian (SWAG) (Maddox et al., 2019) which relies on interpreting SGD iterates as performing approximate variational inference (Mandt et al., 2017) and functional variational Bayesian neural networks (Sun et al., 2019) which attempts to minimise the KL-divergence in function space instead of weight space. One thing all of these techniques have in common, when applied to BNNs, is that they primarily use the fully factorised Gaussian approximating family, QFFG, although other families can sometimes be used within the same framework. 2.3 Approximate inference in Bayesian neural networks 27 2.3.3 Choosing and evaluating approximating family methods Unlike MCMC methods, there are usually no theoretical guarantees that the approxi- mations provided by approximating family methods will converge to the exact posterior. In fact, the parametric form assumed by the approximating family often has properties (such as unimodality, Gaussianity, or independence assumptions) that we know not to be true of the exact posterior, making convergence guarantees of the kind available with MCMC impossible to obtain. There is, however, a very active line of research that provides frequentist concen- tration guarantees for variational approximations of the Bayesian posterior (Alquier and Ridgway, 2020; Chérief-Abdellatif, 2020; Pati et al., 2018; Zhang and Gao, 2020). These assume there is a single true setting of the parameters which is used to generate the data. It is then shown, subject to technical conditions, that as the amount of data increases, the variational posterior concentrates around the true value of the parameter. However, these results are not immediately relevant for BNN practitioners. This is because the main motivation for introducing BNNs is to represent epistemic uncertainty in the parameters. By contrast, these frequentist consistency results only become relevant when there is enough data for the posterior to concentrate around the true setting of the parameters — in other words, when it is no longer necessary to represent uncertainty. They cannot be used to show that, for a given dataset, the variational posterior predictive will be similar to the exact Bayesian posterior predictive, which is what we are concerned with. Given the task of approximating the exact Bayesian posterior for a finite dataset, we are then faced with the challenge that it is not clear which approximating family, or which approximating family method, will allow for the most accurate inference. If the approximating family method is fixed, in theory a larger approximating family is more flexible and hence should always allow for better performance. However, this is not always borne out in practice due to the added computational cost and optimisation difficulties introduced by large approximating families (Trippe and Turner, 2018). Although there have been studies that benchmark the performance of various approximating family methods for BNNs (e.g., Mukhoti et al. (2018); Tomczak et al. (2018)), these most commonly evaluate the methods by metrics such as held-out accuracy or log-likelihood on a benchmark dataset, without any reference to the true posterior predictive. (One recent notable exception is Izmailov et al. (2021), which performs full-batch HMC for ResNets trained on the CIFAR10 dataset as a reference, and compares the HMC posterior predictive with MFVI, among other methods.) 28 Bayesian neural networks While empirical studies comparing performance on benchmark datasets can give some indication as to the practical utility of a method on a specific task, they do not address the fundamental question of how well the approximate posterior predictive matches the true posterior predictive. For example, a method could perform well on a specific dataset because a poorly chosen prior has been combined with inaccurate inference in such a way that the problems introduced fortuitously “cancel out” with each other. While this may still lead to a useful machine learning method, it is debatable to what extent the success of such a method can be attributed to Bayesian principles. At the very least, it is important to know if and when this is happening, so that we can know how to troubleshoot and improve our models. In Chapter 3 we will investigate two common approximating families, QFFG and QMCDO both theoretically and empirically to obtain new insights into how well these families can approximate the true posterior predictive. 2.4 History of approximating families in Bayesian neu- ral networks In this section we give a brief, incomplete history of the approximating families most commonly used for Bayesian neural networks. We focus on the theoretical and practical motivations given for their introduction. Denker and LeCun (1991) appear to be the first to attempt to calculate a Bayesian posterior predictive for a neural network. They use the Laplace approximation with a diagonal approximation to the covariance matrix. Hence they select QFFG as their approximating family. This is the earliest use we found in the literature of QFFG for approximate BNN inference in feed-forward networks. No theoretical or practical justification is made for using a diagonal covariance matrix — the network architectures considered then were small, so presumably inverting the full covariance matrix would have been computationally feasible, but slower. Denker and LeCun (1991) also include a discussion of what would now be called the distinction between ‘epistemic’ and ‘aleatoric’ uncertainty, and the role of BNNs in expressing epistemic uncertainty. Buntine and Weigend (1991) propose using the Laplace approximation with a full covariance matrix, thus selecting QFCG as their approximating family. They claim that the diagonal approximation to the covariance matrix will lead to very poor estimates, although they do not provide a theoretical explanation of the role of the off-diagonal terms. In the conclusion section, the paper raises the question of the quality of the Gaussian approximation. 2.4 History of approximating families in Bayesian neural networks 29 In a seminal paper, MacKay (1992b) introduced the evidence framework for Bayesian neural networks, which relies on the full-covariance Laplace approximation to make predictions and perform model comparison. Thus QFCG is used as the approximating family. In commenting on the method of Denker and LeCun (1991), MacKay (1992b) argues that due to strong posterior correlations in the parameters, it is important to evaluate the off-diagonal terms of the Hessian when doing the Laplace approximation. It is interesting to note that in Figure 1 of MacKay (1992b), the posterior predictive is shown with an emphasis on the fact that the error bars get larger around the perimeter of the training data, and also in the gap between the training regions. We will discuss this ‘in-between uncertainty’ in more detail in Chapter 3. MacKay (1992b) explicitly links this qualitative behaviour in function space to dependencies in the approximate posterior in parameter space, and notes that this qualitative behaviour would not have been obtained if the diagonal Laplace approximation was used. However, he does not provide a detailed argument as to why this is so. Variational inference for BNNs was introduced in Hinton and Van Camp (1993). They frame their work in terms of the Minimum Description Length (MDL) principle, not VI, but the objective used is identical to the ELBO. They avoid having to perform Monte Carlo estimation of the gradients by using a single hidden layer network and tabulating values of the mean and variance of the output of a hidden unit. They use QFFG as their variational family. In commenting on the choice of QFFG, they note that it is not clear how much is lost by ignoring the off-diagonal terms in the covariance matrix, given that MacKay (1992b) showed significant covariances in the Laplace approximation. However, they argue that since VI explicitly manipulates the Gaussian distributions, the learning will try to force the noise in the weights to be independent.8 Barber and Bishop (1998) extended the work of Hinton and Van Camp (1993) by replacing the MDL interpetration with the standard VI interpretation, and also by extending the variational family from QFFG to QFCG. They motivate the introduction of QFCG by noting that the posterior often has very strong correlations between the parameters. Recent work has focused on scaling up VI to larger BNNs (Blundell et al., 2015; Graves, 2011; Osawa et al., 2019), and using Monte Carlo estimates for the gradients to allow deeper networks to be trained using automatic differentiation packages. Since the models considered in modern deep learning are far larger than those used when BNNs were in their infancy, the quadratic (in the number of parameters) computational 8The phenomenon of VI learning variational parameters that are consistent with the factorisation assumptions made has indeed been observed in BNNs, though this behaviour may not always be desirable (Trippe and Turner, 2018) 30 Bayesian neural networks and memory requirements of QFCG are no longer as acceptable, and QFFG is the most commonly used variational family. The need to scale to larger networks has led to QFFG being a widespread choice in many modern approximating family methods, not just VI. To give an incomplete list, it has been used in PBP (Hernández-Lobato and Adams, 2015), variational Gaussian dropout (Kingma et al., 2015), stochastic expectation propagation (Li et al., 2015), black-box alpha divergence minimisation (Hernández-Lobato et al., 2016), Rényi divergence VI (Li and Turner, 2016), natural gradient VI (Khan et al., 2018) and functional variational BNNs (Sun et al., 2019). The other approximating family that has found widespread use in modern Bayesian deep learning is QMCDO. Dropout as a stochastic regularisation technique was intro- duced in Srivastava et al. (2014). Later, the interpretation of MC dropout as a Bayesian approximation was introduced in Gal and Ghahramani (2016). Unlike QFFG or QFCG, QMCDO was not first introduced as a family intended to approximate a Bayesian pos- terior distribution. Nevertheless, MC dropout inference performs competitively on a variety of BNN benchmarks (Filos et al., 2019; Mukhoti et al., 2018). 2.5 Conclusion In this chapter, we introduced Bayesian neural networks and motivated their need by discussing the inability of standard neural networks to represent epistemic uncertainty. We saw that exact inference in BNNs is intractable and has to be approximated. This led us to consider approximate inference, which could be divided into sampling methods and approximating family methods. We provided an overview of the most commonly used approximating family methods for BNNs, and found that the majority use the factorised Gaussian approximating family, QFFG. BNNs hold the promise of combining principled uncertainty estimation with the flexibility of deep learning. However, if approximate inference fails to provide predictive distributions that resemble the exact predictive, the relationship between the principled Bayesian framework we use in theory and the models we deploy in practice becomes tenuous. Unlike MCMC methods, most approximating family methods do not come with any theoretical guarantees as to the quality of their approximations. Hence understanding the effect of approximate inference on BNN predictive distributions is crucial. We turn to this subject in the next chapter. Chapter 3 The expressiveness of approximate inference in Bayesian neural networks In Chapter 2 we saw that while Bayesian neural networks hold the promise of being flexible, well-calibrated statistical models, inference requires approximations whose consequences are poorly understood. Hence it is unclear to what extent the successes (and failures) of approximate BNNs are attributable to the exact Bayesian predictive, rather than peculiarities of the approximation method. From a Bayesian modelling perspective, it is therefore crucial to ask, does the approximate predictive distribution retain the qualitative features of the exact predictive? In this chapter we present a study of the approximation quality of common approx- imating family methods. In Section 3.3 we consider single-hidden layer BNNs, and show a fundamental limitation in function space of two of the most commonly used distributions defined in weight space: mean-field Gaussian and Monte Carlo dropout. We find there are simple cases where neither method can have substantially increased uncertainty in between well-separated regions of low uncertainty. We provide strong empirical evidence that exact inference does not have this pathology, hence it is due to the approximation and not the BNN model itself. In Section 3.4 we consider deeper networks. In contrast to the single-hidden layer case, we show a universality result showing that there exist approximate posteriors in the above classes which provide flexible uncertainty estimates. However, we find empirically that pathologies of a similar form as in the single-hidden layer case can persist when performing variational inference in deeper networks — i.e., these posteriors are not found in practice. Our results motivate careful consideration of the implications of approximate inference methods in BNNs. 32 The expressiveness of approximate inference in Bayesian neural networks The material in this chapter was previously published in ‘On the Expressiveness of Approximate Inference in Bayesian Neural Networks’ (Foong et al., 2020b). The research was conducted in collaboration with my co-first author David R. Burt, and was supervised by Yinghzen Li and Richard E. Turner throughout. I was involved closely with all aspects of the paper, including the theoretical results, the experiments and the writing of the paper. 3.1 Criteria for successful approximation In Section 2.3.2, we saw that many approximate inference methods for BNNs work by defining a simple class of distributions over the model parameters, (an approximating family), and then choosing a member of this family as an approximation to the posterior. Mean-field variational inference (MFVI) and Monte Carlo dropout (MCDO) are two of the most commonly used instances of this approach. For such a method to succeed, two criteria must be met: Criterion 1 The approximating family must contain good approximations to the posterior. Criterion 2 The method must then select a good approximate posterior within this family. For nearly all tasks, the performance of a BNN only depends on the distribution over weights to the extent that it affects the distribution over predictions (i.e. in ‘function space’). Hence for our purposes, a ‘good’ approximate posterior is one that captures features of the exact posterior in function space that are relevant to the task at hand. However, approximating families are often defined in weight space for computational reasons. Evaluating Criterion 1 therefore involves understanding how weight space approximations translate to function space, which is a non-trivial task for highly nonlinear models such as BNNs. In this chapter we provide both theoretical and empirical analyses of the flexibility of the predictive mean and variance functions of approximate BNNs. Our main findings are: 1. For shallow (i.e., single-hidden layer) BNNs, there exist simple situations where no mean-field Gaussian or MC dropout distribution can faithfully represent the exact posterior predictive uncertainty (Criterion 1 is not satisfied). We prove in Section 3.3 that in these instances the predictive variance function of any fully- connected, single-hidden layer ReLU BNN using these families suffers a lack of ‘in- 3.1 Criteria for successful approximation 33 between uncertainty ’: increased predictive uncertainty in between well-separated regions of low uncertainty. This is especially problematic for lower-dimensional data where we may expect some datapoints to be in between others. Examples include spatio-temporal data, or Bayesian optimisation for hyperparameter search, where we frequently wish to make predictions in unobserved regions in between observed regions. We verify that the exact posterior predictive does not suffer from this limitation; hence this pathology is attributable solely to the restrictiveness of the approximating family. Furthermore, since this problem is tied to the approximating family, any method that uses the mean-field Gaussian or MC dropout families will be similarly restricted. 2. In contrast, in Section 3.4 we prove a universal approximation result showing that the mean and variance functions of deep (more than 1 hidden layer) approx- imate BNNs using mean-field Gaussian or MCDO distributions can uniformly approximate any continuous function and any continuous non-negative function respectively. However, it remains to be shown that appropriate predictive means and variances will be selected when choosing the approximate posterior from the approximating family. Since addressing this question requires assessing the behaviour of the particular approximating family method, and not simply the family itself, we choose to limit our study to variational inference, i.e., ELBO optimisation, as the approximating family method. To test the fidelity of the approximation, we focus on the low-dimensional, small data regime where com- parisons to references for the exact posterior such as the limiting GP (Lee et al., 2018; Matthews et al., 2018; Neal, 1995) are easier to make. In Section 3.4.2 we provide empirical evidence that in spite of its theoretical flexibility (in terms of the expressiveness of the variational family), VI in deep BNNs can still lead to dis- tributions that suffer from similar pathologies to the shallow case, i.e. Criterion 2 is not satisfied. Finally, in Section 3.5, we provide an active learning case study on a real-world dataset showing how in-between uncertainty can be a crucial feature of the posterior predictive. In this case, we provide evidence that although the inductive biases of the BNN model with exact inference can bring considerable benefits, these are lost when MFVI or MCDO are used. Code to reproduce our experiments can be found at https://github.com/cambridge-mlg/expressiveness-approx-bnns. 34 The expressiveness of approximate inference in Bayesian neural networks 3.2 Priors and references for the exact predictive In this chapter, our goal is to examine how closely approximate BNN predictive distributions resemble exact inference. To make this comparison, a choice of BNN prior must be made. As we noted in Section 2.2.3, common practice is to choose independent Gaussian priors. Furthermore, it is common to set these to be standard Normal N (0, 1) priors for all parameters, regardless of the size of the network. However, such priors are known to lead to extremely large prior predictive variances in function space for wide or deep networks (Neal, 1995). For example, choosing a standard normal prior for a 4-hidden layer BNN with 50 neurons in each layer leads to a prior standard deviation of ∼103 for the output of the network at x = 0. This is orders of magnitude too large to reflect our prior beliefs for normalised data. It is conceivable that one may combine an unreasonable prior such as this with poor approximate inference to obtain practically useful uncertainty estimates that bear little relation to the exact Bayesian predictive — we do not consider this case. Instead, we focus our study on the quality of approximate inference in models with more moderate prior variances in function space. There is a body of literature on BNN priors (Lee et al., 2018; Matthews et al., 2018; Neal, 1995; Schoenholz et al., 2017) which shows how to select prior weight variances that lead to reasonable prior variances in function space, even as the width of the hidden layers tends to infinity. For a layer with Nin inputs, we choose independent N (0, σ2w/Nin) priors for the weights, with σ2w a width-independent constant. As the width tends to infinity, both the prior and posterior of such a BNN converges to a well-defined Gaussian process (GP) (Hron et al., 2020; Matthews et al., 2018; Neal, 1995). This convergence does not occur if we omit the scaling by 1/Nin. We hence include this scaling when specifying our BNN priors. It has been shown with extensive Markov chain Monte Carlo simulation that 3-hidden layer BNNs with just 50 units per layer already closely resemble their cor- responding infinite-width GP counterparts (Matthews et al., 2018). In this chapter, we examine BNNs of up to 10 hidden layers. It is uncertain whether finite-width BNNs of such large depths will still resemble their infinite-width counterparts as closely. However, the GP predictive may still act as a useful qualitative reference for what we expect of the exact predictive in the finite-width case. We hence use both exact inference in the corresponding infinite-width GP and also ‘gold-standard’ Hamiltonian Monte Carlo (HMC) (Hoffman and Gelman, 2014; Neal et al., 2011) as references for the exact posterior. 3.3 Single-hidden layer neural networks 35 3.3 Single-hidden layer neural networks In this section, we present our results stating that for single-hidden layer (1HL) ReLU BNNs, QFFG and QMCDO are not expressive enough to satisfy Criterion 1 in situations where in-between uncertainty is important. We identify limitations on the variance in function space, V[f(x)], implied by these families. We show empirically that the exact posterior does not have these restrictions, implying that approximate inference does not qualitatively resemble the posterior. Theorem 1 (Factorised Gaussian). Consider any single-hidden layer fully-connected ReLU neural network f : RD → R. Let xd denote the dth element of the input vector x. Assume a fully factorised Gaussian distribution over the parameters, i.e., the QFFG approximating family. Consider any points p, q, r ∈ RD such that r ∈ −→pq and either: i. −→pq contains 0 and r is closer to 0 than both p and q, ii. −→pq is orthogonal to and intersects the plane xd = 0, and r is closer to the plane xd = 0 than both p and q. Then V[f(r)] ≤ V[f(p)] + V[f(q)]. Remark 1. In Theorem 7 in Appendix A we actually prove a stronger result than Theorem 1, which also applies to approximating families that have certain correlations. For example, the bound still holds when the weights coming out of a neuron in the hidden layer are correlated with each other. In words, Theorem 1 states that there are line segments in input space (illustrated in Figure 3.1) such that the predictive variance on the line is bounded by the sum of the variance at the endpoints. This restriction is problematic in situations where we would like the BNN to express higher epistemic uncertainty on a line segment joining regions with lower epistemic uncertainty. Analogous but weaker bounds on higher dimensional sets in input space enclosed by these lines can be obtained as a corollary. For instance, consider the case where the input domain is R2. Let p, q, r, s be the four corners of a rectangle containing the origin. For any point a in the rectangle, we can upper bound V[f(a)] by the sum of the variances at the points at the top and bottom edges of the rectangle with the same horizontal coordinate as a (Theorem 1, condition (ii)). These in turn can be upper bounded in terms of the variances at the corners of the rectangle, again by applying Theorem 1. Hence we have that for any point a in the rectangle, V[f(a)] ≤ V[f(p)] + V[f(q)] + V[f(r)] + V[f(s)]. 36 The expressiveness of approximate inference in Bayesian neural networks x2 x1 p r q q′ x2 x1 p q r q′ Fig. 3.1 Illustration of the bounded regions implied by Theorem 1, showing the input domain of a 1HL mean-field Gaussian BNN, for the case x ∈ R2. Left: For any two points p, q ∈ R2 such that the line joining them crosses the origin, the output variance at any point r on the solid red portion of the line is upper bounded by V[f(p)]+V[f(q)], illustrating condition (i) of Theorem 1. Right: For any two points p, q ∈ R2 such that the line joining them is orthogonal to and intersects a plane xd = 0, the output variance at any point r on the solid red portion of the line is upper bounded by V[f(p)]+V[f(q)], illustrating condition (ii) of Theorem 1. The bounded segments (in red) extend from q = (q1, q2) to q′, where q′ = (−q1,−q2) (Left, condition (i)), or q′ = (q1,−q2) (Right, condition (ii)). 3.3 Single-hidden layer neural networks 37 Similarly, for higher-dimensional input domains, the variance at any point inside an axis-aligned hyperrectangle containing the origin can be bounded by the sum of the variances on its vertices, and we can obtain tighter bounds on diagonals and faces of the hyperrectangle, by repeatedly applying Theorem 1. This again could be problematic if we required the BNN to express high epistemic uncertainty inside the hyperrectangle whislt having much lower epistemic uncertainty at its vertices/edges. However, we note that these bounds become exponentially weaker as the dimensionality of the bounded region increases, so the theorem is most informative when bounding the variance on lower dimensional regions such as lines, which we focus on for the remainder of this chapter. Theorem 1 applies to 1HL BNNs of any width using any approximating family method which uses QFFG, as listed in Section 2.3.2. We also prove related results for MC dropout. Here the behaviour is different depending on whether dropout is applied to the inputs of the network: Theorem 2 (MC dropout with inputs not dropped out). Consider the same network architecture as in Theorem 1. Assume an MC dropout distribution over the parameters, with inputs not dropped out, i.e. the first weight matrix is deterministic. Then V[f(x)] is convex in x. Theorem 2 implies the predictive variance on any line segment in input space is bounded by the maximum of the variance at its endpoints, as this is a straightforward consequence of convexity. A weaker statement is true if we also apply dropout to the inputs: Theorem 3 (MC dropout with inputs dropped out). Consider the same network architecture as in Theorem 1. Assume an MC dropout distribution over the parameters, with inputs dropped out, i.e. the first weight matrix has a dropout distribution. Then, for any finite set of points S ⊂ RD such that 0 is in the convex hull of S, V[f(0)] ≤ max s∈S {V[f(s)]} . (3.1) This is illustrated in Figure 3.2. Although weaker than Theorem 2, Theorem 3 still implies pathological behaviour whenever the origin should have higher epistemic uncertainty than points surrounding it. Remark 2. As it is more common not to apply dropout to the inputs of a network (see Figure 4.5 in Gal (2016)), we will focus on that case in this chapter. Hence when 38 The expressiveness of approximate inference in Bayesian neural networks x2 x1 Fig. 3.2 Schematic illustration of the bound in Theorem 3, showing the input domain of a single-hidden layer MC dropout BNN, for the case x ∈ R2 with dropout applied to the inputs. The convex hull (in light blue) of the blue points contains the origin. Hence Theorem 3 implies the variance at the origin (red point) cannot exceed the variance at any of the blue points. we refer to MC dropout or QMCDO without any further qualification, we always mean dropout is applied to the hidden features but not to the input. Remark 3. Although Theorems 1 to 3 are stated for networks with a single scalar output for brevity, for networks with multiple outputs, these theorems hold for each output separately. See Appendix A for more general statements of these results. Full proofs of Theorems 1 to 3 are provided in Appendix A. Theorems 1 to 3 show that there are simple cases where 1HL approximate BNNs using QFFG and QMCDO cannot represent in-between uncertainty : i.e., increased uncertainty in between well separated regions of low uncertainty. As Theorems 1 to 3 depend only on the approximating family, this cannot be fixed by improving the optimiser, regulariser or prior. 3.3.1 Numerical verification of theorems We next verify Theorems 1 and 2 numerically. Since we are concerned with whether there are any distributions that show in-between uncertainty, we do not maximise the ELBO in this experiment (we consider ELBO maximisation in Sections 3.3.4 and 3.4.2). 3.3 Single-hidden layer neural networks 39 −1.0 −0.5 0.0 0.5 1.0 x 0.0 0.5 1.0 V [f (x )] Target FFG Bound −1.0 −0.5 0.0 0.5 1.0 x 0.0 0.5 1.0 V [f (x )] Target MCDO Fig. 3.3 Results of directly minimising the squared error in function space between V[f(x)] (for a single-hidden layer NN) and a target variance function. Left: FFG distribution, Right: MCDO distribution. The bound implied by Theorem 1 for FFG distributions (red) applies on [−1, 1] with p = −1, q = 1. The MCDO variance function is convex, as implied by Theorem 2, and almost constant. The FFG and MCDO variance functions underestimate the target variance near the origin and overestimate it away from the origin due to the restrictiveness of the approximating family. Instead, we train 1HL networks of width 50 with QFFG and QMCDO distributions to directly minimise the squared error between V[f(x)] and a pre-specified target variance function which displays in-between uncertainty. In detail, we generate a dataset consisting of two separated clusters of datapoints in one dimension. We then fit a Gaussian process to the dataset and compute the predictive mean and variance of the GP on a one-dimensional grid X consisting of 40 points. Let µ(X) ∈ R40 denote the mean of the GP posterior predictive at these points σ2(X) ∈ R40 denote the variance. We define the loss function as L(ϕ) = ∥Eqϕ [f(X)]− µ(X)∥22 + ∥Vqϕ [f(X)]− σ2(X)∥22. (3.2) This loss function encourages the predictive mean and variance of the BNN to directly match that of the GP, which displays in-between uncertainty. The expectation and variance of f(X) are Monte Carlo estimated using 128 samples. We use the ADAM optimiser and full-batch training with a learning rate of 1× 10−3 for 50,000 iterations. A dropout rate of 0.05 is used for MCDO. Weights and biases are initialised at the prior for MFVI. The results are shown in Figure 3.3. We see that even when trained to directly minimise this objective, 1HL BNNs cannot successfully mimic the GP’s in-between uncertainty, since that would violate Theorems 1 and 2. Although Theorems 1 and 2 apply only to 1HL BNNs, 1HL BNN regression tasks have been a common benchmark in the BNN literature (Gal and Ghahramani, 2016; Hernández-Lobato and Adams, 2015; Mukhoti et al., 2018; Sun et al., 2019; Tomczak et al., 2018), and have been used to assess different inference methods. 40 The expressiveness of approximate inference in Bayesian neural networks 3.3.2 In-between uncertainty in other regions of input space Although Theorem 2 implies a bound on any line in input space, Theorem 1 only bounds lines in input space meeting specific criteria. For BNNs with higher input dimensionality, these criteria are less likely to be satisfied by general lines in input space. Hence it is unclear whether the lack of in-between uncertainty occurs only on these special lines, or is a more general feature of the approximate posterior predictive. To answer this, we next show empirically that for a BNN with a 5-dimensional input space, lines joining random points in input space also tend to suffer from a lack of in-between uncertainty. We generate two Gaussian clusters of input locations, with the centres of the clusters randomly chosen to lie on a sphere of radius √ 5 centred at the origin. We generate the output values corresponding to each input location by sampling from the wide-limit BNN GP. We then train MFVI and MCDO BNNs on the data, and compare the predictive distribution to that of the wide-limit GP. We choose σw = √ 2, σb = 1, networks of width 50 and a dropout probability of p = 0.05 for MCDO. We set the observation noise standard deviation to 0.01, which is the ground truth value used to generate the synthetic data. This is repeated for three random samplings of the dataset. We then visualise the predictive uncertainty along the line segments in input space joining the centres of the two datapoint clusters. In Figure 3.4 we see that although exact inference with the wide-limit GP exhibits in-between uncertainty, this is lost by both MFVI and MCDO. For MCDO, this is expected as Theorem 2 implies that MCDO’s predictive variance will be convex along any line, including the lines plotted. In contrast, Theorem 1 only applies to certain lines in input space, and does not bound the variance on general lines in input space like the lines in Figure 3.4. However, we still see that MFVI and MCDO are often more confident in between the data clusters than at the data clusters, which intuitively is a poor reflection of epistemic uncertainty. Hence Figure 3.4 lends support to the idea that the pathology in Theorem 1 is symptomatic of a lack of in-between uncertainty on more general lines in input space than the conditions of the theorem statement imply. 3.3.3 Intuition for results We now provide intuition for the proofs of Theorems 1 to 3. Let θin be the parameters in the first layer. By the law of total variance, V[f(x)] = E [V[f(x)|θin]] + V[E [f(x)|θin]]. For QMCDO the second term is 0 as θin is deterministic, since the input weights are not dropped out. Hence to prove Theorem 2 (MCDO without dropping out input weights), 3.3 Single-hidden layer neural networks 41 GP MFVI MCDO −2 0 2 λ 1.0 1.5 2.0 2.5 3.0 f (x (λ )) −2 0 2 λ 1.0 1.5 2.0 2.5 3.0 3.5 f (x (λ )) −2 0 2 λ 1.25 1.50 1.75 2.00 2.25 2.50 2.75 f (x (λ )) −4 −2 0 2 4 λ −4 −2 0 2 f (x (λ )) −4 −2 0 2 4 λ −4 −2 0 2 f (x (λ )) −4 −2 0 2 4 λ −4 −2 0 2 f (x (λ )) −2 −1 0 1 2 λ −1.0 −0.5 0.0 0.5 1.0 1.5 f (x (λ )) −2 −1 0 1 2 λ −1.0 −0.5 0.0 0.5 f (x (λ )) −2 −1 0 1 2 λ −1.0 −0.5 0.0 0.5 1.0 f (x (λ )) Fig. 3.4 Mean and 2 standard deviation bars of the predictive distribution on lines joining random clusters of data, for single-hidden layer BNNs. Each row represents the same randomly generated dataset. We also plot the projection of the 5-dimensional data onto this line segment, where the coordinate along the line segment is denoted λ. Note that the data appears very noisy in some of the plots, but this appearance is due to the projection onto a lower-dimensional space. 42 The expressiveness of approximate inference in Bayesian neural networks it suffices to show the first term is convex. We have: V[f(x)|θin] = V [ I∑ i=1 wiψ(ai(x; θin)) + b ∣∣∣∣θin] (3.3) = I∑ i=1 V[wi]ψ(ai(x; θin))2 + V[b], (3.4) where {wi}Ii=1 and b are the output weights and bias, ψ(a) = max(0, a), and ai(x; θin) is the activation of the ith neuron. Since ai(x; θin) is affine in x, ψ(ai(x; θin))2 is a ‘half quadratic’ in x and therefore convex. This proves Theorem 2. In order to prove Theorem 3 (MCDO with dropping out input weights), we now need to consider the effect of randomness in the input weights. However, we know that when x = 0, dropping out the input features has no effect, since the input will take the value 0 regardless of whether it is dropped out. Hence V[E [f(0)|θin]] = 0. Theorem 3 follows easily by combining this with the fact that E [V[f(x)|θin]] is convex. To arrive at Equation (3.4), we used the fact that for QMCDO, the output weights of each neuron are independent. Since this is also the case for QFFG, Equation (3.4) also applies to QFFG. If correlations between the weights were allowed in the posterior (such as with QFCG), this could introduce negative covariance terms, leading to non-convex behaviour. However, in a factorised posterior, this is not possible. Thus we see here a concrete instance of how weight space factorisation assumptions can lead to function space restrictions on the predictive uncertainty. To complete the proof of Theorem 1 for QFFG, we need to analyse V[E [f(x)|θin]] when θin follows a mean-field Gaussian distribution. Because of the factorisation assumptions on the weights in the first layer, this term is a positive linear combination of the variances of each activation function. While these variance functions are not convex, they satisfy certain restrictive conditions that imply bounds on arbitrary positive linear combinations. Roughly speaking, they resemble quadratic functions that are truncated to zero at the point where the ReLU is saturated. One such typical variance function is shown in Figure 3.5. We provide a rigorous characterisation of the limitations in expressiveness of positive linear combinations of such functions, along with full proofs of Theorems 1 to 3, in Appendix A. 3.3 Single-hidden layer neural networks 43 4 2 0 2 4 x 0.0 0.5 1.0 1.5 2.0 2.5 Va r[R eL U( W x + b) ] Fig. 3.5 The contribution to V[E [f(x)|θin]] made by a single neuron, for some choice of the Gaussian variational distribution over θin. Here the weight W and bias b are part of the input layer, i.e., W, b ∈ θin. The full value of V[E [f(x)|θin]] is given by a positive linear combination of such terms, one for each neuron, due to the factorisation assumptions. Although this function is not convex, it resembles a quadratic function that has been truncated to zero. In Appendix A we show that arbitrary positive linear combinations of these functions necessarily suffer a lack of in-between uncertainty. 3.3.4 Empirical tests of approximate inference in single-hidden layer BNNs It is not immediately apparent that Theorems 1 and 2 are problematic from the perspective of Bayesian inference. For example, even exact inference in a Bayesian linear regression model results in a convex predictive variance function. In that case, the lack of in-between uncertainty is not due to poor inference, but is instead due to the linear modelling assumption. If this assumption genuinely reflects our prior beliefs about the regression task, then the lack of in-between uncertainty is not problematic. Here we provide strong evidence that, in contrast, the modelling assumptions of 1HL BNNs lead to exact posteriors that do show in-between uncertainty. Theorems 1 to 3 thus imply that it is approximate inference with QFFG or QMCDO that fails to reflect this intuitively desirable property of the exact predictive, violating Criterion 1. Figure 3.6 compares the predictive distributions obtained from MFVI and MCDO (here we optimise the ELBO for MFVI and the standard MCDO objective, in contrast with Figure 3.3 — see Appendix B for experimental details) with HMC and the limiting GP on a regression dataset consisting of two clusters of covariates. We use 1HL BNNs 44 The expressiveness of approximate inference in Bayesian neural networks −2 −1 0 1 2 x1 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 x 2 σ[f (x)] −2 −1 0 1 2 λ −5 0 5 f (x (λ )) 0.0 1.1 2.3 3.4 4.6 5.7 6.8 8.0 9.1 10.3 (a) Infinite-width GP −2 −1 0 1 2 x1 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 x 2 σ[f (x)] −2 −1 0 1 2 λ −5 0 5 f (x (λ )) 0.00 0.95 1.90 2.85 3.80 4.75 5.70 6.65 7.60 8.55 (b) HMC −2 −1 0 1 2 x1 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 x 2 σ[f (x)] −2 −1 0 1 2 λ −5 0 5 f (x (λ )) 0.03 0.07 0.11 0.15 0.19 0.23 0.27 0.31 0.35 0.39 (c) MFVI −2 −1 0 1 2 x1 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 x 2 σ[f (x)] −2 −1 0 1 2 λ −5 0 5 f (x (λ )) 0.04 0.17 0.30 0.42 0.55 0.68 0.81 0.94 1.06 1.19 (d) MCDO Fig. 3.6 Regression on a 2D synthetic dataset (red crosses). The colour plots show the standard deviation of the output, σ[f(x)], in 2D input space, in the square of side length 4 centred at the origin. The plots beneath show the mean with 2-standard deviation bars along the dashed white line (parameterised by λ, where λ = 0 at the origin and takes the values −2√2, 2√2 at the corners of the square). MFVI and MCDO are overconfident for λ ∈ [−1, 1]. Theorems 1 and 2 explain this: given the predictive variance is near zero at the data clusters, there is no setting of the variational parameters that induces a predictive variance much greater than zero in the line segment between them. 3.4 Deeper networks 45 with 50 hidden units and ReLU activations. The HMC and limiting GP posteriors are almost indistinguishable, suggesting they both resemble the exact predictive. For these methods V[f(x)] is markedly larger near the origin than near the data. In contrast, MFVI and MCDO are as confident in between the data as they are near the data. This provides strong evidence that the lack of in-between uncertainty is not a feature of the BNN model or prior, but is caused by approximate inference. 3.4 Deeper networks Theorems 1 to 3 pose an important question: is the structural limitation observed in the 1HL case fundamental to QFFG and QMCDO even in deeper networks, or can depth help these approximations satisfy Criterion 1? In Theorem 4, we provide universality results for the mean and variance functions of approximate BNNs with at least two hidden layers using QFFG and QMCDO (with the inputs not dropped out). As the predictive mean and variance often determine the performance of BNNs in regression applications, this provides theoretical evidence that, for many applications, approximate inference in deep BNNs satisfies Criterion 1: Theorem 4 (Universality of deeper networks). Let m be any continuous function on a compact set A ⊂ RD, and let v be any continuous, non-negative function on A. For any ϵ > 0, for both QFFG and QMCDO there exists a 2HL ReLU BNN such that supx∈A |E [f(x)]−m(x)| < ϵ and supx∈A |V[f(x)]− v(x)| < ϵ simultaneously. Remark 4. If MC dropout is used with the inputs also dropped out, the analogous statement to Theorem 4 is false. In Appendix C.2 we provide a counterexample that holds for arbitrarily deep networks and shows that if inputs are dropped out, V[f ] cannot be made small at two points x1, x2 which have significantly different values of E [f(x1)] and E [f(x2)]. Figure 3.7 shows the result of directly minimising the squared error between the network output mean and variance and a given target mean and variance function, using the same method as with the 1HL network in Figure 3.3, but this time with two hidden layers. In contrast to Figure 3.3, the variances of both QFFG and QMCDO are able to fit the target very closely. While Theorem 4 gives some cause for optimism for approximating family methods with deep BNNs, it shows only that the mean and variance of pointwise marginal distributions of the output are universal (i.e., it does not tell us about higher moments of the predictive or covariances between different outputs). Additionally, and crucially, 46 The expressiveness of approximate inference in Bayesian neural networks −1.0 −0.5 0.0 0.5 1.0 x −2 0 E[ f (x )] Target FFG MCDO −1.0 −0.5 0.0 0.5 1.0 x 0.0 0.5 1.0 V [f (x )] Target FFG MCDO Fig. 3.7 Results of minimising the squared error in function space between E [f(x)] and a target mean function (left), and between V[f(x)] and a target variance function (right), for a 2-hidden layer BNN with FFG and MCDO distributions. All three lines overlap, indicating a very close fit to the target. it does not say whether good distributions will actually be found by an optimiser when maximising an objective such as the ELBO, i.e it does not address Criterion 2. Addressing Criterion 2 theoretically is more challenging since we must make a statement not only about the variational family, but about the optimum of the ELBO within the variational family. Such an analysis has indeed been conducted in recent work (Coker et al., 2022), which we will discuss in Section 3.6.2. 3.4.1 Proof sketch of Theorem 4 To prove Theorem 4 for QFFG, we provide a construction that relies on the universal approximation theorem for deterministic NNs (Leshno et al., 1993). We illustrate this construction schematically in Figure 3.8. Consider a 2HL NN whose second hidden layer has two neurons, with activations a1, a2. Let w1, w2 denote the weights connecting a1, a2 to the output, and b denote the output bias, such that the output f(x) = w1ψ(a1)+w2ψ(a2)+b. In this construction, a1 will be used to control the mean, and a2 the variance, of the BNN output. By setting the variances of the parameters in the first two linear layers to be sufficiently small, we can consider a1 and a2 to be essentially deterministic functions of x. By the universal approximation theorem, a1 and a2 can approximate any continuous functions. Recall that we would like the mean function of the BNN to approximate m(x) and the variance function to approximate v(x). Choose a1≈m(x)−minx′∈Am(x′) and a2≈ √ v(x).1 Choose E [b]=minx′∈Am(x′), V[b]≈0; E [w1]=1, V[w1]≈0; and E [w2]=0, V[w2]=1. 1Recall that here A denotes the input domain of the neural network, see Theorem 4. 3.4 Deeper networks 47 By linearity of expectation, the factorisation assumptions, and a1, a2 ≥ 0: E [f(x)] = E [w1ψ(a1) + w2ψ(a2) + b] = E [w1]E [ψ(a1)] + E [w2]E [ψ(a2)] + E [b] ≈ m(x)−min x′∈A m(x′) + min x′∈A m(x′) = m(x), as desired. By the law of total variance, the variance of the network output is V[f(x)] = E [V[f(x)|a1, a2]] + V[E [f(x)|a1, a2]] ≈ E [V[f(x)|a1, a2]] ≈ E[ψ(a2)2]+ V[b] ≈ v(x), where we used that w1, b are essentially deterministic and V[E [f(x)|a1, a2]] ≈ 0 since a1, a2 are essentially deterministic. Also, we have that ψ(a2) ≈ a2 since a2 ≈ √ v(x) ≥ 0. The approximations come from the standard universal function approximation theorem, and the variances of weights not being set exactly to 0 so that we remain in QFFG. A mathematically rigorous proof, along with a proof for QMCDO with any dropout rate p ∈ (0, 1), is given in Appendix C.1. The main technical challenge in the proof for QFFG is to establish the validity of the argument presented above when the required weights are not deterministic (since strictly speaking deterministic weight distributions would not lie in QFFG), but are instead Gaussian-distributed with very small variances. The proof for QMCDO uses a somewhat similar construction, but is more involved as we cannot set individual weights to be essentially deterministic, due to the nature of the dropout distribution. 3.4.2 Empirical tests of approximate inference in deep BNNs We now consider empirically whether the distributions found by optimising the ELBO with these families resemble the exact predictive distribution (Criterion 2). To do this, we consider the dataset from Figure 3.6 and define the ‘overconfidence ratio’ at an input x as γ(x) = (VGP[f(x)]/Vqϕ [f(x)])1/2, where VGP is the predictive variance of exact inference in the infinite-width BNN, and Vqϕ is the predictive variance of the approximate posterior. We then compute γ(x) at 300 points {xn}300n=1 evenly spaced along the dashed white line joining the data clusters in Figure 3.6, i.e., from 48 The expressiveness of approximate inference in Bayesian neural networks Fig. 3.8 Schematic illustration of the construction used to prove that there exist 2- hidden layer BNNs using QFFG which are able to approximate any predictive mean function m and variance function v. Here the blue weights are used to approximate the mean function and the red weights are used to approximate the variance function, and β is a shorthand for minx′∈Am(x′). Only six neurons are shown in the first hidden layer for illustrative purposes, but the universal approximation theorem may require many more depending on the desired approximation error ϵ. 3.4 Deeper networks 49 x = (−1.2,−1.2) to x = (1.2, 1.2). We then create boxplots of the values {γ(xn)}300n=1 for varying BNN depths. If the BNNs are wide enough, accurate inference should lead to similar uncertainty estimates to the limiting GP, i.e. the boxplot should be tightly centered around 1 (dashed line). If instead γ ≫ 1, this means the approximate BNN is much more confident than the exact infinite-width GP reference, which suggests that it is more confident than exact inference in the finite BNN as well. The opposite is true if γ ≪ 1. We consider ReLU BNNs with 1 to 10 hidden layers, and with 50 hidden units in each layer. We set the prior mean for all parameters to 0. The prior standard deviation for the bias parameters is chosen as σb = 1. Let σw/ √ H be the prior standard deviation of each weight, where H is the number of inputs to the weight matrix. We consider two schemes for choosing σw: 1. In Figure 3.9 we choose σw = 4, 3, 2.25, 2, 2, 1.9, 1.75, 1.75, 1.7, 1.65 for depths 1-10 respectively. These values were chosen to ensure the prior standard deviations (of both the infinite width GP and the finite width BNN) in function space at the points (1, 1) and (−1,−1) (the centres of the data clusters) were between 10 and 15 — a value we judged to constitute a vague yet still reasonable prior in function space. 2. In Figure 3.10 we choose σw = √ 2 for all depths. The value of √ 2 is chosen as it has been shown to lead to BNN priors with roughly constant variance in function space as depth increases (Schoenholz et al., 2017). Finally, all models use a fixed Gaussian likelihood with standard deviation 0.1, and the training procedure is the same as that detailed in Appendix B. In Figure 3.9 we see that for the 1HL and 2HL BNNs, the GP and HMC agree closely, suggesting both resemble the exact predictive of the finite BNN. In contrast, MFVI and MCDO are often an order of magnitude overconfident (γ(x) > 1) at some points (upper tail of the boxplot) and somewhat underconfident (γ(x) < 1) at other points (lower tail of the boxplot). Increased depth does not alleviate this behaviour. In Figure 3.10, we see that the agreement between HMC and the limiting GP is less close than it is in Figure 3.9. This could be due to HMC not mixing well for this prior, or the GP-BNN correspondence not being particularly good for networks of width 50 for this prior. However, there is still much closer agreement between HMC and the limiting GP than there is between MFVI or MCDO and the limiting GP. We next investigate where the overconfidence and underconfidence of approximate inference relative to the limiting GP is occurring. In Figure 3.11, we find that over- 50 The expressiveness of approximate inference in Bayesian neural networks 1 2 3 4 5 6 7 8 9 10 Number of hidden layers 10−1 100 101 102 O ve rc on fi d en ce ra ti o HMC MFVI MCDO Fig. 3.9 Box and whisker plots of the overconfidence ratios of HMC, MFVI and MCDO relative to exact inference in the corresponding infinite-width limit GP along the dashed white line on the dataset from Figure 3.6. The whiskers show the smallest and largest overconfidence ratios computed, and the box extends from the lower to upper quartile values of the overconfidence ratios, with a line at the median. HMC is only run for 1 and 2 hidden layers due to difficulty ensuring convergence in larger models. We see that MFVI and MCDO can be overconfident by up to an order of magnitude relative to the GP, for all depths. 1 2 3 4 5 6 7 8 9 10 Number of hidden layers 10−1 100 101 102 O ve rc on fi d en ce ra ti o HMC MFVI MCDO Fig. 3.10 Same as in Figure 3.9, but now with σw = √ 2 for all depths. 3.4 Deeper networks 51 confidence occurs in-between the data clusters, and underconfidence occurs at the data clusters. Hence the uncertainty estimates of the approximate BNNs suffer from qualitatively similar issues to those seen in 1HL BNNs in Figure 3.3, even though here deeper BNNs are considered. In addition, similarly to Figure 3.4, in Figure 3.12 we plot the uncertainty on line segments in between random clusters of data in a 5-dimensional input space, but this time with deeper networks. We again see that compared to exact inference in the limiting GP, MFVI and MCDO both underestimate in-between uncertainty — or sometimes show as large uncertainty at the data as in between the data. Figure 3.12 hence shows that the lack of adequate in-between uncertainty is not specific to 1HL BNNs or to the dataset from Figure 3.6. 3.4.3 Initialising a BNN with in-between uncertainty In light of the theoretical flexibility of the variational families QFFG and QMCDO in the deep case as shown in Theorem 4 and Figure 3.7, it is perhaps surprising that VI fails to capture important properties of the posterior predictive even with deep networks. In order to assess whether the variational objective (the ELBO) or optimisation failure is primarily responsible for the lack of in-between uncertainty when performing MFVI and MCDO, we investigate the effect of initialisation on the quality of the posterior obtained after variational inference. The idea is to find an initialisation of the variational parameters such that the approximate posterior predictive closely matches the infinite width GP (and hence shows good in-between uncertainty). If ELBO optimisation starting from this initialisation subsequently loses in-between uncertainty, this provides evidence that the ELBO objective for BNNs is to blame for the lack of in-between uncertainty in the deep case. In order to find this initialisation, we train the network by minimising the mean squared error between the mean and variance functions of the GP reference posterior and the approximate posterior (as in Equation (3.2) and Figure 3.7). The reference posterior was obtained by fitting the limiting GP on the dataset (shown in crosses in Figure 3.13). The noise variance was fixed to the true noise variance that generated the data, and the data itself was sampled from the limiting GP prior, so that the model should be able to fit the data well with minimal model mismatch. Two-hidden layer MFVI and MCDO networks were used, with 50 hidden units in both layers. Unfortunately, it may be the case that the initialisation found by minimising the mean squared loss for 50,000 iterations leads to variational distributions with a very high KL to the posterior. Hence once ELBO optimisation begins, the distribution may need 52 The expressiveness of approximate inference in Bayesian neural networks (a) Mean Field VI (b) MC Dropout (c) Mean Field VI (σw = √ 2 prior) (d) MC Dropout (σw = √ 2 prior) Fig. 3.11 Plots of the overconfidence ratio γ on the dataset from Figure 3.6 against λ (where λ is defined as in Figure 3.6) for several depths of neural networks with σw = 4, 2, 1.7 for 1, 5 and 9 hidden layers respectively (top), and σw = √ 2 for all depths (bottom). Projections of the input locations of the datapoints onto the diagonal slice between the clusters are shown as black crosses (✕). We see that both MCDO and MFVI are overconfident (γ > 1) in between data, and underconfident (γ < 1) at the locations where we have observed data, relative to the GP reference. 3.4 Deeper networks 53 GP MFVI MCDO −2 0 2 λ 1.0 1.5 2.0 2.5 3.0 f (x (λ )) −2 0 2 λ 1.5 2.0 2.5 3.0 3.5 f (x (λ )) −2 0 2 λ 1.25 1.50 1.75 2.00 2.25 2.50 2.75 f (x (λ )) −4 −2 0 2 4 λ −4 −2 0 2 f (x (λ )) −4 −2 0 2 4 λ −4 −2 0 2 f (x (λ )) −4 −2 0 2 4 λ −2 −1 0 1 2 f (x (λ )) −2 −1 0 1 2 λ −1.0 −0.5 0.0 0.5 1.0 1.5 f (x (λ )) −2 −1 0 1 2 λ −1.0 −0.5 0.0 0.5 f (x (λ )) −2 −1 0 1 2 λ −1.0 −0.5 0.0 0.5 f (x (λ )) Fig. 3.12 Same experimental set-up as in Figure 3.4, but now with 3-hidden layer BNNs. These deeper BNNs still fail to show adequate in-between uncertainty, and are overconfident in between the data clusters and underconfident at the data clusters relative to the infinite-width GP reference. 54 The expressiveness of approximate inference in Bayesian neural networks −1.0 −0.5 0.0 0.5 1.0 x −2 0 2 4 y GP MFVI (a) Mean-field VI −1.0 −0.5 0.0 0.5 1.0 x −2 0 2 4 y GP MCDO (b) MC dropout Fig. 3.13 Mean and error bars (± 2 standard deviations) for the GP and the BNN with each inference scheme, trained on the data shown by the red crosses. The inference algorithms were initialised by first minimising the squared error to the reference GP mean and variance, and then running the respective inference algorithm. Even when starting from an initialisation that closely matches the GP and hence shows good in-between uncertainty, in-between uncertainty is subsequently lost when the variational objective is optimised. to move very far from its initialisation, and hence may lose the in-between uncertainty that it started with. In other words, there may exist variational distributions that lead to good in-between uncertainty and also a good ELBO, but these might be very far from the distributions we find when only optimising for good in-between uncertainty. To account for this, we gradually interpolate between the squared-error loss and the variational objective, by taking convex combinations of the losses. This procedure gives us a better chance of finding an initialisation that both has good in-between uncertainty and a low KL divergence to the posterior. In detail, call the function space squared loss L1 and the standard variational objective L2. Then after the first 50,000 iterations of training with L1, we train for 10,000 iterations using .9L1 + .1L2, 10,000 iterations using .8L1 + .2L2 and so on until we are only training using L2. We then train for 100,000 iterations using just L2, to ensure the variational objective has converged. The results are shown in Figure 3.13. We see that even when using this initialisation which explicitly takes into account in-between uncertainty, the obtained posterior still lacks in-between uncertainty. This provides some evidence that this pathology may be due to the nature of the variational objective function itself, rather than the difficulty of optimising the ELBO. However, this does not constitute a definitive proof, since there may still be variational parameters that show good in-between uncertainty and 3.5 Case study: active learning with BNNs 55 are also an optimum of the ELBO — but these may be extremely difficult to find, even when using this specially designed initialisation. We leave a further investigation of this, and more broadly, of Criterion 2, to future work. 3.5 Case study: active learning with BNNs We now consider the impact of the pathologies described in Sections 3.3 and 3.4 on active learning (Settles, 2009) on a real-world dataset, where the task is to use uncertainty information to intelligently select which points to label. Active learning with approximate BNNs has been considered in previous works, often showing improvements over random selection of datapoints (Gal et al., 2017b; Hernández-Lobato and Adams, 2015). However, in cases when active learning with BNNs fails, common metrics such as RMSE are insufficient to diagnose the causes. In particular, it is difficult to attribute the failure to the model or to poor approximate inference. In this section, we specifically analyse a dataset where we have observed active learning with approximate BNNs to fail — the Naval regression dataset (Coraddu et al., 2014), which has 1-dimensional output variables y, 14-dimensional input variables x, and consists of 11,934 datapoints. We find via PCA that this dataset has most of its variance along a single direction. It hence may be especially problematic for methods that struggle with in-between uncertainty, as points are more likely to lie roughly in between others. This makes it a highly suitable dataset to test an approximate inference method’s ability to represent in-between uncertainty. The main questions we seek to address are: 1. Does a lack of in-between uncertainty lead to pathological behaviour on a real dataset in the 1HL case? We have already demonstrated empirically in Sec- tion 3.3.4 that 1HL BNNs struggle with in-between uncertainty for 2 and 5- dimensional datasets. However, in higher dimensional datasets such as Naval, it is not immediately apparent that Theorems 1 and 2 are problematic, since the convex hull of the datapoints may have relatively low volume in high dimensions. Unlike the synthetic experiments in the previous sections, the input locations in Naval are not specifically designed to make Theorems 1 and 2 relevant. In most cases in this experiment, there will be few datapoints that are exactly in between others. Hence these theorems may no longer be relevant to this real-world example. However, it may be the case that approximate inference will struggle with datapoints that are in some sense approximately in between each other. 56 The expressiveness of approximate inference in Bayesian neural networks 2. Will deeper BNNs be able to express appropriate in-between uncertainty? Given the theoretical expressiveness of the approximating families proven in Theorem 4, it is possible that increased depth will alleviate any pathologies experienced with shallower models. 3. What is the effect of a lack of in-between uncertainty on downstream tasks? So far, we have only looked at in-between uncertainty as an end in itself. However, it is much more practically relevant to see what effect a lack on in-between uncertainty has on a downstream application such as active learning. 3.5.1 Experimental set-up and results We compare MFVI, MCDO and the limiting GP on the active learning task. We do not run HMC as it would take too long to wait for convergence at each iteration of active learning. We normalise the dataset to have zero mean and unit standard deviation in each dimension. The experiment begins with an initial active set, which is a collection of labelled datapoints. The remainder of the datapoints in the dataset constitute the pool set, which is unlabelled — only the input location x is known, not the output y. In each iteration of active learning, the model chooses a datapoint from the pool set to label, after which it becomes a member of the active set. Then at end of each iteration, the model is retrained on the active set. 5 datapoints are chosen randomly as an initial active set, with the rest being the pool set. Following Hernández-Lobato and Adams (2015), the models choose the datapoint in the pool set which is assigned the highest predictive variance by the model to add to the active set. The goal is to obtain the best predictive performance on the remaining members of the pool set after a fixed number of iterations of active learning. We train MFVI and MCDO with full batch training for 20,000 iterations of ADAM at each step of active learning. All BNNs are retrained from scratch after the acquisition of each point from the pool set. This process is repeated 50 times. As the dataset has low noise, we use a homoskedastic Gaussian noise model with a fixed standard deviation of 0.01 for all models. We used a learning rate of 1 × 10−3 and 32 Monte Carlo samples from qϕ to estimate the objective function for both MFVI and MCDO. All networks had 50 neurons in each hidden layer. The prior for all BNNs and the GP was chosen to have σw = √ 2, σb = 1. σw = √ 2 was chosen so that the prior in function space has a stable variance as depth increases (Schoenholz et al., 2017). The dropout probability was set at p = 0.05 for all MCDO networks. The dropout ℓ2 regularisation was chosen to match the ‘KL condition’ as stated in Gal (2016, Section 3.2.3). The 3.5 Case study: active learning with BNNs 57 Table 3.1 Test RMSEs (± 1 standard error) after the 50th iteration of active learning, averaged over 20 random seeds. As the data is normalised to have zero mean and unit standard deviation, a method that predicts the value 0 on all datapoints will have an RMSE near 1. 1 HL 2 HL 3 HL 4 HL GP Active 0.04± 0.00 0.04± 0.00 0.04± 0.00 0.05± 0.00 GP Random 0.12± 0.01 0.13± 0.01 0.15± 0.01 0.16± 0.01 MFVI Active 0.94± 0.11 0.46± 0.04 0.35± 0.03 0.31± 0.02 MFVI Random 0.15± 0.01 0.23± 0.01 0.28± 0.01 0.32± 0.01 MCDO Active 0.69± 0.04 0.36± 0.02 0.38± 0.02 0.45± 0.02 MCDO Random 0.22± 0.01 0.35± 0.01 0.43± 0.01 0.47± 0.02 results were averaged over 20 random initialisations/random selections of the 5 initial points in the active set. For MFVI and MCDO, the predictive distribution at test time and the predictive variances used for active learning were estimated using 500 samples from the approximate posterior. The parameter initialisations are the same as those in Appendix B. Table 3.1 shows the RMSE of each model on a held-out test set after this process, compared to a baseline where points are chosen randomly. Active learning significantly reduces the RMSE for the GP compared to random selection, often by more than a factor of three. However it increases the RMSE for 1HL MFVI and MCDO, and either increases it or does not significantly decrease it for deeper networks. The one exception is 3HL MCDO, where active performs about 10% better than random, which is still far less than the factor of three improvement obtained by exact inference in the infinite-width BNN. Note that, perhaps counterintuitively, the performance of all models degrades with increasing depth when choosing datapoints randomly. This could be due to the small dataset size and possible simplicity of the regression problem, where a shallow network may have more suitable inductive biases than a deeper one. However, our goal in this experiment is not to find the best architecture/prior for the Naval dataset, but rather to assess the impact of approximate inference on active learning. The infinite width GP with active learning has almost the same performance for all depths, which is consistently much better than random selection of points. This is not the case for the approximate BNNs, which provides strong evidence that exact inference in the BNN model leads to good active learning performance, which is lost by approximate inference. 58 The expressiveness of approximate inference in Bayesian neural networks 3.5.2 Discussion In Figure 3.14 we visualise the dataset and the points chosen by 1HL BNNs using t-SNE (van der Maaten and Hinton, 2008). The covariates of Naval are clustered, with points in the same cluster roughly the same distance from the origin. Since the dataset is mean-centred, clusters of points closer to the origin are in a sense ‘in between’ or ‘surrounded by’ clusters of points that are further away from the origin. We see that although the 1HL GP chooses points from every cluster during active learning, 1HL MFVI fails to select any points from many of the clusters — including all the clusters closest to the origin. It ignores points in the ‘inside’ of the input space and oversamples points on the ‘outside’, leading to a selection strategy which is worse than random. This behaviour, although not directly implied by Theorem 1 (because the clusters may not lie on straight lines joining other clusters), is nonetheless consistent with it. Both the behaviour implied in Theorem 1 and the behaviour here can be seen as aspects of a general difficulty of the models in expressing in-between uncertainty. Here it manifests in the fact that the uncertainty seems to be underestimated on clusters of points within a sphere bounded by the outermost data clusters. We next consider deeper BNNs. Figure 3.15 shows the points chosen by 3HL BNNs. Again the GP chooses points from every cluster, and seems to focus on the ‘corners’ of each cluster. This is intuitively a good strategy, since, assuming the output value varies approximately linearly across each cluster, knowing the values at the corners of each cluster allows for the best estimate of the slope of the linear region. MFVI samples from more clusters than in the 1HL case, but still comparatively oversamples clusters further from the origin, and undersamples those near the origin. MCDO has a more spread out choice of points than the 1HL case, but still fails to obtain a significantly better RMSE than random. We see that compared to the GP, it still misses out on some clusters and does not follow the strategy of sampling the corners of clusters. Figure 3.16 shows the predictive uncertainty of 1HL models at the beginning and end of active learning respectively. Comparing the uncertainty before and after the 50 points have been collected during active learning, we see that all models significantly reduce their uncertainty around clusters that have been heavily sampled, except for MCDO. This causes MCDO to repeatedly sample near locations that have already been labelled, in contrast to the GP. Interestingly, it sometimes chooses from clusters near the origin in the 1HL case, even though its variance function is provably convex. This may be unexpected since convex functions are roughly ‘bowl-shaped’ and hence one might expect regions the centre of the input space to be a region of relatively lower predictive variance. The fact that 1HL MCDO nevertheless sometimes chooses 3.5 Case study: active learning with BNNs 59 (a) Wide-limit GP (b) MFVI (c) MCDO (d) Random Fig. 3.14 Points chosen during active learning in the 1HL case. Colours denote distance from the origin in 14-dimensional input space, i.e., ∥x∥2. Grey crosses (✕) denote the five points randomly chosen as an initial training set. Red crosses (✕) denote the 50 points selected by active learning. Both MFVI and MCDO entirely miss some clusters which are nearer the origin, and oversample certain clusters which are far from the origin, as might be expected of methods that struggle to represent in-between uncertainty. In contrast, the limiting GP samples the ‘corners’ of each cluster, without missing any entirely. Note that t-SNE does not preserve relative positions, so that clusters near the origin may appear on the ‘outside’ of the t-SNE plot. 60 The expressiveness of approximate inference in Bayesian neural networks (a) Limiting GP (b) MFVI (c) MCDO (d) Random Fig. 3.15 Points chosen during active learning in the 3HL case. Colours denote distance from the origin in 14-dimensional input space, i.e., ∥x∥2. Grey crosses (✕) denote the five points randomly chosen as an initial training set. Red crosses (✕) denote the 50 points selected by active learning. Again, the GP samples the corners of each cluster, and MFVI oversamples clusters far from the origin. Note that the random selection of points shown here is the same as that shown in Figure 3.14. 3.5 Case study: active learning with BNNs 61 (a) Limiting GP (before) (b) MFVI (before) (c) MCDO (before) (d) Limiting GP (after) (e) MFVI (after) (f) MCDO (after) Fig. 3.16 Predictive uncertainties before (top row) and after (bottom row) active learning, for single-hidden layer BNNs. Note here that colours denote predictive uncertainties, rather than distance from the origin as in Figures 3.14 and 3.15. As the noise standard deviation was fixed to 0.01 for all models, changes in the predictive standard deviation reflect model uncertainty. Grey crosses (✕) denote the five points randomly chosen as an initial training set. Red crosses (✕) denote the 50 points selected by active learning. Note how, compared to Figure 3.16, the GP has reduced its uncertainty near points it has observed, and is most uncertain at the corners of clusters opposite those points. In contrast, for both MFVI and MCDO, the network is still uncertain around regions it has already collected points from, leading it to oversample those clusters and undersample others. datapoints near the origin could be because the minimum of the variance function for MCDO is not centred at the origin, or because the variance has the shape of an elongated valley. Note also that MFVI is most confident at clusters near the origin that have never been sampled, and least confident at clusters far from the origin that have already been heavily sampled. Again, this is not necessarily a direct consequence of Theorem 1, but appears to be a wider pathology to do with in-between uncertainty. In contrast, the GP seems to select the ‘corners’ of each cluster, which is intuitively efficient. The success of the infinite-width GP provides strong evidence that this BNN model combined with exact inference has desirable inductive biases for this task; it is rather approximate inference that has caused active learning to fail. It may conceivably 62 The expressiveness of approximate inference in Bayesian neural networks be the case that exact inference in finite BNNs behaves more like MCDO and MFVI than the infinite width GP. However, we believe this is unlikely since convergence to the limiting GP can occur for even moderately wide BNNs (Matthews et al., 2018). To rule out this possibility, a follow-up study with extensive HMC simulation for finite BNNs would be needed to corroborate these findings. 3.6 Related work As we saw in Section 2.4, concerns have been raised about the suitability of QFFG since the earliest work on BNNs. However, to our knowledge, Theorem 1 is the first theoretical result showing that QFFG, when applied to certain datasets, necessarily has a pathologically restrictive effect on BNN predictive uncertainties. 3.6.1 Discussion of Farquhar et al. (2020) Concurrently with (and in response to) our work on the expressiveness of QFFG, Farquhar et al. (2020) argued that the mean-field approximation is not severely restrictive. As their paper directly addresses the research in this chapter, whilst coming to different overall conclusions and recommendations, we provide here a detailed discussion of their work. They make several claims, which we discuss one by one: First, as mentioned above, their overarching claim is that the mean-field approxi- mation for variational inference in BNNs is not severely restrictive. In order to assess this claim, it is crucial to be clear about the goal of approximate inference, and the task that it is applied to. In this chapter, we have assumed that the goal is to obtain predictive distributions that resemble the exact predictive. Hence we consider an approximating family to be restrictive if there are features of the exact predictive that are consistently missing from the approximate predictive (e.g., in-between uncertainty). However, if the goal of approximate inference is instead to obtain a method that performs reasonably well on some metric for a task (e.g., accuracy and expected cali- bration error on ImageNet), then there are tasks where MFVI indeed can be regarded as succeeding. An example of this is shown in Table 2 in Farquhar et al. (2020), which shows accuracies, negative log-likelihoods and expected calibration error for various Bayesian CNN architectures trained on ImageNet. In this chapter we show that for certain datasets, QFFG fails to capture essential features of the true predictive distribution, but this does not imply that the method cannot be practically useful on any dataset and task. For example, there could be situations where in-between 3.6 Related work 63 uncertainty is simply irrelevant for the task at hand (e.g., raw accuracy on ImageNet). In that sense, there are certain tasks/situations where QFFG is severely restrictive, and others where it is not necessarily so. Second, Farquhar et al. (2020) state that MFVI in deep networks can, in theory, have similar predictive distributions as those induced by more expressive posteriors over shallow networks. On this point our analysis and results are in agreement. Theorem 4 shows that (at least marginally), a deep MFVI BNN can approximate any predictive mean and variance function. In comparison, proposition 4 in Farquhar et al. (2020) states that our result can be extended to approximate the entire predictive density function, not just the first two moments (although still only marginally). The implication of both of our results is similar: for wide enough BNNs with more than two hidden layers, QFFG is flexible enough, in theory, to resemble the predictive distribution induced by any posterior, including non-mean-field posteriors. However, as acknowledged by Farquhar et al. (2020), this is not enough to show that good approximate posteriors will actually be found by VI (Criterion 2). Third, Farquhar et al. (2020) claim that the performance of mean-field BNNs on downstream tasks is comparable to that of BNNs using more flexible, but still Gaussian, posteriors. In their Table 2 they show that the performance of SWAG (Maddox et al., 2019) with a low-rank Gaussian posterior is comparable to that of SWAG with a diagonal Gaussian posterior.2 They argue that this provides evidence that the importance of going beyond the mean-field approximation is greatly diminished in large-scale models, and hence research should focus on addressing other problems for MFVI at scale. However, it is not clear that this observation holds for all approximating family methods. For instance, the K-FAC Laplace approximation shows significantly improved uncertainty estimation over the diagonal Laplace approximation in Ritter et al. (2018), and K-FAC is preferred over diagonal Gaussians in state-of-the-art applications of the Laplace approximation for BNNs (Daxberger et al., 2021; Immer et al., 2021b) (although note that Immer et al. (2021a) found that the diagonal Laplace approximation can lead to good performance when used for estimating the marginal likelihood). Furthermore, there could be other, non-Gaussian approximating families that lead to significant improvements over QFFG. We point out global inducing variational posteriors (Ober and Aitchison, 2021) as a recent example of an empirically successful BNN approximate posterior that is layer-wise conditionally Gaussian, but not jointly Gaussian. It has been shown to lead to much tighter ELBOs than MFVI, 2Although note that the low-rank Gaussian has a slightly better log-likelihood and expected calibration error than the diagonal Gaussian. 64 The expressiveness of approximate inference in Bayesian neural networks making it significantly more amenable to hyperparameter selection by optimising the ELBO (Bui, 2021). Finally, it could be the case that more expressive Gaussian posteriors do indeed lead to superior performance, but more improvements in, e.g., the optimisation procedure or objective function, are required to realise their potential. In this chapter, we refrain from making recommendations regarding which approximating family to use in practice for a particular downstream task. Our concern, rather, is to highlight a specific pathology that we observe in QFFG. However, we believe it is premature to abandon more flexible approximating families as a research direction in favour of focusing solely on scaling up MFVI. Next, Farquhar et al. (2020) argue that in deep BNNs, there exist modes of the posterior that are well approximated by mean-field distributions. However, it is not immediately clear that this is relevant either for approximating the true posterior, or for good performance on downstream tasks. For example, it may be that these ‘mean-field modes’ exist and MFVI is biased towards finding them. Even if that is the case, QFFG may still be a severely restrictive family. Indeed, the modes that MFVI finds may be very unrepresentative of the full posterior, and it may be the case that a more flexible variational distribution could find other modes that lead to much better performance. Farquhar et al. (2020) argue that increased depth closes the performance gap between mean-field and full-covariance posteriors, which they illustrate in their Figure 2. Their experiment involves running HMC on a small network, and fitting a Gaussian distribution to the samples.3 The KL divergence between a full-covariance Gaussian fit to the samples and a mean-field Gaussian fit is then shown to decrease with depth. However, since the HMC chain they use is initialised from a mode found by MFVI, their experiment is biased to sample from modes that are well-approximated by QFFG to begin with. It does not tell us how much these ‘factorised modes’ are losing compared to other modes that are not well-approximated by QFFG, or, indeed, compared to the full, multimodal posterior, which is what we are finally concerned with approximating. Finally, Farquhar et al. (2020) claim that deeper networks trained with MFVI show improved in-between uncertainty compared to shallow networks. Their main evidence for this is Figure 5 in their paper, which compares in-between uncertainty on 1-dimensional regression for 1HL MFVI vs 3HL MFVI. However, although they show arguably better in-between uncertainty in the 3HL case compared to the 1HL case, the predictive variance of the 3HL BNN is still roughly the same at the data clusters as it 3They in fact fit a mixture of Gaussians (since the HMC samples reflect a multimodal posterior), and select the Gaussian with the highest Bayesian information criterion. It is not clear that fitting a Gaussian in this way leads to reasonable predictive distributions. 3.6 Related work 65 is in between the data clusters — the BNN is still overconfident in-between the data and underconfident at the data. We show similar behaviour in Figure 3.12. In summary, practitioners seeking to understand whether to use full-covariance Gaussian distributions rather than mean-field Gaussian distributions in their approx- imate BNNs for tasks such as image classification will find the results in Farquhar et al. (2020) instructive. However, their findings do not directly address the ques- tion of whether the mean-field Gaussian approximation suffices to provide a good approximation to the exact posterior predictive. 3.6.2 Pathologies of the optimal mean-field posterior in wide BNNs The wide limit of BNNs has been a fruitful topic of theoretical investigation as wide BNN priors and (exact) posteriors both converge to Gaussian processes for the case of regression with Gaussian likelihoods (Hron et al., 2020; Matthews et al., 2018; Yang, 2019b). Very recently, Coker et al. (2022) used the wide limit to provide a theoretical characterisation of the optimal MFVI posterior (in the sense of maximising the ELBO) for wide, deep BNNs. They prove that as the width tends to infinity, the approximate posterior predictive of such an MFVI BNN tends to the prior predictive. In other words, the optimal infinite-width MFVI BNN provably completely ignores the data, which is pathological behaviour that is not reflected by the exact posterior. Their result is a significant theoretical advance in our understanding of approximate inference in BNNs, and directly addresses Criterion 2, since it is a statement about the optimal posterior. In contrast, our Theorems 1 to 4 only address Criterion 1, since they made existence statements regarding the entire approximating family. I.e., nothing in Theorems 1 to 4 relied on how the member of the approximating family was selected (e.g., via the ELBO, or Laplace approximation etc.). Rather, these theorems only made statements about whether there were any elements of the approximating family that met certain conditions. However, the main theorem of Coker et al. (2022) does have a limitation, in that in only applies to BNNs with odd activation functions, such as tanh. In particular, it does not apply to the ReLU BNNs that we investigate in this chapter. When non-odd activations are used, Coker et al. (2022) find that the approximate predictive no longer necessarily converges to the prior; however, it does not necessarily model the data well either. 66 The expressiveness of approximate inference in Bayesian neural networks Combined with the results in this chapter, we thus have the following (incomplete) theoretical picture of MFVI in BNNs: for 1HL ReLU BNNs, in-between uncertainty is provably lost. For deep, wide BNNs with odd activations, the posterior predictive converges to the prior predictive. Optimistically, one could hope that when neither of these theorems apply (e.g., when considering deep BNNs with non-odd activations, or which are not too wide), the MFVI predictive will closely resemble the exact predictive, and be able to represent properties such as in-between uncertainty. More conservatively, it appears that whenever definitive theoretical characterisations can be made about the MFVI posterior predictive, they imply major deviations from the exact predictive. Our inability to prove the existence of pathologies in other cases does not imply their absence. Hence these results sound a note of caution for practitioners, and in general we should not expect the MFVI predictive to closely resemble the exact predictive, unless we have references for the exact predictive to corroborate this (e.g., extensive HMC simulation). 3.6.3 The cold posterior effect and prior selection Beginning with Wenzel et al. (2020), there has been much work on the cold posterior effect : the observation that the performance of BNNs can be improved by artificially sharpening the Bayesian posterior distribution with a temperature parameter T < 1. In order to show that the cold posterior effect is a genuine feature of the model and not simply an artefact of an inaccurate inference procedure, Wenzel et al. (2020) performed a study of the quality of approximate inference in deep BNNs. They focused on stochastic gradient Markov Chain Monte Carlo (SGMCMC) (Chen et al., 2014; Welling and Teh, 2011; Zhang et al., 2020) in deep convolutional networks, and concluded that SGMCMC is accurate enough for inference, suggesting that the prior is at fault for the cold posterior effect. This has been further investigated in Fortuin et al. (2021) who found that the cold posterior effect can be alleviated in some cases by using heavy-tailed priors. Other possible causes of the cold posterior effect have been suggested, including data augmentation (Fortuin et al., 2021; Izmailov et al., 2021) and dataset curation (Aitchison, 2020). Finally, Noci et al. (2021) argue that the cold posterior effect may be a symptom with many causes, showing that dataset curation, data augmentation and poor prior specification can each, in isolation, lead to the cold posterior effect. In contrast to these studies, we do not investigate the cold posterior effect or focus on designing BNN priors. Instead, we ask whether approximate inference resembles the exact posterior for a given prior. We give examples of situations where commonly used independent Gaussian priors do encode useful inductive biases which are subsequently 3.6 Related work 67 lost by approximate inference. Hence we show that even if the problem of BNN prior specification was completely solved (something the community may be far from achieving), the inaccuracies in MFVI and MCDO inference could still stop the good inductive biases of the prior from being translated to the posterior. 3.6.4 Properties of MC dropout posteriors Prior to our work, Osband et al. (2018) also identified pathologies in MC dropout posteriors, although of a different nature. They note that the MCDO predictive distribution is invariant to duplicates of the data, and in the linear case predictive uncertainty does not decrease as dataset size increases, if the dropout rate and regulariser are fixed. However, for a fixed prior, the ‘KL condition’ (Gal, 2016, Section 3.2.3) requires the ℓ2 regularisation constant to decrease with increasing dataset size. In that case, the MCDO predictive will no longer be invariant to duplicates of the data. Theorem 2 shows that in the non-linear 1HL case, the predictive uncertainty in the MCDO posterior has restricted flexibility even for datasets without repeated entries. Furthermore, since it applies for any setting of the parameters, the restrictions on in-between uncertainty will persist regardless of how much (or how little) data is observed. In follow-up work, Manita et al. (2022) generalised our result on the universality of MC dropout networks (Theorem 4). Whilst our theorem only holds for networks with ReLU activations, they show the universal approximation property holds for the same class of activation functions that the original deterministic universal approximation theorem holds for (Leshno et al., 1993). Furthermore, it is common in non-Bayesian uses of MC dropout to employ a deterministic mode of the network at test time, which works by multiplying the deterministic weights by 1− p, where p is the dropout probability (Srivastava et al., 2014). Manita et al. (2022) show that it is possible to construct a dropout network that can approximate any function in both random and deterministic mode simultaneously. In contrast to our work, they focus on proving that the output of the network can approximate any function either with high probability or in expected Lq norm, and do not consider the universal approximation properties of the predictive variance function. 68 The expressiveness of approximate inference in Bayesian neural networks 3.7 Conclusions Principled approximate Bayesian inference involves defining a reasonable model, then finding an approximate posterior that retains the properties of the exact posterior that are relevant for the task at hand. We have presented both theoretical and empirical results characterising the expressiveness of the approximate posterior in function space obtained by MFVI and MCDO. For shallow BNNs we prove a fundamental limitation of mean-field Gaussian and MC dropout distributions in representing in-between uncertainty. While using deeper networks significantly improves the expressive power of these approximating families in terms of fitting arbitrary mean and variance functions, in practice VI does not take full advantage of this flexibility and again fails to capture in-between uncertainty. Although this is of greatest relevance for lower-dimensional regression tasks, the fact that MFVI and MCDO often fail these simple sanity checks indicates that these methods might generally have predictive distributions which are qualitatively different from the exact predictive. While BNNs have previously been shown to provide uncertainty estimates that are useful for a range of downstream tasks, it remains an open question as to what extent this is attributable to a resemblance between the approximate and exact predictive posteriors. To date, BNN approximate posteriors are poorly understood, especially when compared with the extensive work that has been done on understanding BNN priors (Lee et al., 2018; Matthews et al., 2018; Neal, 1995; Yang, 2019a). Together with the results of Coker et al. (2022), Theorems 1 to 4 serve as an important first step in theoretically characterising the behaviour of approximate inference in these models. Finally, Theorem 4 raises important questions about the flexibility of approximate inference in deep networks: Can the theorem be extended to covariances between the network outputs (i.e., statements about joint distributions in function space)? Why is Criterion 2 not satisfied when performing VI in weight space, even when Criterion 1 is satisfied? We hope our results motivate future work to better understand the interaction between approximating families and objective functions, as well as new approximate inference methods which can realise the full potential of BNNs. Chapter 4 Neural processes 4.1 Introduction In the first part of this thesis, we considered Bayesian neural networks as a promising machine learning model for making predictions under uncertainty. However, we saw that approximate inference was intractable and often led to behaviour which was qualitatively different from that of the true predictive distribution. Now, we turn to the second focus of this thesis: neural processes (NPs) (Garnelo et al., 2018a,b; Kim et al., 2018). NPs are a recently proposed family of deep learning models. Like BNNs, NPs address a shortcoming of modern deep learning: it is not easily applicable in settings where the dataset is small and good uncertainty estimation is required. As an example, consider a doctor using machine learning to predict the future time evolution of a patient’s biophysical data. The doctor has access to measurements of the patient’s data collected during their stay at the hospital. However, having just a single patient’s data is unlikely to provide all the information needed to make an accurate prediction. Ideally, the doctor would like to incorporate inductive biases into the model obtained from the medical histories of many patients. In this thesis, we consider an inductive bias as any modelling assumption which is baked into the model before training on the data in the task at hand (in this case, the biophysical data of the patient of interest). There are many kinds of inductive biases that can be incorporated into a neural network model, and they vary on a continuum from very general to very specific. For example, using a deep MLP architecture is a very general inductive bias, which enforces some degree of smoothness in the function, but is otherwise extremely flexible. Beyond this, architectures like convolutional neural networks bake in translation equivariance into the model, thus restricting the class of functions that can be represented. In standard Bayesian machine learning models, 70 Neural processes inductive biases are incorporated into the model by specifying a prior over functions. This is the case with BNNs, where the model architecture combined with the prior over the weights induces a distribution over predictive functions. In our example, ideally, this prior would include information about how biophysical data is likely to behave in general, which would then be combined with the specific observations made of the current patient. Having accurate and informative inductive biases tailored to the task at hand would allow the model to learn from far fewer examples compared to a model that only had very general inductive biases, e.g., about the smoothness of the function. Unfortunately, specifying such inductive biases by hand usually requires expert knowledge, both of the application area and of the Bayesian model class. Instead, neural processes approach this problem with meta-learning, or learning to learn (Schmidhuber, 1987; Thrun and Pratt, 2012). Meta-learning frames the task of finding suitable inductive biases as part of a supervised learning problem, where each learning instance is an entire dataset (in this case, a patient’s entire biophysical data trajectory), rather than a single datapoint (in this case, a single time stamp in a patient’s biophysical data trajectory). Meta-learning removes the burden of prior selection from the machine learning practitioner, which for probabilistic methods like BNNs is notoriously difficult (Fortuin, 2022). Instead, the relevant inductive biases are learned directly from data. When such data is available, e.g., in this example where there are many related patient trajectories, it would be advantageous for the model to make full use of it directly, rather than only using it to inform a human expert’s modelling decisions. Having argued for the benefits of learning inductive biases directly from data, it is important to mention that in Chapter 5 we will see that even with meta-learning, there are benefits to incorporating high-level inductive biases, such as convolutional structure. However, compared to specifying a BNN prior, which assigns a probability density to every possible setting of the weights, this is a much more general inductive bias. In general, on the spectrum of specificity of inductive biases in methods, there is usually an optimum where some information is baked in to the model by human design, and some information is learned directly from data. With neural processes, we explore a model which is closer to the ‘data-driven’ end of this spectrum than BNNs. We have described the advantages of taking a data-driven approach to learning inductive biases. Another key feature of neural processes is that, like BNNs, they explicitly model uncertainty in their predictions. Continuing our example, suppose the doctor is planning to make a potentially life-changing treatment decision based on the network’s predictions. It is then crucial that the network knows when it should be uncertain, instead of being confidently wrong. As we saw in Chapters 2 and 3, BNNs 4.1 Introduction 71 approach this problem by placing a prior probability distribution on the weights of the network, which is then updated using Bayes’ theorem. In constrast, NPs take a more direct approach where, given an observed dataset, the neural network outputs are used to specify the parameters of a predictive stochastic process, i.e. a distribution over predictive functions. For conditional neural processes (CNPs) (Garnelo et al., 2018a), this approach does not require any intractable inference procedures, and for latent neural processes (LNPs) (Garnelo et al., 2018b), it may require inference only over a set of latent variables which is much smaller than the number of weights in the network. In summary, neural processes are a collection of models that work by meta-learning a distribution over predictive functions, i.e., a predictive stochastic process. Meta-learning allows NPs to incorporate data from many related tasks, and providing predictive stochastic processes instead of deterministic functions allows NPs to effectively represent uncertainty via the randomness in the function, similarly to BNNs. In this chapter, we will present an introduction to neural processes, covering several of the NP variants that have been introduced so far, and unpack both the terms ‘meta-learning’ and ‘stochastic process’ in more detail. In addition, in Section 4.4.3 we present a novel objective function for training latent neural processes, which we will evaluate against the standard latent neural process objective in Chapter 5. The exposition in this chapter is based on a Jupyter-book tutorial on neural processes, ‘The Neural Process Family’ (Dubois et al., 2020), which I wrote together with Yann Dubois and Jonathan Gordon. I was involved in all aspects of the writing. The use of the approximate log-likelihood objective presented in Section 4.4.3 was first proposed in ‘Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes’ (Foong et al., 2020a). The research on the new objective in that paper was conducted along with my co-authors Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois, and James Requiema. Richard E. Turner supervised the work throughout. I was involved in all aspects of writing in that paper, and with the proposal and evaluation of the newly proposed approximate maximum likelihood objective. 4.1.1 Meta-learning In standard supervised learning, a neural network is trained to output a predictive function given an observed dataset. To make this more precise, we introduce some notation. Let X = Rdx be the space of inputs to the function, and let Y ⊂ Rdy , with Y 72 Neural processes compact, be the space of outputs (though to ease notation, we often assume Y ⊂ R).1 Let ZM = (X × Y)M be the collection of M input–output pairs, let Z≤M = ⋃M m=1ZM be the collection of at most M pairs, and let Z = ⋃∞m=1ZM be the collection of finitely many pairs. Then, a single datapoint is an element of X × Y, a dataset D with M datapoints is an element of ZM , and all finite-sized datasets are elements of Z. Let Cb(X ,Y) be the space of continuous, bounded functions X → Y . In supervised learning, a neural network is trained on a single dataset D ∈ Z typically using a variant of stochastic gradient descent on some loss function. The trained network f ∈ Cb(X ,Y) is then used as a predictor.2 For example, we may train a network using ADAM to minimise the MAP objective defined in Equation (2.14). This allows us to associate any supervised learning dataset D with its corresponding trained network f (assuming the hyperparameters and random seed have been fixed). The supervised learning algorithm (defined by the choice of objective function, hyper- parameters, etc.), which we denote as A, can thus be seen as a map A : Z → Cb(X ,Y). At test time, a prediction at a target input x ∈ X can be made by feeding it into the predictor to obtain f(x). The key insight of supervised meta-learning is that we can apply supervised learning to learn the map A itself. In other words, we learn the supervised learning algorithm A using another, higher-level supervised learning algorithm: hence the name ‘meta- learning’. To achieve this, we parameterise a space of supervised learning algorithms, and optimise over that space. For training, we need a collection M := (Dn)Ntasksn=1 of related datasets, where each Dn ∈ Z is itself a supervised learning dataset. We refer to M as a meta-trainset. The result of meta-training on M (i.e., optimising over the parameterised space of learning algorithms) is a supervised learning algorithm, i.e., a map Z → Cb(X ,Y). However, instead of being defined by a loss function and an optimiser like standard supervised learning algorithms, the algorithm is specified by a parametric function (in our case a neural network) which is learned directly from data. Once the meta-learner has been trained, we can deploy the learned algorithm on new datasets that are not in the meta-trainset. We refer to this as making predictions at meta-test time. In this thesis we are concerned with the case where the learnable algorithm A : Z → Cb(X ,Y) is entirely parameterised by a neural network, i.e. the adaptation to a new task is done with a single forward pass, without any gradient updates. This is in contrast to the popular model-agnostic meta-learning (MAML) 1In this thesis we focus on regression tasks. For classification, Y = {1, . . . ,K}, where K is the number of classes. 2The neural network output may not actually be bounded if X is not compact, but this is not important for our exposition. 4.1 Introduction 73 algorithm, which uses gradient steps to update the parameters of the network during meta-test time (Finn et al., 2017). Because meta-learning can share information across learning tasks, it is especially well-suited to situations where there are many similar tasks, and each task is a small dataset, as in, e.g., few-shot learning. The small data regime is precisely when we would expect uncertainty in our predictions to matter the most. To relate this back to our example, if we only record the patient’s data at a small number of timestamps, can we always give a confident answer as to how that data will evolve? What we need is to express our uncertainty, and this leads us naturally to consider stochastic processes. 4.1.2 Stochastic process prediction We have seen that we can think of meta-learning as learning a map directly from observed datasets D ∈ Z to predictor functions f ∈ Cb(X ,Y). However, there are many situations where a point estimate prediction is insufficient. Given a set of query inputs x, what we need is often not a single prediction f(x), but rather a distribution over predictions p(y|x;D), where y are the output values.3 As long as these predictive distributions are consistent with each other for different choices of x, this is equivalent to specifying a distribution over functions X → Y . Such a distribution is known as a stochastic process. In detail, we define a stochastic process as a probability measure on the set of functions4 from X → Y, i.e. YX , equipped with the product σ-algebra of the Borel σ-algebra over each index point (Tao, 2011), denoted Σ. The measurable sets of Σ are those which can be specified by the values of the function at a countable subset I ⊂ X of its input locations. Since in practice we only ever observe data and make predictions at a finite number of points, this is sufficient for our purposes.5 We denote the set of all YX -valued stochastic processes as P(X ,Y). Instead of considering learning algorithms that give point estimates, i.e. those mapping Z → Cb(X ,Y), we now consider algorithms that map Z → P(X ,Y). Each predictor sampled from the 3Here we use the notation p(y|x;D) instead of p(y|x,D) to emphasise that the distribution need not depend on D via exact Bayesian conditioning, but rather can depend on D in an arbitrary way — for example, through a neural network. 4Note that this is non-standard terminology since strictly speaking, a stochastic process is a random variable, i.e., a map from Ω→ YX , where Ω is some abstract sample space. The resulting measure on YX is then known as the law of a stochastic process. In this thesis we will colloquially use the phrase ‘stochastic process’ to refer to both a stochastic process and its law. 5However note that this σ-algebra is not rich enough to answer questions concerning properties that depend on an uncountable number of index points, such as continuity of the functions sampled from the stochastic process. 74 Neural processes resulting stochastic process represents a plausible interpolation of the data, and the diversity of the samples reflects the uncertainty in the predictions. Hence, a neural process can be viewed as using neural networks to meta-learn a map from datasets to predictive stochastic processes. This point of view can be clarified by comparing NPs to the most commonly used form of stochastic process prediction in machine learning: Gaussian process regression (Rasmussen and Williams, 2005). In GP regression, we begin by specifying a prior stochastic process, which is a GP, with fprior ∼ GP(0, k), where k is the kernel function. Let the observed data D = (xn, yn)Nn=1, with X := (xn)Nn=1 and y := (yn)Nn=1. Then, for a Gaussian observation likelihood with variance σ2, we can perform exact Bayesian inference to obtain the posterior predictive stochastic process: fpost ∼ GP(µpost, kpost) (4.1) µpost = k(·, X)(k(X,X) + σ2I)−1y (4.2) kpost = k(·, ·)− k(·, X)(k(X,X) + σ2I)−1k(X, ·), (4.3) where fpost is a sample from the posterior predictive stochastic process, k(X,X) ∈ RN×N is the kernel matrix at the training inputs, k(X, ·) is a column-vector-valued function of the kernel values between the evaluation point and the training inputs, and k(·, X) = k(X, ·)T. Here we can view the process of conditioning the prior GP on the data D as a map that takes D as input and outputs a predictive stochastic process. In other words, GP regression is a map Z → P(X ,Y) defined by (i) specifying a prior using the kernel k and (ii) performing Bayesian inference. NPs, on the other hand, define this map using a neural network directly, side-stepping this two-step procedure. In contrast to GPs, rather than having inductive biases put into the model via the choice of the prior, NPs learn these biases directly from the meta-trainset. 4.1.3 Stochastic process consistency In the previous section, we considered specifying a stochastic process by specifying p(y|x;D) for all finite collections of target inputs x using a neural network. Each distribution for a given set of inputs x is referred to as a finite-dimensional distribution of the stochastic process. An important question to ask is, can we stitch together all of these finite-dimensional distributions to obtain a single consistent stochastic process? The Kolmogorov extension theorem (see e.g. Tao (2011, Section 2.4)) tells us that we 4.1 Introduction 75 can, as long as the marginals are consistent with each other under permutation and marginalisation. To illustrate these consistency conditions, we consider some artificial examples of finite-dimensional distributions that are not consistent. Let x1, x2 be two input locations, with y1, y2 the corresponding (probabilistic) outputs. 1. Consider a collection of finite-dimensional distributions with y1 ∼ N (0, 1) and [y1, y2] T ∼ N ([10, 0]T, I). What is the mean of y1? 2. Consider a collection with [y1, y2]T ∼ N ([0, 0]T, I) and [y2, y1]T ∼ N ([1, 1]T, I). What is the mean of y1? What is the mean of y2? From these examples, it is clear that inconsistent marginals lead to self-contradictory predictions. In the first example, the marginals were not consistent under marginalisa- tion: marginalising out y2 from the distribution of [y1, y2]T did not yield the distribution of y1. In the second case, the marginals were not consistent under permutation: the distributions differed depending on whether considered y1 or y2 was considered first. These inconsistencies can never occur when doing GP regression, since we begin by specifying a consistent stochastic process prior and compute an exact conditional probability for the predictive distribution. However, when using arbitrary neural networks to directly specify the finite-dimensional distributions of the predictive, some care must be taken so that our definition satisfies these consistency conditions. Later we will prove that these problems will never occur for NPs — given a fixed dataset D to condition on, the NP predictive distributions p(y|x;D) always define a consistent stochastic process. So far, we have only considered what happens when the conditioning dataset D is fixed and the target inputs x are varied. There is another kind of consistency that we might expect stochastic process predictions to satisfy: consistency among predictions with different context sets, with respect to the product rule of probability. To illustrate this, consider two input-output pairs, (x1, y1) and (x2, y2). The product rule of probability tells us that any well-defined joint predictive density over y1, y2 must satisfy: p(y1, y2|x1, x2) = p(y1|x1)p(y2|x2, y1, x1) (4.4) = p(y2|x2)p(y1|x1, y2, x2). (4.5) This is equivalent to requiring that the distribution over y1, y2 obtained by autoregressive sampling from the model should be independent of the order in which the sampling 76 Neural processes is performed. Unfortunately, this is not guaranteed to be the case for NPs, i.e., it is possible that, for a neural process: p(y1|x1)p(y2|x2;x1, y1) ̸= p(y2|x2)p(y1|x1;x2, y2).6 (4.6) Ideally, this property would be a guaranteed consequence of the NP model definition, as is the case with GP regression. As it stands, NPs can yield good predictive performance even though they do not exactly obey this product-rule consistency, likely because the training procedure encourages NPs to respect this property approximately, if not exactly. From another point of view, it may be the case that NPs are easier to train than BNNs precisely because they do not guarantee consistency with the rules of probability theory. By not directly attempting to enforce this product-rule consistency, they sidestep the requirement for complicated approximate inference procedures. 4.1.4 The prediction map We now discuss what mapping we would like NPs to learn ideally. We model the world as having a ground truth stochastic process P ∈ P(X ,Y), from which all our observed datasets are drawn. More precisely, let xc ∈ XC and xt ∈ X T with C, T ∈ N be two sets of input locations. We would like to define what it means to make predictions for the random function values yt ∈ YT at xt conditioned on observations of the random function values yc ∈ YC at xc, given that the ground truth stochastic process P is completely known. In reality this will not be the case, but it serves as a target that we would like NPs to approximate. Let p(·|xc) and p(·|xt) denote the densities with respect to Lebesgue measure of the finite-dimensional distributions of P when indexed at xc and xt respectively. In this thesis we will assume that these densities always exist. We then have: yt ∼ p(yt|xt), (4.7) yc ∼ p(yc|xc). (4.8) In accordance with the product rule of probability, we then define the finite-dimensional distribution of the predictive stochastic process at xt conditioned on (xc, yc) as having 6Recall that the semicolon after the conditioning bar denotes the fact that the probability dis- tribution depends on the elements following it in some arbitrary way, in this case via a neural network. 4.2 Neural process architectural framework 77 the density7 p(yt|yc, xt, xc) = p(yt, yc|xt, xc) p(yc|xc) . (4.9) It can easily be verified that for a fixed conditioning dataset Dc := (xc, yc), the conditional marginal distributions defined by different choices of xt in Equation (4.9) are Kolmogorov-consistent in the sense described in Section 4.1.3. Hence, the Kolmogorov extension theorem implies there is a unique measure on (YX ,Σ) that has Equation (4.9) as its finite-dimensional distributions. We denote this measure by PDc . It is the predictive stochastic process obtained by conditioning P on the observations in Dc. We now define πP : Z → P(X ,Y), πP : Dc 7→ PDc as the prediction map, so called because it maps each observed dataset Dc to the predictive stochastic process conditioned on Dc. The general prediction problem, and the objective of training neural processes, may then be viewed as learning to approximate the prediction map πP . In Section 4.4 we will discuss training procedures and optimisation objectives designed to achieve this goal. 4.2 Neural process architectural framework We now discuss the basic design pattern that underlies the architecture of many NPs. This involves viewing NPs as an encoder-decoder model, where the encoder processes the conditioning dataset D, and the decoder combines the encoded representation with the query input, x to form a prediction. The basic NP architectural framework can be motivated by the following design goals: 1. The dataset to be conditioned on, D, should be treated as a set. This differs from standard vector-valued neural network inputs in that: (i) datasets may have varying sizes; (ii) sets have no intrinsic ordering of their elements. This means that NPs should be invariant with respect to permutations of D. That is, p(y|x;D) = p(y|x; πD), where πD is any dataset formed by permuting the order of the datapoints in D. 2. The resulting predictive distributions p(y|x;D) should be consistent with each other for varying x to ensure that NPs give rise to consistent stochastic processes, as dictated by the Kolmogorov extension theorem. 7In contrast to Equation (4.6), here we use a comma rather than a semicolon after the conditioning bar, since these are exact values computed with the rules of probability, rather than approximations given by a neural network. 78 Neural processes We now describe the encoder of an NP. The encoder for most NPs can be written in the form R(D) = ∑ (x,y)∈D ϕ(x, y), (4.10) where ϕ is a deep neural network, ϕ : X × Y → E , where E is some representation space, and R(D) ∈ E is a fixed-dimensional representation of the dataset D. The summation operation defining R is key as it ensures permutation invariance due to the commutativity of summation. It also ensures that R ‘lives’ in the same space regardless of the number of datapoints in D. Hence all encoders of this form automatically satisfy the first design goal given above. Next, the NP has to combine this representation with the query input locations x to return a prediction. We can broadly categorise NPs into two sub-families based on how this is done. Conditional neural processes (CNPs) directly use the deterministic representation R to define a predictive distribution that is factorised conditioned on R. That is, given a set of query input locations x ∈ XN with x = (x1, . . . , xN), the predictive distribution is given by: p(y|x;D) = N∏ n=1 p(yn|xn, R(D)). (4.11) Here each factor p(yn|xn, R(D)) is a parameterised probability density (typically Gaussian), whose parameters are given by the decoder network. The decoder takes the query input location xn and conditioning dataset representation R(D) and returns the parameters of the predictive distribution. The graphical model for a CNP is shown in Figure 4.1. On the other hand, latent neural processes8 (LNPs) use the representation R(D) to parameterise a distribution over a latent variable, z ∼ p(z|R(D)). The predictive distribution is then factorised conditionally given z. That is, p(y|x;D) = ∫ N∏ n=1 p(yn|xn, z)p(z|R(D)) dz. (4.12) 8Note that in the original paper where LNPs are introduced (Garnelo et al., 2018b), they are simply known as neural processes. We believe this terminology can be confusing, hence we prefer to use the term ‘neural process’ as an umbrella term covering both LNPs and CNPs. 4.2 Neural process architectural framework 79 Fig. 4.1 Graphical model of a conditional neural process. Grey circles denote observed variables. As with conditional neural processes, the factors p(yn|xn, z) are specified by a decoder network. The graphical model for a LNP is shown in Figure 4.2. As we will see, LNPs offer more expressive predictive distributions than CNPs, which can induce correlations between different query locations — but at the cost of making the likelihood of the model intractable. CNPs are generally easier to train and have closed-form objective functions, but cannot be used to sample coherent functions from the predictive distribution, due to the factorisation assumption. In Sections 4.2.2 to 4.2.4 we will describe some concrete instantiations of the NP architectural framework. First, however, we provide a quick proof that both CNPs and LNPs satisfy the Kolmogorov consistency requirement given above. 4.2.1 Kolmogorov consistency of CNPs and LNPs Here we show that both CNPs and LNPs meet the consistency requirements to specify a stochastic process according to the Kolmogorov extension theorem, given a fixed conditioning dataset D. We first consider CNPs. Recall that we require consistency under both marginalisation and permutation: Proposition 1. The finite-dimensional distributions of CNPs are consistent under marginalisation. 80 Neural processes Fig. 4.2 Graphical model of a latent neural process. Grey circles denote observed variables. Proof. Consider two query inputs, x1, x2 ∈ X . Then by marginalising out the second predicted output and using Equation (4.11), we get:∫ p(y1, y2|x1, x2;D) dy2 := ∫ p(y1|x1, R(D))p(y2|x2, R(D)) dy2 (4.13) = p(y1|x1, R(D)) ∫ p(y2|x2, R(D)) dy2 (4.14) = p(y1|x1, R(D)) (4.15) := p(y1|x1;D), (4.16) which shows that the predictive distribution obtained by querying the CNP at x1 is the same as that obtained by querying it at x1, x2 and then marginalising out the second target point. Of course, the same argument holds for collections of any size, and when marginalising out any subset of the variables. Proposition 2. The finite-dimensional distributions of CNPs are consistent under permutation. 4.2 Neural process architectural framework 81 Proof. Let (xn)Nn=1 be the query inputs and π be any permutation of {1, ..., N}. Then, again using Equation (4.11), the predictive density is given by: p(y1, ..., yN |x1, ..., xN ;D) := N∏ n=1 p(yn|xn, R(D)) (4.17) = N∏ n=1 p(yπ(n)|xπ(n), R(D)) (4.18) := p(yπ(1), ..., yπ(N)|xπ(1), ..., xπ(N);D), (4.19) since multiplication is commutative. It is clear from these derivations that these properties hold for any factorised predictive distributions. The proof of Kolmogorov consistency for LNPs is similar to that given for CNPs: Proposition 3. The finite-dimensional distributions of LNPs are consistent under marginalisation. Proof. Consider two query inputs, x1, x2. Then by marginalising out the second predicted output and using Equation (4.12), we obtain:∫ p(y1, y2|x1, x2;D) dy2 := ∫ ∫ p(y1|x1, z)p(y2|x2, z)p(z|R(D)) dz dy2 (4.20) = ∫ p(y1|x1, z)p(z|R(D)) ∫ p(y2|x2, z) dy2 dz (4.21) = ∫ p(y1|x1, z)p(z|R(D)) dz (4.22) := p(y1|x1;D), (4.23) which shows that the predictive distribution obtained by querying an LNP at x1 is the same as that obtained by querying it at x1, x2 and then marginalising out the second target point. Again, the same idea works with collections of any size, and when marginalising out any subset of the variables. Proposition 4. The finite-dimensional distributions of LNPs are consistent under permutation. 82 Neural processes Proof. Let (xn)Nn=1 be the query inputs and π be any permutation of {1, ..., N}. Then the predictive density is: p(y1, ..., yN |x1, ..., xN ;D) := ∫ p(z|R(D)) N∏ n=1 p(yn|xn, z) dz (4.24) = ∫ p(z|R(D)) N∏ n=1 p(yπ(n)|yπ(n), z) dz (4.25) := p(yπ(1), ..., yπ(N)|xπ(1), ..., xπ(N);D), (4.26) since multiplication is commutative. We next describe some concrete instantiations of the NP architectural framework. 4.2.2 MLP-conditional neural processes The simplest model in the NP architectural framework, and the first to be proposed, is the MLP-conditional neural process — usually just referred to as the conditional neural process (CNP) (Garnelo et al., 2018a).9 Recall that specifying an instantiation of the NP architectural framework given at the beginning of Section 4.2 requires defining an encoder and decoder. For MLP-CNPs, the encoder is given by: R(D) = ∑ (x,y)∈D ϕ(x, y), (4.27) where ϕ is a multilayer perceptron. More precisely, given an observed datapoint with x ∈ Rdx and y ∈ Rdy , x and y are concatenated and fed into an MLP ϕ : Rdx+dy → RdR , where dR is the dimensionality of the representation. Following this, the per-datapoint representations ϕ(x, y) are summed together to form a representation of the entire dataset, R(D) ∈ RdR . Next, we specify the predictive distribution of the CNP. Following Equation (4.11), the CNP predictions are factorised over each query input, and we additionally use 9A note about terminology: in the original publication (Garnelo et al., 2018a), what we refer to as the MLP-CNP is simply known as the CNP. Instead, we use the term CNP to usually refer to the entire class of neural process models that make factorised predictions as in Equation (4.11). Similar comments apply to the latent variable version, the MLP-LNP (Garnelo et al., 2018b) — we use the term LNP to refer to the entire class of neural process models that define their predictive distribution using Equation (4.12). 4.2 Neural process architectural framework 83 Gaussian distributions for each factor: p(y1, . . . , yN |x1, . . . , xN ;D) = N∏ n=1 p(yn|xn, R(D)) (4.28) = N∏ n=1 N (yn;µ(xn, R(D)), σ2(xn, R(D))). (4.29) Here µ(xn, R(D)) and σ2(xn, R(D)) are again defined by MLPs, with µ : Rdx+dR → Rdy , and σ2 : Rdx+dR → Rdy . Together, these networks form the decoder of the MLP-CNP. In practice it is common to use a single MLP that outputs both µ(xn, R(D)) and log σ2(xn, R(D)), where the logarithm of the predictive variance is output to ensure that σ2(xn, R(D)) > 0. MLP-CNPs are simple to define and were shown to successfully approximate the predictive distribution for non-Gaussian regression tasks and image inpainting (Garnelo et al., 2018a). However, as with all CNPs, they cannot model dependencies between query points in the predictive distribution. Since the samples of every yn value will be independent, functions sampled from the posterior of a CNP will be extremely noisy — there is no way to separate ‘aleatoric’ from ‘epistemic’ uncertainty in the CNP predictive distribution, since all of the randomness is independent between different query locations. This inability to model dependencies renders CNPs unsuitable for downstream applications that require coherent samples, such as Thompson sampling (Thompson, 1933), where a function is sampled from the posterior and then greedily optimised in order to perform Bayesian optimisation. Coherent samples are also required in order to estimate the probability of events that are defined over an extended region of the input space. For example, consider the task of predicting if the value of the function will exceed a certain threshold over the entirety of a range in the input. This kind of task occurs in heatwave prediction, where we are interested in the probability that the temperature exceeds a certain threshold over an extended period of time. CNPs will assign an unreasonably low probability to this event since within any non-zero interval, there are an uncountable number of query points xn, and the probability that the corresponding output values yn all exceed a certain threshold will vanish if all the yn are modelled as independent Gaussian distributions. Finally, the inability to model dependencies leads to poorer joint log-likelihoods, since CNPs will be forced to approximate the ground truth, dependent predictive distribution with a factorised 84 Neural processes distribution. The inability of CNPs to model dependencies was addressed in follow-up work with the introduction of the MLP-latent neural process, which we discuss next. 4.2.3 MLP-latent neural processes The MLP-latent neural process (MLP-LNP), commonly known simply as the latent neural process (LNP) (Garnelo et al., 2018b), has a similar architecture to the MLP- CNP. However, instead of directly passing the deterministic representation R(D) to the decoder network, R(D) is used to define the parameters of a Gaussian distribution over a latent variable z ∈ Rdz : p(z|R(D)) = N (z;µz(R(D)), σ2z(R(D))). (4.30) Here µz(R(D)) ∈ Rdz and σ2z(R(D)) ∈ Rdz are output by an MLP that takes R(D) as input. Next, the latent variable z is used to define the predictive distribution, following Equation (4.12): p(y1, . . . , yN |x1, . . . , xN ;D) = ∫ N∏ n=1 N (µ(xn, z), σ2(xn, z))p(z|R(D)) dz. (4.31) The architecture of the decoder networks µ(xn, z) and σ2(xn, z) is the same as that of the MLP-CNP. Note that if the variance of the latent variable σ2z(R(D))→ 0, then z becomes a deterministic representation; hence the MLP-CNP is a special case of the MLP-LNP. Garnelo et al. (2018b) showed that the MLP-LNP was capable of producing coherent and diverse function samples that could be used for downstream tasks such as Thompson sampling. 4.2.4 Attentive neural processes One shortcoming of MLP-based neural processes is that they have a tendency to underfit the data (Kim et al., 2018). For example, MLP-CNPs struggle to take advantage of the fact that if a query point is very close to a datapoint in D, they should both have similar values, and conversely if the points are far apart. One possible explanation for this is that all the query points xn share a single global representation R(D) of the conditioning dataset D, i.e., R(D) is independent of the location of the query input. This suggests that a priori, all points in the dataset D are given the same ‘importance’, regardless of the location at which a prediction is being made. Although, as we will see in Section 4.3, the form of the representation used by MLP-CNPs is universal in 4.2 Neural process architectural framework 85 the space of permutation invariant set functions, it nevertheless may not be the most data-efficient representation since it does not bake in the importance of locality as an inductive bias. One solution to this is to use a query-location-dependent representation, R(xn, D). To achieve this, Kim et al. (2018) propose the attentive NP (ANP), which replaces the summation operation in MLP-NPs with aggregation using an attention mechanism (Bahdanau et al., 2015). The attention mechanism allows the ANP to learn to attend to specific datapoints in D that are particularly relevant to the query location, giving them more weight than others when making a prediction. To illustrate how attention can alleviate underfitting, consider the case where D contains two observations with inputs x1, x2 that are very far apart. These observations are then mapped by ϕ to the local representations ϕ(x1, y1) and ϕ(x2, y2) respectively. Intuitively, when making predictions close to x1, the decoder should focus on ϕ(x1, y1) and ignore ϕ(x2, y2), since ϕ(x1, y1) contains much more information about this region of input space. The attention mechanism allows us to define this intuition algorithmically, and incorporate it as a high-level inductive bias in the NP. In detail, an attention mechanism works by processing a set of keys, queries and values. The queries attend to the keys via the computation of a similarity measure. This similarity measure is then used to compute attention weights, which are normalised to sum to one. The attention weights are used to compute a weighted combination of the values. The most common form of attention, dot-product attention, uses the dot product as a similarity measure and works as follows. Consider having M key-value pairs arranged in matrices, with the key matrix K ∈ RM×dK , and the value matrix V ∈ RM×dV . These key-value pairs are attended to by N query vectors, Q ∈ RN×dK . The output of the dot product attention is then computed as: Attention(K,Q, V ) = softmax(QKT/ √ dK)︸ ︷︷ ︸ W∈RN×M V ∈ RN×dV . (4.32) Here the softmax operation is applied row-wise to QKT over M elements. W is the matrix of attention weights, and we can see that the nth row of the attention output is given by a weighted sum of the M rows of the value matrix V . ANPs make use of the attention operation in Equation (4.32) as follows. The query matrix Q is formed by applying a pointwise MLP to the N input locations in the query set. The key and value matrices K,V are formed by applying two separate pointwise MLPs to the M datapoints in the conditioning set, one for producing keys and the other for producing values. The nth row of the output of the attention operation 86 Neural processes Attention(K,Q, V ) ∈ RN×dV is then the query-location specific representation of D, i.e., R(xn, D). In contrast to the MLP-CNP, there are now N distinct representations for the N datapoints. Another way to view this is as defining a weighting function w(·, ·) that weights each datapoint in D depending on the input location we want to predict at, xn. The datapoints in D determine the attention keys, and xn determines the attention query. The xn-specific representation of D is then given by R(xn, D) = ∑ (x,y)∈D w(x, xn)ϕ(x, y), (4.33) with the attention weights normalised so that ∑ (x,y)∈D w(x, xn) = 1. This is in contrast with Equation (4.27) where no weighting is performed. Kim et al. (2018) tested various kinds of attention mechanisms to define w(·, ·), including Laplace kernel attention, dot product attention, and multihead dot product attention (Vaswani et al., 2017). They generally find that multihead dot product attention works best. So far we have only considered an attention mechanism between the query input location xn and the observed dataset, i.e., cross attention. In addition, the ANP when originally proposed (Kim et al., 2018) used an attention mechanism between datapoints in D: self attention. In this case, the representation of a datapoint (x, y) ∈ D is no longer given by ϕ(x, y), but is instead the result of applying self attention to ϕ(x, y). This can be implemented using Equation (4.32), but with the keys, queries and values all computed from the conditioning set only. Note that neither the self attention applied to D, nor the cross attention between xn and D impacts the invariance of the predictions with respect to permutations of D. If D is permuted, so will the sequence of per-datapoint representations. When cross attention is applied to this sequence, its ordering is irrelevant (see Equation (4.33)), hence the predictions are unaffected. Compared to cross attention, self attention between the datapoints in D has a less clear interpretation as an inductive bias. In fact, we have found that only using cross attention without self attention in the ANP is generally not detrimental to performance, while being less computationally demanding. Using this architecture, both CNP and LNP versions of the attentive neural process can be constructed, following Sections 4.2.2 and 4.2.3. However, Kim et al. (2018) propose a hybrid model that uses both a deterministic and stochastic path. Specifically, a deterministic representation of D is constructed using Equation (4.33). Separately, a stochastic representation is obtained by applying self attention to the datapoints in D, and then taking the mean of the resulting outputs. This mean is then fed into an MLP 4.3 Deep sets 87 which defines the parameters of p(z|D). Finally, the predictive distribution is given by p(y|x;D) = ∫ N∏ n=1 N (µ(xn, R(xn, D), z), σ2(xn, R(xn, D), z))p(z|R) dz. (4.34) Note that, in contrast to Sections 4.2.2 and 4.2.3, the decoder takes both the deter- ministic representation R(xn, D) and the latent variable z as inputs. If the MLPs µ and σ2 learn to ignore the input z, then the hybrid model collapses to an attentive CNP (ACNP). This hybrid definition could also be easily applied to MLP-based neural processes. Kim et al. (2018) show that the ANP significantly outperforms the MLP-LNP in various regression tasks. Hence, in Chapter 5 we use the ANP as a strong baseline with which to compare our proposed NP models. However, the attention operations increase the computational complexity of the ANP relative to MLP-NPs. MLP-NPs have a computational complexity of O(N + T ) for making predictions at T query locations conditioned on a dataset of N points. In contrast, the ANP has a computational complexity of O(N2 +NT ), due to the self attention between the N points in D, and the cross attention between each query location and D. If the self attention is dropped and only cross attention is used, then the computational complexity is reduced to O(NT ). 4.3 Deep sets We have seen that various neural processes can be described using the architectural framework given in Section 4.2. A key component of this architecture is the repre- sentation of the dataset by the encoder given by a summation over datapoints in Equation (4.10). It is natural to ask, how flexible is this representation? This question was investigated by Wagstaff et al. (2022); Zaheer et al. (2017) in the broader context of deep learning on sets. Their main result is the following representation theorem: Theorem 5 (Wagstaff et al. (2022); Zaheer et al. (2017)). Let [0, 1]≤M denote the set of subsets of [0, 1] containing at most M elements. Let f : [0, 1]≤M → R be a permutation-invariant, continuous function. Then f(x) = ρ  |x|∑ i=1 ϕ(xi)  (4.35) 88 Neural processes for some continuous functions ρ : RM → R and ϕ : R→ RM . We refer to RM as the embedding space. Here |x| denotes the number of elements in x.10 Equation (4.35) is known as a ‘sum-decomposition’ or ‘deep sets encoding’. Theo- rem 5 tells us that as long as ρ and ϕ are universal function approximators (such as sufficiently wide MLPs), this sum-decomposition can be done without loss of generality in terms of the class of permutation-invariant maps that can be expressed. Note that in Theorem 5, the dimensionality of the embedding space has to grow with the maximum size of the set, M . Wagstaff et al. (2022) show that this is a necessary condition: if the maximum size of the input set is greater than the dimensionality of the embedding space, then there exists a permutation-invariant, continuous function that cannot be expressed by Equation (4.35). Furthermore, they show a stronger result: there exist permutation-invariant continuous functions that cannot be approximated by functions of the form in Equation (4.35). It is important to note the role played by continuity in Theorem 5. Instead of considering set elements in the domain [0, 1] and demanding continuity, Zaheer et al. (2017) also considered set elements taken from U , where U is any countable set. In that case, they show that it suffices for the dimensionality of the embedding space to be one, i.e. the embedding space is just R. However, Wagstaff et al. (2019, 2022) showed that this is not a realistic case to consider, because in practice it leads to the specification of maps exhibiting a high degree of discontinuity, such that it would be impractical to represent using floating-point arithmetic. As we saw in Sections 4.2.2 and 4.2.3, MLP-NPs make heavy use of the deep sets decomposition, and so do ANPs, since they reduce to MLP-NPs when the attention weights are all equal to 1. To highlight the similarities, we can express the mean function of the MLP-CNP as µ(xn, R(D)) = µ xn, ∑ (x,y)∈D ϕ(x, y)  , (4.36) where recall that µ : Rdx+dR → Rdy is an MLP. This is very similar to Equation (4.35), with µ playing the role of ρ, except that µ also takes in the query location xn as an input. It is straightforward to leverage this relationship to formally show that 10Note that this theorem assumes that the individual set elements are members of [0, 1]. It is not immediately clear how to extend the proof to vector-valued set elements. Such an extension was proven by Wessel P. Bruinsma, Andrew Y. K. Foong and Jonathan Gordon, and is presented in Gordon (2021, Theorem 2.3), with the added condition that the dimensionality of the embedding space is now 2M instead of M . See also Yarotsky (2022) for a comparable statement. 4.4 Training neural processes 89 CNPs can recover (in the limit of infinite width) any continuous map from datasets to continuous functions Z → Cb(X ,Y) as their predictive mean and variance (Gordon, 2021, Theorem 2.4), which provides justification for the proposed architecture. 4.4 Training neural processes Having described the architecture, we now discuss how to train neural processes. As mentioned in Section 4.1.1, in order to meta-learn, we require a meta-dataset, i.e., a dataset of datasets. In the meta-learning literature, each dataset in the meta-dataset is referred to as a task. For NPs, this means having access to many independent samples of functions from the ground truth data-generating stochastic process. Each sampled function is then a task. For example, we may have a large collection of audio waveforms (Dn)Ntasksn=1 from different speakers. Each of these waveforms may be regarded as an independent sample from the ground truth stochastic process representing the distribution of human speech. Each waveform is then a task which is itself a dataset Dn = ((xi, yi)) N i=1, where each (xi, yi) is a timestamp–audio amplitude pair. Or we might have a large collection of natural images: then each Dn would be a single image consisting of pixel-location/pixel-value pairs. We would like to use this meta-dataset to learn how to make predictions at some new query locations upon observing some new conditioning data. To do this, we use an episodic training procedure, common in meta-learning (Finn et al., 2017; Ravi and Larochelle, 2016; Vinyals et al., 2016). Each episode consists of the following steps: 1. Sample a task D from the meta-trainset (Dn)Ntasksn=1 . 2. Randomly split the task into two subsets, D = Dc ∪Dt. Dc = (xc, yc) is known as the context set and Dt = (xt, yt) is known as the target set. Here xc denotes all the input locations in the context set, and yc denotes their corresponding output values, and similarly for xt, yt. 3. Pass Dc through the neural process forward pass as the conditioning dataset to obtain the predictive distribution at the input locations in the target set, p(yt|xt;Dc). 4. Compute the objective function L, which measures the predictive performance of the NP on the target set. For models with tractable likelihood functions, this is usually L = log p(yt|xt;Dc). However, for LNPs, we will have to compute 90 Neural processes an approximation or a lower bound of the log-likelihood objective, as will be discussed in Sections 4.4.2 and 4.4.3. 5. Compute the gradient ∇θL with respect to all learnable parameters θ in the NP for stochastic gradient optimisation. The episodes are repeated until training converges. Intuitively, this procedure encour- ages the NP to produce predictions that fit an unseen target set, given access to only the context set. Once meta-training is complete, if the neural process generalises well, it will be able to do this for unseen context sets that are not in the meta-train set. Note that this setup is analogous to standard supervised learning. We now discuss various objective functions that can be used to train NPs. 4.4.1 Log-likelihood The most basic objective function to optimise is the log-likelihood of the target set conditioned on the context set. More precisely, given a meta-dataset M := (Dn)Ntasksn=1 , during each iteration of stochastic gradient descent training, we sample (Dc, Dt) from M and optimise LML = log p(yt|xt;Dc) (4.37) with respect to the learnable parameters in the NP. In fact, typically we sample a batch of tasks from M and perform mini-batch optimisation, so that at each iteration we take a gradient step that maximises the mean of Equation (4.37) over a batch of datasets (Dc, Dt). This can be viewed as a simple Monte Carlo estimate of the objective Ep(Dc,Dt)[log p(yt|xt;Dc)]. In the case of CNPs, the distribution over the target outputs factorises, and we have: LML = ∑ (x,y)∈Dt log p(y|x;Dc). (4.38) We now prove that, in the limit of infinite data and infinite model capacity, globally optimising the log-likelihood objective recovers the exact prediction map described in Section 4.1.4, subject to certain conditions on the data-generating process. Proposition 5. Let Ψ : Z → P(X ,Y) be any map from data sets to stochastic processes, and let pΨ be the density of Ψ(Dc) evaluated at xt. Then Ψ globally maximises Ep(Dc,Dt)[LML(Ψ)] = Ep(Dc,Dt)[log pΨ(yt|xt;Dc)] if and only if all the finite-dimensional 4.4 Training neural processes 91 distributions of pΨ match those of πP , the prediction map (as defined in Section 4.1.4), p(Dc, xt)-almost everywhere. I.e., equality holds except on a set of measure zero with respect to p(Dc, xt). Proof. We have: Ep(Dc,Dt) [log pΨ(yt|xt, Dc)] = Ep(Dc,xt) [ Ep(yt|xt,Dc) [log pΨ(yt|xt, Dc)] ] (4.39) = −Ep(Dc,xt) [KL (p(yt|xt, Dc)∥pΨ(yt|xt, Dc))] + constant, (4.40) where the additive constant is constant with respect to Ψ. First note that the KL- divergence is non-negative, and that the prediction map sends all the KL-divergences to zero, globally optimising L(Ψ). Furthermore, the KL-divergence is equal to zero if and only if the two distributions are equal, and this must hold for almost all Dc, xt with respect to p(Dc, xt). For, if this were not the case, the KL-divergence would contribute a non-zero amount to the expectation in Equation (4.40). Hence the objective is globally optimised if and only if all the finite-dimensional distributions of pΨ match the conditional distributions p(Dc, xt)-almost everywhere. Proposition 5 shows that the support of the data-generating distribution is of crucial importance, since equality with the prediction map πP only holds almost everywhere with respect to p(Dc, xt). This means that if, for example, the range of the inputs or the number of context points is limited during training, we cannot expect the model to be able to approximate the prediction map well outside of that range, just on the basis of maximum likelihood training. In order to make the support of p(Dc, xt) as large as possible, one could generate tasks (Dc, Dt) as follows: first, sample some finite number of input locations xt, xc. Further, set Pr(|xt| = n) > 0 for all n ∈ Z≥0, where |xt| denotes the number of datapoints in xt, and assume the same is true of Pr(|xc| = n). Finally, arrange that for each n > 0, the distribution of x given |x| = n has a continuous density with support over all of Rn×din . This could be achieved, for example, by setting the distribution of x to be Gaussian. A distribution like this would ensure that equality p(Dc, xt)-almost everywhere implies equality for all context sets and target inputs. In practice we often limit the maximum size of the sampled data sets, and also their range in X space. Hence we can only expect the model to learn reasonable predictions within the ranges seen during train time. That is, if during train time we only observe datasets with at most n datapoints and with input locations within some finite range, we have no reason a priori to expect the NP to be able to make sensible 92 Neural processes predictions if at meta-test time it encounters context sets that do not belong to these ranges. In Chapter 5 we will present an example where incorporating suitable inductive biases, in particular, translation equivariance, in the architecture, rather than simply relying on the maximum-likelihood objective, allows the NP to generalise beyond the meta-training input range for X . In addition to these conditions regarding the input data distribution, there are a number of assumptions in Proposition 5 that caveat its applicability to real world settings. First, in reality we would only optimise a Monte Carlo expectation of Ep(Dc,Dt) [log pΨ(yt|xt, Dc)] such as 1|M | ∑ (Dc,Dt)∈M log pΨ(yt|xt, Dc), with (Dc, Dt) ∼ p(Dc, Dt). Hence the NP would only be guaranteed to recover the prediction map as the size of the meta-trainset |M | → ∞. Furthermore, the proof assumes that the prediction map can be expressed as an NP, which is only guaranteed in the infinite- width limit (see Section 4.3). Finally, the prediction map will only be recovered if global optimisation of the objective succeeds, which is rarely the case for stochastic gradient-based optimisers. Nevertheless, Proposition 5 motivates the use of LML in cases where the meta-trainset is large, the neural networks in the NP have high capacity, and training is performed until convergence. 4.4.2 Neural process variational inference The maximum likelihood objective of Equation (4.38) is the most commonly used training objective for CNPs. However, for LNPs, this objective cannot be used since it is intractable to compute due to the integral in Equation (4.12). Instead, when introducing the MLP-LNP, Garnelo et al. (2018b) proposed viewing LNPs as performing approximate Bayesian inference and learning in the following latent variable model: z ∼ p(z); p(yt|xt, z) = ∏ (x,y)∈Dt N (y; f(x; z), σ2y) , (4.41) where f is given by a neural network. To train the model, they use amortized VI (Kingma and Welling, 2013; Rezende et al., 2014). This involves introducing a variational approximation network qϕ which maps datasets Dc ∈ Z to distributions over z, and maximizing a lower bound (ELBO) on log p(yt|xt, Dc). We can use the LNP encoder architecture to parameterise qϕ, since the LNP encoder specifies a map from datasets to distributions over z, just as required (see Figure 4.2). Here, note that log p(yt|xt, Dc) is defined by exact Bayesian inference for the model in Equation (4.41), 4.4 Training neural processes 93 hence the notation log p(yt|xt, Dc) instead of log p(yt|xt;Dc). That is, log p(yt|xt, Dc) = log ∫ p(yt|xt, z)p(z|Dc) dz, (4.42) p(z|Dc) = p(Dc|z)p(z) p(Dc) , (4.43) = ∏ (x,y)∈Dc N ( y; f(x; z), σ2y ) p(z)∫ ∏ (x,y)∈Dc N ( y; f(x; z), σ2y ) p(z) dz . (4.44) Note that this is in contrast to log p(yt|xt;Dc) which is defined directly by the NP forward pass without reference to Bayes’ theorem. Given a task (Dc, Dt), the (conditional) ELBO for this model is: Ez∼qϕ(z|Dc∪Dt) [log p(yt|xt, z)]−KL(qϕ(z|Dc ∪Dt)∥p(z|Dc)) ≤ log p(yt|xt, Dc). (4.45) As p(z|Dc) is intractable to compute (since the normalising constant in Equation (4.44) involves an intractable integral), Garnelo et al. (2018b) instead propose the following objective: LNPVI := Ez∼qϕ(z|Dc∪Dt) [log p(yt|xt, z)]−KL(qϕ(z|Dc ∪Dt)∥qϕ(z|Dc)), (4.46) where the intractable term p(z|Dc) has been substituted with our variational approxi- mation qϕ(z|Dc). We refer to maximising this objective as neural process variational inference (NPVI). Due to this substitution, LNPVI is no longer a valid ELBO for the original model (Equation (4.41)), i.e., it is no longer guaranteed to be a lower bound to the Bayesian conditional log-likelihood log p(yt|xt, Dc). Rather, if we define separate models for each context set Dc, and define the conditional prior for each model as p(z|Dc) := qϕ(z|Dc), then LNPVI may be thought of as performing VI for this collection of models. However, there is no guarantee that these conditional priors are consistent in the sense that they correspond to conditional distributions of a single Bayesian model as in Equation (4.41). This is in contrast to sparse variational inference in Gaussian processes, where there is a single Bayesian prior and posterior which is targeted by the approximate posterior GP (Matthews et al., 2016; Titsias, 2009). Although the fact that LNPVI does not target a single consistent posterior distribution introduces conceptual difficulties, it was shown by Garnelo et al. (2018b) to be a useful objective for LNPs. 94 Neural processes 4.4.3 Approximate log-likelihood As an alternative to the NPVI objective for LNPs, we propose optimising the fol- lowing Monte Carlo estimate of LML, which is conservatively biased, consistent, and monotonically increasing (in expectation) in the number of samples, L (Burda et al., 2015): LˆML := log  1 L L∑ l=1 exp  ∑ (x,y)∈Dt log p(y|x, zl)  ; zl ∼ p(z|R(Dc)), (4.47) where R(Dc) is the deterministic representation of the context set Dc, as in Equa- tion (4.12). Again here we state the objective for a single dataset Dc, Dt; during actual training we would optimise a Monte Carlo estimate of Ep(Dc,Dt)[LˆML]. LˆML is an approximation to LML in the sense that LˆML = log  1 L L∑ l=1 ∏ (x,y)∈Dt p(y|x, zl)  (4.48) ≈ log ∫ ∏ (x,y)∈Dt p(y|x, z)p(z|R(Dc)) dz, (4.49) where Equation (4.49) is the exact LNP log-likelihood. However, since the logarithm of an unbiased estimator is not an unbiased estimator of the logarithm, LˆML is a biased estimate of LML, which is only accurate in the limit L→∞. The bias in this estimator decreases as the variance of z decreases, and in particular, if z is deterministic then the estimator is exact. This means that optimisation may attempt to reduce the variance in order to reduce the bias in the estimator rather than actually increasing the likelihood. Generally, if L is too large then each minibatch may take up too much memory, but if L is too small the bias in the estimate can become unacceptably large. In particular, in contrast to LNPVI, single sample estimators with L = 1 are not useful, as they drive z to be deterministic. Hence training with LML often requires more memory than training with LNPVI. 4.4.4 Approximate maximum-likelihood vs variational lower bound maximisation for training NPs In this section we argue that the VI interpretation may be unnecessary when focusing on predictive performance for NPs. First, we note that LNPVI is equal to LML up to an 4.4 Training neural processes 95 additional KL term. To see this, let D := Dt ∪Dc, and let Z = ∫ p(yt|xt, z)qϕ(z|Dc) dz. The NPVI objective is then: LNPVI := Eqϕ(z|D)[log p(yt|xt, z)]−KL(qϕ(z|D)∥qϕ(z|Dc)) (4.50) = Eqϕ(z|D)[log p(yt|xt, z) + log qϕ(z|Dc)− log qϕ(z|D)] (4.51) = Eqϕ(z|D) [ logZ + log p(yt|xt, z)qϕ(z|Dc) Z − log qϕ(z|D) ] (4.52) = logZ −KL ( qϕ(z|D) ∥∥∥∥ 1Zp(yt|xt, z)qϕ(z|Dc) ) . (4.53) When training LNPs with maximum likelihood, qϕ no longer has an approximate inference interpretation, but is simply the encoder of the LNP. In that case, logZ = log ∫ p(yt|xt, z)qϕ(z|Dc) dz = LML is simply the (exact) log-likelihood, so: LNPVI = LML −KL ( qϕ(z|D) ∥∥∥∥ 1Zp(yt|xt, z)qϕ(z|Dc) ) . (4.54) Hence we see that LNPVI is equal to LML up to an additional KL term. This KL term encourages consistency among the qϕ for varying conditioning datasets, in the sense that Bayes’ theorem is respected if the target set is subsumed into the context set. To see this, note that it encourages qϕ(z|Dc ∪ Dt) to be similar to 1Zp(yt|xt, z)qϕ(z|Dc). If qϕ was performing exact inference instead of approximate inference, this would be satisfied immediately, by the rules of probability. However, since qϕ is parameterised by a learned encoder, this consistency with respect to Bayesian updating of z must be learned from data. Thus LNPVI can be viewed as directly encouraging this consistency in the objective function. In the infinite capacity/data limit, LNPVI is globally maximised if the LNP recovers (i) the prediction map πP for p(yt|xt, Dc) and (ii) exact Bayesian inference for z. (i) follows from Proposition 5, since πP globally optimises LML, and (ii) follows from the fact that exact inference for z sends the KL term to zero since it respects Bayes’ rule. However, in most applications, only the distribution over yt is of interest, and we are not directly concerned with our inference for the latent variable z. Given only finite capacity/data, it may be advantageous to not expend capacity in encouraging the distribution over z to be consistent with Bayes’ theorem. Hence it could be beneficial to use LML over LNPVI, since LML solely targets the predictive performance we care about. Unfortunately, as discussed earlier, LML is intractable for LNPs, and its finite- sample approximation LˆML introduces biases of its own into the training procedure. It 96 Neural processes is unclear a priori how detrimental these biases will be to performance. Both LˆML and LNPVI can be seen as lower bounds on the actual quantity we would like to optimise, the exact log-likelihood. Which objective is preferable in practice will depend on which introduces more harmful biases to the training procedure. In Chapter 5 we compare LNPs trained with LˆML and LNPVI and find that LˆML can significantly outperform LNPVI. 4.5 Summary and conclusions We have introduced neural processes, a family of deep learning models for meta-learning maps from observed datasets to predictive stochastic processes. NPs naturally lend themselves to tasks that require uncertainty estimation in the small-data regime, as long as a meta-dataset is available. We introduced the encoder-decoder architectural framework used by most NPs, and motivated it with a discussion of stochastic process consistency and invariance with respect to permutations of the context set. Next, we saw that NPs could be divided into two broad sub-families, CNPs and LNPs, depending on whether a latent variable was used to induce dependencies in the predictive distributions. Within these subfamilies we presented instantiations of NPs based on vanilla MLPs and also attention mechanisms. Finally, we discussed the various objective functions that have been proposed for training NPs. In Section 4.2.4 we saw how the introduction of a suitable inductive bias in the form of attentive neural processes successfully addressed the underfitting problems of MLP-based NPs. This naturally raises the question of what other inductive biases could be built into NP architectures, and what their benefits may be. In Chapter 5 we will present and evaluate a new member of the NP family, the convolutional neural process, which uses a convolutional neural network to build in translation equivariance as an inductive bias. Chapter 5 Convolutional neural processes In Chapter 4 we saw that neural processes could be viewed as learning maps from datasets directly to predictive stochastic processes. Although this framework is very general, specialising it to incorporate useful inductive biases can lead to dramatic improvements, as was the case with attentive neural processes (Kim et al., 2018). In this chapter, we consider symmetries, and in particular, stationarity as a powerful inductive bias. Stationary stochastic processes are a key component of many probabilistic models, such as those for off-the-grid spatio-temporal data. They enable the statistical symmetry of underlying physical phenomena to be leveraged, thereby aiding generalisation. Prediction in such models can be viewed as a translation equivariant map from observed datasets to predictive stochastic processes (see Figure 5.1), emphasising the intimate relationship between stationarity and equivariance. Building on this, we propose the convolutional conditional neural process (ConvCNP) and the convolutional latent neural process (ConvLNP). The ConvCNP, like other members of the CNP family, makes factorised predictions for each element of the target set. This means that we cannot sample coherent functions from the predictive distribution of the ConvCNP, since every target value will be independent of the other values, as discussed in Section 4.2.2. The ConvLNP, on the other hand, uses a latent variable (in this case, a latent function) to enable coherent samples to be drawn from the predictive distribution. This allows ConvLNPs to be deployed in settings which require coherent samples such as Thompson sampling. Crucially, both ConvCNPs and ConvLNPs use convolutional architectures to endow neural processes with translation equivariance as an inductive bias. Moreover, as discussed in Section 4.4.3, we propose a new maximum-likelihood objective to replace the standard ELBO objective in NPs, which conceptually simplifies the framework and empirically improves performance for ConvLNPs. We demonstrate the strong performance and generalisation capabilities 98 Convolutional neural processes of ConvCNPs and ConvLNPs on 1D regression, image completion, and various tasks with real-world spatio-temporal data. The work in this chapter is based on two publications, ‘Convolutional Conditional Neural Processes’ (Gordon et al., 2020) and ‘Meta-learning Stationary Stochastic Process Prediction with Convolutional Neural Processes’ (Foong et al., 2020a). The research in Gordon et al. (2020) was conducted with Jonathan Gordon, Wessel P. Bru- insma, James Requeima, Yann Dubois and Richard E. Turner. The research in these publications also appears in the PhD theses of my collaborators Jonathan Gordon (Gordon, 2021) and Wessel P. Bruinsma (forthcoming), both submitted to the Uni- versity of Cambridge. I introduced the density channel into the ConvCNP model, verified and assisted with the proof of the main representation theorem, performed the initial experiments on simple time-series and the first on-the-grid experiments, and contributed to writing and editing the paper. The research in Foong et al. (2020a) was conducted with my co-first authors Wessel P. Bruinsma and Jonathan Gordon, along with Yann Dubois, James Requeima and Richard E. Turner. I was involved with conceptualising the model, proving theoretical results, planning and running the experiments on environmental data, and writing the paper. 5.1 Introduction Incorporating appropriate inductive biases into machine learning models is key to achieving good generalisation performance. Consider, for example, the task of predict- ing rainfall at an unseen test location from rainfall measurements nearby. A powerful inductive bias for this task is stationarity : the assumption that the generative process governing rainfall is spatially homogeneous. Given only observations in a limited part of the space, stationarity allows the model to extrapolate to yet unobserved regions. Closely related to stationarity is translation equivariance. Translation equivariance formalises the intuitive idea that if an observed dataset is shifted in time or space, then the resulting predictions should be shifted by the same amount. This is illus- trated schematically in Figure 5.1. When stationarity or translation equivariance is appropriate, e.g. in time-series (Roberts et al., 2013), images (LeCun et al., 1998), and spatio-temporal modelling (Cressie, 1990; Delhomme, 1978), incorporating them into our models yields significant benefits. As such, NPs would ideally have translation equivariance built directly into the modelling assumptions as an inductive bias when appropriate. However, current NP models must learn this structure from the dataset 5.1 Introduction 99 Fig. 5.1 Schematic illustration of translation equivariance in stochastic process predic- tion. The top row shows a context set, and the corresponding predictive distribution obtained by passing the predictions through a prediction map, e.g., a well-trained neural process. The bottom row shows the same context set, but with the input values shifted horizontally by an amount τ ∈ R. In a translation equivariant neural process, the resulting predictive distribution will be identical to that in the top row, except it is also shifted horizontally by τ . instead, which is sample and parameter inefficient, and impacts the ability of the model to generalise. The goal of this chapter is to build translation equivariance into NPs. Famously, convolutional neural networks (CNNs) incorporate translation equivariant convolutional layers (Cohen and Welling, 2016; Fukushima and Miyake, 1982; LeCun et al., 1998). However, it is not straightforward to generalise NPs in an analogous way for the following reasons: 1. CNNs require data to live ‘on the grid’. For example, image pixels and audio recordings usually live on a regularly spaced grid. In the 1-dimensional input case, audio recordings sample a waveform at times (. . . , x0− ϵ, x0, x0+ ϵ, . . .). An analogous sampling procedure for image data occurs in the 2-dimensional case, 100 Convolutional neural processes where the shifts ϵ are now two-dimensional. However, many domains we would like to apply NPs to have data that live ‘off the grid’. For example, some time series data may be observed irregularly at any time t ∈ R, or observations of weather may occur at irregularly spaced stations at locations x ∈ R2. We must modify the standard CNN forward pass to be able to handle these situations as well. 2. NPs operate on partially observed context sets, in the sense that the function is not observed everywhere, but only at certain points. However, in CNNs the input image is usually free from missing values. 3. NPs rely on embedding sets into a finite-dimensional vector space for which the notion of equivariance with respect to input translations is not well defined. For example, consider the case where the inputs are two dimensional, and the context set is translated in input space by some amount τ ∈ R2. Standard MLP-CNPs will have a representation of the context set given by some vector R(D) ∈ RdR , where, e.g., dR = 256. It is not clear how to represent a shift of R(D) by τ , i.e., it is not straightforward to define the action of the 2-dimensional translation group on vector spaces of arbitrary dimensionality. In this chapter, we introduce the ConvCNP and ConvLNP, new members of the NP family that address these challenges and account for translation equivariance. Our key contributions can be summarised as follows: 1. We introduce the ConvCNP, a translation equivariant neural process that makes factorised predictions. 2. We introduce the ConvLNP, a translation equivariant neural process that uses a latent variable to induce dependencies in its predictive distribution. 3. We evaluate the new training objective for LNPs that was proposed in Sec- tion 4.4.3, which discards variational inference in favour of a biased Monte Carlo estimate of the maximum likelihood objective. We empirically show that this objective improves performance for ConvLNPs. 4. We evaluate both the ConvCNP and ConvLNP experimentally and demonstrate that they exhibit excellent performance on several synthetic and real-world benchmarks. 5.1 Introduction 101 5.1.1 Translation equivariance and stationarity As we saw in Section 4.1.4, NP learning can be seen as approximating the exact prediction map from datasets to predictive stochastic processes πP . The prediction map πP for stationary stochastic processes possesses two important symmetries. First, as described in Section 4.2, πP is invariant to permutations of Dc (Zaheer et al., 2017). This is a symmetry respected by all NPs thanks to (variations of) the deep sets construction described in Section 4.2. Second, specifically to stationary stochastic processes, πP is translation equivariant : whenever an input to the map is translated, its output is translated by the same amount, as described in Figure 5.1. To state this precisely, we make the following definitions: Definition 1 (Translating datasets and stochastic processes). We define the action of the translation operator Tτ on datasets and stochastic processes, where τ ∈ X denotes the shift vector of the translation:1 1. Translating datasets. Let ((xn, yn))Nn=1 = D ∈ Z. For the index set x = (x1, . . . , xn), translation by τ is defined as Tτx = (x1 + τ, . . . , xn + τ). Similarly, TτD := ((xn + τ, yn)) N n=1. 2. Translating functions. For a function f ∈ YX , define Tτf(x) := f(x− τ) for all x ∈ X . Let F ⊆ YX . Then we define the translation of this set of functions as TτF := {Tτf : f ∈ F}. 3. Translating stochastic processes. For any stochastic process P ∈ P(X ,Y), we define the translation of the stochastic process TτP by setting the probability it assigns to a measurable set F ∈ Σ as2 TτP (F ) := P (T−τF ). Definition 2 (Stationary stochastic process). We say a stochastic process is (strictly) stationary if the densities of its finite marginals satisfy p(yt|xt) = p(yt|Tτxt) (5.1) for all (xt, yt) ∈ Z and τ ∈ X . We are now ready to give a precise definition of a translation equivariant prediction map, as illustrated in Figure 5.1: 1To prevent notational clutter, the same symbol, Tτ , will be used to denote translations of datasets, functions, sets of functions and stochastic processes. 2Recall from Section 4.1.2 that Σ denotes the product σ-algebra on YX . P (T−τF ) is well-defined since Σ is closed under translations. Equivalently, we could define TτP as the push-forward of P under the the translation map on functions, Tτ : YX → YX . 102 Convolutional neural processes Definition 3 (Translation equivariant prediction maps). We say that Ψ: Z → P(X ,Y) is translation equivariant if Ψ(TτD) = TτΨ(D) for any dataset D ∈ Z and shift τ ∈ X . Having defined what we mean by stationarity and translation equivariance, the following simple statement highlights the intimate link between these concepts: Proposition 6. Let P be a stationary stochastic process. Then the prediction map πP is translation equivariant.3 Proof. Let p(yt|xt, Dc) denote the finite dimensional density of πP (Dc) at index set xt. To show that πP (TτDc) = TτπP (Dc) it suffices to show that p(yt|xt, TτDc) = p(yt|T−τxt, Dc). We have p(yt|xt, TτDc) = p(yt, yc|xt, Tτxc) p(yc|Tτxc) (5.2) = p(yt, yc|T−τxt, xc) p(yc|xc) (5.3) = p(yt|T−τxt, Dc), (5.4) where we used the stationarity assumption in the second line. Proposition 6 suggests that models for the prediction map should also be made translation equivariant and permutation invariant. As such models are a small subset of the space of all models, building in these properties can greatly improve data efficiency and generalisation for stationary stochastic process prediction. In the next section, we describe how this can be done for NPs by extending the deep sets theorem of Section 4.3 to incorporate translation equivariance. 5.2 Convolutional deep sets We are interested in translation equivariance (Definition 3) with respect to translations on X . The encoder for both MLP-based NPs and attentive NPs maps datasets D to an embedding in a vector space RdR , for which the notion of equivariance with respect to input translations in X is not well defined. For example, a function f on X can be translated by τ ∈ X to form f(· − τ). However, for a vector R ∈ RdR , which can be seen as a function R : {1, . . . , dR} → RdR , with R(i) = Ri, the translation R(· − τ) does not make sense, since it is not clear how to add a translation τ ∈ X to the discrete 3We exclude conditioning on observations that have zero density, so that the prediction map is well defined. 5.2 Convolutional deep sets 103 index of a finite-dimensional vector. Another way to say this is that there is no natural way for the translation group of X (where often X = R or R2) to act on the space of finite-dimensional vector representations RdR , when dR ̸= 1, 2. To overcome this, we define the encoder of the convolutional neural process E : Z → H to map into a function space H containing functions on X . Since functions in H live on X , our notion of translation equivariance (Definition 3) now also makes sense for E. As we will see below, every translation equivariant function on sets has a representation in terms of a specific functional embedding. Definition 4 (Functional mappings on sets and functional representations of sets). Call a map E : Z → H a functional mapping on sets if it maps from the space of datasets Z to an appropriate space of functions H. We call E(Z) the functional representation of the set Z. Furthermore, the functional representation E is translation equivariant if E(TτD) = TτE(D) for all τ ∈ X and D ∈ Z. Considering functional representations of sets leads to our key result for convolu- tional NPs, which can be summarised as follows: For an appropriately chosen Z ′ ⊂ Z, a continuous function Φ: Z ′ → Cb(X ,Y) is both permutation invariant and translation equivariant if and only if it is of the form Φ(Z) = ρ (E(Z)) , E(Z) = ∑ (x,y)∈Zϕ(y)ψ(· − x) ∈ H, (5.5) for some continuous and translation equivariant ρ : H → Cb(X ,Y), and appropriate ϕ and ψ. Note that here ρ is a map between function spaces. Equation (5.5) defines the encoder used by our proposed models, the ConvCNP and ConvLNP. In Section 5.2.1, we describe this theoretical result in more detail. The result provides an extension of the key result of Zaheer et al. (2017) to functional representations on sets, and shows that it can naturally be extended to handle varying- size sets. The practical implementation of ConvCNPs and ConvLNPs — the design of ρ, ϕ, and ψ — is informed by the results in Section 5.2.1, and is discussed for domains of interest in Section 5.3. 5.2.1 Representing translation equivariant functions on sets In this section we discuss the theoretical foundations of the ConvCNP and ConvLNP encoder. We begin by stating a definition that is used in the main result. Definition 5 (Multiplicity). A collection of datasets Z ′ ⊆ Z is said to have multiplicity K if, for every dataset Z ∈ Z ′, every input value x occurs at most K times. 104 Convolutional neural processes For example, in the case of real-world data like time series and images, we often observe only one (possibly multi-dimensional) observation per input location, which corresponds to multiplicity one, since none of the input values are repeated within a single time series or image. We now state our key representation theorem. Theorem 6. Consider an appropriate4 collection of datasets Z ′≤M ⊆ Z≤M with multi- plicity K. Then a function Φ: Z ′≤M → Cb(X ,Y) is continuous5, permutation invariant, and translation equivariant if and only if it is of the form Φ(Z) = ρ (E(Z)) , E((x1, y1), . . . , (xm, ym)) = m∑ i=1 ϕ(yi)ψ(· − xi) (5.6) for some continuous and translation equivariant ρ : H → Cb(X ,Y) and some continuous ϕ : Y → RK+1 and ψ : X → R, where H is an appropriate space of functions that includes the range of E. We call a function Φ of the above form a ConvDeepSet. The proof of the ‘if’ direction is straightforward: Proof of sufficiency. First, Φ is permutation invariant, because addition is commutative and associative. Second, that Φ is translation equivariant follows from a direct verification and that ρ is also translation equivariant: Φ(TτZ) = ρ ( M∑ i=1 ϕ(yi)ψ(· − (xi + τ)) ) (5.7) = ρ ( M∑ i=1 ϕ(yi)ψ((· − τ)− xi) ) (5.8) = ρ ( M∑ i=1 ϕ(yi)ψ(· − xi) ) (· − τ) (5.9) = Φ(Z)(· − τ) (5.10) = T ′τΦ(Z). The proof of the ‘only if’ direction is much more technical, and requires topolog- ical considerations, primarily to make precise the notion of a continuous map from Z ′≤M → Cb(X ,Y). This is complicated by the fact that Z ′≤M is a union of (sub- sets of) vector spaces with differing dimensionality, and the fact that Cb(X ,Y) is an 4For every m ∈ {1, . . . ,M}, Z ′≤M ∩ Zm must be closed and closed under permutations and translations. 5For every m ∈ {1, . . . ,M}, the restriction Φ|Z′≤M∩Zm is continuous. 5.2 Convolutional deep sets 105 infinite-dimensional function space. The crux of the proof is to show that the proposed embedding E is a homeomorphism (that is, a continuous map with a continuous inverse) between Z ′≤M and a space constructed from certain reproducing kernel Hilbert spaces that have ψ as their reproducing kernel. Once this has been established, the rest of the proof is straightforward: Proof sketch of necessity (incomplete, informal). The proof follows the strategy used by Zaheer et al. (2017) and Wagstaff et al. (2019). We choose ψ to be the exponentiated quadratic (EQ) kernel, ψ(x, x′) = σ2 exp ( − 1 2ℓ2 ∥x− x′∥2 ) . (5.11) Let D ∈ Z ′≤M be a dataset. Assume E is a homeomorphism. By invertibility of E, D = E−1(E(D)). Therefore, Φ(D) = Φ(E−1(E(D))) = (Φ ◦ E−1) ( M∑ i=1 ϕ(yi)ψ(· − xi) ) . (5.12) Let H denote an appropriate space of functions that includes the range of E. Define ρ : H → Cb(X ,Y) by ρ = Φ ◦ E−1. First, ρ is continuous since Φ is continuous and E−1 is continuous as E is a homeomorphism. Second, E−1 is translation equivariant, because ψ is a stationary kernel. Also, Φ is translation equivariant by assumption. Thus their composition ρ is also translation equivariant. Hence any continuous, translation equivariant map Φ can be written in the form given in Equation (5.6). The full proof of necessity is beyond the scope of this thesis, and is provided in Gordon et al. (2020, appendix A). Here we discuss several key points from the proof that have practical implications and provide insights for the design of convolutional NPs: 1. For the construction of ρ and E, ψ is set to be a flexible positive-definite kernel (Equation (5.11)) associated with a reproducing kernel Hilbert space (RKHS; Aronszajn (1950)), which results in desirable properties for E. 2. Using the work of Zaheer et al. (2017), we set ϕ(y) = (y0, y1, · · · , yK) to be the powers of y up to order K, where K is the multiplicity. 3. Theorem 6 requires ρ to be a powerful function approximator of continuous, translation equivariant maps between functions. 106 Convolutional neural processes In Section 5.3, we discuss how these theoretical results inform our implementation of the ConvCNP. Theorem 6 extends the result of Zaheer et al. (2017) discussed in Section 4.3 by embedding the set into an infinite-dimensional space—the RKHS—instead of a finite- dimensional space. Beyond allowing the model to exhibit translation equivariance, the RKHS formalism allows us to naturally deal with finite sets of varying sizes, which turns out to be challenging with finite-dimensional embeddings. Furthermore, our formalism requires ϕ(y) = (y0, y1, y2, . . . , yK) to expand up to order no more than the multiplicity of the sets K; if K is bounded, then our results hold for sets up to any arbitrarily large finite size M , while fixing ϕ to be only (K + 1)-dimensional. 5.3 Convolutional conditional neural processes In this section we discuss the architecture and implementation details for ConvCNPs, which produce factorised predictive distributions. Similarly to other CNPs, ConvCNPs model the conditional distribution as p(y|x,D) = N∏ n=1 p(yn|Φθ(D)(xn)) = N∏ n=1 N (yn;µn, σn) with (µn, σn) = Φθ(D)(xn), (5.13) where D is the observed dataset and Φ is a ConvDeepSet (Theorem 6). Here we denote the learnable parameters of Φ as θ. As with other CNPs, the ConvCNP has fully tractable predictive likelihoods. This allows us to use the simple maximum-likelihood objective to learn θ, as described in Section 4.4.1. We now turn to the architectural details of the ConvCNP. The key considerations are the design of ϕ, ψ, and ρ for Φ (see Theorem 6). Form of ϕ. The applications considered in this thesis have a single (potentially multi-dimensional) output per input location, so the multiplicity of Z is one (i.e., K = 1). It then suffices to let ϕ be a power series of order one, which is equivalent to appending a constant to y in all datasets, i.e. ϕ(y) = [1, y]⊤. The first output ϕ1 thus provides the model with information regarding where data has been observed, which is necessary to distinguish, for example, between having no observed datapoint at x and a datapoint at x with y = 0. Denoting the functional representation as h, we can think of the first channel h(0) as a ‘density channel’ — it gives information about how densely in space data has been observed at a particular location. We found it helpful to divide the remaining channels h(1:) by h(0) (Figures 5.2b and 5.2c, line 5.3 Convolutional conditional neural processes 107 Context set Dc = (xn, yn) N n=1 y x 1 Functional representation 2 h(0)= ∑ ψ( · −xn) h(1)= ∑ ynψ( · −xn)∑ ψ( · −xn)(density channel) x Evaluate at discretisation (ti) T i=1 x 3 Apply CNN and predict [ µ(x∗) σ(x∗) ] = T∑ i=1 [ fµ(ti) efσ(ti) ] ψρ(x ∗−ti) x∗1 x ∗ 2 x ∗ 3 x ∗ 4 p(y∗3 | x∗3 , Dc) (a) require: ρ = (CNN, ψρ), ψ, and density γ require: context (xn, yn)Nn=1, target (x∗m)Mm=1 1 begin 2 lower, upper← range ( (xn) N n=1∪(x∗m)Mm=1 ) 3 (ti) T i=1 ← uniform_grid(lower, upper; γ) 4 hi ← ∑N n=1 [ 1 yn ]⊤ ψ(ti − xn) 5 h (1) i ← h(1)i /h(0)i 6 (fµ(ti), fσ(ti)) T i=1 ← CNN((ti, hi)Ti=1) 7 µm ← ∑T i=1 fµ(ti)ψρ(x ∗ m − ti) 8 σm ← ∑T i=1 pos(fσ(ti))ψρ(x ∗ m− ti) 9 return (µm, σm)Mm=1 10 end (b) require: ρ = CNN and E = convθ require: image I, context Mc, and target mask Mt 1 begin 2 // We discretize at the pixel locations. 3 Zc ← Mc ⊙ I // Extract context set. 4 h← convθ([Mc,Zc]⊤) 5 h(1:C) ← h(1:C)/h(0) 6 ft ← Mt ⊙ CNN(h) 7 µ← f (1:C)t 8 σ ← pos(f (C+1:2C)t ) 9 return (µ, σ) 10 end (c) Fig. 5.2 (a) Illustration of the ConvCNP forward pass in the off-the-grid case and pseudo-code for (b) off-the-grid and (c) on-the-grid data. The function pos : R→ (0,∞) is used to enforce positivity. 5), as this improved performance when there is large variation in the density of input locations. In the image processing literature, this is known as a normalised convolution (Knutsson and Westin, 1993). The normalisation operation can be reversed by ρ and therefore does not restrict the expressivity of the model. Furthermore, this normalised signal channel can be viewed as an implementation of the Nadaraya-Watson estimator for the mean function (Nadaraya, 1964; Watson, 1964). Having specified ϕ, it remains to specify the form of ψ and ρ. Our choice for ψ and ρ will depend on whether the data lies on-the-grid or off-the-grid, as we detail in the next sections. 108 Convolutional neural processes 5.3.1 ConvCNPs for off-the-grid data We first describe the form of ψ and ρ in the case where data lives off-the-grid. Our proof of Theorem 6 suggests that ψ should be a stationary, non-negative, positive- definite kernel. The exponentiated-quadratic (EQ) kernel with a learnable length scale parameter is a natural choice. This kernel is multiplied by ϕ to form the functional representation E(D) (Figure 5.2b, line 4; and Figure 5.2a, arrow 1). Next, Theorem 6 suggests that ρ should be a continuous, translation equivariant map between function spaces. Yarotsky (2022, Theorem 3.1) shows that any translation equivariant continuous function can be arbitrarily well approximated by a CNN. Furthermore, using a CNN for ρ allows us to take advantage of all the considerable work put in by the research community on designing and optimising CNN architectures. However, CNNs operate on discrete (on-the-grid) input spaces and produce discrete outputs. Hence in order to approximate ρ with a CNN, we discretise the input of ρ, apply the CNN, and finally transform the CNN output back to a continuous function X → Y. To do this, for each context and test set, we space points (ti)ni=1 ⊂ X on a uniform grid (at a pre-specified density) over a hyper-cube that covers both the context and target inputs. We then evaluate (E(D)(ti))ni=1 (Figure 5.2b, lines 2–3; Figure 5.2a, arrow 2). This discretized representation of E(D) is then passed through a CNN (Figure 5.2b, line 6; Figure 5.2a, arrow 3). To map the output of the CNN back to a continuous function X → Y, we use the CNN outputs as weights for evenly-spaced basis functions (again employing the EQ kernel), which we denote by ψρ (Figure 5.2b, lines 7–8; Figure 5.2a, arrow 3). The resulting approximation to ρ is not perfectly translation equivariant, but will be approximately so for length scales larger than the spacing of (E(D)(ti))ni=1. The resulting continuous functions are then used to generate the (Gaussian) predictive mean and variance at any input. This, in turn, can be used to evaluate the log-likelihood. 5.3.2 ConvCNPs for on-the-grid data. We next discuss the ConvCNP architecture in the case where the data live on a regularly spaced grid. While the ConvCNP is readily applicable to many on-the-grid settings, here we focus on images (other on-the-grid data formats can be viewed as a kind of image, potentially with many channels). As such, the following description uses the image completion task as an example, which is often used to benchmark NPs (Garnelo et al., 2018a; Kim et al., 2018). Compared to the off-the-grid case, the implementation 5.3 Convolutional conditional neural processes 109 becomes simpler as we can naturally choose the discretisation (ti)ni=1 to be the pixel locations. Let I ∈ RH×W×C be an image — H,W,C denote the height, width, and number of channels, respectively — and let Mc be the context mask, which is defined such that [Mc]i,j = 1 if pixel location (i, j) is in the context set, and 0 otherwise. Let ⊙ denote the element-wise or Hadamard product. To implement ϕ, we select all context points by multiplying with the mask, Zc := Mc ⊙ I, and prepend the context mask: ϕ = [Mc,Zc] ⊤ (Figure 5.2c, line 4). Here the context mask provides information to the ConvCNP about where the data are observed. Next, we apply a single convolution layer to the context mask to form the on-the- grid density channel: h(0) = convθ(Mc) (Figure 5.2c, line 4). To all other channels, we apply a normalized convolution: h(1:C) = convθ(y)/h(0) (Figure 5.2c, line 5), where the division is element-wise. The filter of the convolution is analogous to ψ, which means that h is the functional representation, with the convolution performing the role of E (the summation in Figure 5.2b, line 4). Although the theory suggests using a non-negative, positive-definite kernel, we did not find significant empirical differences between an EQ kernel and using a fully trainable kernel restricted to positive values to enforce non-negativity. Lastly, we describe the on-the-grid version of ρ(·), which consists of two stages. First, we apply a CNN to E(D) (Figure 5.2c, line 6). Second, we apply a shared, pointwise MLP that maps the output of the CNN at each pixel location in the target set to R2C , where we absorb the pointwise MLP into the CNN. The first C outputs of the MLP are the means of the Gaussian predictive distribution and the second C are the standard deviations, which are then passed through a positivity-enforcing function (Figure 5.2c, line 7–8). To summarise, the on-the-grid algorithm is given by (µ, pos−1(σ)) = CNN ρ ( E(context set) [conv(Mc) density channel ;conv(Mc ⊙ I)/conv multiplies by ψ and sums (Mc)] ⊤), (5.14) where (µ, σ) are the predicted image mean and standard deviation over the image locations, ρ is implemented with the CNN, and E is implemented with the mask Mc and convolution conv. Here the semicolon denotes the stacking of different channels in the CNN input. 110 Convolutional neural processes 5.4 ConvCNP experimental results We evaluate the performance of ConvCNPs in both on-the-grid and off-the-grid settings, focusing on two central questions: 1. Do translation equivariant models improve performance over non-translation equivariant models in appropriate domains? 2. Can translation equivariance enable ConvCNPs to generalise to settings outside of those encountered during training? We use several off-the-grid data-sets which are irregularly sampled time series (X = R), comparing ConvCNPs against Gaussian processes (GPs; Rasmussen and Williams (2005)) and attentive CNPs (ACNP; which is identical to the ANP (Kim et al., 2018), but without the latent path in the encoder). We then evaluate on several on-the-grid image datasets (X = Z2). In all settings we demonstrate substantial improvements over the ACNP. For the CNN component of our model, we propose a small and large architecture for each experiment (in the experimental sections named ConvCNP and ConvCNPXL, respectively). We note that these architectures are different for off-the- grid and on-the-grid experiments, with full details regarding the architectures given in the appendices. 5.4.1 Synthetic 1D experiments We first consider synthetic regression problems. At each iteration, a function is sampled, followed by context and target sets. Beyond EQ-kernel GPs (as proposed in Garnelo et al. (2018a); Kim et al. (2018)), we consider more complex data arising from Matérn–5 2 and weakly-periodic kernels, as well as a challenging, non-Gaussian sawtooth process with random shift and frequency (see Figure 5.3 for an example). ConvCNP is compared to CNP (Garnelo et al., 2018a) and ACNP. Training and testing procedures are fixed across all models. Full details on models, data generation, and training procedures are provided in Appendix D.2. Table 5.1 reports the log-likelihood means and standard errors of the models over 1000 tasks. The context and target points for both training and testing lie within the interval [−2, 2], where training data was observed (marked ‘training data range’ in Figure 5.3). Table 5.1 demonstrates that, even when extrapolation beyond the training range is not required, the ConvCNP significantly outperforms other models in all cases, despite having fewer parameters. 5.4 ConvCNP experimental results 111 A C N P A C N P C o n v C N P C o n v C N P Fig. 5.3 Example functions learned by the ACNP (top row), and ConvCNP (bottom row), when trained on a Matern–5 2 kernel with length scale 0.25 (first and second column) and sawtooth function (third and fourth column). Columns one and three show the predictive distribution of the models when data is presented in same range as training, with predictive distributions continuing beyond that range on either side. Columns two and four show model predictive distribution when presented with data outside the training data range. Plots show means and two standard deviations. 112 Convolutional neural processes Table 5.1 Log-likelihood and standard errors from synthetic 1-dimensional experiments. Model Params EQ Weak Periodic Matern Sawtooth MLP-CNP 66818 -0.86 ± 3e-3 -1.23 ± 2e-3 -0.95 ± 1e-3 -0.16 ± 1e-5 ACNP 149250 0.72 ± 4e-3 -1.20 ± 2e-3 0.10 ± 2e-3 -0.16 ± 2e-3 ConvCNP 6537 0.70 ± 5e-3 -0.92 ± 2e-3 0.32 ± 4e-3 1.43 ± 4e-3 ConvCNPXL 50617 1.06 ± 4e-3 -0.65 ± 2e-3 0.53 ± 4e-3 1.94 ± 1e-3 Table 5.2 Log-likelihood with standard errors from image experiments (6 runs). Model Params MNIST SVHN CelebA32 CelebA64 ZSMM ACNP 410k 1.08 ±0.04 3.94 ±0.02 3.18 ±0.02 -0.83 ±0.08 ConvCNP 113k 1.21 ±0.00 3.89 ±0.01 3.22 ±0.02 3.66 ±0.01 1.18 ±0.04 ConvCNPXL 400k 1.27 ±0.01 3.97 ±0.02 3.39 ±0.02 3.73 ±0.01 0.86 ±0.12 Figure 5.3 demonstrates that the ConvCNP generates excellent fits, even for challenging functions such as those sampled from the Matérn–5 2 GP and sawtooth process. Moreover, Figure 5.3 compares the performance of the ConvCNP and ACNP when data is observed outside the range where the models were trained: translation equivariance enables the ConvCNP to elegantly generalise to this setting, whereas the ACNP is unable to generate reasonable predictions. 5.4.2 2D image completion experiments To test the ConvCNP beyond one-dimensional features, we evaluate our model on on-the-grid image completion tasks and compare it to the ACNP. Image completion can be cast as a prediction of pixel intensities y∗i (∈ R3 for RGB, ∈ R for greyscale) given a target 2D pixel location x∗i conditioned on an observed (context) set of pixel values D = ((xn, yn))Nn=1. In the following experiments, the context set can vary but the target set contains all pixels from the image. Further experimental details are in Section D.3.1. Standard image benchmarks We first evaluate the model on four common bench- marks: MNIST (LeCun et al., 1998), SVHN (Netzer et al., 2011), and 32 × 32 and 64×64 CelebA (Liu et al., 2018). Importantly, these datasets are biased towards images containing a single, well-centered object. As a result, perfect translation equivariance might hinder the performance of the model when the test data are similarly structured. However, if the receptive field of the CNN is larger than the input image, then the model can learn absolute-position specific features, due to the zero-padding (Islam 5.4 ConvCNP experimental results 113 et al., 2019). Hence we might expect larger models like the ConvCNPXL to perform better than the smaller ConvCNP in situations where absolute spatial position is important for the task. Table 5.2 shows that the ConvCNP significantly outperforms the ACNP when it has a large receptive field size, while being at least as good with a small receptive field size. Qualitative samples for various context sets can be seen in Figure 5.4. Generalisation to multiple, non-centered objects The datasets from the pre- vious paragraphs were centered and contained single objects. Here we test whether ConvCNPs trained on such data can generalise to images containing multiple, non- centered objects. To test this, we introduce the zero-shot multi-MNIST (ZSMM) dataset. The training set contains all 60000 28× 28 MNIST training digits centered on a black 56× 56 background (Figure 5.5a). For the test set, we randomly sample with replacement 10000 pairs of digits from the MNIST test set, place them on a black 56 × 56 background, and translate the digits in such a way that the digits can be arbitrarily close but cannot overlap (Figure 5.5b). Importantly, the scale of the digits and the image size are the same during training and testing. The last column of Table 5.2 evaluates the models in the zero shot multi-MNIST setting, where images contain multiple digits at test time. The ConvCNP significantly outperforms the ACNP on such tasks. Figure 5.6a shows a histogram of the image log-likelihoods for ConvCNP and ACNP, as well as qualitative results at different percentiles of the ConvCNP distribution. ConvCNP is able to extrapolate to this out-of-distribution test set, while ACNP appears to model the bias of the training data and predict a centered ‘mean’ digit independently of the context. Interestingly, ConvCNPXL does not perform as well on this task. In particular, we find that, as the receptive field becomes very large, performance on this task decreases. We hypothesize that this has to do with behavior of the model at the edges of the image. CNNs with larger receptive fields—the region of input pixels that affect a particular output pixel—are able to model non-stationary behavior by looking at the distance from any pixel to the image boundary. Although ZSMM is a contrived task, note that our field of view usually contains multiple independent objects, thereby requiring translation equivariance. As a more realistic example, we took a ConvCNP model trained on CelebA and tested it on a natural image of different shape which contains multiple people (Figure 5.6b). Even with 95% of the pixels removed, the ConvCNP was able to produce a qualitatively reasonable reconstruction. 114 Convolutional neural processes Fig. 5.4 Qualitative evaluation of the ConvCNP (XL). For each dataset, an image is randomly sampled, the first row shows the given context points while the second is the mean of the estimated conditional distribution. From left to right the first seven columns correspond to a context set with 3, 1%, 5%, 10%, 20%, 30%, 50%, 100% randomly sampled context points. In the last two columns, the context sets respectively contain all the pixels in the left and top half of the image. ConvCNPXL is shown for all datasets besides ZSMM, for which we show the fully translation equivariant ConvCNP. 5.4 ConvCNP experimental results 115 (a) Train (b) Test Fig. 5.5 Samples from our generated zero-shot multi MNIST (ZSMM) dataset. (a) Log-likelihood and qualitative results on ZSMM. The top row shows the log-likelihood distribution for both models. The images below correspond to the context points (top), ConvCNP target predictions (middle), and ACNP target predictions (bottom). Each column corresponds to a given percentile of the ConvCNP distribution. (b) Qualitative evaluation of a ConvC- NPXL trained on the unscaled CelebA (218×178) and tested on Ellen’s Oscar unscaled (337×599) selfie (DeGeneres, 2014) with 5% of the pixels as context (top). Fig. 5.6 Zero-shot generalisation to tasks that require translation equivariance. 116 Convolutional neural processes Computational efficiency Beyond the performance and generalisation improve- ments, a key advantage of the ConvCNP is its computational efficiency. The memory and time complexity of a single self-attention layer grows quadratically with the number of inputs (the number of pixels for images) but only linearly for a convolutional layer. Empirically, with a batch size of 16 on 32×32 MNIST, ConvCNPXL requires 945MB of VRAM, while ACNP requires 5839 MB. For the 56× 56 ZSMM ConvCNPXL increases its requirements to 1443 MB, while ACNP could not fit onto a 32GB GPU. Ultimately, ACNP had to be trained with a batch size of 6 (using 19139 MB) and we were not able to fit it on the GPU for CelebA64. 5.4.3 Limitations of factorised predictive distributions We have introduced the ConvCNP and shown that it outperforms the ACNP in a variety of synthetic and real-world regression tasks. However, as described in Section 4.2, all CNPs, ConvCNPs included, are unable to produce predictive distributions that have dependencies between different target locations. More precisely, let PN(X ,Y) ⊂ P(X ,Y) denote the set of noise GPs: Gaussian processes on X whose covariance is given by Cov(x, x′) = σ2(x)δ[x−x′], where σ2 ∈ Cb(X ,Y) and δ is the Kronecker delta, with δ[0] = 1 and δ[ · ] = 0 otherwise. Then the ConvCNP is a map ConvCNP : Z → PN(X ,Y) with Equation (5.13) defining its finite-dimensional marginals. Unfortunately, predictive stochastic processes in PN(X ,Y) possess two key limitations. First, it is impossible to obtain coherent function samples from the predictive distribution as each point of the function is generated independently. This severely limits the ability of ConvCNPs to be used in tasks such as Thompson sampling. Furthermore, when using ConvCNPs to estimate the probability that the value of the predicted function over the entirety of a given range will exceed a certain threshold, this probability may be drastically underestimated due to the factorisation assumption. One example is in heatwave or flood prediction, where we are interested in the probability that the temperature or amount of precipitation exceeds a threshold throughout some region of space or time, in order to predict droughts or floods (Markou et al., 2022). Another limitation of PN(X ,Y) is that Gaussian predictive distributions cannot model multi-modality, heavy-tailedness, or asymmetry. Although this can be addressed by using the ConvCNP output to parameterise more flexible families of distributions, such as mixtures of Gaussians or normalising flows, in the next section we will show how introducing a latent variable can lift both the restrictions of factorisation and Gaussianity simultaneously. 5.5 Convolutional latent neural processes 117 5.5 Convolutional latent neural processes We now present the convolutional latent neural process (ConvLNP), which addresses the weaknesses of ConvCNPs. The ConvLNP extends the ConvCNP by parameterising a map to predictive stochastic processes more expressive than PN(X ,Y), allowing for coherent sampling and non-Gaussian predictive distributions. It achieves this by passing the output of a ConvCNP through a non-linear, translation equivariant map between function spaces. Specifically, the ConvLNP uses an encoder-decoder architecture, where the encoder E: Z → PN(X ,Y) is a ConvCNP and the decoder d : YX → YX is translation equivariant (here YX denotes the set of all functions from X to Y). Note that throughout this section, when describing the ConvLNP, we will use the terms ‘encoder’ and ‘decoder’ differently to their use in Section 4.2. There, the encoder described how elements of the context set are embedded and aggregated into a single representation. The decoder was then used to combine that representation with a target input location to form a prediction. Here, we refer to E, which is itself a complete ConvCNP, as the encoder for the ConvLNP. The decoder then simply refers to the second stage of the ConvLNP, d, which transforms samples from the ConvCNP encoder.6 Conditioned on the context set Dc, ConvLNP samples can be obtained by sampling a function z ∼ ConvCNP(Dc) and then computing f = d(z). This is illustrated in Figure 5.7. Importantly, d takes functions to functions and does not necessarily act point-wise: letting f(x) depend on the value of z at multiple locations is crucial for inducing dependencies in the predictive distribution. This sampling procedure induces a map between stochastic processes, D: PN(X ,Y)→ P(X ,Y). Putting these together, and making explicit the parameter dependence in E and D, the ConvLNP is constructed as ConvLNPθ,ϕ = Dθ ◦ Eϕ, Eϕ = ConvCNPϕ, Dθ = (dθ)∗, (5.15) where (dθ)∗ is the pushforward7 under dθ. We now prove that the ConvLNP is indeed a translation equivariant map from datasets to stochastic processes, by proving that the decoder and encoder are separately translation equivariant. 6The choice of what constitutes the ‘encoder’ and ‘decoder’ here is somewhat arbitrary and is primarily a naming convention rather than a fundamental distinction. 7i.e., (dθ)∗(Eϕ) is the measure induced on RX by sampling a function from Eϕ and passing it through dθ. 118 Convolutional neural processes 1 Context set Dc 2 Encoder: z ∼ ConvCNP(Dc) 3 Decoder: f = d(z) Eϕ Dθ Fig. 5.7 The ConvLNP encoder-decoder architecture. The encoder is a ConvCNP which takes the context set as input (left panel) and outputs a single sample of z (center panel). The decoder takes this as input and outputs a predictive sample (right panel blue; two other samples shown in grey). Lemma 1. Let d be a measurable, translation equivariant map from (YX ,Σ) to (YX ,Σ). Then the ConvLNP decoder D : P(X ,Y)→ P(X ,Y), defined by D(P ) = d∗(P ), where d∗(P ) is the pushforward measure of P under d, is translation equivariant. Proof. Let F ∈ Σ ⊆ YX be a measurable set. Then: D(TτP )(F ) (a) = TτP (d −1(F )) = P (T−τd−1(F )) (b) = P (d−1(T−τF )) = D(P )(T−τF ) = TτD(P )(F ). Here (a) follows from definition of the pushforward, and (b) follows because T−τd−1(F ) = T−τ{f : d(f) ∈ F} = {T−τf : d(f) ∈ F} = {f : d(Tτf) ∈ F} = {f : Tτd(f) ∈ F} = {f : d(f) ∈ T−τF} = d−1(T−τF ). Lemma 2. The ConvLNP encoder E (which is defined to be a ConvCNP), is a translation equivariant map from datasets to stochastic processes. Proof. Recall that the mean and variance µ(·, D), σ2(·, D) (viewed as maps from Z → Cb(X ,Y)) of the ConvCNP encoder E are both given by ConvDeepSets. Due to the translation equivariance of ConvDeepSets (Theorem 6), µ(·, TτD) = Tτµ(·, D) for all D ∈ Z, τ ∈ X , and similarly for σ2. Let F ∈ Σ. Then since the measure 5.6 ConvLNP experimental results 119 E(D) ∈ PN(X ) is defined entirely by its mean and variance function, E(TτD)(F ) = E(D)(T−τF ) = TτE(D)(F ). Noting that a composition of translation equivariant maps is itself translation equivariant, we obtain the following proposition: Proposition 7. Define ConvLNP = D◦E. Then ConvLNP is a translation equivariant map from datasets to stochastic processes. In practice, we cannot actually compute a full functional sample z from a noise GP (PN) as described in Figure 5.7, since z comprises uncountably many independent random variables. Instead, we consider a discrete version of the model, which enables practical computation (at the expense of not having the theory in Proposition 7 apply exactly). Similarly to Section 5.3.1, we discretise the domain of z on a grid (xi)Ki=1, with z := (z(xi))Ki=1. As a consequence, the model can only be equivariant up to shifts on this discrete grid. With this discretisation, sampling z ∼ ConvCNPϕ(Dc) amounts to sampling a finite number of independent Gaussian random variables, and dθ is implemented by passing z through a CNN — which plays the role of the translation equivariant map between (discretised) function spaces. The forward pass of a discretised, trained ConvLNP is illustrated in Figure 5.8. Note that CNNs are not always entirely translation equivariant due to the zero padding that occurs at each layer. In practice, we find that this does not hinder the model from extrapolating meaningfully. Following Kim et al. (2018), we define the model likelihood by adding heteroskedastic Gaussian observation noise σ2y(x, z) to the predictive function draws f = dθ(z) ∈ YX . Given a context set Dc, the predictive distribution for the target outputs yt given the target inputs xt is then: p(yt|xt, Dc) = E z∼Eϕ(Dc)  ∏ (x,y)∈Dt N (y; dθ(z)(x), σ2y(x, z))  . (5.16) Although the product in the expectation factorises, p(yt|xt, Dc) does not: z induces dependencies in the predictive, in contrast to Equation (5.13). We provide pseudocode for the ConvLNP forward pass in both the off-the-grid and on-the-grid case in Figures 5.9 and 5.10. 5.6 ConvLNP experimental results We evaluate ConvLNPs on a range of regression tasks. Our main questions are: 120 Convolutional neural processes Fig. 5.8 Forward pass of a ConvLNP. Steps (1)-(4) depict sampling from the encoder Eϕ, which is a ConvCNP. This involves: (1) computing a functional representation of the context set, with separate ‘density’ and ‘data’ channels (described in detail in Section 5.3.1), (2) discretizing the representation, (3) passing the representation through a CNN, which outputs the parameters of independent Gaussian distributions spaced on a grid, and (4) sampling from these distributions. However, the samples at each grid point are independent of each other, hence in (5) the samples are passed through another CNN, the decoder, to induce dependencies, and then are smoothed out. 5.6 ConvLNP experimental results 121 require: d = (CNN, ψd), Eϕ (off-the-grid ConvCNP), and number of samples L require: context (xn, yn)Nn=1, target (x∗m)Mm=1 1 begin 2 µz, σz ← Eϕ(Dc) 3 for l = 1, . . . , L do 4 zl ∼ N (z;µz, σ2z) 5 (fµ(ti), fσ(ti)) K i=1 ← CNN(zl) 6 µm,l ← ∑T i=1 fµ(ti)ψd(x ∗ m − ti) 7 σm,l ← pos (fσ(ti)) 8 end for 9 return (µ, σ) 10 end Fig. 5.9 Forward pass through a ConvLNP (off-the-grid). The function pos : R→ (0,∞) is used to enforce positivity. require: d = CNN, Eϕ (on-the-grid ConvCNP), and number of samples L require: image I, context mask Mc, and target mask Mt 1 begin 2 µz, σz ← Eϕ(I,Mc) 3 for l = 1, . . . , L do 4 zl ∼ N (z;µz, σ2z) 5 (fµ(ti), fσ(ti)) K i=1 ← CNN(zl) 6 µ← f (1:C)t 7 σ ← pos ( f (C+1:2C) t ) 8 end for 9 return (µ, σ) 10 end Fig. 5.10 Forward pass through a ConvLNP (on-the-grid). The function pos : R → (0,∞) is used to enforce positivity. 122 Convolutional neural processes 1. Does the ConvLNP produce coherent, meaningful predictive samples? 2. Similarly to the ConvCNP, can it leverage translation equivariance to outperform baseline methods within and beyond the training range (generalisation)? 3. Unlike the ConvCNP, does it learn expressive non-Gaussian predictive distribu- tions? 4. How does training the ConvLNP with the approximate maximum likelihood objective LˆML of Section 4.4.3 compare with training using the neural process variational inference objective LNPVI of Section 4.4.2? We use several approaches for evaluating latent neural processes. First, as in (Garnelo et al., 2018b; Kim et al., 2018), we provide qualitative visual comparisons of samples. These allow us to see if the models display meaningful structure, quan- tify uncertainty, and are able to generalise spatially. Second, LNPs lack closed-form likelihoods, so we evaluate lower bounds on their predictive log-likelihoods via im- portance sampling (Le et al., 2018). As these lower bounds can be quite loose (see Appendix E for an analysis of the looseness of the bounds as a function of number of samples used), they are primarily useful to show when LNPs outperform baselines with exact likelihoods, such as GPs and ConvCNPs. Finally, in Section 5.6.3 we consider Bayesian optimisation to evaluate the usefulness of ConvLNPs for downstream tasks. In Sections 5.6.1 and 5.6.2, we compare against the Attentive NP (ANP; (Kim et al., 2018)), which in prior work has been trained only with LNPVI. The ANP architectures used in this section are comparable to those in Kim et al. (2018), and have a param- eter count comparable to or greater than the ConvLNP. Full details are provided in Appendix F. Code to reproduce the 1D regression experiments can be found at https: //github.com/wesselb/neuralprocesses, and code to implement the image-completion experiments can be found at https://github.com/YannDubs/Neural-Process-Family. 5.6.1 1D regression Similarly to Section 5.4.1, we train on an exponentiated quadratic kernel GP, a Matérn- 5 2 GP, a weakly periodic GP, and a non-Gaussian sawtooth process with random shifts and frequency (see Appendix F.1 for details). Figure 5.11 shows predictive samples, where during training the models only observe data within the grey regions (training range). While samples from the ANP exhibit unnatural ‘kinks’ and do not resemble the underlying process, the ConvLNP produces smooth samples for Matérn–5 2 and samples exhibiting meaningful structure for the weakly periodic and sawtooth processes. The 5.6 ConvLNP experimental results 123 ConvLNP also generalises gracefully beyond the training range, whereas the ANP fails catastrophically. The ANP with LNPVI collapses to deterministic samples, with the epistemic uncertainty explained using the heteroskedastic noise σ2y(x, z). This was also noted in Le et al. (2018). This behaviour is alleviated when training with LˆML, with much of the predictive uncertainty due to variations in the sampled functions. Table 5.3 compares lower bounds on the log-likelihood for the ConvLNP with the ANP and MLP-NP for both our proposed LˆML objective and the standard LNPVI objective. We also show three exact log-likelihoods: the ground-truth GP (full), the ground-truth GP with diagonalised predictions (diag), and the ConvCNP.8 The ConvCNP performs on par with the GP (diag), which is the optimal factorised predictive. The ConvLNP lower bound is consistently higher than the GP (diag) and ConvCNP log-likelihoods, demonstrating that its non-factorised predictive distributions improve performance. Furthermore, the ConvLNP performs similarly inside and outside its training range, demonstrating that translation equivariance helps generalisation. This is in contrast to the ANP, which fails catastrophically outside its training range. 5.6.2 Image completion We now evaluate ConvLNPs on image completion tasks, focusing on spatial generalisa- tion. To test this, we consider zero-shot multi MNIST (ZSMM), where we train on single MNIST digits but test on two MNIST digits on a larger canvas. We randomly translate the digits during training, so the generative stochastic process is stationary. The black background of MNIST causes difficulty with heteroskedastic noise, as the models can obtain high likelihood by predicting the background with high confidence whilst ignoring the digits. Hence for MNIST and ZSMM we use homoskedastic noise σ2y(z). Figures 5.12a and 5.12b show that the ANP fails to generalise spatially, whereas this is naturally handled by the ConvLNP. We also test the ConvLNP’s ability to learn non-Gaussian predictive distributions. Figure 5.12c shows that the ConvLNP can learn highly multimodal predictive dis- tributions, enabling the generation of diverse yet coherent samples. A quantitative comparison of models using log-likelihood lower bounds is provided in Table 5.4, where the ConvLNP trained with LˆML consistently achieves the highest values. Appendix F.2 provides details regarding the data, architectures, and protocols used in our image experiments. In Section F.2.4, we provide samples and further quantitative comparisons 8Note that the log-likelihood values for the ConvCNP reported in Table 5.3 are not comparable with those given in Table 5.2 since the sampling procedures determining the size of the context and target sets differ for the two experiments, see Section D.2.2 and Appendix F.1. 124 Convolutional neural processes ConvLNP ANP M at ér n – 5 2 Lˆ M L L N P V I M at ér n – 5 2 Lˆ M L L N P V I W ea k ly P er io d ic Lˆ M L L N P V I S aw to o th Lˆ M L L N P V I Fig. 5.11 Predictions of ConvLNPs and ANPs trained with LˆML and LNPVI, showing interpolation and extrapolation within (grey background) and outside (white back- ground) the training range. Solid blue lines are samples, dashed blue lines are means, and the shaded blue area is µ± 2σ. Purple dash–dot lines are the ground-truth GP mean and µ±2σ. ConvNP handles points outside the training range naturally, whereas this leads to catastrophic failure for the ANP. Note ANP with LNPVI tends to collapse to deterministic samples, with all uncertainty explained with the heteroskedastic noise. In contrast, models trained with LˆML show diverse samples that account for much of the uncertainty. (a) ConvLNP (b) ANP (c) ConvLNP (d) ANP Fig. 5.12 Left two plots: predictive samples on zero-shot multi MNIST. Right two plots: samples and marginal predictives on standard MNIST. We plot the density of the five marginals that maximize Sarle’s bimodality coefficient Ellison (1987). We use LˆML for training. Blue pixels are not in the context set. 5.6 ConvLNP experimental results 125 Table 5.3 Log-likelihood for ConvCNP, ConvLNP, ANP, and MLP-LNP. Each of the latent variable models was trained on each data set with LˆML and LNPVI, separately. EQ Matérn– 52 Noisy Mixt. Weakly Per. Sawtooth Interpolation inside training range GP (full) 5.80± 0.02 1.22± 6.3e –3 1.00± 4.1e –3 –0.06± 4.6e –3 N/A GP (diag) –0.59± 0.01 –0.84± 9.0e –3 –0.89± 0.01 –1.17± 5.2e –3 N/A ConvCNP –0.70± 0.02 –0.88± 0.01 –0.92± 0.02 –1.19± 7.0e –3 1.15± 0.04 ConvLNP LˆML –0.30± 0.02 –0.58± 0.01 –0.55± 0.01 –1.02± 6.0e –3 2.30± 0.01 ANP LˆML –0.52± 0.01 –0.73± 0.01 –0.69± 0.01 –1.14± 6.0e –3 0.09± 3.0e –3 MLP-LNP LˆML –0.84± 9.0e –3 –0.96± 7.0e –3 –0.93± 9.0e –3 –1.23± 5.0e –3 –0.02± 2.0e –3 ConvLNP LNPVI –0.50± 0.02 –0.77± 0.01 –0.48± 0.02 –1.03± 8.0e –3 2.47± 8.0e –3 ANP LNPVI –0.82± 0.01 –0.96± 0.01 –1.04± 0.01 –1.37± 6.0e –3 0.20± 9.0e –3 MLP-LNP LNPVI –0.58± 9.0e –3 –1.00± 9.0e –3 –0.72± 0.01 –1.22± 5.0e –3 –0.16± 2.0e –3 Interpolation beyond training range GP (full) 5.80± 0.02 1.22± 6.3e –3 1.00± 4.1e –3 –0.06± 4.6e –3 N/A GP (diag) –0.59± 0.01 –0.84± 9.0e –3 –0.89± 0.01 –1.17± 5.2e –3 N/A ConvCNP –0.69± 0.02 –0.87± 0.01 –0.94± 0.02 –1.19± 7.0e –3 1.11± 0.04 ConvLNP LˆML –0.30± 0.02 –0.58± 0.01 –0.56± 0.01 –1.03± 6.0e –3 2.29± 0.02 ANP LˆML –1.35± 6.0e –3 –1.39± 7.0e –3 –1.65± 5.0e –3 –1.35± 4.0e –3 –0.17± 1.0e –3 MLP-LNP LˆML –2.70± 3.0e –3 –2.60± 3.0e –3 –2.82± 3.0e –3 - –0.03± 2.0e –3 ConvLNP LNPVI –0.48± 0.02 –0.79± 0.01 –0.48± 0.02 –1.04± 8.0e –3 2.47± 8.0e –3 ANP LNPVI –1.91± 0.03 –1.48± 4.0e –3 –1.85± 7.0e –3 –1.66± 0.01 –0.30± 4.0e –3 MLP-LNP LNPVI –13.7± 0.82 –3.96± 0.04 –3.80± 0.02 - –4.98± 0.02 Extrapolation beyond training range GP (full) 4.29± 6.2e –3 0.82± 4.3e –3 0.66± 2.2e –3 –0.33± 3.4e –3 N/A GP (diag) –1.40± 5.0e –3 –1.41± 4.8e –3 –1.72± 6.2e –3 –1.40± 4.0e –3 N/A ConvCNP –1.41± 6.0e –3 –1.41± 7.0e –3 –1.73± 8.0e –3 –1.41± 6.0e –3 0.27± 0.02 ConvLNP LˆML –1.09± 5.0e –3 –1.11± 5.0e –3 –1.30± 4.0e –3 –1.24± 4.0e –3 1.61± 0.02 ANP LˆML –1.29± 6.0e –3 –1.29± 5.0e –3 –1.55± 5.0e –3 –1.34± 5.0e –3 –0.25± 2.0e –3 MLP-LNP LˆML –2.23± 4.0e –3 –2.08± 3.0e –3 –2.50± 4.0e –3 –1.39± 4.0e –3 –0.06± 2.0e –3 ConvLNP LNPVI –1.21± 0.01 –1.31± 0.01 –1.19± 0.01 –1.51± 8.0e –3 2.10± 7.0e –3 ANP LNPVI –1.44± 6.0e –3 –1.45± 6.0e –3 –1.77± 7.0e –3 –1.46± 6.0e –3 –0.20± 2.0e –3 MLP-LNP LNPVI –5.85± 0.05 –2.65± 3.0e –3 –4.06± 0.04 –1.49± 5.0e –3 –1.99± 6.0e –3 Table 5.4 Test log-likelihood lower bounds for image completion (5 runs). MNIST CelebA32 SVHN ZSMM LˆML LNPVI LˆML LNPVI LˆML LNPVI LˆML LNPVI ConvLNP 2.11± 0.01 0.99± 0.42 6.92± 0.10 −0.27± 0.00 9.89± 0.09 0.17± 0.00 4.58± 0.04 0.14± 0.00 ANP 1.66± 0.03 1.64± 0.03 5.98± 0.08 6.04± 0.10 9.18± 0.08 8.91± 0.06 −10.8± 1.99 −6.45± 0.99 126 Convolutional neural processes Table 5.5 Joint predictive log-likelihoods (LL) and RMSEs on ERA5-Land, averaged over 1000 tasks. Central (train) West (test) East (test) South (test) LL ConvLNP 4.47± 0.07 4.55± 0.08 5.07± 0.07 4.65± 0.08GP 3.33± 0.06 3.65± 0.06 4.07± 0.06 3.34± 0.06 RMSE (×10−2) ConvLNP 5.72± 0.33 5.77± 0.37 3.23± 0.22 6.92± 0.39GP 6.26± 0.30 5.75± 0.29 3.10± 0.18 7.94± 0.44 of models trained on SVHN (Netzer et al., 2011), MNIST LeCun et al. (1989), and 32×32 CelebA Netzer et al. (2011) in a range of scenarios, along with full experimental details. 5.6.3 Environmental data We next consider a real-world dataset, ERA5-Land (Copernicus Climate Change Service, 2020), containing environmental measurements at a ∼9 km spacing across the globe. We consider predicting daily precipitation y at position x. Environmental data is not perfectly stationary, as there are changes in climate that reflect geographic position. Hence this task reflects the model’s ability to handle situations where the underlying process is only approximately stationary. In general, one approach to handle situations like these is to provide the model with input variables such that, conditioned on those variables, the underlying stochastic process is approximately stationary. For example, ground elevation (known as orography) is an important factor influencing climate. Since the orography depends on absolute geographical position, the ground truth stochastic process governing precipitation cannot be strictly stationary. However, it may be the case that if we also translate the orography data along with the input positions, then stationarity is approximately restored. Hence we provide the ConvLNP with orography data, and also temperature values, as inputs along with precipitation. We choose a large region of central Europe as our train set, and use regions east, west and south as held-out test sets. For such tasks, models must be able to make predictions at locations spanning a range different from the training set, inhibiting the deployment of NPs not equipped with translation equivariance. To sample a task at train time, we sample a random date between 1981 and 2020, then sample a sub-region within the train region, which is split into context and target sets. In this section, we train using LML. See Appendix F.3 for details. 5.6 ConvLNP experimental results 127 (a) Ground truth data (b) ConvLNP sample 1 (c) ConvLNP sample 2 (d) ConvLNP sample 3 (e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3 Fig. 5.13 Predictive samples overlaid on central Europe. Darker colours show higher precipitation. In (e), coloured pixels represent context points. GP samples often take negative values (lighter than ground truth data, see Section F.3.2 for a discussion), whereas the NP has learned to produce non-negative samples which capture the sparsity of precipitation. The model is trained on subregions roughly the size of the lengthscale of the precipitation process. More samples in Section F.3.6. 0 10 20 30 40 50 Central (train) 0.6 0.8 1.0 1.2 1.4 1.6 A ve ra ge R eg re t 0 10 20 30 40 50 West (test) 0.6 0.8 1.0 1.2 1.4 1.6 0 10 20 30 40 50 East (test) 0.2 0.3 0.4 0.5 0.6 0.7 0 10 20 30 40 50 South (test) 0.75 1.00 1.25 1.50 1.75 2.00 GP UCB GP TS NP UCB NP TS Random Fig. 5.14 Average regret plotted against number of points queried for Bayesian op- timisation for the precipitation value on a given day in different regions of Europe, averaged over 5000 tasks. 128 Convolutional neural processes Prediction We first evaluate the ConvLNP’s predictive performance, comparing to a GP trained individually on each task as a baseline. In about 10% of tasks, the GP obtains an especially poor likelihood (< 0 nats); we remove these outliers from the evaluation. The results are shown in Table 5.5. The ConvLNP and GP have comparable RMSEs except on the south dataset, where the ConvLNP outperforms the GP. However, the ConvLNP consistently outperforms the GP in log-likelihood, which is expected because (i) the GP does not share information between tasks and hence is prone to overfitting on small context sets, resulting in overconfident predictions; and (ii) the ConvLNP can learn non-Gaussian predictive densities (illustrated in Section F.3.6). Figure 5.13 shows samples from the predictive process of a ConvLNP and GP, over the whole of the train region. This demonstrates spatial extrapolation, as the ConvLNP is trained only on random subregions. Bayesian optimisation We demonstrate the ConvLNP in a downstream task by considering a toy Bayesian optimisation problem, where the goal is to identify the location with heaviest rainfall on a given day. We also test the ConvLNP’s spatial generalization, by optimising over larger regions (for central, west, and south) than the model was trained on. We test both Thompson sampling (TS) (Thompson, 1933) and upper confidence bounds (UCB) (Auer, 2002) as methods for acquiring points. Note that TS requires coherent samples. The results are shown in Figure 5.14. On all datasets, ConvLNP TS and UCB significantly outperform the random baseline by the 50th iteration; the GP does not reliably outperform random. We hypothesise this is due to its overconfidence, in line with the results on prediction. 5.7 Summary and conclusions In this chapter we presented the convolutional conditional neural process and the convolutional latent neural process. Both models take advantage of the ConvDeepSets representation theorem (Theorem 6) to flexibly parameterise a permutation invari- ant and translation equivariant map from observed datasets to predictive stochastic processes. In Section 5.4 we showed that the ConvCNP outperforms the attentive conditional neural process on both synthetic 1D regression and image tasks. However, like all CNPs, it makes factorised predictions for every point in the target set. In Section 5.5 we remedied this by introducing a latent variable to define the ConvLNP, which uses a ConvCNP to define a distribution over a latent function which is then passed through another CNN to introduce dependencies in the predictive distribution. 5.7 Summary and conclusions 129 In Section 5.6 we showed that the ConvLNP is able to address the shortcomings of the ConvCNP, both by making non-factorised predictions (thereby allowing it to out- perform the ConvCNP in terms of log-likelihood), and also by enabling non-Gaussian and sometimes multimodal marginal predictive distributions. Together, the ConvCNP and ConvLNP represent practical and highly performant models for stochastic process prediction whenever translation equivariance is an appropriate inductive bias. Chapter 6 Conclusions and discussion 6.1 Summary of contributions We now summarise our main contributions in this thesis. The first half of the thesis focused on understanding the consequences of approximate inference in Bayesian neural networks, and the second half focused on convolutional neural processes. We describe each of these contributions in turn. 6.1.1 Approximate inference in Bayesian neural networks In Chapter 2 we introduced and motivated Bayesian neural networks and described the need for reliable approximate inference as a pressing research problem. This led to the first major contribution of the thesis, which was a theoretical and empirical study of two of the most common approximate inference methods for BNNs: mean-field variational inference and Monte Carlo dropout. The results of these investigations were presented in Chapter 3. On the theoretical side, our main contribution was to prove theorems showing that, for single-hidden layer ReLU BNNs with either MFVI or MCDO approximate posteriors, there are simple situations where no setting of the variational parameters can represent increased uncertainty in between regions of low uncertainty. This is in contrast to the exact posterior predictive, which shows increased in-between uncertainty when appropriate. We next considered the theoretical expressiveness of BNNs with more than one hidden layer. We proved that given sufficient width, they are able to represent any predictive mean and variance function. This provides a kind of stochastic analogue to the classical universal approximation theorem for neural networks. 132 Conclusions and discussion Our universal approximation result for deep MFVI and MCDO networks naturally leads to the question of whether, when training networks using the ELBO objective, variational parameters will be found that lead to predictive distributions which resemble the true predictive. This is the main question addressed by the empirical studies we perform in Chapter 3. By studying toy examples and comparing them to reference predictive distributions such as HMC and the infinite-width GP, we show that even for deep BNNs, in-between uncertainty is not reliably represented even though in theory there exist variational parameters that can represent it. We finally conclude Chapter 3 with a case study showing how a lack of in-between uncertainty can be deleterious for active learning. 6.1.2 Convolutional neural processes In the second half of the thesis, we propose the convolutional neural process, a new member of the neural process family that incorporates translation equivariance into its predictions. We begin in Chapter 4 by providing an overview of various existing members of the neural process family. We view neural processes as performing stochastic process prediction via meta-learning. We describe the encoder-decoder architectural framework that underlies the design of many different NPs, and also document the training objectives used to train both conditional NPs and latent NPs. For latent NPs, we introduce a new approximate maximum-likelihood objective that sidesteps the complexities of variational inference in favour of directly forming a (biased) estimate of the likelihood. Finally, in Chapter 5 we introduce our proposed model, the convolutional neural process. To motivate the model, we prove that stationary stochastic processes imply translation equivariant prediction maps, and extend the original deep sets representation theorem to also incorporate translation equivariance. This convolutional deep sets theorem then directly informs the implementation of our convolutional neural process. We present two versions of the model. The convolutional conditional neural process (ConvCNP) is simpler, and only outputs a predictive mean and variance function; hence it cannot model dependencies in the predictive distribution. We also introduce the convolutional latent neural process (ConvLNP). The ConvLNP uses a latent function to allow it to model dependencies and also provide non-Gaussian marginal predictions. For both models, we provide extensive experiments on both synthetic 1D regression tasks and also 2D image regression. We show that the ConvCNP and ConvLNP outperform the attentive CNP and attentive LNP, the previous best performing neural processes. Furthermore, we show that the models can leverage translation equivariance 6.2 BNNs and NPs compared 133 to solve challenging tasks such as zero-shot multi-MNIST, where the model has to generalise from seeing only single, centered MNIST digits at train time, to seeing multiple non-centered MNIST digits at test time. 6.2 BNNs and NPs compared Having described the two main focuses of this thesis, it is natural to compare and contrast BNNs and NPs. We now consider their similarities and differences from various angles. Priors and meta-learning An area where NPs differ significantly from BNNs is in prior selection. For BNNs, choosing the prior is a crucial part of specifying the model. Choices such as whether to use heavy-tailed or correlated priors can have a significant impact on downstream performance (Fortuin et al., 2021). In contrast, for NPs no prior needs to be chosen. Instead, the required inductive biases to succeed on a new task (apart from high-level inductive biases such as convolution and attention) are learned directly from previous tasks in the episodic meta-learning setting. This provides a more data-driven approach that relieves practitioners of the burden of prior design. The price that has to be paid for this is the need for a meta-dataset. Often this is not available — we may only have one dataset of interest and wish to make predictions based on it. In such cases BNNs can be used although NPs are no longer applicable. On the other hand, if a meta-dataset is available, it may be possible to use the meta-dataset to meta-learn a prior for the BNN, thus removing some of the burden of prior selection, as proposed in Rothfuss et al. (2020), although this is not yet common practice in the BNN literature. Regression with uncertainty In this thesis, BNNs and NPs were both applied to the task of performing regression with uncertainty estimates. In this sense, both meth- ods may be viewed as neural network-based alternatives to more classical uncertainty- aware regression approaches such as GP regression. Although BNNs and NPs can be applied to similar tasks, the way in which uncertainty is represented in each of these models is quite different. For BNNs, epistemic uncertainty is encoded in the weights of the network. This is then propagated to the predictive distribution by sampling many instances of the weights from the posterior. In contrast, in NPs, there is no uncertainty represented in the weights (although uncertainty may be represented in the latent variable for latent neural processes). Rather, the decoder of the NP directly 134 Conclusions and discussion outputs the parameters of a Gaussian distribution over the regression target value. The contrast between the two approaches is clear when we consider that for the conditional neural process, there is no direct way to separate the uncertainty in the predictive distribution between epistemic and aleatoric uncertainty — the CNP simply outputs a predictive variance which incorporates both kinds of uncertainty simultaneously. Approximate inference Approximate inference is used in both BNNs and NPs, but in very different ways. In BNNs, approximate inference is needed because we specify a prior over the neural network parameters, which is then paired with a complicated non-linear likelihood. To obtain predictions from the BNN, some approximation of integrals over the posterior distribution must be made. In contrast, for conditional neural processes, there is no approximate inference required — the model outputs the predictive mean and variance using a single deterministic forward pass. It is only for latent neural processes that approximate inference plays a role. When LNPs were first introduced (Garnelo et al., 2018b), the objective proposed was an ELBO which treated the latent variables as quantities to infer. This led to the neural process variational inference objective described in Section 4.4.2. However, even this is not necessary to train a working LNP. In Section 4.4.3 we introduced the approximated maximum likelihood objective for LNPs that does away with the approximate inference interpretation for the latent variables entirely, instead viewing the latent variables simply as a device for introducing correlations in the NP predictive. Hence approximate inference is not crucial for training either CNPs or LNPs, in the way that it is for BNNs. Practical recommendations We conclude with some brief recommendations for practitioners who are deciding between using either a BNN or an NP for their problem. One of our main takeaways from Chapter 3 is that BNN approximate inference is an active research topic that is not yet well understood. Even in relatively simple situa- tions, previously unknown pathologies can sometimes cripple performance. Combined with the difficult problem of prior selection, we recommend using MFVI or MCDO approximate inference in BNNs with caution. Although BNNs can in some cases pro- duce better uncertainty estimates than vanilla deterministic neural networks, this may not necessarily be due to a principled application of Bayesian inference. Furthermore, approximate inference techniques like MFVI often significantly complicate the training procedure (although MCDO is an exception as it is relatively straightforward to apply). In summary, we recommend the use of BNNs when: 6.3 Continued work and future research directions 135 1. Epistemic uncertainty estimation is very important for the task at hand. 2. The model is applied to a large dataset (too large for exact GP regression to be applicable). 3. There is only a single dataset available (so that episodic meta-learning cannot be applied). 4. The model is being used with a non-Gaussian likelihood, so that exact GP regression cannot be applied. NPs can often be much easier to deploy than BNNs due their avoidance of complicated approximate inference techniques. However, their reliance on meta-learning means they can only be applied in specific situations. We recommend the use of NPs when: 1. Uncertainty estimation is important. If there is a need to separate epistemic and aleatoric uncertainty, a latent neural process can be used. Otherwise, both conditional and latent neural processes can be used. 2. The model is to be applied to many small datasets. 3. Prior design is difficult, so that e.g., an appropriate kernel for applying simple GP regression to the task cannot be applied straightforwardly. 4. Furthermore, if stationarity (or approximate stationarity) is a feature of the underlying stochastic process, and the inputs are one or two-dimensional, we recommend the use of convolutional neural processes. 6.3 Continued work and future research directions We now briefly discuss future research directions for both BNNs and NPs in light of the work in this thesis, along with follow-up work that has occurred since the publication of the work in this thesis. 6.3.1 Approximate inference in Bayesian neural networks Our research in Chapter 3 suggests various next steps for BNN research. The first is the development of more flexible yet still scalable approximate posterior distributions. Ideally, these should be such that the assumptions of Theorems 1 and 2 are violated, so that there is no theoretical restriction on the posteriors representing in-between 136 Conclusions and discussion uncertainty, even in the single hidden layer case. One recent promising example of such an expressive posterior is the recently introduced global inducing variational posterior (Ober and Aitchison, 2021), which is fully correlated across all layers and non-Gaussian. As mentioned earlier, our theoretical and empirical results for deep BNNs, taken together, suggest that at least in function space, existing posteriors such as mean-field Gaussian and MC dropout are already flexible enough to approximate the posterior predictive distribution well — but the right member of the variational family is not being selected when optimising the ELBO. We conjecture that this is due to the KL- optimal posterior in weight space being far from the optimal posterior in function space. This suggests that one fruitful avenue of research is to change the objective function to reflect function space approximations of the predictive, rather than changing the variational family. This approach has been tried in works such as Rudner et al. (2021); Sun et al. (2019). Finally, we believe that the most important practical future work to be undertaken is in the development of diagnostic methods and benchmarks for approximate inference in BNNs. Assessing inference quality in these complex models is a difficult research problem of its own. In this thesis we have focused on a single, easily identifiable property of the posterior predictive: in-between uncertainty. However, there may be many other qualitative features of the exact predictive distribution that are more or less faithfully represented by the approximate predictive. It would be of great use to practitioners to systematically document a range of these properties for a variety of approximate inference methods. Practitioners could then assess which approximation is most appropriate for them based on which of these properties is most relevant for the task at hand. 6.3.2 Convolutional neural processes Since their introduction, convolutional neural processes have been improved and generalised in various directions. One line of research considers extending ConvNPs beyond translation equivariance to consider more general group equivariances, e.g., the group of rotations of a sphere (Holderrieth et al., 2021; Kawano et al., 2020). In particular, Holderrieth et al. (2021) use group equivariant neural processes to model stochastic fields, which are stochastic processes which may be vector-valued. We note that their proposed SteerCNP uses a similar encoder to the ConvNP, involving embedding the context set into a function which is then discretised on a regular grid. Both Holderrieth et al. (2021); Kawano et al. (2020) only consider conditional neural 6.3 Continued work and future research directions 137 processes and hence can only provide factorised predictions. Addressing this by creating a latent variable version of the model is a natural direction for future work. ConvNPs have also been developed in another direction which focuses on providing correlated predictions with exactly tractable likelihoods without the need for a latent variable. This leads to the Gaussian neural process (GNP) (Bruinsma et al., 2020; Markou et al., 2022). GNPs work by directly parameterising both the mean and covariance function of the predictive distribution with neural networks. In contrast to CNPs, which can be viewed as outputting predictive Gaussian processes that have kernel functions which lead to diagonal covariance matrices, GNPs can output predictive distributions with fully-correlated covariance matrices. This allows coherent samples to be drawn from GNP predictive distributions. Furthermore, since the predictive distribution is always Gaussian, the likelihood can be computed exactly. It is important to note that although the predictive distributions are GPs conditioned on a fixed context set, the GNP does not necessarily correspond to inference using any GP prior. In this sense it is a more flexible model than GP inference with any fixed kernel. One disadvantage of the GNP compared to latent neural processes is that the GNP cannot model non-Gaussian predictive distributions. However, it appears that this disadvantage is outweighed by the tractable likelihood of the GNP enabling exact computation of the maximum likelihood objective. Markou et al. (2022) shows that a convolutional GNP can outperform the convolutional latent neural process on a variety of tasks. References Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe- mawat, S., Irving, G., Isard, M., et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283. Abramowitz, M. and Stegun, I. A. (1965). Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation. Aitchison, L. (2020). A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations. Alquier, P. and Ridgway, J. (2020). Concentration of tempered posteriors and of their variational approximations. The Annals of Statistics, 48(3):1475–1497. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404. Ashukha, A., Lyzhov, A., Molchanov, D., and Vetrov, D. (2019). Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In International Conference on Learning Representations. Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422. Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. NIPS 2016 Deep Learning Symposium. Bahdanau, D., Cho, K. H., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015. Barber, D. and Bishop, C. M. (1998). Ensemble learning in Bayesian neural networks. Nato ASI Series F Computer and Systems Sciences, 168:215–238. Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis, University College London. Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. (2018). Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research (JMLR). 140 References Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML). Bruinsma, W., Requeima, J., Foong, A. Y., Gordon, J., and Turner, R. E. (2020). The Gaussian neural process. In Third Symposium on Advances in Approximate Bayesian Inference. Bui, T. D. (2021). Biases in variational Bayesian neural networks. In Bayesian Deep Learning Workshop, 35th Conference on Neural Information Processing Systems. Buntine, W. L. and Weigend, A. S. (1991). Bayesian back-propagation. Complex systems, 5(6):603–643. Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoen- coders. In International Conference on Learning Representations. Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691. Chérief-Abdellatif, B.-E. (2020). Convergence rates of variational inference in sparse deep learning. In International Conference on Machine Learning, pages 1831–1842. PMLR. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258. Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765. Cohen, T. and Welling, M. (2016). Group equivariant convolutional networks. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2990–2999, New York, New York, USA. PMLR. Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995). Active learning with statistical models. In Advances in Neural Information Processing Systems, pages 705–712. Coker, B., Bruinsma, W. P., Burt, D. R., Pan, W., and Doshi-Velez, F. (2022). Wide mean-field Bayesian neural networks ignore the data. In International Conference on Artificial Intelligence and Statistics, pages 5276–5333. PMLR. Copernicus Climate Change Service (2020). Copernicus Climate Change Service (C3S) (2019): C3S ERA5-Land reanalysis. (accessed: 15.05.2020). Coraddu, A., Oneto, L., Ghio, A., Savio, S., Anguita, D., and Figari, M. (2014). Machine learning approaches for improving condition-based maintenance of naval propulsion plants. Journal of Engineering for the Maritime Environment. References 141 Cox, R. T. (1946). Probability, frequency and reasonable expectation. American journal of physics, 14(1):1–13. Cressie, N. (1990). The origins of kriging. Mathematical geology, 22(3):239–252. Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. (2021). Laplace redux-effortless Bayesian deep learning. Advances in Neural Information Processing Systems, 34. DeGeneres, E. (2014). If only Bradley’s arm was longer. Best photo ever. Oscars pic.twitter.com/c9u5notgap. Deisenroth, M. and Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 465–472. Delhomme, J. P. (1978). Kriging in the hydrosciences. Advances in water resources, 1:251–266. Denker, J. S. and LeCun, Y. (1991). Transforming neural-net output levels to probability distributions. In Advances in Neural Information Processing Systems (NIPS). Der Kiureghian, A. and Ditlevsen, O. (2009). Aleatory or epistemic? does it matter? Structural Safety, 31(2):105–112. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations. Dubois, Y., Gordon, J., and Foong, A. Y. K. (2020). Neural process family. https: //yanndubs.github.io/Neural-Process-Family. Ellison, A. M. (1987). Effect of seed dimorphism on the density-dependent dynamics of experimental populations of atriplex triangularis (chenopodiaceae). American Journal of Botany, 74(8):1280–1288. Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115. Farquhar, S., Smith, L., and Gal, Y. (2020). Liberty or depth: Deep Bayesian neural nets do not need complex weight posterior approximations. Advances in Neural Information Processing Systems, 33:4346–4357. Filos, A., Farquhar, S., Gomez, A. N., Rudner, T. G. J., Kenton, Z., Smith, L., Alizadeh, M., de Kroon, A., and Gal, Y. (2019). Benchmarking Bayesian deep learning with diabetic retinopathy diagnosis. https://github.com/OATML/bdl-benchmarks. 142 References Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia. PMLR. Flam-Shepherd, D., Requeima, J., and Duvenaud, D. (2017). Mapping Gaussian process priors to Bayesian neural networks. In NIPS Bayesian deep learning workshop, volume 3. Foong, A., Bruinsma, W., Gordon, J., Dubois, Y., Requeima, J., and Turner, R. (2020a). Meta-learning stationary stochastic process prediction with convolutional neural processes. Advances in Neural Information Processing Systems, 33:8284–8295. Foong, A., Burt, D., Li, Y., and Turner, R. (2020b). On the expressiveness of approximate inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 33:15897–15908. Fort, S., Ren, J., and Lakshminarayanan, B. (2021). Exploring the limits of out- of-distribution detection. Advances in Neural Information Processing Systems, 34:7068–7081. Fortuin, V. (2022). Priors in Bayesian deep learning: A review. International Statistical Review. Fortuin, V., Garriga-Alonso, A., Ober, S. W., Wenzel, F., Ratsch, G., Turner, R. E., van der Wilk, M., and Aitchison, L. (2021). Bayesian neural network priors revisited. In International Conference on Learning Representations. Frey, B. J. and Hinton, G. E. (1999). Variational learning in nonlinear Gaussian belief networks. Neural Computation, 11(1):193–213. Frostig, R., Johnson, M. J., and Leary, C. (2018). Compiling machine learning programs via high-level tracing. Systems for Machine Learning, pages 23–24. Fukushima, K. and Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267–285. Springer. Gal, Y. (2016). Uncertainty in deep learning. PhD thesis, University of Cambridge. Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Represent- ing model uncertainty in deep learning. In Proceedings of The 33rd International Conference on Machine Learning (ICML). Gal, Y., Hron, J., and Kendall, A. (2017a). Concrete dropout. In Advances in Neural Information Processing Systems, pages 3581–3590. Gal, Y., Islam, R., and Ghahramani, Z. (2017b). Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning (ICML). References 143 Gal, Y., McAllister, R., and Rasmussen, C. E. (2016). Improving PILCO with Bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML, volume 4. Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. A. (2018a). Conditional neural processes. In International Conference on Machine Learning, pages 1704–1713. PMLR. Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. (2018b). Neural processes. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models. Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical science, pages 457–472. Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452. Good, I. J. (1983). Good thinking: The foundations of probability and its applications. U of Minnesota Press. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press. Gordon, J. (2021). Advances in Probabilistic Meta-Learning and the Neural Process Family. PhD thesis, University of Cambridge. Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner, R. E. (2020). Convolutional conditional neural processes. Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (NIPS) 24. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. Heek, J. and Kalchbrenner, N. (2019). Bayesian inference for large scale image classification. arXiv preprint arXiv:1908.03491. Hernández-Lobato, J. M. and Adams, R. (2015). Probabilistic backpropagation for scalable learning of Bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML). Hernández-Lobato, J. M., Li, Y., Rowland, M., Bui, T., Hernández-Lobato, D., and Turner, R. E. (2016). Black-box alpha divergence minimization. In Proceedings of The 33rd International Conference on Machine Learning (ICML). Hinton, G. and Van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on Computational Learning Theory. Citeseer. 144 References Hoffman, M. D. and Gelman, A. (2014). The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623. Holderrieth, P., Hutchinson, M. J., and Teh, Y. W. (2021). Equivariant learning of stochastic fields: Gaussian processes and steerable conditional neural processes. In International Conference on Machine Learning, pages 4297–4307. PMLR. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257. Hron, J., Bahri, Y., Novak, R., Pennington, J., and Sohl-Dickstein, J. (2020). Exact posterior distributions of wide Bayesian neural networks. In Uncertainty in deep learning Workshop, ICML. Hron, J., Matthews, A. G. d. G., and Ghahramani, Z. (2018). Variational Bayesian dropout: pitfalls and fixes. In Proceedings of the 35th International Conference on Machine Learning (ICML). Huszár, F. (2017). Variational inference using implicit distributions. arXiv preprint arXiv:1702.08235. Immer, A., Bauer, M., Fortuin, V., Rätsch, G., and Emtiyaz, K. M. (2021a). Scalable marginal likelihood estimation for model selection in deep learning. In International Conference on Machine Learning, pages 4563–4573. PMLR. Immer, A., Korzepa, M., and Bauer, M. (2021b). Improving predictions of Bayesian neu- ral nets via local linearization. In International Conference on Artificial Intelligence and Statistics, pages 703–711. PMLR. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR. Islam, M. A., Jia, S., and Bruce, N. D. (2019). How much position information do convolutional neural networks encode? In International Conference on Learning Representations. Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. G. (2021). What are Bayesian neural network posteriors really like? In International Conference on Machine Learning, pages 4629–4640. PMLR. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233. Kawano, M., Kumagai, W., Sannai, A., Iwasawa, Y., and Matsuo, Y. (2020). Group equivariant conditional neural processes. In International Conference on Learning Representations. Kendall, A. and Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems, pages 5574–5584. References 145 Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. (2018). Fast and scalable Bayesian deep learning by weight-perturbation in Adam. Proceedings of The 35th International Conference on Machine Learning (ICML). Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. (2018). Attentive neural processes. In International Conference on Learning Representations. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. In International Conference on Learning Representations. Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583. Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. In Interna- tional Conference on Learning Representations. Knutsson, H. and Westin, C.-F. (1993). Normalized and differential convolution. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 515–523. IEEE. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105. Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable pre- dictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30. Le, T. A., Kim, H., Garnelo, M., Rosenbaum, D., Schwarz, J., and Teh, Y. W. (2018). Empirical evaluation of neural process objectives. In NeurIPS workshop on Bayesian Deep Learning. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. Lee, J., Sohl-Dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. (2018). Deep neural networks as Gaussian processes. In International Conference on Learning Representations (ICLR). Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867. Li, Y., Hernández-Lobato, J. M., and Turner, R. E. (2015). Stochastic expectation propagation. In Advances in Neural Information Processing Systems, pages 2323– 2331. 146 References Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. In Advances in Neural Information Processing Systems, pages 1073–1081. Liu, Z., Luo, P., Wang, X., and Tang, X. (2018). Large-scale celebfaces attributes (CelebA) dataset. Retrieved August, 15:2018. Louizos, C. and Welling, M. (2016). Structured and efficient variational deep learning with matrix Gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716. Louizos, C. and Welling, M. (2017). Multiplicative normalizing flows for variational Bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML). Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925. MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cam- bridge university press. MacKay, D. J. C. (1992a). Information-based objective functions for active data selection. Neural computation, 4(4):590–604. MacKay, D. J. C. (1992b). A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472. Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019). A simple baseline for Bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32. Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent as approximate Bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907. Manita, O. A., Peletier, M. A., Portegies, J. W., Sanders, J., and Senen-Cerda, A. (2022). Universal approximation in dropout neural networks. Journal of Machine Learning Research, 23(19):1–46. Markou, S., Requeima, J., Bruinsma, W. P., Vaughan, A., and Turner, R. E. (2022). Practical conditional neural processes via tractable dependent predictions. Martens, J. (2020). New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851. Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On sparse variational methods and the kullback-leibler divergence between stochastic processes. In Artificial Intelligence and Statistics, pages 231–239. PMLR. Matthews, A. G. d. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations. References 147 Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., and Hensman, J. (2017). GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research (JMLR), 18(40):1–6. Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes: Uni- fying variational autoencoders and generative adversarial networks. In International conference on machine learning, pages 2391–2400. PMLR. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092. Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages 362–369. Morgan Kaufmann Publishers Inc. Mobiny, A., Singh, A., and Nguyen, H. V. (2019). Risk-aware machine learning classifier for skin lesion diagnosis. Journal of Clinical Medicine, 8. Mukhoti, J., Stenetorp, P., and Gal, Y. (2018). On the importance of strong baselines in Bayesian deep learning. In NeurIPS 2018 Bayesian Deep Learning Workshop. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press. Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1):141–142. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2021). Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003. Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto. Neal, R. M. et al. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11):2. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. Noci, L., Roth, K., Bachmann, G., Nowozin, S., and Hofmann, T. (2021). Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. Advances in Neural Information Processing Systems, 34:12738–12748. Ober, S. W. and Aitchison, L. (2021). Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes. In International Conference on Machine Learning, pages 8248–8259. PMLR. Osawa, K., Swaroop, S., Khan, M. E. E., Jain, A., Eschenhagen, R., Turner, R. E., and Yokota, R. (2019). Practical deep learning with Bayesian principles. Advances in neural information processing systems, 32. 148 References Osband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 8617–8629. Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Laksh- minarayanan, B., and Snoek, J. (2019). Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32. Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. (2018). Image transformer. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4055–4064, Stockholmsmässan, Stockholm Sweden. PMLR. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in PyTorch. In NIPS-W. Pati, D., Bhattacharya, A., and Yang, Y. (2018). On statistical optimality of variational Bayes. In International Conference on Artificial Intelligence and Statistics, pages 1579–1588. PMLR. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR. Ramsey, F. P. (2016). Truth and probability. In Readings in formal epistemology, pages 21–45. Springer. Ranganath, R., Tran, D., Altosaar, J., and Blei, D. (2016a). Operator variational inference. Advances in Neural Information Processing Systems, 29. Ranganath, R., Tran, D., and Blei, D. (2016b). Hierarchical variational models. In International Conference on Machine Learning, pages 324–333. Rasmussen, C. E. and Williams, C. K. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press. Ravi, S. and Larochelle, H. (2016). Optimization as a model for few-shot learning. In International Conference on Learning Representations. Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR. Ritter, H., Botev, A., and Barber, D. (2018). A scalable Laplace approximation for neural networks. In International Conference on Learning Representations (ICLR). References 149 Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N., and Aigrain, S. (2013). Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550. Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer. Rothfuss, J., Josifoski, M., and Krause, A. (2020). Meta-learning Bayesian neural network priors based on PAC-Bayesian theory. In NeurIPS 4th Workshop on Meta- Learning. Rudner, T. G., Chen, Z., Teh, Y. W., and Gal, Y. (2021). Tractable function-space variational inference in Bayesian neural networks. In ICML Workshop on Uncertainty & Robustness in Deep Learning. Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5(3):1. Savage, L. J. (1972). The foundations of statistics. Courier Corporation. Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München. Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep information propagation. In International Conference on Learning Representations (ICLR). Settles, B. (2009). Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences. Shafaei, A., Schmidt, M., and Little, J. J. (2018). Does your model know the digit 6 is not a cat? a less biased evaluation of "outlier" detectors. CoRR, abs/1809.04729. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. Shi, J., Sun, S., and Zhu, J. (2018). Kernel implicit variational inference. In Interna- tional Conference on Learning Representations (ICLR). Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. (2015). Scalable Bayesian optimization using deep neural networks. In International Conference on Machine Learning, pages 2171–2180. Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. (2016). Bayesian optimization with robust Bayesian neural networks. In Advances in Neural Information Processing Systems, pages 4134–4142. 150 References Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958. Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational Bayesian neural networks. In International Conference on Learning Representations (ICLR). Swaroop, S., Nguyen, C. V., Bui, T. D., and Turner, R. E. (2018). Improving and understanding variational continual learning. In NIPS 2018 Continual Learning Workshop. Tao, T. (2011). An introduction to measure theory, volume 126. American Mathematical Society Providence. Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294. Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media. Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR. Tomczak, M. B., Swaroop, S., and Turner, R. E. (2018). Neural network ensembles and variational inference revisited. In 1st Symposium on Advances in Approximate Bayesian Inference, pages 1–11. Tran, B.-H., Milios, D., Rossi, S., and Filippone, M. (2020). Functional priors for Bayesian neural networks through Wasserstein distance minimization to Gaussian processes. In Third Symposium on Advances in Approximate Bayesian Inference. Trippe, B. and Turner, R. (2018). Overpruning in variational Bayesian neural networks. In NIPS 2017 Workshop on Advances in Approximate Bayesian Inference. van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9:2579–2605. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C. (2021). Rank- normalization, folding, and localization: an improved R for assessing convergence of MCMC (with discussion). Bayesian analysis, 16(2):667–718. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. Advances in neural information processing systems, 29. Wagstaff, E., Fuchs, F., Engelcke, M., Posner, I., and Osborne, M. A. (2019). On the limitations of representing functions on sets. In International Conference on Machine Learning, pages 6487–6494. PMLR. References 151 Wagstaff, E., Fuchs, F. B., Engelcke, M., Osborne, M. A., and Posner, I. (2022). Universal approximation of functions on sets. Journal of Machine Learning Research, 23(151):1–56. Watson, G. S. (1964). Smooth regression analysis. Sankhya¯: The Indian Journal of Statistics, Series A, pages 359–372. Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688. Wenzel, F., Roth, K., Veeling, B., Swiatkowski, J., Tran, L., Mandt, S., Snoek, J., Salimans, T., Jenatton, R., and Nowozin, S. (2020). How good is the Bayes posterior in deep neural networks really? In International Conference on Machine Learning, pages 10248–10259. PMLR. Wilson, A. G. (2020). The case for Bayesian deep learning. arXiv preprint arXiv:2001.10995. Wilson, A. G. and Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708. Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. (2017). On the quantitative analysis of decoder-based generative models. In International Conference on Learning Representations. Yang, G. (2019a). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760. Yang, G. (2019b). Wide feedforward or recurrent neural networks of any architecture are Gaussian processes. Advances in Neural Information Processing Systems, 32. Yang, W., Lorch, L., Graule, M. A., Srinivasan, S., Suresh, A., Yao, J., Pradier, M. F., and Doshi-Velez, F. (2019). Output-constrained Bayesian neural networks. arXiv preprint arXiv:1905.06287. Yarotsky, D. (2022). Universal approximations of invariant maps by neural networks. Constructive Approximation, 55(1):407–474. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017). Deep sets. Advances in neural information processing systems, 30. Zhang, F. and Gao, C. (2020). Convergence rates of variational posterior distributions. The Annals of Statistics, 48(4):2180–2207. Zhang, G., Wang, C., Xu, B., and Grosse, R. (2019). Three mechanisms of weight decay regularization. In International Conference on Learning Representations. Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. (2020). Cyclical stochastic gradient MCMC for bayesian deep learning. In International Conference on Learning Representations. Appendix A Proofs of results on single-hidden layer BNNs In Section 3.3 we stated simplified versions of bounds concerning the variance of single- hidden layer networks with certain approximating families. Here in Appendix A.1 we provide more general statements of the theorems, followed by statements of a series of lemmas that their proofs rely on in Appendix A.2. In Appendix A.3, we present proofs of each lemma. Finally, in Appendix A.4, we provide the proofs of the general statements of the theorems. A.1 General theorem statements The three main results we prove in this section are the following generalisations of Theorems 1 to 3, now stated as Theorems 7 to 9: Theorem 7 (MFVI). Consider a single-hidden layer ReLU neural network mapping from RD → RK with I ∈ N hidden units. The corresponding mapping is given by fk(x) = ∑I i=1wk,iψ (∑D d=1 ui,dxd + vi ) + bk for 1 ≤ k ≤ K, where ψ(a) = max(0, a). Suppose we have a distribution over network parameters with density of the form: q(W, b, U, v)= I∏ i=1 qi(Wi|U, v)q(b|U, v) I∏ i=1 D∏ d=1 N (ui,d;µui,d , σ2ui,d) I∏ i=1 N (vi;µvi , σ2vi), (A.1) where Wi = {wk,i}Kk=1 are the weights out of neuron i and b = {bk}Kk=1 are the output biases, and qi(Wi|U, v) and q(b|U, v) are arbitrary probability densities with finite first two moments. Consider a line in RD parameterised by x(λ)d = γdλ + cd for λ ∈ R 154 Proofs of results on single-hidden layer BNNs such that γdcd = 0 for 1 ≤ d ≤ D. Then for any λ1 ≤ 0 ≤ λ2, and any λ∗ such that |λ∗| ≤ min(|λ1|, |λ2|), V[fk(x(λ∗))] ≤ V[fk(x(λ1))] + V[fk(x(λ2))] for 1 ≤ k ≤ K. (A.2) We provide the proof of Theorem 7 in Appendix A.4. We briefly describe how the statement of Theorem 7 in the main text can be deduced from this more general version. The fully factorised Gaussian family QFFG is of the form in Equation (A.1). It remains to show that both conditions i. and ii. imply that γdcd = 0. Consider any line intersecting the origin (i.e. satisfying condition i)). Such a line can be written in the form x(λ)d = γdλ by choosing the origin to correspond to λ = 0. As cd = 0 for all d, γdcd = 0 for all d. In Theorem 1 p = x(λ1) and q = x(λ2) are on opposite sides of the origin, hence the signs of λ1 and λ2 are opposite. Finally, the condition that r = x(λ∗) is closer to the origin than both p and q is exactly that |λ∗| ≤ min(|λ1|, |λ2|). In order to verify condition ii), note that any line orthogonal to a hyperplane xd′ = 0 can be parameterised as x(λ)d = γdλ+ cd, where γd = 0 for d ̸= d′ and cd′ = 0. Hence γdcd = 0 for all d. The condition that the line segment −→pq intersects the plane, with p = x(λ1) and q = x(λ2) is exactly that the signs of λ1 and λ2 are opposite, and that |λ∗| ≤ min(|λ1|, |λ2|). We now describe the results for BNNs using MC dropout, where, as noted in Section 3.3, the result differs depending on whether dropout is applied to the inputs: Theorem 8 (MC dropout with inputs not dropped out). Consider a single-hidden layer ReLU neural network mapping from RD → RK with I ∈ N hidden units. The corresponding mapping is given by fk(x) = ∑I i=1wk,iψ (∑D d=1 ui,dxd + vi ) + bk for 1 ≤ k ≤ K, where ψ(a) = max(0, a). Assume U, v are set deterministically and q(W, b) = q(b) I∏ i=1 qi(Wi), where Wi = {wk,i}Kk=1 are the weights out of neuron i, b = {bk}Kk=1 are the output biases and q(b) and qi(Wi) are arbitrary probability densities with finite first two moments. Then, V[fk(x)] is convex in x for 1 ≤ k ≤ K. We note that when performing MC dropout without dropping out the inputs, U, v are set deterministically. Furthermore, the weights out of different neurons are dropped independently of each other. Hence Theorem 8 applies to this approximating family. We provide the proof of Theorem 8 in Appendix A.4. A.2 Statements of lemmas 155 Remark 5. Theorem 8 applies for any activation function ψ such that ψ2 is convex. This is the only property of ψ which we will use in Lemma 3. Theorem 9 (MC dropout with inputs dropped out). Consider a single-hidden layer ReLU neural network mapping from RD → RK with I ∈ N hidden units. The corresponding mapping is given by fk(x) = ∑I i=1wk,iψ (∑D d=1 ui,dxd + vi ) + bk for 1 ≤ k ≤ K, where ψ(a) = max(0, a). Assume v is set deterministically and q(W, b, U) = q(U)q(b|U) ∏ i qi(Wi|U), where Wi = {wk,i}Kk=1 are the weights out of neuron i, b = {bk}Kk=1 are the output biases and q(U), q(b|U) and qi(Wi|U) are arbitrary probability densities with finite first two moments. Then, for any finite set of points S ⊂ RD such that 0 is in the convex hull of S, V[fk(0)] ≤ max s∈S {V[fk(s)]} for 1 ≤ k ≤ K. (A.3) We note that when applying MC dropout to the inputs and the hidden layer, v is still deterministic since biases are not dropped out, and the weight distribution factorises as in Theorem 9. We provide the proof of Theorem 9 in Appendix A.4. A.2 Statements of lemmas In this section we state the lemmas required to prove Theorems 7 to 9: Lemma 3. Assume a distribution for W, b|U, v with density of the form q(W, b|U, v) = q(b|U, v) ∏ i qi(Wi|U, v). Then, V[fk(x)|U, v] is a convex function of x. The proof of Lemma 3 is in Section A.3.1. Lemma 4. Consider the variance of a single neuron in the one dimensional case, with activation a(x) ∼ N (µ(x), σ2(x)), µ(x) = µux+ µv and σ2(x) = σ2ux2 + σ2v . Let T1 = {f ≥ 0 : ∀0 ≤ b < a, f(a) ≥ f(−a) and f(b) ≤ f(a)} and T2 = {f ≥ 0 : ∀a < b ≤ 0, f(a) ≥ f(−a) and f(b) ≤ f(a)}. 156 Proofs of results on single-hidden layer BNNs If µu ≥ 0, then V[ψ(a(x))] ∈ T1. If µu ≤ 0, then V[ψ(a(x))] ∈ T2. The proof of Lemma 4 is in Section A.3.2. Corollary 1 (Corollary of Lemma 4). Consider a line in RD parameterized by [x(λ)]d = γdλ+ cd for λ ∈ R such that γdcd = 0 for 1 ≤ d ≤ D. Let a(x) := ∑D d=1 udxd + v with {ud}Dd=1 and v independent and Gaussian distributed. Then, V[ψ(a(x(λ)))] ∈ T1 ∪ T2 (as a function of λ). Proof. The activation a(x(λ)) is a linear combination of Gaussian random variables, and is therefore Gaussian distributed. Moreover the mean is linear in λ. The variance of a(x(λ)) is given by: V[a(x(λ))] = D∑ d=1 V[ud](γdλ+ cd)2 + V[v] = D∑ d=1 σ2ud(γdλ+ cd) 2 + σ2v = λ2 ( D∑ d=1 σ2udγ 2 d ) + 2λ ( D∑ d=1 σ2udγdcd ) + ( D∑ d=1 σ2udc 2 d + σ 2 v ) = λ2 ( D∑ d=1 σ2udγ 2 d ) + ( D∑ d=1 σ2udc 2 d + σ 2 v ) . Defining σ2u˜ = ∑D d=1 σ 2 ud γ2d and σ2v˜ = ∑D d=1 σ 2 ud c2d + σ 2 v , the corollary follows from Lemma 4. Lemma 5. Let C be the set of convex functions from R→ [0,∞). Fix any a < 0 < b and c such that |c| ≤ min(|a|, |b|). Then any function f that can be written as a linear combination of functions in T1 ∪ T2 ∪ C with non-negative weights satisfies, f(c) ≤ f(a) + f(b). The proof of Lemma 5 can be found in Section A.3.3. Lemma 6. Let f : RD → R be a convex function and consider a finite set of points S ⊂ RD. Then for any point r in the convex hull of S, f(r) ≤ max s∈S {f(s)}. The proof of Lemma 6 is given in Section A.3.4. A.3 Proofs of lemmas 157 A.3 Proofs of lemmas In this section we prove the lemmas stated in Appendix A.2. A.3.1 Proof of lemma 3 Proof. We assume a distribution for the network weights such that: q(W, b|U, v) = q(b|U, v) I∏ i=1 qi(Wi|U, v). By this factorisation assumption, the outgoing weights from each neuron are condi- tionally independent. This means the conditional variance of the output under this distribution can be written V[fk(x)|U, v] = ∑ i V[wk,i|U, v]ψ(ai)2 + V[bk|U, v]. (A.4) with ai := ai(x) = ∑D d=1 ui,dxd + vi. Since V[fk(x)|U, v] is a linear combination of the ψ(ai)2 with non-negative weights (plus a constant), to prove convexity it suffices to show that each ψ(ai)2 is convex as a function of x. ψ(ai)2 is convex as a function of ai, since it is 0 for ai ≤ 0 and a2i for ai > 0. To show that it is convex as a function of x, we write ψ (ai(tx1 + (1− t)x2))2 = ψ (∑ d ui,d (t[x1]d + (1− t)[x1]d) + vi )2 = ψ ( t (∑ d ui,d[x1]d + vi ) + (1− t) (∑ d ui,d[x2]d + vi ))2 ≤ tψ (∑ d ui,d[x1]d + vi )2 + (1− t)ψ (∑ d ui,d[x2]d + vi )2 = tψ (ai(x1)) 2 + (1− t)ψ (ai(x2))2 . The inequality uses convexity of ψ(a) as a function of a. A.3.2 Proof of lemma 4 Throughout, we assume σu, σv and µv are fixed and suppress dependence on these parameters. Let vµu(x) := V[ψ(a(x))] where the variance is taken with respect to a 158 Proofs of results on single-hidden layer BNNs distribution with parameter µu. Then, vµu(x) = v−µu(−x) since µ(x) and σ2(x) are unchanged by the transformation µu, x→ −µu,−x. Suppose vµu ∈ T1 for µu > 0, then for x ≤ 0, v−µu(x) = vµu(−x) ≥ vµu(x) = v−µu(−x), and for x < y ≤ 0, v−µu(y) = vµu(−y) ≤ vµu(−x) = v−µu(x). In words, if vµu ∈ T1 then v−µu ∈ T2. It therefore suffices to consider the case when µu ≥ 0. We first show that if x ≥ 0, vµu(x) ≥ vµu(−x). Henceforth, we assume µu ≥ 0 is fixed and suppress it notationally. From Frey and Hinton (1999), v(x) = σ(x)2α(r(x)), (A.5) Here r(x) = µ(x)/σ(x). We define h(r) = N(r) + rΦ(r), where N is the standard Gaussian pdf, Φ is the standard Gaussian cdf. We define α(r) = Φ(r) + rh(r)− h(r)2. As σ(x)2 = σ(−x)2, it suffices to show α(r(x)) ≥ α(r(−x)) for x > 0. To show this, we first show that r(x) ≥ r(−x) for x > 0, then show that α(r) is monotonically increasing. r(x) = µ(x)/σ(x) = µ(−x)/σ(−x) + 2µux/σ(−x) ≥ µ(−x)/σ(−x) = r(−x). The inequality uses that both µu and x are non-negative. It remains to show that α(r) is monotonically increasing. A straightforward calculation shows that, α′(r) = 2h(r)(1− Φ(r)). As 1−Φ(r) > 0, we must show h(r) ≥ 0. We have limr→−∞ h(r) = 0 and h′(r) = Φ(r) > 0, implying h(r) > 0. We conclude α′(r) > 0 for all r, showing that vµu(x) ≥ vµu(−x) for x ≥ 0. To complete the proof, we must show that v(x) is monotonically increasing for x ≥ 0. As σ(x)2 is increasing as a function of x and α(r) is increasing as a function of r, v(x) is increasing as a function of x whenever r(x) is increasing as a function of x. As r′(x) = σ 2 vµu−σ2uµvx σ(x)3 , this completes the proof if σ2vµu − σ2uµvx ≥ 0. In particular, we A.3 Proofs of lemmas 159 need only consider cases when µv > 0. In this case, we write, v(x) = µ(x)2β(r(x)) (A.6) where β(r) = α(r)/r2. Also in this region, we have the inequality, r′(x)σ(x) = σ2vµu − σ2uµvx σ2ux 2 + σ2v ≤ σ 2 vµu σ2ux 2 + σ2v ≤ σ 2 vµu σ2v = µu, which leads to r′(x) ≤ µu/σ(x). Differentiating Equation (A.6), v′(x) = 2µuµ(x)β(r(x)) + µ(x)2 ( σ2vµu − σ2uµvx σ(x)3 ) β′(r(x)) ≥ 2µuµ(x) ( β(r(x)) + 1 2 r(x)β′(r(x)) ) . The inequality uses that r(x) > 0, so that by Lemma 7, β′(r(x)) < 0. It suffices to show that β(r) + 1 2 rβ′(r) > 0 for r > 0. β(r) + 1 2 rβ′(r) = β(r) + 1 2 r d dr ( α(r) r2 ) = α(r) r2 + 1 2 r α′(r)r2 − 2rα(r) r4 = α′(r) 2r ≥ 0. We conclude that v′(x) ≥ 0 for x ≥ 0, implying that v(x) is monotonically increasing in this region. This completes the proof that vµu(x) ∈ T1 for µu > 0. Lemma 7. For β defined as in the proof of Lemma 4 and for r > 0, β′(r) < 0 Proof. For r ̸= 0, β′(r) = (−2Φ(r) + 2N(r)2 + 2N(r)Φ(r)) /r3. As r > 0, β′(r) ≤ 0⇔ I(r) := −Φ(r) +N(r)2 +N(r)rΦ(r) ≤ 0. Rearranging (Abramowitz and Stegun, 1965, 7.1.13) yields: 1− 2 r + √ r2 + 8/π N(r) ≤ Φ(r) < 1− 2 r + √ r2 + 4 N(r). (A.7) 160 Proofs of results on single-hidden layer BNNs for r ≥ 0. I(r) = −Φ(r) +N(r)2 + rN(r)Φ(r) ≤ −Φ(r) +N(r)2 + rN(r) ( 1− 2 r + √ r2 + 4 N(r) ) ≤ −1 + 2 r + √ r2 + 8/π N(r) +N(r)2 + rN(r) ( 1− 2 r + √ r2 + 4 N(r) ) = −1 + 2 r + √ r2 + 8/π N(r) + rN(r) +N(r)2 ( 1− 2r r + √ r2 + 4 ) (A.8) We now make use of numerous crude bounds which hold for r > 0: 1. N(r) ≤ 1/√2π, 2. 2 r+ √ r2+8/π ≤√π/2, 3. rN(r) ≤ 1/√2πe 4. 2r r+ √ r2+4 ≥ 0. Plugging these into Equation (A.8), I(r) ≤ −1 + √ π/2√ 2π + 1√ 2πe + 1 2π = −1 2 + 1√ 2πe + 1 2π ≈ −0.098 < 0. A.3.3 Proof of lemma 5 Proof. Recall that T1 = {f ≥ 0 : ∀0 ≤ b < a, f(a) ≥ f(−a) and f(b) ≤ f(a)} and T2 = {f ≥ 0 : ∀a < b ≤ 0, f(a) ≥ f(−a) and f(b) ≤ f(a)}. First, note that T1, T2 and the set of non-negative convex functions, C are all closed under addition and positive scalar multiplication. We can therefore write f as a sum of three functions, f(x) = t1(x) + t2(x) + s(x) with t1 ∈ T1, t2 ∈ T2 and s ∈ C. We prove the case when a ≤ c ≤ 0 ≤ −c ≤ b. The case a ≤ −c ≤ 0 ≤ c ≤ b follows a symmetric A.4 Proofs of theorems 161 argument. f(c) = t1(c) + t2(c) + s(c) (def.) ≤ t1(c) + t2(a) + s(c) (second condition for T2) ≤ t1(−c) + t2(a) + s(c) (first condition for T1) ≤ t1(b) + t2(a) + s(c) (second condition for T1) ≤ t1(b) + t2(a) + max(s(a), s(b)) (s convex) ≤ t1(b) + t2(a) + s(a) + s(b) ≤ t1(a) + t1(b) + t2(a) + t2(b) + s(a) + s(b) (non-negativity) = f(a) + f(b). A.3.4 Proof of lemma 6 Proof. Let {sn}Nn=1 = SN ⊂ RD. We proceed by induction. The lemma is true for N = 2 by the definition of convexity. Assume it is true for N . Let Conv(SN+1) denote the convex hull of SN+1. Consider a point rN+1 ∈ Conv(SN+1). Then f(rN+1) = f ( N+1∑ n=1 αnsn ) (A.9) with ∑N+1 n=1 αn = 1 and αn ≥ 0 for 1 ≤ n ≤ N + 1. We can write f(rN+1) = f (( N∑ n=1 αn ) tN + αN+1sN+1 ) ≤ max{f(tN), f(sN+1)} (A.10) where tN := ∑N n=1 αnsn /∑N n=1 αn, and we have used the convexity of f . By the induction assumption, f(tN ) ≤ max s∈SN {f(s)}, since tN ∈ Conv(SN ). Combining this with Equation (A.10) completes the proof. A.4 Proofs of theorems Having collected the necessary preliminary lemmas we now prove Theorems 7 to 9. Proof of Theorem 7. By the law of total variance, V[fk(x)] = E[V[fk(x)|U, v]] + V[E[fk(x)|U, v]]. 162 Proofs of results on single-hidden layer BNNs Using Lemma 3, V[fk(x)|U, v] is convex as a function of x. As the expectation of a convex function is convex, the first term is a convex function of x. For the second term we have E[fk(x)|U, v] = E [ I∑ i=1 wk,iψ(ai) + bk ∣∣∣∣U, v ] = I∑ i=1 µwk,iψ(ai) + µbk , where µwk,i := E[wk,i], µbk := E[bk]. In the second line we used linearity of expectation and that conditioned on (U, v), the ai are deterministic. Next, V[E[fk(x)|U, v]] = V [ I∑ i=1 µwk,iψ(ai) + µbk ] = I∑ i=1 µ2wk,iV[ψ(ai)], (A.11) since the ai are independent of each other. Consider a line in RD parameterised by [x(λ)]d = γdλ + cd for λ ∈ R such that γdcd = 0 for 1 ≤ d ≤ D. By Corollary 1, V[ψ(ai(x(λ)))] ∈ T1 ∪ T2 (as a function of λ). Since V[fk(x)|U, v] is convex as a function of x, it is also convex as a function of λ. We have written V[fk(x(λ))] in the form assumed in Lemma 5, completing the proof. Proof of Theorem 8. The theorem follows immediately from Lemma 3 since U and v are deterministic. Proof of Theorem 9. By the law of total variance, V[fk(x)] = E[V[fk(x)|U ]] + V[E[fk(x)|U ]]. Using Lemma 3, V[fk(x)|U ] is convex as a function of x. As the expectation of a convex function is convex, the first term is a convex function of x. This implies E[V[fk(0)|U ]] ≤ max s∈S {E[V[fk(s)|U ]]} , by Lemma 6. V[E[fk(x)|U ]] is non-negative everywhere. As the output of the first layer is independent of the matrix U at x = 0, E[fk(0)|U ] is deterministic. So V[E[fk(0)|U ]] = 0, completing the proof. Appendix B Bayesian neural network experimental details In this chapter we provide the experimental details of the BNNs experiments reported in Chapter 3. Data: The input locations of the data were generated by sampling 100 total points, 50 each from two distinct Gaussians. In Figure 3.6, one Gaussian was centred at (−1,−1) and the other at (1, 1); both had isotropic variance of 0.01. The output values were generated by sampling from the Gaussian process prior with the kernel resulting from the wide limit of the BNN at these input values. Prior: A fully-connected ReLU network with 50 hidden units per layer is used. The prior mean for all parameters is chosen to be 0. The prior standard deviation for the bias parameters is chosen as σb = 1 for all experiments. Let σw/ √ H be the prior standard deviation of each weight, where H is the number of inputs to the weight matrix. We choose σw = 4. All models used a fixed Gaussian likelihood with standard deviation 0.1. Fitting the GP: The Gaussian process was implemented using GPFlow (Matthews et al., 2017) with the infinite-width ReLU BNN kernel implemented following Lee et al. (2018). All hyperparameters were fixed and exact inference was performed using the Cholesky decomposition. Fitting MFVI: We initialize the standard deviations of weights to be small and train for many epochs, following Swaroop et al. (2018); Tomczak et al. (2018) who 164 Bayesian neural network experimental details found this led to good predictive performance. The weight means in each weight matrix were initialised by sampling from N (0, 1/√2nout), where nout is the number of outputs of the weight matrix. The weight standard deviations were all initialised to a very small value of 1× 10−5, (we tried a larger initialization with weight standard deviations initialized to 1× 10−2.5 and found no significant difference). Bias means were initialised to zero, with the variances initialised to the same small value as the weight variance. 100,000 iterations of full batch training on the dataset were performed using ADAM with a learning rate of 1 × 10−3. The ELBO was estimated using 32 Monte Carlo samples during training. The local reparameterisation trick was used (Kingma et al., 2015). The predictive distribution at test time was estimated using 500 samples from the approximate posterior. Fitting MCDO: The weights and biases were initialised using the default PyTorch initialisation. The dropout rate was fixed at p = 0.05. The ℓ2 regularisation parameter was set following Gal (2016, Section 3.2.3) for the given prior, in such a way that the ‘KL condition’ is met, in the interpretation of dropout training as variational inference. 100,000 iterations of full batch training on the dataset were performed using ADAM with a learning rate of 1× 10−3. The dropout objective was estimated using 32 Monte Carlo samples during training. The predictive distribution at test time was estimated using 500 samples from the approximate posterior. Fitting HMC: For HMC on the 1HL BNN, 250,000 samples of HMC were taken using the NUTS implementation in Pyro (Bingham et al., 2018; Hoffman and Gelman, 2014) after 10,000 warmup steps. For the 2HL case, 1,000,000 samples of HMC were taken after 20,000 warmup steps. We set the maximum tree depth in NUTS to 5, and adapt the step size and mass matrix during warmup. Appendix C Proofs of results on deep BNNs In Appendix C.1 we prove the universality of the mean and variance function for deep BNNs using QFFG and QMCDO, where, as usual, the inputs are not dropped out for QMCDO. Conversely, if the inputs are dropped out, we show in Appendix C.2 via a counterexample that the resulting BNN does not have a universal mean and variance function. C.1 Proof of Theorem 4 We now restate and prove Theorem 4 from the main body: Theorem 10. Let A ⊂ RD be compact, and let C(A) be the space of continuous functions on A to R. Similarly, let C+(A) be the space of continuous functions on A to R≥0. Then for any g ∈ C(A) and h ∈ C+(A), and any ϵ > 0, for both the mean-field Gaussian and MC dropout families, there exists a 2-hidden layer ReLU NN such that sup x∈A |E [f(x)]− g(x)| < ϵ and sup x∈A |V[f(x)]− h(x)| < ϵ, where f(x) is the (stochastic) output of the network. Our proof will make use of the standard universal approximation theorem for deterministic NNs as given in Leshno et al. (1993): Theorem 11 (Universal approximation for deterministic NNs). Let ψ(a) = max(0, a). Then for every g ∈ C(RD) and every compact set A ⊂ RD, for any ϵ > 0 there exists a 166 Proofs of results on deep BNNs function f ∈ S such that ∥g − f∥∞ < ϵ, where S := { I∑ i=1 wiψ ( D∑ d=1 ui,dxd + vi ) : I ∈ N, wi, ui,d, vi ∈ R } . We first prove a useful lemma. Lemma 8. Let ψ(a) = max(0, a). Let a be a random variable with finite first two moments. Then V[ψ(a)] ≤ V[a]. Proof. For all x, y ∈ R, we have |x− y|2 ≥ |ψ(x)− ψ(y)|2. Consider two i.i.d. copies of any random variable with finite first two moments, denoted a1 and a2. Then V[a1] = E [ a21 ]− E [a1]2 = 1 2 E [ a21 + a 2 2 − 2a1a2 ] = 1 2 E [|a1 − a2|2] ≥ 1 2 E [|ψ(a1)− ψ(a2)|2] = V[ψ(a1)]. C.1.1 Proof of Theorem 4 for QFFG We prove Theorem 10 for the fully-factorised Gaussian approximating family. We begin by proving results about 1HL networks within this family. The overall goal of these results is Lemma 11, which informally says that for any set of mean parameters for the weights, we can find a setting of the standard deviations of the weights, such that the mean output of the network is close to the output of the deterministic network, with weights equal to the mean parameters. Our proof of this proceeds in 3 parts: First, in Lemma 12, we show that by making the standard deviation parameters sufficiently small, we can ensure that the variance of the output of the network is uniformly small on some compact set A. Next, in Lemma 10, we show that again by choosing the standard deviation sufficiently small, we can make most of the sample functions of the 1HL network close to the function that would be obtained by using the mean parameters. Finally, in the proof of Lemma 11, we use Chebyshev’s inequality and the triangle inequality to conclude that the mean of the network must also be close to the function defined by the mean parameters. These networks will be used to construct the desired 2HL network. C.1 Proof of Theorem 4 167 Notation Consider a 1HL ReLU NN with input x ∈ RD and output f ∈ RK . Let the network have I hidden units and be parameterised by input weights U ∈ RI×D, input biases v ∈ RI , output weights W ∈ RK×I and output biases b ∈ RK . Let θ = (U, v,W, b). Denote the kth output of the network by fk,θ(x). Consider a factorised Gaussian distribution over the parameters θ in the network. Let the means of the Gaussians be denoted µ = (µU , µv, µW , µb), where e.g. µU is a matrix whose elements are the means of U . Each mean is always taken to be ∈ R. Similarly, let the standard deviations be denoted σ = (σU , σv, σW , σb). Each standard deviation is always taken to be ∈ R>0. The following lemma states that we can make the output of a 1HL BNN have low variance by setting the standard deviation of the weights to be small. Lemma 9. Let A ⊂ RD be a compact set and fk,θ(x) be the kth output of a 1HL ReLU NN with a mean-field Gaussian distribution mapping from A→ R. Fix any µ and any ϵ > 0. Let all the standard deviations in σ be equal to a shared constant σ > 0. Then there exists σ′ > 0 such that for all σ < σ′ and for all x ∈ A, V[ψ(fk,θ(x))] < ϵ for all 1 ≤ k ≤ K. Proof. Define ai = ∑D d=1 ui,dxd + vi, so that fk,θ(x) = ∑I i=1wk,iψ(ai) + bk. Then V[fk,θ(x)] = V [ I∑ i=1 wk,iψ(ai) ] + σ2 = I∑ i=1 I∑ j=1 Cov (wk,iψ(ai), wk,jψ(aj)) + σ 2 ≤ I∑ i=1 I∑ j=1 |Cov (wk,iψ(ai), wk,jψ(aj))|+ σ2 ≤ I∑ i=1 I∑ j=1 √ V[wk,iψ(ai)]V[wk,jψ(aj)] + σ2, where the final line follows from the Cauchy–Schwarz inequality. We now analyse each of the constituent terms. Since wk,i and ψ(ai) are independent, V[wk,iψ(ai)] = µ2wk,iV[ψ(ai)] + E [ψ(ai)] 2 σ2 + σ2V[ψ(ai)]. 168 Proofs of results on deep BNNs As A is compact, it is bounded, so there exists an M such that |xd| ≤ M for all 1 ≤ d ≤ D. Using Lemma 8, and the mean-field assumptions, V[ψ(ai)] ≤ V[ai] = σ2 ( D∑ d=1 x2d + 1 ) ≤ σ2(DM2 + 1). Since ai is a linear combination of Gaussian random variables, we have that ai ∼ N (µai , σ2ai), where µai = ∑D d=1 µui,dxd + µvi and σ 2 ai = σ2 (∑D d=1 x 2 d + 1 ) . Therefore, we have that (Frey and Hinton, 1999): E [ψ(ai)]2 = ( µaiΦ ( µai σai ) + σaiN ( µai σai ))2 ≤ ( |µai |Φ ( µai σai ) + σaiN ( µai σai ))2 ≤ ( |µai|+ σai√ 2π )2 . We can then upper bound V[wk,iψ(ai)] as follows: V[wk,iψ(ai)] ≤ µ2wk,iσ2(DM2 + 1) + ( |µai |+ σai√ 2π )2 σ2 + σ4(DM2 + 1) ≤ µ2wk,iσ2(DM2 + 1) + ( M D∑ d=1 |µui,d |+ |µvi |+ √ σ2(M2D + 1)√ 2π )2 σ2 + σ4(DM2 + 1) := vk,i(σ). The second inequality follows since A is compact and we have |µai | ≤M ∑D d=1 |µui,d |+ |µvi |. Note that the upper bound vk,i(σ) is continuous and monotonically increasing in σ, and vk,i(0) = 0. We can then upper bound the variance of the output: V[fk,θ(x)] ≤ I∑ i=1 I∑ j=1 √ vk,i(σ)vk,j(σ) + σ 2. We then choose σ′ such that for all 1 ≤ k ≤ K and for all 1 ≤ i ≤ I, vk,i(σ′) < ϵ2I2 , and such that σ′2 < ϵ 2 . Then V[fk,θ(x)] ≤ I2 ϵ 2I2 + σ′2 < ϵ for 1 ≤ k ≤ K. Finally, applying Lemma 8, we have V[ψ(fk,θ(x))] < ϵ for 1 ≤ k ≤ K. C.1 Proof of Theorem 4 169 The following lemma states that by setting the standard deviation of the weights to be sufficiently small, we can with high probability make the sampled BNN output close to the BNN output evaluated at the mean parameters. Lemma 10. Let A ⊂ RD be any compact set. Fix any µ and any ϵ, δ > 0. Let all the standard deviations in σ be equal to a shared constant σ > 0. Then there exists σ′ > 0 such that for all σ < σ′, and for any x ∈ A, Pr (|ψ(fk,µ(x))− ψ(fk,θ(x))| > ϵ) < δ for all 1 ≤ k ≤ K. Proof. Let θ ∈ RP . We first note that ψ(fk,θ(x)) is continuous as a function from A× RP → R, under the metric topology induced by the Euclidean metric on A× RP . Next, define a ball in parameter space Bγ = {θ : ∥θ − µ∥2 < γ}. Consider the closed ball of unit radius around µ, B¯1. Note that B¯1 is compact, and therefore A× B¯1 is compact as a product of compact spaces. Since a continuous map from a compact metric space to another metric space is uniformly continuous, given ϵ > 0, there exists a 0 < τ < 1 such that for all pairs (x1, θ1), (x2, θ2) ∈ A × B¯1 such that d((x1, θ1), (x2, θ2)) < τ , |ψ(fk,θ1(x1)) − ψ(fk,θ2(x2))| < ϵ. Here d(·, ·) is the Euclidean metric on A × RP . Since this is true for all 1 ≤ k ≤ K, we can find a 0 < τ < 1 such that |ψ(fk,θ1(x1))− ψ(fk,θ2(x2))| < ϵ holds for all k simultaneously, by taking the minimum of the τ over k. Now choose σ′ > 0 such that for all σ < σ′, Pr(θ ∈ Bτ ) > 1 − δ. This event implies d((x, θ), (x,µ)) = ∥θ − µ∥2 < τ . Furthermore, θ ∈ B¯1, since τ < 1. Hence |ψ(fk,µ(x))− ψ(fk,θ(x))| < ϵ holds for all 1 ≤ k ≤ K. The following lemma shows that for 1HL networks, we can make E [ψ(fk,θ)] (the mean BNN output) close to ψ(fk,µ) (the BNN output evaluated at the mean parameter settings) by choosing the standard deviation of the weights to be sufficiently small. Lemma 11. Let A ⊂ RD be any compact set. Then, for any ϵ > 0 and any µ, there exists a σ1 > 0 such that for any shared standard deviation σ < σ1, ∥E [ψ(fk,θ)]− ψ(fk,µ)∥∞ < ϵ for all 1 ≤ k ≤ K. 170 Proofs of results on deep BNNs Proof. For all x ∈ A and any θ∗, by the triangle inequality |E [ψ(fk,θ(x))]− ψ(fk,µ(x))| ≤ |E [ψ(fk,θ(x))]− ψ(fk,θ∗(x))|+|ψ(fk,µ(x))− ψ(fk,θ∗(x))| . Applying Lemma 10 with ϵ′ = ϵ/2 and δ = 1/4, we can find a σ′ such that for all σ < σ′, |ψ(fk,µ(x))− ψ(fk,θ(x))| ≤ ϵ/2 with probability at least 3/4. By Lemma 9, we can find a σ′′ such that for all σ < σ′′, V[ψ(fk,θ(x))] < ϵ 2 16K . Choose 0 < σ < min(σ′, σ′′). We can apply Chebyshev’s inequality to each random variable ψ(fk,θ(x)), Pr [|ψ(fk,θ(x))− E [ψ(fk,θ(x))]| > ϵ/2] < 1 4K . Applying the union bound, the probability that |ψ(fk,θ(x))− E [ψ(fk,θ(x))] | ≤ ϵ/2 for all k simultaneously is at least 3/4. Therefore, for any x we can find a θ∗ such that |ψ(fk,θ∗(x))−E [ψ(fk,θ(x))] | ≤ ϵ/2 and |ψ(fk,µ(x))− ψ(fk,θ∗(x))| ≤ ϵ/2 simultaneously because both events occur with probability at least 1/2 and therefore have a non-empty intersection. Therefore for all x and all k |E [ψ(fk,θ(x))]− ψ(fk,µ(x))| ≤ ϵ. We can now complete the proof of theorem 3 for QFFG. Proof of Theorem 10. Consider the case of a 2-hidden layer ReLU Bayesian neural network with 2 units in the second hidden layer. Denote the inputs to these units as f1,θ(x) and f2,θ(x) respectively, where θ are the parameters in the bottom two weight matrices and biases of the network. The output of the network can then be written as f(x) = s1ψ(f1,θ(x)) + s2ψ(f2,θ(x)) + t, (C.1) where the si are the weights in the final layer and t is the bias. Taking expectations on both sides, E [f(x)] = E [s1ψ(f1,θ(x))] + E [s2ψ(f2,θ(x))] + E [t] . Choose µs1 = 1, µs2 = 0, and note that s1 is independent of θ by the mean field assumption. Then E [f(x)] = E [ψ(f1,θ(x))] + E [t] . (C.2) Define µt = −minx′∈A g(x′) (as A is compact and g is continuous, this minimum is well- defined). Define g˜(x) ≥ 0 to be g(x)−minx′∈A g(x′). By the universal approximation theorem (Theorem 11) we can find a setting of the mean parameters, µ in the first C.1 Proof of Theorem 4 171 two layers (i.e. excluding the parameters of the distributions on s1, s2 and t) such that ∥f (1)µ − g˜∥∞ < ϵ/2 and ∥f (2)µ − √ h∥∞ < ϵ/2. This can be done by splitting the neurons in the first hidden layer into two sets, where the first and second set are responsible for f (1), f (2) respectively, and the weights from each set to the output of the other set are zero. Since g˜(x) > 0, applying the ReLU can only make f (1) closer to g˜. Hence ∥ψ(f (1)µ )− g˜∥∞ < ϵ/2. By Lemma 11, we can find a σ1 > 0 for this µ such that when the standard deviations in the first two layers are set to any shared constant σ < σ1,∥∥E [ψ(f1,θ)]− ψ(f (1)µ )∥∥∞ < ϵ/2. By the triangle inequality, ∥E [ψ(f1,θ)]− g˜∥∞ < ϵ. Combining with Equation (C.2), it follows that the expectation can approximate any continuous function g. We now consider the variance of Equation (C.1). V[f(x)] = V[s1ψ(f1,θ(x)) + s2ψ(f2,θ(x))] + V[t] = V[s1ψ(f1,θ(x))] + V[s2ψ(f2,θ(x))] + 2Cov(s1ψ(f1,θ(x)), s2ψ(f2,θ(x))) + σ2t . Choose σ2t = ϵ. We now consider V[s1ψ(f1,θ(x))]. As s1 is independent of θ, V[s1ψ(f1,θ(x))] = µ2s1V[ψ(f1,θ(x))] + σ 2 s1 E [ψ(f1,θ(x))]2 + V[ψ(f1,θ(x))]σ2s1 . Recall µs1 = 1 and choose σ2s1 = min ( 1, ϵ / ( maxx∈A E [ψ(f1,θ(x))]2 )) , then V[s1ψ(f1,θ(x))] ≤ 2V[ψ(f1,θ(x))] + ϵ. By Lemma 9, we can find a σ2 such that for any σ < σ2, V[ψ(f1,θ(x))] ≤ ϵ. For any such σ, V[s1ψ(f1,θ(x))] ≤ 3ϵ. We now choose σ2s2 = 1 and consider V[s2ψ(f2,θ(x))] = µ2s2V[ψ(f2,θ(x))] + σ 2 s2 E [ψ(f2,θ(x))]2 + σ2s2V[ψ(f2,θ(x))] = E [ψ(f2,θ(x))]2 + V[ψ(f2,θ(x))]. By Lemma 9, we can find a σ3 such that for any σ < σ3, V[ψ(f2,θ(x))] < ϵ. 172 Proofs of results on deep BNNs By the universal function approximator theorem (Theorem 11) we can find a setting of the mean parameters, µ in the first two layers such that ∥f (2)µ − √ h∥∞ < ϵ/2. Since√ h(x) > 0, the ReLU can only make f (2) closer to √ h, ∥ψ(f (2)µ )− √ h)∥∞ < ϵ/2. By Lemma 11, we can find a setting of σ for this µ such that∥∥E [ψ(f2,θ)]− ψ(f (2)µ )∥∥∞ < ϵ/2. By the triangle inequality, ∥∥∥E [ψ(f2,θ)]−√h∥∥∥∞ < ϵ. This implies,∥∥E [ψ(f2,θ)]2 − h∥∥∞ = ∥∥∥(E [ψ(f2,θ)]−√h)(E [ψ(f2,θ)] +√h)∥∥∥∞ ≤ ϵ ∥∥∥E [ψ(f2,θ)] +√h∥∥∥∞ ≤ ϵ(2∥ √ h∥∞ + ϵ) We therefore have, ∥V[f ]− h∥∞ ≤ E(ϵ) + 2Cov(s1ψ(f1,θ(x)), s2ψ(f2,θ(x))) ≤ E(ϵ) + 2 √ V[s1ψ(f1,θ(x))]V[s2ψ(f2,θ(x))] ≤ E(ϵ) + C√ϵ where the first inequality is Cauchy-Schwarz, and E(ϵ) is a function that tends to zero with ϵ and C is a constant. The theorem follows by choosing σ < min{σ1, σ2, σ3}. The construction in our proof used a 2HL BNN with only two neurons in the second hidden layer. The construction still works for wider hidden layers, by setting the unused neurons to have zero mean and sufficiently small variance. An analogous statement to Theorem 4 for networks with more than two hidden layers can be proved inductively: applying Theorem 4 for 2HL BNNs we can choose the variance to be uniformly small, thus satisfying the condition stated in Lemma 9. The proof of Lemma 10 applies equally for the output of 2HL BNNs. The rest of the proof then follows as stated. C.1 Proof of Theorem 4 173 C.1.2 Proof of Theorem 10 for MCDO In order to prove the universality result for deep dropout, we first prove two lem- mas about 1HL dropout networks. The following lemma states that the mean of a 1HL dropout network is a universal function approximator, while its variance can simultaneously be made arbitrarily small. Lemma 12. Consider any ϵ > 0 and any continuous function, m : A → R, where A ⊂ RD is compact. Then there exists a (random) ReLU neural network of the form f(x) = I∑ i=1 wiγiψ ( D∑ d=1 ui,dxd + vi ) + b with γi i.i.d.∼ Bern(1− p) such that ∥E [f ]−m∥∞ < ϵ and ∥V[f ]∥∞ ≤ ϵ. Proof. By the universal approximation theorem (Leshno et al., 1993), there exists a J ∈ N and 1HL network of the form, g(x) = J∑ j=1 w˜jψ ( D∑ d=1 u˜j,dxd + vj ) + b, such that ∥g −m∥∞ ≤ ϵ. Define the dropout network, f (1)(x) = J∑ j=1 w˜j 1− pψ ( D∑ d=1 u˜j,dxd + vj ) + b. Then E [ f (1) ] = g, so that ∥E[f (1)]−m∥∞ ≤ ϵ. Let S = ∥V[f (1)]∥∞ <∞. Define f = 1 L ∑L ℓ=1 f (1,ℓ) where each f (1,ℓ) is an independent realisation of f (1). Then E [f ] = g and V[f ] = V[f (1)]√ L ≤ S√ L . f can be realised by a dropout network by combining L copies of f (1) together with identical weights within each copy and 0 weights connecting the various copies. Choosing L = (S/ϵ)2 completes the proof. The following lemma states that the mean of the MCDO network can approximate any continuous positive function, after application of the ReLU non-linearity. Lemma 13. Given a positive mean function m with 0 < δ ≤ ∥m∥∞ ≤ ∆ and a stochastic process f such that ∥E [f ]−m∥∞ ≤ ϵ ≤ δ and ∥V[f ]∥∞ ≤ ϵ, ∥E [ψ(f)]−m∥∞ ≤ ϵ+ √ ϵ2 + ϵ (∆ + ϵ)2 δ − ϵ = O(∆ √ ϵ/(δ − ϵ)) 174 Proofs of results on deep BNNs and ∥V[ψ(f)]∥∞ ≤ ϵ. In the big-O notation, we assume ∆ is bounded below by a constant and ϵ, δ are bounded above by a constant. Proof. The bound ∥V[ψ(f)]∥∞ ≤ ϵ follows from Lemma 8. We consider the expectation of ψ(f(x)) for some arbitrary fixed x, |E [ψ(f(x))]−m(x)| = |E [f(x)]−m(x)− E [min(0, f(x))]| ≤ |E [f(x)]−m(x)|+ |E [min(0, f(x))]| ≤ ϵ+ |E [min(0, f(x))]| . We therefore bound |E [min(0, f(x))]|. |E [min(0, f(x))]| = |E [f(x)1{x : f(x) < 0}]| ≤ √ E [f(x)2] Pr(f(x) < 0). The inequality uses Cauchy-Schwarz, that the square of an indicator function is itself and reinterprets the expectation of an indicator function as a probability. We bound the two terms on the RHS separately. E [ f(x)2 ] = V[f(x)] + E [f(x)]2 ≤ ϵ+ E [f(x)]2 ≤ ϵ+ (m(x) + ϵ)2 ≤ ϵ+ (∆ + ϵ)2 We use Chebyshev’s inequality to bound the probability f(x) < 0, Pr(f(x) < 0) ≤ Pr (|f(x)− E [f(x)] | > m(x)− ϵ) ≤ V[f(x)] (m(x)− ϵ)2 ≤ ϵ (m(x)− ϵ)2 ≤ ϵ (δ − ϵ)2 . Having collected the necessary lemmas, we provide a construction that proves Theorem 10. Proof of Theorem 10. Consider a 2HL dropout NN. Let the pre-activations in the first hidden layer be collectively denoted a1, and the random dropout masks by ϵ1. Let the second hidden layer have I + 2 hidden units. Let ⊙ denote the elementwise product of two vectors of the same length. Define the pre-activations of two of the second hidden layer units by av = wTv (ϵ1 ⊙ ψ(a1)), i.e. both these hidden units have identical weight vectors wv and dropout masks, and are hence the same random variable. Similarly, let the remaining I second hidden layer pre-activations be defined by am = wTm(ϵ1⊙ψ(a1)), C.1 Proof of Theorem 4 175 again all being the same random variable. Furthermore, let (wv)i = 0 whenever (wm)i ̸= 0 and vice versa, so that the first hidden layer neurons that influence av and those that influence am form disjoint sets. Then the output of the 2HL network is: f = ϵaw2,aψ(av) + ϵbw2,bψ(av) + I∑ i=1 ϵiw2,iψ(am) + b2, where ϵa, ϵb, {ϵi}Ii=1 are the final layer dropout masks and {w2,i}Ii=1, b2 are the final layer weights and bias. We now make the choices w2,a = 1, w2,b = −1, w2,i = α, where αI = 1/(1−p). Then E [f ] = E [ψ(am)] + b2. Let b2 = minx∈A g − δ, where δ > 0 and the min exists due to compactness of A. Define g′ = g−b2. Since am is just the output of a single-hidden layer dropout network, for any γ′ > 0 we can use Lemma 12 to choose ∥E [am]− g′∥∞ < γ′ and ∥V[am]∥∞ < γ′. Since g′ is bounded below by δ and bounded above by some ∆ ∈ R (by continuity of g and compactness of A), we can then apply Lemma 13 to obtain ∥E [am]− g′∥∞ = O(∆ √ ϵ′/(δ − ϵ′)) and ∥V[ψ(am)]∥∞ < γ′. We can use this to bound the error in the mean of the 2HL network output: ∥E [f ]− g∥∞ = ∥E [ψ(am)] + b2 − g∥∞ = ∥E [ψ(am)]− g′∥∞ = O(∆ √ γ′/(δ − γ′)). We can choose γ′ to depend on δ,∆ such that ∥E [f ]− g∥∞ < γ, proving the first part of the theorem. Next, calculating the variance, V[f ] = V [ (ϵa − ϵb)ψ(av) + αψ(am) I∑ i=1 ϵi ] (C.3) = V[(ϵa − ϵb)ψ(av)] + α2V [ ψ(am) I∑ i=1 ϵi ] . (C.4) 176 Proofs of results on deep BNNs Next we show that by taking I sufficiently large, we can make the second term arbitrarily small. We have, V [ ψ(am) I∑ i=1 ϵi ] = V[ψ(am)]V [ I∑ i=1 ϵi ] + V[ψ(am)]E [ I∑ i=1 ϵi ]2 + V [ I∑ i=1 ϵi ] E [ψ(am)]2 = V[ψ(am)]Ip(1− p) + V[ψ(am)]I2(1− p)2 + Ip(1− p)E [ψ(am)]2 ≤ γ′Ip(1− p) + γ′I2(1− p)2 + Ip(1− p)E [ψ(am)]2 The first two of these three terms can be made arbitrarily small by choosing γ′ sufficiently small. The third term, upon multiplying by α2, becomes α2Ip(1− p)E [ψ(am)]2 = p I(1− p)E [ψ(am)] 2 , which can also be made arbitrarily small by choosing I ∈ N sufficiently large. We now show that the first term in Equation (C.4) can well approximate our target variance function h. V[(ϵa − ϵb)ψ(av)] = V[ϵa − ϵb]V[ψ(av)] + V[ϵa − ϵb]E [ψ(av)]2 + V[ψ(av)]E [ϵa − ϵb]2 (C.5) = 2p(1− p)V[ψ(av)] + 2p(1− p)E [ψ(av)]2 (C.6) Define h′ = √ h 2p(1− p) + δ ′, for some δ′ > 0. Again applying Lemma 12 (which we can do independently of the choice of am since neurons influencing av and am are disjoint), for any γ′′ > 0 we can choose ∥E [av]− h′∥∞ < γ′′ and ∥V[av]∥∞ < γ′′. The first term in Equation (C.6) can be made arbitrarily small by choosing γ′′ small enough. We can again apply Lemma 13 so that ∥E [ψ(av)]− h′∥∞ = O(∆′ √ γ′′/(δ′ − γ′′)). We then bound the difference between C.2 Counterexample when inputs are dropped out 177 the second term in Equation (C.6) and our target variance function:∥∥2p(1− p)E [ψ(av)]2 − h∥∥∞ (C.7) ≤ ∥∥∥√2p(1− p)E [ψ(av)]+√h∥∥∥∞∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞ (C.8) ≤ (∥∥∥2√h∥∥∥ ∞ + ∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞)∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞ (C.9) where Equation (C.8) follows from sub-multiplicativity of the infinity norm. Expanding the second term in Equation (C.9),∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞ =√2p(1− p) ∥E [ψ(av)]− h′ + δ′∥∞ = O(δ′ +∆′ √ γ′′/(δ′ − γ′′)) By first choosing δ′ sufficiently small, and then choosing γ′′ depending on δ′, we can make this error term arbitrarily small. Since all the other contributions to V[f ] were made arbitrarily small, this allows us to set ∥V[f ]− h∥ < γ, for any γ > 0, completing the proof. In order to provide an analogous construction for MCDO BNNs with more than 2 hidden layers, we note that the above proof only requires a BNN output with a universal mean function and an arbitrarily small variance function in Lemma 12. Instead of a 1HL network, we can apply Theorem 4 to construct a 2 or more hidden layer network to provide these mean and variance functions. The rest of the proof then follows as in the 2HL case. C.2 Counterexample when inputs are dropped out In the case when the network has several hidden layers, dropout with inputs dropped defines a posterior with somewhat strange properties, as observed in Gal (2016, Section 4.2.1). In particular, in D dimensions, a typical sample function from the approximate posterior will be constant as a function of roughly pD of the input dimensions. However, which dimensions it is constant along depends on the particular sample. This behaviour is unlikely to be shared by the exact posterior. We are able to exploit this type of behaviour to show that if inputs are dropped out, there are simple combinations of mean and variance functions that cannot be simultaneously approximated by the corresponding approximating family. 178 Proofs of results on deep BNNs Proposition 8. Consider f the (stochastic) output of an MC dropout network of arbitrary depth with inputs dropped out. For any x, x′ ∈ R such that V[f(x)],V[f(x′)] < ϵ2, |E [f(x)]− E [f(x′)] | ≤ 2ϵ√2/p. Proof. With probability p, the input is dropped out, so Pr(f(x) = f(x′)) ≥ p. We apply Chebyshev’s inequality giving the bounds, Pr(|f(x)−E [f(x)] | ≤ rϵ) ≥ 1− 1/r2 and Pr(|f(x′)−E [f(x′)] | ≤ rϵ) ≥ 1− 1/r2. for any r > 0. Choose r = √ 2/p+ δ for any δ > 0, then there exists a realisation of the dropout network such that |f(x)− E [f(x)] | ≤ rϵ, |f(x′)− E [f(x′)] | and f(x) = f(x′) simultaneously. Consequently, |E [f(x)]− E [f(x′)] | = |E [f(x)]− f(x) + f(x)− E [f(x′)] | = |E [f(x)]− f(x) + f(x′)− E [f(x′)] | ≤ |E [f(x)]− f(x)|+ |f(x′)− E [f(x′)] | ≤ 2rϵ = 2ϵ √ 2/p+ 2ϵδ. Taking the limit as δ → 0 completes the proof. In other words we can bound the difference in the mean output at two points in terms of the uncertainty at those points and the dropout probability. In D > 1 dimensions, we can get similarly tight bounds on lines parallel to a coordinate axis: for x, x′ on such a line Pr(f(x) = f(x′)) ≥ p still holds. If the dimension on which x and x′ differ is dropped out f(x) = f(x′). Alternatively in D dimensions for arbitrary x, x′ ∈ RD, Pr(f(x) = f(x′)) ≥ pD. This comes from noting that with probability pD the output of the network is a constant function. However, we note this bound becomes exponentially weak as the input dimension increases. Appendix D ConvCNP experimental details D.1 Baseline neural process models In both our 1D and image experiments, our main comparison is to conditional neural process models. In particular, we compare to a vanilla MLP-CNP (1D only; Garnelo et al. (2018a)) and an ACNP (Kim et al., 2018). Our architectures largely follow the details given in the relevant publications. MLP-CNP baseline. Our baseline MLP-CNP follows the implementation provided by the authors.1 The encoder is a 3-layer MLP with 128 hidden units in each layer, and ReLU non-linearities. The encoder embeds every context point into a representation, and the representations are then averaged across each context set. Target inputs are then concatenated with the latent representations, and passed to the decoder. The decoder follows the same architecture, outputting mean and standard deviation channels for each input. Attentive CNP baseline. The ACNP we use corresponds to the deterministic path of the model described by Kim et al. (2018) for image experiments. Namely, an encoder first embeds each context point c to a latent representation (x(c), y(c)) 7→ r(c)xy ∈ R128. For the image experiments, this is achieved using a 2-hidden layer MLP of hidden dimensions 128. For the 1D experiments, we use the same encoder as the MLP-CNP above. Every context point then goes through two stacked self-attention layers. Each self-attention layer is implemented with an 8-headed attention, a skip connection, and two layer normalizations (as described in Parmar et al. (2018), modulo the dropout 1https://github.com/deepmind/neural-processes 180 ConvCNP experimental details layer). To predict values at each target point t, we embed r(t) 7→ r(t)x and r(c) 7→ r(c)x using the same single hidden layer MLP of dimension 128. A target representation r(t)xy is then estimated by applying cross-attention (using an 8-headed attention described above) with keys K := {r(c)x }Cc=1, values V := {r(c)xy }Cc=1, and query q := r(t)x . Given the estimated target representation rˆ(t)xy , the conditional predictive posterior is given by a Gaussian pdf with diagonal covariance parametrised by (µ(t), σ(t)pre) = decoder(r(t)xy ) where µ(t), σ(t)pre ∈ R3, and the decoder is a 4 hidden layer MLP with 64 hidden units per layer for the images, and the same decoder as the MLP-CNP for the 1D experiments. Following Le et al. (2018), we enforce a minimum standard deviation σ(t)min = [0.1; 0.1; 0.1] to avoid infinite log-likelihoods by using the following post-processed standard deviation: σ (t) post = 0.1σ (t) min + (1− 0.1) log(1 + exp(σ(t)pre)) (D.1) D.2 1-dimensional experiments In this section, we give details regarding our experiments for the 1D data. In all experiments, the weights are optimised using Adam (Kingma and Ba, 2014) and weight decay of 10−5 is applied to all model parameters. D.2.1 CNN architectures We consider two models: ConvCNP (which utilises a smaller architecture), and ConvC- NPXL (with a larger architecture). For all architectures, the input kernel ψ was an EQ (exponentiated quadratic) kernel with a learnable length scale parameter, as detailed in Section 5.3, as was the kernel for the final output layer ψρ. When dividing by the density channel, we add ε = 10−8 to avoid numerical issues. The lengthscales for the EQ kernels are initialised to twice the spacing 1/γ1/d between the discretisation points (ti) T i=1, where γ is the density of these points and d is the dimensionality of the input space X . The architectures for the ConvCNP and ConvCNPXL are described below. ConvCNP For the 1D experiments, we use a simple, 4-layer convolutional archi- tecture, with ReLU nonlinearities. The kernel size of the convolutional layers was chosen to be 5, and all employed a stride of length 1 and zero padding of 2 units. The number of channels per layer was set to [16, 32, 16, 2], where the final channels were then processed by the final, EQ-based layer of ρ as mean and standard deviation D.2 1-dimensional experiments 181 channels. We employ a softplus nonlinearity on the standard deviation channel to enforce positivity. This model has 6,537 parameters. ConvCNPXL Our large architecture takes inspiration from UNet (Ronneberger et al., 2015). We employ a 12-layer architecture with skip connections. The number of channels is doubled every layer for the first 6 layers, and halved every layer for the final 6 layers. We use concatenation for the skip connections. The following describes which layers are concatenated, where Li ← [Lj, Lk] means that the input to layer i is the concatenation of the activations of layers j and k: • L8 ← [L5, L7], • L9 ← [L4, L8], • L10 ← [L3, L9], • L11 ← [L2, L10], • L12 ← [L1, L11]. Like for the smaller architecture, we use ReLU nonlinearities, kernels of size 5, stride 1, and zero padding for two units on all layers. D.2.2 Synthetic data The kernels used for the Gaussian processes which generate the data are defined as follows: • EQ: k(x, x′) = e− 1 2 (x−x ′ 0.25 )2 , • weakly periodic: k(x, x′) = e− 1 2 (f1(x)−f1(x′))2− 12 (f2(x)−f2(x′))2 · e− 18 (x−x′)2 , with f1(x) = cos(8πx) and f2(x) = sin(8πx), and • Matern–5 2 : k(x, x′) = (1 + 4 √ 5d+ 5 3 d2)e− √ 5d with d = 4|x− x′|. 182 ConvCNP experimental details C on v C N P A C N P C N P Fig. D.1 Example functions learned by (top) the ConvCNP, (center) ACNP, and (bottom) CNP when trained on an EQ kernel (with length scale parameter 1). “True function” refers to the sample from the GP prior from which the context and target sets were sub-sampled. “Ground Truth GP” refers to the GP posterior distribution when using the exact kernel and performing posterior inference based on the context set. The left column shows the predictive posterior of the models when data is presented in same range as training. The centre column shows the model predicting outside the training data range when no data is observed there. The right-most column shows the model predictive posteriors when presented with data outside the training data range. During the training procedure, the number of context points and target points for a training batch are each selected randomly from a uniform distribution over the integers between 3 and 50. This number of context and target points are randomly sampled from a function sampled from the process (a Gaussian process with one of the above kernels or the sawtooth process), where input locations are uniformly sampled from the interval [−2, 2]. All models in this experiment were trained for 200 epochs using 256 batches per epoch of batch size 16. We discretise E(Z) by evaluating 64 points per unit in this setting. We use a learning rate of 3e−4 for all models, except for ConvCNPXL on the sawtooth data, where we use a learning rate of 1e−3 (this learning rate was too large for the other models). D.2 1-dimensional experiments 183 C on v C N P A C N P C N P Fig. D.2 Example functions learned by the (top) ConvCNP, (center) ACNP, and (bottom) CNP when trained on a Matérn-5/2 kernel (with length scale parameter 0.25). “True function” refers to the sample from the GP prior from which the context and target sets were sub-sampled. “Ground Truth GP” refers to the GP posterior distribution when using the exact kernel and performing posterior inference based on the context set. The left column shows the predictive posterior of the models when data is presented in same range as training. The centre column shows the model predicting outside the training data range when no data is observed there. The right-most column shows the model predictive posteriors when presented with data outside the training data range. The random sawtooth samples are generated from the following function: ysawtooth(t) = A 2 − A π ∞∑ k=1 (−1)k sin(2πkft) k , (D.2) where A is the amplitude, f is the frequency, and t is “time”. Throughout training, we fix the amplitude to be one. We truncate the series at an integer K. At every iteration, we sample a frequency uniformly in [3, 5], K in [10, 20], and a random shift in [−5, 5]. As the task is much harder, we sample context and target set sizes over [3, 100]. Here the MLP-CNP and ACNP employ learning rates of 10−3. All other hyperparameters remain unchanged. We include additional figures showing the performance of ConvCNPs, ACNPs and MLP-CNPs on GP and sawtooth function regression tasks in Figures D.1 to D.3. 184 ConvCNP experimental details C on v C N P A C N P C N P Fig. D.3 Example functions learned by the (top) ConvCNP, (center) ACNP, and (bottom) CNP when trained on a random sawtooth sample. The left column shows the predictive posterior of the models when data is presented in the same range as training. The centre column shows the model predicting outside the training data range when no data is observed there. The right-most column shows the model predictive posteriors when presented with data outside the training data range. D.3 Image experimental details and additional re- sults D.3.1 Experimental details Training details In all experiments, we sample the number of context points uni- formly from U(ntotal 100 , ntotal 2 ), and the number of target points is set to ntotal. The context and target points are sampled randomly from each of the 16 images per batch. The weights are optimised using Adam (Kingma and Ba, 2014) with learning rate 5× 10−4. We use a maximum of 100 epochs, with early stopping of 15 epochs patience. All pixel values are divided by 255 to rescale them to the [0, 1] range. In the following discussion, we assume that images are RGB, but very similar models can be used for greyscale images or other gridded inputs (e.g. 1D time series sampled at uniform intervals). Proposed convolutional CNP. Unlike ACNP and off-the-grid ConvCNP, on-the- grid ConvCNP takes advantage of the gridded structure. Namely, the target and D.3 Image experimental details and additional results 185 context points can be specified in terms of the image, a context mask Mc, and a target mask Mt instead of sets of input–value pairs. Although this is an equivalent formulation, it makes it more natural and simpler to implement in standard deep learning libraries. In the following, we dissect the architecture and algorithmic steps succinctly summarised in Section 5.3. Note that all the convolutional layers are actually depthwise separable (Chollet, 2017); this enables a large kernel size (i.e. receptive fields) while being parameter and computationally efficient. 1. Let I denote the image. Select all context points signal := Mc ⊙ I and append a density channel density := Mc, which intuitively says that “there is a point at this position”: [signal, density]⊤. Each pixel value will now have 4 channels: 3 RGB channels and 1 density channel Mc. Note that the mask will set the pixel value to 0 at a location where the density channel is 0, indicating there are no points at this position (a missing value). 2. Apply a convolution to the density channel density′ = convθ(density) and a normalised convolution to the signal signal′ := convθ(signal)/density′. The normalised convolution makes sure that the output mostly depends on the scale of the signal rather than the number of observed points. The output channel size is 128 dimensional. The kernel size of convθ depends on the image shape and model used (Table D.1). We also enforce element-wise positivity of the trainable filter by taking the absolute value of the kernel weights θ before applying the convolution. Note that in this setting, E(Z) is [signal′, density′]⊤. 3. Now we describe the on-the-grid version of ρ(·), which we decompose into two stages. In the first stage, we apply a CNN to [signal′, density′]⊤. This CNN is composed of residual blocks (He et al., 2016), each consisting of 1 or 2 (Table D.1) convolutional layers with ReLU activations and no batch normalisation. The number of output channels in each layer is 128. The kernel size is the same across the whole network, but depends on the image shape and model used (Table D.1). 4. In the second stage of ρ(·), we apply a shared pointwise MLP : R128 → R2C (we use the same architecture as used for the ACNP decoder) to the output of the first stage at each pixel location in the target set. Here C denotes the number of channels in the image. The first C outputs of the MLP are treated as the means of a Gaussian predictive distribution, and the last C outputs are treated as the standard deviations. These then pass through the positivity-enforcing function shown in Equation (D.1). 186 ConvCNP experimental details Table D.1 CNN architecture for the image experiments. Model Input Shape convθKernel Size CNN Kernel Size CNN Num. Res. Blocks Conv. Layers per Block ConvCNP < 50 pixels 9 5 4 1 > 50 pixels 7 3 4 1 ConvCNP XL any 9 11 6 2 D.3.2 ACNP and ConvCNP qualitative comparison Figure D.4 shows the test log-likelihood distributions of an ACNP and ConvCNP model as well as some qualitative comparisons between the two. Although most mean predictions of both models look relatively similar for SVHN and CelebA32, the real advantage of the ConvCNP becomes apparent when testing the generalization capacity of both models. Figure D.5 shows the ConvCNP and ACNP trained on CelebA32 and tested on a downscaled version of Ellen’s famous Oscar selfie. We see that ConvCNP generalises better in this setting. 2 2The reconstruction looks worse than Figure 5.6b despite the larger context set, because the test image has been downscaled and the models are trained on a low resolution CelebA32. These constraints come from ACNP’s large memory footprint. D.3 Image experimental details and additional results 187 (a) MNIST (b) SVHN (c) CelebA 32× 32 (d) CelebA 64× 64 Fig. D.4 Log-likelihood and qualitative comparisons between ACNP and ConvCNP on four standard benchmarks. The top row shows the log-likelihood distribution for both models. The images below correspond to the context points (top), ConvCNP target predictions (middle), and ACNP target predictions (bottom). Each column corresponds to a given percentile of the ConvCNP distribution. ACNP could not be trained on CelebA64 due to its memory inefficiency. Fig. D.5 Qualitative evaluation of a ConvCNP (center) and ACNP (right) trained on CelebA32 and tested on a downscaled version (146 × 259) of Ellen’s Oscar selfie (DeGeneres, 2014) with 20% of the pixels as context (left). Appendix E Effect of number of samples used on evaluation of latent neural processes As the exact log-likelihoods of latent neural process models are intractable, quantitative evaluation and comparison of models is challenging. Instead, we compare models by using an estimate of the log-likelihood. A natural candidate is LˆML. However, unless the number of samples L used is large, LˆML is conservative and tends to significantly underestimate the log-likelihood. One way to improve the estimate of LˆML is through importance weighting (IW) (Le et al., 2018; Wu et al., 2017). Denoting D = Dc ∪Dt, the ConvLNP encoder Eϕ(D) can be used as a proposal distribution: LˆIW(θ, ϕ; ξ) := log  1 L L∑ l=1 exp logw(zl) + ∑ (x,y)∈Dt log pθ(y|x, zl)  , zl ∼ Eϕ(D), (E.1) where the importance weights are given by logw(zl) := log qϕ(z|Dc) − log qϕ(z|D). Here qϕ(z|D) is the density of the encoder distribution. We find that training models with LˆML results in encoders that are ill-suited as proposal distributions since the distribution over some of the latent variables can become deterministic, so we only use LIW to evaluate models trained with LNPVI. Effect of number of samples used during evaluation Figure E.1 demonstrates the effect of the number of samples L used to estimate the evaluation objective for the ConvLNP and ANP trained with LˆML and LNPVI. The models used to generate Figure E.1 are the same models used in Section 5.6.1, i.e. having heteroskedastic noise. Observe the general trend that the log-likelihood estimates tend to increase with L, as expected. The ANP trained with LNPVI collapsed to a conditional ANP, meaning 190 Effect of number of samples used on evaluation of latent neural processes 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 ConvNP (LML) ConvNP (LNP + IW) ANP (LML) ANP (LNP + IW) ANP (LNP) 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Number of samples L in evaluation loss 15 10 5 ConvNP (LNP) (a) Matérn–52 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Number of samples L in evaluation loss 2.0 1.8 1.6 1.4 1.2 1.0 ConvNP (LML) ConvNP (LNP + IW) ConvNP (LNP) ANP (LML) ANP (LNP + IW) ANP (LNP) (b) Weakly periodic kernel Fig. E.1 Log-likelihood bounds achieved by various combination of models and training objectives when evaluated with LˆML and LIW for various numbers of samples L. Color indicates model. Solid lines correspond to models trained and evaluated with LˆML. Dashed lines correspond to models trained with LNP and evaluated with LIW. Dotted lines correspond to models trained with LˆML and evaluated with LˆML. In this figure LNP is used as an abbreviation for LNPVI. 191 / Fig. E.2 Interpolation performance (within training range) for context set sizes uniformly sampled from {0, . . . , 50} of the ConvNP and ANP on Matérn–5 2 samples. The models are trained with LˆML and LNPVI for various number of samples L. Models trained with LˆML are evaluated with LˆML, while models trained with LNPVI are evaluated with LˆML. At evaluation, all bounds are estimated using 2,048 samples. In this figure LNP is used as an abbreviation for LNPVI. that the encoder became deterministic; in that case, LˆML is exact, which means that larger L and importance weighting will not increase the estimate. In contrast, the ANP trained with LˆML did not collapse, and we see that there the estimate increases with L. For the ConvLNP trained with LNPVI, evaluating with LIW yields a significant increase, showing that the bound estimated with LNPVI is very loose. The models trained with LˆML tend to be the best performing, although the ConvLNP trained with LNPVI is best for weakly periodic kernel and appears to still be increasing with L. In this thesis, all log-likelihood lower bounds for LNPs are computed with LˆML if the model was trained using LˆML and with LIW if the model was trained using LNPVI. Effect of number of samples used during training Figure E.2 shows the effect of the number of samples L in the training objectives on the performance of the ConvNP and ANP. Observe that the performance of LˆML reliably increases with the number of samples L and that LˆML outperforms LNPVI. The performance for LNPVI does not appear to increase with the number of samples L and appears more noisy than LˆML. Note that the models used for Figure E.2 were trained with homoskedastic observation noise. This is achieved by pooling the output of the model corresponding to the predictive standard deviation, i.e., fσ, over the time dimension. Appendix F ConvLNP experimental details F.1 Experimental details on 1D regression In the 1D regression experiments, we consider the following generative processes: EQ: samples from a Gaussian process with the following exponentiated- quadratic kernel: k(t, t′) = exp ( −1 8 (t− t′)2 ) ; Matérn–5 2 : samples from a Gaussian process with the following Matérn–5 2 kernel: k(t, t′) = ( 1 + 4 √ 5d+ 5 3 d2 ) exp ( − √ 5d ) with d = 4|x− x′|; noisy mixture: samples from a Gaussian process with the following noisy mixture kernel: k(t, t′) = exp ( −1 8 (t− t′)2 ) + exp ( −1 2 (t− t′)2 ) + 10−3δ[t− t′]; weakly periodic: samples from a Gaussian process with the following weakly-periodic kernel: k(t, t′) = exp ( −1 2 (f1(t)− f1(t′))2 − 1 2 (f2(t)− f2(t′))2 − 1 8 (t− t′)2 ) 194 ConvLNP experimental details with f1(t) = cos(8πt) and f2(t) = sin(8πt); and sawtooth: samples from the following sawtooth process: f(t) = A 2 − A π K∑ k=1 (−1)k sin(2πkf(t− s)) k with A = 1, f ∼ U [3, 5], s ∼ U [−5, 5], and K ∈ {10, . . . , 20} chosen uniformly. We compare the following models, where all activation functions are leaky ReLUs with leak 0.1: ConvCNP: The first model is the ConvCNP. The architecture of the ConvCNP is equal to that of the encoder in the ConvLNP, described next. ConvLNP: The second model is the ConvLNP as described in the main body. The functional embedding uses separate length scales for the data channel and density channel (Figure 5.8), which are initialized to twice the inter-point spacing of the discretization and learned dur- ing training. The discretization uniformly ranges over [min(x) − 1,max(x)+ 1] at density ρ = 64 points per unit, where min(x) is the minimum x value occurring in the union of the context and target sets in the current batch and max(x) is corresponding maximum x value. The discretization is passed through a 10-layer (excluding an initial and final point-wise linear layer) CNN with 64 channels and depthwise-separable convolutions. The width of the filters depends on the data set and is chosen such that the receptive field sizes are as follows: EQ: 2, Matérn–5 2 : 2, noisy mixture: 4, weakly periodic: 4, sawtooth: 16. The discretized functional representation consists of 16 channels. The smoothing at the end of the encoder also has separate length scales F.1 Experimental details on 1D regression 195 for the mean and variance which are initialized similarly and learned. The encoder parametrizes the standard deviations by passing the output of the CNN through a softplus. The decoder has the same architecture as the encoder. ANP: The third model is the Attentive NP with latent dimensionality d = 128 and 8-head dot-product attention (Vaswani et al., 2017). In the attentive deterministic encoder, the keys (t), queries (t), and values (concatenation of t and y) are transformed by a three-layer MLP of constant width d. The dot products are normalised by √ d. The output of the attention mechanism is passed through a constant- width linear layer, which is then passed through two layers of layer normalization (Ba et al., 2016) to normalise the latent representation. In the first of these two layers, first the transformed queries are passed through a constant-width linear layer and added to the input. In the second of these two layers, the output of the first layer is first passed through a two-layer constant-width MLP and added to itself, making a residual layer. In the stochastic encoder, the inputs and outputs are concatenated and passed though a three-layer MLP of constant width d. The result is mean-pooled and passed through a two-layer constant-width MLP. The decoder consists of a three-layer MLP of constant width d. NP: The fourth model is the original NP (Garnelo et al., 2018b). The architecture is similar to that of the ANP, where the architecture of the deterministic encoder is replaced by that of the stochastic encoder. For all models, positivity of the observation noise is enforced with a softplus function. Parameter counts of the ConvCNP, ConvLNP, ANP, and MLP-LNP are listed in Table F.1. The models are trained with LˆML (L = 20) and LNPVI (L = 5). For LNPVI, the context set is appended to the target set when evaluating the objective. The models are optimised using ADAM with learning rate 5 · 10−3 for 100 epochs. One epoch consists of 214 tasks divided into batches of size 16. For training, the inputs of the context and target sets are sampled uniformly from [−2, 2]. The size of the context set is sampled uniformly from {0, . . . , 50} and the size of the target set is fixed to 50. To encourage the LNP-based models—not the CNP-based models—to fit and not revert 196 ConvLNP experimental details EQ Matérn–5 2 Noisy Mixt. Weakly Per. Sawtooth ConvCNP 42 822 42 822 51 014 51 014 100 166 ConvLNP 88 486 88 486 104 870 104 870 203 174 ANP 530 178 530 178 530 178 530 178 530 178 NLP-LNP 479 874 479 874 479 874 479 874 479 874 Table F.1 Parameter counts of models in 1D regression. to their conditional variants, the observation noise standard deviation σ is held fixed to 10−2 for the first 20 epochs. For evaluation, the size of the context set is sampled uniformly from {0, . . . , 10}, and the losses are evaluated with L = 5000 and batch size one. To test interpolation within the training range, the inputs of the context and target sets are, like training, sampled uniformly from [−2, 2]. To test interpolation beyond the training range, the inputs of the context and target sets are sampled uniformly from [2, 6]. To test extrapolation beyond the training range, the inputs of the context sets are sampled uniformly from [−2, 2] and the inputs of the target sets are sampled uniformly from [−4,−2]∪ [2, 4]. As described in Appendix E, models trained with LNPVI are evaluated using importance weighting to obtain a better estimate of the evaluation loss. F.2 Experimental details on image completion F.2.1 Data details We use three standard datasets throughout our image experiments: SVHN (Netzer et al., 2011), MNIST LeCun et al. (1989), and 32 × 32 CelebA Netzer et al. (2011). The aforementioned standard datasets all contain only a single, well-centered object. To evaluate the translation equivariance and generalisation capabilities of our model we evaluate on the zero shot multi-MNIST (ZSMM) task described in Section 5.4.2. Namely, we generate a test set by randomly sampling with replacement 10000 pairs of digits from the MNIST test set, place them on a black 56 × 56 background, and translate the digits in such a way that the digits can be arbitrarily close but cannot overlap. However, we make one change from the dataset described in Section 5.4.2, the training set now consists of the standard MNIST digits (instead of a single digit placed in the center of 56× 56 canvas), augmented by up to 4 pixel shifts (Figure 5.5a). The model thus has to generalise both to a larger canvas size as well as to seeing multiple digits. F.2 Experimental details on image completion 197 For all data sets, pixel values are divided by 255 to rescale them to the [0, 1] range. We evaluate on predefined test splits when available (MNIST, SVHN, ZSMM) and make our own test set for CelebA by randomly selecting 10% of the data. For each dataset we also set aside 10% of the training set as validation. F.2.2 Training details In all experiments, we sample the number of context pixels uniformly from U(0, ntotal 2 ), and the number of target points is set to ntotal. The weights are optimised using Adam (Kingma and Ba, 2014) with learning rate 5× 10−4. We use a maximum of 100 epochs, with early stopping — based on log likelihood on the validation set — of 10 epochs patience. Unless stated otherwise, we use L = 16 samples from the latent function during training, and L = 128 at test time. We clip the ℓ2 norm of all gradients to 1, which was particularly important for ConvLNP. We use a batch size of 32 for all models besides ANP trained on ZSMM which used a batch size of 8 due to memory constraints. F.2.3 General architecture details For all models, we follow Le et al. (2018) and process the predicted standard deviation of the latent function σz using a sigmoid and the standard deviation σ of the predictive distribution using lower-bounded softplus: σz = 0.001 + (1− 0.001) 1 1 + exp(fσ,z) (F.1) σ = 0.001 + (1− 0.001) ln(1 + exp(fσ)) (F.2) As the pixels are rescaled to [0, 1], we also process the mean of the posterior predictive (conditioned on a single sample) to be in [0, 1] using a logistic function µ = 1 1 + exp(−fµ) (F.3) In the following, we describe the architecture of ANP and ConvLNP. Unless stated otherwise, all vectors in the following paragraphs are in R128 and all MLPs have 128 hidden units. ANP details We provide details for the ANP trained with LˆML. As the ANP cannot take advantage of the fact that images are on the grid, we preprocess each pixel so that 198 ConvLNP experimental details x ∈ [−1, 1]2. The only exception being for the test set of ZSMM, where x ∈ [−56 32 , 56 32 ]2 as the model is trained on 32× 32 but evaluated on 56× 56 images. Let superscript c index the context points from 1, . . . , C, and let superscript t index the target points from 1, . . . , T . Each context feature is first encoded x(c) 7→ r(c)x by a single hidden layer MLP, while a second single hidden layer MLP encodes values y(c) 7→ r(c)y . We produce a representation r(c)xy by summing both representations r(c)x + r(c)y and passing them through two self-attention layers (Vaswani et al., 2017). Following Parmar et al. (2018), each self-attention layer is implemented as 8-headed attention, a skip connection, and two layer normalizations (Ba et al., 2016). To predict values at each target point t, we embed x(t) 7→ r(t)x using the MLP used for r(c)x . A deterministic target representation r(t)xy is then computed by applying cross-attention (using an 8- headed attention described above) with keys K := {r(c)x }Cc=1, values V := {r(c)xy }Cc=1, and query q := r(t)x . For the latent path, we average over context representations r(c)xy , and pass the resulting representation through a single hidden layer MLP that outputs (µz, σz) ∈ R256. σz is made positive by post-processing it using Equation (F.1). We then sample (with reparameterisation (Kingma and Welling, 2013)) L latent representations zl ∼ N (z;µz, σ2z). We describe the remainder of the forward pass for a single zl, though in practice multiple samples may be processed in parallel. The deterministic and latent repre- sentations of the context set are concatenated, and the resulting representation is passed through a linear layer [r(t)xy ; zl]→ r(t)xyz ∈ R128. Given the target and context-set representations, the predictive posterior is given by a Gaussian pdf with diagonal covariance parametrised by (µ(t), σ(t)pre) = decoder([r(t)x ; r(t)xyz]) where µ(t), σ(t)pre ∈ R3 and decoder is a 4 hidden layer MLP. Finally, the σ(t) is processed by Equation (F.2) using Equation (F.3). In the case of MNIST and ZSMM, σ(t) is also spatially mean pooled, which corresponds to using homoskedastic noise. This improves the qualitative performance by forcing ANP and ConvLNP to model the digit instead of focusing on predicting the black background with high confidence. Kim et al. (2018) did not suffer from that issue as they used a much larger lower bound for Equation (F.2). ConvLNP details The core algorithm of on-the-grid ConvLNP is outlined in Fig- ure 5.10 and Figure 5.2c. Here we discuss the parameterisations used for each step of the algorithm. All convolutional layers are depthwise separable (Chollet, 2017). convθ is a convolutional layer with kernel size of 11 (no bias). Following Gordon et al. (2020), we enforce positivity on the weights in the first convolutional layer by only convolving their absolute value with the signal. F.2 Experimental details on image completion 199 The CNNs are ResNets He et al. (2016) with 9 blocks, where each convolution has a kernel size of 3. Each residual block consists of two convolutional layers, pre-activation batch normalization layers (Ioffe and Szegedy, 2015), and ReLU activations. The output of the pre-latent CNN (CNN in Figure 5.2c) goes through a single hidden layer MLP that outputs (µz, σz) ∈ R256. As with ANP, fσ,z is processed by Equation (F.1) and then used to sample (with reparameterisation (Kingma and Welling, 2013)) L latent functions Zl. Importantly, we found that the coherence of samples improves if the model uses a global representation in addition to the the pixel dependent representation. We achieve this by mean-pooling half of the functional representation. Namely, we replace zl by the channel-wise concatenation of z (1:64) l and mean(z (65:128) l ), where the mean is taken over the spatial dimensions. This latent function then goes through the post- latent CNN (CNN in Figure 5.10), as well as a linear layer to output (fµ, fσ) ∈ R256. As for the ANP fµ is processed by Equation (F.3) and fσ is re-scaled with Equation (F.2) and is spatially pooled in the case of MNIST and ZSMM to obtain homoskedastic noise. F.2.4 Additional results on image completion We provide additional qualitative samples and quantitative analyses for the ConvLNP and ANP. Additional ConvLNP samples Figure F.1 provides further samples from a Con- vLNP trained with LˆML. We observe that the ConvLNP produces reasonably diverse yet coherent samples when evaluated in a regime that resembles the training regime (in the first four sub-columns of MNIST, SVHN, and CelebA). However, Figure F.1 also demonstrates that the ConvLNP struggles with context sets that are significantly different from those seen during training. Further comparisons of ANP and ConvLNP We provide further qualitative comparisons of ConvLNPs, ANPs trained with LˆML, and ANPs trained with LNPVI. We omit ConvLNPs trained with LNPVI as these are significantly outperformed by ConvLNPs trained with LˆML (see e.g. Table 5.4). Figure F.2 shows that all models perform relatively well when context sets are drawn from a similar distribution as employed during training (first four sub-columns of MNIST, SVHN, and CelebA). Furthermore, we observe that samples from the ConvLNP prior tend to be closer to samples from the underlying data distribution (e.g. for CelebA). 200 ConvLNP experimental details Fig. F.1 Qualitative samples for the ConvLNP trained with LˆML in Table 5.4. From top to bottom the four major rows correspond to MNIST, ZSMM, SVHN, CelebA32 datasets. For each dataset and each of the two major columns, a different image is randomly sampled; the first sub-row shows the given context points (missing pixels are in blue for MNIST and ZSMM but in black for SVHN and CelebA), while the next three sub-rows show the mean of the posterior predictive corresponding to different samples of the latent function. To show diverse samples we select three samples that maximize the average Euclidean distance between pixels of the samples. From left to right the first four sub-columns correspond to a context set with 0%, 1%, 3%, 10% randomly sampled context points. In the last two sub-columns, the context sets respectively contain all the pixels in the left and top half of the image. F.2 Experimental details on image completion 201 (a) ConvNP LˆML (b) ANP LˆML (c) ANP LNPVI Fig. F.2 Qualitative samples between (a) ConvLNP trained with LˆML; (b) ANP trained with LˆML; (c) ANP trained with LNPVI. For each model the figure shows the same as Figure F.1. 202 ConvLNP experimental details Table F.2 Coordinates for boxes defining the train and test regions. Central (train) Western (test) Eastern (test) Southern (test) Latitudes (52, 46) (50, 46) (52, 49) (46, 42) Longitudes (08, 28) (01, 08) (28, 35) (19, 26) The qualitative advantage of ConvLNP is most significant in settings that require translation equivariance for generalisation. Figure F.2 row 2 (ZSMM) clearly demon- strates that ConvLNP generalizes to larger canvas sizes and multiple digits, while ANP attempts to reconstruct a single digit regardless of the context set. Finally, Figure F.3 provides the test log-likelihood distributions of ANP and ConvLNP as well as some qualitative comparisons between the two. F.3 Experimental details on environmental data F.3.1 Data details ERA5-Land (Copernicus Climate Change Service, 2020) contains high resolution information on environmental variables at a 9 km spacing across the globe.1 The data we use contains daily measurements of accumulated precipitation at 11pm and temperature at 11pm at every location, between 1981 and 2020, yielding a total of 14,304 temporal measurements across the spatial grid. In addition, we provide orography (elevation) values for each location. We normalize the data such that the precipitation values in the train set have zero mean and unit standard deviation. We consider the task of predicting daily precipitation y, with latitude and longitude as x. In addition, at each context and target location, we provide the model with access to side information in the form of orography (elevation) and temperature values. We also normalise the orography and temperature values to have zero mean and unit standard deviation. We choose a large region of central Europe as our train set, and use regions East, West and South of the train set as held out test sets (see Figure F.4 and Table F.2). At train time, to sample a task, we first sample a random date between 1981 and 2020. We then sample a square subregion of grid of values from within the train region (which has size 61× 201). We consider two models, one trained on 28× 28 1URL: https://www.ecmwf.int/en/era5-land. Neither the European Commission nor ECMWF is responsible for any use that may be made of the Copernicus Information or data it contains. F.3 Experimental details on environmental data 203 (a) MNIST (b) CelebA32 (c) Zero Shot Multi-MNIST (d) SVHN Fig. F.3 Log-likelihood and qualitative samples comparing ConvLNP and ANP trained with LˆML on (a) MNIST; (b) CelebA; (c) ZSMM; (d) SVHN. For each sub-figure, the top row shows the log-likelihood distribution for both models. The images below correspond to the context points (top), followed by three samples form ConvLNP (mean of the posterior predictive corresponding to different samples from the latent function), and three samples from ANP. Each column corresponds to a given percentile of the ConvLNP test log likelihood (as shown by green arrows). 204 ConvLNP experimental details Fig. F.4 Training (blue) and test (red) regions in Europe, along with orography data from ERA5Land. subregions, and another trained on 40× 40 subregions. During training, each subregion is then split into context and target sets. Context points are randomly chosen with a keep rate pkeep with pkeep ∼ U [0, 0.3]. In this section, we train only on the LˆML objective. F.3.2 Gaussian process baseline We mean-centre the data for each task for the GP before training, and add the mean offset back for evaluation and sampling. We use an Automatic Relevance Determination (ARD) kernel, with separate factors for latitude/longitude, temperature and orography. In detail, let x = (xlat, xlon) denote position, and let ω, t denote orography and precipitation respectively, and let r := (x, ω, t). Then the kernel is given by k(r, r′) = σ2vkl(x, x ′)kω(ω, ω′)kt(t, t′) + σ2nδ(r, r ′). Here each of kl, kω and kt are Matérn–52 kernels with separate learnable lengthscales; δ(r, r′) = 1 if r = r′ and 0 otherwise; and σ2v , σ2n are learnable signal and noise variances respectively. We learn all hyperparameters by maximising the log-marginal likelihood using Scipy’s implementation of L-BFGS. Transforming the data As the data is non-negative, we considered applying the transform y 7→ log(ϵ + y) for the GP to model. If ϵ = 0, this would guarantee that the GP would only yield positive samples, which would be physically sensible as precipitation is non-negative. However, this cannot be done as precipitation often F.3 Experimental details on environmental data 205 takes the value y = 0, which would lead to the transform being undefined. On the other hand, if ϵ > 0, the GP samples after performing the inverse transform could still predict a precipitation value as low as −ϵ, which is still unphysical. Further, a small value of ϵ leads to large distortion of the y values in transformed space. In the end, we run all experiments for the GP and NP without log-transforming the data; hence the models have to learn non-negativity. F.3.3 ConvLNP architecture and training details As the ERA5-Land dataset is regularly spaced, we use the on-the-grid version of the architecture, without the need for an RBF smoothing layer at the input. All experiments used a convolutional architecture with 3 residual blocks (He et al., 2016) for the encoder and 3 residual blocks for the decoder. Each residual block is defined with two layers of ReLU activations followed by convolutions, each with kernel size 5. The first convolution in each block is a standard convolution layer, whereas the second is depthwise separable (Chollet, 2017). All intermediate convolutional layers have 128 channels, and the latent function z has 16 channels. The networks were trained using Adam with a learning rate of 10−4. We used 16 channels for the latent function z, and estimated LˆML using 16-32 samples at train time, with batches of 8-16 images. We train the models for between 400 and 500 epochs, where each epoch is defined as a single pass through each day in the training set, where at each day, a random subregion of the full 61 × 201 central Europe region is cropped. We estimated the predictive density using 2500 samples of z during test time. F.3.4 Prediction and sampling To create Table 5.5, at test time we sample 28×28 subregions from each of the train and test regions. This is done 1000 times. For the GP, we randomly restart optimisation 5 times per task and use the best hyper-parameters found. In order to remove outliers where the GP has very poor likelihood, we set a log-likelihood threshold for the GP. If the GP has a log-likelihood of less than 0 nats on a particular task, then that task is removed from the evaluation. We find that to produce high quality samples, we need to train the model on subregions that are roughly as large as the lengthscale of the precipitation process. Hence we sample from the model trained on 40× 40 subregions in Figure 5.13 in the main body. We show samples from the model trained on both 28× 28 subregions and 206 ConvLNP experimental details 40× 40 subregions in Section F.3.6. We also compare to samples from GPs trained on each context set (no random restarts were used for sampling). F.3.5 Bayesian optimization We use the models described in Section F.3.3, trained on random 28× 28 subregions of the train region, and compare to the GP baselines described in Section F.3.2. For the Bayesian optimization experiments in Figure 5.14 in the main body, we do not perform random restarts as this was too time-consuming. We carry out the Bayesian optimization (BayesOpt) experiments in each of the four regions: Central (train), West (test), East (test), and South (test). Each Bayesian optimization “episode” is defined by randomly sub-sampling a day (uniformly at random between 1981 and 2020), then sampling a sub-region from the tested region. To test the models’ spatial generalization capacity (where possible), we sub-sample episodes from each of the four regions with the following sizes: • Central: 42x42 • West: 40x40 • East: 28x28 • South: 36x36 Episodes begin from empty sets D(0)c =, and models sequentially query locations for t = 1, . . . , 50. Denoting (x(t), y(t)) the query location and queried value at iteration t, the context set is then updated as D(t)c = D(t−1)c ∪ {(x(t), y(t))}. Denoting y as the complete set of rainfall values in the sub-region, and y(t) as the set of queried values at iteration t, we can define the instantaneous regret as rt = max(y)−max(y(t)c ), and compute the average regret (plotted in Figure 5.14) at the tth iteration as r¯t = 1t ∑t i=1 ri. F.3.6 Additional figures for environmental data Predictive density Figure F.5 displays the predictive densities for precipitation at different locations, conditioned on a context set used for testing. The density of the ConvLNP is estimated using 2500 samples of z. To examine why the ConvLNP outperforms the GP in terms of log-likelihood, we plot cases where the ConvLNP likelihood is significantly better than the GP likelihood. We see that this is due to the GP occasionally making very overconfident predictions compared to the ConvLNP. We F.3 Experimental details on environmental data 207 (a) (b) Fig. F.5 Predictive density at two target points, where the ConvLNP significantly outperforms the GP. The orange and blue circles show the likelihood for the ground truth target value under the GP and ConvLNP. Note that as the precipitation values are normalised to zero mean and unit standard deviation, yt = −0.53 corresponds to no rain. In Figure F.5a, we see the ConvLNP sometimes produces predictions heavily centered on this value, showing it has learned the sparsity of precipitation values. In Figure F.5b we see the ConvLNP predictive distribution is sometimes asymmetric with a heavier positive tail, reflecting the non-negativity of precipitation. also see that the ConvLNP in a small proportion of cases exhibits very non-Gaussian, asymmetric predictive distribtuions. Additional samples In this section we show additional samples from the model trained on 28 × 28 images (Figures F.6 and F.7) and also on 40 × 40 images (Fig- ures F.8 and F.9). Training on larger images reduces the occurrence of blocky artefacts. Figure 5.13 in the main body was trained on 40× 40 images. Note that samples shown here are 61× 201, i.e. the size of the entire central Europe train region. 208 ConvLNP experimental details (a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3 (e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3 Fig. F.6 Samples from the predictive processes overlaid on central Europe, for a model trained on random 28× 28 subregions of the full 61× 201 central Europe region. Note some blocky artefacts in the ConvNP samples due to training on small subregions. Here the GP has overfit to the orography data, with samples that resemble the orography rather than precipitation. (a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3 (e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3 Fig. F.7 Samples from the predictive processes overlaid on central Europe, for a model trained on random 28× 28 subregions of the full 61× 201 central Europe region. Here the GP has learned a lengthscale that is too large. (a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3 (e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3 Fig. F.8 Samples from the predictive processes overlaid on central Europe, for a model trained on random 40× 40 subregions of the full 61× 201 central Europe region. Here the GP has overfit to the orography data, with samples that resemble the orography rather than precipitation. F.3 Experimental details on environmental data 209 (a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3 (e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3 Fig. F.9 Samples from the predictive processes overlaid on central Europe, for a model trained on random 40× 40 subregions of the full 61× 201 central Europe region. The GP has again overfit to the orography data. (a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3 (e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3 Fig. F.10 Samples from the predictive processes overlaid on central Europe, for a model trained on random 40× 40 subregions of the full 61× 201 central Europe region.