Approximate Inference in Bayesian
Neural Networks and Translation
Equivariant Neural Processes
Yue Kwang Foong
Department of Engineering
University of Cambridge
This dissertation is submitted for the degree of
Doctor of Philosophy
Trinity Hall September 2022

Declaration
This thesis is the result of my own work and includes nothing which is the outcome of
work done in collaboration except as declared in the Preface and specified in the text.
I further state that no substantial part of my thesis has already been submitted, or, is
being concurrently submitted for any such degree, diploma or other qualification at
the University of Cambridge or any other University or similar institution except as
declared in the Preface and specified in the text. This dissertation contains fewer than
65,000 words including appendices, bibliography, footnotes, tables and equations and
has fewer than 150 figures.
Yue Kwang Foong
September 2022

Abstract
Approximate Inference in Bayesian Neural Networks and Translation
Equivariant Neural Processes
Yue Kwang Foong
It has been a longstanding goal in machine learning to develop flexible prediction
methods that ‘know what they don’t know’ — when faced with an out-of-distribution
input, these models should signal their uncertainty rather than be confidently wrong.
This thesis is concerned with two such probabilistic machine learning models: Bayesian
neural networks and neural processes. Bayesian neural networks are a classical model
that has been the subject of research since the 1990s. They rely on Bayesian inference
to represent uncertainty in the weights of a neural network. On the other hand, neural
processes are a recently introduced model that relies on meta-learning rather than
Bayesian inference to obtain uncertainty estimates.
This thesis provides contributions to both of these research areas. For Bayesian
neural networks, we provide a theoretical and empirical study of the quality of com-
mon variational methods in approximating the Bayesian predictive distribution. We
show that for single-hidden layer networks with ReLU activation functions, there
are fundamental limitations concerning the representation of in-between uncertainty :
increased uncertainty in between well separated regions of low uncertainty. We show
that this theoretical limitation doesn’t apply for deeper networks. However, in practice,
in-between uncertainty is a feature of the exact predictive distribution that is still often
lost by approximate inference, even with deep networks.
In the second part of this thesis, we focus on neural processes. In contrast to
Bayesian neural networks, neural processes do not rely on approximate inference.
Instead, they use neural networks to directly parameterise the map from a dataset to
the posterior predictive stochastic process conditioned on that dataset. In this thesis
we introduce the convolutional neural process, a new kind of neural process architecture
which incorporates translation equivariance into its predictions. We show that when
this symmetry is an appropriate assumption, convolutional neural processes outperform
their standard multilayer perceptron-based and attentive counterparts on a variety of
regression benchmarks.

Acknowledgements
My thanks goes first and foremost to my supervisor, Richard E. Turner. I could not
have asked for a better supervisor. From day one of the PhD he has been supportive
and insightful, giving me the freedom to pursue topics of my interest, while being
available whenever I needed help. Rich took a chance by taking on a student who
didn’t have prior research experience in machine learning, and for this I’ll always be
grateful.
I also have the pleasure of thanking my supervisors during my time at two very
enjoyable internships. Sebastian Nowozin, who supervised me at Microsoft Research,
was a pleasure to work with. I am particularly grateful for his mentorship during my
first taste of industry. Working with Michalis Titsias, my supervisor at DeepMind, was
also an enormous privilege. His knowledge of his field is unrivalled, and I was struck
by his patience and helpfulness during meetings.
This thesis would not have been possible without my collaborators, from whom I
have learned so much. It is a pleasure in particular to thank David R. Burt, Jonathan
Gordon and Wessel P. Bruinsma, all of whom I spent countless hours discussing ideas,
debugging code and proving theorems with. I’d also like to thank my co-authors
Yingzhen Li, José Miguel Hernández-Lobato, Yann Dubois, James Requiema, Marcin
Tomczak, Siddharth Swaroop, Tim Pearce, and everyone at the Computational and
Biological Learning Lab. Ross Clarke gave me hours of invaluable help with computing
when I was getting started, and Sebastian Ober has been an excellent office mate. It’s
been a privilege to work with all them.
I’d also like to thank my 4th year project supervisor Ramji Venkataramanan,
without whom I would not have applied for the Trinity Hall Research Studentship,
which, along with the George and Lilian Schiff Foundation, generously funded my PhD.
I would not have been able to finish this PhD without the kindness of my many
dear friends at St Andrew the Great church. Throughout it all, I have relied on the
mercy and grace of a generous God, from whom comes all knowledge. Finally, this
thesis is gratefully dedicated to my parents. Their love and support have made all of
this possible.

Table of contents
List of figures xv
List of tables xvii
1 Introduction 1
1.1 Overview of thesis and main contributions . . . . . . . . . . . . . . . . 2
1.2 List of publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Bayesian neural networks 7
2.1 Standard neural network training . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Multilayer perceptrons . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Probabilistic modelling with MLPs . . . . . . . . . . . . . . . . 8
2.1.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . 10
2.1.4 Maximum a posteriori estimation . . . . . . . . . . . . . . . . . 11
2.2 Bayesian neural networks and uncertainty in deep learning . . . . . . . 13
2.2.1 Epistemic and aleatoric uncertainty . . . . . . . . . . . . . . . . 14
2.2.2 Bayesian inference for neural networks . . . . . . . . . . . . . . 14
2.2.3 Specifying the prior . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Applications of Bayesian neural network uncertainty . . . . . . 17
2.3 Approximate inference in Bayesian neural networks . . . . . . . . . . . 18
2.3.1 Sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Approximating family methods . . . . . . . . . . . . . . . . . . 21
2.3.3 Choosing and evaluating approximating family methods . . . . . 27
2.4 History of approximating families in Bayesian neural networks . . . . . 28
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 The expressiveness of approximate inference in Bayesian neural net-
works 31
3.1 Criteria for successful approximation . . . . . . . . . . . . . . . . . . . 32
x Table of contents
3.2 Priors and references for the exact predictive . . . . . . . . . . . . . . . 34
3.3 Single-hidden layer neural networks . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Numerical verification of theorems . . . . . . . . . . . . . . . . . 38
3.3.2 In-between uncertainty in other regions of input space . . . . . . 40
3.3.3 Intuition for results . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.4 Empirical tests of approximate inference in single-hidden layer
BNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Deeper networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Proof sketch of Theorem 4 . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Empirical tests of approximate inference in deep BNNs . . . . . 47
3.4.3 Initialising a BNN with in-between uncertainty . . . . . . . . . 51
3.5 Case study: active learning with BNNs . . . . . . . . . . . . . . . . . . 55
3.5.1 Experimental set-up and results . . . . . . . . . . . . . . . . . . 56
3.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.1 Discussion of Farquhar et al. (2020) . . . . . . . . . . . . . . . . 62
3.6.2 Pathologies of the optimal mean-field posterior in wide BNNs . 65
3.6.3 The cold posterior effect and prior selection . . . . . . . . . . . 66
3.6.4 Properties of MC dropout posteriors . . . . . . . . . . . . . . . 67
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 Neural processes 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.1 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.2 Stochastic process prediction . . . . . . . . . . . . . . . . . . . . 73
4.1.3 Stochastic process consistency . . . . . . . . . . . . . . . . . . . 74
4.1.4 The prediction map . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Neural process architectural framework . . . . . . . . . . . . . . . . . . 77
4.2.1 Kolmogorov consistency of CNPs and LNPs . . . . . . . . . . . 79
4.2.2 MLP-conditional neural processes . . . . . . . . . . . . . . . . . 82
4.2.3 MLP-latent neural processes . . . . . . . . . . . . . . . . . . . . 84
4.2.4 Attentive neural processes . . . . . . . . . . . . . . . . . . . . . 84
4.3 Deep sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Training neural processes . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Neural process variational inference . . . . . . . . . . . . . . . . 92
4.4.3 Approximate log-likelihood . . . . . . . . . . . . . . . . . . . . . 94
Table of contents xi
4.4.4 Approximate maximum-likelihood vs variational lower bound
maximisation for training NPs . . . . . . . . . . . . . . . . . . . 94
4.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 96
5 Convolutional neural processes 97
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.1 Translation equivariance and stationarity . . . . . . . . . . . . . 101
5.2 Convolutional deep sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Representing translation equivariant functions on sets . . . . . . 103
5.3 Convolutional conditional neural processes . . . . . . . . . . . . . . . . 106
5.3.1 ConvCNPs for off-the-grid data . . . . . . . . . . . . . . . . . . 108
5.3.2 ConvCNPs for on-the-grid data. . . . . . . . . . . . . . . . . . . 108
5.4 ConvCNP experimental results . . . . . . . . . . . . . . . . . . . . . . 110
5.4.1 Synthetic 1D experiments . . . . . . . . . . . . . . . . . . . . . 110
5.4.2 2D image completion experiments . . . . . . . . . . . . . . . . . 112
5.4.3 Limitations of factorised predictive distributions . . . . . . . . . 116
5.5 Convolutional latent neural processes . . . . . . . . . . . . . . . . . . . 117
5.6 ConvLNP experimental results . . . . . . . . . . . . . . . . . . . . . . . 119
5.6.1 1D regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.6.2 Image completion . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.6.3 Environmental data . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.7 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 128
6 Conclusions and discussion 131
6.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.1 Approximate inference in Bayesian neural networks . . . . . . . 131
6.1.2 Convolutional neural processes . . . . . . . . . . . . . . . . . . . 132
6.2 BNNs and NPs compared . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.3 Continued work and future research directions . . . . . . . . . . . . . . 135
6.3.1 Approximate inference in Bayesian neural networks . . . . . . . 135
6.3.2 Convolutional neural processes . . . . . . . . . . . . . . . . . . . 136
References 139
Appendix A Proofs of results on single-hidden layer BNNs 153
A.1 General theorem statements . . . . . . . . . . . . . . . . . . . . . . . . 153
A.2 Statements of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.3 Proofs of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
xii Table of contents
A.3.1 Proof of lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.3.2 Proof of lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.3.3 Proof of lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.3.4 Proof of lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.4 Proofs of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Appendix B Bayesian neural network experimental details 163
Appendix C Proofs of results on deep BNNs 165
C.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.1.1 Proof of Theorem 4 for QFFG . . . . . . . . . . . . . . . . . . . 166
C.1.2 Proof of Theorem 10 for MCDO . . . . . . . . . . . . . . . . . . 173
C.2 Counterexample when inputs are dropped out . . . . . . . . . . . . . . 177
Appendix D ConvCNP experimental details 179
D.1 Baseline neural process models . . . . . . . . . . . . . . . . . . . . . . . 179
D.2 1-dimensional experiments . . . . . . . . . . . . . . . . . . . . . . . . . 180
D.2.1 CNN architectures . . . . . . . . . . . . . . . . . . . . . . . . . 180
D.2.2 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
D.3 Image experimental details and additional results . . . . . . . . . . . . 184
D.3.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . 184
D.3.2 ACNP and ConvCNP qualitative comparison . . . . . . . . . . 186
Appendix E Effect of number of samples used on evaluation of latent
neural processes 189
Appendix F ConvLNP experimental details 193
F.1 Experimental details on 1D regression . . . . . . . . . . . . . . . . . . . 193
F.2 Experimental details on image completion . . . . . . . . . . . . . . . . 196
F.2.1 Data details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
F.2.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
F.2.3 General architecture details . . . . . . . . . . . . . . . . . . . . 197
F.2.4 Additional results on image completion . . . . . . . . . . . . . . 199
F.3 Experimental details on environmental data . . . . . . . . . . . . . . . 202
F.3.1 Data details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
F.3.2 Gaussian process baseline . . . . . . . . . . . . . . . . . . . . . 204
F.3.3 ConvLNP architecture and training details . . . . . . . . . . . . 205
F.3.4 Prediction and sampling . . . . . . . . . . . . . . . . . . . . . . 205
Table of contents xiii
F.3.5 Bayesian optimization . . . . . . . . . . . . . . . . . . . . . . . 206
F.3.6 Additional figures for environmental data . . . . . . . . . . . . . 206

List of figures
3.1 Regions with restricted in-between uncertainty with MFVI. . . . . . . . 36
3.2 Regions with restricted in-between uncertainty with MC dropout. . . . 38
3.3 Restrictiveness of predictive variance of shallow MFVI and MCDO BNNs. 39
3.4 Predictive distributions of BNNs on randomly generated data. . . . . . 41
3.5 Contribution of a single neuron to the predictive variance. . . . . . . . 43
3.6 BNN regression on a 2D synthetic dataset. . . . . . . . . . . . . . . . . 44
3.7 Expressiveness of deep BNN predictive variance function. . . . . . . . . 46
3.8 Schematic of construction used to prove expressiveness of deep BNNs. . 48
3.9 Overconfidence of BNNs relative to GP. . . . . . . . . . . . . . . . . . . 50
3.10 Overconfidence of BNNs relative to GP with σw =
√
2. . . . . . . . . . 50
3.11 Overconfidence of BNN relative to GP plotted over training data. . . . 52
3.12 Deep BNN predictive distributions on randomly generated data. . . . . 53
3.13 Predictive distributions of BNNs initialised by matching the limiting GP. 54
3.14 Points chosen during active learning with shallow BNNs. . . . . . . . . 59
3.15 Points chosen during active learning with deeper BNNs. . . . . . . . . . 60
3.16 Predictive uncertainties before and after active learning. . . . . . . . . 61
4.1 Graphical model of a conditional neural process. . . . . . . . . . . . . . 79
4.2 Graphical model of a latent neural process. . . . . . . . . . . . . . . . . 80
5.1 Schematic illustration of translation equivariance. . . . . . . . . . . . . 99
5.2 Illustration of ConvCNP forward pass. . . . . . . . . . . . . . . . . . . 107
5.3 Predictive distributions of the ACNP and ConvCNP. . . . . . . . . . . 111
5.4 Qualitative evaluation of ConvCNP. . . . . . . . . . . . . . . . . . . . . 114
5.5 Samples from the zero-shot multi MNIST dataset. . . . . . . . . . . . . 115
5.6 Zero-shot generalisation of the ConvCNP. . . . . . . . . . . . . . . . . . 115
5.7 ConvLNP encoder-decoder architecture. . . . . . . . . . . . . . . . . . 118
5.8 Illustration of ConvLNP forward pass. . . . . . . . . . . . . . . . . . . 120
xvi List of figures
5.9 Algorithm for off-the-grid ConvLNP forward pass. . . . . . . . . . . . . 121
5.10 Algorithm for on-the-grid ConvLNP forward pass. . . . . . . . . . . . . 121
5.11 Predictive distributions of ConvLNPs and ANPs. . . . . . . . . . . . . 124
5.12 Predictive samples for MNIST and zero-shot multi MNIST. . . . . . . . 124
5.13 Predictive samples of precipitation overlaid on Europe. . . . . . . . . . 127
5.14 Results of Bayesian optimisation experiment. . . . . . . . . . . . . . . . 127
D.1 Predictive distribution of ConvCNP, ACNP and CNP for EQ kernel. . 182
D.2 Predictive distribution of ConvCNP, ACNP and CNP for Matérn kernel.183
D.3 Predictive distribution of ConvCNP, ACNP and CNP for sawtooth
function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
D.4 ACNP and ConvCNP predictions on MNIST, SVHN and CelebA. . . . 187
D.5 ConvCNP and ACNP applied to Ellen’s Oscar selfie. . . . . . . . . . . 187
E.1 Log-likelihood bounds as a function of number of samples. . . . . . . . 190
E.2 Effect of number of samples used during training. . . . . . . . . . . . . 191
F.1 Samples from ConvLNP trained on MNIST, ZSMM, SVHN and CelebA.200
F.2 Image completion samples for ConvLNP and ANP. . . . . . . . . . . . 201
F.3 Log-likelihood and image completion samples for ConvLNP and ANP. . 203
F.4 Train and test regions in Europe. . . . . . . . . . . . . . . . . . . . . . 204
F.5 Predictive density of ConvLNP and GP. . . . . . . . . . . . . . . . . . 207
F.6 Samples from models trained on precipitation in Europe. . . . . . . . . 208
F.7 Samples from models trained on precipitation in Europe. . . . . . . . . 208
F.8 Samples from models trained on precipitation in Europe. . . . . . . . . 208
F.9 Samples from models trained on precipitation in Europe. . . . . . . . . 209
F.10 Samples from models trained on precipitation in Europe. . . . . . . . . 209
List of tables
3.1 Results of active learning experiment. . . . . . . . . . . . . . . . . . . . 57
5.1 Log-likelihood from synthetic 1-dimensional experiments. . . . . . . . . 112
5.2 Log-likelihood from image experiments. . . . . . . . . . . . . . . . . . . 112
5.3 Log-likelihood from synthetic 1-dimensional experiments with latent
variable models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Log-likelihoods from image completion with latent variable models. . . 125
5.5 Log-likelihoods and RMSEs on ERA5-Land dataset. . . . . . . . . . . . 126
D.1 CNN architecture for the image experiments. . . . . . . . . . . . . . . . 186
F.1 Parameter counts of models in 1D regression. . . . . . . . . . . . . . . . 196
F.2 Coordinates for boxes defining the train and test regions. . . . . . . . . 202

Chapter 1
Introduction
In this thesis, we consider two machine learning methods for performing regression with
uncertainty estimates: Bayesian neural networks (BNNs; MacKay, 1992b; Neal, 1995)
and neural processes (NPs; Garnelo et al., 2018a,b). Bayesian neural networks are a
classical model, first proposed in the 1990s, that marries the principled mathematical
framework of Bayesian inference with the flexibility of neural networks. However, ever
since their inception, they have been plagued with issues surrounding approximate
inference. The first part of this thesis studies, both theoretically and empirically,
the consequences of approximate inference in Bayesian neural networks when using
mean-field variational inference (Blundell et al., 2015; Graves, 2011) and Monte Carlo
dropout (Gal and Ghahramani, 2016) as approximate inference techniques.
In the second half of the thesis, we turn to neural processes. Neural processes
are a recently introduced machine learning model that uses deep learning to model
predictive stochastic processes. In contrast to BNNs, NPs rely on meta-learning
(Schmidhuber, 1987) to directly learn the appropriate amount of uncertainty in their
predictions. Neural processes come in a variety of flavours. Our main contribution
in the second half of this thesis will be to motivate and propose a new member of
the neural process family: convolutional neural processes (ConvNPs). Unlike previous
NP architectures, ConvNPs leverage the fact that if the data-generating stochastic
process is stationary, then the corresponding map from observed datasets to predictive
distributions is translation equivariant. ConvNPs use convolutional neural networks to
bake this symmetry directly into the architecture.
2 Introduction
1.1 Overview of thesis and main contributions
The research in this thesis appears in several publications written during the course of
my PhD studies. The work on Bayesian neural networks was published in Foong et al.
(2020b), and the work on convolutional neural processes was published in Foong et al.
(2020a); Gordon et al. (2020). We now provide an overview of the rest of the thesis,
and highlight our contributions in each chapter.
Chapter 2 presents an introduction to Bayesian neural networks. This chapter
does not include novel research, but instead provides background on the subject of
approximate inference in BNNs. We begin in Section 2.1 by describing standard neural
network training as maximum a posteriori inference. In Section 2.2 we then show
how this naturally motivates a fully Bayesian treatment of the network weights. The
remainder of the chapter describes various proposed methods for approximate inference,
with a particular focus on approximating family methods, which will be a major focus
of this thesis. Chapter 2 concludes with a brief history of approximating families in
Bayesian neural networks.
In Chapter 3 we present our theoretical and empirical studies of approximate
inference in Bayesian neural networks with mean-field variational inference and Monte
Carlo dropout. Our central theoretical findings can be split into a negative result
concerning single-hidden layer BNNs with ReLU activations, and a positive result
concerning deeper ReLU BNNs. For single-hidden layer networks, we show that there
are simple situations where no setting of the variational parameters can represent
in-between uncertainty : increased uncertainty in between well-separated clusters of
low uncertainty. In contrast, for deeper and sufficiently wide networks, we show that
there exist variational parameters that can approximate any predictive mean and
variance function. However, we also show empirically that the appropriate parameters
to approximate the exact predictive distribution well in function space are often not
found when maximising the ELBO. The theorems and experiments in this chapter
were developed with my co-author David R. Burt, with Yingzhen Li and Richard
E. Turner supervising throughout. The material in this chapter is published in Foong
et al. (2020b).
In Chapter 4 we turn to the second main topic of this thesis, neural processes. This
chapter presents our perspective of neural processes as learning to approximate the
prediction map: i.e., the map that takes an observed dataset to the exact predictive
stochastic process conditioned on that dataset. We introduce various kinds of existing
NPs and describe the meta-learning procedure used to train them. The exposition in
1.2 List of publications 3
this chapter is based on an online Jupyter-book tutorial on neural processes (Dubois
et al., 2020), which I co-wrote with Yann Dubois and Jonathan Gordon.
In Chapter 5 we present our proposed convolutional neural process, a new addition
to the neural process family of models. In Section 5.2 we present the theoretical
foundations of convolutional neural processes in the form of a representation theorem
that incorporates translation equivariance into the standard Deep Sets representation
theorem (Zaheer et al., 2017). In Section 5.3 we introduce the convolutional conditional
neural process, which parameterises a translation-equivariant map from datasets to
predictive stochastic processes. However, since it only outputs a predictive mean
and variance function, its predictions are necessarily factorised. We address this
in Section 5.5 by introducing a latent variable, leading to the convolutional latent
neural process. We show that both models outperform their standard MLP-based and
attentive counterparts on a variety of regression tasks. The research in this chapter
was conducted in collaboration with Wessel P. Bruinsma, Jonathan Gordon, Yann
Dubois and James Requiema, and was supervised by Richard E. Turner throughout.
It is published in Gordon et al. (2020) and Foong et al. (2020a). The research in these
publications also appears in the PhD theses of my collaborators Jonathan Gordon
(Gordon, 2021) and Wessel P. Bruinsma (forthcoming), both submitted to the University
of Cambridge.
1.2 List of publications
The following is a list of publications I co-authored throughout the course of the PhD,
regardless of whether the research appears in this thesis.
Peer-reviewed conference proceedings
Jonathan Gordon, Wessel P. Bruinsma, Andrew Y. K. Foong, James Requiema, Yann
Dubois, and Richard E. Turner (2020). ‘Convolutional Conditional Neural Pro-
cesses’. In: International Conference on Learning Representations.
Andrew Y. K. Foong, David R. Burt, Yingzhen Li, and Richard E. Turner (2020). ‘On
the Expressiveness of Approximate Inference in Bayesian Neural Networks’. In:
Advances in Neural Information Processing Systems 33.
4 Introduction
Andrew Y. K. Foong, Wessel P. Bruinsma, Jonathan Gordon, Yann Dubois, James
Requiema, and Richard E. Turner (2020). ‘Meta-Learning Stationary Stochastic
Process Prediction with Convolutional Neural Processes’. In: Advances in Neural
Information Processing Systems 33.
Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt, and Richard E. Turner
(2021). ‘How Tight Can PAC-Bayes be in the Small Data Regime?’. In: Advances
in Neural Information Processing Systems 34.
Marcin Tomczak, Siddharth Swaroop, Andrew Y. K. Foong, and Richard E. Turner
(2021). ‘Collapsed Variational Bounds for Bayesian Neural Networks’. In: Ad-
vances in Neural Information Processing Systems 34.
Peer-reviewed workshop proceedings
Andrew Y. K. Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard
E. Turner (2019). “In-Between’ Uncertainty in Bayesian Neural Networks’. In:
Uncertainty in Deep Learning Workshop, ICML 2019.
Andrew Y. K. Foong, David R. Burt, Yingzhen Li, and Richard E. Turner (2019).
‘Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural
Networks’. In: Bayesian Deep Learning Workshop, NeurIPS 2019.
Tim Pearce, Andrew Y. K. Foong, and Alexandra Brintrup (2020). ‘Structured Weight
Priors for Convolutional Neural Networks’. In: Uncertainty in Deep Learning
Workshop, ICML 2020.
Wessel P. Bruinsma, James Requiema, Andrew Y. K. Foong, Jonathan Gordon, and
Richard E. Turner (2020). ‘The Gaussian Neural Process’. In: Proceedings of the
3rd Symposium on Advances in Approximate Bayesian Inference.
Andrew Gordon Wilson, Pavel Izmailov, Matthew D. Hoffman, Yarin Gal, Yingzhen
Li, Melanie F. Pradier, Sharad Vikram, Andrew Y. K. Foong, Sanae Lotfi, and
Sebastian Farquhar (2021). ‘Evaluating Approximate Inference in Bayesian Deep
Learning’. In: NeurIPS 2021 Competitions and Demonstrations Track
1.2 List of publications 5
Unreviewed preprints and other
Yann Dubois, Jonathan Gordon, and Andrew Y. K. Foong (2020). ‘The Neural Process
Family’. Jupyter book tutorial on neural processes. https://yanndubs.github.io/
Neural-Process-Family.
Andrew Y. K. Foong, Wessel P. Bruinsma, and David R. Burt (2022). ‘A Note on the
Chernoff Bound for Random Variables in the Unit Interval’ In arXiv:2205.07880.
Electronic print: https://arxiv.org/abs/2205.07880.

Chapter 2
Bayesian neural networks
In this chapter we introduce Bayesian neural networks. We begin in Section 2.1 by
describing standard neural network training as maximum likelihood and maximum a
posteriori estimation. In Section 2.2 we describe how the need to represent uncertainty in
the parameters of the network leads us to fully Bayesian neural networks. In Section 2.3
we consider one of the main challenges with this approach: the intractability of exact
Bayesian inference and the need to rely on approximate inference algorithms, which we
categorise into sampling methods and approximating family methods. Understanding
the consequences of approximating family methods will be a key focus of this thesis.
Finally, in Section 2.4 we give a brief history of various approximating families in
Bayesian neural networks.
2.1 Standard neural network training
There are a variety of deep neural network architectures, including multilayer percep-
trons (MLPs), convolutional neural networks (CNNs), recurrent neural networks and
transformers. In this chapter we will focus largely on MLPs, as they are the simplest
network architecture and form a building block for many others. However, much of this
discussion applies with appropriate modifications to other deep learning architectures
as well.
2.1.1 Multilayer perceptrons
An MLP is a neural network formed by stacking a series of affine transformations
interleaved with element-wise nonlinearities. Formally, an MLP is a parameterised
function f : RD → RK , defined as follows: Let x ∈ RD be the input to the network.
8 Bayesian neural networks
Let (W (l))Ll=0 and (b(l))Ll=0 be a collection of L + 1 weight matrices and bias vectors
respectively, which together represent the learnable parameters of the MLP, collectively
denoted as θ. Then the output of the MLP, f(x) ∈ RK is defined by:
h(0)(x) := x, (2.1)
h(l+1)(x) := ϕ
(
W (l)h(l)(x) + b(l)
)
for 0 ≤ l ≤ L− 1, (2.2)
f(x) := W (L)h(L)(x) + b(L). (2.3)
Here ϕ is the nonlinearity or activation function, and is applied elementwise. Common
choices for ϕ include the rectified linear unit (ReLU), ϕ(a) = max(0, a), and the
hyperbolic tangent function. L is known as the number of hidden layers in the network,
and the length of the vector h(l)(x) is referred to as the number of hidden units, or
neurons, in the lth hidden layer. The term ‘deep learning’ refers to the use of neural
networks with many hidden layers.
2.1.2 Probabilistic modelling with MLPs
MLPs may be viewed simply as flexible function approximators, without being given a
probabilistic interpretation. This is, for example, the view taken in frequentist statistical
learning theory, where the goal of a learning algorithm is to choose a hypothesis from
a large class of functions, and the performance of the hypothesis is measured by the
expectation of some loss function for that hypothesis over the true data-generating
distribution (Shalev-Shwartz and Ben-David, 2014). This loss function can be chosen
to reflect desiderata about the task that the MLP is used for, and does not necessarily
have to relate to any probabilistic model.
Alternatively, it is possible to interpret an MLP as defining a probabilistic model
which explicitly encodes uncertainty in its predictions using a probability distribution.
This will be the key to the Bayesian approach to neural network training, and it is this
viewpoint that we will focus on in this thesis.
Consider supervised learning. For an input x with supervised label y, we interpret
the MLP output f(x) as the parameters of a likelihood function for y. For example, in
univariate regression, we seek to predict a value y ∈ R from an input vector x. One
way to do this is to set the output dimensionality of the MLP to K = 1 and interpret
2.1 Standard neural network training 9
the output of the network1 f(x; θ) as the mean of a Gaussian distribution over y:
p(y|x, θ) := N (y; f(x; θ), σ2), (2.4)
where σ2 is some constant. This is known as homoscedastic regression. Alternatively,
the likelihood variance could be an output of the network itself; in which case we could
take K = 2 and define
p(y|x, θ) := N (y; f1(x; θ), exp(f2(x; θ))) , (2.5)
where the exponentiation guarantees that the noise variance is positive. Here f(x; θ) is
a two-dimensional vector, with f1(x; θ) and f2(x; θ) being the first and second entry
respectively. This allows the network to model different values of the observation noise
in different regions of input space, and is known as heteroscedastic regression.
As another example, consider C-way classification. Now the likelihood function
must be defined using a categorical distribution over C discrete classes. We can
parameterise this likelihood by setting the output dimensionality of the MLP such that
f(x) ∈ RC , and using the softmax function:
Pr(class label = c|x; θ) := exp(fc(x; θ))∑C
c′=1 exp(fc′(x; θ))
, for 1 ≤ c ≤ C. (2.6)
In this context the MLP outputs fc are known as logits. The softmax function transforms
these logits so that they form a valid normalised probability distribution over the class
labels.
As part of the probabilistic modelling interpretation, we interpret p(y|x; θ) as the
probability or degree of belief that the model with parameters θ assigns to the event
that the label takes the value y given the input x. Hence when setting up these models
we make the implicit assumption that there exists some value of the parameters θ
such that p(y|x; θ) accurately represents our beliefs about the relationship between
x and y. For simple models such as linear models, this is unlikely to be the case for
complex tasks. However, as MLPs are extremely flexible function approximators (and
are universal given sufficient width (Hornik, 1991)), it stands to reason that for a large
enough MLP, there will exist some setting of θ for which this is a good assumption.
We next focus on how to learn θ from data.
1Here we make the dependence of the outputs on the parameters θ explicit.
10 Bayesian neural networks
2.1.3 Maximum likelihood estimation
In order to set the model parameters θ, standard neural network training proceeds by
defining an objective function and attempting to find parameters that maximise that
function. A common way of defining the objective function is to take the logarithm
of the likelihood function defined by the network. For example, consider a dataset
D = ((xn, yn))Nn=1 on which we apply homoscedastic regression.2 The likelihood function
defined by the network is
log p((yn)
N
n=1|(xn)Nn=1, θ) = log
N∏
n=1
p(yn|xn, θ) (2.7)
=
N∑
n=1
logN (yn; f(xn; θ), σ2) (2.8)
= −N
2
log(2πσ2)− 1
2σ2
N∑
n=1
(yn − f(xn; θ))2. (2.9)
Equivalently, maximum likelihood estimation involves minimising the following loss:
L(θ) = N
2
log(2πσ2) +
1
2σ2
N∑
n=1
(yn − f(xn; θ))2. (2.10)
We see that for fixed σ2, up to a constant, maximising the likelihood as a function
of θ is equivalent to minimising the squared error between the network outputs and
the observed targets. A similar derivation can be performed to obtain an objective
function for classification, which leads to the widely used cross-entropy loss. Once the
objective function is defined, a gradient based optimisation algorithm such as stochastic
gradient descent (SGD) or ADAM (Kingma and Ba, 2014) can be used to optimise
θ. The gradients can be computed efficiently using the backpropagation algorithm
(Rumelhart et al., 1988), and are easily obtained using modern implementations of
automatic differentiation (Abadi et al., 2016; Frostig et al., 2018; Paszke et al., 2017).
This method for optimising θ is known as maximum likelihood estimation.
Unfortunately, neural networks trained via maximum likelihood can be prone to
overfitting — the network obtains increasingly good performance on the training set,
but begins to deteriorate in its predictive performance on unseen test data. Many
methods have been proposed to prevent overfitting, including early stopping, where
the optimisation of the weights is halted before the loss function reaches a minimum,
2Here and throughout this thesis the notation (an)In=1 refers to the sequence (a1, . . . , aN ).
2.1 Standard neural network training 11
and also limiting the complexity of the network, by reducing its depth or the number
of hidden units (MacKay, 2003, Chapter 39.4). More recently, it has been argued that
in the overparameterised regime where networks have the capacity to memorise the
training set, overfitting can actually be alleviated by further increasing the capacity of
the network (Nakkiran et al., 2021). In the next section we will focus on the classical
technique of weight regularisation as a means of controlling overfitting, as it leads
naturally to a probabilistic modelling viewpoint of the weights in neural network
learning (MacKay, 2003, Chapter 41).
2.1.4 Maximum a posteriori estimation
We introduce a probabilistic modelling perspective on θ by first describing standard
weight regularisation practice. The most common form of weight regularisation is ℓ2
regularisation (Goodfellow et al., 2016, Chapter 7), where the objective function in
Equation (2.9) is modified by adding a term proportional to the squared ℓ2 norm of
the parameters θ:
L(θ) = N
2
log(2πσ2) +
1
2σ2
N∑
n=1
(yn − f(xn; θ))2 + α
2
∥θ∥22︸ ︷︷ ︸
regulariser
. (2.11)
Here α is a non-negative hyperparameter that controls the strength of the regularisation.
By comparing this loss function with Equation (2.10), we see that when α → 0
we recover the original maximum likelihood loss function, and when α → ∞ the
optimisation algorithm will ignore the data and focus on minimising ∥θ∥22.
Introducing ℓ2 regularisation biases the learning process towards networks that
have smaller magnitudes for their parameters. Networks with larger weights represent
functions that have greater complexity — they tend to be more ‘wiggly’ as a function of
x (MacKay, 2003, Chapter 44). Hence ℓ2 regularisation encodes a preference in favour
of simpler functions. Empirically, ℓ2 regularisation is effective at mitigating overfitting
and improving generalisation performance.3 Though it can be viewed simply as an ad
hoc procedure, we now describe how to give it a probabilistic modelling interpretation
by framing learning θ as a Bayesian inference problem.
3Recent studies have shown that there are more complex factors at play in the success of ℓ2
regularisation beyond controlling the complexity of the network (Zhang et al., 2019). For example,
Krizhevsky et al. (2012) found that ℓ2 regularisation can improve training accuracy in deep networks.
We mainly consider ℓ2 regularisation as a classical means of motivating the introduction of Gaussian
priors in Bayesian neural networks, but as we discuss in Section 2.2.3, it is far from clear that Gaussian
priors are the best choice.
12 Bayesian neural networks
Bayesian inference is a statistical framework that represents uncertainty by means
of probability distributions which are updated using Bayes’ rule. It has several features
that make it a compelling framework for neural network learning. On the theoretical
side, it can be motivated by various axiomatic constructions as a procedure for updating
beliefs in a consistent way (Cox, 1946; Ramsey, 2016; Savage, 1972). Furthermore, it
provides a way of interpreting ad hoc choices of the loss function and regulariser as
modelling assumptions that can be more precisely critiqued (for example, using the
marginal likelihood (MacKay, 2003, Chapter 3)). Finally, on a practical level, Bayesian
inference has been applied with great success to a broad range of machine learning
tasks (Ghahramani, 2015; Murphy, 2012). This is typically done by taking a standard
non-Bayesian machine learning model and specifying a Bayesian prior over its learnable
parameters, a process we now describe for neural networks.
In Bayesian inference, the prior distribution p(θ) describes what we believe about
the parameters θ before the dataset D is observed. Ideally, this should be a distribution
over θ that induces a distribution over functions f(x; θ) which encapsulates all of
our experience and intuition about the problem at hand. However, specifying such a
prior precisely is a highly nontrivial task for neural networks, which we will discuss
more in Section 2.2.3. Here we will simply consider the simplest, most convenient
prior. Anticipating the objective in Equation (2.11), we set the prior to be a factorised
Gaussian distribution: p(θ) = N (θ; 0, α−1I).
Our next task is to update our beliefs about θ in light of the observed dataset D.
These updated beliefs are represented by the posterior distribution p(θ|D). We can
compute this by applying Bayes’ rule:
p(θ|D) ∝ p((yn)Nn=1|(xn)Nn=1, θ)p(θ) (2.12)
=
N∏
n=1
N (yn; f(xn; θ), σ2)N (θ; 0, α−1I) (2.13)
log p(θ|D) = − 1
2σ2
N∑
n=1
(yn − f(xn; θ))2 + α
2
∥θ∥22 + const. (2.14)
Maximising log p(θ|D) in Equation (2.14) is known as maximum a posteriori (MAP)
estimation and corresponds to finding the setting of the θ that has the highest density
under the posterior p(θ|D).
So far, the Bayesian interpretation of neural network training we have described falls
within the bounds of standard deep learning practice — although this interpretation
may bring new insights into the choice and interpretation of α, the resulting MAP
2.2 Bayesian neural networks and uncertainty in deep learning 13
estimation algorithm is, by construction, identical to minimising squared error/cross-
entropy with an added ℓ2 regularisation term. Combined with other training innovations
such as weight normalisation, dropout and data augmentation, optimising this objective
function is the workhorse of most modern neural network training, whether explicitly
given a Bayesian interpretation or not.
However, the Bayesian interpretation allows us to go much further than MAP
estimation, since the maximiser of the posterior density has no fundamental status in
Bayesian inference.4 In the next section, we will describe why it can be desirable to
utilise the entire posterior distribution p(θ|D) when making predictions. Implementing
this will necessitate fundamental changes to our learning algorithms for neural networks.
2.2 Bayesian neural networks and uncertainty in deep
learning
To motivate the use of full Bayesian inference with the entire posterior distribution
rather than simply MAP estimation, we consider the problem of uncertainty quantifica-
tion. This is the task of obtaining neural networks that understand the limits of their
knowledge; or, in other words, that ‘know what they don’t know’ (Gal, 2016).
Empirically, it has been observed that neural networks regularly make overconfident
predictions, especially on out-of-distribution data (Ovadia et al., 2019). For example, it
has been shown that deep convolutional neural networks, when trained on the ImageNet
dataset, give unpredictable and unreasonably confident answers when shown inputs
that are unlike anything in the training set. In Shafaei et al. (2018), a picture of random
Gaussian noise was classified as a ‘chainlink fence’ with 31% probability. Ideally, when
presented with an out-of-distribution (OOD) input like this, an uncertainty-aware
machine learning algorithm would make a high-entropy, unconfident prediction, with
probability mass spread widely over many classes, rather than make an arbitrary
prediction with high confidence.
Recently, Fort et al. (2021) demonstrated that vision transformers (Dosovitskiy
et al., 2020) pretrained on very large datasets of image-text pairs via methods such as
CLIP (Radford et al., 2021) and subsequently fine-tuned to a specific task can show
much better performance for OOD detection. However, this approach relies on the
pretraining dataset (which is commonly obtained by scraping websites on the Internet)
to provide representations that are relevant for the fine-tuned task. Hence it is not
4In fact, the maximiser of the posterior density is not invariant to non-linear reparameterisations
of θ (MacKay, 2003, Chapter 28), so that MAP estimate is a parameterisation-dependent notion.
14 Bayesian neural networks
directly applicable to situations where the fine-tuned task is very specialised, which is
the case for most medical and scientific applications. We next discuss more precisely
what kind of uncertainty we would like our network to reflect in these situations.
2.2.1 Epistemic and aleatoric uncertainty
We now distinguish between two kinds of uncertainty — ‘aleatoric’ uncertainty and
‘epistemic’ uncertainty (Der Kiureghian and Ditlevsen, 2009; Kendall and Gal, 2017).
Aleatoric uncertainty is uncertainty which is modelled as inherent to the observations
and cannot be reduced by collecting greater amounts of data. Epistemic uncertainty is
uncertainty due to the model parameters not being fully known, and can be reduced
by collecting more data.
For example, assume for the sake of argument that the likelihood in Equation (2.4)
fully represents our beliefs regarding the data-generating process for the label y given an
input x. Furthermore, assume that the true values of the parameters θ are completely
known. Then for any x, there is still uncertainty about the corresponding label y due
to the non-zero noise variance σ2. As the model parameters θ are already completely
known, no amount of added data can reduce this uncertainty in y, which is inherent to
the data-generating process. This is an example of aleatoric uncertainty.
On the other hand, consider the case where σ2 → 0, so that, given a value of θ,
the label y corresponding to an input x is essentially deterministically set to f(x; θ).
However, suppose that we do not know the true value of θ, but we are instead uncertain
about its value. Then this would lead to another, independent source of uncertainty
about the value of y. This is an example of epistemic uncertainty. What distinguishes
this from aleatoric uncertainty is that, in principle, if additional data were to be
collected, this could allow us to reduce our uncertainty about θ, which in turn would
reduce our uncertainty about y.
2.2.2 Bayesian inference for neural networks
Given this distinction, we can now provide a Bayesian perspective on why neural
networks may be overconfident, and what can be done to mitigate this. Deep neural
networks are extremely flexible function approximators, often with millions of parame-
ters. Furthermore, the input space that x resides in can be very high-dimensional, and
the training set will only occupy a small region of that space. For example, in the case
of CNNs trained on ImageNet, the input space is the space of all images of a certain
size, but natural images will live only near a low-dimensional manifold in this input
2.2 Bayesian neural networks and uncertainty in deep learning 15
space. Intuitively, there will be many settings of the parameters that will give good
predictions on the training set but may differ greatly in OOD regions.
Since the data do not determine the parameters exactly, we are epistemically
uncertain about which setting is the ‘correct’ one. When we perform MAP estimation
as in standard neural network training, this can be viewed as choosing a single setting
of the parameters that fits the data (and prior) best, even though many other settings
may also be plausible according to the posterior. From the Bayesian point of view, this
procedure can be expected to lead to overconfidence because it fails to propagate our
uncertainty about the parameters into our predictions. In other words, standard MAP
estimation is able to capture aleatoric uncertainty but completely ignores epistemic
uncertainty.
From a Bayesian perspective, the remedy for this is straightforward (at least in
principle): we should compute the probability distribution over the label y given the
input x using the rules of probability, by marginalising out the uncertain parameters θ:
p(y|x,D) =
∫
p(y, θ|x,D) dθ (2.15)
=
∫
p(y|θ, x,D)p(θ|x,D) dθ (2.16)
=
∫
p(y|x, θ)p(θ|D) dθ. (2.17)
Here Equation (2.17) follows because according to our model, once θ is known, the
distribution of y given x is completely specified. The distribution p(y|x,D) is known
as the Bayesian posterior predictive distribution, or simply the posterior predictive or
just the predictive distribution. Equation (2.17) tells us that to compute the posterior
predictive, we should average our predictions over the entire posterior distribution
p(θ|D), instead of just plugging in the value of θ that maximises the posterior density,
as in MAP estimation. In other words, we marginalise out the parameters when making
predictions, thus propagating our epistemic uncertainty from θ to y. When neural
network predictions are made in this way, we refer to the model as a Bayesian neural
network (BNN).5
5Although neural networks trained using MAP estimation involve placing a Bayesian prior on the
parameters, we generally reserve the term ‘Bayesian neural network’ to refer to models where we (at
least approximately) compute the full posterior distribution over θ and marginalise it out. In a loose
sense MAP estimation may be viewed as a BNN where the posterior is approximated by a point mass
at the maximum of p(θ|D).
16 Bayesian neural networks
2.2.3 Specifying the prior
One thing we have left out of the discussion so far is how to choose the prior over
parameters, p(θ). Ideally this prior should encapsulate all our beliefs about the task
before the data are observed. For example, consider the case of image classification with
a deep neural network. In some sense, p(θ) should place higher probability on regions
of parameter space that correspond to classifiers that we expect to be more plausible.
Specifying such a p(θ) is extremely difficult — we often do not have well-formed,
consistent, quantifiable beliefs about such a large space. Even if we did, transforming
beliefs about output probabilities in function space into beliefs about the parameters
in weight space is non-trivial and is an active area of research (Flam-Shepherd et al.,
2017; Tran et al., 2020; Yang et al., 2019).
As such, common practice is to specify computationally convenient priors, such as
factorised Gaussian priors. The efficacy of such priors has been a subject of debate
recently, with Wenzel et al. (2020) arguing that it leads to poor performance (and may
be to blame for the so-called ‘cold posterior effect’), and Wilson (2020); Wilson and
Izmailov (2020) arguing that the network architecture provides sufficient structure in the
prior over functions, with p(θ) simply needing to be sufficiently vague to provide good
results. Fortuin et al. (2021) train networks using SGD and compute summary statistics
of the weights in order to motivate the choice of prior. They propose using spatially
correlated priors for the weights in convolutional networks, and heavy-tailed priors
for the weights in MLPs, demonstrating that this can lead to improved performance.
Fortuin (2022) provides a review on recent work on specifying priors in Bayesian deep
learning.
Although the problem of prior selection is crucial for BNNs, in this thesis we will
focus primarily on issues relating to approximate inference. One reason for this is
that a proper understanding of inference is needed to reliably evaluate our priors.
Often in Bayesian modelling, the efficacy of a prior is only made clear after it has
been combined with data to form a posterior predictive. By inspecting this posterior,
practitioners are able to critique pathologies and iterate towards better priors. This
way of thinking is neatly summarised in the quote by the statistician I. J. Good: “Ye
priors shall be known by their posteriors” (Good, 1983). If the inference process itself is
poorly understood, it is difficult to disentagle whether the behaviour of the predictive
distribution is more of a consequence of the choice of prior or choice of approximate
inference algorithm. Hence a better understanding of approximate inference allows for
a better understanding of BNN priors.
2.2 Bayesian neural networks and uncertainty in deep learning 17
2.2.4 Applications of Bayesian neural network uncertainty
Before describing in Section 2.3 how to actually perform the computations required to
obtain the posterior predictive from Equation (2.17), we briefly mention some of the
uses of the epistemic uncertainty estimates that a BNN could provide.
Such a model would have numerous applications in active learning (Gal et al.,
2017b), reinforcement learning (Chua et al., 2018), Bayesian optimisation (Snoek et al.,
2012) and high-risk decision making tasks. For example, in active learning (Cohn et al.,
1995; MacKay, 1992a) we are presented with a large, unlabelled dataset. We are allowed
to query an oracle which will provide the label for any element in the dataset. However,
each query is associated with a cost, and the aim is to obtain a labelled subset that will
allow us to train the most accurate model possible whilst minimising the number of
queries made. This problem is naturally tackled by separately identifying the aleatoric
and epistemic contributions to uncertainty: we want to query datapoints with high
epistemic uncertainty in their predictions (since we stand to gain the most information
about the model parameters) but low aleatoric uncertainty (since inherently noisy
labels are less informative).
In reinforcement learning and Bayesian optimisation (Deisenroth and Rasmussen,
2011; Gal et al., 2016; Snoek et al., 2012, 2015), a central problem is that of balancing
exploration and exploitation. The task of quantifying the value of exploration naturally
involves epistemic uncertainty — an action is more worth exploring if we are uncertain
about its result, but we can reduce that uncertainty by gathering observations. Con-
versely, an action that simply leads to an outcome with high aleatoric uncertainty is
less worth exploring.
Lastly, high-risk decision making tasks, which commonly arise e.g. in medical
applications, require good epistemic uncertainty quantification. For example, deep
neural networks have been used to classify skin lesions as either cancerous or benign
with human expert-level accuracy (Esteva et al., 2017). A wrongly made classification
here can lead to disastrous consequences for a patient. Ideally a neural network, when
presented with a skin lesion it had never seen the like of before, would be able to signal
its uncertainty. This information could be used in a doctor’s decision making process
to perform more thorough checks (Filos et al., 2019; Mobiny et al., 2019). However, as
we’ve seen, standard neural networks that do not quantify epistemic uncertainty can
confidently make arbitrary predictions when presented with out-of-distribution inputs.
18 Bayesian neural networks
2.3 Approximate inference in Bayesian neural net-
works
Having described the ideal of taking into account epistemic uncertainty with the
Bayesian posterior predictive, we return to the task of how to actually perform this
computation in practice. It is here that we run into a major hurdle: computing the
posterior predictive involves evaluating intractable integrals.
To recap, the posterior predictive in Equation (2.17) involves averaging our predic-
tions over the entire posterior distribution. The posterior distribution itself is calculated
using Bayes’ theorem:
p(θ|D) = p(D|θ)p(θ)
p(D) (2.18)
=
p(D|θ)p(θ)∫
p(D|θ)p(θ) dθ . (2.19)
We see that obtaining the posterior predictive requires performing two integrals, one
to calculate the normalising constant in Bayes’ theorem in Equation (2.19), and one to
average our predictions over the posterior distribution in Equation (2.17). Since the
likelihood function p(D|θ) is highly non-linear in θ for neural networks, these integrals
are analytically intractable — approximate inference methods are needed.
A great variety of approximate inference algorithms have been proposed for Bayesian
neural networks. We will only give a brief overview here. In broad terms, we can divide
approximate inference methods for BNNs into two categories. The first are sampling
methods — those that aim to represent the posterior distribution by a collection
of representative samples only. The second are what we refer to as approximating
family methods — those that assume a particular parametric form for the approximate
posterior. The work in this thesis will focus primarily on approximating family methods,
although we also give a brief overview of sampling methods in the next section.
2.3.1 Sampling methods
Sampling methods for BNNs are based on the idea of Monte Carlo integration. This
approach relies on the fact that the formula for the posterior predictive, Equation (2.17),
can be written as an expectation:∫
p(y|x, θ)p(θ|D) dθ = Ep(θ|D) [p(y|x, θ)] (2.20)
2.3 Approximate inference in Bayesian neural networks 19
This expectation can then be approximated by drawing samples from the posterior:
Ep(θ|D) [p(y|x, θ)] ≈ 1
M
M∑
m=1
p(y|x, θm), θm ∼ p(θ|D). (2.21)
In simple Monte Carlo methods, each of the samples θm is independent and identically
distributed. However, this is often difficult to achieve in practice. Nevertheless, even
if the samples are dependent, Equation (2.21) is still an unbiased estimator of the
posterior predictive, which converges to the true value as long as the dependencies are
not too strong.
In order to use Equation (2.21), we need a method of drawing samples from the
posterior distribution. Since θ is high-dimensional and the posterior is multimodal, this
is a non-trivial task. The most common way of doing this is with Markov chain Monte
Carlo (MCMC) methods. In an MCMC method, a Markov chain is constructed such
that its stationary distribution is p(θ|D). Since an ergodic Markov chain has a unique
stationary distribution to which it converges from any initial state, Bayesian predictions
can be made by simulating the chain until it converges and using Equation (2.21).
One advantage of MCMC is its convergence guarantees — in the limit as the
chain is simulated for an infinitely long time, the samples obtained will be exact
draws from p(θ|D). However, depending on the problem, MCMC can often take an
impractically long time to converge. This is complicated by the fact that it is often
difficult to diagnose when convergence has occurred. Common diagnostics have been
proposed such as the potential scale reduction factor Rˆ (Gelman and Rubin, 1992).
However, such diagnostics are not foolproof, and later studies have shown that the
Rˆ diagnostic can have serious flaws (Vehtari et al., 2021). In practice, diagnosing
MCMC convergence confidently often requires many different checks and some amount
of subjective judgement on the part of the practitioner.
Naïve MCMC methods, such as the Metropolis-Hastings algorithm (Metropolis
et al., 1953) suffer from random-walk behaviour — the state of the Markov chain
takes a random walk through parameter space, thus requiring an inordinately long
time to explore large regions of the posterior distribution. More advanced MCMC
techniques, such as Hamiltonian Monte Carlo (HMC) (Neal, 1995) make use of gradient
information to avoid random walk behaviour and thus explore the posterior more
quickly. HMC is often considered a gold-standard in terms of BNN inference quality,
although even with HMC, it is difficult to assess whether a particular Markov chain
has converged in practice. Moreover, HMC has several hyperparameters, such as the
number of leapfrog steps, the step size, and the mass matrix, that have to be tuned
20 Bayesian neural networks
to obtain good performance. Another limitation is that HMC involves performing an
accept-reject step, which requires the likelihood function for the entire dataset to be
computed at the end of every Markov chain transition. This severely limits HMC’s
scalability to the massive datasets that have become common in modern deep learning.
Although full-scale HMC on large image recognition datasets has been performed
for research purposes (Izmailov et al., 2021), the computational costs render such a
procedure impractical for real-world applications.
Work has been done to address both of these issues, with the No U-Turn Sampler
(NUTS) (Hoffman and Gelman, 2014) automatically tuning the number of leapfrog
steps of HMC, and methods such as stochastic gradient Langevin dynamics (SGLD)
(Welling and Teh, 2011) and stochastic gradient Hamiltonian Monte Carlo (SGHMC)
(Chen et al., 2014) introducing mini-batch methods to HMC. These stochastic gradient
MCMC (SGMCMC) methods focus on performing discrete-step approximations to the
simulation of a stochastic differential equation. Ma et al. (2015) provides a unifying
framework for SGMCMC methods under this interpretation.
Unlike standard HMC, SGMCMC methods usually omit any form of accept-reject
step. Hence although they are more scalable, they do not enjoy the same theoretical
guarantees as HMC. SGHMC has been applied to BNNs for Bayesian optimisation, with
promising results (Springenberg et al., 2016). However, scaling MCMC methods up to
modern architectures like deep convolutional networks with massive datasets such as
ImageNet, whilst still providing high quality inference, has proven challenging. Recent
work in this area has begun to show results competitive with standard CNNs (Heek and
Kalchbrenner, 2019; Zhang et al., 2020); however it seems that artificially increasing the
sharpness of the likelihood function may be required for good performance in practice
(the so-called ‘cold posterior effect’ (Wenzel et al., 2020)).
Finally, deep ensembles (Lakshminarayanan et al., 2017) have been interpreted as
a kind of sampling method for BNNs. A deep ensemble is simply an ensemble of (non-
Bayesian) neural networks trained by standard gradient descent, where the randomness
in the ensemble comes from the random initialisation of the network parameters. Once
the ensemble is trained, each trained network can be interpreted as a sample from the
approximate posterior, and predictions can be made using Equation (2.21). Although
deep ensembles were originally introduced as an alternative to Bayesian neural networks,
Wilson (2020); Wilson and Izmailov (2020) argue that they should be interpreted as a
kind of approximation to the Bayesian posterior, since each member of the ensemble
represents a mode of the posterior. In fact, they argue that deep ensembles can provide
a better approximation to the posterior than other approximate Bayesian methods
2.3 Approximate inference in Bayesian neural networks 21
that only represent a single mode. Izmailov et al. (2021) compared deep ensembles to
full-batch HMC on the CIFAR-10 dataset and found that the predictive distribution of
deep ensembles resembled the HMC predictive as closely as SGLD, and better than
variational inference (which we discuss in the next section). Although the Bayesian
interpretation of deep ensembles is debated, its efficacy as a simple and effective method
for obtaining predictive uncertainty estimates has been demonstrated by several studies,
often outperforming other Bayesian methods in uncertainty estimation benchmarks
(Ashukha et al., 2019; Ovadia et al., 2019).
2.3.2 Approximating family methods
The other major class of approximate inference methods are approximating family
methods. These methods will be the focus of the study presented in Chapter 3. We define
an approximating family method as a method that assumes a pre-specified parametric
form for the approximate posterior distribution. These methods approximate the true
posterior p(θ|D) with an approximate posterior qϕ(θ), where ϕ are the parameters of
the distribution. We will refer to the set of all approximating distributions consistent
with the pre-specified parametric form as the approximating family, Q.
Each approximating family method must define its family Q, and also a method
for choosing a member of that family, or equivalently, a method of choosing ϕ. Once
this is done, predictions can be made by replacing the expectation under the posterior
in Equation (2.17) by an expectation under the approximate posterior:∫
p(y|x, θ)p(θ|D) dθ = Ep(θ|D) [p(y|x, θ)] ≈ Eqϕ(θ) [p(y|x, θ)] . (2.22)
Usually, the approximating family is chosen such that it is easy to obtain independent
samples from qϕ(θ). In this case, we can make predictions by simple Monte Carlo,
without the need to resort to MCMC methods:
Eqϕ(θ) [p(y|x, θ)] ≈
1
M
M∑
m=1
p(y|x, θm), θm i.i.d.∼ qϕ(θ). (2.23)
Since it is easy to reduce the variance of this estimator by drawing many samples from
qϕ(θ), the main challenge in approximating family methods is choosing Q and ϕ such
that qϕ(θ) approximates the true posterior well in some sense. We now present some
common examples of approximating family methods.
22 Bayesian neural networks
Variational inference
Variational inference (VI) (Beal, 2003; Blei et al., 2017; Jordan et al., 1999) frames
approximate inference as an optimisation problem. The parameters ϕ of the approxi-
mating distribution are chosen to minimise the KL-divergence between qϕ(θ) and the
true posterior p(θ|D). In other words, we seek the parameters ϕ∗ such that
ϕ∗ = argmin
ϕ
KL(qϕ(θ)∥p(θ|D)). (2.24)
The KL-divergence is non-negative, and equals zero if and only if qϕ(θ) = p(θ|D). In
order to perform this minimisation, we need to write the KL-divergence in terms of
computationally tractable quantities. If we define the quantity
LELBO(ϕ) := log p(D)−KL(qϕ(θ)||p(θ|D)), (2.25)
it can easily be shown that:
LELBO(ϕ) = Eqϕ(θ) [log p(D|θ)]−KL(qϕ(θ)||p(θ)) (2.26)
=
N∑
n=1
Eqϕ(θ) [log p(yn|xn, θ)]−KL(qϕ(θ)||p(θ)) (2.27)
=
N∑
n=1
Eqϕ(θ) [log p(yn|xn, θ)]− Eqϕ(θ)
[
log
qϕ(θ)
p(θ)
]
. (2.28)
From Equation (2.25) we can see that LELBO is a lower bound to the model evidence
log p(D), and that maximising LELBO is equivalent to minimising KL(qϕ(θ)||p(θ|D)).
LELBO is known as the evidence lower bound (ELBO) or the (negative) variational free
energy. Moreoever, unlike KL(qϕ(θ)||p(θ|D)), it is often computationally tractable to
obtain an unbiased estimate of LELBO. This can be done by forming simple Monte
Carlo estimates of the expectations in Equation (2.28). In some cases, e.g. when qϕ(θ)
and p(θ) are both Gaussian distributions, the KL-divergence between them can also be
evaluated analytically. Furthermore, since the likelihood terms appear as a sum over
datapoints in Equation (2.28), it is trivial to form an unbiased estimate of LELBO using
minibatches of data, allowing VI to scale to massive datasets. In order to optimise ϕ,
gradient-based optimisers can be used with unbiased estimates of ∇ϕLELBO obtained
via the reparameterisation trick (Blundell et al., 2015; Kingma et al., 2015; Kingma
and Welling, 2013).
2.3 Approximate inference in Bayesian neural networks 23
Variational inference is a wide-ranging approximate inference technique that encom-
passes a broad variety of approximating families (also known as variational families
in this context). The most commonly used variational family is the set of all fully-
factorised Gaussian distributions over θ, which we denote as QFFG. VI with QFFG is
usually referred to as mean-field variational inference (MFVI) (Blundell et al., 2015;
Graves, 2011; Hinton and Van Camp, 1993). Other, more flexible families have also
been proposed, ranging from the set of all multivariate Gaussian distributions over
θ, denoted QFCG (Barber and Bishop, 1998)6, to families involving matrix-variate
Gaussians (Louizos and Welling, 2016), normalising flows (Rezende and Mohamed,
2015) and even implicit distributions (Huszár, 2017; Mescheder et al., 2017; Ranganath
et al., 2016a; Shi et al., 2018) for qϕ(θ), whose densities cannot be evaluated directly.
However, using more flexible families often substantially complicates the method. In
Multiplicative Normalising Flows (MNF) (Louizos and Welling, 2017), latent variables
are used to multiply the outputs of each neuron. The distribution over the latent
variables is specified by a normalising flow. However, the training procedure then
necessitates the use of a hierarchical ELBO, which is itself a lower bound to the
ELBO (Ranganath et al., 2016b). In Kernel Implicit Variational Inference (KIVI)
(Shi et al., 2018), an implicit distribution is used for the variational posterior, but
this necessitates the use of kernel density estimators to obtain a (biased) estimate of
the KL-divergence term in the ELBO. The computational overhead and the increased
complexity introduced by using these more expressive variational families has prevented
their widespread adoption. The fully factorised Gaussian family remains the most
widely used approximating family for its simplicity and scalability, and has been
successfully applied up to the ImageNet scale, when combined with natural-gradient
optimisation methods (Osawa et al., 2019).
Monte Carlo dropout
Monte Carlo dropout (MCDO) is an approximate inference method for BNNs that
works by training a neural network with dropout (Srivastava et al., 2014), where
hidden units are stochastically dropped (set to zero) during training with probability
p. Dropout was originally conceived as a regularisation technique designed to prevent
neurons from co-adapting to one another. In standard dropout, once training is
complete, predictions are made with all hidden units present, but with the weights
downscaled by a factor. In contrast, in MC dropout, units are stochastically dropped
during test time. Multiple forward passes are made through the network for each
6Here the ‘FCG’ in QFCG stands for ‘full-covariance Gaussian’.
24 Bayesian neural networks
prediction, with each forward pass being performed with a random subset of units
dropped. The final prediction is then the average of all of these forward passes.
MC dropout has been given a Bayesian interpretation in Gal and Ghahramani (2016)
and Gal (2016) as a form of approximate variational inference. In this interpretation,
the stochasticity is interpreted as occuring in parameter space, not the space of hidden
features. Specifically, let h(l) be the hidden features in the lth layer before any units
are dropped out, and let ĥ(l) be a sample of the hidden features after units have been
dropped out with probability p. Then we can write:
ĥ(l) = ϵ(l) ⊙ h(l), (2.29)
where ϵ(l) is a vector of the same length as h(l), with each element of ϵ drawn i.i.d., and
taking the value 0 with probability p, and the value 1 with probability 1 − p. Here
⊙ denotes the element-wise product and the stochasticity is usually interpreted as
being in hidden feature space. However, in Bayesian inference we want to quantify our
uncertainty about the parameters, not the hidden features. To do this, we note that
hidden features in an MLP are always multiplied by a weight matrix. We can then
write:
W (l)ĥ(l) = W (l)(ϵ(l) ⊙ h(l)) (2.30)
= W (l)diag(ϵ(l))h(l) (2.31)
= Ŵ (l)h(l) (2.32)
where diag(·) maps a vector to a diagonal matrix with that vector on the diagonal,
and we have defined the random matrix Ŵ (l) := W (l)diag(ϵ(l)). Hence we can view the
stochasticity as occurring in parameter space through these random weight matrices.
It has been argued by Gal (2016) that standard dropout training with ℓ2 regu-
larisation approximates stochastic optimisation of the ELBO in variational inference.
Under this interpretation, the variational family, QMCDO, is the set of distributions
over weight matrices induced by the sampling procedure Ŵ (l) := W (l)diag(ϵ(l)) for
0 ≤ l ≤ L. Members of this family are referred to as Bernoulli variational distributions,
or dropout variational distributions. Here the variational parameters ϕ are the pre-
dropout weight matrices (W (l))Ll=0, the biases (which are deterministic) (b(l))Ll=0, and
the dropout probability p. Applying dropout at test time can then be viewed as an
application of Equation (2.23). In its original implementation, the dropout probability
p was tuned by cross-validation on a held out dataset. Concrete dropout (Gal et al.,
2.3 Approximate inference in Bayesian neural networks 25
2017a) is an extension of MC dropout that allows p to be learned automatically using
the VI interpretation.
The MC dropout approximating family QMCDO is unusual in that it has support
over only a finite number of settings of the parameters θ. In other words, it can be
expressed as a finite mixture of Dirac-δ distributions. Hence, in the case of commonly
used Gaussian priors p(θ), KL(qϕ(θ)||p(θ)) is infinite. Gal (2016) justifies the method by
considering the delta functions to be Gaussians with small variances, or by considering a
discrete prior instead of a Gaussian prior. This procedure is given a rigorous justification
in Hron et al. (2018). Another interesting feature of the dropout variational distribution
is that although hidden units are dropped out independently of each other, it is not
a fully factorised distribution when viewed in weight space - the weights out of each
hidden unit are dependent on each other.
Laplace approximation
The Laplace approximation (Denker and LeCun, 1991; MacKay, 1992b) works by
finding a mode θMAP of the posterior via standard gradient-based optimisation, and
then sets the approximate posterior to qϕ(θ) = N (θ;µ,Σ) with µ = θMAP, the mode of
the posterior. Σ is set such that the curvature of log p(θ|D) matches the curvature of
the logarithm of the Gaussian approximation at θMAP, that is:
Σ = −
[
∇θ∇θ log p(θ|D)
∣∣
θ=θMAP
]−1
. (2.33)
In words, Σ is the negative inverse Hessian evaluated at the MAP solution. In
practice, for regression networks it is common to use the Gauss-Newton matrix as an
approximation to the Hessian. The Gauss-Newton matrix is guaranteed to be positive
semi-definite, and can be evaluated using only first derivatives:
Σ = −
[
1
σ2
N∑
n=1
g(xn)g(xn)
T + diag(p)
]−1
. (2.34)
Here g(xn) = ∇θfθ(xn)
∣∣
θ=θMAP
and p is a vector whose ith element is 1/σ2i , where σ2i
is the prior variance7 of θi. For networks with other likelihoods such as classification
networks, a generalised Gauss-Newton approximation to the Hessian can be used
instead (Martens, 2020).
7Here we have assumed a diagonal Gaussian prior for θ.
26 Bayesian neural networks
In this case, the approximating family is the set of multivariate Gaussian distribu-
tions over the parameters of the network, i.e., QFCG. However, other approximating
families may also be considered for use with the Laplace approximation. Let the
number of parameters in the network be NP . The method as presented here requires
the storage and inversion of an (NP ×NP ) matrix. While this is still feasible for small
networks such as those considered in (MacKay, 1992b), it is prohibitively expensive
for the large neural networks considered in modern deep learning. As a more scalable
alternative, we could take just the diagonal of the Hessian matrix and invert it to obtain
a diagonal covariance, Σdiag. This is known as the diagonal Laplace approximation,
and was first proposed by Denker and LeCun (1991). In this case the approximating
family is just QFFG. Recently, the K-FAC (Kronecker-factored approximate curvature)
Laplace approximation has been proposed for deep neural networks (Daxberger et al.,
2021; Immer et al., 2021a,b; Ritter et al., 2018) that is more scalable than the full
Laplace approximation while using a more flexible approximating family than the
diagonal Laplace approximation. In the K-FAC Laplace approximation, Q is the set
of Gaussian distributions that are factorised over layers in the network (leading to
a block diagonal covariance), with the covariance matrices within each layer being
Kronecker-factored.
Other approximating family methods
There are a wide range of other approximating family methods which we will not
discuss in detail. These include methods that minimise a divergence other than the
KL-divergence such as expectation propagation (Hernández-Lobato and Adams, 2015;
Minka, 2001) black-box alpha divergence minimisation (Hernández-Lobato et al., 2016)
and Rényi divergence variational inference (Li and Turner, 2016). In addition, there
are methods such as stochastic weight averaging-Gaussian (SWAG) (Maddox et al.,
2019) which relies on interpreting SGD iterates as performing approximate variational
inference (Mandt et al., 2017) and functional variational Bayesian neural networks (Sun
et al., 2019) which attempts to minimise the KL-divergence in function space instead
of weight space. One thing all of these techniques have in common, when applied to
BNNs, is that they primarily use the fully factorised Gaussian approximating family,
QFFG, although other families can sometimes be used within the same framework.
2.3 Approximate inference in Bayesian neural networks 27
2.3.3 Choosing and evaluating approximating family methods
Unlike MCMC methods, there are usually no theoretical guarantees that the approxi-
mations provided by approximating family methods will converge to the exact posterior.
In fact, the parametric form assumed by the approximating family often has properties
(such as unimodality, Gaussianity, or independence assumptions) that we know not to
be true of the exact posterior, making convergence guarantees of the kind available
with MCMC impossible to obtain.
There is, however, a very active line of research that provides frequentist concen-
tration guarantees for variational approximations of the Bayesian posterior (Alquier
and Ridgway, 2020; Chérief-Abdellatif, 2020; Pati et al., 2018; Zhang and Gao, 2020).
These assume there is a single true setting of the parameters which is used to generate
the data. It is then shown, subject to technical conditions, that as the amount of data
increases, the variational posterior concentrates around the true value of the parameter.
However, these results are not immediately relevant for BNN practitioners. This is
because the main motivation for introducing BNNs is to represent epistemic uncertainty
in the parameters. By contrast, these frequentist consistency results only become
relevant when there is enough data for the posterior to concentrate around the true
setting of the parameters — in other words, when it is no longer necessary to represent
uncertainty. They cannot be used to show that, for a given dataset, the variational
posterior predictive will be similar to the exact Bayesian posterior predictive, which is
what we are concerned with.
Given the task of approximating the exact Bayesian posterior for a finite dataset,
we are then faced with the challenge that it is not clear which approximating family, or
which approximating family method, will allow for the most accurate inference. If the
approximating family method is fixed, in theory a larger approximating family is more
flexible and hence should always allow for better performance. However, this is not
always borne out in practice due to the added computational cost and optimisation
difficulties introduced by large approximating families (Trippe and Turner, 2018).
Although there have been studies that benchmark the performance of various
approximating family methods for BNNs (e.g., Mukhoti et al. (2018); Tomczak et al.
(2018)), these most commonly evaluate the methods by metrics such as held-out
accuracy or log-likelihood on a benchmark dataset, without any reference to the true
posterior predictive. (One recent notable exception is Izmailov et al. (2021), which
performs full-batch HMC for ResNets trained on the CIFAR10 dataset as a reference,
and compares the HMC posterior predictive with MFVI, among other methods.)
28 Bayesian neural networks
While empirical studies comparing performance on benchmark datasets can give
some indication as to the practical utility of a method on a specific task, they do not
address the fundamental question of how well the approximate posterior predictive
matches the true posterior predictive. For example, a method could perform well on
a specific dataset because a poorly chosen prior has been combined with inaccurate
inference in such a way that the problems introduced fortuitously “cancel out” with each
other. While this may still lead to a useful machine learning method, it is debatable to
what extent the success of such a method can be attributed to Bayesian principles. At
the very least, it is important to know if and when this is happening, so that we can
know how to troubleshoot and improve our models.
In Chapter 3 we will investigate two common approximating families, QFFG and
QMCDO both theoretically and empirically to obtain new insights into how well these
families can approximate the true posterior predictive.
2.4 History of approximating families in Bayesian neu-
ral networks
In this section we give a brief, incomplete history of the approximating families most
commonly used for Bayesian neural networks. We focus on the theoretical and practical
motivations given for their introduction.
Denker and LeCun (1991) appear to be the first to attempt to calculate a Bayesian
posterior predictive for a neural network. They use the Laplace approximation with
a diagonal approximation to the covariance matrix. Hence they select QFFG as their
approximating family. This is the earliest use we found in the literature of QFFG
for approximate BNN inference in feed-forward networks. No theoretical or practical
justification is made for using a diagonal covariance matrix — the network architectures
considered then were small, so presumably inverting the full covariance matrix would
have been computationally feasible, but slower. Denker and LeCun (1991) also include
a discussion of what would now be called the distinction between ‘epistemic’ and
‘aleatoric’ uncertainty, and the role of BNNs in expressing epistemic uncertainty.
Buntine and Weigend (1991) propose using the Laplace approximation with a full
covariance matrix, thus selecting QFCG as their approximating family. They claim that
the diagonal approximation to the covariance matrix will lead to very poor estimates,
although they do not provide a theoretical explanation of the role of the off-diagonal
terms. In the conclusion section, the paper raises the question of the quality of the
Gaussian approximation.
2.4 History of approximating families in Bayesian neural networks 29
In a seminal paper, MacKay (1992b) introduced the evidence framework for Bayesian
neural networks, which relies on the full-covariance Laplace approximation to make
predictions and perform model comparison. Thus QFCG is used as the approximating
family. In commenting on the method of Denker and LeCun (1991), MacKay (1992b)
argues that due to strong posterior correlations in the parameters, it is important to
evaluate the off-diagonal terms of the Hessian when doing the Laplace approximation.
It is interesting to note that in Figure 1 of MacKay (1992b), the posterior predictive is
shown with an emphasis on the fact that the error bars get larger around the perimeter
of the training data, and also in the gap between the training regions. We will discuss
this ‘in-between uncertainty’ in more detail in Chapter 3. MacKay (1992b) explicitly
links this qualitative behaviour in function space to dependencies in the approximate
posterior in parameter space, and notes that this qualitative behaviour would not have
been obtained if the diagonal Laplace approximation was used. However, he does not
provide a detailed argument as to why this is so.
Variational inference for BNNs was introduced in Hinton and Van Camp (1993).
They frame their work in terms of the Minimum Description Length (MDL) principle,
not VI, but the objective used is identical to the ELBO. They avoid having to perform
Monte Carlo estimation of the gradients by using a single hidden layer network and
tabulating values of the mean and variance of the output of a hidden unit. They use
QFFG as their variational family. In commenting on the choice of QFFG, they note that
it is not clear how much is lost by ignoring the off-diagonal terms in the covariance
matrix, given that MacKay (1992b) showed significant covariances in the Laplace
approximation. However, they argue that since VI explicitly manipulates the Gaussian
distributions, the learning will try to force the noise in the weights to be independent.8
Barber and Bishop (1998) extended the work of Hinton and Van Camp (1993) by
replacing the MDL interpetration with the standard VI interpretation, and also by
extending the variational family from QFFG to QFCG. They motivate the introduction
of QFCG by noting that the posterior often has very strong correlations between the
parameters.
Recent work has focused on scaling up VI to larger BNNs (Blundell et al., 2015;
Graves, 2011; Osawa et al., 2019), and using Monte Carlo estimates for the gradients
to allow deeper networks to be trained using automatic differentiation packages. Since
the models considered in modern deep learning are far larger than those used when
BNNs were in their infancy, the quadratic (in the number of parameters) computational
8The phenomenon of VI learning variational parameters that are consistent with the factorisation
assumptions made has indeed been observed in BNNs, though this behaviour may not always be
desirable (Trippe and Turner, 2018)
30 Bayesian neural networks
and memory requirements of QFCG are no longer as acceptable, and QFFG is the most
commonly used variational family.
The need to scale to larger networks has led to QFFG being a widespread choice
in many modern approximating family methods, not just VI. To give an incomplete
list, it has been used in PBP (Hernández-Lobato and Adams, 2015), variational
Gaussian dropout (Kingma et al., 2015), stochastic expectation propagation (Li et al.,
2015), black-box alpha divergence minimisation (Hernández-Lobato et al., 2016), Rényi
divergence VI (Li and Turner, 2016), natural gradient VI (Khan et al., 2018) and
functional variational BNNs (Sun et al., 2019).
The other approximating family that has found widespread use in modern Bayesian
deep learning is QMCDO. Dropout as a stochastic regularisation technique was intro-
duced in Srivastava et al. (2014). Later, the interpretation of MC dropout as a Bayesian
approximation was introduced in Gal and Ghahramani (2016). Unlike QFFG or QFCG,
QMCDO was not first introduced as a family intended to approximate a Bayesian pos-
terior distribution. Nevertheless, MC dropout inference performs competitively on a
variety of BNN benchmarks (Filos et al., 2019; Mukhoti et al., 2018).
2.5 Conclusion
In this chapter, we introduced Bayesian neural networks and motivated their need by
discussing the inability of standard neural networks to represent epistemic uncertainty.
We saw that exact inference in BNNs is intractable and has to be approximated. This
led us to consider approximate inference, which could be divided into sampling methods
and approximating family methods. We provided an overview of the most commonly
used approximating family methods for BNNs, and found that the majority use the
factorised Gaussian approximating family, QFFG.
BNNs hold the promise of combining principled uncertainty estimation with the
flexibility of deep learning. However, if approximate inference fails to provide predictive
distributions that resemble the exact predictive, the relationship between the principled
Bayesian framework we use in theory and the models we deploy in practice becomes
tenuous. Unlike MCMC methods, most approximating family methods do not come
with any theoretical guarantees as to the quality of their approximations. Hence
understanding the effect of approximate inference on BNN predictive distributions is
crucial. We turn to this subject in the next chapter.
Chapter 3
The expressiveness of approximate
inference in Bayesian neural networks
In Chapter 2 we saw that while Bayesian neural networks hold the promise of being
flexible, well-calibrated statistical models, inference requires approximations whose
consequences are poorly understood. Hence it is unclear to what extent the successes
(and failures) of approximate BNNs are attributable to the exact Bayesian predictive,
rather than peculiarities of the approximation method. From a Bayesian modelling
perspective, it is therefore crucial to ask, does the approximate predictive distribution
retain the qualitative features of the exact predictive?
In this chapter we present a study of the approximation quality of common approx-
imating family methods. In Section 3.3 we consider single-hidden layer BNNs, and
show a fundamental limitation in function space of two of the most commonly used
distributions defined in weight space: mean-field Gaussian and Monte Carlo dropout.
We find there are simple cases where neither method can have substantially increased
uncertainty in between well-separated regions of low uncertainty. We provide strong
empirical evidence that exact inference does not have this pathology, hence it is due to
the approximation and not the BNN model itself.
In Section 3.4 we consider deeper networks. In contrast to the single-hidden layer
case, we show a universality result showing that there exist approximate posteriors
in the above classes which provide flexible uncertainty estimates. However, we find
empirically that pathologies of a similar form as in the single-hidden layer case can
persist when performing variational inference in deeper networks — i.e., these posteriors
are not found in practice. Our results motivate careful consideration of the implications
of approximate inference methods in BNNs.
32 The expressiveness of approximate inference in Bayesian neural networks
The material in this chapter was previously published in ‘On the Expressiveness
of Approximate Inference in Bayesian Neural Networks’ (Foong et al., 2020b). The
research was conducted in collaboration with my co-first author David R. Burt, and
was supervised by Yinghzen Li and Richard E. Turner throughout. I was involved
closely with all aspects of the paper, including the theoretical results, the experiments
and the writing of the paper.
3.1 Criteria for successful approximation
In Section 2.3.2, we saw that many approximate inference methods for BNNs work by
defining a simple class of distributions over the model parameters, (an approximating
family), and then choosing a member of this family as an approximation to the posterior.
Mean-field variational inference (MFVI) and Monte Carlo dropout (MCDO) are two of
the most commonly used instances of this approach. For such a method to succeed,
two criteria must be met:
Criterion 1 The approximating family must contain good approximations to the
posterior.
Criterion 2 The method must then select a good approximate posterior within this
family.
For nearly all tasks, the performance of a BNN only depends on the distribution
over weights to the extent that it affects the distribution over predictions (i.e. in
‘function space’). Hence for our purposes, a ‘good’ approximate posterior is one that
captures features of the exact posterior in function space that are relevant to the
task at hand. However, approximating families are often defined in weight space for
computational reasons. Evaluating Criterion 1 therefore involves understanding how
weight space approximations translate to function space, which is a non-trivial task for
highly nonlinear models such as BNNs.
In this chapter we provide both theoretical and empirical analyses of the flexibility
of the predictive mean and variance functions of approximate BNNs. Our main findings
are:
1. For shallow (i.e., single-hidden layer) BNNs, there exist simple situations where
no mean-field Gaussian or MC dropout distribution can faithfully represent the
exact posterior predictive uncertainty (Criterion 1 is not satisfied). We prove
in Section 3.3 that in these instances the predictive variance function of any fully-
connected, single-hidden layer ReLU BNN using these families suffers a lack of ‘in-
3.1 Criteria for successful approximation 33
between uncertainty ’: increased predictive uncertainty in between well-separated
regions of low uncertainty. This is especially problematic for lower-dimensional
data where we may expect some datapoints to be in between others. Examples
include spatio-temporal data, or Bayesian optimisation for hyperparameter search,
where we frequently wish to make predictions in unobserved regions in between
observed regions. We verify that the exact posterior predictive does not suffer from
this limitation; hence this pathology is attributable solely to the restrictiveness
of the approximating family. Furthermore, since this problem is tied to the
approximating family, any method that uses the mean-field Gaussian or MC
dropout families will be similarly restricted.
2. In contrast, in Section 3.4 we prove a universal approximation result showing
that the mean and variance functions of deep (more than 1 hidden layer) approx-
imate BNNs using mean-field Gaussian or MCDO distributions can uniformly
approximate any continuous function and any continuous non-negative function
respectively. However, it remains to be shown that appropriate predictive means
and variances will be selected when choosing the approximate posterior from
the approximating family. Since addressing this question requires assessing the
behaviour of the particular approximating family method, and not simply the
family itself, we choose to limit our study to variational inference, i.e., ELBO
optimisation, as the approximating family method. To test the fidelity of the
approximation, we focus on the low-dimensional, small data regime where com-
parisons to references for the exact posterior such as the limiting GP (Lee et al.,
2018; Matthews et al., 2018; Neal, 1995) are easier to make. In Section 3.4.2 we
provide empirical evidence that in spite of its theoretical flexibility (in terms of
the expressiveness of the variational family), VI in deep BNNs can still lead to dis-
tributions that suffer from similar pathologies to the shallow case, i.e. Criterion
2 is not satisfied.
Finally, in Section 3.5, we provide an active learning case study on a real-world
dataset showing how in-between uncertainty can be a crucial feature of the posterior
predictive. In this case, we provide evidence that although the inductive biases of
the BNN model with exact inference can bring considerable benefits, these are lost
when MFVI or MCDO are used. Code to reproduce our experiments can be found at
https://github.com/cambridge-mlg/expressiveness-approx-bnns.
34 The expressiveness of approximate inference in Bayesian neural networks
3.2 Priors and references for the exact predictive
In this chapter, our goal is to examine how closely approximate BNN predictive
distributions resemble exact inference. To make this comparison, a choice of BNN prior
must be made. As we noted in Section 2.2.3, common practice is to choose independent
Gaussian priors. Furthermore, it is common to set these to be standard Normal N (0, 1)
priors for all parameters, regardless of the size of the network. However, such priors
are known to lead to extremely large prior predictive variances in function space for
wide or deep networks (Neal, 1995).
For example, choosing a standard normal prior for a 4-hidden layer BNN with 50
neurons in each layer leads to a prior standard deviation of ∼103 for the output of the
network at x = 0. This is orders of magnitude too large to reflect our prior beliefs for
normalised data. It is conceivable that one may combine an unreasonable prior such as
this with poor approximate inference to obtain practically useful uncertainty estimates
that bear little relation to the exact Bayesian predictive — we do not consider this
case. Instead, we focus our study on the quality of approximate inference in models
with more moderate prior variances in function space.
There is a body of literature on BNN priors (Lee et al., 2018; Matthews et al., 2018;
Neal, 1995; Schoenholz et al., 2017) which shows how to select prior weight variances
that lead to reasonable prior variances in function space, even as the width of the
hidden layers tends to infinity. For a layer with Nin inputs, we choose independent
N (0, σ2w/Nin) priors for the weights, with σ2w a width-independent constant. As the
width tends to infinity, both the prior and posterior of such a BNN converges to a
well-defined Gaussian process (GP) (Hron et al., 2020; Matthews et al., 2018; Neal,
1995). This convergence does not occur if we omit the scaling by 1/Nin. We hence
include this scaling when specifying our BNN priors.
It has been shown with extensive Markov chain Monte Carlo simulation that
3-hidden layer BNNs with just 50 units per layer already closely resemble their cor-
responding infinite-width GP counterparts (Matthews et al., 2018). In this chapter,
we examine BNNs of up to 10 hidden layers. It is uncertain whether finite-width
BNNs of such large depths will still resemble their infinite-width counterparts as closely.
However, the GP predictive may still act as a useful qualitative reference for what
we expect of the exact predictive in the finite-width case. We hence use both exact
inference in the corresponding infinite-width GP and also ‘gold-standard’ Hamiltonian
Monte Carlo (HMC) (Hoffman and Gelman, 2014; Neal et al., 2011) as references for
the exact posterior.
3.3 Single-hidden layer neural networks 35
3.3 Single-hidden layer neural networks
In this section, we present our results stating that for single-hidden layer (1HL) ReLU
BNNs, QFFG and QMCDO are not expressive enough to satisfy Criterion 1 in situations
where in-between uncertainty is important. We identify limitations on the variance in
function space, V[f(x)], implied by these families. We show empirically that the exact
posterior does not have these restrictions, implying that approximate inference does
not qualitatively resemble the posterior.
Theorem 1 (Factorised Gaussian). Consider any single-hidden layer fully-connected
ReLU neural network f : RD → R. Let xd denote the dth element of the input vector x.
Assume a fully factorised Gaussian distribution over the parameters, i.e., the QFFG
approximating family. Consider any points p, q, r ∈ RD such that r ∈ −→pq and either:
i. −→pq contains 0 and r is closer to 0 than both p and q,
ii. −→pq is orthogonal to and intersects the plane xd = 0, and r is closer to the plane
xd = 0 than both p and q.
Then V[f(r)] ≤ V[f(p)] + V[f(q)].
Remark 1. In Theorem 7 in Appendix A we actually prove a stronger result than
Theorem 1, which also applies to approximating families that have certain correlations.
For example, the bound still holds when the weights coming out of a neuron in the
hidden layer are correlated with each other.
In words, Theorem 1 states that there are line segments in input space (illustrated
in Figure 3.1) such that the predictive variance on the line is bounded by the sum of
the variance at the endpoints. This restriction is problematic in situations where we
would like the BNN to express higher epistemic uncertainty on a line segment joining
regions with lower epistemic uncertainty.
Analogous but weaker bounds on higher dimensional sets in input space enclosed
by these lines can be obtained as a corollary. For instance, consider the case where
the input domain is R2. Let p, q, r, s be the four corners of a rectangle containing
the origin. For any point a in the rectangle, we can upper bound V[f(a)] by the
sum of the variances at the points at the top and bottom edges of the rectangle
with the same horizontal coordinate as a (Theorem 1, condition (ii)). These in turn
can be upper bounded in terms of the variances at the corners of the rectangle,
again by applying Theorem 1. Hence we have that for any point a in the rectangle,
V[f(a)] ≤ V[f(p)] + V[f(q)] + V[f(r)] + V[f(s)].
36 The expressiveness of approximate inference in Bayesian neural networks
x2
x1
p
r
q
q′
x2
x1
p
q
r
q′
Fig. 3.1 Illustration of the bounded regions implied by Theorem 1, showing the input
domain of a 1HL mean-field Gaussian BNN, for the case x ∈ R2. Left: For any two
points p, q ∈ R2 such that the line joining them crosses the origin, the output variance
at any point r on the solid red portion of the line is upper bounded by V[f(p)]+V[f(q)],
illustrating condition (i) of Theorem 1. Right: For any two points p, q ∈ R2 such that
the line joining them is orthogonal to and intersects a plane xd = 0, the output variance
at any point r on the solid red portion of the line is upper bounded by V[f(p)]+V[f(q)],
illustrating condition (ii) of Theorem 1. The bounded segments (in red) extend from
q = (q1, q2) to q′, where q′ = (−q1,−q2) (Left, condition (i)), or q′ = (q1,−q2) (Right,
condition (ii)).
3.3 Single-hidden layer neural networks 37
Similarly, for higher-dimensional input domains, the variance at any point inside
an axis-aligned hyperrectangle containing the origin can be bounded by the sum of the
variances on its vertices, and we can obtain tighter bounds on diagonals and faces of the
hyperrectangle, by repeatedly applying Theorem 1. This again could be problematic if
we required the BNN to express high epistemic uncertainty inside the hyperrectangle
whislt having much lower epistemic uncertainty at its vertices/edges. However, we note
that these bounds become exponentially weaker as the dimensionality of the bounded
region increases, so the theorem is most informative when bounding the variance on
lower dimensional regions such as lines, which we focus on for the remainder of this
chapter.
Theorem 1 applies to 1HL BNNs of any width using any approximating family
method which uses QFFG, as listed in Section 2.3.2. We also prove related results for
MC dropout. Here the behaviour is different depending on whether dropout is applied
to the inputs of the network:
Theorem 2 (MC dropout with inputs not dropped out). Consider the same network
architecture as in Theorem 1. Assume an MC dropout distribution over the parameters,
with inputs not dropped out, i.e. the first weight matrix is deterministic. Then V[f(x)]
is convex in x.
Theorem 2 implies the predictive variance on any line segment in input space is
bounded by the maximum of the variance at its endpoints, as this is a straightforward
consequence of convexity. A weaker statement is true if we also apply dropout to the
inputs:
Theorem 3 (MC dropout with inputs dropped out). Consider the same network
architecture as in Theorem 1. Assume an MC dropout distribution over the parameters,
with inputs dropped out, i.e. the first weight matrix has a dropout distribution. Then,
for any finite set of points S ⊂ RD such that 0 is in the convex hull of S,
V[f(0)] ≤ max
s∈S
{V[f(s)]} . (3.1)
This is illustrated in Figure 3.2.
Although weaker than Theorem 2, Theorem 3 still implies pathological behaviour
whenever the origin should have higher epistemic uncertainty than points surrounding
it.
Remark 2. As it is more common not to apply dropout to the inputs of a network
(see Figure 4.5 in Gal (2016)), we will focus on that case in this chapter. Hence when
38 The expressiveness of approximate inference in Bayesian neural networks
x2
x1
Fig. 3.2 Schematic illustration of the bound in Theorem 3, showing the input domain
of a single-hidden layer MC dropout BNN, for the case x ∈ R2 with dropout applied to
the inputs. The convex hull (in light blue) of the blue points contains the origin. Hence
Theorem 3 implies the variance at the origin (red point) cannot exceed the variance at
any of the blue points.
we refer to MC dropout or QMCDO without any further qualification, we always mean
dropout is applied to the hidden features but not to the input.
Remark 3. Although Theorems 1 to 3 are stated for networks with a single scalar
output for brevity, for networks with multiple outputs, these theorems hold for each
output separately. See Appendix A for more general statements of these results.
Full proofs of Theorems 1 to 3 are provided in Appendix A. Theorems 1 to 3
show that there are simple cases where 1HL approximate BNNs using QFFG and
QMCDO cannot represent in-between uncertainty : i.e., increased uncertainty in between
well separated regions of low uncertainty. As Theorems 1 to 3 depend only on the
approximating family, this cannot be fixed by improving the optimiser, regulariser or
prior.
3.3.1 Numerical verification of theorems
We next verify Theorems 1 and 2 numerically. Since we are concerned with whether
there are any distributions that show in-between uncertainty, we do not maximise the
ELBO in this experiment (we consider ELBO maximisation in Sections 3.3.4 and 3.4.2).
3.3 Single-hidden layer neural networks 39
−1.0 −0.5 0.0 0.5 1.0
x
0.0
0.5
1.0
V
[f
(x
)]
Target
FFG
Bound
−1.0 −0.5 0.0 0.5 1.0
x
0.0
0.5
1.0
V
[f
(x
)]
Target
MCDO
Fig. 3.3 Results of directly minimising the squared error in function space between
V[f(x)] (for a single-hidden layer NN) and a target variance function. Left: FFG
distribution, Right: MCDO distribution. The bound implied by Theorem 1 for FFG
distributions (red) applies on [−1, 1] with p = −1, q = 1. The MCDO variance function
is convex, as implied by Theorem 2, and almost constant. The FFG and MCDO
variance functions underestimate the target variance near the origin and overestimate
it away from the origin due to the restrictiveness of the approximating family.
Instead, we train 1HL networks of width 50 with QFFG and QMCDO distributions to
directly minimise the squared error between V[f(x)] and a pre-specified target variance
function which displays in-between uncertainty.
In detail, we generate a dataset consisting of two separated clusters of datapoints
in one dimension. We then fit a Gaussian process to the dataset and compute the
predictive mean and variance of the GP on a one-dimensional grid X consisting of 40
points. Let µ(X) ∈ R40 denote the mean of the GP posterior predictive at these points
σ2(X) ∈ R40 denote the variance. We define the loss function as
L(ϕ) = ∥Eqϕ [f(X)]− µ(X)∥22 + ∥Vqϕ [f(X)]− σ2(X)∥22. (3.2)
This loss function encourages the predictive mean and variance of the BNN to directly
match that of the GP, which displays in-between uncertainty. The expectation and
variance of f(X) are Monte Carlo estimated using 128 samples. We use the ADAM
optimiser and full-batch training with a learning rate of 1× 10−3 for 50,000 iterations.
A dropout rate of 0.05 is used for MCDO. Weights and biases are initialised at the
prior for MFVI. The results are shown in Figure 3.3. We see that even when trained
to directly minimise this objective, 1HL BNNs cannot successfully mimic the GP’s
in-between uncertainty, since that would violate Theorems 1 and 2.
Although Theorems 1 and 2 apply only to 1HL BNNs, 1HL BNN regression tasks
have been a common benchmark in the BNN literature (Gal and Ghahramani, 2016;
Hernández-Lobato and Adams, 2015; Mukhoti et al., 2018; Sun et al., 2019; Tomczak
et al., 2018), and have been used to assess different inference methods.
40 The expressiveness of approximate inference in Bayesian neural networks
3.3.2 In-between uncertainty in other regions of input space
Although Theorem 2 implies a bound on any line in input space, Theorem 1 only
bounds lines in input space meeting specific criteria. For BNNs with higher input
dimensionality, these criteria are less likely to be satisfied by general lines in input
space. Hence it is unclear whether the lack of in-between uncertainty occurs only on
these special lines, or is a more general feature of the approximate posterior predictive.
To answer this, we next show empirically that for a BNN with a 5-dimensional
input space, lines joining random points in input space also tend to suffer from a lack
of in-between uncertainty. We generate two Gaussian clusters of input locations, with
the centres of the clusters randomly chosen to lie on a sphere of radius
√
5 centred at
the origin. We generate the output values corresponding to each input location by
sampling from the wide-limit BNN GP. We then train MFVI and MCDO BNNs on the
data, and compare the predictive distribution to that of the wide-limit GP. We choose
σw =
√
2, σb = 1, networks of width 50 and a dropout probability of p = 0.05 for
MCDO. We set the observation noise standard deviation to 0.01, which is the ground
truth value used to generate the synthetic data. This is repeated for three random
samplings of the dataset. We then visualise the predictive uncertainty along the line
segments in input space joining the centres of the two datapoint clusters.
In Figure 3.4 we see that although exact inference with the wide-limit GP exhibits
in-between uncertainty, this is lost by both MFVI and MCDO. For MCDO, this is
expected as Theorem 2 implies that MCDO’s predictive variance will be convex along
any line, including the lines plotted. In contrast, Theorem 1 only applies to certain
lines in input space, and does not bound the variance on general lines in input space
like the lines in Figure 3.4. However, we still see that MFVI and MCDO are often more
confident in between the data clusters than at the data clusters, which intuitively is a
poor reflection of epistemic uncertainty. Hence Figure 3.4 lends support to the idea
that the pathology in Theorem 1 is symptomatic of a lack of in-between uncertainty on
more general lines in input space than the conditions of the theorem statement imply.
3.3.3 Intuition for results
We now provide intuition for the proofs of Theorems 1 to 3. Let θin be the parameters in
the first layer. By the law of total variance, V[f(x)] = E [V[f(x)|θin]] + V[E [f(x)|θin]].
For QMCDO the second term is 0 as θin is deterministic, since the input weights are not
dropped out. Hence to prove Theorem 2 (MCDO without dropping out input weights),
3.3 Single-hidden layer neural networks 41
GP MFVI MCDO
−2 0 2
λ
1.0
1.5
2.0
2.5
3.0
f
(x
(λ
))
−2 0 2
λ
1.0
1.5
2.0
2.5
3.0
3.5
f
(x
(λ
))
−2 0 2
λ
1.25
1.50
1.75
2.00
2.25
2.50
2.75
f
(x
(λ
))
−4 −2 0 2 4
λ
−4
−2
0
2
f
(x
(λ
))
−4 −2 0 2 4
λ
−4
−2
0
2
f
(x
(λ
))
−4 −2 0 2 4
λ
−4
−2
0
2
f
(x
(λ
))
−2 −1 0 1 2
λ
−1.0
−0.5
0.0
0.5
1.0
1.5
f
(x
(λ
))
−2 −1 0 1 2
λ
−1.0
−0.5
0.0
0.5
f
(x
(λ
))
−2 −1 0 1 2
λ
−1.0
−0.5
0.0
0.5
1.0
f
(x
(λ
))
Fig. 3.4 Mean and 2 standard deviation bars of the predictive distribution on lines
joining random clusters of data, for single-hidden layer BNNs. Each row represents
the same randomly generated dataset. We also plot the projection of the 5-dimensional
data onto this line segment, where the coordinate along the line segment is denoted λ.
Note that the data appears very noisy in some of the plots, but this appearance is due
to the projection onto a lower-dimensional space.
42 The expressiveness of approximate inference in Bayesian neural networks
it suffices to show the first term is convex. We have:
V[f(x)|θin] = V
[ I∑
i=1
wiψ(ai(x; θin)) + b
∣∣∣∣θin] (3.3)
=
I∑
i=1
V[wi]ψ(ai(x; θin))2 + V[b], (3.4)
where {wi}Ii=1 and b are the output weights and bias, ψ(a) = max(0, a), and ai(x; θin)
is the activation of the ith neuron. Since ai(x; θin) is affine in x, ψ(ai(x; θin))2 is a ‘half
quadratic’ in x and therefore convex. This proves Theorem 2.
In order to prove Theorem 3 (MCDO with dropping out input weights), we now
need to consider the effect of randomness in the input weights. However, we know that
when x = 0, dropping out the input features has no effect, since the input will take the
value 0 regardless of whether it is dropped out. Hence V[E [f(0)|θin]] = 0. Theorem 3
follows easily by combining this with the fact that E [V[f(x)|θin]] is convex.
To arrive at Equation (3.4), we used the fact that for QMCDO, the output weights of
each neuron are independent. Since this is also the case for QFFG, Equation (3.4) also
applies to QFFG. If correlations between the weights were allowed in the posterior (such
as with QFCG), this could introduce negative covariance terms, leading to non-convex
behaviour. However, in a factorised posterior, this is not possible. Thus we see here a
concrete instance of how weight space factorisation assumptions can lead to function
space restrictions on the predictive uncertainty.
To complete the proof of Theorem 1 for QFFG, we need to analyse V[E [f(x)|θin]]
when θin follows a mean-field Gaussian distribution. Because of the factorisation
assumptions on the weights in the first layer, this term is a positive linear combination
of the variances of each activation function. While these variance functions are not
convex, they satisfy certain restrictive conditions that imply bounds on arbitrary
positive linear combinations. Roughly speaking, they resemble quadratic functions
that are truncated to zero at the point where the ReLU is saturated. One such typical
variance function is shown in Figure 3.5. We provide a rigorous characterisation of the
limitations in expressiveness of positive linear combinations of such functions, along
with full proofs of Theorems 1 to 3, in Appendix A.
3.3 Single-hidden layer neural networks 43
4 2 0 2 4
x
0.0
0.5
1.0
1.5
2.0
2.5
Va
r[R
eL
U(
W
x
+
b)
]
Fig. 3.5 The contribution to V[E [f(x)|θin]] made by a single neuron, for some choice
of the Gaussian variational distribution over θin. Here the weight W and bias b are
part of the input layer, i.e., W, b ∈ θin. The full value of V[E [f(x)|θin]] is given by a
positive linear combination of such terms, one for each neuron, due to the factorisation
assumptions. Although this function is not convex, it resembles a quadratic function
that has been truncated to zero. In Appendix A we show that arbitrary positive linear
combinations of these functions necessarily suffer a lack of in-between uncertainty.
3.3.4 Empirical tests of approximate inference in single-hidden
layer BNNs
It is not immediately apparent that Theorems 1 and 2 are problematic from the
perspective of Bayesian inference. For example, even exact inference in a Bayesian
linear regression model results in a convex predictive variance function. In that case,
the lack of in-between uncertainty is not due to poor inference, but is instead due to
the linear modelling assumption. If this assumption genuinely reflects our prior beliefs
about the regression task, then the lack of in-between uncertainty is not problematic.
Here we provide strong evidence that, in contrast, the modelling assumptions of
1HL BNNs lead to exact posteriors that do show in-between uncertainty. Theorems 1
to 3 thus imply that it is approximate inference with QFFG or QMCDO that fails to
reflect this intuitively desirable property of the exact predictive, violating Criterion
1.
Figure 3.6 compares the predictive distributions obtained from MFVI and MCDO
(here we optimise the ELBO for MFVI and the standard MCDO objective, in contrast
with Figure 3.3 — see Appendix B for experimental details) with HMC and the limiting
GP on a regression dataset consisting of two clusters of covariates. We use 1HL BNNs
44 The expressiveness of approximate inference in Bayesian neural networks
−2 −1 0 1 2
x1
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
x
2
σ[f (x)]
−2 −1 0 1 2
λ
−5
0
5
f
(x
(λ
))
0.0
1.1
2.3
3.4
4.6
5.7
6.8
8.0
9.1
10.3
(a) Infinite-width GP
−2 −1 0 1 2
x1
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
x
2
σ[f (x)]
−2 −1 0 1 2
λ
−5
0
5
f
(x
(λ
))
0.00
0.95
1.90
2.85
3.80
4.75
5.70
6.65
7.60
8.55
(b) HMC
−2 −1 0 1 2
x1
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
x
2
σ[f (x)]
−2 −1 0 1 2
λ
−5
0
5
f
(x
(λ
))
0.03
0.07
0.11
0.15
0.19
0.23
0.27
0.31
0.35
0.39
(c) MFVI
−2 −1 0 1 2
x1
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
x
2
σ[f (x)]
−2 −1 0 1 2
λ
−5
0
5
f
(x
(λ
))
0.04
0.17
0.30
0.42
0.55
0.68
0.81
0.94
1.06
1.19
(d) MCDO
Fig. 3.6 Regression on a 2D synthetic dataset (red crosses). The colour plots show
the standard deviation of the output, σ[f(x)], in 2D input space, in the square of side
length 4 centred at the origin. The plots beneath show the mean with 2-standard
deviation bars along the dashed white line (parameterised by λ, where λ = 0 at the
origin and takes the values −2√2, 2√2 at the corners of the square). MFVI and
MCDO are overconfident for λ ∈ [−1, 1]. Theorems 1 and 2 explain this: given
the predictive variance is near zero at the data clusters, there is no setting of the
variational parameters that induces a predictive variance much greater than zero in
the line segment between them.
3.4 Deeper networks 45
with 50 hidden units and ReLU activations. The HMC and limiting GP posteriors are
almost indistinguishable, suggesting they both resemble the exact predictive. For these
methods V[f(x)] is markedly larger near the origin than near the data. In contrast,
MFVI and MCDO are as confident in between the data as they are near the data. This
provides strong evidence that the lack of in-between uncertainty is not a feature of the
BNN model or prior, but is caused by approximate inference.
3.4 Deeper networks
Theorems 1 to 3 pose an important question: is the structural limitation observed
in the 1HL case fundamental to QFFG and QMCDO even in deeper networks, or can
depth help these approximations satisfy Criterion 1? In Theorem 4, we provide
universality results for the mean and variance functions of approximate BNNs with
at least two hidden layers using QFFG and QMCDO (with the inputs not dropped out).
As the predictive mean and variance often determine the performance of BNNs in
regression applications, this provides theoretical evidence that, for many applications,
approximate inference in deep BNNs satisfies Criterion 1:
Theorem 4 (Universality of deeper networks). Let m be any continuous function
on a compact set A ⊂ RD, and let v be any continuous, non-negative function on A.
For any ϵ > 0, for both QFFG and QMCDO there exists a 2HL ReLU BNN such that
supx∈A |E [f(x)]−m(x)| < ϵ and supx∈A |V[f(x)]− v(x)| < ϵ simultaneously.
Remark 4. If MC dropout is used with the inputs also dropped out, the analogous
statement to Theorem 4 is false. In Appendix C.2 we provide a counterexample that
holds for arbitrarily deep networks and shows that if inputs are dropped out, V[f ] cannot
be made small at two points x1, x2 which have significantly different values of E [f(x1)]
and E [f(x2)].
Figure 3.7 shows the result of directly minimising the squared error between the
network output mean and variance and a given target mean and variance function,
using the same method as with the 1HL network in Figure 3.3, but this time with two
hidden layers. In contrast to Figure 3.3, the variances of both QFFG and QMCDO are
able to fit the target very closely.
While Theorem 4 gives some cause for optimism for approximating family methods
with deep BNNs, it shows only that the mean and variance of pointwise marginal
distributions of the output are universal (i.e., it does not tell us about higher moments
of the predictive or covariances between different outputs). Additionally, and crucially,
46 The expressiveness of approximate inference in Bayesian neural networks
−1.0 −0.5 0.0 0.5 1.0
x
−2
0
E[
f
(x
)]
Target
FFG
MCDO
−1.0 −0.5 0.0 0.5 1.0
x
0.0
0.5
1.0
V
[f
(x
)]
Target
FFG
MCDO
Fig. 3.7 Results of minimising the squared error in function space between E [f(x)]
and a target mean function (left), and between V[f(x)] and a target variance function
(right), for a 2-hidden layer BNN with FFG and MCDO distributions. All three lines
overlap, indicating a very close fit to the target.
it does not say whether good distributions will actually be found by an optimiser
when maximising an objective such as the ELBO, i.e it does not address Criterion
2. Addressing Criterion 2 theoretically is more challenging since we must make a
statement not only about the variational family, but about the optimum of the ELBO
within the variational family. Such an analysis has indeed been conducted in recent
work (Coker et al., 2022), which we will discuss in Section 3.6.2.
3.4.1 Proof sketch of Theorem 4
To prove Theorem 4 for QFFG, we provide a construction that relies on the universal
approximation theorem for deterministic NNs (Leshno et al., 1993). We illustrate
this construction schematically in Figure 3.8. Consider a 2HL NN whose second
hidden layer has two neurons, with activations a1, a2. Let w1, w2 denote the weights
connecting a1, a2 to the output, and b denote the output bias, such that the output
f(x) = w1ψ(a1)+w2ψ(a2)+b. In this construction, a1 will be used to control the mean,
and a2 the variance, of the BNN output. By setting the variances of the parameters
in the first two linear layers to be sufficiently small, we can consider a1 and a2 to be
essentially deterministic functions of x. By the universal approximation theorem, a1
and a2 can approximate any continuous functions. Recall that we would like the mean
function of the BNN to approximate m(x) and the variance function to approximate
v(x). Choose a1≈m(x)−minx′∈Am(x′) and a2≈
√
v(x).1 Choose E [b]=minx′∈Am(x′),
V[b]≈0; E [w1]=1, V[w1]≈0; and E [w2]=0, V[w2]=1.
1Recall that here A denotes the input domain of the neural network, see Theorem 4.
3.4 Deeper networks 47
By linearity of expectation, the factorisation assumptions, and a1, a2 ≥ 0:
E [f(x)] = E [w1ψ(a1) + w2ψ(a2) + b]
= E [w1]E [ψ(a1)] + E [w2]E [ψ(a2)] + E [b]
≈ m(x)−min
x′∈A
m(x′) + min
x′∈A
m(x′)
= m(x),
as desired. By the law of total variance, the variance of the network output is
V[f(x)] = E [V[f(x)|a1, a2]] + V[E [f(x)|a1, a2]]
≈ E [V[f(x)|a1, a2]]
≈ E[ψ(a2)2]+ V[b]
≈ v(x),
where we used that w1, b are essentially deterministic and V[E [f(x)|a1, a2]] ≈ 0 since
a1, a2 are essentially deterministic. Also, we have that ψ(a2) ≈ a2 since a2 ≈
√
v(x) ≥ 0.
The approximations come from the standard universal function approximation theorem,
and the variances of weights not being set exactly to 0 so that we remain in QFFG.
A mathematically rigorous proof, along with a proof for QMCDO with any dropout
rate p ∈ (0, 1), is given in Appendix C.1. The main technical challenge in the proof for
QFFG is to establish the validity of the argument presented above when the required
weights are not deterministic (since strictly speaking deterministic weight distributions
would not lie in QFFG), but are instead Gaussian-distributed with very small variances.
The proof for QMCDO uses a somewhat similar construction, but is more involved as
we cannot set individual weights to be essentially deterministic, due to the nature of
the dropout distribution.
3.4.2 Empirical tests of approximate inference in deep BNNs
We now consider empirically whether the distributions found by optimising the ELBO
with these families resemble the exact predictive distribution (Criterion 2). To do
this, we consider the dataset from Figure 3.6 and define the ‘overconfidence ratio’ at
an input x as γ(x) = (VGP[f(x)]/Vqϕ [f(x)])1/2, where VGP is the predictive variance
of exact inference in the infinite-width BNN, and Vqϕ is the predictive variance of
the approximate posterior. We then compute γ(x) at 300 points {xn}300n=1 evenly
spaced along the dashed white line joining the data clusters in Figure 3.6, i.e., from
48 The expressiveness of approximate inference in Bayesian neural networks
Fig. 3.8 Schematic illustration of the construction used to prove that there exist 2-
hidden layer BNNs using QFFG which are able to approximate any predictive mean
function m and variance function v. Here the blue weights are used to approximate
the mean function and the red weights are used to approximate the variance function,
and β is a shorthand for minx′∈Am(x′). Only six neurons are shown in the first hidden
layer for illustrative purposes, but the universal approximation theorem may require
many more depending on the desired approximation error ϵ.
3.4 Deeper networks 49
x = (−1.2,−1.2) to x = (1.2, 1.2). We then create boxplots of the values {γ(xn)}300n=1
for varying BNN depths. If the BNNs are wide enough, accurate inference should lead
to similar uncertainty estimates to the limiting GP, i.e. the boxplot should be tightly
centered around 1 (dashed line). If instead γ ≫ 1, this means the approximate BNN is
much more confident than the exact infinite-width GP reference, which suggests that
it is more confident than exact inference in the finite BNN as well. The opposite is
true if γ ≪ 1.
We consider ReLU BNNs with 1 to 10 hidden layers, and with 50 hidden units in
each layer. We set the prior mean for all parameters to 0. The prior standard deviation
for the bias parameters is chosen as σb = 1. Let σw/
√
H be the prior standard deviation
of each weight, where H is the number of inputs to the weight matrix. We consider
two schemes for choosing σw:
1. In Figure 3.9 we choose σw = 4, 3, 2.25, 2, 2, 1.9, 1.75, 1.75, 1.7, 1.65 for depths 1-10
respectively. These values were chosen to ensure the prior standard deviations
(of both the infinite width GP and the finite width BNN) in function space at
the points (1, 1) and (−1,−1) (the centres of the data clusters) were between 10
and 15 — a value we judged to constitute a vague yet still reasonable prior in
function space.
2. In Figure 3.10 we choose σw =
√
2 for all depths. The value of
√
2 is chosen as it
has been shown to lead to BNN priors with roughly constant variance in function
space as depth increases (Schoenholz et al., 2017).
Finally, all models use a fixed Gaussian likelihood with standard deviation 0.1, and
the training procedure is the same as that detailed in Appendix B.
In Figure 3.9 we see that for the 1HL and 2HL BNNs, the GP and HMC agree
closely, suggesting both resemble the exact predictive of the finite BNN. In contrast,
MFVI and MCDO are often an order of magnitude overconfident (γ(x) > 1) at some
points (upper tail of the boxplot) and somewhat underconfident (γ(x) < 1) at other
points (lower tail of the boxplot). Increased depth does not alleviate this behaviour.
In Figure 3.10, we see that the agreement between HMC and the limiting GP is less
close than it is in Figure 3.9. This could be due to HMC not mixing well for this prior,
or the GP-BNN correspondence not being particularly good for networks of width 50
for this prior. However, there is still much closer agreement between HMC and the
limiting GP than there is between MFVI or MCDO and the limiting GP.
We next investigate where the overconfidence and underconfidence of approximate
inference relative to the limiting GP is occurring. In Figure 3.11, we find that over-
50 The expressiveness of approximate inference in Bayesian neural networks
1 2 3 4 5 6 7 8 9 10
Number of hidden layers
10−1
100
101
102
O
ve
rc
on
fi
d
en
ce
ra
ti
o
HMC MFVI MCDO
Fig. 3.9 Box and whisker plots of the overconfidence ratios of HMC, MFVI and MCDO
relative to exact inference in the corresponding infinite-width limit GP along the dashed
white line on the dataset from Figure 3.6. The whiskers show the smallest and largest
overconfidence ratios computed, and the box extends from the lower to upper quartile
values of the overconfidence ratios, with a line at the median. HMC is only run for 1
and 2 hidden layers due to difficulty ensuring convergence in larger models. We see
that MFVI and MCDO can be overconfident by up to an order of magnitude relative
to the GP, for all depths.
1 2 3 4 5 6 7 8 9 10
Number of hidden layers
10−1
100
101
102
O
ve
rc
on
fi
d
en
ce
ra
ti
o
HMC MFVI MCDO
Fig. 3.10 Same as in Figure 3.9, but now with σw =
√
2 for all depths.
3.4 Deeper networks 51
confidence occurs in-between the data clusters, and underconfidence occurs at the
data clusters. Hence the uncertainty estimates of the approximate BNNs suffer from
qualitatively similar issues to those seen in 1HL BNNs in Figure 3.3, even though here
deeper BNNs are considered.
In addition, similarly to Figure 3.4, in Figure 3.12 we plot the uncertainty on
line segments in between random clusters of data in a 5-dimensional input space, but
this time with deeper networks. We again see that compared to exact inference in
the limiting GP, MFVI and MCDO both underestimate in-between uncertainty — or
sometimes show as large uncertainty at the data as in between the data. Figure 3.12
hence shows that the lack of adequate in-between uncertainty is not specific to 1HL
BNNs or to the dataset from Figure 3.6.
3.4.3 Initialising a BNN with in-between uncertainty
In light of the theoretical flexibility of the variational families QFFG and QMCDO in the
deep case as shown in Theorem 4 and Figure 3.7, it is perhaps surprising that VI fails
to capture important properties of the posterior predictive even with deep networks.
In order to assess whether the variational objective (the ELBO) or optimisation failure
is primarily responsible for the lack of in-between uncertainty when performing MFVI
and MCDO, we investigate the effect of initialisation on the quality of the posterior
obtained after variational inference. The idea is to find an initialisation of the variational
parameters such that the approximate posterior predictive closely matches the infinite
width GP (and hence shows good in-between uncertainty). If ELBO optimisation
starting from this initialisation subsequently loses in-between uncertainty, this provides
evidence that the ELBO objective for BNNs is to blame for the lack of in-between
uncertainty in the deep case.
In order to find this initialisation, we train the network by minimising the mean
squared error between the mean and variance functions of the GP reference posterior
and the approximate posterior (as in Equation (3.2) and Figure 3.7). The reference
posterior was obtained by fitting the limiting GP on the dataset (shown in crosses in
Figure 3.13). The noise variance was fixed to the true noise variance that generated the
data, and the data itself was sampled from the limiting GP prior, so that the model
should be able to fit the data well with minimal model mismatch. Two-hidden layer
MFVI and MCDO networks were used, with 50 hidden units in both layers.
Unfortunately, it may be the case that the initialisation found by minimising the
mean squared loss for 50,000 iterations leads to variational distributions with a very high
KL to the posterior. Hence once ELBO optimisation begins, the distribution may need
52 The expressiveness of approximate inference in Bayesian neural networks
(a) Mean Field VI (b) MC Dropout
(c) Mean Field VI (σw =
√
2 prior) (d) MC Dropout (σw =
√
2 prior)
Fig. 3.11 Plots of the overconfidence ratio γ on the dataset from Figure 3.6 against
λ (where λ is defined as in Figure 3.6) for several depths of neural networks with
σw = 4, 2, 1.7 for 1, 5 and 9 hidden layers respectively (top), and σw =
√
2 for all
depths (bottom). Projections of the input locations of the datapoints onto the diagonal
slice between the clusters are shown as black crosses (✕). We see that both MCDO
and MFVI are overconfident (γ > 1) in between data, and underconfident (γ < 1) at
the locations where we have observed data, relative to the GP reference.
3.4 Deeper networks 53
GP MFVI MCDO
−2 0 2
λ
1.0
1.5
2.0
2.5
3.0
f
(x
(λ
))
−2 0 2
λ
1.5
2.0
2.5
3.0
3.5
f
(x
(λ
))
−2 0 2
λ
1.25
1.50
1.75
2.00
2.25
2.50
2.75
f
(x
(λ
))
−4 −2 0 2 4
λ
−4
−2
0
2
f
(x
(λ
))
−4 −2 0 2 4
λ
−4
−2
0
2
f
(x
(λ
))
−4 −2 0 2 4
λ
−2
−1
0
1
2
f
(x
(λ
))
−2 −1 0 1 2
λ
−1.0
−0.5
0.0
0.5
1.0
1.5
f
(x
(λ
))
−2 −1 0 1 2
λ
−1.0
−0.5
0.0
0.5
f
(x
(λ
))
−2 −1 0 1 2
λ
−1.0
−0.5
0.0
0.5
f
(x
(λ
))
Fig. 3.12 Same experimental set-up as in Figure 3.4, but now with 3-hidden layer
BNNs. These deeper BNNs still fail to show adequate in-between uncertainty, and
are overconfident in between the data clusters and underconfident at the data clusters
relative to the infinite-width GP reference.
54 The expressiveness of approximate inference in Bayesian neural networks
−1.0 −0.5 0.0 0.5 1.0
x
−2
0
2
4
y
GP
MFVI
(a) Mean-field VI
−1.0 −0.5 0.0 0.5 1.0
x
−2
0
2
4
y
GP
MCDO
(b) MC dropout
Fig. 3.13 Mean and error bars (± 2 standard deviations) for the GP and the BNN with
each inference scheme, trained on the data shown by the red crosses. The inference
algorithms were initialised by first minimising the squared error to the reference GP
mean and variance, and then running the respective inference algorithm. Even when
starting from an initialisation that closely matches the GP and hence shows good
in-between uncertainty, in-between uncertainty is subsequently lost when the variational
objective is optimised.
to move very far from its initialisation, and hence may lose the in-between uncertainty
that it started with. In other words, there may exist variational distributions that lead
to good in-between uncertainty and also a good ELBO, but these might be very far
from the distributions we find when only optimising for good in-between uncertainty.
To account for this, we gradually interpolate between the squared-error loss and the
variational objective, by taking convex combinations of the losses. This procedure
gives us a better chance of finding an initialisation that both has good in-between
uncertainty and a low KL divergence to the posterior. In detail, call the function
space squared loss L1 and the standard variational objective L2. Then after the first
50,000 iterations of training with L1, we train for 10,000 iterations using .9L1 + .1L2,
10,000 iterations using .8L1 + .2L2 and so on until we are only training using L2. We
then train for 100,000 iterations using just L2, to ensure the variational objective has
converged.
The results are shown in Figure 3.13. We see that even when using this initialisation
which explicitly takes into account in-between uncertainty, the obtained posterior still
lacks in-between uncertainty. This provides some evidence that this pathology may be
due to the nature of the variational objective function itself, rather than the difficulty
of optimising the ELBO. However, this does not constitute a definitive proof, since
there may still be variational parameters that show good in-between uncertainty and
3.5 Case study: active learning with BNNs 55
are also an optimum of the ELBO — but these may be extremely difficult to find, even
when using this specially designed initialisation. We leave a further investigation of
this, and more broadly, of Criterion 2, to future work.
3.5 Case study: active learning with BNNs
We now consider the impact of the pathologies described in Sections 3.3 and 3.4
on active learning (Settles, 2009) on a real-world dataset, where the task is to use
uncertainty information to intelligently select which points to label. Active learning with
approximate BNNs has been considered in previous works, often showing improvements
over random selection of datapoints (Gal et al., 2017b; Hernández-Lobato and Adams,
2015). However, in cases when active learning with BNNs fails, common metrics such
as RMSE are insufficient to diagnose the causes. In particular, it is difficult to attribute
the failure to the model or to poor approximate inference.
In this section, we specifically analyse a dataset where we have observed active
learning with approximate BNNs to fail — the Naval regression dataset (Coraddu et al.,
2014), which has 1-dimensional output variables y, 14-dimensional input variables x,
and consists of 11,934 datapoints. We find via PCA that this dataset has most of its
variance along a single direction. It hence may be especially problematic for methods
that struggle with in-between uncertainty, as points are more likely to lie roughly
in between others. This makes it a highly suitable dataset to test an approximate
inference method’s ability to represent in-between uncertainty.
The main questions we seek to address are:
1. Does a lack of in-between uncertainty lead to pathological behaviour on a real
dataset in the 1HL case? We have already demonstrated empirically in Sec-
tion 3.3.4 that 1HL BNNs struggle with in-between uncertainty for 2 and 5-
dimensional datasets. However, in higher dimensional datasets such as Naval, it
is not immediately apparent that Theorems 1 and 2 are problematic, since the
convex hull of the datapoints may have relatively low volume in high dimensions.
Unlike the synthetic experiments in the previous sections, the input locations
in Naval are not specifically designed to make Theorems 1 and 2 relevant. In
most cases in this experiment, there will be few datapoints that are exactly
in between others. Hence these theorems may no longer be relevant to this
real-world example. However, it may be the case that approximate inference will
struggle with datapoints that are in some sense approximately in between each
other.
56 The expressiveness of approximate inference in Bayesian neural networks
2. Will deeper BNNs be able to express appropriate in-between uncertainty? Given
the theoretical expressiveness of the approximating families proven in Theorem 4,
it is possible that increased depth will alleviate any pathologies experienced with
shallower models.
3. What is the effect of a lack of in-between uncertainty on downstream tasks? So
far, we have only looked at in-between uncertainty as an end in itself. However,
it is much more practically relevant to see what effect a lack on in-between
uncertainty has on a downstream application such as active learning.
3.5.1 Experimental set-up and results
We compare MFVI, MCDO and the limiting GP on the active learning task. We do not
run HMC as it would take too long to wait for convergence at each iteration of active
learning. We normalise the dataset to have zero mean and unit standard deviation in
each dimension. The experiment begins with an initial active set, which is a collection
of labelled datapoints. The remainder of the datapoints in the dataset constitute the
pool set, which is unlabelled — only the input location x is known, not the output
y. In each iteration of active learning, the model chooses a datapoint from the pool
set to label, after which it becomes a member of the active set. Then at end of each
iteration, the model is retrained on the active set. 5 datapoints are chosen randomly
as an initial active set, with the rest being the pool set. Following Hernández-Lobato
and Adams (2015), the models choose the datapoint in the pool set which is assigned
the highest predictive variance by the model to add to the active set. The goal is to
obtain the best predictive performance on the remaining members of the pool set after
a fixed number of iterations of active learning.
We train MFVI and MCDO with full batch training for 20,000 iterations of ADAM
at each step of active learning. All BNNs are retrained from scratch after the acquisition
of each point from the pool set. This process is repeated 50 times. As the dataset
has low noise, we use a homoskedastic Gaussian noise model with a fixed standard
deviation of 0.01 for all models. We used a learning rate of 1 × 10−3 and 32 Monte
Carlo samples from qϕ to estimate the objective function for both MFVI and MCDO.
All networks had 50 neurons in each hidden layer. The prior for all BNNs and the GP
was chosen to have σw =
√
2, σb = 1. σw =
√
2 was chosen so that the prior in function
space has a stable variance as depth increases (Schoenholz et al., 2017). The dropout
probability was set at p = 0.05 for all MCDO networks. The dropout ℓ2 regularisation
was chosen to match the ‘KL condition’ as stated in Gal (2016, Section 3.2.3). The
3.5 Case study: active learning with BNNs 57
Table 3.1 Test RMSEs (± 1 standard error) after the 50th iteration of active learning,
averaged over 20 random seeds. As the data is normalised to have zero mean and unit
standard deviation, a method that predicts the value 0 on all datapoints will have an
RMSE near 1.
1 HL 2 HL 3 HL 4 HL
GP Active 0.04± 0.00 0.04± 0.00 0.04± 0.00 0.05± 0.00
GP Random 0.12± 0.01 0.13± 0.01 0.15± 0.01 0.16± 0.01
MFVI Active 0.94± 0.11 0.46± 0.04 0.35± 0.03 0.31± 0.02
MFVI Random 0.15± 0.01 0.23± 0.01 0.28± 0.01 0.32± 0.01
MCDO Active 0.69± 0.04 0.36± 0.02 0.38± 0.02 0.45± 0.02
MCDO Random 0.22± 0.01 0.35± 0.01 0.43± 0.01 0.47± 0.02
results were averaged over 20 random initialisations/random selections of the 5 initial
points in the active set. For MFVI and MCDO, the predictive distribution at test time
and the predictive variances used for active learning were estimated using 500 samples
from the approximate posterior. The parameter initialisations are the same as those in
Appendix B.
Table 3.1 shows the RMSE of each model on a held-out test set after this process,
compared to a baseline where points are chosen randomly. Active learning significantly
reduces the RMSE for the GP compared to random selection, often by more than
a factor of three. However it increases the RMSE for 1HL MFVI and MCDO, and
either increases it or does not significantly decrease it for deeper networks. The one
exception is 3HL MCDO, where active performs about 10% better than random, which
is still far less than the factor of three improvement obtained by exact inference in the
infinite-width BNN.
Note that, perhaps counterintuitively, the performance of all models degrades with
increasing depth when choosing datapoints randomly. This could be due to the small
dataset size and possible simplicity of the regression problem, where a shallow network
may have more suitable inductive biases than a deeper one. However, our goal in
this experiment is not to find the best architecture/prior for the Naval dataset, but
rather to assess the impact of approximate inference on active learning. The infinite
width GP with active learning has almost the same performance for all depths, which
is consistently much better than random selection of points. This is not the case for
the approximate BNNs, which provides strong evidence that exact inference in the
BNN model leads to good active learning performance, which is lost by approximate
inference.
58 The expressiveness of approximate inference in Bayesian neural networks
3.5.2 Discussion
In Figure 3.14 we visualise the dataset and the points chosen by 1HL BNNs using
t-SNE (van der Maaten and Hinton, 2008). The covariates of Naval are clustered, with
points in the same cluster roughly the same distance from the origin. Since the dataset
is mean-centred, clusters of points closer to the origin are in a sense ‘in between’ or
‘surrounded by’ clusters of points that are further away from the origin. We see that
although the 1HL GP chooses points from every cluster during active learning, 1HL
MFVI fails to select any points from many of the clusters — including all the clusters
closest to the origin. It ignores points in the ‘inside’ of the input space and oversamples
points on the ‘outside’, leading to a selection strategy which is worse than random.
This behaviour, although not directly implied by Theorem 1 (because the clusters may
not lie on straight lines joining other clusters), is nonetheless consistent with it. Both
the behaviour implied in Theorem 1 and the behaviour here can be seen as aspects of a
general difficulty of the models in expressing in-between uncertainty. Here it manifests
in the fact that the uncertainty seems to be underestimated on clusters of points within
a sphere bounded by the outermost data clusters.
We next consider deeper BNNs. Figure 3.15 shows the points chosen by 3HL BNNs.
Again the GP chooses points from every cluster, and seems to focus on the ‘corners’ of
each cluster. This is intuitively a good strategy, since, assuming the output value varies
approximately linearly across each cluster, knowing the values at the corners of each
cluster allows for the best estimate of the slope of the linear region. MFVI samples
from more clusters than in the 1HL case, but still comparatively oversamples clusters
further from the origin, and undersamples those near the origin. MCDO has a more
spread out choice of points than the 1HL case, but still fails to obtain a significantly
better RMSE than random. We see that compared to the GP, it still misses out on
some clusters and does not follow the strategy of sampling the corners of clusters.
Figure 3.16 shows the predictive uncertainty of 1HL models at the beginning and
end of active learning respectively. Comparing the uncertainty before and after the 50
points have been collected during active learning, we see that all models significantly
reduce their uncertainty around clusters that have been heavily sampled, except for
MCDO. This causes MCDO to repeatedly sample near locations that have already
been labelled, in contrast to the GP. Interestingly, it sometimes chooses from clusters
near the origin in the 1HL case, even though its variance function is provably convex.
This may be unexpected since convex functions are roughly ‘bowl-shaped’ and hence
one might expect regions the centre of the input space to be a region of relatively
lower predictive variance. The fact that 1HL MCDO nevertheless sometimes chooses
3.5 Case study: active learning with BNNs 59
(a) Wide-limit GP (b) MFVI
(c) MCDO (d) Random
Fig. 3.14 Points chosen during active learning in the 1HL case. Colours denote distance
from the origin in 14-dimensional input space, i.e., ∥x∥2. Grey crosses (✕) denote
the five points randomly chosen as an initial training set. Red crosses (✕) denote
the 50 points selected by active learning. Both MFVI and MCDO entirely miss some
clusters which are nearer the origin, and oversample certain clusters which are far from
the origin, as might be expected of methods that struggle to represent in-between
uncertainty. In contrast, the limiting GP samples the ‘corners’ of each cluster, without
missing any entirely. Note that t-SNE does not preserve relative positions, so that
clusters near the origin may appear on the ‘outside’ of the t-SNE plot.
60 The expressiveness of approximate inference in Bayesian neural networks
(a) Limiting GP (b) MFVI
(c) MCDO (d) Random
Fig. 3.15 Points chosen during active learning in the 3HL case. Colours denote distance
from the origin in 14-dimensional input space, i.e., ∥x∥2. Grey crosses (✕) denote the
five points randomly chosen as an initial training set. Red crosses (✕) denote the 50
points selected by active learning. Again, the GP samples the corners of each cluster,
and MFVI oversamples clusters far from the origin. Note that the random selection of
points shown here is the same as that shown in Figure 3.14.
3.5 Case study: active learning with BNNs 61
(a) Limiting GP (before) (b) MFVI (before) (c) MCDO (before)
(d) Limiting GP (after) (e) MFVI (after) (f) MCDO (after)
Fig. 3.16 Predictive uncertainties before (top row) and after (bottom row) active
learning, for single-hidden layer BNNs. Note here that colours denote predictive
uncertainties, rather than distance from the origin as in Figures 3.14 and 3.15. As the
noise standard deviation was fixed to 0.01 for all models, changes in the predictive
standard deviation reflect model uncertainty. Grey crosses (✕) denote the five points
randomly chosen as an initial training set. Red crosses (✕) denote the 50 points
selected by active learning. Note how, compared to Figure 3.16, the GP has reduced its
uncertainty near points it has observed, and is most uncertain at the corners of clusters
opposite those points. In contrast, for both MFVI and MCDO, the network is still
uncertain around regions it has already collected points from, leading it to oversample
those clusters and undersample others.
datapoints near the origin could be because the minimum of the variance function
for MCDO is not centred at the origin, or because the variance has the shape of an
elongated valley.
Note also that MFVI is most confident at clusters near the origin that have never
been sampled, and least confident at clusters far from the origin that have already
been heavily sampled. Again, this is not necessarily a direct consequence of Theorem 1,
but appears to be a wider pathology to do with in-between uncertainty.
In contrast, the GP seems to select the ‘corners’ of each cluster, which is intuitively
efficient. The success of the infinite-width GP provides strong evidence that this BNN
model combined with exact inference has desirable inductive biases for this task; it is
rather approximate inference that has caused active learning to fail. It may conceivably
62 The expressiveness of approximate inference in Bayesian neural networks
be the case that exact inference in finite BNNs behaves more like MCDO and MFVI
than the infinite width GP. However, we believe this is unlikely since convergence to
the limiting GP can occur for even moderately wide BNNs (Matthews et al., 2018).
To rule out this possibility, a follow-up study with extensive HMC simulation for finite
BNNs would be needed to corroborate these findings.
3.6 Related work
As we saw in Section 2.4, concerns have been raised about the suitability of QFFG
since the earliest work on BNNs. However, to our knowledge, Theorem 1 is the first
theoretical result showing that QFFG, when applied to certain datasets, necessarily has
a pathologically restrictive effect on BNN predictive uncertainties.
3.6.1 Discussion of Farquhar et al. (2020)
Concurrently with (and in response to) our work on the expressiveness of QFFG,
Farquhar et al. (2020) argued that the mean-field approximation is not severely
restrictive. As their paper directly addresses the research in this chapter, whilst coming
to different overall conclusions and recommendations, we provide here a detailed
discussion of their work. They make several claims, which we discuss one by one:
First, as mentioned above, their overarching claim is that the mean-field approxi-
mation for variational inference in BNNs is not severely restrictive. In order to assess
this claim, it is crucial to be clear about the goal of approximate inference, and the
task that it is applied to. In this chapter, we have assumed that the goal is to obtain
predictive distributions that resemble the exact predictive. Hence we consider an
approximating family to be restrictive if there are features of the exact predictive that
are consistently missing from the approximate predictive (e.g., in-between uncertainty).
However, if the goal of approximate inference is instead to obtain a method that
performs reasonably well on some metric for a task (e.g., accuracy and expected cali-
bration error on ImageNet), then there are tasks where MFVI indeed can be regarded
as succeeding. An example of this is shown in Table 2 in Farquhar et al. (2020),
which shows accuracies, negative log-likelihoods and expected calibration error for
various Bayesian CNN architectures trained on ImageNet. In this chapter we show
that for certain datasets, QFFG fails to capture essential features of the true predictive
distribution, but this does not imply that the method cannot be practically useful
on any dataset and task. For example, there could be situations where in-between
3.6 Related work 63
uncertainty is simply irrelevant for the task at hand (e.g., raw accuracy on ImageNet).
In that sense, there are certain tasks/situations where QFFG is severely restrictive, and
others where it is not necessarily so.
Second, Farquhar et al. (2020) state that MFVI in deep networks can, in theory,
have similar predictive distributions as those induced by more expressive posteriors
over shallow networks. On this point our analysis and results are in agreement.
Theorem 4 shows that (at least marginally), a deep MFVI BNN can approximate any
predictive mean and variance function. In comparison, proposition 4 in Farquhar et al.
(2020) states that our result can be extended to approximate the entire predictive
density function, not just the first two moments (although still only marginally).
The implication of both of our results is similar: for wide enough BNNs with more
than two hidden layers, QFFG is flexible enough, in theory, to resemble the predictive
distribution induced by any posterior, including non-mean-field posteriors. However,
as acknowledged by Farquhar et al. (2020), this is not enough to show that good
approximate posteriors will actually be found by VI (Criterion 2).
Third, Farquhar et al. (2020) claim that the performance of mean-field BNNs
on downstream tasks is comparable to that of BNNs using more flexible, but still
Gaussian, posteriors. In their Table 2 they show that the performance of SWAG
(Maddox et al., 2019) with a low-rank Gaussian posterior is comparable to that of
SWAG with a diagonal Gaussian posterior.2 They argue that this provides evidence
that the importance of going beyond the mean-field approximation is greatly diminished
in large-scale models, and hence research should focus on addressing other problems for
MFVI at scale. However, it is not clear that this observation holds for all approximating
family methods. For instance, the K-FAC Laplace approximation shows significantly
improved uncertainty estimation over the diagonal Laplace approximation in Ritter
et al. (2018), and K-FAC is preferred over diagonal Gaussians in state-of-the-art
applications of the Laplace approximation for BNNs (Daxberger et al., 2021; Immer
et al., 2021b) (although note that Immer et al. (2021a) found that the diagonal
Laplace approximation can lead to good performance when used for estimating the
marginal likelihood). Furthermore, there could be other, non-Gaussian approximating
families that lead to significant improvements over QFFG. We point out global inducing
variational posteriors (Ober and Aitchison, 2021) as a recent example of an empirically
successful BNN approximate posterior that is layer-wise conditionally Gaussian, but
not jointly Gaussian. It has been shown to lead to much tighter ELBOs than MFVI,
2Although note that the low-rank Gaussian has a slightly better log-likelihood and expected
calibration error than the diagonal Gaussian.
64 The expressiveness of approximate inference in Bayesian neural networks
making it significantly more amenable to hyperparameter selection by optimising
the ELBO (Bui, 2021). Finally, it could be the case that more expressive Gaussian
posteriors do indeed lead to superior performance, but more improvements in, e.g., the
optimisation procedure or objective function, are required to realise their potential. In
this chapter, we refrain from making recommendations regarding which approximating
family to use in practice for a particular downstream task. Our concern, rather, is
to highlight a specific pathology that we observe in QFFG. However, we believe it is
premature to abandon more flexible approximating families as a research direction in
favour of focusing solely on scaling up MFVI.
Next, Farquhar et al. (2020) argue that in deep BNNs, there exist modes of the
posterior that are well approximated by mean-field distributions. However, it is not
immediately clear that this is relevant either for approximating the true posterior,
or for good performance on downstream tasks. For example, it may be that these
‘mean-field modes’ exist and MFVI is biased towards finding them. Even if that is the
case, QFFG may still be a severely restrictive family. Indeed, the modes that MFVI
finds may be very unrepresentative of the full posterior, and it may be the case that a
more flexible variational distribution could find other modes that lead to much better
performance. Farquhar et al. (2020) argue that increased depth closes the performance
gap between mean-field and full-covariance posteriors, which they illustrate in their
Figure 2. Their experiment involves running HMC on a small network, and fitting a
Gaussian distribution to the samples.3 The KL divergence between a full-covariance
Gaussian fit to the samples and a mean-field Gaussian fit is then shown to decrease
with depth. However, since the HMC chain they use is initialised from a mode found
by MFVI, their experiment is biased to sample from modes that are well-approximated
by QFFG to begin with. It does not tell us how much these ‘factorised modes’ are
losing compared to other modes that are not well-approximated by QFFG, or, indeed,
compared to the full, multimodal posterior, which is what we are finally concerned
with approximating.
Finally, Farquhar et al. (2020) claim that deeper networks trained with MFVI
show improved in-between uncertainty compared to shallow networks. Their main
evidence for this is Figure 5 in their paper, which compares in-between uncertainty on
1-dimensional regression for 1HL MFVI vs 3HL MFVI. However, although they show
arguably better in-between uncertainty in the 3HL case compared to the 1HL case, the
predictive variance of the 3HL BNN is still roughly the same at the data clusters as it
3They in fact fit a mixture of Gaussians (since the HMC samples reflect a multimodal posterior),
and select the Gaussian with the highest Bayesian information criterion. It is not clear that fitting a
Gaussian in this way leads to reasonable predictive distributions.
3.6 Related work 65
is in between the data clusters — the BNN is still overconfident in-between the data
and underconfident at the data. We show similar behaviour in Figure 3.12.
In summary, practitioners seeking to understand whether to use full-covariance
Gaussian distributions rather than mean-field Gaussian distributions in their approx-
imate BNNs for tasks such as image classification will find the results in Farquhar
et al. (2020) instructive. However, their findings do not directly address the ques-
tion of whether the mean-field Gaussian approximation suffices to provide a good
approximation to the exact posterior predictive.
3.6.2 Pathologies of the optimal mean-field posterior in wide
BNNs
The wide limit of BNNs has been a fruitful topic of theoretical investigation as wide
BNN priors and (exact) posteriors both converge to Gaussian processes for the case of
regression with Gaussian likelihoods (Hron et al., 2020; Matthews et al., 2018; Yang,
2019b). Very recently, Coker et al. (2022) used the wide limit to provide a theoretical
characterisation of the optimal MFVI posterior (in the sense of maximising the ELBO)
for wide, deep BNNs. They prove that as the width tends to infinity, the approximate
posterior predictive of such an MFVI BNN tends to the prior predictive. In other
words, the optimal infinite-width MFVI BNN provably completely ignores the data,
which is pathological behaviour that is not reflected by the exact posterior. Their result
is a significant theoretical advance in our understanding of approximate inference in
BNNs, and directly addresses Criterion 2, since it is a statement about the optimal
posterior. In contrast, our Theorems 1 to 4 only address Criterion 1, since they
made existence statements regarding the entire approximating family. I.e., nothing in
Theorems 1 to 4 relied on how the member of the approximating family was selected
(e.g., via the ELBO, or Laplace approximation etc.). Rather, these theorems only made
statements about whether there were any elements of the approximating family that
met certain conditions.
However, the main theorem of Coker et al. (2022) does have a limitation, in that
in only applies to BNNs with odd activation functions, such as tanh. In particular, it
does not apply to the ReLU BNNs that we investigate in this chapter. When non-odd
activations are used, Coker et al. (2022) find that the approximate predictive no longer
necessarily converges to the prior; however, it does not necessarily model the data well
either.
66 The expressiveness of approximate inference in Bayesian neural networks
Combined with the results in this chapter, we thus have the following (incomplete)
theoretical picture of MFVI in BNNs: for 1HL ReLU BNNs, in-between uncertainty
is provably lost. For deep, wide BNNs with odd activations, the posterior predictive
converges to the prior predictive. Optimistically, one could hope that when neither of
these theorems apply (e.g., when considering deep BNNs with non-odd activations, or
which are not too wide), the MFVI predictive will closely resemble the exact predictive,
and be able to represent properties such as in-between uncertainty. More conservatively,
it appears that whenever definitive theoretical characterisations can be made about
the MFVI posterior predictive, they imply major deviations from the exact predictive.
Our inability to prove the existence of pathologies in other cases does not imply their
absence. Hence these results sound a note of caution for practitioners, and in general
we should not expect the MFVI predictive to closely resemble the exact predictive,
unless we have references for the exact predictive to corroborate this (e.g., extensive
HMC simulation).
3.6.3 The cold posterior effect and prior selection
Beginning with Wenzel et al. (2020), there has been much work on the cold posterior
effect : the observation that the performance of BNNs can be improved by artificially
sharpening the Bayesian posterior distribution with a temperature parameter T < 1.
In order to show that the cold posterior effect is a genuine feature of the model and not
simply an artefact of an inaccurate inference procedure, Wenzel et al. (2020) performed
a study of the quality of approximate inference in deep BNNs. They focused on
stochastic gradient Markov Chain Monte Carlo (SGMCMC) (Chen et al., 2014; Welling
and Teh, 2011; Zhang et al., 2020) in deep convolutional networks, and concluded that
SGMCMC is accurate enough for inference, suggesting that the prior is at fault for the
cold posterior effect. This has been further investigated in Fortuin et al. (2021) who
found that the cold posterior effect can be alleviated in some cases by using heavy-tailed
priors. Other possible causes of the cold posterior effect have been suggested, including
data augmentation (Fortuin et al., 2021; Izmailov et al., 2021) and dataset curation
(Aitchison, 2020). Finally, Noci et al. (2021) argue that the cold posterior effect may
be a symptom with many causes, showing that dataset curation, data augmentation
and poor prior specification can each, in isolation, lead to the cold posterior effect.
In contrast to these studies, we do not investigate the cold posterior effect or focus
on designing BNN priors. Instead, we ask whether approximate inference resembles the
exact posterior for a given prior. We give examples of situations where commonly used
independent Gaussian priors do encode useful inductive biases which are subsequently
3.6 Related work 67
lost by approximate inference. Hence we show that even if the problem of BNN
prior specification was completely solved (something the community may be far from
achieving), the inaccuracies in MFVI and MCDO inference could still stop the good
inductive biases of the prior from being translated to the posterior.
3.6.4 Properties of MC dropout posteriors
Prior to our work, Osband et al. (2018) also identified pathologies in MC dropout
posteriors, although of a different nature. They note that the MCDO predictive
distribution is invariant to duplicates of the data, and in the linear case predictive
uncertainty does not decrease as dataset size increases, if the dropout rate and regulariser
are fixed. However, for a fixed prior, the ‘KL condition’ (Gal, 2016, Section 3.2.3)
requires the ℓ2 regularisation constant to decrease with increasing dataset size. In
that case, the MCDO predictive will no longer be invariant to duplicates of the data.
Theorem 2 shows that in the non-linear 1HL case, the predictive uncertainty in the
MCDO posterior has restricted flexibility even for datasets without repeated entries.
Furthermore, since it applies for any setting of the parameters, the restrictions on
in-between uncertainty will persist regardless of how much (or how little) data is
observed.
In follow-up work, Manita et al. (2022) generalised our result on the universality of
MC dropout networks (Theorem 4). Whilst our theorem only holds for networks with
ReLU activations, they show the universal approximation property holds for the same
class of activation functions that the original deterministic universal approximation
theorem holds for (Leshno et al., 1993). Furthermore, it is common in non-Bayesian
uses of MC dropout to employ a deterministic mode of the network at test time,
which works by multiplying the deterministic weights by 1− p, where p is the dropout
probability (Srivastava et al., 2014). Manita et al. (2022) show that it is possible to
construct a dropout network that can approximate any function in both random and
deterministic mode simultaneously. In contrast to our work, they focus on proving that
the output of the network can approximate any function either with high probability
or in expected Lq norm, and do not consider the universal approximation properties of
the predictive variance function.
68 The expressiveness of approximate inference in Bayesian neural networks
3.7 Conclusions
Principled approximate Bayesian inference involves defining a reasonable model, then
finding an approximate posterior that retains the properties of the exact posterior that
are relevant for the task at hand. We have presented both theoretical and empirical
results characterising the expressiveness of the approximate posterior in function space
obtained by MFVI and MCDO. For shallow BNNs we prove a fundamental limitation
of mean-field Gaussian and MC dropout distributions in representing in-between
uncertainty. While using deeper networks significantly improves the expressive power of
these approximating families in terms of fitting arbitrary mean and variance functions,
in practice VI does not take full advantage of this flexibility and again fails to capture
in-between uncertainty. Although this is of greatest relevance for lower-dimensional
regression tasks, the fact that MFVI and MCDO often fail these simple sanity checks
indicates that these methods might generally have predictive distributions which are
qualitatively different from the exact predictive. While BNNs have previously been
shown to provide uncertainty estimates that are useful for a range of downstream tasks,
it remains an open question as to what extent this is attributable to a resemblance
between the approximate and exact predictive posteriors.
To date, BNN approximate posteriors are poorly understood, especially when
compared with the extensive work that has been done on understanding BNN priors
(Lee et al., 2018; Matthews et al., 2018; Neal, 1995; Yang, 2019a). Together with the
results of Coker et al. (2022), Theorems 1 to 4 serve as an important first step in
theoretically characterising the behaviour of approximate inference in these models.
Finally, Theorem 4 raises important questions about the flexibility of approximate
inference in deep networks: Can the theorem be extended to covariances between the
network outputs (i.e., statements about joint distributions in function space)? Why is
Criterion 2 not satisfied when performing VI in weight space, even when Criterion
1 is satisfied? We hope our results motivate future work to better understand the
interaction between approximating families and objective functions, as well as new
approximate inference methods which can realise the full potential of BNNs.
Chapter 4
Neural processes
4.1 Introduction
In the first part of this thesis, we considered Bayesian neural networks as a promising
machine learning model for making predictions under uncertainty. However, we saw
that approximate inference was intractable and often led to behaviour which was
qualitatively different from that of the true predictive distribution. Now, we turn to
the second focus of this thesis: neural processes (NPs) (Garnelo et al., 2018a,b; Kim
et al., 2018). NPs are a recently proposed family of deep learning models. Like BNNs,
NPs address a shortcoming of modern deep learning: it is not easily applicable in
settings where the dataset is small and good uncertainty estimation is required. As an
example, consider a doctor using machine learning to predict the future time evolution
of a patient’s biophysical data. The doctor has access to measurements of the patient’s
data collected during their stay at the hospital. However, having just a single patient’s
data is unlikely to provide all the information needed to make an accurate prediction.
Ideally, the doctor would like to incorporate inductive biases into the model obtained
from the medical histories of many patients.
In this thesis, we consider an inductive bias as any modelling assumption which is
baked into the model before training on the data in the task at hand (in this case, the
biophysical data of the patient of interest). There are many kinds of inductive biases
that can be incorporated into a neural network model, and they vary on a continuum
from very general to very specific. For example, using a deep MLP architecture is a
very general inductive bias, which enforces some degree of smoothness in the function,
but is otherwise extremely flexible. Beyond this, architectures like convolutional neural
networks bake in translation equivariance into the model, thus restricting the class
of functions that can be represented. In standard Bayesian machine learning models,
70 Neural processes
inductive biases are incorporated into the model by specifying a prior over functions.
This is the case with BNNs, where the model architecture combined with the prior over
the weights induces a distribution over predictive functions. In our example, ideally,
this prior would include information about how biophysical data is likely to behave
in general, which would then be combined with the specific observations made of the
current patient. Having accurate and informative inductive biases tailored to the task
at hand would allow the model to learn from far fewer examples compared to a model
that only had very general inductive biases, e.g., about the smoothness of the function.
Unfortunately, specifying such inductive biases by hand usually requires expert
knowledge, both of the application area and of the Bayesian model class. Instead, neural
processes approach this problem with meta-learning, or learning to learn (Schmidhuber,
1987; Thrun and Pratt, 2012). Meta-learning frames the task of finding suitable
inductive biases as part of a supervised learning problem, where each learning instance
is an entire dataset (in this case, a patient’s entire biophysical data trajectory), rather
than a single datapoint (in this case, a single time stamp in a patient’s biophysical data
trajectory). Meta-learning removes the burden of prior selection from the machine
learning practitioner, which for probabilistic methods like BNNs is notoriously difficult
(Fortuin, 2022). Instead, the relevant inductive biases are learned directly from data.
When such data is available, e.g., in this example where there are many related patient
trajectories, it would be advantageous for the model to make full use of it directly,
rather than only using it to inform a human expert’s modelling decisions.
Having argued for the benefits of learning inductive biases directly from data, it
is important to mention that in Chapter 5 we will see that even with meta-learning,
there are benefits to incorporating high-level inductive biases, such as convolutional
structure. However, compared to specifying a BNN prior, which assigns a probability
density to every possible setting of the weights, this is a much more general inductive
bias. In general, on the spectrum of specificity of inductive biases in methods, there
is usually an optimum where some information is baked in to the model by human
design, and some information is learned directly from data. With neural processes, we
explore a model which is closer to the ‘data-driven’ end of this spectrum than BNNs.
We have described the advantages of taking a data-driven approach to learning
inductive biases. Another key feature of neural processes is that, like BNNs, they
explicitly model uncertainty in their predictions. Continuing our example, suppose the
doctor is planning to make a potentially life-changing treatment decision based on the
network’s predictions. It is then crucial that the network knows when it should be
uncertain, instead of being confidently wrong. As we saw in Chapters 2 and 3, BNNs
4.1 Introduction 71
approach this problem by placing a prior probability distribution on the weights of
the network, which is then updated using Bayes’ theorem. In constrast, NPs take a
more direct approach where, given an observed dataset, the neural network outputs
are used to specify the parameters of a predictive stochastic process, i.e. a distribution
over predictive functions. For conditional neural processes (CNPs) (Garnelo et al.,
2018a), this approach does not require any intractable inference procedures, and for
latent neural processes (LNPs) (Garnelo et al., 2018b), it may require inference only
over a set of latent variables which is much smaller than the number of weights in the
network.
In summary, neural processes are a collection of models that work by meta-learning a
distribution over predictive functions, i.e., a predictive stochastic process. Meta-learning
allows NPs to incorporate data from many related tasks, and providing predictive
stochastic processes instead of deterministic functions allows NPs to effectively represent
uncertainty via the randomness in the function, similarly to BNNs.
In this chapter, we will present an introduction to neural processes, covering several
of the NP variants that have been introduced so far, and unpack both the terms
‘meta-learning’ and ‘stochastic process’ in more detail. In addition, in Section 4.4.3 we
present a novel objective function for training latent neural processes, which we will
evaluate against the standard latent neural process objective in Chapter 5.
The exposition in this chapter is based on a Jupyter-book tutorial on neural
processes, ‘The Neural Process Family’ (Dubois et al., 2020), which I wrote together with
Yann Dubois and Jonathan Gordon. I was involved in all aspects of the writing. The use
of the approximate log-likelihood objective presented in Section 4.4.3 was first proposed
in ‘Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural
Processes’ (Foong et al., 2020a). The research on the new objective in that paper was
conducted along with my co-authors Wessel P. Bruinsma, Jonathan Gordon, Yann
Dubois, and James Requiema. Richard E. Turner supervised the work throughout.
I was involved in all aspects of writing in that paper, and with the proposal and
evaluation of the newly proposed approximate maximum likelihood objective.
4.1.1 Meta-learning
In standard supervised learning, a neural network is trained to output a predictive
function given an observed dataset. To make this more precise, we introduce some
notation. Let X = Rdx be the space of inputs to the function, and let Y ⊂ Rdy , with Y
72 Neural processes
compact, be the space of outputs (though to ease notation, we often assume Y ⊂ R).1
Let ZM = (X × Y)M be the collection of M input–output pairs, let Z≤M =
⋃M
m=1ZM
be the collection of at most M pairs, and let Z = ⋃∞m=1ZM be the collection of finitely
many pairs. Then, a single datapoint is an element of X × Y, a dataset D with M
datapoints is an element of ZM , and all finite-sized datasets are elements of Z. Let
Cb(X ,Y) be the space of continuous, bounded functions X → Y .
In supervised learning, a neural network is trained on a single dataset D ∈ Z
typically using a variant of stochastic gradient descent on some loss function. The
trained network f ∈ Cb(X ,Y) is then used as a predictor.2 For example, we may train
a network using ADAM to minimise the MAP objective defined in Equation (2.14).
This allows us to associate any supervised learning dataset D with its corresponding
trained network f (assuming the hyperparameters and random seed have been fixed).
The supervised learning algorithm (defined by the choice of objective function, hyper-
parameters, etc.), which we denote as A, can thus be seen as a map A : Z → Cb(X ,Y).
At test time, a prediction at a target input x ∈ X can be made by feeding it into the
predictor to obtain f(x).
The key insight of supervised meta-learning is that we can apply supervised learning
to learn the map A itself. In other words, we learn the supervised learning algorithm
A using another, higher-level supervised learning algorithm: hence the name ‘meta-
learning’. To achieve this, we parameterise a space of supervised learning algorithms,
and optimise over that space. For training, we need a collection M := (Dn)Ntasksn=1 of
related datasets, where each Dn ∈ Z is itself a supervised learning dataset. We refer
to M as a meta-trainset. The result of meta-training on M (i.e., optimising over the
parameterised space of learning algorithms) is a supervised learning algorithm, i.e.,
a map Z → Cb(X ,Y). However, instead of being defined by a loss function and an
optimiser like standard supervised learning algorithms, the algorithm is specified by a
parametric function (in our case a neural network) which is learned directly from data.
Once the meta-learner has been trained, we can deploy the learned algorithm on
new datasets that are not in the meta-trainset. We refer to this as making predictions
at meta-test time. In this thesis we are concerned with the case where the learnable
algorithm A : Z → Cb(X ,Y) is entirely parameterised by a neural network, i.e. the
adaptation to a new task is done with a single forward pass, without any gradient
updates. This is in contrast to the popular model-agnostic meta-learning (MAML)
1In this thesis we focus on regression tasks. For classification, Y = {1, . . . ,K}, where K is the
number of classes.
2The neural network output may not actually be bounded if X is not compact, but this is not
important for our exposition.
4.1 Introduction 73
algorithm, which uses gradient steps to update the parameters of the network during
meta-test time (Finn et al., 2017).
Because meta-learning can share information across learning tasks, it is especially
well-suited to situations where there are many similar tasks, and each task is a small
dataset, as in, e.g., few-shot learning. The small data regime is precisely when we
would expect uncertainty in our predictions to matter the most. To relate this back to
our example, if we only record the patient’s data at a small number of timestamps, can
we always give a confident answer as to how that data will evolve? What we need is to
express our uncertainty, and this leads us naturally to consider stochastic processes.
4.1.2 Stochastic process prediction
We have seen that we can think of meta-learning as learning a map directly from
observed datasets D ∈ Z to predictor functions f ∈ Cb(X ,Y). However, there are
many situations where a point estimate prediction is insufficient. Given a set of query
inputs x, what we need is often not a single prediction f(x), but rather a distribution
over predictions p(y|x;D), where y are the output values.3 As long as these predictive
distributions are consistent with each other for different choices of x, this is equivalent
to specifying a distribution over functions X → Y . Such a distribution is known as a
stochastic process.
In detail, we define a stochastic process as a probability measure on the set of
functions4 from X → Y, i.e. YX , equipped with the product σ-algebra of the Borel
σ-algebra over each index point (Tao, 2011), denoted Σ. The measurable sets of Σ
are those which can be specified by the values of the function at a countable subset
I ⊂ X of its input locations. Since in practice we only ever observe data and make
predictions at a finite number of points, this is sufficient for our purposes.5 We
denote the set of all YX -valued stochastic processes as P(X ,Y). Instead of considering
learning algorithms that give point estimates, i.e. those mapping Z → Cb(X ,Y), we
now consider algorithms that map Z → P(X ,Y). Each predictor sampled from the
3Here we use the notation p(y|x;D) instead of p(y|x,D) to emphasise that the distribution need
not depend on D via exact Bayesian conditioning, but rather can depend on D in an arbitrary way —
for example, through a neural network.
4Note that this is non-standard terminology since strictly speaking, a stochastic process is a random
variable, i.e., a map from Ω→ YX , where Ω is some abstract sample space. The resulting measure on
YX is then known as the law of a stochastic process. In this thesis we will colloquially use the phrase
‘stochastic process’ to refer to both a stochastic process and its law.
5However note that this σ-algebra is not rich enough to answer questions concerning properties
that depend on an uncountable number of index points, such as continuity of the functions sampled
from the stochastic process.
74 Neural processes
resulting stochastic process represents a plausible interpolation of the data, and the
diversity of the samples reflects the uncertainty in the predictions. Hence, a neural
process can be viewed as using neural networks to meta-learn a map from datasets to
predictive stochastic processes.
This point of view can be clarified by comparing NPs to the most commonly used
form of stochastic process prediction in machine learning: Gaussian process regression
(Rasmussen and Williams, 2005). In GP regression, we begin by specifying a prior
stochastic process, which is a GP, with fprior ∼ GP(0, k), where k is the kernel function.
Let the observed data D = (xn, yn)Nn=1, with X := (xn)Nn=1 and y := (yn)Nn=1. Then, for
a Gaussian observation likelihood with variance σ2, we can perform exact Bayesian
inference to obtain the posterior predictive stochastic process:
fpost ∼ GP(µpost, kpost) (4.1)
µpost = k(·, X)(k(X,X) + σ2I)−1y (4.2)
kpost = k(·, ·)− k(·, X)(k(X,X) + σ2I)−1k(X, ·), (4.3)
where fpost is a sample from the posterior predictive stochastic process, k(X,X) ∈
RN×N is the kernel matrix at the training inputs, k(X, ·) is a column-vector-valued
function of the kernel values between the evaluation point and the training inputs, and
k(·, X) = k(X, ·)T.
Here we can view the process of conditioning the prior GP on the data D as a map
that takes D as input and outputs a predictive stochastic process. In other words, GP
regression is a map Z → P(X ,Y) defined by (i) specifying a prior using the kernel k
and (ii) performing Bayesian inference. NPs, on the other hand, define this map using
a neural network directly, side-stepping this two-step procedure. In contrast to GPs,
rather than having inductive biases put into the model via the choice of the prior, NPs
learn these biases directly from the meta-trainset.
4.1.3 Stochastic process consistency
In the previous section, we considered specifying a stochastic process by specifying
p(y|x;D) for all finite collections of target inputs x using a neural network. Each
distribution for a given set of inputs x is referred to as a finite-dimensional distribution
of the stochastic process. An important question to ask is, can we stitch together all of
these finite-dimensional distributions to obtain a single consistent stochastic process?
The Kolmogorov extension theorem (see e.g. Tao (2011, Section 2.4)) tells us that we
4.1 Introduction 75
can, as long as the marginals are consistent with each other under permutation and
marginalisation.
To illustrate these consistency conditions, we consider some artificial examples
of finite-dimensional distributions that are not consistent. Let x1, x2 be two input
locations, with y1, y2 the corresponding (probabilistic) outputs.
1. Consider a collection of finite-dimensional distributions with y1 ∼ N (0, 1) and
[y1, y2]
T ∼ N ([10, 0]T, I). What is the mean of y1?
2. Consider a collection with [y1, y2]T ∼ N ([0, 0]T, I) and [y2, y1]T ∼ N ([1, 1]T, I).
What is the mean of y1? What is the mean of y2?
From these examples, it is clear that inconsistent marginals lead to self-contradictory
predictions. In the first example, the marginals were not consistent under marginalisa-
tion: marginalising out y2 from the distribution of [y1, y2]T did not yield the distribution
of y1. In the second case, the marginals were not consistent under permutation: the
distributions differed depending on whether considered y1 or y2 was considered first.
These inconsistencies can never occur when doing GP regression, since we begin
by specifying a consistent stochastic process prior and compute an exact conditional
probability for the predictive distribution. However, when using arbitrary neural
networks to directly specify the finite-dimensional distributions of the predictive, some
care must be taken so that our definition satisfies these consistency conditions. Later
we will prove that these problems will never occur for NPs — given a fixed dataset D
to condition on, the NP predictive distributions p(y|x;D) always define a consistent
stochastic process.
So far, we have only considered what happens when the conditioning dataset D
is fixed and the target inputs x are varied. There is another kind of consistency
that we might expect stochastic process predictions to satisfy: consistency among
predictions with different context sets, with respect to the product rule of probability.
To illustrate this, consider two input-output pairs, (x1, y1) and (x2, y2). The product
rule of probability tells us that any well-defined joint predictive density over y1, y2 must
satisfy:
p(y1, y2|x1, x2) = p(y1|x1)p(y2|x2, y1, x1) (4.4)
= p(y2|x2)p(y1|x1, y2, x2). (4.5)
This is equivalent to requiring that the distribution over y1, y2 obtained by autoregressive
sampling from the model should be independent of the order in which the sampling
76 Neural processes
is performed. Unfortunately, this is not guaranteed to be the case for NPs, i.e., it is
possible that, for a neural process:
p(y1|x1)p(y2|x2;x1, y1) ̸= p(y2|x2)p(y1|x1;x2, y2).6 (4.6)
Ideally, this property would be a guaranteed consequence of the NP model definition, as
is the case with GP regression. As it stands, NPs can yield good predictive performance
even though they do not exactly obey this product-rule consistency, likely because
the training procedure encourages NPs to respect this property approximately, if not
exactly. From another point of view, it may be the case that NPs are easier to train
than BNNs precisely because they do not guarantee consistency with the rules of
probability theory. By not directly attempting to enforce this product-rule consistency,
they sidestep the requirement for complicated approximate inference procedures.
4.1.4 The prediction map
We now discuss what mapping we would like NPs to learn ideally. We model the world
as having a ground truth stochastic process P ∈ P(X ,Y), from which all our observed
datasets are drawn. More precisely, let xc ∈ XC and xt ∈ X T with C, T ∈ N be two
sets of input locations. We would like to define what it means to make predictions for
the random function values yt ∈ YT at xt conditioned on observations of the random
function values yc ∈ YC at xc, given that the ground truth stochastic process P is
completely known. In reality this will not be the case, but it serves as a target that we
would like NPs to approximate.
Let p(·|xc) and p(·|xt) denote the densities with respect to Lebesgue measure of the
finite-dimensional distributions of P when indexed at xc and xt respectively. In this
thesis we will assume that these densities always exist. We then have:
yt ∼ p(yt|xt), (4.7)
yc ∼ p(yc|xc). (4.8)
In accordance with the product rule of probability, we then define the finite-dimensional
distribution of the predictive stochastic process at xt conditioned on (xc, yc) as having
6Recall that the semicolon after the conditioning bar denotes the fact that the probability dis-
tribution depends on the elements following it in some arbitrary way, in this case via a neural
network.
4.2 Neural process architectural framework 77
the density7
p(yt|yc, xt, xc) = p(yt, yc|xt, xc)
p(yc|xc) . (4.9)
It can easily be verified that for a fixed conditioning dataset Dc := (xc, yc), the
conditional marginal distributions defined by different choices of xt in Equation (4.9) are
Kolmogorov-consistent in the sense described in Section 4.1.3. Hence, the Kolmogorov
extension theorem implies there is a unique measure on (YX ,Σ) that has Equation (4.9)
as its finite-dimensional distributions. We denote this measure by PDc . It is the
predictive stochastic process obtained by conditioning P on the observations in Dc. We
now define πP : Z → P(X ,Y), πP : Dc 7→ PDc as the prediction map, so called because
it maps each observed dataset Dc to the predictive stochastic process conditioned on
Dc. The general prediction problem, and the objective of training neural processes,
may then be viewed as learning to approximate the prediction map πP . In Section 4.4
we will discuss training procedures and optimisation objectives designed to achieve this
goal.
4.2 Neural process architectural framework
We now discuss the basic design pattern that underlies the architecture of many NPs.
This involves viewing NPs as an encoder-decoder model, where the encoder processes
the conditioning dataset D, and the decoder combines the encoded representation with
the query input, x to form a prediction. The basic NP architectural framework can be
motivated by the following design goals:
1. The dataset to be conditioned on, D, should be treated as a set. This differs
from standard vector-valued neural network inputs in that: (i) datasets may
have varying sizes; (ii) sets have no intrinsic ordering of their elements. This
means that NPs should be invariant with respect to permutations of D. That is,
p(y|x;D) = p(y|x; πD), where πD is any dataset formed by permuting the order
of the datapoints in D.
2. The resulting predictive distributions p(y|x;D) should be consistent with each
other for varying x to ensure that NPs give rise to consistent stochastic processes,
as dictated by the Kolmogorov extension theorem.
7In contrast to Equation (4.6), here we use a comma rather than a semicolon after the conditioning
bar, since these are exact values computed with the rules of probability, rather than approximations
given by a neural network.
78 Neural processes
We now describe the encoder of an NP. The encoder for most NPs can be written
in the form
R(D) =
∑
(x,y)∈D
ϕ(x, y), (4.10)
where ϕ is a deep neural network, ϕ : X × Y → E , where E is some representation
space, and R(D) ∈ E is a fixed-dimensional representation of the dataset D. The
summation operation defining R is key as it ensures permutation invariance due to the
commutativity of summation. It also ensures that R ‘lives’ in the same space regardless
of the number of datapoints in D. Hence all encoders of this form automatically satisfy
the first design goal given above.
Next, the NP has to combine this representation with the query input locations x
to return a prediction. We can broadly categorise NPs into two sub-families based on
how this is done.
Conditional neural processes (CNPs) directly use the deterministic representation R
to define a predictive distribution that is factorised conditioned on R. That is, given a
set of query input locations x ∈ XN with x = (x1, . . . , xN), the predictive distribution
is given by:
p(y|x;D) =
N∏
n=1
p(yn|xn, R(D)). (4.11)
Here each factor p(yn|xn, R(D)) is a parameterised probability density (typically
Gaussian), whose parameters are given by the decoder network. The decoder takes the
query input location xn and conditioning dataset representation R(D) and returns the
parameters of the predictive distribution. The graphical model for a CNP is shown in
Figure 4.1.
On the other hand, latent neural processes8 (LNPs) use the representation R(D)
to parameterise a distribution over a latent variable, z ∼ p(z|R(D)). The predictive
distribution is then factorised conditionally given z. That is,
p(y|x;D) =
∫ N∏
n=1
p(yn|xn, z)p(z|R(D)) dz. (4.12)
8Note that in the original paper where LNPs are introduced (Garnelo et al., 2018b), they are
simply known as neural processes. We believe this terminology can be confusing, hence we prefer to
use the term ‘neural process’ as an umbrella term covering both LNPs and CNPs.
4.2 Neural process architectural framework 79
Fig. 4.1 Graphical model of a conditional neural process. Grey circles denote observed
variables.
As with conditional neural processes, the factors p(yn|xn, z) are specified by a decoder
network. The graphical model for a LNP is shown in Figure 4.2.
As we will see, LNPs offer more expressive predictive distributions than CNPs, which
can induce correlations between different query locations — but at the cost of making
the likelihood of the model intractable. CNPs are generally easier to train and have
closed-form objective functions, but cannot be used to sample coherent functions from
the predictive distribution, due to the factorisation assumption. In Sections 4.2.2 to 4.2.4
we will describe some concrete instantiations of the NP architectural framework. First,
however, we provide a quick proof that both CNPs and LNPs satisfy the Kolmogorov
consistency requirement given above.
4.2.1 Kolmogorov consistency of CNPs and LNPs
Here we show that both CNPs and LNPs meet the consistency requirements to specify
a stochastic process according to the Kolmogorov extension theorem, given a fixed
conditioning dataset D. We first consider CNPs. Recall that we require consistency
under both marginalisation and permutation:
Proposition 1. The finite-dimensional distributions of CNPs are consistent under
marginalisation.
80 Neural processes
Fig. 4.2 Graphical model of a latent neural process. Grey circles denote observed
variables.
Proof. Consider two query inputs, x1, x2 ∈ X . Then by marginalising out the second
predicted output and using Equation (4.11), we get:∫
p(y1, y2|x1, x2;D) dy2 :=
∫
p(y1|x1, R(D))p(y2|x2, R(D)) dy2 (4.13)
= p(y1|x1, R(D))
∫
p(y2|x2, R(D)) dy2 (4.14)
= p(y1|x1, R(D)) (4.15)
:= p(y1|x1;D), (4.16)
which shows that the predictive distribution obtained by querying the CNP at x1 is the
same as that obtained by querying it at x1, x2 and then marginalising out the second
target point. Of course, the same argument holds for collections of any size, and when
marginalising out any subset of the variables.
Proposition 2. The finite-dimensional distributions of CNPs are consistent under
permutation.
4.2 Neural process architectural framework 81
Proof. Let (xn)Nn=1 be the query inputs and π be any permutation of {1, ..., N}. Then,
again using Equation (4.11), the predictive density is given by:
p(y1, ..., yN |x1, ..., xN ;D) :=
N∏
n=1
p(yn|xn, R(D)) (4.17)
=
N∏
n=1
p(yπ(n)|xπ(n), R(D)) (4.18)
:= p(yπ(1), ..., yπ(N)|xπ(1), ..., xπ(N);D), (4.19)
since multiplication is commutative.
It is clear from these derivations that these properties hold for any factorised
predictive distributions. The proof of Kolmogorov consistency for LNPs is similar to
that given for CNPs:
Proposition 3. The finite-dimensional distributions of LNPs are consistent under
marginalisation.
Proof. Consider two query inputs, x1, x2. Then by marginalising out the second
predicted output and using Equation (4.12), we obtain:∫
p(y1, y2|x1, x2;D) dy2 :=
∫ ∫
p(y1|x1, z)p(y2|x2, z)p(z|R(D)) dz dy2 (4.20)
=
∫
p(y1|x1, z)p(z|R(D))
∫
p(y2|x2, z) dy2 dz (4.21)
=
∫
p(y1|x1, z)p(z|R(D)) dz (4.22)
:= p(y1|x1;D), (4.23)
which shows that the predictive distribution obtained by querying an LNP at x1 is
the same as that obtained by querying it at x1, x2 and then marginalising out the
second target point. Again, the same idea works with collections of any size, and when
marginalising out any subset of the variables.
Proposition 4. The finite-dimensional distributions of LNPs are consistent under
permutation.
82 Neural processes
Proof. Let (xn)Nn=1 be the query inputs and π be any permutation of {1, ..., N}. Then
the predictive density is:
p(y1, ..., yN |x1, ..., xN ;D) :=
∫
p(z|R(D))
N∏
n=1
p(yn|xn, z) dz (4.24)
=
∫
p(z|R(D))
N∏
n=1
p(yπ(n)|yπ(n), z) dz (4.25)
:= p(yπ(1), ..., yπ(N)|xπ(1), ..., xπ(N);D), (4.26)
since multiplication is commutative.
We next describe some concrete instantiations of the NP architectural framework.
4.2.2 MLP-conditional neural processes
The simplest model in the NP architectural framework, and the first to be proposed, is
the MLP-conditional neural process — usually just referred to as the conditional neural
process (CNP) (Garnelo et al., 2018a).9 Recall that specifying an instantiation of the
NP architectural framework given at the beginning of Section 4.2 requires defining an
encoder and decoder. For MLP-CNPs, the encoder is given by:
R(D) =
∑
(x,y)∈D
ϕ(x, y), (4.27)
where ϕ is a multilayer perceptron. More precisely, given an observed datapoint with
x ∈ Rdx and y ∈ Rdy , x and y are concatenated and fed into an MLP ϕ : Rdx+dy → RdR ,
where dR is the dimensionality of the representation. Following this, the per-datapoint
representations ϕ(x, y) are summed together to form a representation of the entire
dataset, R(D) ∈ RdR .
Next, we specify the predictive distribution of the CNP. Following Equation (4.11),
the CNP predictions are factorised over each query input, and we additionally use
9A note about terminology: in the original publication (Garnelo et al., 2018a), what we refer to as
the MLP-CNP is simply known as the CNP. Instead, we use the term CNP to usually refer to the
entire class of neural process models that make factorised predictions as in Equation (4.11). Similar
comments apply to the latent variable version, the MLP-LNP (Garnelo et al., 2018b) — we use the
term LNP to refer to the entire class of neural process models that define their predictive distribution
using Equation (4.12).
4.2 Neural process architectural framework 83
Gaussian distributions for each factor:
p(y1, . . . , yN |x1, . . . , xN ;D) =
N∏
n=1
p(yn|xn, R(D)) (4.28)
=
N∏
n=1
N (yn;µ(xn, R(D)), σ2(xn, R(D))). (4.29)
Here µ(xn, R(D)) and σ2(xn, R(D)) are again defined by MLPs, with µ : Rdx+dR → Rdy ,
and σ2 : Rdx+dR → Rdy . Together, these networks form the decoder of the MLP-CNP.
In practice it is common to use a single MLP that outputs both µ(xn, R(D)) and
log σ2(xn, R(D)), where the logarithm of the predictive variance is output to ensure
that σ2(xn, R(D)) > 0.
MLP-CNPs are simple to define and were shown to successfully approximate the
predictive distribution for non-Gaussian regression tasks and image inpainting (Garnelo
et al., 2018a). However, as with all CNPs, they cannot model dependencies between
query points in the predictive distribution. Since the samples of every yn value will
be independent, functions sampled from the posterior of a CNP will be extremely
noisy — there is no way to separate ‘aleatoric’ from ‘epistemic’ uncertainty in the CNP
predictive distribution, since all of the randomness is independent between different
query locations.
This inability to model dependencies renders CNPs unsuitable for downstream
applications that require coherent samples, such as Thompson sampling (Thompson,
1933), where a function is sampled from the posterior and then greedily optimised in
order to perform Bayesian optimisation. Coherent samples are also required in order
to estimate the probability of events that are defined over an extended region of the
input space. For example, consider the task of predicting if the value of the function
will exceed a certain threshold over the entirety of a range in the input. This kind of
task occurs in heatwave prediction, where we are interested in the probability that
the temperature exceeds a certain threshold over an extended period of time. CNPs
will assign an unreasonably low probability to this event since within any non-zero
interval, there are an uncountable number of query points xn, and the probability that
the corresponding output values yn all exceed a certain threshold will vanish if all
the yn are modelled as independent Gaussian distributions. Finally, the inability to
model dependencies leads to poorer joint log-likelihoods, since CNPs will be forced
to approximate the ground truth, dependent predictive distribution with a factorised
84 Neural processes
distribution. The inability of CNPs to model dependencies was addressed in follow-up
work with the introduction of the MLP-latent neural process, which we discuss next.
4.2.3 MLP-latent neural processes
The MLP-latent neural process (MLP-LNP), commonly known simply as the latent
neural process (LNP) (Garnelo et al., 2018b), has a similar architecture to the MLP-
CNP. However, instead of directly passing the deterministic representation R(D) to
the decoder network, R(D) is used to define the parameters of a Gaussian distribution
over a latent variable z ∈ Rdz :
p(z|R(D)) = N (z;µz(R(D)), σ2z(R(D))). (4.30)
Here µz(R(D)) ∈ Rdz and σ2z(R(D)) ∈ Rdz are output by an MLP that takes R(D) as
input. Next, the latent variable z is used to define the predictive distribution, following
Equation (4.12):
p(y1, . . . , yN |x1, . . . , xN ;D) =
∫ N∏
n=1
N (µ(xn, z), σ2(xn, z))p(z|R(D)) dz. (4.31)
The architecture of the decoder networks µ(xn, z) and σ2(xn, z) is the same as that of
the MLP-CNP. Note that if the variance of the latent variable σ2z(R(D))→ 0, then z
becomes a deterministic representation; hence the MLP-CNP is a special case of the
MLP-LNP. Garnelo et al. (2018b) showed that the MLP-LNP was capable of producing
coherent and diverse function samples that could be used for downstream tasks such
as Thompson sampling.
4.2.4 Attentive neural processes
One shortcoming of MLP-based neural processes is that they have a tendency to underfit
the data (Kim et al., 2018). For example, MLP-CNPs struggle to take advantage of
the fact that if a query point is very close to a datapoint in D, they should both have
similar values, and conversely if the points are far apart. One possible explanation for
this is that all the query points xn share a single global representation R(D) of the
conditioning dataset D, i.e., R(D) is independent of the location of the query input.
This suggests that a priori, all points in the dataset D are given the same ‘importance’,
regardless of the location at which a prediction is being made. Although, as we will
see in Section 4.3, the form of the representation used by MLP-CNPs is universal in
4.2 Neural process architectural framework 85
the space of permutation invariant set functions, it nevertheless may not be the most
data-efficient representation since it does not bake in the importance of locality as an
inductive bias. One solution to this is to use a query-location-dependent representation,
R(xn, D).
To achieve this, Kim et al. (2018) propose the attentive NP (ANP), which replaces
the summation operation in MLP-NPs with aggregation using an attention mechanism
(Bahdanau et al., 2015). The attention mechanism allows the ANP to learn to attend
to specific datapoints in D that are particularly relevant to the query location, giving
them more weight than others when making a prediction. To illustrate how attention
can alleviate underfitting, consider the case where D contains two observations with
inputs x1, x2 that are very far apart. These observations are then mapped by ϕ to
the local representations ϕ(x1, y1) and ϕ(x2, y2) respectively. Intuitively, when making
predictions close to x1, the decoder should focus on ϕ(x1, y1) and ignore ϕ(x2, y2),
since ϕ(x1, y1) contains much more information about this region of input space. The
attention mechanism allows us to define this intuition algorithmically, and incorporate
it as a high-level inductive bias in the NP.
In detail, an attention mechanism works by processing a set of keys, queries and
values. The queries attend to the keys via the computation of a similarity measure.
This similarity measure is then used to compute attention weights, which are normalised
to sum to one. The attention weights are used to compute a weighted combination of
the values. The most common form of attention, dot-product attention, uses the dot
product as a similarity measure and works as follows. Consider having M key-value
pairs arranged in matrices, with the key matrix K ∈ RM×dK , and the value matrix
V ∈ RM×dV . These key-value pairs are attended to by N query vectors, Q ∈ RN×dK .
The output of the dot product attention is then computed as:
Attention(K,Q, V ) = softmax(QKT/
√
dK)︸ ︷︷ ︸
W∈RN×M
V ∈ RN×dV . (4.32)
Here the softmax operation is applied row-wise to QKT over M elements. W is the
matrix of attention weights, and we can see that the nth row of the attention output
is given by a weighted sum of the M rows of the value matrix V .
ANPs make use of the attention operation in Equation (4.32) as follows. The query
matrix Q is formed by applying a pointwise MLP to the N input locations in the query
set. The key and value matrices K,V are formed by applying two separate pointwise
MLPs to the M datapoints in the conditioning set, one for producing keys and the
other for producing values. The nth row of the output of the attention operation
86 Neural processes
Attention(K,Q, V ) ∈ RN×dV is then the query-location specific representation of D,
i.e., R(xn, D). In contrast to the MLP-CNP, there are now N distinct representations
for the N datapoints.
Another way to view this is as defining a weighting function w(·, ·) that weights
each datapoint in D depending on the input location we want to predict at, xn. The
datapoints in D determine the attention keys, and xn determines the attention query.
The xn-specific representation of D is then given by
R(xn, D) =
∑
(x,y)∈D
w(x, xn)ϕ(x, y), (4.33)
with the attention weights normalised so that
∑
(x,y)∈D w(x, xn) = 1. This is in contrast
with Equation (4.27) where no weighting is performed. Kim et al. (2018) tested various
kinds of attention mechanisms to define w(·, ·), including Laplace kernel attention, dot
product attention, and multihead dot product attention (Vaswani et al., 2017). They
generally find that multihead dot product attention works best.
So far we have only considered an attention mechanism between the query input
location xn and the observed dataset, i.e., cross attention. In addition, the ANP when
originally proposed (Kim et al., 2018) used an attention mechanism between datapoints
in D: self attention. In this case, the representation of a datapoint (x, y) ∈ D is no
longer given by ϕ(x, y), but is instead the result of applying self attention to ϕ(x, y).
This can be implemented using Equation (4.32), but with the keys, queries and values
all computed from the conditioning set only. Note that neither the self attention
applied to D, nor the cross attention between xn and D impacts the invariance of the
predictions with respect to permutations of D. If D is permuted, so will the sequence
of per-datapoint representations. When cross attention is applied to this sequence,
its ordering is irrelevant (see Equation (4.33)), hence the predictions are unaffected.
Compared to cross attention, self attention between the datapoints in D has a less
clear interpretation as an inductive bias. In fact, we have found that only using cross
attention without self attention in the ANP is generally not detrimental to performance,
while being less computationally demanding.
Using this architecture, both CNP and LNP versions of the attentive neural process
can be constructed, following Sections 4.2.2 and 4.2.3. However, Kim et al. (2018)
propose a hybrid model that uses both a deterministic and stochastic path. Specifically,
a deterministic representation of D is constructed using Equation (4.33). Separately, a
stochastic representation is obtained by applying self attention to the datapoints in D,
and then taking the mean of the resulting outputs. This mean is then fed into an MLP
4.3 Deep sets 87
which defines the parameters of p(z|D). Finally, the predictive distribution is given by
p(y|x;D) =
∫ N∏
n=1
N (µ(xn, R(xn, D), z), σ2(xn, R(xn, D), z))p(z|R) dz. (4.34)
Note that, in contrast to Sections 4.2.2 and 4.2.3, the decoder takes both the deter-
ministic representation R(xn, D) and the latent variable z as inputs. If the MLPs µ
and σ2 learn to ignore the input z, then the hybrid model collapses to an attentive
CNP (ACNP). This hybrid definition could also be easily applied to MLP-based neural
processes.
Kim et al. (2018) show that the ANP significantly outperforms the MLP-LNP in
various regression tasks. Hence, in Chapter 5 we use the ANP as a strong baseline with
which to compare our proposed NP models. However, the attention operations increase
the computational complexity of the ANP relative to MLP-NPs. MLP-NPs have a
computational complexity of O(N + T ) for making predictions at T query locations
conditioned on a dataset of N points. In contrast, the ANP has a computational
complexity of O(N2 +NT ), due to the self attention between the N points in D, and
the cross attention between each query location and D. If the self attention is dropped
and only cross attention is used, then the computational complexity is reduced to
O(NT ).
4.3 Deep sets
We have seen that various neural processes can be described using the architectural
framework given in Section 4.2. A key component of this architecture is the repre-
sentation of the dataset by the encoder given by a summation over datapoints in
Equation (4.10). It is natural to ask, how flexible is this representation? This question
was investigated by Wagstaff et al. (2022); Zaheer et al. (2017) in the broader context
of deep learning on sets. Their main result is the following representation theorem:
Theorem 5 (Wagstaff et al. (2022); Zaheer et al. (2017)). Let [0, 1]≤M denote the
set of subsets of [0, 1] containing at most M elements. Let f : [0, 1]≤M → R be a
permutation-invariant, continuous function. Then
f(x) = ρ
 |x|∑
i=1
ϕ(xi)
 (4.35)
88 Neural processes
for some continuous functions ρ : RM → R and ϕ : R→ RM . We refer to RM as the
embedding space. Here |x| denotes the number of elements in x.10
Equation (4.35) is known as a ‘sum-decomposition’ or ‘deep sets encoding’. Theo-
rem 5 tells us that as long as ρ and ϕ are universal function approximators (such as
sufficiently wide MLPs), this sum-decomposition can be done without loss of generality
in terms of the class of permutation-invariant maps that can be expressed.
Note that in Theorem 5, the dimensionality of the embedding space has to grow with
the maximum size of the set, M . Wagstaff et al. (2022) show that this is a necessary
condition: if the maximum size of the input set is greater than the dimensionality of
the embedding space, then there exists a permutation-invariant, continuous function
that cannot be expressed by Equation (4.35). Furthermore, they show a stronger result:
there exist permutation-invariant continuous functions that cannot be approximated
by functions of the form in Equation (4.35).
It is important to note the role played by continuity in Theorem 5. Instead of
considering set elements in the domain [0, 1] and demanding continuity, Zaheer et al.
(2017) also considered set elements taken from U , where U is any countable set. In
that case, they show that it suffices for the dimensionality of the embedding space
to be one, i.e. the embedding space is just R. However, Wagstaff et al. (2019, 2022)
showed that this is not a realistic case to consider, because in practice it leads to the
specification of maps exhibiting a high degree of discontinuity, such that it would be
impractical to represent using floating-point arithmetic.
As we saw in Sections 4.2.2 and 4.2.3, MLP-NPs make heavy use of the deep sets
decomposition, and so do ANPs, since they reduce to MLP-NPs when the attention
weights are all equal to 1. To highlight the similarities, we can express the mean
function of the MLP-CNP as
µ(xn, R(D)) = µ
xn, ∑
(x,y)∈D
ϕ(x, y)
 , (4.36)
where recall that µ : Rdx+dR → Rdy is an MLP. This is very similar to Equation (4.35),
with µ playing the role of ρ, except that µ also takes in the query location xn as
an input. It is straightforward to leverage this relationship to formally show that
10Note that this theorem assumes that the individual set elements are members of [0, 1]. It is not
immediately clear how to extend the proof to vector-valued set elements. Such an extension was
proven by Wessel P. Bruinsma, Andrew Y. K. Foong and Jonathan Gordon, and is presented in
Gordon (2021, Theorem 2.3), with the added condition that the dimensionality of the embedding
space is now 2M instead of M . See also Yarotsky (2022) for a comparable statement.
4.4 Training neural processes 89
CNPs can recover (in the limit of infinite width) any continuous map from datasets to
continuous functions Z → Cb(X ,Y) as their predictive mean and variance (Gordon,
2021, Theorem 2.4), which provides justification for the proposed architecture.
4.4 Training neural processes
Having described the architecture, we now discuss how to train neural processes. As
mentioned in Section 4.1.1, in order to meta-learn, we require a meta-dataset, i.e., a
dataset of datasets. In the meta-learning literature, each dataset in the meta-dataset
is referred to as a task. For NPs, this means having access to many independent
samples of functions from the ground truth data-generating stochastic process. Each
sampled function is then a task. For example, we may have a large collection of audio
waveforms (Dn)Ntasksn=1 from different speakers. Each of these waveforms may be regarded
as an independent sample from the ground truth stochastic process representing the
distribution of human speech. Each waveform is then a task which is itself a dataset
Dn = ((xi, yi))
N
i=1, where each (xi, yi) is a timestamp–audio amplitude pair. Or we
might have a large collection of natural images: then each Dn would be a single image
consisting of pixel-location/pixel-value pairs.
We would like to use this meta-dataset to learn how to make predictions at some
new query locations upon observing some new conditioning data. To do this, we use
an episodic training procedure, common in meta-learning (Finn et al., 2017; Ravi and
Larochelle, 2016; Vinyals et al., 2016). Each episode consists of the following steps:
1. Sample a task D from the meta-trainset (Dn)Ntasksn=1 .
2. Randomly split the task into two subsets, D = Dc ∪Dt. Dc = (xc, yc) is known
as the context set and Dt = (xt, yt) is known as the target set. Here xc denotes all
the input locations in the context set, and yc denotes their corresponding output
values, and similarly for xt, yt.
3. Pass Dc through the neural process forward pass as the conditioning dataset
to obtain the predictive distribution at the input locations in the target set,
p(yt|xt;Dc).
4. Compute the objective function L, which measures the predictive performance of
the NP on the target set. For models with tractable likelihood functions, this
is usually L = log p(yt|xt;Dc). However, for LNPs, we will have to compute
90 Neural processes
an approximation or a lower bound of the log-likelihood objective, as will be
discussed in Sections 4.4.2 and 4.4.3.
5. Compute the gradient ∇θL with respect to all learnable parameters θ in the NP
for stochastic gradient optimisation.
The episodes are repeated until training converges. Intuitively, this procedure encour-
ages the NP to produce predictions that fit an unseen target set, given access to only
the context set. Once meta-training is complete, if the neural process generalises well,
it will be able to do this for unseen context sets that are not in the meta-train set.
Note that this setup is analogous to standard supervised learning. We now discuss
various objective functions that can be used to train NPs.
4.4.1 Log-likelihood
The most basic objective function to optimise is the log-likelihood of the target set
conditioned on the context set. More precisely, given a meta-dataset M := (Dn)Ntasksn=1 ,
during each iteration of stochastic gradient descent training, we sample (Dc, Dt) from
M and optimise
LML = log p(yt|xt;Dc) (4.37)
with respect to the learnable parameters in the NP. In fact, typically we sample a
batch of tasks from M and perform mini-batch optimisation, so that at each iteration
we take a gradient step that maximises the mean of Equation (4.37) over a batch of
datasets (Dc, Dt). This can be viewed as a simple Monte Carlo estimate of the objective
Ep(Dc,Dt)[log p(yt|xt;Dc)]. In the case of CNPs, the distribution over the target outputs
factorises, and we have:
LML =
∑
(x,y)∈Dt
log p(y|x;Dc). (4.38)
We now prove that, in the limit of infinite data and infinite model capacity, globally
optimising the log-likelihood objective recovers the exact prediction map described in
Section 4.1.4, subject to certain conditions on the data-generating process.
Proposition 5. Let Ψ : Z → P(X ,Y) be any map from data sets to stochastic
processes, and let pΨ be the density of Ψ(Dc) evaluated at xt. Then Ψ globally maximises
Ep(Dc,Dt)[LML(Ψ)] = Ep(Dc,Dt)[log pΨ(yt|xt;Dc)] if and only if all the finite-dimensional
4.4 Training neural processes 91
distributions of pΨ match those of πP , the prediction map (as defined in Section 4.1.4),
p(Dc, xt)-almost everywhere. I.e., equality holds except on a set of measure zero with
respect to p(Dc, xt).
Proof. We have:
Ep(Dc,Dt) [log pΨ(yt|xt, Dc)] = Ep(Dc,xt)
[
Ep(yt|xt,Dc) [log pΨ(yt|xt, Dc)]
]
(4.39)
= −Ep(Dc,xt) [KL (p(yt|xt, Dc)∥pΨ(yt|xt, Dc))] + constant,
(4.40)
where the additive constant is constant with respect to Ψ. First note that the KL-
divergence is non-negative, and that the prediction map sends all the KL-divergences
to zero, globally optimising L(Ψ). Furthermore, the KL-divergence is equal to zero if
and only if the two distributions are equal, and this must hold for almost all Dc, xt with
respect to p(Dc, xt). For, if this were not the case, the KL-divergence would contribute
a non-zero amount to the expectation in Equation (4.40). Hence the objective is
globally optimised if and only if all the finite-dimensional distributions of pΨ match
the conditional distributions p(Dc, xt)-almost everywhere.
Proposition 5 shows that the support of the data-generating distribution is of crucial
importance, since equality with the prediction map πP only holds almost everywhere
with respect to p(Dc, xt). This means that if, for example, the range of the inputs or
the number of context points is limited during training, we cannot expect the model
to be able to approximate the prediction map well outside of that range, just on the
basis of maximum likelihood training. In order to make the support of p(Dc, xt) as
large as possible, one could generate tasks (Dc, Dt) as follows: first, sample some
finite number of input locations xt, xc. Further, set Pr(|xt| = n) > 0 for all n ∈ Z≥0,
where |xt| denotes the number of datapoints in xt, and assume the same is true of
Pr(|xc| = n). Finally, arrange that for each n > 0, the distribution of x given |x| = n
has a continuous density with support over all of Rn×din . This could be achieved, for
example, by setting the distribution of x to be Gaussian. A distribution like this would
ensure that equality p(Dc, xt)-almost everywhere implies equality for all context sets
and target inputs.
In practice we often limit the maximum size of the sampled data sets, and also
their range in X space. Hence we can only expect the model to learn reasonable
predictions within the ranges seen during train time. That is, if during train time we
only observe datasets with at most n datapoints and with input locations within some
finite range, we have no reason a priori to expect the NP to be able to make sensible
92 Neural processes
predictions if at meta-test time it encounters context sets that do not belong to these
ranges. In Chapter 5 we will present an example where incorporating suitable inductive
biases, in particular, translation equivariance, in the architecture, rather than simply
relying on the maximum-likelihood objective, allows the NP to generalise beyond the
meta-training input range for X .
In addition to these conditions regarding the input data distribution, there are
a number of assumptions in Proposition 5 that caveat its applicability to real world
settings. First, in reality we would only optimise a Monte Carlo expectation of
Ep(Dc,Dt) [log pΨ(yt|xt, Dc)] such as 1|M |
∑
(Dc,Dt)∈M log pΨ(yt|xt, Dc), with (Dc, Dt) ∼
p(Dc, Dt). Hence the NP would only be guaranteed to recover the prediction map
as the size of the meta-trainset |M | → ∞. Furthermore, the proof assumes that the
prediction map can be expressed as an NP, which is only guaranteed in the infinite-
width limit (see Section 4.3). Finally, the prediction map will only be recovered if
global optimisation of the objective succeeds, which is rarely the case for stochastic
gradient-based optimisers. Nevertheless, Proposition 5 motivates the use of LML in
cases where the meta-trainset is large, the neural networks in the NP have high capacity,
and training is performed until convergence.
4.4.2 Neural process variational inference
The maximum likelihood objective of Equation (4.38) is the most commonly used
training objective for CNPs. However, for LNPs, this objective cannot be used since
it is intractable to compute due to the integral in Equation (4.12). Instead, when
introducing the MLP-LNP, Garnelo et al. (2018b) proposed viewing LNPs as performing
approximate Bayesian inference and learning in the following latent variable model:
z ∼ p(z); p(yt|xt, z) =
∏
(x,y)∈Dt
N (y; f(x; z), σ2y) , (4.41)
where f is given by a neural network. To train the model, they use amortized
VI (Kingma and Welling, 2013; Rezende et al., 2014). This involves introducing a
variational approximation network qϕ which maps datasets Dc ∈ Z to distributions
over z, and maximizing a lower bound (ELBO) on log p(yt|xt, Dc). We can use the
LNP encoder architecture to parameterise qϕ, since the LNP encoder specifies a map
from datasets to distributions over z, just as required (see Figure 4.2). Here, note that
log p(yt|xt, Dc) is defined by exact Bayesian inference for the model in Equation (4.41),
4.4 Training neural processes 93
hence the notation log p(yt|xt, Dc) instead of log p(yt|xt;Dc). That is,
log p(yt|xt, Dc) = log
∫
p(yt|xt, z)p(z|Dc) dz, (4.42)
p(z|Dc) = p(Dc|z)p(z)
p(Dc)
, (4.43)
=
∏
(x,y)∈Dc N
(
y; f(x; z), σ2y
)
p(z)∫ ∏
(x,y)∈Dc N
(
y; f(x; z), σ2y
)
p(z) dz
. (4.44)
Note that this is in contrast to log p(yt|xt;Dc) which is defined directly by the NP
forward pass without reference to Bayes’ theorem.
Given a task (Dc, Dt), the (conditional) ELBO for this model is:
Ez∼qϕ(z|Dc∪Dt) [log p(yt|xt, z)]−KL(qϕ(z|Dc ∪Dt)∥p(z|Dc)) ≤ log p(yt|xt, Dc). (4.45)
As p(z|Dc) is intractable to compute (since the normalising constant in Equation (4.44)
involves an intractable integral), Garnelo et al. (2018b) instead propose the following
objective:
LNPVI := Ez∼qϕ(z|Dc∪Dt) [log p(yt|xt, z)]−KL(qϕ(z|Dc ∪Dt)∥qϕ(z|Dc)), (4.46)
where the intractable term p(z|Dc) has been substituted with our variational approxi-
mation qϕ(z|Dc). We refer to maximising this objective as neural process variational
inference (NPVI). Due to this substitution, LNPVI is no longer a valid ELBO for the
original model (Equation (4.41)), i.e., it is no longer guaranteed to be a lower bound
to the Bayesian conditional log-likelihood log p(yt|xt, Dc). Rather, if we define separate
models for each context set Dc, and define the conditional prior for each model as
p(z|Dc) := qϕ(z|Dc), then LNPVI may be thought of as performing VI for this collection
of models. However, there is no guarantee that these conditional priors are consistent in
the sense that they correspond to conditional distributions of a single Bayesian model
as in Equation (4.41). This is in contrast to sparse variational inference in Gaussian
processes, where there is a single Bayesian prior and posterior which is targeted by
the approximate posterior GP (Matthews et al., 2016; Titsias, 2009). Although the
fact that LNPVI does not target a single consistent posterior distribution introduces
conceptual difficulties, it was shown by Garnelo et al. (2018b) to be a useful objective
for LNPs.
94 Neural processes
4.4.3 Approximate log-likelihood
As an alternative to the NPVI objective for LNPs, we propose optimising the fol-
lowing Monte Carlo estimate of LML, which is conservatively biased, consistent, and
monotonically increasing (in expectation) in the number of samples, L (Burda et al.,
2015):
LˆML := log
 1
L
L∑
l=1
exp
 ∑
(x,y)∈Dt
log p(y|x, zl)
 ; zl ∼ p(z|R(Dc)), (4.47)
where R(Dc) is the deterministic representation of the context set Dc, as in Equa-
tion (4.12). Again here we state the objective for a single dataset Dc, Dt; during
actual training we would optimise a Monte Carlo estimate of Ep(Dc,Dt)[LˆML]. LˆML is
an approximation to LML in the sense that
LˆML = log
 1
L
L∑
l=1
∏
(x,y)∈Dt
p(y|x, zl)
 (4.48)
≈ log
∫ ∏
(x,y)∈Dt
p(y|x, z)p(z|R(Dc)) dz, (4.49)
where Equation (4.49) is the exact LNP log-likelihood. However, since the logarithm of
an unbiased estimator is not an unbiased estimator of the logarithm, LˆML is a biased
estimate of LML, which is only accurate in the limit L→∞. The bias in this estimator
decreases as the variance of z decreases, and in particular, if z is deterministic then the
estimator is exact. This means that optimisation may attempt to reduce the variance in
order to reduce the bias in the estimator rather than actually increasing the likelihood.
Generally, if L is too large then each minibatch may take up too much memory, but if
L is too small the bias in the estimate can become unacceptably large. In particular,
in contrast to LNPVI, single sample estimators with L = 1 are not useful, as they drive
z to be deterministic. Hence training with LML often requires more memory than
training with LNPVI.
4.4.4 Approximate maximum-likelihood vs variational lower
bound maximisation for training NPs
In this section we argue that the VI interpretation may be unnecessary when focusing
on predictive performance for NPs. First, we note that LNPVI is equal to LML up to an
4.4 Training neural processes 95
additional KL term. To see this, let D := Dt ∪Dc, and let Z =
∫
p(yt|xt, z)qϕ(z|Dc) dz.
The NPVI objective is then:
LNPVI := Eqϕ(z|D)[log p(yt|xt, z)]−KL(qϕ(z|D)∥qϕ(z|Dc)) (4.50)
= Eqϕ(z|D)[log p(yt|xt, z) + log qϕ(z|Dc)− log qϕ(z|D)] (4.51)
= Eqϕ(z|D)
[
logZ + log
p(yt|xt, z)qϕ(z|Dc)
Z
− log qϕ(z|D)
]
(4.52)
= logZ −KL
(
qϕ(z|D)
∥∥∥∥ 1Zp(yt|xt, z)qϕ(z|Dc)
)
. (4.53)
When training LNPs with maximum likelihood, qϕ no longer has an approximate
inference interpretation, but is simply the encoder of the LNP. In that case, logZ =
log
∫
p(yt|xt, z)qϕ(z|Dc) dz = LML is simply the (exact) log-likelihood, so:
LNPVI = LML −KL
(
qϕ(z|D)
∥∥∥∥ 1Zp(yt|xt, z)qϕ(z|Dc)
)
. (4.54)
Hence we see that LNPVI is equal to LML up to an additional KL term. This KL term
encourages consistency among the qϕ for varying conditioning datasets, in the sense
that Bayes’ theorem is respected if the target set is subsumed into the context set. To
see this, note that it encourages qϕ(z|Dc ∪ Dt) to be similar to 1Zp(yt|xt, z)qϕ(z|Dc).
If qϕ was performing exact inference instead of approximate inference, this would be
satisfied immediately, by the rules of probability. However, since qϕ is parameterised
by a learned encoder, this consistency with respect to Bayesian updating of z must be
learned from data. Thus LNPVI can be viewed as directly encouraging this consistency
in the objective function.
In the infinite capacity/data limit, LNPVI is globally maximised if the LNP recovers
(i) the prediction map πP for p(yt|xt, Dc) and (ii) exact Bayesian inference for z. (i)
follows from Proposition 5, since πP globally optimises LML, and (ii) follows from the
fact that exact inference for z sends the KL term to zero since it respects Bayes’ rule.
However, in most applications, only the distribution over yt is of interest, and we are
not directly concerned with our inference for the latent variable z. Given only finite
capacity/data, it may be advantageous to not expend capacity in encouraging the
distribution over z to be consistent with Bayes’ theorem. Hence it could be beneficial
to use LML over LNPVI, since LML solely targets the predictive performance we care
about.
Unfortunately, as discussed earlier, LML is intractable for LNPs, and its finite-
sample approximation LˆML introduces biases of its own into the training procedure. It
96 Neural processes
is unclear a priori how detrimental these biases will be to performance. Both LˆML and
LNPVI can be seen as lower bounds on the actual quantity we would like to optimise,
the exact log-likelihood. Which objective is preferable in practice will depend on which
introduces more harmful biases to the training procedure. In Chapter 5 we compare
LNPs trained with LˆML and LNPVI and find that LˆML can significantly outperform
LNPVI.
4.5 Summary and conclusions
We have introduced neural processes, a family of deep learning models for meta-learning
maps from observed datasets to predictive stochastic processes. NPs naturally lend
themselves to tasks that require uncertainty estimation in the small-data regime, as
long as a meta-dataset is available. We introduced the encoder-decoder architectural
framework used by most NPs, and motivated it with a discussion of stochastic process
consistency and invariance with respect to permutations of the context set. Next, we saw
that NPs could be divided into two broad sub-families, CNPs and LNPs, depending on
whether a latent variable was used to induce dependencies in the predictive distributions.
Within these subfamilies we presented instantiations of NPs based on vanilla MLPs
and also attention mechanisms. Finally, we discussed the various objective functions
that have been proposed for training NPs.
In Section 4.2.4 we saw how the introduction of a suitable inductive bias in the
form of attentive neural processes successfully addressed the underfitting problems of
MLP-based NPs. This naturally raises the question of what other inductive biases
could be built into NP architectures, and what their benefits may be. In Chapter 5 we
will present and evaluate a new member of the NP family, the convolutional neural
process, which uses a convolutional neural network to build in translation equivariance
as an inductive bias.
Chapter 5
Convolutional neural processes
In Chapter 4 we saw that neural processes could be viewed as learning maps from
datasets directly to predictive stochastic processes. Although this framework is very
general, specialising it to incorporate useful inductive biases can lead to dramatic
improvements, as was the case with attentive neural processes (Kim et al., 2018). In this
chapter, we consider symmetries, and in particular, stationarity as a powerful inductive
bias. Stationary stochastic processes are a key component of many probabilistic models,
such as those for off-the-grid spatio-temporal data. They enable the statistical symmetry
of underlying physical phenomena to be leveraged, thereby aiding generalisation.
Prediction in such models can be viewed as a translation equivariant map from
observed datasets to predictive stochastic processes (see Figure 5.1), emphasising the
intimate relationship between stationarity and equivariance.
Building on this, we propose the convolutional conditional neural process (ConvCNP)
and the convolutional latent neural process (ConvLNP). The ConvCNP, like other
members of the CNP family, makes factorised predictions for each element of the
target set. This means that we cannot sample coherent functions from the predictive
distribution of the ConvCNP, since every target value will be independent of the other
values, as discussed in Section 4.2.2. The ConvLNP, on the other hand, uses a latent
variable (in this case, a latent function) to enable coherent samples to be drawn from
the predictive distribution. This allows ConvLNPs to be deployed in settings which
require coherent samples such as Thompson sampling. Crucially, both ConvCNPs and
ConvLNPs use convolutional architectures to endow neural processes with translation
equivariance as an inductive bias. Moreover, as discussed in Section 4.4.3, we propose
a new maximum-likelihood objective to replace the standard ELBO objective in NPs,
which conceptually simplifies the framework and empirically improves performance for
ConvLNPs. We demonstrate the strong performance and generalisation capabilities
98 Convolutional neural processes
of ConvCNPs and ConvLNPs on 1D regression, image completion, and various tasks
with real-world spatio-temporal data.
The work in this chapter is based on two publications, ‘Convolutional Conditional
Neural Processes’ (Gordon et al., 2020) and ‘Meta-learning Stationary Stochastic
Process Prediction with Convolutional Neural Processes’ (Foong et al., 2020a). The
research in Gordon et al. (2020) was conducted with Jonathan Gordon, Wessel P. Bru-
insma, James Requeima, Yann Dubois and Richard E. Turner. The research in these
publications also appears in the PhD theses of my collaborators Jonathan Gordon
(Gordon, 2021) and Wessel P. Bruinsma (forthcoming), both submitted to the Uni-
versity of Cambridge. I introduced the density channel into the ConvCNP model,
verified and assisted with the proof of the main representation theorem, performed the
initial experiments on simple time-series and the first on-the-grid experiments, and
contributed to writing and editing the paper. The research in Foong et al. (2020a)
was conducted with my co-first authors Wessel P. Bruinsma and Jonathan Gordon,
along with Yann Dubois, James Requeima and Richard E. Turner. I was involved
with conceptualising the model, proving theoretical results, planning and running the
experiments on environmental data, and writing the paper.
5.1 Introduction
Incorporating appropriate inductive biases into machine learning models is key to
achieving good generalisation performance. Consider, for example, the task of predict-
ing rainfall at an unseen test location from rainfall measurements nearby. A powerful
inductive bias for this task is stationarity : the assumption that the generative process
governing rainfall is spatially homogeneous. Given only observations in a limited part
of the space, stationarity allows the model to extrapolate to yet unobserved regions.
Closely related to stationarity is translation equivariance. Translation equivariance
formalises the intuitive idea that if an observed dataset is shifted in time or space,
then the resulting predictions should be shifted by the same amount. This is illus-
trated schematically in Figure 5.1. When stationarity or translation equivariance is
appropriate, e.g. in time-series (Roberts et al., 2013), images (LeCun et al., 1998), and
spatio-temporal modelling (Cressie, 1990; Delhomme, 1978), incorporating them into
our models yields significant benefits. As such, NPs would ideally have translation
equivariance built directly into the modelling assumptions as an inductive bias when
appropriate. However, current NP models must learn this structure from the dataset
5.1 Introduction 99
Fig. 5.1 Schematic illustration of translation equivariance in stochastic process predic-
tion. The top row shows a context set, and the corresponding predictive distribution
obtained by passing the predictions through a prediction map, e.g., a well-trained
neural process. The bottom row shows the same context set, but with the input values
shifted horizontally by an amount τ ∈ R. In a translation equivariant neural process,
the resulting predictive distribution will be identical to that in the top row, except it
is also shifted horizontally by τ .
instead, which is sample and parameter inefficient, and impacts the ability of the model
to generalise.
The goal of this chapter is to build translation equivariance into NPs. Famously,
convolutional neural networks (CNNs) incorporate translation equivariant convolutional
layers (Cohen and Welling, 2016; Fukushima and Miyake, 1982; LeCun et al., 1998).
However, it is not straightforward to generalise NPs in an analogous way for the
following reasons:
1. CNNs require data to live ‘on the grid’. For example, image pixels and audio
recordings usually live on a regularly spaced grid. In the 1-dimensional input
case, audio recordings sample a waveform at times (. . . , x0− ϵ, x0, x0+ ϵ, . . .). An
analogous sampling procedure for image data occurs in the 2-dimensional case,
100 Convolutional neural processes
where the shifts ϵ are now two-dimensional. However, many domains we would
like to apply NPs to have data that live ‘off the grid’. For example, some time
series data may be observed irregularly at any time t ∈ R, or observations of
weather may occur at irregularly spaced stations at locations x ∈ R2. We must
modify the standard CNN forward pass to be able to handle these situations as
well.
2. NPs operate on partially observed context sets, in the sense that the function
is not observed everywhere, but only at certain points. However, in CNNs the
input image is usually free from missing values.
3. NPs rely on embedding sets into a finite-dimensional vector space for which the
notion of equivariance with respect to input translations is not well defined. For
example, consider the case where the inputs are two dimensional, and the context
set is translated in input space by some amount τ ∈ R2. Standard MLP-CNPs
will have a representation of the context set given by some vector R(D) ∈ RdR ,
where, e.g., dR = 256. It is not clear how to represent a shift of R(D) by τ , i.e.,
it is not straightforward to define the action of the 2-dimensional translation
group on vector spaces of arbitrary dimensionality.
In this chapter, we introduce the ConvCNP and ConvLNP, new members of the
NP family that address these challenges and account for translation equivariance. Our
key contributions can be summarised as follows:
1. We introduce the ConvCNP, a translation equivariant neural process that makes
factorised predictions.
2. We introduce the ConvLNP, a translation equivariant neural process that uses a
latent variable to induce dependencies in its predictive distribution.
3. We evaluate the new training objective for LNPs that was proposed in Sec-
tion 4.4.3, which discards variational inference in favour of a biased Monte Carlo
estimate of the maximum likelihood objective. We empirically show that this
objective improves performance for ConvLNPs.
4. We evaluate both the ConvCNP and ConvLNP experimentally and demonstrate
that they exhibit excellent performance on several synthetic and real-world
benchmarks.
5.1 Introduction 101
5.1.1 Translation equivariance and stationarity
As we saw in Section 4.1.4, NP learning can be seen as approximating the exact
prediction map from datasets to predictive stochastic processes πP . The prediction
map πP for stationary stochastic processes possesses two important symmetries. First,
as described in Section 4.2, πP is invariant to permutations of Dc (Zaheer et al.,
2017). This is a symmetry respected by all NPs thanks to (variations of) the deep
sets construction described in Section 4.2. Second, specifically to stationary stochastic
processes, πP is translation equivariant : whenever an input to the map is translated,
its output is translated by the same amount, as described in Figure 5.1. To state this
precisely, we make the following definitions:
Definition 1 (Translating datasets and stochastic processes). We define the action of
the translation operator Tτ on datasets and stochastic processes, where τ ∈ X denotes
the shift vector of the translation:1
1. Translating datasets. Let ((xn, yn))Nn=1 = D ∈ Z. For the index set x =
(x1, . . . , xn), translation by τ is defined as Tτx = (x1 + τ, . . . , xn + τ). Similarly,
TτD := ((xn + τ, yn))
N
n=1.
2. Translating functions. For a function f ∈ YX , define Tτf(x) := f(x− τ) for
all x ∈ X . Let F ⊆ YX . Then we define the translation of this set of functions
as TτF := {Tτf : f ∈ F}.
3. Translating stochastic processes. For any stochastic process P ∈ P(X ,Y),
we define the translation of the stochastic process TτP by setting the probability
it assigns to a measurable set F ∈ Σ as2 TτP (F ) := P (T−τF ).
Definition 2 (Stationary stochastic process). We say a stochastic process is (strictly)
stationary if the densities of its finite marginals satisfy
p(yt|xt) = p(yt|Tτxt) (5.1)
for all (xt, yt) ∈ Z and τ ∈ X .
We are now ready to give a precise definition of a translation equivariant prediction
map, as illustrated in Figure 5.1:
1To prevent notational clutter, the same symbol, Tτ , will be used to denote translations of datasets,
functions, sets of functions and stochastic processes.
2Recall from Section 4.1.2 that Σ denotes the product σ-algebra on YX . P (T−τF ) is well-defined
since Σ is closed under translations. Equivalently, we could define TτP as the push-forward of P
under the the translation map on functions, Tτ : YX → YX .
102 Convolutional neural processes
Definition 3 (Translation equivariant prediction maps). We say that Ψ: Z → P(X ,Y)
is translation equivariant if Ψ(TτD) = TτΨ(D) for any dataset D ∈ Z and shift τ ∈ X .
Having defined what we mean by stationarity and translation equivariance, the
following simple statement highlights the intimate link between these concepts:
Proposition 6. Let P be a stationary stochastic process. Then the prediction map πP
is translation equivariant.3
Proof. Let p(yt|xt, Dc) denote the finite dimensional density of πP (Dc) at index set
xt. To show that πP (TτDc) = TτπP (Dc) it suffices to show that p(yt|xt, TτDc) =
p(yt|T−τxt, Dc). We have
p(yt|xt, TτDc) = p(yt, yc|xt, Tτxc)
p(yc|Tτxc) (5.2)
=
p(yt, yc|T−τxt, xc)
p(yc|xc) (5.3)
= p(yt|T−τxt, Dc), (5.4)
where we used the stationarity assumption in the second line.
Proposition 6 suggests that models for the prediction map should also be made
translation equivariant and permutation invariant. As such models are a small subset of
the space of all models, building in these properties can greatly improve data efficiency
and generalisation for stationary stochastic process prediction. In the next section,
we describe how this can be done for NPs by extending the deep sets theorem of
Section 4.3 to incorporate translation equivariance.
5.2 Convolutional deep sets
We are interested in translation equivariance (Definition 3) with respect to translations
on X . The encoder for both MLP-based NPs and attentive NPs maps datasets D to
an embedding in a vector space RdR , for which the notion of equivariance with respect
to input translations in X is not well defined. For example, a function f on X can
be translated by τ ∈ X to form f(· − τ). However, for a vector R ∈ RdR , which can
be seen as a function R : {1, . . . , dR} → RdR , with R(i) = Ri, the translation R(· − τ)
does not make sense, since it is not clear how to add a translation τ ∈ X to the discrete
3We exclude conditioning on observations that have zero density, so that the prediction map is
well defined.
5.2 Convolutional deep sets 103
index of a finite-dimensional vector. Another way to say this is that there is no natural
way for the translation group of X (where often X = R or R2) to act on the space of
finite-dimensional vector representations RdR , when dR ̸= 1, 2.
To overcome this, we define the encoder of the convolutional neural process E : Z →
H to map into a function space H containing functions on X . Since functions in H live
on X , our notion of translation equivariance (Definition 3) now also makes sense for E.
As we will see below, every translation equivariant function on sets has a representation
in terms of a specific functional embedding.
Definition 4 (Functional mappings on sets and functional representations of sets). Call
a map E : Z → H a functional mapping on sets if it maps from the space of datasets
Z to an appropriate space of functions H. We call E(Z) the functional representation
of the set Z. Furthermore, the functional representation E is translation equivariant if
E(TτD) = TτE(D) for all τ ∈ X and D ∈ Z.
Considering functional representations of sets leads to our key result for convolu-
tional NPs, which can be summarised as follows: For an appropriately chosen Z ′ ⊂ Z,
a continuous function Φ: Z ′ → Cb(X ,Y) is both permutation invariant and translation
equivariant if and only if it is of the form
Φ(Z) = ρ (E(Z)) , E(Z) =
∑
(x,y)∈Zϕ(y)ψ(· − x) ∈ H, (5.5)
for some continuous and translation equivariant ρ : H → Cb(X ,Y), and appropriate ϕ
and ψ. Note that here ρ is a map between function spaces.
Equation (5.5) defines the encoder used by our proposed models, the ConvCNP
and ConvLNP. In Section 5.2.1, we describe this theoretical result in more detail. The
result provides an extension of the key result of Zaheer et al. (2017) to functional
representations on sets, and shows that it can naturally be extended to handle varying-
size sets. The practical implementation of ConvCNPs and ConvLNPs — the design of
ρ, ϕ, and ψ — is informed by the results in Section 5.2.1, and is discussed for domains
of interest in Section 5.3.
5.2.1 Representing translation equivariant functions on sets
In this section we discuss the theoretical foundations of the ConvCNP and ConvLNP
encoder. We begin by stating a definition that is used in the main result.
Definition 5 (Multiplicity). A collection of datasets Z ′ ⊆ Z is said to have multiplicity
K if, for every dataset Z ∈ Z ′, every input value x occurs at most K times.
104 Convolutional neural processes
For example, in the case of real-world data like time series and images, we often
observe only one (possibly multi-dimensional) observation per input location, which
corresponds to multiplicity one, since none of the input values are repeated within a
single time series or image. We now state our key representation theorem.
Theorem 6. Consider an appropriate4 collection of datasets Z ′≤M ⊆ Z≤M with multi-
plicity K. Then a function Φ: Z ′≤M → Cb(X ,Y) is continuous5, permutation invariant,
and translation equivariant if and only if it is of the form
Φ(Z) = ρ (E(Z)) , E((x1, y1), . . . , (xm, ym)) =
m∑
i=1
ϕ(yi)ψ(· − xi) (5.6)
for some continuous and translation equivariant ρ : H → Cb(X ,Y) and some continuous
ϕ : Y → RK+1 and ψ : X → R, where H is an appropriate space of functions that
includes the range of E. We call a function Φ of the above form a ConvDeepSet.
The proof of the ‘if’ direction is straightforward:
Proof of sufficiency. First, Φ is permutation invariant, because addition is commutative
and associative. Second, that Φ is translation equivariant follows from a direct
verification and that ρ is also translation equivariant:
Φ(TτZ) = ρ
(
M∑
i=1
ϕ(yi)ψ(· − (xi + τ))
)
(5.7)
= ρ
(
M∑
i=1
ϕ(yi)ψ((· − τ)− xi)
)
(5.8)
= ρ
(
M∑
i=1
ϕ(yi)ψ(· − xi)
)
(· − τ) (5.9)
= Φ(Z)(· − τ) (5.10)
= T ′τΦ(Z).
The proof of the ‘only if’ direction is much more technical, and requires topolog-
ical considerations, primarily to make precise the notion of a continuous map from
Z ′≤M → Cb(X ,Y). This is complicated by the fact that Z ′≤M is a union of (sub-
sets of) vector spaces with differing dimensionality, and the fact that Cb(X ,Y) is an
4For every m ∈ {1, . . . ,M}, Z ′≤M ∩ Zm must be closed and closed under permutations and
translations.
5For every m ∈ {1, . . . ,M}, the restriction Φ|Z′≤M∩Zm is continuous.
5.2 Convolutional deep sets 105
infinite-dimensional function space. The crux of the proof is to show that the proposed
embedding E is a homeomorphism (that is, a continuous map with a continuous inverse)
between Z ′≤M and a space constructed from certain reproducing kernel Hilbert spaces
that have ψ as their reproducing kernel. Once this has been established, the rest of
the proof is straightforward:
Proof sketch of necessity (incomplete, informal). The proof follows the strategy used
by Zaheer et al. (2017) and Wagstaff et al. (2019). We choose ψ to be the exponentiated
quadratic (EQ) kernel,
ψ(x, x′) = σ2 exp
(
− 1
2ℓ2
∥x− x′∥2
)
. (5.11)
Let D ∈ Z ′≤M be a dataset. Assume E is a homeomorphism. By invertibility of E,
D = E−1(E(D)). Therefore,
Φ(D) = Φ(E−1(E(D))) = (Φ ◦ E−1)
(
M∑
i=1
ϕ(yi)ψ(· − xi)
)
. (5.12)
Let H denote an appropriate space of functions that includes the range of E. Define
ρ : H → Cb(X ,Y) by ρ = Φ ◦ E−1. First, ρ is continuous since Φ is continuous and
E−1 is continuous as E is a homeomorphism. Second, E−1 is translation equivariant,
because ψ is a stationary kernel. Also, Φ is translation equivariant by assumption. Thus
their composition ρ is also translation equivariant. Hence any continuous, translation
equivariant map Φ can be written in the form given in Equation (5.6).
The full proof of necessity is beyond the scope of this thesis, and is provided in
Gordon et al. (2020, appendix A). Here we discuss several key points from the proof
that have practical implications and provide insights for the design of convolutional
NPs:
1. For the construction of ρ and E, ψ is set to be a flexible positive-definite kernel
(Equation (5.11)) associated with a reproducing kernel Hilbert space (RKHS;
Aronszajn (1950)), which results in desirable properties for E.
2. Using the work of Zaheer et al. (2017), we set ϕ(y) = (y0, y1, · · · , yK) to be the
powers of y up to order K, where K is the multiplicity.
3. Theorem 6 requires ρ to be a powerful function approximator of continuous,
translation equivariant maps between functions.
106 Convolutional neural processes
In Section 5.3, we discuss how these theoretical results inform our implementation of
the ConvCNP.
Theorem 6 extends the result of Zaheer et al. (2017) discussed in Section 4.3 by
embedding the set into an infinite-dimensional space—the RKHS—instead of a finite-
dimensional space. Beyond allowing the model to exhibit translation equivariance, the
RKHS formalism allows us to naturally deal with finite sets of varying sizes, which
turns out to be challenging with finite-dimensional embeddings. Furthermore, our
formalism requires ϕ(y) = (y0, y1, y2, . . . , yK) to expand up to order no more than the
multiplicity of the sets K; if K is bounded, then our results hold for sets up to any
arbitrarily large finite size M , while fixing ϕ to be only (K + 1)-dimensional.
5.3 Convolutional conditional neural processes
In this section we discuss the architecture and implementation details for ConvCNPs,
which produce factorised predictive distributions. Similarly to other CNPs, ConvCNPs
model the conditional distribution as
p(y|x,D) =
N∏
n=1
p(yn|Φθ(D)(xn)) =
N∏
n=1
N (yn;µn, σn) with (µn, σn) = Φθ(D)(xn),
(5.13)
where D is the observed dataset and Φ is a ConvDeepSet (Theorem 6). Here we denote
the learnable parameters of Φ as θ. As with other CNPs, the ConvCNP has fully
tractable predictive likelihoods. This allows us to use the simple maximum-likelihood
objective to learn θ, as described in Section 4.4.1.
We now turn to the architectural details of the ConvCNP. The key considerations
are the design of ϕ, ψ, and ρ for Φ (see Theorem 6).
Form of ϕ. The applications considered in this thesis have a single (potentially
multi-dimensional) output per input location, so the multiplicity of Z is one (i.e.,
K = 1). It then suffices to let ϕ be a power series of order one, which is equivalent
to appending a constant to y in all datasets, i.e. ϕ(y) = [1, y]⊤. The first output ϕ1
thus provides the model with information regarding where data has been observed,
which is necessary to distinguish, for example, between having no observed datapoint
at x and a datapoint at x with y = 0. Denoting the functional representation as h,
we can think of the first channel h(0) as a ‘density channel’ — it gives information
about how densely in space data has been observed at a particular location. We found
it helpful to divide the remaining channels h(1:) by h(0) (Figures 5.2b and 5.2c, line
5.3 Convolutional conditional neural processes 107
Context set Dc = (xn, yn)
N
n=1
y
x
1
Functional representation
2
h(0)=
∑
ψ( · −xn) h(1)=
∑
ynψ( · −xn)∑
ψ( · −xn)(density channel)
x
Evaluate at discretisation (ti)
T
i=1
x
3
Apply CNN and predict
[
µ(x∗)
σ(x∗)
]
=
T∑
i=1
[
fµ(ti)
efσ(ti)
]
ψρ(x
∗−ti)
x∗1 x
∗
2 x
∗
3 x
∗
4
p(y∗3 | x∗3 , Dc)
(a)
require: ρ = (CNN, ψρ), ψ, and
density γ
require: context (xn, yn)Nn=1, target
(x∗m)Mm=1
1 begin
2 lower, upper←
range
(
(xn)
N
n=1∪(x∗m)Mm=1
)
3 (ti)
T
i=1 ←
uniform_grid(lower, upper; γ)
4 hi ←
∑N
n=1
[
1 yn
]⊤
ψ(ti − xn)
5 h
(1)
i ← h(1)i /h(0)i
6 (fµ(ti), fσ(ti))
T
i=1 ←
CNN((ti, hi)Ti=1)
7 µm ←
∑T
i=1 fµ(ti)ψρ(x
∗
m − ti)
8 σm ←
∑T
i=1 pos(fσ(ti))ψρ(x
∗
m− ti)
9 return (µm, σm)Mm=1
10 end
(b)
require: ρ = CNN and E = convθ
require: image I, context Mc, and
target mask Mt
1 begin
2 // We discretize at the pixel
locations.
3 Zc ← Mc ⊙ I // Extract context
set.
4 h← convθ([Mc,Zc]⊤)
5 h(1:C) ← h(1:C)/h(0)
6 ft ← Mt ⊙ CNN(h)
7 µ← f (1:C)t
8 σ ← pos(f (C+1:2C)t )
9 return (µ, σ)
10 end
(c)
Fig. 5.2 (a) Illustration of the ConvCNP forward pass in the off-the-grid case and
pseudo-code for (b) off-the-grid and (c) on-the-grid data. The function pos : R→ (0,∞)
is used to enforce positivity.
5), as this improved performance when there is large variation in the density of input
locations. In the image processing literature, this is known as a normalised convolution
(Knutsson and Westin, 1993). The normalisation operation can be reversed by ρ and
therefore does not restrict the expressivity of the model. Furthermore, this normalised
signal channel can be viewed as an implementation of the Nadaraya-Watson estimator
for the mean function (Nadaraya, 1964; Watson, 1964).
Having specified ϕ, it remains to specify the form of ψ and ρ. Our choice for ψ and
ρ will depend on whether the data lies on-the-grid or off-the-grid, as we detail in the
next sections.
108 Convolutional neural processes
5.3.1 ConvCNPs for off-the-grid data
We first describe the form of ψ and ρ in the case where data lives off-the-grid. Our
proof of Theorem 6 suggests that ψ should be a stationary, non-negative, positive-
definite kernel. The exponentiated-quadratic (EQ) kernel with a learnable length scale
parameter is a natural choice. This kernel is multiplied by ϕ to form the functional
representation E(D) (Figure 5.2b, line 4; and Figure 5.2a, arrow 1).
Next, Theorem 6 suggests that ρ should be a continuous, translation equivariant map
between function spaces. Yarotsky (2022, Theorem 3.1) shows that any translation
equivariant continuous function can be arbitrarily well approximated by a CNN.
Furthermore, using a CNN for ρ allows us to take advantage of all the considerable
work put in by the research community on designing and optimising CNN architectures.
However, CNNs operate on discrete (on-the-grid) input spaces and produce discrete
outputs. Hence in order to approximate ρ with a CNN, we discretise the input of ρ,
apply the CNN, and finally transform the CNN output back to a continuous function
X → Y. To do this, for each context and test set, we space points (ti)ni=1 ⊂ X on
a uniform grid (at a pre-specified density) over a hyper-cube that covers both the
context and target inputs. We then evaluate (E(D)(ti))ni=1 (Figure 5.2b, lines 2–3;
Figure 5.2a, arrow 2). This discretized representation of E(D) is then passed through
a CNN (Figure 5.2b, line 6; Figure 5.2a, arrow 3).
To map the output of the CNN back to a continuous function X → Y, we use
the CNN outputs as weights for evenly-spaced basis functions (again employing the
EQ kernel), which we denote by ψρ (Figure 5.2b, lines 7–8; Figure 5.2a, arrow 3).
The resulting approximation to ρ is not perfectly translation equivariant, but will
be approximately so for length scales larger than the spacing of (E(D)(ti))ni=1. The
resulting continuous functions are then used to generate the (Gaussian) predictive mean
and variance at any input. This, in turn, can be used to evaluate the log-likelihood.
5.3.2 ConvCNPs for on-the-grid data.
We next discuss the ConvCNP architecture in the case where the data live on a regularly
spaced grid. While the ConvCNP is readily applicable to many on-the-grid settings,
here we focus on images (other on-the-grid data formats can be viewed as a kind of
image, potentially with many channels). As such, the following description uses the
image completion task as an example, which is often used to benchmark NPs (Garnelo
et al., 2018a; Kim et al., 2018). Compared to the off-the-grid case, the implementation
5.3 Convolutional conditional neural processes 109
becomes simpler as we can naturally choose the discretisation (ti)ni=1 to be the pixel
locations.
Let I ∈ RH×W×C be an image — H,W,C denote the height, width, and number
of channels, respectively — and let Mc be the context mask, which is defined such
that [Mc]i,j = 1 if pixel location (i, j) is in the context set, and 0 otherwise. Let ⊙
denote the element-wise or Hadamard product. To implement ϕ, we select all context
points by multiplying with the mask, Zc := Mc ⊙ I, and prepend the context mask:
ϕ = [Mc,Zc]
⊤ (Figure 5.2c, line 4). Here the context mask provides information to the
ConvCNP about where the data are observed.
Next, we apply a single convolution layer to the context mask to form the on-the-
grid density channel: h(0) = convθ(Mc) (Figure 5.2c, line 4). To all other channels, we
apply a normalized convolution: h(1:C) = convθ(y)/h(0) (Figure 5.2c, line 5), where
the division is element-wise. The filter of the convolution is analogous to ψ, which
means that h is the functional representation, with the convolution performing the
role of E (the summation in Figure 5.2b, line 4). Although the theory suggests using a
non-negative, positive-definite kernel, we did not find significant empirical differences
between an EQ kernel and using a fully trainable kernel restricted to positive values to
enforce non-negativity.
Lastly, we describe the on-the-grid version of ρ(·), which consists of two stages.
First, we apply a CNN to E(D) (Figure 5.2c, line 6). Second, we apply a shared,
pointwise MLP that maps the output of the CNN at each pixel location in the target
set to R2C , where we absorb the pointwise MLP into the CNN. The first C outputs of
the MLP are the means of the Gaussian predictive distribution and the second C are
the standard deviations, which are then passed through a positivity-enforcing function
(Figure 5.2c, line 7–8). To summarise, the on-the-grid algorithm is given by
(µ, pos−1(σ)) = CNN
ρ
(
E(context set)
[conv(Mc)
density channel
;conv(Mc ⊙ I)/conv
multiplies by ψ and sums
(Mc)]
⊤), (5.14)
where (µ, σ) are the predicted image mean and standard deviation over the image
locations, ρ is implemented with the CNN, and E is implemented with the mask Mc
and convolution conv. Here the semicolon denotes the stacking of different channels
in the CNN input.
110 Convolutional neural processes
5.4 ConvCNP experimental results
We evaluate the performance of ConvCNPs in both on-the-grid and off-the-grid settings,
focusing on two central questions:
1. Do translation equivariant models improve performance over non-translation
equivariant models in appropriate domains?
2. Can translation equivariance enable ConvCNPs to generalise to settings outside
of those encountered during training?
We use several off-the-grid data-sets which are irregularly sampled time series (X = R),
comparing ConvCNPs against Gaussian processes (GPs; Rasmussen and Williams
(2005)) and attentive CNPs (ACNP; which is identical to the ANP (Kim et al., 2018),
but without the latent path in the encoder). We then evaluate on several on-the-grid
image datasets (X = Z2). In all settings we demonstrate substantial improvements
over the ACNP. For the CNN component of our model, we propose a small and large
architecture for each experiment (in the experimental sections named ConvCNP and
ConvCNPXL, respectively). We note that these architectures are different for off-the-
grid and on-the-grid experiments, with full details regarding the architectures given in
the appendices.
5.4.1 Synthetic 1D experiments
We first consider synthetic regression problems. At each iteration, a function is sampled,
followed by context and target sets. Beyond EQ-kernel GPs (as proposed in Garnelo
et al. (2018a); Kim et al. (2018)), we consider more complex data arising from Matérn–5
2
and weakly-periodic kernels, as well as a challenging, non-Gaussian sawtooth process
with random shift and frequency (see Figure 5.3 for an example). ConvCNP is compared
to CNP (Garnelo et al., 2018a) and ACNP. Training and testing procedures are fixed
across all models. Full details on models, data generation, and training procedures are
provided in Appendix D.2.
Table 5.1 reports the log-likelihood means and standard errors of the models over
1000 tasks. The context and target points for both training and testing lie within the
interval [−2, 2], where training data was observed (marked ‘training data range’ in
Figure 5.3). Table 5.1 demonstrates that, even when extrapolation beyond the training
range is not required, the ConvCNP significantly outperforms other models in all cases,
despite having fewer parameters.
5.4 ConvCNP experimental results 111
A
C
N
P
A
C
N
P
C
o
n
v
C
N
P
C
o
n
v
C
N
P
Fig. 5.3 Example functions learned by the ACNP (top row), and ConvCNP (bottom
row), when trained on a Matern–5
2
kernel with length scale 0.25 (first and second
column) and sawtooth function (third and fourth column). Columns one and three
show the predictive distribution of the models when data is presented in same range
as training, with predictive distributions continuing beyond that range on either side.
Columns two and four show model predictive distribution when presented with data
outside the training data range. Plots show means and two standard deviations.
112 Convolutional neural processes
Table 5.1 Log-likelihood and standard errors from synthetic 1-dimensional experiments.
Model Params EQ Weak Periodic Matern Sawtooth
MLP-CNP 66818 -0.86 ± 3e-3 -1.23 ± 2e-3 -0.95 ± 1e-3 -0.16 ± 1e-5
ACNP 149250 0.72 ± 4e-3 -1.20 ± 2e-3 0.10 ± 2e-3 -0.16 ± 2e-3
ConvCNP 6537 0.70 ± 5e-3 -0.92 ± 2e-3 0.32 ± 4e-3 1.43 ± 4e-3
ConvCNPXL 50617 1.06 ± 4e-3 -0.65 ± 2e-3 0.53 ± 4e-3 1.94 ± 1e-3
Table 5.2 Log-likelihood with standard errors from image experiments (6 runs).
Model Params MNIST SVHN CelebA32 CelebA64 ZSMM
ACNP 410k 1.08 ±0.04 3.94 ±0.02 3.18 ±0.02 -0.83 ±0.08
ConvCNP 113k 1.21 ±0.00 3.89 ±0.01 3.22 ±0.02 3.66 ±0.01 1.18 ±0.04
ConvCNPXL 400k 1.27 ±0.01 3.97 ±0.02 3.39 ±0.02 3.73 ±0.01 0.86 ±0.12
Figure 5.3 demonstrates that the ConvCNP generates excellent fits, even for
challenging functions such as those sampled from the Matérn–5
2
GP and sawtooth
process. Moreover, Figure 5.3 compares the performance of the ConvCNP and ACNP
when data is observed outside the range where the models were trained: translation
equivariance enables the ConvCNP to elegantly generalise to this setting, whereas the
ACNP is unable to generate reasonable predictions.
5.4.2 2D image completion experiments
To test the ConvCNP beyond one-dimensional features, we evaluate our model on
on-the-grid image completion tasks and compare it to the ACNP. Image completion
can be cast as a prediction of pixel intensities y∗i (∈ R3 for RGB, ∈ R for greyscale)
given a target 2D pixel location x∗i conditioned on an observed (context) set of pixel
values D = ((xn, yn))Nn=1. In the following experiments, the context set can vary but
the target set contains all pixels from the image. Further experimental details are in
Section D.3.1.
Standard image benchmarks We first evaluate the model on four common bench-
marks: MNIST (LeCun et al., 1998), SVHN (Netzer et al., 2011), and 32 × 32 and
64×64 CelebA (Liu et al., 2018). Importantly, these datasets are biased towards images
containing a single, well-centered object. As a result, perfect translation equivariance
might hinder the performance of the model when the test data are similarly structured.
However, if the receptive field of the CNN is larger than the input image, then the
model can learn absolute-position specific features, due to the zero-padding (Islam
5.4 ConvCNP experimental results 113
et al., 2019). Hence we might expect larger models like the ConvCNPXL to perform
better than the smaller ConvCNP in situations where absolute spatial position is
important for the task.
Table 5.2 shows that the ConvCNP significantly outperforms the ACNP when it
has a large receptive field size, while being at least as good with a small receptive field
size. Qualitative samples for various context sets can be seen in Figure 5.4.
Generalisation to multiple, non-centered objects The datasets from the pre-
vious paragraphs were centered and contained single objects. Here we test whether
ConvCNPs trained on such data can generalise to images containing multiple, non-
centered objects. To test this, we introduce the zero-shot multi-MNIST (ZSMM)
dataset. The training set contains all 60000 28× 28 MNIST training digits centered
on a black 56× 56 background (Figure 5.5a). For the test set, we randomly sample
with replacement 10000 pairs of digits from the MNIST test set, place them on a black
56 × 56 background, and translate the digits in such a way that the digits can be
arbitrarily close but cannot overlap (Figure 5.5b). Importantly, the scale of the digits
and the image size are the same during training and testing.
The last column of Table 5.2 evaluates the models in the zero shot multi-MNIST
setting, where images contain multiple digits at test time. The ConvCNP significantly
outperforms the ACNP on such tasks. Figure 5.6a shows a histogram of the image
log-likelihoods for ConvCNP and ACNP, as well as qualitative results at different
percentiles of the ConvCNP distribution. ConvCNP is able to extrapolate to this
out-of-distribution test set, while ACNP appears to model the bias of the training
data and predict a centered ‘mean’ digit independently of the context. Interestingly,
ConvCNPXL does not perform as well on this task. In particular, we find that, as the
receptive field becomes very large, performance on this task decreases. We hypothesize
that this has to do with behavior of the model at the edges of the image. CNNs
with larger receptive fields—the region of input pixels that affect a particular output
pixel—are able to model non-stationary behavior by looking at the distance from any
pixel to the image boundary.
Although ZSMM is a contrived task, note that our field of view usually contains
multiple independent objects, thereby requiring translation equivariance. As a more
realistic example, we took a ConvCNP model trained on CelebA and tested it on a
natural image of different shape which contains multiple people (Figure 5.6b). Even
with 95% of the pixels removed, the ConvCNP was able to produce a qualitatively
reasonable reconstruction.
114 Convolutional neural processes
Fig. 5.4 Qualitative evaluation of the ConvCNP (XL). For each dataset, an image is
randomly sampled, the first row shows the given context points while the second is
the mean of the estimated conditional distribution. From left to right the first seven
columns correspond to a context set with 3, 1%, 5%, 10%, 20%, 30%, 50%, 100%
randomly sampled context points. In the last two columns, the context sets respectively
contain all the pixels in the left and top half of the image. ConvCNPXL is shown
for all datasets besides ZSMM, for which we show the fully translation equivariant
ConvCNP.
5.4 ConvCNP experimental results 115
(a) Train (b) Test
Fig. 5.5 Samples from our generated zero-shot multi MNIST (ZSMM) dataset.
(a) Log-likelihood and qualitative results on ZSMM.
The top row shows the log-likelihood distribution for
both models. The images below correspond to the
context points (top), ConvCNP target predictions
(middle), and ACNP target predictions (bottom).
Each column corresponds to a given percentile of
the ConvCNP distribution.
(b) Qualitative evaluation of a ConvC-
NPXL trained on the unscaled CelebA
(218×178) and tested on Ellen’s Oscar
unscaled (337×599) selfie (DeGeneres,
2014) with 5% of the pixels as context
(top).
Fig. 5.6 Zero-shot generalisation to tasks that require translation equivariance.
116 Convolutional neural processes
Computational efficiency Beyond the performance and generalisation improve-
ments, a key advantage of the ConvCNP is its computational efficiency. The memory
and time complexity of a single self-attention layer grows quadratically with the number
of inputs (the number of pixels for images) but only linearly for a convolutional layer.
Empirically, with a batch size of 16 on 32×32 MNIST, ConvCNPXL requires 945MB of
VRAM, while ACNP requires 5839 MB. For the 56× 56 ZSMM ConvCNPXL increases
its requirements to 1443 MB, while ACNP could not fit onto a 32GB GPU. Ultimately,
ACNP had to be trained with a batch size of 6 (using 19139 MB) and we were not
able to fit it on the GPU for CelebA64.
5.4.3 Limitations of factorised predictive distributions
We have introduced the ConvCNP and shown that it outperforms the ACNP in a variety
of synthetic and real-world regression tasks. However, as described in Section 4.2,
all CNPs, ConvCNPs included, are unable to produce predictive distributions that
have dependencies between different target locations. More precisely, let PN(X ,Y) ⊂
P(X ,Y) denote the set of noise GPs: Gaussian processes on X whose covariance is
given by Cov(x, x′) = σ2(x)δ[x−x′], where σ2 ∈ Cb(X ,Y) and δ is the Kronecker delta,
with δ[0] = 1 and δ[ · ] = 0 otherwise. Then the ConvCNP is a map ConvCNP : Z →
PN(X ,Y) with Equation (5.13) defining its finite-dimensional marginals. Unfortunately,
predictive stochastic processes in PN(X ,Y) possess two key limitations. First, it is
impossible to obtain coherent function samples from the predictive distribution as
each point of the function is generated independently. This severely limits the ability
of ConvCNPs to be used in tasks such as Thompson sampling. Furthermore, when
using ConvCNPs to estimate the probability that the value of the predicted function
over the entirety of a given range will exceed a certain threshold, this probability may
be drastically underestimated due to the factorisation assumption. One example is
in heatwave or flood prediction, where we are interested in the probability that the
temperature or amount of precipitation exceeds a threshold throughout some region of
space or time, in order to predict droughts or floods (Markou et al., 2022).
Another limitation of PN(X ,Y) is that Gaussian predictive distributions cannot
model multi-modality, heavy-tailedness, or asymmetry. Although this can be addressed
by using the ConvCNP output to parameterise more flexible families of distributions,
such as mixtures of Gaussians or normalising flows, in the next section we will show
how introducing a latent variable can lift both the restrictions of factorisation and
Gaussianity simultaneously.
5.5 Convolutional latent neural processes 117
5.5 Convolutional latent neural processes
We now present the convolutional latent neural process (ConvLNP), which addresses
the weaknesses of ConvCNPs. The ConvLNP extends the ConvCNP by parameterising
a map to predictive stochastic processes more expressive than PN(X ,Y), allowing
for coherent sampling and non-Gaussian predictive distributions. It achieves this
by passing the output of a ConvCNP through a non-linear, translation equivariant
map between function spaces. Specifically, the ConvLNP uses an encoder-decoder
architecture, where the encoder E: Z → PN(X ,Y) is a ConvCNP and the decoder
d : YX → YX is translation equivariant (here YX denotes the set of all functions from
X to Y). Note that throughout this section, when describing the ConvLNP, we will
use the terms ‘encoder’ and ‘decoder’ differently to their use in Section 4.2. There, the
encoder described how elements of the context set are embedded and aggregated into
a single representation. The decoder was then used to combine that representation
with a target input location to form a prediction. Here, we refer to E, which is itself a
complete ConvCNP, as the encoder for the ConvLNP. The decoder then simply refers
to the second stage of the ConvLNP, d, which transforms samples from the ConvCNP
encoder.6
Conditioned on the context set Dc, ConvLNP samples can be obtained by sampling
a function z ∼ ConvCNP(Dc) and then computing f = d(z). This is illustrated in
Figure 5.7. Importantly, d takes functions to functions and does not necessarily act
point-wise: letting f(x) depend on the value of z at multiple locations is crucial for
inducing dependencies in the predictive distribution. This sampling procedure induces
a map between stochastic processes, D: PN(X ,Y)→ P(X ,Y). Putting these together,
and making explicit the parameter dependence in E and D, the ConvLNP is constructed
as
ConvLNPθ,ϕ = Dθ ◦ Eϕ, Eϕ = ConvCNPϕ, Dθ = (dθ)∗, (5.15)
where (dθ)∗ is the pushforward7 under dθ.
We now prove that the ConvLNP is indeed a translation equivariant map from
datasets to stochastic processes, by proving that the decoder and encoder are separately
translation equivariant.
6The choice of what constitutes the ‘encoder’ and ‘decoder’ here is somewhat arbitrary and is
primarily a naming convention rather than a fundamental distinction.
7i.e., (dθ)∗(Eϕ) is the measure induced on RX by sampling a function from Eϕ and passing it
through dθ.
118 Convolutional neural processes
1 Context set Dc 2 Encoder: z ∼ ConvCNP(Dc) 3 Decoder: f = d(z)
Eϕ Dθ
Fig. 5.7 The ConvLNP encoder-decoder architecture. The encoder is a ConvCNP
which takes the context set as input (left panel) and outputs a single sample of z
(center panel). The decoder takes this as input and outputs a predictive sample (right
panel blue; two other samples shown in grey).
Lemma 1. Let d be a measurable, translation equivariant map from (YX ,Σ) to (YX ,Σ).
Then the ConvLNP decoder D : P(X ,Y)→ P(X ,Y), defined by D(P ) = d∗(P ), where
d∗(P ) is the pushforward measure of P under d, is translation equivariant.
Proof. Let F ∈ Σ ⊆ YX be a measurable set. Then:
D(TτP )(F )
(a)
= TτP (d
−1(F ))
= P (T−τd−1(F ))
(b)
= P (d−1(T−τF ))
= D(P )(T−τF )
= TτD(P )(F ).
Here (a) follows from definition of the pushforward, and (b) follows because
T−τd−1(F ) = T−τ{f : d(f) ∈ F}
= {T−τf : d(f) ∈ F}
= {f : d(Tτf) ∈ F}
= {f : Tτd(f) ∈ F}
= {f : d(f) ∈ T−τF}
= d−1(T−τF ).
Lemma 2. The ConvLNP encoder E (which is defined to be a ConvCNP), is a
translation equivariant map from datasets to stochastic processes.
Proof. Recall that the mean and variance µ(·, D), σ2(·, D) (viewed as maps from
Z → Cb(X ,Y)) of the ConvCNP encoder E are both given by ConvDeepSets. Due
to the translation equivariance of ConvDeepSets (Theorem 6), µ(·, TτD) = Tτµ(·, D)
for all D ∈ Z, τ ∈ X , and similarly for σ2. Let F ∈ Σ. Then since the measure
5.6 ConvLNP experimental results 119
E(D) ∈ PN(X ) is defined entirely by its mean and variance function, E(TτD)(F ) =
E(D)(T−τF ) = TτE(D)(F ).
Noting that a composition of translation equivariant maps is itself translation
equivariant, we obtain the following proposition:
Proposition 7. Define ConvLNP = D◦E. Then ConvLNP is a translation equivariant
map from datasets to stochastic processes.
In practice, we cannot actually compute a full functional sample z from a noise
GP (PN) as described in Figure 5.7, since z comprises uncountably many independent
random variables. Instead, we consider a discrete version of the model, which enables
practical computation (at the expense of not having the theory in Proposition 7 apply
exactly). Similarly to Section 5.3.1, we discretise the domain of z on a grid (xi)Ki=1,
with z := (z(xi))Ki=1. As a consequence, the model can only be equivariant up to
shifts on this discrete grid. With this discretisation, sampling z ∼ ConvCNPϕ(Dc)
amounts to sampling a finite number of independent Gaussian random variables,
and dθ is implemented by passing z through a CNN — which plays the role of the
translation equivariant map between (discretised) function spaces. The forward pass of
a discretised, trained ConvLNP is illustrated in Figure 5.8.
Note that CNNs are not always entirely translation equivariant due to the zero
padding that occurs at each layer. In practice, we find that this does not hinder the
model from extrapolating meaningfully. Following Kim et al. (2018), we define the
model likelihood by adding heteroskedastic Gaussian observation noise σ2y(x, z) to the
predictive function draws f = dθ(z) ∈ YX . Given a context set Dc, the predictive
distribution for the target outputs yt given the target inputs xt is then:
p(yt|xt, Dc) = E
z∼Eϕ(Dc)
 ∏
(x,y)∈Dt
N (y; dθ(z)(x), σ2y(x, z))
 . (5.16)
Although the product in the expectation factorises, p(yt|xt, Dc) does not: z induces
dependencies in the predictive, in contrast to Equation (5.13). We provide pseudocode
for the ConvLNP forward pass in both the off-the-grid and on-the-grid case in Figures 5.9
and 5.10.
5.6 ConvLNP experimental results
We evaluate ConvLNPs on a range of regression tasks. Our main questions are:
120 Convolutional neural processes
Fig. 5.8 Forward pass of a ConvLNP. Steps (1)-(4) depict sampling from the encoder
Eϕ, which is a ConvCNP. This involves: (1) computing a functional representation
of the context set, with separate ‘density’ and ‘data’ channels (described in detail
in Section 5.3.1), (2) discretizing the representation, (3) passing the representation
through a CNN, which outputs the parameters of independent Gaussian distributions
spaced on a grid, and (4) sampling from these distributions. However, the samples at
each grid point are independent of each other, hence in (5) the samples are passed
through another CNN, the decoder, to induce dependencies, and then are smoothed
out.
5.6 ConvLNP experimental results 121
require: d = (CNN, ψd), Eϕ (off-the-grid ConvCNP), and number of samples L
require: context (xn, yn)Nn=1, target (x∗m)Mm=1
1 begin
2 µz, σz ← Eϕ(Dc)
3 for l = 1, . . . , L do
4 zl ∼ N (z;µz, σ2z)
5 (fµ(ti), fσ(ti))
K
i=1 ← CNN(zl)
6 µm,l ←
∑T
i=1 fµ(ti)ψd(x
∗
m − ti)
7 σm,l ← pos (fσ(ti))
8 end for
9 return (µ, σ)
10 end
Fig. 5.9 Forward pass through a ConvLNP (off-the-grid). The function pos : R→ (0,∞)
is used to enforce positivity.
require: d = CNN, Eϕ (on-the-grid ConvCNP), and number of samples L
require: image I, context mask Mc, and target mask Mt
1 begin
2 µz, σz ← Eϕ(I,Mc)
3 for l = 1, . . . , L do
4 zl ∼ N (z;µz, σ2z)
5 (fµ(ti), fσ(ti))
K
i=1 ← CNN(zl)
6 µ← f (1:C)t
7 σ ← pos
(
f
(C+1:2C)
t
)
8 end for
9 return (µ, σ)
10 end
Fig. 5.10 Forward pass through a ConvLNP (on-the-grid). The function pos : R →
(0,∞) is used to enforce positivity.
122 Convolutional neural processes
1. Does the ConvLNP produce coherent, meaningful predictive samples?
2. Similarly to the ConvCNP, can it leverage translation equivariance to outperform
baseline methods within and beyond the training range (generalisation)?
3. Unlike the ConvCNP, does it learn expressive non-Gaussian predictive distribu-
tions?
4. How does training the ConvLNP with the approximate maximum likelihood
objective LˆML of Section 4.4.3 compare with training using the neural process
variational inference objective LNPVI of Section 4.4.2?
We use several approaches for evaluating latent neural processes. First, as in
(Garnelo et al., 2018b; Kim et al., 2018), we provide qualitative visual comparisons
of samples. These allow us to see if the models display meaningful structure, quan-
tify uncertainty, and are able to generalise spatially. Second, LNPs lack closed-form
likelihoods, so we evaluate lower bounds on their predictive log-likelihoods via im-
portance sampling (Le et al., 2018). As these lower bounds can be quite loose (see
Appendix E for an analysis of the looseness of the bounds as a function of number of
samples used), they are primarily useful to show when LNPs outperform baselines with
exact likelihoods, such as GPs and ConvCNPs. Finally, in Section 5.6.3 we consider
Bayesian optimisation to evaluate the usefulness of ConvLNPs for downstream tasks.
In Sections 5.6.1 and 5.6.2, we compare against the Attentive NP (ANP; (Kim et al.,
2018)), which in prior work has been trained only with LNPVI. The ANP architectures
used in this section are comparable to those in Kim et al. (2018), and have a param-
eter count comparable to or greater than the ConvLNP. Full details are provided in
Appendix F. Code to reproduce the 1D regression experiments can be found at https:
//github.com/wesselb/neuralprocesses, and code to implement the image-completion
experiments can be found at https://github.com/YannDubs/Neural-Process-Family.
5.6.1 1D regression
Similarly to Section 5.4.1, we train on an exponentiated quadratic kernel GP, a Matérn-
5
2
GP, a weakly periodic GP, and a non-Gaussian sawtooth process with random shifts
and frequency (see Appendix F.1 for details). Figure 5.11 shows predictive samples,
where during training the models only observe data within the grey regions (training
range). While samples from the ANP exhibit unnatural ‘kinks’ and do not resemble the
underlying process, the ConvLNP produces smooth samples for Matérn–5
2
and samples
exhibiting meaningful structure for the weakly periodic and sawtooth processes. The
5.6 ConvLNP experimental results 123
ConvLNP also generalises gracefully beyond the training range, whereas the ANP fails
catastrophically. The ANP with LNPVI collapses to deterministic samples, with the
epistemic uncertainty explained using the heteroskedastic noise σ2y(x, z). This was also
noted in Le et al. (2018). This behaviour is alleviated when training with LˆML, with
much of the predictive uncertainty due to variations in the sampled functions.
Table 5.3 compares lower bounds on the log-likelihood for the ConvLNP with the
ANP and MLP-NP for both our proposed LˆML objective and the standard LNPVI
objective. We also show three exact log-likelihoods: the ground-truth GP (full),
the ground-truth GP with diagonalised predictions (diag), and the ConvCNP.8 The
ConvCNP performs on par with the GP (diag), which is the optimal factorised predictive.
The ConvLNP lower bound is consistently higher than the GP (diag) and ConvCNP
log-likelihoods, demonstrating that its non-factorised predictive distributions improve
performance. Furthermore, the ConvLNP performs similarly inside and outside its
training range, demonstrating that translation equivariance helps generalisation. This
is in contrast to the ANP, which fails catastrophically outside its training range.
5.6.2 Image completion
We now evaluate ConvLNPs on image completion tasks, focusing on spatial generalisa-
tion. To test this, we consider zero-shot multi MNIST (ZSMM), where we train on
single MNIST digits but test on two MNIST digits on a larger canvas. We randomly
translate the digits during training, so the generative stochastic process is stationary.
The black background of MNIST causes difficulty with heteroskedastic noise, as the
models can obtain high likelihood by predicting the background with high confidence
whilst ignoring the digits. Hence for MNIST and ZSMM we use homoskedastic noise
σ2y(z). Figures 5.12a and 5.12b show that the ANP fails to generalise spatially, whereas
this is naturally handled by the ConvLNP.
We also test the ConvLNP’s ability to learn non-Gaussian predictive distributions.
Figure 5.12c shows that the ConvLNP can learn highly multimodal predictive dis-
tributions, enabling the generation of diverse yet coherent samples. A quantitative
comparison of models using log-likelihood lower bounds is provided in Table 5.4, where
the ConvLNP trained with LˆML consistently achieves the highest values. Appendix F.2
provides details regarding the data, architectures, and protocols used in our image
experiments. In Section F.2.4, we provide samples and further quantitative comparisons
8Note that the log-likelihood values for the ConvCNP reported in Table 5.3 are not comparable
with those given in Table 5.2 since the sampling procedures determining the size of the context and
target sets differ for the two experiments, see Section D.2.2 and Appendix F.1.
124 Convolutional neural processes
ConvLNP ANP
M
at
ér
n
–
5 2
Lˆ M
L
L N
P
V
I
M
at
ér
n
–
5 2
Lˆ M
L
L N
P
V
I
W
ea
k
ly
P
er
io
d
ic
Lˆ M
L
L N
P
V
I
S
aw
to
o
th Lˆ M
L
L N
P
V
I
Fig. 5.11 Predictions of ConvLNPs and ANPs trained with LˆML and LNPVI, showing
interpolation and extrapolation within (grey background) and outside (white back-
ground) the training range. Solid blue lines are samples, dashed blue lines are means,
and the shaded blue area is µ± 2σ. Purple dash–dot lines are the ground-truth GP
mean and µ±2σ. ConvNP handles points outside the training range naturally, whereas
this leads to catastrophic failure for the ANP. Note ANP with LNPVI tends to collapse
to deterministic samples, with all uncertainty explained with the heteroskedastic noise.
In contrast, models trained with LˆML show diverse samples that account for much of
the uncertainty.
(a) ConvLNP (b) ANP (c) ConvLNP (d) ANP
Fig. 5.12 Left two plots: predictive samples on zero-shot multi MNIST. Right two
plots: samples and marginal predictives on standard MNIST. We plot the density of
the five marginals that maximize Sarle’s bimodality coefficient Ellison (1987). We use
LˆML for training. Blue pixels are not in the context set.
5.6 ConvLNP experimental results 125
Table 5.3 Log-likelihood for ConvCNP, ConvLNP, ANP, and MLP-LNP. Each of the
latent variable models was trained on each data set with LˆML and LNPVI, separately.
EQ Matérn– 52 Noisy Mixt. Weakly Per. Sawtooth
Interpolation inside training range
GP (full) 5.80± 0.02 1.22± 6.3e –3 1.00± 4.1e –3 –0.06± 4.6e –3 N/A
GP (diag) –0.59± 0.01 –0.84± 9.0e –3 –0.89± 0.01 –1.17± 5.2e –3 N/A
ConvCNP –0.70± 0.02 –0.88± 0.01 –0.92± 0.02 –1.19± 7.0e –3 1.15± 0.04
ConvLNP LˆML –0.30± 0.02 –0.58± 0.01 –0.55± 0.01 –1.02± 6.0e –3 2.30± 0.01
ANP LˆML –0.52± 0.01 –0.73± 0.01 –0.69± 0.01 –1.14± 6.0e –3 0.09± 3.0e –3
MLP-LNP LˆML –0.84± 9.0e –3 –0.96± 7.0e –3 –0.93± 9.0e –3 –1.23± 5.0e –3 –0.02± 2.0e –3
ConvLNP LNPVI –0.50± 0.02 –0.77± 0.01 –0.48± 0.02 –1.03± 8.0e –3 2.47± 8.0e –3
ANP LNPVI –0.82± 0.01 –0.96± 0.01 –1.04± 0.01 –1.37± 6.0e –3 0.20± 9.0e –3
MLP-LNP LNPVI –0.58± 9.0e –3 –1.00± 9.0e –3 –0.72± 0.01 –1.22± 5.0e –3 –0.16± 2.0e –3
Interpolation beyond training range
GP (full) 5.80± 0.02 1.22± 6.3e –3 1.00± 4.1e –3 –0.06± 4.6e –3 N/A
GP (diag) –0.59± 0.01 –0.84± 9.0e –3 –0.89± 0.01 –1.17± 5.2e –3 N/A
ConvCNP –0.69± 0.02 –0.87± 0.01 –0.94± 0.02 –1.19± 7.0e –3 1.11± 0.04
ConvLNP LˆML –0.30± 0.02 –0.58± 0.01 –0.56± 0.01 –1.03± 6.0e –3 2.29± 0.02
ANP LˆML –1.35± 6.0e –3 –1.39± 7.0e –3 –1.65± 5.0e –3 –1.35± 4.0e –3 –0.17± 1.0e –3
MLP-LNP LˆML –2.70± 3.0e –3 –2.60± 3.0e –3 –2.82± 3.0e –3 - –0.03± 2.0e –3
ConvLNP LNPVI –0.48± 0.02 –0.79± 0.01 –0.48± 0.02 –1.04± 8.0e –3 2.47± 8.0e –3
ANP LNPVI –1.91± 0.03 –1.48± 4.0e –3 –1.85± 7.0e –3 –1.66± 0.01 –0.30± 4.0e –3
MLP-LNP LNPVI –13.7± 0.82 –3.96± 0.04 –3.80± 0.02 - –4.98± 0.02
Extrapolation beyond training range
GP (full) 4.29± 6.2e –3 0.82± 4.3e –3 0.66± 2.2e –3 –0.33± 3.4e –3 N/A
GP (diag) –1.40± 5.0e –3 –1.41± 4.8e –3 –1.72± 6.2e –3 –1.40± 4.0e –3 N/A
ConvCNP –1.41± 6.0e –3 –1.41± 7.0e –3 –1.73± 8.0e –3 –1.41± 6.0e –3 0.27± 0.02
ConvLNP LˆML –1.09± 5.0e –3 –1.11± 5.0e –3 –1.30± 4.0e –3 –1.24± 4.0e –3 1.61± 0.02
ANP LˆML –1.29± 6.0e –3 –1.29± 5.0e –3 –1.55± 5.0e –3 –1.34± 5.0e –3 –0.25± 2.0e –3
MLP-LNP LˆML –2.23± 4.0e –3 –2.08± 3.0e –3 –2.50± 4.0e –3 –1.39± 4.0e –3 –0.06± 2.0e –3
ConvLNP LNPVI –1.21± 0.01 –1.31± 0.01 –1.19± 0.01 –1.51± 8.0e –3 2.10± 7.0e –3
ANP LNPVI –1.44± 6.0e –3 –1.45± 6.0e –3 –1.77± 7.0e –3 –1.46± 6.0e –3 –0.20± 2.0e –3
MLP-LNP LNPVI –5.85± 0.05 –2.65± 3.0e –3 –4.06± 0.04 –1.49± 5.0e –3 –1.99± 6.0e –3
Table 5.4 Test log-likelihood lower bounds for image completion (5 runs).
MNIST CelebA32 SVHN ZSMM
LˆML LNPVI LˆML LNPVI LˆML LNPVI LˆML LNPVI
ConvLNP 2.11± 0.01 0.99± 0.42 6.92± 0.10 −0.27± 0.00 9.89± 0.09 0.17± 0.00 4.58± 0.04 0.14± 0.00
ANP 1.66± 0.03 1.64± 0.03 5.98± 0.08 6.04± 0.10 9.18± 0.08 8.91± 0.06 −10.8± 1.99 −6.45± 0.99
126 Convolutional neural processes
Table 5.5 Joint predictive log-likelihoods (LL) and RMSEs on ERA5-Land, averaged
over 1000 tasks.
Central (train) West (test) East (test) South (test)
LL ConvLNP 4.47± 0.07 4.55± 0.08 5.07± 0.07 4.65± 0.08GP 3.33± 0.06 3.65± 0.06 4.07± 0.06 3.34± 0.06
RMSE (×10−2) ConvLNP 5.72± 0.33 5.77± 0.37 3.23± 0.22 6.92± 0.39GP 6.26± 0.30 5.75± 0.29 3.10± 0.18 7.94± 0.44
of models trained on SVHN (Netzer et al., 2011), MNIST LeCun et al. (1989), and
32×32 CelebA Netzer et al. (2011) in a range of scenarios, along with full experimental
details.
5.6.3 Environmental data
We next consider a real-world dataset, ERA5-Land (Copernicus Climate Change
Service, 2020), containing environmental measurements at a ∼9 km spacing across
the globe. We consider predicting daily precipitation y at position x. Environmental
data is not perfectly stationary, as there are changes in climate that reflect geographic
position. Hence this task reflects the model’s ability to handle situations where the
underlying process is only approximately stationary. In general, one approach to handle
situations like these is to provide the model with input variables such that, conditioned
on those variables, the underlying stochastic process is approximately stationary. For
example, ground elevation (known as orography) is an important factor influencing
climate. Since the orography depends on absolute geographical position, the ground
truth stochastic process governing precipitation cannot be strictly stationary. However,
it may be the case that if we also translate the orography data along with the input
positions, then stationarity is approximately restored. Hence we provide the ConvLNP
with orography data, and also temperature values, as inputs along with precipitation.
We choose a large region of central Europe as our train set, and use regions east,
west and south as held-out test sets. For such tasks, models must be able to make
predictions at locations spanning a range different from the training set, inhibiting the
deployment of NPs not equipped with translation equivariance. To sample a task at
train time, we sample a random date between 1981 and 2020, then sample a sub-region
within the train region, which is split into context and target sets. In this section, we
train using LML. See Appendix F.3 for details.
5.6 ConvLNP experimental results 127
(a) Ground truth data (b) ConvLNP sample 1 (c) ConvLNP sample 2 (d) ConvLNP sample 3
(e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3
Fig. 5.13 Predictive samples overlaid on central Europe. Darker colours show higher
precipitation. In (e), coloured pixels represent context points. GP samples often take
negative values (lighter than ground truth data, see Section F.3.2 for a discussion),
whereas the NP has learned to produce non-negative samples which capture the sparsity
of precipitation. The model is trained on subregions roughly the size of the lengthscale
of the precipitation process. More samples in Section F.3.6.
0 10 20 30 40 50
Central (train)
0.6
0.8
1.0
1.2
1.4
1.6
A
ve
ra
ge
 R
eg
re
t
0 10 20 30 40 50
West (test)
0.6
0.8
1.0
1.2
1.4
1.6
0 10 20 30 40 50
East (test)
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50
South (test)
0.75
1.00
1.25
1.50
1.75
2.00 GP UCB
GP TS
NP UCB
NP TS
Random
Fig. 5.14 Average regret plotted against number of points queried for Bayesian op-
timisation for the precipitation value on a given day in different regions of Europe,
averaged over 5000 tasks.
128 Convolutional neural processes
Prediction We first evaluate the ConvLNP’s predictive performance, comparing
to a GP trained individually on each task as a baseline. In about 10% of tasks, the
GP obtains an especially poor likelihood (< 0 nats); we remove these outliers from
the evaluation. The results are shown in Table 5.5. The ConvLNP and GP have
comparable RMSEs except on the south dataset, where the ConvLNP outperforms the
GP. However, the ConvLNP consistently outperforms the GP in log-likelihood, which
is expected because (i) the GP does not share information between tasks and hence is
prone to overfitting on small context sets, resulting in overconfident predictions; and (ii)
the ConvLNP can learn non-Gaussian predictive densities (illustrated in Section F.3.6).
Figure 5.13 shows samples from the predictive process of a ConvLNP and GP, over the
whole of the train region. This demonstrates spatial extrapolation, as the ConvLNP is
trained only on random subregions.
Bayesian optimisation We demonstrate the ConvLNP in a downstream task by
considering a toy Bayesian optimisation problem, where the goal is to identify the
location with heaviest rainfall on a given day. We also test the ConvLNP’s spatial
generalization, by optimising over larger regions (for central, west, and south) than
the model was trained on. We test both Thompson sampling (TS) (Thompson, 1933)
and upper confidence bounds (UCB) (Auer, 2002) as methods for acquiring points.
Note that TS requires coherent samples. The results are shown in Figure 5.14. On all
datasets, ConvLNP TS and UCB significantly outperform the random baseline by the
50th iteration; the GP does not reliably outperform random. We hypothesise this is
due to its overconfidence, in line with the results on prediction.
5.7 Summary and conclusions
In this chapter we presented the convolutional conditional neural process and the
convolutional latent neural process. Both models take advantage of the ConvDeepSets
representation theorem (Theorem 6) to flexibly parameterise a permutation invari-
ant and translation equivariant map from observed datasets to predictive stochastic
processes. In Section 5.4 we showed that the ConvCNP outperforms the attentive
conditional neural process on both synthetic 1D regression and image tasks. However,
like all CNPs, it makes factorised predictions for every point in the target set. In
Section 5.5 we remedied this by introducing a latent variable to define the ConvLNP,
which uses a ConvCNP to define a distribution over a latent function which is then
passed through another CNN to introduce dependencies in the predictive distribution.
5.7 Summary and conclusions 129
In Section 5.6 we showed that the ConvLNP is able to address the shortcomings of
the ConvCNP, both by making non-factorised predictions (thereby allowing it to out-
perform the ConvCNP in terms of log-likelihood), and also by enabling non-Gaussian
and sometimes multimodal marginal predictive distributions. Together, the ConvCNP
and ConvLNP represent practical and highly performant models for stochastic process
prediction whenever translation equivariance is an appropriate inductive bias.

Chapter 6
Conclusions and discussion
6.1 Summary of contributions
We now summarise our main contributions in this thesis. The first half of the thesis
focused on understanding the consequences of approximate inference in Bayesian neural
networks, and the second half focused on convolutional neural processes. We describe
each of these contributions in turn.
6.1.1 Approximate inference in Bayesian neural networks
In Chapter 2 we introduced and motivated Bayesian neural networks and described
the need for reliable approximate inference as a pressing research problem. This led
to the first major contribution of the thesis, which was a theoretical and empirical
study of two of the most common approximate inference methods for BNNs: mean-field
variational inference and Monte Carlo dropout. The results of these investigations were
presented in Chapter 3.
On the theoretical side, our main contribution was to prove theorems showing
that, for single-hidden layer ReLU BNNs with either MFVI or MCDO approximate
posteriors, there are simple situations where no setting of the variational parameters
can represent increased uncertainty in between regions of low uncertainty. This is in
contrast to the exact posterior predictive, which shows increased in-between uncertainty
when appropriate. We next considered the theoretical expressiveness of BNNs with
more than one hidden layer. We proved that given sufficient width, they are able to
represent any predictive mean and variance function. This provides a kind of stochastic
analogue to the classical universal approximation theorem for neural networks.
132 Conclusions and discussion
Our universal approximation result for deep MFVI and MCDO networks naturally
leads to the question of whether, when training networks using the ELBO objective,
variational parameters will be found that lead to predictive distributions which resemble
the true predictive. This is the main question addressed by the empirical studies we
perform in Chapter 3. By studying toy examples and comparing them to reference
predictive distributions such as HMC and the infinite-width GP, we show that even for
deep BNNs, in-between uncertainty is not reliably represented even though in theory
there exist variational parameters that can represent it. We finally conclude Chapter 3
with a case study showing how a lack of in-between uncertainty can be deleterious for
active learning.
6.1.2 Convolutional neural processes
In the second half of the thesis, we propose the convolutional neural process, a new
member of the neural process family that incorporates translation equivariance into
its predictions. We begin in Chapter 4 by providing an overview of various existing
members of the neural process family. We view neural processes as performing stochastic
process prediction via meta-learning. We describe the encoder-decoder architectural
framework that underlies the design of many different NPs, and also document the
training objectives used to train both conditional NPs and latent NPs. For latent
NPs, we introduce a new approximate maximum-likelihood objective that sidesteps the
complexities of variational inference in favour of directly forming a (biased) estimate
of the likelihood.
Finally, in Chapter 5 we introduce our proposed model, the convolutional neural
process. To motivate the model, we prove that stationary stochastic processes imply
translation equivariant prediction maps, and extend the original deep sets representation
theorem to also incorporate translation equivariance. This convolutional deep sets
theorem then directly informs the implementation of our convolutional neural process.
We present two versions of the model. The convolutional conditional neural process
(ConvCNP) is simpler, and only outputs a predictive mean and variance function; hence
it cannot model dependencies in the predictive distribution. We also introduce the
convolutional latent neural process (ConvLNP). The ConvLNP uses a latent function
to allow it to model dependencies and also provide non-Gaussian marginal predictions.
For both models, we provide extensive experiments on both synthetic 1D regression
tasks and also 2D image regression. We show that the ConvCNP and ConvLNP
outperform the attentive CNP and attentive LNP, the previous best performing neural
processes. Furthermore, we show that the models can leverage translation equivariance
6.2 BNNs and NPs compared 133
to solve challenging tasks such as zero-shot multi-MNIST, where the model has to
generalise from seeing only single, centered MNIST digits at train time, to seeing
multiple non-centered MNIST digits at test time.
6.2 BNNs and NPs compared
Having described the two main focuses of this thesis, it is natural to compare and
contrast BNNs and NPs. We now consider their similarities and differences from various
angles.
Priors and meta-learning An area where NPs differ significantly from BNNs is
in prior selection. For BNNs, choosing the prior is a crucial part of specifying the
model. Choices such as whether to use heavy-tailed or correlated priors can have a
significant impact on downstream performance (Fortuin et al., 2021). In contrast, for
NPs no prior needs to be chosen. Instead, the required inductive biases to succeed on
a new task (apart from high-level inductive biases such as convolution and attention)
are learned directly from previous tasks in the episodic meta-learning setting. This
provides a more data-driven approach that relieves practitioners of the burden of prior
design. The price that has to be paid for this is the need for a meta-dataset. Often
this is not available — we may only have one dataset of interest and wish to make
predictions based on it. In such cases BNNs can be used although NPs are no longer
applicable. On the other hand, if a meta-dataset is available, it may be possible to
use the meta-dataset to meta-learn a prior for the BNN, thus removing some of the
burden of prior selection, as proposed in Rothfuss et al. (2020), although this is not
yet common practice in the BNN literature.
Regression with uncertainty In this thesis, BNNs and NPs were both applied to
the task of performing regression with uncertainty estimates. In this sense, both meth-
ods may be viewed as neural network-based alternatives to more classical uncertainty-
aware regression approaches such as GP regression. Although BNNs and NPs can be
applied to similar tasks, the way in which uncertainty is represented in each of these
models is quite different. For BNNs, epistemic uncertainty is encoded in the weights
of the network. This is then propagated to the predictive distribution by sampling
many instances of the weights from the posterior. In contrast, in NPs, there is no
uncertainty represented in the weights (although uncertainty may be represented in
the latent variable for latent neural processes). Rather, the decoder of the NP directly
134 Conclusions and discussion
outputs the parameters of a Gaussian distribution over the regression target value. The
contrast between the two approaches is clear when we consider that for the conditional
neural process, there is no direct way to separate the uncertainty in the predictive
distribution between epistemic and aleatoric uncertainty — the CNP simply outputs a
predictive variance which incorporates both kinds of uncertainty simultaneously.
Approximate inference Approximate inference is used in both BNNs and NPs, but
in very different ways. In BNNs, approximate inference is needed because we specify
a prior over the neural network parameters, which is then paired with a complicated
non-linear likelihood. To obtain predictions from the BNN, some approximation of
integrals over the posterior distribution must be made. In contrast, for conditional
neural processes, there is no approximate inference required — the model outputs
the predictive mean and variance using a single deterministic forward pass. It is
only for latent neural processes that approximate inference plays a role. When LNPs
were first introduced (Garnelo et al., 2018b), the objective proposed was an ELBO
which treated the latent variables as quantities to infer. This led to the neural process
variational inference objective described in Section 4.4.2. However, even this is not
necessary to train a working LNP. In Section 4.4.3 we introduced the approximated
maximum likelihood objective for LNPs that does away with the approximate inference
interpretation for the latent variables entirely, instead viewing the latent variables
simply as a device for introducing correlations in the NP predictive. Hence approximate
inference is not crucial for training either CNPs or LNPs, in the way that it is for
BNNs.
Practical recommendations We conclude with some brief recommendations for
practitioners who are deciding between using either a BNN or an NP for their problem.
One of our main takeaways from Chapter 3 is that BNN approximate inference is an
active research topic that is not yet well understood. Even in relatively simple situa-
tions, previously unknown pathologies can sometimes cripple performance. Combined
with the difficult problem of prior selection, we recommend using MFVI or MCDO
approximate inference in BNNs with caution. Although BNNs can in some cases pro-
duce better uncertainty estimates than vanilla deterministic neural networks, this may
not necessarily be due to a principled application of Bayesian inference. Furthermore,
approximate inference techniques like MFVI often significantly complicate the training
procedure (although MCDO is an exception as it is relatively straightforward to apply).
In summary, we recommend the use of BNNs when:
6.3 Continued work and future research directions 135
1. Epistemic uncertainty estimation is very important for the task at hand.
2. The model is applied to a large dataset (too large for exact GP regression to be
applicable).
3. There is only a single dataset available (so that episodic meta-learning cannot be
applied).
4. The model is being used with a non-Gaussian likelihood, so that exact GP
regression cannot be applied.
NPs can often be much easier to deploy than BNNs due their avoidance of complicated
approximate inference techniques. However, their reliance on meta-learning means
they can only be applied in specific situations. We recommend the use of NPs when:
1. Uncertainty estimation is important. If there is a need to separate epistemic
and aleatoric uncertainty, a latent neural process can be used. Otherwise, both
conditional and latent neural processes can be used.
2. The model is to be applied to many small datasets.
3. Prior design is difficult, so that e.g., an appropriate kernel for applying simple
GP regression to the task cannot be applied straightforwardly.
4. Furthermore, if stationarity (or approximate stationarity) is a feature of the
underlying stochastic process, and the inputs are one or two-dimensional, we
recommend the use of convolutional neural processes.
6.3 Continued work and future research directions
We now briefly discuss future research directions for both BNNs and NPs in light of the
work in this thesis, along with follow-up work that has occurred since the publication
of the work in this thesis.
6.3.1 Approximate inference in Bayesian neural networks
Our research in Chapter 3 suggests various next steps for BNN research. The first is
the development of more flexible yet still scalable approximate posterior distributions.
Ideally, these should be such that the assumptions of Theorems 1 and 2 are violated,
so that there is no theoretical restriction on the posteriors representing in-between
136 Conclusions and discussion
uncertainty, even in the single hidden layer case. One recent promising example of such
an expressive posterior is the recently introduced global inducing variational posterior
(Ober and Aitchison, 2021), which is fully correlated across all layers and non-Gaussian.
As mentioned earlier, our theoretical and empirical results for deep BNNs, taken
together, suggest that at least in function space, existing posteriors such as mean-field
Gaussian and MC dropout are already flexible enough to approximate the posterior
predictive distribution well — but the right member of the variational family is not
being selected when optimising the ELBO. We conjecture that this is due to the KL-
optimal posterior in weight space being far from the optimal posterior in function space.
This suggests that one fruitful avenue of research is to change the objective function
to reflect function space approximations of the predictive, rather than changing the
variational family. This approach has been tried in works such as Rudner et al. (2021);
Sun et al. (2019).
Finally, we believe that the most important practical future work to be undertaken
is in the development of diagnostic methods and benchmarks for approximate inference
in BNNs. Assessing inference quality in these complex models is a difficult research
problem of its own. In this thesis we have focused on a single, easily identifiable
property of the posterior predictive: in-between uncertainty. However, there may be
many other qualitative features of the exact predictive distribution that are more or
less faithfully represented by the approximate predictive. It would be of great use to
practitioners to systematically document a range of these properties for a variety of
approximate inference methods. Practitioners could then assess which approximation
is most appropriate for them based on which of these properties is most relevant for
the task at hand.
6.3.2 Convolutional neural processes
Since their introduction, convolutional neural processes have been improved and
generalised in various directions. One line of research considers extending ConvNPs
beyond translation equivariance to consider more general group equivariances, e.g.,
the group of rotations of a sphere (Holderrieth et al., 2021; Kawano et al., 2020). In
particular, Holderrieth et al. (2021) use group equivariant neural processes to model
stochastic fields, which are stochastic processes which may be vector-valued. We
note that their proposed SteerCNP uses a similar encoder to the ConvNP, involving
embedding the context set into a function which is then discretised on a regular grid.
Both Holderrieth et al. (2021); Kawano et al. (2020) only consider conditional neural
6.3 Continued work and future research directions 137
processes and hence can only provide factorised predictions. Addressing this by creating
a latent variable version of the model is a natural direction for future work.
ConvNPs have also been developed in another direction which focuses on providing
correlated predictions with exactly tractable likelihoods without the need for a latent
variable. This leads to the Gaussian neural process (GNP) (Bruinsma et al., 2020;
Markou et al., 2022). GNPs work by directly parameterising both the mean and
covariance function of the predictive distribution with neural networks. In contrast
to CNPs, which can be viewed as outputting predictive Gaussian processes that have
kernel functions which lead to diagonal covariance matrices, GNPs can output predictive
distributions with fully-correlated covariance matrices. This allows coherent samples
to be drawn from GNP predictive distributions. Furthermore, since the predictive
distribution is always Gaussian, the likelihood can be computed exactly. It is important
to note that although the predictive distributions are GPs conditioned on a fixed
context set, the GNP does not necessarily correspond to inference using any GP prior.
In this sense it is a more flexible model than GP inference with any fixed kernel.
One disadvantage of the GNP compared to latent neural processes is that the GNP
cannot model non-Gaussian predictive distributions. However, it appears that this
disadvantage is outweighed by the tractable likelihood of the GNP enabling exact
computation of the maximum likelihood objective. Markou et al. (2022) shows that a
convolutional GNP can outperform the convolutional latent neural process on a variety
of tasks.

References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-
mawat, S., Irving, G., Isard, M., et al. (2016). Tensorflow: A system for large-scale
machine learning. In 12th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 16), pages 265–283.
Abramowitz, M. and Stegun, I. A. (1965). Handbook of mathematical functions: with
formulas, graphs, and mathematical tables, volume 55. Courier Corporation.
Aitchison, L. (2020). A statistical theory of cold posteriors in deep neural networks. In
International Conference on Learning Representations.
Alquier, P. and Ridgway, J. (2020). Concentration of tempered posteriors and of their
variational approximations. The Annals of Statistics, 48(3):1475–1497.
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American
mathematical society, 68(3):337–404.
Ashukha, A., Lyzhov, A., Molchanov, D., and Vetrov, D. (2019). Pitfalls of in-domain
uncertainty estimation and ensembling in deep learning. In International Conference
on Learning Representations.
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs.
Journal of Machine Learning Research, 3(Nov):397–422.
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. NIPS 2016
Deep Learning Symposium.
Bahdanau, D., Cho, K. H., and Bengio, Y. (2015). Neural machine translation by
jointly learning to align and translate. In 3rd International Conference on Learning
Representations, ICLR 2015.
Barber, D. and Bishop, C. M. (1998). Ensemble learning in Bayesian neural networks.
Nato ASI Series F Computer and Systems Sciences, 168:215–238.
Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD
thesis, University College London.
Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T.,
Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. (2018). Pyro: Deep universal
probabilistic programming. Journal of Machine Learning Research (JMLR).
140 References
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review
for statisticians. Journal of the American Statistical Association, 112(518):859–877.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight
uncertainty in neural networks. In Proceedings of the 32nd International Conference
on Machine Learning (ICML).
Bruinsma, W., Requeima, J., Foong, A. Y., Gordon, J., and Turner, R. E. (2020). The
Gaussian neural process. In Third Symposium on Advances in Approximate Bayesian
Inference.
Bui, T. D. (2021). Biases in variational Bayesian neural networks. In Bayesian Deep
Learning Workshop, 35th Conference on Neural Information Processing Systems.
Buntine, W. L. and Weigend, A. S. (1991). Bayesian back-propagation. Complex
systems, 5(6):603–643.
Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoen-
coders. In International Conference on Learning Representations.
Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte
Carlo. In International Conference on Machine Learning, pages 1683–1691.
Chérief-Abdellatif, B.-E. (2020). Convergence rates of variational inference in sparse
deep learning. In International Conference on Machine Learning, pages 1831–1842.
PMLR.
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1251–1258.
Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement
learning in a handful of trials using probabilistic dynamics models. In Advances in
Neural Information Processing Systems, pages 4754–4765.
Cohen, T. and Welling, M. (2016). Group equivariant convolutional networks. In
Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International
Conference on Machine Learning, volume 48 of Proceedings of Machine Learning
Research, pages 2990–2999, New York, New York, USA. PMLR.
Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995). Active learning with statistical
models. In Advances in Neural Information Processing Systems, pages 705–712.
Coker, B., Bruinsma, W. P., Burt, D. R., Pan, W., and Doshi-Velez, F. (2022). Wide
mean-field Bayesian neural networks ignore the data. In International Conference
on Artificial Intelligence and Statistics, pages 5276–5333. PMLR.
Copernicus Climate Change Service (2020). Copernicus Climate Change Service (C3S)
(2019): C3S ERA5-Land reanalysis. (accessed: 15.05.2020).
Coraddu, A., Oneto, L., Ghio, A., Savio, S., Anguita, D., and Figari, M. (2014).
Machine learning approaches for improving condition-based maintenance of naval
propulsion plants. Journal of Engineering for the Maritime Environment.
References 141
Cox, R. T. (1946). Probability, frequency and reasonable expectation. American
journal of physics, 14(1):1–13.
Cressie, N. (1990). The origins of kriging. Mathematical geology, 22(3):239–252.
Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig,
P. (2021). Laplace redux-effortless Bayesian deep learning. Advances in Neural
Information Processing Systems, 34.
DeGeneres, E. (2014). If only Bradley’s arm was longer. Best photo ever. Oscars
pic.twitter.com/c9u5notgap.
Deisenroth, M. and Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient
approach to policy search. In Proceedings of the 28th International Conference on
Machine Learning (ICML-11), pages 465–472.
Delhomme, J. P. (1978). Kriging in the hydrosciences. Advances in water resources,
1:251–266.
Denker, J. S. and LeCun, Y. (1991). Transforming neural-net output levels to probability
distributions. In Advances in Neural Information Processing Systems (NIPS).
Der Kiureghian, A. and Ditlevsen, O. (2009). Aleatory or epistemic? does it matter?
Structural Safety, 31(2):105–112.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recognition at scale. In International
Conference on Learning Representations.
Dubois, Y., Gordon, J., and Foong, A. Y. K. (2020). Neural process family. https:
//yanndubs.github.io/Neural-Process-Family.
Ellison, A. M. (1987). Effect of seed dimorphism on the density-dependent dynamics
of experimental populations of atriplex triangularis (chenopodiaceae). American
Journal of Botany, 74(8):1280–1288.
Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun,
S. (2017). Dermatologist-level classification of skin cancer with deep neural networks.
Nature, 542(7639):115.
Farquhar, S., Smith, L., and Gal, Y. (2020). Liberty or depth: Deep Bayesian neural
nets do not need complex weight posterior approximations. Advances in Neural
Information Processing Systems, 33:4346–4357.
Filos, A., Farquhar, S., Gomez, A. N., Rudner, T. G. J., Kenton, Z., Smith, L., Alizadeh,
M., de Kroon, A., and Gal, Y. (2019). Benchmarking Bayesian deep learning with
diabetic retinopathy diagnosis. https://github.com/OATML/bdl-benchmarks.
142 References
Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast
adaptation of deep networks. In Precup, D. and Teh, Y. W., editors, Proceedings of
the 34th International Conference on Machine Learning, volume 70 of Proceedings
of Machine Learning Research, pages 1126–1135, International Convention Centre,
Sydney, Australia. PMLR.
Flam-Shepherd, D., Requeima, J., and Duvenaud, D. (2017). Mapping Gaussian process
priors to Bayesian neural networks. In NIPS Bayesian deep learning workshop,
volume 3.
Foong, A., Bruinsma, W., Gordon, J., Dubois, Y., Requeima, J., and Turner, R.
(2020a). Meta-learning stationary stochastic process prediction with convolutional
neural processes. Advances in Neural Information Processing Systems, 33:8284–8295.
Foong, A., Burt, D., Li, Y., and Turner, R. (2020b). On the expressiveness of
approximate inference in Bayesian neural networks. Advances in Neural Information
Processing Systems, 33:15897–15908.
Fort, S., Ren, J., and Lakshminarayanan, B. (2021). Exploring the limits of out-
of-distribution detection. Advances in Neural Information Processing Systems,
34:7068–7081.
Fortuin, V. (2022). Priors in Bayesian deep learning: A review. International Statistical
Review.
Fortuin, V., Garriga-Alonso, A., Ober, S. W., Wenzel, F., Ratsch, G., Turner, R. E.,
van der Wilk, M., and Aitchison, L. (2021). Bayesian neural network priors revisited.
In International Conference on Learning Representations.
Frey, B. J. and Hinton, G. E. (1999). Variational learning in nonlinear Gaussian belief
networks. Neural Computation, 11(1):193–213.
Frostig, R., Johnson, M. J., and Leary, C. (2018). Compiling machine learning programs
via high-level tracing. Systems for Machine Learning, pages 23–24.
Fukushima, K. and Miyake, S. (1982). Neocognitron: A self-organizing neural network
model for a mechanism of visual pattern recognition. In Competition and cooperation
in neural nets, pages 267–285. Springer.
Gal, Y. (2016). Uncertainty in deep learning. PhD thesis, University of Cambridge.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Represent-
ing model uncertainty in deep learning. In Proceedings of The 33rd International
Conference on Machine Learning (ICML).
Gal, Y., Hron, J., and Kendall, A. (2017a). Concrete dropout. In Advances in Neural
Information Processing Systems, pages 3581–3590.
Gal, Y., Islam, R., and Ghahramani, Z. (2017b). Deep Bayesian active learning with
image data. In Proceedings of the 34th International Conference on Machine Learning
(ICML).
References 143
Gal, Y., McAllister, R., and Rasmussen, C. E. (2016). Improving PILCO with Bayesian
neural network dynamics models. In Data-Efficient Machine Learning workshop,
ICML, volume 4.
Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M.,
Teh, Y. W., Rezende, D., and Eslami, S. A. (2018a). Conditional neural processes.
In International Conference on Machine Learning, pages 1704–1713. PMLR.
Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S.,
and Teh, Y. W. (2018b). Neural processes. ICML 2018 workshop on Theoretical
Foundations and Applications of Deep Generative Models.
Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple
sequences. Statistical science, pages 457–472.
Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence.
Nature, 521(7553):452.
Good, I. J. (1983). Good thinking: The foundations of probability and its applications.
U of Minnesota Press.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press.
Gordon, J. (2021). Advances in Probabilistic Meta-Learning and the Neural Process
Family. PhD thesis, University of Cambridge.
Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner,
R. E. (2020). Convolutional conditional neural processes.
Graves, A. (2011). Practical variational inference for neural networks. In Advances in
Neural Information Processing Systems (NIPS) 24.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Heek, J. and Kalchbrenner, N. (2019). Bayesian inference for large scale image
classification. arXiv preprint arXiv:1908.03491.
Hernández-Lobato, J. M. and Adams, R. (2015). Probabilistic backpropagation for
scalable learning of Bayesian neural networks. In Proceedings of the 32nd International
Conference on Machine Learning (ICML).
Hernández-Lobato, J. M., Li, Y., Rowland, M., Bui, T., Hernández-Lobato, D., and
Turner, R. E. (2016). Black-box alpha divergence minimization. In Proceedings of
The 33rd International Conference on Machine Learning (ICML).
Hinton, G. and Van Camp, D. (1993). Keeping neural networks simple by minimizing
the description length of the weights. In Proc. of the 6th Ann. ACM Conf. on
Computational Learning Theory. Citeseer.
144 References
Hoffman, M. D. and Gelman, A. (2014). The No-U-Turn sampler: adaptively setting
path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research,
15(1):1593–1623.
Holderrieth, P., Hutchinson, M. J., and Teh, Y. W. (2021). Equivariant learning of
stochastic fields: Gaussian processes and steerable conditional neural processes. In
International Conference on Machine Learning, pages 4297–4307. PMLR.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks.
Neural networks, 4(2):251–257.
Hron, J., Bahri, Y., Novak, R., Pennington, J., and Sohl-Dickstein, J. (2020). Exact
posterior distributions of wide Bayesian neural networks. In Uncertainty in deep
learning Workshop, ICML.
Hron, J., Matthews, A. G. d. G., and Ghahramani, Z. (2018). Variational Bayesian
dropout: pitfalls and fixes. In Proceedings of the 35th International Conference on
Machine Learning (ICML).
Huszár, F. (2017). Variational inference using implicit distributions. arXiv preprint
arXiv:1702.08235.
Immer, A., Bauer, M., Fortuin, V., Rätsch, G., and Emtiyaz, K. M. (2021a). Scalable
marginal likelihood estimation for model selection in deep learning. In International
Conference on Machine Learning, pages 4563–4573. PMLR.
Immer, A., Korzepa, M., and Bauer, M. (2021b). Improving predictions of Bayesian neu-
ral nets via local linearization. In International Conference on Artificial Intelligence
and Statistics, pages 703–711. PMLR.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In International conference on machine
learning, pages 448–456. PMLR.
Islam, M. A., Jia, S., and Bruce, N. D. (2019). How much position information do
convolutional neural networks encode? In International Conference on Learning
Representations.
Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. G. (2021). What are
Bayesian neural network posteriors really like? In International Conference on
Machine Learning, pages 4629–4640. PMLR.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction
to variational methods for graphical models. Machine Learning, 37(2):183–233.
Kawano, M., Kumagai, W., Sannai, A., Iwasawa, Y., and Matsuo, Y. (2020). Group
equivariant conditional neural processes. In International Conference on Learning
Representations.
Kendall, A. and Gal, Y. (2017). What uncertainties do we need in Bayesian deep
learning for computer vision? In Advances in Neural Information Processing Systems,
pages 5574–5584.
References 145
Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A.
(2018). Fast and scalable Bayesian deep learning by weight-perturbation in Adam.
Proceedings of The 35th International Conference on Machine Learning (ICML).
Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O.,
and Teh, Y. W. (2018). Attentive neural processes. In International Conference on
Learning Representations.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. In
International Conference on Learning Representations.
Kingma, D. P., Salimans, T., and Welling, M. (2015). Variational dropout and the local
reparameterization trick. In Advances in Neural Information Processing Systems,
pages 2575–2583.
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. In Interna-
tional Conference on Learning Representations.
Knutsson, H. and Westin, C.-F. (1993). Normalized and differential convolution. In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages
515–523. IEEE.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with
deep convolutional neural networks. In Advances in Neural Information Processing
Systems, pages 1097–1105.
Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable pre-
dictive uncertainty estimation using deep ensembles. Advances in neural information
processing systems, 30.
Le, T. A., Kim, H., Garnelo, M., Rosenbaum, D., Schwarz, J., and Teh, Y. W. (2018).
Empirical evaluation of neural process objectives. In NeurIPS workshop on Bayesian
Deep Learning.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and
Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition.
Neural computation, 1(4):541–551.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Lee, J., Sohl-Dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y.
(2018). Deep neural networks as Gaussian processes. In International Conference on
Learning Representations (ICLR).
Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward
networks with a nonpolynomial activation function can approximate any function.
Neural Networks, 6(6):861–867.
Li, Y., Hernández-Lobato, J. M., and Turner, R. E. (2015). Stochastic expectation
propagation. In Advances in Neural Information Processing Systems, pages 2323–
2331.
146 References
Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. In Advances
in Neural Information Processing Systems, pages 1073–1081.
Liu, Z., Luo, P., Wang, X., and Tang, X. (2018). Large-scale celebfaces attributes
(CelebA) dataset. Retrieved August, 15:2018.
Louizos, C. and Welling, M. (2016). Structured and efficient variational deep learning
with matrix Gaussian posteriors. In International Conference on Machine Learning,
pages 1708–1716.
Louizos, C. and Welling, M. (2017). Multiplicative normalizing flows for variational
Bayesian neural networks. In Proceedings of the 34th International Conference on
Machine Learning (ICML).
Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient
MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925.
MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cam-
bridge university press.
MacKay, D. J. C. (1992a). Information-based objective functions for active data
selection. Neural computation, 4(4):590–604.
MacKay, D. J. C. (1992b). A practical Bayesian framework for backpropagation
networks. Neural computation, 4(3):448–472.
Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2019).
A simple baseline for Bayesian uncertainty in deep learning. Advances in Neural
Information Processing Systems, 32.
Mandt, S., Hoffman, M. D., and Blei, D. M. (2017). Stochastic gradient descent
as approximate Bayesian inference. The Journal of Machine Learning Research,
18(1):4873–4907.
Manita, O. A., Peletier, M. A., Portegies, J. W., Sanders, J., and Senen-Cerda, A.
(2022). Universal approximation in dropout neural networks. Journal of Machine
Learning Research, 23(19):1–46.
Markou, S., Requeima, J., Bruinsma, W. P., Vaughan, A., and Turner, R. E. (2022).
Practical conditional neural processes via tractable dependent predictions.
Martens, J. (2020). New insights and perspectives on the natural gradient method.
The Journal of Machine Learning Research, 21(1):5776–5851.
Matthews, A. G. d. G., Hensman, J., Turner, R., and Ghahramani, Z. (2016). On
sparse variational methods and the kullback-leibler divergence between stochastic
processes. In Artificial Intelligence and Statistics, pages 231–239. PMLR.
Matthews, A. G. d. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z.
(2018). Gaussian process behaviour in wide deep neural networks. In International
Conference on Learning Representations.
References 147
Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A.,
León-Villagrá, P., Ghahramani, Z., and Hensman, J. (2017). GPflow: A Gaussian
process library using TensorFlow. Journal of Machine Learning Research (JMLR),
18(40):1–6.
Mescheder, L., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes: Uni-
fying variational autoencoders and generative adversarial networks. In International
conference on machine learning, pages 2391–2400. PMLR.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.
(1953). Equation of state calculations by fast computing machines. The journal of
chemical physics, 21(6):1087–1092.
Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence,
pages 362–369. Morgan Kaufmann Publishers Inc.
Mobiny, A., Singh, A., and Nguyen, H. V. (2019). Risk-aware machine learning classifier
for skin lesion diagnosis. Journal of Clinical Medicine, 8.
Mukhoti, J., Stenetorp, P., and Gal, Y. (2018). On the importance of strong baselines
in Bayesian deep learning. In NeurIPS 2018 Bayesian Deep Learning Workshop.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its
Applications, 9(1):141–142.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2021).
Deep double descent: Where bigger models and more data hurt. Journal of Statistical
Mechanics: Theory and Experiment, 2021(12):124003.
Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of
Toronto.
Neal, R. M. et al. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov
chain Monte Carlo, 2(11):2.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading
digits in natural images with unsupervised feature learning. In NIPS Workshop on
Deep Learning and Unsupervised Feature Learning 2011.
Noci, L., Roth, K., Bachmann, G., Nowozin, S., and Hofmann, T. (2021). Disentangling
the roles of curation, data-augmentation and the prior in the cold posterior effect.
Advances in Neural Information Processing Systems, 34:12738–12748.
Ober, S. W. and Aitchison, L. (2021). Global inducing point variational posteriors for
Bayesian neural networks and deep Gaussian processes. In International Conference
on Machine Learning, pages 8248–8259. PMLR.
Osawa, K., Swaroop, S., Khan, M. E. E., Jain, A., Eschenhagen, R., Turner, R. E.,
and Yokota, R. (2019). Practical deep learning with Bayesian principles. Advances
in neural information processing systems, 32.
148 References
Osband, I., Aslanides, J., and Cassirer, A. (2018). Randomized prior functions for
deep reinforcement learning. In Advances in Neural Information Processing Systems
(NeurIPS), pages 8617–8629.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Laksh-
minarayanan, B., and Snoek, J. (2019). Can you trust your model’s uncertainty?
evaluating predictive uncertainty under dataset shift. Advances in neural information
processing systems, 32.
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D.
(2018). Image transformer. In Dy, J. and Krause, A., editors, Proceedings of the 35th
International Conference on Machine Learning, volume 80 of Proceedings of Machine
Learning Research, pages 4055–4064, Stockholmsmässan, Stockholm Sweden. PMLR.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison,
A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in PyTorch. In
NIPS-W.
Pati, D., Bhattacharya, A., and Yang, Y. (2018). On statistical optimality of variational
Bayes. In International Conference on Artificial Intelligence and Statistics, pages
1579–1588. PMLR.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry,
G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual
models from natural language supervision. In International Conference on Machine
Learning, pages 8748–8763. PMLR.
Ramsey, F. P. (2016). Truth and probability. In Readings in formal epistemology, pages
21–45. Springer.
Ranganath, R., Tran, D., Altosaar, J., and Blei, D. (2016a). Operator variational
inference. Advances in Neural Information Processing Systems, 29.
Ranganath, R., Tran, D., and Blei, D. (2016b). Hierarchical variational models. In
International Conference on Machine Learning, pages 324–333.
Rasmussen, C. E. and Williams, C. K. (2005). Gaussian Processes for Machine Learning
(Adaptive Computation and Machine Learning). The MIT Press.
Ravi, S. and Larochelle, H. (2016). Optimization as a model for few-shot learning. In
International Conference on Learning Representations.
Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In
International conference on machine learning, pages 1530–1538. PMLR.
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation
and approximate inference in deep generative models. In International conference
on machine learning, pages 1278–1286. PMLR.
Ritter, H., Botev, A., and Barber, D. (2018). A scalable Laplace approximation for
neural networks. In International Conference on Learning Representations (ICLR).
References 149
Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N., and Aigrain, S. (2013).
Gaussian processes for time-series modelling. Philosophical Transactions of the Royal
Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks
for biomedical image segmentation. In International Conference on Medical image
computing and computer-assisted intervention, pages 234–241. Springer.
Rothfuss, J., Josifoski, M., and Krause, A. (2020). Meta-learning Bayesian neural
network priors based on PAC-Bayesian theory. In NeurIPS 4th Workshop on Meta-
Learning.
Rudner, T. G., Chen, Z., Teh, Y. W., and Gal, Y. (2021). Tractable function-space
variational inference in Bayesian neural networks. In ICML Workshop on Uncertainty
& Robustness in Deep Learning.
Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations
by back-propagating errors. Cognitive modeling, 5(3):1.
Savage, L. J. (1972). The foundations of statistics. Courier Corporation.
Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on
learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität
München.
Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep
information propagation. In International Conference on Learning Representations
(ICLR).
Settles, B. (2009). Active learning literature survey. Technical report, University of
Wisconsin-Madison Department of Computer Sciences.
Shafaei, A., Schmidt, M., and Little, J. J. (2018). Does your model know the digit 6 is
not a cat? a less biased evaluation of "outlier" detectors. CoRR, abs/1809.04729.
Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From
theory to algorithms. Cambridge university press.
Shi, J., Sun, S., and Zhu, J. (2018). Kernel implicit variational inference. In Interna-
tional Conference on Learning Representations (ICLR).
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of
machine learning algorithms. In Advances in Neural Information Processing Systems,
pages 2951–2959.
Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M.,
Prabhat, M., and Adams, R. (2015). Scalable Bayesian optimization using deep
neural networks. In International Conference on Machine Learning, pages 2171–2180.
Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. (2016). Bayesian optimization
with robust Bayesian neural networks. In Advances in Neural Information Processing
Systems, pages 4134–4142.
150 References
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The Journal of
Machine Learning Research, 15(1):1929–1958.
Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational Bayesian
neural networks. In International Conference on Learning Representations (ICLR).
Swaroop, S., Nguyen, C. V., Bui, T. D., and Turner, R. E. (2018). Improving and
understanding variational continual learning. In NIPS 2018 Continual Learning
Workshop.
Tao, T. (2011). An introduction to measure theory, volume 126. American Mathematical
Society Providence.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds
another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.
Thrun, S. and Pratt, L. (2012). Learning to learn. Springer Science & Business Media.
Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian
processes. In Artificial intelligence and statistics, pages 567–574. PMLR.
Tomczak, M. B., Swaroop, S., and Turner, R. E. (2018). Neural network ensembles
and variational inference revisited. In 1st Symposium on Advances in Approximate
Bayesian Inference, pages 1–11.
Tran, B.-H., Milios, D., Rossi, S., and Filippone, M. (2020). Functional priors for
Bayesian neural networks through Wasserstein distance minimization to Gaussian
processes. In Third Symposium on Advances in Approximate Bayesian Inference.
Trippe, B. and Turner, R. (2018). Overpruning in variational Bayesian neural networks.
In NIPS 2017 Workshop on Advances in Approximate Bayesian Inference.
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE. Journal of
Machine Learning Research (JMLR), 9:2579–2605.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł.,
and Polosukhin, I. (2017). Attention is all you need. Advances in neural information
processing systems, 30.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P.-C. (2021). Rank-
normalization, folding, and localization: an improved R for assessing convergence of
MCMC (with discussion). Bayesian analysis, 16(2):667–718.
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks
for one shot learning. Advances in neural information processing systems, 29.
Wagstaff, E., Fuchs, F., Engelcke, M., Posner, I., and Osborne, M. A. (2019). On
the limitations of representing functions on sets. In International Conference on
Machine Learning, pages 6487–6494. PMLR.
References 151
Wagstaff, E., Fuchs, F. B., Engelcke, M., Osborne, M. A., and Posner, I. (2022).
Universal approximation of functions on sets. Journal of Machine Learning Research,
23(151):1–56.
Watson, G. S. (1964). Smooth regression analysis. Sankhya¯: The Indian Journal of
Statistics, Series A, pages 359–372.
Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin
dynamics. In Proceedings of the 28th International Conference on Machine Learning
(ICML-11), pages 681–688.
Wenzel, F., Roth, K., Veeling, B., Swiatkowski, J., Tran, L., Mandt, S., Snoek, J.,
Salimans, T., Jenatton, R., and Nowozin, S. (2020). How good is the Bayes posterior
in deep neural networks really? In International Conference on Machine Learning,
pages 10248–10259. PMLR.
Wilson, A. G. (2020). The case for Bayesian deep learning. arXiv preprint
arXiv:2001.10995.
Wilson, A. G. and Izmailov, P. (2020). Bayesian deep learning and a probabilistic
perspective of generalization. Advances in neural information processing systems,
33:4697–4708.
Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. (2017). On the quantitative
analysis of decoder-based generative models. In International Conference on Learning
Representations.
Yang, G. (2019a). Scaling limits of wide neural networks with weight sharing: Gaussian
process behavior, gradient independence, and neural tangent kernel derivation. arXiv
preprint arXiv:1902.04760.
Yang, G. (2019b). Wide feedforward or recurrent neural networks of any architecture
are Gaussian processes. Advances in Neural Information Processing Systems, 32.
Yang, W., Lorch, L., Graule, M. A., Srinivasan, S., Suresh, A., Yao, J., Pradier, M. F.,
and Doshi-Velez, F. (2019). Output-constrained Bayesian neural networks. arXiv
preprint arXiv:1905.06287.
Yarotsky, D. (2022). Universal approximations of invariant maps by neural networks.
Constructive Approximation, 55(1):407–474.
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola,
A. J. (2017). Deep sets. Advances in neural information processing systems, 30.
Zhang, F. and Gao, C. (2020). Convergence rates of variational posterior distributions.
The Annals of Statistics, 48(4):2180–2207.
Zhang, G., Wang, C., Xu, B., and Grosse, R. (2019). Three mechanisms of weight
decay regularization. In International Conference on Learning Representations.
Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. (2020). Cyclical stochastic
gradient MCMC for bayesian deep learning. In International Conference on Learning
Representations.

Appendix A
Proofs of results on single-hidden
layer BNNs
In Section 3.3 we stated simplified versions of bounds concerning the variance of single-
hidden layer networks with certain approximating families. Here in Appendix A.1 we
provide more general statements of the theorems, followed by statements of a series
of lemmas that their proofs rely on in Appendix A.2. In Appendix A.3, we present
proofs of each lemma. Finally, in Appendix A.4, we provide the proofs of the general
statements of the theorems.
A.1 General theorem statements
The three main results we prove in this section are the following generalisations of
Theorems 1 to 3, now stated as Theorems 7 to 9:
Theorem 7 (MFVI). Consider a single-hidden layer ReLU neural network mapping
from RD → RK with I ∈ N hidden units. The corresponding mapping is given by
fk(x) =
∑I
i=1wk,iψ
(∑D
d=1 ui,dxd + vi
)
+ bk for 1 ≤ k ≤ K, where ψ(a) = max(0, a).
Suppose we have a distribution over network parameters with density of the form:
q(W, b, U, v)=
I∏
i=1
qi(Wi|U, v)q(b|U, v)
I∏
i=1
D∏
d=1
N (ui,d;µui,d , σ2ui,d)
I∏
i=1
N (vi;µvi , σ2vi),
(A.1)
where Wi = {wk,i}Kk=1 are the weights out of neuron i and b = {bk}Kk=1 are the output
biases, and qi(Wi|U, v) and q(b|U, v) are arbitrary probability densities with finite first
two moments. Consider a line in RD parameterised by x(λ)d = γdλ + cd for λ ∈ R
154 Proofs of results on single-hidden layer BNNs
such that γdcd = 0 for 1 ≤ d ≤ D. Then for any λ1 ≤ 0 ≤ λ2, and any λ∗ such that
|λ∗| ≤ min(|λ1|, |λ2|),
V[fk(x(λ∗))] ≤ V[fk(x(λ1))] + V[fk(x(λ2))] for 1 ≤ k ≤ K. (A.2)
We provide the proof of Theorem 7 in Appendix A.4.
We briefly describe how the statement of Theorem 7 in the main text can be
deduced from this more general version. The fully factorised Gaussian family QFFG
is of the form in Equation (A.1). It remains to show that both conditions i. and ii.
imply that γdcd = 0. Consider any line intersecting the origin (i.e. satisfying condition
i)). Such a line can be written in the form x(λ)d = γdλ by choosing the origin to
correspond to λ = 0. As cd = 0 for all d, γdcd = 0 for all d. In Theorem 1 p = x(λ1)
and q = x(λ2) are on opposite sides of the origin, hence the signs of λ1 and λ2 are
opposite. Finally, the condition that r = x(λ∗) is closer to the origin than both p and q
is exactly that |λ∗| ≤ min(|λ1|, |λ2|). In order to verify condition ii), note that any line
orthogonal to a hyperplane xd′ = 0 can be parameterised as x(λ)d = γdλ+ cd, where
γd = 0 for d ̸= d′ and cd′ = 0. Hence γdcd = 0 for all d. The condition that the line
segment −→pq intersects the plane, with p = x(λ1) and q = x(λ2) is exactly that the signs
of λ1 and λ2 are opposite, and that |λ∗| ≤ min(|λ1|, |λ2|).
We now describe the results for BNNs using MC dropout, where, as noted in
Section 3.3, the result differs depending on whether dropout is applied to the inputs:
Theorem 8 (MC dropout with inputs not dropped out). Consider a single-hidden
layer ReLU neural network mapping from RD → RK with I ∈ N hidden units. The
corresponding mapping is given by fk(x) =
∑I
i=1wk,iψ
(∑D
d=1 ui,dxd + vi
)
+ bk for
1 ≤ k ≤ K, where ψ(a) = max(0, a). Assume U, v are set deterministically and
q(W, b) = q(b)
I∏
i=1
qi(Wi),
where Wi = {wk,i}Kk=1 are the weights out of neuron i, b = {bk}Kk=1 are the output biases
and q(b) and qi(Wi) are arbitrary probability densities with finite first two moments.
Then, V[fk(x)] is convex in x for 1 ≤ k ≤ K.
We note that when performing MC dropout without dropping out the inputs, U, v
are set deterministically. Furthermore, the weights out of different neurons are dropped
independently of each other. Hence Theorem 8 applies to this approximating family.
We provide the proof of Theorem 8 in Appendix A.4.
A.2 Statements of lemmas 155
Remark 5. Theorem 8 applies for any activation function ψ such that ψ2 is convex.
This is the only property of ψ which we will use in Lemma 3.
Theorem 9 (MC dropout with inputs dropped out). Consider a single-hidden layer
ReLU neural network mapping from RD → RK with I ∈ N hidden units. The
corresponding mapping is given by fk(x) =
∑I
i=1wk,iψ
(∑D
d=1 ui,dxd + vi
)
+ bk for
1 ≤ k ≤ K, where ψ(a) = max(0, a). Assume v is set deterministically and
q(W, b, U) = q(U)q(b|U)
∏
i
qi(Wi|U),
where Wi = {wk,i}Kk=1 are the weights out of neuron i, b = {bk}Kk=1 are the output biases
and q(U), q(b|U) and qi(Wi|U) are arbitrary probability densities with finite first two
moments. Then, for any finite set of points S ⊂ RD such that 0 is in the convex hull
of S,
V[fk(0)] ≤ max
s∈S
{V[fk(s)]} for 1 ≤ k ≤ K. (A.3)
We note that when applying MC dropout to the inputs and the hidden layer, v
is still deterministic since biases are not dropped out, and the weight distribution
factorises as in Theorem 9. We provide the proof of Theorem 9 in Appendix A.4.
A.2 Statements of lemmas
In this section we state the lemmas required to prove Theorems 7 to 9:
Lemma 3. Assume a distribution for W, b|U, v with density of the form
q(W, b|U, v) = q(b|U, v)
∏
i
qi(Wi|U, v).
Then, V[fk(x)|U, v] is a convex function of x.
The proof of Lemma 3 is in Section A.3.1.
Lemma 4. Consider the variance of a single neuron in the one dimensional case, with
activation a(x) ∼ N (µ(x), σ2(x)), µ(x) = µux+ µv and σ2(x) = σ2ux2 + σ2v . Let
T1 = {f ≥ 0 : ∀0 ≤ b < a, f(a) ≥ f(−a) and f(b) ≤ f(a)}
and
T2 = {f ≥ 0 : ∀a < b ≤ 0, f(a) ≥ f(−a) and f(b) ≤ f(a)}.
156 Proofs of results on single-hidden layer BNNs
If µu ≥ 0, then V[ψ(a(x))] ∈ T1. If µu ≤ 0, then V[ψ(a(x))] ∈ T2.
The proof of Lemma 4 is in Section A.3.2.
Corollary 1 (Corollary of Lemma 4). Consider a line in RD parameterized by [x(λ)]d =
γdλ+ cd for λ ∈ R such that γdcd = 0 for 1 ≤ d ≤ D. Let a(x) :=
∑D
d=1 udxd + v with
{ud}Dd=1 and v independent and Gaussian distributed. Then, V[ψ(a(x(λ)))] ∈ T1 ∪ T2
(as a function of λ).
Proof. The activation a(x(λ)) is a linear combination of Gaussian random variables,
and is therefore Gaussian distributed. Moreover the mean is linear in λ. The variance
of a(x(λ)) is given by:
V[a(x(λ))] =
D∑
d=1
V[ud](γdλ+ cd)2 + V[v]
=
D∑
d=1
σ2ud(γdλ+ cd)
2 + σ2v
= λ2
(
D∑
d=1
σ2udγ
2
d
)
+ 2λ
(
D∑
d=1
σ2udγdcd
)
+
(
D∑
d=1
σ2udc
2
d + σ
2
v
)
= λ2
(
D∑
d=1
σ2udγ
2
d
)
+
(
D∑
d=1
σ2udc
2
d + σ
2
v
)
.
Defining σ2u˜ =
∑D
d=1 σ
2
ud
γ2d and σ2v˜ =
∑D
d=1 σ
2
ud
c2d + σ
2
v , the corollary follows from
Lemma 4.
Lemma 5. Let C be the set of convex functions from R→ [0,∞). Fix any a < 0 < b
and c such that |c| ≤ min(|a|, |b|). Then any function f that can be written as a
linear combination of functions in T1 ∪ T2 ∪ C with non-negative weights satisfies,
f(c) ≤ f(a) + f(b).
The proof of Lemma 5 can be found in Section A.3.3.
Lemma 6. Let f : RD → R be a convex function and consider a finite set of points
S ⊂ RD. Then for any point r in the convex hull of S, f(r) ≤ max
s∈S
{f(s)}.
The proof of Lemma 6 is given in Section A.3.4.
A.3 Proofs of lemmas 157
A.3 Proofs of lemmas
In this section we prove the lemmas stated in Appendix A.2.
A.3.1 Proof of lemma 3
Proof. We assume a distribution for the network weights such that:
q(W, b|U, v) = q(b|U, v)
I∏
i=1
qi(Wi|U, v).
By this factorisation assumption, the outgoing weights from each neuron are condi-
tionally independent. This means the conditional variance of the output under this
distribution can be written
V[fk(x)|U, v] =
∑
i
V[wk,i|U, v]ψ(ai)2 + V[bk|U, v]. (A.4)
with ai := ai(x) =
∑D
d=1 ui,dxd + vi.
Since V[fk(x)|U, v] is a linear combination of the ψ(ai)2 with non-negative weights
(plus a constant), to prove convexity it suffices to show that each ψ(ai)2 is convex as a
function of x. ψ(ai)2 is convex as a function of ai, since it is 0 for ai ≤ 0 and a2i for
ai > 0. To show that it is convex as a function of x, we write
ψ (ai(tx1 + (1− t)x2))2 = ψ
(∑
d
ui,d (t[x1]d + (1− t)[x1]d) + vi
)2
= ψ
(
t
(∑
d
ui,d[x1]d + vi
)
+ (1− t)
(∑
d
ui,d[x2]d + vi
))2
≤ tψ
(∑
d
ui,d[x1]d + vi
)2
+ (1− t)ψ
(∑
d
ui,d[x2]d + vi
)2
= tψ (ai(x1))
2 + (1− t)ψ (ai(x2))2 .
The inequality uses convexity of ψ(a) as a function of a.
A.3.2 Proof of lemma 4
Throughout, we assume σu, σv and µv are fixed and suppress dependence on these
parameters. Let vµu(x) := V[ψ(a(x))] where the variance is taken with respect to a
158 Proofs of results on single-hidden layer BNNs
distribution with parameter µu. Then, vµu(x) = v−µu(−x) since µ(x) and σ2(x) are
unchanged by the transformation µu, x→ −µu,−x.
Suppose vµu ∈ T1 for µu > 0, then for x ≤ 0,
v−µu(x) = vµu(−x) ≥ vµu(x) = v−µu(−x),
and for x < y ≤ 0,
v−µu(y) = vµu(−y) ≤ vµu(−x) = v−µu(x).
In words, if vµu ∈ T1 then v−µu ∈ T2. It therefore suffices to consider the case when
µu ≥ 0.
We first show that if x ≥ 0, vµu(x) ≥ vµu(−x). Henceforth, we assume µu ≥ 0 is
fixed and suppress it notationally. From Frey and Hinton (1999),
v(x) = σ(x)2α(r(x)), (A.5)
Here r(x) = µ(x)/σ(x). We define h(r) = N(r) + rΦ(r), where N is the standard
Gaussian pdf, Φ is the standard Gaussian cdf. We define α(r) = Φ(r) + rh(r)− h(r)2.
As σ(x)2 = σ(−x)2, it suffices to show α(r(x)) ≥ α(r(−x)) for x > 0. To show
this, we first show that r(x) ≥ r(−x) for x > 0, then show that α(r) is monotonically
increasing.
r(x) = µ(x)/σ(x) = µ(−x)/σ(−x) + 2µux/σ(−x) ≥ µ(−x)/σ(−x) = r(−x).
The inequality uses that both µu and x are non-negative. It remains to show that α(r)
is monotonically increasing. A straightforward calculation shows that,
α′(r) = 2h(r)(1− Φ(r)).
As 1−Φ(r) > 0, we must show h(r) ≥ 0. We have limr→−∞ h(r) = 0 and h′(r) = Φ(r) >
0, implying h(r) > 0. We conclude α′(r) > 0 for all r, showing that vµu(x) ≥ vµu(−x)
for x ≥ 0.
To complete the proof, we must show that v(x) is monotonically increasing for
x ≥ 0. As σ(x)2 is increasing as a function of x and α(r) is increasing as a function of
r, v(x) is increasing as a function of x whenever r(x) is increasing as a function of x.
As r′(x) = σ
2
vµu−σ2uµvx
σ(x)3
, this completes the proof if σ2vµu − σ2uµvx ≥ 0. In particular, we
A.3 Proofs of lemmas 159
need only consider cases when µv > 0. In this case, we write,
v(x) = µ(x)2β(r(x)) (A.6)
where β(r) = α(r)/r2. Also in this region, we have the inequality,
r′(x)σ(x) =
σ2vµu − σ2uµvx
σ2ux
2 + σ2v
≤ σ
2
vµu
σ2ux
2 + σ2v
≤ σ
2
vµu
σ2v
= µu,
which leads to r′(x) ≤ µu/σ(x).
Differentiating Equation (A.6),
v′(x) = 2µuµ(x)β(r(x)) + µ(x)2
(
σ2vµu − σ2uµvx
σ(x)3
)
β′(r(x))
≥ 2µuµ(x)
(
β(r(x)) +
1
2
r(x)β′(r(x))
)
.
The inequality uses that r(x) > 0, so that by Lemma 7, β′(r(x)) < 0. It suffices to
show that β(r) + 1
2
rβ′(r) > 0 for r > 0.
β(r) +
1
2
rβ′(r) = β(r) +
1
2
r
d
dr
(
α(r)
r2
)
=
α(r)
r2
+
1
2
r
α′(r)r2 − 2rα(r)
r4
=
α′(r)
2r
≥ 0.
We conclude that v′(x) ≥ 0 for x ≥ 0, implying that v(x) is monotonically increasing
in this region. This completes the proof that vµu(x) ∈ T1 for µu > 0.
Lemma 7. For β defined as in the proof of Lemma 4 and for r > 0, β′(r) < 0
Proof. For r ̸= 0, β′(r) = (−2Φ(r) + 2N(r)2 + 2N(r)Φ(r)) /r3. As r > 0,
β′(r) ≤ 0⇔ I(r) := −Φ(r) +N(r)2 +N(r)rΦ(r) ≤ 0.
Rearranging (Abramowitz and Stegun, 1965, 7.1.13) yields:
1− 2
r +
√
r2 + 8/π
N(r) ≤ Φ(r) < 1− 2
r +
√
r2 + 4
N(r). (A.7)
160 Proofs of results on single-hidden layer BNNs
for r ≥ 0.
I(r) = −Φ(r) +N(r)2 + rN(r)Φ(r)
≤ −Φ(r) +N(r)2 + rN(r)
(
1− 2
r +
√
r2 + 4
N(r)
)
≤ −1 + 2
r +
√
r2 + 8/π
N(r) +N(r)2 + rN(r)
(
1− 2
r +
√
r2 + 4
N(r)
)
= −1 + 2
r +
√
r2 + 8/π
N(r) + rN(r) +N(r)2
(
1− 2r
r +
√
r2 + 4
)
(A.8)
We now make use of numerous crude bounds which hold for r > 0:
1. N(r) ≤ 1/√2π,
2. 2
r+
√
r2+8/π
≤√π/2,
3. rN(r) ≤ 1/√2πe
4. 2r
r+
√
r2+4
≥ 0.
Plugging these into Equation (A.8),
I(r) ≤ −1 +
√
π/2√
2π
+
1√
2πe
+
1
2π
= −1
2
+
1√
2πe
+
1
2π
≈ −0.098 < 0.
A.3.3 Proof of lemma 5
Proof. Recall that
T1 = {f ≥ 0 : ∀0 ≤ b < a, f(a) ≥ f(−a) and f(b) ≤ f(a)}
and
T2 = {f ≥ 0 : ∀a < b ≤ 0, f(a) ≥ f(−a) and f(b) ≤ f(a)}.
First, note that T1, T2 and the set of non-negative convex functions, C are all closed
under addition and positive scalar multiplication. We can therefore write f as a sum of
three functions, f(x) = t1(x) + t2(x) + s(x) with t1 ∈ T1, t2 ∈ T2 and s ∈ C. We prove
the case when a ≤ c ≤ 0 ≤ −c ≤ b. The case a ≤ −c ≤ 0 ≤ c ≤ b follows a symmetric
A.4 Proofs of theorems 161
argument.
f(c) = t1(c) + t2(c) + s(c) (def.)
≤ t1(c) + t2(a) + s(c) (second condition for T2)
≤ t1(−c) + t2(a) + s(c) (first condition for T1)
≤ t1(b) + t2(a) + s(c) (second condition for T1)
≤ t1(b) + t2(a) + max(s(a), s(b)) (s convex)
≤ t1(b) + t2(a) + s(a) + s(b)
≤ t1(a) + t1(b) + t2(a) + t2(b) + s(a) + s(b) (non-negativity)
= f(a) + f(b).
A.3.4 Proof of lemma 6
Proof. Let {sn}Nn=1 = SN ⊂ RD. We proceed by induction. The lemma is true for
N = 2 by the definition of convexity. Assume it is true for N . Let Conv(SN+1) denote
the convex hull of SN+1. Consider a point rN+1 ∈ Conv(SN+1). Then
f(rN+1) = f
(
N+1∑
n=1
αnsn
)
(A.9)
with
∑N+1
n=1 αn = 1 and αn ≥ 0 for 1 ≤ n ≤ N + 1. We can write
f(rN+1) = f
((
N∑
n=1
αn
)
tN + αN+1sN+1
)
≤ max{f(tN), f(sN+1)} (A.10)
where tN :=
∑N
n=1 αnsn
/∑N
n=1 αn, and we have used the convexity of f . By the
induction assumption, f(tN ) ≤ max
s∈SN
{f(s)}, since tN ∈ Conv(SN ). Combining this with
Equation (A.10) completes the proof.
A.4 Proofs of theorems
Having collected the necessary preliminary lemmas we now prove Theorems 7 to 9.
Proof of Theorem 7. By the law of total variance,
V[fk(x)] = E[V[fk(x)|U, v]] + V[E[fk(x)|U, v]].
162 Proofs of results on single-hidden layer BNNs
Using Lemma 3, V[fk(x)|U, v] is convex as a function of x. As the expectation of a
convex function is convex, the first term is a convex function of x. For the second term
we have
E[fk(x)|U, v] = E
[
I∑
i=1
wk,iψ(ai) + bk
∣∣∣∣U, v
]
=
I∑
i=1
µwk,iψ(ai) + µbk ,
where µwk,i := E[wk,i], µbk := E[bk]. In the second line we used linearity of expectation
and that conditioned on (U, v), the ai are deterministic. Next,
V[E[fk(x)|U, v]] = V
[
I∑
i=1
µwk,iψ(ai) + µbk
]
=
I∑
i=1
µ2wk,iV[ψ(ai)], (A.11)
since the ai are independent of each other.
Consider a line in RD parameterised by [x(λ)]d = γdλ + cd for λ ∈ R such that
γdcd = 0 for 1 ≤ d ≤ D.
By Corollary 1, V[ψ(ai(x(λ)))] ∈ T1 ∪ T2 (as a function of λ). Since V[fk(x)|U, v]
is convex as a function of x, it is also convex as a function of λ. We have written
V[fk(x(λ))] in the form assumed in Lemma 5, completing the proof.
Proof of Theorem 8. The theorem follows immediately from Lemma 3 since U and v
are deterministic.
Proof of Theorem 9. By the law of total variance,
V[fk(x)] = E[V[fk(x)|U ]] + V[E[fk(x)|U ]].
Using Lemma 3, V[fk(x)|U ] is convex as a function of x. As the expectation of a convex
function is convex, the first term is a convex function of x. This implies
E[V[fk(0)|U ]] ≤ max
s∈S
{E[V[fk(s)|U ]]} ,
by Lemma 6. V[E[fk(x)|U ]] is non-negative everywhere. As the output of the first
layer is independent of the matrix U at x = 0, E[fk(0)|U ] is deterministic. So
V[E[fk(0)|U ]] = 0, completing the proof.
Appendix B
Bayesian neural network experimental
details
In this chapter we provide the experimental details of the BNNs experiments reported
in Chapter 3.
Data: The input locations of the data were generated by sampling 100 total points,
50 each from two distinct Gaussians. In Figure 3.6, one Gaussian was centred at
(−1,−1) and the other at (1, 1); both had isotropic variance of 0.01. The output values
were generated by sampling from the Gaussian process prior with the kernel resulting
from the wide limit of the BNN at these input values.
Prior: A fully-connected ReLU network with 50 hidden units per layer is used. The
prior mean for all parameters is chosen to be 0. The prior standard deviation for the
bias parameters is chosen as σb = 1 for all experiments. Let σw/
√
H be the prior
standard deviation of each weight, where H is the number of inputs to the weight
matrix. We choose σw = 4. All models used a fixed Gaussian likelihood with standard
deviation 0.1.
Fitting the GP: The Gaussian process was implemented using GPFlow (Matthews
et al., 2017) with the infinite-width ReLU BNN kernel implemented following Lee et al.
(2018). All hyperparameters were fixed and exact inference was performed using the
Cholesky decomposition.
Fitting MFVI: We initialize the standard deviations of weights to be small and
train for many epochs, following Swaroop et al. (2018); Tomczak et al. (2018) who
164 Bayesian neural network experimental details
found this led to good predictive performance. The weight means in each weight matrix
were initialised by sampling from N (0, 1/√2nout), where nout is the number of outputs
of the weight matrix. The weight standard deviations were all initialised to a very
small value of 1× 10−5, (we tried a larger initialization with weight standard deviations
initialized to 1× 10−2.5 and found no significant difference). Bias means were initialised
to zero, with the variances initialised to the same small value as the weight variance.
100,000 iterations of full batch training on the dataset were performed using ADAM
with a learning rate of 1 × 10−3. The ELBO was estimated using 32 Monte Carlo
samples during training. The local reparameterisation trick was used (Kingma et al.,
2015). The predictive distribution at test time was estimated using 500 samples from
the approximate posterior.
Fitting MCDO: The weights and biases were initialised using the default PyTorch
initialisation. The dropout rate was fixed at p = 0.05. The ℓ2 regularisation parameter
was set following Gal (2016, Section 3.2.3) for the given prior, in such a way that the
‘KL condition’ is met, in the interpretation of dropout training as variational inference.
100,000 iterations of full batch training on the dataset were performed using ADAM
with a learning rate of 1× 10−3. The dropout objective was estimated using 32 Monte
Carlo samples during training. The predictive distribution at test time was estimated
using 500 samples from the approximate posterior.
Fitting HMC: For HMC on the 1HL BNN, 250,000 samples of HMC were taken
using the NUTS implementation in Pyro (Bingham et al., 2018; Hoffman and Gelman,
2014) after 10,000 warmup steps. For the 2HL case, 1,000,000 samples of HMC were
taken after 20,000 warmup steps. We set the maximum tree depth in NUTS to 5, and
adapt the step size and mass matrix during warmup.
Appendix C
Proofs of results on deep BNNs
In Appendix C.1 we prove the universality of the mean and variance function for deep
BNNs using QFFG and QMCDO, where, as usual, the inputs are not dropped out for
QMCDO. Conversely, if the inputs are dropped out, we show in Appendix C.2 via a
counterexample that the resulting BNN does not have a universal mean and variance
function.
C.1 Proof of Theorem 4
We now restate and prove Theorem 4 from the main body:
Theorem 10. Let A ⊂ RD be compact, and let C(A) be the space of continuous
functions on A to R. Similarly, let C+(A) be the space of continuous functions on A to
R≥0. Then for any g ∈ C(A) and h ∈ C+(A), and any ϵ > 0, for both the mean-field
Gaussian and MC dropout families, there exists a 2-hidden layer ReLU NN such that
sup
x∈A
|E [f(x)]− g(x)| < ϵ and sup
x∈A
|V[f(x)]− h(x)| < ϵ,
where f(x) is the (stochastic) output of the network.
Our proof will make use of the standard universal approximation theorem for
deterministic NNs as given in Leshno et al. (1993):
Theorem 11 (Universal approximation for deterministic NNs). Let ψ(a) = max(0, a).
Then for every g ∈ C(RD) and every compact set A ⊂ RD, for any ϵ > 0 there exists a
166 Proofs of results on deep BNNs
function f ∈ S such that ∥g − f∥∞ < ϵ, where
S :=
{
I∑
i=1
wiψ
(
D∑
d=1
ui,dxd + vi
)
: I ∈ N, wi, ui,d, vi ∈ R
}
.
We first prove a useful lemma.
Lemma 8. Let ψ(a) = max(0, a). Let a be a random variable with finite first two
moments. Then V[ψ(a)] ≤ V[a].
Proof. For all x, y ∈ R, we have |x− y|2 ≥ |ψ(x)− ψ(y)|2. Consider two i.i.d. copies
of any random variable with finite first two moments, denoted a1 and a2. Then
V[a1] = E
[
a21
]− E [a1]2
=
1
2
E
[
a21 + a
2
2 − 2a1a2
]
=
1
2
E
[|a1 − a2|2]
≥ 1
2
E
[|ψ(a1)− ψ(a2)|2]
= V[ψ(a1)].
C.1.1 Proof of Theorem 4 for QFFG
We prove Theorem 10 for the fully-factorised Gaussian approximating family. We begin
by proving results about 1HL networks within this family. The overall goal of these
results is Lemma 11, which informally says that for any set of mean parameters for the
weights, we can find a setting of the standard deviations of the weights, such that the
mean output of the network is close to the output of the deterministic network, with
weights equal to the mean parameters. Our proof of this proceeds in 3 parts: First, in
Lemma 12, we show that by making the standard deviation parameters sufficiently
small, we can ensure that the variance of the output of the network is uniformly small
on some compact set A. Next, in Lemma 10, we show that again by choosing the
standard deviation sufficiently small, we can make most of the sample functions of
the 1HL network close to the function that would be obtained by using the mean
parameters. Finally, in the proof of Lemma 11, we use Chebyshev’s inequality and the
triangle inequality to conclude that the mean of the network must also be close to the
function defined by the mean parameters. These networks will be used to construct
the desired 2HL network.
C.1 Proof of Theorem 4 167
Notation Consider a 1HL ReLU NN with input x ∈ RD and output f ∈ RK . Let
the network have I hidden units and be parameterised by input weights U ∈ RI×D,
input biases v ∈ RI , output weights W ∈ RK×I and output biases b ∈ RK . Let
θ = (U, v,W, b). Denote the kth output of the network by fk,θ(x). Consider a factorised
Gaussian distribution over the parameters θ in the network. Let the means of the
Gaussians be denoted µ = (µU , µv, µW , µb), where e.g. µU is a matrix whose elements
are the means of U . Each mean is always taken to be ∈ R. Similarly, let the standard
deviations be denoted σ = (σU , σv, σW , σb). Each standard deviation is always taken
to be ∈ R>0.
The following lemma states that we can make the output of a 1HL BNN have low
variance by setting the standard deviation of the weights to be small.
Lemma 9. Let A ⊂ RD be a compact set and fk,θ(x) be the kth output of a 1HL ReLU
NN with a mean-field Gaussian distribution mapping from A→ R. Fix any µ and any
ϵ > 0. Let all the standard deviations in σ be equal to a shared constant σ > 0. Then
there exists σ′ > 0 such that for all σ < σ′ and for all x ∈ A, V[ψ(fk,θ(x))] < ϵ for all
1 ≤ k ≤ K.
Proof. Define ai =
∑D
d=1 ui,dxd + vi, so that fk,θ(x) =
∑I
i=1wk,iψ(ai) + bk. Then
V[fk,θ(x)] = V
[
I∑
i=1
wk,iψ(ai)
]
+ σ2
=
I∑
i=1
I∑
j=1
Cov (wk,iψ(ai), wk,jψ(aj)) + σ
2
≤
I∑
i=1
I∑
j=1
|Cov (wk,iψ(ai), wk,jψ(aj))|+ σ2
≤
I∑
i=1
I∑
j=1
√
V[wk,iψ(ai)]V[wk,jψ(aj)] + σ2,
where the final line follows from the Cauchy–Schwarz inequality. We now analyse each
of the constituent terms. Since wk,i and ψ(ai) are independent,
V[wk,iψ(ai)] = µ2wk,iV[ψ(ai)] + E [ψ(ai)]
2 σ2 + σ2V[ψ(ai)].
168 Proofs of results on deep BNNs
As A is compact, it is bounded, so there exists an M such that |xd| ≤ M for all
1 ≤ d ≤ D. Using Lemma 8, and the mean-field assumptions,
V[ψ(ai)] ≤ V[ai] = σ2
(
D∑
d=1
x2d + 1
)
≤ σ2(DM2 + 1).
Since ai is a linear combination of Gaussian random variables, we have that ai ∼
N (µai , σ2ai), where µai =
∑D
d=1 µui,dxd + µvi and σ
2
ai
= σ2
(∑D
d=1 x
2
d + 1
)
. Therefore,
we have that (Frey and Hinton, 1999):
E [ψ(ai)]2 =
(
µaiΦ
(
µai
σai
)
+ σaiN
(
µai
σai
))2
≤
(
|µai |Φ
(
µai
σai
)
+ σaiN
(
µai
σai
))2
≤
(
|µai|+
σai√
2π
)2
.
We can then upper bound V[wk,iψ(ai)] as follows:
V[wk,iψ(ai)] ≤ µ2wk,iσ2(DM2 + 1) +
(
|µai |+
σai√
2π
)2
σ2 + σ4(DM2 + 1)
≤ µ2wk,iσ2(DM2 + 1) +
(
M
D∑
d=1
|µui,d |+ |µvi |+
√
σ2(M2D + 1)√
2π
)2
σ2 + σ4(DM2 + 1)
:= vk,i(σ).
The second inequality follows since A is compact and we have |µai | ≤M
∑D
d=1 |µui,d |+
|µvi |. Note that the upper bound vk,i(σ) is continuous and monotonically increasing in
σ, and vk,i(0) = 0. We can then upper bound the variance of the output:
V[fk,θ(x)] ≤
I∑
i=1
I∑
j=1
√
vk,i(σ)vk,j(σ) + σ
2.
We then choose σ′ such that for all 1 ≤ k ≤ K and for all 1 ≤ i ≤ I, vk,i(σ′) < ϵ2I2 ,
and such that σ′2 < ϵ
2
. Then
V[fk,θ(x)] ≤ I2 ϵ
2I2
+ σ′2 < ϵ
for 1 ≤ k ≤ K. Finally, applying Lemma 8, we have V[ψ(fk,θ(x))] < ϵ for 1 ≤ k ≤
K.
C.1 Proof of Theorem 4 169
The following lemma states that by setting the standard deviation of the weights
to be sufficiently small, we can with high probability make the sampled BNN output
close to the BNN output evaluated at the mean parameters.
Lemma 10. Let A ⊂ RD be any compact set. Fix any µ and any ϵ, δ > 0. Let all the
standard deviations in σ be equal to a shared constant σ > 0. Then there exists σ′ > 0
such that for all σ < σ′, and for any x ∈ A,
Pr (|ψ(fk,µ(x))− ψ(fk,θ(x))| > ϵ) < δ
for all 1 ≤ k ≤ K.
Proof. Let θ ∈ RP . We first note that ψ(fk,θ(x)) is continuous as a function from
A× RP → R, under the metric topology induced by the Euclidean metric on A× RP .
Next, define a ball in parameter space
Bγ = {θ : ∥θ − µ∥2 < γ}.
Consider the closed ball of unit radius around µ, B¯1. Note that B¯1 is compact, and
therefore A× B¯1 is compact as a product of compact spaces.
Since a continuous map from a compact metric space to another metric space
is uniformly continuous, given ϵ > 0, there exists a 0 < τ < 1 such that for all
pairs (x1, θ1), (x2, θ2) ∈ A × B¯1 such that d((x1, θ1), (x2, θ2)) < τ , |ψ(fk,θ1(x1)) −
ψ(fk,θ2(x2))| < ϵ. Here d(·, ·) is the Euclidean metric on A × RP . Since this is true
for all 1 ≤ k ≤ K, we can find a 0 < τ < 1 such that |ψ(fk,θ1(x1))− ψ(fk,θ2(x2))| < ϵ
holds for all k simultaneously, by taking the minimum of the τ over k.
Now choose σ′ > 0 such that for all σ < σ′, Pr(θ ∈ Bτ ) > 1 − δ. This event
implies d((x, θ), (x,µ)) = ∥θ − µ∥2 < τ . Furthermore, θ ∈ B¯1, since τ < 1. Hence
|ψ(fk,µ(x))− ψ(fk,θ(x))| < ϵ holds for all 1 ≤ k ≤ K.
The following lemma shows that for 1HL networks, we can make E [ψ(fk,θ)] (the
mean BNN output) close to ψ(fk,µ) (the BNN output evaluated at the mean parameter
settings) by choosing the standard deviation of the weights to be sufficiently small.
Lemma 11. Let A ⊂ RD be any compact set. Then, for any ϵ > 0 and any µ, there
exists a σ1 > 0 such that for any shared standard deviation σ < σ1,
∥E [ψ(fk,θ)]− ψ(fk,µ)∥∞ < ϵ
for all 1 ≤ k ≤ K.
170 Proofs of results on deep BNNs
Proof. For all x ∈ A and any θ∗, by the triangle inequality
|E [ψ(fk,θ(x))]− ψ(fk,µ(x))| ≤ |E [ψ(fk,θ(x))]− ψ(fk,θ∗(x))|+|ψ(fk,µ(x))− ψ(fk,θ∗(x))| .
Applying Lemma 10 with ϵ′ = ϵ/2 and δ = 1/4, we can find a σ′ such that for all σ < σ′,
|ψ(fk,µ(x))− ψ(fk,θ(x))| ≤ ϵ/2 with probability at least 3/4. By Lemma 9, we can
find a σ′′ such that for all σ < σ′′, V[ψ(fk,θ(x))] < ϵ
2
16K
. Choose 0 < σ < min(σ′, σ′′).
We can apply Chebyshev’s inequality to each random variable ψ(fk,θ(x)),
Pr [|ψ(fk,θ(x))− E [ψ(fk,θ(x))]| > ϵ/2] < 1
4K
.
Applying the union bound, the probability that |ψ(fk,θ(x))− E [ψ(fk,θ(x))] | ≤ ϵ/2 for
all k simultaneously is at least 3/4. Therefore, for any x we can find a θ∗ such that
|ψ(fk,θ∗(x))−E [ψ(fk,θ(x))] | ≤ ϵ/2 and |ψ(fk,µ(x))− ψ(fk,θ∗(x))| ≤ ϵ/2 simultaneously
because both events occur with probability at least 1/2 and therefore have a non-empty
intersection. Therefore for all x and all k
|E [ψ(fk,θ(x))]− ψ(fk,µ(x))| ≤ ϵ.
We can now complete the proof of theorem 3 for QFFG.
Proof of Theorem 10. Consider the case of a 2-hidden layer ReLU Bayesian neural
network with 2 units in the second hidden layer. Denote the inputs to these units as
f1,θ(x) and f2,θ(x) respectively, where θ are the parameters in the bottom two weight
matrices and biases of the network. The output of the network can then be written as
f(x) = s1ψ(f1,θ(x)) + s2ψ(f2,θ(x)) + t, (C.1)
where the si are the weights in the final layer and t is the bias. Taking expectations on
both sides,
E [f(x)] = E [s1ψ(f1,θ(x))] + E [s2ψ(f2,θ(x))] + E [t] .
Choose µs1 = 1, µs2 = 0, and note that s1 is independent of θ by the mean field
assumption. Then
E [f(x)] = E [ψ(f1,θ(x))] + E [t] . (C.2)
Define µt = −minx′∈A g(x′) (as A is compact and g is continuous, this minimum is well-
defined). Define g˜(x) ≥ 0 to be g(x)−minx′∈A g(x′). By the universal approximation
theorem (Theorem 11) we can find a setting of the mean parameters, µ in the first
C.1 Proof of Theorem 4 171
two layers (i.e. excluding the parameters of the distributions on s1, s2 and t) such that
∥f (1)µ − g˜∥∞ < ϵ/2 and ∥f (2)µ −
√
h∥∞ < ϵ/2.
This can be done by splitting the neurons in the first hidden layer into two sets, where
the first and second set are responsible for f (1), f (2) respectively, and the weights from
each set to the output of the other set are zero. Since g˜(x) > 0, applying the ReLU
can only make f (1) closer to g˜. Hence ∥ψ(f (1)µ )− g˜∥∞ < ϵ/2.
By Lemma 11, we can find a σ1 > 0 for this µ such that when the standard
deviations in the first two layers are set to any shared constant σ < σ1,∥∥E [ψ(f1,θ)]− ψ(f (1)µ )∥∥∞ < ϵ/2.
By the triangle inequality, ∥E [ψ(f1,θ)]− g˜∥∞ < ϵ. Combining with Equation (C.2), it
follows that the expectation can approximate any continuous function g.
We now consider the variance of Equation (C.1).
V[f(x)] = V[s1ψ(f1,θ(x)) + s2ψ(f2,θ(x))] + V[t]
= V[s1ψ(f1,θ(x))] + V[s2ψ(f2,θ(x))] + 2Cov(s1ψ(f1,θ(x)), s2ψ(f2,θ(x))) + σ2t .
Choose σ2t = ϵ. We now consider V[s1ψ(f1,θ(x))]. As s1 is independent of θ,
V[s1ψ(f1,θ(x))] = µ2s1V[ψ(f1,θ(x))] + σ
2
s1
E [ψ(f1,θ(x))]2 + V[ψ(f1,θ(x))]σ2s1 .
Recall µs1 = 1 and choose σ2s1 = min
(
1, ϵ
/ (
maxx∈A E [ψ(f1,θ(x))]2
))
, then
V[s1ψ(f1,θ(x))] ≤ 2V[ψ(f1,θ(x))] + ϵ.
By Lemma 9, we can find a σ2 such that for any σ < σ2, V[ψ(f1,θ(x))] ≤ ϵ. For any
such σ, V[s1ψ(f1,θ(x))] ≤ 3ϵ.
We now choose σ2s2 = 1 and consider
V[s2ψ(f2,θ(x))] = µ2s2V[ψ(f2,θ(x))] + σ
2
s2
E [ψ(f2,θ(x))]2 + σ2s2V[ψ(f2,θ(x))]
= E [ψ(f2,θ(x))]2 + V[ψ(f2,θ(x))].
By Lemma 9, we can find a σ3 such that for any σ < σ3, V[ψ(f2,θ(x))] < ϵ.
172 Proofs of results on deep BNNs
By the universal function approximator theorem (Theorem 11) we can find a setting
of the mean parameters, µ in the first two layers such that ∥f (2)µ −
√
h∥∞ < ϵ/2. Since√
h(x) > 0, the ReLU can only make f (2) closer to
√
h, ∥ψ(f (2)µ )−
√
h)∥∞ < ϵ/2.
By Lemma 11, we can find a setting of σ for this µ such that∥∥E [ψ(f2,θ)]− ψ(f (2)µ )∥∥∞ < ϵ/2.
By the triangle inequality, ∥∥∥E [ψ(f2,θ)]−√h∥∥∥∞ < ϵ.
This implies,∥∥E [ψ(f2,θ)]2 − h∥∥∞ = ∥∥∥(E [ψ(f2,θ)]−√h)(E [ψ(f2,θ)] +√h)∥∥∥∞
≤ ϵ
∥∥∥E [ψ(f2,θ)] +√h∥∥∥∞
≤ ϵ(2∥
√
h∥∞ + ϵ)
We therefore have,
∥V[f ]− h∥∞ ≤ E(ϵ) + 2Cov(s1ψ(f1,θ(x)), s2ψ(f2,θ(x)))
≤ E(ϵ) + 2
√
V[s1ψ(f1,θ(x))]V[s2ψ(f2,θ(x))]
≤ E(ϵ) + C√ϵ
where the first inequality is Cauchy-Schwarz, and E(ϵ) is a function that tends to zero
with ϵ and C is a constant. The theorem follows by choosing σ < min{σ1, σ2, σ3}.
The construction in our proof used a 2HL BNN with only two neurons in the
second hidden layer. The construction still works for wider hidden layers, by setting
the unused neurons to have zero mean and sufficiently small variance.
An analogous statement to Theorem 4 for networks with more than two hidden
layers can be proved inductively: applying Theorem 4 for 2HL BNNs we can choose
the variance to be uniformly small, thus satisfying the condition stated in Lemma 9.
The proof of Lemma 10 applies equally for the output of 2HL BNNs. The rest of the
proof then follows as stated.
C.1 Proof of Theorem 4 173
C.1.2 Proof of Theorem 10 for MCDO
In order to prove the universality result for deep dropout, we first prove two lem-
mas about 1HL dropout networks. The following lemma states that the mean of a
1HL dropout network is a universal function approximator, while its variance can
simultaneously be made arbitrarily small.
Lemma 12. Consider any ϵ > 0 and any continuous function, m : A → R, where
A ⊂ RD is compact. Then there exists a (random) ReLU neural network of the form
f(x) =
I∑
i=1
wiγiψ
(
D∑
d=1
ui,dxd + vi
)
+ b
with γi
i.i.d.∼ Bern(1− p) such that ∥E [f ]−m∥∞ < ϵ and ∥V[f ]∥∞ ≤ ϵ.
Proof. By the universal approximation theorem (Leshno et al., 1993), there exists a
J ∈ N and 1HL network of the form,
g(x) =
J∑
j=1
w˜jψ
(
D∑
d=1
u˜j,dxd + vj
)
+ b,
such that ∥g −m∥∞ ≤ ϵ. Define the dropout network,
f (1)(x) =
J∑
j=1
w˜j
1− pψ
(
D∑
d=1
u˜j,dxd + vj
)
+ b.
Then E
[
f (1)
]
= g, so that ∥E[f (1)]−m∥∞ ≤ ϵ. Let S = ∥V[f (1)]∥∞ <∞.
Define f = 1
L
∑L
ℓ=1 f
(1,ℓ) where each f (1,ℓ) is an independent realisation of f (1).
Then E [f ] = g and V[f ] = V[f
(1)]√
L
≤ S√
L
. f can be realised by a dropout network by
combining L copies of f (1) together with identical weights within each copy and 0
weights connecting the various copies. Choosing L = (S/ϵ)2 completes the proof.
The following lemma states that the mean of the MCDO network can approximate
any continuous positive function, after application of the ReLU non-linearity.
Lemma 13. Given a positive mean function m with 0 < δ ≤ ∥m∥∞ ≤ ∆ and a
stochastic process f such that ∥E [f ]−m∥∞ ≤ ϵ ≤ δ and ∥V[f ]∥∞ ≤ ϵ,
∥E [ψ(f)]−m∥∞ ≤ ϵ+
√
ϵ2 + ϵ (∆ + ϵ)2
δ − ϵ = O(∆
√
ϵ/(δ − ϵ))
174 Proofs of results on deep BNNs
and ∥V[ψ(f)]∥∞ ≤ ϵ. In the big-O notation, we assume ∆ is bounded below by a
constant and ϵ, δ are bounded above by a constant.
Proof. The bound ∥V[ψ(f)]∥∞ ≤ ϵ follows from Lemma 8. We consider the expectation
of ψ(f(x)) for some arbitrary fixed x,
|E [ψ(f(x))]−m(x)| = |E [f(x)]−m(x)− E [min(0, f(x))]|
≤ |E [f(x)]−m(x)|+ |E [min(0, f(x))]|
≤ ϵ+ |E [min(0, f(x))]| .
We therefore bound |E [min(0, f(x))]|.
|E [min(0, f(x))]| = |E [f(x)1{x : f(x) < 0}]| ≤
√
E [f(x)2] Pr(f(x) < 0).
The inequality uses Cauchy-Schwarz, that the square of an indicator function is itself
and reinterprets the expectation of an indicator function as a probability. We bound
the two terms on the RHS separately.
E
[
f(x)2
]
= V[f(x)] + E [f(x)]2 ≤ ϵ+ E [f(x)]2 ≤ ϵ+ (m(x) + ϵ)2 ≤ ϵ+ (∆ + ϵ)2
We use Chebyshev’s inequality to bound the probability f(x) < 0,
Pr(f(x) < 0) ≤ Pr (|f(x)− E [f(x)] | > m(x)− ϵ)
≤ V[f(x)]
(m(x)− ϵ)2
≤ ϵ
(m(x)− ϵ)2
≤ ϵ
(δ − ϵ)2 .
Having collected the necessary lemmas, we provide a construction that proves
Theorem 10.
Proof of Theorem 10. Consider a 2HL dropout NN. Let the pre-activations in the first
hidden layer be collectively denoted a1, and the random dropout masks by ϵ1. Let the
second hidden layer have I + 2 hidden units. Let ⊙ denote the elementwise product of
two vectors of the same length. Define the pre-activations of two of the second hidden
layer units by av = wTv (ϵ1 ⊙ ψ(a1)), i.e. both these hidden units have identical weight
vectors wv and dropout masks, and are hence the same random variable. Similarly, let
the remaining I second hidden layer pre-activations be defined by am = wTm(ϵ1⊙ψ(a1)),
C.1 Proof of Theorem 4 175
again all being the same random variable. Furthermore, let (wv)i = 0 whenever
(wm)i ̸= 0 and vice versa, so that the first hidden layer neurons that influence av and
those that influence am form disjoint sets. Then the output of the 2HL network is:
f = ϵaw2,aψ(av) + ϵbw2,bψ(av) +
I∑
i=1
ϵiw2,iψ(am) + b2,
where ϵa, ϵb, {ϵi}Ii=1 are the final layer dropout masks and {w2,i}Ii=1, b2 are the final
layer weights and bias.
We now make the choices w2,a = 1, w2,b = −1, w2,i = α, where αI = 1/(1−p). Then
E [f ] = E [ψ(am)] + b2. Let b2 = minx∈A g − δ, where δ > 0 and the min exists due to
compactness of A. Define g′ = g−b2. Since am is just the output of a single-hidden layer
dropout network, for any γ′ > 0 we can use Lemma 12 to choose ∥E [am]− g′∥∞ < γ′
and ∥V[am]∥∞ < γ′. Since g′ is bounded below by δ and bounded above by some
∆ ∈ R (by continuity of g and compactness of A), we can then apply Lemma 13 to
obtain ∥E [am]− g′∥∞ = O(∆
√
ϵ′/(δ − ϵ′)) and ∥V[ψ(am)]∥∞ < γ′. We can use this to
bound the error in the mean of the 2HL network output:
∥E [f ]− g∥∞ = ∥E [ψ(am)] + b2 − g∥∞ = ∥E [ψ(am)]− g′∥∞ = O(∆
√
γ′/(δ − γ′)).
We can choose γ′ to depend on δ,∆ such that ∥E [f ]− g∥∞ < γ, proving the first part
of the theorem. Next, calculating the variance,
V[f ] = V
[
(ϵa − ϵb)ψ(av) + αψ(am)
I∑
i=1
ϵi
]
(C.3)
= V[(ϵa − ϵb)ψ(av)] + α2V
[
ψ(am)
I∑
i=1
ϵi
]
. (C.4)
176 Proofs of results on deep BNNs
Next we show that by taking I sufficiently large, we can make the second term arbitrarily
small. We have,
V
[
ψ(am)
I∑
i=1
ϵi
]
= V[ψ(am)]V
[
I∑
i=1
ϵi
]
+ V[ψ(am)]E
[
I∑
i=1
ϵi
]2
+ V
[
I∑
i=1
ϵi
]
E [ψ(am)]2
= V[ψ(am)]Ip(1− p) + V[ψ(am)]I2(1− p)2 + Ip(1− p)E [ψ(am)]2
≤ γ′Ip(1− p) + γ′I2(1− p)2 + Ip(1− p)E [ψ(am)]2
The first two of these three terms can be made arbitrarily small by choosing γ′
sufficiently small. The third term, upon multiplying by α2, becomes
α2Ip(1− p)E [ψ(am)]2 = p
I(1− p)E [ψ(am)]
2 ,
which can also be made arbitrarily small by choosing I ∈ N sufficiently large. We now
show that the first term in Equation (C.4) can well approximate our target variance
function h.
V[(ϵa − ϵb)ψ(av)]
= V[ϵa − ϵb]V[ψ(av)] + V[ϵa − ϵb]E [ψ(av)]2 + V[ψ(av)]E [ϵa − ϵb]2 (C.5)
= 2p(1− p)V[ψ(av)] + 2p(1− p)E [ψ(av)]2 (C.6)
Define
h′ =
√
h
2p(1− p) + δ
′,
for some δ′ > 0. Again applying Lemma 12 (which we can do independently of the
choice of am since neurons influencing av and am are disjoint), for any γ′′ > 0 we can
choose ∥E [av]− h′∥∞ < γ′′ and ∥V[av]∥∞ < γ′′. The first term in Equation (C.6) can be
made arbitrarily small by choosing γ′′ small enough. We can again apply Lemma 13 so
that ∥E [ψ(av)]− h′∥∞ = O(∆′
√
γ′′/(δ′ − γ′′)). We then bound the difference between
C.2 Counterexample when inputs are dropped out 177
the second term in Equation (C.6) and our target variance function:∥∥2p(1− p)E [ψ(av)]2 − h∥∥∞ (C.7)
≤
∥∥∥√2p(1− p)E [ψ(av)]+√h∥∥∥∞∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞ (C.8)
≤
(∥∥∥2√h∥∥∥
∞
+
∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞)∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞
(C.9)
where Equation (C.8) follows from sub-multiplicativity of the infinity norm. Expanding
the second term in Equation (C.9),∥∥∥√2p(1− p)E [ψ(av)]−√h∥∥∥∞ =√2p(1− p) ∥E [ψ(av)]− h′ + δ′∥∞
= O(δ′ +∆′
√
γ′′/(δ′ − γ′′))
By first choosing δ′ sufficiently small, and then choosing γ′′ depending on δ′, we can
make this error term arbitrarily small. Since all the other contributions to V[f ] were
made arbitrarily small, this allows us to set ∥V[f ]− h∥ < γ, for any γ > 0, completing
the proof.
In order to provide an analogous construction for MCDO BNNs with more than 2
hidden layers, we note that the above proof only requires a BNN output with a universal
mean function and an arbitrarily small variance function in Lemma 12. Instead of a
1HL network, we can apply Theorem 4 to construct a 2 or more hidden layer network
to provide these mean and variance functions. The rest of the proof then follows as in
the 2HL case.
C.2 Counterexample when inputs are dropped out
In the case when the network has several hidden layers, dropout with inputs dropped
defines a posterior with somewhat strange properties, as observed in Gal (2016, Section
4.2.1). In particular, in D dimensions, a typical sample function from the approximate
posterior will be constant as a function of roughly pD of the input dimensions. However,
which dimensions it is constant along depends on the particular sample. This behaviour
is unlikely to be shared by the exact posterior. We are able to exploit this type of
behaviour to show that if inputs are dropped out, there are simple combinations of
mean and variance functions that cannot be simultaneously approximated by the
corresponding approximating family.
178 Proofs of results on deep BNNs
Proposition 8. Consider f the (stochastic) output of an MC dropout network of
arbitrary depth with inputs dropped out. For any x, x′ ∈ R such that V[f(x)],V[f(x′)] <
ϵ2, |E [f(x)]− E [f(x′)] | ≤ 2ϵ√2/p.
Proof. With probability p, the input is dropped out, so Pr(f(x) = f(x′)) ≥ p. We
apply Chebyshev’s inequality giving the bounds,
Pr(|f(x)−E [f(x)] | ≤ rϵ) ≥ 1− 1/r2 and Pr(|f(x′)−E [f(x′)] | ≤ rϵ) ≥ 1− 1/r2.
for any r > 0. Choose r =
√
2/p+ δ for any δ > 0, then there exists a realisation of the
dropout network such that |f(x)− E [f(x)] | ≤ rϵ, |f(x′)− E [f(x′)] | and f(x) = f(x′)
simultaneously. Consequently,
|E [f(x)]− E [f(x′)] | = |E [f(x)]− f(x) + f(x)− E [f(x′)] |
= |E [f(x)]− f(x) + f(x′)− E [f(x′)] |
≤ |E [f(x)]− f(x)|+ |f(x′)− E [f(x′)] |
≤ 2rϵ = 2ϵ
√
2/p+ 2ϵδ.
Taking the limit as δ → 0 completes the proof.
In other words we can bound the difference in the mean output at two points in
terms of the uncertainty at those points and the dropout probability.
In D > 1 dimensions, we can get similarly tight bounds on lines parallel to a
coordinate axis: for x, x′ on such a line Pr(f(x) = f(x′)) ≥ p still holds. If the
dimension on which x and x′ differ is dropped out f(x) = f(x′).
Alternatively in D dimensions for arbitrary x, x′ ∈ RD, Pr(f(x) = f(x′)) ≥ pD.
This comes from noting that with probability pD the output of the network is a
constant function. However, we note this bound becomes exponentially weak as the
input dimension increases.
Appendix D
ConvCNP experimental details
D.1 Baseline neural process models
In both our 1D and image experiments, our main comparison is to conditional neural
process models. In particular, we compare to a vanilla MLP-CNP (1D only; Garnelo
et al. (2018a)) and an ACNP (Kim et al., 2018). Our architectures largely follow the
details given in the relevant publications.
MLP-CNP baseline. Our baseline MLP-CNP follows the implementation provided
by the authors.1 The encoder is a 3-layer MLP with 128 hidden units in each layer, and
ReLU non-linearities. The encoder embeds every context point into a representation,
and the representations are then averaged across each context set. Target inputs
are then concatenated with the latent representations, and passed to the decoder.
The decoder follows the same architecture, outputting mean and standard deviation
channels for each input.
Attentive CNP baseline. The ACNP we use corresponds to the deterministic path
of the model described by Kim et al. (2018) for image experiments. Namely, an encoder
first embeds each context point c to a latent representation (x(c), y(c)) 7→ r(c)xy ∈ R128.
For the image experiments, this is achieved using a 2-hidden layer MLP of hidden
dimensions 128. For the 1D experiments, we use the same encoder as the MLP-CNP
above. Every context point then goes through two stacked self-attention layers. Each
self-attention layer is implemented with an 8-headed attention, a skip connection, and
two layer normalizations (as described in Parmar et al. (2018), modulo the dropout
1https://github.com/deepmind/neural-processes
180 ConvCNP experimental details
layer). To predict values at each target point t, we embed r(t) 7→ r(t)x and r(c) 7→ r(c)x
using the same single hidden layer MLP of dimension 128. A target representation r(t)xy
is then estimated by applying cross-attention (using an 8-headed attention described
above) with keys K := {r(c)x }Cc=1, values V := {r(c)xy }Cc=1, and query q := r(t)x . Given the
estimated target representation rˆ(t)xy , the conditional predictive posterior is given by
a Gaussian pdf with diagonal covariance parametrised by (µ(t), σ(t)pre) = decoder(r(t)xy )
where µ(t), σ(t)pre ∈ R3, and the decoder is a 4 hidden layer MLP with 64 hidden units per
layer for the images, and the same decoder as the MLP-CNP for the 1D experiments.
Following Le et al. (2018), we enforce a minimum standard deviation σ(t)min =
[0.1; 0.1; 0.1] to avoid infinite log-likelihoods by using the following post-processed
standard deviation:
σ
(t)
post = 0.1σ
(t)
min + (1− 0.1) log(1 + exp(σ(t)pre)) (D.1)
D.2 1-dimensional experiments
In this section, we give details regarding our experiments for the 1D data. In all
experiments, the weights are optimised using Adam (Kingma and Ba, 2014) and weight
decay of 10−5 is applied to all model parameters.
D.2.1 CNN architectures
We consider two models: ConvCNP (which utilises a smaller architecture), and ConvC-
NPXL (with a larger architecture). For all architectures, the input kernel ψ was an EQ
(exponentiated quadratic) kernel with a learnable length scale parameter, as detailed
in Section 5.3, as was the kernel for the final output layer ψρ. When dividing by the
density channel, we add ε = 10−8 to avoid numerical issues. The lengthscales for the
EQ kernels are initialised to twice the spacing 1/γ1/d between the discretisation points
(ti)
T
i=1, where γ is the density of these points and d is the dimensionality of the input
space X . The architectures for the ConvCNP and ConvCNPXL are described below.
ConvCNP For the 1D experiments, we use a simple, 4-layer convolutional archi-
tecture, with ReLU nonlinearities. The kernel size of the convolutional layers was
chosen to be 5, and all employed a stride of length 1 and zero padding of 2 units.
The number of channels per layer was set to [16, 32, 16, 2], where the final channels
were then processed by the final, EQ-based layer of ρ as mean and standard deviation
D.2 1-dimensional experiments 181
channels. We employ a softplus nonlinearity on the standard deviation channel to
enforce positivity. This model has 6,537 parameters.
ConvCNPXL Our large architecture takes inspiration from UNet (Ronneberger
et al., 2015). We employ a 12-layer architecture with skip connections. The number of
channels is doubled every layer for the first 6 layers, and halved every layer for the
final 6 layers. We use concatenation for the skip connections. The following describes
which layers are concatenated, where Li ← [Lj, Lk] means that the input to layer i is
the concatenation of the activations of layers j and k:
• L8 ← [L5, L7],
• L9 ← [L4, L8],
• L10 ← [L3, L9],
• L11 ← [L2, L10],
• L12 ← [L1, L11].
Like for the smaller architecture, we use ReLU nonlinearities, kernels of size 5, stride 1,
and zero padding for two units on all layers.
D.2.2 Synthetic data
The kernels used for the Gaussian processes which generate the data are defined as
follows:
• EQ:
k(x, x′) = e−
1
2
(x−x
′
0.25
)2 ,
• weakly periodic:
k(x, x′) = e−
1
2
(f1(x)−f1(x′))2− 12 (f2(x)−f2(x′))2 · e− 18 (x−x′)2 ,
with f1(x) = cos(8πx) and f2(x) = sin(8πx), and
• Matern–5
2
:
k(x, x′) = (1 + 4
√
5d+
5
3
d2)e−
√
5d
with d = 4|x− x′|.
182 ConvCNP experimental details
C
on
v
C
N
P
A
C
N
P
C
N
P
Fig. D.1 Example functions learned by (top) the ConvCNP, (center) ACNP, and
(bottom) CNP when trained on an EQ kernel (with length scale parameter 1). “True
function” refers to the sample from the GP prior from which the context and target sets
were sub-sampled. “Ground Truth GP” refers to the GP posterior distribution when
using the exact kernel and performing posterior inference based on the context set.
The left column shows the predictive posterior of the models when data is presented in
same range as training. The centre column shows the model predicting outside the
training data range when no data is observed there. The right-most column shows the
model predictive posteriors when presented with data outside the training data range.
During the training procedure, the number of context points and target points for a
training batch are each selected randomly from a uniform distribution over the integers
between 3 and 50. This number of context and target points are randomly sampled
from a function sampled from the process (a Gaussian process with one of the above
kernels or the sawtooth process), where input locations are uniformly sampled from the
interval [−2, 2]. All models in this experiment were trained for 200 epochs using 256
batches per epoch of batch size 16. We discretise E(Z) by evaluating 64 points per unit
in this setting. We use a learning rate of 3e−4 for all models, except for ConvCNPXL
on the sawtooth data, where we use a learning rate of 1e−3 (this learning rate was too
large for the other models).
D.2 1-dimensional experiments 183
C
on
v
C
N
P
A
C
N
P
C
N
P
Fig. D.2 Example functions learned by the (top) ConvCNP, (center) ACNP, and
(bottom) CNP when trained on a Matérn-5/2 kernel (with length scale parameter
0.25). “True function” refers to the sample from the GP prior from which the context
and target sets were sub-sampled. “Ground Truth GP” refers to the GP posterior
distribution when using the exact kernel and performing posterior inference based on
the context set. The left column shows the predictive posterior of the models when data
is presented in same range as training. The centre column shows the model predicting
outside the training data range when no data is observed there. The right-most column
shows the model predictive posteriors when presented with data outside the training
data range.
The random sawtooth samples are generated from the following function:
ysawtooth(t) =
A
2
− A
π
∞∑
k=1
(−1)k sin(2πkft)
k
, (D.2)
where A is the amplitude, f is the frequency, and t is “time”. Throughout training, we
fix the amplitude to be one. We truncate the series at an integer K. At every iteration,
we sample a frequency uniformly in [3, 5], K in [10, 20], and a random shift in [−5, 5].
As the task is much harder, we sample context and target set sizes over [3, 100]. Here
the MLP-CNP and ACNP employ learning rates of 10−3. All other hyperparameters
remain unchanged.
We include additional figures showing the performance of ConvCNPs, ACNPs and
MLP-CNPs on GP and sawtooth function regression tasks in Figures D.1 to D.3.
184 ConvCNP experimental details
C
on
v
C
N
P
A
C
N
P
C
N
P
Fig. D.3 Example functions learned by the (top) ConvCNP, (center) ACNP, and
(bottom) CNP when trained on a random sawtooth sample. The left column shows the
predictive posterior of the models when data is presented in the same range as training.
The centre column shows the model predicting outside the training data range when no
data is observed there. The right-most column shows the model predictive posteriors
when presented with data outside the training data range.
D.3 Image experimental details and additional re-
sults
D.3.1 Experimental details
Training details In all experiments, we sample the number of context points uni-
formly from U(ntotal
100
, ntotal
2
), and the number of target points is set to ntotal. The context
and target points are sampled randomly from each of the 16 images per batch. The
weights are optimised using Adam (Kingma and Ba, 2014) with learning rate 5× 10−4.
We use a maximum of 100 epochs, with early stopping of 15 epochs patience. All pixel
values are divided by 255 to rescale them to the [0, 1] range. In the following discussion,
we assume that images are RGB, but very similar models can be used for greyscale
images or other gridded inputs (e.g. 1D time series sampled at uniform intervals).
Proposed convolutional CNP. Unlike ACNP and off-the-grid ConvCNP, on-the-
grid ConvCNP takes advantage of the gridded structure. Namely, the target and
D.3 Image experimental details and additional results 185
context points can be specified in terms of the image, a context mask Mc, and a
target mask Mt instead of sets of input–value pairs. Although this is an equivalent
formulation, it makes it more natural and simpler to implement in standard deep
learning libraries. In the following, we dissect the architecture and algorithmic steps
succinctly summarised in Section 5.3. Note that all the convolutional layers are actually
depthwise separable (Chollet, 2017); this enables a large kernel size (i.e. receptive fields)
while being parameter and computationally efficient.
1. Let I denote the image. Select all context points signal := Mc ⊙ I and append
a density channel density := Mc, which intuitively says that “there is a point at
this position”: [signal, density]⊤. Each pixel value will now have 4 channels: 3
RGB channels and 1 density channel Mc. Note that the mask will set the pixel
value to 0 at a location where the density channel is 0, indicating there are no
points at this position (a missing value).
2. Apply a convolution to the density channel density′ = convθ(density) and a
normalised convolution to the signal signal′ := convθ(signal)/density′. The
normalised convolution makes sure that the output mostly depends on the scale
of the signal rather than the number of observed points. The output channel size
is 128 dimensional. The kernel size of convθ depends on the image shape and
model used (Table D.1). We also enforce element-wise positivity of the trainable
filter by taking the absolute value of the kernel weights θ before applying the
convolution. Note that in this setting, E(Z) is [signal′, density′]⊤.
3. Now we describe the on-the-grid version of ρ(·), which we decompose into two
stages. In the first stage, we apply a CNN to [signal′, density′]⊤. This CNN is
composed of residual blocks (He et al., 2016), each consisting of 1 or 2 (Table D.1)
convolutional layers with ReLU activations and no batch normalisation. The
number of output channels in each layer is 128. The kernel size is the same across
the whole network, but depends on the image shape and model used (Table D.1).
4. In the second stage of ρ(·), we apply a shared pointwise MLP : R128 → R2C (we
use the same architecture as used for the ACNP decoder) to the output of the
first stage at each pixel location in the target set. Here C denotes the number of
channels in the image. The first C outputs of the MLP are treated as the means
of a Gaussian predictive distribution, and the last C outputs are treated as the
standard deviations. These then pass through the positivity-enforcing function
shown in Equation (D.1).
186 ConvCNP experimental details
Table D.1 CNN architecture for the image experiments.
Model Input Shape convθKernel Size
CNN
Kernel Size
CNN Num.
Res. Blocks
Conv. Layers
per Block
ConvCNP < 50 pixels 9 5 4 1
> 50 pixels 7 3 4 1
ConvCNP XL any 9 11 6 2
D.3.2 ACNP and ConvCNP qualitative comparison
Figure D.4 shows the test log-likelihood distributions of an ACNP and ConvCNP
model as well as some qualitative comparisons between the two.
Although most mean predictions of both models look relatively similar for SVHN
and CelebA32, the real advantage of the ConvCNP becomes apparent when testing the
generalization capacity of both models. Figure D.5 shows the ConvCNP and ACNP
trained on CelebA32 and tested on a downscaled version of Ellen’s famous Oscar selfie.
We see that ConvCNP generalises better in this setting. 2
2The reconstruction looks worse than Figure 5.6b despite the larger context set, because the
test image has been downscaled and the models are trained on a low resolution CelebA32. These
constraints come from ACNP’s large memory footprint.
D.3 Image experimental details and additional results 187
(a) MNIST (b) SVHN
(c) CelebA 32× 32
(d) CelebA 64× 64
Fig. D.4 Log-likelihood and qualitative comparisons between ACNP and ConvCNP
on four standard benchmarks. The top row shows the log-likelihood distribution for
both models. The images below correspond to the context points (top), ConvCNP
target predictions (middle), and ACNP target predictions (bottom). Each column
corresponds to a given percentile of the ConvCNP distribution. ACNP could not be
trained on CelebA64 due to its memory inefficiency.
Fig. D.5 Qualitative evaluation of a ConvCNP (center) and ACNP (right) trained
on CelebA32 and tested on a downscaled version (146 × 259) of Ellen’s Oscar selfie
(DeGeneres, 2014) with 20% of the pixels as context (left).

Appendix E
Effect of number of samples used on
evaluation of latent neural processes
As the exact log-likelihoods of latent neural process models are intractable, quantitative
evaluation and comparison of models is challenging. Instead, we compare models by
using an estimate of the log-likelihood. A natural candidate is LˆML. However, unless
the number of samples L used is large, LˆML is conservative and tends to significantly
underestimate the log-likelihood. One way to improve the estimate of LˆML is through
importance weighting (IW) (Le et al., 2018; Wu et al., 2017). Denoting D = Dc ∪Dt,
the ConvLNP encoder Eϕ(D) can be used as a proposal distribution:
LˆIW(θ, ϕ; ξ) := log
 1
L
L∑
l=1
exp
logw(zl) + ∑
(x,y)∈Dt
log pθ(y|x, zl)
 , zl ∼ Eϕ(D),
(E.1)
where the importance weights are given by logw(zl) := log qϕ(z|Dc) − log qϕ(z|D).
Here qϕ(z|D) is the density of the encoder distribution. We find that training models
with LˆML results in encoders that are ill-suited as proposal distributions since the
distribution over some of the latent variables can become deterministic, so we only use
LIW to evaluate models trained with LNPVI.
Effect of number of samples used during evaluation Figure E.1 demonstrates
the effect of the number of samples L used to estimate the evaluation objective for
the ConvLNP and ANP trained with LˆML and LNPVI. The models used to generate
Figure E.1 are the same models used in Section 5.6.1, i.e. having heteroskedastic noise.
Observe the general trend that the log-likelihood estimates tend to increase with L,
as expected. The ANP trained with LNPVI collapsed to a conditional ANP, meaning
190 Effect of number of samples used on evaluation of latent neural processes
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0.6
ConvNP (LML)
ConvNP (LNP + IW)
ANP (LML)
ANP (LNP + IW)
ANP (LNP)
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
Number of samples L in evaluation loss
15
10
5
ConvNP (LNP)
(a) Matérn–52
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
Number of samples L in evaluation loss
2.0
1.8
1.6
1.4
1.2
1.0
ConvNP (LML)
ConvNP (LNP + IW)
ConvNP (LNP)
ANP (LML)
ANP (LNP + IW)
ANP (LNP)
(b) Weakly periodic kernel
Fig. E.1 Log-likelihood bounds achieved by various combination of models and training
objectives when evaluated with LˆML and LIW for various numbers of samples L. Color
indicates model. Solid lines correspond to models trained and evaluated with LˆML.
Dashed lines correspond to models trained with LNP and evaluated with LIW. Dotted
lines correspond to models trained with LˆML and evaluated with LˆML. In this figure
LNP is used as an abbreviation for LNPVI.
191
/
Fig. E.2 Interpolation performance (within training range) for context set sizes uniformly
sampled from {0, . . . , 50} of the ConvNP and ANP on Matérn–5
2
samples. The models
are trained with LˆML and LNPVI for various number of samples L. Models trained with
LˆML are evaluated with LˆML, while models trained with LNPVI are evaluated with LˆML.
At evaluation, all bounds are estimated using 2,048 samples. In this figure LNP is used
as an abbreviation for LNPVI.
that the encoder became deterministic; in that case, LˆML is exact, which means that
larger L and importance weighting will not increase the estimate. In contrast, the
ANP trained with LˆML did not collapse, and we see that there the estimate increases
with L. For the ConvLNP trained with LNPVI, evaluating with LIW yields a significant
increase, showing that the bound estimated with LNPVI is very loose. The models
trained with LˆML tend to be the best performing, although the ConvLNP trained with
LNPVI is best for weakly periodic kernel and appears to still be increasing with L.
In this thesis, all log-likelihood lower bounds for LNPs are computed with LˆML if
the model was trained using LˆML and with LIW if the model was trained using LNPVI.
Effect of number of samples used during training Figure E.2 shows the effect
of the number of samples L in the training objectives on the performance of the
ConvNP and ANP. Observe that the performance of LˆML reliably increases with the
number of samples L and that LˆML outperforms LNPVI. The performance for LNPVI
does not appear to increase with the number of samples L and appears more noisy
than LˆML. Note that the models used for Figure E.2 were trained with homoskedastic
observation noise. This is achieved by pooling the output of the model corresponding
to the predictive standard deviation, i.e., fσ, over the time dimension.

Appendix F
ConvLNP experimental details
F.1 Experimental details on 1D regression
In the 1D regression experiments, we consider the following generative processes:
EQ: samples from a Gaussian process with the following exponentiated-
quadratic kernel:
k(t, t′) = exp
(
−1
8
(t− t′)2
)
;
Matérn–5
2
: samples from a Gaussian process with the following Matérn–5
2
kernel:
k(t, t′) =
(
1 + 4
√
5d+
5
3
d2
)
exp
(
−
√
5d
)
with d = 4|x− x′|;
noisy mixture: samples from a Gaussian process with the following noisy mixture
kernel:
k(t, t′) = exp
(
−1
8
(t− t′)2
)
+ exp
(
−1
2
(t− t′)2
)
+ 10−3δ[t− t′];
weakly periodic: samples from a Gaussian process with the following weakly-periodic
kernel:
k(t, t′) = exp
(
−1
2
(f1(t)− f1(t′))2 − 1
2
(f2(t)− f2(t′))2 − 1
8
(t− t′)2
)
194 ConvLNP experimental details
with f1(t) = cos(8πt) and f2(t) = sin(8πt); and
sawtooth: samples from the following sawtooth process:
f(t) =
A
2
− A
π
K∑
k=1
(−1)k sin(2πkf(t− s))
k
with A = 1, f ∼ U [3, 5], s ∼ U [−5, 5], and K ∈ {10, . . . , 20} chosen
uniformly.
We compare the following models, where all activation functions are leaky ReLUs
with leak 0.1:
ConvCNP: The first model is the ConvCNP. The architecture of the ConvCNP
is equal to that of the encoder in the ConvLNP, described next.
ConvLNP: The second model is the ConvLNP as described in the main body.
The functional embedding uses separate length scales for the data
channel and density channel (Figure 5.8), which are initialized to
twice the inter-point spacing of the discretization and learned dur-
ing training. The discretization uniformly ranges over [min(x) −
1,max(x)+ 1] at density ρ = 64 points per unit, where min(x) is the
minimum x value occurring in the union of the context and target
sets in the current batch and max(x) is corresponding maximum x
value. The discretization is passed through a 10-layer (excluding an
initial and final point-wise linear layer) CNN with 64 channels and
depthwise-separable convolutions. The width of the filters depends
on the data set and is chosen such that the receptive field sizes are
as follows:
EQ: 2,
Matérn–5
2
: 2,
noisy mixture: 4,
weakly periodic: 4,
sawtooth: 16.
The discretized functional representation consists of 16 channels. The
smoothing at the end of the encoder also has separate length scales
F.1 Experimental details on 1D regression 195
for the mean and variance which are initialized similarly and learned.
The encoder parametrizes the standard deviations by passing the
output of the CNN through a softplus. The decoder has the same
architecture as the encoder.
ANP: The third model is the Attentive NP with latent dimensionality
d = 128 and 8-head dot-product attention (Vaswani et al., 2017).
In the attentive deterministic encoder, the keys (t), queries (t), and
values (concatenation of t and y) are transformed by a three-layer
MLP of constant width d. The dot products are normalised by
√
d.
The output of the attention mechanism is passed through a constant-
width linear layer, which is then passed through two layers of layer
normalization (Ba et al., 2016) to normalise the latent representation.
In the first of these two layers, first the transformed queries are passed
through a constant-width linear layer and added to the input. In
the second of these two layers, the output of the first layer is first
passed through a two-layer constant-width MLP and added to itself,
making a residual layer. In the stochastic encoder, the inputs and
outputs are concatenated and passed though a three-layer MLP of
constant width d. The result is mean-pooled and passed through a
two-layer constant-width MLP. The decoder consists of a three-layer
MLP of constant width d.
NP: The fourth model is the original NP (Garnelo et al., 2018b). The
architecture is similar to that of the ANP, where the architecture
of the deterministic encoder is replaced by that of the stochastic
encoder.
For all models, positivity of the observation noise is enforced with a softplus function.
Parameter counts of the ConvCNP, ConvLNP, ANP, and MLP-LNP are listed in
Table F.1.
The models are trained with LˆML (L = 20) and LNPVI (L = 5). For LNPVI, the
context set is appended to the target set when evaluating the objective. The models
are optimised using ADAM with learning rate 5 · 10−3 for 100 epochs. One epoch
consists of 214 tasks divided into batches of size 16. For training, the inputs of the
context and target sets are sampled uniformly from [−2, 2]. The size of the context set
is sampled uniformly from {0, . . . , 50} and the size of the target set is fixed to 50. To
encourage the LNP-based models—not the CNP-based models—to fit and not revert
196 ConvLNP experimental details
EQ Matérn–5
2
Noisy Mixt. Weakly Per. Sawtooth
ConvCNP 42 822 42 822 51 014 51 014 100 166
ConvLNP 88 486 88 486 104 870 104 870 203 174
ANP 530 178 530 178 530 178 530 178 530 178
NLP-LNP 479 874 479 874 479 874 479 874 479 874
Table F.1 Parameter counts of models in 1D regression.
to their conditional variants, the observation noise standard deviation σ is held fixed
to 10−2 for the first 20 epochs.
For evaluation, the size of the context set is sampled uniformly from {0, . . . , 10}, and
the losses are evaluated with L = 5000 and batch size one. To test interpolation within
the training range, the inputs of the context and target sets are, like training, sampled
uniformly from [−2, 2]. To test interpolation beyond the training range, the inputs of
the context and target sets are sampled uniformly from [2, 6]. To test extrapolation
beyond the training range, the inputs of the context sets are sampled uniformly from
[−2, 2] and the inputs of the target sets are sampled uniformly from [−4,−2]∪ [2, 4]. As
described in Appendix E, models trained with LNPVI are evaluated using importance
weighting to obtain a better estimate of the evaluation loss.
F.2 Experimental details on image completion
F.2.1 Data details
We use three standard datasets throughout our image experiments: SVHN (Netzer
et al., 2011), MNIST LeCun et al. (1989), and 32 × 32 CelebA Netzer et al. (2011).
The aforementioned standard datasets all contain only a single, well-centered object.
To evaluate the translation equivariance and generalisation capabilities of our model
we evaluate on the zero shot multi-MNIST (ZSMM) task described in Section 5.4.2.
Namely, we generate a test set by randomly sampling with replacement 10000 pairs
of digits from the MNIST test set, place them on a black 56 × 56 background, and
translate the digits in such a way that the digits can be arbitrarily close but cannot
overlap. However, we make one change from the dataset described in Section 5.4.2, the
training set now consists of the standard MNIST digits (instead of a single digit placed
in the center of 56× 56 canvas), augmented by up to 4 pixel shifts (Figure 5.5a). The
model thus has to generalise both to a larger canvas size as well as to seeing multiple
digits.
F.2 Experimental details on image completion 197
For all data sets, pixel values are divided by 255 to rescale them to the [0, 1] range.
We evaluate on predefined test splits when available (MNIST, SVHN, ZSMM) and
make our own test set for CelebA by randomly selecting 10% of the data. For each
dataset we also set aside 10% of the training set as validation.
F.2.2 Training details
In all experiments, we sample the number of context pixels uniformly from U(0, ntotal
2
),
and the number of target points is set to ntotal. The weights are optimised using Adam
(Kingma and Ba, 2014) with learning rate 5× 10−4. We use a maximum of 100 epochs,
with early stopping — based on log likelihood on the validation set — of 10 epochs
patience. Unless stated otherwise, we use L = 16 samples from the latent function
during training, and L = 128 at test time. We clip the ℓ2 norm of all gradients to
1, which was particularly important for ConvLNP. We use a batch size of 32 for all
models besides ANP trained on ZSMM which used a batch size of 8 due to memory
constraints.
F.2.3 General architecture details
For all models, we follow Le et al. (2018) and process the predicted standard deviation
of the latent function σz using a sigmoid and the standard deviation σ of the predictive
distribution using lower-bounded softplus:
σz = 0.001 + (1− 0.001) 1
1 + exp(fσ,z)
(F.1)
σ = 0.001 + (1− 0.001) ln(1 + exp(fσ)) (F.2)
As the pixels are rescaled to [0, 1], we also process the mean of the posterior predictive
(conditioned on a single sample) to be in [0, 1] using a logistic function
µ =
1
1 + exp(−fµ) (F.3)
In the following, we describe the architecture of ANP and ConvLNP. Unless stated
otherwise, all vectors in the following paragraphs are in R128 and all MLPs have 128
hidden units.
ANP details We provide details for the ANP trained with LˆML. As the ANP cannot
take advantage of the fact that images are on the grid, we preprocess each pixel so that
198 ConvLNP experimental details
x ∈ [−1, 1]2. The only exception being for the test set of ZSMM, where x ∈ [−56
32
, 56
32
]2
as the model is trained on 32× 32 but evaluated on 56× 56 images. Let superscript c
index the context points from 1, . . . , C, and let superscript t index the target points
from 1, . . . , T . Each context feature is first encoded x(c) 7→ r(c)x by a single hidden
layer MLP, while a second single hidden layer MLP encodes values y(c) 7→ r(c)y . We
produce a representation r(c)xy by summing both representations r(c)x + r(c)y and passing
them through two self-attention layers (Vaswani et al., 2017). Following Parmar
et al. (2018), each self-attention layer is implemented as 8-headed attention, a skip
connection, and two layer normalizations (Ba et al., 2016). To predict values at each
target point t, we embed x(t) 7→ r(t)x using the MLP used for r(c)x . A deterministic
target representation r(t)xy is then computed by applying cross-attention (using an 8-
headed attention described above) with keys K := {r(c)x }Cc=1, values V := {r(c)xy }Cc=1,
and query q := r(t)x . For the latent path, we average over context representations r(c)xy ,
and pass the resulting representation through a single hidden layer MLP that outputs
(µz, σz) ∈ R256. σz is made positive by post-processing it using Equation (F.1). We then
sample (with reparameterisation (Kingma and Welling, 2013)) L latent representations
zl ∼ N (z;µz, σ2z).
We describe the remainder of the forward pass for a single zl, though in practice
multiple samples may be processed in parallel. The deterministic and latent repre-
sentations of the context set are concatenated, and the resulting representation is
passed through a linear layer [r(t)xy ; zl]→ r(t)xyz ∈ R128. Given the target and context-set
representations, the predictive posterior is given by a Gaussian pdf with diagonal
covariance parametrised by (µ(t), σ(t)pre) = decoder([r(t)x ; r(t)xyz]) where µ(t), σ(t)pre ∈ R3 and
decoder is a 4 hidden layer MLP. Finally, the σ(t) is processed by Equation (F.2)
using Equation (F.3). In the case of MNIST and ZSMM, σ(t) is also spatially mean
pooled, which corresponds to using homoskedastic noise. This improves the qualitative
performance by forcing ANP and ConvLNP to model the digit instead of focusing on
predicting the black background with high confidence. Kim et al. (2018) did not suffer
from that issue as they used a much larger lower bound for Equation (F.2).
ConvLNP details The core algorithm of on-the-grid ConvLNP is outlined in Fig-
ure 5.10 and Figure 5.2c. Here we discuss the parameterisations used for each step of
the algorithm. All convolutional layers are depthwise separable (Chollet, 2017). convθ
is a convolutional layer with kernel size of 11 (no bias). Following Gordon et al. (2020),
we enforce positivity on the weights in the first convolutional layer by only convolving
their absolute value with the signal.
F.2 Experimental details on image completion 199
The CNNs are ResNets He et al. (2016) with 9 blocks, where each convolution has a
kernel size of 3. Each residual block consists of two convolutional layers, pre-activation
batch normalization layers (Ioffe and Szegedy, 2015), and ReLU activations. The output
of the pre-latent CNN (CNN in Figure 5.2c) goes through a single hidden layer MLP
that outputs (µz, σz) ∈ R256. As with ANP, fσ,z is processed by Equation (F.1) and
then used to sample (with reparameterisation (Kingma and Welling, 2013)) L latent
functions Zl. Importantly, we found that the coherence of samples improves if the model
uses a global representation in addition to the the pixel dependent representation. We
achieve this by mean-pooling half of the functional representation. Namely, we replace
zl by the channel-wise concatenation of z
(1:64)
l and mean(z
(65:128)
l ), where the mean is
taken over the spatial dimensions. This latent function then goes through the post-
latent CNN (CNN in Figure 5.10), as well as a linear layer to output (fµ, fσ) ∈ R256. As
for the ANP fµ is processed by Equation (F.3) and fσ is re-scaled with Equation (F.2)
and is spatially pooled in the case of MNIST and ZSMM to obtain homoskedastic
noise.
F.2.4 Additional results on image completion
We provide additional qualitative samples and quantitative analyses for the ConvLNP
and ANP.
Additional ConvLNP samples Figure F.1 provides further samples from a Con-
vLNP trained with LˆML. We observe that the ConvLNP produces reasonably diverse
yet coherent samples when evaluated in a regime that resembles the training regime
(in the first four sub-columns of MNIST, SVHN, and CelebA). However, Figure F.1
also demonstrates that the ConvLNP struggles with context sets that are significantly
different from those seen during training.
Further comparisons of ANP and ConvLNP We provide further qualitative
comparisons of ConvLNPs, ANPs trained with LˆML, and ANPs trained with LNPVI.
We omit ConvLNPs trained with LNPVI as these are significantly outperformed by
ConvLNPs trained with LˆML (see e.g. Table 5.4).
Figure F.2 shows that all models perform relatively well when context sets are
drawn from a similar distribution as employed during training (first four sub-columns
of MNIST, SVHN, and CelebA). Furthermore, we observe that samples from the
ConvLNP prior tend to be closer to samples from the underlying data distribution (e.g.
for CelebA).
200 ConvLNP experimental details
Fig. F.1 Qualitative samples for the ConvLNP trained with LˆML in Table 5.4. From
top to bottom the four major rows correspond to MNIST, ZSMM, SVHN, CelebA32
datasets. For each dataset and each of the two major columns, a different image is
randomly sampled; the first sub-row shows the given context points (missing pixels are
in blue for MNIST and ZSMM but in black for SVHN and CelebA), while the next
three sub-rows show the mean of the posterior predictive corresponding to different
samples of the latent function. To show diverse samples we select three samples that
maximize the average Euclidean distance between pixels of the samples. From left
to right the first four sub-columns correspond to a context set with 0%, 1%, 3%,
10% randomly sampled context points. In the last two sub-columns, the context sets
respectively contain all the pixels in the left and top half of the image.
F.2 Experimental details on image completion 201
(a) ConvNP LˆML (b) ANP LˆML (c) ANP LNPVI
Fig. F.2 Qualitative samples between (a) ConvLNP trained with LˆML; (b) ANP trained
with LˆML; (c) ANP trained with LNPVI. For each model the figure shows the same as
Figure F.1.
202 ConvLNP experimental details
Table F.2 Coordinates for boxes defining the train and test regions.
Central (train) Western (test) Eastern (test) Southern (test)
Latitudes (52, 46) (50, 46) (52, 49) (46, 42)
Longitudes (08, 28) (01, 08) (28, 35) (19, 26)
The qualitative advantage of ConvLNP is most significant in settings that require
translation equivariance for generalisation. Figure F.2 row 2 (ZSMM) clearly demon-
strates that ConvLNP generalizes to larger canvas sizes and multiple digits, while ANP
attempts to reconstruct a single digit regardless of the context set. Finally, Figure F.3
provides the test log-likelihood distributions of ANP and ConvLNP as well as some
qualitative comparisons between the two.
F.3 Experimental details on environmental data
F.3.1 Data details
ERA5-Land (Copernicus Climate Change Service, 2020) contains high resolution
information on environmental variables at a 9 km spacing across the globe.1 The
data we use contains daily measurements of accumulated precipitation at 11pm and
temperature at 11pm at every location, between 1981 and 2020, yielding a total
of 14,304 temporal measurements across the spatial grid. In addition, we provide
orography (elevation) values for each location. We normalize the data such that the
precipitation values in the train set have zero mean and unit standard deviation.
We consider the task of predicting daily precipitation y, with latitude and longitude
as x. In addition, at each context and target location, we provide the model with
access to side information in the form of orography (elevation) and temperature values.
We also normalise the orography and temperature values to have zero mean and unit
standard deviation. We choose a large region of central Europe as our train set, and
use regions East, West and South of the train set as held out test sets (see Figure F.4
and Table F.2). At train time, to sample a task, we first sample a random date between
1981 and 2020. We then sample a square subregion of grid of values from within the
train region (which has size 61× 201). We consider two models, one trained on 28× 28
1URL: https://www.ecmwf.int/en/era5-land. Neither the European Commission nor ECMWF is
responsible for any use that may be made of the Copernicus Information or data it contains.
F.3 Experimental details on environmental data 203
(a) MNIST (b) CelebA32
(c) Zero Shot Multi-MNIST (d) SVHN
Fig. F.3 Log-likelihood and qualitative samples comparing ConvLNP and ANP trained
with LˆML on (a) MNIST; (b) CelebA; (c) ZSMM; (d) SVHN. For each sub-figure,
the top row shows the log-likelihood distribution for both models. The images below
correspond to the context points (top), followed by three samples form ConvLNP
(mean of the posterior predictive corresponding to different samples from the latent
function), and three samples from ANP. Each column corresponds to a given percentile
of the ConvLNP test log likelihood (as shown by green arrows).
204 ConvLNP experimental details
Fig. F.4 Training (blue) and test (red) regions in Europe, along with orography data
from ERA5Land.
subregions, and another trained on 40× 40 subregions. During training, each subregion
is then split into context and target sets. Context points are randomly chosen with
a keep rate pkeep with pkeep ∼ U [0, 0.3]. In this section, we train only on the LˆML
objective.
F.3.2 Gaussian process baseline
We mean-centre the data for each task for the GP before training, and add the mean
offset back for evaluation and sampling. We use an Automatic Relevance Determination
(ARD) kernel, with separate factors for latitude/longitude, temperature and orography.
In detail, let x = (xlat, xlon) denote position, and let ω, t denote orography and
precipitation respectively, and let r := (x, ω, t). Then the kernel is given by
k(r, r′) = σ2vkl(x, x
′)kω(ω, ω′)kt(t, t′) + σ2nδ(r, r
′).
Here each of kl, kω and kt are Matérn–52 kernels with separate learnable lengthscales;
δ(r, r′) = 1 if r = r′ and 0 otherwise; and σ2v , σ2n are learnable signal and noise variances
respectively. We learn all hyperparameters by maximising the log-marginal likelihood
using Scipy’s implementation of L-BFGS.
Transforming the data As the data is non-negative, we considered applying the
transform y 7→ log(ϵ + y) for the GP to model. If ϵ = 0, this would guarantee that
the GP would only yield positive samples, which would be physically sensible as
precipitation is non-negative. However, this cannot be done as precipitation often
F.3 Experimental details on environmental data 205
takes the value y = 0, which would lead to the transform being undefined. On the
other hand, if ϵ > 0, the GP samples after performing the inverse transform could still
predict a precipitation value as low as −ϵ, which is still unphysical. Further, a small
value of ϵ leads to large distortion of the y values in transformed space. In the end, we
run all experiments for the GP and NP without log-transforming the data; hence the
models have to learn non-negativity.
F.3.3 ConvLNP architecture and training details
As the ERA5-Land dataset is regularly spaced, we use the on-the-grid version of
the architecture, without the need for an RBF smoothing layer at the input. All
experiments used a convolutional architecture with 3 residual blocks (He et al., 2016)
for the encoder and 3 residual blocks for the decoder. Each residual block is defined
with two layers of ReLU activations followed by convolutions, each with kernel size 5.
The first convolution in each block is a standard convolution layer, whereas the second
is depthwise separable (Chollet, 2017). All intermediate convolutional layers have 128
channels, and the latent function z has 16 channels. The networks were trained using
Adam with a learning rate of 10−4. We used 16 channels for the latent function z, and
estimated LˆML using 16-32 samples at train time, with batches of 8-16 images.
We train the models for between 400 and 500 epochs, where each epoch is defined
as a single pass through each day in the training set, where at each day, a random
subregion of the full 61 × 201 central Europe region is cropped. We estimated the
predictive density using 2500 samples of z during test time.
F.3.4 Prediction and sampling
To create Table 5.5, at test time we sample 28×28 subregions from each of the train and
test regions. This is done 1000 times. For the GP, we randomly restart optimisation 5
times per task and use the best hyper-parameters found. In order to remove outliers
where the GP has very poor likelihood, we set a log-likelihood threshold for the GP. If
the GP has a log-likelihood of less than 0 nats on a particular task, then that task is
removed from the evaluation.
We find that to produce high quality samples, we need to train the model on
subregions that are roughly as large as the lengthscale of the precipitation process.
Hence we sample from the model trained on 40× 40 subregions in Figure 5.13 in the
main body. We show samples from the model trained on both 28× 28 subregions and
206 ConvLNP experimental details
40× 40 subregions in Section F.3.6. We also compare to samples from GPs trained on
each context set (no random restarts were used for sampling).
F.3.5 Bayesian optimization
We use the models described in Section F.3.3, trained on random 28× 28 subregions
of the train region, and compare to the GP baselines described in Section F.3.2. For
the Bayesian optimization experiments in Figure 5.14 in the main body, we do not
perform random restarts as this was too time-consuming. We carry out the Bayesian
optimization (BayesOpt) experiments in each of the four regions: Central (train), West
(test), East (test), and South (test). Each Bayesian optimization “episode” is defined
by randomly sub-sampling a day (uniformly at random between 1981 and 2020), then
sampling a sub-region from the tested region. To test the models’ spatial generalization
capacity (where possible), we sub-sample episodes from each of the four regions with
the following sizes:
• Central: 42x42
• West: 40x40
• East: 28x28
• South: 36x36
Episodes begin from empty sets D(0)c =, and models sequentially query locations
for t = 1, . . . , 50. Denoting (x(t), y(t)) the query location and queried value at iteration
t, the context set is then updated as D(t)c = D(t−1)c ∪ {(x(t), y(t))}. Denoting y as the
complete set of rainfall values in the sub-region, and y(t) as the set of queried values
at iteration t, we can define the instantaneous regret as rt = max(y)−max(y(t)c ), and
compute the average regret (plotted in Figure 5.14) at the tth iteration as r¯t = 1t
∑t
i=1 ri.
F.3.6 Additional figures for environmental data
Predictive density Figure F.5 displays the predictive densities for precipitation
at different locations, conditioned on a context set used for testing. The density of
the ConvLNP is estimated using 2500 samples of z. To examine why the ConvLNP
outperforms the GP in terms of log-likelihood, we plot cases where the ConvLNP
likelihood is significantly better than the GP likelihood. We see that this is due to the
GP occasionally making very overconfident predictions compared to the ConvLNP. We
F.3 Experimental details on environmental data 207
(a) (b)
Fig. F.5 Predictive density at two target points, where the ConvLNP significantly
outperforms the GP. The orange and blue circles show the likelihood for the ground
truth target value under the GP and ConvLNP. Note that as the precipitation values
are normalised to zero mean and unit standard deviation, yt = −0.53 corresponds to
no rain. In Figure F.5a, we see the ConvLNP sometimes produces predictions heavily
centered on this value, showing it has learned the sparsity of precipitation values. In
Figure F.5b we see the ConvLNP predictive distribution is sometimes asymmetric with
a heavier positive tail, reflecting the non-negativity of precipitation.
also see that the ConvLNP in a small proportion of cases exhibits very non-Gaussian,
asymmetric predictive distribtuions.
Additional samples In this section we show additional samples from the model
trained on 28 × 28 images (Figures F.6 and F.7) and also on 40 × 40 images (Fig-
ures F.8 and F.9). Training on larger images reduces the occurrence of blocky artefacts.
Figure 5.13 in the main body was trained on 40× 40 images. Note that samples shown
here are 61× 201, i.e. the size of the entire central Europe train region.
208 ConvLNP experimental details
(a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3
(e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3
Fig. F.6 Samples from the predictive processes overlaid on central Europe, for a model
trained on random 28× 28 subregions of the full 61× 201 central Europe region. Note
some blocky artefacts in the ConvNP samples due to training on small subregions. Here
the GP has overfit to the orography data, with samples that resemble the orography
rather than precipitation.
(a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3
(e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3
Fig. F.7 Samples from the predictive processes overlaid on central Europe, for a model
trained on random 28× 28 subregions of the full 61× 201 central Europe region. Here
the GP has learned a lengthscale that is too large.
(a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3
(e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3
Fig. F.8 Samples from the predictive processes overlaid on central Europe, for a model
trained on random 40× 40 subregions of the full 61× 201 central Europe region. Here
the GP has overfit to the orography data, with samples that resemble the orography
rather than precipitation.
F.3 Experimental details on environmental data 209
(a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3
(e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3
Fig. F.9 Samples from the predictive processes overlaid on central Europe, for a model
trained on random 40× 40 subregions of the full 61× 201 central Europe region. The
GP has again overfit to the orography data.
(a) Ground truth data (b) ConvNP sample 1 (c) ConvNP sample 2 (d) ConvNP sample 3
(e) Context set (f) GP sample 1 (g) GP sample 2 (h) GP sample 3
Fig. F.10 Samples from the predictive processes overlaid on central Europe, for a model
trained on random 40× 40 subregions of the full 61× 201 central Europe region.