Analog deep neural networks (DNNs) provide a promising solution, especially for deployment on resource-limited platforms, for example in mobile settings. However, the practicability of analog DNNs has been limited by their instability due to multi-factor reasons from manufacturing, thermal noise, etc. Here, we present a theoretically guaranteed noise injection approach to improve the robustness of analog DNNs without any hardware modifications or sacrifice of accuracy, which proves that within a certain range of parameter perturbations, the prediction results would not change. Experimental results demonstrate that our algorithmic framework can outperform state-of-the-art methods on tasks including image classification, object detection, and large-scale point cloud object detection in autonomous driving by a factor of 10 to 100. Together, our results may serve as a way to ensure the robustness of analog deep neural network systems, especially for safety-critical applications.

Device imperfections are a major challenge that limit the potential of analogue neural networks. Nanyang Ye and colleagues propose a training-time noise injection approach to improve their robustness without hardware modifications, which comes with theoretical guarantees.

Recently, analog DNNs have emerged as a promising direction to further alleviate the speed power consumption limits of standard von Neumann computational architectures. For example, with the crossbar computing architecture, a common operation in DNNs—in-place dot product, can be efficiently implemented by analog computation without the need to transfer data from memories to computing devices, therefore breaking the memory wall, which limits the efficiency of existing deep-learning accelerators. This is in contrast to standard von Neumann architectures, where data has to be moved to computation units before computation.

However, challenges exist in that analog DNNs are not compatible with current deep-learning paradigms designed primarily for deterministic circuits. Without the potential gap between high and low voltages to resist noise, as in digital circuits, the stability of analog DNNs is rather sensitive to thermal noise, electrical noise, process variations, and programming errors. As a result, the parameters of DNNs represented by the conductance at each crossbar intersection can be easily distorted, jeopardizing the utility of analog deep-learning systems, especially for life-critical applications^{1}.

Many efforts have been made to minimize the detrimental effect of noise by improving the device stability from the engineering perspective, such as employing novel materials and optimizing the design of circuits^{2–7}. These approaches can mitigate the issue to some extent, for example, in a certain field or for single tasks. However, hardware modification is normally on a specific type or types of devices that lacks universality in manufacturing, and will bring extra hardware costs. From an algorithmic perspective, previous work has shown that noise injection during training could lead to empirical improvements in the noise resilience of analog computing devices. For example, standard Gaussian noise is widely introduced in the training process of DNNs to improve the robustness^{8–10}, while the improvement depends on the measurements of the time-consuming noise for each single device to be deployed. Also, prior studies focused more on the methods themselves but lacked in-depth analysis such as explanations about how to choose the strength of the injected noise and how the noise spectrum would affect the performance^{8–13}. Therefore, the fundamental understanding is still unclear which limits the wide application of analog DNNs in real-world situations. Consequently, developing a theoretically guaranteed method to ensure the robustness of analog DNNs could be essential for their widespread utility and may lead to improvements in life-critical applications, such as autonomous driving.

Herein, we present a thorough theoretical study and a theoretically guaranteed noise injection approach that allows us to train an analog DNN that is fault-tolerant, robust against noise, and generalizable to complex tasks. Inspired by previous neuroscience research that demonstrated the benefits of noise in human neural systems^{14}, we demonstrate a noise injection approach leveraging the Bayesian optimization method—“BayesFT”—to optimize the characteristics of the injected noise, therefore enhancing the robustness of analog neural networks. Compared with some typical state-of-the-art studies (Supplementary Table

A DNN can be regarded as a composition of many nonlinear functions. Given input data _{θ} is the neural network with drift-inducing parameters

_{value} ≤ 0.05, smaller

Though adding dropout layers can significantly improve the robustness of the analog DNN to parameter drifting, they can also cause suboptimal performances due to misspecifications. Therefore, it is crucial to automatically search for an optimal neural network architecture with proper noise injection settings to ensure robustness while avoiding uncertainties. To simplify the neural architecture search, instead of looking for all possible topology structures of the DNN^{15–17}, we appended noise injection layers after each DNN layer except the last softmax layer for output, and we searched for the dropout rate of each layer only. We denote the specification of the additional noise injection layers as

We then examine the effectiveness of our method by performing image recognition experiments on several datasets including the Modified National Institute of Standards and Technology (MNIST, ref. ^{18}) dataset, the Canadian Institute for Advanced Research-10 (CIFAR-10, ref. ^{19}) dataset, the German Traffic Sign Recognition Dataset (GTSRB, ref. ^{20}), the Penn-Fudan Database for Pedestrian Detection (PennFudanPed, ref. ^{21}), and the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago Vision Benchmark (KITTI, ref. ^{22}). For comparison, we implemented the following state-of-the-art baseline algorithms as references: Empirical risk minimization (ERM) which is the baseline algorithm that only minimizes the empirical risk; ReRAM-variations (ReRAM-V, ref. ^{6}) which diagnoses the ReRAM circuits and iteratively re-tunes the parameters to improve robustness to parameter drift until BayesFT converges; Adversarial weight perturbations (AWP, ref. ^{23}) which adversely trains the neural network against parameter perturbations to improve robustness to parameter shifts; and Fault-tolerant neural network architecture (FTNA, ref. ^{1}) which replaces the last softmax layer of the original model with an error-correction coding scheme as discussed before. We denote our method as BayesFT-DO, BayesFT-Ga, and BayesFT-La, corresponding to different types of injecting noise, namely Bernoulli noise, Gaussian noise, and Laplace noise, respectively.

Each algorithm was run 20 times on the MemSim simulation platform, and the mean (line) and standard deviation (shaded area) of accuracy under different resistance variations are demonstrated in Figs. ^{18}. The experiments were carried out on a three-layer multilayer perceptron (MLP) and LeNet5^{24} with drifting terms (resistance variation,

A similar experimental protocol was applied to the CIFAR-10 dataset with various neural network architectures, including the most commonly used ones in computer vision, such as AlexNet^{25}, VGG^{26}, ResNet^{27}, and PreAct-ResNet (PreAct, ref. ^{27}) with different numbers of layers. Compared with handwritten digits in the MNIST dataset, CIFAR-10 contains real-world objects that pose more difficulties for recognition. The results are shown in Fig.

Experimental results on

We finally conducted experiments on an autonomous driving task of point cloud detection with the KITTI^{28} dataset to detect cars, pedestrians, and cyclists on the road, in which data is collected by a Velodyne Lidar. A point cloud is a bunch of points that contains the location information of each point in three-dimensional space^{29}, and it plays an essential role in many contemporary autonomous driving systems. We compared the performance of different training algorithms using the PointPillars networks^{30}, a fast DNN model for object detection from point clouds. Similar to the above experiments, perturbations with

To conclude, in this work, in contrast to the previous efforts which focus on improving the accuracy of analog DNNs from a device manufacturing perspective, we provide a train-time optimization framework to improve the robustness of analog DNNs to achieve accurate recognition without any hardware modification. By systematically analyzing the influence of different factors on the performance of analog DNNs through a memristor simulator, we find that injecting noise can efficiently improve the robustness, for which we also provide theoretical proof. BayesFT was applied to optimize the setting and distribution of the injected noise, and its working principle was proved theoretically, making it the first theoretically guaranteed method for training robust analog DNNs. BayesFT is generalizable to various DNN architectures and its effectiveness was examined by different recognition tasks, including image classification (MNIST and CIFAR-10), traffic sign recognition, and 3D point detection (KITTI). BayesFT makes the analog DNN insensitive to noise while inducing only relatively low accuracy loss, even at high-level noise. We believe our findings could extend the practicability of analog DNNs to previously impossible life-critical tasks, such as autonomous driving, by providing both good empirical performances and theoretical guarantees.

For simplicity and to avoid loss of generality, we adopt a memristor perturbation model following a challenging setting as used in refs. ^{6, 31}. This model is fitted on real devices and considers multiple factors resulting in the memristance drifting, including thermal noise, programming errors, and manufacturing errors. Specifically, the drifting term is applied to each neural network parameter _{i}:

import torch

`”’`

`”’`

`sigma= np.linspace(0., 1.5, 31)`

`accu = []`

`num = 20`

`evaluated = np.zeros(num)`

`for std in sigma:`

`for i in range(num):`

`model.load_state_dict(torch.load(model_path))`

`add_noise_to_parameters(0, std, model)`

`evaluated[i] = evaluate_accuracy(model, valid_dl)[’val_acc’]`

`accu.append(np.sum(evaluated)/num)`

`return sigma, accu`

Implementation details of optimization methods We first define our objective function by marginalizing the loss over the distribution of drifting neural network parameters θ :u ( α , θ ) = − E θ ̃ ~ p ( θ ̃ ) [ ℓ ( f ( α , θ ̃ ) ( x ) , y ) ] \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$u({{{{{{{\boldsymbol{\alpha }}}}}}}},{{{{{{{\boldsymbol{\theta }}}}}}}})=-{{\mathbb{E}}}_{\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}} \sim p(\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}})}[\ell ({f}_{({{{{{{{\boldsymbol{\alpha }}}}}}}},\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}})}({{{{{{{\boldsymbol{x}}}}}}}}),{{{{{{{\boldsymbol{y}}}}}}}})]$$\end{document} ![]()

where θ ̃ = θ exp ( λ ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}}={{{{{{{\boldsymbol{\theta }}}}}}}}\exp (\lambda )$$\end{document} ; λ ~ N ( 0 , σ 2 ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\lambda \sim {{{{{{{\mathcal{N}}}}}}}}(0,{\sigma }^{2})$$\end{document} , ℓ ( f ( α , θ ̃ ) ( x ) , y ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\ell ({f}_{({{{{{{{\boldsymbol{\alpha }}}}}}}},\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}})}({{{{{{{\boldsymbol{x}}}}}}}}),{{{{{{{\boldsymbol{y}}}}}}}})$$\end{document} is the loss of a neural network with the noise setting α (e.g . dropout rates for Binomial noises) and parameter θ given input data x and target y . This intractable equation can be approximately computed by Monte Carlo sampling:u ( α , θ ) ≃ − 1 T ∑ t = 1 T ℓ ( f ( α , θ ̃ t ) ( x ) , y ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$u({{{{{{{\boldsymbol{\alpha }}}}}}}},{{{{{{{\boldsymbol{\theta }}}}}}}})\simeq -\frac{1}{T}\mathop{\sum }\limits_{t=1}^{T}\ell \left.({f}_{({{{{{{{\boldsymbol{\alpha }}}}}}}},{\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}}}_{t})}({{{{{{{\boldsymbol{x}}}}}}}}),{{{{{{{\boldsymbol{y}}}}}}}})\right]$$\end{document} ![]()

where T is the number of Monte Carlo samples and θ ̃ t \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}}}_{t}$$\end{document} is the t -th sample randomly drawn from the distribution of parameter p ( θ ̃ ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$p(\tilde{{{{{{{{\boldsymbol{\theta }}}}}}}}})$$\end{document} . To maximize the objective function, we use an optimization scheme where α and θ are optimized alternatively. When optimizing α , as discussed in the main text, because there is no exact gradient information available for α , we use Bayesian optimization, which does not require the gradients for variables to be optimized. Bayesian optimization uses a surrogate model constructed from previous trials to determine the point for the next trial, i.e. the point which is the most likely to give the optimal solution for the gradient-free optimization problem^{32}. For θ , we employ the stochastic gradient descent method.

In terms of Bayesian optimization, we use a Gaussian process regression model as the surrogate model. Suppose we already have n trials of different settings of α denoted as α _{1:n}, its corresponding objective function value g (α _{1:n}), and kernel matrix κ (α _{1:n}, α _{1:n}), more specifically:α 1 : n = [ α 1 , ⋯ , α n ] \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:n}=[{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1},\cdots \,,{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{n}]$$\end{document} ![]()

g ( α 1 : n ) = [ u ( α 1 , θ ) , ⋯ , u ( α n , θ ) ] \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$g({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:n})=[u({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1},{{{{{{{\boldsymbol{\theta }}}}}}}}),\cdots \,,u({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{n},{{{{{{{\boldsymbol{\theta }}}}}}}})]$$\end{document} ![]()

κ ( α 1 : n , α 1 : n ) = κ ( α 1 , α 1 ) , ⋯ , κ ( α 1 , α n ) ⋯ , ⋯ , ⋯ κ ( α n , α 1 ) , ⋯ , κ ( α n , α n ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\kappa ({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:n},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:n})=\left[\begin{array}{ccc}\kappa ({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1}),&\cdots \,,&\kappa ({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{n})\\ \cdots \,,&\cdots \,,&\cdots \\ \kappa ({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{n},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1}),&\cdots \,,&\kappa ({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{n},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{n})\end{array}\right]$$\end{document} ![]()

Then, according to the Gaussian process’s property, the posterior probability of g (α ) after n trials follows a Gaussian distribution: where κ is the kernel function. In our experiment, we use the exponential kernel function:κ ( α 1 , α 2 ) = k 0 exp ( − ∥ α 1 − α 2 ∥ 2 ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\kappa ({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{2})={k}_{0}\exp (-\parallel {{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1}-{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{2}{\parallel }^{2})$$\end{document} ![]()

where ∥ α 1 − α 2 ∥ 2 = ∑ i = 1 d k i ( α 1 , i − α 2 , i ) 2 \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\parallel {{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1}-{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{2}{\parallel }^{2}=\mathop{\sum }\nolimits_{i = 1}^{d}{k}_{i}{({\alpha }_{1,i}-{\alpha }_{2,i})}^{2}$$\end{document} , and k _{i} are parameters of the kernel.

Then, the next trial is given by finding the point that is most likely to give the optimal objective value: α * = max α * p ( g ( α * ) ∣ g ( α 1 : n ) ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${{{{{{{{\boldsymbol{\alpha }}}}}}}}}^{* }=\mathop{\max }\nolimits_{{{{{{{{{\boldsymbol{\alpha }}}}}}}}}^{* }}p(g({{{{{{{{\boldsymbol{\alpha }}}}}}}}}^{* })| g({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:n}))$$\end{document} . Based on the above theory, we generate the algorithm based on Bayesian optimization for fault-tolerant neural network architecture, as shown in Algorithm 1.

Algorithm 1 Bayesian Optimization for Fault-Tolerant DNN (BayesFT)

Input: Dataset (x , y ), neural network parameters θ , dropout rates for each layer α , number of epochs for training neural networks E .

Output: Trained neural network θ and dropout rates for each layer α .

Initialization: initialize θ with Xavier random initialization^{33}, α with a uniform distribution on [0, 1], number of iterations t = 0:

repeat

for e = 1 to e = E do

Optimize neural network parameters θ :θ t ← θ t − 1 − ∇ θ u ( α t − 1 , θ t − 1 ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${{{{{{{{\boldsymbol{\theta }}}}}}}}}_{t}\leftarrow {{{{{{{{\boldsymbol{\theta }}}}}}}}}_{t-1}-{\nabla }_{{{{{{{{\boldsymbol{\theta }}}}}}}}}u({\alpha }_{t-1},{{{{{{{{\boldsymbol{\theta }}}}}}}}}_{t-1})$$\end{document} ![]()

end for

Update the posterior distribution function for Bayesian optimization:g ( α 1 : t − 1 ) = [ u ( α 1 , θ t ) , ⋯ , u ( α t − 1 , θ t ) ] \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$g({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1})=[u({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1},{{{{{{{{\boldsymbol{\theta }}}}}}}}}_{t}),\cdots \,,u({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{t-1},{{{{{{{{\boldsymbol{\theta }}}}}}}}}_{t})]$$\end{document} ![]()

u t − 1 ( α ) = κ ( α , α 1 : t − 1 ) κ ( α 1 : t − 1 , α 1 : t − 1 ) − 1 g ( α 1 : t − 1 ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${u}_{t-1}({{{{{{{\boldsymbol{\alpha }}}}}}}})=\kappa ({{{{{{{\boldsymbol{\alpha }}}}}}}},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1})\kappa {({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1})}^{-1}g({\alpha }_{1:t-1})$$\end{document} ![]()

σ t − 1 2 ( α ) = κ ( α , α ) − κ ( α , α 1 : t − 1 ) κ ( α 1 : t − 1 , α 1 : t − 1 ) − 1 κ ( α 1 : t − 1 , α ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\sigma }_{t-1}^{2}({{{{{{{\boldsymbol{\alpha }}}}}}}})=\kappa ({{{{{{{\boldsymbol{\alpha }}}}}}}},{{{{{{{\boldsymbol{\alpha }}}}}}}})-\kappa ({{{{{{{\boldsymbol{\alpha }}}}}}}},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1})\kappa {({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1},{{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1})}^{-1}\kappa ({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1},{{{{{{{\boldsymbol{\alpha }}}}}}}})$$\end{document} ![]()

Calculate the optimal α from the updated posterior distribution function for the surrogate model:α t ← max α p ( g ( α ) ∣ g ( α 1 : t − 1 ) ) \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{t}\leftarrow \mathop{\max }\limits_{{{{{{{{\boldsymbol{\alpha }}}}}}}}}p(g({{{{{{{\boldsymbol{\alpha }}}}}}}})| g({{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{1:t-1}))$$\end{document} ![]()

t ← t + 1

until convergence;

Acknowledgements N.Y. acknowledges the funding from the National Science Foundation of China (Grant No. 62106139). Q.G. is grateful for the support from the Shanghai Artificial Intelligence Laboratory and the National Key R&D Program of China (Grant NO.2022ZD0160100). G.-Z.Y. acknowledges the funding from the Science and Technology Commission of Shanghai Municipality (Grant No. 20DZ2220400).

Author contributions N.Y. conceived the robust training algorithm procedure and implemented the algorithm. L.C. conducted experiments on autonomous driving datasets and prepared figures and tables. Z.Z. provided the theoretical analysis. L.Y. and Z.F. conducted experiments on MNIST, CIFAR, and object detection datasets. N.Y., Q.G., and G.-Z.Y. conceived and supervised the project. All the authors contribute to results analysis and manuscript writing.

Peer review Peer review information Communications Engineering thanks Christopher Bennett, Melika Payvand and Mostafa Rahimiazghadi for their contribution to the peer review of this work. Primary Handling Editors: Damien Querlioz and Miranda Vinay and Rosamund Daw.

Data availability The data supporting the results of this work are available from the corresponding author upon reasonable request.

Code availability Code for this article is available from the corresponding author upon reasonable request.

Competing interests The authors declare no competing interests.

References Ye, N. et al. Bayesft: Bayesian optimization for fault tolerant neural network architecture. In 2021 58th ACM/IEEE Design Automation Conference 487–492 (2021). Ambrogio S Equivalent-accuracy accelerated neural-network training using analogue memory Dalgaty T In situ learning using intrinsic memristor variability via markov chain monte carlo sampling Sun Y A ti/alox/taox/pt analog synapse for memristive neural network Liu, C., Hu, M., Strachan, J. P. & Li, H. Rescuing memristor-based neuromorphic design with high defects. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC ) 1–6 (IEEE, 2017). Chen, L. et al. Accelerator-friendly neural-network training: Learning variations and defects in rram crossbar. In Design, Automation & Test in Europe Conference & Exhibition, 2017 19–24 (IEEE, 2017). Stathopoulos S Multibit memory operation of metal-oxide bi-layer memristors Wan W A compute-in-memory chip based on resistive random-access memory Wan, W. et al. Edge AI without compromise: efficient, versatile and accurate neurocomputing in resistive random-access memory. Preprint at arXiv:2108.07879 (2021). Bennett, C. H. et al. Device-aware inference operations in sonos nonvolatile memory arrays. In 2020 IEEE International Reliability Physics Symposium 1–6 (IEEE, 2020). Kraisnikovic, C., Stathopoulos, S., Prodromakis, T. & Legenstein, R. Fault pruning: Robust training of neural networks with memristive weights. In 20th International Conference on Unconventional Computation and Natural Computation (2023). Joksas D Nonideality-aware training for accurate and robust low-power memristive neural networks Huang, L. et al. A method for obtaining highly robust memristor based binarized convolutional neural network. In Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications 813–822 (2022). Faisal AA Selen LP Wolpert DM Noise in the nervous system Zoph, B. & Le, Q. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (2017). Liu, H., Simonyan, K. & Yang, Y. DARTS: Differentiable architecture search. In International Conference on Learning Representations (2019). Elsken T Metzen JH Hutter F Neural architecture search: a survey Deng L The MNIST database of handwritten digit images for machine learning research Krizhevsky, A., Nair, V. & Hinton, G. Cifar-10 (canadian institute for advanced research) http://www.cs.toronto.edu/~kriz/cifar.html . Stallkamp, J., Schlipsing, M., Salmen, J. & Igel, C. The german traffic sign recognition benchmark: a multi-class classification competition. In The 2011 International Joint Conference on Neural Networks 1453–1460 (IEEE, 2011). Wang, L. et al. Object detection combining recognition and segmentation. In Asian conference on computer vision 189–199 (Springer 2007) . Geiger A Lenz P Stiller C Urtasun R Vision meets robotics: the kitti dataset Wu D Xia S-T Wang Y Adversarial weight perturbation helps robust generalization LeCun Y Bottou L Bengio Y Haffner P Gradient-based learning applied to document recognition Krizhevsky A Sutskever I Hinton GE Imagenet classification with deep convolutional neural networks Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at arXiv:1409.1556 (2014). He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778 (IEEE, 2016). Geiger, A., Lenz, P. & Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition 3354–3361 (IEEE, 2012). Guo Y Deep learning for 3d point clouds: a survey Lang, A. H. et al. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12697–12705 (IEEE, 2019). Liu, T. et al. A fault-tolerant neural network architecture. In 2019 56th ACM/IEEE Design Automation Conference 1–6 (IEEE, 2019). Brochu, E., Cora, V. M. & De Freitas, N. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint at arXiv:1012.2599 (2010). Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 249–256 (2010). Supplementary information

Supplementary Information

Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s44172-023-00074-3 .

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.