^{1}

^{2}

^{3}

^{*}

^{1}

^{2}

^{3}

Edited by: Housen Li, University of Göttingen, Germany

Reviewed by: Laura Antonelli, National Research Council (CNR), Italy; Markus Haltmeier, University of Innsbruck, Austria

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We propose a novel framework for the regularized inversion of deep neural networks. The framework is based on the authors' recent work on training feed-forward neural networks without the differentiation of activation functions. The framework lifts the parameter space into a higher dimensional space by introducing auxiliary variables, and penalizes these variables with tailored Bregman distances. We propose a family of variational regularizations based on these Bregman distances, present theoretical results and support their practical application with numerical examples. In particular, we present the first convergence result (to the best of our knowledge) for the regularized inversion of a single-layer perceptron that only assumes that the solution of the inverse problem is in the range of the regularization operator, and that shows that the regularized inverse provably converges to the true inverse if measurement errors converge to zero.

Neural networks are computing systems that have revolutionized a wide range of research domains over the past decade and outperformed many traditional machine learning approaches [cf. [

While the concept of inverting neural networks is certainly not new [cf. [

While several approaches for the inversion of neural networks have been proposed especially in the context of generative modeling [see for example [

In this work, we propose a novel regularization framework based on lifting with tailored Bregman distances and prove that the proposed framework is a convergent variational regularization for the inverse problem of estimating the inputs from single-layer perceptrons or the inverse problem of estimating hidden variables in a multi-layer perceptron sequentially. While there has been substantial work in previous years that focuses on utilizing neural networks as nonlinear operators in variational regularization methods [

Our contributions are three-fold. (1) We propose a novel framework for the regularized inversion of multi-layer perceptrons, respectively feed-forward neural networks, that is based on the lifted Bregman framework recently proposed by the authors in [

The paper is structured as follows. In Section 2, we introduce the lifted Bregman formulation for the model-based inversion of feed-forward neural networks. In Section 3, we prove that for the single-layer perceptron case the proposed model is a convergent variational regularization method and provide general error estimates as well as error estimates for a concrete example of a perceptron with ReLU activation function. In Section 4, we discuss how to implement the proposed variational regularization computationally for both the single-layer and multi-layer perceptron setting with a generalization of the primal-dual hybrid gradient method and coordinate descent. Subsequently, we present numerical results that demonstrate empirically that the proposed approach is a model-based regularization in Section 5, before we conclude this work with a brief section on conclusions and outlook in Section 6.

Suppose we are given an

for input data ^{n} and pre-trained parameters ^{m}, our goal is to solve the inverse problem

for the unknown input ^{n}. The problem (2) is usually ill-posed in the sense that a solution may not exist (especially if _{l} and affine linear transformation

where we assume _{0} = ^{δ} is a perturbed version of _{Ψl} for

for a proper, convex, and lower semi-continuous function ^{n} → ℝ∪{∞} is a proper, convex, and lower semi-continuous function that enables us to incorporate a-priori information into the inversion process. The impact of this is controlled by the parameter α>0.

Please note that the functions _{Ψl} are directly connected to the chosen activation functions

where _{l}, i.e.

for all

The advantage of using functions _{Ψl} over more conventional functions such as the squared Euclidean norm of the difference of the network output and the measured output, i.e., _{Ψl} are continuously differentiable with respect to their second argument [along with several other useful properties, cf. [

Please note that the family of objective functions _{Ψl} satisfies several other interesting properties; we refer the interested reader to [

For the remainder of this work, we assume that the parameterized functions _{l}. A concrete example is the affine-linear transformation _{l}) = _{l}_{l}, for a (weight) matrix _{l} = (_{l}, _{l}).

In the next section, we show that (3) is a variational regularization method for

In this section, we show that the proposed model (3) is a convergent variational regularization for the specific choice

where dom(Ψ) is defined as dom(Ψ): = {^{m}|Ψ(^{δ}.

For simplicity, we focus on the finite-dimensional setting with network inputs in ℝ^{n} and outputs in dom(Ψ). However, the following analysis also extends to more general Banach space settings with additional assumptions on the operator ^{*} for a proper function ^{n} → ℝ∪{∞}. Note that this automatically implies convexity of _{Ψ} is proper, non-negative, convex in its second argument and continuous in its first argument for every ^{δ}∈dom(Ψ). Then, for every

Last but not least, we assume that

for constants

Theorem 1. Let the assumptions outlined in the previous paragraph be satisfied.

Then, for every

The regularization operator

For every sequence _{n}→

Proof. The results follow directly from [

Having established that (3) is a regularization operator, we now want to prove that it is also a convergent regularization operator in the sense of the estimate

such that

Here, the term _{R} denotes the (generalized) Bregman distance [or divergence, [cf. [

for two arguments ^{α} is a solution of (3) with data ^{δ} for which we assume ^{†} is an element of the selection operator as specified in Lemma 1. 1, i.e., ^{†} being a ^{*}(σ(^{†}+^{†}+^{*}, this further implies ^{†}+

In order to be able to derive error estimates of the form (7), we restrict ourselves to solutions ^{†} that are in the range of ^{†} such that ^{†}, this implies

which for ^{†}: = (^{†}−σ(^{†}+^{†}−^{†} that satisfies the source condition [cf. [

In the following, we verify that the symmetric Bregman distance with respect to

for

Before we begin our analysis, we recall the concept of the ^{n} → ℝ∪{∞} generalizes to so-called

Definition 1 (Burbea-Rao divergence). Suppose ^{n} → ℝ∪{∞} is a proper, convex and lower semi-continuous function. The corresponding Burbea-Rao divergence is defined as

for all

Another important concept that we need in order to establish error estimates is that of Fenchel conjugates [cf. [

Definition 2 (Fenchel conjugate). The Fenchel (or convex) conjugate ^{*}:ℝ^{n} → ℝ∪{−∞, ∞} of a function ^{n} → ℝ∪{−∞, +∞} is defined as

The Fenchel conjugate that is of particular interest to us is the conjugate of the function _{Ψ}(

Lemma 1. The Fenchel conjugate of _{y}(_{Ψ}(

Proof. From the definition of the Fenchel conjugate we observe

which concludes the proof.

Having defined the Burbea-Rao divergence and having established the Fenchel conjugate of _{Ψ}(

Theorem 2. Suppose ^{δ} and ^{†} that satisfy ^{†} of the perceptron problem ^{†}+

for a constant

Proof. Every solution ^{α} that satisfies

for any subgradient _{α}∈∂_{α}). Subtracting ^{†}∈∂^{†}) from both sides of the equation and taking a dual product with

We easily verify

hence, we can replace

We know ^{†} = ^{*}^{†}. Hence, we can estimate

Next, we introduce the constant

Next, we make use of Lemma 1 to estimate

and

Adding both estimates together yields

which together with the error bound

Remark 1. We want to emphasize that for continuous Ψ and

in which case the important question from an error estimate point-of-view is if the term converges quicker to zero than α, as we would need to guarantee

Example 1 (ReLU perceptron). Let us consider a concrete example to demonstrate that (6) is a convergent regularization with respect to the symmetric Bregman distance of _{Ψ}(

for all

where we have also divided by α on both sides of the inequality. If we choose

as long as we can ensure

Together with

We want to briefly comment on the extension of the convergence analysis to the general case

Remark 2. The presented convergence analysis easily extends to a sequential, layer-wise inversion approach. Suppose we have

which is also of the form of (6), but where _{L−1}. Alternatively, one can also replace Ψ_{L−1} with another function _{L−1} if good prior knowledge for the auxiliary variable _{L−1} exists. Once we have estimated

for ^{α} as a solution of (3) but with data ^{δ}.

The advantage of such a sequential approach is that every individual regularization problem is convex and the previously presented theorems and guarantees still apply. The disadvantage is that for this approach to work in theory, we require bounds for every auxiliary variable of the form

Please note that showing that the simultaneous approach (3) is a (convergent) variational regularization is beyond the scope of this work as it is harder and potentially requires additional assumptions for the following reason. The overall objective function in (3) is no longer guaranteed to be convex with respect to all variables simultaneously, which means that we cannot simply carry over the analysis of the single-layer to the multi-layer perceptron case.

Remark 3 (Infinite-dimensional setting). Please note that almost all theoretical results presented in this section also apply to neural networks that map functions between Banach spaces instead of finite-dimensional vectors. The only result that changes is Theorem 1, Item 3, where the statement in an infinite-dimensional setting only implies convergence in the weak-star topology.

This concludes the theoretical analysis of the perceptron inversion model. In the following section, we focus on how to implement (6) and its more general counterpart (3).

In this section, we describe how to computationally implement the proposed variational regularization for both the single-layer and the multi-layer perceptron setting. More specifically, we show that the proposed variational regularization can be efficiently solved via a generalized primal-dual hybrid gradient method and a coordinate descent approach.

To begin with, we first consider the example of inverting a (single-layer) perceptron. For

Here

where ^{*} denotes the convex conjugate of

where we alternate between a descent step in the _{x} and τ_{z} are chosen such that

In this work, we will focus on the discrete total variation ||∇_{p, 1}, [^{H×W}, we can define a finite forward difference discretization of the gradient operator ∇:ℝ^{H×W} → ℝ^{H×W×2} as

The discrete total variation is defined as the ℓ_{1} norm of the

For our numerical results we consider the isotropic total variation and consequently choose _{Ψ} denoting the activation function, the PDHG approach (13) of solving the perceptron inversion problem (6) can be summarized as

Please note that we define the discrete approximation of the divergence div such that it satisfies div = −∇^{⊤} in order to be the negative transpose of the discretized finite difference approximation of the gradient in analogy to the continuous case, which is why the sign in (14a) is flipped in comparison to (13a). The proximal map with regards to the convex conjugate of

We now discuss the implementation of the inversion of multi-layer perceptrons with _{1}, …, _{L−1}.

For the minimization of (3) we consider an alternating minimization approach, also known as _{0} and each _{l} variable for

Note that one advantage for adopting this approach is that we exploit that the overall objective function is convex in each individual variable when all other variables are kept fixed. In the following, we will discuss different strategies to computationally solve each sub-problem.

When optimizing with respect to the input variable _{0}, the structure of sub-problem (15a) is identical to the perceptron inversion problem that we have discussed in Section 4.1. Hence, we can approximate ^{δ}, which yields the iteration

For each auxiliary variable _{l} with _{xl}, which we can rewrite to

This concludes the discussion on the implementation of the regularized single-layer and multi-layer perceptron inversion. In the next section, we present some numerical results to demonstrate the effectiveness of the proposed approaches empirically.

In this section, we present numerical results for the perceptron inversion problem implemented with the PDHG algorithm as outlined in (14), and for the multi-layer perceptron inversion problem implemented with the coordinate descent approach as described in (16) and (17). In the following, we first demonstrate that we can invert a perceptron with random weights and bias terms and ReLU activation function via (11) with total variation regularization and the algorithm described in (14). We then proceed to a more realistic example of inverting the code of a simple autoencoder with perceptron encoder, before we extend the results to the total variation-based inversion of encodings from multi-layer convolutional autoencoders. All results have been computed using PyTorch 3.7 on an Intel Xeon CPU E5-2630 v4.

We present results for two experiments: the first one is the perceptron inversion of the image of a circle from the noisy output of the perceptron, where we compare the Landweber regularization and the total-variation-based variational regularization (6). For the second experiment, we perform perceptron inversion for samples from the MNIST dataset [

We begin with the toy example of recovering the image of a circle from noisy measurements of a ReLU perceptron. To prepare the experiment, we generate a circle image ^{†}∈ℝ^{64 × 64}, as shown in ^{512 × 4, 096}, ^{512 × 1}. The weights operates on the column-vector representation of x, where ^{4, 096 × 1}. The noise-free data is generated via the forward operation of the model, i.e., ^{†}+^{δ} by adding Gaussian noise with mean 0 and standard deviation 0.005. Note that we clip all the negative values of ^{δ} to ensure ^{δ}∈dom(Ψ).

Groundtruth image ^{†} of a circle.

A first attempt to solve this ill-posed perceptron inversion problem is via Landweber regularization [^{K}+^{δ}||, the recovered image does not resemble the image ^{†}. We will discuss shortly the reason for this visually poor inversion. In comparison, we see a regularized inversion via the total variation regularization approach following (14) in ^{−2}. Both _{0} and _{z} = 1/(8α), see [_{0} and ^{−5} or when we reach the maximum number of iterations, which we set to 10, 000. As shown in

Inverted image via Landweber regularization.

Inverted image via TV regularization.

To explain why the Landweber iteration performs worse compared to the total variation regularization for this specific example, we compare the ℓ_{2} norms of each two solutions and the groundtruth image ^{†}. The ℓ_{2} norm of the Landweber solution in ^{†} measure 25.69 and 28.07, respectively. This is not surprising, as the Landweber iteration is known to converge to a minimal Euclidean norm solution if the noise level converges to zero. On the other hand, when we compare the TV semi-norm of each solution, the groundtruth image in measures 128.0, while the Landweber solution in ^{†}.

In this second example, we perform perceptron inversion on the MNIST dataset [^{δ}

To be more precise, we first train a two-layer fully connected autoencoder

We then invert the code _{z} = 1/(8α). We choose the regularization parameter α in the range [10^{−4}, 10^{−2}] and set to 5 × 10^{−3} for all sample images from the training set, and to α = 5 × 10^{−2} for all sample images from the validation set. These choices work well with regards to the visual quality of the inverted images.

In

Groundtruth input images from the MNIST training dataset (Left) and validation dataset (Right), together with the corresponding autoencoder output images ^{α} of the encoding via (11).

In this section, we present numerical results for inverting multi-layer perceptrons. In particular, we consider feedforward neural networks with convolutional layers (CNN), where in the network architecture two-dimensional convolution operations are used to represent the linear operations in the affine-linear functions ^{δ}.

More specifically, we first train a six-layer convolutional autoencoder on the MNIST training dataset via stochastic gradient method to minimize the MSE. The encoder

Following the implementation details outlined in Section 4.2, we iteratively compute the update steps (16) and (17) to recover _{z} = 1/(8α). The initial values _{0} and _{0} and ^{−5} in norm. For the coordinate descent algorithm, the stepsize-parameters are set to

In ^{−4}, 10^{−2}] and set at 9 × 10^{−3} for both training sample images and validation sample images for best visual inversion quality.

Groundtruth input images from the MNIST training dataset (Left) and validation dataset (Right), together with the corresponding CNN autoencoder output images ^{α} of the encoding via (3).

In ^{2} from 6.80 down to 0.00.

Visualization of the comparison between inverted image and decoded image against various levels of noise. (Top) Decoded output image from the trained convolutional autoencoder. (Bottom) Inverted input image from the CNN with total variation regularization.

Please note that for each noise level the regularization factor α is manually selected in the range [10^{−4}, 10^{−2}] for the best PSNR value. As we can see, for the noise level with standard deviation 0.33 where δ^{2} is at 6.80, the decoder is only capable of producing a blurry distorted output, while the inverted image shows the structure of the digit more clearly. When we decrease the noise level down to 0.00, the inverted image becomes more clean-cut while the decoded image is still less sharply defined.

Comparison of PSNR values of total variation-based reconstruction and decoder output per noise level. Each curve reports the change of PSNR value over gradually decreasing levels of Gaussian noise, with δ^{2} ranging from 0.00 to 6.80.

We have introduced a novel variational regularization framework based on a lifted Bregman formulation for the stable inversion of feed-forward neural networks (also known as multi-layer perceptrons). We have proven that the proposed framework is a convergent regularization for the single-layer perceptron case under the mild assumption that the inverse problem solution has to be in the range of the regularization operator. We have derived a general error estimate as well as a specific error estimate for the case that the activation function is the ReLU activation function. We have also addressed the extension of the theory to the multi-layer perceptron case, which can be carried out sequentially, albeit under unrealistic assumptions. We have discussed implementation strategies to solve the proposed scheme computationally, and presented numerical results for the regularized inversion of the image of a circle and piecewise constant images of hand-written digits from single- and multi-layer perceptron outputs with total variation regularization.

Despite all the positive achievements presented in this work, the proposed framework also has some limitations. The framework is currently restricted to feed-forward architectures with affine-linear transformations and proximal activation functions. While it is straight-forward to extend the framework to other architectures such as ResNets [

An open question is how a convergence theory without restrictive, unrealistic assumptions can be established for the multi-layer case. One issue is the non-convexity of the proposed formulation. A remedy could be the use of different architectures that lead to lifted Bregman formulations that are jointly convex in all auxiliary variables.

And last but not least, one would also like to consider other forms of regularization, such as iterative regularization, data-driven regularizations [

Publicly available datasets were analyzed in this study. This data can be found at:

XW has programmed and contributed all numerical results as well as Sections 4 and 5. MB has contributed the introduction (Section 1) as well as the theoretical results (Section 3). XW and MB have contributed equally to Sections 2 and 6. Both authors contributed to the article and approved the submitted version.

The authors acknowledge support from the Cantab Capital Institute for the Mathematics of Information, the Cambridge Centre for Analysis (CCA), and the Alan Turing Institute (ATI).

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.