Unsupervised Learning of Probably Symmetric
Deformable 3D Objects From Images in the Wild

(Invited Paper)
Shangzhe Wu , Christian Rupprecht , and Andrea Vedaldi

Abstract—We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision.

The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to

disentangle these components without supervision, we use the fact that many object categories have, at least approximately, a

symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the

appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by

predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this

method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision

or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the

level of 2D image correspondences.

Index Terms—Unsupervised 3D reconstruction, single-image 3D reconstruction, intrinsic image decomposition

Ç

1 INTRODUCTION

THE ability to understand and reconstruct the content of
images in 3D is of great importance in many computer

vision applications. Yet, when it comes to learning catego-
ries of visual objects, for instance to detect and segment
them, most approaches model them as 2D patterns [1], with
no obvious understanding of their 3D structure. Thus, in
this paper we consider the problem of learning categories of
3D deformable objects. Furthermore, we do so under two
challenging conditions. The first condition is that no 2D or
3D ground truth information (such as keypoints, segmenta-
tion, depth maps, or prior knowledge of a 3D model) is
available. Learning without external supervisions removes
the need for collecting image annotations, which is often a
major obstacle to deploying deep learning to new applica-
tions. The second condition is that learning can only use an
unconstrained collection of single-view images — in particular,
it does not use multiple views of the same object instance.
Learning from single-view images is useful because in
many applications we only have a source of independent
still images to work with (for example obtained form an
Internet search engine).

In more detail, we introduce a new learning algorithm
that takes as input a collection of single-view images of a
deformable object category and produces as output a deep

network that can estimate the 3D shape of any object
instance given a single image of it (Fig. 1). The algorithm is
based on an autoencoder that internally decomposes the
image into albedo, depth, illumination and viewpoint, with-
out direct supervision for any of these factors. In general,
decomposing images into these four factors is ill-posed. We
thus seek for a minimal set of assumptions that makes the
problem solvable. To this end, we note that many object cat-
egories are symmetric (e.g. almost all animals and many
handcrafted objects). If an object is perfectly symmetric,
mirroring any image of it results in a second virtual view of
the object. Furthermore, if point correspondences between
the image and its mirrored version can be established, then
the 3D shape of the object can be recovered using any of a
number of standard multi-view 3D reconstruction
approaches [2], [3], [4], [5], [6]. Motivated by this, we seek to
leverage symmetry as a cue to constrain this decomposition
task.

While symmetry is a powerful cue, using it in practice is
far from trivial. First, even if symmetry allows to obtain a
pair of virtual views of an object, reconstruction still require
to establish point correspondences between them, which
can be difficult to do in an unsupervised manner. For
instance, the appearance of symmetric points may still differ
substantially due to asymmetric illumination. Second, spe-
cific object instances are in practice never fully symmetric,
neither in shape nor appearance. Shape is non-symmetric
due to variations in pose or other details (e.g. the hair style
or expressions in a human face), and albedo can also be
non-symmetric (e.g. asymmetries in the texture of cat’s fur).

We address these issues in two ways. First, we explicitly
account for the effect of illumination in the reconstruction
pipeline by decomposing the appearance into albedo and
shading. In this manner, the model learns to explain

� The authors are with the Department of Engineering Science, University of
Oxford, OX1 2JD Oxford, U.K. E-mail: {szwu, chrisr, vedaldi}@robots.
ox.ac.uk.

Manuscript received 20 Dec. 2020; accepted 16 Apr. 2021. Date of publication
29 Apr. 2021; date of current version 6 Mar. 2023.
(Corresponding author: Shangzhe Wu.)
Recommended for acceptance by Silvio Savarese and Ce Liu.
Digital Object Identifier no. 10.1109/TPAMI.2021.3076536

5268 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

https://orcid.org/0000-0003-1011-5963
https://orcid.org/0000-0003-1011-5963
https://orcid.org/0000-0003-1011-5963
https://orcid.org/0000-0003-1011-5963
https://orcid.org/0000-0003-1011-5963
https://orcid.org/0000-0003-3994-8045
https://orcid.org/0000-0003-3994-8045
https://orcid.org/0000-0003-3994-8045
https://orcid.org/0000-0003-3994-8045
https://orcid.org/0000-0003-3994-8045
https://orcid.org/0000-0003-1374-2858
https://orcid.org/0000-0003-1374-2858
https://orcid.org/0000-0003-1374-2858
https://orcid.org/0000-0003-1374-2858
https://orcid.org/0000-0003-1374-2858
mailto:szwu@robots.ox.ac.uk
mailto:chrisr@robots.ox.ac.uk
mailto:vedaldi@robots.ox.ac.uk


asymmetries in the object appearance resulting from illumi-
nation, allowing it to better understand how pairs of sym-
metric views of the object correspond. Moreover, since
shading provides information on the surface normals and
thus the 3D shape, decomposing it allows the model to
explicitly use this information to constrain 3D shapes. Sec-
ond, we augment the model to reason about potential lack
of symmetry in the objects. To do this, the model predicts,
along with the factors listed above, a dense map explaining
the probability that a given pixel has a symmetric counter-
part in the image.

We combine these elements in an end-to-end learning
formulation, where all components, including the sym-
metry probability map, are learned from raw RGB
images only. As a further contribution, we show that,
rather than enforcing the symmetry by adding further
terms to the learning objective, we can instead do so
indirectly. The latter is obtained by randomly mirroring
the internal representation of the object, thus encourag-
ing the autoencoder to generate a symmetric view of the
object. The advantage of this approach is that it avoids
the need to introduce and thus tune additional terms in
the learning objective.

We test our method on several datasets, including
human faces, cat faces and synthetic cars. We provide a
thorough ablation study and extensive analyses using a syn-
thetic face dataset with the necessary 3D ground truth. On
real images, we achieve higher fidelity reconstruction
results compared to other methods [7], [8] that do not rely
on 2D or 3D ground truth information, nor prior knowledge
of a 3D model of the instance or class. In addition, our
method outperforms a recent state-of-the-art method [9]
that uses keypoint supervision for 3D reconstruction on real
faces, while our method uses no external supervision at all.
As a by-product, our method also learns intrinsic image
decomposition without any external supervision. Finally,
we demonstrate that our trained model generalizes to non-
natural images, such as paintings and cartoon drawings, as
well as video frames without any fine-tuning.

This article is an extension and archival version of our
previous work [10]. In this article, we expand the literature
review, provide additional technical details, and include
additional experiments and discussions that reveal the
important insights of the proposed algorithm, including
how it works, how it may fail, and how it compares to
prominent model-based methods on 3D reconstruction

benchmarks. The code and pretrained models are available
at https://github.com/elliottwu/unsup3d.

2 RELATED WORK

In order to assess our contribution in relation to the vast lit-
erature on image-based 3D reconstruction, it is important to
consider three aspects of each approach: which information
is used, which assumptions are made, and what the output
is. Below and in Table 1 we compare our contribution to
prior works based on these factors.

Fig. 1. Unsupervised learning of 3D deformable objects from in-the-wild images. Left: Training uses only single views of the object category with no
additional supervision at all (i.e. no ground-truth 3D information, multiple views, or any prior model of the object). Right: Once trained, our model
reconstructs the 3D pose, shape, albedo and illumination of a deformable object instance from a single image with excellent fidelity.

TABLE 1
Comparison With Selected Prior Work: Supervision, Goals, and

Data

Paper Supervision Goals Data

[11] 3D scans 3DMM Face
[12] 3DV, I Prior on 3DV ShapeNet, Ikea
[13] 3DP Prior on 3DP ShapeNet
[14] 3DM Prior on 3DM Face

[15] 3DMM, 2DKP, I Refine 3DMM fit to I Face
[16] 3DMM, 2DKP, I Fit 3DMM to I+2DKP Face
[17] 3DMM Fit 3DMM to 3D scans Face
[18] 3DMM, 2DKP Pred. 3DMM Humans
[19] 3DMM, 2DS+KP Pred. N, A, L Face
[20] 3DMM, I Pred. 3DM, VP, T, E Face
[21] 3DMM, 2DKP, I Fit 3DMM to I Face

[22] 2DS Prior on 3DV ModelNet
[23] 2DS Pred. 3DV ShapeNet
[24] I, 2DS, VP Prior on 3DV ShapeNet, PAS3D
[25] I, 2DS+KP Pred. 3DM, T, VP Birds
[26] I, 2DS Pred. 3DM, T, L, VP ShapeNet, Birds
[27] I, 2DS Pred. 3DV, VP ShapeNet, others
[28]� I, 2DS, 3DTM Fit 3DTM to I Animals
[29]� I, 2DS, 3DTM Pred. 3DM, T, VP Birds, PAS3D
[30]� I, 2DSy Pred. 3DM, T, VP Birds, PAS3D

[8] I Prior on 3DM, T Face
[31] I Prior on 3DV, T Face, others
[32]� I Prior on 3DV, T ShapeNet, others
[7] I Pred. 3DM, VP, Tz Face
[33] I Pred. V, L, VP ShapeNet
Ours I Pred. D, L, A, VP Face, others

I: image, 3DMM: 3D morphable model, 3DTM: 3D template model, 2DKP:
2D keypoints, 2DS: 2D silhouette, 3DP: 3D points, VP: viewpoint, E: expres-
sion, 3DM: 3D mesh, 3DV: 3D volume, D: depth, N: normals, A: albedo, T:
texture, L: light, PAS3D: PASCAL 3D+ [34]. y in the form of part segmenta-
tion maps. z can also recover A and L in post-processing. � appear after our
original paper was published.

WU ETAL.: UNSUPERVISED LEARNING OF PROBABLY SYMMETRIC DEFORMABLE 3D OBJECTS FROM IMAGES IN THE WILD... 5269

https://github.com/elliottwu/unsup3d


Our method uses single-view images of an object cate-
gory as training data, assumes that the objects belong to a
specific class (e.g. human faces) which is weakly symmetric,
and outputs a monocular predictor capable of decomposing
any image of the category into shape, albedo, illumination,
viewpoint and symmetry probability.

2.1 Structure From Motion

Traditional methods such as Structure fromMotion (SfM) [35]
can reconstruct the 3D structure of individual rigid scenes
given as input multiple views of each scene and 2D keypoint
matches between the views. This can be extended in twoways.
First, monocular reconstruction methods can perform dense 3D
reconstruction from a single image without 2D keypoints [36],
[37], [38]. However, they require multiple views [38] or videos
of rigid scenes for training [36]. Second, Non-Rigid SfM
(NRSfM) approaches [39], [40] can learn to reconstruct deform-
able objects by allowing 3D points to deform in a limited man-
ner between views, but require supervision in terms of
annotated 2D keypoints for both training and testing. Hence,
neither family of SfM approaches can learn to reconstruct
deformable objects from rawpixels of a single view.

2.2 Shape From X

Many other monocular cues have been used as alternatives
or supplements to SfM for recovering shape from images,
such as shading [41], [42], silhouettes [43], texture [44], sym-
metry [2], [3] etc. In particular, our work is inspired from
shape from symmetry and shape from shading. Shape from sym-
metry [2], [3], [4], [5] reconstructs symmetric objects from a
single image by using the mirrored image as a virtual sec-
ond view, provided that symmetric correspondences are
available. [5] also shows that it is possible to detect symme-
tries and correspondences using descriptors. Shape from
shading [41], [42] assumes a shading model such as Lamber-
tian reflectance, and reconstructs the surface by exploiting
the non-uniform illumination.

2.3 Category-Specific Reconstruction

Learning-based methods have recently been leveraged to
reconstruct objects from a single view, either in the form of
a raw image or 2D keypoints (see also Table 1). While this
task is ill-posed, it has been shown to be solvable by learn-
ing a suitable object prior from the training data [11], [12],
[13], [14]. A variety of supervisory signals have been pro-
posed to learn such priors. Besides using 3D ground truth
directly, authors have considered using videos [36], [45],
[46], [47], [48], stereo pairs [38], [49] and multi-view images
[50], [51], [52], [53], [54].

Other approaches have used single views with 2D key-
point annotations [9], [25], [55], [56] or object masks [23],
[25], [26], [29]. For objects such as human bodies and human
faces, some methods [16], [17], [18], [20], [21], [29], [57], [58],
[59], [60], [61] have learn to reconstruct from raw images,
but starting from the knowledge of a predefined shape
model, such as SMPL [62] or Basel [11], or shape templates.
These prior models are constructed using specialized hard-
ware and/or other forms of supervision, which are often
difficult to obtain for deformable objects in the wild, such as
animals, and also limited in details of the shape.

Only recently have authors attempted to learn the geom-
etry of object categories from raw, monocular views only.
Thewlis et al. [63], [64] uses equivariance to learn dense
landmarks, which recovers the 2D geometry of the objects.
DAE [65] learns to predict a deformation field through
heavily constraining an autoencoder with a small bottleneck
embedding and lift that to 3D in [7] — in post processing,
they further decompose the reconstruction in albedo and
shading, obtaining an output similar to ours.

Adversarial learning has been proposed as a way of hal-
lucinating new views of an object. Some of these methods
start from 3D representations [12], [13], [14], [32], [66]. Kato
et al. [24] trains a discriminator on raw images but uses
viewpoint as addition supervision. HoloGAN [31] only uses
raw images but does not obtain an explicit 3D reconstruc-
tion. Szabo et al. [8] uses adversarial training to reconstruct
3D meshes of the object, but does not assess their results
quantitatively. Henzler et al. [27] also learns from raw
images, but only experiments with images that contain the
object on a white background, which is akin to supervision
with 2D silhouettes. In Section 4.4, we compare to [7], [8]
and demonstrate superior reconstruction results with much
higher fidelity.

Since our model generates images from an internal 3D
representation, one essential component is a differentiable
renderer. However, with a traditional rendering pipeline,
gradients across occlusions and boundaries are not defined.
Several soft relaxations have thus been proposed [67], [68],
[69]. Here, we use a PyTorch implementation1 of [68].

3 METHOD

Our learning algorithm, illustrated in Fig. 2, takes as input a
collection of independent images of objects of a certain cate-
gory, such as human or cat faces. It then produces as output
a model F that, given any new image, recovers the object’s
3D shape, albedo, illumination and viewpoint.

As the algorithm has only raw images to learn from, the
learning objective is reconstructive: namely, the model is
trained so that the combination of the four factors gives

Fig. 2. Photo-geometric autoencoding. Our network F decomposes an
input image I into depth, albedo, viewpoint and lighting, together with a
pair of confidence maps. It is trained to reconstruct the input without
external supervision.

1. https://github.com/daniilidis-group/neural_renderer

5270 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023

https://github.com/daniilidis-group/neural_renderer


back the input image. This results in an auto-encoding pipe-
line where the factors have, due to the way they are com-
bined to generate an image, an specific photo-geometric
interpretation.

Due to the lack of 2D or 3D supervision and of a 3D prior
on the possible shapes of the objects, this reconstruction
problem is ill-posed. In order to address this issue, we use
the fact that many object categories are bilaterally symmetric,
which provides a strong geometric cue to remove the most
severe reconstruction ambiguities. In practice, the appear-
ance of specific object instances is never exactly symmetric
due to deformations of the 3D shape and asymmetric details
in the shape itself as well as in the illumination and albedo.
We take two measures to account for these asymmetries.
First, we explicitly model asymmetric illumination. Second,
our model also estimates, for each pixel in the input image,
a confidence score that explains the probability of the pixel
having a symmetric counterpart in the image (denoted as
conf. s and s0 in Fig. 2).

The following sections describe how this is done, looking
first at the photo-geometric autoencoder (Section 3.1), then
at how symmetries are modelled (Section 3.2), followed by
details of the image formation (Section 3.3) and the optional
use of a perceptual loss (Section 3.4).

3.1 Photo-Geometric Autoencoding

An image I is a function V ! R3 defined on a grid V ¼
f0; . . . ;W � 1g � f0; . . . ; H � 1g, or, equivalently, a tensor in
R3�W�H . We assume that the image is roughly centered on
an instance of the object of interest. The goal is to learn a
function F, implemented as a neural network, that maps
the image I to four factors ðd; a; w; lÞ comprising a depth map
d : V ! Rþ, an albedo image a : V ! R3, a global light direc-
tion l 2 S2, and a viewpoint w 2 R6 so that the image can be
reconstructed from them.

The image I is reconstructed from the four factors in two
steps, lighting L and reprojection P, as follows:

Î ¼ P Lða; d; lÞ; d; wð Þ: (1)

The lighting function L generates a version of the object
based on the depth map d, the light direction l and the
albedo a as seen from a canonical viewpoint w ¼ 0. The
viewpoint w represents the transformation between the
canonical view and the viewpoint of the actual input image
I. Then, the reprojection function P simulates the effect of a
viewpoint change and generates the image Î given the
canonical depth d and the shaded canonical image Lða; d; lÞ.
Learning uses a reconstruction loss which encourages I � Î
(Section 3.2).

3.1.1 Discussion

The effect of lighting could be incorporated in the albedo a
by interpreting the latter as a texture rather than as the
object’s albedo. However, there are two good reasons to
avoid this. First, the albedo a is often symmetric even if the
illumination causes the corresponding appearance to look
asymmetric. Separating them allows us to more effectively
incorporate the symmetry constraint described below. Sec-
ond, shading provides an additional cue on the underlying
3D shape [70], [71]. In particular, unlike the recent work

of [65] where a shading map is predicted independently
from shape, our model computes the shading based on the
predicted depth, mutually constraining each other.

3.2 Probably Symmetric Objects

Leveraging symmetry for 3D reconstruction requires identi-
fying symmetric object points in an image. Here we do so
implicitly, assuming that depth and albedo, which are recon-
structed in a canonical frame, are symmetric about a fixed
vertical plane. An important beneficial side effect of this
choice is that it helps the model discover a ‘canonical view’
for the object, which is important for reconstruction [40].

To do this, we consider the operator that flips a map a 2
RC�W�H along the horizontal axis:2 ½flipa�c;u;v ¼ ac;W�1�u;v:
We then require d � flipd0 and a � flipa0. While these con-
straints could be enforced by adding corresponding loss
terms to the learning objective, they would be difficult to
balance. Instead, we achieve the same effect indirectly, by
obtaining a second reconstruction Î0 from the flipped depth
and albedo

Î0 ¼ P Lða0; d0; lÞ; d0; wð Þ; a0 ¼ flip a; d0 ¼ flip d: (2)

Then, we consider two reconstruction losses encouraging
I � Î and I � Î0. Since the two losses are commensurate,
they are easy to balance and train jointly. Most importantly,
this approach allows us to easily reason about symmetry
probabilistically, as explained next.

The source image I and the reconstruction Î are com-
pared via the loss

LðÎ; I; sÞ ¼ � 1

jVj
X
uv2V

ln
1ffiffiffi
2

p
suv

exp�
ffiffiffi
2

p
‘1;uv
suv

; (3)

where ‘1;uv ¼ ĵIuv � Iuvj is the L1 distance between the inten-
sity of pixels at location uv, and s 2 RW�H

þ is a confidence
map, also estimated by the network F from the image I,
which expresses the aleatoric uncertainty of the model. The
loss can be interpreted as the negative log-likelihood of a
factorized Laplacian distribution on the reconstruction
residuals. Optimizing likelihood causes the model to self-
calibrate, learning a meaningful confidence map [72].

Modelling uncertainty is generally useful, but in our case
is particularly important when we consider the “symmetric”
reconstruction Î0, for which we use the same loss Lð̂I0; I; s0Þ.
Crucially, we use the network to estimate, also from the
same input image I, a second confidence map s0. This confi-
dence map allows the model to learn which portions of the
input image might not be symmetric. For instance, in some
cases hair on a human face is not symmetric, as shown
in Fig. 2, and s0 can assign a higher reconstruction uncer-
tainty to the hair region where the symmetry assumption is
not satisfied. Note that this depends on the specific instance
under consideration, and is learned by themodel itself.

Overall, the learning objective is given by the combina-
tion of the two reconstruction errors

EðF; IÞ ¼ Lð̂I; I; sÞ þ �fLðÎ0; I; s0Þ; (4)

2. The choice of axis is arbitrary as long as it is fixed.

WU ETAL.: UNSUPERVISED LEARNING OF PROBABLY SYMMETRIC DEFORMABLE 3D OBJECTS FROM IMAGES IN THE WILD... 5271


where �f ¼ 0:5 is a weighing factor, ðd; a; w; l; s; s0Þ ¼ FðIÞ is
the output of the neural network, and Î and Î0 are obtained
according to Eqs. (1) and (2).

3.3 Image Formation Model

We now describe the functions P and L in Eq. (1) in more
detail. The image is formed by a camera looking at a 3D
object. If we denote with P ¼ ðPx; Py; PzÞ 2 R3 a 3D point
expressed in the reference frame of the camera, this is
mapped to pixel p ¼ ðu; v; 1Þ by the following projection:

p / KP; K ¼
f 0 cu
0 f cv
0 0 1

2
4

3
5;

cu ¼ W�1
2 ;

cv ¼ H�1
2 ;

f ¼ W�1

2 tan
uFOV

2

:

8><
>: (5)

This model assumes a perspective camera with field of view
(FOV) uFOV. We assume a nominal distance of the object
from the camera at about 1m. Given that the images are
cropped around a particular object, we assume a relatively
narrow FOV of uFOV � 10�.

The depth map d : V ! Rþ associates a depth value duv
to each pixel ðu; vÞ 2 V in the canonical view. By inverting
the camera model (5), we find that this corresponds to the
3D point P ¼ duv 	K�1p:

The viewpoint w 2 R6 represents an euclidean transfor-
mation ðR; T Þ 2 SEð3Þ, where w1:3 and w4:6 are rotation
angles and translations along x, y and z axes respectively.

The map ðR; T Þ transforms 3D points from the canonical
view to the actual view. Thus a pixel ðu; vÞ in the canonical
view is mapped to the pixel ðu0; v0Þ in the actual view by the
warping function hd;w : ðu; vÞ 7! ðu0; v0Þ given by

p0 / Kðduv 	RK�1pþ T Þ; (6)

where p0 ¼ ðu0; v0; 1Þ:
Finally, the reprojection function P takes as input the

depth d and the viewpoint change w and applies the result-
ing warp to the canonical image J to obtain the actual image
Î ¼ PðJ; d; wÞ as Îu0v0 ¼ Juv; where ðu; vÞ ¼ h�1

d;wðu0; v0Þ: Note
that this requires to compute the inverse of the warp hd;w,
which is detailed in Section 3.5.

The canonical image J ¼ Lða; d; lÞ is in turn generated as
a combination of albedo, normal map and light direction.
To do so, given the depth map d, we derive the normal map
n : V ! S2 by associating to each pixel ðu; vÞ a vector normal
to the underlying 3D surface. In order to find this vector, we
compute the vectors tuuv and tvuv tangent to the surface along
the u and v directions. For example, the first one is

tuuv ¼ duþ1;v 	K�1ðpþ exÞ � du�1;v 	K�1ðp� exÞ; (7)

where p is defined above and ex ¼ ð1; 0; 0Þ. Then, the normal
is obtained by taking the vector product nuv / tuuv � tvuv.

The normal nuv is multiplied by the light direction l
to obtain a value for the directional illumination and the lat-
ter is added to the ambient light. Finally, the result is multi-
plied by the albedo to obtain the illuminated texture, as
follows:

Juv ¼ ks þ kdmaxf0; hl; nuvigð Þ 	 auv: (8)

Here ks and kd are the scalar coefficients weighting the
ambient and diffuse terms, and are predicted by the model
with range between 0 and 1 via rescaling a tanh output.
The light direction l ¼ ðlx; ly; 1ÞT =ðl2x þ l2y þ 1Þ0:5 is modeled
as a spherical sector by predicting lx and ly with tanh.

3.4 Perceptual Loss

The L1 loss function Eq. (3) is sensitive to small geometric
imperfections and tends to result in blurry reconstructions.
We add a perceptual loss term to mitigate this problem. The
kth layer of an off-the-shelf image encoder e (VGG16 in our
case [73]) predicts a representation eðkÞðIÞ 2 RCk�Wk�Hk

where Vk ¼ f0; . . .;Wk � 1g � f0; . . .; Hk � 1g is the corre-
sponding spatial domain. Note that this feature encoder
does not have to be trained with supervised tasks. Self-
supervised encoders can be equally effective as shown
in Table 3.

Similar to Eq. (3), assuming a Gaussian distribution, the
perceptual loss is given by

LðkÞ
p ð̂I; I; sðkÞÞ ¼ � 1

jVkj
X
uv2Vk

ln
1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2pðsðkÞ
uv Þ2

q exp� ð‘ðkÞuv Þ2
2ðsðkÞ

uv Þ2
;

(9)

where ‘ðkÞuv ¼ jeðkÞuv ðÎÞ � eðkÞuv ðIÞj for each pixel index uv in the
kth layer. We also compute the loss for Î0 using sðkÞ0 . sðkÞ and
sðkÞ0 are additional confidence maps predicted by our
model. In practice, we found it is good enough for our pur-
pose to use the features from only one layer (relu3_3) of
VGG16. We therefore shorten the notation of perceptual
loss to Lp. With this, the loss function L in Eq. (4) is
replaced by L þ �pLp with �p ¼ 1.

3.5 Differentiable Rendering Layer

As noted in Section 3.3, the reprojection function P warps
the canonical image J to generate the actual image I. In
CNNs, image warping is usually regarded as a simple oper-
ation that can be implemented efficiently using a bilinear
resampling layer [74]. However, this is true only if we can
easily send pixels ðu0; v0Þ in the warped image I back to pix-
els ðu; vÞ in the source image J, a process also known as back-
ward warping. Unfortunately, in our case the function hd;w
obtained by Eq. (6) sends pixels the opposite way.

Implementing a forward warping layer is surprisingly del-
icate. One way of approaching the problem is to regard this
task as a special case of rendering a textured mesh. The Neu-
ral Mesh Renderer (NMR) of [68] is a differentiable renderer
of this type. In our case, the mesh has one vertex per pixel
and each group of 2� 2 adjacent pixels is tessellated by two
triangles. Empirically, we found the quality of the texture
gradients of NMR to be poor in this case, likely caused by
noisy depth map d and high frequency content in the tex-
ture image J.

We solve the problem as follows. First, we use NMR to
warp only the depth map d, obtaining a version �d of the
depth map as seen from the input viewpoint. This has two
advantages: backpropagation through NMR is faster and
second, the depth gradients are more stable than color gra-
dients, probably also due to the comparatively smooth
nature of the depth map d compared to the texture image J.

5272 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023


Given the depth map �d, we then use the inverse of Eq. (6) to
find the warp field from the observed viewpoint to the
canonical viewpoint, and bilinearly resample the canonical
image J to obtain the reconstruction (i.e. using backward
warping).

4 EXPERIMENTS

In this section, we first describe the experimental setup and
implementation details, and then present the qualitative
results on three object categories, human faces, cat faces and
synthetic cars, followed by extensive ablation studies and
analyses. We also report comparisons with several state-of-
the-art methods both qualitatively and quantitatively. In the
end,we provide a discussion on the limitations of ourmethod.

4.1 Experimental Setup

4.1.1 Datasets

We test our method on three human face datasets: Cel-
ebA [75], 3DFAW [76], [77], [78], [79] and BFM [11]. CelebA
is a large scale human face dataset, consisting of over 200k
images of real human faces in the wild annotated with
bounding boxes. 3DFAW contains 23k images with 66 3D
keypoint annotations, which we use to evaluate our 3D pre-
dictions in Section 4.4. We roughly crop the images around
the head region using MTCNN [80] and use the official
train/val/test splits. BFM (Basel Face Model) is a synthetic
face model, which we use to assess the quality of the 3D
reconstructions (since the in-the-wild datasets lack ground-
truth). We follow the protocol of [19] to generate a dataset,
sampling shapes, poses, textures, and illumination ran-
domly. We use images from SUN Database [81] as back-
ground and save ground truth depth maps for evaluation.

We also test our method on cat faces and synthetic cars.
We use two cat datasets [82], [83]. The first one has 10k cat
images with nine keypoint annotations, and the second one
is a collection of dog and cat images, containing 1.2k cat
images with bounding box annotations. We combine the
two datasets and crop the images around the cat heads. For
cars, we render 35k images of synthetic cars from Shape-
Net [84] with random viewpoints and illumination. We ran-
domly split the images by 8:1:1 into train, validation and
test sets.

4.1.2 Metrics

Since the scale of 3D reconstruction from projective cameras
is inherently ambiguous [35], we discount it in the evalua-
tion. Specifically, given the depth map d predicted by our
model in the canonical view, we warp it to a depth map �d in
the actual view using the predicted viewpoint and compare
the latter to the ground-truth depth map d� using the scale-
invariant depth error (SIDE) [85]

ESIDEð�d; d�Þ ¼
1

WH

X
uv

D2
uv � ð 1

WH

X
uv

DuvÞ2
 !1

2

; (10)

where Duv ¼ log �duv � log d�uv. We compare only valid depth
pixels and erode the foreground mask by one pixel to dis-
count rendering artefacts at object boundaries. Additionally,
we report the mean angle deviation (MAD) between normals

computed from ground truth depth and from the predicted
depth, measuring how well the surface is captured.

4.1.3 Implementation Details

The function ðd; a; w; l; sÞ ¼ FðIÞ that preditcs depth, albedo,
viewpoint, lighting, and confidence maps from the image I
is implemented using individual neural networks. The
depth and albedo are generated by encoder-decoder net-
works, while viewpoint and lighting are regressed using
simple encoder networks. The encoder-decoders do not use
skip connections because input and output images are not
spatially aligned (since the output is in the canonical view-
point). All four confidence maps are predicted using the
same network, at different decoding layers for the photo-
metric and perceptual losses since these are computed at
different resolutions. The final activation function is tanh

for depth, albedo, viewpoint and lighting and softplus

for the confidence maps. The depth prediction is centered
on the mean before tanh, as the global distance is estimated
as part of the viewpoint. We do not use any special initiali-
zation for all predictions, except that two border pixels of
the depth maps on both the left and the right are clamped at
a maximal depth to avoid boundary issues.

We train using Adam over batches of 64 input images,
resized to 64� 64 pixels. The size of the output depth and
albedo is also 64� 64. We train for approximately 50k itera-
tions. For visualization, depth maps are upsampled to 256. We
include more details in the supplementary material, which can
be found on the Computer Society Digital Library at http://
doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3076536.

4.2 Qualitative Results

4.2.1 Reconstruction Results

In Fig. 3 we show reconstruction results of human faces
from CelebA and 3DFAW, cat faces from [82], [83] and syn-
thetic cars from ShapeNet. The 3D shapes are recovered

Fig. 3. Reconstruction of faces, cats and cars. Our unsupervised model
recovers accurate 3D shape from only a single input image.

WU ETAL.: UNSUPERVISED LEARNING OF PROBABLY SYMMETRIC DEFORMABLE 3D OBJECTS FROM IMAGES IN THE WILD... 5273

http://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3076536
http://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3076536


with high fidelity. The reconstructed 3D face, for instance,
contain fine details of the nose, eyes and mouth even in the
presence of extreme facial expression.

4.2.2 Generalization to Paintings

To further test generalization, we applied our model trained
on the CelebA dataset to a number of paintings and cartoon
drawings of faces collected from [86] and the Internet. As
shown in Fig. 4, our method still works well even though it
has never seen such images during training. It is worth not-
ing that since the model is trained using real face images,
the reconstructions seem to also be more “realistic” faces
reflecting the prior learned during training.

4.2.3 Relighting

A by-product of our reconstruction framework is that it
learns to disentangle albedo and shading from a single
image, without any external supervision at all. This is possi-
ble by leveraging the symmetry assumption on the albedo
as well as the categorical prior imposed by training set.

Decomposing the albedo map enables realistic graphics
editing applications, such as re-rendering the object under
different lighting conditions, as illustrated in Fig. 5.

4.2.4 Inference on Video Frames

We can also apply our trained model on video sequences
frame-by-frame. To demonstrate this, we take speech clips
from VoxCeleb [87], and crop the faces using MTCNN3 [80].
We then feed the crops to our model to produce a 3D recon-
struction of the faces and render them from novel viewpoints,
shown in Fig. 6. Note that our model does not use videos for
training, yet it produces temporally consistent and accurate
reconstruction results by simply processing the frames
independently.

4.2.5 Symmetry and Asymmetry Detection

Since ourmodel predicts a canonical view of the objects that is
symmetric about the vertical center-line of the image, we can
easily visualize the symmetry plane, which is otherwise non-
trivial to detect from in-the-wild images. In Fig. 7, we warp
the center-line of the canonical image to the predicted input
viewpoint. Our method can detect symmetry planes accu-
rately despite the presence of asymmetric texture and lighting

Fig. 4. Reconstruction of faces in paintings and cartoons. The model
trained on real faces in CelebA generalizes well to abstract faces in
paintings and cartoons.

Fig. 5. Re-lighting results. Our model disentangles albedo and shading
from a single input image, which allows us to relight the objects with
novel lighting conditions.

Fig. 7. Symmetry plane and asymmetry detection. (a): our model can
reconstruct the “intrinsic” symmetry plane of an in-the-wild object even
though the appearance is highly asymmetric. (b): asymmetries
(highlighted in red) are detected and visualized using confidence map s0.

Fig. 6. Frame-by-frame reconstruction on video sequences. Even though
our model does not use videos for training, it produces temporally con-
sistent reconstructions on video sequences.

3. We use the implementation from https://github.com/timesler/
facenet-pytorch.

5274 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023

https://github.com/timesler/facenet-pytorch
https://github.com/timesler/facenet-pytorch


effects. We also overlay the predicted confidence map s0 onto
the image, confirming that the model assigns low confidence
to asymmetric regions in a sample-specificway.

4.3 Analyses and Discussions

4.3.1 Comparison With Baselines

Table 2 uses the BFMdataset to compare the depth reconstruc-
tion quality obtained by our method, a fully-supervised base-
line and two other baselines. The supervised baseline is a
version of ourmodel trained to regress the ground-truth depth
maps using an L1 loss. The trivial baseline predicts a constant
uniform depth map, which provides a performance lower-
bound. The third baseline is a constant depthmap obtained by
averaging all ground-truth depth maps in the test set. Our
method largely outperforms the two constant baselines and
approaches the results of supervised training.

4.3.2 Ablation

To understand the influence of the individual parts of the
model, we remove them one at a time and evaluate the per-
formance of the ablated model in Table 3 and Fig. 8.

In the table, row (1) shows the performance of the full
model (the same as in Table 2). Row (2) does not flip the
albedo. Thus, the albedo is not encouraged to be symmetric
in the canonical space, which fails to canonicalize the view-
point of the object and to use cues from symmetry to recover
shape. The performance is as low as the trivial baseline
in Table 2. Row (3) does not flip the depth, with a similar
effect to row (2). In addition, we had to add an L2 smooth-
ness loss on the depth maps during training. Otherwise, the
model tends to produce noisy depth maps without the sym-
metry constraint, which lead to heavy occlusion and break
the training.

Row (4) predicts a shading map instead of computing it
from depth and light direction. This also harms performance

significantly because shading cannot be used as a cue to
recover shape. Moreover, the training often collapses after a
few epochs as the model produces spikes in the depth maps
that also result in large occlusion. We therefore report the
results of the latest epoch prior to collapse.

Row (5) switches off the perceptual loss, which leads to
degraded image quality and hence degraded reconstruction
results. Row (6) replaces the ImageNet pretrained image
encoder used in the perceptual loss with one4 trained
through a self-supervised task [88], which shows no differ-
ence in performance.

Finally, row (7) switches off the confidence maps, using a
fixed and uniform value for the confidence — this reduces
losses (3) and (9) to the basic L1 and L2 losses, respectively.
The accuracy does not drop significantly, as faces in BFM
are highly symmetric (e.g. do not have hair), but its variance
increases. To better understand the effect of the confidence
maps, we specifically evaluate on partially asymmetric faces
using perturbations.

4.3.3 Asymmetric Perturbation

In order to demonstrate that our uncertainty modelling
allows the model to handle asymmetry, we add asymmetric
perturbations to BFM. Specifically, we generate random
rectangular color patches with 20 to 50 percent of the image
size and blend them onto the images with a-values ranging
from 0.5 to 1, as shown in Fig. 9. We then train our model

TABLE 2
Comparison With Baselines

No Baseline SIDE (�10�2) # MAD (deg.) #
(1) Supervised 0:410
 0:103 10:78
 1:01

(2) Const. null depth 2:723
 0:371 43:34
 2:25
(3) Average g.t. depth 1:990
 0:556 23:26
 2:85

(4) Ours (unsupervised) 0:793
 0:140 16:51
 1:56

SIDE and MAD errors of our reconstructions on the BFM dataset compared
against a fully-supervised and trivial baselines.

TABLE 3
Ablation Study

No Method SIDE (�10�2) # MAD (deg.) #
(1) Ours full 0.793 
0.140 16.51 
1.56

(2) w/o albedo flip 2.916 
0.300 39.04 
1.80
(3) w/o depth flip 1.139 
0.244 27.06 
2.33
(4) w/o light 2.406 
0.676 41.64 
8.48
(5) w/o perc. loss 0.931 
0.269 17.90 
2.31
(6) w/ self-sup. perc. loss 0.815 
0.145 15.88 
1.57
(7) w/o confidence 0.829 
0.213 16.39 
2.12

Refer to Section 4.3.2 for details.

Fig. 8. Ablation study. Refer to Section 4.3.2 for details.

4. We use a RotNet [88] pretrained VGG16 model obtained from
https://github.com/facebookresearch/DeeperCluster.

WU ETAL.: UNSUPERVISED LEARNING OF PROBABLY SYMMETRIC DEFORMABLE 3D OBJECTS FROM IMAGES IN THE WILD... 5275

https://github.com/facebookresearch/DeeperCluster


with and without confidence on these perturbed images,
and report the results in Table 4. Without the confidence
maps, the model always predicts a symmetric albedo and
geometry reconstruction often fails. With our confidence
estimates, the model is able to reconstruct the asymmetric
faces correctly, with very little loss in accuracy compared to
the unperturbed case.

4.3.4 Training Only on Frontal Faces

Our full training data consists of single-view images of
many instances, each captured from a different viewpoint,
which essentially compose a large “multi-view” image set,
although these are “multi-views” of different instances with
different texture and shape. Nonetheless, it would be inter-
esting to know how much this “multi-view” signal contrib-
utes to the learning, compared to other cues, such as
symmetry and shading.

In order to understand this, we generate another synthetic
face dataset consisting of only frontal faces with random tex-
ture and shape variations, and train a model on only frontal
faces. We compare the performance of this model to our full
model trained on the original dataset of images with various
viewpoints in Table 5 and Fig. 10. In fact, the model trained
on only frontal faces is indeed able to learn 3D shape of fron-
tal faces, despite producing artifacts and a lower reconstruc-
tion accuracy compared to the full model. This suggests the
symmetry and shading constraints can still provide power-
ful signals for learning shapes, even without the view varia-
tion in the training set. However, this model fails to
generalize to input faces from other viewpoints.

4.3.5 Training With Fewer Images

As the symmetry assumption and shading seem to provide
strong signals for learning the shape, another interesting

question to ask is: does it still need to be trained on a large
image collection? To answer this question, we train the
model on different numbers of training images, ranging
from only one single image to the entire training set of 155k
images, and compare the results in Fig. 11. When training
with 1 image and 100 images, we added a L2 smoothness
loss on the depth maps, as the training otherwise collapses
due to noisy depth maps.

As shown in Fig. 11, when trained on only 1 image, the
model seems still able to pick up some shading and symme-
try cues to recover the 3D shape. However, these cues alone
cannot provide enough constraints on this heavily ill-posed
2D-to-3D task. Therefore, although the image reconstruction
loss is low, the underlying 3D shape is poorly reconstructed.
The model only starts to learn reasonable 3D faces when
trained on 1000 or more images, which suggests that a suffi-
ciently large image collection is critical for the model to
learn a 3D shape prior of the object category.

4.3.6 Mixing Categories

In order to understand whether the model learns different
priors for different categories, we further conduct experi-
ments on cross-category inference as well as multi-category
training. Fig. 12 shows some examples. We first feed images
of human faces to a model trained on images of cat images
and also the other way around. Unsurprisingly, the models
trained on one single category learn shape priors specific to
that particular category, and tend to reconstruct shapes of
the training category, even if the input images depict a dif-
ferent category.

We further consider training the model on a mixture of
images from two object categories, which turns out still
capable of reconstructing both categories with similar
quality compared to the models trained individually on
each category. This observation shows promise of learn-
ing a general modal independent of object categories in
the future.

Fig. 9. Asymmetric perturbation. Top: examples of the perturbed dataset.
Bottom: reconstructions with and without confidence maps. Confidence
allows the model to correctly reconstruct the 3D shape with the asym-
metric texture.

TABLE 4
Asymmetric Perturbation

SIDE (�10�2) # MAD (deg.) #
No perturb, no conf. 0.829 
0.213 16.39 
2.12
No perturb, conf. 0.793 
0.140 16.51 
1.56

Perturb, no conf. 2.141 
0.842 26.61 
5.39
Perturb, conf. 0.878 
0.169 17.14 
1.90

We add asymmetric perturbations to BFM and show that confidence maps
allow the model to reject such noise, while the vanilla model without confidence
maps breaks.

TABLE 5
Training on Frontal Faces

SIDE (�10�2) # MAD (deg.) #
Train frontal, test frontal 1.347 
0.150 22.90 
1.13
Train frontal, test all 1.858 
0.429 30.80 
3.93

Train all, test frontal 0.818 
0.107 15.90 
1.22
Train all, test all 0.793 
0.140 16.51 
1.56

We compare the model trained on only frontal faces with our full model trained
on faces with all random viewpoints.

Fig. 10. Training only on frontal faces. The model trained on only frontal
faces is still able to learn 3D shape, despite producing artifacts (first
row), but it does not generalize to other views (second row).

5276 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023


4.4 Comparison With the State of the Art

As shown in Table 1, most reconstruction methods in the lit-
erature require either image annotations, prior 3D models
or both. When these assumptions are dropped, the task
becomes considerably harder, and there is little prior work
that is directly comparable. Of these, [33] only uses syn-
thetic, texture-less objects from ShapeNet, [8] reconstructs
in-the-wild faces but does not report any quantitative
results, and [7] reports quantitative results only on keypoint
regression, but not on the 3D reconstruction quality. We
were not able to obtain code or trained models from [7], [8]
for a direct quantitative comparison and thus compare
qualitatively.

4.4.1 Qualitative Comparison

In order to establish a side-by-side comparison, we cropped
the figures reported in the papers [7], [8] and compare our
results with theirs (Fig. 13). Our method produces higher
quality reconstructions than both methods, with fine details
of the facial expression. The difference is especially notice-
able in the recovery of 3D shape for [7], and the shape

generation in [8]. Note that [8] uses an unconditional GAN
that generates high resolution 3D faces from random noise,
and cannot recover 3D shapes from images. The input
images for [8] in Fig. 13 were generated by their GAN.

4.4.2 3D Keypoint Depth Evaluation

Next, we compare to the DepthNet model of [9]. This
method predicts depth for selected facial keypoints, but
uses 2D keypoint annotations as input — a much easier
setup than the one we consider here. Still, we compare the
quality of the reconstruction of these sparse point obtained
by DepthNet and our method. We also compare to the base-
lines MOFA [90] and AIGN [89] reported in [9]. For a fair
comparison, we use their public code which computes the

Fig. 11. Training with fewer images. We show a qualitative comparison of
the models trained with different numbers of images, which confirms the
necessity of training on a sufficiently large image collection.

Fig. 12. Mixing categories. When trained on one single category, the
model learns a prior specific to that particular category, whereas when
trained on two categories, it is able to reconstruct both categories well.

Fig. 13. Qualitative comparison to SOTA. Comparing to [7], [8], our
method recovers higher quality shapes.

WU ETAL.: UNSUPERVISED LEARNING OF PROBABLY SYMMETRIC DEFORMABLE 3D OBJECTS FROM IMAGES IN THE WILD... 5277


depth correlation score (between 0 and 66) on the frontal
faces. We use the 2D keypoint locations to sample our pre-
dicted depth and then evaluate the same metric. The set of
test images from 3DFAW and the preprocessing are identi-
cal to [9]. Since 3DFAW is a small dataset with limited varia-
tion, we also report results with CelebA pre-training.

In Table 6 we report the results from their paper and the
slightly improved results we obtained from their publicly-
available implementation. The paper also evaluates a super-
vised model using a GAN discriminator trained with
ground-truth depth information. While our method does
not use any supervision, it still outperforms DepthNet and
reaches close-to-supervised performance.

4.4.3 3D Face Reconstruction Benchmarks

We also evaluate the reconstructed 3D meshes and compare
the performance with several recent 3DMM-based recon-
struction methods [21], [57], [58], [59], [60], [61] on two 3D
face reconstruction benchmarks [21], [91].

The first benchmark by Feng et al. [91] provides a test set,
which consists of 133 ground-truth 3D scans and 2,000 test
images, including 656 high-quality (HQ) images captured
in a controlled environment and 1,344 low-quality (LQ)
images extracted from videos. The second one, NoW bench-
mark [21], provides a test set of 1,702 images of 80 subjects
and a ground-truth 3D scan per subject. These images are
captured with a higher variety in facial expression, occlu-
sion, and lighting, compared to the Feng et al. benchmark.

However, it is important to highlight that these bench-
marks are designed specifically for evaluating 3DMM-based
face reconstruction methods, and inherently put model-free
approaches at a disadvantage. In both of these benchmark
sets, only 3D scans of neutral faces are available, which are
used as ground-truth for various input images that describe
different viewpoints and facial expressions and may contain
occlusion. This gives the 3DMM-based methods an advan-
tage over our method, since the output of these methods is
always constrained to a face model regardless of input vari-
ety, whereas our method produces instance-specific recon-
structionswith different expressions, which are not captured
in the ground-truth scans. Our main intention with this eval-
uation is the establishment of a fair, quantitative evaluation
of future model-free methods, since qualitative comparisons
are often subjective and synthetic benchmarks are limited in
terms of generalization to real data.

For both datasets, we detect faces and crop the images
using MTCNN [80] and obtain 3D mesh reconstructions
from the depthmaps predicted by our model trained on Cel-
ebA. We then use the same evaluation protocol in both
benchmarks [21], [91], which align the predicted meshes
with the ground-truth meshes with a rigid transformation
based on 7 pre-defined keypoints and compute the scan-to-
mesh distances. We obtain these keypoints on our predicted
meshes by applying a facial keypoint detector [92] on the
reconstructed canonical images. The average keypoints are
used when the keypoint detector fails.

We report the statistics of the distances and compare them
with other methods in Tables 7 and 8. Although our model-
free unsupervised method does not perform as well as the
model-based methods on these benchmarks, it is signifi-
cantly better than a flat shape baseline as shown in Table 7.
Since the NoWdataset provides attributes for the images, we
select a subset of the test set that contains 91 frontal neutral
faces, which better match with the ground-truth scans, and
include the results in Table 8. The results in this subset fur-
ther reduce the gap towardsmodel-basedmethods.

4.5 Limitations

While our unsupervised method is robust in many challeng-
ing scenarios (e.g., extreme facial expression, drawings), we
do observe limitations as shown in Fig. 14.

TABLE 6
3DFAW Keypoint Depth Evaluation

Depth Corr. "
Ground truth 66
AIGN [89] (supervised, from [9]) 50.81
DepthNetGAN [9] (supervised, from [9]) 58.68

MOFA [90] (model-based, from [9]) 15.97
DepthNet [9] (from [9]) 26.32
DepthNet [9] (from GitHub) 35.77

Ours 48.98
Ours (w/ CelebA pre-training) 54.65

Depth correlation between ground truth and prediction evaluated at 66 facial
keypoint locations.

TABLE 7
Performance on Feng et al. [91] Benchmark

Methods Median # Mean # Std

LQ HQ LQ HQ LQ HQ

Extreme3D [58] 2.40 2.37 3.49 3.58 6.15 6.75
3DMM-CNN [57] 1.88 1.85 2.32 2.29 1.89 1.88
PRNet [59] 1.79 1.60 2.38 2.06 2.19 1.79
RingNet [21] 1.63 1.58 2.08 2.02 1.79 1.69
3DDFA-V2 [60] 1.62 1.49 2.10 1.91 1.87 1.64
DECA [61] 1.48 1.44 1.91 1.89 1.68 1.66

Const. flat plane 12.47 12.47 14.11 14.07 10.21 10.17

Ours (model-free) 5.58 5.54 5.74 5.68 1.47 1.89

We compare our model-free unsupervised method with several recent 3DMM-
based methods.

TABLE 8
Performance on NoW et al. [21] Benchmark

Methods Median # Mean # Std

3DMM-CNN [57] 1.84 2.33 2.05
PRNet [59] 1.50 1.98 1.88
RingNet [21] 1.21 1.54 1.31
3DDFA-V2 [60] 1.23 1.57 1.39
DECA [61] 1.09 1.38 1.18

Ours (model-free) 2.64 3.29 2.86

3DMM-CNN [57] (frontal) 1.88 2.36 2.07
PRNet [59] (frontal) 1.38 1.79 1.67
RingNet [21] (frontal) 1.16 1.48 1.28

Ours (model-free, frontal) 2.25 2.80 2.44

We compare our model-free unsupervised method with several recent 3DMM-
based methods. The bottom half of the table reports the results on a subset of
frontal neutral faces, indicated by “frontal”.

5278 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023


First and foremost, our model relies on the assumption
that object category has weakly symmetric 3D shape as well
as weakly symmetric albedo. Extending the key insights in
this work, including leveraging category priors, other forms
of symmetry and shape from shading in a learning frame-
work, to general objects will require future work.

In this work, we represent shape using a depth map in
the canonical (symmetric) viewpoint, which cannot describe
the full 3D shape in 360 degrees. Thus, the reconstructed
shapes often lack details on the sides. This is particularly
evident for the cars, as illustrated in Fig. 14a. One would
need to consider using other 3D representations to capture
full 3D objects from 360 degrees.

Our model also tends to ignore occluders (Fig. 14b), since
the training set does not contain many examples with occlu-
sion. Disentangling dark textures and shading is often diffi-
cult. Therefore, the model fails to accurately reconstruct
sunglasses (Fig. 14c) and may produce bumpy surfaces
when the texture is noisy (Fig. 14d). During training, we
assume a simple Lambertian shading model, ignoring shad-
ows and specularity, which leads to inaccurate reconstruc-
tions under extreme lighting conditions (Fig. 14e) or highly
non-Lambertian surfaces. The reconstruction quality is also
lower for extreme poses (Fig. 14f), partly due to poor super-
visory signal from the reconstruction loss of side images.
This may be improved by imposing constraints from accu-
rate reconstructions of frontal poses.

5 CONCLUSION

We have presented a method that can learn a 3D model of a
deformable object category from an unconstrained collec-
tion of single-view images of the object category. The model
is able to obtain high-fidelity monocular 3D reconstructions
of individual object instances. This is trained based on a
reconstruction loss without any supervision, resembling an
autoencoder. We have shown that symmetry and illumina-
tion are strong cues for shape and help the model to con-
verge to a meaningful reconstruction. Our model
outperforms a current state-of-the-art 3D reconstruction
method that uses 2D keypoint supervision. As for future
work, the model currently represents 3D shape from a
canonical viewpoint using a depth map, which is sufficient
for objects such as faces that have a roughly convex shape
and a natural canonical viewpoint. For more complex

objects, it may be possible to extend the model to use either
multiple canonical views or a different 3D representation,
such as a mesh or a voxel map.

ACKNOWLEDGMENTS

The authors would like to thank Soumyadip Sengupta for
sharing with us the code to generate synthetic face datasets
and Mihir Sahasrabudhe for sending us the reconstruction
results of Lifting AutoEncoders. The authors would also
like to thank the members of Visual Geometry Group for
insightful discussions. This work was supported in part by
the Facebook Research, in part by the ERC Horizon 2020
Research, and in part by the Innovation Programme under
Grant IDIU 638009.

REFERENCES

[1] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann,
and W. Brendel, “Imagenet-trained CNNs are biased towards tex-
ture; increasing shape bias improves accuracy and robustness,” in
Proc. Int. Conf. Learn. Representations, 2019. [Online]. Available:
https://openreview.net/forum?id=Bygh9j09KX

[2] D. P. Mukherjee, A. Zisserman, and J. M. Brady, “Shape from
symmetry: Detecting and exploiting symmetry in affine images,”
Philos. Trans. Roy. Soc. London, vol. 351, pp. 77–106, 1995.

[3] A. R. J. François, G. G. Medioni, and R. Waupotitsch, “Mirror
symmetry ) 2-view stereo geometry,” Image Vis. Comput., vol. 21,
pp. 137–143, 2003.

[4] S. Thrun and B. Wegbreit, “Shape from symmetry,” in Proc. 10th
IEEE Int. Conf. Comput. Vis., 2005, pp. 1824–1831.

[5] S. N. Sinha, K. Ramnath, and R. Szeliski, “Detecting and recon-
structing 3D mirror symmetric objects,” in Proc. Eur. Conf. Comput.
Vis., 2012, pp. 586–600.

[6] Y. Gao and A. L. Yuille, “Exploiting symmetry and/or manhattan
properties for 3D object structure estimation from single and mul-
tiple images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2017, pp. 7408–7417.

[7] M. Sahasrabudhe, Z. Shu, E. Bartrum, R. A. Guler, D. Samaras,
and I. Kokkinos, “Lifting autoencoders: Unsupervised learning of
a fully-disentangled 3D morphable model using deep non-rigid
structure from motion,” in Proc. Int. Conf. Comput. Vis. Workshops,
2019, pp. 4054–4064.

[8] A. Szab�o, G. Meishvili, and P. Favaro, “Unsupervised generative
3D shape learning from natural images,” 2019, arXiv:1910.00287.

[9] J. R. A. Moniz, C. Beckham, S. Rajotte, S. Honari, and C. Pal,
“Unsupervised depth estimation, 3D face rotation and
replacement,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst.,
2018, pp. 9759–9769.

[10] S. Wu, C. Rupprecht, and A. Vedaldi, “Unsupervised learning of
probably symmetric deformable 3D objects from images in the
wild,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020,
pp. 1–10.

[11] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A
3D face model for pose and illumination invariant face recog-
nition,” in Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill.,
2009, pp. 296–301.

[12] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum,
“Learning a probabilistic latent space of object shapes via 3D gen-
erative-adversarial modeling,” in Proc. Neural Inf. Process. Syst.,
2016, pp. 82–90.

[13] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas,
“Learning representations and generative models for 3D point
clouds,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 40–49.

[14] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black, “Generating 3D
faces using convolutional mesh autoencoders,” in Proc. Eur. Conf.
Comput. Vis., 2018, pp. 725–741.

[15] Z. Geng, C. Cao, and S. Tulyakov, “3D guided fine-grained face
manipulation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
nit., 2019, pp. 9813–9822.

[16] B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou, “GANFIT: Gen-
erative adversarial network fitting for high fidelity 3D face
reconstruction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Rec-
ognit., 2019, pp. 1155–1164.

Fig. 14. Limitations. See Section 4.5 for details.

WU ETAL.: UNSUPERVISED LEARNING OF PROBABLY SYMMETRIC DEFORMABLE 3D OBJECTS FROM IMAGES IN THE WILD... 5279

https://openreview.net/forum?id=Bygh9j09KX


[17] T. Gerig et al., “Morphable face models - An open framework,” in
Proc. Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 75–82.

[18] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end
recovery of human shape and pose,” in Proc. IEEE/CVF Conf. Com-
put. Vis. Pattern Recognit., 2018, pp. 7122–7131.

[19] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs,
“SfSNet: Learning shape, refectance and illuminance of faces in
the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
pp. 6296–6305.

[20] M. Wang, Z. Shu, S. Cheng, Y. Panagakis, D. Samaras, and S.
Zafeiriou, “An adversarial neuro-tensorial approach for learning
disentangled representations,”Int. J. Comput. Vis , vol. 127, no. 6–
7, pp. 743–762, 2019.

[21] S. Sanyal, T. Bolkart, H. Feng, and M. Black, “Learning to regress
3D face shape and expression from an image without 3D super-
vision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 7763–7772.

[22] M. Gadelha, S. Maji, and R. Wang, “3D shape induction from 2D
views of multiple objects,” in Proc. Int. Conf. 3D Vis., 2017, pp.
402–411.

[23] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective trans-
former nets: Learning single-view 3D object reconstruction with-
out 3D supervision,” in Proc. 30th Int. Conf. Neural Inf. Process.
Syst., 2016, pp. 1704–1712.

[24] H. Kato and T. Harada, “Learning view priors for single-view 3D
reconstruction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Rec-
ognit., 2019, pp. 9778–9787.

[25] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, “Learning cat-
egory-specific mesh reconstruction from image collections,” in
Proc. Eur. Conf. Comput. Vis., 2018, pp. 371–386.

[26] W. Chen et al., “Learning to predict 3D objects with an interpola-
tion-based differentiable renderer,” in Proc. Conf. Neural Inf. Pro-
cess. Syst., 2019, pp. 9605–9616. [Online]. Available: https://dblp.
org/rec/conf/nips/ChenLGSLJF19.html?view=bibtex

[27] P. Henzler, N. Mitra, and T. Ritschel, “Escaping plato’s cave using
adversarial training: 3D shape from unstructured 2D image
collections,” in Proc. Int. Conf. Comput. Vis., 2019, pp. 9984–9993.

[28] N. Kulkarni, A. Gupta, D. F. Fouhey, and S. Tulsiani,
“Articulation-aware canonical surface mapping,” in Proc. IEEE/
CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 452–461.

[29] S. Goel, A. Kanazawa, and J. Malik, “Shape and viewpoints with-
out keypoints,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 88–104.

[30] X. Li et al., “Self-supervised single-view 3D reconstruction via
semantic consistency,” in Proc. Eur. Conf. Comput. Vis., 2020, pp.
677–693.

[31] T. Nguyen-Phuoc , C. Li, L. Theis, C. Richardt, and Y. -L. Yang,
“Hologan: Unsupervised learning of 3D representations from nat-
ural images,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp.
7588–7597.

[32] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger, “GRAF: Genera-
tive radiance fields for 3D-aware image synthesis,” in Proc. Conf.
Neural Inf. Process. Syst., 2020, pp. 20154–20166. [Online]. Available:
https://papers.nips.cc/paper/2020/hash/e92e1b476bb5262d793fd
40931e0ed53-Abstract.html

[33] P. Henderson and V. Ferrari, “Learning single-image 3D recon-
struction by generative modelling of shape, pose and shading,”
Int. J. Comput. Vis., vol. 128, pp. 835–854, 2019.

[34] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond Pascal: A bench-
mark for 3D object detection in the wild,” in Proc.IEEE Winter
Conf. Appl. Comput. Vis., 2014, pp. 75–82.

[35] O. Faugeras, Q.-T. Luong, and T. Papadopoulo, The Geometry of
Multiple Images. Cambridge, MA, USA: MIT Press, 2001.

[36] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised
learning of depth and ego-motion from video,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2017, pp. 6612–6619.

[37] B. Ummenhofer et al., “Demon: Depth and motion network for
learning monocular stereo,” in Proc. IEEE Conf. Comput. Vis. Pat-
tern Recognit., 2017, pp. 5622–5631.

[38] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised mon-
ocular depth estimation with left-right consistency,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 270–279.

[39] C. Bregler, A. Hertzmann, and H. Biermann, “Recovering non-
rigid 3D shape from image streams,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2000, pp. 690–696 .

[40] D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi,
“C3DPO: Canonical 3D pose networks for non-rigid structure
from motion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp.
7687–7696.

[41] B. K. P. Horn and M. J. Brooks, Shape from Shading. Cambridge,
MA, USA: MIT Press, 1989.

[42] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah, “Shape-from-shad-
ing: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no.
8, pp. 690–706, Aug. 1999.

[43] J. J. Koenderink, “What does the occluding contour tell us about
solid shape?,” Perception, vol. 13, no. 3, pp. 321–330, 1984.

[44] A. P. Witkin, “Recovering surface shape and orientation from
texture,” Artif. Intell., vol. 17, no. 1–3, pp. 17–45, 1981.

[45] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,”
in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 37–45.

[46] D. Novotny, D. Larlus, and A. Vedaldi, “Learning 3D object cate-
gories by looking around them,” in Proc. IEEE Int. Conf. Comput.
Vis., 2017, pp. 5228–5237.

[47] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning
depth from monocular videos using direct methods,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2022–2030.

[48] X. Li et al., “Online adaptation for consistent mesh reconstruction
in the wild,” in Proc. Neural Inf. Process. Syst., 2020, pp. 15009–
15019. [Online]. Available: https://papers.nips.cc/paper/2020/
hash/aba3b6fd5d186d28e06ff97135cade7f-Abstract.html

[49] Y. Luo et al., “Single view stereo matching,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 155–163.

[50] V. Sitzmann, M. Zollh€ofer, and G. Wetzstein, “Scene representa-
tion networks: Continuous 3D-structure-aware neural scene rep-
resentations,” in Proc. Neural Inf. Process. Syst., 2019, pp. 1119–
1130.

[51] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “SynSin: End-to-
end view synthesis from a single image,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2020, pp. 7467–7477.

[52] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Rama-
moorthi, and R. Ng, “NeRF: Representing scenes as neural radi-
ance fields for view synthesis,” in Proc. Eur. Conf. Comput. Vis.,
2020, pp. 405–421. [Online]. Available: https://link.springer.
com/chapter/10.1007/978-3-030-58452-8_24

[53] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger,
“Differentiable volumetric rendering: Learning implicit 3D repre-
sentations without 3D supervision,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2020, pp. 3504–3515.

[54] L. Yariv et al., “Multiview neural surface reconstruction by disen-
tangling geometry and appearance,” in Proc. Neural Inf. Process.
Syst., 2020, pp. 2492–2502.

[55] S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi,
“Discovery of latent 3D keypoints via end-to-end geometric rea-
soning,” in Proc. Neural Inf. Process. Syst., 2018, pp. 2063–2074.
[Online]. Available: https://dblp.org/rec/conf/nips/
SuwajanakornSTN18.html?view=bibtex

[56] C.-H. Chen et al., “Unsupervised 3D pose estimation with geomet-
ric self-supervision,” in Proc.IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2019, pp. 5714–5724.

[57] A. T. Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing
robust and discriminative 3D morphable models with a very deep
neural network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2017, pp. 5163–5172.

[58] A. T. Tran, T. Hassner, I. Masi, E. Paz, Y. Nirkin, and G. Medioni,
“Extreme 3D face reconstruction: Seeing through occlusions,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3935–
3944.

[59] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou, “Joint 3D face
reconstruction and dense alignment with position map regression
network,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 557–574.

[60] J. Guo, X. Zhu, Y. Yang, F. Yang, Z. Lei, and S. Z. Li, “Towards
fast, accurate and stable 3 D dense face alignment,” in Proc. Eur.
Conf. Comput. Vis., 2020, pp. 152–168.

[61] Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animat-
able detailed 3D face model from in-the-wild images,” 2020,
arXiv:2012.04012.

[62] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll , and M. J. Black,
“SMPL: A skinned multi-person linear model,” ACM Trans.
Graph., vol. 34, no. 6, 2015, Art. no. 248.

5280 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023

https://dblp.org/rec/conf/nips/ChenLGSLJF19.html?view=bibtex
https://dblp.org/rec/conf/nips/ChenLGSLJF19.html?view=bibtex
https://papers.nips.cc/paper/2020/hash/e92e1b476bb5262d793fd40931e0ed53-Abstract.html
https://papers.nips.cc/paper/2020/hash/e92e1b476bb5262d793fd40931e0ed53-Abstract.html
https://papers.nips.cc/paper/2020/hash/aba3b6fd5d186d28e06ff97135cade7f-Abstract.html
https://papers.nips.cc/paper/2020/hash/aba3b6fd5d186d28e06ff97135cade7f-Abstract.html
https://link.springer.com/chapter/10.1007/978-3-030-58452-8_24
https://link.springer.com/chapter/10.1007/978-3-030-58452-8_24
https://dblp.org/rec/conf/nips/SuwajanakornSTN18.html?view=bibtex
https://dblp.org/rec/conf/nips/SuwajanakornSTN18.html?view=bibtex


[63] J. Thewlis, H. Bilen, and A. Vedaldi, “Unsupervised learning of
object frames by dense equivariant image labelling,” in Proc. Neu-
ral Inf. Process. Syst., 2017, pp. 844–855. [Online]. Available:
https://dblp.org/rec/conf/nips/ThewlisBV17.html?view=bibtex

[64] J. Thewlis, H. Bilen, and A. Vedaldi, “Modelling and unsuper-
vised learning of symmetric deformable object categories,” in
Proc. Neural Inf. Process. Syst., 2018, pp. 8189–8200.

[65] Z. Shu, M. Sahasrabudhe, A. Guler, D. Samaras, N. Paragios, and
I. Kokkinos, “Deforming autoencoders: Unsupervised disentan-
gling of shape and appearance,” in Proc. Eur. Conf. Comput. Vis.,
2018, pp. 650–665.

[66] J.-Y. Zhu et al., “Visual object networks: Image generation with
disentangled 3D representations,” in Proc. Neural Inf. Process.
Syst., 2018, pp. 118–129. [Online]. Available: https://dblp.org/
rec/conf/nips/ZhuZZ00TF18.html?view=bibtex

[67] M. M. Loper and M. J. Black, “OpenDR: An approximate differen-
tiable renderer,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 154–169.

[68] H. Kato, Y. Ushiku, and T. Harada, “Neural 3Dmesh renderer,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3907–
3916.

[69] S. Liu, T. Li, W. Chen, and H. Li, “Soft rasterizer: A differentiable
renderer for image-based 3D reasoning,” in Proc. IEEE/CVF Int.
Conf. Comput. Vis., 2019, pp. 7707–7716.

[70] B. Horn, “Obtaining shape from shading information, ” in, The
Psychology of Computer Vision. New York, NY, USA: McGraw-Hill,
1975.

[71] P. N. Belhumeur, D. J. Kriegman, and A. L. Yuille, “The bas-relief
ambiguity,” Int. J. Comput. Vis., vol. 35, pp. 33–44, 1999.

[72] A. Kendall and Y. Gal, “What uncertainties do we need in Bayes-
ian deep learning for computer vision?,” in Proc. Neural Inf. Pro-
cess. Syst., 2017, pp. 5574–5584. [Online]. Available: https://dblp.
org/rec/conf/nips/KendallG17.html?view=bibtex

[73] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” in Proc. Int. Conf. Learn.
Representations, 2015. [Online]. Available: https://dblp.org/rec/
journals/corr/SimonyanZ14a.html?view=bibtex

[74] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu,
“Spatial transformer networks,” in Proc. Neural Inf. Process. Syst.,
2015, pp. 2017–2025.

[75] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attrib-
utes in the wild,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp.
3730–3738.

[76] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-
PIE,” Image Vis. Comput., vol. 28, no. 5, pp. 807–813, 2010.

[77] L. A. Jeni, J. F. Cohn, and T. Kanade, “Dense 3D face alignment
from 2D videos in real-time,” in Proc. Int. Conf. Autom. Face Gesture
Recognit., 2015, pp. 1–8.

[78] X. Zhang et al., “BP4D-Spontaneous: A high-resolution spontane-
ous 3D dynamic facial expression database,” Image Vis. Comput.,
vol. 32, no. 10, pp. 692–706, 2014.

[79] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale, “A high-resolu-
tion 3D dynamic facial expression database,” in Proc. Int. Conf.
Autom. Face Gesture Recognit., 2008, pp. 1–6.

[80] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and
alignment using multitask cascaded convolutional networks,”
IEEE Signal Process. Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016.

[81] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN data-
base: Large-scale scene recognition from abbey to zoo,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3485–3492.

[82] W. Zhang, J. Sun, and X. Tang, “Cat head detection - How to effec-
tively exploit shape and texture features,” in Proc. Eur. Conf. Com-
put. Vis., 2008, pp. 802–816.

[83] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, “Cats
and dogs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012,
pp. 3498–3505.

[84] A. X. Chang et al., “Shapenet: An information-rich 3D model
repository,” 2015, arXiv:1512.03012.

[85] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from
a single image using a multi-scale deep network,” in Proc. Int.
Conf. Neural Inf. Process. Syst., 2014, pp. 2366–2374.

[86] E. J. Crowley, O. M. Parkhi, and A. Zisserman, “Face painting:
Querying art with photos,” in Proc. Brit. Mach. Vis. Conf., 2015, pp.
1–13.

[87] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep
speaker recognition,” in Proc. INTERSPEECH, 2018, pp. 1086–
1090.

[88] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen-
tation learning by predicting image rotations,” in Proc. Int. Conf.
Learn. Representations, 2018. [Online]. Available: https://dblp.
org/rec/conf/iclr/GidarisSK18.html?view=bibtex

[89] H.-Y. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki,
“Adversarial inverse graphics networks: Learning 2D-to-3D lift-
ing and image-to-image translation from unpaired supervision,”
in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 4364–4372.

[90] A. Tewari et al., “MoFA: Model-based deep convolutional face
autoencoder for unsupervised monocular reconstruction,” in
Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3735–3744.

[91] Z.-H. Feng et al., “Evaluation of dense 3D reconstruction from 2D
face images in the wild,” in Proc. Int. Conf. Autom. Face Gesture Rec-
ognit., 2018, pp. 780–786.

[92] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M.
Pantic, “300 faces in-the-wild challenge: Database and results,”
Image Vis. Comput., vol. 47, pp. 3–18, 2016.

Shangzhe Wu received the bachelor’s degree in
computer science from the Hong Kong University
of Science and Technology, where he worked
with Chi-Keung Tang and Yu-Wing Tai on image
translation. He is currently working toward the
DPhil degree with the Visual Geometry Group,
University of Oxford, supervised by Andrea
Vedaldi. His research focuses on unsupervised
3D understanding. He was the recipient of the
Best Paper Award at CVPR 2020.

Christian Rupprecht received the PhD degree
from the Technical University of Munich, Ger-
many, advised by Nassir Navab and Gregory D.
Hager (JHU). He is currently a postdoctoral
researcher with the Visual Geometry Group, Uni-
versity of Oxford. For six months, he was with
Chris Pal, Mila Institute, Montreal, working on AI
safety. His research interests include self-super-
vised and minimally supervised learning for com-
puter vision.

Andrea Vedaldi is currently a professor of com-
puter vision and machine learning with the Uni-
versity of Oxford, where he has been co-leading
Visual Geometry Group since 2012. He is also a
research scientist with Facebook AI Research,
London. He has authored or coauthored more
than 130 peer-reviewed publications in the top
machine vision and artificial intelligence confer-
ences and journals. His research interests
include unsupervised learning of representations
and geometry in computer vision. He was the

recipient of the Mark Everingham Prize for selfless contributions to
the computer vision community, the Open Source Software Award by
the ACM, and the Best Paper Award from the Conference on Computer
Vision and Pattern Recognition.

" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

WU ETAL.: UNSUPERVISED LEARNING OF PROBABLY SYMMETRIC DEFORMABLE 3D OBJECTS FROM IMAGES IN THE WILD... 5281

https://dblp.org/rec/conf/nips/ThewlisBV17.html?view=bibtex
https://dblp.org/rec/conf/nips/ZhuZZ00TF18.html?view=bibtex
https://dblp.org/rec/conf/nips/ZhuZZ00TF18.html?view=bibtex
https://dblp.org/rec/conf/nips/KendallG17.html?view=bibtex
https://dblp.org/rec/conf/nips/KendallG17.html?view=bibtex
https://dblp.org/rec/journals/corr/SimonyanZ14a.html?view=bibtex
https://dblp.org/rec/journals/corr/SimonyanZ14a.html?view=bibtex
https://dblp.org/rec/conf/iclr/GidarisSK18.html?view=bibtex
https://dblp.org/rec/conf/iclr/GidarisSK18.html?view=bibtex


<<
  /ASCII85EncodePages false
  /AllowTransparency false
  /AutoPositionEPSFiles true
  /AutoRotatePages /None
  /Binding /Left
  /CalGrayProfile (Gray Gamma 2.2)
  /CalRGBProfile (sRGB IEC61966-2.1)
  /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2)
  /sRGBProfile (sRGB IEC61966-2.1)
  /CannotEmbedFontPolicy /Warning
  /CompatibilityLevel 1.4
  /CompressObjects /Off
  /CompressPages true
  /ConvertImagesToIndexed true
  /PassThroughJPEGImages true
  /CreateJobTicket false
  /DefaultRenderingIntent /Default
  /DetectBlends true
  /DetectCurves 0.0000
  /ColorConversionStrategy /sRGB
  /DoThumbnails true
  /EmbedAllFonts true
  /EmbedOpenType false
  /ParseICCProfilesInComments true
  /EmbedJobOptions true
  /DSCReportingLevel 0
  /EmitDSCWarnings false
  /EndPage -1
  /ImageMemory 1048576
  /LockDistillerParams true
  /MaxSubsetPct 100
  /Optimize true
  /OPM 0
  /ParseDSCComments false
  /ParseDSCCommentsForDocInfo true
  /PreserveCopyPage true
  /PreserveDICMYKValues true
  /PreserveEPSInfo false
  /PreserveFlatness true
  /PreserveHalftoneInfo true
  /PreserveOPIComments false
  /PreserveOverprintSettings true
  /StartPage 1
  /SubsetFonts false
  /TransferFunctionInfo /Remove
  /UCRandBGInfo /Preserve
  /UsePrologue false
  /ColorSettingsFile ()
  /AlwaysEmbed [ true
    /Algerian
    /Arial-Black
    /Arial-BlackItalic
    /Arial-BoldItalicMT
    /Arial-BoldMT
    /Arial-ItalicMT
    /ArialMT
    /ArialNarrow
    /ArialNarrow-Bold
    /ArialNarrow-BoldItalic
    /ArialNarrow-Italic
    /ArialUnicodeMS
    /BaskOldFace
    /Batang
    /Bauhaus93
    /BellMT
    /BellMTBold
    /BellMTItalic
    /BerlinSansFB-Bold
    /BerlinSansFBDemi-Bold
    /BerlinSansFB-Reg
    /BernardMT-Condensed
    /BodoniMTPosterCompressed
    /BookAntiqua
    /BookAntiqua-Bold
    /BookAntiqua-BoldItalic
    /BookAntiqua-Italic
    /BookmanOldStyle
    /BookmanOldStyle-Bold
    /BookmanOldStyle-BoldItalic
    /BookmanOldStyle-Italic
    /BookshelfSymbolSeven
    /BritannicBold
    /Broadway
    /BrushScriptMT
    /CalifornianFB-Bold
    /CalifornianFB-Italic
    /CalifornianFB-Reg
    /Centaur
    /Century
    /CenturyGothic
    /CenturyGothic-Bold
    /CenturyGothic-BoldItalic
    /CenturyGothic-Italic
    /CenturySchoolbook
    /CenturySchoolbook-Bold
    /CenturySchoolbook-BoldItalic
    /CenturySchoolbook-Italic
    /Chiller-Regular
    /ColonnaMT
    /ComicSansMS
    /ComicSansMS-Bold
    /CooperBlack
    /CourierNewPS-BoldItalicMT
    /CourierNewPS-BoldMT
    /CourierNewPS-ItalicMT
    /CourierNewPSMT
    /EstrangeloEdessa
    /FootlightMTLight
    /FreestyleScript-Regular
    /Garamond
    /Garamond-Bold
    /Garamond-Italic
    /Georgia
    /Georgia-Bold
    /Georgia-BoldItalic
    /Georgia-Italic
    /Haettenschweiler
    /HarlowSolid
    /Harrington
    /HighTowerText-Italic
    /HighTowerText-Reg
    /Impact
    /InformalRoman-Regular
    /Jokerman-Regular
    /JuiceITC-Regular
    /KristenITC-Regular
    /KuenstlerScript-Black
    /KuenstlerScript-Medium
    /KuenstlerScript-TwoBold
    /KunstlerScript
    /LatinWide
    /LetterGothicMT
    /LetterGothicMT-Bold
    /LetterGothicMT-BoldOblique
    /LetterGothicMT-Oblique
    /LucidaBright
    /LucidaBright-Demi
    /LucidaBright-DemiItalic
    /LucidaBright-Italic
    /LucidaCalligraphy-Italic
    /LucidaConsole
    /LucidaFax
    /LucidaFax-Demi
    /LucidaFax-DemiItalic
    /LucidaFax-Italic
    /LucidaHandwriting-Italic
    /LucidaSansUnicode
    /Magneto-Bold
    /MaturaMTScriptCapitals
    /MediciScriptLTStd
    /MicrosoftSansSerif
    /Mistral
    /Modern-Regular
    /MonotypeCorsiva
    /MS-Mincho
    /MSReferenceSansSerif
    /MSReferenceSpecialty
    /NiagaraEngraved-Reg
    /NiagaraSolid-Reg
    /NuptialScript
    /OldEnglishTextMT
    /Onyx
    /PalatinoLinotype-Bold
    /PalatinoLinotype-BoldItalic
    /PalatinoLinotype-Italic
    /PalatinoLinotype-Roman
    /Parchment-Regular
    /Playbill
    /PMingLiU
    /PoorRichard-Regular
    /Ravie
    /ShowcardGothic-Reg
    /SimSun
    /SnapITC-Regular
    /Stencil
    /SymbolMT
    /Tahoma
    /Tahoma-Bold
    /TempusSansITC
    /TimesNewRomanMT-ExtraBold
    /TimesNewRomanMTStd
    /TimesNewRomanMTStd-Bold
    /TimesNewRomanMTStd-BoldCond
    /TimesNewRomanMTStd-BoldIt
    /TimesNewRomanMTStd-Cond
    /TimesNewRomanMTStd-CondIt
    /TimesNewRomanMTStd-Italic
    /TimesNewRomanPS-BoldItalicMT
    /TimesNewRomanPS-BoldMT
    /TimesNewRomanPS-ItalicMT
    /TimesNewRomanPSMT
    /Times-Roman
    /Trebuchet-BoldItalic
    /TrebuchetMS
    /TrebuchetMS-Bold
    /TrebuchetMS-Italic
    /Verdana
    /Verdana-Bold
    /Verdana-BoldItalic
    /Verdana-Italic
    /VinerHandITC
    /Vivaldii
    /VladimirScript
    /Webdings
    /Wingdings2
    /Wingdings3
    /Wingdings-Regular
    /ZapfChanceryStd-Demi
    /ZWAdobeF
  ]
  /NeverEmbed [ true
  ]
  /AntiAliasColorImages false
  /CropColorImages true
  /ColorImageMinResolution 150
  /ColorImageMinResolutionPolicy /OK
  /DownsampleColorImages true
  /ColorImageDownsampleType /Bicubic
  /ColorImageResolution 150
  /ColorImageDepth -1
  /ColorImageMinDownsampleDepth 1
  /ColorImageDownsampleThreshold 1.50000
  /EncodeColorImages true
  /ColorImageFilter /DCTEncode
  /AutoFilterColorImages false
  /ColorImageAutoFilterStrategy /JPEG
  /ColorACSImageDict <<
    /QFactor 0.76
    /HSamples [2 1 1 2] /VSamples [2 1 1 2]
  >>
  /ColorImageDict <<
    /QFactor 0.40
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000ColorACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 15
  >>
  /JPEG2000ColorImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 15
  >>
  /AntiAliasGrayImages false
  /CropGrayImages true
  /GrayImageMinResolution 150
  /GrayImageMinResolutionPolicy /OK
  /DownsampleGrayImages true
  /GrayImageDownsampleType /Bicubic
  /GrayImageResolution 300
  /GrayImageDepth -1
  /GrayImageMinDownsampleDepth 2
  /GrayImageDownsampleThreshold 1.50000
  /EncodeGrayImages true
  /GrayImageFilter /DCTEncode
  /AutoFilterGrayImages false
  /GrayImageAutoFilterStrategy /JPEG
  /GrayACSImageDict <<
    /QFactor 0.76
    /HSamples [2 1 1 2] /VSamples [2 1 1 2]
  >>
  /GrayImageDict <<
    /QFactor 0.40
    /HSamples [1 1 1 1] /VSamples [1 1 1 1]
  >>
  /JPEG2000GrayACSImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 15
  >>
  /JPEG2000GrayImageDict <<
    /TileWidth 256
    /TileHeight 256
    /Quality 15
  >>
  /AntiAliasMonoImages false
  /CropMonoImages true
  /MonoImageMinResolution 1200
  /MonoImageMinResolutionPolicy /OK
  /DownsampleMonoImages true
  /MonoImageDownsampleType /Bicubic
  /MonoImageResolution 600
  /MonoImageDepth -1
  /MonoImageDownsampleThreshold 1.50000
  /EncodeMonoImages true
  /MonoImageFilter /CCITTFaxEncode
  /MonoImageDict <<
    /K -1
  >>
  /AllowPSXObjects false
  /CheckCompliance [
    /None
  ]
  /PDFX1aCheck false
  /PDFX3Check false
  /PDFXCompliantPDFOnly false
  /PDFXNoTrimBoxError true
  /PDFXTrimBoxToMediaBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXSetBleedBoxToMediaBox true
  /PDFXBleedBoxToTrimBoxOffset [
    0.00000
    0.00000
    0.00000
    0.00000
  ]
  /PDFXOutputIntentProfile (None)
  /PDFXOutputConditionIdentifier ()
  /PDFXOutputCondition ()
  /PDFXRegistryName ()
  /PDFXTrapped /False

  /CreateJDFFile false
  /Description <<
    /CHS <FEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000410064006f006200650020005000440046002065876863900275284e8e55464e1a65876863768467e5770b548c62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002>
    /CHT <FEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef69069752865bc666e901a554652d965874ef6768467e5770b548c52175370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002>
    /DAN <FEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002c0020006400650072002000650067006e006500720020007300690067002000740069006c00200064006500740061006c006a006500720065007400200073006b00e60072006d007600690073006e0069006e00670020006f00670020007500640073006b007200690076006e0069006e006700200061006600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002e>
    /DEU <FEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200075006d002000650069006e00650020007a0075007600650072006c00e40073007300690067006500200041006e007a006500690067006500200075006e00640020004100750073006700610062006500200076006f006e00200047006500730063006800e40066007400730064006f006b0075006d0065006e00740065006e0020007a0075002000650072007a00690065006c0065006e002e00200044006900650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000520065006100640065007200200035002e003000200075006e00640020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002e>
    /ESP <FEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f00620065002000500044004600200061006400650063007500610064006f007300200070006100720061002000760069007300750061006c0069007a00610063006900f3006e0020006500200069006d0070007200650073006900f3006e00200064006500200063006f006e006600690061006e007a006100200064006500200064006f00630075006d0065006e0074006f007300200063006f006d00650072006300690061006c00650073002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002e>
    /FRA <FEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f006200650020005000440046002000700072006f00660065007300730069006f006e006e0065006c007300200066006900610062006c0065007300200070006f007500720020006c0061002000760069007300750061006c00690073006100740069006f006e0020006500740020006c00270069006d007000720065007300730069006f006e002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002e>
    /ITA (Utilizzare queste impostazioni per creare documenti Adobe PDF adatti per visualizzare e stampare documenti aziendali in modo affidabile. I documenti PDF creati possono essere aperti con Acrobat e Adobe Reader 5.0 e versioni successive.)
    /JPN <FEFF30d330b830cd30b9658766f8306e8868793a304a3088307353705237306b90693057305f002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e305930023053306e8a2d5b9a3067306f30d530a930f330c8306e57cb30818fbc307f3092884c3044307e30593002>
    /KOR <FEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020be44c988b2c8c2a40020bb38c11cb97c0020c548c815c801c73cb85c0020bcf4ace00020c778c1c4d558b2940020b3700020ac00c7a50020c801d569d55c002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002e>
    /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken waarmee zakelijke documenten betrouwbaar kunnen worden weergegeven en afgedrukt. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.)
    /NOR <FEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200073006f006d002000650072002000650067006e0065007400200066006f00720020007000e5006c006900740065006c006900670020007600690073006e0069006e00670020006f00670020007500740073006b007200690066007400200061007600200066006f0072007200650074006e0069006e006700730064006f006b0075006d0065006e007400650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002e>
    /PTB <FEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f00620065002000500044004600200061006400650071007500610064006f00730020007000610072006100200061002000760069007300750061006c0069007a006100e700e3006f002000650020006100200069006d0070007200650073007300e3006f00200063006f006e0066006900e1007600650069007300200064006500200064006f00630075006d0065006e0074006f007300200063006f006d0065007200630069006100690073002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002e>
    /SUO <FEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a0061002c0020006a006f0074006b006100200073006f0070006900760061007400200079007200690074007900730061007300690061006b00690072006a006f006a0065006e0020006c0075006f00740065007400740061007600610061006e0020006e00e400790074007400e4006d0069007300650065006e0020006a0061002000740075006c006f007300740061006d0069007300650065006e002e0020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002e>
    /SVE <FEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400200073006f006d00200070006100730073006100720020006600f60072002000740069006c006c006600f60072006c00690074006c006900670020007600690073006e0069006e00670020006f006300680020007500740073006b007200690066007400650072002000610076002000610066006600e4007200730064006f006b0075006d0065006e0074002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002e>
    /ENU (Use these settings to create PDFs that match the "Suggested"  settings for PDF Specification 4.0)
  >>
>> setdistillerparams
<<
  /HWResolution [600 600]
  /PageSize [612.000 792.000]
>> setpagedevice