1. Introduction
  2. Related work
  3. Model
    1. Architectures
    2. Training details
  4. Experiments
    1. Inverting AlexNet
    2. Variational autoencoder
  5. Conclusion

(NIPS 2016) Generating Images with Perceptual Similarity Metrics based on Deep Networks
Paper: http://lmb.informatik.uni-freiburg.de/Publications/2016/DB16c/inverting_GAN_nips2016_final.pdf
Supplement: http://lmb.informatik.uni-freiburg.de/Publications/2016/DB16c/inverting_GAN_nips2016_supp_final.pdf
Code: http://lmb.informatik.uni-freiburg.de/people/dosovits/code.html

Introduction

Nevertheless, there is little work on studying loss functions which are appropriate for the image generation task.

However, exact locations of all fine details are not important for perceptual similarity of images.

Our main insight is that invariance to irrelevant transformations and sensitivity to local image statistics can be achieved by measuring distances in a suitable feature space.

Since feature representations are typically contractive, feature similarity does not automatically mean image similarity.

A combination of similarity in an appropriate feature space with adversarial training yields the best results.

We demonstrate this in two applications: inversion of the AlexNet convolutional network and a generative model based on a variational autoencoder.

Model

Suppose we are given a supervised image generation task and a training set of input-target pairs \(\{y_i, x_i\}\), consisting of high-level image representations \(y_i \in \mathbb{R}^I\) and images \(x_i \in \mathbb{R}^{W \times H \times C}\).

The aim is to learn the parameters \(\theta\) of a differentiable generator function \(G_\theta (\cdot): \mathbb{R}^I \to \mathbb{R}^{W \times H \times C}\) which optimally approximates the input-target dependency according to a loss function \(L(G_\theta (y), x)\).

We propose a new class of losses, which we call deep perceptual similarity metrics (DeePSiM).

These losses are weighted sums of three terms: feature loss \(L_{feat}\), adversarial loss \(L_{adv}\), and image space loss \(L_{img}\):

\[L = \lambda_{feat} L_{feat} + \lambda_{adv} L_{adv} + \lambda_{img} L_{img}\]

Loss in feature space. Given a differentiable comparator \(C: \mathbb{R}^{W \times H \times C} \to \mathbb{R}^F\), we define

\[L_{feat} = \sum_i \lVert C(G_\theta (y_i)) - C(x_i) \rVert_2^2\]

Optimizing just for similarity in a high-level feature space typically leads to high-frequency artifacts [21].

Therefore, a natural image prior is necessary to constrain the generated images to the manifold of natural images.

Adversarial loss. Instead of manually designing a prior, as in Mahendran and Vedaldi [21], we learn it with an approach similar to Generative Adversarial Networks (GANs) of Goodfellow et al. [1] .

\[L_{adv} = - \sum_i \log D_\phi (G_\theta (y_i))\]

Loss in image space. Adversarial training is unstable and sensitive to hyperparameter values. we add to our loss function a small squared error term:

\[L_{img} = \sum_i \lVert G_\theta (y_i) - x_i \rVert_2^2\]

Architectures

Generators

All our generators make use of up-convolutional (’deconvolutional’) layers [8] .

In all networks we use leaky ReLU nonlinearities.

Comparators

We experimented with three comparators:

  1. AlexNet [22] is a network with 5 convolutional and 2 fully connected layers trained on image classification. More precisely, in all experiments we used a variant of AlexNet called CaffeNet [23].

  2. The network of Wang and Gupta [24] has the same architecture as CaffeNet, but is trained without supervision. The network is trained to map frames of one video clip close to each other in the feature space and map frames from different videos far apart. We refer to this network as VideoNet.

  3. AlexNet with random weights.

We found using CONV5 features for comparison leads to best results in most cases. We used these features unless specified otherwise.

Discriminator

In our setup the job of the discriminator is to analyze the local statistics of images. Therefore, after five convolutional layers with occasional stride we perform global average pooling.

The result is processed by two fully connected layers, followed by a 2-way softmax.

We perform 50% dropout after the global average pooling layer and the first fully connected layer.

Training details

Experiments

Inverting AlexNet

  1. this shows which information is preserved in the representation.

  2. reconstruction from artificial networks can be seen as test-ground for reconstruction from real neural networks.

  3. it is interesting to see that in contrast with the standard scheme "generative pretraining for a discriminative task", we show that "discriminative pre-training for a generative task" can be fruitful.

  4. we indirectly show that our loss can be useful for unsupervised learning with generative models.

Variational autoencoder

Conclusion