1. Introduction
  2. Related Work
  3. Review of Generative Adversarial Nets
  4. Pixel-Level Domain Transfer
    1. Converter Network
    2. Discriminator Networks
    3. Adversarial Training
  5. Evaluation
    1. LookBook Dataset
    2. Experiment Details
    3. Qualitative evaluation
    4. Quantitative evaluation by user study
    5. Quantitative evaluation by pixel-level (dis)similarity
  6. Conclusion

(ECCV 2016) Pixel-Level Domain Transfer
Paper: https://dgyoo.github.io/papers/eccv16.pdf
Supplement: https://dgyoo.github.io/papers/eccv16_supp.pdf
Code: https://github.com/fxia22/PixelDTGAN

Introduction

Our focus of this paper lies on the problem; to enable a machine to transfer a visual input into different forms and to visualize the various forms by generating a pixel-level image.

image-conditioned image generation

all these adaptations take place in the feature space, i.e. the model parameters are adapted. However, our method directly produces target images.

we present a pixel-level domain converter composed of an encoder for semantic embedding of a source and a decoder to produce a target image.

To train our converter, we first place a separate network named domain discriminator on top of the converter. The domain discriminator takes a pair of a source image and a target image, and is trained to make a binary decision whether the input pair is associated or not. Secondly, we also employ the discriminator of [6], which is supervised by the labels of "real" or "fake", to produce realistic images.

source domain: a natural human image

target domain: a product image of the person's top

made a large dataset named LookBook, which contains in total of 84k images, where 75k human images are associated with 10k top product images

Contributions:

  • semantically transferring a source domain to a target domain in pixel-level.
  • a novel discriminator that enables us to train the semantic relation between the domains.
  • a large clothing dataset containing two domains

Review of Generative Adversarial Nets

Pixel-Level Domain Transfer

Converter Network

Fig. 2

The encoder part is composed of five convolutional layers to abstract the source into a semantic 64-dimensional code.

The decoder constructs a relevant target through the five decoding layers.

Discriminator Networks

Given the converter, a simple choice of a loss function to train it is the mean-square error (MSE) such as \(|| \hat{I}_T − I_T ||^2_2\).

However, MSE is not suitable for pixel-level supervision for natural images, and minimizing MSE always forces the converter to fit into one of the target images.

we place a discriminator network which plays a role as a loss function.

The domain discriminator takes a pair of source and target as input, and produces a scalar probability of whether the input pair is associated or not.

In summary, we employ both of the real/fake discriminator and the domain
discriminator for adversarial training.

The real/fake discriminator penalizes an unrealistic target while the domain discriminator penalizes a target being irrelevant to a source.

Adversarial Training

At first, we train the discriminators.

After that, we freeze the updated discriminator parameters and optimize the converter parameters to increase the losses of both discriminators.

Evaluation

LookBook Dataset

We make a dataset named LookBook that covers two fashion domains. Images of one domain contain fashion models, and those of the other domain contain top products with a clean background.

LookBook contains 84,748 images where 9,732 top product images are associated with 75,016 fashion model images.

Experiment Details

Before training, we rescale all images in LookBook to have 64 pixels at a longer side while keeping the aspect ratio, and fill the margins of both ends with 255s.

randomly select 5% images to define a validation set, and also 5% images for a test set.

Qualitative evaluation

Quantitative evaluation by user study

Quantitative evaluation by pixel-level (dis)similarity

Conclusion