Generative Image Modeling using Style and Structure Adversarial Networks

Xiaolong Wang, Abhinav Gupta
Robotics Institute, Carnegie Mellon University

 

Presenter: Wei YANG, CUHK

Xiaolong Wang

Ph.D. Student @ RI, CMU

  • Object Detection/Segmentation
  • 3D Scene Understanding, Video Analysis.
  • Deep Learning, Unsupervised Learning.

Abhinav Gupta

Machine Learning Dept.

  • Learning representation
  • Language ~ Vision
  • Actions and objects

PAMI Young Researcher Award 2016

Unsupervided Learning of Visual Representation

Discriminative Framework

Generative Framework

Discriminative Framework

Use auxiliary tasks such that ground truth can be generated without labeling

Generative Framework

  • Non-parametric matching

  • Parametric modeling: RBMs, VAE, GAN ...

Generative Adversarial Networks

G

Generator

D

Discriminator

Random Noise

Generated 

Real 

X_i
XiX_i
G(z_i)
G(zi)G(z_i)
z_i
ziz_i

Generative Adversarial Networks

Real samples

Random noise sampled from uniform distribution

\mathbf{X} = (X_1,\cdots, X_M)
X=(X1,,XM)\mathbf{X} = (X_1,\cdots, X_M)
\mathbf{Z} = (z_1,\cdots, z_M)
Z=(z1,,zM)\mathbf{Z} = (z_1,\cdots, z_M)

Loss for D network

L^D(\mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{M/2} L(D(X_i), 1) + \sum_{i=M/2+1}^{M} L(D(G(z_i)), 0).
LD(X,Z)=i=1M/2L(D(Xi),1)+i=M/2+1ML(D(G(zi)),0).L^D(\mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{M/2} L(D(X_i), 1) + \sum_{i=M/2+1}^{M} L(D(G(z_i)), 0).

Where L is the binary cross-entropy loss,

L(y^*, y) = -[y\log(y^*) + (1-y)\log(1-y^*)].
L(y,y)=[ylog(y)+(1y)log(1y)].L(y^*, y) = -[y\log(y^*) + (1-y)\log(1-y^*)].

Generative Adversarial Networks

Real samples

Random noise sampled from uniform distribution

\mathbf{X} = (X_1,\cdots, X_M)
X=(X1,,XM)\mathbf{X} = (X_1,\cdots, X_M)
\mathbf{Z} = (z_1,\cdots, z_M)
Z=(z1,,zM)\mathbf{Z} = (z_1,\cdots, z_M)

Loss for G network

L^G(\mathbf{Z}) = \sum_{i=M/2+1}^{M} L(D(G(z_i)), 1).
LG(Z)=i=M/2+1ML(D(G(zi)),1).L^G(\mathbf{Z}) = \sum_{i=M/2+1}^{M} L(D(G(z_i)), 1).

It tries to fool the discriminator D to classify the generated sample as real sample

What does the GAN really learns?

  • Structure?
  • Texture?

Underlying Principles of image formation

=

+

Structure

Style (texture)

SS-GAN

Style and Structure Adversarial Networks

  • Simplifies the generative process

  • More interpretable

  • Unsupervided learning of RGBD representation

  • Style-GAN as a rendering engine

SS GAN

Structure-GAN + Style-GAN

Generated Examples

Structure-GAN

Style-GAN

Synthetic Scene Rendering

Style-GAN

Ground truth

Structured-GAN

generates surface normals from sampled    

\hat{z}
z^\hat{z}

Generator

Input

100-d vector sampled from uniform distribution

Output

72*72*3 surface normal map

Discriminator

Input

Surface normal maps (1/2 generated, 1/2 ground truth)

Output

Binary classification (generated or ground truth)

\hat{z}
z^\hat{z}
G(\hat{z})
G(z^)G(\hat{z})
D(\cdot)
D()D(\cdot)
\pm1
±1\pm1
X_i
XiX_i
\hat{z}
z^\hat{z}

Style-GAN

Conditional GAN

Conditional GAN

Generator is conditioned on additional information

Input

100-d vector sampled from uniform distribution

RGB images

Ground truth surface normal  

Output

128*128*3 scene image

\tilde{z}
z~\tilde{z}
G(C_i, \hat{z})
G(Ci,z^)G(C_i, \hat{z})
C_i
CiC_i
C_i
CiC_i
\tilde{z}
z~\tilde{z}
X_i
XiX_i
L_{cond}^D(\mathbf{X, C, \tilde{Z}}) = \sum_{i=1}^{M/2}L(D(C_i, X_i), 1) + \sum_{i=M/2+1}^{M}L(D(C_i, G(C_i, \tilde{z_i}), 0),
LcondD(X,C,Z~)=i=1M/2L(D(Ci,Xi),1)+i=M/2+1ML(D(Ci,G(Ci,zi~),0),L_{cond}^D(\mathbf{X, C, \tilde{Z}}) = \sum_{i=1}^{M/2}L(D(C_i, X_i), 1) + \sum_{i=M/2+1}^{M}L(D(C_i, G(C_i, \tilde{z_i}), 0),
L_{cond}^G(\mathbf{C, \tilde{Z}}) = \sum_{i=M/2+1}^{M}L(D(C_i, G(C_i, \tilde{z_i}), 1),
LcondG(C,Z~)=i=M/2+1ML(D(Ci,G(Ci,zi~),1),L_{cond}^G(\mathbf{C, \tilde{Z}}) = \sum_{i=M/2+1}^{M}L(D(C_i, G(C_i, \tilde{z_i}), 1),

Style-GAN Generator

Style-GAN: full architecture

Drawback

The generated images are noisy

Edges are not well aligned with the surface normal

Style-GAN: full architecture

Multi-task learning with pixel-wise constraints

Assumption: if the generated image is real enough, it can be used for reconstructing the surface normal maps

Style-GAN: full architecture

L_{multi}^G(\mathbf{C, \tilde{Z}}) = L_{cond}^G(\mathbf{C, \tilde{Z}}) + L^{FCN}(G(\mathbf{C, \tilde{Z}}), C)
LmultiG(C,Z~)=LcondG(C,Z~)+LFCN(G(C,Z~),C)L_{multi}^G(\mathbf{C, \tilde{Z}}) = L_{cond}^G(\mathbf{C, \tilde{Z}}) + L^{FCN}(G(\mathbf{C, \tilde{Z}}), C)
L^{FCN}(\mathbf{X, C}) = \sum_{i=1}^{M} \sum_{k=1}^{K\times K} softmax (F_k{X_i}, C_{i,k}),
LFCN(X,C)=i=1Mk=1K×Ksoftmax(FkXi,Ci,k),L^{FCN}(\mathbf{X, C}) = \sum_{i=1}^{M} \sum_{k=1}^{K\times K} softmax (F_k{X_i}, C_{i,k}),

FCN loss

Full loss

Style-GAN: full architecture

EM-like training algorithm

  • Fix the generator G, optimize the discriminator D 
  • Fix the FCN and the discriminator D, optimize the generator G
  • Fix the generator G, fine-tune FCN using generated and real images.

Joint Learning for SS-GAN

Experiments

Style-GAN Visualization

Style-GAN

visulized results (gt from NYUv2 test set)

Style-GAN

with/without pixel-wise constraints

Rendering on Synthetic Scene

Inputs are 3D model annotations.

SS-GAN Visualization

DCGAN: Radford et al. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

LAPGAN: Denton et al. "Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks." NIPS, 2015.

Walking the latent space

interpret the model by manipulating the latent space         

\hat{z}, \tilde{z}
z^,z~\hat{z}, \tilde{z}

"Growing" 3D cubic

Shutdown window

Nearest Neighbors Test

Quantitative Results

Scene Classification

Object Detection

Thank you.

https://github.com/xiaolonw/ss-gan

ssgan

By Wei Yang