Spectral Normalization for Generative Adversarial Networks

(SN-GAN)

 

ICLR 2018

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida

https://arxiv.org/abs/1802.05957​

Outline

  • GAN
  • Wasserstein GAN (WGAN)

  • Spectral Normalization GAN (SN-GAN)

  • ​Experiments

  • Questions

GAN

$$\begin{aligned}\min_G\max_D\mathcal{L}&=\min_G\max_D\mathbb{E}_{x\sim p_{real}}[log(D(x))]+\mathbb{E}_{x\sim p_G}[{log(1-D(x))}]\\&=\min_G\max_D 2JS(\mathbb{P}_r|\mathbb{P}_g)+2log2\end{aligned}$$

note that

$$JS(\mathbb{P}_r|\mathbb{P}_g)=KL(\mathbb{P}_r|\frac{\mathbb{P}_r+\mathbb{P}_g}{2})+KL(\mathbb{P}_g|\frac{\mathbb{P}_r+\mathbb{P}_g}{2})$$​

Unstable

Training

source: Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).

GAN

$$\begin{aligned}\min_G\max_D\mathcal{L}&=\min_G\max_D\mathbb{E}_{x\sim p_r}[log(D(x))]+\mathbb{E}_{x\sim p_g}[{log(1-D(x))}]\\&=\min_G\max_D 2JS(\mathbb{P}_r|\mathbb{P}_g)+2log2\end{aligned}$$

Why Mode collapse:

In order to avoid gradient vanish for Generator, change

$$\min_G\mathcal{L}_G=\min_G\mathbb{E}_{x\sim p_G}[{log(1-D(x))}]$$

to

$$\min_G\mathcal{L}_G=\min_G\mathbb{E}_{x\sim p_G}[-log(D(x))]$$

then

$$\mathcal{L}_G=KL(\mathbb{P}_g|\mathbb{P}_r)-2JS(\mathbb{P}_r|\mathbb{P}_g)$$

$$\rightarrow\mathbb{P}_g\text{ tends to converge to a small region}$$

Wasserstein GAN

Wasserstein Distance:

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\inf_{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}_{\theta})}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert]$$

source: https://vincentherrmann.github.io/blog/wasserstein/

Wasserstein GAN

Wasserstein Distance:

$$\text{If }G\text{ is continuous in }\theta$$

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\inf_{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert]$$

  • $$W(\mathbb{P}_r, \mathbb{P}_{\theta})\text{ is }\mathbf{continuous}\text{ every where.}$$
  • $$W(\mathbb{P}_r, \mathbb{P}_{\theta})\text{ is }\mathbf{almost}\text{ }\mathbf{ differentiable}\text{ every where.}$$

Dual representation:

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\sup_{\Vert f\Vert_{Lip}\le 1}\mathbb{E}_{x\sim\mathbb{P}_r}[f(x)]-\mathbb{E}_{x\sim\mathbb{P}_{\theta}}[f(x)]$$

$$\text{where the supremum is over all the 1-Lipschitz functions }f:X\rightarrow\mathbb{R}$$

Wasserstein GAN

Wasserstein distance:

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\sup_{\Vert f\Vert_{Lip}\le 1}\mathbb{E}_{x\sim\mathbb{P}_r}[f(x)]-\mathbb{E}_{x\sim\mathbb{P}_{\theta}}[f(x)]$$

$$\text{Use Neural Network(Discriminator) to approximate }f$$

$$W(\mathbb{P}_r, \mathbb{P}_{g})=\max_D\mathbb{E}_{x\sim\mathbb{P}_r}[D(x)]-\mathbb{E}_{x\sim\mathbb{P}_{g}}[D(x)]$$

$$\text{The remaining work is to restrict the Lipschitz constant of }D$$

$$\Vert D\Vert_{Lip}\le1$$

Different Approaches

Directly restrict lipschitz constant

  • WGAN [3]: Weight clipping
  • WGAN-GP [5]: Gradient Penalty

$$\mathcal{L}_{GP}=\lambda(\Vert \nabla_{\hat{x}}D(\hat{x})\Vert_2-1)^2$$

  • SN-GAN [6]: Spectral Normalization

      For Discriminator:

$$\begin{aligned}\theta_D&\leftarrow \theta_D+\alpha\cdot \nabla_{\theta_D}\mathcal{L}_D\\\theta_D&\leftarrow clip(\theta_D,-c,c)\end{aligned}$$

$$\begin{aligned}&W^l_{SN}\leftarrow W^l/\sigma(W^l)\\&W^l\leftarrow W^l-\alpha\cdot \nabla_{W^l}\mathcal{L}(W^l_{SN},D)\\&\sigma(\cdot)\text { spectral norm}\end{aligned}$$

Indirect approaches

  • BEGAN [4]: Maximize the lower bound of Wasserstein Loss

SN GAN

  • Spectral Normalization:

$$\begin{aligned}&W^l_{SN}=W^l/\sigma(W^l)\\&\sigma(\cdot)\text { denotes spectral norm, maximum singluar value}\end{aligned}$$

  • Apply Spectral Normalization on each layer of D(x) then:

$$\Vert D_{\theta}\Vert _{Lip}\le1$$

  • proof:

$$\begin{aligned}\text{Let }D(x)=&a_L(W^L(a_{L-1}(W^{L-1}(\cdots a^1(W^1 x)\cdots))))\\\Vert D(x)\Vert_{Lip}\le&\Vert a_L\Vert_{Lip}\cdot\Vert W^L\Vert_{Lip}\cdot\Vert a_{L-1}\Vert_{Lip}\cdot\Vert W^{L-1}\Vert_{Lip}\\&\cdots\Vert a_1\Vert_{Lip}\cdot\Vert W^1\Vert_{Lip}\\ \sigma(W^l_{SN})=&\sigma(W^l/\sigma(W^l))=1\end{aligned}$$

SN GAN

Experiments

Experiments

$$\begin{aligned}\beta_1,\beta_2:&\text{ parameter for Adam}\\\alpha:&\text{ learning rate}\\n_{dis}:&\text{ the number of updates of}\\&\text{ the discriminator per one update}\\&\text{ of the generator and}\end{aligned}$$

Experiments

Mode Name Loss Architecture Does it work?
DCGAN (in paper) * KL TransConv work
WGAN (in paper) Wasserstein TransConv work
WGAN-GP * Wasserstein TransConv work
WGAN-GP (in paper) * Wasserstein ResNet work
SN-GAN * Wasserstein TransConv work
SN-GAN (in paper) * ​Hinge loss ​ResNet work
SN-GAN * Wasserstein ResNet NOT WORK

Experiments

* Means I have tested.

source: github

SN-GAN

(Hinge loss + ResNet)

SN-GAN

(W. dis. + ResNet)

SN-GAN

(W. dis. + TransConv)

1.8

0.25

References

Questions

  1. Why the training time of SN-GAN is less than WGAN-GP ?
  2. If mode collapse occurred, what is the feature of generated image

Spectral Normalization for Generative Adversarial Networks

By w86763777

Spectral Normalization for Generative Adversarial Networks

Group meeting

  • 1,234