Spectral Normalization for Generative Adversarial Networks

(SN-GAN)

ICLR 2018

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida

Outline

GAN
Wasserstein GAN (WGAN)
Spectral Normalization GAN (SN-GAN)
Experiments
Questions

GAN

$$\begin{aligned}\min_G\max_D\mathcal{L}&=\min_G\max_D\mathbb{E}_{x\sim p_{real}}[log(D(x))]+\mathbb{E}_{x\sim p_G}[{log(1-D(x))}]\\&=\min_G\max_D 2JS(\mathbb{P}_r|\mathbb{P}_g)+2log2\end{aligned}$$

note that

$$JS(\mathbb{P}_r|\mathbb{P}_g)=KL(\mathbb{P}_r|\frac{\mathbb{P}_r+\mathbb{P}_g}{2})+KL(\mathbb{P}_g|\frac{\mathbb{P}_r+\mathbb{P}_g}{2})$$

Unstable

Training

source: Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).

GAN

$$\begin{aligned}\min_G\max_D\mathcal{L}&=\min_G\max_D\mathbb{E}_{x\sim p_r}[log(D(x))]+\mathbb{E}_{x\sim p_g}[{log(1-D(x))}]\\&=\min_G\max_D 2JS(\mathbb{P}_r|\mathbb{P}_g)+2log2\end{aligned}$$

Why Mode collapse:

In order to avoid gradient vanish for Generator, change

$$\min_G\mathcal{L}_G=\min_G\mathbb{E}_{x\sim p_G}[{log(1-D(x))}]$$

$$\min_G\mathcal{L}_G=\min_G\mathbb{E}_{x\sim p_G}[-log(D(x))]$$

then

$$\mathcal{L}_G=KL(\mathbb{P}_g|\mathbb{P}_r)-2JS(\mathbb{P}_r|\mathbb{P}_g)$$

$$\rightarrow\mathbb{P}_g\text{ tends to converge to a small region}$$

Wasserstein GAN

Wasserstein Distance:

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\inf_{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}_{\theta})}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert]$$

source: https://vincentherrmann.github.io/blog/wasserstein/

Wasserstein GAN

Wasserstein Distance:

$$\text{If }G\text{ is continuous in }\theta$$

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\inf_{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert]$$

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})\text{ is }\mathbf{continuous}\text{ every where.}$$

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})\text{ is }\mathbf{almost}\text{ }\mathbf{ differentiable}\text{ every where.}$$

Dual representation:

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\sup_{\Vert f\Vert_{Lip}\le 1}\mathbb{E}_{x\sim\mathbb{P}_r}[f(x)]-\mathbb{E}_{x\sim\mathbb{P}_{\theta}}[f(x)]$$

$$\text{where the supremum is over all the 1-Lipschitz functions }f:X\rightarrow\mathbb{R}$$

Wasserstein GAN

Wasserstein distance:

$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\sup_{\Vert f\Vert_{Lip}\le 1}\mathbb{E}_{x\sim\mathbb{P}_r}[f(x)]-\mathbb{E}_{x\sim\mathbb{P}_{\theta}}[f(x)]$$

$$\text{Use Neural Network(Discriminator) to approximate }f$$

$$W(\mathbb{P}_r, \mathbb{P}_{g})=\max_D\mathbb{E}_{x\sim\mathbb{P}_r}[D(x)]-\mathbb{E}_{x\sim\mathbb{P}_{g}}[D(x)]$$

$$\text{The remaining work is to restrict the Lipschitz constant of }D$$

$$\Vert D\Vert_{Lip}\le1$$

Different Approaches

Directly restrict lipschitz constant

WGAN [3]: Weight clipping

WGAN-GP [5]: Gradient Penalty

$$\mathcal{L}_{GP}=\lambda(\Vert \nabla_{\hat{x}}D(\hat{x})\Vert_2-1)^2$$

SN-GAN [6]: Spectral Normalization

For Discriminator:

$$\begin{aligned}\theta_D&\leftarrow \theta_D+\alpha\cdot \nabla_{\theta_D}\mathcal{L}_D\\\theta_D&\leftarrow clip(\theta_D,-c,c)\end{aligned}$$

$$\begin{aligned}&W^l_{SN}\leftarrow W^l/\sigma(W^l)\\&W^l\leftarrow W^l-\alpha\cdot \nabla_{W^l}\mathcal{L}(W^l_{SN},D)\\&\sigma(\cdot)\text { spectral norm}\end{aligned}$$

Indirect approaches

BEGAN [4]: Maximize the lower bound of Wasserstein Loss

SN GAN

Spectral Normalization:

$$\begin{aligned}&W^l_{SN}=W^l/\sigma(W^l)\\&\sigma(\cdot)\text { denotes spectral norm, maximum singluar value}\end{aligned}$$

Apply Spectral Normalization on each layer of D(x) then:

$$\Vert D_{\theta}\Vert _{Lip}\le1$$

proof:

$$\begin{aligned}\text{Let }D(x)=&a_L(W^L(a_{L-1}(W^{L-1}(\cdots a^1(W^1 x)\cdots))))\\\Vert D(x)\Vert_{Lip}\le&\Vert a_L\Vert_{Lip}\cdot\Vert W^L\Vert_{Lip}\cdot\Vert a_{L-1}\Vert_{Lip}\cdot\Vert W^{L-1}\Vert_{Lip}\\&\cdots\Vert a_1\Vert_{Lip}\cdot\Vert W^1\Vert_{Lip}\\ \sigma(W^l_{SN})=&\sigma(W^l/\sigma(W^l))=1\end{aligned}$$

SN GAN

Experiments

$$\begin{aligned}\beta_1,\beta_2:&\text{ parameter for Adam}\\\alpha:&\text{ learning rate}\\n_{dis}:&\text{ the number of updates of}\\&\text{ the discriminator per one update}\\&\text{ of the generator and}\end{aligned}$$

Experiments

Mode Name	Loss	Architecture	Does it work?
DCGAN (in paper) *	KL	TransConv	work
WGAN (in paper)	Wasserstein	TransConv	work
WGAN-GP *	Wasserstein	TransConv	work
WGAN-GP (in paper) *	Wasserstein	ResNet	work
SN-GAN *	Wasserstein	TransConv	work
SN-GAN (in paper) *	Hinge loss	ResNet	work
SN-GAN *	Wasserstein	ResNet	NOT WORK