(SN-GAN)
ICLR 2018
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida
Wasserstein GAN (WGAN)
Spectral Normalization GAN (SN-GAN)
Experiments
Questions
$$\begin{aligned}\min_G\max_D\mathcal{L}&=\min_G\max_D\mathbb{E}_{x\sim p_{real}}[log(D(x))]+\mathbb{E}_{x\sim p_G}[{log(1-D(x))}]\\&=\min_G\max_D 2JS(\mathbb{P}_r|\mathbb{P}_g)+2log2\end{aligned}$$
note that
$$JS(\mathbb{P}_r|\mathbb{P}_g)=KL(\mathbb{P}_r|\frac{\mathbb{P}_r+\mathbb{P}_g}{2})+KL(\mathbb{P}_g|\frac{\mathbb{P}_r+\mathbb{P}_g}{2})$$
source: Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).
$$\begin{aligned}\min_G\max_D\mathcal{L}&=\min_G\max_D\mathbb{E}_{x\sim p_r}[log(D(x))]+\mathbb{E}_{x\sim p_g}[{log(1-D(x))}]\\&=\min_G\max_D 2JS(\mathbb{P}_r|\mathbb{P}_g)+2log2\end{aligned}$$
In order to avoid gradient vanish for Generator, change
$$\min_G\mathcal{L}_G=\min_G\mathbb{E}_{x\sim p_G}[{log(1-D(x))}]$$
to
$$\min_G\mathcal{L}_G=\min_G\mathbb{E}_{x\sim p_G}[-log(D(x))]$$
then
$$\mathcal{L}_G=KL(\mathbb{P}_g|\mathbb{P}_r)-2JS(\mathbb{P}_r|\mathbb{P}_g)$$
$$\rightarrow\mathbb{P}_g\text{ tends to converge to a small region}$$
Wasserstein Distance:
$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\inf_{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}_{\theta})}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert]$$
source: https://vincentherrmann.github.io/blog/wasserstein/
Wasserstein Distance:
$$\text{If }G\text{ is continuous in }\theta$$
$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\inf_{\gamma\in\Pi(\mathbb{P}_r, \mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert]$$
Dual representation:
$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\sup_{\Vert f\Vert_{Lip}\le 1}\mathbb{E}_{x\sim\mathbb{P}_r}[f(x)]-\mathbb{E}_{x\sim\mathbb{P}_{\theta}}[f(x)]$$
$$\text{where the supremum is over all the 1-Lipschitz functions }f:X\rightarrow\mathbb{R}$$
Wasserstein distance:
$$W(\mathbb{P}_r, \mathbb{P}_{\theta})=\sup_{\Vert f\Vert_{Lip}\le 1}\mathbb{E}_{x\sim\mathbb{P}_r}[f(x)]-\mathbb{E}_{x\sim\mathbb{P}_{\theta}}[f(x)]$$
$$\text{Use Neural Network(Discriminator) to approximate }f$$
$$W(\mathbb{P}_r, \mathbb{P}_{g})=\max_D\mathbb{E}_{x\sim\mathbb{P}_r}[D(x)]-\mathbb{E}_{x\sim\mathbb{P}_{g}}[D(x)]$$
$$\text{The remaining work is to restrict the Lipschitz constant of }D$$
$$\Vert D\Vert_{Lip}\le1$$
$$\mathcal{L}_{GP}=\lambda(\Vert \nabla_{\hat{x}}D(\hat{x})\Vert_2-1)^2$$
For Discriminator:
$$\begin{aligned}\theta_D&\leftarrow \theta_D+\alpha\cdot \nabla_{\theta_D}\mathcal{L}_D\\\theta_D&\leftarrow clip(\theta_D,-c,c)\end{aligned}$$
$$\begin{aligned}&W^l_{SN}\leftarrow W^l/\sigma(W^l)\\&W^l\leftarrow W^l-\alpha\cdot \nabla_{W^l}\mathcal{L}(W^l_{SN},D)\\&\sigma(\cdot)\text { spectral norm}\end{aligned}$$
$$\begin{aligned}&W^l_{SN}=W^l/\sigma(W^l)\\&\sigma(\cdot)\text { denotes spectral norm, maximum singluar value}\end{aligned}$$
$$\Vert D_{\theta}\Vert _{Lip}\le1$$
$$\begin{aligned}\text{Let }D(x)=&a_L(W^L(a_{L-1}(W^{L-1}(\cdots a^1(W^1 x)\cdots))))\\\Vert D(x)\Vert_{Lip}\le&\Vert a_L\Vert_{Lip}\cdot\Vert W^L\Vert_{Lip}\cdot\Vert a_{L-1}\Vert_{Lip}\cdot\Vert W^{L-1}\Vert_{Lip}\\&\cdots\Vert a_1\Vert_{Lip}\cdot\Vert W^1\Vert_{Lip}\\ \sigma(W^l_{SN})=&\sigma(W^l/\sigma(W^l))=1\end{aligned}$$
$$\begin{aligned}\beta_1,\beta_2:&\text{ parameter for Adam}\\\alpha:&\text{ learning rate}\\n_{dis}:&\text{ the number of updates of}\\&\text{ the discriminator per one update}\\&\text{ of the generator and}\end{aligned}$$
| Mode Name | Loss | Architecture | Does it work? |
|---|---|---|---|
| DCGAN (in paper) * | KL | TransConv | work |
| WGAN (in paper) | Wasserstein | TransConv | work |
| WGAN-GP * | Wasserstein | TransConv | work |
| WGAN-GP (in paper) * | Wasserstein | ResNet | work |
| SN-GAN * | Wasserstein | TransConv | work |
| SN-GAN (in paper) * | Hinge loss | ResNet | work |
| SN-GAN * | Wasserstein | ResNet | NOT WORK |
* Means I have tested.
source: github
SN-GAN
(Hinge loss + ResNet)
SN-GAN
(W. dis. + ResNet)
SN-GAN
(W. dis. + TransConv)
1.8
0.25