presented by Artem Sobolev

## Generative Modeling

• Given $$N$$ samples from the true data-generating process $$p_\text{data}(x)$$ and a model $$p_\theta(x)$$
• We'd like to find $$\hat\theta_N$$ s.t. $$p_{\hat\theta_N}(x)$$ is as close to $$p_\text{data}(x)$$ as possible
• Typically we measure closeness by means of some divergence
• KL divergence
• MLE: VAEs, Flow-based models
• JS divergence, f-divergences, Wasserstein distance
• GANs

## Method of Moments

• Old and well-studied alternative to the Maximum Likelihood Estimation approach
• Moment function: $m(\theta) = \mathbb{E}_{p_\theta(x)} \Phi(x)$ where $$\Phi(x) \in \mathbb{R}^D$$ -- some feature extractor
• Let $$\hat\theta_N$$ be s.t. $m(\hat\theta_N) = \frac{1}{N} \sum_{n=1}^N \Phi(x_n)$
• If $$p_\text{data}(x) = p_{\theta^\star}(x)$$ for some $$\theta^\star$$, then this estimator is consistent

## Asymptotic Normality

• Theorem: if $$m(\theta)$$ is a one-to-one mapping, and is continuously differentiable at $$\hat\theta_N$$ with non-singular derivative $$\nabla_\theta \mathbb{E}_{p_\theta(x)} \Phi(x)$$, assuming $$\mathbb{E}_{p_\theta(x)} \|\Phi(x)\|^2 < \infty$$, $$\hat\theta_N$$ exists with probability tending to 1, and satisfies $\sqrt{N}(\hat\theta_N - \theta^\star) \to \mathcal{N}\left(0, G^{-1} \Sigma G^{-T}\right), \quad\quad \Sigma = \text{Cov}_{p_\text{data}(x)} \; \Phi(x)$
• Implications: some $$\Phi$$ are better than others due to lower variance, and thus lower sample complexity
• Invertibility is too restrictive, can be relaxed to identifiability: $$m(\theta) = \mathbb{E}_{p_\text{data}(x)} \Phi(x)$$ iff $$\theta = \theta^*$$
• Still hard to verify, instead assume $$G$$ is full rank, and there are more moments than model parameters

## Moment Networks

• It's hard to generate more moments than number of parameters in $$\theta$$
• Authors propose to use a moment network $$f_\phi(x)$$ and let $\Phi(x) = [\nabla_\phi f_\phi(x), x, h_1(x), \dots, h_{L-1}(x)]^T$ where $$h_l(x)$$ is activations of $$l$$-th layer
• Since moment function is not invertible anymore, the generator is trained by minimizing $\mathcal{L}^G(\theta) = \frac{1}{2} \left\| \frac{1}{N} \sum_{n=1}^N \Phi(x_n) - \mathbb{E}_{p(\varepsilon)} \Phi(g_\theta(\varepsilon)) \right\|_2^2$

## Learning a Moment Network

• One could use a randomly initialized moment network $$f_\phi$$
• That would have large asymptotic variance of $$\hat\theta_N$$
• Ideally our $$\phi$$ minimizes this asymptotic variance, which is approximately equivalent to maximizing $\| \mathbb{E}_{p(\varepsilon)} \Phi(g_\theta(\varepsilon)) - \mathbb{E}_{p_\text{data}(x)} \Phi(x) \|^2$
• However author claim that this maximization makes moments correlated, and breaks consistency of $$\hat\theta_N$$
• Authors follow prior work and introduce a binary classifier $\mathcal{L}^M(\phi) = \mathbb{E}_{p_\text{data}(x)} \log D_\phi(x) + \mathbb{E}_{p(\varepsilon)} \log (1-D_\phi(g_\theta(\varepsilon))) + \lambda R(x)$ where $$D_\phi(x) = \sigma(f_\phi(x))$$, and $$R$$ regularizes $$\nabla_\phi f_\phi(x)$$

## Nice Properties

• Asymptotic Normality assuming $$\mathcal{L}^G(\theta)$$ is asymptotically quadratic $\sqrt{N} (\hat\theta_N - \theta^*) \to \mathcal{N}(0, V_{SE})$
• Even if we can't compute $$\mathbb{E}_{p(\varepsilon)} \Phi(g_\theta(\varepsilon))$$ analytically, and use Monte Carlo, this only increases asymptotic variance by a constant factor
• The method works for any $$\phi$$, we only train it occasionally to increase sample efficiency
• Regularizes that controls asymptotic variance can be related to gradient penalty in WGANs

By Artëm Sobolev

• 438