deck

Method of Moments

Old and well-studied alternative to the Maximum Likelihood Estimation approach
Moment function: $m(\theta) = \mathbb{E}_{p_\theta(x)} \Phi(x)$ where $\Phi(x) \in \mathbb{R}^D$ -- some feature extractor
Let $\hat\theta_N$ be s.t. $m(\hat\theta_N) = \frac{1}{N} \sum_{n=1}^N \Phi(x_n)$
If $p_\text{data}(x) = p_{\theta^\star}(x)$ for some $\theta^\star$ , then this estimator is consistent

Theorem: if $m(\theta)$ is a one-to-one mapping, and is continuously differentiable at $\hat\theta_N$ with non-singular derivative $\nabla_\theta \mathbb{E}_{p_\theta(x)} \Phi(x)$ , assuming $\mathbb{E}_{p_\theta(x)} \|\Phi(x)\|^2 < \infty$ , $\hat\theta_N$ exists with probability tending to 1, and satisfies $\sqrt{N}(\hat\theta_N - \theta^\star) \to \mathcal{N}\left(0, G^{-1} \Sigma G^{-T}\right), \quad\quad \Sigma = \text{Cov}_{p_\text{data}(x)} \; \Phi(x)$
Implications: some $\Phi$ are better than others due to lower variance, and thus lower sample complexity
Invertibility is too restrictive, can be relaxed to identifiability: $m(\theta) = \mathbb{E}_{p_\text{data}(x)} \Phi(x)$ iff $\theta = \theta^*$
Still hard to verify, instead assume $G$ is full rank, and there are more moments than model parameters

It's hard to generate more moments than number of parameters in $\theta$
Authors propose to use a moment network $f_\phi(x)$ and let $\Phi(x) = [\nabla_\phi f_\phi(x), x, h_1(x), \dots, h_{L-1}(x)]^T$ where $h_l(x)$ is activations of $l$ -th layer
Since moment function is not invertible anymore, the generator is trained by minimizing $\mathcal{L}^G(\theta) = \frac{1}{2} \left\| \frac{1}{N} \sum_{n=1}^N \Phi(x_n) - \mathbb{E}_{p(\varepsilon)} \Phi(g_\theta(\varepsilon)) \right\|_2^2$

One could use a randomly initialized moment network $f_\phi$
- That would have large asymptotic variance of $\hat\theta_N$
Ideally our $\phi$ minimizes this asymptotic variance, which is approximately equivalent to maximizing $\| \mathbb{E}_{p(\varepsilon)} \Phi(g_\theta(\varepsilon)) - \mathbb{E}_{p_\text{data}(x)} \Phi(x) \|^2$
- However author claim that this maximization makes moments correlated, and breaks consistency of $\hat\theta_N$
Authors follow prior work and introduce a binary classifier $\mathcal{L}^M(\phi) = \mathbb{E}_{p_\text{data}(x)} \log D_\phi(x) + \mathbb{E}_{p(\varepsilon)} \log (1-D_\phi(g_\theta(\varepsilon))) + \lambda R(x)$ where $D_\phi(x) = \sigma(f_\phi(x))$ , and $R$ regularizes $\nabla_\phi f_\phi(x)$

Asymptotic Normality assuming $\mathcal{L}^G(\theta)$ is asymptotically quadratic $\sqrt{N} (\hat\theta_N - \theta^*) \to \mathcal{N}(0, V_{SE})$
Even if we can't compute $\mathbb{E}_{p(\varepsilon)} \Phi(g_\theta(\varepsilon))$ analytically, and use Monte Carlo, this only increases asymptotic variance by a constant factor
The method works for any $\phi$ , we only train it occasionally to increase sample efficiency
Regularizes that controls asymptotic variance can be related to gradient penalty in WGANs