Old and well-studied alternative to the Maximum Likelihood Estimation approach
Moment function: m(θ)=Epθ(x)Φ(x) where Φ(x)∈RD -- some feature extractor
Let θ^N be s.t. m(θ^N)=N1n=1∑NΦ(xn)
If pdata(x)=pθ⋆(x) for some θ⋆, then this estimator is consistent
Asymptotic Normality
Theorem: if m(θ) is a one-to-one mapping, and is continuously differentiable at θ^N with non-singular derivative ∇θEpθ(x)Φ(x), assuming Epθ(x)∥Φ(x)∥2<∞, θ^N exists with probability tending to 1, and satisfies N(θ^N−θ⋆)→N(0,G−1ΣG−T),Σ=Covpdata(x)Φ(x)
Implications: some Φ are better than others due to lower variance, and thus lower sample complexity
Invertibility is too restrictive, can be relaxed to identifiability: m(θ)=Epdata(x)Φ(x) iff θ=θ∗
Still hard to verify, instead assume G is full rank, and there are more moments than model parameters
Moment Networks
It's hard to generate more moments than number of parameters in θ
Authors propose to use a moment network fϕ(x) and let Φ(x)=[∇ϕfϕ(x),x,h1(x),…,hL−1(x)]T where hl(x) is activations of l-th layer
Since moment function is not invertible anymore, the generator is trained by minimizing LG(θ)=21N1n=1∑NΦ(xn)−Ep(ε)Φ(gθ(ε))22
Learning a Moment Network
One could use a randomly initialized moment network fϕ
That would have large asymptotic variance of θ^N
Ideally our ϕ minimizes this asymptotic variance, which is approximately equivalent to maximizing ∥Ep(ε)Φ(gθ(ε))−Epdata(x)Φ(x)∥2
However author claim that this maximization makes moments correlated, and breaks consistency of θ^N
Authors follow prior work and introduce a binary classifier LM(ϕ)=Epdata(x)logDϕ(x)+Ep(ε)log(1−Dϕ(gθ(ε)))+λR(x) where Dϕ(x)=σ(fϕ(x)), and R regularizes ∇ϕfϕ(x)
Nice Properties
Asymptotic Normality assuming LG(θ) is asymptotically quadratic N(θ^N−θ∗)→N(0,VSE)
Even if we can't compute Ep(ε)Φ(gθ(ε)) analytically, and use Monte Carlo, this only increases asymptotic variance by a constant factor
The method works for any ϕ, we only train it occasionally to increase sample efficiency
Regularizes that controls asymptotic variance can be related to gradient penalty in WGANs