Artëm Sobolev
Research Scientist in Machine Learning
Artem Sobolev and Dmitry Vetrov
@art_sobolev http://artem.sobolev.name
Standard VAE[1] has a fully factorized Gaussian proposal
Q.E.D.
These two look suspiciously similar!
Given all this evidence, it's time to question ourselves: Is this all just a coincidence? Or, is the (1) indeed a multisample τ-variational upper bound on the marginal log-density logqϕ(z∣x)?
Theorem: for K∈N0, any* qϕ(z,ψ∣x) and τη(ψ∣x,z) denote UK:=Eqϕ(ψ0∣x,z)Eτη(ψ1:K∣z,x)logK+11k=0∑Kτη(ψk∣x,z)qϕ(z,ψk∣x)Then the following holds
Proof: we will only prove the first statement by showing that the gap between the bound and the marginal log-density is equal to some KL divergence
Proof: consider the gap Eqϕ(ψ0∣x,z)Eτη(ψ1:K∣z,x)logK+11k=0∑Kτη(ψk∣x,z)qϕ(z,ψk∣x)−logqϕ(z∣x)
Multiply and divide by qϕ(ψ0∣x,z)τη(ψ1:K∣x,z) to get Eqϕ(ψ0∣x,z)Eτη(ψ1:K∣z,x)logK+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)qϕ(ψ0∣x,z)τη(ψ1:K∣z,x)qϕ(ψ0∣x,z)τη(ψ1:K∣z,x)
Subtract the log Eqϕ(ψ0∣x,z)Eτη(ψ1:K∣z,x)logK+11k=0∑Kτη(ψk∣x,z)qϕ(ψk∣z,x)
Which is exactly the KL divergence DKLqϕ(ψ0∣x,z)τη(ψ1:K∣z,x)∣∣K+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)qϕ(ψ0∣x,z)τη(ψ1:K∣z,x)
For the KL to enjoy non-negativity and be 0 when distributions match, we need the second argument to be a valid probability density. Is it? ωq,τ(ψ0:K∣x,z):=K+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)qϕ(ψ0∣x,z)τη(ψ1:K∣z,x)
We'll show it by means of symmetry:
∫K+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)qϕ(ψ0∣x,z)τη(ψ1:K∣z,x)dψ0:K=∫K+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)τη(ψ0∣x,z)qϕ(ψ0∣x,z)τη(ψ0:K∣z,x)dψ0:K=
However, there's nothing special in the choice of the 0-th index, we could take any j and the expectation wouldn't change. Let's average all of them: =∫K+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)τη(ψj∣x,z)qϕ(ψj∣x,z)τη(ψ0:K∣z,x)dψ0:K=K+11j=0∑K∫K+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)τη(ψj∣x,z)qϕ(ψj∣x,z)τη(ψ0:K∣z,x)dψ0:K
Q.E.D.
=∫K+11∑k=0Kτη(ψk∣x,z)qϕ(ψk∣z,x)K+11∑j=0Kτη(ψj∣x,z)qϕ(ψj∣x,z)τη(ψ0:K∣z,x)dψ0:K=∫τη(ψ0:K∣z,x)dψ0:K=1
Lets see how our variational upper bound compares with prior work.
Corollary: for the case of two hierarchical distributions q(z)=∫q(z,ψ)dψ and p(z)=∫p(z,ζ)dζ, we can give the following multisample variational bounds on KL divergence:
DKL(q(z)∣∣p(z))≤Eq(z,ψ0)Eτ(ψ1:K∣z)Eν(ζ1:L∣z)logL1∑l=1Lν(ζl∣z)p(z,ζl)K+11∑k=0Kτ(ψk∣z)q(z,ψk)
DKL(q(z)∣∣p(z))≥Eq(z)Eτ(ψ1:K∣z)Ep(ζ0∣z)Eν(ζ1:L∣z)logL+11∑l=0Lν(ζl∣z)p(z,ζl)K1∑k=1Kτ(ψk∣z)q(z,ψk)
Where τ(ψ∣z) and ν(ζ∣z) are variational approximations to q(ψ∣z) and p(ζ∣z), correspondingly
Note: actually, variational distributions in the lower and upper bounds optimize different divergences, thus technically they should be different
Method | MNIST | OMNIGLOT |
---|---|---|
AVB+AC | −83.7 ± 0.3 | — |
IWHVI | −83.9 ± 0.1 | −104.8 ± 0.1 |
SIVI | −84.4 ± 0.1 | −105.7 ± 0.1 |
HVM | −84.9 ± 0.1 | −105.8 ± 0.1 |
VAE+RealNVP | −84.8 ± 0.1 | −106.0 ± 0.1 |
VAE+IAF | −84.9 ± 0.1 | −107.0 ± 0.1 |
VAE | −85.0 ± 0.1 | −106.6 ± 0.1 |
Test log-likelihood on dynamically binarized MNIST and OMNIGLOT with 2 std. interval
By Artëm Sobolev