## Too Boring; Didn't Watch

1. Good black-box lower bounds on entropy are only possible when the entropy is small, otherwise require $$\color{red} \exp(H)$$ samples
2. Good black-box lower bounds on the Mutual Information are only possible when the MI is small, otherwise require $$\color{red} \exp(\text{MI})$$ samples
3. Good lower bounds are known when we know at least one marginal $$p(z)$$
4. Good upper bounds are known when we know both $$p(x|z)$$ and $$p(z)$$ densities
5. Alternative dependence measures might be interesting

## Mutual Information

• A measure of dependence between two variables $$I(X, Z)$$
• $$I(X, Z) := H(Z) - H(Z|X)$$
• $$I(X, Z) := \text{KL}(p(x, z) \mid\mid p(x) p(z) )$$
• Captures the idea of  amount of "information" one r.v. carries about the other

## Applications

• A natural measure of posterior collapse in VAEs $$\text{MI}[p_\theta(x,z)] = \mathbb{E}_{p_\theta(x)} \text{KL}(p_\theta(z|x) \mid\mid p(z))$$
• Representation Learning [2,3,4] $$I(X, E_\theta(X)) \to \max_\theta$$
• Controlling dependence
• Demographic Parity in Fairness [17] $$I(Y; C | X) = 0$$
• Uncertainty Estimation
• Information Bottleneck [5]
• And more

## Mutual Information Estimation

### Lower bounds

Unknown p(z) Known p(z)
Unknown p(x|z) Hard Doable
Known p(x|z) Hard Good

### Upper bounds

Unknown p(z) Known p(z)
Unknown p(x|z) Hard Hard
Known p(x|z) Doable Good

## Black-box Lower Bounds [6]

• $$I(X, Z) := \text{KL}(p(x, z) \mid\mid p(x) p(z) )$$
• Can use lower bounds on KL
• Nguyen-Wainwright-Jordan (NWJ) [8] Lower Bound $$\text{KL}(p(x) \mid\mid q(x)) \ge \mathbb{E}_{p(x)} f(x) - \mathbb{E}_{q(x)} \exp(f(x)) + 1$$
• Donsker-Varadhan (DV) [7] Lower Bound $$\text{KL}(p(x) \mid\mid q(x)) \ge \mathbb{E}_{p(x)} f(x) - \log \mathbb{E}_{q(x)} \exp(f(x))$$
• InfoNCE [2] Lower Bound $$\text{MI}[p(x, z)] \ge \mathbb{E}_{p(x, z_0)} \mathbb{E}_{p(z_{1:K})} \log \frac{\exp(f(x, z_0))}{\tfrac{1}{K+1} \sum_{k=0}^K \exp(f(x, z_k)) }$$

All these lower bounds have been shown to be very poor!

## Formal Limitations [Unpublished]

Theorem [1]: Let $$B$$ be any distribution-free high-confidence lower bound on $$\mathbb{H}[p(x)]$$ computed from a sample $$x_{1:N} \sim p(x)$$. More specifically, let $$B(x_{1:N}, \delta)$$ be any real-valued function of a sample and a confidence parameter $$\delta$$ such that for any $$p(x)$$, with probability at least $$(1 − \delta)$$ over a draw of $$x_{1:N}$$ from $$p(x)$$, we have $$\mathbb{H}[p(x)] \ge B(x_{1:N}, δ).$$ For any such bound, and for $$N \ge 50$$ and $$k \ge 2$$, with probability at least $$1 − \delta − 1.01/k$$ over the draw of $$x_{1:N}$$ we have $$B(x_{1:N}, \delta) ≤ \log(2k N^2)$$

Good black-box lower bounds on $$\mathbb{H}$$ require exponential number of samples!

$$I(X, Z) = {\color{red} H(X)} - H(X|Z) \overset{\text{discrete } X}{\le} {\color{red} H(X)}$$

## The NWJ Bound

• Recall the bound is: for any $$f(x)$$ $$\text{KL}(p(x) \mid\mid q(x)) \ge \mathbb{E}_{p(x)} f(x) - \mathbb{E}_{q(x)} \exp(f(x)) + 1$$
• The optimal critic $$f(x)$$ is $$f^*(x) = \ln \tfrac{p(x)}{q(x)}$$
• The Monte Carlo estimate for $$x_{1:N}^{(p)} \sim p(x), x_{1:M}^{(q)} \sim q(x)$$ and the optimal critic is $$\frac{1}{N} \sum_{n=1}^N \ln \tfrac{p(x_n^{(p)})}{q(x_n^{(p)})} - \frac{1}{M} \sum_{m=1}^M \left[ \tfrac{p(x_m^{(q)})}{q(x_m^{(q)})} - 1 \right]$$
• The second term has zero expectation, but contributes variance $$\text{Var}\left[ \frac{1}{M} \sum_{m=1}^M \left[ \tfrac{p(x_m^{(q)})}{q(x_m^{(q)})} - 1 \right] \right] = \frac{\chi^2(p(x) \mid\mid q(x))}{M} \ge \frac{\exp(\text{KL}(p(x) \mid\mid q(x))) - 1}{M}$$
• To drive the variance down one needs exponential number of samples $$M$$

## The DV Bound

• Recall the bound is: for any $$f(x)$$ $$\text{KL}(p(x) \mid\mid q(x)) \ge \mathbb{E}_{p(x)} f(x) - \log \mathbb{E}_{q(x)} \exp(f(x))$$
• The optimal critic $$f(x)$$ is the same $$f^*(x) = \ln \tfrac{p(x)}{q(x)}$$
• The Monte Carlo estimate for $$x_{1:N}^{(p)} \sim p(x), x_{1:M}^{(q)} \sim q(x)$$ and the optimal critic is $$\frac{1}{N} \sum_{n=1}^N \ln \tfrac{p(x_n^{(p)})}{q(x_n^{(p)})} - \log \left[ \frac{1}{M} \sum_{m=1}^M \tfrac{p(x_m^{(q)})}{q(x_m^{(q)})} \right]$$
• A biased estimate. The bias is non-negative and can be shown [10] to satisfy $$\mathbb{E}_{q(x_{1:M})} \log \left[ \frac{1}{M} \sum_{m=1}^M \tfrac{p(x_m)}{q(x_m)} \right] = O\left( \text{Var}\left[ \frac{1}{M} \sum_{m=1}^M \tfrac{p(x_m^{(q)})}{q(x_m^{(q)})} \right] \right)$$
• To reduce the bias one'd need exponential number of samples $$M$$

## The InfoNCE Bound

• Recall the bound is: for any $$f(x, z)$$ $$\text{MI}[p(x, z)] \ge \mathbb{E}_{p(x, z_0)} \mathbb{E}_{p(z_{1:K})} \log \frac{\exp(f(x, z_0))}{\tfrac{1}{K+1} \sum_{k=0}^K \exp(f(x, z_k)) }$$
• The optimal critic $$f^*(x, z) = \ln p(x|z)$$
• The bound can be shown to upper-bounded by $$\log (K+1)$$: $$\mathbb{E}_{p(x, z_0)} \mathbb{E}_{p(z_{1:K})} \log \frac{\exp(f(x, z_0))}{\tfrac{1}{K+1} \sum_{k=0}^K \exp(f(x, z_k)) } \le \log (K+1)$$
• Once again, one have to use exponential number of samples to have a good estimate of the MI

## But why do they work?

• These methods have been reported to work empirically
• Despite negative results presented in this section

• "On Mutual Information Maximization for Representation Learning"
• Use MI to learn informative representations
• Tighter bounds do not translate into better representations
• Invertible representations keep improving
• Encoders initialized to be invertible become ill-conditioned
• Metric Learning seems to be a better interpretation

Does not agree with the MI-maximizing interpretation

## Dissecting the Mutual Information

• Recall the equivalent definitions of the Mutual Information:
• $$\text{MI}[p(x, z)] := \text{KL}(p(x, z) \mid\mid p(x) p(z) )$$
• $$\text{MI}[p(x, z)] := \mathbb{E}_{p(x, z)} \log p(x, z) - \mathbb{E}_{p(x)} \log p(x) - \mathbb{E}_{p(z)} \log p(z)$$
• $$\text{MI}[p(x, z)] := \mathbb{E}_{p(x, z)} \log p(z | x) - \mathbb{E}_{p(z)} \log p(z)$$
• $$\text{MI}[p(x, z)] := \mathbb{E}_{p(x, z)} \log p(x | z) - \mathbb{E}_{p(x)} \log p(x)$$

1. Avoid lower-bounding the intractable entropy
2. Assume the marginal $$p(z)$$ is known
(actually, these two are equivalent)

## Avoiding lower-bounds on entropies

• One of the equivalent definitions of the Mutual Information: $$\text{MI}[p(x, z)] := \mathbb{E}_{p(x, z)} \log p(z | x) - \mathbb{E}_{p(z)} \log p(z)$$
• Assume the second term is tractable
• Upper-bound the conditional entropy
• Upper bounds are easy (and efficient): $$0 \le \text{KL}(p(x) \mid\mid q(x)) = \mathbb{E}_{p(x)} \log \tfrac{p(x)}{q(x)} \Rightarrow -\mathbb{E}_{p(x)} \log p(x) \le -\mathbb{E}_{p(x)} \log q(x)$$

• This leads us to the Barber-Agakov [9] lower bound on the MI:  $$\text{MI}[p(x, z)] \ge \mathbb{E}_{p(x, z)} \log q(z | x) - \mathbb{E}_{p(z)} \log p(z)$$

## Assume the marginal is known

• Another equivalent definition of the Mutual Information: $$\text{MI}[p(x, z)] := \mathbb{E}_{p(x, z)} \log p(x | z) - \mathbb{E}_{p(x)} \log p(x)$$
• Use known $$p(z)$$ to lower-bound the second term
• Let $$\hat\rho(x,z)$$ be any unnormalized pdf and $$q(z|x)$$ be normalized
• IWHVI [12] upper bound + importance sampling: $$\log p(x) \le \mathbb{E}_{p(x, z_0)} \mathbb{E}_{q(z_{1:K}|x)} \log \tfrac{1}{K+1} \sum_{k=0}^K \frac{\rho(x, z_k)}{q(z_k|x)} + \text{KL}(p(x,z) \mid\mid \rho(x, z))$$
• Hence $$\boxed{ \mathbb{E}_{p(x, z)} \log \frac{p(x|z)}{p(x)} \ge \mathbb{E}_{p(x, z_0)} \mathbb{E}_{q(z_{1:K}|x)} \log \frac{\hat\rho(x, z_0)}{\tfrac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho(x, z_k)}{q(z_k|x)} } - \mathbb{E}_{p(z)} \log p(z) }$$

## Multisample Lower Bound

$$\text{MI}[p(x, z)] \ge \mathbb{E}_{p(x, z_0)} \mathbb{E}_{q(z_{1:K}|x)} \log \frac{\hat\rho(x, z_0)}{\tfrac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho(x, z_k)}{q(z_k|x)} } - \mathbb{E}_{p(z)} \log p(z)$$

• $$\hat\rho(x,z)$$ approximates the joint distribution $$p(x,z)$$
• $$q(z|x)$$ approximates the posterior distribution $$p(z|x)$$
• For $$K=0$$ we recover the Barber-Agakov lower bound
• Essentially, a fancy cross-entropy estimator for $$\log p(z|x)$$
• $$q(z|x)$$ is enhanced by Self-Normalized Importance Sampling
• $$q(z|x) = p(z)$$ and $$\hat\rho(x, z) = \exp(f(x, z)) p(z)$$ recover the InfoNCE lower bound
• Similar to SIVI [13]
• Explains its inefficiency due to poor choice of $$q(z|x)$$

## An Upper Bound on the Mutual Information

• Recall the equivalent definitions of the Mutual Information:
• $$I(X,Z) := H(X) - H(X|Z) = \mathbb{E}_{p(x, z)} \log p(x|z) - \mathbb{E}_{p(x)} \log p(x)$$
• Need a lower bound on the conditional entropy
• The theorem forbids any black-box efficient lower bounds
• Assume the conditional $$p(x|z)$$ and the prior $$p(z)$$ are known
• Can use IWAE [11] lower bound on $$\log p(x)$$ to obtain the following upper bound: $$\text{MI}[p(x,z)] \le \mathbb{E}_{p(x, z_0)} \mathbb{E}_{q(z_{1:K}|x)} \frac{p(x|z_0)}{\tfrac{1}{K} \sum_{k=1}^K \tfrac{p(x,z_k)}{q(z_k|x)} }$$
• Where $$q(z|x)$$ is a normalized variational distribution
• Putting $$q(z|x) = p(z)$$ frees us from knowing the prior density $$p(z)$$ $$\text{MI}[p(x,z)] \le \mathbb{E}_{p(x, z_0)} \mathbb{E}_{p(z_{1:K})} \frac{p(x|z_0)}{\tfrac{1}{K} \sum_{k=1}^K p(x|z_k) }$$

## A Sandwich Bound on the MI

$$\mathbb{E}_{p(x, z_0)} \mathbb{E}_{q(z_{1:K}|x)} \frac{p(x|z_0)}{\tfrac{1}{K+1} \sum_{k=0}^K \tfrac{p(x,z_k)}{q(z_k|x)} } \le \text{MI}[p(x,z)] \le \mathbb{E}_{p(x, z_0)} \mathbb{E}_{q(z_{1:K}|x)} \frac{p(x|z_0)}{\tfrac{1}{K} \sum_{k=1}^K \tfrac{p(x,z_k)}{q(z_k|x)} }$$

• A multisample variational sandwich bound
• Variational distribution $$q(z|x)$$ approximates the true posterior $$p(z|x)$$
• $$K$$ samples are used to improve the variational approximation

## Alternative Measures

• Once again, the Mutual Information is: $$\text{MI}[p(x,z)] := \text{KL}(p(x,z) \mid\mid p(x)p(z) )$$
• Has information-theoretic interpretation
• Suffers from inability to efficiently lower-bound the entropy

• One might consider other divergences between $$p(x,z)$$ and $$p(x) p(z)$$:
• Reverse KL: $$\text{KL}(p(x)p(z) \mid\mid p(x,z))$$
• Any $$f$$-divergence: $$D_f(p(x,z) \mid\mid p(x)p(z))$$
• $$p$$-Wasserstein: $$W_p( p(x,z) \mid\mid p(x)p(z) )$$ [14]
• (Seem to) Lack information-theoretic interpretation

## Lautum Information

• While the Mutual Information is $$\text{MI}[p(x,z)] := \text{KL}(p(x,z) \mid\mid p(x)p(z) )$$
• The Lautum Information[15] is defined as $$\text{LI}[p(x,z)] := \text{KL}(p(x)p(z) \mid\mid p(x, z) )$$
• $$\text{LI}[p(x,z)] = \mathbb{E}_{p(x)p(z)} \log \frac{p(x)}{p(x|z)} = -\mathbb{E}_{p(x)p(z)} \log p(x|z) -\mathbb{H}[p(x)]$$
• Lacks information-theoretic interpretation
• The entropy term allows some bounds
• Good black-box cross-entropy upper bounds
• Good distribution-dependent lower bounds
• The cross-entropy term is hard to deal with

## Lower Bounds on the Lautum Information

• Recall the Lautum Information $$\text{LI}[p(x,z)] = \mathbb{E}_{p(x)} \log p(x) - \mathbb{E}_{p(x)p(z)} \log p(x|z)$$
• Simple cross-entropy lower bound: $$\text{LI}[p(x,z)] \ge \mathbb{E}_{p(x)} \log q(x) - \mathbb{E}_{p(x)p(z)} \log p(x|z)$$
• Requires expressive tractable $$q(x)$$
• Hierarchical multisample cross-entropy lower bound, $$q(x) := \int p(x|z) \rho(z) dz$$: $$\text{LI}[p(x,z)] \ge \mathbb{E}_{p(x)} \mathbb{E}_{q(z_{1:K}|x)} \log \frac{1}{K} \sum_{k=1}^K \frac{p(x|z_k) \rho(z_k)}{q(z_k|x)} - \mathbb{E}_{p(x)p(z)} \log p(x|z)$$
• $$\rho(z)$$ approximates the marginal $$p(z)$$
• $$q(z|x)$$ approximates the posterior $$p(z|x)$$

## Upper Bounds on the Lautum Information

• The Lautum Information $$\text{LI}[p(x,z)] = \mathbb{E}_{p(x)} \log p(x) - \mathbb{E}_{p(x)p(z)} \log p(x|z)$$
• No efficient lower bounds on the entropy unless we know $$p(z)$$

• IWHVI bound gives $$\text{LI}[p(x,z)] \le \mathbb{E}_{p(x, z_0)} \mathbb{E}_{q(z_{1:K}|x)} \log \frac{1}{K+1} \sum_{k=0}^K \frac{p(x, z_k)}{q(z_k|x)} - \mathbb{E}_{p(x)p(z)} \log p(x|z)$$
• $$q(z|x)$$ approximates the posterior $$p(z|x)$$

## AIS-based Bounds

• All bounds we've considered essentially followed from lower and upper bounds on marginal log-likelihood given by Self-Normalized Importance Sampling (SNIS)
• Alternatively, one might consider Annealed Importance Sampling (AIS) [16]

• SNIS and AIS can be seen as non-parametric procedures that improve $$q(z|x)$$
• SNIS tries $$K$$ different samples and selects the best-looking one
• Embarrassingly parallel
• Works with discrete random variables
• AIS takes a sample and runs several steps of HMC to move it towards $$p(z|x)$$
• Potentially more efficient in high-dimensional spaces
• Gradient-based MCMC only works in continuous spaces
• Sequential in nature

## Conclusion

• The MI is hard to estimate reliably in a black-box case
• Because the entropy is hard to estimate
• Black-box lower bounds on KL are bad for the same reason
• Knowing a marginal distribution $$p(z)$$ enables efficient lower bounds
• Knowing the full joint distribution $$p(x,z)$$ enables efficient upper bounds
• These bounds can be made tighter by using more computation
• We've developed SNIS-based bounds, AIS is also promising
• Alternative measures of dependency might not suffer from the entropy estimation problem

## References

[1]: David McAllester, Karl Stratos. Formal Limitations on the Measurement of Mutual Information.

[2]: Aaron van den Oord, Yazhe Li, Oriol Vinyals. Representation Learning with Contrastive Predictive Coding.

[3]: R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, Yoshua Bengio. Learning deep representations by mutual information estimation and maximization.

[4]: Philip Bachman, R Devon Hjelm, William Buchwalter. Learning Representations by Maximizing Mutual Information Across Views.

[5]: Naftali Tishby, Noga Zaslavsky. Deep Learning and the Information Bottleneck Principle.

[6]: Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, George Tucker. On Variational Bounds of Mutual Information.

[7]: Donsker, M. D. and Varadhan, S. S. Asymptotic evaluation of certain markov process expectations for large time.

[8]: Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization.

[9]: Barber, D. and Agakov, F. The im algorithm: A variational approach to information maximization.

[10]: Justin Domke, Daniel Sheldon. Importance Weighting and Variational Inference.

[11]: Yuri Burda, Roger Grosse, Ruslan Salakhutdinov. Importance Weighted Autoencoders.

[12]: Artem Sobolev, Dmitry Vetrov. Importance Weighted Hierarchical Variational Inference.

[13]: Mingzhang Yin, Mingyuan Zhou. Semi-Implicit Variational Inference.

[14]: Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, Pierre Sermanet. Wasserstein Dependency Measure for Representation Learning.

[15]: Daniel P. Palomar, Sergio Verdu. Lautum Information.

[16]: Roger B. Grosse, Zoubin Ghahramani, Ryan P. Adams. Sandwiching the marginal likelihood using bidirectional Monte Carlo.