Don't just sample optimize
Peadar Coyle
Bayesian Neural Networks - Thomas Wiecki - PyMC3 Docs
Challenges in Bayesian Inference
- 1. Tradeoffs. How do we formalize statistical and computational tradeoffs for inference?
- 2. Software. How do we design efficient and flexible software for generative models?
Why do we need Variational Inference?
- Inferring hidden variables
- Unlike MCMC:
- Deterministic
- Easy to gauge convergence
- Requires dozens of iterations - Doesn't require conjugacy
- Slightly hairier math
Background
\mathbf{x}
x
p(\mathbf{x}, \mathbf{z})
p(x,z)
Given
- Data set
- Generative model
with latent variable
Goal
- Infer posterior
\mathbf{z}
z
\in
∈
\mathbb{R}^{d}
Rd
p(\mathbf{z} | \mathbf{x} )
p(z∣x)
That is the key problem in Bayesian inference
Let's look at the posterior
- We can write the conditional or posterior distribution as
- The denominator in the marginal distribution is called the marginal distribution of observations (also called the evidence) and it is calculated by marginalizing out the latent variables from the joint distribution
- Often this integral is intractable
p(\mathbf{z} | \mathbf{x}) = \dfrac{p(\mathbf{z},\mathbf{x})}{p(\mathbf{x})}
p(z∣x)=p(x)p(z,x)
p(\mathbf{x}) = \int_{z} p(\mathbf{z}, \mathbf{x}) d\mathbf{z}
p(x)=∫zp(z,x)dz
Title Text
Text
What do we approximate?
- We create a variational distribution over the latent variables
- We want to find settings of
- So that q is close to p
- When p == q this is plain Expectation Maximization
\nu
ν
q(z_{1:m} | \nu)
q(z1:m∣ν)
What does closeness mean?
- We measure the closeness of distributions using Kullback-Leibler Divergence
- If q and p are high we're happy
- If KL = 0 , then the distributions are equal
- If q is low we don't care. If q isn't high but p isn't we pay a price
- http://bit.ly/2oROYAw
\mathbb{E}_{q} [\log \dfrac{q(Z)}{p(Z|x)}]
Eq[logp(Z∣x)q(Z)]
We can do some math...
Negative of ELBO (evidence lower bound) + a constant is equal to KL divergence
-(\mathbb{E}_{q} [\log p(z | x)] - \mathbb{E}_{q}[\log q(z)]) + \log p(x)
−(Eq[logp(z∣x)]−Eq[logq(z)])+logp(x)
Constant
ELBO (in brackets)
Key points
-
Minimizing KL divergence is the same as maximizing ELBO
- This allows us to change a sampling problem into an optimization problem
Whats new in PyMC3
- Release of the first stable version in early 2017
- Variational Inference
- Advanced Hamiltonian Monte Carlo samplers
- Easy optimization for finding the MAP point.
- Theano support for fast compilation
What else is new
- Gaussian process kernels
- New variants of Variational Inference (including Operator)
- Speed improvements
- API and documentation improvements
- Bayesian Methods for Hackers - in PyMC3 too
First gather data from some real-world phenomena. Then cycle through Box’s loop:
- Build a probabilistic model of the phenomena.
- Reason about the phenomena given model and data.
- Criticize the model, revise and repeat.
Don't just sample optimize
By springcoil
Don't just sample optimize
- 2,104