Don't just sample optimize

Peadar Coyle

Bayesian Neural Networks - Thomas Wiecki - PyMC3 Docs

Challenges in Bayesian Inference

  • 1. Tradeoffs. How do we formalize statistical and computational tradeoffs for inference?
     
  • 2. Software. How do we design efficient and flexible software for generative models?

Why do we need Variational Inference?

  • Inferring hidden variables
  • Unlike MCMC:
    ​     - Deterministic
         - Easy to gauge convergence
         - Requires dozens of iterations
  • Doesn't require conjugacy 
  • Slightly hairier math

Background

\mathbf{x}
x \mathbf{x}
p(\mathbf{x}, \mathbf{z})
p(x,z)p(\mathbf{x}, \mathbf{z})

Given

  • Data set
  • Generative model

with latent variable

Goal

  • Infer posterior
\mathbf{z}
z\mathbf{z}
\in
\in
\mathbb{R}^{d}
Rd\mathbb{R}^{d}
p(\mathbf{z} | \mathbf{x} )
p(zx)p(\mathbf{z} | \mathbf{x} )

That is the key problem in Bayesian inference

Let's look at the posterior

  • We can write the conditional or posterior distribution as

     
  • The denominator in the marginal distribution is called the marginal distribution of observations (also called the evidence) and it is calculated by marginalizing out the latent variables from the joint distribution


     
  • Often this integral is intractable
p(\mathbf{z} | \mathbf{x}) = \dfrac{p(\mathbf{z},\mathbf{x})}{p(\mathbf{x})}
p(zx)=p(z,x)p(x)p(\mathbf{z} | \mathbf{x}) = \dfrac{p(\mathbf{z},\mathbf{x})}{p(\mathbf{x})}
p(\mathbf{x}) = \int_{z} p(\mathbf{z}, \mathbf{x}) d\mathbf{z}
p(x)=zp(z,x)dzp(\mathbf{x}) = \int_{z} p(\mathbf{z}, \mathbf{x}) d\mathbf{z}

Title Text

Text

What do we approximate?

  • We create a variational distribution over the latent variables
  • We want to find settings of
  • So that q is close to p
  • When p == q this is plain Expectation Maximization
\nu
ν\nu
q(z_{1:m} | \nu)
q(z1:mν)q(z_{1:m} | \nu)

What does closeness mean?

  • We measure the closeness of distributions using Kullback-Leibler Divergence


     
  • If q and p are high we're happy
  • If KL = 0 , then the distributions are equal
  • If q is low we don't care. If q isn't high but p isn't we pay a price
  • http://bit.ly/2oROYAw​
\mathbb{E}_{q} [\log \dfrac{q(Z)}{p(Z|x)}]
Eq[logq(Z)p(Zx)]\mathbb{E}_{q} [\log \dfrac{q(Z)}{p(Z|x)}]

We can do some math... 

Negative of ELBO (evidence lower bound) + a constant is equal to KL divergence

-(\mathbb{E}_{q} [\log p(z | x)] - \mathbb{E}_{q}[\log q(z)]) + \log p(x)
(Eq[logp(zx)]Eq[logq(z)])+logp(x)-(\mathbb{E}_{q} [\log p(z | x)] - \mathbb{E}_{q}[\log q(z)]) + \log p(x)

Constant

ELBO (in brackets)

Key points

  • Minimizing KL divergence is the same as maximizing ELBO

     
  • This allows us to change a sampling problem into an optimization problem
     

Whats new in PyMC3

  • Release of the first stable version in early 2017
     
  • Variational Inference
     
  • Advanced Hamiltonian Monte Carlo samplers
     
  • Easy optimization for finding the MAP point.
     
  • Theano support for fast compilation

What else is new

  • Gaussian process kernels
  • New variants of Variational Inference (including Operator)
  • Speed improvements
  • API and documentation improvements
  • Bayesian Methods for Hackers - in PyMC3 too

First gather data from some real-world phenomena. Then cycle through Box’s loop:

  1. Build a probabilistic model of the phenomena.
  2. Reason about the phenomena given model and data.
  3. Criticize the model, revise and repeat.

Don't just sample optimize

By springcoil

Don't just sample optimize

  • 1,997