Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

Daniel Emaasit

Data Scientist

Haystax Technology

Bayesian Data Science DC Meetup

October 11, 2018

Materials

Download slides & code:

bit.ly/pymc-learn-dc

Application (1/3)

Optimizing complex models in autonomous vehicles.

Schneider et al., 2016. (Uber ATG).

Application (2/3)

Automating machine learning

TechCrunch, accessed 2018 (Google AutoML).

Application (3/3)

Supplying internet to remote areas

Wired magazine, accessed 2016. (Google Project Loon)

Intro to Bayesian Nonparametrics

Bayesian Nonparametrics (1/2)

Task: Function approximation.

3. pre-specifying $f(\textbf{X})$ may produce either overly complex or simple models

Most approaches to modeling focus on parametric models that impose restrictive assumptions, e.g:

2. it's difficult to know a priori the most appropriate number of parameters

Limitations of this approach:

y = f(\textbf{x}) + \varepsilon

y = f(\textbf{x}) + \varepsilon

prespecifying the functional form,

2. prespecifying the number of parameters, e.g $\text{coefficients }\beta, \text{variance }\sigma^2$

1. it's difficult to know a priori the most appropriate function

\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2

\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2

\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_1^2 + \beta_3x_2^2 + \beta_4sin(1+ x_2)

\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_1^2 + \beta_3x_2^2 + \beta_4sin(1+ x_2)

\pi = \mathbf{E}[Y \mid X] = \frac{e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}{1 + e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}

\pi = \mathbf{E}[Y \mid X] = \frac{e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}{1 + e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}

y_i \sim Bin(n_i, \pi_i)

y_i \sim Bin(n_i, \pi_i)

\pi = \mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2

\pi = \mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2

y_i \sim \mathbf{N}(\pi_i, \sigma^2)

y_i \sim \mathbf{N}(\pi_i, \sigma^2)

\text{suppose} \ y \ \text{is} \ \text{continuous}

\text{suppose} \ y \ \text{is} \ \text{continuous}

\text{suppose} \ y \ \text{is} \ \text{discrete} \{0, 1\}

\text{suppose} \ y \ \text{is} \ \text{discrete} \{0, 1\}

Bayesian Nonparametrics (2/2)

Gaussian Process

Methods that require weaker or less restrictive assumptions are preferred.

leads to estimating flexible* models e.g Gaussian Processes

*Ghahramani, 2016

Define a prior over the function

\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})

\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})

A distribution over functions in an infinite space is:

a Gaussian process (GP) (Rasmussen & Williams , 2006)

where

\begin{pmatrix} f(\textbf{x}_1) \\ f(\textbf{x}_2) \\ \vdots \\ f(\textbf{x}_N) \end{pmatrix} \sim \mathcal{N} \left \{ \begin{pmatrix} m_f(\textbf{x}_1) \\ m_f(\textbf{x}_2) \\ \vdots \\ m_f(\textbf{x}_N) \end{pmatrix} , \begin{pmatrix} k_f(\textbf{x}_1,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_1,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_1,\textbf{x}_N^{\prime}) \\ k_f(\textbf{x}_2,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_2,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_2,\textbf{x}_N^{\prime}) \\ \vdots & \vdots & \ddots & \vdots \\ k_f(\textbf{x}_N,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_N,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_N,\textbf{x}_N^{\prime}) \end{pmatrix} \right \}

\begin{pmatrix} f(\textbf{x}_1) \\ f(\textbf{x}_2) \\ \vdots \\ f(\textbf{x}_N) \end{pmatrix} \sim \mathcal{N} \left \{ \begin{pmatrix} m_f(\textbf{x}_1) \\ m_f(\textbf{x}_2) \\ \vdots \\ m_f(\textbf{x}_N) \end{pmatrix} , \begin{pmatrix} k_f(\textbf{x}_1,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_1,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_1,\textbf{x}_N^{\prime}) \\ k_f(\textbf{x}_2,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_2,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_2,\textbf{x}_N^{\prime}) \\ \vdots & \vdots & \ddots & \vdots \\ k_f(\textbf{x}_N,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_N,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_N,\textbf{x}_N^{\prime}) \end{pmatrix} \right \}

$\mathbf{m}_f$ = mean function

$\textbf{K}_{f}$ = covariance function (kernel)

f(\textbf{x})

f(\textbf{x})

\text{input }x

\text{input }x

See Appendix for mathematical details

Link

Probabilistic Programming

Probabilistic Programming (1/2)

Probabilisic Programming (PP) Languages:

Software packages that take a model and then automatically generate inference routines (even source code!) e.g Pyro, Stan, Infer.Net, PyMC3, TensorFlow Probability, etc.

Probabilistic Programming (2/2)

Steps in Probabilisic ML:

Build the model (Joint probability distribution of all the relevant variables)
Incorporate the observed data
Perform inference (to learn distributions of the latent variables)

Box's Loop (Image credit: http://dustintran.com/)

Gaussian Process in PyMC3

import pymc3 as pm

# Instantiate a model
with pm.Model() as latent_gp_model:
    
    # specify the priors
    length_scale = pm.Gamma("length_scale", alpha = 2, beta = 1)
    signal_variance = pm.HalfCauchy("signal_variance", beta = 5)
    noise_variance = pm.HalfCauchy("noise_variance", beta = 5)
    degrees_of_freedom = pm.Gamma("degrees_of_freedom", alpha = 2, beta = 0.1)
    
    # specify the kernel function
    cov = signal_variance**2 * pm.gp.cov.ExpQuad(1, length_scale)
        
    # specify the mean function
    mean_function = pm.gp.mean.Zero()
    
    # specify the gp
    gp = pm.gp.Latent(cov_func = cov)
    
    # specify the prior over the latent function
    f = gp.prior("f", X = X) 
    
    # specify the likelihood
    obs = pm.StudentT("obs", mu = f, lam = 1/signal_variance, nu = degrees_of_freedom, observed = y)


# Perform Inference
with latent_gp_model:
    posterior = pm.sample(draws = 100, njobs = 2)

# extend the model by adding the GP conditional distribution so as to predict at test data
with latent_gp_model:
    f_pred = gp.conditional("f_pred", X_new)

# sample from the GP conditional posterior
with latent_gp_model:
    posterior_pred = pm.sample_ppc(posterior, vars = [f_pred], samples = 200)

Build a model

Train a model

Prediction

Scikit-learn

scikit-learn.org

from sklearn.gaussian_process import GaussianProcessRegressor()

model = GaussianProcessRegressor()

model.fit(X_train, y_train)

model.predict(X_test, y_test)

model.score(X_test, y_test)

model.save('path/to/saved/model')

Few lines of code

Build + Train + predict + score + save + load

Pymc-learn

pymc-learn.org

Pymc-learn

pymc-learn.org

from pmlearn.gaussian_process import GaussianProcessRegressor()

# Instantiate a PyMC3 Gaussian process model
model = GaussianProcessRegressor()

# Fit using MCMC or Variational Inference
model.fit(X_train, y_train)

model.predict(X_test, y_test)

model.score(X_test, y_test)

model.save('path/to/saved/model')

Few lines of code

PyMC3 vs Pymc-learn

import pymc3 as pm

# Instantiate a model
with pm.Model() as latent_gp_model:
    
    # specify the priors
    length_scale = pm.Gamma("length_scale", alpha = 2, beta = 1)
    signal_variance = pm.HalfCauchy("signal_variance", beta = 5)
    noise_variance = pm.HalfCauchy("noise_variance", beta = 5)
    degrees_of_freedom = pm.Gamma("degrees_of_freedom", alpha = 2, beta = 0.1)
    
    # specify the kernel function
    cov = signal_variance**2 * pm.gp.cov.ExpQuad(1, length_scale)
        
    # specify the mean function
    mean_function = pm.gp.mean.Zero()
    
    # specify the gp
    gp = pm.gp.Latent(cov_func = cov)
    
    # specify the prior over the latent function
    f = gp.prior("f", X = X) 
    
    # specify the likelihood
    obs = pm.StudentT("obs", mu = f, lam = 1/signal_variance, nu = degrees_of_freedom, observed = y)


# Perform Inference
with latent_gp_model:
    posterior = pm.sample(draws = 100, njobs = 2)

# extend the model by adding the GP conditional distribution so as to predict at test data
with latent_gp_model:
    f_pred = gp.conditional("f_pred", X_new)

# sample from the GP conditional posterior
with latent_gp_model:
    posterior_pred = pm.sample_ppc(posterior, vars = [f_pred], samples = 200)

from pmlearn.gaussian_process import GaussianProcessRegressor()

# Instantiate a PyMC3 Gaussian process model
model = GaussianProcessRegressor()

# Fit using MCMC or Variational Inference
model.fit(X_train, y_train)

model.predict(X_test, y_test)

Few lines of code

Many lines of code

PyMC3

Pymc-learn

Demo

bit.ly/pymc-learn-dc

Resources to get started

PyMC3 documents

Winn, J., Bishop, C. M., Diethe, T. (2015). Model-Based Machine Learning. Microsoft Research Cambridge.

R. McElreath (2012) Statistical Rethinking: A Bayesian Course with Examples in R and Stan (& PyMC3 & brms too)

Probabilistic Programming and Bayesian Methods for Hackers: Fantastic book with many applied code examples.

Thank You!

Appendix

Start with an inflexible* model

Consider for each data input, $i$ , that:

$y_i$ = output variable,

$\mathbf{x}_i$ = covariates with dimension $D$ , e.g {income, employment, trip distance, etc}

$f(\mathbf{x}_i)$ = function that maps $\mathbf{x}_i$ to $y_i$ ,

$\varepsilon_i$ = noise term,

p(\mathbf{\theta} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{\theta},\textbf{X}) \, p(\mathbf{\theta})}{p(\textbf{y}\mid \textbf{X})}

p(\mathbf{\theta} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{\theta},\textbf{X}) \, p(\mathbf{\theta})}{p(\textbf{y}\mid \textbf{X})}

$p(\mathbf{\theta} \mid \textbf{y},\textbf{X})$ = posterior over the parameters, given observed data

$p(\textbf{y} \mid \mathbf{\theta},\textbf{X})$ = likelihood of output variable, given covariates & parameters

$p(\mathbf{\theta})$ = prior over the parameters

$p(\textbf{y} \mid \textbf{X})$ = data distribution to ensure normalization

y_i = f(\textbf{x}_i) + \varepsilon_i, \quad \text{assume} \, \, \varepsilon \sim \mathcal{N}(0, \sigma_{\varepsilon}^2)

y_i = f(\textbf{x}_i) + \varepsilon_i, \quad \text{assume} \, \, \varepsilon \sim \mathcal{N}(0, \sigma_{\varepsilon}^2)

Bayesian modeling involves:

where

$\mathbf{\theta}$ = parameters e.g. coefficients

*Ghahramani, 2016; Williams & Rasmussen, 2006

Extension to a flexible* model

Given that latent function $f$ , is unknown:

It ( $f$ ) is considered the parameter of interest.

p(\mathbf{f} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{f},\textbf{X}) \, p(\mathbf{f})}{p(\textbf{y}\mid \textbf{X})}

p(\mathbf{f} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{f},\textbf{X}) \, p(\mathbf{f})}{p(\textbf{y}\mid \textbf{X})}

$p(\mathbf{f} \mid \textbf{y},\textbf{X})$ = posterior over the function, given observed data

$p(\textbf{y} \mid \mathbf{f},\textbf{X})$ = likelihood of output variable, given the covariates & function

$p(\mathbf{f})$ = prior over the function

$p(\textbf{y} \mid \textbf{X})$ = data distribution to ensure normalization

Consider the prior over the function as:

as any possible function in an infinite space

*Ghahramani, 2016; Williams & Rasmussen, 2006

Define a prior over the function (1/2)

\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})

\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})

A distribution over functions in an infinite space is:

a Gaussian process (GP) (Rasmussen & Williams , 2006)

where

\begin{pmatrix} f(\textbf{x}_1) \\ f(\textbf{x}_2) \\ \vdots \\ f(\textbf{x}_N) \end{pmatrix} \sim \mathcal{N} \left \{ \begin{pmatrix} m_f(\textbf{x}_1) \\ m_f(\textbf{x}_2) \\ \vdots \\ m_f(\textbf{x}_N) \end{pmatrix} , \begin{pmatrix} k_f(\textbf{x}_1,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_1,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_1,\textbf{x}_N^{\prime}) \\ k_f(\textbf{x}_2,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_2,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_2,\textbf{x}_N^{\prime}) \\ \vdots & \vdots & \ddots & \vdots \\ k_f(\textbf{x}_N,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_N,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_N,\textbf{x}_N^{\prime}) \end{pmatrix} \right \}

$\mathbf{m}_f$ = mean function

$\textbf{K}_{f}$ = covariance function (kernel)

f(\textbf{x})

f(\textbf{x})

\text{input }x

\text{input }x

Define a prior over the function (2/2)

k_f(\textbf{x}_i, \textbf{x}_j^{\prime}\mid\theta) =\sigma^2_fexp\bigg(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_{i,d}-x_{j,d})^2}{l_d^2}\bigg)

k_f(\textbf{x}_i, \textbf{x}_j^{\prime}\mid\theta) =\sigma^2_fexp\bigg(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_{i,d}-x_{j,d})^2}{l_d^2}\bigg)

$\sigma^2_f$ = signal-variance parameter
$l_d$ = length-scale parameter ( $l_d$ is positive; controls smoothness)

The Squared Exponential Automatic Relevance Determination (SE-ARD) kernel (Neal, 2012).

Very smooth i.e continuous

We expect our functions to vary smoothly

Assumption:

Recent sampling methods considered efficient in high dimensions will be used:

Efficient sampling

Hamiltonian Monte Carlo (HMC) i.e. No-U-turn Sampler (NUTS) (Hoffman & Gelman, 2014)

Estimation will involve finding values for two sets hyperparameters:

$\sigma^2_f$ & $\sigma^2_g$ = signal-variance & noise variance

Automatic Differentiation Variational Inference (ADVI) (Kucukelbir et al, 2016)

$l_d^f$ & $l_d^g$ = length-scales in $f$ and $g$

Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

By Daniel Emaasit

Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

7 years ago
1,501

Daniel Emaasit

Data Scientist @HaystaxTech, Ph.D. Candidate @UNLV, Bayesian Machine Learning Researcher, Organizer of Data Science Meetups. User of #PyMC3.

Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

Bayesian Data Science DC Meetup

October 11, 2018

Materials

Application (1/3)

Application (2/3)

Application (3/3)

Intro to Bayesian Nonparametrics

Define a prior over the function

Probabilistic Programming

Probabilistic Programming (1/2)

Probabilistic Programming (2/2)

Gaussian Process in PyMC3

Scikit-learn

Pymc-learn

Pymc-learn

PyMC3 vs Pymc-learn

Demo

Resources to get started

Thank You!

Appendix

Start with an inflexible* model

Extension to a flexible* model

Define a prior over the function (1/2)

Define a prior over the function (2/2)

Efficient sampling

Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

More from Daniel Emaasit