## Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

Daniel Emaasit

Data Scientist

Haystax Technology

## Materials

Download slides & code:

## Application (1/3)

• Optimizing complex models in autonomous vehicles.

## Application (2/3)

• Automating machine learning

## Application (3/3)

• Supplying internet to remote areas

# Intro to Bayesian Nonparametrics

Bayesian Nonparametrics (1/2)

• Task: Function approximation.

3. pre-specifying $$f(\textbf{X})$$ may produce  either overly complex or simple models

• Most approaches to modeling focus on parametric models that impose restrictive assumptions, e.g:

2.   it's difficult to know a priori the most appropriate number of parameters

• Limitations of this approach:
y = f(\textbf{x}) + \varepsilon
$y = f(\textbf{x}) + \varepsilon$
1. prespecifying the functional form,

2. prespecifying the number of parameters, e.g $$\text{coefficients }\beta, \text{variance }\sigma^2$$

1. it's difficult to know a priori the most appropriate function

\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2
$\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2$
\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_1^2 + \beta_3x_2^2 + \beta_4sin(1+ x_2)
$\mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_1^2 + \beta_3x_2^2 + \beta_4sin(1+ x_2)$
\pi = \mathbf{E}[Y \mid X] = \frac{e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}{1 + e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}
$\pi = \mathbf{E}[Y \mid X] = \frac{e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}{1 + e^{(\beta_0 + \beta_1x_1 + \beta_2x_2)}}$
y_i \sim Bin(n_i, \pi_i)
$y_i \sim Bin(n_i, \pi_i)$
\pi = \mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2
$\pi = \mathbf{E}[Y \mid X] = \beta_0 + \beta_1x_1 + \beta_2x_2$
y_i \sim \mathbf{N}(\pi_i, \sigma^2)
$y_i \sim \mathbf{N}(\pi_i, \sigma^2)$
\text{suppose} \ y \ \text{is} \ \text{continuous}
$\text{suppose} \ y \ \text{is} \ \text{continuous}$
\text{suppose} \ y \ \text{is} \ \text{discrete} \{0, 1\}
$\text{suppose} \ y \ \text{is} \ \text{discrete} \{0, 1\}$

Or

Bayesian Nonparametrics (2/2)

Gaussian Process

• Methods that require weaker or less restrictive assumptions are preferred.
• leads to estimating flexible*  models e.g Gaussian Processes

*Ghahramani, 2016

20

15

10

## Define a prior over the function

\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})
$\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})$
• A distribution over functions in an infinite space is:
• a Gaussian process (GP) (Rasmussen & Williams , 2006)

where

\begin{pmatrix} f(\textbf{x}_1) \\ f(\textbf{x}_2) \\ \vdots \\ f(\textbf{x}_N) \end{pmatrix} \sim \mathcal{N} \left \{ \begin{pmatrix} m_f(\textbf{x}_1) \\ m_f(\textbf{x}_2) \\ \vdots \\ m_f(\textbf{x}_N) \end{pmatrix} , \begin{pmatrix} k_f(\textbf{x}_1,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_1,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_1,\textbf{x}_N^{\prime}) \\ k_f(\textbf{x}_2,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_2,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_2,\textbf{x}_N^{\prime}) \\ \vdots & \vdots & \ddots & \vdots \\ k_f(\textbf{x}_N,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_N,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_N,\textbf{x}_N^{\prime}) \end{pmatrix} \right \}
$\begin{pmatrix} f(\textbf{x}_1) \\ f(\textbf{x}_2) \\ \vdots \\ f(\textbf{x}_N) \end{pmatrix} \sim \mathcal{N} \left \{ \begin{pmatrix} m_f(\textbf{x}_1) \\ m_f(\textbf{x}_2) \\ \vdots \\ m_f(\textbf{x}_N) \end{pmatrix} , \begin{pmatrix} k_f(\textbf{x}_1,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_1,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_1,\textbf{x}_N^{\prime}) \\ k_f(\textbf{x}_2,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_2,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_2,\textbf{x}_N^{\prime}) \\ \vdots & \vdots & \ddots & \vdots \\ k_f(\textbf{x}_N,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_N,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_N,\textbf{x}_N^{\prime}) \end{pmatrix} \right \}$
• $$\mathbf{m}_f$$ = mean function
• $$\textbf{K}_{f}$$ = covariance function (kernel)
f(\textbf{x})
$f(\textbf{x})$
\text{input }x
$\text{input }x$

See Appendix for mathematical details

# Probabilistic Programming

## Probabilistic Programming (1/2)

• Probabilisic Programming (PP) Languages:
• Software packages that take a model and then automatically generate inference routines (even source code!) e.g Pyro, Stan, Infer.Net, PyMC3, TensorFlow Probability, etc.

## Probabilistic Programming (2/2)

• Steps in Probabilisic ML:
• Build the model (Joint probability distribution of all the relevant variables)
• Incorporate the observed data
• Perform inference (to learn distributions of the latent variables)

## Gaussian Process in PyMC3

import pymc3 as pm

# Instantiate a model
with pm.Model() as latent_gp_model:

# specify the priors
length_scale = pm.Gamma("length_scale", alpha = 2, beta = 1)
signal_variance = pm.HalfCauchy("signal_variance", beta = 5)
noise_variance = pm.HalfCauchy("noise_variance", beta = 5)
degrees_of_freedom = pm.Gamma("degrees_of_freedom", alpha = 2, beta = 0.1)

# specify the kernel function
cov = signal_variance**2 * pm.gp.cov.ExpQuad(1, length_scale)

# specify the mean function
mean_function = pm.gp.mean.Zero()

# specify the gp
gp = pm.gp.Latent(cov_func = cov)

# specify the prior over the latent function
f = gp.prior("f", X = X)

# specify the likelihood
obs = pm.StudentT("obs", mu = f, lam = 1/signal_variance, nu = degrees_of_freedom, observed = y)


# Perform Inference
with latent_gp_model:
posterior = pm.sample(draws = 100, njobs = 2)

# extend the model by adding the GP conditional distribution so as to predict at test data
with latent_gp_model:
f_pred = gp.conditional("f_pred", X_new)

# sample from the GP conditional posterior
with latent_gp_model:
posterior_pred = pm.sample_ppc(posterior, vars = [f_pred], samples = 200)

Build a model

Train a model

Prediction

## Scikit-learn

from sklearn.gaussian_process import GaussianProcessRegressor()

model = GaussianProcessRegressor()

model.fit(X_train, y_train)

model.predict(X_test, y_test)

model.score(X_test, y_test)

model.save('path/to/saved/model')

Few lines of code

• Build + Train + predict + score + save + load

## Pymc-learn

from pmlearn.gaussian_process import GaussianProcessRegressor()

# Instantiate a PyMC3 Gaussian process model
model = GaussianProcessRegressor()

# Fit using MCMC or Variational Inference
model.fit(X_train, y_train)

model.predict(X_test, y_test)

model.score(X_test, y_test)

model.save('path/to/saved/model')

Few lines of code

## PyMC3  vs  Pymc-learn

import pymc3 as pm

# Instantiate a model
with pm.Model() as latent_gp_model:

# specify the priors
length_scale = pm.Gamma("length_scale", alpha = 2, beta = 1)
signal_variance = pm.HalfCauchy("signal_variance", beta = 5)
noise_variance = pm.HalfCauchy("noise_variance", beta = 5)
degrees_of_freedom = pm.Gamma("degrees_of_freedom", alpha = 2, beta = 0.1)

# specify the kernel function
cov = signal_variance**2 * pm.gp.cov.ExpQuad(1, length_scale)

# specify the mean function
mean_function = pm.gp.mean.Zero()

# specify the gp
gp = pm.gp.Latent(cov_func = cov)

# specify the prior over the latent function
f = gp.prior("f", X = X)

# specify the likelihood
obs = pm.StudentT("obs", mu = f, lam = 1/signal_variance, nu = degrees_of_freedom, observed = y)

# Perform Inference
with latent_gp_model:
posterior = pm.sample(draws = 100, njobs = 2)

# extend the model by adding the GP conditional distribution so as to predict at test data
with latent_gp_model:
f_pred = gp.conditional("f_pred", X_new)

# sample from the GP conditional posterior
with latent_gp_model:
posterior_pred = pm.sample_ppc(posterior, vars = [f_pred], samples = 200)
from pmlearn.gaussian_process import GaussianProcessRegressor()

# Instantiate a PyMC3 Gaussian process model
model = GaussianProcessRegressor()

# Fit using MCMC or Variational Inference
model.fit(X_train, y_train)

model.predict(X_test, y_test)

Few lines of code

Many lines of code

PyMC3

Pymc-learn

# Appendix

## Start with an inflexible* model

• Consider for each data input, $$i$$, that:

$$y_i$$ = output variable,

$$\mathbf{x}_i$$ = covariates with dimension $$D$$, e.g {income, employment, trip distance, etc}

$$f(\mathbf{x}_i)$$ = function that maps $$\mathbf{x}_i$$ to $$y_i$$,

$$\varepsilon_i$$ = noise term,

p(\mathbf{\theta} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{\theta},\textbf{X}) \, p(\mathbf{\theta})}{p(\textbf{y}\mid \textbf{X})}
$p(\mathbf{\theta} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{\theta},\textbf{X}) \, p(\mathbf{\theta})}{p(\textbf{y}\mid \textbf{X})}$

$$p(\mathbf{\theta} \mid \textbf{y},\textbf{X})$$ = posterior over the parameters, given observed data

$$p(\textbf{y} \mid \mathbf{\theta},\textbf{X})$$ = likelihood of output variable, given covariates & parameters

$$p(\mathbf{\theta})$$ = prior over the parameters

$$p(\textbf{y} \mid \textbf{X})$$ = data distribution to ensure normalization

y_i = f(\textbf{x}_i) + \varepsilon_i, \quad \text{assume} \, \, \varepsilon \sim \mathcal{N}(0, \sigma_{\varepsilon}^2)
$y_i = f(\textbf{x}_i) + \varepsilon_i, \quad \text{assume} \, \, \varepsilon \sim \mathcal{N}(0, \sigma_{\varepsilon}^2)$
• Bayesian modeling involves:

where

• $$\mathbf{\theta}$$ = parameters e.g. coefficients

*Ghahramani, 2016; Williams & Rasmussen, 2006

## Extension to a flexible* model

• Given that latent function $$f$$, is unknown:
• It ($$f$$) is considered the parameter of interest.
p(\mathbf{f} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{f},\textbf{X}) \, p(\mathbf{f})}{p(\textbf{y}\mid \textbf{X})}
$p(\mathbf{f} \mid \textbf{y},\textbf{X}) = \frac{p(\textbf{y} \mid \mathbf{f},\textbf{X}) \, p(\mathbf{f})}{p(\textbf{y}\mid \textbf{X})}$

$$p(\mathbf{f} \mid \textbf{y},\textbf{X})$$ = posterior over the function, given observed data

$$p(\textbf{y} \mid \mathbf{f},\textbf{X})$$ = likelihood of output variable, given the covariates & function

$$p(\mathbf{f})$$ = prior over the function

$$p(\textbf{y} \mid \textbf{X})$$ = data distribution to ensure normalization

• Consider the prior over the function as:
• as any possible function in an infinite space

*Ghahramani, 2016; Williams & Rasmussen, 2006

## Define a prior over the function (1/2)

\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})
$\textbf{f} \sim \mathcal{GP}(\textbf{m}_f, \textbf{K}_{f})$
• A distribution over functions in an infinite space is:
• a Gaussian process (GP) (Rasmussen & Williams , 2006)

where

\begin{pmatrix} f(\textbf{x}_1) \\ f(\textbf{x}_2) \\ \vdots \\ f(\textbf{x}_N) \end{pmatrix} \sim \mathcal{N} \left \{ \begin{pmatrix} m_f(\textbf{x}_1) \\ m_f(\textbf{x}_2) \\ \vdots \\ m_f(\textbf{x}_N) \end{pmatrix} , \begin{pmatrix} k_f(\textbf{x}_1,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_1,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_1,\textbf{x}_N^{\prime}) \\ k_f(\textbf{x}_2,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_2,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_2,\textbf{x}_N^{\prime}) \\ \vdots & \vdots & \ddots & \vdots \\ k_f(\textbf{x}_N,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_N,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_N,\textbf{x}_N^{\prime}) \end{pmatrix} \right \}
$\begin{pmatrix} f(\textbf{x}_1) \\ f(\textbf{x}_2) \\ \vdots \\ f(\textbf{x}_N) \end{pmatrix} \sim \mathcal{N} \left \{ \begin{pmatrix} m_f(\textbf{x}_1) \\ m_f(\textbf{x}_2) \\ \vdots \\ m_f(\textbf{x}_N) \end{pmatrix} , \begin{pmatrix} k_f(\textbf{x}_1,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_1,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_1,\textbf{x}_N^{\prime}) \\ k_f(\textbf{x}_2,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_2,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_2,\textbf{x}_N^{\prime}) \\ \vdots & \vdots & \ddots & \vdots \\ k_f(\textbf{x}_N,\textbf{x}_1^{\prime}) & k_f(\textbf{x}_N,\textbf{x}_2^{\prime}) & \cdots & k_f(\textbf{x}_N,\textbf{x}_N^{\prime}) \end{pmatrix} \right \}$
• $$\mathbf{m}_f$$ = mean function
• $$\textbf{K}_{f}$$ = covariance function (kernel)
f(\textbf{x})
$f(\textbf{x})$
\text{input }x
$\text{input }x$

## Define a prior over the function (2/2)

k_f(\textbf{x}_i, \textbf{x}_j^{\prime}\mid\theta) =\sigma^2_fexp\bigg(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_{i,d}-x_{j,d})^2}{l_d^2}\bigg)
$k_f(\textbf{x}_i, \textbf{x}_j^{\prime}\mid\theta) =\sigma^2_fexp\bigg(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_{i,d}-x_{j,d})^2}{l_d^2}\bigg)$
• $$\sigma^2_f$$ = signal-variance parameter
• $$l_d$$ = length-scale parameter ($$l_d$$ is positive; controls smoothness)
• The Squared Exponential Automatic Relevance Determination (SE-ARD) kernel (Neal, 2012).

Very smooth i.e continuous

We expect our functions to vary smoothly

• Assumption:
• Recent sampling methods considered efficient in high dimensions will be used:

## Efficient sampling

• Hamiltonian Monte Carlo (HMC) i.e. No-U-turn Sampler (NUTS) (Hoffman & Gelman, 2014)
• Estimation will involve finding values for two sets hyperparameters:
• $$\sigma^2_f$$ & $$\sigma^2_g$$ = signal-variance & noise variance
• Automatic Differentiation Variational Inference (ADVI) (Kucukelbir et al, 2016)
• $$l_d^f$$ & $$l_d^g$$ = length-scales in $$f$$ and $$g$$

#### Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

By Daniel Emaasit

# Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

Custom PyMC3 nonparametric Bayesian models built on top of the scikit-learn API

• 124
Loading comments...