Decomposition of Uncertainty in Bayesian Deep Learning

Uncertainty Decomposition

Two types of uncertainty:

Aleatoric - caused by the stochasticity of the modelled process
Epistemic - caused by the lack of training data available

Author's proposition:

Given a properly trained Bayesian Neural Network,

one can decompose its uncertainty into aleatoric and epistemic terms

(Depeweg et al., 2018)

Decomposition of uncertainty in Bayesian Deep Learning

Bayesian Neural Networks

p(y|x) = \int p(y|x, w) p(w) dw

p(y|x) = \int p(y|x, w) p(w) dw

probabilistic model

p(w | x, y) = \frac{p(y|x, w) p(w) }{p(y|x)}

p(w | x, y) = \frac{p(y|x, w) p(w) }{p(y|x)}

posterior distribution over model parameters (not tractable)

We want to approximate the posterior with a tractable distribution

The common optimization routine is following (Variational Bayes):

KL\big(q(w) || p(w|x, y) \big) = \log p(y|x) - \mathcal{L}(q) \rightarrow \min_q

KL\big(q(w) || p(w|x, y) \big) = \log p(y|x) - \mathcal{L}(q) \rightarrow \min_q

Authors propose to minimize different metric - $\alpha$ -divergence, since it yields better results

(Hernandez-Lobato et al., 2016) Black-Box $\alpha$ -Divergence Minimization

\mathcal{L}(q)= \mathbb{E}_q \log p(y|x, w) - KL\big (q(w)||p(w)\big) \rightarrow \max_q

\mathcal{L}(q)= \mathbb{E}_q \log p(y|x, w) - KL\big (q(w)||p(w)\big) \rightarrow \max_q

Bayesian Neural Networks

Commonly, the model is chosen to have the following form:

The approximate posterior is chosen to be fully-factorized Gaussian:

and the prior on parameters has the similar form:

p(y|x, w) = \mathcal{N}\big(\mu(x, w), \sigma^2(x, w) \big)

p(y|x, w) = \mathcal{N}\big(\mu(x, w), \sigma^2(x, w) \big)

q(w) = \prod_i \prod_j \prod_l \mathcal{N} (\mu_{ijl}, \sigma^2_{ijl})

q(w) = \prod_i \prod_j \prod_l \mathcal{N} (\mu_{ijl}, \sigma^2_{ijl})

p(w) = \mathcal{N}(0, \lambda)

p(w) = \mathcal{N}(0, \lambda)

where $\lambda$ is a prior variance and is commonly chosen to be 1

Bayesian Neural Networks
with Latent Variables

Classical BNNs assume only additive Gaussian noise, which is restrictive

Idea: feed the noise into the network as an input,
thinking of it as of a latent variable $z$

The model became:

p(y|x, z, w) = \mathcal{N}\big(\mu(x, z, w), I \big)

p(y|x, z, w) = \mathcal{N}\big(\mu(x, z, w), I \big)

Approximate posterior:

q(w, z) = q(w)q(z)

q(w, z) = q(w)q(z)

Prior:

p(z) = \mathcal{N}(0, 1)

p(z) = \mathcal{N}(0, 1)

(Depeweg et al., 2017) Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks

The variance of the predictive model is fixed!

Uncertainty Decomposition in BNNs

Once we trained a proper BNN, we are interested in decomposing its uncertainty into aleatoric and epistemic components.

Total Uncertainty:

\mathcal{TU}(x) = \mathcal{AU}(x) + \mathcal{EU}(x) = \mathcal{H}\big(\int p(y|x, w) q(w)dw \big)

\mathcal{TU}(x) = \mathcal{AU}(x) + \mathcal{EU}(x) = \mathcal{H}\big(\int p(y|x, w) q(w)dw \big)

Aleatoric Uncertainty:

\mathcal{AU}(x) = \mathbb{E}_{w \sim q}\mathcal{H}\big( p(y|x, w) \big)

\mathcal{AU}(x) = \mathbb{E}_{w \sim q}\mathcal{H}\big( p(y|x, w) \big)

Epistemic Uncertainty:

\mathcal{EU}(x) = \mathcal{TU}(x) - \mathcal{AU}(x)

\mathcal{EU}(x) = \mathcal{TU}(x) - \mathcal{AU}(x)

Nearest Neighbor Entropy Estimation

The predictive distribution of a BNN commonly has no closed form

So, we should estimate its entropy from samples.

Assume is a set of samples from the distribution of interest

y = \{ y_i \}_{i=0}^{n}

y = \{ y_i \}_{i=0}^{n}

and the set is sorted

y_{i+1} \geq y_i

y_{i+1} \geq y_i

Then we can approximate the entropy with the Nearest Neighbor Estimator:

\mathcal{H}\big(p(y)\big) \approx \frac{1}{n - 1} \sum_{i=0}^{n-1} \log ( y_{i+1} - y_i) + \psi(1) - \psi(n)

\mathcal{H}\big(p(y)\big) \approx \frac{1}{n - 1} \sum_{i=0}^{n-1} \log ( y_{i+1} - y_i) + \psi(1) - \psi(n)

where is the digamma function

\psi(\cdot)

\psi(\cdot)

(Kraskov et al., 2003) Estimating Mutual Information

Aims of the Final Project

Implement both BNN and BNN+LV models using PyTorch
Reproduce the results of neural networks training on the
1d problem with heteroscedastic noise
Reproduce the results of uncertainty decomposition presented
in the paper
Analyse the behaviour of epistemic uncertainty on
outside-of-domain data
Propose a technique for data-generation in the context of active learning

Experimental Results

Dataset

y = 7\sin(x) + 3|\cos(x/2) | \epsilon

y = 7\sin(x) + 3|\cos(x/2) | \epsilon

\epsilon \sim \mathcal{N}(0, 1)

\epsilon \sim \mathcal{N}(0, 1)

x \sim \sum_{i=0}^3 \pi_i \mathcal{N}(\mu_i, \sigma^2_i)

x \sim \sum_{i=0}^3 \pi_i \mathcal{N}(\mu_i, \sigma^2_i)

0	1	2
1/3	1/3	1/3
-4	0	4
2/5	0.9	2/5

\pi_i

\pi_i

\mu_i

\mu_i

\sigma_i

\sigma_i

Experimental Results

Training of the neural networks

BNN

BNN + Latent Variable

Classical BNN produces satisfactory results, whereas BNN+LV is much worse

Both NNs were implemented in PyTorch from scratch with Variational Bayes approach

Experimental Results

Uncertainty decomposition in BNN without (!) latent variables

Total and aleatoric uncertainty captures the region with the biggest STD of the model

Epistemic one is too noisy (and unstable from realization to realization)

Experimental Results

Uncertainty decomposition in BNN with Latent Variable

The distribution of uncertainties is not informative!

This is due to the poor model performance. Probably learnable additive noise can help.

Stated results from the paper

(Depeweg et al., 2018)

Decomposition of Uncertainty in Bayesian Deep Learning

for Efficient and Risk-sensitive Learning

The results from the paper show that the maximums of the epistemic uncertainty match the most unobserved regions of $\mathcal{X}$

The results are quite hard to reproduce

Uncertainty outside of domain

For classical BNN without (!) latent variables

Total and epistemic uncertainty grow quickly outside of domain.

Plot for epistemic one shows that in-domain variation of uncertainty is comparably low.

Uncertainty-based data generation

For active learning purposes

Common technique for active learning is to maximize the Epistemic Uncertainty over $x$

However, as we've seen previously, the EU is not a good maximization objective:

it is not stable at capturing unobserved regions inside the domain
it grows rapidly outside of the domain
it is unbounded from the top (so the optimization problem is ill-posed)

One possible way to overcome these problems is to tract the data generation procedure

as a sampling from distribution task.

The samples should be close to the in-domain data

And the model should be uncertain about them (in the epistemic sense)

Uncertainty-based data generation

For active learning purposes

Assume, we have a probability distribution of in-domain data points:

g(x)

g(x)

generative distribution

We can use the fact that for constructing the of the following distribution

\mathcal{EU}(x) \geq 0

\mathcal{EU}(x) \geq 0

p(x \;|\; \text{unobserved} = True) \propto \mathcal{EU}(x)

p(x \;|\; \text{unobserved} = True) \propto \mathcal{EU}(x)

And we can sample from using
Metropolis - Hastings algorithm

g(x) \cdot p(x \;|\; \text{unobserved} = True)

g(x) \cdot p(x \;|\; \text{unobserved} = True)

proposal distribution:

acceptance ratio:

g(x)

g(x)

\alpha(x', x) = \min\Big( 1, \frac{\mathcal{EU}(x')}{\mathcal{EU}(x)} \frac{g(x)}{g(x')} \Big)

\alpha(x', x) = \min\Big( 1, \frac{\mathcal{EU}(x')}{\mathcal{EU}(x)} \frac{g(x)}{g(x')} \Big)

Discussion

Two types of Bayesian Neural Networks were implemented from scratch using PyTorch.
The training procedure for BNNs can be classified as unstable.
It is required to test the $\alpha$ -divergence minimization procedure instead of the Variational Bayes
In practice, there is no warranty, that all the epistemic uncertainty will be estimated as the epistemic and not as an aleatoric.
A better entropy estimation procedure is needed. A probable solution for 1-d is to use K-Nearest Neighbour estimator from (link)
The method for uncertainty-based exploration was proposed. The method is based on sampling from the generative model, while the epistemic uncertainty acts as a critic. Further investigation is required.

Decomposition of Uncertainty in Bayesian Deep Learning

Decomposition of Uncertainty in Bayesian Deep Learning

More from cydoroga