Deep Learning: Introduction to Generative Modeling - Part 1

Srijith Rajamohan, Ph.D.

Generative Modeling

Discriminative Modeling has dominated most of Machine Learning or Deep Learning
Discriminative Modeling: E.g. for a classifier, learn a manifold that separates the data space
Generative Modeling: Learn the underlying distribution of data so you can generate new data
Generative Modeling is harder 'but closer to true Artificial Intelligence than simply Discriminative Learning'
Two of the popular techniques in Deep Learning
- Variational Auto-encoders
- Generative Adversarial Networks

Srijith Rajamohan, Ph.D.

What are Variational Autoencoders?

Unsupervised Machine Learning algorithm
- Supervised and semi-supervised versions exist as well
- E.g. Conditional Variational Autoencoders are supervised
Most people like to think of it in terms of a regular autoencoder - Encoder and a Decoder
Mathematically, the motivation is to understand the underlying latent space of high-dimensional data
Think of it as dimensionality reduction

Srijith Rajamohan, Ph.D.

Autoencoders

Srijith Rajamohan, Ph.D.

Text

Picture courtesy of Wikipedia

Uses of Variational Autoencoders?

Anomaly detection
Dimensionality reduction
Physics-Informed GANs
- Physics-based laws encoded into the GANs
Data augmentation
- Medical data where you have limited data
- Data where privacy concerns require synthetic data
- Challenging to obtain labeled data

Srijith Rajamohan, Ph.D.

Variables

The Math?

The input data

The latent representation

Conditional probability distribution of the input

P(X|z)

Probability of the latent space variable

P(z) \sim N(0,1)

Conditional probability of the latent space variable given the input

P(z/X)

Srijith Rajamohan, Ph.D.

Motivation

Srijith Rajamohan, Ph.D.

Using Bayes theorem to compute P(z|X) is intractable since the computation of P(x) is usually not feasible

P(x) = \int P(x|z)P(z)dz

However, we can compute P(z|X) using Variational Inference

P(z|X) = P(X|z)P(z)/P(x)

Variational Inference

We don't know what P(z|X) is but we want to infer it
We can't estimate it directly but we can use Variational Inference
To do this we approximate P(z|X) with an approximate function that is easier to evaluate

P(z|X) \approx Q(z|X)

In most cases, Q(z|X) is a Normal distribution and we try to minimize the difference between these two distributions using the KL Divergence

D_{KL}[P(z|X),Q(z|X)]

Srijith Rajamohan, Ph.D.

Q(z|X) \sim N(\vec{\mu}(X), \vec{\sigma}(X))

Encoder output

KL Divergence

D_{KL}[Q(z|X) || P(z|X)] = \underset{z \sim Q(z|X)}{\int}[Q(z|X)log \dfrac{ Q(z|X)}{P(z|X)}]

\Longrightarrow \underset{z \sim Q(z|X)}{E}[ log(\dfrac{Q(z|X)}{P(z|X)})] = \underset{z \sim Q(z|X)}{E}[ logQ(z|X) - logP(z|X)]

We don't have P(z|X) so we use Bayes Theorem to replace it as

\underset{z \sim Q(z|X)}{E}[ logQ(z|X) - log \dfrac{P(X|z)P(z)}{P(X)}]

= \underset{z \sim Q(z|X)}{E}[logQ(z|X) - logP(X|z) - logP(z) + logP(X)]

Srijith Rajamohan, Ph.D.

KL Divergence

P(X) - D_{KL}[P(z|X) || Q(z|X)] =

\underset{z \sim Q(z|X)}{E}[-logQ(z|X) + logP(X|z) + logP(z)]

These two combine to give another KL Divergence

P(X) - D_{KL}[P(z|X) || Q(z|X)] =

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

This is the objective function for the VAE

Srijith Rajamohan, Ph.D.

Evidence Lower Bound

The right hand side of the equation is called the Evidence Lower Bound

P(X) - D_{KL}[P(z|X) || Q(z|X)] =

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

We want to minimize the KL Divergence of P(z|X) and Q(z|X).
Minimizing the KL Divergence term on the left is equivalent to maximizing the LHS or the RHS of the equation above.
This is called a Lower Bound because the KL Divergence is always greater than or equal to zero and therefore the RHS becomes the 'lower bound' for the evidence P(X)

Srijith Rajamohan, Ph.D.

Let's look at each term

P(X) - D_{KL}[P(z|X) || Q(z|X)] =

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

Evidence

The term we want to minimize

Reconstruction error

KL Divergence between the approximate function and the distribution of the latent variable

Srijith Rajamohan, Ph.D.

What we do in practice

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

To accomplish our goal of minimizing the original KL divergence, we maximize the term above
P(X|z) represents the output of the decoder
Q(z|X) represents the output of the encoder
In a Variational Autoencoder, both the encoder and decoder are neural networks

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Decoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

The first term then becomes maximizing the expectation of X given an input z and set of parameters that define the decoder
Rephrased, it means we want to optimize the weights of the decoder neural network to minimize the error between the estimated X and the true X
This is the reconstruction loss
Use an appropriate cost function for this, e.g. MSE

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Encoder

\underset{z \sim z|X}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

Example architecture of a Variational Autoencoder.

Image taken from Jeremy Jordan's excellent blog on the topic https://www.jeremyjordan.me/variational-autoencoders/

Srijith Rajamohan, Ph.D.

Encoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

The loss term seen above is the KL divergence loss
Q(z|X)
- Q(z|X) is the result of our encoder
- The encoder is also represented by a neural network
- Since Q(z|X) is supposed to be a normal, this encoder outputs a vector of means and standard deviations
P(z)
- P(z) is assumed to be a normal distribution N(0,1)
- Alternately, it can also be a mixture of Gaussians

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Encoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

It is clear how P(z) being a standard normal is useful when we try to compute the KL divergence in the encoder since this simplifies the computation
This divergence term tries to minimize the difference between the true prior distribution of the latent variable P(z) and the approximate conditional distribution Q(z|X)
So Q(z|X) moves closer to a unit normal
But we don't want it to collapse to one since then there is no separability of data
In a way this acts as a form of regularization

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Decoder Input

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

So what goes into the decoder ?
A value (or vector) is sampled from the distribution formed by the mean and std output by the encoder
This is done so that the VAE can learn to recognize not only the data input to the VAE but also data similar to it
Just a single sample was shown to be sufficient for this

Decoder loss

Encoder loss

[ \vec{\mu}, \vec{\sigma} ]

Dimensionality of the vectors is a hyperparameter

Srijith Rajamohan, Ph.D.

\vec{z} \sim N(\vec{\mu}(X), \vec{\sigma}(X))

Loss for VAE

Loss comes from both the reconstruction error and the divergence term
Without the reconstruction loss, the output will not look like the input
Without the divergence term, the approximate distribution Q(z|X) can learn a narrow distribution because the KL Divergence was defined as

\underset{z \sim Q(z|X)}{\int}[Q(z|X)log \dfrac{ Q(z|X)}{P(z|X)}]

We want our distributions to be broad so that they can cover the solution space, otherwise it would suffer from the same problem as a regular auto encoder, i.e. discontinuous solution space

Srijith Rajamohan, Ph.D.

Reparameterization

There is one more problem as far as implementation goes
We cannot compute the gradients through a probabilistic node with autodiff
Reparameterization was accomplished as

\vec{z} = \vec{\mu} + \vec{\sigma} \cdot \vec{\epsilon}, \epsilon \sim N(0,1)

Picture from Jeremy Jordans blog

Srijith Rajamohan, Ph.D.

To summarize

\vec{X}^{n}

\vec{\tilde{X}}^{n}

f(X; \theta)

g(z;\lambda)

[\vec{\mu}^d, \vec{\sigma}^d]

Sampling

Srijith Rajamohan, Ph.D.

Generation of Data

\vec{\tilde{X}}^{n}

g(z;\lambda)

Srijith Rajamohan, Ph.D.

z \sim N(0,1)

To generate data, we sample z from a unit normal since our assumption of P(z) is N(0,1)

Also, we made Q(z|X) similar to P(z)