Deep Learning: Introduction to Generative Modeling - Part 1

Srijith Rajamohan, Ph.D.

Generative Modeling

  •  Discriminative Modeling has dominated most of Machine Learning or Deep Learning
  • Discriminative Modeling: E.g. for a classifier, learn a manifold that separates the data space
  • Generative Modeling: Learn the underlying distribution of data so you can generate new data
  • Generative Modeling is harder 'but closer to true Artificial Intelligence than simply Discriminative Learning'
  • Two of the popular techniques in Deep Learning
    • Variational Auto-encoders
    • Generative Adversarial Networks

Srijith Rajamohan, Ph.D.

What are Variational Autoencoders?

  • Unsupervised Machine Learning algorithm
    • Supervised  and  semi-supervised versions exist as well
    • E.g. Conditional Variational Autoencoders are supervised
  • Most people like to think of it in terms of a regular autoencoder - Encoder and a Decoder
  • Mathematically, the motivation is to understand the underlying latent space of high-dimensional data
  • Think of it as dimensionality reduction

Srijith Rajamohan, Ph.D.

Autoencoders

Srijith Rajamohan, Ph.D.

Text

Picture courtesy of Wikipedia

Uses of Variational Autoencoders?

 

  • Anomaly detection
  • Dimensionality reduction
  • Physics-Informed GANs
    • Physics-based laws encoded into the GANs
  • Data augmentation
    • Medical data where you have limited data
    • Data where privacy concerns require synthetic data
    • Challenging to obtain labeled data

Srijith Rajamohan, Ph.D.

Variables

The Math?

X

The input data

The latent representation

z

Conditional probability distribution of the input

P(X|z)

Probability of the latent space variable

P(z) \sim N(0,1)

Conditional probability of the latent space variable given the input

P(z/X)

Srijith Rajamohan, Ph.D.

Motivation

Srijith Rajamohan, Ph.D.

Using Bayes theorem to compute P(z|X) is intractable since the computation of P(x) is usually not feasible

P(x) = \int P(x|z)P(z)dz

However, we can compute P(z|X) using Variational Inference

P(z|X) = P(X|z)P(z)/P(x)

Variational Inference

  • We don't know what P(z|X) is but we want to infer it
  • We can't estimate it directly but we can use Variational Inference 
  • To do this we approximate P(z|X) with an approximate function that is easier to evaluate
P(z|X) \approx Q(z|X)

In most cases, Q(z|X) is a Normal distribution and we  try to minimize the difference between these two distributions using the KL Divergence

D_{KL}[P(z|X),Q(z|X)]

Srijith Rajamohan, Ph.D.

Q(z|X) \sim N(\vec{\mu}(X), \vec{\sigma}(X))

Encoder output

KL Divergence

D_{KL}[Q(z|X) || P(z|X)] = \underset{z \sim Q(z|X)}{\int}[Q(z|X)log \dfrac{ Q(z|X)}{P(z|X)}]
\Longrightarrow \underset{z \sim Q(z|X)}{E}[ log(\dfrac{Q(z|X)}{P(z|X)})] = \underset{z \sim Q(z|X)}{E}[ logQ(z|X) - logP(z|X)]

We don't have P(z|X) so we use Bayes Theorem to replace it as 

\underset{z \sim Q(z|X)}{E}[ logQ(z|X) - log \dfrac{P(X|z)P(z)}{P(X)}]
= \underset{z \sim Q(z|X)}{E}[logQ(z|X) - logP(X|z) - logP(z) + logP(X)]

Srijith Rajamohan, Ph.D.

KL Divergence

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[-logQ(z|X) + logP(X|z) + logP(z)]

These two combine to give another KL Divergence

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

This is the objective function for the VAE

Srijith Rajamohan, Ph.D.

Evidence Lower Bound

The right hand side of the equation is called the Evidence Lower Bound

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
  • We want to minimize the KL Divergence of P(z|X) and Q(z|X).
  • Minimizing the KL Divergence term on the left is equivalent to maximizing the LHS or the RHS of the equation above.
  • This is called a Lower Bound because the KL Divergence is always greater than or equal to zero and therefore the RHS becomes the 'lower bound' for the evidence P(X)

Srijith Rajamohan, Ph.D.

Let's look at each term

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

Evidence

The term we want to minimize

Reconstruction error

KL Divergence between the approximate function and the distribution of the latent  variable

Srijith Rajamohan, Ph.D.

What we do in practice

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
  • To accomplish our goal of minimizing the original KL divergence, we maximize the term above
  • P(X|z) represents the output of the decoder
  • Q(z|X) represents the output of the encoder
  • In a Variational Autoencoder, both the encoder and decoder are neural networks

 

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Decoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
  • The first term then becomes maximizing the expectation of X given an input z and set of parameters that define the decoder
  • Rephrased, it means we want to optimize the weights of the decoder neural network to minimize the error between the estimated X and the true X
  • This is the reconstruction loss
  • Use an appropriate cost function for this, e.g. MSE

 

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Encoder

\underset{z \sim z|X}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

Example architecture of a Variational Autoencoder.

Image taken from Jeremy Jordan's excellent blog on the topic https://www.jeremyjordan.me/variational-autoencoders/  

Srijith Rajamohan, Ph.D.

Encoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
  • The loss term seen above is the KL divergence loss
  • Q(z|X)
    • Q(z|X) is the result of our encoder
    • The encoder is also represented by a neural network
    • Since Q(z|X) is supposed to be a normal, this encoder outputs a vector of means and standard deviations
  • P(z)
    • P(z) is assumed to be a normal distribution N(0,1)
    • Alternately, it can also be a mixture of Gaussians

 

 

 

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Encoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
  • It is clear how P(z) being a standard normal is useful when we try to compute the KL divergence in the encoder since this simplifies the computation
  • This divergence term tries to minimize the difference between the true prior distribution of the latent variable P(z) and the approximate conditional distribution Q(z|X)
  • So Q(z|X) moves closer to a unit normal
  • But we don't want it to collapse to one since then there is no separability of data
  • In a way this acts as a form of regularization

 

 

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

Decoder Input

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

 

  • So what goes into the decoder ?
  • A value (or vector) is sampled from the distribution formed by the mean and std output by the encoder
  • This is done so that the VAE can learn to recognize not only the data input to the VAE but also data similar to it
  • Just a single sample was shown to be sufficient for this

 

 

Decoder loss

Encoder loss

[ \vec{\mu}, \vec{\sigma} ]

Dimensionality of the vectors is a hyperparameter

Srijith Rajamohan, Ph.D.

\vec{z} \sim N(\vec{\mu}(X), \vec{\sigma}(X))

Loss for VAE

 

  • Loss comes from both the reconstruction error and the divergence term
  • Without the reconstruction loss, the output will not look like the input
  • Without the divergence term, the approximate distribution Q(z|X) can learn a narrow distribution because the KL Divergence was defined as 

 

\underset{z \sim Q(z|X)}{\int}[Q(z|X)log \dfrac{ Q(z|X)}{P(z|X)}]

We want our distributions to be broad so that they can cover the solution space, otherwise it  would suffer from the same problem as a regular auto encoder, i.e. discontinuous solution space

Srijith Rajamohan, Ph.D.

Reparameterization

 

  • There is one more problem as far as implementation goes
  • We cannot compute the gradients through a probabilistic node with autodiff
  • Reparameterization was accomplished as

 

\vec{z} = \vec{\mu} + \vec{\sigma} \cdot \vec{\epsilon}, \epsilon \sim N(0,1)

Picture from Jeremy Jordans blog

Srijith Rajamohan, Ph.D.

To summarize

\vec{X}^{n}
\vec{\tilde{X}}^{n}
f(X; \theta)
g(z;\lambda)
[\vec{\mu}^d, \vec{\sigma}^d]

Sampling

Srijith Rajamohan, Ph.D.

Generation of Data

\vec{\tilde{X}}^{n}
g(z;\lambda)

Srijith Rajamohan, Ph.D.

z \sim N(0,1)

To generate data, we sample z from a unit normal since our assumption of P(z) is N(0,1)

 

Also, we made Q(z|X) similar to P(z)