# Deep Learning: Introduction to Generative Modeling - Part 1

## Generative Modeling

•  Discriminative Modeling has dominated most of Machine Learning or Deep Learning
• Discriminative Modeling: E.g. for a classifier, learn a manifold that separates the data space
• Generative Modeling: Learn the underlying distribution of data so you can generate new data
• Generative Modeling is harder 'but closer to true Artificial Intelligence than simply Discriminative Learning'
• Two of the popular techniques in Deep Learning
• Variational Auto-encoders

Srijith Rajamohan, Ph.D.

## What are Variational Autoencoders?

• Unsupervised Machine Learning algorithm
• Supervised  and  semi-supervised versions exist as well
• E.g. Conditional Variational Autoencoders are supervised
• Most people like to think of it in terms of a regular autoencoder - Encoder and a Decoder
• Mathematically, the motivation is to understand the underlying latent space of high-dimensional data
• Think of it as dimensionality reduction

Srijith Rajamohan, Ph.D.

## Autoencoders

Srijith Rajamohan, Ph.D.

Text

Picture courtesy of Wikipedia

## Uses of Variational Autoencoders?

• Anomaly detection
• Dimensionality reduction
• Physics-Informed GANs
• Physics-based laws encoded into the GANs
• Data augmentation
• Medical data where you have limited data
• Data where privacy concerns require synthetic data
• Challenging to obtain labeled data

Srijith Rajamohan, Ph.D.

## The Math?

X

The input data

The latent representation

z

Conditional probability distribution of the input

P(X|z)

Probability of the latent space variable

P(z) \sim N(0,1)

Conditional probability of the latent space variable given the input

P(z/X)

Srijith Rajamohan, Ph.D.

## Motivation

Srijith Rajamohan, Ph.D.

Using Bayes theorem to compute P(z|X) is intractable since the computation of P(x) is usually not feasible

P(x) = \int P(x|z)P(z)dz

However, we can compute P(z|X) using Variational Inference

P(z|X) = P(X|z)P(z)/P(x)

## Variational Inference

• We don't know what P(z|X) is but we want to infer it
• We can't estimate it directly but we can use Variational Inference
• To do this we approximate P(z|X) with an approximate function that is easier to evaluate
P(z|X) \approx Q(z|X)

In most cases, Q(z|X) is a Normal distribution and we  try to minimize the difference between these two distributions using the KL Divergence

D_{KL}[P(z|X),Q(z|X)]

Srijith Rajamohan, Ph.D.

Q(z|X) \sim N(\vec{\mu}(X), \vec{\sigma}(X))

Encoder output

## KL Divergence

D_{KL}[Q(z|X) || P(z|X)] = \underset{z \sim Q(z|X)}{\int}[Q(z|X)log \dfrac{ Q(z|X)}{P(z|X)}]
\Longrightarrow \underset{z \sim Q(z|X)}{E}[ log(\dfrac{Q(z|X)}{P(z|X)})] = \underset{z \sim Q(z|X)}{E}[ logQ(z|X) - logP(z|X)]

We don't have P(z|X) so we use Bayes Theorem to replace it as

\underset{z \sim Q(z|X)}{E}[ logQ(z|X) - log \dfrac{P(X|z)P(z)}{P(X)}]
= \underset{z \sim Q(z|X)}{E}[logQ(z|X) - logP(X|z) - logP(z) + logP(X)]

Srijith Rajamohan, Ph.D.

## KL Divergence

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[-logQ(z|X) + logP(X|z) + logP(z)]

These two combine to give another KL Divergence

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

This is the objective function for the VAE

Srijith Rajamohan, Ph.D.

## Evidence Lower Bound

The right hand side of the equation is called the Evidence Lower Bound

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
• We want to minimize the KL Divergence of P(z|X) and Q(z|X).
• Minimizing the KL Divergence term on the left is equivalent to maximizing the LHS or the RHS of the equation above.
• This is called a Lower Bound because the KL Divergence is always greater than or equal to zero and therefore the RHS becomes the 'lower bound' for the evidence P(X)

Srijith Rajamohan, Ph.D.

## Let's look at each term

P(X) - D_{KL}[P(z|X) || Q(z|X)] =
\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

Evidence

The term we want to minimize

Reconstruction error

KL Divergence between the approximate function and the distribution of the latent  variable

Srijith Rajamohan, Ph.D.

## What we do in practice

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
• To accomplish our goal of minimizing the original KL divergence, we maximize the term above
• P(X|z) represents the output of the decoder
• Q(z|X) represents the output of the encoder
• In a Variational Autoencoder, both the encoder and decoder are neural networks

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

## Decoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
• The first term then becomes maximizing the expectation of X given an input z and set of parameters that define the decoder
• Rephrased, it means we want to optimize the weights of the decoder neural network to minimize the error between the estimated X and the true X
• This is the reconstruction loss
• Use an appropriate cost function for this, e.g. MSE

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

## Encoder

\underset{z \sim z|X}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

Example architecture of a Variational Autoencoder.

Image taken from Jeremy Jordan's excellent blog on the topic https://www.jeremyjordan.me/variational-autoencoders/

Srijith Rajamohan, Ph.D.

## Encoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
• The loss term seen above is the KL divergence loss
• Q(z|X)
• Q(z|X) is the result of our encoder
• The encoder is also represented by a neural network
• Since Q(z|X) is supposed to be a normal, this encoder outputs a vector of means and standard deviations
• P(z)
• P(z) is assumed to be a normal distribution N(0,1)
• Alternately, it can also be a mixture of Gaussians

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

## Encoder

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]
• It is clear how P(z) being a standard normal is useful when we try to compute the KL divergence in the encoder since this simplifies the computation
• This divergence term tries to minimize the difference between the true prior distribution of the latent variable P(z) and the approximate conditional distribution Q(z|X)
• So Q(z|X) moves closer to a unit normal
• But we don't want it to collapse to one since then there is no separability of data
• In a way this acts as a form of regularization

Decoder loss

Encoder loss

Srijith Rajamohan, Ph.D.

## Decoder Input

\underset{z \sim Q(z|X)}{E}[ logP(X|z)] - D_{KL}[Q(z|X) || P(z)]

• So what goes into the decoder ?
• A value (or vector) is sampled from the distribution formed by the mean and std output by the encoder
• This is done so that the VAE can learn to recognize not only the data input to the VAE but also data similar to it
• Just a single sample was shown to be sufficient for this

Decoder loss

Encoder loss

[ \vec{\mu}, \vec{\sigma} ]

Dimensionality of the vectors is a hyperparameter

Srijith Rajamohan, Ph.D.

\vec{z} \sim N(\vec{\mu}(X), \vec{\sigma}(X))

## Loss for VAE

• Loss comes from both the reconstruction error and the divergence term
• Without the reconstruction loss, the output will not look like the input
• Without the divergence term, the approximate distribution Q(z|X) can learn a narrow distribution because the KL Divergence was defined as

\underset{z \sim Q(z|X)}{\int}[Q(z|X)log \dfrac{ Q(z|X)}{P(z|X)}]

We want our distributions to be broad so that they can cover the solution space, otherwise it  would suffer from the same problem as a regular auto encoder, i.e. discontinuous solution space

Srijith Rajamohan, Ph.D.

## Reparameterization

• There is one more problem as far as implementation goes
• We cannot compute the gradients through a probabilistic node with autodiff
• Reparameterization was accomplished as

\vec{z} = \vec{\mu} + \vec{\sigma} \cdot \vec{\epsilon}, \epsilon \sim N(0,1)

Picture from Jeremy Jordans blog

Srijith Rajamohan, Ph.D.

## To summarize

\vec{X}^{n}
\vec{\tilde{X}}^{n}
f(X; \theta)
g(z;\lambda)
[\vec{\mu}^d, \vec{\sigma}^d]

Sampling

Srijith Rajamohan, Ph.D.

## Generation of Data

\vec{\tilde{X}}^{n}
g(z;\lambda)

Srijith Rajamohan, Ph.D.

z \sim N(0,1)

To generate data, we sample z from a unit normal since our assumption of P(z) is N(0,1)

Also, we made Q(z|X) similar to P(z)