Deep Learning: Introduction to Generative Modeling - Part 1
Srijith Rajamohan, Ph.D.
Generative Modeling
- Discriminative Modeling has dominated most of Machine Learning or Deep Learning
- Discriminative Modeling: E.g. for a classifier, learn a manifold that separates the data space
- Generative Modeling: Learn the underlying distribution of data so you can generate new data
- Generative Modeling is harder 'but closer to true Artificial Intelligence than simply Discriminative Learning'
- Two of the popular techniques in Deep Learning
- Variational Auto-encoders
- Generative Adversarial Networks
Srijith Rajamohan, Ph.D.
What are Variational Autoencoders?
- Unsupervised Machine Learning algorithm
- Supervised and semi-supervised versions exist as well
- E.g. Conditional Variational Autoencoders are supervised
- Most people like to think of it in terms of a regular autoencoder - Encoder and a Decoder
- Mathematically, the motivation is to understand the underlying latent space of high-dimensional data
- Think of it as dimensionality reduction
Srijith Rajamohan, Ph.D.
Autoencoders
Srijith Rajamohan, Ph.D.
Text
Picture courtesy of Wikipedia
Uses of Variational Autoencoders?
- Anomaly detection
- Dimensionality reduction
- Physics-Informed GANs
- Physics-based laws encoded into the GANs
- Data augmentation
- Medical data where you have limited data
- Data where privacy concerns require synthetic data
- Challenging to obtain labeled data
Srijith Rajamohan, Ph.D.
Variables
The Math?
The input data
The latent representation
Conditional probability distribution of the input
Probability of the latent space variable
Conditional probability of the latent space variable given the input
Srijith Rajamohan, Ph.D.
Motivation
Srijith Rajamohan, Ph.D.
Using Bayes theorem to compute P(z|X) is intractable since the computation of P(x) is usually not feasible
However, we can compute P(z|X) using Variational Inference
Variational Inference
- We don't know what P(z|X) is but we want to infer it
- We can't estimate it directly but we can use Variational Inference
- To do this we approximate P(z|X) with an approximate function that is easier to evaluate
In most cases, Q(z|X) is a Normal distribution and we try to minimize the difference between these two distributions using the KL Divergence
Srijith Rajamohan, Ph.D.
Encoder output
KL Divergence
We don't have P(z|X) so we use Bayes Theorem to replace it as
Srijith Rajamohan, Ph.D.
KL Divergence
These two combine to give another KL Divergence
This is the objective function for the VAE
Srijith Rajamohan, Ph.D.
Evidence Lower Bound
The right hand side of the equation is called the Evidence Lower Bound
- We want to minimize the KL Divergence of P(z|X) and Q(z|X).
- Minimizing the KL Divergence term on the left is equivalent to maximizing the LHS or the RHS of the equation above.
- This is called a Lower Bound because the KL Divergence is always greater than or equal to zero and therefore the RHS becomes the 'lower bound' for the evidence P(X)
Srijith Rajamohan, Ph.D.
Let's look at each term
Evidence
The term we want to minimize
Reconstruction error
KL Divergence between the approximate function and the distribution of the latent variable
Srijith Rajamohan, Ph.D.
What we do in practice
- To accomplish our goal of minimizing the original KL divergence, we maximize the term above
- P(X|z) represents the output of the decoder
- Q(z|X) represents the output of the encoder
- In a Variational Autoencoder, both the encoder and decoder are neural networks
Decoder loss
Encoder loss
Srijith Rajamohan, Ph.D.
Decoder
- The first term then becomes maximizing the expectation of X given an input z and set of parameters that define the decoder
- Rephrased, it means we want to optimize the weights of the decoder neural network to minimize the error between the estimated X and the true X
- This is the reconstruction loss
- Use an appropriate cost function for this, e.g. MSE
Decoder loss
Encoder loss
Srijith Rajamohan, Ph.D.
Encoder
Example architecture of a Variational Autoencoder.
Image taken from Jeremy Jordan's excellent blog on the topic https://www.jeremyjordan.me/variational-autoencoders/
Srijith Rajamohan, Ph.D.
Encoder
- The loss term seen above is the KL divergence loss
- Q(z|X)
- Q(z|X) is the result of our encoder
- The encoder is also represented by a neural network
- Since Q(z|X) is supposed to be a normal, this encoder outputs a vector of means and standard deviations
- P(z)
- P(z) is assumed to be a normal distribution N(0,1)
- Alternately, it can also be a mixture of Gaussians
Decoder loss
Encoder loss
Srijith Rajamohan, Ph.D.
Encoder
- It is clear how P(z) being a standard normal is useful when we try to compute the KL divergence in the encoder since this simplifies the computation
- This divergence term tries to minimize the difference between the true prior distribution of the latent variable P(z) and the approximate conditional distribution Q(z|X)
- So Q(z|X) moves closer to a unit normal
- But we don't want it to collapse to one since then there is no separability of data
- In a way this acts as a form of regularization
Decoder loss
Encoder loss
Srijith Rajamohan, Ph.D.
Decoder Input
- So what goes into the decoder ?
- A value (or vector) is sampled from the distribution formed by the mean and std output by the encoder
- This is done so that the VAE can learn to recognize not only the data input to the VAE but also data similar to it
- Just a single sample was shown to be sufficient for this
Decoder loss
Encoder loss
Dimensionality of the vectors is a hyperparameter
Srijith Rajamohan, Ph.D.
Loss for VAE
- Loss comes from both the reconstruction error and the divergence term
- Without the reconstruction loss, the output will not look like the input
- Without the divergence term, the approximate distribution Q(z|X) can learn a narrow distribution because the KL Divergence was defined as
We want our distributions to be broad so that they can cover the solution space, otherwise it would suffer from the same problem as a regular auto encoder, i.e. discontinuous solution space
Srijith Rajamohan, Ph.D.
Reparameterization
- There is one more problem as far as implementation goes
- We cannot compute the gradients through a probabilistic node with autodiff
- Reparameterization was accomplished as
Picture from Jeremy Jordans blog
Srijith Rajamohan, Ph.D.
To summarize
Sampling
Srijith Rajamohan, Ph.D.
Generation of Data
Srijith Rajamohan, Ph.D.
To generate data, we sample z from a unit normal since our assumption of P(z) is N(0,1)
Also, we made Q(z|X) similar to P(z)
deck
By sjster
deck
- 2,536