Introduction to Generative Modeling
Francois Lanusse @EiffL
Do you know this person?
Probably not, this is a randomly generated person: thispersondoesntexist.com
What is generative modeling?
 The goal of generative modeling is to learn the distribution from which the training set is drawn
 Usually, this means building a parametric model that tries to be close to
True
samples
Model
Why it isn't that easy
 The curse of dimensionality put all points far apart in high dimension
 Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps
Distance between pairs of points drawn from a Gaussian distribution
So how do we get to this ?
Hint: Deep Learning is involved...
The Evolution of Deep Generative Models
 Deep Belief Network
(Hinton et al. 2006)
 Variational AutoEncoder
(Kingma & Welling 2014)
 Generative Adversarial Network
(Goodfellow et al. 2014)
 Wasserstein GAN
(Arjovsky et al. 2017)
A Visual Turing Test
Fake images from a PixelCNN
Real SDSS images
How are these models usefull for physics?
They are datadriven models, can complement physical models
VAE model of galaxy morphology
 They can learn from real data
 They can learn from simulations
 They can be orders of magnitude faster than a proper simulation and speed up significantly part of an analysis
Simulation of Dark Matter Maps
 They can be used alongside physical model to solve diverse problems
Observations
Model convolved with PSF
Model
Residuals
Observed data
Imagined solutions
DGM are a vast domain of research
Grathwohl et al. 2019
 We will focus on a subset of Latent Variable Models today: GANs and VAEs
Latent Variable Models
 We model using a mapping from a latent distribution to data space.
 To draw a sample from , follow this recipe:
 Draw a latent variable z from a known/fixed distribution, e.g. a Gaussian, of low dimension:
 Transform this random variable to the data space using a deep neural network :
 Draw a latent variable z from a known/fixed distribution, e.g. a Gaussian, of low dimension:
 The goal of the game is to find the parameters so that ends up looking realistic.
Problem: In the data, I only have access to the output , but how can I train if I never see the input ????
Why do we expect this to work? We are saying that the data can actually be represented on the low dimensionality manifold in latent space.
Part I: AutoEncoders
The idea of autoencoding:
Introduce a second network
The encoder tries to guess the latent variable that generates the image
Encoder
Decoder
The benefits of AutoEncoding
 Because the code is lowdimensional, it forces the model to compress the information as efficiently as possible in just a few numbers.
> A great way to do dimensionality reduction
AutoEncoded MNIST digits in 2d
 Because they cannot preserve all of the information, they discard "noise", they can be used as denoisers
> Denoising AutoEncoders
 Because they only know how to reconstruct a specific type of data, they will fail on an example from a different dataset
> Anomaly detection
Examples in Physics:
 Searching for New Physics with Deep Autoencoders, Farina et al. 2018

Variational Autoencoders for New Physics Mining
at the Large Hadron Collider, Cerri et al. 2019
Let's try it out!
 Guided tutorial on Colab at this link.
Main Takeaway
 AutoEncoders can work very well to compress data
 They can't be directly used as generative models because we don't know a priori how the latent space gets distributed
Part II: Variational AutoEncoders
What is the difference to a normal AutoEncoder?
 An AutoEncoder has only one constraint: Encode and then Decode has best you can:
> We never ask it to make sure that the latent variables follow a particular distribution.  If the latent space has regularity.... it's only by chance
 A Variational AutoEncoder tries to make sure that the latent variable follows a desired prior distribution:
because if we know that the latent space of the autoencoder is Gaussian distributed, we can sample from it.
 To achieve this, a VAE is trained using the Evidence Lower Bound (ELBO):
Reconstruction Error
Code Regularization
Unpacking the ELBO
 The Likelihood term
> Probability of image if is known. This needs to assume some knowledge of the statistics of the signal x
 This needs to assume some knowledge of the statistics of the signal x
In this case, this is equivalent to the AE loss if
Unpacking the ELBO
 The Posterior
> Tries to estimate the probability of if image is known  This is what the encoder models.
 The KullbackLeibler Divergence
A distance between distributions: the KullbackLeibler Divergence
Unpacking the ELBO
The ELBO is maximal when the input x is close to the output, and the code is close to a Gaussian
Reconstruction Error
Code Regularization
How do we build a network that outputs distributions?
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
# Build model.
model = tf.keras.Sequential([
tf.keras.layers.Dense(1+1),
tfp.layers.IndependentNormal(1),
])
# Define the loss function:
negloglik = lambda x, q:  q.log_prob(x)
# Do inference.
model.compile(optimizer='adam', loss=negloglik)
model.fit(x, y, epochs=500)
# Make predictions.
yhat = model(x_tst)
Let's try it out!
 Guided tutorial on Colab at this link.
Part III: Generative Adversarial Networks
What is a GAN?
 It is again a Latent Variable Model
 Contrary to a VAE, a GAN does not try to bound or estimate , instead the parameters are estimated by Adversarial Training.
 The Discriminator is trained to classify between real and fake images
 The Generator is trained to generate images that the discriminator will think are real
Traditional GAN (Goodfellow 2014)
Spoiler Alert: GANs are difficult to train
 In this competition between generator and discriminator, you have to make sure they are of similar strength.
 Typically training is not convergent, the GAN doesn't settle in a solution but is constently shifting. You may get better results in the middle of training than at the end!
 Beware of mode collapse!
BigGAN
VQVAE
 Vanishing gradients far from data distribution
Arjovsky et al. 2017
WGANGP: Your GoTo GAN
 The discriminator/critic is computing a distance between two distributions (of real and fake images), a Wasserstein distance (hence the W).
> This requires certain constraints on the critic (that's where the GP comes in).  The generator is trying to minimize that distance
 Training of the WGAN is still efficient even when the two distrtibutions are far apart.
What is this GPthing?
 For the derivation of the WGAN to work, it requires a Lipschitz bound on the critic.
The Gradient Penalty is a way to impose that condition on the critic
How far will this take us?
128x128 images, state of the art in 2017
WGANGP
1024x1024, state of the art circa end of 2019
This is extremely compute expensive and extremely technical
Let's try it out!
 Guided tutorial on Colab at this link.
Introduction to Generative Modeling
By eiffl
Introduction to Generative Modeling
Slides for ANF Machine Learning
 921