Lecture 2: Generative Models - Autoregressive

Shen Shen

April 2, 2025

2:30pm, Room 32-144

Slides adapted from Kaiming He

Modeling with Machine Learning for Computer Science

Discriminative v.s. Generative Models

discriminative:

"feature/sample" \(x\) to "label" \(y\)
one "desired" output

generative:

"label" \(y\) to "feature/sample" \(x\)
many possible outputs

Discriminative v.s. Generative Models

discriminative: \(p(y \mid x)\)

generative: \(p(x \mid y)\)

Generative models are about \(p(x \mid y)\)

What can be \(y\)?

condition
constraint
labels
attributes

...

What can be \(x\)?

“data”
samples
observations
measurements

...

Model the task as \(p(x \mid y)\)

\(y\) : condition/constraint (e.g., symmetry)

\(x\) : generated protein structures

Protein structure generation

Model the task as \(p(x \mid y)\)

\(y\) :

text prompt

\(x\) :

response of a chatbot

Text-to-image/video generation

Prompt: teddy bear teaching a course, with
"generative models" written on blackboard

Model the task as \(p(x \mid y)\)

\(y\) : prompt

\(x\) :

response of a chatbot

Natural language conversation

Model the task as \(p(x \mid y)\)

\(y\) : prompt

\(x\) :

response of a chatbot

Text-to-3D structure generation

Figure credit: Tang, et al. LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. ECCV 2024

Model the task as \(p(x \mid y)\)

\(y\) : class label

\(x\) :

generated image

Class-conditional image generation

Image generated by: Li, et al. Autoregressive Image Generation without Vector Quantization, 2024

red fox

Model the task as \(p(x \mid y)\)

\(y\) : an implicit condition

\(x\) :

generated CIFAR10-like images

“Unconditional” image generation

Images generated by: Karras, et al. Elucidating the Design Space of Diffusion-Based Generative Models, NeurIPS 2022

“images following CIFAR10 distribution”

Model the task as \(p(x \mid y)\)

\(y\) : an image as the “condition”

\(x\) : probability of classes conditioned on the image

Classification (a generative perspective)

Model the task as \(p(x \mid y)\)

\(y\) : an image as the “condition”

\(x\) : plausible descriptions
conditioned on the image

Open-vocabulary recognition

Model the task as \(p(x \mid y)\)

\(y\) : an image as the “condition”

\(x\) : plausible descriptions conditioned on the image

Image Captioning

Model the task as \(p(x \mid y)\)

\(y\) : visual and other
sensory observations

\(x\) : policies
(probability of actions)

Policy Learning in Robotics

Chi, et al. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023

In discriminative model, modeling \(p(y \mid x)\) typically means, for any given data \(x\), find the label \(y\) that maximizes \(p(y \mid x)\)

\[\operatorname{argmax} p(y \mid x)\]

We care about the absolute/relative value of \(p(y \mid x)\) for each \(y\) given \(x\)

What do we mean by modeling \(p(x \mid y)\)

In generative model, modeling \(p(x \mid y)\) typically means, for the given conditioning \(y\), synthesize / sample data \(x\) following \(p(x \mid y)\)
- We don't want to only get the mode \(\operatorname{argmax} p(x \mid y)\)
- We care less about the exact value of \(p(x \mid y)\)
- Generative models that can estimate \(p(x \mid y)\) is called "explicit generative models"
- Generative models that do not directly estimate \(p(x \mid y)\) is called "implicit generative models"

What do we mean by modeling \(p(x \mid y)\)

Why is this hard?

constant for given \(x\)

assuming known prior over the label \(y\)

still need to model prior distribution of \(x\)

constant for given \(y\)

Can discriminative models be generative?

Learning to represent probabilistic distributions

\(z \sim \pi\)

to approximate \(p_x\)

Modeling high-dimensional distribution is always hard \(p(x)=p\left(x_1, x_2, \ldots, x_{hwc}\right)\)
Modeling low-dimensional distribution is easy: discriminative model!
Can we convert the problem of modeling high-dimensional distribution into modeling low-dimensional distribution?

image depth (channels)

image width

image

height

Learning to represent probabilistic distributions

this dependency graph is what engineers designed (not learned)

Not all parts of the learned distribution modeling is done by learning

e.g. auto-regressive model

Not all parts of the learned distribution modeling is done by learning

e.g. auto-regressive model

these function mappings are learned (via various hypothesis class choices)

this dependency graph is what engineers designed (not learned)

Not all parts of the learned distribution modeling is done by learning

e.g. diffusion model

Not all parts of the learned distribution modeling is done by learning

e.g. diffusion model

these function mappings are learned (via various hypothesis class choices)

- Joint distribution \(\Rightarrow\) product of conditionals
- Inductive bias:
- shared architecture, shared weight
- induced order
- Inference: autoregressive

Summary

How to represent \(x\), \(y\), and their dependence?

Financial Modeling (Quantitative Finance & Trading)

Autoregressive Models in Wireless Communications (Channel Modeling)

Structural Health Monitoring (Predictive Maintenance)

Energy Forecasting (Electrical Grid Load Prediction)

Network Traffic Modeling (Networking & Cybersecurity)

Deep Generative Models may involve:

Formulation:
- formulate a problem as probabilistic models
- decompose complex distributions into simple and tractable ones
Representation: deep neural networks to represent data and their distributions
Objective function: to measure how good the predicted distribution is
Optimization: optimize the networks and/or the decomposition
Inference:
- sampler: to produce new samples
- probability density estimator (optional)