Intro to Machine Learning

https://introml.mit.edu/

Lecture 12: Unsupervised Learning

Shen Shen

May 3, 2024

(many slides adapted from Phillip Isola and Tamara Broderick)

Logistics

This is the last regular lecture; next week is the last regular week.
Friday, May 10
- Lecture time, discuss future topics (generativeAI).
- By the end of the day, all assignments will be due.
Tuesday, May 14
- 20-day extensions applicable through this day.
- Last regular OHs (for checkoffs/hw etc); afterwards only Instructor OHs.
- 6-8pm, 10-250, final exam review.
The end-of-term subject evaluations are open. We'd love to hear your thoughts on 6.390: this provides valuable feedback for us and other students, for future semesters!
Check out final exam logistics on introML homepage.

Outline

Recap: Supervised learning and reinforcement learning
Unsupervised learning
Clustering: \(k\)-means algorithm
- Clustering vs. classification
- Initialization matters
- \(k\) matters
Auto-encoder
- Compact representation
Unsupervised learning again -- representation learning and beyond

f: x \rightarrow y

Supervised learning

explicit supervision via labels \(y\).
both regression or classification are trying to predict accurate, exact labels.

Reinforcement learning

implicit(evaluative), sequential, iterative supervision via rewards.

Learner

Outline

Recap: Supervised learning and reinforcement learning
Unsupervised learning
Clustering: \(k\)-means algorithm
- Clustering vs. classification
- Initialization matters
- \(k\) matters
Auto-encoder
- Compact representation
Unsupervised learning again -- representation learning and beyond

Unsupervised learning

No supervision, i.e., no labels nor rewards.
try to learn something "interesting" using only the features
- Clustering: learn "similarity" of the
- Autoencoder: learn compression/reconstruction/representation
useful paradigm on its own; often empowers downstream tasks.

Autoencoder

Outline

Recap: Supervised learning and reinforcement learning
Unsupervised learning
Clustering: \(k\)-means algorithm
- Clustering vs. classification
- Initialization matters
- \(k\) matters
Auto-encoder
- Compact representation
Unsupervised learning again -- representation learning and beyond

x_1

x_2

Food distribution placement

\(x_1\): longitude, \(x_2\): latitude
Person \(i\) location \(x^{(i)}\)
Food truck \(j\) location \(\mu^{(j)}\)
Q: where should I have my \(k\) food trucks park?
want to minimize the "loss" of people we serve
Loss if \(i\) walks to truck \(j\) : \(\left\|x^{(i)}-\mu^{(j)}\right\|_2^2\)
Index of the truck where person \(i\) is chosen to walk to: \(y^{(i)}\)

x_1

x_2

Food distribution placement

\(x_1\): longitude, \(x_2\): latitude
Person \(i\) location \(x^{(i)}\)

x_1

x_2

Food distribution placement

\(x_1\): longitude, \(x_2\): latitude
Person \(i\) location \(x^{(i)}\)
Q: where should I have my \(k\) food trucks park?

x_1

x_2

Food distribution placement

\(x_1\): longitude, \(x_2\): latitude
Person \(i\) location \(x^{(i)}\)
Q: where should I have my \(k\) food trucks park?
Food truck \(j\) location \(\mu^{(j)}\)
Want to minimize the "loss" of people we serve

x_1

x_2

Food distribution placement

\(x_1\): longitude, \(x_2\): latitude
Person \(i\) location \(x^{(i)}\)
Q: where should I have my \(k\) food trucks park?
Food truck \(j\) location \(\mu^{(j)}\)
Want to minimize the "loss" of people we serve
Loss if \(i\) walks to truck \(j\) : \(\left\|x^{(i)}-\mu^{(j)}\right\|_2^2\)
Index of the truck where person \(i\) is chosen to walk to: \(y^{(i)}\)

Loss over all people \(\sum_{i=1}^n \sum_{j=1}^k \mathbf{1}\left\{y^{(i)}=j\right\}\left\|x^{(i)}-\mu^{(j)}\right\|_2^2\)

\(k\)-means objective

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu\) random initialization

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu\) random initialization

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

each person \(i\) gets assigned to a food truck \(j\), color-coded.

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu\) random initialization

each person \(i\) gets assigned to a food truck \(j\), color-coded.

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu=\) random initialization

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

food truck \(j\) gets moved to the "central" location of all ppl assigned to it

\(N_j = \sum_{i=1}^n \mathbf{1}\left\{y^{(i)}=j\right\}\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu=\) random initialization

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu=\) random initialization

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu=\) random initialization

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu=\) random initialization

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

continue (ppl assignment then truck movement) update

at some point.

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu=\) random initialization

2 for \(t=1\) to \(\tau\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

8 if \(y==y_{\text {old }}\)

9\(\quad \quad \quad\quad \quad\)break

(ppl assignment and truck location) will stop changing

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu, y=\) random initialization

2 for \(t=1\) to \(\tau\)

3 \(y_{o l d}=y\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

8 if \(y==y_{\text {old }}\)

9\(\quad \quad \quad\quad \quad\)break

10 return \(\mu, y\)

Outline

Recap: Supervised learning and reinforcement learning
Unsupervised learning
Clustering: \(k\)-means algorithm
- Clustering vs. classification
- Initialization matters
- \(k\) matters
Auto-encoder
- Compact representation
Unsupervised learning again -- representation learning and beyond

\(k\)-means

Compare to classification

Did we just do \(k\)-class classification?
Looks like we assigned label \(y^{(i)}\), which takes \(k\) different values, to each feature vector \(x^{(i)}\)
But we didn't use any labeled data
The "labels" here don't have meaning; I could permute them and have the same result
Output is really a partition of the data

Compare to classification

So what did we do?
We clustered the data: we grouped the data by similarity
Why not just plot the data? We should! Whenever we can!
But also: Precision, big data, high dimensions, high volume
An example of unsupervised learning: no labeled data, and we're finding patterns

Outline

Recap: Supervised learning and reinforcement learning
Unsupervised learning
Clustering: \(k\)-means algorithm
- Clustering vs. classification
- Initialization matters
- \(k\) matters
Auto-encoder
- Compact representation
Unsupervised learning again -- representation learning and beyond

Effect of initialization

A theorem say if run for enough outer iterations (line 2), the k-means algorithm will converge to a local minimum of the k-means objective
That local minimum could be bad!
The initialization can make a big difference.
Some options: random restarts.

Effect of initialization

Effect of \(k\)

Outline

Recap: Supervised learning and reinforcement learning
Unsupervised learning
Clustering: \(k\)-means algorithm
- Clustering vs. classification
- Initialization matters
- \(k\) matters
Auto-encoder
- Compact representation
Unsupervised learning again -- representation learning and beyond

\mathcal{F}

\tilde{x}=\mathcal{F}(x;\theta)

\min_{\theta} ||x - \tilde{x}||^2

\mathcal{F}

\tilde{x}=\mathcal{F}(x)

Outline

Recap: Supervised learning and reinforcement learning
Unsupervised learning
Clustering: \(k\)-means algorithm
- Clustering vs. classification
- Initialization matters
- \(k\) matters
Auto-encoder
- Compact representation
Unsupervised learning again -- representation learning and beyond

Self-supervision (masking)

(Density estimation)

Dall-E 2 (UnCLIP):

[https://arxiv.org/pdf/2204.06125.pdf]

Thanks!

We'd appreciate your feedback on the lecture.

Outline

Unsupervised learning takes in a data set of X’s and tries to find some interesting or useful underlying structure
Clustering is an important kind of unsupervised learning in which we try to divide the x’s into a finite set of groups that are in some sense similar.
One of the most tricky things about clustering is picking the objective function. A widely used one is the k-means objective (eq 12.1 in notes). It also requires a distance metric on x’s.
We could try to optimize using generic optimization methods (e.g. gradient descent) but there’s a convenient special-purpose method for finding a local optimum: the k-means algorithm.The solution obtained by k-means algorithm is sensitive to initialization.
Choosing the number of clusters is an important problem. It’s a kind of judgment call: we can optimize the k-means objective by having each x in its own cluster, so this has more to do with the initial abstract objective of finding interesting or useful underlying structure in the data.

K-MEANS\((k, \tau, \left\{x^{(i)}\right\}_{i=1}^n)\)

1 \(\mu, y=\) random initialization

2 for \(t=1\) to \(\tau\)

3 \(y_{o l d}=y\)

4 for \(i=1\) to \(n\)

\(5 \quad \quad\quad\quad \quad y^{(i)}=\arg \min _j\left\|x^{(i)}-\mu^{(j)}\right\|^2\)

6 for \(j=1\) to \(k\)

\(7 \quad \quad\quad\quad \quad \mu^{(j)}=\frac{1}{N_j} \sum_{i=1}^n \mathbb{1}\left(y^{(i)}=\mathfrak{j}\right) x^{(i)}\)

8 if \(y==y_{\text {old }}\)

9\(\quad \quad \quad\quad \quad\)break

10 return \(\mu, y\)