Welcome to 6.390!
Team
~50 awesome LAs
Class meetings
assignments
Hours:
Lec: 1.5 hr
Rec + Lab: 3 hr
Notes + exercise: 2 hr
Homework: 6-7 hr
Exercises: Releases on Thursday 9am, due the following Tuesday 9am.
Relatively easy questions based on that week’s lecture and notes reading.
Lecture: Thursday, 11am–12:30pm, 10-250. Recorded.
Overview the technical contents, and tie together the high-level motivations, concepts, and stories.
Recitation: Friday, various sections. See introml for exact time.
Assumes you have read and done exercises; start on homework.
Homework: Releases Friday 9am; due Wed. (12 days later) at 11:59pm
Harder questions: concepts, mechanics, implementations.
Lab: Tuesdays, various sections. Synchronous. See website for exact time and room. In-class empirical exploration of concepts, work with partner(s) on questions, Check-off conversation with staff member.
Detailed exam logistics will be posted 3 weeks before the exam date.
Things we expect you to know (we use these constantly, but don’t teach them explicitly):
Given:
Derive:
A model might:
traditionally
supervised learning
unsupervised learning
reinforcement learning
nowadays
reinforcement learning
supervised learning
unsupervised learning
RLHF (ChatGPT etc.)
Toddler demo, Russ Tedrake thesis, 2004
(Uses vanilla policy gradient (actor-critic))
Optimization + first-principle physics
DARPA Robotics Competition
2015
In 6.390:
supervised learning
unsupervised learning
reinforcement learning
Topics in order:
supervised
unsupervised
reinforcement
Model class:
Optimization:
Learning process:
Modeling choices:
Many other ways to dissect
[These lists are neither exhaustive nor exclusive.]
We first focus on an instance of supervised learning known as regression.
example: city daily energy consumption prediction
Features | Label | |
---|---|---|
City | Temperature (°C) |
Energy used (GWh) |
Chicago | 25 | 51 |
New York | 28 | 57 |
Boston | 31 | 63 |
San Diego | 35 | 71 |
temperature \(x_1\)
energy used \(y\)
toy data, for illustration only
Training data:
\(x^{(1)} =\begin{bmatrix} x_1^{(1)} \\[4pt] x_2^{(1)} \\[4pt] \vdots \\[4pt] x_d^{(1)} \end{bmatrix} \in \mathbb{R}^d\)
label
feature vector
\(y^{(1)} \in \mathbb{R}\)
\(\mathcal{D}_\text{train}:=\)
\(\left\{\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(n)}, y^{(n)}\right)\right\}\)
\(n = 4, d = 1\)
\(n = 4 ,d = 2\)
temperature \(x_1\)
energy used \(y\)
temperature \(x_1\)
population \(x_2\)
energy used \(y\)
Training data:
\(x^{(1)} =\begin{bmatrix} x_1^{(1)} \\[4pt] x_2^{(1)} \\[4pt] \vdots \\[4pt] x_d^{(1)} \end{bmatrix} \in \mathbb{R}^d\)
label
feature vector
\(y^{(1)} \in \mathbb{R}\)
\(\mathcal{D}_\text{train}:=\)
\(\left\{\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(n)}, y^{(n)}\right)\right\}\)
\(n = 4 ,d = 2\)
temperature \(x_1\)
population \(x_2\)
energy used \(y\)
\((x^{(1)}, y^{(1)})\)
\(=\left(\begin{bmatrix} x_1^{(1)} \\[4pt] x_2^{(1)} \\[4pt] \end{bmatrix}, y^{(1)}\right)\)
Regression
Algorithm
💻
\(\in \mathbb{R}^d \)
\(\in \mathbb{R}\)
\(\mathcal{D}_\text{train}\)
\(\left\{\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(n)}, y^{(n)}\right)\right\}\)
What do we want from the regression algortim?
A good way to label new features, i.e. a good hypothesis.
Suppose our friend's algorithm proposes \(h(x)=10\)
hypothesis
\(\mathcal{L}\left(h\left(x^{(i)}\right), y^{(i)}\right) \)
temperature \(x\)
\(h(x)=10\)
e.g. \(h\left(x^{(4)}\right) - y^{(4)} \)
energy used \(y\)
\(\mathcal{E}_{\text {train }}(h)=\frac{1}{n} \sum_{i=1}^n \mathcal{L}\left(h\left(x^{(i)} \right), y^{(i)}\right)\)
e.g. with squared loss, the training error is the mean-squared-error (MSE)
e.g. squared loss \(\mathcal{L}\left(h\left(x^{(i)}\right), y^{(i)}\right) = (h\left(x^{(i)}\right) - y^{(i)} )^2\)
\(\mathcal{E}_{\text {test }}(h)=\frac{1}{n^{\prime}} \sum_{i=n+1}^{n+n^{\prime}} \mathcal{L}\left(h\left(x^{(i)}\right), y^{(i)}\right)\)
\(n'\) unseen data points, i.e.
test data
set of \(h\) we ask the algorithm to search over
Hypothesis class \(\mathcal{H}:\)
\(\{\)constant functions\(\}\)
temperature \(x\)
energy used \(y\)
\(\subset\)
less expressive
more expressive
\(\{\)linear functions\(\}_1\)
1. technically, affine functions. ppl tend to be flexible about this terminology in ML.
\(h_1(x)=10\)
\(h_2(x)=20\)
\(h_3(x)=30\)
temperature \(x\)
energy used \(y\)
\(h(x)=\theta x + \theta_0\)
Regression
Algorithm
💻
\(\in \mathbb{R}^d \)
\(\in \mathbb{R}\)
\(\mathcal{D}_\text{train}\)
\(\left\{\left(x^{(1)}, y^{(1)}\right), \dots, \left(x^{(n)}, y^{(n)}\right)\right\}\)
hypothesis
🧠
Quick summary
\(h\left(x ; \theta\right)\)
\( = \left[\begin{array}{lllll} \theta_1 & \theta_2 & \cdots & \theta_d\end{array}\right]\) \(\left[\begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_d\end{array}\right]\)
parameters
Linear least square regression
\(\mathcal{L}\left(h\left(x^{(i)}\right), y^{(i)}\right) =(\theta^T x- y^{(i)} )^2\)
temperature \(x_1\)
population \(x_2\)
energy used \(y\)
for now, ignoring the offset
\(=\)
\(x\)
\(\theta^T\)
features
Features | Label | ||
---|---|---|---|
City | Temperature | Population | Energy Used |
Chicago | 90 | 7.2 | 45 |
New York | 20 | 9.5 | 32 |
Boston | 35 | 8.4 | 99 |
San Diego | 18 | 4.3 | 39 |
Features | Label | ||
---|---|---|---|
City | Temperature | Population | Energy Used |
Chicago | 90 | 7.2 | 45 |
New York | 20 | 9.5 | 32 |
Boston | 35 | 8.4 | 99 |
San Diego | 18 | 4.3 | 39 |
\(X =\begin{bmatrix}90 & 7.2 \\20 & 9.5\\35 & 8.4 \\18 & 4.3\end{bmatrix}\)
\(Y =\begin{bmatrix}45 \\32 \\99 \\39\end{bmatrix}\)
\(\theta =\begin{bmatrix}\theta_1 \\\theta_2\end{bmatrix}\)
\(X = \begin{bmatrix}x_1^{(1)} & \dots & x_d^{(1)}\\\vdots & \ddots & \vdots\\x_1^{(n)} & \dots & x_d^{(n)}\end{bmatrix}\)
\(Y = \begin{bmatrix}y^{(1)}\\\vdots\\y^{(n)}\end{bmatrix}\)
\(\theta = \begin{bmatrix}\theta_{1}\\\vdots\\\theta_{d}\end{bmatrix}\)
\( J(\theta) =\frac{1}{n}({X} \theta-{Y})^{\top}({X} \theta-{Y})\)
Let
Then
\(\in \mathbb{R}^{n\times d}\)
\(\in \mathbb{R}^{n\times 1}\)
\(\in \mathbb{R}^{d\times 1}\)
\(\in \mathbb{R}^{1\times 1}\)
e.g.
Features | Label | ||
---|---|---|---|
City | Temperature | Population | Energy Used |
Chicago | 90 | 7.2 | 45 |
New York | 20 | 9.5 | 32 |
Boston | 35 | 8.4 | 99 |
San Diego | 18 | 4.3 | 39 |
\( J(\theta) =\frac{1}{n}({X} \theta-{Y})^{\top}({X} \theta-{Y})\)
\(X =\begin{bmatrix}90 & 7.2 \\20 & 9.5\\35 & 8.4 \\18 & 4.3\end{bmatrix}\)
\(Y =\begin{bmatrix}45 \\32 \\99 \\39\end{bmatrix}\)
deviation
\(\theta =\begin{bmatrix}\theta_1 \\\theta_2\end{bmatrix}\)
training error (MSE):
\( {X} \theta - Y\)
summing deviation squared
\(= \begin{bmatrix}90\theta_1 + 7.2\theta_2 - 45 \\20\theta_1 + 9.5\theta_2 - 32 \\35\theta_1 + 8.4\theta_2 - 99 \\18\theta_1 + 4.3\theta_2 - 39\end{bmatrix}\)
= \(({X} \theta-{Y})^{\top}({X} \theta-{Y})\)
\(= \begin{bmatrix}e_1 \\e_2\\e_3\\e_4\end{bmatrix}\)
\(=\begin{bmatrix}e_1, e_2, e_3, e_4\end{bmatrix}\begin{bmatrix}e_1 \\e_2\\e_3\\e_4\end{bmatrix}\)
want to show:
Objective function (training error)
\( J(\theta) =\frac{1}{n}({X} \theta-{Y})^{\top}({X} \theta-{Y})\)
[1d case walk-through on board]
goal: find \(\theta\) to minimize \(J(\theta)\)
For \(f: \mathbb{R}^m \rightarrow \mathbb{R}\), its gradient \(\nabla f: \mathbb{R}^m \rightarrow \mathbb{R}^m\) is defined at the point \(p=\left(x_1, \ldots, x_m\right)\) as:
Sometimes the gradient is undefined or ill-behaved, but today it is well-behaved.
3. The gradient can be symbolic or numerical.
example:
its symbolic gradient:
just like a derivative can be a function or a number.
evaluating the symbolic gradient at a point gives a numerical gradient:
4. The gradient points in the direction of the (steepest) increase in the function value.
\(\frac{d}{dx} \cos(x) \bigg|_{x = -4} = -\sin(-4) \approx -0.7568\)
\(\frac{d}{dx} \cos(x) \bigg|_{x = 5} = -\sin(5) \approx 0.9589\)
5. The gradient at the function minimizer is necessarily zero
\(\nabla_\theta J=\left[\begin{array}{c}\partial J / \partial \theta_1 \\ \vdots \\ \partial J / \partial \theta_d\end{array}\right]\)
= \(\frac{2}{n}\left(X^T X \theta-X^T Y\right)\)
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
Beauty of
1. "center" the data
How to deal with \(\theta_0\)?
when data is centered, the optimal offset is guaranteed to be 0
centering
Features | Label | ||
---|---|---|---|
City | Temperature | Population | Energy Used |
Chicago | 90 | 7.2 | 45 |
New York | 20 | 9.5 | 32 |
Boston | 35 | 8.4 | 100 |
San Diego | 18 | 4.3 | 39 |
Features | Label | ||
---|---|---|---|
City | Temperature | Population | Energy Used |
Chicago | 49.25 | -0.15 | -9.00 |
New York | -20.75 | 2.15 | -22.00 |
Boston | -5.75 | 1.05 | 46.00 |
San Diego | -22.75 | -3.05 | -15.00 |
all column-wise \(\Sigma =0\)
2. Append a "fake" feature of \(1\)
\(h\left(x ; \theta, \theta_0\right)=\theta^T x+\theta_0\)
\( = \left[\begin{array}{lllll} \theta_1 & \theta_2 & \cdots & \theta_d\end{array}\right]\) \(\left[\begin{array}{l}x_1 \\ x_2 \\ \vdots \\ x_d\end{array}\right] + \theta_0\)
\( = \left[\begin{array}{lllll} \theta_1 & \theta_2 & \cdots & \theta_d & \theta_0\end{array}\right]\) \(\left[\begin{array}{c}x_1 \\ x_2 \\ \vdots \\ x_d \\ 1\end{array}\right] \)
\( = \theta_{\mathrm{aug}}^T x_{\mathrm{aug}}\)
Another way to handle offsets is to trick our model: treat the bias as just another feature, always equal to 1.
temperature \(x_1\)
energy used \(y\)
How to deal with \(\theta_0\)?
\( J(\theta) =\frac{1}{n}({X} \theta-{Y})^{\top}({X} \theta-{Y})\)
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
Now:
\(\theta^*=\left({X}^{\top} {X}\right)^{-1} {X}^{\top} {Y}\)
we'll discuss all these next week.
Looking ahead:
We'd love to hear your thoughts.
prompt engineered by
Lyrics:
Melody and Vocal:
Video Production: