Sarah Dean PRO
asst prof in CS at Cornell
Prof Sarah Dean
automated system
environment
action
measure-ment
training data \(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
features
predicted label
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
policy
observation
action
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
observation
prediction
sampled i.i.d. from \(\mathcal D\)
\(x\sim\mathcal D_{x}\)
Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)
model
\(f_t:\mathcal X\to\mathcal Y\)
observation
prediction
\(x_t\)
Goal: cumulatively over time, predictions \(\hat y_t = f_t(x_t)\) are close to true \(y_t\)
accumulate
\(\{(x_t, y_t)\}\)
policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
Goal: cumulatively over time, actions \(\pi_t(x_t)\) achieve high reward
\(a_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
policy
\(\pi_t:\mathcal X^t\to\mathcal A\)
observation
action
\(x_t\)
Goal: select actions \(a_t\) to bring environment to high-reward state
\(a_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
Participation expectation: actively ask questions and contribute to discussions
Ex - classification
\(\ell(y,\hat y)\) measures "loss" of predicting \(\hat y\) when it's actually \(y\)
Ex - regression
Claim: The predictor with the lowest possible risk is
The risk of a predictor \(f\) over a distribution \(\mathcal D\) is the expected (average) loss
$$\mathcal R(f) = \mathbb E_{x,y\sim\mathcal D}[\ell(y, f(x))]$$
Proof: exercise. Hint: use tower property of expectation.
Loss determines trade-offs between (potentially inevitable) errors
Ex - sit/stand classifier with \(x=\) position of face in frame
In many domains, decisions have moral and legal significance
Harms can occur at many levels
Fundamental Theorem of Supervised Learning:
Empirical risk minimization
$$\hat f = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N \ell(y_i, f(x_i))$$
\(\mathcal R_N(f)\)
1. Representation
2. Optimization
3. Generalization
Next time: more on fairness & non-discrimination, then linear regression case study
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
performance depends on representation, optimization, and generalization
Ref: Ch 2-3 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org
By Sarah Dean