Overview

ML in Feedback Sys #1

Prof Sarah Dean

Machine Learning

Feedback Systems

automated system

environment

action

measure-ment

training data $\{(x_i, y_i)\}$

model

$f:\mathcal X\to\mathcal Y$

features

predicted label

ML in Feedback Systems

training data

$\{(x_i, y_i)\}$

model

$f:\mathcal X\to\mathcal Y$

policy

observation

action

ML in Feedback Systems

training data

$\{(x_i, y_i)\}$

model

$f:\mathcal X\to\mathcal Y$

observation

prediction

Supervised learning

$\mathcal D$

sampled i.i.d. from $\mathcal D$

$x\sim\mathcal D_{x}$

Goal: for new sample $x,y\sim \mathcal D$, prediction $\hat y = f(x)$ is close to true $y$

model

$f_t:\mathcal X\to\mathcal Y$

observation

prediction

Online learning

$x_t$

Goal: cumulatively over time, predictions $\hat y_t = f_t(x_t)$ are close to true $y_t$

accumulate

$\{(x_t, y_t)\}$

policy

$\pi_t:\mathcal X\to\mathcal A$

observation

action

(Contextual) Bandits

$x_t$

Goal: cumulatively over time, actions $\pi_t(x_t)$ achieve high reward

$a_t$

accumulate

$\{(x_t, a_t, r_t)\}$

policy

$\pi_t:\mathcal X^t\to\mathcal A$

observation

action

Online control/RL

$x_t$

Goal: select actions $a_t$ to bring environment to high-reward state

$a_t$

accumulate

$\{(x_t, a_t, r_t)\}$

Topics and Schedule

Unit 1: Learning to predict (Aug-Sept)
- Supervised learning & Fairness
- Online learning
- Dynamical systems & Stability
Unit 2: Learning to act (Oct-Nov)
- Multi-armed Bandits
- Control/RL & Robustness
- Model predictive control & Safety
Detailed Calendar

Prerequisites

Machine learning
Linear algebra, convex optimization, and probability
- Linear Algebra Review and Reference, Convex Optimization Overview, Review of Probability Theory
Lectures will focus on theoretical foundations
Focus on practical concerns and applications welcome for discussion and projects!

Assignments

10% participation
20% scribing
20% paper presentation
50% final project

Participation expectation: actively ask questions and contribute to discussions

in class (in person when possible)
and/or on Ed Discussions (exercises)

Scribing

high quality notes using the Tufte-handout template
summarize the lecture and expand upon it
draft due one week after lecture, revision is due a week after feedback
Sign up sheet

Paper presentations

group of 2-3 responsible for presenting and leading discussion
- single or multiple papers
assigned based on ranked choice, full list of papers here
should cover motivation, problem statement, prior work, main results, technical tools, and future work
first paper presentations 9/12 and 9/14
- HSNL18 Fairness Without Demographics in Repeated Loss Minimization
- PZMH20 Performative Prediction

Final Project

topic that connects class material to you research
groups of up to three
deliverables:
- Project proposal (1 page) due mid-October
- Midterm update (2 pages) due mid-November
- Project report (4-6 pages) due last day of class

Introductions

How would you design a classifier?

$(\qquad,\text{sitting})$

$(\qquad,\text{standing})$

$(\qquad,\text{?})$

How would you design a classifier?

$\hat y = \hat f(\qquad)$

$$\widehat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N \ell(y_i, f(x_i))$$

Loss functions

Ex - classification

$\mathbb{1}\{y\neq\hat y\}$
$\max\{0, \hat y-y\}$

$\ell(y,\hat y)$ measures "loss" of predicting $\hat y$ when it's actually $y$

Ex - regression

$|\hat y-y|$
$(\hat y-y)^2$

Risk

Claim: The predictor with the lowest possible risk is

$\mathbb E[y| x]$ for squared loss
$\mathbb 1\{\mathbb E[y| x]\geq t\}$ for 0-1 loss, with $t$ depending on $\mathcal D$

The risk of a predictor $f$ over a distribution $\mathcal D$ is the expected (average) loss

$$\mathcal R(f) = \mathbb E_{x,y\sim\mathcal D}[\ell(y, f(x))]$$

Proof: exercise. Hint: use tower property of expectation.

Prediction errors

Loss determines trade-offs between (potentially inevitable) errors

Ex - sit/stand classifier with $x=$ position of face in frame

$\ell($sitting$,$sitting$)=0$
$\ell($standing$,$standing$)=0$
$\ell($sitting$,$standing$)$
$\ell($standing$,$sitting$)$

Discrimination

In many domains, decisions have moral and legal significance

Harms can occur at many levels

Correctness: who is burdened by errors?
Stereotyping: which correlations are permissible?
Specification: who is left out?

Sample vs. population

Fundamental Theorem of Supervised Learning:

The risk is bounded by the empirical risk plus the generalization error. $$ \mathcal R(f) \leq \mathcal R_N(f) + |\mathcal R(f) - \mathcal R_N(f)|$$

Empirical risk minimization

$$\hat f = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N \ell(y_i, f(x_i))$$

$\{$

$\mathcal R_N(f)$

1. Representation

2. Optimization

3. Generalization

Recap

Next time: more on fairness & non-discrimination, then linear regression case study

training data

$\{(x_i, y_i)\}$

model

$f:\mathcal X\to\mathcal Y$

define loss
do ERM

$\mathcal D$

performance depends on representation, optimization, and generalization

Ref: Ch 2-3 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org

Overview

ML in Feedback Sys #1

Machine Learning

Feedback Systems

ML in Feedback Systems

ML in Feedback Systems

Supervised learning

\(\mathcal D\)

Online learning

(Contextual) Bandits

Online control/RL

Topics and Schedule

Prerequisites

Assignments

Scribing

Paper presentations

Final Project

Introductions

How would you design a classifier?

\((\qquad,\text{sitting})\)

\((\qquad,\text{sitting})\)

\((\qquad,\text{standing})\)

\((\qquad,\text{standing})\)

\((\qquad,\text{?})\)

How would you design a classifier?

\(\hat y = \hat f(\qquad)\)

$$\widehat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N \ell(y_i, f(x_i))$$

Loss functions

Risk

Prediction errors

Discrimination

Sample vs. population

\(\{\)

Recap

\(\mathcal D\)