## Overview

*ML in Feedback Sys #1*

Prof Sarah Dean

**Machine Learning**

**Feedback Systems**

automated system

environment

action

measure-ment

training data \(\{(x_i, y_i)\}\)

model

\(f:\mathcal X\to\mathcal Y\)

features

predicted label

## ML in Feedback Systems

training data

\(\{(x_i, y_i)\}\)

model

\(f:\mathcal X\to\mathcal Y\)

policy

observation

action

## ML in Feedback Systems

training data

\(\{(x_i, y_i)\}\)

model

\(f:\mathcal X\to\mathcal Y\)

observation

prediction

## Supervised learning

## \(\mathcal D\)

sampled i.i.d. from \(\mathcal D\)

\(x\sim\mathcal D_{x}\)

**Goal: **for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)

model

\(f_t:\mathcal X\to\mathcal Y\)

observation

prediction

## Online learning

\(x_t\)

**Goal: **cumulatively over time, predictions \(\hat y_t = f_t(x_t)\) are close to true \(y_t\)

accumulate

\(\{(x_t, y_t)\}\)

policy

\(\pi_t:\mathcal X\to\mathcal A\)

observation

action

## (Contextual) Bandits

\(x_t\)

**Goal: **cumulatively over time, actions \(\pi_t(x_t)\) achieve high reward

\(a_t\)

accumulate

\(\{(x_t, a_t, r_t)\}\)

policy

\(\pi_t:\mathcal X^t\to\mathcal A\)

observation

action

## Online control/RL

\(x_t\)

**Goal: **select actions \(a_t\) to bring environment to high-reward state

\(a_t\)

accumulate

\(\{(x_t, a_t, r_t)\}\)

## Topics and Schedule

- Unit 1: Learning to predict (Aug-Sept)
- Supervised learning & Fairness
- Online learning
- Dynamical systems & Stability

- Unit 2: Learning to act (Oct-Nov)
- Multi-armed Bandits
- Control/RL & Robustness
- Model predictive control & Safety

- Detailed Calendar

## Prerequisites

- Machine learning
- Linear algebra, convex optimization, and probability
- Lectures will focus on
*theoretical foundations* - Focus on practical concerns and applications welcome for discussion and projects!

## Assignments

- 10% participation
- 20% scribing
- 20% paper presentation
- 50% final project

Participation expectation: actively ask questions and contribute to discussions

- in class (in person when possible)
- and/or on Ed Discussions (exercises)

## Scribing

- high quality notes using the Tufte-handout template
- summarize the lecture and expand upon it
- draft due one week after lecture, revision is due a week after feedback
- Sign up sheet

## Paper presentations

- group of 2-3 responsible for presenting and leading discussion
- single or multiple papers

- assigned based on ranked choice, full list of papers here
- should cover motivation, problem statement, prior work, main results, technical tools, and future work
- first paper presentations 9/12 and 9/14
- HSNL18
*Fairness Without Demographics in Repeated Loss Minimization* - PZMH20
*Performative Prediction*

- HSNL18

## Final Project

- topic that connects class material to you research
- groups of up to three
- deliverables:
- Project proposal (1 page) due mid-October
- Midterm update (2 pages) due mid-November
- Project report (4-6 pages) due last day of class

# Introductions

## How would you design a classifier?

## \((\qquad,\text{sitting})\)

## \((\qquad,\text{sitting})\)

## \((\qquad,\text{standing})\)

## \((\qquad,\text{standing})\)

## \((\qquad,\text{?})\)

## How would you design a classifier?

## \(\hat y = \hat f(\qquad)\)

### $$\widehat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N \ell(y_i, f(x_i))$$

## Loss functions

Ex - classification

- \(\mathbb{1}\{y\neq\hat y\}\)
- \(\max\{0, \hat y-y\}\)

\(\ell(y,\hat y)\) measures "loss" of predicting \(\hat y\) when it's actually \(y\)

Ex - regression

- \(|\hat y-y|\)
- \((\hat y-y)^2\)

## Risk

**Claim: **The predictor with the lowest possible risk is

- \(\mathbb E[y| x]\) for squared loss
- \(\mathbb 1\{\mathbb E[y| x]\geq t\}\) for 0-1 loss, with \(t\) depending on \(\mathcal D\)

The *risk* of a predictor \(f\) over a distribution \(\mathcal D\) is the *expected (average) loss*

$$\mathcal R(f) = \mathbb E_{x,y\sim\mathcal D}[\ell(y, f(x))]$$

Proof: exercise. *Hint: use tower property of expectation.*

## Prediction errors

Loss determines trade-offs between (potentially inevitable) errors

Ex - sit/stand classifier with \(x=\) position of face in frame

- \(\ell(\)sitting\(,\)sitting\()=0\)
- \(\ell(\)standing\(,\)standing\()=0\)
- \(\ell(\)sitting\(,\)standing\()\)
- \(\ell(\)standing\(,\)sitting\()\)

## Discrimination

In many domains, decisions have moral and legal significance

Harms can occur at many levels

- Correctness: who is burdened by errors?
- Stereotyping: which correlations are permissible?
- Specification: who is left out?

## Sample vs. population

**Fundamental Theorem of Supervised Learning:**

- The
*risk*is bounded by the*empirical risk*plus the*generalization error.*$$ \mathcal R(f) \leq \mathcal R_N(f) + |\mathcal R(f) - \mathcal R_N(f)|$$

Empirical risk minimization

$$\hat f = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N \ell(y_i, f(x_i))$$

# \(\{\)

\(\mathcal R_N(f)\)

1. Representation

2. Optimization

3. Generalization

## Recap

**Next time:** more on fairness & non-discrimination, then linear regression case study

training data

\(\{(x_i, y_i)\}\)

model

\(f:\mathcal X\to\mathcal Y\)

- define loss
- do ERM

## \(\mathcal D\)

*performance depends on representation, optimization, and generalization*

Ref: Ch 2-3 of Hardt & Recht, "Patterns, Predictions, and Actions" mlstory.org

#### 01 - Overview - ML in Feedback Sys

By Sarah Dean