### Sarah Dean PRO

asst prof in CS at Cornell

Prof. Sarah Dean

MW 2:45-4pm

110 Hollister Hall

0. Announcements & Recap

1. Real World RL

2. Specification and Risks

3. Does RL Work?

5789 Paper Review Assignment (weekly pace *suggested*)

HW 4 due 5/9 -- don't plan on extentions

Final exam Monday 5/16 at 7pm

Review session in lecture 5/9

Course evaluations open tomorrow

**Supervised Learning**

**Policy**

**Dataset of expert trajectory**

...

**\(\pi\)( ) = **

\((x=s, y=a^*)\)

imitation

inverse RL

**Goal: **understand/predict behaviors

- For \(k=0,\dots,K-1\):
- \(\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)\)
- \(w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])\)

- Return \(\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})\)

- Input: reward function \(r\). Initialize \(V_H^*(s) = 0\)
- For \(h=H-1,\dots 0\):
- \(Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]\)
- \(\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))\)
- \(V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)\)

**Soft-VI**

0. Announcements & Recap

1. Real World RL

2. Specification and Risks

3. Does RL Work?

AlphaGo vs. Lee Sedol, 2016

RL for Amazon Ads, 2022

0. Announcements & Recap

1. Real World RL

2. Specification and Risks

3. Does RL Work?

Markov decision process \(\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}\)

\(s_t\)

\(r_t\)

\(a_t\)

\(\pi\)

\(\gamma\)

\(P\)

- action space and discount known
- states and reward signals observed
- transition probabilities unknown

actions & states determine **environment**

discount & reward determine **objective**

Large discount factor leads to short-sighted agent

- \(0\) cost for \(a_0\)
- \(2\epsilon\) cost for \(a_1\)
- \(\epsilon\) reward in \(s_0\)
- \(1\) reward in \(s_1\)

\(V^{a_0}(s_0) = \frac{\epsilon}{1-\gamma}\) and \(V^{a_1}(s_0) = \frac{1}{1-\gamma} - \frac{2\epsilon}{\gamma}\)

The promise of RL:

translate *specified objective* into *desired behavior*

The reality:

While everyone seemed focused on how many views a video got, we thought the amount of time someone spent watching a video was a better way to understand whether a viewer really enjoyed it."

Youtube in 2014 vs. 2018

Facebook's "Meaningful Social Interaction" metric

Misinformation, toxicity, and violent content are inordinately prevalent among reshares"

Inverse Reward Design (NeuRIPS, 2017)

Idea: treat specified reward as imperfect proxy

Then attempt to learn true reward from other feedback

The interface through which the agent sees and impacts the world

Also delimits reasoning about the world

\(s_t\)

\(a_t\)

Evolving an oscillator on hardware (Bird & Layzell, 2002)

Result: a *"network of transistors sensing and utilising the radio waves emanating from nearby PCs"*

The first Tesla autopilot fatality in 2016

Safety systems failed to detect white truck against bright sky

*"vehicles [...] will no longer be equipped with radar. Instead, these will [...] rely on camera vision and neural net processing." *(Tesla, 2021)

Learning to influence other drivers

Excessive caution around other drivers

Excessive aggression

Example adapted from Anca Dragan

0. Announcements & Recap

1. Real World RL

2. Specification and Risks

3. Does RL Work?

1. Model-based design and optimization works better

Three strikes against RL:

ex - Model Predictive Control at Boston Dynamics

1. Model-based design and optimization works better

Three strikes against RL:

data-driven optimization suffers from local minima, large sample complexity (Deep RL doesn't work yet, 2018)

2. Simulation essentially necessary, but huge sim2real gap

Three strikes against RL:

RL exploits bugs in simulator code (Nathan Lambert, 2021)

3. Questionable evaluation practices

Three strikes against RL:

State-of-the-art algorithms outperformed by simple baselines: Simple random search provides a competitive approach

to reinforcement learning, 2017

This perspective ignores the instance-specific tuning that often goes into making RL algorithms work

*"Machine learning has become alchemy"* Ali Rahimi & Ben Recht, 2017

King Midas cursed by Dionysus

When Silicon Valley tries to imagine superintelligence, what it comes up with is no-holds-barred capitalism.

Ted Chiang, 2018.

I think many AV teams could handle a pogo stick user in pedestrian crosswalk. Having said that, bouncing on a pogo stick in the middle of a highway would be really dangerous. Rather than building AI to solve the pogo stick problem, we should partner with the government to ask people to be lawful and considerate. Safety isn’t just about the quality of the AI technology.

- Andrew Ng, 2018

ex - credit-score designed within supervised learning framework, but used to make lending decisions

\(\{x_i, y_i\}\)

\(x\)

\(\widehat y\)

\((x, y)\)

When a measure becomes a target, it ceases to be a good measure"

Goodhardt's law

When a measure becomes a target, it ceases to be a good measure"

Goodhardt's law

Buzzfeed noticed the success of content that exploited racial divisions, fad/junky science, extremely disturbing news and gross images.

Some political parties in Europe told Facebook the algorithm had made them shift their policy positions so they resonated more on the platform, according to the documents."

Technologies are developed and used within a particular social, economic, and political context. They arise out of a social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen nor foreseeable.”

Ursula Franklin, 1989

control feedback

data feedback

external feedback

*"...social, economic, and political context..."*

*"...neither foreseen nor forseeable..."*

1. Real World RL

2. Specification and Risks

3. Does RL Work?

1. AlphaGo case study

2. Review for final

By Sarah Dean