Sp23 CS 4/5789: Lecture 1

CS 4/5789: Introduction to Reinforcement Learning

Lecture 1: Introduction

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Agenda

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

AlphaGo

Robotic Manipulation

Algorithmic Media Feeds

...

observation

action

reward

a policy maps observation to action

design policy to achieve high reward

RL is for Sequential Decision-Making

reaction

adaptation

Sequential Decision-Making

observation

action

reward

AlphaGo

Robotic Manipulation

Media Feeds

$\theta_t-\theta_*$

Agenda

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

Logistics

Instructor: Prof. Sarah Dean
Head TAs: Rohan Banerjee and Runzhe Wu
TAs & Consultant: Yann Hicke, Jenna Fields, Ruizhe Wang, Brandon Man, Yijia Dai, Patrick Yuan, Yingbing Huang

Contact: Ed Discussion
Instructor Office Hours: Wednesday 4-4:50pm in 255 Olin, additional time TBD
TA Office Hours: TBD, see Ed Discussion

Waitlist and Enrollment

There should be plenty of space!

Course staff do not manage waitlist and enrollment.

CS enrollment policies:
https://www.cs.cornell.edu/courseinfo/enrollment

Exams

Prelim on March 15 in class
Final exam during finals period, time TBD

Homework

Homework assignments
- ~8 problem sets (math)
- ~4 programming assignments (coding)
Gradescope
- neatly written, ideally typeset with LaTeX
5789: Paper review assignments (after Unit 1)
Collaboration Policy: discussion is fine, but write your own solutions and code, and do not look at others or let others look at yours
Late Policy: penalties can be avoided by requesting extensions on Ed (private post)

Participation

Participation is 5% of final grade, /20 points

Lecture participation = 1pt each
- Poll Everywhere: PollEv.com/sarahdean011
Helpful posts on Ed Discussions = 2pt each
- TA endorsement

Schedule

Unit 1: Fundamentals of Planning and Control (Jan & Feb)
- Imitation learning, Markov Decision Processes, Dynamic Programming, Value and Policy Iteration, Continuous Control, Linear Quadratic Regulation
Unit 2: Learning in MDPs (Mar)
- Estimation, Model-based RL, Approximate Dynamic Programming, Policy Optimization
Unit 3: Exploration (Apr & May)
- Multi-armed Bandits, Contextual Bandits
- State of the art examples

Prerequisites

Machine learning (e.g., CS 4780)

Background in probability, linear algebra, and programming.

Materials

Lecture Slides and Notes

Extra Resources (not required)
RL Theory Book: https://rltheorybook.github.io/
Classic RL Book: Sutton & Barto (http://www.incompleteideas.net/book/RLbook2020.pdf)

Agenda

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

Types of Machine Learning

Unsupervised Learning
Supervised Learning
Reinforcement Learning

Unsupervised Learning

Examples: clustering, principle component analysis (PCA)

Goal:
- summarization
Dataset:
- information about many instances
- $\{x_1, x_2, \dots x_N\}$
Evaluation:
- qualitative

"descriptive"

Supervised Learning

Examples: classification, regression

Goal:
- prediction
Dataset:
- each instance has features and label
- $\{(x_1, y_1), \dots (x_N,y_N)\}$
Evaluation:
- accuracy, $y$ vs. $\hat y$

"predictive"

Reinforcement Learning

Goal:
- action or decision
Dataset:
- history of observations, actions, and rewards
- sequential $\{(o_t, a_t, r_t)\}_{t=1}^T$
Evaluation:
- cumulative reward

"presciptive"

Types of Machine Learning

Unsupervised Learning
- summarize unstructured data $\{x_i\}_{i=1}^N$
Supervised Learning
- predict labels from features $\{(x_i, y_i)\}_{i=1}^N$
Reinforcement Learning
- choose actions that lead to high reward
- sequential data

Difficulties of Sequential Problem

Unlike other types of ML, in RL data may not be drawn "i.i.d." from some distribution

May start with no data
Actions have consequences
Solving task may require long sequence of correct

$a_t$

$o_t, r_t$

$a_{t+1}$

$o_{t+1}, r_{t+1}$

$...$

ML Specification

Specifying supervised learning problem
- feature space $\mathcal X$ and label space $\mathcal Y$
- distribution over feature and labels $(x,y)\sim \mathcal D$
  - often empirical, i.e. defined by a dataset
- pick a loss function to determine accuracy

Agenda

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)

General setting

action

observation

$a_t$

reward

$o_t$

$r_t$

$o_{t+1}$

Agent observes environment
Agent takes action
Environment sends reward and changes

Markov Decision Process (MDP)

action $a_t$

state $s_t$

$\sim \pi(s_t)$

reward

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

Agent observes state of environment
Agent takes action
- depending on state according to policy
Environment state updates (stochastically) according to transition function

Assumption on structure of observations and how they change

state
$s_t$

Markov Decision Process (MDP)

The state transition is independent of the past when conditioned on the current state and action
- $\mathbb P\{s_{t+1}=s\mid s_t, s_{t-1},\dots, s_0, a_t, \dots a_0 \} = \mathbb P\{s_{t+1}=s\mid s_t, a_t \}$
Similarly for the reward signal
Therefore, we write state transition and reward distribution as
- $s_{t+1}\sim P(s_t, a_t),\quad r_t\sim r(s_t, a_t)$
Actions can be chosen based only on current state
- $a_t \sim \pi(s_t)$

Key Markovian Assumption:

Markov Decision Process (MDP)

action $a_t$

state $s_t$

$\sim \pi(s_t)$

reward

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

Agent observes state of environment
Agent takes action depending on state according to policy
Environment returns reward and updates state according to reward/transition function

Example

state: $s$
- finger configuration and object pose
action: $a$
- joint motor commands
transition: $s'\sim P(s,a)$
- physical equations of motion (gravity, contact forces, friction)
policy: $\pi(s)$
- maps configurations to motor commands
reward: $r(s,a)$
- negative distance to goal (etc)

robot manipulation

Infinite Horizon Discounted MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

$\mathcal{S}$ space of possible states $s\in\mathcal S$
$\mathcal{A}$ space of possible actions $a\in \mathcal{A}$
$r$ stochastic map from state, action to scalar reward
$P$ stochastic map from current state and action to next state
$\gamma$ discount factor between $0$ and $1$

Goal: achieve high cumulative reward:

$$\sum_{t=0}^\infty \gamma^t r_t$$

Infinite Horizon Discounted MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

$\mathcal{S}$ space of possible states $s\in\mathcal S$
$\mathcal{A}$ space of possible actions $a\in \mathcal{A}$
$r$ stochastic map from state, action to scalar reward
$P$ stochastic map from current state and action to next state
$\gamma$ discount factor between $0$ and $1$

maximize $\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$

s.t. $s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$

$\pi$

Recap

1. What is Reinforcement Learning (RL)?

2. Logistics and Syllabus

3. Types of Machine Learning (ML)

4. Markov Decision Processes (MDP)