CS 4/5789: Introduction to Reinforcement Learning
Lecture 1: Introduction
Prof. Sarah Dean
MW 2:454pm
255 Olin Hall
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
AlphaGo
Robotic Manipulation
Algorithmic Media Feeds
...
observation
action
reward
a policy maps observation to action
design policy to achieve high reward
RL is for Sequential DecisionMaking
reaction
adaptation
Sequential DecisionMaking
observation
action
reward
AlphaGo
Robotic Manipulation
Media Feeds
?
?
?
?
?
?
\(\theta_t\theta_*\)
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
Logistics
 Instructor: Prof. Sarah Dean
 Head TAs: Rohan Banerjee and Runzhe Wu
 TAs & Consultant: Yann Hicke, Jenna Fields, Ruizhe Wang, Brandon Man, Yijia Dai, Patrick Yuan, Yingbing Huang
 Contact: Ed Discussion
 Instructor Office Hours: Wednesday 44:50pm in 255 Olin, additional time TBD
 TA Office Hours: TBD, see Ed Discussion
Waitlist and Enrollment
There should be plenty of space!
Course staff do not manage waitlist and enrollment.
CS enrollment policies:
https://www.cs.cornell.edu/courseinfo/enrollment
Exams
 Prelim on March 15 in class
 Final exam during finals period, time TBD
Homework
 Homework assignments
 ~8 problem sets (math)
 ~4 programming assignments (coding)
 Gradescope
 neatly written, ideally typeset with LaTeX
 5789: Paper review assignments (after Unit 1)
 Collaboration Policy: discussion is fine, but write your own solutions and code, and do not look at others or let others look at yours
 Late Policy: penalties can be avoided by requesting extensions on Ed (private post)
Participation
Participation is 5% of final grade, /20 points

Lecture participation = 1pt each
 Poll Everywhere: PollEv.com/sarahdean011

Helpful posts on Ed Discussions = 2pt each
 TA endorsement
Schedule

Unit 1: Fundamentals of Planning and Control (Jan & Feb)
 Imitation learning, Markov Decision Processes, Dynamic Programming, Value and Policy Iteration, Continuous Control, Linear Quadratic Regulation

Unit 2: Learning in MDPs (Mar)
 Estimation, Modelbased RL, Approximate Dynamic Programming, Policy Optimization

Unit 3: Exploration (Apr & May)
 Multiarmed Bandits, Contextual Bandits
 State of the art examples
Prerequisites
Machine learning (e.g., CS 4780)
Background in probability, linear algebra, and programming.
Materials
Lecture Slides and Notes
Extra Resources (not required)
RL Theory Book: https://rltheorybook.github.io/
Classic RL Book: Sutton & Barto (http://www.incompleteideas.net/book/RLbook2020.pdf)
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
Types of Machine Learning
 Unsupervised Learning
 Supervised Learning
 Reinforcement Learning
Unsupervised Learning
Examples: clustering, principle component analysis (PCA)
 Goal:
 summarization
 Dataset:
 information about many instances
 \(\{x_1, x_2, \dots x_N\}\)
 Evaluation:
 qualitative
"descriptive"
Supervised Learning
Examples: classification, regression
 Goal:
 prediction
 Dataset:
 each instance has features and label
 \(\{(x_1, y_1), \dots (x_N,y_N)\}\)
 Evaluation:
 accuracy, \(y\) vs. \(\hat y\)
"predictive"
Reinforcement Learning
 Goal:
 action or decision
 Dataset:
 history of observations, actions, and rewards
 sequential \(\{(o_t, a_t, r_t)\}_{t=1}^T\)
 Evaluation:
 cumulative reward
"presciptive"
Types of Machine Learning
 Unsupervised Learning
 summarize unstructured data \(\{x_i\}_{i=1}^N\)
 Supervised Learning
 predict labels from features \(\{(x_i, y_i)\}_{i=1}^N\)
 Reinforcement Learning
 choose actions that lead to high reward
 sequential data
Difficulties of Sequential Problem
Unlike other types of ML, in RL data may not be drawn "i.i.d." from some distribution
 May start with no data
 Actions have consequences
 Solving task may require long sequence of correct
\(a_t\)
\(o_t, r_t\)
\(a_{t+1}\)
\(o_{t+1}, r_{t+1}\)
\(...\)
ML Specification
 Specifying supervised learning problem
 feature space \(\mathcal X\) and label space \(\mathcal Y\)
 distribution over feature and labels \((x,y)\sim \mathcal D\)
 often empirical, i.e. defined by a dataset
 pick a loss function to determine accuracy
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
General setting
action
observation
\(a_t\)
reward
\(o_t\)
\(r_t\)
\(o_{t+1}\)
 Agent observes environment
 Agent takes action
 Environment sends reward and changes
Markov Decision Process (MDP)
action \(a_t\)
state \(s_t\)
\(\sim \pi(s_t)\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
 Agent observes state of environment
 Agent takes action
 depending on state according to policy
 Environment state updates (stochastically) according to transition function
Assumption on structure of observations and how they change
state
\(s_t\)
Markov Decision Process (MDP)
 The state transition is independent of the past when conditioned on the current state and action
 \(\mathbb P\{s_{t+1}=s\mid s_t, s_{t1},\dots, s_0, a_t, \dots a_0 \} = \mathbb P\{s_{t+1}=s\mid s_t, a_t \}\)
 Similarly for the reward signal
 Therefore, we write state transition and reward distribution as

\(s_{t+1}\sim P(s_t, a_t),\quad r_t\sim r(s_t, a_t)\)


Actions can be chosen based only on current state

\(a_t \sim \pi(s_t)\)

Key Markovian Assumption:
Markov Decision Process (MDP)
action \(a_t\)
state \(s_t\)
\(\sim \pi(s_t)\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
 Agent observes state of environment
 Agent takes action depending on state according to policy
 Environment returns reward and updates state according to reward/transition function
Example

state: \(s\)
 finger configuration and object pose

action: \(a\)
 joint motor commands

transition: \(s'\sim P(s,a)\)
 physical equations of motion (gravity, contact forces, friction)

policy: \(\pi(s)\)
 maps configurations to motor commands

reward: \(r(s,a)\)
 negative distance to goal (etc)
robot manipulation
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
 \(\mathcal{S}\) space of possible states \(s\in\mathcal S\)
 \(\mathcal{A}\) space of possible actions \(a\in \mathcal{A}\)
 \(r\) stochastic map from state, action to scalar reward
 \(P\) stochastic map from current state and action to next state
 \(\gamma\) discount factor between \(0\) and \(1\)
Goal: achieve high cumulative reward:
$$\sum_{t=0}^\infty \gamma^t r_t$$
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
 \(\mathcal{S}\) space of possible states \(s\in\mathcal S\)
 \(\mathcal{A}\) space of possible actions \(a\in \mathcal{A}\)
 \(r\) stochastic map from state, action to scalar reward
 \(P\) stochastic map from current state and action to next state
 \(\gamma\) discount factor between \(0\) and \(1\)
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
Recap
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
Sp23 CS 4/5789: Lecture 1
By Sarah Dean