CS 4/5789: Introduction to Reinforcement Learning
Lecture 1: Introduction
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)


AlphaGo


Robotic Manipulation


Algorithmic Media Feeds




...

observation
action
reward
a policy maps observation to action
design policy to achieve high reward
RL is for Sequential Decision-Making
reaction
adaptation

Sequential Decision-Making
observation
action
reward


AlphaGo



Robotic Manipulation
Media Feeds
?
?
?
?
?
?


\(\theta_t-\theta_*\)
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
Logistics
- Instructor: Prof. Sarah Dean
- Head TAs: Rohan Banerjee and Runzhe Wu
- TAs & Consultant: Yann Hicke, Jenna Fields, Ruizhe Wang, Brandon Man, Yijia Dai, Patrick Yuan, Yingbing Huang
- Contact: Ed Discussion
- Instructor Office Hours: Wednesday 4-4:50pm in 255 Olin, additional time TBD
- TA Office Hours: TBD, see Ed Discussion
Waitlist and Enrollment
There should be plenty of space!
Course staff do not manage waitlist and enrollment.
CS enrollment policies:
https://www.cs.cornell.edu/courseinfo/enrollment
Exams
- Prelim on March 15 in class
- Final exam during finals period, time TBD
Homework
- Homework assignments
- ~8 problem sets (math)
- ~4 programming assignments (coding)
- Gradescope
- neatly written, ideally typeset with LaTeX
- 5789: Paper review assignments (after Unit 1)
- Collaboration Policy: discussion is fine, but write your own solutions and code, and do not look at others or let others look at yours
- Late Policy: penalties can be avoided by requesting extensions on Ed (private post)
Participation
Participation is 5% of final grade, /20 points
-
Lecture participation = 1pt each
- Poll Everywhere: PollEv.com/sarahdean011
-
Helpful posts on Ed Discussions = 2pt each
- TA endorsement
Schedule
-
Unit 1: Fundamentals of Planning and Control (Jan & Feb)
- Imitation learning, Markov Decision Processes, Dynamic Programming, Value and Policy Iteration, Continuous Control, Linear Quadratic Regulation
-
Unit 2: Learning in MDPs (Mar)
- Estimation, Model-based RL, Approximate Dynamic Programming, Policy Optimization
-
Unit 3: Exploration (Apr & May)
- Multi-armed Bandits, Contextual Bandits
- State of the art examples
Prerequisites
Machine learning (e.g., CS 4780)
Background in probability, linear algebra, and programming.
Materials
Lecture Slides and Notes
Extra Resources (not required)
RL Theory Book: https://rltheorybook.github.io/
Classic RL Book: Sutton & Barto (http://www.incompleteideas.net/book/RLbook2020.pdf)
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
Types of Machine Learning
- Unsupervised Learning
- Supervised Learning
- Reinforcement Learning
Unsupervised Learning
Examples: clustering, principle component analysis (PCA)
- Goal:
- summarization
- Dataset:
- information about many instances
- \(\{x_1, x_2, \dots x_N\}\)
- Evaluation:
- qualitative
"descriptive"
Supervised Learning
Examples: classification, regression
- Goal:
- prediction
- Dataset:
- each instance has features and label
- \(\{(x_1, y_1), \dots (x_N,y_N)\}\)
- Evaluation:
- accuracy, \(y\) vs. \(\hat y\)
"predictive"
Reinforcement Learning
- Goal:
- action or decision
- Dataset:
- history of observations, actions, and rewards
- sequential \(\{(o_t, a_t, r_t)\}_{t=1}^T\)
- Evaluation:
- cumulative reward
"presciptive"
Types of Machine Learning
- Unsupervised Learning
- summarize unstructured data \(\{x_i\}_{i=1}^N\)
- Supervised Learning
- predict labels from features \(\{(x_i, y_i)\}_{i=1}^N\)
- Reinforcement Learning
- choose actions that lead to high reward
- sequential data
Difficulties of Sequential Problem
Unlike other types of ML, in RL data may not be drawn "i.i.d." from some distribution
- May start with no data
- Actions have consequences
- Solving task may require long sequence of correct


\(a_t\)
\(o_t, r_t\)


\(a_{t+1}\)
\(o_{t+1}, r_{t+1}\)
\(...\)
ML Specification
- Specifying supervised learning problem
- feature space \(\mathcal X\) and label space \(\mathcal Y\)
- distribution over feature and labels \((x,y)\sim \mathcal D\)
- often empirical, i.e. defined by a dataset
- pick a loss function to determine accuracy
Agenda
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
General setting


action
observation
\(a_t\)
reward
\(o_t\)
\(r_t\)
\(o_{t+1}\)
- Agent observes environment
- Agent takes action
- Environment sends reward and changes
Markov Decision Process (MDP)


action \(a_t\)
state \(s_t\)
\(\sim \pi(s_t)\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
- Agent observes state of environment
- Agent takes action
- depending on state according to policy
- Environment state updates (stochastically) according to transition function
Assumption on structure of observations and how they change
state
\(s_t\)
Markov Decision Process (MDP)
- The state transition is independent of the past when conditioned on the current state and action
- \(\mathbb P\{s_{t+1}=s\mid s_t, s_{t-1},\dots, s_0, a_t, \dots a_0 \} = \mathbb P\{s_{t+1}=s\mid s_t, a_t \}\)
- Similarly for the reward signal
- Therefore, we write state transition and reward distribution as
-
\(s_{t+1}\sim P(s_t, a_t),\quad r_t\sim r(s_t, a_t)\)
-
-
Actions can be chosen based only on current state
-
\(a_t \sim \pi(s_t)\)
-
Key Markovian Assumption:
Markov Decision Process (MDP)


action \(a_t\)
state \(s_t\)
\(\sim \pi(s_t)\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
- Agent observes state of environment
- Agent takes action depending on state according to policy
- Environment returns reward and updates state according to reward/transition function
Example
-
state: \(s\)
- finger configuration and object pose
-
action: \(a\)
- joint motor commands
-
transition: \(s'\sim P(s,a)\)
- physical equations of motion (gravity, contact forces, friction)
-
policy: \(\pi(s)\)
- maps configurations to motor commands
-
reward: \(r(s,a)\)
- negative distance to goal (etc)

robot manipulation
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
- \(\mathcal{S}\) space of possible states \(s\in\mathcal S\)
- \(\mathcal{A}\) space of possible actions \(a\in \mathcal{A}\)
- \(r\) stochastic map from state, action to scalar reward
- \(P\) stochastic map from current state and action to next state
- \(\gamma\) discount factor between \(0\) and \(1\)
Goal: achieve high cumulative reward:
$$\sum_{t=0}^\infty \gamma^t r_t$$
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
- \(\mathcal{S}\) space of possible states \(s\in\mathcal S\)
- \(\mathcal{A}\) space of possible actions \(a\in \mathcal{A}\)
- \(r\) stochastic map from state, action to scalar reward
- \(P\) stochastic map from current state and action to next state
- \(\gamma\) discount factor between \(0\) and \(1\)
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
Recap
1. What is Reinforcement Learning (RL)?
2. Logistics and Syllabus
3. Types of Machine Learning (ML)
4. Markov Decision Processes (MDP)
Sp23 CS 4/5789: Lecture 1
By Sarah Dean