Lecture 5: Reinforcement Learning (Value-based Methods)
Shen Shen
April 14, 2025
2:30pm, Room 32-144
Modeling with Machine Learning for Computer Science

Toddler demo, Russ Tedrake thesis, 2004
(Uses vanilla policy gradient (actor-critic))









Reinforcement Learning with Human Feedback
Markov Decision Processes
-
Research area initiated in the 50s by Bellman, known under various names (in various communities):
-
Stochastic optimal control (Control theory)
-
Stochastic shortest path (Operations Research)
-
Sequential decision making under uncertainty (Economics)
-
Reinforcement learning (Artificial Intelligence, Machine Learning)
-
-
A rich variety of accessible and elegant theory, math, algorithms, and applications. But also, considerable variation in notations.
-
We will use the most RL-flavored notations.
- (state, action) results in a transition \(\mathrm{T}\) into a next state:
-
Normally, we get to the “intended” state;
-
E.g., in state (7), action “↑” gets to state (4)
-
-
If an action would take Mario out of the grid world, stay put;
-
E.g., in state (9), “→” gets back to state (9)
-
-
In state (6), action “↑” leads to two possibilities:
-
20% chance to (2)
-
80% chance to (3).
-
-
Running example: Mario in a grid-world

- 9 possible states \(s\)
- 4 possible actions \(a\): {Up ↑, Down ↓, Left ←, Right →}

reward of (3, \(\downarrow\))
reward of \((3,\uparrow\))
reward of \((6, \downarrow\))
reward of \((6,\rightarrow\))
- (state, action) pairs give out rewards:
- in state 3, any action gives reward 1
- in state 6, any action gives reward -10
- any other (state, action) pair gives reward 0

-
discount factor: a scalar that reduces the "worth" of rewards, depending on the timing we get them.
- e.g., say this factor is 0.9 for our Mario game. Then, for (3, \(\leftarrow\)) pair, we get a reward of 1 at the start of the game; at the 2nd time step, we get a discounted reward of 0.9; at the 3rd time step, it is further discounted to \((0.9)^2\), and so on.


Mario in a grid-world, cont'd
- \(\mathcal{S}\) : state space, contains all possible states \(s\).
- \(\mathcal{A}\) : action space, contains all possible actions \(a\).
- \(\mathrm{T}\left(s, a, s^{\prime}\right)\) : the probability of transition from state \(s\) to \(s^{\prime}\) when action \(a\) is taken.
Markov Decision Processes - Definition and terminologies
\(\mathrm{T}\left(7, \uparrow, 4\right) = 1\)
\(\mathrm{T}\left(9, \rightarrow, 9\right) = 1\)
\(\mathrm{T}\left(6, \uparrow, 3\right) = 0.8\)
\(\mathrm{T}\left(6, \uparrow, 2\right) = 0.2\)
- \(\mathcal{S}\) : state space, contains all possible states \(s\).
- \(\mathcal{A}\) : action space, contains all possible actions \(a\).
- \(\mathrm{T}\left(s, a, s^{\prime}\right)\) : the probability of transition from state \(s\) to \(s^{\prime}\) when action \(a\) is taken.
- \(\mathrm{R}(s, a)\) : reward, takes in a (state, action) pair and returns a reward.
Markov Decision Processes - Definition and terminologies
reward of \((3,\uparrow\))
reward of \((6,\rightarrow\))
\(\mathrm{R}\left(3, \uparrow \right) = 1\)
\(\mathrm{R}\left(6, \rightarrow, \right) = -10\)
- \(\mathcal{S}\) : state space, contains all possible states \(s\).
- \(\mathcal{A}\) : action space, contains all possible actions \(a\).
- \(\mathrm{T}\left(s, a, s^{\prime}\right)\) : the probability of transition from state \(s\) to \(s^{\prime}\) when action \(a\) is taken.
- \(\mathrm{R}(s, a)\) : reward, takes in a (state, action) pair and returns a reward.
- \(\gamma \in [0,1]\): discount factor, a scalar.
- \(\pi{(s)}\) : policy, takes in a state and returns an action.
The goal of an MDP is to find a "good" policy.
Sidenote: For now,
- \(\mathrm{R}(s, a)\) is deterministic and bounded.
- \(\pi(s)\) is deterministic.
- \(\mathcal{S}\) and \(\mathcal{A}\) are small discrete sets, unless otherwise specified.
Markov Decision Processes - Definition and terminologies
State \(s\)
Action \(a\)
Reward \(r\)























Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)
time
a trajectory (aka, an experience, or a rollout), of horizon \(h\)
\(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_{h-1}, a_{h-1}, r_{h-1}\right)\)
initial state




all depends on \(\pi\)
Starting in a given \(s_0\), how "good" is it to follow a policy for \(h\) time steps?
State \(s\)
Action \(a\)
Reward \(r\)























Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)




But, consider the Mario game, with
One idea:


reward of \((6,\uparrow\))


Suppose start in state 6, go up:
State \(s\)
Action \(a\)
Reward \(r\)























Policy \(\pi(s)\)
Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)
Reward \(\mathrm{R}(s, a)\)




this expectation is w.r.t. the transition probabilities \(\mathrm{T}\left(s, a, s^{\prime}\right),\) as well as possibly noisy rewards. For now, let's assume rewards are deterministic.
\( h\) terms inside
Starting in a given \(s_0\), how "good" is it to follow a policy for \(h\) time steps?
For a given policy \(\pi(s),\) the (state) value functions
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
- value functions \(V^h_\pi(s)\): the expected sum of discounted rewards, starting in state \(s\) and follow policy \(\pi\) for \(h\) steps.
- horizon-0 values defined as 0.
- value is long-term, reward is short-term (one-time).
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
expanded form
\( h\) terms inside
- Horizon \(h\) = 0: no step left.
- Horizon \(h\) = 1: receive the rewards at face value
evaluating the "\(\pi(s) = \uparrow\), for all \(s,\) i.e. the always \(\uparrow\)" policy

- \(\pi(s) = ``\uparrow",\ \forall s\)
- \(\gamma = 0.9\)
states/transitions
rewards
- Horizon \(h\) = 2





\( 2\) terms inside

- \(\pi(s) = ``\uparrow",\ \forall s\)
- \(\gamma = 0.9\)
states/transitions
rewards
- Horizon \(h\) = 2





\( 2\) terms inside

- \(\pi(s) = ``\uparrow",\ \forall s\)
- \(\gamma = 0.9\)
states/transitions
rewards

- \(\pi(s) = ``\uparrow",\ \forall s\)
- \(\gamma = 0.9\)
states/transitions
rewards
- Horizon \(h\) = 2

\( 2\) terms inside
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
action \(\uparrow\)
- Horizon \(h\) = 3

- \(\pi(s) = ``\uparrow",\ \forall s\)
- \(\gamma = 0.9\)
states/transitions
rewards
Bellman Recursion
weighted by the probability of getting to that next state \(s^{\prime}\)
\((h-1)\) horizon values at a next state \(s^{\prime}\)
the immediate reward for taking the policy-prescribed action \(\pi(s)\) in state \(s\).
discounted by \(\gamma\)
horizon-\(h\) value in state \(s\): the expected sum of discounted rewards, starting in state \(s\) and following policy \(\pi\) for \(h\) steps.
approaches infinity
\(|\mathcal{S}|\) many linear equations, one equation for each state
Bellman Recursion
typically \(\gamma <1\) in MDP definition
becomes Bellman Equations
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
If the horizon \(h\) goes to infinity
finite-horizon Bellman recursions
infinite-horizon Bellman equations
Recall: For a given policy \(\pi(s),\) the (state) value functions
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)
MDP
Policy evaluation
Quick summary
- For a fixed MDP, the optimal values \(\mathrm{V}^h_{\pi^*}({s})\) must be unique.
- Optimal policy \(\pi^*\) might not be unique (think, e.g. symmetric world)
- For finite \(h\), optimal policy depends on how many time steps left.
- When \(h \rightarrow \infty\), time no longer matters. In other words, there exists a stationary optimal policy.
Optimal policy \(\pi^*\)
Definition of \(\pi^*\): for a given MDP and a fixed horizon \(h\) (possibly infinite), \(\mathrm{V}^h_{\pi^*}({s}) \geqslant \mathrm{V}^h_\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
- One possible idea: enumerate over all possible policies, do policy evaluation, get the max values \(\mathrm{V}^h_{\pi^*}({s})\) which then gives us the optimal policy.
- Very very tedious ... also gives no insights.
- A better idea: take advantage of the recursive structure.
Definition of \(\pi^*\): for a given MDP and a fixed horizon \(h\) (possibly infinite), \(\mathrm{V}^h_{\pi^*}({s}) \geqslant \mathrm{V}^h_\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
How to search for an optimal policy \(\pi^*\)?
- Should be convinced of:
How to search for an optimal policy \(\pi^*\)?
having acted optimally
only matters need to act optimally one more time
If we introduce



























Optimal state-action value functions \(Q^h(s, a)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
\(\mathrm{Q}^h(s, a)\): expected sum of discounted rewards
- starting in state \(s\),
- take the action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
\(V\) values vs. \(Q\) values
- \(V\) is defined over state space; \(Q\) is defined over (state, action) space.
- Any policy can be evaluated to get \(V\) values; whereas \(Q,\) per definition, has the sense of "tail optimality" baked in.
- \(\mathrm{V}^h_{\pi^*}({s})\) can be derived from \(Q^h(s,a): V^{h}_{\pi^{*}}(s)=\max_{a}\left[\mathrm{Q}^{h}(s, a)\right]\), and vise versa.
- \(Q\) is easier to read "optimal actions" from.
Optimal state-action value functions \(Q^h(s, a)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps

recursively finding \(Q^h(s, a)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{R}(s,a)\)

Let's consider \(\mathrm{Q}^2(3, \rightarrow)\)
- receive \(\mathrm{R}(3,\rightarrow)\)
\( = 1 + .9 \max _{a^{\prime}} \mathrm{Q}^{1}\left(3, a^{\prime}\right)\)
- next state \(s'\) = 3, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} \mathrm{Q}^{1}\left(3, a^{\prime}\right)\)
\( = 1.9\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
\(\mathrm{Q}^2(3, \rightarrow) = \mathrm{R}(3,\rightarrow) + \gamma \max _{a^{\prime}} \mathrm{Q}^{1}\left(3, a^{\prime}\right)\)

Let's consider \(\mathrm{Q}^2(3, \uparrow)\)
- receive \(\mathrm{R}(3,\uparrow)\)
\( = 1 + .9 \max _{a^{\prime}} \mathrm{Q}^{1}\left(3, a^{\prime}\right)\)
- next state \(s'\) = 3, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} \mathrm{Q}^{1}\left(3, a^{\prime}\right)\)
\( = 1.9\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
\(\mathrm{Q}^2(3, \uparrow) = \mathrm{R}(3,\uparrow) + \gamma \max _{a^{\prime}} \mathrm{Q}^{1}\left(3, a^{\prime}\right)\)

Let's consider \(\mathrm{Q}^2(3, \leftarrow)\)
- receive \(\mathrm{R}(3,\leftarrow)\)
\( = 1 + .9 \max _{a^{\prime}} \mathrm{Q}^{1}\left(2, a^{\prime}\right)\)
- next state \(s'\) = 2, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} \mathrm{Q}^{1}\left(2, a^{\prime}\right)\)
\( = 1\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(\mathrm{Q}^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
\(\mathrm{Q}^2(3, \leftarrow) = \mathrm{R}(3,\leftarrow) + \gamma \max _{a^{\prime}} \mathrm{Q}^{1}\left(2, a^{\prime}\right)\)

Let's consider \(Q^2(3, \downarrow)\)
- receive \(\mathrm{R}(3,\downarrow)\)
\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)
- next state \(s'\) = 6, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)
\( = -8\)
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
\(Q^2(3, \downarrow) = \mathrm{R}(3,\downarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
Recall:

\(\gamma = 0.9\)
States and one special transition:
- act optimally for one more timestep, at the next state \(s^{\prime}\)
- 20% chance, \(s'\) = 2, act optimally, receive \(\max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
- 80% chance, \(s'\) = 3, act optimally, receive \(\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
\(= -10 + .9 [.2*0+ .8*1] = -9.28\)
- receive \(\mathrm{R}(6,\uparrow)\)
\(Q^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
Let's consider \(\mathrm{Q}^2(6, \uparrow) \)
\(=\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)

\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)
in general
Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps

Recall:
\(\gamma = 0.9\)
States and one special transition:
\(Q^h(s, a)\): the expected sum of discounted rewards for
- starting in state \(s\),
- take action \(a\), for one step
- act optimally there afterwards for the remaining \((h-1)\) steps
what's the optimal action in state 3, with horizon 2, given by \(\pi_2^*(3)=?\)
in general
either up or right
Given the recursion
- for \(s \in \mathcal{S}, a \in \mathcal{A}\) :
- \(\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0\)
- while True:
- for \(s \in \mathcal{S}, a \in \mathcal{A}\) :
- \(\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
- if \(\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:\)
- return \(\mathrm{Q}_{\text {new }}\)
- \(\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}\)
we can have an infinite horizon equation
Infinite-horizon Value Iteration
if run this block \(h\) times and break, then the returns are exactly \(Q^h\)
\(Q^{\infty}(s, a)\)
Quick Summary
- Markov decision processes (MDP) is nice mathematical framework for making sequential decisions. It's the foundation to reinforcement learning.
- An MDP is defined by a five-tuple, and the goal is to find an optimal policy that leads to high expected cumulative discounted rewards.
- To evaluate how good a given policy \(\pi, \) we can calculate \(V_{\pi}(s)\) via
- the summation over rewards definition
- Bellman recursion for finite horizon, equation for infinite horizon
- To find an optimal policy, we can recursively find \(Q(s,a)\) via the value iteration algorithm, and then act greedily w.r.t. the \(Q\) values.
6.C011/C511 - ML for CS (Spring25) - Lecture 5 Reinforcement Learning I (Value-based methods)
By Shen Shen
6.C011/C511 - ML for CS (Spring25) - Lecture 5 Reinforcement Learning I (Value-based methods)
- 146