
Lecture 11: Markov Decision Processes
Shen Shen
November 15, 2024
Intro to Machine Learning

Outline
-
Markov Decision Processes
-
Definition, terminologies, and policy
-
Policy Evaluation
-
V-values: State Value Functions
-
Bellman recursions and Bellman equations
-
-
Policy Optimization
-
Optimal policies π∗
-
Q-values: State-action Optimal Value Functions
-
Value iteration
-
-
Toddler demo, Russ Tedrake thesis, 2004
(Uses vanilla policy gradient (actor-critic))









Reinforcement Learning with Human Feedback
Outline
-
Markov Decision Processes
-
Definition, terminologies, and policy
-
Policy Evaluation
-
V-values: State Value Functions
-
Bellman recursions and Bellman equations
-
-
Policy Optimization
-
Optimal policies π∗
-
Q-values: State-action Optimal Value Functions
-
Value iteration
-
-
Markov Decision Processes
-
Research area initiated in the 50s by Bellman, known under various names (in various communities):
-
Stochastic optimal control (Control theory)
-
Stochastic shortest path (Operations Research)
-
Sequential decision making under uncertainty (Economics)
-
Reinforcement learning (Artificial Intelligence, Machine Learning)
-
-
A rich variety of accessible and elegant theory, math, algorithms, and applications. But also, considerable variation in notations.
-
We will use the most RL-flavored notations.
- (state, action) results in a transition into a next state:
-
Normally, we get to the “intended” state;
-
E.g., in state (7), action “↑” gets to state (4)
-
-
If an action would take Mario out of the grid world, stay put;
-
E.g., in state (9), “→” gets back to state (9)
-
-
In state (6), action “↑” leads to two possibilities:
-
20% chance to (2)
-
80% chance to (3).
-
-
Running example: Mario in a grid-world

- 9 possible states
- 4 possible actions: {Up ↑, Down ↓, Left ←, Right →}

reward of (3, ↓)
reward of (3,↑)
reward of (6,↓)
reward of (6,→)
- (state, action) pairs give out rewards:
- in state 3, any action gives reward 1
- in state 6, any action gives reward -10
- any other (state, action) pair gives reward 0

-
discount factor: a scalar of 0.9 that reduces the "worth" of rewards, depending on the timing we receive them.
- e.g., for (3, ←) pair, we receive a reward of 1 at the start of the game; at the 2nd time step, we receive a discounted reward of 0.9; at the 3rd time step, it is further discounted to (0.9)2, and so on.


Mario in a grid-world, cont'd
- S : state space, contains all possible states s.
- A : action space, contains all possible actions a.
- T(s,a,s′) : the probability of transition from state s to s′ when action a is taken.
Markov Decision Processes - Definition and terminologies
T(7,↑,4)=1
T(9,→,9)=1
T(6,↑,3)=0.8
T(6,↑,2)=0.2
- S : state space, contains all possible states s.
- A : action space, contains all possible actions a.
- T(s,a,s′) : the probability of transition from state s to s′ when action a is taken.
- R(s,a) : reward, takes in a (state, action) pair and returns a reward.
- γ∈[0,1]: discount factor, a scalar.
- π(s) : policy, takes in a state and returns an action.
The goal of an MDP is to find a "good" policy.
Sidenote: In 6.390,
- R(s,a) is deterministic and bounded.
- π(s) is deterministic.
- S and A are small discrete sets, unless otherwise specified.
Markov Decision Processes - Definition and terminologies
State s
Action a
Reward r























Policy π(s)
Transition T(s,a,s′)
Reward R(s,a)
time
a trajectory (aka, an experience, or a rollout), of horizon h
τ=(s0,a0,r0,s1,a1,r1,…sh−1,ah−1,rh−1)
initial state




all depends on π
Outline
-
Markov Decision Processes
-
Definition, terminologies, and policy
-
Policy Evaluation
-
V-values: State Value Functions
-
Bellman recursions and Bellman equations
-
-
Policy Optimization
-
Optimal policies π∗
-
Q-values: State-action Optimal Value Functions
-
Value iteration
-
-
Starting in a given s0, how "good" is it to follow a policy for h time steps?
State s
Action a
Reward r























Policy π(s)
Transition T(s,a,s′)
Reward R(s,a)




But, consider the Mario game, with
One idea:


reward of (6,↑)


State s
Action a
Reward r























Policy π(s)
Transition T(s,a,s′)
Reward R(s,a)




in 390, this expectation is only w.r.t. the transition probabilities T(s,a,s′)
h terms inside
Starting in a given s0, how "good" is it to follow a policy for h time steps?
For a given policy π(s), the (state) value functions
Vπh(s):=E[∑t=0h−1γtR(st,π(st))∣s0=s,π],∀s,h
- value functions Vπh(s): the expected sum of discounted rewards, starting in state s and follow policy π for h steps.
- horizon-0 values defined as 0.
- value is long-term, reward is short-term (one-time).

evaluating the "always ↑" policy
Vπh(s):=E[∑t=0h−1γtR(st,π(st))∣s0=s,π],∀s,h
expanded form
h terms inside
- Horizon h = 0: no step left.
- π(s)=‘‘↑", ∀s
- all rewards are zero, except
- R(3,↑)=1
- R(6,↑)=−10
- γ=0.9
- Horizon h = 1: receive the rewards at face value

evaluating the "always ↑" policy
- π(s)=‘‘↑", ∀s
- all rewards are zero, except
- R(3,↑)=1
- R(6,↑)=−10
- γ=0.9
- Horizon h = 2

2 terms inside





evaluating the "always ↑" policy
- π(s)=‘‘↑", ∀s
- all rewards are zero, except
- R(3,↑)=1
- R(6,↑)=−10
- γ=0.9
- Horizon h = 2

2 terms inside




action ↑
6
2
3
action ↑
action ↑
Recall:
π(s)= ‘‘↑", ∀s
R(3,↑)=1
R(6,↑)=−10
γ=0.9
2
3
action ↑
action ↑
6
action ↑
2
action ↑
3
action ↑
- Horizon h = 3

evaluating the "always ↑" policy
Bellman Recursion
weighted by the probability of getting to that next state s′
(h−1) horizon values at a next state s′
the immediate reward for taking the policy-prescribed action π(s) in state s.
discounted by γ
horizon-h value in state s: the expected sum of discounted rewards, starting in state s and following policy π for h steps.
approaches infinity
∣S∣ many linear equations, one equation for each state
Bellman Recursion
typically γ<1 in MDP definition
becomes Bellman Equations
Vπh(s):=E[∑t=0h−1γtR(st,π(st))∣s0=s,π],∀s,h
If the horizon h goes to infinity
finite-horizon Bellman recursions
infinite-horizon Bellman equations
Recall: For a given policy π(s), the (state) value functions
Vπh(s):=E[∑t=0h−1γtR(st,π(st))∣s0=s,π],∀s,h
MDP
Policy evaluation
Quick summary
Outline
-
Markov Decision Processes
-
Definition, terminologies, and policy
-
Policy Evaluation
-
V-values: State Value Functions
-
Bellman recursions and Bellman equations
-
-
Policy Optimization
-
Optimal policies π∗
-
Q-values: State-action Optimal Value Functions
-
Value iteration
-
-
- For a fixed MDP, the optimal values Vπ∗h(s) must be unique.
- Optimal policy π∗ might not be unique (think, e.g. symmetric world)
- In finite horizon, optimal policy depends on how many time steps left.
- In infinite horizon, time steps no longer matters. In other words, there exists a stationary optimal policy.
Optimal policy π∗
Definition of π∗: for a given MDP and a fixed horizon h (possibly infinite), Vπ∗h(s)⩾Vπh(s) for all s∈S and for all possible policy π.
- One possible idea: enumerate over all possible policies, do policy evaluation, get the max values Vπ∗h(s) which then gives us the optimal policy.
- Very very tedious ... also gives no insights.
- A better idea: take advantage of the recursive structure.
How to search for an optimal policy π∗?
Definition of π∗: for a given MDP and a fixed horizon h (possibly infinite), Vπ∗h(s)⩾Vπh(s) for all s∈S and for all possible policy π.
Optimal state-action value functions Qh(s,a)
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
V values vs. Q values
- V is defined over state space; Q is defined over (state, action) space.
- Any policy can be evaluated to get V values; whereas Q, per definition, has the sense of "tail optimality" baked in.
- Vπ∗h(s) can be derived from Qh(s,a), and vise versa.
- Q is easier to read "optimal actions" from.
Optimal state-action value functions Qh(s,a)
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps

recursively finding Qh(s,a)
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Recall:
γ=0.9
States and one special transition:
R(s,a)

Let's consider Q2(3,→)
- receive R(3,→)
=1+.9maxa′Q1(3,a′)
- next state s′ = 3, act optimally for the remaining one timestep
- receive maxa′Q1(3,a′)
=1.9
Recall:
γ=0.9
States and one special transition:
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Q2(3,→)=R(3,→) +γmaxa′Q1(3,a′)

Let's consider Q2(3,↑)
- receive R(3,↑)
=1+.9maxa′Q1(3,a′)
- next state s′ = 3, act optimally for the remaining one timestep
- receive maxa′Q1(3,a′)
=1.9
Recall:
γ=0.9
States and one special transition:
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Q2(3,↑)=R(3,↑) +γmaxa′Q1(3,a′)

Let's consider Q2(3,←)
- receive R(3,←)
=1+.9maxa′Q1(2,a′)
- next state s′ = 2, act optimally for the remaining one timestep
- receive maxa′Q1(2,a′)
=1
Recall:
γ=0.9
States and one special transition:
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Q2(3,←)=R(3,←) +γmaxa′Q1(2,a′)

Let's consider Q2(3,↓)
- receive R(3,↓)
=1+.9maxa′Q1(6,a′)
- next state s′ = 6, act optimally for the remaining one timestep
- receive maxa′Q1(6,a′)
=−8
Recall:
γ=0.9
States and one special transition:
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Q2(3,←)=R(3,←) +γmaxa′Q1(2,a′)
Recall:

γ=0.9
States and one special transition:
- act optimally for one more timestep, at the next state s′
- 20% chance, s′ = 2, act optimally, receive maxa′Q1(2,a′)
- 80% chance, s′ = 3, act optimally, receive maxa′Q1(3,a′)
=−10+.9[.2∗0+.8∗1]=−9.28
- receive R(6,↑)
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Q2(6,↑)=R(6,↑) +γ[.2maxa′Q1(2,a′)+.8maxa′Q1(3,a′)]
Let's consider

Q2(6,↑)=R(6,↑) +γ[.2maxa′Q1(2,a′)+.8maxa′Q1(3,a′)]
in general
Recall:
γ=0.9
States and one special transition:
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps

what's the optimal action in state 3, with horizon 2, given by π2∗(3)=?
in general
either up or right
Recall:
γ=0.9
States and one special transition:
Qh(s,a): the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Given the recursion
- for s∈S,a∈A :
- Qold (s,a)=0
- while True:
- for s∈S,a∈A :
- Qnew (s,a)←R(s,a)+γ∑s′T(s,a,s′)maxa′Qold (s′,a′)
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
we can have an infinite horizon equation
Infinite-horizon Value Iteration
if run this block h times and break, then the returns are exactly Qh
Q∞(s,a)
Summary
- Markov decision processes (MDP) is nice mathematical framework for making sequential decisions. It's the foundation to reinforcement learning.
- An MDP is defined by a five-tuple, and the goal is to find an optimal policy that leads to high expected cumulative discounted rewards.
- To evaluate how good a given policy π, we can calculate Vπ(s) via
- the summation over rewards definition
- Bellman recursion for finite horizon, equation for infinite horizon
- To find an optimal policy, we can recursively find Q(s,a) via the value iteration algorithm, and then act greedily w.r.t. the Q values.
Thanks!
We'd love to hear your thoughts.
6.390 IntroML (Fall24) - Lecture 11 Markov Decision Processes
By Shen Shen
6.390 IntroML (Fall24) - Lecture 11 Markov Decision Processes
- 78