Lecture 9: Reinforcement Learning
(DRAFT)
Shen Shen
April 11, 2025
Intro to Machine Learning
Outline
- Recap: Markov decision processes
- Reinforcement learning setup
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- Reinforcement learning setup again
- S : state space, contains all possible states s.
- A : action space, contains all possible actions a.
- T(s,a,s′) : the probability of transition from state s to s′ when action a is taken.
- R(s,a) : reward, takes in a (state, action) pair and returns a reward.
- γ∈[0,1]: discount factor, a scalar.
- π(s) : policy, takes in a state and returns an action.
The goal of an MDP is to find a "good" policy.
Sidenote: In 6.390,
- R(s,a) is deterministic and bounded.
- π(s) is deterministic.
- S and A are small discrete sets, unless otherwise specified.
Recap:
Markov Decision Processes - Definition and terminologies
For a given policy π(s), the (state) value functions
Vπh(s):=E[∑t=0h−1γtR(st,π(st))∣s0=s,π],∀s,h
- Vπh(s): expected sum of discounted rewards, starting in state s, and following policy π, for h steps.
- horizon-0 values defined as 0.
- value is long-term, reward is short-term.
State value functions V values
Recap:
Bellman Recursion
weighted by the probability of getting to that next state s′
(h−1) horizon values at a next state s′
the immediate reward for taking the policy-prescribed action π(s) in state s.
discounted by γ
horizon-h value in state s: the expected sum of discounted rewards, starting in state s and following policy π for h steps.
Recap:
finite-horizon Bellman recursions
infinite-horizon Bellman equations
For a given policy π(s), the (state) value functions
Vπh(s):=E[∑t=0h−1γtR(st,π(st))∣s0=s,π],∀s,h
MDP
Policy evaluation
Recap:
Optimal policy π∗
Definition: for a given MDP and a fixed horizon h (possibly infinite), a policy π∗ is an optimal policy if Vπ∗h(s)⩾ Vπh(s) for all s∈S and for all possible policy π.
Recap:
Qh(s,a): expected sum of discounted rewards
- starting in state s,
- take the action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
recipe for constructing an optimal policy
Qh(s,a): expected sum of discounted rewards
- starting in state s,
- take the action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Recap:
- for s∈S,a∈A :
- Qold (s,a)=0
- while True:
- for s∈S,a∈A :
- Qnew (s,a)←R(s,a)+γ∑s′T(s,a,s′)maxa′Qold (s′,a′)
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
Infinite-horizon Value Iteration
if run this block h times and break, then the returns are exactly Qh
that satisfies the infinite-horizon equation
Recap:
Outline
- Recap: Markov decision processes
- Reinforcement learning setup
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- Reinforcement learning setup again
- (state, action) results in a transition into a next state:
-
Normally, we get to the “intended” state;
-
E.g., in state (7), action “↑” gets to state (4)
-
-
If an action would take Mario out of the grid world, stay put;
-
E.g., in state (9), “→” gets back to state (9)
-
-
In state (6), action “↑” leads to two possibilities:
-
20% chance to (2)
-
80% chance to (3)
-
-
Running example: Mario in a grid-world
- 9 possible states
- 4 possible actions: {Up ↑, Down ↓, Left ←, Right →}
Recall
reward of (3, ↓)
reward of (3,↑)
reward of (6,↓)
reward of (6,→)
- (state, action) pairs give out rewards:
- in state 3, any action gives reward 1
- in state 6, any action gives reward -10
- any other (state, action) pair gives reward 0
-
discount factor: a scalar of 0.9 that reduces the "worth" of rewards, depending on the timing we receive them.
- e.g., for (3, ←) pair, we receive a reward of 1 at the start of the game; at the 2nd time step, we receive a discounted reward of 0.9; at the 3rd time step, it is further discounted to (0.9)2, and so on.
Mario in a grid-world, cont'd
- transition probabilities are unknown
Running example: Mario in a grid-world
Reinforcement learning setup
- 9 possible states
- 4 possible actions: {Up ↑, Down ↓, Left ←, Right →}
- rewards Mario unknown
- discount factor γ=0.9
Now
- S : state space, contains all possible states s.
- A : action space, contains all possible actions a.
- T(s,a,s′) : the probability of transition from state s to s′ when action a is taken.
- R(s,a) : reward, takes in a (state, action) pair and returns a reward.
- γ∈[0,1]: discount factor, a scalar.
- π(s) : policy, takes in a state and returns an action.
The goal of an MDP problem is to find a "good" policy.
Markov Decision Processes - Definition and terminologies
Reinforcement Learning
RL
State s
Action a
Reward r
Policy π(s)
Transition T(s,a,s′)
Reward R(s,a)
time
a trajectory (aka, an experience, or a rollout), of horizon h
τ=(s0,a0,r0,s1,a1,r1,…sh−1,ah−1,rh−1)
initial state
all depends on π
also depends on T,R, but we do not know T,R, explicitly
Reinforcement learning is very general:
robotics
games
social sciences
chatbot (RLHF)
health care
...
Outline
- Recap: Markov decision processes
- Reinforcement learning setup
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- Reinforcement learning setup again
Model-Based Methods
Keep playing the game to approximate the unknown rewards and transitions.
e.g. observe what reward r is received from taking the (6,↑) pair, we get R(6,↑)
- Transitions are a bit more involved but still simple:
- Rewards are particularly easy:
e.g. play the game 1000 times, count the # of times that (start in state 6, take ↑ action, end in state 2), then, roughly, T(6,↑,2) =(that count/1000)
(MDP)-
Now, with R and T estimated, we're back in MDP setting.
(for solving RL)
In Reinforcement Learning:
- Model typically means the MDP tuple ⟨S,A,T,R,γ⟩
- What the algorithm is learning is not referred to as a hypothesis either, we simply just call it the policy.
[A non-exhaustive, but useful taxonomy of algorithms in modern RL. Source]
Outline
- Recap: Markov decision processes
- Reinforcement learning setup
- Model-based methods
-
Model-free methods
-
(tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
-
(tabular) Q-learning
- Reinforcement learning setup again
Is it possible that we get a good policy without learning transition or rewards explicitly?
We kinda know a way already:
If we have access to Q value functions, we can back out an optimal policy easily (without needing transition or rewards)
(Recall, from MDP lab)
But... doesn't value iteration rely on transition and rewards explicitly?
Value Iteration
- for s∈S,a∈A :
- Qold (s,a)=0
- while True:
- for s∈S,a∈A :
- Qnew (s,a)←R(s,a)+γ∑s′T(s,a,s′)maxa′Qold (s′,a′)
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
- Indeed, value iteration relied on having full access to R and T
- Without R and T, perhaps we could execute (s,a), observe r and s′, and use
as an approximate (rough) update?
target
States and unknown transition:
Game Set up
Try using
unknown rewards:
execute (3,↑), observe a reward r=1
Qold(s,a)
Qnew(s,a)
States and unknown transition:
Try out
- execute (6,↑)
- update Q(6,↑) as:
−10+0.9maxa′Qold (3,a′)
= -10 + 0.9 = -9.1
To update the estimate of Q(6,↑):
- suppose, we observe a reward r=−10, the next state s′=3
γ=0.9
Qold(s,a)
Qnew(s,a)
States and unknown transition:
- execute (6,↑) again
- update Q(6,↑) as:
−10+0.9maxa′Qold (2,a′)
= -10 + 0 = -10
- suppose, we observe a reward r=−10, the next state s′=2
γ=0.9
Try out
Qold(s,a)
Qnew(s,a)
To update the estimate of Q(6,↑):
States and unknown transition:
Try out
- execute (6,↑) again
- update Q(6,↑) as:
−10+0.9maxa′Qold (3,a′)
= -10 + 0.9 = -9.1
To update the estimate of Q(6,↑):
- suppose, we observe a reward r=−10, the next state s′=3
γ=0.9
Qold(s,a)
Qnew(s,a)
States and unknown transition:
- execute (6,↑) again
- update Q(6,↑) as:
−10+0.9maxa′Qold (2,a′)
= -10 + 0 = -10
- suppose, we observe a reward r=−10, the next state s′=2
γ=0.9
Try out
Qold(s,a)
Qnew(s,a)
To update the estimate of Q(6,↑):
States and unknown transition:
Try out
- execute (6,↑) again
- update Q(6,↑) as:
−10+0.9maxa′Qold (3,a′)
= -10 + 0.9 = -9.1
To update the estimate of Q(6,↑):
- suppose, we observe a reward r=−10, the next state s′=3
γ=0.9
Qold(s,a)
Qnew(s,a)
- Indeed, value iteration relied on having full access to R and T
- Without R and T, perhaps we could execute (s,a), observe r and s′, and use
- But target keeps "washing away" the old progress.
target
🥺
- Indeed, value iteration relied on having full access to R and T
- Without R and T, perhaps we could execute (s,a), observe r and s′, and use
old belief
learning rate
😍
- Amazingly, this way has nice convergence properties.
target
- execute (6,↑)
- update Q(6,↑) as:
(−10+
= -5 + 0.5(-10 + 0.9)= - 9.55
- suppose, we observe a reward r=−10, the next state s′=3
States and unknown transition:
Better idea:
γ=0.9
pick learning rate α=0.5
+ 0.5
(1-0.5) * -10
Qold(s,a)
Qnew(s,a)
To update the estimate of Q(6,↑):
0.9maxa′Qold (3,a′))
- execute (6,↑) again
(−10
= 0.5*-9.55 + 0.5(-10 + 0)= -9.775
- suppose, we observe a reward r=−10, the next state s′=2
States and unknown transition:
Better idea:
γ=0.9
pick learning rate α=0.5
+ 0.5
(1-0.5) * -9.55
Qold(s,a)
Qnew(s,a)
To update the estimate of Q(6,↑):
- update Q(6,↑) as:
+0.9maxa′Qold (2,a′))
- for s∈S,a∈A :
- Qold (s,a)=0
- while True:
- for s∈S,a∈A :
- Qnew (s,a)←R(s,a)+γ∑s′T(s,a,s′)maxa′Qold (s′,a′)
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
Value Iteration(S,A,T,R,γ,ϵ)
"calculating"
"learning" (estimating)
Q-Learning (S,A,γ,α,s0max-iter)
1. i=0
2. for s∈S,a∈A:
3. Qold(s,a)=0
4. s←s0
5. while i<max-iter:
6. a←select_action(s,Qold(s,a))
7. r,s′←execute(a)
8. Qnew(s,a) ←(1−α)Qold(s,a)+α(r+γmaxa′Qold(s′,a′))
9. s ←s′
10. i ←(i+1)
11. Qold ←Qnew
12. return Qnew
"learning"
Q-Learning (S,A,γ,α,s0max-iter)
1. i=0
2. for s∈S,a∈A:
3. Qold(s,a)=0
4. s←s0
5. while i<max-iter:
6. a←select_action(s,Qold(s,a))
7. r,s′←execute(a)
8. Qnew(s,a) ←(1−α)Qold(s,a)+α(r+γmaxa′Qold(s′,a′))
9. s ←s′
10. i ←(i+1)
11. Qold ←Qnew
12. return Qnew
- Remarkably, 👈 can converge to the true infinite-horizon Q-values1.
1 given we visit all s,a infinitely often, and satisfy a condition on the learning rate α.
- But the convergence can be extremely slow.
- During learning, especially in early stages, we'd like to explore, and observe diverse (s,a) consequences.
- ϵ-greedy action selection strategy:
- with probability ϵ, choose an action a∈A uniformly at random
- with probability 1−ϵ, choose argmaxaQold(s,a)
- ϵ controls the trade-off between exploration vs. exploitation.
the current estimate of Q values
"learning"
Q-Learning (S,A,γ,α,s0max-iter)
1. i=0
2. for s∈S,a∈A:
3. Qold(s,a)=0
4. s←s0
5. while i<max-iter:
6. a←select_action(s,Qold(s,a))
7. r,s′←execute(a)
8. Qnew(s,a) ←(1−α)Qold(s,a)+α(r+γmaxa′Qold(s′,a′))
9. s ←s′
10. i ←(i+1)
11. Qold ←Qnew
12. return Qnew
Outline
- Recap: Markov decision processes
- Reinforcement learning setup
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- Reinforcement learning setup again
- So far, Q-learning is only kinda sensible for (small) tabular setting.
- What do we do if S and/or A are large, or even continuous?
- Notice that the key update line in Q-learning algorithm:
is equivalently:
Qnew(s,a)←Qold (s,a)+α([r+γmaxa′Qold(s′,a′)]−Qold (s,a))
new belief
←
old belief
learning rate
target
old belief
- Reminds us of: when minimizing (target−guessθ)2
Qnew(s,a)←Qold (s,a)+α([r+γmaxa′Qold(s′,a′)]−Qold (s,a))
new belief
←
old belief
learning rate
target
old belief
- Generalize tabular Q-learning for continuous state/action space:
(Qθ(s,a)−target)2
Gradient descent does: θnew←θold+η(target−guessθ)dθdguess
1. parameterize Qθ(s,a)
2. collect data (r,s′) to construct the target
3. update θ via gradient-descent methods to minimize
r+γmaxa′Qθ(s′,a′)
Outline
- Recap: Markov decision processes
- Reinforcement learning setup
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- Reinforcement learning setup again
- If no direct supervision is available?
- Strictly RL setting. Interact, observe, get data, use rewards as "coy" supervision signal.
[Slide Credit: Yann LeCun]
Reinforcement learning has a lot of challenges:
- Data can be very expensive/tricky to get
- sim-to-real gap
- sparse rewards
- exploration-exploitation trade-off
- catastrophic forgetting
- Learning can be very inefficient
- temporal process, error can compound
- high variance
- Q-learning can be very unstable
...
Summary
- We saw, last week, how to find good in a known MDP: these are policies with high cumulative expected reward.
- In reinforcement learning, we assume we are interacting with an unknown MDP, but we still want to find a good policy. We will do so via estimating the Q value function.
- One problem is how to select actions to gain good reward while learning. This “exploration vs exploitation” problem is important.
- Q-learning, for discrete-state problems, will converge to the optimal value function (with enough exploration).
- “Deep Q learning” can be applied to continuous-state or large discrete-state problems by using a parameterized function to represent the Q-values.
Thanks!
We'd love to hear your thoughts.
6.390 IntroML (Spring25) - Lecture 9 Reinforcement Learning
By Shen Shen
6.390 IntroML (Spring25) - Lecture 9 Reinforcement Learning
- 19