Intro to Machine Learning


Lecture 11: Reinforcement Learning
Shen Shen
April 26, 2024
Outline
- Recap: Markov Decision Processes
- Reinforcement Learning Setup
- What's changed from MDP?
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- RL setup again
- What's changed from supervised learning?
MDP Definition and Goal
- S : state space, contains all possible states s.
- A : action space, contains all possible actions a.
- T(s,a,s′) : the probability of transition from state s to s′ when action a is taken.
- R(s,a) : a function that takes in the (state, action) and returns a reward.
- γ∈[0,1]: discount factor, a scalar.
- π(s) : policy, takes in a state and returns an action.
Ultimate goal of an MDP: Find the "best" policy π.
State s
Action a
Reward r




























Policy π(s)
Transition T(s,a,s′)
Reward R(s,a)
time

a trajectory (aka an experience or rollout) τ=(s0,a0,r0,s1,a1,r1,…)
how "good" is a trajectory?
- almost all transitions are deterministic:
-
Normally, actions take Mario to the “intended” state.
-
E.g., in state (7), action “↑” gets to state (4)
-
-
If an action would've taken us out of this world, stay put
-
E.g., in state (9), action “→” gets back to state (9)
-
-
except, in state (6), action “↑” leads to two possibilities:
-
20% chance ends in (2)
-
80% chance ends in (3)
-
-
Running example: Mario in a grid-world

- 9 possible states
- 4 possible actions: {Up ↑, Down ↓, Left ←, Right →}

example cont'd

- (state, action) pair can get Mario rewards:
- Any other (state, action) pairs get reward 0
- In state (6), any action gets reward -10
- In state (3), any action gets reward +1

actions: {Up ↑, Down ↓, Left ←, Right →}
- goal is to find a gameplay policy strategy for Mario, to get maximum expected sum of discounted rewards, with a discount facotor γ=0.9
Now, let's think about Vπ3(6)
Recall:
π(s)= ‘‘↑", ∀s
R(3,↑)=1
R(6,↑)=−10
γ=0.9
2
3
action ↑
action ↑
6
action ↑
2
action ↑
3
action ↑
MDP
Policy evaluation
finite-horizon policy evaluation
infinite-horizon policy evaluation
γ is now necessarily <1 for convergence too in general
Bellman equation
- ∣S∣ many linear equations
For any given policy π(s), the infinite-horizon (state) value functions are
Vπ(s):=E[∑t=0∞γtR(st,π(st))∣s0=s,π],∀s
For a given policy π(s), the finite-horizon horizon-h (state) value functions are:
Vπh(s):=E[∑t=0h−1γtR(st,π(st))∣s0=s,π],∀s
Bellman recursion
Recall:

example: recursively finding Qh(s,a)
γ=0.9
Qh(s,a) is the expected sum of discounted rewards for
States and one special transition:
R(s,a)
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
Recall:

γ=0.9
Qh(s,a) is the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
States and one special transition:
- act optimally for one more timestep, at the next state s′
- 20% chance, s′ = 2, act optimally, receive maxa′Q1(2,a′)
- 80% chance, s′ = 3, act optimally, receive maxa′Q1(3,a′)
Let's consider Q2(6,↑)
=−10+.9[.2∗0+.8∗1]=−9.28
=R(6,↑) +γ[.2maxa′Q1(2,a′)+.8maxa′Q1(3,a′)]
- receive R(6,↑)
Recall:

γ=0.9
Qh(s,a) is the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
States and one special transition:
Q2(6,↑)=R(6,↑) +γ[.2maxa′Q1(2,a′)+.8maxa′Q1(3,a′)]
in general
Recall:

γ=0.9
Qh(s,a) is the expected sum of discounted rewards for
- starting in state s,
- take action a, for one step
- act optimally there afterwards for the remaining (h−1) steps
States and one special transition:
what's the optimal action in state 3, with horizon 2, given by π2∗(3)=?
in general
either up or right
Given the finite horizon recursion
- for s∈S,a∈A :
- Qold (s,a)=0
- while True:
- for s∈S,a∈A :
- Qnew (s,a)←R(s,a)+γ∑s′T(s,a,s′)maxa′Qold (s′,a′)
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
We should easily be convinced of the infinite horizon equation
Infinite-horizon Value Iteration
Outline
- Recap: Markov Decision Processes
- Reinforcement Learning Setup
- What's changed from MDP?
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- RL setup again
- What's changed from supervised learning?
- all transitions probabilities are unknown.
Running example: Mario in a grid-world (the Reinforcement-Learning Setup)

- 9 possible states
- 4 possible actions: {Up ↑, Down ↓, Left ←, Right →}
- (state, action) pair gets Mario unknown rewards.
- goal is to find a gameplay policy strategy for Mario, to get maximum expected sum of discounted rewards, with a discount facotor γ=0.9
MDP Definition and Goal
- S : state space, contains all possible states s.
- A : action space, contains all possible actions a.
- T(s,a,s′) : the probability of transition from state s to s′ when action a is taken.
- R(s,a) : a function that takes in the (state, action) and returns a reward.
- γ∈[0,1]: discount factor, a scalar.
- π(s) : policy, takes in a state and returns an action.
Ultimate goal of an MDP: Find the "best" policy π.
RL
RL:
Outline
- Recap: Markov Decision Processes
- Reinforcement Learning Setup
- What's changed from MDP?
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- RL setup again
- What's changed from supervised learning?
Model-Based Methods
Keep playing the game to approximate the unknown rewards and transitions.
e.g. by observing what reward r received from being in state 6 and take ↑ action, we know R(6,↑)
Transitions are a bit more involved but still simple:
Rewards are particularly easy:
e.g. play the game 1000 times, count the # of times (we started in state 6, take ↑ action, end in state 2), then, roughly, T(6,↑,2)=(that count/1000)
(MDP)-
Now, with R,T estimated, we're back in MDP setting.
(for solving RL)
In Reinforcement Learning:
- Model typically means MDP tuple ⟨S,A,T,R,γ⟩
- The learning objective is not referred to as hypothesis explicitly, we simply just call it the policy.
Outline
- Recap: Markov Decision Processes
- Reinforcement Learning Setup
- What's changed from MDP?
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- RL setup again
- What's changed from supervised learning?

How do we learn a good policy without learning transition or rewards explicitly?
We kinda already know a way: Q functions!
So once we have "good" Q values, we can find optimal policy easily.
(Recall from MDP lab)
But didn't we calculate this Q-table via value iteration using transition and rewards explicitly?
Indeed, recall that, in MDP:
- for s∈S,a∈A :
- Qold (s,a)=0
- while True:
- for s∈S,a∈A :
- Qnew (s,a)←R(s,a)+γ∑s′T(s,a,s′)maxa′Qold (s′,a′)
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
- Infinite horizon equation
- Infinite-horizon Value Iteration
- Finite horizon recursion
- value iteration relied on having full access to R and T
- BUT, this is basically saying the realized s′ is the only possible next state; pretty rough! We'd override any previous "learned" Q values.
- hmm... perhaps, we could simulate (s,a), observe r and s′, and just use
- better way is to smoothly keep track of what's our old belief with new evidence:
e.g.
as the proxy for the r.h.s. assignment?
target
old belief
learning rate
- for s∈S,a∈A :
- Qold (s,a)=0
- while True:
- for s∈S,a∈A :
- Qnew (s,a)←R(s,a)+γ∑s′T(s,a,s′)maxa′Qold (s′,a′)
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
VALUE-ITERATION (S,A,T,R,γ,ϵ)
Q-LEARNING (S,A,γ,s0,α)
- for s∈S,a∈A :
- Qold (s,a)=0
- s←s0
- while True:
- a← select_action (s,Qold(s,a))
- r,s′=execute(a)
- Qnew(s,a)←(1−α)Qold (s,a)+α(r+γmaxa′Qold(s′,a′))
- s←s′
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
"calculating"
"estimating"
Q-LEARNING (S,A,γ,s0,α)
- for s∈S,a∈A :
- Qold (s,a)=0
- s←s0
- while True:
- a← select_action (s,Qold(s,a))
- r,s′=execute(a)
- Qnew(s,a)←(1−α)Qold (s,a)+α(r+γmaxa′Qold(s′,a′))
- s←s′
- if maxs,a∣Qold (s,a)−Qnew (s,a)∣<ϵ:
- return Qnew
- Qold ←Qnew
- Remarkably, still can converge. (So long S,A are finite; we visit every state and action infinity-many times; and α decays.
- Line 7 :
Qnew(s,a)←Qold (s,a)+α([r+γmaxa′Qold(s′,a′)]−Qold (s,a))
is equivalently:
old belief
learning
rate
target
old belief
- Line 5, a sub-routine.
pretty similar to SGD.
- ϵ-greedy action selection strategy:
- with probability 1−ϵ, choose argmaxaQ(s,a)
- with probability ϵ, choose an action a∈A uniformly at random
- If our Q values are estimated quite accurately (nearly converged to the true Q values), then should act greedily
- argmaxaQh(s,a), as we did in MDP.
- During learning, especially in early stages, we'd like to explore.
- Benefit: get to observe more diverse (s,a) consequences.
- exploration vs. exploitation.
Outline
- Recap: Markov Decision Processes
- Reinforcement Learning Setup
- What's changed from MDP?
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- RL setup again
- What's changed from supervised learning?
- Q-learning only is kinda sensible for tabular setting.
- What do we do if S and/or A are large (or continuous)?
- Recall from Q-learning algorithm, key line 7 :
is equivalently:
Qnew(s,a)←Qold (s,a)+α([r+γmaxa′Qold(s′,a′)]−Qold (s,a))
learning
rate
old belief
target
old belief
- Can be interpreted as we're minimizing:
(Q(s,a)−(r+γmaxa′Q(s′,a′)))2
via gradient method!
Outline
- Recap: Markov Decision Processes
- Reinforcement Learning Setup
- What's changed from MDP?
- Model-based methods
- Model-free methods
- (tabular) Q-learning
- ϵ-greedy action selection
- exploration vs. exploitation
- (neural network) Q-learning
- (tabular) Q-learning
- RL setup again
- What's changed from supervised learning?

Supervised learning


- If no direct supervision is available?
- Strictly RL setting. Interact, observe and get data. Use rewards/value as "coy" supervision signal.
Thanks!
We'd appreciate your feedback on the lecture.
introml-sp24-lec11
By Shen Shen
introml-sp24-lec11
- 153