Intro to Machine Learning

https://introml.mit.edu/

Lecture 11: Reinforcement Learning

Shen Shen

April 26, 2024

Outline

Recap: Markov Decision Processes
Reinforcement Learning Setup
- What's changed from MDP?
Model-based methods
Model-free methods
- (tabular) Q-learning
  - $\epsilon$ -greedy action selection
  - exploration vs. exploitation
- (neural network) Q-learning
RL setup again
- What's changed from supervised learning?

MDP Definition and Goal

$\mathcal{S}$ : state space, contains all possible states $s$ .
$\mathcal{A}$ : action space, contains all possible actions $a$ .
$\mathrm{T}\left(s, a, s^{\prime}\right)$ : the probability of transition from state $s$ to $s^{\prime}$ when action $a$ is taken.
$\mathrm{R}(s, a)$ : a function that takes in the (state, action) and returns a reward.
$\gamma \in [0,1]$ : discount factor, a scalar.

$\pi{(s)}$ : policy, takes in a state and returns an action.

Ultimate goal of an MDP: Find the "best" policy $\pi$ .

State $s$

Action $a$

Reward $r$

\dots

\dots

Policy $\pi(s)$

Transition $\mathrm{T}\left(s, a, s^{\prime}\right)$

Reward $\mathrm{R}(s, a)$

time

a trajectory (aka an experience or rollout) $\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)$

r_0 = \\ \mathrm{R}(s_0, a_0)

r_0 = \\ \mathrm{R}(s_0, a_0)

r_1 = \\ \mathrm{R}(s_1, a_1)

r_1 = \\ \mathrm{R}(s_1, a_1)

r_2 = \\ \mathrm{R}(s_2, a_2)

r_2 = \\ \mathrm{R}(s_2, a_2)

r_4 = \\ \mathrm{R}(s_4, a_4)

r_4 = \\ \mathrm{R}(s_4, a_4)

r_5 = \\ \mathrm{R}(s_5, a_5)

r_5 = \\ \mathrm{R}(s_5, a_5)

r_6 = \\ \mathrm{R}(s_6, a_6)

r_6 = \\ \mathrm{R}(s_6, a_6)

r_7 = \\ \mathrm{R}(s_7, a_7)

r_7 = \\ \mathrm{R}(s_7, a_7)

\dots

\dots

r_3 = \\ \mathrm{R}(s_3, a_3)

r_3 = \\ \mathrm{R}(s_3, a_3)

how "good" is a trajectory?

\mathrm{R}(s_0, a_0)

\mathrm{R}(s_0, a_0)

\gamma \mathrm{R}(s_1, a_1)

\gamma \mathrm{R}(s_1, a_1)

\gamma^3\mathrm{R}(s_3, a_3)

\gamma^3\mathrm{R}(s_3, a_3)

\gamma^2 \mathrm{R}(s_2, a_2)

\gamma^2 \mathrm{R}(s_2, a_2)

\gamma^4\mathrm{R}(s_4, a_4)

\gamma^4\mathrm{R}(s_4, a_4)

\gamma^5\mathrm{R}(s_5, a_5)

\gamma^5\mathrm{R}(s_5, a_5)

\gamma^6\mathrm{R}(s_6, a_6)

\gamma^6\mathrm{R}(s_6, a_6)

\gamma^7\mathrm{R}(s_7, a_7)

\gamma^7\mathrm{R}(s_7, a_7)

\dots

\dots

+

+

+

+

+

+

+

almost all transitions are deterministic:
- Normally, actions take Mario to the “intended” state.
  - E.g., in state (7), action “↑” gets to state (4)
- If an action would've taken us out of this world, stay put
  - E.g., in state (9), action “→” gets back to state (9)
- except, in state (6), action “↑” leads to two possibilities:
  - 20% chance ends in (2)
  - 80% chance ends in (3)

1

2

9

8

7

5

4

3

6

80\%

80\%

20\%

20\%

Running example: Mario in a grid-world

9 possible states

4 possible actions: {Up ↑, Down ↓, Left ←, Right →}

1

2

9

8

7

5

4

3

6

80\%

80\%

20\%

20\%

example cont'd

1

1

1

1

-10

-10

-10

-10

-10

-10

-10

-10

(state, action) pair can get Mario rewards:

Any other (state, action) pairs get reward 0

In state (6), any action gets reward -10

In state (3), any action gets reward +1

actions: {Up ↑, Down ↓, Left ←, Right →}

goal is to find a gameplay policy strategy for Mario, to get maximum expected sum of discounted rewards, with a discount facotor $\gamma = 0.9$

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

-10

-10

V_{\pi}^0(s)

V_{\pi}^0(s)

V_{\pi}^1(s)

V_{\pi}^1(s)

V_{\pi}^2(s)

V_{\pi}^2(s)

0

0

-9

-9

0

0

0

0

1.9

1.9

-9.28

-9.28

Now, let's think about $V_\pi^3(6)$

1

2

9

8

7

5

4

3

6

80\%

80\%

20\%

20\%

Recall:

$\pi(s) = ``\uparrow",\ \forall s$

$\mathrm{R}(3, \uparrow) = 1$

$\mathrm{R}(6, \uparrow) = -10$

$\gamma = 0.9$

action $\uparrow$

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

\gamma^2

\gamma^2

\gamma^2

\gamma^2

\gamma^2

\gamma^2

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

\gamma^2

\gamma^2

action $\uparrow$

\mathrm{R}(6, \uparrow)

\mathrm{R}(6, \uparrow)

\mathrm{R}(6, \uparrow)

\mathrm{R}(6, \uparrow)

\gamma

\gamma

\gamma

\gamma

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

20\%

20\%

20\%

20\%

action $\uparrow$

80\%

80\%

80\%

80\%

action $\uparrow$

\gamma

\gamma

\gamma

\gamma

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

+

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

\gamma

\gamma

\gamma

\gamma

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

20\%

20\%

[

[

+

]

]

\mathrm{R}(6, \uparrow)

\mathrm{R}(6, \uparrow)

=

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

\gamma

\gamma

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

80\%

80\%

[

[

+

+

]

]

\gamma

\gamma

V_\pi^3(6)=

V_\pi^3(6)=

\mathrm{R}(6, \uparrow)

\mathrm{R}(6, \uparrow)

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

\gamma

\gamma

\gamma^2

\gamma^2

\mathrm{R}(2, \uparrow)

\mathrm{R}(2, \uparrow)

20\%

20\%

[

[

+

+

]

]

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

\gamma

\gamma

\mathrm{R}(3, \uparrow)

\mathrm{R}(3, \uparrow)

\gamma^2

\gamma^2

80\%

80\%

[

[

+

+

]

]

\mathrm{R}(6, \uparrow)

\mathrm{R}(6, \uparrow)

=

\gamma

\gamma

20\%

20\%

+

V_\pi^2(2)

V_\pi^2(2)

\gamma

\gamma

80\%

80\%

+

V_\pi^2(3)

V_\pi^2(3)

\pi(s)

\pi(s)

V_{\pi}^h(s)

V_{\pi}^h(s)

MDP

Policy evaluation

finite-horizon policy evaluation

infinite-horizon policy evaluation

$\gamma$ is now necessarily <1 for convergence too in general

Bellman equation

$|\mathcal{S}|$ many linear equations

For any given policy $\pi(s),$ the infinite-horizon (state) value functions are
$V_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s$

V_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V_\pi\left(s^{\prime}\right), \forall s

V_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V_\pi\left(s^{\prime}\right), \forall s

For a given policy $\pi(s),$ the finite-horizon horizon- $h$ (state) value functions are:
$V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s$

Bellman recursion

V^{h}_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V^{h-1}_\pi\left(s^{\prime}\right), \forall s

V^{h}_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V^{h-1}_\pi\left(s^{\prime}\right), \forall s

1

2

9

8

7

5

4

3

6

80\%

80\%

20\%

20\%

Recall:

example: recursively finding $Q^h(s, a)$

$\gamma = 0.9$

$Q^h(s, a)$ is the expected sum of discounted rewards for

1

1

1

1

-10

-10

-10

-10

-10

-10

-10

-10

States and one special transition:

$\mathrm{R}(s,a)$

starting in state $s$ ,
take action $a$ , for one step
act optimally there afterwards for the remaining $(h-1)$ steps

Q^1(s, a)

Q^1(s, a)

0

0

0

0

0

1

0

-10

-10

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

-10

-10

1

-10

-10

-10

-10

Q^0(s, a)

Q^0(s, a)

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

2

9

8

7

5

4

3

6

80\%

80\%

20\%

20\%

Recall:

$\gamma = 0.9$

$Q^h(s, a)$ is the expected sum of discounted rewards for

starting in state $s$ ,
take action $a$ , for one step
act optimally there afterwards for the remaining $(h-1)$ steps

Q^1(s, a) = \mathrm{R}(s,a)

Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

0

0

0

1

0

-10

-10

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

-10

-10

1

-10

-10

-10

-10

0

0

Q^2(s, a)

Q^2(s, a)

act optimally for one more timestep, at the next state $s^{\prime}$

0

0

0

0

1.9

1.9

1.9

1.9

1

-8

-8

20% chance, $s'$ = 2, act optimally, receive $\max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)$

80% chance, $s'$ = 3, act optimally, receive $\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)$

-9.28

-9.28

Let's consider $Q^2(6, \uparrow)$

$= -10 + .9 [.2*0+ .8*1] = -9.28$

$=\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)]$

receive $\mathrm{R}(6,\uparrow)$

1

2

9

8

7

5

4

3

6

80\%

80\%

20\%

20\%

Recall:

$\gamma = 0.9$

$Q^h(s, a)$ is the expected sum of discounted rewards for

starting in state $s$ ,
take action $a$ , for one step
act optimally there afterwards for the remaining $(h-1)$ steps

Q^1(s, a)

Q^1(s, a)

States and one special transition:

0

0

0

1

0

-10

-10

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

-10

-10

1

-10

-10

-10

-10

0

0

Q^2(s, a)

Q^2(s, a)

0

0

0

0

1.9

1.9

1.9

1.9

1

-8

-8

-9.28

-9.28

$Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)]$

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right), \forall s,a

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right), \forall s,a

=\mathrm{R}(s, a)

=\mathrm{R}(s, a)

in general

1

2

9

8

7

5

4

3

6

80\%

80\%

20\%

20\%

Recall:

$\gamma = 0.9$

$Q^h(s, a)$ is the expected sum of discounted rewards for

starting in state $s$ ,
take action $a$ , for one step
act optimally there afterwards for the remaining $(h-1)$ steps

Q^1(s, a)

Q^1(s, a)

States and one special transition:

0

0

0

1

0

-10

-10

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

-10

-10

1

-10

-10

-10

-10

0

0

Q^2(s, a)

Q^2(s, a)

0

0

0

0

1.9

1.9

1.9

1.9

1

-8

-8

-9.28

-9.28

\pi_h^*(s)=\arg \max _a Q^h(s, a), \forall s, h

\pi_h^*(s)=\arg \max _a Q^h(s, a), \forall s, h

what's the optimal action in state 3, with horizon 2, given by $\pi_2^*(3)=?$

in general

either up or right

Given the finite horizon recursion

Q(s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q \left(s^{\prime}, a^{\prime}\right)

Q(s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q \left(s^{\prime}, a^{\prime}\right)

for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0$
while True:
for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)$
if $\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:$
return $\mathrm{Q}_{\text {new }}$
$\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}$

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)

We should easily be convinced of the infinite horizon equation

Infinite-horizon Value Iteration

Outline

Recap: Markov Decision Processes
Reinforcement Learning Setup
- What's changed from MDP?
Model-based methods
Model-free methods
- (tabular) Q-learning
  - $\epsilon$ -greedy action selection
  - exploration vs. exploitation
- (neural network) Q-learning
RL setup again
- What's changed from supervised learning?

Indeed, recall that, in MDP:

Q(s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q \left(s^{\prime}, a^{\prime}\right)

Q(s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q \left(s^{\prime}, a^{\prime}\right)

for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0$
while True:
for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)$
if $\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:$
return $\mathrm{Q}_{\text {new }}$
$\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}$

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)

Infinite horizon equation

Infinite-horizon Value Iteration

Finite horizon recursion

\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)

\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)

value iteration relied on having full access to $\mathrm{R}$ and $\mathrm{T}$

BUT, this is basically saying the realized $s'$ is the only possible next state; pretty rough! We'd override any previous "learned" Q values.

hmm... perhaps, we could simulate $(s,a)$ , observe $r$ and $s'$ , and just use

r

+\ \gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)

+\ \gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)

better way is to smoothly keep track of what's our old belief with new evidence:

\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow -10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)

\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow -10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(2, a^{\prime}\right)

\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow -10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)

\mathrm{Q}_{\text {new }}(6, \uparrow) \leftarrow -10 +\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(3, a^{\prime}\right)

e.g.

as the proxy for the r.h.s. assignment?

\mathrm{Q}_{\text {new }}(s, a) \leftarrow(1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)

\mathrm{Q}_{\text {new }}(s, a) \leftarrow(1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)

target

old belief

learning rate

for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0$
while True:
for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)$
if $\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:$
return $\mathrm{Q}_{\text {new }}$
$\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}$

VALUE-ITERATION $(\mathcal{S}, \mathcal{A}, \mathrm{T}, \mathrm{R}, \gamma, \epsilon)$

Q-LEARNING $\left(\mathcal{S}, \mathcal{A}, \gamma, \mathrm{s}_0,\alpha\right)$

for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0$
$s \leftarrow s_0$
while True:
$a \leftarrow$ select_action $\left(s, Q_{\text{old}}(s, a)\right)$
$r, s^{\prime}=\operatorname{execute}(a)$
$\mathrm{Q}_{\text {new}}(s, a) \leftarrow(1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')\right)$
$s \leftarrow s^{\prime}$
if $\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:$
return $\mathrm{Q}_{\text {new }}$
$\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}$

"calculating"

"estimating"

Q-LEARNING $\left(\mathcal{S}, \mathcal{A}, \gamma, \mathrm{s}_0,\alpha\right)$

for $s \in \mathcal{S}, a \in \mathcal{A}$ :
$\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0$
$s \leftarrow s_0$
while True:
$a \leftarrow$ select_action $\left(s, Q_{\text{old}}(s, a)\right)$
$r, s^{\prime}=\operatorname{execute}(a)$
$\mathrm{Q}_{\text {new}}(s, a) \leftarrow(1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')\right)$
$s \leftarrow s^{\prime}$
if $\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:$
return $\mathrm{Q}_{\text {new }}$
$\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}$

Remarkably, still can converge. (So long $\mathcal{S}, \mathcal{A}$ are finite; we visit every state and action infinity-many times; and $\alpha$ decays.

Line 7 :

$\mathrm{Q}_{\text {new}}(s, a) \leftarrow\mathrm{Q}_{\text {old }}(s, a)+\alpha\left([r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old}}(s', a')] - \mathrm{Q}_{\text {old }}(s, a)\right)$

\mathrm{Q}_{\text {new }}(s, a) \leftarrow(1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)

\mathrm{Q}_{\text {new }}(s, a) \leftarrow(1-\alpha) \mathrm{Q}_{\text {old }}(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\right)

is equivalently:

)

)

+

(

(

-

old belief

learning

rate

target

old belief

Line 5, a sub-routine.

pretty similar to SGD.

$\epsilon$ -greedy action selection strategy:
- with probability $1-\epsilon$ , choose $\arg \max _{\mathrm{a}} \mathrm{Q}(s, \mathrm{a})$
- with probability $\epsilon$ , choose an action $a \in \mathcal{A}$ uniformly at random

If our $Q$ values are estimated quite accurately (nearly converged to the true $Q$ values), then should act greedily
- $\arg \max _a Q^h(s, a),$ as we did in MDP.

During learning, especially in early stages, we'd like to explore.

Benefit: get to observe more diverse (s,a) consequences.

exploration vs. exploitation.

Intro to Machine Learning https://introml.mit.edu/ Lecture 11: Reinforcement Learning Shen Shen April 26, 2024

introml-sp24-lec11

By Shen Shen

introml-sp24-lec11

10 months ago
153

Shen Shen

shenshen.mit.edu

Intro to Machine Learning

Lecture 11: Reinforcement Learning

Outline

MDP Definition and Goal

Running example: Mario in a grid-world

Outline

Running example: Mario in a grid-world (the Reinforcement-Learning Setup)

MDP Definition and Goal

RL

Outline

Model-Based Methods

(MDP)-

(for solving RL)

Outline

Outline

Outline

Thanks!

introml-sp24-lec11

introml-sp24-lec11

Shen Shen

Intro to Machine Learning

Lecture 11: Reinforcement Learning

introml-sp24-lec11

More from Shen Shen