Intro to Machine Learning

https://introml.mit.edu/

Lecture 10: Markov Decision Processes

Shen Shen

April 19, 2024

Outline

Recap: Supervised Learning
Markov Decision Processes
- Mario example
- Formal definition
- Policy Evaluation
  - State-Value Functions: \(V\)-values
  - Finite horizon (recursion) and infinite horizon (equation)
- Optimal Policy and Finding Optimal Policy
  - General tool: State-action Value Functions: \(Q\)-values
  - Value iteration

Toddler demo, Russ Tedrake thesis, 2004

(Uses vanilla policy gradient (actor-critic))

(The demo won't embed in PDF. But the direct link below works.)

Text

Reinforcement Learning with Human Feedback

Markov Decision Processes

Foundational tools and concept to understand RL.
Research area initiated in the 1950s (Bellman), known under various names (in various communities):
- Stochastic optimal control (Control theory)
- Stochastic shortest path (Operations research)
- Sequential decision making under uncertainty (Economics)
- Dynamic programming, control of dynamical systems (under uncertainty)
- Reinforcement learning (Artificial Intelligence, Machine Learning)
A rich variety of (accessible & elegant) theory/math, algorithms, and applications/illustrations
As a result, quite a large variations of notations.
We will use the most RL-flavored notation

almost all transitions are deterministic:
- Normally, actions take Mario to the “intended” state.
  - E.g., in state (7), action “↑” gets to state (4)
- If an action would've taken us out of this world, stay put
  - E.g., in state (9), action “→” gets back to state (9)
- except, in state (6), action “↑” leads to two possibilities:
  - 20% chance ends in (2)
  - 80% chance ends in (3)

80\%

20\%

Running example: Mario in a grid-world

9 possible states

4 possible actions: {Up ↑, Down ↓, Left ←, Right →}

80\%

20\%

example cont'd

-10

reward of being in state 3, taking action \(\uparrow\)

reward of being in state 3, taking action \(\downarrow\)

reward of being in state 6, taking action \(\downarrow\)

reward of being in state 6, taking action \(\rightarrow\)

(state, action) pair can get Mario rewards:

Any other (state, action) pairs get reward 0

In state (6), any action gets reward -10

In state (3), any action gets reward +1

actions: {Up ↑, Down ↓, Left ←, Right →}

goal is to find a gameplay strategy for Mario, to
- get maximum sum of rewards
- get these rewards as soon as possible

Definition and Goal

\(\mathcal{S}\) : state space, contains all possible states \(s\).
\(\mathcal{A}\) : action space, contains all possible actions \(a\).
\(\mathrm{T}\left(s, a, s^{\prime}\right)\) : the probability of transition from state \(s\) to \(s^{\prime}\) when action \(a\) is taken.
\(\mathrm{R}(s, a)\) : a function that takes in the (state, action) and returns a reward.
\(\gamma \in [0,1]\): discount factor, a scalar.

\(\pi{(s)}\) : policy, takes in a state and returns an action.

Ultimate goal of an MDP: Find the "best" policy \(\pi\).

Sidenote:

In 6.390, \(\mathrm{R}(s, a)\) is deterministic and bounded.
In 6.390, \(\pi(s)\) is deterministic.
In this week, \(\mathcal{S}\) and \(\mathcal{A}\) are discrete set, i.e. have finite elements (in fact, typically quite small)

State \(s\)

Action \(a\)

Reward \(r\)

\dots

Policy \(\pi(s)\)

Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)

Reward \(\mathrm{R}(s, a)\)

time

a trajectory (aka an experience or rollout) \(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)\)

r_0 = \\ \mathrm{R}(s_0, a_0)

r_1 = \\ \mathrm{R}(s_1, a_1)

r_2 = \\ \mathrm{R}(s_2, a_2)

r_4 = \\ \mathrm{R}(s_4, a_4)

r_5 = \\ \mathrm{R}(s_5, a_5)

r_6 = \\ \mathrm{R}(s_6, a_6)

r_7 = \\ \mathrm{R}(s_7, a_7)

\dots

r_3 = \\ \mathrm{R}(s_3, a_3)

how "good" is a trajectory?

\mathrm{R}(s_3, a_3)

\mathrm{R}(s_0, a_0)

\mathrm{R}(s_1, a_1)

\mathrm{R}(s_2, a_2)

\mathrm{R}(s_4, a_4)

\mathrm{R}(s_5, a_5)

\mathrm{R}(s_6, a_6)

\mathrm{R}(s_7, a_7)

\dots

\mathrm{R}(s_0, a_0)

\gamma \mathrm{R}(s_1, a_1)

\gamma^3\mathrm{R}(s_3, a_3)

\gamma^2 \mathrm{R}(s_2, a_2)

\gamma^4\mathrm{R}(s_4, a_4)

\gamma^5\mathrm{R}(s_5, a_5)

\gamma^6\mathrm{R}(s_6, a_6)

\gamma^7\mathrm{R}(s_7, a_7)

\dots

State \(s\)

Action \(a\)

Reward \(r\)

\dots

Policy \(\pi(s)\)

Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)

Reward \(\mathrm{R}(s, a)\)

time

a trajectory (aka an experience or rollout) \(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)\)

r_0 = \\ \mathrm{R}(s_0, a_0)

r_1 = \\ \mathrm{R}(s_1, a_1)

r_2 = \\ \mathrm{R}(s_2, a_2)

r_4 = \\ \mathrm{R}(s_4, a_4)

r_5 = \\ \mathrm{R}(s_5, a_5)

r_6 = \\ \mathrm{R}(s_6, a_6)

r_7 = \\ \mathrm{R}(s_7, a_7)

\dots

r_3 = \\ \mathrm{R}(s_3, a_3)

\mathrm{R}(s_0, a_0)

\gamma \mathrm{R}(s_1, a_1)

\gamma^3\mathrm{R}(s_3, a_3)

\gamma^2 \mathrm{R}(s_2, a_2)

\gamma^4\mathrm{R}(s_4, a_4)

\gamma^5\mathrm{R}(s_5, a_5)

\gamma^6\mathrm{R}(s_6, a_6)

\gamma^7\mathrm{R}(s_7, a_7)

\dots

Now, suppose \(h\) the horizon (how many time steps), and \(s_0\) the initial state are given.
Also, recall the rewards \(\mathrm{R}(s,a)\) and policy \(\pi(s)\) are deterministic.
There would still be randomness in a trajectory, due to stochastic transition.
That is, we cannot just evaluate

For a given policy \(\pi(s),\) the finite-horizon horizon-\(h\) (state) value functions are:
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)

\pi(s)

V_{\pi}(s)

MDP

Policy evaluation

expected sum of discounted rewards, for starting in state \(s,\) following policy \(\pi(s),\) for horizon \(h.\)
expectation w.r.t. stochastic transition.
horizon-0 values are all 0.
value is a long-term thing, reward is a one-time thing.

\mathrm{R}(s_0, a_0)

\gamma \mathrm{R}(s_1, a_1)

\gamma^3\mathrm{R}(s_3, a_3)

\gamma^2 \mathrm{R}(s_2, a_2)

\gamma^4\mathrm{R}(s_4, a_4)

\gamma^5\mathrm{R}(s_5, a_5)

\gamma^6\mathrm{R}(s_6, a_6)

\gamma^7\mathrm{R}(s_7, a_7)

\dots

\mathbb{E}[

]

80\%

20\%

Recall:

example: evaluating the "always \(\uparrow\)" policy

\(\pi(s) = ``\uparrow",\ \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\mathrm{R}(s, \uparrow) = 0\) for all other seven states

Suppose \(\gamma = 0.9\)

Horizon \(h\) = 0; nothing happens

Horizon \(h\) = 1: simply receiving the rewards

-10

V_{\pi}^0(s)

\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)

\mathbb{E}[

]

V_{\pi}^1(s)

\mathrm{R}(s_0, a_0)

.9 \mathrm{R}(s_1, a_1)

(.9)^2 \mathrm{R}(s_2, a_2)

\dots

\( h\) terms inside

80\%

20\%

V_\pi^2(8)= \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)

V_\pi^2(9)= \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)

Recall:

\(\pi(s) = ``\uparrow",\ \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)

Horizon \(h\) = 2

\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right]\)

\mathbb{E}[

]

\mathrm{R}(s_0, a_0)

.9 \mathrm{R}(s_1, a_1)

\( 2\) terms inside

\mathrm{R}(s_0, a_0)

\gamma \mathrm{R}(s_1, a_1)

V_{\pi}^2(s)

80\%

20\%

V_\pi^2(8)= \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)

V_\pi^2(9)= \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)

-9

-9.28

Recall:

\(\pi(s) = ``\uparrow",\ \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)

\mathrm{R}(s_0, a_0)

\gamma \mathrm{R}(s_1, a_1)

Horizon \(h\) = 2

\text{if } s_0 = 6,

\text{receive } \mathrm{R}(6, \uparrow) + \gamma [.2 \mathrm{R}(2, \uparrow) + .8 \mathrm{R}(3, \uparrow)]

\text{if } s_0 = 5, \text{receive } \mathrm{R}(5, \uparrow) + \gamma \mathrm{R}(2, \uparrow)

\text{if } s_0 = 1, \text{receive } \mathrm{R}(1, \uparrow) + \gamma \mathrm{R}(1, \uparrow)

\text{if } s_0 = 2, \text{receive } \mathrm{R}(2, \uparrow) + \gamma \mathrm{R}(2, \uparrow)

1.9

\text{if } s_0 = 3, \text{receive } \mathrm{R}(3, \uparrow) + \gamma \mathrm{R}(3, \uparrow)

\text{if } s_0 = 4, \text{receive } \mathrm{R}(4, \uparrow) + \gamma \mathrm{R}(1, \uparrow)

\text{if } s_0 = 8, \text{receive } \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)

\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right]\)

\mathbb{E}[

]

\mathrm{R}(s_0, a_0)

.9 \mathrm{R}(s_1, a_1)

\( 2\) terms inside

V_{\pi}^2(s)

\text{if } s_0 = 7, \text{receive } \mathrm{R}(7, \uparrow) + \gamma \mathrm{R}(4, \uparrow)

\text{if } s_0 = 9, \text{receive } \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)

action \(\uparrow\)

\mathrm{R}(6, \uparrow)

20\%

80\%

action \(\uparrow\)

\mathrm{R}(2, \uparrow)

\gamma

\mathrm{R}(3, \uparrow)

\gamma

-10

V_{\pi}^0(s)

V_{\pi}^1(s)

V_{\pi}^2(s)

-9

1.9

-9.28

Now, let's think about \(V_\pi^3(6)\)

80\%

20\%

Recall:

\(\pi(s) = ``\uparrow",\ \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)

action \(\uparrow\)

\mathrm{R}(3, \uparrow)

\mathrm{R}(2, \uparrow)

\gamma^2

\mathrm{R}(2, \uparrow)

\mathrm{R}(3, \uparrow)

\gamma^2

action \(\uparrow\)

\mathrm{R}(6, \uparrow)

\gamma

\mathrm{R}(2, \uparrow)

20\%

action \(\uparrow\)

80\%

action \(\uparrow\)

\gamma

\mathrm{R}(3, \uparrow)

\mathrm{R}(2, \uparrow)

\gamma

\mathrm{R}(2, \uparrow)

20\%

[

]

\mathrm{R}(6, \uparrow)

\mathrm{R}(3, \uparrow)

\gamma

\mathrm{R}(3, \uparrow)

80\%

[

]

\gamma

V_\pi^3(6)=

\mathrm{R}(6, \uparrow)

\mathrm{R}(2, \uparrow)

\gamma

\gamma^2

\mathrm{R}(2, \uparrow)

20\%

[

]

\mathrm{R}(3, \uparrow)

\gamma

\mathrm{R}(3, \uparrow)

\gamma^2

80\%

[

]

\mathrm{R}(6, \uparrow)

\gamma

20\%

V_\pi^2(2)

\gamma

80\%

V_\pi^2(3)

Bellman Recursion

V_\pi^h(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V_\pi^{h-1}\left(s^{\prime}\right)

weighted by the probability of getting to that next state \(s^{\prime}\)

\((h-1)\) horizon values at a next state \(s^{\prime}\)

immediate reward, for being in state \(s\) and taking the action given by policy \(\pi(s)\)

discounted by \(\gamma\)

expected sum of discounted rewards, for starting in state \(s,\) follow policy \(\pi(s)\) for horizon \(h\)

finite-horizon policy evaluation

infinite-horizon policy evaluation

\(\gamma\) is now necessarily <1 for convergence too

Bellman equation

\(|\mathcal{S}|\) many linear equations

For any given policy \(\pi(s),\) the infinite-horizon (state) value functions are
\(V_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s\)

V_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V_\pi\left(s^{\prime}\right), \forall s

Bellman recursion

V^{h}_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V^{h-1}_\pi\left(s^{\prime}\right), \forall s

Definition of \(\pi^*\): for any given horizon \(h\) (possibly infinite horizon), \(\mathrm{V}^h_{\pi^*}({s}) \geqslant \mathrm{V}^h_\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
For a fixed MDP, optimal values \(\mathrm{V}^h_{\pi^*}({s})\) must be unique.
Optimal policy \(\pi^*\) might not be unique. (Think e.g. symmetric)
In finite horizon, optimal policy depends on horizon.
In infinite horizon, horizon no longer matter. Exist a stationary optimal policy.

Optimal policy \(\pi^*\)

\(V\) values vs. \(Q\) values

\(V\) is defined over state space; \(Q\) is defined over (state, action) space.
Any policy can be evaluated to get \(V\) values; whereas \(Q\) per our definition, has the sense of "tail optimality" baked in.
\(\mathrm{V}^h_{\pi^*}({s})\) can be derived from \(Q^h(s,a)\), and vise versa.
\(Q\) is easier to read "optimal actions" from.

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

(Optimal) state-action value functions \(Q^h(s, a)\)

80\%

20\%

Recall:

example: recursively finding \(Q^h(s, a)\)

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

-10

States and one special transition:

\(\mathrm{R}(s,a)\)

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a)

-10

-10

Q^0(s, a)

80\%

20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

-10

-10

Q^2(s, a)

Let's consider \(Q^2(3, \rightarrow)\)

receive \(\mathrm{R}(3,\rightarrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

next state \(s'\) = 3, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

\(Q^2(3, \rightarrow) = \mathrm{R}(3,\rightarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

\( = 1.9\)

1.9

80\%

20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

-10

-10

Q^2(s, a)

Let's consider \(Q^2(3, \uparrow)\)

receive \(\mathrm{R}(3,\uparrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

next state \(s'\) = 3, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

\(Q^2(3, \uparrow) = \mathrm{R}(3,\uparrow) + \gamma \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

\( = 1.9\)

1.9

States and one special transition:

80\%

20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

-10

-10

Q^2(s, a)

Let's consider \(Q^2(3, \leftarrow)\)

receive \(\mathrm{R}(3,\leftarrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)

next state \(s'\) = 2, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)

\(Q^2(3, \leftarrow) = \mathrm{R}(3,\leftarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)

\( = 1\)

1.9

80\%

20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

-10

-10

Q^2(s, a)

Let's consider \(Q^2(3, \downarrow)\)

receive \(\mathrm{R}(3,\downarrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)

next state \(s'\) = 6, act optimally for the remaining one timestep
- receive \(\max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)

\(Q^2(3, \downarrow) = \mathrm{R}(3,\downarrow) + \gamma \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)

\( = -8\)

1.9

-8

80\%

20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

-10

-10

Q^2(s, a)

act optimally for one more timestep, at the next state \(s^{\prime}\)

1.9

-8

20% chance, \(s'\) = 2, act optimally, receive \(\max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)

80% chance, \(s'\) = 3, act optimally, receive \(\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

-9.28

Let's consider \(Q^2(6, \uparrow)\)

\(= -10 + .9 [.2*0+ .8*1] = -9.28\)

\(=\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)

receive \(\mathrm{R}(6,\uparrow)\)

80\%

20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a)

States and one special transition:

-10

-10

Q^2(s, a)

1.9

-8

-9.28

\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow) + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right), \forall s,a

=\mathrm{R}(s, a)

in general

80\%

20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

starting in state \(s\),
take action \(a\), for one step
act optimally there afterwards for the remaining \((h-1)\) steps

Q^1(s, a)

States and one special transition:

-10

-10

Q^2(s, a)

1.9

-8

-9.28

\pi_h^*(s)=\arg \max _a Q^h(s, a), \forall s, h

what's the optimal action in state 3, with horizon 2, given by \(\pi_2^*(3)=?\)

in general

either up or right

Given the finite horizon recursion

Q(s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q \left(s^{\prime}, a^{\prime}\right)

for \(s \in \mathcal{S}, a \in \mathcal{A}\) :
\(\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0\)
while True:
for \(s \in \mathcal{S}, a \in \mathcal{A}\) :
\(\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
if \(\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:\)
return \(\mathrm{Q}_{\text {new }}\)
\(\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}\)

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)

We should easily be convinced of the infinite horizon equation

Infinite-horizon Value Iteration

if instead of relying on line 6 (convergence criterion), we run the block of (line 4 and 5) for \(h\) times, then the returned values are exactly horizon-\(h\) Q values

Thanks!

We'd appreciate your feedback on the lecture.

Q^0(2,\uparrow)

Q^0(6,\rightarrow)

Q^0(8,\downarrow)

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)

Q^2 (1, \uparrow)=\mathrm{R}(1, \uparrow)+\gamma \sum_{s^{\prime} = 1} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{1}\left(s^{\prime}, a^{\prime}\right)

Let's try to find \(Q^1 (1, \uparrow)\)

next state following (1, \uparrow)\) is only state 1.

Q^2 (1, \uparrow)=\mathrm{R}(1, \uparrow)+\gamma \max _{a^{\prime}} Q^{1}\left(1, a^{\prime}\right)

9 possible states
4 possible actions: {Up ↑, Down ↓, Left ←, Right →}

Running example: Mario in a grid-world

There is this grid world with some rewards assosistaed with it.
We want Mario to act wisely to get
- as much accumulated reward
- as quickly as possible
- for as long as we're playing.

almost all transitions are deterministic:
- Normally, actions take Mario to the “intended” state.
  - E.g., in state (7), action “↑” gets to state (4)
- If an action would've taken us out of this world, stay put
  - E.g., in state (1), action “↑” gets back to state (1)
except, in state (6), action “↑” leads to two possibilities:
- 20% chance ends in (2)
- 80% chance ends in (3)

80\%

20\%

V_\pi^2(6)= \mathrm{R}(6, \uparrow)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(6, \uparrow, s^{\prime}\right) V_\pi^{1}\left(s^{\prime}\right)

= -10 + .9 [0.2 * 0 + 0.8 * 1]

= -10 + \gamma [0.2 V_\pi^{1}\left(2\right) + 0.8 V_\pi^{1}\left(3 \right)]

V_\pi^2(1)= \mathrm{R}(1, \uparrow) + \gamma \mathrm{R}(1, \uparrow)

V_\pi^2(2)= \mathrm{R}(2, \uparrow) + \gamma \mathrm{R}(2, \uparrow)

V_\pi^2(3)= \mathrm{R}(3, \uparrow) + \gamma \mathrm{R}(3, \uparrow)

V_\pi^2(4)= \mathrm{R}(4, \uparrow) + \gamma \mathrm{R}(1, \uparrow)

V_\pi^2(5)= \mathrm{R}(5, \uparrow) + \gamma \mathrm{R}(2, \uparrow)

V_\pi^2(6)= \mathrm{R}(6, \uparrow) + \gamma [.2 \mathrm{R}(2, \uparrow) + .8 \mathrm{R}(3, \uparrow)]

V_\pi^2(7)= \mathrm{R}(7, \uparrow) + \gamma \mathrm{R}(4, \uparrow)

V_\pi^2(8)= \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)

V_\pi^2(9)= \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)

-9

1.9

-9.28

Recall:

\(\pi(s) = ``\uparrow",\ \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)