Intro to Machine Learning

Lecture 10: Markov Decision Processes

Shen Shen

April 19, 2024

Outline

  • Recap: Supervised Learning

  • Markov Decision Processes

    • ​Mario example
    • Formal definition

    • Policy Evaluation

      • State-Value Functions: \(V\)-values

      • Finite horizon (recursion) and infinite horizon (equation)

    • Optimal Policy and Finding Optimal Policy

      • General tool: State-action Value Functions: \(Q\)-values

      • Value iteration

Toddler demo, Russ Tedrake thesis, 2004

(Uses vanilla policy gradient (actor-critic))

(The demo won't embed in PDF. But the direct link below works.)

Text

Reinforcement Learning with Human Feedback

Markov Decision Processes

  • Foundational tools and concept to understand RL.

  • Research area initiated in the 1950s (Bellman), known under various names (in various communities):

    • Stochastic optimal control (Control theory)

    • Stochastic shortest path (Operations research)

    • Sequential decision making under uncertainty (Economics)

    • Dynamic programming, control of dynamical systems (under uncertainty)

    • Reinforcement learning (Artificial Intelligence, Machine Learning)

  • A rich variety of (accessible & elegant) theory/math, algorithms, and applications/illustrations

  • As a result, quite a large variations of notations.

  • We will use the most RL-flavored notation

  • almost all transitions are deterministic:
    • Normally, actions take Mario to the “intended” state.

      • E.g., in state (7), action “↑” gets to state (4)

    • If an action would've taken us out of this world, stay put

      • E.g., in state (9), action “→” gets back to state (9)

    • except, in state (6), action “↑” leads to two possibilities:

      • 20% chance ends in (2)

      • 80% chance ends in (3)

1
2
9
8
7
5
4
3
6
80\%
20\%

Running example: Mario in a grid-world 

  • 9 possible states
  • 4 possible actions: {Up ↑, Down ↓, Left ←, Right →}
1
2
9
8
7
5
4
3
6
80\%
20\%

example cont'd

1
1
1
1
-10
-10
-10
-10

reward of being in state 3, taking action \(\uparrow\)

reward of being in state 3, taking action \(\downarrow\)

reward of being in state 6, taking action \(\downarrow\)

reward of being in state 6, taking action \(\rightarrow\)

  • (state, action) pair can get Mario rewards:
  • Any other (state, action) pairs get reward 0
  • In state (6), any action gets reward -10
  • In state (3), any action gets reward +1

actions: {Up ↑, Down ↓, Left ←, Right →}

  • goal is to find a gameplay strategy for Mario, to
    • ​get maximum sum of rewards
    • get these rewards as soon as possible 

Definition and Goal

  • \(\mathcal{S}\) : state space, contains all possible states \(s\).
  • \(\mathcal{A}\) : action space, contains all possible actions \(a\).
  • \(\mathrm{T}\left(s, a, s^{\prime}\right)\) : the probability of transition from state \(s\) to \(s^{\prime}\) when action \(a\) is taken.
  • \(\mathrm{R}(s, a)\) : a function that takes in the (state, action) and returns a reward.
  • \(\gamma \in [0,1]\): discount factor, a scalar.
  • \(\pi{(s)}\) : policy, takes in a state and returns an action.

Ultimate goal of an MDP: Find the "best" policy \(\pi\).

Sidenote:

  • In 6.390, \(\mathrm{R}(s, a)\) is deterministic and bounded.
  • In 6.390, \(\pi(s)\) is deterministic.
  • In this week, \(\mathcal{S}\) and \(\mathcal{A}\) are discrete set, i.e. have finite elements (in fact, typically quite small)

State \(s\)

Action \(a\)

Reward \(r\)

\dots

Policy \(\pi(s)\)

Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)

Reward \(\mathrm{R}(s, a)\)

time

a trajectory (aka an experience or rollout) \(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)\)

r_0 = \\ \mathrm{R}(s_0, a_0)
r_1 = \\ \mathrm{R}(s_1, a_1)
r_2 = \\ \mathrm{R}(s_2, a_2)
r_4 = \\ \mathrm{R}(s_4, a_4)
r_5 = \\ \mathrm{R}(s_5, a_5)
r_6 = \\ \mathrm{R}(s_6, a_6)
r_7 = \\ \mathrm{R}(s_7, a_7)
\dots
r_3 = \\ \mathrm{R}(s_3, a_3)

how "good" is a trajectory?

\mathrm{R}(s_3, a_3)
\mathrm{R}(s_0, a_0)
\mathrm{R}(s_1, a_1)
\mathrm{R}(s_2, a_2)
\mathrm{R}(s_4, a_4)
\mathrm{R}(s_5, a_5)
\mathrm{R}(s_6, a_6)
\mathrm{R}(s_7, a_7)
+
+
+
+
+
+
+
\dots
\mathrm{R}(s_0, a_0)
\gamma \mathrm{R}(s_1, a_1)
\gamma^3\mathrm{R}(s_3, a_3)
\gamma^2 \mathrm{R}(s_2, a_2)
\gamma^4\mathrm{R}(s_4, a_4)
\gamma^5\mathrm{R}(s_5, a_5)
\gamma^6\mathrm{R}(s_6, a_6)
\gamma^7\mathrm{R}(s_7, a_7)
\dots
+
+
+
+
+
+
+

State \(s\)

Action \(a\)

Reward \(r\)

\dots

Policy \(\pi(s)\)

Transition \(\mathrm{T}\left(s, a, s^{\prime}\right)\)

Reward \(\mathrm{R}(s, a)\)

time

a trajectory (aka an experience or rollout) \(\quad \tau=\left(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\right)\)

r_0 = \\ \mathrm{R}(s_0, a_0)
r_1 = \\ \mathrm{R}(s_1, a_1)
r_2 = \\ \mathrm{R}(s_2, a_2)
r_4 = \\ \mathrm{R}(s_4, a_4)
r_5 = \\ \mathrm{R}(s_5, a_5)
r_6 = \\ \mathrm{R}(s_6, a_6)
r_7 = \\ \mathrm{R}(s_7, a_7)
\dots
r_3 = \\ \mathrm{R}(s_3, a_3)
\mathrm{R}(s_0, a_0)
\gamma \mathrm{R}(s_1, a_1)
\gamma^3\mathrm{R}(s_3, a_3)
\gamma^2 \mathrm{R}(s_2, a_2)
\gamma^4\mathrm{R}(s_4, a_4)
\gamma^5\mathrm{R}(s_5, a_5)
\gamma^6\mathrm{R}(s_6, a_6)
\gamma^7\mathrm{R}(s_7, a_7)
\dots
+
+
+
+
+
+
+
  • Now, suppose \(h\) the horizon (how many time steps), and \(s_0\) the initial state are given.
  • Also, recall the rewards \(\mathrm{R}(s,a)\) and policy \(\pi(s)\) are deterministic.
  • There would still be randomness in a trajectory, due to stochastic transition.
  • That is, we cannot just evaluate

For a given policy \(\pi(s),\) the finite-horizon horizon-\(h\) (state) value functions are:
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)

\pi(s)
V_{\pi}(s)

MDP

Policy evaluation

  • expected sum of discounted rewards, for starting in state \(s,\) following policy \(\pi(s),\) for horizon \(h.\)
  • expectation w.r.t. stochastic transition.
  • horizon-0 values are all 0.
  • value is a long-term thing, reward is a one-time thing.
\mathrm{R}(s_0, a_0)
\gamma \mathrm{R}(s_1, a_1)
\gamma^3\mathrm{R}(s_3, a_3)
\gamma^2 \mathrm{R}(s_2, a_2)
\gamma^4\mathrm{R}(s_4, a_4)
\gamma^5\mathrm{R}(s_5, a_5)
\gamma^6\mathrm{R}(s_6, a_6)
\gamma^7\mathrm{R}(s_7, a_7)
\dots
+
+
+
+
+
+
+
\mathbb{E}[
]
1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

example: evaluating the "always \(\uparrow\)" policy

\(\pi(s) = ``\uparrow",\  \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\mathrm{R}(s, \uparrow) = 0\) for all other seven states

 

Suppose \(\gamma = 0.9\)

  • Horizon \(h\) = 0; nothing happens
  • Horizon \(h\) = 1: simply receiving the rewards
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
-10
V_{\pi}^0(s)

\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s, h\)

\mathbb{E}[
]
V_{\pi}^1(s)
\mathrm{R}(s_0, a_0)
.9 \mathrm{R}(s_1, a_1)
(.9)^2 \mathrm{R}(s_2, a_2)
\dots
+
+

\( h\) terms inside

1
2
9
8
7
5
4
3
6
80\%
20\%
V_\pi^2(8)= \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)
V_\pi^2(9)= \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)

Recall:

 

 

 

 

 

 

\(\pi(s) =  ``\uparrow",\  \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)

  • Horizon \(h\) = 2

\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right]\)

\mathbb{E}[
]
\mathrm{R}(s_0, a_0)
.9 \mathrm{R}(s_1, a_1)
+

\( 2\) terms inside

\mathrm{R}(s_0, a_0)
\gamma \mathrm{R}(s_1, a_1)
V_{\pi}^2(s)
1
2
9
8
7
5
4
3
6
80\%
20\%
V_\pi^2(8)= \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)
V_\pi^2(9)= \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)
-9
0
0
0
-9.28

Recall:

 

 

 

 

 

 

\(\pi(s) =  ``\uparrow",\  \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)

\mathrm{R}(s_0, a_0)
\gamma \mathrm{R}(s_1, a_1)
  • Horizon \(h\) = 2
\text{if } s_0 = 6,
\text{receive } \mathrm{R}(6, \uparrow) + \gamma [.2 \mathrm{R}(2, \uparrow) + .8 \mathrm{R}(3, \uparrow)]
\text{if } s_0 = 5, \text{receive } \mathrm{R}(5, \uparrow) + \gamma \mathrm{R}(2, \uparrow)
0
\text{if } s_0 = 1, \text{receive } \mathrm{R}(1, \uparrow) + \gamma \mathrm{R}(1, \uparrow)
0
\text{if } s_0 = 2, \text{receive } \mathrm{R}(2, \uparrow) + \gamma \mathrm{R}(2, \uparrow)
1.9
\text{if } s_0 = 3, \text{receive } \mathrm{R}(3, \uparrow) + \gamma \mathrm{R}(3, \uparrow)
0
\text{if } s_0 = 4, \text{receive } \mathrm{R}(4, \uparrow) + \gamma \mathrm{R}(1, \uparrow)
\text{if } s_0 = 8, \text{receive } \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)

\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right]\)

\mathbb{E}[
]
\mathrm{R}(s_0, a_0)
.9 \mathrm{R}(s_1, a_1)
+

\( 2\) terms inside

V_{\pi}^2(s)
\text{if } s_0 = 7, \text{receive } \mathrm{R}(7, \uparrow) + \gamma \mathrm{R}(4, \uparrow)
\text{if } s_0 = 9, \text{receive } \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)

action \(\uparrow\)

\mathrm{R}(6, \uparrow)
20\%
80\%

6

2

3

action \(\uparrow\)

action \(\uparrow\)

\mathrm{R}(2, \uparrow)
\gamma
\mathrm{R}(3, \uparrow)
\gamma
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
-10
V_{\pi}^0(s)
V_{\pi}^1(s)
V_{\pi}^2(s)
0
0
-9
0
0
0
0
1.9
-9.28

Now, let's think about \(V_\pi^3(6)\)

1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

 

 

 

 

 

 

\(\pi(s) =  ``\uparrow",\  \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)

2

3

action \(\uparrow\)

action \(\uparrow\)

\mathrm{R}(3, \uparrow)
\mathrm{R}(2, \uparrow)
\gamma^2
\gamma^2
\gamma^2
\mathrm{R}(2, \uparrow)
\mathrm{R}(3, \uparrow)
\gamma^2

6

action \(\uparrow\)

\mathrm{R}(6, \uparrow)
\mathrm{R}(6, \uparrow)
\gamma
\gamma
\mathrm{R}(2, \uparrow)
\mathrm{R}(2, \uparrow)
20\%
20\%

2

action \(\uparrow\)

3

80\%
80\%

action \(\uparrow\)

\gamma
\gamma
\mathrm{R}(3, \uparrow)
\mathrm{R}(3, \uparrow)
+
\mathrm{R}(2, \uparrow)
\gamma
\gamma
\mathrm{R}(2, \uparrow)
20\%
[
+
]
\mathrm{R}(6, \uparrow)
=
\mathrm{R}(3, \uparrow)
\gamma
\mathrm{R}(3, \uparrow)
80\%
[
+
+
]
\gamma
V_\pi^3(6)=
\mathrm{R}(6, \uparrow)
\mathrm{R}(2, \uparrow)
\gamma
\gamma^2
\mathrm{R}(2, \uparrow)
20\%
[
+
+
]
\mathrm{R}(3, \uparrow)
\gamma
\mathrm{R}(3, \uparrow)
\gamma^2
80\%
[
+
+
]
\mathrm{R}(6, \uparrow)
=
\gamma
20\%
+
V_\pi^2(2)
\gamma
80\%
+
V_\pi^2(3)

Bellman Recursion

V_\pi^h(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V_\pi^{h-1}\left(s^{\prime}\right)

weighted by the probability of getting to that next state \(s^{\prime}\) 

\((h-1)\) horizon values at a next state \(s^{\prime}\)

immediate reward, for being in state \(s\) and taking the action given by policy \(\pi(s)\)

discounted by \(\gamma\) 

expected sum of discounted rewards, for starting in state \(s,\) follow policy \(\pi(s)\) for horizon \(h\)

finite-horizon policy evaluation 

infinite-horizon policy evaluation

\(\gamma\) is now necessarily <1 for convergence too 

Bellman equation

  • \(|\mathcal{S}|\) many linear equations

For any given policy \(\pi(s),\) the infinite-horizon (state) value functions are
\(V_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s\)

V_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V_\pi\left(s^{\prime}\right), \forall s

For a given policy \(\pi(s),\) the finite-horizon horizon-\(h\) (state) value functions are:
\(V^h_\pi(s):=\mathbb{E}\left[\sum_{t=0}^{h-1} \gamma^t \mathrm{R}\left(s_t, \pi\left(s_t\right)\right) \mid s_0=s, \pi\right], \forall s\)

Bellman recursion

V^{h}_\pi(s)= \mathrm{R}(s, \pi(s))+ \gamma \sum_{s^{\prime}} \mathrm{T}\left(s, \pi(s), s^{\prime}\right) V^{h-1}_\pi\left(s^{\prime}\right), \forall s
  • Definition of \(\pi^*\): for any given horizon \(h\) (possibly infinite horizon), \(\mathrm{V}^h_{\pi^*}({s}) \geqslant \mathrm{V}^h_\pi({s})\) for all \(s \in \mathcal{S}\) and for all possible policy \(\pi\).
  • For a fixed MDP, optimal values \(\mathrm{V}^h_{\pi^*}({s})\) must be unique.
  • Optimal policy \(\pi^*\) might not be unique. (Think e.g. symmetric)
  • In finite horizon, optimal policy depends on horizon.
  • In infinite horizon, horizon no longer matter. Exist a stationary optimal policy.

Optimal policy \(\pi^*\)

\(V\) values vs. \(Q\) values

  • \(V\) is defined over state space; \(Q\) is defined over (state, action) space.
  • Any policy can be evaluated to get \(V\) values; whereas \(Q\) per our definition, has the sense of "tail optimality" baked in.
  • \(\mathrm{V}^h_{\pi^*}({s})\) can be derived from \(Q^h(s,a)\), and vise versa.
  • \(Q\) is easier to read "optimal actions" from.

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps

(Optimal) state-action value functions \(Q^h(s, a)\)

1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

example: recursively finding \(Q^h(s, a)\)

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

1
1
1
1
-10
-10
-10
-10

States and one special transition:

\(\mathrm{R}(s,a)\)

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a)
0
0
0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
Q^0(s, a)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
0
0
Q^2(s, a)

Let's consider \(Q^2(3, \rightarrow)\)

  • receive \(\mathrm{R}(3,\rightarrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

  • next state \(s'\) = 3, act optimally for the remaining one timestep
    • receive \(\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
  • \(Q^2(3, \rightarrow) = \mathrm{R}(3,\rightarrow)  + \gamma \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

\( = 1.9\)

0
0
0
0
1.9
1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
0
0
Q^2(s, a)

Let's consider \(Q^2(3, \uparrow)\)

  • receive \(\mathrm{R}(3,\uparrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

  • next state \(s'\) = 3, act optimally for the remaining one timestep
    • receive \(\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
  • \(Q^2(3, \uparrow) = \mathrm{R}(3,\uparrow)  + \gamma \max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)

\( = 1.9\)

0
0
0
0
1.9
1.9

States and one special transition:

1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
0
0
Q^2(s, a)

Let's consider \(Q^2(3, \leftarrow)\)

  • receive \(\mathrm{R}(3,\leftarrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)

  • next state \(s'\) = 2, act optimally for the remaining one timestep
    • receive \(\max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
  • \(Q^2(3, \leftarrow) = \mathrm{R}(3,\leftarrow)  + \gamma \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)

\( = 1\)

0
0
0
0
1.9
1.9
1
1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
0
0
Q^2(s, a)

Let's consider \(Q^2(3, \downarrow)\)

  • receive \(\mathrm{R}(3,\downarrow)\)

\( = 1 + .9 \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)

  • next state \(s'\) = 6, act optimally for the remaining one timestep
    • receive \(\max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)
  • \(Q^2(3, \downarrow) = \mathrm{R}(3,\downarrow)  + \gamma \max _{a^{\prime}} Q^{1}\left(6, a^{\prime}\right)\)

\( = -8\)

0
0
0
0
1.9
1.9
1
-8
1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a) = \mathrm{R}(s,a)

States and one special transition:

0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
0
0
Q^2(s, a)
  • act optimally for one more timestep, at the next state \(s^{\prime}\) 
0
0
0
0
1.9
1.9
1
-8
  • 20% chance, \(s'\) = 2, act optimally, receive \(\max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)\)
  • 80% chance, \(s'\) = 3, act optimally, receive \(\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)\)
-9.28

Let's consider \(Q^2(6, \uparrow)\)

\(= -10 + .9 [.2*0+ .8*1] = -9.28\)

\(=\mathrm{R}(6,\uparrow)  + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)

  • receive \(\mathrm{R}(6,\uparrow)\)
1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a)

States and one special transition:

0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
0
0
Q^2(s, a)
0
0
0
0
1.9
1.9
1
-8
-9.28

\(Q^2(6, \uparrow) =\mathrm{R}(6,\uparrow)  + \gamma[.2 \max _{a^{\prime}} Q^{1}\left(2, a^{\prime}\right)+ .8\max _{a^{\prime}} Q^{1}\left(3, a^{\prime}\right)] \)

Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right), \forall s,a
=\mathrm{R}(s, a)

in general 

1
2
9
8
7
5
4
3
6
80\%
20\%

Recall:

\(\gamma = 0.9\)

\(Q^h(s, a)\) is the expected sum of discounted rewards for

  • starting in state \(s\),
  • take action \(a\), for one step
  • act optimally there afterwards for the remaining \((h-1)\) steps
Q^1(s, a)

States and one special transition:

0
0
0
1
0
-10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
-10
1
-10
-10
0
0
Q^2(s, a)
0
0
0
0
1.9
1.9
1
-8
-9.28
\pi_h^*(s)=\arg \max _a Q^h(s, a), \forall s, h

what's the optimal action in state 3, with horizon 2, given by \(\pi_2^*(3)=?\)

 

in general 

either up or right 

Given the finite horizon recursion

Q(s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q \left(s^{\prime}, a^{\prime}\right)
  1. for \(s \in \mathcal{S}, a \in \mathcal{A}\) :
  2.       \(\mathrm{Q}_{\text {old }}(\mathrm{s}, \mathrm{a})=0\)
  3. while True:
  4.       for \(s \in \mathcal{S}, a \in \mathcal{A}\) :
  5.             \(\mathrm{Q}_{\text {new }}(s, a) \leftarrow \mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime}\right) \max _{a^{\prime}} \mathrm{Q}_{\text {old }}\left(s^{\prime}, a^{\prime}\right)\)
  6.       if \(\max _{s, a}\left|Q_{\text {old }}(s, a)-Q_{\text {new }}(s, a)\right|<\epsilon:\)
  7.             return \(\mathrm{Q}_{\text {new }}\)
  8.       \(\mathrm{Q}_{\text {old }} \leftarrow \mathrm{Q}_{\text {new }}\)
Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)

We should easily be convinced of the infinite horizon equation

Infinite-horizon Value Iteration

if instead of relying on line 6 (convergence criterion), we run the block of (line 4 and 5) for \(h\) times, then the returned values are exactly horizon-\(h\) Q values

Thanks!

We'd appreciate your feedback on the lecture.

Q^0(2,\uparrow)
Q^0(6,\rightarrow)
Q^0(8,\downarrow)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Q^h (s, a)=\mathrm{R}(s, a)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{h-1}\left(s^{\prime}, a^{\prime}\right)
Q^2 (1, \uparrow)=\mathrm{R}(1, \uparrow)+\gamma \sum_{s^{\prime} = 1} \mathrm{T}\left(s, a, s^{\prime} \right) \max _{a^{\prime}} Q^{1}\left(s^{\prime}, a^{\prime}\right)

Let's try to find \(Q^1 (1, \uparrow)\)

next state following (1, \uparrow)\) is only state 1. 

Q^2 (1, \uparrow)=\mathrm{R}(1, \uparrow)+\gamma \max _{a^{\prime}} Q^{1}\left(1, a^{\prime}\right)
  • 9 possible states
  • 4 possible actions: {Up ↑, Down ↓, Left ←, Right →}

Running example: Mario in a grid-world 

1
2
9
8
7
5
4
3
6
  • There is this grid world with some rewards assosistaed with it.
  • We want Mario to act wisely to get
    • as much accumulated reward
    • as quickly as possible 
    • for as long as we're playing.
  • almost all transitions are deterministic:
    • Normally, actions take Mario to the “intended” state.

      • E.g., in state (7), action “↑” gets to state (4)

    • If an action would've taken us out of this world, stay put

      • E.g., in state (1), action “↑” gets back to state (1)

  • except, in state (6), action “↑” leads to two possibilities:

    • 20% chance ends in (2)

    • 80% chance ends in (3)

1
2
9
8
7
5
4
3
6
80\%
20\%
V_\pi^2(6)= \mathrm{R}(6, \uparrow)+\gamma \sum_{s^{\prime}} \mathrm{T}\left(6, \uparrow, s^{\prime}\right) V_\pi^{1}\left(s^{\prime}\right)
= -10 + .9 [0.2 * 0 + 0.8 * 1]
= -10 + \gamma [0.2 V_\pi^{1}\left(2\right) + 0.8 V_\pi^{1}\left(3 \right)]
V_\pi^2(1)= \mathrm{R}(1, \uparrow) + \gamma \mathrm{R}(1, \uparrow)
V_\pi^2(2)= \mathrm{R}(2, \uparrow) + \gamma \mathrm{R}(2, \uparrow)
V_\pi^2(3)= \mathrm{R}(3, \uparrow) + \gamma \mathrm{R}(3, \uparrow)
V_\pi^2(4)= \mathrm{R}(4, \uparrow) + \gamma \mathrm{R}(1, \uparrow)
V_\pi^2(5)= \mathrm{R}(5, \uparrow) + \gamma \mathrm{R}(2, \uparrow)
V_\pi^2(6)= \mathrm{R}(6, \uparrow) + \gamma [.2 \mathrm{R}(2, \uparrow) + .8 \mathrm{R}(3, \uparrow)]
V_\pi^2(7)= \mathrm{R}(7, \uparrow) + \gamma \mathrm{R}(4, \uparrow)
V_\pi^2(8)= \mathrm{R}(8, \uparrow) + \gamma \mathrm{R}(5, \uparrow)
V_\pi^2(9)= \mathrm{R}(9, \uparrow) + \gamma \mathrm{R}(6, \uparrow)
0
0
-9
0
0
0
0
1.9
-9.28

Recall:

 

 

 

 

 

 

\(\pi(s) =  ``\uparrow",\  \forall s\)

\(\mathrm{R}(3, \uparrow) = 1\)

\(\mathrm{R}(6, \uparrow) = -10\)

\(\gamma = 0.9\)

introml-sp24-lec10

By Shen Shen

introml-sp24-lec10

  • 97