Tabular RL:
Policy Iteration,
Value Iteration

speaker: Pavel Temirchev

Reminder: MDP formalism

AGENT

ENVIRONMENT

action

move a figure

observation

new positions

reward

win: +1
lose: -1

Learn optimal actions

to maximize the reward

Tabular RL setting

\( a_t \) - action

\( s_t \) - observation

\( r_t \) - reward

R_T = \sum_{t=0}^\infty \gamma^t r_t

- return

\( \pi(a_t | s_t) = P(\text{take } a_t |\text{ in state } s_t) \) - policy

\( p(s_{t+1}| s_t, a_t) \) - transition probabilities

Today they are known!

Moreover:

\( |\mathcal{S}| < \infty\)

\( |\mathcal{A}| < \infty\)

Policy Iteration algorithm (PI)

The idea is fairly simple:

Evaluate the performance of the policy
Improve the policy

and continue in the loop

Policy Evaluation

How to evaluate the policy?

State Value function for the policy:

V^\pi(s) = \mathbb{E}\Big[ \sum_{t=0}^\infty \gamma^t r_t \Big | s_0 = s \Big]

it can be rewritten recursively:

V^\pi(s) = \mathbb{E}_{a\sim \pi} \Big[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^\pi(s') \Big]

we can estimate it by Monte-Carlo

is it a good idea?

we does not consider the structure of the problem

Policy Evaluation

as a system of linear equations

We can rewrite the equation as follows:

V^\pi(s) = r(s, \pi(s)) + \gamma \sum_{s'} p(s'|s, \pi(s)) V^\pi(s')

And now it is rewritable in a vector form:

V = R + \gamma P V

V = (I - \gamma P)^{-1}R

easily solvable?

is this matrix invertible?

complexity \( \sim O(|\mathcal{S}|^3) \)

Policy Evaluation

as a dynamic programming problem

Let's use the method of simple iterations:

V^{k+1}(s) = \mathbb{E}_{a\sim \pi} \Big[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^k(s') \Big]

We can also rewrite it in the operator form:

\mathcal{B} V(s) = \mathbb{E}_{a\sim \pi} \Big[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) V(s') \Big]

V^{k+1} = \mathcal{B} V^k(s)

it is called the Bellman operator for the policy

Policy Evaluation

the algorithm

initialise \(V^0\)
while not converged:
- for all \(s\) in \(\mathcal{S}\):
  - \( V^{k+1}(s) = \mathbb{E}_{a\sim \pi} \Big[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^k(s') \Big] \)

will converge given an infinite amount of time

шило на мыло tradeoff?

use \( |V^{k+1} - V^k| < \epsilon \) to stop

Policy Evaluation

will it converge eventually?

The Bellman operator is a \(\gamma\) contraction under the infinity norm.

The process will converge from any initialisation.

Will prove later

Policy Improvement

Now, we know the performance of the policy.

Can we improve it?

First of all, let us define an order for the policies:

\pi_i\succcurlyeq \pi_j

V^{\pi_i}(s) \ge V^{\pi_j}(s), \;\;\; \forall s \in \mathcal{S}

Policy Improvement

If we have some policy \(\pi^{k}\) and we know \(V^{\pi^k}\):

Compute \(Q^{\pi^k}(s, a) \; \forall (s, a) \)
Improve the policy:

State-action Value function:

Q^\pi(s, a) = r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^\pi(s')

\pi^{k+1}(s) = \arg\max_a Q^{\pi^k}(s, a)

\pi^{k+1} \succcurlyeq \pi^k

Policy Improvement

proof

V^{\pi^{k}}(s) \le \max_a Q^{\pi^k}(s, a) =

= \max_a r(s,a) + \gamma \sum p(s'|s, a) V^{\pi^k}(s') =

= r(s,\pi^{k+1}(s)) + \gamma \sum p(s'|s, \pi^{k+1}(s)) V^{\pi^k}(s') =

\le r(s,\pi^{k+1}(s)) + \gamma \sum p(s'|s, \pi^{k+1}(s)) \max_{a'} Q^{\pi^k}(s', a')

\le V^{\pi^{k+1}}(s)

\dots

Policy Iteration

the algorithm

Compute \(Q^{\pi^k}(s, a) \; \forall (s, a) \)
Improve the policy for all states:

\pi^{k+1}(s) = \arg\max_a Q^{\pi^k}(s, a)

initialise \(V^0\)
while not converged:
- for all \(s\) in \(\mathcal{S}\):
  - \( V^{k+1}(s) = \mathbb{E}_{a\sim \pi} \Big[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^k(s') \Big] \)

initialise policy \(\pi^0\)

Loop forever:

Value Iteration (VI)

The proof for the contraction

PI and VI (OZON)

By cydoroga

Tabular RL: Policy Iteration, Value Iteration

Reminder: MDP formalism

Tabular RL setting

Policy Iteration algorithm (PI)

Policy Evaluation

Policy Evaluation

Policy Evaluation

Policy Evaluation

Policy Evaluation

Policy Improvement

Policy Improvement

Policy Improvement

Policy Iteration

Value Iteration (VI)

The proof for the contraction

The proof for the contraction

PI and VI (OZON)

More from cydoroga

Tabular RL:
Policy Iteration,
Value Iteration