Tabular RL:
Policy Iteration,
Value Iteration
speaker: Pavel Temirchev
Reminder: MDP formalism
AGENT
ENVIRONMENT
action
move a figure
observation
new positions
reward
win: +1
lose: -1
Learn optimal actions
to maximize the reward
Tabular RL setting
\( a_t \) - action
\( s_t \) - observation
\( r_t \) - reward
- return
\( \pi(a_t | s_t) = P(\text{take } a_t |\text{ in state } s_t) \) - policy
\( p(s_{t+1}| s_t, a_t) \) - transition probabilities
Today they are known!
Moreover:
\( |\mathcal{S}| < \infty\)
\( |\mathcal{A}| < \infty\)
Policy Iteration algorithm (PI)
The idea is fairly simple:
-
Evaluate the performance of the policy
-
Improve the policy
and continue in the loop
Policy Evaluation
How to evaluate the policy?
State Value function for the policy:
it can be rewritten recursively:
we can estimate it by Monte-Carlo
is it a good idea?
we does not consider the structure of the problem
Policy Evaluation
as a system of linear equations
We can rewrite the equation as follows:
And now it is rewritable in a vector form:
easily solvable?
is this matrix invertible?
complexity \( \sim O(|\mathcal{S}|^3) \)
Policy Evaluation
as a dynamic programming problem
Let's use the method of simple iterations:
We can also rewrite it in the operator form:
it is called the Bellman operator for the policy
Policy Evaluation
the algorithm
- initialise \(V^0\)
- while not converged:
- for all \(s\) in \(\mathcal{S}\):
- \( V^{k+1}(s) = \mathbb{E}_{a\sim \pi} \Big[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^k(s') \Big] \)
- for all \(s\) in \(\mathcal{S}\):
will converge given an infinite amount of time
шило на мыло tradeoff?
use \( |V^{k+1} - V^k| < \epsilon \) to stop
Policy Evaluation
will it converge eventually?
The Bellman operator is a \(\gamma\) contraction under the infinity norm.
The process will converge from any initialisation.
Will prove later
Policy Improvement
Now, we know the performance of the policy.
Can we improve it?
First of all, let us define an order for the policies:
if
Policy Improvement
If we have some policy \(\pi^{k}\) and we know \(V^{\pi^k}\):
- Compute \(Q^{\pi^k}(s, a) \; \forall (s, a) \)
- Improve the policy:
State-action Value function:
Policy Improvement
proof
Policy Iteration
the algorithm
- Compute \(Q^{\pi^k}(s, a) \; \forall (s, a) \)
- Improve the policy for all states:
- initialise \(V^0\)
- while not converged:
- for all \(s\) in \(\mathcal{S}\):
- \( V^{k+1}(s) = \mathbb{E}_{a\sim \pi} \Big[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) V^k(s') \Big] \)
- for all \(s\) in \(\mathcal{S}\):
initialise policy \(\pi^0\)
Loop forever:
Value Iteration (VI)
The proof for the contraction
The proof for the contraction
PI and VI (OZON)
By cydoroga
PI and VI (OZON)
- 531