speaker: Pavel Temirchev
AGENT
ENVIRONMENT
action
move a figure
observation
new positions
reward
win: +1
lose: -1
Learn optimal actions
to maximize the reward
\( a_t \) - action
\( s_t \) - observation
\( r_t \) - reward
- return
\( \pi(a_t | s_t) = P(\text{take } a_t |\text{ in state } s_t) \) - policy
\( p(s_{t+1}| s_t, a_t) \) - transition probabilities
Today they are known!
Moreover:
\( |\mathcal{S}| < \infty\)
\( |\mathcal{A}| < \infty\)
The idea is fairly simple:
and continue in the loop
How to evaluate the policy?
State Value function for the policy:
it can be rewritten recursively:
we can estimate it by Monte-Carlo
is it a good idea?
we does not consider the structure of the problem
as a system of linear equations
We can rewrite the equation as follows:
And now it is rewritable in a vector form:
easily solvable?
is this matrix invertible?
complexity \( \sim O(|\mathcal{S}|^3) \)
as a dynamic programming problem
Let's use the method of simple iterations:
We can also rewrite it in the operator form:
it is called the Bellman operator for the policy
the algorithm
will converge given an infinite amount of time
шило на мыло tradeoff?
use \( |V^{k+1} - V^k| < \epsilon \) to stop
will it converge eventually?
The Bellman operator is a \(\gamma\) contraction under the infinity norm.
The process will converge from any initialisation.
Will prove later
Now, we know the performance of the policy.
Can we improve it?
First of all, let us define an order for the policies:
if
If we have some policy \(\pi^{k}\) and we know \(V^{\pi^k}\):
State-action Value function:
proof
the algorithm
initialise policy \(\pi^0\)
Loop forever: