Pavel Temirchev
Reinforcement Learning is a Machine Learning technique aimed
to learn optimal behavior from the interaction with an environment
AGENT
ENVIRONMENT
action
move a figure
observation
new positions
reward
win: +1
lose: -1
Learn optimal actions
to maximize the reward
\( a_t \) - action
\( s_t \) - observation
\( r_t \) - reward
- return
Reinforcement Learning:
What is inside this expectation?
\( \pi(a_t | s_t) = P(\text{take } a_t |\text{ in state } s_t) \) - policy
For a general optimization problem
Forget about the structure inside \(\mathbb{E} R_T\) we just need to maximize some \(f(x)\), but:
For a general optimization problem
Idea: instead of an optimum, let's find a good proposal distribution
'optimal' distribution
For a general optimization problem
Problem: we don't know the "optimal" distribution
Take some parametric distribution \(q^0\)
Sample \( x_i \sim q^0\)
Pick M largest \( x_i \) - elites
Fit a new distribution \(q^1\)
on the elites
Repeat!
For a general optimization problem
For Gaussian fit \(q^k\) on elites using Max. Likelihood estimator
For a general optimization problem
In general, in order to fit new distribution on elites, we minimize KL:
minimise cross-entropy
For discrete proposal distributions
What if \(q\) is in discrete family of distributions?
\(q_k\) - parameters of the distribution
\(z_k\) - possible values of a random variable
For discrete proposal distributions
Let's find \(q^{k+1}\) on an iteration
Asks us to send \(q_k\) to the infinity
For discrete proposal distributions
We should use method of Lagrange Multipliers to ensure \(\sum_k q_k = 1 \)
From \(\sum_k q_k = 1 \rightarrow \)
For a tabular RL problem
In RL, we want to maximise return \(\mathbb{E} R_T\) by choosing good actions \(a_t\)
Proposal distribution for action is our policy:
For a tabular RL problem
ALGORITHM:
For large state spaces
What if we can not store policies for all states or actions are continuous?
Define a proposal distribution parametrised by a NN:
Define a proposal distribution parametrised by a NN:
e.g.
Minimise the Cross-Entropy with a gradient-based algorithm (e.g. ADAM):