Cross Entropy Method
Pavel Temirchev
Reminder: MDP formalism
Reinforcement Learning is a Machine Learning technique aimed
to learn optimal behavior from the interaction with an environment
- Works with unknown and stochastic environments, which can be explored only via interaction
- Applicable to a broad range of problems
- Learning is performed to maximize some kind of reward
AGENT
ENVIRONMENT
action
move a figure
observation
new positions
reward
win: +1
lose: -1
Learn optimal actions
to maximize the reward
Reminder: MDP formalism
\( a_t \) - action
\( s_t \) - observation
\( r_t \) - reward
- return
Reinforcement Learning:
What is inside this expectation?
\( \pi(a_t | s_t) = P(\text{take } a_t |\text{ in state } s_t) \) - policy
Cross Entropy Method (CEM)
For a general optimization problem
Forget about the structure inside \(\mathbb{E} R_T\) we just need to maximize some \(f(x)\), but:
- it is not differentiable
Cross Entropy Method (CEM)
For a general optimization problem
Idea: instead of an optimum, let's find a good proposal distribution
'optimal' distribution
Cross Entropy Method (CEM)
For a general optimization problem
Problem: we don't know the "optimal" distribution
Take some parametric distribution \(q^0\)
Sample \( x_i \sim q^0\)
Pick M largest \( x_i \) - elites
Fit a new distribution \(q^1\)
on the elites
Repeat!
Cross Entropy Method (CEM)
For a general optimization problem
For Gaussian fit \(q^k\) on elites using Max. Likelihood estimator
Cross Entropy Method (CEM)
For a general optimization problem
In general, in order to fit new distribution on elites, we minimize KL:
minimise cross-entropy
Cross Entropy Method (CEM)
For discrete proposal distributions
What if \(q\) is in discrete family of distributions?
\(q_k\) - parameters of the distribution
\(z_k\) - possible values of a random variable
Cross Entropy Method (CEM)
For discrete proposal distributions
Let's find \(q^{k+1}\) on an iteration
Asks us to send \(q_k\) to the infinity
Cross Entropy Method (CEM)
For discrete proposal distributions
We should use method of Lagrange Multipliers to ensure \(\sum_k q_k = 1 \)
From \(\sum_k q_k = 1 \rightarrow \)
Cross Entropy Method (CEM)
For a tabular RL problem
In RL, we want to maximise return \(\mathbb{E} R_T\) by choosing good actions \(a_t\)
Proposal distribution for action is our policy:
Cross Entropy Method (CEM)
For a tabular RL problem
ALGORITHM:
- Initialize policy \(\pi^0\) by uniform distribution
- Play \(n\) games with \(\pi^k\)
- Collect all triples \( (a_t, s_t, R_{end}) \)
- Select M% of elites, discard non-elite points
- Recompute \(\pi^{k+1} \) and go to 2.
Approximate Cross Entropy Method
For large state spaces
What if we can not store policies for all states or actions are continuous?
Define a proposal distribution parametrised by a NN:
Define a proposal distribution parametrised by a NN:
e.g.
Minimise the Cross-Entropy with a gradient-based algorithm (e.g. ADAM):
Cross Entropy Method (OZON)
By cydoroga
Cross Entropy Method (OZON)
- 513