Cross Entropy Method

Pavel Temirchev

Reminder: MDP formalism

Reinforcement Learning is a Machine Learning technique aimed
to learn optimal behavior from the interaction with an environment

Works with unknown and stochastic environments, which can be explored only via interaction
Applicable to a broad range of problems
Learning is performed to maximize some kind of reward

AGENT

ENVIRONMENT

action

move a figure

observation

new positions

reward

win: +1
lose: -1

Learn optimal actions

to maximize the reward

Reminder: MDP formalism

\( a_t \) - action

\( s_t \) - observation

\( r_t \) - reward

R_T = \sum_{t=0}^T r_t

- return

Reinforcement Learning:

\mathbb{E} R_T \rightarrow \max_\pi

What is inside this expectation?

\( \pi(a_t | s_t) = P(\text{take } a_t |\text{ in state } s_t) \) - policy

Cross Entropy Method (CEM)

For a general optimization problem

Forget about the structure inside \(\mathbb{E} R_T\) we just need to maximize some \(f(x)\), but:

it is not differentiable

Cross Entropy Method (CEM)

For a general optimization problem

x_0

x_1

Idea: instead of an optimum, let's find a good proposal distribution

'optimal' distribution

Cross Entropy Method (CEM)

For a general optimization problem

x_0

x_1

Problem: we don't know the "optimal" distribution

Take some parametric distribution \(q^0\)

Sample \( x_i \sim q^0\)

Pick M largest \( x_i \) - elites

Fit a new distribution \(q^1\)
on the elites

Repeat!

Cross Entropy Method (CEM)

For a general optimization problem

x_0

x_1

For Gaussian fit \(q^k\) on elites using Max. Likelihood estimator

Cross Entropy Method (CEM)

For a general optimization problem

In general, in order to fit new distribution on elites, we minimize KL:

KL\left( p_{data} || q\right) = \int p_{data}(x) \log \frac{p_{data}(x)}{q(x)} dx

\min_q KL\left( p_{data} || q\right) = \min_q -\int p_{data}(x) \log q(x) dx

minimise cross-entropy

q^{k+1} = \arg\min_q - \sum_i \log q(x_i) \;\;\;\;\;\;\; x_i \in \text{elites}^k

Cross Entropy Method (CEM)

For discrete proposal distributions

What if \(q\) is in discrete family of distributions?

q(x) = \sum_k [x=z_k] q_k

\(q_k\) - parameters of the distribution

\(z_k\) - possible values of a random variable

Cross Entropy Method (CEM)

For discrete proposal distributions

Let's find \(q^{k+1}\) on an iteration

\mathcal{L}(q) = -\sum_k n_{x=z_k} \log q_k

q^{k+1} = \arg\min_q - \sum_i \log q(x_i) = \arg\min_q \mathcal{L}(q)

\frac{\delta \mathcal{L}}{\delta q_k} = - \frac{n_{x=z_k}} {q_k}

Asks us to send \(q_k\) to the infinity

Cross Entropy Method (CEM)

For discrete proposal distributions

We should use method of Lagrange Multipliers to ensure \(\sum_k q_k = 1 \)

\mathcal{L}'(q) = \mathcal{L}(q) + \lambda(\sum_k q_k - 1)

\frac{\delta \mathcal{L}'}{\delta q_k} = - \frac{n_{x=z_k}} {q_k} + \lambda = 0

q_k = \frac{n_{x=z_k}} {\lambda}

From \(\sum_k q_k = 1 \rightarrow \)

q_k = \frac{n_{x=z_k}} {N}

Cross Entropy Method (CEM)

For a tabular RL problem

In RL, we want to maximise return \(\mathbb{E} R_T\) by choosing good actions \(a_t\)

Proposal distribution for action is our policy:

\pi(a_t|s_t)

Cross Entropy Method (CEM)

For a tabular RL problem

ALGORITHM:

Initialize policy \(\pi^0\) by uniform distribution
Play \(n\) games with \(\pi^k\)
Collect all triples \( (a_t, s_t, R_{end}) \)
Select M% of elites, discard non-elite points
Recompute \(\pi^{k+1} \) and go to 2.

\pi^{k+1}(a_t|s_t) = \frac{n_{[a_e=a_t \;\&\; s_e = s_t]}}{n_{[s_e = s_t]}}

Approximate Cross Entropy Method

For large state spaces

What if we can not store policies for all states or actions are continuous?

Define a proposal distribution parametrised by a NN:

\text{Distr}(a_t | NN_\theta (s_t))

Define a proposal distribution parametrised by a NN:

e.g.

\mathcal{N}(a_t | \mu_\theta(s_t), \sigma_\theta(s_t))

Minimise the Cross-Entropy with a gradient-based algorithm (e.g. ADAM):

q^{k+1} = \arg\min_\theta - \sum_i \log \text{Distr}(a_i | NN_\theta (s_i)) \;\;\;\;\;\;\; a_i, s_i \in \text{elites}^k

Cross Entropy Method (OZON)

By cydoroga

Cross Entropy Method (OZON)

Cross Entropy Method

Reminder: MDP formalism

Reminder: MDP formalism

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Cross Entropy Method (CEM)

Approximate Cross Entropy Method

Cross Entropy Method (OZON)

More from cydoroga