Cross Entropy Method

Pavel Temirchev

Reminder: MDP formalism

Reinforcement Learning is a Machine Learning technique aimed
to learn optimal behavior from the interaction with an environment

Works with unknown and stochastic environments, which can be explored only via interaction
Applicable to a broad range of problems
Learning is performed to maximize some kind of reward

AGENT

ENVIRONMENT

action

move a figure

observation

new positions

reward

win: +1
lose: -1

Learn optimal actions

to maximize the reward

Reminder: MDP formalism

\( a_t \) - action

\( s_t \) - observation

\( r_t \) - reward

R_T = \sum_{t=0}^T r_t

- return

Reinforcement Learning:

\mathbb{E} R_T \rightarrow \max_\pi

What is inside this expectation?

\( \pi(a_t | s_t) = P(\text{take } a_t |\text{ in state } s_t) \) - policy

Cross Entropy Method (CEM)

For a general optimization problem

Forget about the structure inside \(\mathbb{E} R_T\) we just need to maximize some \(f(x)\), but:

it is not differentiable

Cross Entropy Method (CEM)

For a general optimization problem

x_0

x_1

Idea: instead of an optimum, let's find a good proposal distribution

'optimal' distribution

Cross Entropy Method (CEM)

For a general optimization problem

x_0

x_1

Problem: we don't know the "optimal" distribution

Take some parametric distribution \(q^0\)

Sample \( x_i \sim q^0\)

Pick M largest \( x_i \) - elites

Fit a new distribution \(q^1\)
on the elites

Repeat!

Cross Entropy Method (CEM)

For a general optimization problem

x_0

x_1

For Gaussian fit \(q^k\) on elites using Max. Likelihood estimator

Cross Entropy Method (CEM)

For a general optimization problem

In general, in order to fit new distribution on elites, we minimize KL:

KL\left( p_{data} || q\right) = \int p_{data}(x) \log \frac{p_{data}(x)}{q(x)} dx

\min_q KL\left( p_{data} || q\right) = \min_q -\int p_{data}(x) \log q(x) dx

minimise cross-entropy

q^{k+1} = \arg\min_q - \sum_i \log q(x_i) \;\;\;\;\;\;\; x_i \in \text{elites}^k

Cross Entropy Method (CEM)

For discrete proposal distributions

What if \(q\) is in discrete family of distributions?

q(x) = \sum_k [x=z_k] q_k

\(q_k\) - parameters of the distribution

\(z_k\) - possible values of a random variable

Cross Entropy Method (CEM)

For discrete proposal distributions

Let's find \(q^{k+1}\) on an iteration

\mathcal{L}(q) = -\sum_k n_{x=z_k} \log q_k

q^{k+1} = \arg\min_q - \sum_i \log q(x_i) = \arg\min_q \mathcal{L}(q)

\frac{\delta \mathcal{L}}{\delta q_k} = - \frac{n_{x=z_k}} {q_k}

Asks us to send \(q_k\) to the infinity

Cross Entropy Method (CEM)

For discrete proposal distributions

We should use method of Lagrange Multipliers to ensure \(\sum_k q_k = 1 \)

\mathcal{L}'(q) = \mathcal{L}(q) + \lambda(\sum_k q_k - 1)

\frac{\delta \mathcal{L}'}{\delta q_k} = - \frac{n_{x=z_k}} {q_k} + \lambda = 0

q_k = \frac{n_{x=z_k}} {\lambda}

From \(\sum_k q_k = 1 \rightarrow \)

q_k = \frac{n_{x=z_k}} {N}

Cross Entropy Method (CEM)

For a tabular RL problem

In RL, we want to maximise return \(\mathbb{E} R_T\) by choosing good actions \(a_t\)

Proposal distribution for action is our policy:

\pi(a_t|s_t)

Cross Entropy Method (CEM)

For a tabular RL problem

ALGORITHM:

Initialize policy \(\pi^0\) by uniform distribution
Play \(n\) games with \(\pi^k\)
Collect all triples \( (a_t, s_t, R_{end}) \)
Select M% of elites, discard non-elite points
Recompute \(\pi^{k+1} \) and go to 2.

\pi^{k+1}(a_t|s_t) = \frac{n_{[a_e=a_t \;\&\; s_e = s_t]}}{n_{[s_e = s_t]}}

Approximate Cross Entropy Method

For large state spaces

What if we can not store policies for all states or actions are continuous?

Define a proposal distribution parametrised by a NN:

\text{Distr}(a_t | NN_\theta (s_t))

Define a proposal distribution parametrised by a NN:

e.g.

\mathcal{N}(a_t | \mu_\theta(s_t), \sigma_\theta(s_t))

Minimise the Cross-Entropy with a gradient-based algorithm (e.g. ADAM):

q^{k+1} = \arg\min_\theta - \sum_i \log \text{Distr}(a_i | NN_\theta (s_i)) \;\;\;\;\;\;\; a_i, s_i \in \text{elites}^k