Curiosity-driven Exploration
by Self-Supervised
Prediction

report is made by
Pavel Temirchev

 

Deep RL

reading group

 

Reinforcement Learning preliminaries

AGENT

ENVIRONMENT

ACTION

OBSERVATION

REWARD

Reinforcement Learning preliminaries

a_t \in \mathcal{A}
s_t \in \mathcal{S}
r_t = r(s_t, a_t) \in \mathbb{R}
s_{t+1} \sim p(s_{t+1}|s_t, a_t)

Markov Decision Process:

a_{t} \sim \pi(a_{t}|s_t)

- the set of actions

- the set of states of the environment

- reward function

- the policy

- transition probability

Reinforcement Learning preliminaries

Policy Gradient Methods:

We want to maximize the expected return:

R(\theta) = \mathbb{E}_{\pi(\theta)} \sum_t r_t \rightarrow \max_\theta

We parametrize the policy:

\pi(a_t|s_t) = \pi(a_t|s_t, \theta)

Maximization is done by gradient ascent:

\theta \leftarrow \theta + \alpha\nabla_\theta R

Reinforcement Learning preliminaries

Advantage Actor-Critic Method:

Where \( A(a, s) \) - advantage function

Gradients:

\nabla_\theta R = \mathbb{E}_{\pi(\theta)} \nabla_\theta \log \pi(a|s,\theta)A(a, s)

A3C method was used in this work

For more see: https://arxiv.org/pdf/1602.01783.pdf

Problem Description

Commonly used exploration strategies:

  • \(\epsilon\)-greedy strategy

Take an action \(a_t = argmax \pi(a_t | s_t)\) with probability \( 1 - \epsilon \)

Or take a random action with \(\epsilon\) probability

  • Boltzmann strategy

Take an action \(a_t\) with probability \( \sim \exp(\pi(a_t|s_t) / T) \)

Problem Description

These exploration strategies are not efficient if rewards are sparce

It is almost impossible to perform long complex sequence of actions

by random exploration

Curiosity-driven Exploration

Good exploration strategy should 

encourage agent to:

  • Explore 'novel' states
  • Perform actions that reduce the uncertainty about the environment

Curiosity-driven Exploration

Сам себя не похвалишь - никто не похвалит

Curiosity-driven Exploration

r_t = r^i_t + r^e_t
r^i_t
r^e_t

- the intrinsic reward

- the extrinsic reward

mostly, if not always, zero

- total reward

Curiosity-driven Exploration

Agent is composed of two modules:

- Generator of the intrinsic reward \(r_t^i\)

- Policy \(\pi(\theta)\)  that outputs actions

Prediction error as curiosity reward

\(r_t^i\) is based on how hard it is for the agent to predict the consequences of it's own actions 

We need a model of the environmental dynamics that predicts \(s_{t+1}\) given \(s_t\) and \(a_t\)

Prediction error as curiosity reward

Prediction in a raw state space \(\mathcal{S}\)

is not efficient

Not all changes in the environment depend on agent's actions

or

affect the agent

Prediction error as curiosity reward

Inverse Dynamic Model

Forward Dynamic Model

Transforms raw state representation \(s_t\)

into \(\phi(s_t)\) which depend only on such a parts of \(s_t\) that need to be controlled by agent

Tries to predict \(\phi(s_{t+1})\) given \(\phi(s_t)\) and \(a_t\)

Inverse dynamics model

s_{t+1}
s_{t}
\phi (s_{t+1})
\phi (s_{t})
\hat{a}_t
\hat{a}_t = g(s_{t+1}, s_t, \theta_I)
\min_{\theta_I} L_I(\hat{a}_t, a_t)
L_I

- cross entropy

Forward dynamics model

\phi (s_{t})
a_t
\hat{\phi}(s_{t+1})
\hat{\phi}(s_{t+1}) = f(s_t, a, \theta_F)
\min_{\theta_F} L_F(\hat{\phi}(s_{t+1}), \phi(s_{t+1}))
L_F(\hat{\phi}_{t+1}, \phi_{t+1}) = \frac{1}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2

Intrinsic Curiosity Module (ICM)

r^i_t = \frac{\eta}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2
\eta > 0

- the tradeoff between extrinsic and intrinsic reward

r_t = r_t^i + r^e_t

- total reward

Optimization problem

L(\theta) = \big[-\lambda \mathbb{E}_{\pi_{\theta_{P}}} \sum_t r_t + \beta L_F + (1-\beta)L_I \big]
\theta = \{\theta_P, \theta_I, \theta_F\}
\lambda > 0
0 \le \beta \le 1

- the tradeoff between policy gradient loss and intrinsic reward learning

L(\theta) \rightarrow\min_\theta

- the tradeoff between forward and inverse dynamics model learning

Experimental Setup

Environments

Doom 3-D navigation task

  • DoomMyWayHome-v0 from OpenAI Gym
  • Actions: left, right, forward and no-action
  • Termination after reaching the goal or 2100 timesteps
  • Sparse termination +1 reward and zero otherwise

Experimental Setup

Environments

Doom 3-D navigation task

Experimental Setup

Environments

Super Mario Bros

  • Training on the first level and looking for generalization on the three subsequent levels
  • 14 unique actions following:

https://github.com/ppaquette/gym-super-mario

Experimental Setup

Training Details

  • RGB images \(\rightarrow\) Grey-Scale 42x42 images
  • \(s_t\) \(\leftarrow\) tensor of 4 last frames
  • A3C was used with 12 workers and ADAM optimizer (parameters not shared)
  • Action repeat of 4 in Doom training
  • Action repeat of 6 in SuperMario training
  • No action repeat at a testing time

Experimental Setup

Training Details

Policy and Value network

\(s_t \rightarrow\) 4 Conv Layers:

  • 32 filters
  • 3x3 kernel size
  • stride = 2
  • padding = 1
  • ELU nonlinearity

Conv output \(\rightarrow\) LSTM:

  • 256 units

Then two separate fully-connected layers for \(\pi(a_t|s_t)\) and \(V(s_t)\)

Experimental Setup

Training Details

ELU = \begin{cases} x, x \ge 0 \\ \alpha(\exp(x) - 1), otherwise \end{cases}

Experimental Setup

Training Details

Experimental Setup

Training Details

Intrinsic Curiosity Module (ICM)

\(s_t\) converted to \(\phi(s_t)\) using:

  • 4 Conv Layers
  • 32 filters each
  • 3x3 kernel size
  • stride = 2
  • padding = 1
  • ELU nonlinearity

\(\phi(s_t)\) dimensionality is 288

Experimental Setup

Training Details

Intrinsic Curiosity Module (ICM)

Inverse Model

Forward Model

\(\phi_t\) and \(\phi_{t+1}\) are concatenated and passed through FC layer of 256 units followed by an output softmax layer for \(\hat{a}_t\) prediction

\(a_t\) and \(\phi_{t}\) are concatenated and passed through FC layer of 256 units follower by an FC layer of 288 units for \(\hat{\phi}_{t+1}\) prediction

Experiments

Doom: DENSE reward setting

Experiments

Doom: SPARSE reward setting

Experiments

Doom: VERY SPARSE reward setting

Experiments

Doom: Robustness to the noise

Input with noise sample:

Experiments

Doom: Robustness to the noise

Experiments

Comparison to the TRPO-VIME

For the SPARSE Doom reward setting

Method Mean (Median) Score
at convergence
TRPO 26.0% (0.0%)
A3C 0.0% (0.0%)
VIME + TRPO 46.1% (27.1%)
ICM + A3C 100% (100%)

Experiments

NO REWARD setting and GENERALIZATION

Experiments

Fine-tuning for unseen scenarios

Links

Thanks for your

attention!

Made with Slides.com