Curiosity-driven Exploration
by Self-Supervised
Prediction

report is made by
Pavel Temirchev

Deep RL

reading group

Reinforcement Learning preliminaries

AGENT

ENVIRONMENT

ACTION

OBSERVATION

REWARD

Reinforcement Learning preliminaries

a_t \in \mathcal{A}

s_t \in \mathcal{S}

r_t = r(s_t, a_t) \in \mathbb{R}

s_{t+1} \sim p(s_{t+1}|s_t, a_t)

Markov Decision Process:

a_{t} \sim \pi(a_{t}|s_t)

- the set of actions

- the set of states of the environment

- reward function

- the policy

- transition probability

Reinforcement Learning preliminaries

Policy Gradient Methods:

We want to maximize the expected return:

R(\theta) = \mathbb{E}_{\pi(\theta)} \sum_t r_t \rightarrow \max_\theta

We parametrize the policy:

\pi(a_t|s_t) = \pi(a_t|s_t, \theta)

Maximization is done by gradient ascent:

\theta \leftarrow \theta + \alpha\nabla_\theta R

Reinforcement Learning preliminaries

Advantage Actor-Critic Method:

Where \( A(a, s) \) - advantage function

Gradients:

\nabla_\theta R = \mathbb{E}_{\pi(\theta)} \nabla_\theta \log \pi(a|s,\theta)A(a, s)

A3C method was used in this work

For more see: https://arxiv.org/pdf/1602.01783.pdf

Problem Description

Commonly used exploration strategies:

\(\epsilon\)-greedy strategy

Take an action \(a_t = argmax \pi(a_t | s_t)\) with probability \( 1 - \epsilon \)

Or take a random action with \(\epsilon\) probability

Boltzmann strategy

Take an action \(a_t\) with probability \( \sim \exp(\pi(a_t|s_t) / T) \)

Problem Description

These exploration strategies are not efficient if rewards are sparce

It is almost impossible to perform long complex sequence of actions

by random exploration

Curiosity-driven Exploration

Good exploration strategy should

encourage agent to:

Explore 'novel' states

Perform actions that reduce the uncertainty about the environment

Curiosity-driven Exploration

Сам себя не похвалишь - никто не похвалит

Curiosity-driven Exploration

r_t = r^i_t + r^e_t

r^i_t

r^e_t

- the intrinsic reward

- the extrinsic reward

mostly, if not always, zero

- total reward

Curiosity-driven Exploration

Agent is composed of two modules:

- Generator of the intrinsic reward \(r_t^i\)

- Policy \(\pi(\theta)\) that outputs actions

Prediction error as curiosity reward

\(r_t^i\) is based on how hard it is for the agent to predict the consequences of it's own actions

We need a model of the environmental dynamics that predicts \(s_{t+1}\) given \(s_t\) and \(a_t\)

Prediction error as curiosity reward

Prediction in a raw state space \(\mathcal{S}\)

is not efficient

Not all changes in the environment depend on agent's actions

affect the agent

Prediction error as curiosity reward

Inverse Dynamic Model

Forward Dynamic Model

Transforms raw state representation \(s_t\)

into \(\phi(s_t)\) which depend only on such a parts of \(s_t\) that need to be controlled by agent

Tries to predict \(\phi(s_{t+1})\) given \(\phi(s_t)\) and \(a_t\)

Inverse dynamics model

s_{t+1}

s_{t}

\phi (s_{t+1})

\phi (s_{t})

\hat{a}_t

\hat{a}_t = g(s_{t+1}, s_t, \theta_I)

\min_{\theta_I} L_I(\hat{a}_t, a_t)

L_I

- cross entropy

Forward dynamics model

\phi (s_{t})

a_t

\hat{\phi}(s_{t+1})

\hat{\phi}(s_{t+1}) = f(s_t, a, \theta_F)

\min_{\theta_F} L_F(\hat{\phi}(s_{t+1}), \phi(s_{t+1}))

L_F(\hat{\phi}_{t+1}, \phi_{t+1}) = \frac{1}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2

Intrinsic Curiosity Module (ICM)

r^i_t = \frac{\eta}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2

\eta > 0

- the tradeoff between extrinsic and intrinsic reward

r_t = r_t^i + r^e_t

- total reward

Optimization problem

L(\theta) = \big[-\lambda \mathbb{E}_{\pi_{\theta_{P}}} \sum_t r_t + \beta L_F + (1-\beta)L_I \big]

\theta = \{\theta_P, \theta_I, \theta_F\}

\lambda > 0

0 \le \beta \le 1

- the tradeoff between policy gradient loss and intrinsic reward learning

L(\theta) \rightarrow\min_\theta

- the tradeoff between forward and inverse dynamics model learning

Experimental Setup

Environments

Doom 3-D navigation task

DoomMyWayHome-v0 from OpenAI Gym
Actions: left, right, forward and no-action
Termination after reaching the goal or 2100 timesteps
Sparse termination +1 reward and zero otherwise

Experimental Setup

Environments

Doom 3-D navigation task

Experimental Setup

Environments

Super Mario Bros

Training on the first level and looking for generalization on the three subsequent levels
14 unique actions following:

https://github.com/ppaquette/gym-super-mario

Experimental Setup

Training Details

RGB images \(\rightarrow\) Grey-Scale 42x42 images

\(s_t\) \(\leftarrow\) tensor of 4 last frames

A3C was used with 12 workers and ADAM optimizer (parameters not shared)

Action repeat of 4 in Doom training
Action repeat of 6 in SuperMario training
No action repeat at a testing time

Experimental Setup

Training Details

Policy and Value network

\(s_t \rightarrow\) 4 Conv Layers:

32 filters
3x3 kernel size
stride = 2
padding = 1
ELU nonlinearity

Conv output \(\rightarrow\) LSTM:

256 units

Then two separate fully-connected layers for \(\pi(a_t|s_t)\) and \(V(s_t)\)

Experimental Setup

Training Details

ELU = \begin{cases} x, x \ge 0 \\ \alpha(\exp(x) - 1), otherwise \end{cases}

Experimental Setup

Training Details

Experimental Setup

Training Details

Intrinsic Curiosity Module (ICM)

\(s_t\) converted to \(\phi(s_t)\) using:

4 Conv Layers
32 filters each
3x3 kernel size
stride = 2
padding = 1
ELU nonlinearity

\(\phi(s_t)\) dimensionality is 288

Experimental Setup

Training Details

Intrinsic Curiosity Module (ICM)

Inverse Model

Forward Model

\(\phi_t\) and \(\phi_{t+1}\) are concatenated and passed through FC layer of 256 units followed by an output softmax layer for \(\hat{a}_t\) prediction

\(a_t\) and \(\phi_{t}\) are concatenated and passed through FC layer of 256 units follower by an FC layer of 288 units for \(\hat{\phi}_{t+1}\) prediction

Experiments

Doom: DENSE reward setting