Curiosity-driven Exploration by Self-Supervised Prediction

Reinforcement Learning preliminaries

AGENT

ENVIRONMENT

ACTION

OBSERVATION

REWARD

Reinforcement Learning preliminaries

a_t \in \mathcal{A}

a_t \in \mathcal{A}

s_t \in \mathcal{S}

s_t \in \mathcal{S}

r_t = r(s_t, a_t) \in \mathbb{R}

r_t = r(s_t, a_t) \in \mathbb{R}

s_{t+1} \sim p(s_{t+1}|s_t, a_t)

s_{t+1} \sim p(s_{t+1}|s_t, a_t)

Markov Decision Process:

a_{t} \sim \pi(a_{t}|s_t)

a_{t} \sim \pi(a_{t}|s_t)

- the set of actions

- the set of states of the environment

- reward function

- the policy

- transition probability

Reinforcement Learning preliminaries

Policy Gradient Methods:

We want to maximize the expected return:

R(\theta) = \mathbb{E}_{\pi(\theta)} \sum_t r_t \rightarrow \max_\theta

R(\theta) = \mathbb{E}_{\pi(\theta)} \sum_t r_t \rightarrow \max_\theta

We parametrize the policy:

\pi(a_t|s_t) = \pi(a_t|s_t, \theta)

\pi(a_t|s_t) = \pi(a_t|s_t, \theta)

Maximization is done by gradient ascent:

\theta \leftarrow \theta + \alpha\nabla_\theta R

\theta \leftarrow \theta + \alpha\nabla_\theta R

Reinforcement Learning preliminaries

Advantage Actor-Critic Method:

Where $A(a, s)$ - advantage function

Gradients:

\nabla_\theta R = \mathbb{E}_{\pi(\theta)} \nabla_\theta \log \pi(a|s,\theta)A(a, s)

\nabla_\theta R = \mathbb{E}_{\pi(\theta)} \nabla_\theta \log \pi(a|s,\theta)A(a, s)

A3C method was used in this work

For more see: https://arxiv.org/pdf/1602.01783.pdf

Problem Description

Commonly used exploration strategies:

$\epsilon$ -greedy strategy

Take an action $a_t = argmax \pi(a_t | s_t)$ with probability $1 - \epsilon$

Or take a random action with $\epsilon$ probability

Boltzmann strategy

Take an action $a_t$ with probability $\sim \exp(\pi(a_t|s_t) / T)$

Prediction error as curiosity reward

$r_t^i$ is based on how hard it is for the agent to predict the consequences of it's own actions

We need a model of the environmental dynamics that predicts $s_{t+1}$ given $s_t$ and $a_t$

Inverse dynamics model

s_{t+1}

s_{t+1}

s_{t}

s_{t}

\phi (s_{t+1})

\phi (s_{t+1})

\phi (s_{t})

\phi (s_{t})

\hat{a}_t

\hat{a}_t

\hat{a}_t = g(s_{t+1}, s_t, \theta_I)

\hat{a}_t = g(s_{t+1}, s_t, \theta_I)

\min_{\theta_I} L_I(\hat{a}_t, a_t)

\min_{\theta_I} L_I(\hat{a}_t, a_t)

L_I

L_I

- cross entropy

Forward dynamics model

\phi (s_{t})

\phi (s_{t})

a_t

a_t

\hat{\phi}(s_{t+1})

\hat{\phi}(s_{t+1})

\hat{\phi}(s_{t+1}) = f(s_t, a, \theta_F)

\hat{\phi}(s_{t+1}) = f(s_t, a, \theta_F)

\min_{\theta_F} L_F(\hat{\phi}(s_{t+1}), \phi(s_{t+1}))

\min_{\theta_F} L_F(\hat{\phi}(s_{t+1}), \phi(s_{t+1}))

L_F(\hat{\phi}_{t+1}, \phi_{t+1}) = \frac{1}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2

L_F(\hat{\phi}_{t+1}, \phi_{t+1}) = \frac{1}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2

Intrinsic Curiosity Module (ICM)

r^i_t = \frac{\eta}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2

r^i_t = \frac{\eta}{2}||\hat{\phi}_{t+1} - \phi_{t+1}||^2_2

\eta > 0

\eta > 0

- the tradeoff between extrinsic and intrinsic reward

r_t = r_t^i + r^e_t

r_t = r_t^i + r^e_t

- total reward

Optimization problem

L(\theta) = \big[-\lambda \mathbb{E}_{\pi_{\theta_{P}}} \sum_t r_t + \beta L_F + (1-\beta)L_I \big]

L(\theta) = \big[-\lambda \mathbb{E}_{\pi_{\theta_{P}}} \sum_t r_t + \beta L_F + (1-\beta)L_I \big]

\theta = \{\theta_P, \theta_I, \theta_F\}

\theta = \{\theta_P, \theta_I, \theta_F\}

\lambda > 0

\lambda > 0

0 \le \beta \le 1

0 \le \beta \le 1

- the tradeoff between policy gradient loss and intrinsic reward learning

L(\theta) \rightarrow\min_\theta

L(\theta) \rightarrow\min_\theta

- the tradeoff between forward and inverse dynamics model learning

Experimental Setup

Training Details

Intrinsic Curiosity Module (ICM)

Inverse Model

Forward Model

$\phi_t$ and $\phi_{t+1}$ are concatenated and passed through FC layer of 256 units followed by an output softmax layer for $\hat{a}_t$ prediction

$a_t$ and $\phi_{t}$ are concatenated and passed through FC layer of 256 units follower by an FC layer of 288 units for $\hat{\phi}_{t+1}$ prediction

Method	Mean (Median) Score at convergence
TRPO	26.0% (0.0%)
A3C	0.0% (0.0%)
VIME + TRPO	46.1% (27.1%)
ICM + A3C	100% (100%)

Curiosity-driven Exploration by Self-Supervised Prediction

Curiosity-driven Exploration by Self-Supervised Prediction

More from cydoroga

Curiosity-driven Exploration
by Self-Supervised
Prediction