Curiosity-driven Exploration
by Self-Supervised
Prediction
report is made by
Pavel Temirchev
Deep RL
reading group
Reinforcement Learning preliminaries
AGENT
ENVIRONMENT
ACTION
OBSERVATION
REWARD
Reinforcement Learning preliminaries
Markov Decision Process:
- the set of actions
- the set of states of the environment
- reward function
- the policy
- transition probability
Reinforcement Learning preliminaries
Policy Gradient Methods:
We want to maximize the expected return:
We parametrize the policy:
Maximization is done by gradient ascent:
Reinforcement Learning preliminaries
Advantage Actor-Critic Method:
Where \( A(a, s) \) - advantage function
Gradients:
A3C method was used in this work
For more see: https://arxiv.org/pdf/1602.01783.pdf
Problem Description
Commonly used exploration strategies:
- \(\epsilon\)-greedy strategy
Take an action \(a_t = argmax \pi(a_t | s_t)\) with probability \( 1 - \epsilon \)
Or take a random action with \(\epsilon\) probability
- Boltzmann strategy
Take an action \(a_t\) with probability \( \sim \exp(\pi(a_t|s_t) / T) \)
Problem Description
These exploration strategies are not efficient if rewards are sparce
It is almost impossible to perform long complex sequence of actions
by random exploration
Curiosity-driven Exploration
Good exploration strategy should
encourage agent to:
- Explore 'novel' states
- Perform actions that reduce the uncertainty about the environment
Curiosity-driven Exploration
Сам себя не похвалишь - никто не похвалит
Curiosity-driven Exploration
- the intrinsic reward
- the extrinsic reward
mostly, if not always, zero
- total reward
Curiosity-driven Exploration
Agent is composed of two modules:
- Generator of the intrinsic reward \(r_t^i\)
- Policy \(\pi(\theta)\) that outputs actions
Prediction error as curiosity reward
\(r_t^i\) is based on how hard it is for the agent to predict the consequences of it's own actions
We need a model of the environmental dynamics that predicts \(s_{t+1}\) given \(s_t\) and \(a_t\)
Prediction error as curiosity reward
Prediction in a raw state space \(\mathcal{S}\)
is not efficient
Not all changes in the environment depend on agent's actions
or
affect the agent
Prediction error as curiosity reward
Inverse Dynamic Model
Forward Dynamic Model
Transforms raw state representation \(s_t\)
into \(\phi(s_t)\) which depend only on such a parts of \(s_t\) that need to be controlled by agent
Tries to predict \(\phi(s_{t+1})\) given \(\phi(s_t)\) and \(a_t\)
Inverse dynamics model
- cross entropy
Forward dynamics model
Intrinsic Curiosity Module (ICM)
- the tradeoff between extrinsic and intrinsic reward
- total reward
Optimization problem
- the tradeoff between policy gradient loss and intrinsic reward learning
- the tradeoff between forward and inverse dynamics model learning
Experimental Setup
Environments
Doom 3-D navigation task
- DoomMyWayHome-v0 from OpenAI Gym
- Actions: left, right, forward and no-action
- Termination after reaching the goal or 2100 timesteps
- Sparse termination +1 reward and zero otherwise
Experimental Setup
Environments
Doom 3-D navigation task
Experimental Setup
Environments
Super Mario Bros
- Training on the first level and looking for generalization on the three subsequent levels
- 14 unique actions following:
Experimental Setup
Training Details
- RGB images \(\rightarrow\) Grey-Scale 42x42 images
- \(s_t\) \(\leftarrow\) tensor of 4 last frames
- A3C was used with 12 workers and ADAM optimizer (parameters not shared)
- Action repeat of 4 in Doom training
- Action repeat of 6 in SuperMario training
- No action repeat at a testing time
Experimental Setup
Training Details
Policy and Value
\(s_t \rightarrow\) 4 Conv Layers:
- 32 filters
- 3x3 kernel size
- stride = 2
- padding = 1
- ELU nonlinearity
Conv output \(\rightarrow\) LSTM:
- 256 units
Then two separate fully-connected layers for \(\pi(a_t|s_t)\) and \(V(s_t)\)
Experimental Setup
Training Details
Experimental Setup
Training Details
Experimental Setup
Training Details
Intrinsic Curiosity Module (ICM)
\(s_t\) converted to \(\phi(s_t)\) using:
- 4 Conv Layers
- 32 filters each
- 3x3 kernel size
- stride = 2
- padding = 1
- ELU nonlinearity
\(\phi(s_t)\) dimensionality is 288
Experimental Setup
Training Details
Intrinsic Curiosity Module (ICM)
Inverse Model
Forward Model
\(\phi_t\) and \(\phi_{t+1}\) are concatenated and passed through FC layer of 256 units followed by an output softmax layer for \(\hat{a}_t\) prediction
\(a_t\) and \(\phi_{t}\) are concatenated and passed through FC layer of 256 units follower by an FC layer of 288 units for \(\hat{\phi}_{t+1}\) prediction
Experiments
Doom: DENSE reward setting
Experiments
Doom: SPARSE reward setting
Experiments
Doom: VERY SPARSE reward setting
Experiments
Doom: Robustness to the noise
Input with noise sample:
Experiments
Doom: Robustness to the noise
Experiments
Comparison to the TRPO-VIME
For the SPARSE Doom reward setting
Method | Mean (Median) Score at convergence |
---|---|
TRPO | 26.0% (0.0%) |
A3C | 0.0% (0.0%) |
VIME + TRPO | 46.1% (27.1%) |
ICM + A3C | 100% (100%) |
Experiments
NO REWARD setting and GENERALIZATION
Experiments
Fine-tuning for unseen scenarios
Links
Blog (video is here):
Article:
Thanks for your
attention!
Curiosity-driven Exploration by Self-Supervised Prediction
By cydoroga
Curiosity-driven Exploration by Self-Supervised Prediction
- 691