report is made by
Pavel Temirchev
Deep RL
reading group
Reinforcement Learning preliminaries
AGENT
ENVIRONMENT
ACTION
OBSERVATION
REWARD
Reinforcement Learning preliminaries
Markov Decision Process:
- the set of actions
- the set of states of the environment
- reward function
- the policy
- transition probability
Reinforcement Learning preliminaries
Policy Gradient Methods:
We want to maximize the expected return:
We parametrize the policy:
Maximization is done by gradient ascent:
Reinforcement Learning preliminaries
Advantage Actor-Critic Method:
Where \( A(a, s) \) - advantage function
Gradients:
A3C method was used in this work
For more see: https://arxiv.org/pdf/1602.01783.pdf
Problem Description
Commonly used exploration strategies:
Take an action \(a_t = argmax \pi(a_t | s_t)\) with probability \( 1 - \epsilon \)
Or take a random action with \(\epsilon\) probability
Take an action \(a_t\) with probability \( \sim \exp(\pi(a_t|s_t) / T) \)
Problem Description
These exploration strategies are not efficient if rewards are sparce
It is almost impossible to perform long complex sequence of actions
by random exploration
Curiosity-driven Exploration
Good exploration strategy should
encourage agent to:
Curiosity-driven Exploration
Сам себя не похвалишь - никто не похвалит
Curiosity-driven Exploration
- the intrinsic reward
- the extrinsic reward
mostly, if not always, zero
- total reward
Curiosity-driven Exploration
Agent is composed of two modules:
- Generator of the intrinsic reward \(r_t^i\)
- Policy \(\pi(\theta)\) that outputs actions
Prediction error as curiosity reward
\(r_t^i\) is based on how hard it is for the agent to predict the consequences of it's own actions
We need a model of the environmental dynamics that predicts \(s_{t+1}\) given \(s_t\) and \(a_t\)
Prediction error as curiosity reward
Prediction in a raw state space \(\mathcal{S}\)
is not efficient
Not all changes in the environment depend on agent's actions
or
affect the agent
Prediction error as curiosity reward
Inverse Dynamic Model
Forward Dynamic Model
Transforms raw state representation \(s_t\)
into \(\phi(s_t)\) which depend only on such a parts of \(s_t\) that need to be controlled by agent
Tries to predict \(\phi(s_{t+1})\) given \(\phi(s_t)\) and \(a_t\)
Inverse dynamics model
- cross entropy
Forward dynamics model
Intrinsic Curiosity Module (ICM)
- the tradeoff between extrinsic and intrinsic reward
- total reward
Optimization problem
- the tradeoff between policy gradient loss and intrinsic reward learning
- the tradeoff between forward and inverse dynamics model learning
Experimental Setup
Environments
Doom 3-D navigation task
Experimental Setup
Environments
Doom 3-D navigation task
Experimental Setup
Environments
Super Mario Bros
Experimental Setup
Training Details
Experimental Setup
Training Details
Policy and Value
\(s_t \rightarrow\) 4 Conv Layers:
Conv output \(\rightarrow\) LSTM:
Then two separate fully-connected layers for \(\pi(a_t|s_t)\) and \(V(s_t)\)
Experimental Setup
Training Details
Experimental Setup
Training Details
Experimental Setup
Training Details
Intrinsic Curiosity Module (ICM)
\(s_t\) converted to \(\phi(s_t)\) using:
\(\phi(s_t)\) dimensionality is 288
Experimental Setup
Training Details
Intrinsic Curiosity Module (ICM)
Inverse Model
Forward Model
\(\phi_t\) and \(\phi_{t+1}\) are concatenated and passed through FC layer of 256 units followed by an output softmax layer for \(\hat{a}_t\) prediction
\(a_t\) and \(\phi_{t}\) are concatenated and passed through FC layer of 256 units follower by an FC layer of 288 units for \(\hat{\phi}_{t+1}\) prediction
Experiments
Doom: DENSE reward setting
Experiments
Doom: SPARSE reward setting
Experiments
Doom: VERY SPARSE reward setting
Experiments
Doom: Robustness to the noise
Input with noise sample:
Experiments
Doom: Robustness to the noise
Experiments
Comparison to the TRPO-VIME
For the SPARSE Doom reward setting
Method | Mean (Median) Score at convergence |
---|---|
TRPO | 26.0% (0.0%) |
A3C | 0.0% (0.0%) |
VIME + TRPO | 46.1% (27.1%) |
ICM + A3C | 100% (100%) |
Experiments
NO REWARD setting and GENERALIZATION
Experiments
Fine-tuning for unseen scenarios
Links
Blog (video is here):
Article:
Thanks for your
attention!