Piotr Kozakowski
We want our policy to make good actions to maximize the reward (exploitation).
But to know which actions are good, we need to explore the state space (exploration).
Idea: Add a reward for exploring new states.
where is reward from the environment (extrinsic)
and is a reward for exploration (intrinsic).
where is the number of times we've seen state s.
Bellemare et al. - Unifying count-based exploration and intrinsic motivation (2016)
where is the number of times we've seen state s.
Problem: Large state space.
Bellemare et al. - Unifying count-based exploration and intrinsic motivation (2016)
where is the number of times we've seen state s.
Problem: Large state space.
Solution: Use function approximation! But how?
Bellemare et al. - Unifying count-based exploration and intrinsic motivation (2016)
Usually used as maps:
Now we need a (multi)set:
Prediction error of various models has previously been used as intrinsic reward:
Two networks with the same architecture:
is updated in each visited state.
Prediction error is used as intrinsic reward:
Burda et al. - Exploration by random network distillation (2018)
Forward dynamics measures 1, but also 2 and 3.
RND measures 1, but not 2 and 3.
4 is hard to avoid.
Goal: Measure uncertainty of prediction of our neural network (for regression).
Instead of one model, learn a distribution over models.
During inference, report mean and variance of prediction over the model distribution.
Start from a prior distribution
Update using examples from the dataset
to get a posterior distribution
i.e. find
Empirical results with DQN translate to neural networks.
Osband et al. - Randomized prior functions for deep reinforcement learning (2018)
which is equivalent to training the predictor in RND when the target is sampled from the prior (need to flip the sign).
The minimized error averaged over an ensemble with shared weights - approximation of prediction uncertainty - is our intrinsic reward.
Let - fitting the zero function without noise. Then the optimization of the residual network becomes
Osband et al. - Randomized prior functions for deep reinforcement learning (2018)
Better uncertainty measure!
Consistently finds 22/24 rooms in Montezuma's Revenge.
Speaker: Piotr Kozakowski
References:
Burda et al. - Exploration by random network distillation (2018)
Bellemare et al. - Unifying count-based exploration and intrinsic motivation (2016)
Stadie et al. - Incentivizing exploration in reinforcement learning with deep predictive models (2015)
Pathak et al. - Curiosity-driven exploration by self-supervised prediction (2017)
Fox et al. - Dora the explorer: Directed outreaching reinforcement action-selection (2018)
Osband et al. - Randomized prior functions for deep reinforcement learning (2018)