Continuous
Prioritized Sweeping
with
Gaussian Processes
- Sample-efficiency!
- Learning with fewer interactions.
GOAL
- Data is in tuple form: <s,a,s',r>
GOAL
- Data is in tuple form: <s,a,s',r>
GOAL
MODEL FREE
- Prioritize experience replay based on past
TD errors
- Experience replay
(randomly feed collected data to learning algorithm)
- Data is in tuple form: <s,a,s',r>
GOAL
MODEL FREE
- Prioritize experience replay based on past
TD errors
- Experience replay
(randomly feed collected data to learning algorithm)
MODEL BASED
- Use model to generate new data/backup the value function
- Prioritize based on past
TD errors AND learned dynamics
PRIORITIZED SWEEPING
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
PRIORITIZED SWEEPING
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
PRIORITIZED SWEEPING
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
PRIORITIZED SWEEPING
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
PRIORITIZED SWEEPING
p_{s,a} = \left | \Delta_{s'} \right |\ \ \cdot \ \ T(s' | s, a)
- Update V(s'), which gives
- Iterate all s,a pairs
- Compute all priorities
- Pick highest and repeat
How to prioritize?
\Delta_{s'}
Model
TD error
CONTINUOUS SETTING
- Learn continuous T
- Iterate s,a pairs to update priority
PROBLEMS
CONTINUOUS SETTING
- Learn continuous T
- Iterate s,a pairs to update priority
PROBLEMS
Gaussian Process!
SOLUTION
GAUSSIAN PROCESS
- Infinite dimensional multivariate Gaussian
- Sample-efficient & Bayesian function approximator
GAUSSIAN PROCESS
GAUSSIAN PROCESS
GAUSSIAN PROCESS
GAUSSIAN PROCESS
GAUSSIAN PROCESS
GAUSSIAN PROCESS
GAUSSIAN PROCESS
GAUSSIAN PROCESS
- It's a DISTRIBUTION over functions!
- It can be learned easily and sample-efficiently (there are some gotchas)
- It keeps track of uncertainty!
CONTINUOUS PS
p_{s,a} = \left | \Delta_{s'} \right |\ \ \cdot \ \ T(s' | s, a)
Model
TD error
This is what we need:
CONTINUOUS PS
T(s' | s, a)
with a GP
s
a
s'
CONTINUOUS PS
Plot shows deterministic mean, but GP handles stochastic transitions!
s
a
s'
CONTINUOUS PS
Iterating over parents equals slicing GP at a given height
matplotlib screws the projection of
overlapping surfaces, sorry!
CONTINUOUS PS
Taking into account uncertainty, it'd look like this
CONTINUOUS PS
- Now we only need to go over all parents, and we can compute the priorities
- But wait.. this is still a continuous function!
- Approximate using sampling!
CONTINUOUS PS
- Sample arbitrary s,a
-
Some ways to optimize this:
- Latin Hypercube sampling
- Gradient descent (depending on GP kernel)
- Compute PDF using GP, and add the resulting priority to a queue
CONTINUOUS PS
EXPERIMENTAL SETTING
-
Test DQN as:
- vanilla
- experience replay
- prioritized experience replay
- use GP to prioritize experience replay
- use GP to fully sample new experience
Multiagent Extension
- Already works for discrete settings (under domain knowledge assumptions)
- Main idea is that priorities are factored as sum of smaller functions
- Each function only depends on some state features and agents
Multiagent Extension
- Important being able to represent priorities as functions
- Since we are already there, use mixtures of Gaussians!
Multiagent Extension
- Cluster samples using variational Expectation Maximization (weighted)
Multiagent Extension
- Now we can represent priorities as actual functions, and can do fancy things with them!
- Enhanced buzzword level
- As long as everything works...
Questions?
Continuous PS with GPs
By svalorzen
Continuous PS with GPs
- 451