Artyom Sorokin | 7 May
Basic Theoretical Results in Model-Free Reinforcement Learning are proved for Markov Decision Processes.
Markovian propery:
In other words: "The future is independent of the past given the present."
Graphical Model for POMDP:
POMDP is a 6-tuple \(<S,A,R,T,\Omega, O>\):
A proper belief state allows a POMDP to be formulated as a MDP over belief states (Astrom, 1965)
Belief State update:
General Idea:
Belief Update "Beilief" MDP Plan with Value Iteration Policy
Problems:
(Ma et al, ICLR 2020)
Asynchronous Methods for Deep Reinforcement Learning (Mnih et al, 2016) | DeepMind, ICML, 3113 citations
Deep Recurrent Q-Learning for Partially Observable MDPs
(Hausknecht et al, 2015) AAAI, 582 citations )
"To deal with partial observability, the temporal sequence of observations is processed by a deep long short-term memory (LSTM) system"
AlphaStar Grandmaster level in StarCraft II using multi-agent reinforcement learning (Vinyalis et al, 2019) | DeepMind, Nature, 16 Citations
"The LSTM composes 84% of the model’s total parameter count."
Dota 2 with Large Scale Deep Reinforcement Learning (Berner et al, 2019) | OpenAI, 17 Citations
Recurrent Experience Replay in Distributed Reinforcement Learning (Kapturowski et al, 2019) | DeepMind, ICLR, 49 citations
R2D2 is a DRQN build on top of Ape-X (Horgan et al, 2018) with addition of two heuristics:
Burn-in - 40 steps, full rollout - 80 steps
Recurrent Experience Replay in Distributed Reinforcement Learning (Kapturowski et al, 2019) | DeepMind, ICLR, 49 citations
Recurrent Experience Replay in Distributed Reinforcement Learning (Kapturowski et al, 2019) | DeepMind, ICLR, 49 citations
good at tracking order of observations
susceptible to noise in observations
bad at long-term dependencies
order often doesn't matter
high variability in observation sequences
long-term dependencies
AMRL: Aggregated Memory For Reinforcement Learning (2020) | MS Research, ICLR
Add aggregators that ignore order of observations: Agregators also act as residual skip connections across time. Instead of true gradients a straight-through estimator(Bengio et al., 2013) is used.
AMRL: Aggregated Memory For Reinforcement Learning (2020) | MS Research, ICLR
AMRL: Aggregated Memory For Reinforcement Learning (2020) | MS Research, ICLR
AMRL: Aggregated Memory For Reinforcement Learning (2020) | MS Research, ICLR
AMRL: Aggregated Memory For Reinforcement Learning (2020) | MS Research, ICLR
A3C/PPO + LSTM
Unsupervised Predictive Memory in a Goal-Directed Agent (2018) | DeepMind, 67 citations
a monstrous combination of VAE and Q-function estimator
uses simplified DNC under the hood
no gradients flow between policy and MBR
trained with Policy Gradients and GAE
Unsupervised Predictive Memory in a Goal-Directed Agent (2018) | DeepMind, 67 citations
Module takes all memory from the previous step and produces parameters of Diagonal Gaussian distribution:
Another MLP \(f^{post}\) takes:
and generates correction for the prior:
At the end, latent state variable \(z_t\) is sampled from posterior distribution.
Model-Based Predictor has a loss function based on the variational lower bound:
Reconstruction Loss: KL Loss:
MERLIN is compared against two baselines: A3C-LSTM, A3C-DNC
Unsupervised Predictive Memory in a Goal-Directed Agent (2018) | DeepMind, 67 citations
Stabilizing Transformers For Reinforcement Learning (2019) | DeepMind
Stabilizing Transformers For Reinforcement Learning (2019) | DeepMind
Stabilizing Transformers For Reinforcement Learning (2019) | DeepMind
Stabilizing Transformers For Reinforcement Learning (2019) | DeepMind
Catastrophic interference i.e knowledge in neural networks is non-local
Nature of gradients
Semantic memory makes better use of experiences (i.e. better generalization)
Episodic memory requires fewer experiences (i.e. more accurate)
"We will show that in general, just as model-free control is better than model-based control after substantial experience, episodic control is better than model-based control after only very limited experience."
A Tree MDP is just a MPD without circles.
Text
Text
A) Tree MDP with branching factor = 2 B) Tree MDP with branching factor = 3 C) Tree MDP with branching factor = 4
Lets store all past experiences in \(|A|\) dictionaries \(Q_{a}^{EC} \)
\(s_t, a_t \) are keys and discounted future rewards \(R_t\) are values.
Dictionary update:
If a state space has a meaningful distance, then we can use k-nearest neightbours to estimate new \((s,a)\) pairs:
Model-Free Episodic Control (2016) | DeepMind, 100 citations
Lets store all past experiences in \(|A|\) dictionaries \(Q_{a}^{EC} \)
\(s_t, a_t \) are keys and discounted future rewards \(R_t\) are values.
Dictionary update:
If a state space has a meaningful distance, then we can use k-nearest neightbours to estimate new \((s,a)\) pairs:
Two possible feature compressors for \(s_t\): Random Network, VAE
Test environments:
Differences with Model-Free Episodic Control:
CNN instead of VAE/RandomNet
Differential Neural Dictionaries (DND)
Replay Memory like in DQN, but small...
CNN and DND learn with gradient descent
Neural Episodic Control (2016) | DeepMind, 115 citations
For each action \(a \in A \), NEC has a dictionary \(M_a = (K_a , V_a )\).
Keys and Queries are generated by CNN
To estimate \(Q(s_t, a)\) we sample p-nearest neighbors from \(M_a\)
\(k\) is a kernel for distance estimate. In experiments:
Once a key \(h_t\) is queried from a DND, that key and its corresponding output are appended to the DND. If \(h_t \in M_a\) then we just store it with N-step Q-value estimate:
otherwise, we update stored value with tabular Q-learning rule:
Learn DND and CNN-encoder:
Sample mini-batches from replay buffer that stores triplets \((s_t,a_t, R_t)\) and use \(R_t\) as a target.
Neural Episodic Control (2016) | DeepMind, 115 citations
Neural Episodic Control (2016) | DeepMind, 115 citations