Artyom Sorokin | 20 October
in many tasks, but we start from Reinforcement Learning...
Basic Theoretical Results in Model-Free Reinforcement Learning are proved for Markov Decision Processes.
Markovian propery:
In other words: "The future is independent of the past given the present."
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
\(act_{t=10}\)
\(act_{t=12}\)
\(act_{t=13}\)
We need this information
At this moment!
Memory
. . .
. . .
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
\(act_{t=10}\)
\(act_{t=12}\)
\(act_{t=20}\)
\(h_{10}\)
. . .
. . .
\(h_9\)
\(h_{19}\)
. . .
. . .
. . .
Information
Gradients
Long-Short Term Memory: LSTM
Differential Neural Computer: DNC
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
\(act_{t=10}\)
\(act_{t=12}\)
\(act_{t=20}\)
. . .
. . .
Information
Gradients
Memory Window
Transformer is a window-based memory architecture
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
. . .
. . .
\(e_{10}\)
\(e_{12}\)
\(a_{10}\)
\(e_{20}\)
Embeddings
Query
\(a_{12}\)
Attention weight: \(a_t = {e_{t}}^T q/\sum_i {e_{i}}^T q \)
\(q\)
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
. . .
. . .
\(e_{10}\)
\(e_{12}\)
\(a_{10}\)
Embeddings
Query
\(a_{12}\)
Context Vector \(c_{20} = \sum_t a_t e_t\)
Attention weight: \(a_t = {e_{t}}^T q/\sum_i {e_{i}}^T q \)
\(q\)
Self-Attention computation:
Each \(z_t\) contains relevant information about \(o_t\) collected over all steps in
Memory Window:
This is how real Transformer looks:
Kind of...
This is how real Transformer looks:
Temporal dependency should fully fit into TBPTT or Attention Span to be learned
You need to store all intermediate computation to implement backpropogation over Attention Span/TBPTT
Vanishing/Exploding Gradients
100-250 steps for AMRL
40-80 steps for R2D2
Best results that i know 500+ steps with MERLIN (Wayne et al, 2018)
Information from red timestep could help at blue timestep.
Markov property doesn't work for these timesteps:
How important it is to remember the red timestep?
\(f_t\) can be Q-value, \(s_{t+1}\), \(r_{t+1}\), etc.
Information from red timestep could help at blue timestep.
Markov property doesn't work for these timesteps:
How important it is to remember the red timestep?
\(f_t\) can be Q-value, \(s_{t+1}\), \(r_{t+1}\), etc.
multiplied by \(P(o_{t-k}| o_t, a_t)\)
Information from red timestep could help at blue timestep.
Markov property doesn't work for these timesteps:
How important it is to remember the red timestep?
The best memory would maximize the following sum:
But training \(m_t\) to maximize Mutual Information (MI) at step \(t\) doesn't help with our problem:
what if at information from step \(t-k\) is already lost at step \(t\)
It is better to optimize the following sum:
Train memory to maximize MI for all future steps:
\(O(T^2)\) in time!
instead of optimizing the whole second sum we can focus only on terms with highest Mutual Information, i.e. optimize w.t.r to moments where memory
still requires to process full sequence for to update \(m_t\)
Instead of optimizing \(m_t\) with respect for the whole sum \(\sum^{T}_{i=t}\), we can select the most w
Observation: not all steps in sum \(\sum^{T}_{i=t}\) are equally valuable.
Train memory to maximize MI for all future steps:
\(O(T^2)\) in time!
Instead of optimizing the whole second sum we can focus only on terms with highest Mutual Information!
i.e. optimize w.r.t. to moments where memory is the most important for model's predictions!
still requires to process full sequence to update \(m_t\)
Then \(I(f_{i} ; m^{*}_t | o_{i}, a_{i})\) specify how much memory from step \(t\) can improve prediction.
Let's assume \(m^{*}_t\) is a perfect memory.
Train memory to maximize MI for all future steps:
\(O(T^2)\) in time!
Idea:
Instead of optimizing the whole second sum it would be cheaper to optimize w.r.t. to the moments where memory is the most important for model's predictions!
still requires to process full sequence to update \(m_t\)
Locality of Reference
how important memory can be in \(f_t\) prediction
If \(H(f_{t}|m_{t}^*, o_{t}, a_{t})\) doesn't fluctuate as much as \(H(f_{t}| o_{t}, a_{t})\), e.g. \(H(f_{t}|m_{t}^*, o_{t}, a_{t}) = c\) for any \(t\)
\(f_t\) uncertainty without memory
\(f_t\) uncertainty with perfect memory
Then \(I(f_{t} ; m_{t}^* | o_{t}, a_{t})\) is proportional to \(H(f_t|o_t, a_t)\)
Goal: Find moments where memory is the most important for model's predictions!
\(I(f_{i} ; m_t | o_{i}, a_{i})\) specify how much memory from step \(t\) can improve prediction.
Problem: To find steps were memory is useful we first need to have a useful memory :(
Let's assume we have a perfect memory \(m^{*}_t\)! Then:
Problems:
how important memory in \(f_i\) prediction
If \(H(f_{i}|m^*, o_{i}, a_{i})\) doesn't fluctuate as much as \(H(f_{i}| o_{i}, a_{i})\), e.g. \(H(f_{i}|m^*, o_{i}, a_{i}) = c\) for any \(i\)
\(f_i\) uncertainty without memory
Let's look at the second sum:
\(f_i\) uncertainty with perfect memory
Then \(I(f_{i} ; m^* | o_{i}, a_{i})\) is proportional to \(H(f_i|o_i, a_i)\)
Two main steps:
this means high uncertainty: \(H(f_i|o_i,a_i)\)
where \(U_t\) is a set of moments that can benefit the most from memory; \(|U| \ll T\).
Memory Pretraining Modules:
Learning is divided in two phases:
We use cumulative discounted future reward as prediction target: \(f_t = \sum_k \gamma^k r_{t+k}\)
We compare MemUP (Memory via Uncertainty Prediction) with the following baselines:
IMPALA-ST
Noisy T-Maze Environment:
Env Details:
Noisy T-Maze-100
Noisy T-Maze-1000
Detachment
Derailment
Go-Explore improves exploration in environments with sparse rewards
Selection depends on:
#visits, #selections, room_id, level, etc
load from the state/
replay trajectory
random policy
new state or better trajectory
Similarities:
Differences:
Go-Explore:
MemUP:
Recurrent neural networks:
Training RNN on Sequences:
LSTM
LSTM
LSTM
LSTM
LSTM
Neural Turing Machines:
Training NTM on Sequences:
Array of N memory vectors
Read and Write with Soft Attention
NTM
NTM
NTM
NTM
NTM
Recurrent Independent Mechanisms:
Imagine 4 LSTMS
Choose active LSTMs with
Top Down Attention
Update Active LSTMS
Copy Inactive LSTMS
Training RIM on Sequences:
RIM
RIM
RIM
RIM
RIM
Similarities:
Differences:
NTM and RIM:
MemUP:
We can use NTM, RIM, etc. as a memory module in MemUP
PlaNet builds a good model of the environment then plans with it
Two Main PlaNet improvements:
1.
Deterministic part
Stochastic part
PlaNet builds a good model of the environment then plans with it
Two Main PlaNet improvements:
2.
KL-loss
Reconstruction Loss
Similarities:
Differences:
PlaNet's Model:
MemUP:
Similarities:
Differences (more like parallels):
Big Bird:
MemUP: