Artyom Sorokin
18 December 2021
in many tasks, but we start from Reinforcement Learning...
Basic Theoretical Results in Model-Free Reinforcement Learning are proved for Markov Decision Processes.
Markovian propery:
In other words: "The future is independent of the past given the present."
Basic Theoretical Results in Model-Free Reinforcement Learning are proved for Markov Decision Processes.
Markovian propery:
In other words: "The future is independent of the past given the present."
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
\(act_{t=10}\)
\(act_{t=12}\)
\(act_{t=13}\)
We need this information
At this moment!
Memory
. . .
. . .
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
\(act_{t=10}\)
\(act_{t=12}\)
\(act_{t=20}\)
\(h_{10}\)
. . .
. . .
\(h_9\)
\(h_{19}\)
. . .
. . .
. . .
Information
Gradients
Long-Short Term Memory: LSTM
Differential Neural Computer: DNC
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
\(act_{t=10}\)
\(act_{t=12}\)
\(act_{t=20}\)
. . .
. . .
Information
Gradients
Memory Window
Transformer is a window-based memory architecture
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
. . .
. . .
\(e_{10}\)
\(e_{12}\)
\(a_{10}\)
\(e_{20}\)
Embeddings
Query
\(a_{12}\)
Attention weight: \(a_t = {e_{t}}^T q/\sum_i {e_{i}}^T q \)
\(q\)
\(obs_{t=10}\)
\(obs_{t=12}\)
\(obs_{t=20}\)
. . .
. . .
\(e_{10}\)
\(e_{12}\)
\(a_{10}\)
Embeddings
Query
\(a_{12}\)
Context Vector \(c_{20} = \sum_t a_t e_t\)
Attention weight: \(a_t = {e_{t}}^T q/\sum_i {e_{i}}^T q \)
\(q\)
Self-Attention computation:
Each \(z_t\) contains relevant information about \(o_t\) collected over all steps in
Memory Window:
This is how real Transformer looks:
Kind of...
This is how real Transformer looks:
Truncated BPTT gradients
agent panda
RNNs
Information observed
Transformer attention span
agent panda
Information observed
Transformers
Temporal dependency should fully fit into TBPTT or Attention Span to be learned
You need to process all intermediate steps to implement backpropagation over Attention Span/TBPTT
Problem:
We need to backpropagate through all intermediate steps
to find and learn temporal dependency
Temporal dependency should fit into TBPTT/Attention Span to be learned
As we can't detect temporal dependency locally
Temporal dependency between t-1 and t+k
Vanishing/Exploding Gradients
100-250 steps for AMRL
40-80 steps for R2D2
Best results that i know 500+ steps with MERLIN (Wayne et al, 2018)
Information from red timestep could help at blue timestep.
Markov property doesn't work for these timesteps:
How important it is to remember the red timestep?
\(y_t\) can be Q-value, \(s_{t+1}\), \(r_{t+1}\), etc.
Information from red timestep could help at blue timestep.
Markov property doesn't work for these timesteps:
How important it is to remember the red timestep?
\(f_t\) can be Q-value, \(s_{t+1}\), \(r_{t+1}\), etc.
multiplied by \(P(o_{t-k}| o_t, a_t)\)
The best memory would maximize the following sum:
But training \(m_t\) to maximize Mutual Information (MI) at step \(t\) doesn't help with our problem:
what if at information from step \(t-k\) is already lost at step \(t\)
It is better to optimize the following sum:
The best memory would maximize the following sum:
But training \(m_t\) to maximize Mutual Information (MI) at step \(t\) doesn't help with our problem:
what if at information from step \(t-k\) is already lost at step \(t\)
It is better to optimize the following sum:
Train memory to maximize MI for all future steps:
\(O(T^2)\) in time!
Idea:
Instead of optimizing the whole second sum it would be cheaper to optimize w.r.t. to the moments where memory is the most important for model's predictions!
still requires to process full sequence to update \(m_t\)
Locality of Reference
how important memory can be in \(f_t\) prediction
If \(H(f_{t}|m_{t}^*, o_{t}, a_{t})\) doesn't fluctuate as much as \(H(f_{t}| o_{t}, a_{t})\), e.g. \(H(f_{t}|m_{t}^*, o_{t}, a_{t}) = c\) for any \(t\)
\(f_t\) uncertainty without memory
\(f_t\) uncertainty with perfect memory
Then \(I(f_{t} ; m_{t}^* | o_{t}, a_{t})\) is proportional to \(H(f_t|o_t, a_t)\)
Goal: Find moments where memory is the most important for model's predictions!
specify how much memory from step \(t\) can improve prediction.
Problem: To find steps were memory is useful we first need to have a useful memory :(
Let's assume we have a perfect memory \(m^{*}_t\)! Then:
Train memory to maximize MI at moments that would benefit the most from using memory:
where \(U_t\) is a set of steps with the highest ; \(|U| \ll T\)
Train memory to maximize MI for all future steps:
memory potential:
how important memory can be in \(f_t\) prediction
If \(H(f_{t}|m_{t}^*, o_{t}, a_{t})\) doesn't fluctuate as much as \(H(f_{t}| o_{t}, a_{t})\), e.g. \(H(f_{t}|m_{t}^*, o_{t}, a_{t}) = c\) for any \(t\)
\(f_t\) uncertainty without memory
\(f_t\) uncertainty with a perfect memory
Then \(I(f_{t} ; m_{t}^* | o_{t}, a_{t})\) is proportional to \(H(f_t|o_t, a_t)\)
Imagine we have a perfect memory state \(m^*_t\) for each t!
Local Metric: Conditional mutual Information
detects the end of a temporal dependency
find ends of temporal dependencies by estimating this
memory potential:
how important memory can be in \(f_t\) prediction
If \(H(f_{t}|m_{t}^*, o_{t}, a_{t})\) doesn't fluctuate as much as \(H(f_{t}| o_{t}, a_{t})\), e.g. \(H(f_{t}|m_{t}^*, o_{t}, a_{t}) = c\) for any \(t\)
\(f_t\) uncertainty without memory
\(f_t\) uncertainty with a perfect memory
Then \(I(f_{t} ; m_{t}^* | o_{t}, a_{t})\) is proportional to \(H(f_t|o_t, a_t)\)
Imagine we have a perfect memory state \(m^*_t\) for each t!
Local Metric: Conditional mutual Information
detects the end of a temporal dependency
find ends of temporal dependencies by estimating this
Train memory to maximize MI at moments that would benefit the most from using memory:
where \(U_t\) is a set of steps with the highest
Given lower bound from Barber, Agakov (2004) you can show that training
with Cross-Entropy Loss is enough to Maximize MI
simply learn to predict \(f_k\) from \(U_t\) steps
Train memory to maximize MI at moments that would benefit the most from using memory:
where \(U_t\) is a set of steps with the highest
Given lower bound from Barber, Agakov (2004) you can show that training
with Cross-Entropy Loss is enough to Maximize MI
simply learn to predict \(f_k\) from \(U_t\) steps
hint: stores information about reward placement
reward
agent panda
Learning Algorithm:
high uncertainty estimate
Uncertainty
detector
Memory
Predictor net
Two main steps:
this means high uncertainty: \(H(f_i|o_i,a_i)\)
Memory Pretraining Modules:
Learning is divided in two phases:
We use cumulative discounted future reward as prediction target: \(f_t = \sum_k \gamma^k r_{t+k}\)
We compare MemUP (Memory via Uncertainty Prediction) with the following baselines:
IMPALA-ST
Noisy T-Maze Environment:
Env Details:
Noisy T-Maze-100
Noisy T-Maze-1000
Detachment
Derailment
Go-Explore improves exploration in environments with sparse rewards
Selection depends on:
#visits, #selections, room_id, level, etc
load from the state/
replay trajectory
random policy
new state or better trajectory
Similarities:
Differences:
Go-Explore:
MemUP:
Recurrent neural networks:
Training RNN on Sequences:
LSTM
LSTM
LSTM
LSTM
LSTM
Neural Turing Machines:
Training NTM on Sequences:
Array of N memory vectors
Read and Write with Soft Attention
NTM
NTM
NTM
NTM
NTM
Recurrent Independent Mechanisms:
Imagine 4 LSTMS
Choose active LSTMs with
Top Down Attention
Update Active LSTMS
Copy Inactive LSTMS
Training RIM on Sequences:
RIM
RIM
RIM
RIM
RIM
Similarities:
Differences:
NTM and RIM:
MemUP:
We can use NTM, RIM, etc. as a memory module in MemUP
PlaNet builds a good model of the environment then plans with it
Two Main PlaNet improvements:
1.
Deterministic part
Stochastic part
PlaNet builds a good model of the environment then plans with it
Two Main PlaNet improvements:
2.
KL-loss
Reconstruction Loss
Similarities:
Differences:
PlaNet's Model:
MemUP:
Similarities:
Differences (more like parallels):
Big Bird:
MemUP: