Partially Observable MDP

Definition

Graphical Model for POMDP:

POMDP is a 6-tuple $<S,A,R,T,\Omega, O>$ :

$S$ is a set of states
$A$ is a set of actions
$R: S \times A \to \mathbb{R}$ is a reward function
$T: S \times A \times S \to [0,1]$ is a transition function $T(s,a,s\prime) = P(S_{t+1}=s\prime|S_t=s, A_t=a)$
$\Omega$ is a set of observations.
$O$ is a set of $|\Omega|$ conditional probability distributions $P(o|s)$

Partially Observable MDP

Exact Solution

A proper belief state allows a POMDP to be formulated as a MDP over belief states (Astrom, 1965)

Belief State update:

General Idea:

Belief Update "Beilief" MDP Plan with Value Iteration Policy

b_0(s) = P(S_0=s)

b_0(s) = P(S_0=s)

b_{t+1}(s) = p(s|o_{t+1}, a_{t}, b_{t}) = \dfrac{p(s, o_{t+1}|a_t, b_t)}{p(o_{t+1}|a_t, b_t)}

b_{t+1}(s) = p(s|o_{t+1}, a_{t}, b_{t}) = \dfrac{p(s, o_{t+1}|a_t, b_t)}{p(o_{t+1}|a_t, b_t)}

\propto p(o_{t+1},| s, a_t, b_t)\, p(s | a_t, b_t)

\propto p(o_{t+1},| s, a_t, b_t)\, p(s | a_t, b_t)

= p(o_{t+1},| s) \sum_{s_i} p(s | a_t, s_i)\, b_t(s_i)

= p(o_{t+1},| s) \sum_{s_i} p(s | a_t, s_i)\, b_t(s_i)

Problems:

Need a model
Can compute exact belief update only for small/simple MDP
Can compute Value Iteration only for small MDP

Just add LSTM to everything

Off-Policy Learning (DRQN):

Add LSTM before last 1-2 layers
Sample sequences of steps from Experience Replay

On-Policy Learning (A3C/PPO):

Add LSTM before last 1-2 layers
Keep LSTM hidden state $h_t$ between rollouts

PPO+LSTM+RESOURCES

DRQN+LSTM+TRICKS

Asynchronous Methods for Deep Reinforcement Learning (Mnih et al, 2016) | DeepMind, ICML, 3113 citations

Deep Recurrent Q-Learning for Partially Observable MDPs

(Hausknecht et al, 2015) AAAI, 582 citations )

MERLIN

Memory-Based Predictor

Prior Distribution

Module takes all memory from the previous step and produces parameters of Diagonal Gaussian distribution:

Posterior Distribution

Another MLP $f^{post}$ takes:

and generates correction for the prior:

At the end, latent state variable $z_t$ is sampled from posterior distribution.

[ \mu_{t}^{prior}, log \Sigma^{prior}_{t} ] = f^{prior}(h_{t-1}, m_{t-1})

[ \mu_{t}^{prior}, log \Sigma^{prior}_{t} ] = f^{prior}(h_{t-1}, m_{t-1})

n_t = [e_t, h_{t-1}, m_{t-1}, \mu_{t}^{prior}, log \Sigma^{prior}_{t} ]

n_t = [e_t, h_{t-1}, m_{t-1}, \mu_{t}^{prior}, log \Sigma^{prior}_{t} ]

[\mu^{post}_{t}, log \Sigma^{post}_{t}] = f^{post}(n_t) + [\mu^{prior}_{t}, log \Sigma^{prior}_{t}]

[\mu^{post}_{t}, log \Sigma^{post}_{t}] = f^{post}(n_t) + [\mu^{prior}_{t}, log \Sigma^{prior}_{t}]

Model-Free Episodic Control

Lets store all past experiences in $|A|$ dictionaries $Q_{a}^{EC}$

$s_t, a_t$ are keys and discounted future rewards $R_t$ are values.

Dictionary update:

If a state space has a meaningful distance, then we can use k-nearest neightbours to estimate new $(s,a)$ pairs:

Model-Free Episodic Control (2016) | DeepMind, 100 citations

Model-Free Episodic Control

Lets store all past experiences in $|A|$ dictionaries $Q_{a}^{EC}$

$s_t, a_t$ are keys and discounted future rewards $R_t$ are values.

Dictionary update:

If a state space has a meaningful distance, then we can use k-nearest neightbours to estimate new $(s,a)$ pairs:

Two possible feature compressors for $s_t$ : Random Network, VAE

Neural Episodic Control

DND Update

Once a key $h_t$ is queried from a DND, that key and its corresponding output are appended to the DND. If $h_t \in M_a$ then we just store it with N-step Q-value estimate:

otherwise, we update stored value with tabular Q-learning rule:

Learn DND and CNN-encoder:

Sample mini-batches from replay buffer that stores triplets $(s_t,a_t, R_t)$ and use $R_t$ as a target.

Neural Episodic Control (2016) | DeepMind, 115 citations

Advanced Topics in RL (lecture 12) :

Memory in RL

advanced_topics_in_rl_memory

More from supergriver