Environment
Belief Updater
Policy/Planner
\(h_t = (b_0, a_1, o_1 \ldots a_{t-1}, o_{t-1})\)
\(a\)
\[b_t(s) = P\left(s_t = s \mid h_t \right)\]
True State
\(s = 7\)
Observation \(o = -0.21\)
BOARD
DESPOT, POMCP, SARSOP, POMCPOW, others
Online, Offline
Goal is to solve the full POMDP approximately
Can find useful approximate solutions to large problems IN REAL TIME
Focus on smaller reachable part of belief space
POMDP:
QMDP:
\[\pi_{Q_\text{MDP}}(b) = \underset{a\in\mathcal{A}}{\text{argmax}} \underset{s\sim b}{E}\left[Q_\text{MDP}(s,a)\right]\]
where \(Q_\text{MDP}\) are the optimal \(Q\) values for the fully observable MDP. \(O(T |S|^2|A|)\)
$$\pi^* = \underset{\pi: \mathcal{B} \to \mathcal{A}}{\mathop{\text{argmax}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(b_t)) \right]$$
INDUSTRIAL GRADE
ACAS X
[Kochenderfer, 2011]
Same as full observability on the next step
\[\pi_\text{FIB}(b) = \underset{a \in \mathcal{A}}{\text{argmax}}\, \alpha_a^T b\]
POMDP:
Hindsight:
$$V_\text{hs}(b) = \underset{s_0 \sim b}{E}\left[\max_{a_t}\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]$$
$$\pi^* = \underset{\pi: \mathcal{B} \to \mathcal{A}}{\mathop{\text{argmax}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(b_t)) \right]$$
BOARD
QMDP
Full POMDP
Ours
Suboptimal
State of the Art
Discretized
[Ye, 2017] [Sunberg, 2018]
COMPUTE
Expected Cumulative Reward
Full POMDP (POMCPOW)
No Observations