Markov Model
Markov Decision Process (MDP)
Partially Observable Markov Decision Process (POMDP)
$$\underset{\pi}{\mathop{\text{maximize}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(\cdot)) \right]$$
\(Q^\pi(s, a)\) = Expected sum of future rewards
Environment
Belief Updater
Planner/Policy
\(o\)
\(b\)
\(a\)
Aggressive: 63%
Normal: 34%
Timid: 3%
\(x, y, v\)
Turn Left
C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markov decision processes,” Mathematics of Operations Research, vol. 12, no. 3, pp. 441–450, 1987
Computational Complexity
POMDPs
(PSPACE Complete)
Try to solve something other than a POMDP
Converge to the optimal POMDP solution
Converge to the optimal POMDP solution
POMDP:
QMDP:
\[\pi_{Q_\text{MDP}}(b) = \underset{a\in\mathcal{A}}{\text{argmax}} \underset{s\sim b}{E}\left[Q_\text{MDP}(s,a)\right]\]
where \(Q_\text{MDP}\) are the optimal \(Q\) values for the fully observable MDP.
$$\pi^* = \underset{\pi}{\mathop{\text{argmax}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(b_t)) \right]$$
INDUSTRIAL GRADE
ACAS X
[Kochenderfer, 2011]
[Sunberg, 2018]
Ours
Suboptimal
State of the Art
Discretized
(S,A,O) | Offline | Online |
---|---|---|
(D,D,D) | All PBVI Variants SARSOP |
AEMS |
(C,D,D) | POMCP DESPOT |
|
(C,D,C) | MCVI | POMCPOW DESPOT-alpha |
(C,C,C) |
(All solvers can also handle the cells above)
Special Cases/RL Only
POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia
Efficiency comes from judiciously choosing beliefs for backup
[Ross, 2008] [Silver, 2010]
*(Partially Observable Monte Carlo Planning)
Types of Uncertainty
ALEATORY
MODEL (Epistemic, Static)
STATE (Epistemic, Dynamic)
State
Timestep
Accurate Observations
Goal: \(a=0\) at \(s=0\)
Optimal Policy
Localize
\(a=0\)
Same as full observability on the next step
Environment
Belief Updater
Policy
\(a\)
\(b = \mathcal{N}(\hat{s}, \Sigma)\)
True State
\(s \in \mathbb{R}^n\)
Observation \(o \sim \mathcal{N}(C s, V)\)
\(s_{t+1} \sim \mathcal{N}(A s_t + B a_t, W)\)
\(\pi(b) = K \hat{s}\)
Kalman Filter
\(R(s, a) = - s^T Q s - a^T R a\)
POMDP:
QMDP:
\[\pi_{Q_\text{MDP}}(b) = \underset{a\in\mathcal{A}}{\text{argmax}} \underset{s\sim b}{E}\left[Q_\text{MDP}(s,a)\right]\]
where \(Q_\text{MDP}\) are the optimal \(Q\) values for the fully observable MDP. \(O(T |S|^2|A|)\)
$$\pi^* = \underset{\pi: \mathcal{B} \to \mathcal{A}}{\mathop{\text{argmax}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(b_t)) \right]$$