A Quick Overview of POMDP Solution Methods

Markov Model

  • \(\mathcal{S}\) - State space
  • \(T:\mathcal{S}\times\mathcal{S} \to \mathbb{R}\) - Transition probability distributions

Markov Decision Process (MDP)

  • \(\mathcal{S}\) - State space
  • \(T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}\) - Transition probability distribution
  • \(\mathcal{A}\) - Action space
  • \(R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}\) - Reward

Partially Observable Markov Decision Process (POMDP)

  • \(\mathcal{S}\) - State space
  • \(T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}\) - Transition probability distribution
  • \(\mathcal{A}\) - Action space
  • \(R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}\) - Reward
  • \(\mathcal{O}\) - Observation space
  • \(Z:\mathcal{S} \times \mathcal{A}\times \mathcal{S} \times \mathcal{O} \to \mathbb{R}\) - Observation probability distribution

Objective Function

$$\underset{\pi}{\mathop{\text{maximize}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(\cdot)) \right]$$

\(Q^\pi(s, a)\) = Expected sum of future rewards

  • Starting at \((s,a)\)
  • Following \(\pi\) thereafter

 

POMDP Sense-Plan-Act Loop

Environment

Belief Updater

Planner/Policy

\(o\)

\(b\)

\(a\)

Aggressive: 63%

Normal: 34%

Timid:   3%

\(x, y, v\)

Turn Left

C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markov decision processes,” Mathematics of Operations Research, vol. 12, no. 3, pp. 441–450, 1987

Computational Complexity

POMDPs

(PSPACE Complete)

Model Approximations

Try to solve something other than a POMDP

Numerical Approximations

Converge to the optimal POMDP solution

Model Approximations

  1. QMDP
  2. Fast Informed Bound
  3. Hindsight Optimization
  4. Ignore Observations
  5. Most Likely Observation

Numerical Approximations

Converge to the optimal POMDP solution

QMDP

POMDP:

 

QMDP:

 

\[\pi_{Q_\text{MDP}}(b) = \underset{a\in\mathcal{A}}{\text{argmax}} \underset{s\sim b}{E}\left[Q_\text{MDP}(s,a)\right]\]

 

where \(Q_\text{MDP}\) are the optimal \(Q\) values for the fully observable MDP.

$$\pi^* = \underset{\pi}{\mathop{\text{argmax}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(b_t)) \right]$$

QMDP

INDUSTRIAL GRADE

QMDP

ACAS X

[Kochenderfer, 2011]

[Sunberg, 2018]

Ours

Suboptimal

State of the Art

Discretized

Offline

  • Find a solution for all possible beliefs offline, then use look-up table online
  • Typically use point-based value iteration with \(\alpha\)-vectors

Online

  • Find solution for current belief online
  • Typically use heuristic search or Monte Carlo Tree Search

Numerical POMDP Solvers

(S,A,O) Offline Online
(D,D,D) All PBVI Variants
SARSOP
AEMS
(C,D,D) POMCP
DESPOT
(C,D,C) MCVI POMCPOW
DESPOT-alpha
(C,C,C)

(All solvers can also handle the cells above)

Special Cases/RL Only

POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia

PBVI/SARSOP

Efficiency comes from judiciously choosing beliefs for backup

  • A POMDP is an MDP on the Belief Space but belief updates are expensive
  • POMCP* uses simulations of histories instead of full belief updates
  • Each belief is implicitly represented by a collection of unweighted particles

[Ross, 2008] [Silver, 2010]

*(Partially Observable Monte Carlo Planning)

Types of Uncertainty

ALEATORY

MODEL (Epistemic, Static)

STATE (Epistemic, Dynamic)

\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, s-10) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}

State

Timestep

Accurate Observations

Goal: \(a=0\) at \(s=0\)

Optimal Policy

Localize

\(a=0\)

POMDP Example: Light-Dark

POMDP Solution

QMDP

Same as full observability on the next step

Environment

Belief Updater

Policy

\(a\)

\(b = \mathcal{N}(\hat{s}, \Sigma)\)

True State

\(s \in \mathbb{R}^n\)

Observation \(o \sim \mathcal{N}(C s, V)\)

LQG Problem (a simple POMDP)

\(s_{t+1} \sim \mathcal{N}(A s_t + B a_t, W)\)

\(\pi(b) = K \hat{s}\)

Kalman Filter

\(R(s, a) = - s^T Q s - a^T R a\)

QMDP

POMDP:

 

QMDP:

 

\[\pi_{Q_\text{MDP}}(b) = \underset{a\in\mathcal{A}}{\text{argmax}} \underset{s\sim b}{E}\left[Q_\text{MDP}(s,a)\right]\]

 

where \(Q_\text{MDP}\) are the optimal \(Q\) values for the fully observable MDP. \(O(T |S|^2|A|)\)

$$\pi^* = \underset{\pi: \mathcal{B} \to \mathcal{A}}{\mathop{\text{argmax}}} \, E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(b_t)) \right]$$

A Quick Overview of POMDP Solution Methods

By Zachary Sunberg

A Quick Overview of POMDP Solution Methods

  • 512