Scalable online POMDP planning for safe and efficient autonomy

Zachary Sunberg

Europa

Europa Lander

Two Objectives for Autonomy

EFFICIENCY

SAFETY

Minimize resource use

(especially time)

Minimize the risk of harm to oneself and others

Safety often opposes Efficiency

Types of Uncertainty

Alleatory

Static Epistemic

Dynamic Epistemic

MDP

Uncertain MDP (RL)

POMDP

Uncertainty in Space Exploration

Thrusters (Alleatory)

Gravity (Epistemic)

Rough Terrain (Alleatory and Epistemic)

Policies of Other Vehicles (Epistemic)

Markov Decision Process (MDP)

$\mathcal{S}$ - State space
$\mathcal{A}$ - Action space
$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Transition probability distribution
$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$ - Reward

Partially Observable Markov Decision Process (POMDP)

$\mathcal{S}$ - State space
$\mathcal{A}$ - Action space
$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Transition probability distribution
$R:\mathcal{S}\times \mathcal{A} \to \mathbb{R}$ - Reward
$\mathcal{O}$ - Observation space
$Z:\mathcal{S} \times \mathcal{A}\times \mathcal{S} \times \mathcal{O} \to \mathbb{R}$ - Observation probability distribution

Pareto Optimization

Safety

Better Performance

Model $M_2$, Algorithm $A_2$

Model $M_1$, Algorithm $A_1$

Efficiency

$$\underset{\pi}{\mathop{\text{maximize}}} \, \sum_{t=1}^T r_t = \sum_{t=1}^T r_t^\text{E} + \lambda r_t^\text{S}$$

Safety

Weight

Efficiency

POMDPs are Hard

Curse of Dimensionality
Curse of History
PSPACE-Complete

Online Tree Search

Time

Estimate $Q(s, a)$ based on children

$$Q(s,a) = E\left[\sum_t \gamma^t r_t | s_0 = s, a_0=a\right]$$

Autonomous Driving

Tweet by Nitin Gupta

29 April 2018

https://twitter.com/nitguptaa/status/990683818825736192

Intelligent Driver Model (IDM)

\ddot{x}_\text{IDM} = a \left[ 1 - \left( \frac{\dot{x}}{\dot{x}_0} \right)^{\delta} - \left(\frac{g^*(\dot{x}, \Delta \dot{x})}{g}\right)^2 \right]

g^*(\dot{x}, \Delta \dot{x}) = g_0 + T \dot{x} + \frac{\dot{x}\Delta \dot{x}}{2 \sqrt{a b}}

[Treiber, et al., 2000] [Kesting, et al., 2007] [Kesting, et al., 2009]

Internal States

MDP trained on normal drivers

MDP trained on all drivers

Omniscient

POMCPOW (Ours)

Simulation results

[Sunberg & Kochenderfer, ACC 2017, T-ITS Under Review]

Actions

Observations

States

POMDPs with Continuous...

PO-UCT (POMCP)
DESPOT

PO-UCT (POMCP)

DESPOT

Actions

Observations

States

POMDPs with Continuous...

PO-UCT (POMCP)
DESPOT

\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, |s-10|) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}