Safety and Efficiency in Autonomous Vehicles through Online Learning

Zachary Sunberg

Postdoctoral Scholar

University of California

Autonomy is becoming possible and prevalent

How do we deploy with Confidence?

Waymo Image By Dllu - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=64517567

Shoettle and Sivak, "A Preliminary Analysis of Real-World Crashes Involving Self-Driving Vehicles" UMTRI-2015-34

Introduction

Lane Changing with Internal States

Solving Continuous POMDPs

POMDPs.jl

Future

Introduction

Lane Changing with Internal States

Solving Continuous POMDPs

POMDPs.jl

Future

Two Objectives for Autonomy

EFFICIENCY

SAFETY

Minimize resource use

(especially time)

Minimize the risk of harm to oneself and others

Hard Safety: Guaranteeing that there will be no harm

Soft Safety: Reducing the risk of harm

Two extremes:

"The two greatest risks in life are risking too much and risking too little" - Utah Avalanche Center Podcast

Objective Function

$$R(s_t, a_t) = R_\text{E}(s_t, a_t) + \lambda R_\text{S}(s_t, a_t)$$

Safety

Weight

Efficiency

\[\text{maximize} \quad E \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \right]\]

Objective Function

Safety

Better Performance

Model $M_2$, Algorithm $A_2$

Model $M_1$, Algorithm $A_1$

Efficiency

$$R(s_t, a_t) = R_\text{E}(s_t, a_t) + \lambda R_\text{S}(s_t, a_t)$$

Safety

Weight

Efficiency

Uncertainty Expression

OUTCOME

MODEL

STATE

Markov Model

$\mathcal{S}$ - State space
$T:\mathcal{S}\times\mathcal{S} \to \mathbb{R}$ - Transition probability distributions

Markov Decision Process (MDP)

$\mathcal{S}$ - State space
$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Transition probability distribution
$\mathcal{A}$ - Action space
$R:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Reward

Partially Observable Markov Decision Process (POMDP)

$\mathcal{S}$ - State space
$T:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Transition probability distribution
$\mathcal{A}$ - Action space
$R:\mathcal{S}\times \mathcal{A} \times\mathcal{S} \to \mathbb{R}$ - Reward
$\mathcal{O}$ - Observation space
$Z:\mathcal{S} \times \mathcal{A}\times \mathcal{S} \times \mathcal{O} \to \mathbb{R}$ - Observation probability distribution

Introduction

Lane Changing with Internal States

Solving Continuous POMDPs

POMDPs.jl

Future

Sadigh, Dorsa, et al. "Information gathering actions over human internal state." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016.

Schmerling, Edward, et al. "Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction." arXiv preprint arXiv:1710.09483 (2017).

Sadigh, Dorsa, et al. "Planning for Autonomous Cars that Leverage Effects on Human Actions." Robotics: Science and Systems. 2016.

Tweet by Nitin Gupta

29 April 2018

https://twitter.com/nitguptaa/status/990683818825736192

POMDP Formulation

$s=\left(x, y, \dot{x}, \left\{(x_c,y_c,\dot{x}_c,l_c,\theta_c)\right\}_{c=1}^{n}\right)$

$o=\left\{(x_c,y_c,\dot{x}_c,l_c)\right\}_{c=1}^{n}$

$a = (\ddot{x}, \dot{y})$, $\ddot{x} \in \{0, \pm 1 \text{ m/s}^2\}$, $\dot{y} \in \{0, \pm 0.67 \text{ m/s}\}$

R(s, a, s') = \text{in\_goal}(s') - \lambda \left(\text{any\_hard\_brakes}(s, s') + \text{any\_too\_slow}(s')\right)

R(s, a, s&#x27;) = \text{in\_goal}(s&#x27;) - \lambda \left(\text{any\_hard\_brakes}(s, s&#x27;) + \text{any\_too\_slow}(s&#x27;)\right)

Ego physical state

Physical states of other cars

Internal states of other cars

Physical states of other cars

Actions filtered so they can never cause crashes
Braking action always available

Efficiency

Safety

Human Behavior Model: IDM and MOBIL

\ddot{x}_\text{IDM} = a \left[ 1 - \left( \frac{\dot{x}}{\dot{x}_0} \right)^{\delta} - \left(\frac{g^*(\dot{x}, \Delta \dot{x})}{g}\right)^2 \right]

\ddot{x}_\text{IDM} = a \left[ 1 - \left( \frac{\dot{x}}{\dot{x}_0} \right)^{\delta} - \left(\frac{g^*(\dot{x}, \Delta \dot{x})}{g}\right)^2 \right]

g^*(\dot{x}, \Delta \dot{x}) = g_0 + T \dot{x} + \frac{\dot{x}\Delta \dot{x}}{2 \sqrt{a b}}

g^*(\dot{x}, \Delta \dot{x}) = g_0 + T \dot{x} + \frac{\dot{x}\Delta \dot{x}}{2 \sqrt{a b}}

M. Treiber, et al., “Congested traffic states in empirical observations and microscopic simulations,” Physical Review E, vol. 62, no. 2 (2000).

A. Kesting, et al., “General lane-changing model MOBIL for car-following models,” Transportation Research Record, vol. 1999 (2007).

A. Kesting, et al., "Agents for Traffic Simulation." Multi-Agent Systems: Simulation and Applications. CRC Press (2009).

All drivers normal

Outcome only

Omniscient

Mean MPC

QMDP

POMCPOW

Simulation results

Assume normal

Outcome only

Omniscient

Mean MPC

QMDP

POMCPOW

Correlation Trend

Robustness

Next Steps: Testing against Learned Models

Introduction

Lane Changing with Internal States

Solving Continuous POMDPs

POMDPs.jl

Future

Information State

History: all previous actions and observations

\[h_t = (b_0, a_0, o_1, a_1, o_2, ..., a_{t-1}, o_t)\]

Belief: probability distribution over $\mathcal{S}$ encoding everything learned about the state from the history

\[b_t(s) = P(s_t=s \mid h_t)\]

A POMDP is an MDP on the belief space

Belief Example: Laser Tag

Solving MDPs and POMDPs - The Value Function

$$\mathop{\text{maximize}} V_\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, \pi(s_t)) \bigm| s_0 = s \right]$$

$$V^*(s) = \max \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$

\pi: \mathcal{S} \to \mathcal{A}

\pi: \mathcal{S} \to \mathcal{A}

Involves all future time

Involves only $t$ and $t+1$

$a \in \mathcal{A}$

C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markov decision processes,” Mathematics of Operations Research, vol. 12, no. 3, pp. 441–450, 1987

But POMDPs are still PSPACE-Complete

Online Decision Process Tree Approaches

State Node

Action Node

(Estimate value function here)

Monte Carlo Tree Search

Image by Dicksonlaw583 (CC 4.0)

POMCP

Uses simulations of histories instead of full belief updates

Each belief is implicitly represented by a collection of unweighted particles

Silver, David, and Joel Veness. "Monte-Carlo planning in large POMDPs." Advances in neural information processing systems. 2010.

Ross, Stéphane, et al. "Online planning algorithms for POMDPs." Journal of Artificial Intelligence Research 32 (2008): 663-704.

Light-Dark Problem

\begin{aligned} & \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ & s' = s+a \quad \quad o \sim \mathcal{N}(s, s-10) \\ & \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ & R(s, a) = \begin{cases} 100 & \text{ if } a = 0, s = 0 \\ -100 & \text{ if } a = 0, s \neq 0 \\ -1 & \text{ otherwise} \end{cases} & \\ \end{aligned}

\begin{aligned} &amp; \mathcal{S} = \mathbb{Z} \quad \quad \quad ~~ \mathcal{O} = \mathbb{R} \\ &amp; s&#x27; = s+a \quad \quad o \sim \mathcal{N}(s, s-10) \\ &amp; \mathcal{A} = \{-10, -1, 0, 1, 10\} \\ &amp; R(s, a) = \begin{cases} 100 &amp; \text{ if } a = 0, s = 0 \\ -100 &amp; \text{ if } a = 0, s \neq 0 \\ -1 &amp; \text{ otherwise} \end{cases} &amp; \\ \end{aligned}

State

Timestep

Accurate Observations

Goal: $a=0$ at $s=0$

Optimal Policy

Localize

$a=0$

QMDP

\[Q_{MDP}(b, a) = \sum_{s \in \mathcal{S}} Q_{MDP}(s,a) b(s) \geq Q^*(b,a)\]

Equivalent to assuming full observability on the next step

Will not take costly exploratory actions

$$Q_\pi(s,a) = R(s, a) + \gamma^t E\left[V_\pi (s')\right]$$

Belief Trajectories

QMDP Predicted Trajectories

[ ] An infinite number of child nodes must be visited

[ ] Each node must be visited an infinite number of times

Solving continuous POMDPs - POMCP fails

[1] Adrien Coutoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, Nicolas Bonnard. Continuous Upper Confidence Trees. LION’11: Proceedings of the 5th International Conference on Learning and Intelligent OptimizatioN, Jan 2011, Italy. pp.TBA. <hal-00542673v2>

POMCP

✔

Limit number of children to

\[k N^\alpha\]

Necessary Conditions for Consistency [1]

POMCP

POMCP-DPW

POMCP-DPW converges to QMDP

Proof Outline:

Observation space is continuous → observations unique w.p. 1.

(1) → One state particle in each belief, so each belief is merely an alias for that state

(2) → POMCP-DPW = MCTS-DPW applied to fully observable MDP + root belief state

Solving this MDP is equivalent to finding the QMDP solution → POMCP-DPW converges to QMDP

Sunberg, Z. N. and Kochenderfer, M. J. "Online Algorithms for POMDPs with Continuous State, Action, and Observation Spaces", ICAPS (2018)

POMCP-DPW

[ ] An infinite number of child nodes must be visited

[ ] Each node must be visited an infinite number of times

[ ] An infinite number of particles must be added to each belief node

✔

Necessary Conditions for Consistency

Use $Z$ to insert weighted particles

✔

POMCP

POMCP-DPW

POMCPOW

Ye, Nan, et al. "DESPOT: Online POMDP planning with regularization." Journal of Artificial Intelligence Research 58 (2017): 231-266.

Next Step: Planning on Weighted Scenarios

Introduction

Lane Changing with Internal States

Solving Continuous POMDPs

POMDPs.jl

Future

POMDPs.jl - An interface for defining and solving MDPs and POMDPs in Julia

Challenges for POMDP Software

POMDPs are computationally difficult.
There is a huge variety of
- Problems
  - Continuous/Discrete
  - Fully/Partially Observable
  - Generative/Explicit
  - Simple/Complex
- Solvers
  - Online/Offline
  - Alpha Vector/Graph/Tree
  - Exact/Approximate
- Domain-specific heuristics

Julia - Speed

Challenges for POMDP Software

POMDPs are computationally difficult.
There is a huge variety of
- Problems
  - Continuous/Discrete
  - Fully/Partially Observable
  - Generative/Explicit
  - Simple/Complex
- Solvers
  - Online/Offline
  - Alpha Vector/Graph/Tree
  - Exact/Approximate
- Domain-specific heuristics

Explicit

Generative

$s,a$

$s', o, r$

Previous C++ framework: APPL

"At the moment, the three packages are independent. Maybe one day they will be merged in a single coherent framework."

Introduction

Lane Changing with Internal States

Solving Continuous POMDPs

POMDPs.jl

Future

Future Research

Deploying autonomous agents with Confidence

Practical Safety Gaurantees

Trusting Visual Sensors

Algorithms for Physical Problems

Testing on Physical Vehicles

TODO: MAKE ICON

Trusting Information from Visual Sensors

Environment

Belief State

Convolutional Neural Network

Control System

Architecture for Safety Assurance

Algorithms for the Physical World

Weaknesses of the algorithms I have researched:

1. Data-driven models on modern parallel hardware

2. Multidimensional continuous action spaces

CPU Image By Eric Gaba, Wikimedia Commons user Sting, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=68125990

Practical Safety Guarantees

Reachability-based guarantees are usually infeasible
Probabilistic guarantees involve low-probability distribution tails

Testing with Physical Vehicles

Texas A&M (HUSL):

TREX 600 RC Helicopter

Stanford (VAIL)

Colorado (RECUV)

Autorotation

"Expert System" Autorotation Controller

Example Controller: Flare

\[\alpha \equiv \frac{KE_{\text{available}} - KE_{\text{flare exit}}}{KE_{\text{flare entry}} - KE_{\text{flare exit}}}\]

\[TTLE = TTLE_{max} \times \alpha\]

\[\ddot{h}_{des} = -\frac{2}{TTI_F^2}h - \frac{2}{TTI_F}\dot{h}\]

Bad because:

No optimality justification
Controller must be adjusted by hand to accommodate any changes
Difficult to communicate about
Parameters not intuitive

Autorotation Simulation Results

References

Acknowledgements

The content of my research reflects my opinions and conclusions, and is not necessarily endorsed by my funding organizations.

TODO: HSL PICTURE

Thank You!

zachary.sunberg.net

Convex Optimization

Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Convexity <=> Exact Solution Tractable

all $f$ convex

Energy Kites

Need safety guarantees
Every bit of efficiency is extra energy
Complex and partially observable dynamics
- Aircraft dynamics
- Tether state
- Wind

Solving MDPs and POMDPs - Offline vs Online

ONLINE

OFFLINE

Value Iteration

Sequential Decision Trees

Solving MDPs and POMDPs - The Value Function

$$\mathop{\text{maximize}} V_\pi(s) = E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \bigm| s_0 = s, a_t = \pi(s_t) \right]$$

$$V^*(s) = \max \left\{R(s, a) + \gamma E\Big[V^*\left(s_{t+1}\right) \mid s_t=s, a_t=a\Big]\right\}$$

\pi: \mathcal{S} \to \mathcal{A}

\pi: \mathcal{S} \to \mathcal{A}

Involves all future time

Involves only $t$ and $t+1$

$a \in \mathcal{A}$

Handle continuous state space via $V(s) \approx \tilde{V}(s; \theta) $

$45^\circ$

$-45^\circ$

$\phi=0^\circ$

Sunberg, Zachary N., Mykel J. Kochenderfer, and Marco Pavone. "Optimized and trusted collision avoidance for unmanned aerial vehicles using approximate dynamic programming." Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016.

Value Function

Policy

A better way: MPC Value Iteration

$R$ quasi-convex (e.g. $x_t \in $ safe landing)

Feasible sets convex

Job Talk

By Zachary Sunberg

Safety and Efficiency in Autonomous Vehicles through Online Learning

Autonomy is becoming possible and prevalent

How do we deploy with Confidence?

EFFICIENCY

SAFETY

Correlation Trend

Robustness

Next Steps: Testing against Learned Models

Information State

Belief Example: Laser Tag

Belief Trajectories

QMDP Predicted Trajectories

✔

✔

✔

✔

✔

Next Step: Planning on Weighted Scenarios

Challenges for POMDP Software

Julia - Speed

Challenges for POMDP Software

Future Research

Trusting Information from Visual Sensors

Algorithms for the Physical World

Practical Safety Guarantees

Testing with Physical Vehicles

Autorotation

Autorotation Simulation Results

References

Acknowledgements

Thank You!

Convex Optimization

Energy Kites

ONLINE

OFFLINE

Job Talk

More from Zachary Sunberg