Long Horizon Gradients

Motivation.

Understand benefits of considering stochasticity in optimal control problems.
Rewrite RL methods in a more "natural" language for our problems.
Better computation of gradients of stochastic objectives,

Objectives:

Previous Work:

Focus on smoothing out the dynamics of the problem.
Intuitive understanding of what it means to average out dynamics.
Very practical algorithm for doing planning-through-contact.

Dynamics proved to be mostly continuous due to infinitesimal time / inelastic impacts.
Was not able to analyze behavior of smoothing when it comes to discontinuities.
Not able to analyze the resulting landscape of the long-horizon optimal control problem.

\bar{f}(x,u)=\mathbb{E}_w\big[f(x,u+w)\big]

Good points:

Leaves much to be desired...

We want to understand the landscape of value functions. When we lift the continuity provided by time and reason about long-horizon effects, discontinuities and multi-modalities rear their ugly heads.

Problems of Interest Today:

Motivation.

Tomas's pendulum moves between 0 and 1st order gradients.

Differentiable RL? Analytical Policy Gradient? Why? Why not?

Analytical Policy Gradient.

\begin{aligned} \frac{\partial c}{\partial x}, \frac{\partial c}{\partial u}, \frac{\partial f}{\partial x},\frac{\partial f}{\partial u}, \frac{\partial \pi}{\partial \theta} \end{aligned}

Given the access to the following gradients:

One can use the chain rule to obtain the analytic gradients, that can be computed efficiently using autodiff on the rollout:

\begin{aligned} \nabla_\theta V(x,\pi_\theta) \end{aligned}

Conceivably, this can be used to update the parameters of the policy.

Policy Gradient Sampling Algorithm

Initialize some parameter estimate

While desired convergence:

Sample some initial states

\begin{aligned} \theta_0 \end{aligned}

\begin{aligned} x_i\sim\rho \end{aligned}

Compute the analytical gradient of the value function from each sampled state.

\begin{aligned} \nabla_\theta V(x_i,\pi_\theta) \end{aligned}

Average the sampled gradients to obtain the gradient of the expected performance:

\begin{aligned} \hat{\nabla}_\theta \mathbb{E}_{x\sim\rho} [V(x,\pi_\theta)] \approx \frac{1}{N}\sum^N_i \nabla_\theta V(x_i,\pi_\theta) \end{aligned}

Update the parameters using gradient descent / Gauss-Newton.

\begin{aligned} \theta_{k+1} = \theta_k - \alpha \hat{\nabla}_\theta \mathbb{E}_{x\sim\rho}[V(x,\pi_\theta)] \end{aligned}

\begin{aligned} \min_\theta & \; \mathbb{E}_{x\sim\rho}\bigg[V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ & \; u_t = \pi(\theta, x_t) \end{aligned}

\begin{aligned} \min_\theta & \; V(x,\pi_\theta) = -x_T^2\\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ & \; u_0 = \theta, u_t=0 \end{aligned}

Back to high-school physics: suppose we throw a ball (Ballistic motion) and want to maximize the distance thrown using gradient descent.

Quiz: what is the optimal angle for maximizing the distance thrown?

Escaping Flat Regions.

\begin{aligned} \min_\theta & \; V(x,\pi_\theta) = -x_T^2\\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ & \; u_t = \theta \end{aligned}

Back to high-school physics: suppose we throw a ball (Ballistic motion) and want to maximize the distance thrown using gradient descent.

Quiz: what is the optimal angle for maximizing the distance thrown?

45 degrees!

If we plot the objective as a function of the angle, it is a nice gradient dominant function that will converge to the local minima.

(Interestingly, this is non-convex. Typical example of one of Jack's PL inequality functions).

Escaping Flat Regions.

\begin{aligned} \min_\theta & \; V(x,\pi_\theta) = -x_T^2\\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ & \; u_t = \theta \end{aligned}

Back to high-school physics: suppose we throw a ball (Ballistic motion) and want to maximize the distance thrown using gradient descent.

Now we no longer have such nice structure....gradient descent fails.

There is fundamentally no local information to improve once we've hit the wall!

Even though the physical gradients are very well defined, we can no longer numerically obtain the minimum of the function.

Now let's add a wall to make things more interesting. (Assume inelastic collision with the wall - once it hits, falls straight down)

Escaping Flat Regions.

\begin{aligned} \min_\theta & \; \mathbb{E}_{w\sim \mu} \big[V(x,\pi_\theta) = -x_T^2\big] \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ & \; u_t = \theta + w \quad w \sim \mu \end{aligned}

Back to high-school physics: suppose we throw a ball (Ballistic motion) and want to maximize the distance thrown using gradient descent.

Now we have recover gradient dominance! Now we know where to go when we've hit the wall.

To resolve this, consider a stochastic objective by adding noise to the decision variable:

Escaping Flat Regions.

Excerpt from Difftaichi, one of the big diff. sims.

Rather surprising that we can tackle this problem with a randomized solution, without:

manual tuning of initial guesses
tree / graph search.

(Or rather, it was surprising to me at the beginning that gradients can't tackle this problem, the problem seems pretty easy and a search direction seems to exist.)

Escaping Flat Regions.

Gradient-Driven Deaths

Given the objective function , you follow the gradient to maximize the objective.
The next gradient ascent step points you towards a bad value.
You fall off a cliff and are unable to recover to where you were before.

Defining Gradient-Driven Deaths

f(x)

\nabla f(x)

f\big(x_k + \alpha \nabla f(x_k)\big) \ll f\big(x_k)

Not necessarily caused by discontinuity. Also fails on stiff functions with large step size.
Discontinuities are special in the sense that there is a point at which GDDs happen for all step sizes.
Once GDD occurs, it can be impossible to recover.
GDD is insensitive to initial conditions. Wherever you seed, it will occur.

Example 1. Momentum Transfer

\theta

1. You need to choose an angle to shoot the hinged door.

2. You get paid based on how much angular momentum you transfer to the door.

3. But if you miss, the hinged door will obviously not move.

Naively following gradients towards optimality makes you fall off the cliff.

(Said another way, there is tension between being optimal and being "safe")

Example 1. Shooting at hinged door.

Example 1. Momentum Transfer

Not a contrived example....

This example quickly figures out that it should maximize moment arm of the pusher, only to end up in a no contact scenario.

Example 2. Manipulation at Dynamic Limits

1. You need to move a cart from point A to point B.

2. You get paid to be as fast as possible, as well as precise (up until here, standard minimum-time DI)

3. But if you drop the ball, you get no points!

Example 2. Minimum-time Transport

u

Ever got too greedy and failed this task? (Waiter serving drinks?)

Example 2. Manipulation at Dynamic Limits

Here the parametrized policy is bang-bang (with two parameters of bang acceleration and switching time)

f(t_s,u) = t_s + \|x(t_f) - x_d\|^2_2

Example 3. Block Stacking

1. Your job is to find horizontal magnitudes of how much to place the box.

2. You get paid based on how far to the right you stack the box and the height of the stack.

3. But if you become too greedy, the stack might topple!

Example 3. Block stacking

u_1

u_2

f(t_s,u) = \lambda \|u_{x}\|^2 + \|u_y\|^2

GDDs Recap.

Optimization problem has tension between optimality and "safety".
Optimization problem is discontinuous / changes too quickly (high Lipschitz constant).
Gradient only reasons about local behavior (limit as epsilon goes to zero)

Why do GDDs occur?

How should we prevent GDDs?

Add and use constraints! (climb as high as possible s.t. I don't fall off a cliff)
RL doesn't have constraints....why doesn't RL as optimization have this problem?

Sometimes, constraints are difficult to define.

Stability of boxes might be clear, but what about stones? humanoids?
Should we constrain the ball to touch the box? Sometimes equivalent to fixing mode sequences.
Valuable to "implicitly" have constraints from black-box data when the engineer cannot design one with domain knowledge.

Stochastic Objectives

How should we prevent GDDs?

Maybe if we're not so myopic, we'll be able prevent GDDs before they occur.

If we consider a stochastic objective, how would the landscape change?

Randomized Smoothing of Value Functions

\begin{aligned} \min_\theta & \; f(\theta) \end{aligned}

\begin{aligned} \min_\theta & \; \mathbb{E}_{w\sim\mu} \big[f(\theta + w)\big] \end{aligned}

Original Problem w\ GDDs

Surrogate Stochastic Problem (Bundled Objective)

Example 1. Momentum Transfer

\theta

Some samples will hit the door, some samples will miss - the average gives information that we should be more conservative when we do gradient ascent.

Example 1. Shooting at hinged door.

Connection to robust optimization.

Intuitively satisfying answer: the maxima of the bundled objective is "pushed back" by how uncertain we are.

Example 2. Manipulation at Dynamic Limits

Some samples will drop the payload, some samples will not. Behaves better on average.

Connection to robust optimization.

Intuitively satisfying answer: the maxima of the bundled objective is "pushed back" by how uncertain we are.

u

Example 2. Minimum-time Transport

Example 3. Block Stacking

Similarly, the expected value behaves much better.

Example 3. Block stacking

u_1

u_2

GDDs can be eliminated by considering a stochastic formulation of the objective!

Emergence of Constraints from Stochasticity

Mathematical Characterization of this behavior:

If the function has a high-cost region, we can decompose the function into:

f(x) = \mathbb{1}_{\mathbf{A}} \cdot C + \mathbb{1}_{\mathbf{A}^C} f(x)

\begin{aligned} \mathbb{E}_w[f(x+w)] & = \int f(x+w)\mu(w)dw \\ & = \int [\mathbb{1}_{x\geq 0} \cdot C] \mu(w)dw + \int[\mathbb{1}_{1< 0} \cdot f(x+w)]\mu(w)dw \\ & = C \cdot \text{erf}(x) + \int[\mathbb{1}_{1< 0} \cdot f(x+w)]\mu(w)dw \end{aligned}

Expectation is a linear operator, so we can smooth them out individually and add them back in.

=

+

By taking an expected value, we have something that resembles more the Lagrangian of the optimization

if we had explicitly wrote down the constraint.

\begin{aligned} \mathcal{L}(x) = f(x) + \lambda g(x) \end{aligned}

\begin{aligned} \min & \; f(x) \\ \text{s.t.} & \; g(x) \leq 0 \end{aligned}

=

+

=

\begin{aligned} f(x) \end{aligned}

\mathbb{1}_{x<0} f(x)

\mathbb{1}_{\mathbf{A}} \cdot C

\begin{aligned} \mathbb{E}_w[f(x+w)] \end{aligned}

\begin{aligned} \mathbb{E}_w\big[\mathbb{1}_{x< 0} \cdot f(x+w)\big] \end{aligned}

\begin{aligned} C \cdot \text{erf}(x) \end{aligned}

Emergence of Constraints from Stochasticity

Connection to Interior-Point Methods

Consider the canonical log-barrier function:

\begin{aligned} g(x) = -\log(x) \end{aligned}

\begin{aligned} \mathcal{L}(x) = f(x) + \lambda g(x) \end{aligned}

\begin{aligned} C \cdot \text{erf}(x) \end{aligned}

Smoothing creates a barrier function-like effect that encourages adherence to constraints.

\begin{aligned} g(x) \end{aligned}

\begin{aligned} \nabla g(x) \end{aligned}

Emergence of Constraints from Stochasticity

Madness? More interesting than it seems....

Emergence of Constraints from Stochasticity

Penalizing for collision and sampling has the same effect as encoding non-penetration as explicit constraints, then IPOPT taking a barrier-function formulation of it.

(In fact, it might perform better when we start having strange geometries!)

Trajectory Optimization through Forest.

Figure from Ani's paper.

Discussions

Explicit Constraints vs. Soft Constraints from Sampling

We never needed to actually specify what the constraints should be. This can be a big advantage sometimes.
The reward / cost simply needs to be able to assign bad costs to bad events (but finite!)
It is a "soft" constraint - it can be violated (feasibility is not a binary thing).
But the violation can also work out to your advantage (e.g. can explore mode sequences)
But such line of thinking may easily lead to games of reward shaping.

Randomized Smoothing on Value Functions

Original Problem

Surrogate Problem

\begin{aligned} \min_\theta & \; \mathbb{E}_{x\sim\rho}\bigg[V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ & \; u_t = \pi(\theta, x_t) \end{aligned}

\begin{aligned} \min_\theta & \; \mathbb{E}_{x\sim\rho,w\sim\mu}\bigg[V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ & \; u_t = \pi(\theta + w, x_t) \quad w\sim \mu \end{aligned}

Original Problem (Single-shooting formulation)

\begin{aligned} \min_{u_t} & \; V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t) \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ \end{aligned}

Surrogate Problem

\begin{aligned} \min_{u_t} & \; \mathbb{E}_{w_t\sim\mu_t,\forall t}\bigg[V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t+w_t) \\ \end{aligned}

expectation over entire trajectory of noises

Injecting noise in policy optimization

Injecting noise in trajectory optimization

Original Problem (Single-shooting formulation)

\begin{aligned} \min_{u_t} & \; V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t) \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t) \\ \end{aligned}

Surrogate Problem

\begin{aligned} \min_{u_t} & \; \mathbb{E}_{w_t\sim\mu_t,\forall t}\bigg[V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t)\bigg] \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = f(x_t, u_t+w_t) \\ \end{aligned}

expectation over entire trajectory of noises

Injecting noise in trajectory optimization

\begin{aligned} \min_{u_t} & \; V(x,\pi_\theta) = c_T(x_T) + \sum^{T-1}_{t=0} \gamma^t c(x_t, u_t) \\ \text{s.t.} & \; x = x_0 \\ & \; x_{t+1} = \mathbb{E}_{w_t\sim\mu_t}\bigg[f(x_t, u_t+w_t)\bigg] \\ \end{aligned}

Note: Comparison with Previous Work

expectation over noise in a single timestep.

Randomized Smoothing on Value Functions

Computing Gradients of Stochastic Objectives

Now let's talk about gradient computations.

Given that has differentiable almost everywhere but has jump discontinuities, how should we compute the gradient of the stochastic objective?

f(x)

\begin{aligned} \nabla_x \mathbb{E}_{w\sim\mu} \big[f(x + w)\big] \end{aligned}

Quantity to compute / estimate:

Recall: The usual trick of D.C.T and Monte-Carlo doesn't work!

\begin{aligned} \nabla_x \mathbb{E}_{w\sim\mu} \big[f(x + w)\big] & = \mathbb{E}_{w\sim\mu} \big[\nabla_x f(x+w)\big] \approx \frac{1}{N}\sum^N_{i=1} \nabla_x f(x+w_i) \quad w_i\sim\mu \end{aligned}

Back to our good old example....

\begin{aligned} \nabla_x \mathbb{E}_{w\sim\mu} f(x+w) & = \mathbb{E}_{w\sim\mu} \nabla_x f(x+w) \\ & \approx \frac{1}{N}\sum^N_{i=1} \nabla_x f(x+w_i) \qquad w_i\sim\mu \\ & \rightarrow 0 \end{aligned}

f(x) = \begin{cases} 1 & \text{ if } x\geq 0 \\ 0 & \text{ if } x < 0 \end{cases}

\mathbb{E}_w \big[f(x+w)\big] = \text{erf}(x)

\nabla f(x) = \delta(x)

Computing Gradients of Stochastic Objectives

We'll review three methods of how to resolve this.

Use the zero-order numerical gradient.
Bootstrap the function to a smooth function approximator.
Analytically add back the Gaussian to the derivative.

Method 1.

If we use finite differences instead of gradients, the interval is not zero-measure, and will always be able to capture the discontinuity.

\begin{aligned} \mathbb{E}_{w\sim\mu} \bigg[\frac{f(x+w)-f(x)}{w}\bigg] \approx \frac{1}{N}\sum^N_{i=1} \bigg[\frac{f(x+w_i)-f(x)}{w_i}\bigg] \quad w_i\sim \mu \end{aligned}

Captures the behavior that we want nicely, but very difficult to analyze since we can't use Taylor.
Usually suffers from higher variance, which leads to worse convergence rates.

Computing Gradients of Stochastic Objectives

Method 2. Bootstrapping to a smooth function approximator.

An effective solution, if you had lots of offline compute, is to simply find a differentiable representation of your discontinuous function by function fitting.

Computing Gradients of Stochastic Objectives

\begin{aligned} \big(x_i, \mathbb{E}_{w\sim\mu} \big[f(x_i + w)\big]\big) \end{aligned}

1. Evaluate the bundled objective at finite points:

2. Regress on e.g. neural network to find a surrogate function that is differentiable everywhere:

3. Compute the derivative of the surrogate function.

\begin{aligned} \bar{f}(x,\theta^*) \text{ where } \theta^* = \text{argmin}_\theta \sum_i \big\|\bar{f}(x_i,\theta) - \mathbb{E}_{w\sim\mu} \big[f(x_i + w)\big]\big)\big\| \end{aligned}

\begin{aligned} \nabla_x \bar{f}(x,\theta^*) \end{aligned}

Doesn't "feel" right to me, but the learning people love this trick!

Bootstrapping to a smooth function approximator.

Computing Gradients of Stochastic Objectives

From DPG paper:

If the Q function is discontinuous, won't they have the same issue?
Not if they bootstrap a neural network to approximate Q!

Conv. w/ Abhishek

Likely, many learning methods bypass the problem with discontinuities by spending offline time to store discontinuities into continuous functions.

Of course, comes with some associated disadvantages as well.

Carrots example.

Computing Gradients of Stochastic Objectives

Q: Why do we need neural simulators?
A: Because we need gradients.

Conv. with Yunzhu:

Conv. with Russ:

Q: Why don't you like the word "differentiable physics"?
A: Physics has always been differentiable. Drake already gives you autodiff gradients through plant.

Resolution?

Perhaps gradients of the smooth representation of the plant can sometimes behave better than true plant gradients.

Very noisy landscape.

Bootstrapped to NN

Analytically adding back the Gaussian.

There is actually a quite easy solution to the delta issue if we really wanted to sample gradients....

Computing Gradients of Stochastic Objectives

WLOG Let's represent the scalar function having discontinuities at x=0 as:

f(x) = \mathbb{1}_{x< 0}\cdot f_L(x) + \mathbb{1}_{x\geq 0}\cdot f_R(x)

with

f_L(0) \neq f_R(0)

Then we can represent the gradient of the function as:

\nabla f(x) = \mathbb{1}_{x< 0}\cdot \nabla f_L(x) + \mathbb{1}_{x\geq 0}\cdot \nabla f_R(x) + \underbrace{(f_R(0) - f_L(0))}_C\cdot \delta(x)

Gradient sampling fails because we have zero probability of landing at the delta

(and we can't evaluate it if we do anyways....)

Analytically adding back the Gaussian.

So instead of sampling the entire gradient, we can analytically integrate out the delta, then sample the rest.

Computing Gradients of Stochastic Objectives

Taken from "Systematically Differentiating Parametric Discontinuities", SIGGRAPH 2021.

Analytically adding back the Gaussian.

So instead of sampling the entire gradient, we can analytically integrate out the delta, then sample the rest.

Computing Gradients of Stochastic Objectives

\nabla f(x) = \mathbb{1}_{x< 0}\cdot \nabla f_L(x) + \mathbb{1}_{x\geq 0}\cdot \nabla f_R(x) + \underbrace{(f_R(0) - f_L(0))}_C\cdot \delta(x)

\begin{aligned} \int \nabla f(x+w)\mu(w)dw & = \bigg[\mathbb{1}_{x< 0}\cdot \nabla f_L(x+w) + \mathbb{1}_{x\geq 0}\cdot \nabla f_R(x+w)\bigg]\mu(w)dw \\ & + \underbrace{(f_R(0) - f_L(0))}_C\cdot \delta(x+w)\mu(w)dw \\ & \approx \frac{1}{N}\sum^N_{i=1} \nabla f(x+w_i) + C\mu(x) \qquad w_i\sim \mu(x) \end{aligned}

Decompose the integral into components without delta (top row), and the delta (bottom row).

Scaling to higher dimensions is challenging.

So how do we scale this to higher dimensions? Starts becoming real complicated....

Computing Gradients of Stochastic Objectives

Derivative of heaviside is the delta, but what is the derivative of a 2D indicator function?

A = \{x\in\mathbb{R}^2 | \|x\|_2 \leq 1\}

\mathbb{1}_A(x) = \begin{cases} 1 & \text{ if } x \in A \\ 0 & \text{ else } \end{cases}

Does such a thing as make sense? Infinity at but ?

Consistent with the 1D delta? Many questions.

\delta(\partial A)

\partial A

\iint \delta(\partial A) dxdy = 1

Convolution with 2D Gaussian gives a "ridge" of Gaussians,

Computational approach might be to find the closest surface of discontinuity, then evaluate the Gaussian centered at the closest point.

Which type of "minima" does smoothing prefer? and why?

When Smoothing Hurts

In general, the reasoning about quality (suboptimality) of the different minima is NP-hard.

Instead, we can reason about the behavior of what happens to the boundary points of local minima.

\begin{aligned} \frac{dx}{dt} = \nabla f(x(t)) \end{aligned}

\begin{aligned} x' = \int^\infty_{t=0}\nabla f(x(t))dt \end{aligned}

\begin{aligned} S(y) = \bigg\{x\in\mathcal{D} \bigg| y= \int^\infty_{t=0}\nabla f(x(t))dt, x(0)=x \bigg\} \end{aligned}

One tool we might be able to use is the region of attraction for different local minima.

Idea is to carve up the domain of the decision variables and "color" them according to different local minima.

How will the boundary of two different regions move as we smooth it out?

Which type of "minima" does smoothing prefer? and why?

\max\{-\int \nabla f_L(x+w)\mu(w)dw, \int \nabla f_R(x+w)\mu(w)dw\}

Direction of movement of the local maxima is determined by:

Simple 1D example - a local maxima must lie between two local minima, so we reason about what happens to the local maxima as we smooth out the function.