Intern Meeting

2023/06/30

Global Planning for Contact-rich Manipulation: Overview of Steps

Step 1. Build goal-conditioned input / output sets with contact dynamics

Step 2. Build GCS out of these input / output sets

Step 3. Experimental testing on simulation and hardware

I will share progress & attempts on this front for the meeting.

Recap of Problem

For some discrete-time system                            , find some tuple                              such that the policy                    stabilizes all                  to 

It took a while to define what exactly the input/output portals refer to.

Attempt 1. Goal-conditioned Policies

\mathcal{S}_i,\mathcal{S}_f,\pi(\cdot;\cdot)
x'=f(x,u)
x_i\in\mathcal{S}_i
\pi(x ; x_f)
x_f

Challenges & Questions:

1. How do we get these goal-conditioned (a.k.a. universal) policies?

2. How do we represent these sets?

Attempt 1. Stochastic Policy Optimization (RL)

\begin{aligned} \min_\theta \quad & \mathbb{E}_{x_g\sim\mathcal{S}_g, x_i\sim \mathcal{S}_i}\left[\|x_T - x_g\|^2_\mathbf{Q} + \sum_t \|u_t\|^2_\mathbf{R} \right] \\ \text{s.t.} \quad & x_{t+1} = f(x_t, u_t) \quad \forall t \\ & u_t = \pi_\theta(x_t; x_g) \\ & x_0 = x_i \end{aligned}

In universal policy formulation, we can try to solve a stochastic policy optimization problem:

There is a chicken-and-egg problem between size of the sets and the ease of policy optimization. 

1. How do we set the initial and goal sets? If they are too large, policy optimization is not

     successful. Even if they are, there's no point since the whole point of our approach is to

     decompose the problem into easier problems. 

2. On the other hand, if they are too small, the resulting size of the graph is too large. 

Attempt 1. Stochastic Policy Optimization (RL)

Performance of Policy Optimization with single goal

u_t = \mathbf{K}_t x_t + k_t
u_t = \pi_\theta(x_t)

Policy with Time-varying gains and biases 

Neural network policy

Not quite doing the job as well as I hoped.... Maybe because the initial set was not good?

Attempt 1. Stochastic Policy Optimization (RL)

Not quite doing the job as well as I hoped.... Maybe because the initial set was not good?

Iterative Refinement of the Initial Set

1. Solve the policy optimization problem from sample ellipsoid.

2. Perform weighted PCA on the initial set with weights 

3. Sample from refined ellipsoid, repeat.

w(x^i_0) = \exp(-\alpha V(x^i_0))

Some factors led me to abandon this route.

1. Policy optimization is a bit costly to use as an inner routine in an alternation scheme.

2. At the same time, performance of PO was not that good when we care about reaching        the goal exactly. 

3. Many hyperparameters. How do we set reliably temperature parameter? 

4. At the time, Pang was looking into LQR solutions with more promising results! 

Re-usable Parts from this attempt. 

\begin{aligned} \min_{\mathbf{K}_t,k_t} \quad & \left[\|x_T - x_g\|^2_\mathbf{Q} + \sum_t \|u_t\|^2_\mathbf{R} \right] \\ \text{s.t.} \quad & x_{t+1} = f(x_t, u_t) \quad \forall t \\ & u_t = \mathbf{K}_t x_t + k_t \\ & x_0 = x_i \end{aligned}

I have a pretty good single-shooting optimization algorithm for trajectory optimization! 

\begin{aligned} \coloneqq V(\theta), \theta=(\mathbf{K}_t,k_t) \end{aligned}

Re-usable Parts from this attempt. 

\begin{aligned} \nabla_{\theta} V(\theta) \end{aligned}

The solver supports three modes of estimating gradients 

1. Policy Gradient Trick: 

 

 

 

2. Autodiff computation graph with numerical gradients:

 

 

3. Autodiff computation graph with randomized gradients: 

Inject Gaussian noise                             at output of the policy

w_t \sim \mathcal{N}(0,\sigma^2\mathbf{I})
\begin{aligned} \nabla_{\theta} \mathbb{E}_{w_t} [V(\theta)] = \frac{1}{\sigma^2} V(\theta) \sum^T_{t=0} \frac{\partial u_t}{\partial \theta}^\top w_t \end{aligned}

Do autodiff computations on the gradient, but approximate dynamic gradients with finite differences.

Do autodiff computations on the gradient, but approximate dynamic gradients with least-squares estimation of the gradient. 

Trajopt Results

Trajopt significantly "easier" than policyopt, works quite reliably for now.

1. Having closed-loop gains to optimize makes a large difference in performance.

2. First-order optimizers more performant than the policy gradient trick.

Randomized vs. Finite Differences.

Attempt 2. Forwards Reachability

For some discrete-time system                            , find some tuple                              such that the policy                    stabilizes all                  to 

Attempt 1. Goal-conditioned Policies

x'=f(x,u)
x_i\in\mathcal{S}_i
\pi(x ; x_f)
x_f
\mathcal{S}_i,\mathcal{S}_f,\pi(\cdot;\cdot)

Lessons: Directly optimizing for such policies is difficult. What if we don't ask for feedback-stabilizable set of goal points, but are willing to incorporate open-loop trajectories? 

Attempt 2. Forwards Reachable Sets Robust to Disturbances on Initial Conditions

For some discrete-time system                            , given some set of initial conditions                 ,

Find        where the definition is:          

x'=f(x,u)
\mathcal{S}_f
\mathcal{S}_f = \{x_{T+1} | \exists u_0,\cdots,u_T \text{ with } \|u_t\|\leq \gamma \forall t \text{ for any } x_0 \in \mathcal{S}_i\}
x_i\in\mathcal{S}_i

Attempt 2. Robust Forward Reachable Sets

Forwards Reachable Sets Robust to Disturbances on Initial Conditions

For some discrete-time system                            , given some set of initial conditions                 ,

Find        where the definition is:          

x'=f(x,u)
\mathcal{S}_f
\mathcal{S}_f = \{x_{T+1} | \exists u_0,\cdots,u_T \text{ with } \|u_t\|\leq \gamma \forall t \text{ for any } x_0 \in \mathcal{S}_i\}
S_f = \cap_{x_0 \in \mathcal{S}_i} x_{T+1}(x_0)

Attempt 2. Robust Forward Reachable Sets

Specialization to Ellipsoids on LTV systems 

\delta x_{t+1} = \mathbf{A}_t \delta x_t + \mathbf{B}_t \delta u_t

Set of finial conditions is characterized by Finite Impulse Response (FIR)

\delta x_{T+1} = \mathbf{N}\delta x_0 + \mathbf{M}_0 \delta u_0 + \mathbf{M}_1 \delta u_1 + \cdots \mathbf{M}_T \delta u_T
\|u_t\|^2\leq \gamma

If          is zero, the reachable set under the constraint                        is given by 

\{x_{T+1}\} = \mathcal{E}^\gamma_0 \oplus \mathcal{E}^\gamma_1 \oplus \cdots \oplus \mathcal{E}^\gamma_T

Suppose we solve trajopt for some nominal trajectory                                 , and describe dynamics of disturbances with time-varying linearizations along this trajectory, 

(x_0,\cdots,x_{T+1})

where                                                                     .

\mathcal{E}^\gamma_t = \{x | x^\top \left[\mathbf{M}_t\mathbf{M}_t^\top\right]^{-1} x \leq \gamma \}

Note that this Minkowski sum results in a convex body with a unique minimum volume outer ellipsoid (MVOE), which can be solved with SDPs / Picard iterations. Denote this MVOE with       .

\mathcal{E}^\gamma_{u}
\delta x_0

Attempt 2. Robust Forward Reachable Sets

What is we want to solve for nonzero disturbances?

\delta x_{T+1} = \mathbf{N}\delta x_0 + \mathbf{M}_u \delta u
\mathcal{E}^\gamma_{u} = \{x | x = \mathbf{M}_u \delta_u, \|\delta u\|\leq \gamma\}

Let                  denote parameters of the MVOE such that 

\mathbf{M}_u, \delta_u

We can rewrite the FIR as a simple equation:

In addition, suppose that the uncertainty set for the initial condition is contained within some ellipsoid with quadratic form  

\mathcal{E}^\rho_{x} = \{\delta x | \delta x^\top \mathbf{S} \delta x\leq \rho\}

What can we say about the set of                such that there exists a point                  such that no matter what the adversary picks in                   , we can achieve                                  ? 

\delta x_{T+1}
x\in \mathcal{E}^\gamma_{u}
y \in \mathcal{E}^\rho_{\mathbf{A}x}
\delta x_{T+1} = x + y
\mathcal{E}^\rho_{\mathbf{A} x} = \{\delta x | \delta x^\top \mathbf{A}^{-\top}\mathbf{S}\mathbf{A}^{-1} \delta x\leq \rho\}

where we the image of this set through the linear map A becomes

Second Route: The Bottleneck Hypothesis

Intuition: Depending on the geometry, the intersection of all these cones for the reachable sets are necessarily bottlenecked around some location around the nominal trajectory. 

Note that the true picture is far more complicated. x0 is an hyperellipsoid, and all time-slices of the cones are all hyperellipsoids.

Second Route: The Bottleneck Hypothesis

But these are far practical to compute if we fix a point.

x0: Backwards Reachable Set

Do TV-LQR along the red trajectory along its linearization, and determine ROA with line search.

x_{T+1}: Forward Reachable Set

Compute MVOE along the blue trajectory, already explained how to compute this given a fixed initial condition.

Bottleneck Hypoethesis: Suppose we fix a point and compute these forward / backward sets. Is there some initial region, rho, and gamma, such that this is the solution to the robust reachable sets problem?

Second Route: The Bottleneck Hypothesis

Backwards Reachable

Forwards Reachable