### russtedrake PRO

Roboticist at MIT and TRI

**(Part 1)**

MIT 6.4210/2:

Robotic Manipulation

Fall 2022, Lecture 18

Follow **live** at https://slides.com/d/9H33j8E/live

(or later at https://slides.com/russtedrake/fall22-lec18)

Levine*, Finn*, Darrel, Abbeel, JMLR 2016

Andy Zeng's MIT CSL Seminar, April 4, 2022

Andy's slides.com presentation

*OpenAI* - Learning Dexterity

Recipe:

- Make the simulator
- Write cost function
- Deep policy gradient

"And then … BC methods started to get good. Really good. So good that our best manipulation system today mostly uses BC, with a sprinkle of Q learning on top to perform high-level action selection. Today, less than 20% of our research investments is on RL, and the research runway for BC-based methods feels more robust."

```
import gym
from gym import error, spaces, utils
from gym.utils import seeding
class FooEnv(gym.Env):
metadata = {'render.modes': ['human']}
def __init__(self):
...
def step(self, action):
...
def reset(self):
...
def render(self, mode='human'):
...
def close(self):
...
```

http://gym.openai.com/

```
import pydrake.all
builder = DiagramBuilder()
....
diagram = builder.Build()
simulator = Simulator(diagram)
simulator.AdvanceTo(...)
observation = sensor_output_port->Eval(context)
reward = reward_output_port->Eval(context)
context = diagram.CreateDefaultContext()
meshcat.Publish(context)
```

```
class DrakeGymEnv(gym.Env):
"""
DrakeGymEnv provides a gym.Env interface for a Drake System (often a
Diagram) using a Simulator.
"""
def __init__(self,
simulator: Union[Simulator, Callable[[RandomGenerator],
Simulator]],
time_step: float,
action_space: gym.spaces.space,
observation_space: gym.spaces.space,
reward: Union[Callable[[System, Context], float],
OutputPortIndex, str],
action_port_id: Union[InputPort, InputPortIndex, str] = None,
observation_port_id: Union[OutputPortIndex, str] = None,
render_rgb_port_id: Union[OutputPortIndex, str] = None,
set_home: Callable[[Simulator, Context], None] = None,
hardware: bool = False):
"""
Args:
simulator: Either:
* A drake.systems.analysis.Simulator, or
* A function that produces a (randomized) Simulator.
time_step: Each call to step() will advance the simulator by
`time_step` seconds.
reward: The reward can be specified in one of two
ways: (1) by passing a callable with the signature
`value = reward(context)` or (2) by passing a scalar
vector-valued output port of `simulator`'s system.
action_port_id: The ID of an input port of `simulator`'s system
compatible with the action_space. Each Env *must* have an
action port; passing `None` defaults to using the *first*
input port (inspired by
`InputPortSelection.kUseFirstInputIfItExists`).
action_space: Defines the `gym.spaces.space` for the actions. If
the action port is vector-valued, then passing `None` defaults
to a gym.spaces.Box of the correct dimension with bounds at
negative and positive infinity. Note: Stable Baselines 3
strongly encourages normalizing the action_space to [-1, 1].
observation_port_id: An output port of `simulator`'s system
compatible with the observation_space. Each Env *must* have
an observation port (it seems that gym doesn't support empty
observation spaces / open-loop policies); passing `None`
defaults to using the *first* input port (inspired by
`OutputPortSelection.kUseFirstOutputIfItExists`).
observation_space: Defines the gym.spaces.space for the
observations. If the observation port is vector-valued, then
passing `None` defaults to a gym.spaces.Box of the correct
dimension with bounds at negative and positive infinity.
render_rgb_port: An optional output port of `simulator`'s system
that returns an `ImageRgba8U`; often the `color_image` port
of a Drake `RgbdSensor`. When not `None`, this enables the
environment `render_mode` `rgb_array`.
set_home: A function that sets the home state (plant, and/or env.)
at reset(). The reset state can be specified in one of
the two ways:
(1) setting random context using a Drake random_generator
(e.g. joint.set_random_pose_distribution()),
(2) parssing a function set_home().
hardware: If True, it prevents from setting random context at
reset() when using random_generator, but it does execute
set_home() if given.
Notes (using `env` as an instance of this class):
- You may set simulator/integrator preferences by using `env.simulator`
directly.
- The `done` condition returned by `step()` is always False by
default. Use `env.simulator.set_monitor()` to use Drake's monitor
functionality for specifying termination conditions.
- You may additionally wish to directly set `env.reward_range` and/or
`env.spec`. See the docs for gym.Env for more details.
"""
```

`from manipulation.drake_gym import DrakeGymEnv`

*OpenAI* - Learning Dexterity

"PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance."

https://openai.com/blog/openai-baselines-ppo/

```
model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log)
```

```
stable_baselines3/common/policies.py#L435-L440
# Default network architecture, from stable-baselines
net_arch = [dict(pi=[64, 64], vf=[64, 64])]
```

Actions

Observations

`builder.ExportOutput(inv_dynamics.get_desired_position(), "actions")`

Network

`builder.ExportOutput(plant.get_state_output_port(), "observations")`

approximately:

```
angle_from_vertical = (box_state[2] % np.pi) - np.pi / 2
cost = 2 * angle_from_vertical**2 # box angle
cost += 0.1 * box_state[5]**2 # box velocity
effort = actions - finger_state[:2]
cost += 0.1 * effort.dot(effort) # effort
# finger velocity
cost += 0.1 * finger_state[2:].dot(finger_state[2:])
# Add 10 to make rewards positive (to avoid rewarding simulator
# crashes).
output[0] = 10 - cost
```

https://en.wikipedia.org/wiki/CMA-ES

(Image source: Tobin et al, 2017)

*Do Differentiable Simulators Give Better Policy Gradients?*

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at: https://arxiv.org/abs/2202.00817

Contact dynamics can lead to **discontinuous**** **landscapes, but mostly in the *corner cases*.

We have "real" discontinuities at the corner cases

- making contact w/ a different face
- transitions to/from contact and no contact

Soft/compliant contact can replace discontinuities with stiff approximations.

\[ \min_x f(x) \]

For gradient descent, discontinuities / non-smoothness can

- introduce local minima
- destroy convergence (e.g. \(l_1\)-minimization)

- A natural idea: can we smooth the objective?

- Probabilistic formulation, for small \(\Sigma\): \[ \min_x f(x) \approx \min_\mu E \left[ f(x) \right], x \sim \mathcal{N}(\mu, \Sigma) \]

- A low-pass filter in parameter space with a Gaussian kernel.

- Smooth local minima
- Alleviate flat regions
- Encode robustness

\begin{gathered}
\min_\theta f(\theta)
\end{gathered}

\begin{gathered}
\min_\theta E_w\left[ f(\theta, w) \right] \\
w \sim N(0, \Sigma)
\end{gathered}

vs

In reinforcement learning (RL) and "deep" model-predictive control, we add stochasticity via

- Stochastic policies
- Random initial conditions
- "Domain randomization"

then optimize a *stochastic optimal control* objective (e.g. maximize expected reward)

These can all *smooth* the optimization landscape.

The answer is subtle; the Heaviside example might shed some light.

\begin{gathered}
\min_\theta f(\theta)
\end{gathered}

\begin{gathered}
\min_\theta E_w\left[ f(\theta, w) \right] \\
w \sim N(0, \Sigma)
\end{gathered}

vs

Differentiable simulators give \(\frac{\partial f}{\partial \theta}\), but we want \(\frac{\partial}{\partial \theta} E_w[f(\theta, w)]\).

J. Burke, F. E. Curtis, A. Lewis, M. Overton, and L. Simoes, *Gradient Sampling Methods for Nonsmooth Optimization*, 02 2020, pp. 201–225.

- Approximate smoothed objective via Monte-carlo : \[ E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K f(x_i), \quad x_i \sim \mathcal{N}(\mu, \Sigma) \]
- First-order gradient estimate \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu}, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

- Zero-order gradient estimate (aka REINFORCE) \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \left[f(\mu + w_i) - f(\mu)\right] w_i, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

- The two gradient estimates converge to the same quantity under sufficient regularity conditions.

- Convergence rate scales directly with variance of the estimators, zero-order often has higher variance.

*But the regularity conditions aren't met in contact discontinuities, leading to a biased first-order estimator.*

*Often, but not always.*

\(\frac{\partial f(x)}{\partial x} = 0\) almost everywhere!

\( \Rightarrow \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu} = 0 \)

First-order estimator is biased

\( \not\approx \frac{\partial}{\partial \mu} E_\mu [f(x)] \)

Zero-order estimator is (still) unbiased

- Continuous yet stiff approximations look like strict discontinuities in the finite-sample regime.
- In the paper, we formalize "
*empirical bias*" to capture this.

e.g. with stiff contact models (large gradient \(\Rightarrow\) high variance)

*Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models*

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at: https://arxiv.org/abs/2206.10787

Establish equivalence between randomized smoothing and a (deterministic/differentiable) force-at-a-distance contact model.

- Do we need the
*over-parameterization*of deep policies?- Is there a comparable story to interpolating solutions in high-dimensional policy space?

"By studying both ES and RL gradient estimators mathematically we can see that ES is an attractive choice especially when the number of time steps in an episode is long, where actions have long-lasting effects, or if no good value function estimates are available."

Trust-region method on a 'Branin' function. From Northwestern University Open Text Book on Process Optimization

Kolter, J. Zico, Zachary Jackowski, and Russ Tedrake. "Design, analysis, and learning control of a fully actuated micro wind turbine." *2012 American Control Conference (ACC)*. IEEE, 2012.

Schulman, John, et al. "Trust region policy optimization." *International conference on machine learning*. 2015.

```
def ppo_pendulum(ctxt=None, seed=1):
"""Train PPO with InvertedDoublePendulum-v2 environment.
Args:
ctxt (garage.experiment.ExperimentContext): The experiment
configuration used by Trainer to create the snapshotter.
seed (int): Used to seed the random number generator to produce
determinism.
"""
set_seed(seed)
env = GymEnv('InvertedDoublePendulum-v2')
trainer = Trainer(ctxt)
policy = GaussianMLPPolicy(env.spec,
hidden_sizes=[64, 64],
hidden_nonlinearity=torch.tanh,
output_nonlinearity=None)
value_function = GaussianMLPValueFunction(env_spec=env.spec,
hidden_sizes=(32, 32),
hidden_nonlinearity=torch.tanh,
output_nonlinearity=None)
algo = PPO(env_spec=env.spec,
policy=policy,
value_function=value_function,
discount=0.99,
center_adv=False)
trainer.setup(algo, env)
trainer.train(n_epochs=100, batch_size=10000)
```

A System for General In-Hand Object Re-Orientation

Tao Chen, Jie Xu, Pulkit Agrawal

*Conference on Robot Learning (CoRL)*, 2021 (Best Paper Award)

https://taochenshh.github.io/projects/in-hand-reorientation

“The sheer scope and variation across objects tested with this method, and the range of different policy architectures and approaches tested makes this paper extremely thorough in its analysis of this reorientation task.”

"We use PPO to optimize \(\pi\)."

Schulman, John, et al. "Proximal policy optimization algorithms." *arXiv preprint arXiv:1707.06347* (2017).

https://spinningup.openai.com/en/latest/algorithms/ppo.html

By russtedrake

MIT Robotic Manipulation Fall 2020 http://manipulation.csail.mit.edu

- 288