Reinforcement Learning

(Part 1)

MIT 6.4210/2:

Robotic Manipulation

Fall 2022, Lecture 18

Follow live at https://slides.com/d/9H33j8E/live

(or later at https://slides.com/russtedrake/fall22-lec18)

Levine*, Finn*, Darrel, Abbeel, JMLR 2016

Last time: Visuomotor policies

Andy Zeng's MIT CSL Seminar, April 4, 2022

Andy's slides.com presentation

OpenAI - Learning Dexterity

Recipe:

Make the simulator
Write cost function
Deep policy gradient

"And then … BC methods started to get good. Really good. So good that our best manipulation system today mostly uses BC, with a sprinkle of Q learning on top to perform high-level action selection. Today, less than 20% of our research investments is on RL, and the research runway for BC-based methods feels more robust."

Ex: OpenAI Gym

import gym
from gym import error, spaces, utils
from gym.utils import seeding

class FooEnv(gym.Env):
  metadata = {'render.modes': ['human']}

  def __init__(self):
    ...
  def step(self, action):
    ...
  def reset(self):
    ...
  def render(self, mode='human'):
    ...
  def close(self):
    ...

http://gym.openai.com/

import pydrake.all


builder = DiagramBuilder()
....
diagram = builder.Build()
simulator = Simulator(diagram)


simulator.AdvanceTo(...)
observation = sensor_output_port->Eval(context)
reward = reward_output_port->Eval(context)


context = diagram.CreateDefaultContext()


meshcat.Publish(context)

DrakeGymEnv

class DrakeGymEnv(gym.Env):
    """
    DrakeGymEnv provides a gym.Env interface for a Drake System (often a
    Diagram) using a Simulator.
    """

    def __init__(self,
                 simulator: Union[Simulator, Callable[[RandomGenerator],
                                                      Simulator]],
                 time_step: float,
                 action_space: gym.spaces.space,
                 observation_space: gym.spaces.space,
                 reward: Union[Callable[[System, Context], float],
                               OutputPortIndex, str],
                 action_port_id: Union[InputPort, InputPortIndex, str] = None,
                 observation_port_id: Union[OutputPortIndex, str] = None,
                 render_rgb_port_id: Union[OutputPortIndex, str] = None,
                 set_home: Callable[[Simulator, Context], None] = None,
                 hardware: bool = False):
        """
        Args:
            simulator: Either:
                * A drake.systems.analysis.Simulator, or
                * A function that produces a (randomized) Simulator.
            time_step: Each call to step() will advance the simulator by
                `time_step` seconds.
            reward: The reward can be specified in one of two
                ways: (1) by passing a callable with the signature
                `value = reward(context)` or (2) by passing a scalar
                vector-valued output port of `simulator`'s system.
            action_port_id: The ID of an input port of `simulator`'s system
                compatible with the action_space.  Each Env *must* have an
                action port; passing `None` defaults to using the *first*
                input port (inspired by
                `InputPortSelection.kUseFirstInputIfItExists`).
            action_space: Defines the `gym.spaces.space` for the actions.  If
                the action port is vector-valued, then passing `None` defaults
                to a gym.spaces.Box of the correct dimension with bounds at
                negative and positive infinity.  Note: Stable Baselines 3
                strongly encourages normalizing the action_space to [-1, 1].
            observation_port_id: An output port of `simulator`'s system
                compatible with the observation_space. Each Env *must* have
                an observation port (it seems that gym doesn't support empty
                observation spaces / open-loop policies); passing `None`
                defaults to using the *first* input port (inspired by
                `OutputPortSelection.kUseFirstOutputIfItExists`).
            observation_space: Defines the gym.spaces.space for the
                observations.  If the observation port is vector-valued, then
                passing `None` defaults to a gym.spaces.Box of the correct
                dimension with bounds at negative and positive infinity.
            render_rgb_port: An optional output port of `simulator`'s system
                that returns  an `ImageRgba8U`; often the `color_image` port
                of a Drake `RgbdSensor`.  When not `None`, this enables the
                environment `render_mode` `rgb_array`.
            set_home: A function that sets the home state (plant, and/or env.)
                at reset(). The reset state can be specified in one of
                the two ways:
                (1) setting random context using a Drake random_generator
                (e.g. joint.set_random_pose_distribution()),
                (2) parssing a function set_home().
            hardware: If True, it prevents from setting random context at
                reset() when using random_generator, but it does execute
                set_home() if given.


        Notes (using `env` as an instance of this class):
        - You may set simulator/integrator preferences by using `env.simulator`
          directly.
        - The `done` condition returned by `step()` is always False by
          default.  Use `env.simulator.set_monitor()` to use Drake's monitor
          functionality for specifying termination conditions.
        - You may additionally wish to directly set `env.reward_range` and/or
          `env.spec`.  See the docs for gym.Env for more details.
        """

from manipulation.drake_gym import DrakeGymEnv

OpenAI - Learning Dexterity

"PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance."

https://openai.com/blog/openai-baselines-ppo/

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log)

stable_baselines3/common/policies.py#L435-L440

 # Default network architecture, from stable-baselines
net_arch = [dict(pi=[64, 64], vf=[64, 64])]

Policy Architecture

Actions

Observations

builder.ExportOutput(inv_dynamics.get_desired_position(), "actions")

Network

builder.ExportOutput(plant.get_state_output_port(), "observations")

approximately:

Cost Function

angle_from_vertical = (box_state[2] % np.pi) - np.pi / 2
cost = 2 * angle_from_vertical**2  # box angle
cost += 0.1 * box_state[5]**2  # box velocity
effort = actions - finger_state[:2]
cost += 0.1 * effort.dot(effort)  # effort
# finger velocity
cost += 0.1 * finger_state[2:].dot(finger_state[2:])
# Add 10 to make rewards positive (to avoid rewarding simulator
# crashes).
output[0] = 10 - cost

CMA-ES

https://en.wikipedia.org/wiki/CMA-ES

"Domain Randomization"

(Image source: Tobin et al, 2017)

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at: https://arxiv.org/abs/2202.00817

Contact dynamics can lead to discontinuous landscapes, but mostly in the corner cases.

Continuity of solutions w.r.t parameters

We have "real" discontinuities at the corner cases

making contact w/ a different face
transitions to/from contact and no contact

Soft/compliant contact can replace discontinuities with stiff approximations.

Beware "artificial" discontinuities

Non-smooth optimization

\[ \min_x f(x) \]

For gradient descent, discontinuities / non-smoothness can

introduce local minima
destroy convergence (e.g. \(l_1\)-minimization)

Smoothing discontinuous objectives

A natural idea: can we smooth the objective?

Probabilistic formulation, for small \(\Sigma\): \[ \min_x f(x) \approx \min_\mu E \left[ f(x) \right], x \sim \mathcal{N}(\mu, \Sigma) \]

A low-pass filter in parameter space with a Gaussian kernel.

Example: The Heaviside function

Smooth local minima
Alleviate flat regions
Encode robustness

Smoothing with stochasticity

\begin{gathered} \min_\theta f(\theta) \end{gathered}

\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

Smoothing with stochasticity for Multibody Contact

Relationship to RL Policy Gradient / CMA / MPPI

In reinforcement learning (RL) and "deep" model-predictive control, we add stochasticity via

Stochastic policies
Random initial conditions
"Domain randomization"

then optimize a stochastic optimal control objective (e.g. maximize expected reward)

These can all smooth the optimization landscape.

Do Differentiable Simulators Give Better Policy Gradients?

The answer is subtle; the Heaviside example might shed some light.

\begin{gathered} \min_\theta f(\theta) \end{gathered}

\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}

Differentiable simulators give \(\frac{\partial f}{\partial \theta}\), but we want \(\frac{\partial}{\partial \theta} E_w[f(\theta, w)]\).

Randomized smoothing

J. Burke, F. E. Curtis, A. Lewis, M. Overton, and L. Simoes, Gradient Sampling Methods for Nonsmooth Optimization, 02 2020, pp. 201–225.

Approximate smoothed objective via Monte-carlo : \[ E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K f(x_i), \quad x_i \sim \mathcal{N}(\mu, \Sigma) \]
First-order gradient estimate \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu}, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

Zero-order gradient estimate (aka REINFORCE) \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \left[f(\mu + w_i) - f(\mu)\right] w_i, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

Lessons from stochastic optimization

The two gradient estimates converge to the same quantity under sufficient regularity conditions.
Convergence rate scales directly with variance of the estimators, zero-order often has higher variance.

But the regularity conditions aren't met in contact discontinuities, leading to a biased first-order estimator.

Often, but not always.

Example: The Heaviside function

\(\frac{\partial f(x)}{\partial x} = 0\) almost everywhere!

\( \Rightarrow \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu} = 0 \)

First-order estimator is biased

\( \not\approx \frac{\partial}{\partial \mu} E_\mu [f(x)] \)

Zero-order estimator is (still) unbiased

What about smooth (but stiff) approximations?

Continuous yet stiff approximations look like strict discontinuities in the finite-sample regime.
In the paper, we formalize "empirical bias" to capture this.

First-order estimates can also have high variance

e.g. with stiff contact models (large gradient \(\Rightarrow\) high variance)

First-order estimates can also have high variance

Is stochasticity essential?

Deterministic smoothing - force at a distance

Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at: https://arxiv.org/abs/2206.10787

Establish equivalence between randomized smoothing and a (deterministic/differentiable) force-at-a-distance contact model.

Keypoints for picking up a plate (in 2D)

Should we expect this to work?

Do we need the over-parameterization of deep policies?
- Is there a comparable story to interpolating solutions in high-dimensional policy space?

"By studying both ES and RL gradient estimators mathematically we can see that ES is an attractive choice especially when the number of time steps in an episode is long, where actions have long-lasting effects, or if no good value function estimates are available."

Trust-region method on a 'Branin' function. From Northwestern University Open Text Book on Process Optimization

Kolter, J. Zico, Zachary Jackowski, and Russ Tedrake. "Design, analysis, and learning control of a fully actuated micro wind turbine." 2012 American Control Conference (ACC). IEEE, 2012.

Schulman, John, et al. "Trust region policy optimization." International conference on machine learning. 2015.

def ppo_pendulum(ctxt=None, seed=1):
    """Train PPO with InvertedDoublePendulum-v2 environment.

    Args:
        ctxt (garage.experiment.ExperimentContext): The experiment
            configuration used by Trainer to create the snapshotter.
        seed (int): Used to seed the random number generator to produce
            determinism.

    """
    set_seed(seed)
    env = GymEnv('InvertedDoublePendulum-v2')

    trainer = Trainer(ctxt)

    policy = GaussianMLPPolicy(env.spec,
                               hidden_sizes=[64, 64],
                               hidden_nonlinearity=torch.tanh,
                               output_nonlinearity=None)

    value_function = GaussianMLPValueFunction(env_spec=env.spec,
                                              hidden_sizes=(32, 32),
                                              hidden_nonlinearity=torch.tanh,
                                              output_nonlinearity=None)

    algo = PPO(env_spec=env.spec,
               policy=policy,
               value_function=value_function,
               discount=0.99,
               center_adv=False)

    trainer.setup(algo, env)
    trainer.train(n_epochs=100, batch_size=10000)

A System for General In-Hand Object Re-Orientation
Tao Chen, Jie Xu, Pulkit Agrawal
Conference on Robot Learning (CoRL), 2021 (Best Paper Award)

https://taochenshh.github.io/projects/in-hand-reorientation

“The sheer scope and variation across objects tested with this method, and the range of different policy architectures and approaches tested makes this paper extremely thorough in its analysis of this reorientation task.”

"We use PPO to optimize \(\pi\)."

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).

https://spinningup.openai.com/en/latest/algorithms/ppo.html

Lecture 18: Reinforcement Learning (part 1)

By russtedrake

Lecture 18: Reinforcement Learning (part 1)

MIT Robotic Manipulation Fall 2020 http://manipulation.csail.mit.edu

1,026

russtedrake PRO

Roboticist at MIT and TRI

people.csail.mit.edu/russt

Reinforcement Learning

Last time: Visuomotor policies

Ex: OpenAI Gym

DrakeGymEnv

Policy Architecture

Cost Function

CMA-ES

"Domain Randomization"

Continuity of solutions w.r.t parameters

Beware "artificial" discontinuities

Non-smooth optimization

Smoothing discontinuous objectives

Example: The Heaviside function

Smoothing with stochasticity

Smoothing with stochasticity for Multibody Contact

Relationship to RL Policy Gradient / CMA / MPPI

Do Differentiable Simulators Give Better Policy Gradients?

Randomized smoothing

Lessons from stochastic optimization

Example: The Heaviside function

What about smooth (but stiff) approximations?

First-order estimates can also have high variance

First-order estimates can also have high variance

Is stochasticity essential?

Deterministic smoothing - force at a distance

Keypoints for picking up a plate (in 2D)

Should we expect this to work?

Lecture 18: Reinforcement Learning (part 1)

More from russtedrake