Reinforcement Learning

(Part 1)

MIT 6.4210/2:

Robotic Manipulation

Fall 2022, Lecture 18

Follow live at

(or later at

Levine*, Finn*, Darrel, Abbeel, JMLR 2016 

Last time: Visuomotor policies

OpenAI - Learning Dexterity


  1. Make the simulator
  2. Write cost function
  3. Deep policy gradient

"And then … BC methods started to get good. Really good. So good that our best manipulation system today mostly uses BC, with a sprinkle of Q learning on top to perform high-level action selection. Today, less than 20% of our research investments is on RL, and the research runway for BC-based methods feels more robust."

Ex: OpenAI Gym

import gym
from gym import error, spaces, utils
from gym.utils import seeding

class FooEnv(gym.Env):
  metadata = {'render.modes': ['human']}

  def __init__(self):
  def step(self, action):
  def reset(self):
  def render(self, mode='human'):
  def close(self):

import pydrake.all

builder = DiagramBuilder()
diagram = builder.Build()
simulator = Simulator(diagram)

observation = sensor_output_port->Eval(context)
reward = reward_output_port->Eval(context)

context = diagram.CreateDefaultContext()



class DrakeGymEnv(gym.Env):
    DrakeGymEnv provides a gym.Env interface for a Drake System (often a
    Diagram) using a Simulator.

    def __init__(self,
                 simulator: Union[Simulator, Callable[[RandomGenerator],
                 time_step: float,
                 reward: Union[Callable[[System, Context], float],
                               OutputPortIndex, str],
                 action_port_id: Union[InputPort, InputPortIndex, str] = None,
                 observation_port_id: Union[OutputPortIndex, str] = None,
                 render_rgb_port_id: Union[OutputPortIndex, str] = None,
                 set_home: Callable[[Simulator, Context], None] = None,
                 hardware: bool = False):
            simulator: Either:
                * A, or
                * A function that produces a (randomized) Simulator.
            time_step: Each call to step() will advance the simulator by
                `time_step` seconds.
            reward: The reward can be specified in one of two
                ways: (1) by passing a callable with the signature
                `value = reward(context)` or (2) by passing a scalar
                vector-valued output port of `simulator`'s system.
            action_port_id: The ID of an input port of `simulator`'s system
                compatible with the action_space.  Each Env *must* have an
                action port; passing `None` defaults to using the *first*
                input port (inspired by
            action_space: Defines the `` for the actions.  If
                the action port is vector-valued, then passing `None` defaults
                to a gym.spaces.Box of the correct dimension with bounds at
                negative and positive infinity.  Note: Stable Baselines 3
                strongly encourages normalizing the action_space to [-1, 1].
            observation_port_id: An output port of `simulator`'s system
                compatible with the observation_space. Each Env *must* have
                an observation port (it seems that gym doesn't support empty
                observation spaces / open-loop policies); passing `None`
                defaults to using the *first* input port (inspired by
            observation_space: Defines the for the
                observations.  If the observation port is vector-valued, then
                passing `None` defaults to a gym.spaces.Box of the correct
                dimension with bounds at negative and positive infinity.
            render_rgb_port: An optional output port of `simulator`'s system
                that returns  an `ImageRgba8U`; often the `color_image` port
                of a Drake `RgbdSensor`.  When not `None`, this enables the
                environment `render_mode` `rgb_array`.
            set_home: A function that sets the home state (plant, and/or env.)
                at reset(). The reset state can be specified in one of
                the two ways:
                (1) setting random context using a Drake random_generator
                (e.g. joint.set_random_pose_distribution()),
                (2) parssing a function set_home().
            hardware: If True, it prevents from setting random context at
                reset() when using random_generator, but it does execute
                set_home() if given.

        Notes (using `env` as an instance of this class):
        - You may set simulator/integrator preferences by using `env.simulator`
        - The `done` condition returned by `step()` is always False by
          default.  Use `env.simulator.set_monitor()` to use Drake's monitor
          functionality for specifying termination conditions.
        - You may additionally wish to directly set `env.reward_range` and/or
          `env.spec`.  See the docs for gym.Env for more details.
from manipulation.drake_gym import DrakeGymEnv

OpenAI - Learning Dexterity

"PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance."

model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=log)


 # Default network architecture, from stable-baselines
net_arch = [dict(pi=[64, 64], vf=[64, 64])]

Policy Architecture



builder.ExportOutput(inv_dynamics.get_desired_position(), "actions")


builder.ExportOutput(plant.get_state_output_port(), "observations")


Cost Function

angle_from_vertical = (box_state[2] % np.pi) - np.pi / 2
cost = 2 * angle_from_vertical**2  # box angle
cost += 0.1 * box_state[5]**2  # box velocity
effort = actions - finger_state[:2]
cost += 0.1 *  # effort
# finger velocity
cost += 0.1 * finger_state[2:].dot(finger_state[2:])
# Add 10 to make rewards positive (to avoid rewarding simulator
# crashes).
output[0] = 10 - cost


"Domain Randomization"

(Image source: Tobin et al, 2017)

Do Differentiable Simulators Give Better Policy Gradients?

H. J. Terry Suh and Max Simchowitz and Kaiqing Zhang and Russ Tedrake

ICML 2022

Available at:

Contact dynamics can lead to discontinuous landscapes, but mostly in the corner cases.

Continuity of solutions w.r.t parameters

We have "real" discontinuities at the corner cases





  • making contact w/ a different face
  • transitions to/from contact and no contact

Soft/compliant contact can replace discontinuities with stiff approximations.

Beware "artificial" discontinuities

Non-smooth optimization

\[ \min_x f(x) \]

For gradient descent, discontinuities / non-smoothness can

  • introduce local minima
  • destroy convergence (e.g. \(l_1\)-minimization)

Smoothing discontinuous objectives

  • A natural idea: can we smooth the objective?


  • Probabilistic formulation, for small \(\Sigma\): \[ \min_x f(x) \approx \min_\mu E \left[ f(x) \right], x \sim \mathcal{N}(\mu, \Sigma) \]


  • A low-pass filter in parameter space with a Gaussian kernel.

Example: The Heaviside function

  • Smooth local minima
  • Alleviate flat regions
  • Encode robustness

Smoothing with stochasticity

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}


Smoothing with stochasticity for Multibody Contact

Relationship to RL Policy Gradient / CMA / MPPI

In reinforcement learning (RL) and "deep" model-predictive control, we add stochasticity via

  • Stochastic policies
  • Random initial conditions
  • "Domain randomization"

then optimize a stochastic optimal control objective (e.g. maximize expected reward)


These can all smooth the optimization landscape.

Do Differentiable Simulators Give Better Policy Gradients?

The answer is subtle; the Heaviside example might shed some light.

\begin{gathered} \min_\theta f(\theta) \end{gathered}
\begin{gathered} \min_\theta E_w\left[ f(\theta, w) \right] \\ w \sim N(0, \Sigma) \end{gathered}


Differentiable simulators give \(\frac{\partial f}{\partial \theta}\), but we want \(\frac{\partial}{\partial \theta} E_w[f(\theta, w)]\).

Randomized smoothing

J. Burke, F. E. Curtis, A. Lewis, M. Overton, and L. Simoes, Gradient Sampling Methods for Nonsmooth Optimization, 02 2020, pp. 201–225.

  • Approximate smoothed objective via Monte-carlo : \[ E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K f(x_i), \quad x_i \sim \mathcal{N}(\mu, \Sigma) \]
  • First-order gradient estimate \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu}, \quad w_i \sim \mathcal{N}(0, \Sigma) \]


  • Zero-order gradient estimate (aka REINFORCE) \[ \frac{\partial}{\partial \mu} E_\mu \left[ f(x) \right] \approx \frac{1}{K} \sum_{i=1}^K \left[f(\mu + w_i) - f(\mu)\right] w_i, \quad w_i \sim \mathcal{N}(0, \Sigma) \]

Lessons from stochastic optimization

  1. The two gradient estimates converge to the same quantity under sufficient regularity conditions.

  2. Convergence rate scales directly with variance of the estimators, zero-order often has higher variance.

But the regularity conditions aren't met in contact discontinuities, leading to a biased first-order estimator.

Often, but not always.

Example: The Heaviside function

\(\frac{\partial f(x)}{\partial x} = 0\) almost everywhere!

\( \Rightarrow \frac{1}{K} \sum_{i=1}^K \frac{\partial f(\mu + w_i)}{\partial \mu} = 0 \)

First-order estimator is biased

\( \not\approx  \frac{\partial}{\partial \mu} E_\mu [f(x)]  \)

Zero-order estimator is (still) unbiased

What about smooth (but stiff) approximations?

  • Continuous yet stiff approximations look like strict discontinuities in the finite-sample regime.
  • In the paper, we formalize "empirical bias" to capture this.

First-order estimates can also have high variance

e.g. with stiff contact models (large gradient \(\Rightarrow\) high variance)

First-order estimates can also have high variance

Is stochasticity essential?

Deterministic smoothing - force at a distance

Global Planning for Contact-Rich Manipulation via
Local Smoothing of Quasi-dynamic Contact Models

Tao Pang, H. J. Terry Suh, Lujie Yang, and Russ Tedrake

Available at:

Establish equivalence between randomized smoothing and a (deterministic/differentiable) force-at-a-distance contact model.

Keypoints for picking up a plate (in 2D)

Should we expect this to work?

  • Do we need the over-parameterization of deep policies?
    • Is there a comparable story to interpolating solutions in high-dimensional policy space?

"By studying both ES and RL gradient estimators mathematically we can see that ES is an attractive choice especially when the number of time steps in an episode is long, where actions have long-lasting effects, or if no good value function estimates are available."

Trust-region method on a 'Branin' function.  From Northwestern University Open Text Book on Process Optimization 

Kolter, J. Zico, Zachary Jackowski, and Russ Tedrake. "Design, analysis, and learning control of a fully actuated micro wind turbine." 2012 American Control Conference (ACC). IEEE, 2012.

Schulman, John, et al. "Trust region policy optimization." International conference on machine learning. 2015.

def ppo_pendulum(ctxt=None, seed=1):
    """Train PPO with InvertedDoublePendulum-v2 environment.

        ctxt (garage.experiment.ExperimentContext): The experiment
            configuration used by Trainer to create the snapshotter.
        seed (int): Used to seed the random number generator to produce

    env = GymEnv('InvertedDoublePendulum-v2')

    trainer = Trainer(ctxt)

    policy = GaussianMLPPolicy(env.spec,
                               hidden_sizes=[64, 64],

    value_function = GaussianMLPValueFunction(env_spec=env.spec,
                                              hidden_sizes=(32, 32),

    algo = PPO(env_spec=env.spec,

    trainer.setup(algo, env)
    trainer.train(n_epochs=100, batch_size=10000)

A System for General In-Hand Object Re-Orientation
Tao Chen, Jie Xu, Pulkit Agrawal
Conference on Robot Learning (CoRL), 2021 (Best Paper Award)

“The sheer scope and variation across objects tested with this method, and the range of different policy architectures and approaches tested makes this paper extremely thorough in its analysis of this reorientation task.”

"We use PPO to optimize \(\pi\)."

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).

Lecture 18: Reinforcement Learning (part 1)

By russtedrake

Lecture 18: Reinforcement Learning (part 1)

MIT Robotic Manipulation Fall 2020

  • 288