Reinforcement Learning

(Part 3)

MIT 6.800/6.843:

Robotic Manipulation

Fall 2021, Lecture 20

Follow live at https://slides.com/d/Tb84VxM/live

(or later at https://slides.com/russtedrake/fall21-lec20)

OpenAI - Learning Dexterity

Recipe:

  1. Make the simulator
  2. Write cost function
  3. Policy gradient

OpenAI - Learning Dexterity

"PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance."

https://openai.com/blog/openai-baselines-ppo/

A System for General In-Hand Object Re-Orientation
Tao Chen, Jie Xu, Pulkit Agrawal
Conference on Robot Learning (CoRL), 2021 (Best Paper Award)

https://taochenshh.github.io/projects/in-hand-reorientation

“The sheer scope and variation across objects tested with this method, and the range of different policy architectures and approaches tested makes this paper extremely thorough in its analysis of this reorientation task.”

"We use PPO to optimize \(\pi\)."

import gym

from stable_baselines3 import PPO

gym.envs.register(id="BoxFlipUp-v0",
                  entry_point="manipulation.envs.box_flipup:BoxFlipUpEnv")
                  
model = PPO('MlpPolicy', "BoxFlipUp-v0")
model.learn(total_timesteps=100000)

...

# Now Animate some roll outs.
env = gym.make("BoxFlipUp-v0", meshcat=meshcat)
obs = env.reset()
for i in range(500):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
      obs = env.reset()
cost = 2 * angle_from_vertical**2  # box angle
cost += 0.1 * box_state[5]**2      # box velocity
cost += 0.1 * effort.dot(effort)   # effort
cost += 0.1 * finger_state[2:].dot(finger_state[2:]) # finger velocity
reward = 10 - cost  # Add 10 to avoid rewarding simulator crashes.

Some details

  • Plant (BoxFlipUp-v0) uses stiffness control
  • Initial conditions use MBP's random distribution
  • Multiprocessing using SubprocVecEnv
  • Could be more efficient (e.g. early termination)

 

 

 

  • Some funny bugs:
    • Cost vs Reward; Please Crash!
    • No real tuning here
cost = 2 * angle_from_vertical**2  # box angle
cost += 0.1 * box_state[5]**2      # box velocity
cost += 0.1 * effort.dot(effort)   # effort
cost += 0.1 * finger_state[2:].dot(finger_state[2:]) # finger velocity
reward = 10 - cost  # Add 10 to avoid rewarding simulator crashes.

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).

https://spinningup.openai.com/en/latest/algorithms/ppo.html

NeurIPS, 1999

PPO Learned Critic: Box angle (x) vs box angular velocity (y)

2005

2018

CMA-ES

https://en.wikipedia.org/wiki/CMA-ES

Keypoints for picking up a plate (in 2D)

"By studying both ES and RL gradient estimators mathematically we can see that ES is an attractive choice especially when the number of time steps in an episode is long, where actions have long-lasting effects, or if no good value function estimates are available."

Lecture 20: Reinforcement Learning (part 3)

By russtedrake

Lecture 20: Reinforcement Learning (part 3)

MIT Robotic Manipulation Fall 2020 http://manipulation.csail.mit.edu

  • 1,031