Reinforcement Learning

(Part 3)

MIT 6.800/6.843:

Robotic Manipulation

Fall 2021, Lecture 20

Follow live at https://slides.com/d/Tb84VxM/live

(or later at https://slides.com/russtedrake/fall21-lec20)

OpenAI - Learning Dexterity

Recipe:

Make the simulator
Write cost function
Policy gradient

OpenAI - Learning Dexterity

"PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance."

https://openai.com/blog/openai-baselines-ppo/

A System for General In-Hand Object Re-Orientation
Tao Chen, Jie Xu, Pulkit Agrawal
Conference on Robot Learning (CoRL), 2021 (Best Paper Award)

https://taochenshh.github.io/projects/in-hand-reorientation

“The sheer scope and variation across objects tested with this method, and the range of different policy architectures and approaches tested makes this paper extremely thorough in its analysis of this reorientation task.”

"We use PPO to optimize \(\pi\)."

import gym

from stable_baselines3 import PPO

gym.envs.register(id="BoxFlipUp-v0",
                  entry_point="manipulation.envs.box_flipup:BoxFlipUpEnv")
                  
model = PPO('MlpPolicy', "BoxFlipUp-v0")
model.learn(total_timesteps=100000)

...

# Now Animate some roll outs.
env = gym.make("BoxFlipUp-v0", meshcat=meshcat)
obs = env.reset()
for i in range(500):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
      obs = env.reset()

cost = 2 * angle_from_vertical**2  # box angle
cost += 0.1 * box_state[5]**2      # box velocity
cost += 0.1 * effort.dot(effort)   # effort
cost += 0.1 * finger_state[2:].dot(finger_state[2:]) # finger velocity
reward = 10 - cost  # Add 10 to avoid rewarding simulator crashes.

Some details

Plant (BoxFlipUp-v0) uses stiffness control
Initial conditions use MBP's random distribution
Multiprocessing using SubprocVecEnv
Could be more efficient (e.g. early termination)

Some funny bugs:
- Cost vs Reward; Please Crash!
- No real tuning here

cost = 2 * angle_from_vertical**2  # box angle
cost += 0.1 * box_state[5]**2      # box velocity
cost += 0.1 * effort.dot(effort)   # effort
cost += 0.1 * finger_state[2:].dot(finger_state[2:]) # finger velocity
reward = 10 - cost  # Add 10 to avoid rewarding simulator crashes.

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).

https://spinningup.openai.com/en/latest/algorithms/ppo.html

NeurIPS, 1999

PPO Learned Critic: Box angle (x) vs box angular velocity (y)

2005

2018

CMA-ES

https://en.wikipedia.org/wiki/CMA-ES

Keypoints for picking up a plate (in 2D)

"By studying both ES and RL gradient estimators mathematically we can see that ES is an attractive choice especially when the number of time steps in an episode is long, where actions have long-lasting effects, or if no good value function estimates are available."

Lecture 20: Reinforcement Learning (part 3)

By russtedrake

Lecture 20: Reinforcement Learning (part 3)

MIT Robotic Manipulation Fall 2020 http://manipulation.csail.mit.edu

1,031

russtedrake PRO

Roboticist at MIT and TRI

people.csail.mit.edu/russt

Reinforcement Learning

Some details

CMA-ES

Keypoints for picking up a plate (in 2D)

Lecture 20: Reinforcement Learning (part 3)

More from russtedrake