Toward “Category-Level” Dexterous Manipulation

Russ Tedrake, Lucas Manuelli, Pete Florence, Wei Gao

russt@mit.edu

groups.csail.mit.edu/locomotion

Follow along at https://slides.com/russtedrake/rss2020-visual/live

kPAM: KeyPoint Affordances for Manipulation

Lucas Manuelli*, Wei Gao*, Peter R. Florence and Russ Tedrake. kPAM: KeyPoint Affordances for Category Level Manipulation. ISRR 2019

Follow along at: https://slides.com/russtedrake/rss2020-visual/live

Problem Statement

Manipulate potentially unknown rigid objects from a category (e.g. mugs, shoes) into desired target configurations.

SE(3) pose is difficult to generalize across a category

So how do we even specify the task?

What's the cost function?

(Images of mugs on the rack?)

3D Keypoints provide rich, class-general semantics

Constraints & Cost on Keypoints

... and robust performance in practice

Inverse kinematics as an optimization

\min_q | q-q_{desired}|

subject to:

rich end-effector constraints
joint limits
collision avoidance
"gaze constraints"
"feet stay put"
balance (center of mass)
...

Keypoints are not a sufficient representation

relative to Sonia, this separates semantics from the dense geometry

kPAM pipeline

No template model nor pose appears in this pipeline.

Custom annotation

tool

RGBD image w/ instance segmentation

Grasp

Planner

3D Keypoint Detection Network

Image

Inverse Kinematics Planner

Keypoint network

Architecture based on Sun, Xiao, et al. "Integral human pose regression." ECCV, 2018

Custom annotation tool

Sample of results

(shoes on rack)

# train objects	10
# test objects	20
# trials	100
placed on shelf	98%
heel error (cm)	1.09 ± (1.29)
toe error (cm)	4.34 ± (3.05)

+ shape completion network (kPAM-SC)

to include collision-avoidance constraints

So far, keypoints are geometric and semantic

(mug handle, front toe of shoe), but required human labels

If we forgo semantics, can we self-supervise?

Follow along at: https://slides.com/russtedrake/rss2020-visual/live

Dense Object Nets

Core technology: dense correspondences

(built on Schmidt, Newcombe, Fox, RA-L 2017)

Peter R. Florence*, Lucas Manuelli*, and Russ Tedrake. Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation. CoRL, 2018.

Dense Object Nets

dense 3d reconstruction

+ pixelwise contrastive loss

New loss function sharpens correspondences

Now 3d correspondences, trained with multiview

Dense descriptors as self-supervised keypoints

But without semantics, how do we specify the task?

Dense Object Nets in Visuomotor Imitation Learning

Peter R. Florence, Lucas Manuelli, and Russ Tedrake. Self-Supervised Correspondence in Visuomotor Policy Learning. RA-L, April 2020

Visuomotor policies

Levine*, Finn*, Darrel, Abbeel, JMLR 2016

Training the visual front-end (prior work)

Pose-based auxiliary loss

Auto-encoding

End-to-end

Estimate object/hand pose

(but hard for category-level)

Idea: use dense descriptors

Idea: Use small set of dense descriptors

Imitation learning setup

from hand-coded policies in sim

and teleop on the real robot

Standard "behavior-cloning" object + data augmentation

Simulation experiments

"push box"

"flip box"

Policy is a small LSTM network (~100 LSTMs)

< demonstrations?

> generalization?

Dense Object Nets for Model-based Control

Late-breaking results:

Learn descriptor keypoint dynamics + trajectory MPC

Late-breaking results:

Learn descriptor keypoint dynamics + trajectory MPC

http://lucasmanuelli.com/

For all the details, check out

Lucas' PhD defense next Monday.

July 20, 2020 02:00PM Eastern

(ping Lucas for zoom info, or watch youtube just after)

Summary

Wanted: State representations for "category-level" dexterous manipulation

Keypoints can represent category-level geometry
- Combine with local sensing / dense reconstruction
- Now trying dynamics.
For task specification
- Annotated keypoints with semantics
- Self-supervised descriptors with demonstrations
Keypoints aren't everything. They won't tell you when the eggs are done cooking!

This fall: my manipulation class at MIT will be online.

Toward "Category-Level" Dexterous Manipulation

By russtedrake

Toward "Category-Level" Dexterous Manipulation

A presentation at the RSS 2020 workshop on "Visual Learning and Reasoning for Robotic Manipulation" https://sites.google.com/view/rss20vlrrm

russtedrake PRO

Roboticist at MIT and TRI

people.csail.mit.edu/russt

Toward “Category-Level” Dexterous Manipulation

kPAM: KeyPoint Affordances for Manipulation

Problem Statement

Inverse kinematics as an optimization

Keypoints are not a sufficient representation

kPAM pipeline

Keypoint network

+ shape completion network (kPAM-SC)

Dense Object Nets

Dense Object Nets

Dense Object Nets in Visuomotor Imitation Learning

Visuomotor policies

Training the visual front-end (prior work)

Idea: use dense descriptors

Idea: Use small set of dense descriptors

Imitation learning setup

Simulation experiments

< demonstrations?

> generalization?

Dense Object Nets for Model-based Control

Summary

Toward "Category-Level" Dexterous Manipulation

More from russtedrake