Few Remarks on Benchmarks

Roberto Calandra

Facebook AI Research

Workshop in Benchmarking Robotics - 13 August 2019

Two Meta-questions about Benchmarks

What is a benchmark testing? (i.e., are these the right metrics)
- Nature is not single-objective
- Solving a specific task vs designing general purpose systems

Is this benchmark representative of real-world problems?
- Are potential advances useful in the real-world?
- If using simulations, is the result indicative?

In Robotics there is tension between creating systems that just work, and advancing scientific understanding
Do we care about being able to pick the same object over and over ?
(e.g., industrial application)
Or do we care about a system that can adapt to different tasks (potentially unknown at training time)?
In System Identification (and kids), we do not know the task beforehand. Can we still learn something useful?

We should not decouple software from hardware !

How do we evaluate the importance of the Hardware too?
(Not abstracting away, but reason about it)

Benchmarks should consider both!

One way to do so, is to evaluate approaches on multiple robots

What are we really benchmarking? (Perception vs learning vs controller vs hardware)
Agree with Jan's point about "mine is better than yours"
Reproducing and negative results MUST be worth (see Physics)
Many of the current learning benchmarks (e.g., OpeanAI) are atrocious
(clearly not well designed as meaningful benchmarks)
Benchmarks should not be proprietary (e.g., MuJoCo)

Let a Robot "play" in an environment for a long time (e.g., 3 Months) without any goal
Now bring in humans and the robot has to reproduce any skill that the human demonstrate
The humans win if the robot can not reproduce the shown skill
(Generative Adversarial Human)
How long does it take to the Human to win?

By Roberto Calandra

Presented at the Workshop in Benchmarking Robotics

Full Professor at TU Dresden. Head of the LASR Lab. Working in AI, Robotics and Touch Sensing.