Few Remarks on Benchmarks

Roberto Calandra

Facebook AI Research

Workshop in Benchmarking Robotics  - 13 August 2019

Two Meta-questions about Benchmarks

  • What is a benchmark testing? (i.e., are these the right metrics)
    • Nature is not single-objective
    • Solving a specific task vs designing general purpose systems


  • Is this benchmark representative of real-world problems?
    • Are potential advances useful in the real-world?
    • If using simulations, is the result indicative?

Solving a specific task vs designing general purpose systems

  • In Robotics there is tension between creating systems that just work, and advancing scientific understanding
  • Do we care about being able to pick the same object over and over ?
    (e.g., industrial application)
  • Or do we care about a system that can adapt to different tasks (potentially unknown at training time)?
  • In System Identification (and kids), we do not know the task beforehand. Can we still learn something useful?

Benchmarks on Real Robots are Hard...

  • Not everyone has the same setting (robot, sensors, etc)
  • Running real-world experiments is time-consuming and expensive
  • Can we just use simulation?


We should not decouple software from hardware !

How do we evaluate the importance of the Hardware too?
(Not abstracting away, but reason about it)

Formalizing the Hardware

Benchmarks should consider both!

One way to do so, is to evaluate approaches on multiple robots

Final Remarks

  • What are we really benchmarking? (Perception vs learning vs controller vs hardware)
  • Agree with Jan's point about "mine is better than yours"
  • Reproducing and negative results MUST be worth (see Physics)
  • Many of the current learning benchmarks (e.g., OpeanAI) are atrocious
    (clearly not well designed as meaningful benchmarks)
  • Benchmarks should not be proprietary (e.g., MuJoCo)

Mimic Benchmark

  • Let a Robot "play" in an environment for a long time (e.g., 3 Months) without any goal
  • Now bring in humans and the robot has to reproduce any skill that the human demonstrate
  • The humans win if the robot can not reproduce the shown skill
    (Generative Adversarial Human)
  • How long does it take to the Human to win?

Few Remarks on Benchmarks

By Roberto Calandra

Few Remarks on Benchmarks

Presented at the Workshop in Benchmarking Robotics

  • 868