Standard "behavior-cloning" objective + data augmentation
Simulation experiments
"push box"
"flip box"
Policy is a small LSTM network (~100 LSTMs)
"And then … BC methods started to get good. Reallygood. So good that our best manipulation system today mostly uses BC, with a sprinkle of Q learning on top to perform high-level action selection. Today, less than 20% of our research investments is on RL, and the research runway for BC-based methods feels more robust."