Lecture 11:

Reinforcement Learning

on Multiple Tasks

Artyom Sorokin |22 Apr

Monstezuma's Revenge, again

Let's look again at montezuma's Revenge, everybody loves it!

It is pretty obvious what to do in this game...

Monstezuma's Revenge, again

Let's look again at montezuma's Revenge, everybody loves it!

It is pretty obvious what to do in this game...

for humans

but not for RL Agents

Monstezuma's Revenge, again

We know what to do because we understand what these sprites mean!
Key: we know it opens doors!
Ladders: we know we can climb them!
Skull: we don’t know what it does, but we know it can’t be good!
Prior understanding of problem structure can help us solve complex tasks quickly!

Patformers without Priors

Can RL use the same prior knowledge?

If we’ve solved prior tasks, we might acquire useful knowledge for solving a new task
How is the knowledge stored?
- Q-function: tells us which actions or states are good
- Policy: tells us which actions are potentially useful
  - some actions are never useful!
- Models: what are the laws of physics that govern the world?
- Features/hidden states: provide us with a good representation
  - Don’t underestimate this!

Representation bottleneck

Paper: Loss is its own Reward

Transfering knoledge between tasks in RL

Main Idea:

Use experience from one set of tasks for faster learning and/or better performance on new tasks!

In RL, task = MDP!

Source Task

Target Task

Transfering knoledge between tasks in RL

Transfer Learning:

Learning: First train on Source Tasks then learn Target Tasks faster/better!
Goal: Learn the Target Task faster or better

Train here

then train here

evaluate on this

Source Task

Target Task

Transfering knoledge between tasks in RL

Multi-Task Learning:

Learning: Train on multiple Tasks
Goal: One Agent that can solve all these tasks

Train here

and here

evaluate on here

and here

Source/Target Domains

Transfering knoledge between tasks in RL

Meta Learning:

Training: Learn to learn on multiple tasks
Goal: Agent that adapts quickly to new tasks

Train on these tasks

Source Tasks

Typically agent don't know which task it learns!

Try to learn as fast as possible:

1-shot
2-shot
few shot

Evaluate here

Sample

new tasks

Transfering knoledge between tasks in RL

Lifelong/Continual Learning:

Training: Learn first task --> then second task --> then third --> ....
Goal: Perform well on all tasks! Learn new tasks faster!

Train here

retraining on old tasks is cheating!

Evaluate an all these tasks

then here

Transfering knoledge between tasks in RL

Main Idea:

Use experience from one set of tasks for faster learning and/or better performance on new tasks!

Transfer Learning:

Learning: First train on Source Tasks then learn Target Tasks faster/better!
Goal: Best Perofrmance at the Target Task

Multi-Task Learning:

Learning: Train on several Tasks simultaneously
Goal: One Agent that can solve all tasks

Meta-Learning:

Learning: Train on set of tasks without knowing which task is it
Achieve performance on new tasks

Transfer in Supervised Learning

Pretraining + Finetuning:

The most popular transfer learning method in (supervised) deep learning!

Finetuning: Problems in RL Setting

Domain shift: representations learned in the source domain might not work well in the target domain

Difference in the MDP: some things that are possible to do in the source domain are not possible to do in the target domain

Finetuning issues: The finetuning process may still need to explore, but optimal policy during pretraining may be deterministic!

Fighting Domain shift in CV

Invariance assumption: everything that is different between domains is irrelevant

train here

do well here

Task Loss

\(D_{\phi}(z)\)

CE-Loss

for Domain Classification

(same network)

Multiply grads from \(D_{\phi}(z)\) by \(-\lambda\)

i.e. train \(z\) to maximize CE-Loss

Paper: Unsupervised Domain Adaptation by Backpropagation

Domain Adaptation in RL

Invariance assumption: everything that is different between domains is irrelevant

Paper: Adapting Visuomotor Representations with Weak Pairwise Constraints

Transfer when Dynamic is Different

Why is invariance not enough when the dynamics don’t match?

Paper: Off-Dynamics Reinforcement Learning

Off-Dynamics RL: Results

Paper: Off-Dynamics Reinforcement Learning

Finetuning issues

RL tasks are generally much less diverse
- Features are less general
- Policies & value functions become overly specialized

Optimal policies in fully observed MDPs are deterministic
- Loss of exploration at convergence
- Low-entropy policies adapt very slowly to new settings

Pretraining with Maximum Entropy RL

Paper: Reinforcement Learning with Deep Energy-Based Policies

Forward Transfer with Randomization

What if we can manipulate the source domain?

So far: source domain (e.g., empty room) and target domain (e.g., corridor) are fixed
What if we can design the source domain, and we have a difficult target domain?
- Often the case for simulation to real world transfer

Randomizing Physical Parameters

Paper: EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

Preparing for the unknown: Parameter Identification

Paper: Preparing for the Unknown: Learning a Universal Policy with Online System Identification

Looks like Meta-Learning to me...

Another Example: CAD2RL

Paper: CAD2RL: Real Single-Image Flight without a Single Real Image

Transfer for different goals:

Assumption:

The dynamics \(p(s_{t+1}|s_t, a_t)\) is the same in both domains but reward function is different

Common examples:

Autonomous car learns how to drive to a few destinations, and then has to navigate to a new one
A kitchen robot learns to cook many different recipes, and then has to cook a new one in the same kitchen

Model Transfer

Model: very simple to transfer, since the model is already (in principle) independent of the reward

You can also transfer contextual policies, i.e. \(p(a|s, g_i)\)!

Adding Multi-Tasking for Better Learning

Sparse Reward setting:

Reward only for reaching the goal state

Problem: RL learns nothing from failed attempts!

Adding Multi-Tasking for Better Learning

But humans can learn in the similar setting:

Adding Multi-Tasking for Better Learning

We can interpret all outcomes of agents' actions as goals:

Adding virtual goals creates a multi-task setting and enriches the learning signal!

Hintsight Experience Replay

Paper: Hindsight Experience Replay

Main Idea: substitute achieved results as desired goals

HER main components:

Goal Conditioned Policies and Value Functions
Any off-policy RL Algorithms: DDPG, DQN, etc..
A method for virtual target selection
A special replay buffer with goal substitution

Hintsight Experience Replay

Paper: Hindsight Experience Replay

Hintsight Experience Replay

Paper: Hindsight Experience Replay

Value Transfer: Successor Representations

Paper: #Exploration:A Study of Count-Based Explorationfor Deep Reinforcement Learning

Multi-Task Reinforcement Learning

Can we learn faster by learning multiple tasks?

Multi-task learning can:

Accelerate learning of all tasks that are learned together
Provide better pre-training for down-stream tasks

Sounds familiar... Domain Randomization?

Can we solve multiple tasks at once?

Multi-task RL corresponds to single-task RL in a joint MDP:

Can we solve multiple tasks at once?

Gradient interference: becoming better on one task can make you worse on another

Winner-take-all problem: imagine one task starts getting good – algorithm is likely to prioritize that task (to increase average expected reward) at the expensive of others