Spinning Up Key Papers

https://spinningup.openai.com/en/latest/spinningup/keypapers.html#transfer-and-multitask-rl

Multitask

Simultaneoulsy learn to solve multiple tasks.

UVFA: Q-learning
HER: any off-policy
IU Agent: DDPG
UNREAL: A3C + Q-learning

Transfer

Learn task A, then use that to learn task B faster.

Task Embeddings
sim2real: MATL
Too Many GPUs: Progressive Networks, PathNet

Multitask

But what are we talking about, exactly?

Example tasks

Move robotic hand into a particular position
Drive to a particular point
Solve an Atari game

Formal definition

based on http://arxiv.org/abs/1910.10897

Learn a single task-conditioned policy $\pi \left( a \mid s, \tau \right)$ , ...
where $\tau$ is an encoding of a task ...
from a distribution $\tau \sim p( \mathcal{T} )$ ...
and the task is defined by its own discount function $\gamma_{\tau}(s)$ and reward function $R_{\tau}(s, a, s')$ .
Policy $\pi$ should maximize average expected return:
$\mathbb{E}_{\tau \sim p( \mathcal{T} )} \left[ \mathbb{E}_{s_0, \pi} \left[ \sum\limits_{t=0}^T \left( \prod\limits_{t'=0}^{t-1} \gamma_\tau \left( s_{t'} \right) \right) R_{\tau}(s_t, a_t, s_{t + 1}) \right] \right]$ .
(see next slide for why we vary $\gamma$ )

Special case: reaching goal state

Reach state $g$ :

$\begin{aligned} R_g(s, a, s') &= \mathbb{I} \left[ s' = g \right] \\ \gamma_g(s) &= 0.99 \cdot \mathbb{I} \left[ s \neq g \right] \end{aligned}$

i.e. $g$ becomes a pseudo-terminal state.

(this is pretty much the only case where $\gamma$ is non-constant)

Benchmarks

(or lack thereof)

Benchmarks

There is no standard benchmark in Multitask/Transfer.

Each paper rolls out their own evaluation.

Benchmarks

Discrete actions: train a single agent to play multiple Atari games (select those where your results are most convincing).

Optionally: a Quake 3 fork.

Multitask

finally, SpinUp papers

Universal Value Function Approximators

Introduce definitions that inspired a previous section;
Train $V(s, g; \theta)$ and $Q(s, a, g; \theta)$ via off-policy Q-learning.

But:

They train a bunch of different networks $Q_g(s, a; \theta_g)$ ("Horde architecture") and then distill them into one $Q(s, a, g; \theta)$
They do distillation via rank-n matrix factorization
🚲 Evaluation on toy environments (custom gridworlds), and one experiment on MsPacman

Schaul et al. (DeepMind), ICML 2015, 319 citations

Hindsight Experience Replay

Andrychowicz et al. (OpenAI), NIPS 2017, 512 citations

Motivating example:

States: $N$ -bit vector
Actions: flip $k$ -th bit
Reward: $R(s, a, s') = \mathbb{I}\left[ s' = g \right]$
Agent receives $g$ as input.

tl;dr: very large state space with very sparse rewards, but easy to solve once we figure it out.

Hindsight Experience Replay

Policy $\pi$ takes goal $g$ as input
Tuple $(s_t, g, a_t, r_t, s_{t + 1})$ goes in the replay buffer
But we put there $(s_t, s_T, a_t, r_t, s_{t + 1})$ as well ( $s_T$ is the final state in a trajectory)
Can also put $(s_t, s_{t'}, a_t, r_t, s_{t + 1})$ for several random $t' > t$ (works better)
Train any off-policy algorithm on that replay buffer

Hindsight Experience Replay

🚲 Evaluation: custom robotic arm manipulation environment based on MuJoCo

Hindsight Experience Replay

Intentional-Unintentional Agent

Cabi et al. (DeepMind), CoRL 2017, 17 citations

tl;dr: train a bunch of DDPG instances in parallel on one stream of experience while executing one of the policies being trained.

(intentional = behavioral policy, unintentional = other policies)

Intentional-Unintentional Agent

$\theta$ — actor parameters

$w$ — critic parameters ( $w'$ — target network)

$i$ — index of a policy that solves $i$ -th task

$j$ — timestep

Intentional-Unintentional Agent

🚲 Evaluation:

Custom environment based on MuJoCo
Comparison only with plain DDPG
$\approx 10^7$ training steps
Up to 43 tasks

Example tasks:

Bring red object near blue one
Put green object in the corner

Intentional-Unintentional Agent

Results:

More tasks = better performance of each task, up to a limit
Making the hardest task intentional (behavioral) works best

UNREAL

"Reinforcement Learning with Unsupervised Auxiliary Tasks"
UNsupervised REinforcement and Auxiliary Learning
Not really multitask: all auxiliary tasks are only used to improve performance on the main task

Jaderberg et al. (DeepMind), ICLR 2017, 508 citations

UNREAL

Train an A3C agent with a bunch of synthetic goals optimized via Q-learning:

UNREAL

Auxiliary control tasks:

Pixel changes: which action will produce maximal change in pixel intensity in this region of the input?
Network features: which action will maximally activate a specific hidden layer?

UNREAL

Auxiliary reward prediction:

Given $k$ input frames, what will be the next reward?
Classification: zero/positive/negative
Train on replay buffer with a skewed distribution
( $\mathbb{P}(r \neq 0) = \frac 1 2$ )

UNREAL

Auxiliary value function replay:

just train A3C value function on replay buffer

UNREAL

another illustration: https://github.com/miyosuda/unreal

UNREAL

Transfer

Mutual Alignment Transfer Learning

Wulfmeier et al. (BAIR), CoRL 2017, 27 citations

Suppose we have a simulator that is really good at reproducing states (i.e. sensor readings), but not environments dynamics.
We can use a discriminator to make the simulator policy $\pi_{sim}$ learn to produce the same trajectories as $\pi_{real}$ using different actions.
$\pi_{sim}$ will be effectively a generator, and the whole thing will be a GAN.

Mutual Alignment Transfer Learning

🚲 Evaluation: transfer between two different simulators.

Mutual Alignment Transfer Learning

Embedding Space for Robot Skills

Hausman et al. (DeepMind), ICLR 2018, 76 citations

For each task $\tau$ (e.g., one-hot encoded), we can learn a distribution of embeddings $p_\phi \left( z \mid \tau \right)$ , ...
a task embedding-conditioned policy $\pi_\theta \left( a \mid s_i, z \right)$ , ...
and a distribution $q_\psi \left( z \mid a, s_i^H \right)$ that identifies task embeddings based on a trajectory segment $s_i^H = \left[ s_{i - H}, \ldots, s_i \right]$ .

Embedding Space for Robot Skills

In training, the following is maximized:

env. reward

log-likelihood of true $z$

policy entropy

embedding entropy

(this is actually maximization of a lower bound on return of an entropy-regularized policy via variational inference)

Embedding Space for Robot Skills

For $p_\phi$ and $\pi_\theta$ , this is done by learning a function $Q_\varphi^\pi$ on the replay buffer with an off-policy correction:

and then maximizing:

Embedding Space for Robot Skills

For $q_\psi$ (inference network), we maximize:

This is supervised learning, done offline, using the replay buffer.

Embedding Space for Robot Skills

The term for log-likelihood of true $z$ in agent's reward helps generate more diverse trajectories:

Embedding Space for Robot Skills

🚲 Evaluation: transfer learning for robotic manipulation.

Multiple target tasks, like moving a block that is attached to a spring over a wall to a goal position (while stretching the spring).
Pretraining on related but different tasks, like bringing a block attached to a spring to a goal position (no wall).
Then transfer to target task, freezing the policy network $\pi_\theta$ and only training* a new task-embedding network $z = f_\vartheta(t)$ .

* text actually says "state-embedding network $z = f_\vartheta(x)$ ", but this seems to be a mistake

Embedding Space for Robot Skills

from scratch = no transfer
task selection = no $z$ , policy receives $t$ directly
pretrain all = pre-training on all auxiliary tasks for all environments, not just the target one

Progressive Neural Networks

Rusu et al. (DeepMind), arXiv 2016, 592 citations

Train a network to do a task
Freeze its weights
Train a second network to do another task, using hidden activations of the first network as input for the second one

Progressive Neural Networks

$h_i^{(k)} = \operatorname{ReLU} \left( W_i^{(k)} h_{i-1}^{(k)} + \sum_{j = 1}^{k - 1} \operatorname{MLP}_i^{(k:j)} \left( h^{(j)}_{i - 1} \right) \right)$ , where

$i$ — layer index

$j, k$ — column indices

$\operatorname{MLP}_i^{(k:j)} \left( h^{(j)}_{i - 1} \right) = U_i^{(k:j)} \sigma \left( P_i^{(k:j)} \alpha_{i - 1}^{(j)} h^{(j)}_{i - 1} \right)$

$W_i^{(k)}$ , $U_i^{(k:j)}$ — linear layers

$P_i^{(k:j)}$ — projection matrix

$\alpha_{i - 1}^{(j)}$ — learnable scalar

$\sigma$ — some nonlinearity

Progressive Neural Networks

Evaluation baselines:

Train from scratch
Finetune last layer
Finetune all layers
Random frozen first column

Progressive Neural Networks

PathNet

Fernando et al. (DeepMind), arXiv 2017, 263 citations

PathNet

Architecture:

$L$ layers, $M$ modules in each, each module is a small network
Pathway = a set of "active" modules in each layer (at most $N$ in each layer, typically $N = 3$ or $4$ )
Different pathways can (and do) share layers
Outputs of modules are summed between layers
Final layer is not shared between tasks

PathNet

Training:

Spawn a bunch of workers, train them for $T$ epochs
Fitness = average reward during training
Once $T$ epochs are complete, worker writes its own fitness to a shared array and compares it to $B$ other fitness values
If there is a larger fitness value, worker's pathway is overwritten with winner's weights with mutation

After training, winning pathway's weights are frozen before starting to train for another task. Other weights are reinitialized.

PathNet

Not covered

All 57 Atari games with one network: https://arxiv.org/abs/1809.04474
Policy distillation: https://t.me/adv_topics_in_rl_ru_2020/909

Transfer & Multitask

Advanced RL, Spring 2020

Transfer & Multi-Task: Advanced RL @ YSDA, Spring 2020