 dniku

This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.

Dmitry Nikulin | 30 April

## Spinning Up Key Papers

Simultaneoulsy learn to solve multiple tasks.

## But what are we talking about, exactly?

• Move robotic hand into a particular position
• Drive to a particular point
• Solve an Atari game

### Formal definition

• Learn a single task-conditioned policy $$\pi \left( a \mid s, \tau \right)$$, ...
• where $$\tau$$ is an encoding of a task ...
• from a distribution $$\tau \sim p( \mathcal{T} )$$ ...
• and the task is defined by its own discount function $$\gamma_{\tau}(s)$$ and reward function $$R_{\tau}(s, a, s')$$.
• Policy $$\pi$$ should maximize average expected return:
• $$\mathbb{E}_{\tau \sim p( \mathcal{T} )} \left[ \mathbb{E}_{s_0, \pi} \left[ \sum\limits_{t=0}^T \left( \prod\limits_{t'=0}^{t-1} \gamma_\tau \left( s_{t'} \right) \right) R_{\tau}(s_t, a_t, s_{t + 1}) \right] \right]$$.
(see next slide for why we vary $$\gamma$$)

### Special case: reaching goal state

Reach state $$g$$:

\begin{aligned} R_g(s, a, s') &= \mathbb{I} \left[ s' = g \right] \\ \gamma_g(s) &= 0.99 \cdot \mathbb{I} \left[ s \neq g \right] \end{aligned}

i.e. $$g$$ becomes a pseudo-terminal state.

(this is pretty much the only case where $$\gamma$$ is non-constant)

# Benchmarks

## (or lack thereof)

### Benchmarks

There is no standard benchmark in Multitask/Transfer.

Each paper rolls out their own evaluation.

### Benchmarks

Discrete actions: train a single agent to play multiple Atari games (select those where your results are most convincing).

Optionally: a Quake 3 fork.

### Benchmarks

• Continuous actions: multiple attempts, most recent one is MetaWorld.

Yu et al. (Stanford & co), CoRL 2019, 263 GitHub stars

## finally, SpinUp papers

### Universal Value Function Approximators

• Introduce definitions that inspired a previous section;
• Train $$V(s, g; \theta)$$ and $$Q(s, a, g; \theta)$$ via off-policy Q-learning.

But:

• They train a bunch of different networks $$Q_g(s, a; \theta_g)$$ ("Horde architecture") and then distill them into one $$Q(s, a, g; \theta)$$
• They do distillation via rank-n matrix factorization
• 🚲 Evaluation on toy environments (custom gridworlds), and one experiment on MsPacman

Schaul et al. (DeepMind), ICML 2015, 319 citations

### Hindsight Experience Replay

Andrychowicz et al. (OpenAI), NIPS 2017, 512 citations

Motivating example:

• States: $$N$$-bit vector
• Actions: flip $$k$$-th bit
• Reward: $$R(s, a, s') = \mathbb{I}\left[ s' = g \right]$$
• Agent receives $$g$$ as input.

tl;dr: very large state space with very sparse rewards, but easy to solve once we figure it out.

### Hindsight Experience Replay

• Policy $$\pi$$ takes goal $$g$$ as input
• Tuple $$(s_t, g, a_t, r_t, s_{t + 1})$$ goes in the replay buffer
• But we put there $$(s_t, s_T, a_t, r_t, s_{t + 1})$$ as well ($$s_T$$ is the final state in a trajectory)
• Can also put $$(s_t, s_{t'}, a_t, r_t, s_{t + 1})$$ for several random $$t' > t$$ (works better)
• Train any off-policy algorithm on that replay buffer

### Hindsight Experience Replay

🚲 Evaluation: custom robotic arm manipulation environment based on MuJoCo

### Intentional-Unintentional Agent

Cabi et al. (DeepMind), CoRL 2017, 17 citations

tl;dr: train a bunch of DDPG instances in parallel on one stream of experience while executing one of the policies being trained.

(intentional = behavioral policy, unintentional = other policies)

### Intentional-Unintentional Agent

$$\theta$$ — actor parameters

$$w$$ — critic parameters ($$w'$$ — target network)

$$i$$ — index of a policy that solves $$i$$-th task

$$j$$ — timestep

### Intentional-Unintentional Agent

🚲 Evaluation:

• Custom environment based on MuJoCo
• Comparison only with plain DDPG
• $$\approx 10^7$$ training steps

• Bring red object near blue one
• Put green object in the corner

### Intentional-Unintentional Agent

Results:

• More tasks = better performance of each task, up to a limit
• Making the hardest task intentional (behavioral) works best

### UNREAL

• "Reinforcement Learning with Unsupervised Auxiliary Tasks"
• UNsupervised REinforcement and Auxiliary Learning
• Not really multitask: all auxiliary tasks are only used to improve performance on the main task

Jaderberg et al. (DeepMind), ICLR 2017, 508 citations

### UNREAL

Train an A3C agent with a bunch of synthetic goals optimized via Q-learning:

### UNREAL

• Pixel changes: which action will produce maximal change in pixel intensity in this region of the input?
• Network features: which action will maximally activate a specific hidden layer?

### UNREAL

Auxiliary reward prediction:

• Given $$k$$ input frames, what will be the next reward?
• Classification: zero/positive/negative
• Train on replay buffer with a skewed distribution
($$\mathbb{P}(r \neq 0) = \frac 1 2$$)

### UNREAL

Auxiliary value function replay:

• just train A3C value function on replay buffer

### UNREAL

another illustration: https://github.com/miyosuda/unreal

# Transfer

### Mutual Alignment Transfer Learning

Wulfmeier et al. (BAIR), CoRL 2017, 27 citations

• Suppose we have a simulator that is really good at reproducing states (i.e. sensor readings), but not environments dynamics.
• We can use a discriminator to make the simulator policy $$\pi_{sim}$$ learn to produce the same trajectories as $$\pi_{real}$$ using different actions.
• $$\pi_{sim}$$ will be effectively a generator, and the whole thing will be a GAN.

### Mutual Alignment Transfer Learning

🚲 Evaluation: transfer between two different simulators.

### Embedding Space for Robot Skills

Hausman et al. (DeepMind), ICLR 2018, 76 citations

• For each task $$\tau$$ (e.g., one-hot encoded), we can learn a distribution of embeddings $$p_\phi \left( z \mid \tau \right)$$, ...
• a task embedding-conditioned policy $$\pi_\theta \left( a \mid s_i, z \right)$$, ...
• and a distribution $$q_\psi \left( z \mid a, s_i^H \right)$$ that identifies task embeddings based on a trajectory segment $$s_i^H = \left[ s_{i - H}, \ldots, s_i \right]$$.

### Embedding Space for Robot Skills

In training, the following is maximized:

env. reward

log-likelihood of true $$z$$

policy entropy

embedding entropy

(this is actually maximization of a lower bound on return of an entropy-regularized policy via variational inference)

### Embedding Space for Robot Skills

For $$p_\phi$$ and $$\pi_\theta$$, this is done by learning a function $$Q_\varphi^\pi$$ on the replay buffer with an off-policy correction:

and then maximizing:

### Embedding Space for Robot Skills

For $$q_\psi$$ (inference network), we maximize:

This is supervised learning, done offline, using the replay buffer.

### Embedding Space for Robot Skills

The term for log-likelihood of true $$z$$ in agent's reward helps generate more diverse trajectories:

### Embedding Space for Robot Skills

🚲 Evaluation: transfer learning for robotic manipulation.

• Multiple target tasks, like moving a block that is attached to a spring over a wall to a goal position (while stretching the spring).
• Pretraining on related but different tasks, like bringing a block attached to a spring to a goal position (no wall).
• Then transfer to target task, freezing the policy network $$\pi_\theta$$ and only training* a new task-embedding network $$z = f_\vartheta(t)$$.

* text actually says "state-embedding network $$z = f_\vartheta(x)$$", but this seems to be a mistake

### Embedding Space for Robot Skills

• from scratch = no transfer
• task selection = no $$z$$, policy receives $$t$$ directly
• pretrain all = pre-training on all auxiliary tasks for all environments, not just the target one

### Progressive Neural Networks

Rusu et al. (DeepMind), arXiv 2016, 592 citations

• Train a network to do a task
• Freeze its weights
• Train a second network to do another task, using hidden activations of the first network as input for the second one

### Progressive Neural Networks

$$h_i^{(k)} = \operatorname{ReLU} \left( W_i^{(k)} h_{i-1}^{(k)} + \sum_{j = 1}^{k - 1} \operatorname{MLP}_i^{(k:j)} \left( h^{(j)}_{i - 1} \right) \right)$$, where

$$i$$ — layer index

$$j, k$$ — column indices

$$\operatorname{MLP}_i^{(k:j)} \left( h^{(j)}_{i - 1} \right) = U_i^{(k:j)} \sigma \left( P_i^{(k:j)} \alpha_{i - 1}^{(j)} h^{(j)}_{i - 1} \right)$$

$$W_i^{(k)}$$, $$U_i^{(k:j)}$$ — linear layers

$$P_i^{(k:j)}$$ — projection matrix

$$\alpha_{i - 1}^{(j)}$$ — learnable scalar

$$\sigma$$ — some nonlinearity

### Progressive Neural Networks

Evaluation baselines:

1. Train from scratch
2. Finetune last layer
3. Finetune all layers
4. Random frozen first column

### PathNet

Fernando et al. (DeepMind), arXiv 2017, 263 citations

### PathNet

Architecture:

• $$L$$ layers, $$M$$ modules in each, each module is a small network
• Pathway = a set of "active" modules in each layer (at most $$N$$ in each layer, typically $$N = 3$$ or $$4$$)
• Different pathways can (and do) share layers
• Outputs of modules are summed between layers
• Final layer is not shared between tasks

### PathNet

Training:

• Spawn a bunch of workers, train them for $$T$$ epochs
• Fitness = average reward during training
• Once $$T$$ epochs are complete, worker writes its own fitness to a shared array and compares it to $$B$$ other fitness values
• If there is a larger fitness value, worker's pathway is overwritten with winner's weights with mutation

After training, winning pathway's weights are frozen before starting to train for another task. Other weights are reinitialized.