Loading

dniku

This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.

Dmitry Nikulin | 30 April

Learn task A, *then* use that to learn task B faster.

- Task Embeddings
- sim2real: MATL
- Too Many GPUs: Progressive Networks, PathNet

- Move robotic hand into a particular position
- Drive to a particular point
- Solve an Atari game

based on http://arxiv.org/abs/1910.10897

- Learn a single task-conditioned policy \( \pi \left( a \mid s, \tau \right) \), ...
- where \( \tau \) is an encoding of a task ...
- from a distribution \( \tau \sim p( \mathcal{T} ) \) ...
- and the task is defined by its own discount function \( \gamma_{\tau}(s) \) and reward function \( R_{\tau}(s, a, s') \).
- Policy \( \pi \) should maximize average expected return:
- \( \mathbb{E}_{\tau \sim p( \mathcal{T} )} \left[ \mathbb{E}_{s_0, \pi} \left[ \sum\limits_{t=0}^T \left( \prod\limits_{t'=0}^{t-1} \gamma_\tau \left( s_{t'} \right) \right) R_{\tau}(s_t, a_t, s_{t + 1}) \right] \right] \).

(see next slide for why we vary \( \gamma \))

Reach state \( g \):

\[ \begin{aligned} R_g(s, a, s') &= \mathbb{I} \left[ s' = g \right] \\ \gamma_g(s) &= 0.99 \cdot \mathbb{I} \left[ s \neq g \right] \end{aligned} \]

i.e. \( g \) becomes a pseudo-terminal state.

(this is pretty much the only case where \( \gamma \) is non-constant)

There is no standard benchmark in Multitask/Transfer.

Each paper rolls out their own evaluation.

Discrete actions: train a single agent to play multiple Atari games (select those where your results are most convincing).

Optionally: a Quake 3 fork.

- Continuous actions: multiple attempts, most recent one is MetaWorld.

Yu et al. (Stanford & co), CoRL 2019, 263 GitHub stars

- Introduce definitions that inspired a previous section;
- Train \( V(s, g; \theta) \) and \( Q(s, a, g; \theta) \) via off-policy Q-learning.

But:

- They train a bunch of different networks \( Q_g(s, a; \theta_g) \) ("Horde architecture") and then distill them into one \( Q(s, a, g; \theta) \)
- They do distillation via rank-n matrix factorization
- 🚲 Evaluation on toy environments (custom gridworlds), and one experiment on MsPacman

Schaul et al. (DeepMind), ICML 2015, 319 citations

Andrychowicz et al. (OpenAI), NIPS 2017, 512 citations

Motivating example:

- States: \(N\)-bit vector
- Actions: flip \(k\)-th bit
- Reward: \( R(s, a, s') = \mathbb{I}\left[ s' = g \right] \)
- Agent receives \(g\) as input.

tl;dr: very large state space with very sparse rewards, but easy to solve once we figure it out.

- Policy \( \pi \) takes goal \( g \) as input
- Tuple \( (s_t, g, a_t, r_t, s_{t + 1}) \) goes in the replay buffer
- But we put there \( (s_t, s_T, a_t, r_t, s_{t + 1}) \) as well (\( s_T \) is the final state in a trajectory)
- Can also put \( (s_t, s_{t'}, a_t, r_t, s_{t + 1}) \) for several random \( t' > t \) (works better)
- Train any off-policy algorithm on that replay buffer

🚲 Evaluation: custom robotic arm manipulation environment based on MuJoCo

Cabi et al. (DeepMind), CoRL 2017, 17 citations

tl;dr: train a bunch of DDPG instances in parallel on one stream of experience while executing one of the policies being trained.

(intentional = behavioral policy, unintentional = other policies)

\( \theta \) — actor parameters

\( w \) — critic parameters (\( w' \) — target network)

\( i \) — index of a policy that solves \(i\)-th task

\( j \) — timestep

🚲 Evaluation:

- Custom environment based on MuJoCo
- Comparison only with plain DDPG
- \( \approx 10^7 \) training steps
- Up to 43 tasks

Example tasks:

- Bring red object near blue one
- Put green object in the corner

Results:

- More tasks = better performance of each task, up to a limit
- Making the hardest task intentional (behavioral) works best

- "Reinforcement Learning with Unsupervised Auxiliary Tasks"
- UNsupervised REinforcement and Auxiliary Learning
- Not really multitask: all auxiliary tasks are only used to improve performance on the main task

Jaderberg et al. (DeepMind), ICLR 2017, 508 citations

Train an A3C agent with a bunch of synthetic goals optimized via Q-learning:

Auxiliary control tasks:

- Pixel changes: which action will produce maximal change in pixel intensity in this region of the input?
- Network features: which action will maximally activate a specific hidden layer?

Auxiliary reward prediction:

- Given \( k \) input frames, what will be the next reward?
- Classification: zero/positive/negative
- Train on replay buffer with a skewed distribution

(\( \mathbb{P}(r \neq 0) = \frac 1 2 \))

Auxiliary value function replay:

- just train A3C value function on replay buffer

another illustration: https://github.com/miyosuda/unreal

Wulfmeier et al. (BAIR), CoRL 2017, 27 citations

- Suppose we have a simulator that is really good at reproducing states (i.e. sensor readings), but not environments dynamics.
- We can use a discriminator to make the simulator policy \( \pi_{sim} \) learn to produce the same trajectories as \( \pi_{real} \) using different actions.
- \( \pi_{sim} \) will be effectively a generator, and the whole thing will be a GAN.

🚲 Evaluation: transfer between two different simulators.

Hausman et al. (DeepMind), ICLR 2018, 76 citations

- For each task \( \tau \) (e.g., one-hot encoded), we can learn a distribution of embeddings \( p_\phi \left( z \mid \tau \right) \), ...
- a task embedding-conditioned policy \( \pi_\theta \left( a \mid s_i, z \right) \), ...
- and a distribution \( q_\psi \left( z \mid a, s_i^H \right) \) that
*identifies*task embeddings based on a trajectory segment \( s_i^H = \left[ s_{i - H}, \ldots, s_i \right] \).

In training, the following is maximized:

env. reward

log-likelihood of true \( z \)

policy entropy

embedding entropy

(this is actually maximization of a lower bound on return of an entropy-regularized policy via variational inference)

For \( p_\phi \) and \( \pi_\theta \), this is done by learning a function \( Q_\varphi^\pi \) on the replay buffer with an off-policy correction:

and then maximizing:

For \( q_\psi \) (inference network), we maximize:

This is supervised learning, done offline, using the replay buffer.

The term for log-likelihood of true \( z \) in agent's reward helps generate more diverse trajectories:

🚲 Evaluation: transfer learning for robotic manipulation.

- Multiple target tasks, like moving a block that is attached to a spring over a wall to a goal position (while stretching the spring).
- Pretraining on related but different tasks, like bringing a block attached to a spring to a goal position (no wall).
- Then transfer to target task, freezing the policy network \( \pi_\theta \) and only training* a new task-embedding network \(z = f_\vartheta(t)\).

* text actually says "state-embedding network \( z = f_\vartheta(x) \)", but this seems to be a mistake

- from scratch = no transfer
- task selection = no \( z \), policy receives \( t \) directly
- pretrain all = pre-training on all auxiliary tasks for all environments, not just the target one

Rusu et al. (DeepMind), arXiv 2016, 592 citations

- Train a network to do a task
- Freeze its weights
- Train a second network to do another task, using hidden activations of the first network as input for the second one

\( h_i^{(k)} = \operatorname{ReLU} \left( W_i^{(k)} h_{i-1}^{(k)} + \sum_{j = 1}^{k - 1} \operatorname{MLP}_i^{(k:j)} \left( h^{(j)}_{i - 1} \right) \right) \), where

\(i\) — layer index

\(j, k\) — column indices

\( \operatorname{MLP}_i^{(k:j)} \left( h^{(j)}_{i - 1} \right) = U_i^{(k:j)} \sigma \left( P_i^{(k:j)} \alpha_{i - 1}^{(j)} h^{(j)}_{i - 1} \right) \)

\( W_i^{(k)} \), \( U_i^{(k:j)} \) — linear layers

\( P_i^{(k:j)} \) — projection matrix

\( \alpha_{i - 1}^{(j)} \) — learnable scalar

\( \sigma \) — some nonlinearity

Evaluation baselines:

- Train from scratch
- Finetune last layer
- Finetune all layers
- Random frozen first column

Fernando et al. (DeepMind), arXiv 2017, 263 citations

Architecture:

- \( L \) layers, \( M \) modules in each, each module is a small network
- Pathway = a set of "active" modules in each layer (at most \( N \) in each layer, typically \( N = 3 \) or \( 4 \))
- Different pathways can (and do) share layers
- Outputs of modules are summed between layers
- Final layer is not shared between tasks

Training:

- Spawn a bunch of workers, train them for \( T \) epochs
- Fitness = average reward during training
- Once \( T \) epochs are complete, worker writes its own fitness to a shared array and compares it to \( B \) other fitness values
- If there is a larger fitness value, worker's pathway is overwritten with winner's weights with mutation

After training, winning pathway's weights are frozen before starting to train for another task. Other weights are reinitialized.

- All 57 Atari games with one network: https://arxiv.org/abs/1809.04474
- Policy distillation: https://t.me/adv_topics_in_rl_ru_2020/909