Loading
Dmitry Nikulin | 30 April
Learn task A, then use that to learn task B faster.
based on http://arxiv.org/abs/1910.10897
Reach state g:
Rg(s,a,s′)γg(s)=I[s′=g]=0.99⋅I[s=g]
i.e. g becomes a pseudo-terminal state.
(this is pretty much the only case where γ is non-constant)
There is no standard benchmark in Multitask/Transfer.
Each paper rolls out their own evaluation.
Discrete actions: train a single agent to play multiple Atari games (select those where your results are most convincing).
Optionally: a Quake 3 fork.
Yu et al. (Stanford & co), CoRL 2019, 263 GitHub stars
But:
Schaul et al. (DeepMind), ICML 2015, 319 citations
Andrychowicz et al. (OpenAI), NIPS 2017, 512 citations
Motivating example:
tl;dr: very large state space with very sparse rewards, but easy to solve once we figure it out.
🚲 Evaluation: custom robotic arm manipulation environment based on MuJoCo
Cabi et al. (DeepMind), CoRL 2017, 17 citations
tl;dr: train a bunch of DDPG instances in parallel on one stream of experience while executing one of the policies being trained.
(intentional = behavioral policy, unintentional = other policies)
θ — actor parameters
w — critic parameters (w′ — target network)
i — index of a policy that solves i-th task
j — timestep
🚲 Evaluation:
Example tasks:
Results:
Jaderberg et al. (DeepMind), ICLR 2017, 508 citations
Train an A3C agent with a bunch of synthetic goals optimized via Q-learning:
Auxiliary control tasks:
Auxiliary reward prediction:
Auxiliary value function replay:
another illustration: https://github.com/miyosuda/unreal
Wulfmeier et al. (BAIR), CoRL 2017, 27 citations
🚲 Evaluation: transfer between two different simulators.
Hausman et al. (DeepMind), ICLR 2018, 76 citations
In training, the following is maximized:
env. reward
log-likelihood of true z
policy entropy
embedding entropy
(this is actually maximization of a lower bound on return of an entropy-regularized policy via variational inference)
For pϕ and πθ, this is done by learning a function Qφπ on the replay buffer with an off-policy correction:
and then maximizing:
For qψ (inference network), we maximize:
This is supervised learning, done offline, using the replay buffer.
The term for log-likelihood of true z in agent's reward helps generate more diverse trajectories:
🚲 Evaluation: transfer learning for robotic manipulation.
* text actually says "state-embedding network z=fϑ(x)", but this seems to be a mistake
Rusu et al. (DeepMind), arXiv 2016, 592 citations
hi(k)=ReLU(Wi(k)hi−1(k)+∑j=1k−1MLPi(k:j)(hi−1(j))), where
i — layer index
j,k — column indices
MLPi(k:j)(hi−1(j))=Ui(k:j)σ(Pi(k:j)αi−1(j)hi−1(j))
Wi(k), Ui(k:j) — linear layers
Pi(k:j) — projection matrix
αi−1(j) — learnable scalar
σ — some nonlinearity
Evaluation baselines:
Fernando et al. (DeepMind), arXiv 2017, 263 citations
Architecture:
Training:
After training, winning pathway's weights are frozen before starting to train for another task. Other weights are reinitialized.