Affordances and alternatives

Martin Biehl

Background

  • Given:
    • Agent with complex action space
    • Complex environment
  • How to train / prepare an easy interface for making the agent achieve arbitrary tasks?
  • Main problem:
    • sparse rewards require finding long sequences of actions

Background

  • Broad distinctions of approaches:
    • Meta-learning
    • Hierarchical RL
    • Auxiliary tasks
      • intrinsic motivations
    • Model based RL
    • Neuroevolution

Affordances

  • Learn predictor (world model) \(p(s_{t+1}|a_t,s_t)\) from random samples \((s,a,s')\)
  • fix affordance space \(\Omega\), sensor target space \(S_T \in S\) with distance \(d\) and affordance grid \(\Omega_G \subset \Omega\)
  • learn proposer (policy) \(\pi_\theta(a|s,\omega)\) where \(\omega \in \Omega\)
    • to each \(\omega_i \in \Omega_G\) associate an affordance (policy) \(\pi(a|s,\omega_i)\) and resulting sensor value \(s_f(\omega_i)\)
    • learn \(\theta\) with objective \[J(\theta)=\min_{i,j} d(s_f(\omega_i),s_f(\omega_j))\]

Affordances

Affordances

  • Can view different \(\omega_i\) as different goals, tasks, or skills
  • learning the proposer then means to generate and learn to achieve a set of goals/tasks/skills that pervade the target sensor space
  • Can also view an affordance target space as a kind of controllable feature, maybe building up one dimensional and independent target spaces is possible and we get independently controllable features in this way?

Affordances

  • Resulting interface:
    • Continuous, low dimensional affordance space as a space of "macro-actions" that can achieve outcomes in target sensor space (by interpolation of grid points)
  • Disadvantages:
    • need distance in target space
    • not clear what to do if we want to control high dimensional \(S_T\) with low dimensional \(\Omega\) (\(S^G\) will have to fold)
    • all affordances have identical number of time steps (seems inefficient)

MAML

  • Model Agnostic Meta Learning
  • General idea: pretrain a network with a set of tasks/datasets such that the resulting network adapts the most when trained on one of them \[\theta:=\min_{\bar{\theta}} \sum_i \mathcal{L}_i(\theta_i'(\bar{\theta}))\] where \[\theta_i'(\bar{\theta}) = \bar{\theta} - \alpha \nabla_\theta \mathcal{L}_i(\bar{\theta})\]

MAML

  • For reinforcement learning, pretrain a policy network \(\pi(a|s,\theta)\) (https://arxiv.org/pdf/1806.04640.pdf)
    • Generate goals/tasks/skills automatically
    • pretrain policy for those tasks

MAML

  • How to generate goals/tasks/skills (in the form of reward functions \(r_i(s)\)) automatically?
    • E.g. via "diversity is all you need" i.e. generate a set of skills \(\pi(a|s,z)\) parameterised by \(z\) such that \(I(Z:S)+H(A|S,Z)\) is maximized. For this a discriminator \(D_\phi(z|s)\) that predicts the skill \(z\) from the state \(s\) is learned and the reward for the task z is then \(\log D_\phi(z|s)\).
    • via affordances? Reward for task \(i\) could be distance to outcome \(s(\omega_i)\)
    • Novelty search vie evolution?

MAML

  • Resulting interface:
    • pretrained policy network \(\pi_\theta(a|s)\) that speeds up adaptation:

 MAESN and ProMP

  • MAESN, and ProMP are improvements of MAML with better exploration

 MAESN and ProMP

  • MAESN, and ProMP are improvements of MAML with better exploration

Modular meta learning

Hierarchical-DQN

  • Hierarchical Deep Q-Network for RL
  • Learn two \(Q\) functions
    • \(Q_2(g,s;\theta_2)\) that evaluates goals/skills/tasks \(g\) as macro-actions in state \(s\)
    • \(Q_1(a,s;g,\theta_1)\) that evaluates actions with respect to  goal \(g\) in state \(s\)
  • Agent picks goal according to \(Q_2(g,s;\theta_2)\) then picks actions according to \(Q_1(a,s;g,\theta_1)\) until goal reached
  • Goals/skills/tasks must be predefined or generated separately e.g. DIAYN, affordances,...

Hierarchical-DQN

 

 

Auxiliary tasks

  • RL with unsupervised auxiliary tasks
  • Add additional tasks that (parts of) the architecture have to solve in order to extract more structure from experience
  • Auxiliary tasks can be
    • control tasks (with own policy networks and Q-functions)
      • sensor value control
      • feature control (in deeper layers) as in independently controllable features
    • value prediction tasks
  • also use experience replay extensively

Auxiliary tasks

  • Hindsight experience replay (openAI)
    • Create large set of arbitrary binary goals and learn a universal value function \(Q(a,s,g)\) using experience replay for subset of goals after each episode
    • Can learn something from unsuccessful experiences
    • Can solve bit-flipping up to 40 bits instead of 13

Auxiliary tasks

  • Hindsight experience replay

Auxiliary tasks

Intrinsic motivations

  • Often implemented as auxiliary tasks
  • Prediction based auxiliary task used to be state of the art in Montezuma's revenge (Random network distillation)
  • Before that it was a knowledge seeking additional reward (Count based exploration)
  • Current state of the art exploits ability to go back to promising situations and explore again from there. When it finds a solution it reinforces it by imitation learning.

Model based meta-RL

  • Model based approaches predict future states of the environment and not only future rewards (like Q-functions do)
  • The advantage is high data-efficiency, problems are model misspecification and computational cost

Model based meta-RL

  • Meta Reinforcement Learning with Latent Variable Gaussian Processes (Saemundsson et al)

    • Introduce latent (meta-) variable \(h\) that identifies the environment (e.g. mass and length of pole in cart pole)

    • model environment dynamics \(p(x_{t+1}|a_t,x_t,h,\theta)\) with gaussian process

    • train the gaussian process on datasets from different environments

    • infer \(h\) on test environments and use it to predict future / select actions

  • Similar to FEP without intrinsic motivation

Model based meta-RL

Model based meta-RL

Model based meta-RL

CF-GPS

CF-GPS

Neuroevolution

Neuroevolution

  • Learns an environment model on latent space of autoencoder
  • then learns to solve problems inside the model

Deep RL methods

Text

deck

By slides_martin

deck

  • 958