Affordances and alternatives

Martin Biehl

Background

Given:
- Agent with complex action space
- Complex environment
How to train / prepare an easy interface for making the agent achieve arbitrary tasks?
Main problem:
- sparse rewards require finding long sequences of actions

Background

Broad distinctions of approaches:
- Meta-learning
- Hierarchical RL
- Auxiliary tasks
  - intrinsic motivations
- Model based RL
- Neuroevolution

Affordances

Learn predictor (world model) \(p(s_{t+1}|a_t,s_t)\) from random samples \((s,a,s')\)
fix affordance space \(\Omega\), sensor target space \(S_T \in S\) with distance \(d\) and affordance grid \(\Omega_G \subset \Omega\)
learn proposer (policy) \(\pi_\theta(a|s,\omega)\) where \(\omega \in \Omega\)
- to each \(\omega_i \in \Omega_G\) associate an affordance (policy) \(\pi(a|s,\omega_i)\) and resulting sensor value \(s_f(\omega_i)\)
- learn \(\theta\) with objective \[J(\theta)=\min_{i,j} d(s_f(\omega_i),s_f(\omega_j))\]

Affordances

Can view different \(\omega_i\) as different goals, tasks, or skills
learning the proposer then means to generate and learn to achieve a set of goals/tasks/skills that pervade the target sensor space
Can also view an affordance target space as a kind of controllable feature, maybe building up one dimensional and independent target spaces is possible and we get independently controllable features in this way?

Affordances

Resulting interface:
- Continuous, low dimensional affordance space as a space of "macro-actions" that can achieve outcomes in target sensor space (by interpolation of grid points)
Disadvantages:
- need distance in target space
- not clear what to do if we want to control high dimensional \(S_T\) with low dimensional \(\Omega\) (\(S^G\) will have to fold)
- all affordances have identical number of time steps (seems inefficient)

MAML

Model Agnostic Meta Learning
General idea: pretrain a network with a set of tasks/datasets such that the resulting network adapts the most when trained on one of them \[\theta:=\min_{\bar{\theta}} \sum_i \mathcal{L}_i(\theta_i'(\bar{\theta}))\] where \[\theta_i'(\bar{\theta}) = \bar{\theta} - \alpha \nabla_\theta \mathcal{L}_i(\bar{\theta})\]

MAML

For reinforcement learning, pretrain a policy network \(\pi(a|s,\theta)\) (https://arxiv.org/pdf/1806.04640.pdf)
- Generate goals/tasks/skills automatically
- pretrain policy for those tasks

MAML

How to generate goals/tasks/skills (in the form of reward functions \(r_i(s)\)) automatically?
- E.g. via "diversity is all you need" i.e. generate a set of skills \(\pi(a|s,z)\) parameterised by \(z\) such that \(I(Z:S)+H(A|S,Z)\) is maximized. For this a discriminator \(D_\phi(z|s)\) that predicts the skill \(z\) from the state \(s\) is learned and the reward for the task z is then \(\log D_\phi(z|s)\).
- via affordances? Reward for task \(i\) could be distance to outcome \(s(\omega_i)\)
- Novelty search vie evolution?

MAML

Resulting interface:
- pretrained policy network \(\pi_\theta(a|s)\) that speeds up adaptation:

MAESN and ProMP

MAESN, and ProMP are improvements of MAML with better exploration

MAESN and ProMP

MAESN, and ProMP are improvements of MAML with better exploration

Modular meta learning

Hierarchical-DQN

Hierarchical Deep Q-Network for RL
Learn two \(Q\) functions
- \(Q_2(g,s;\theta_2)\) that evaluates goals/skills/tasks \(g\) as macro-actions in state \(s\)
- \(Q_1(a,s;g,\theta_1)\) that evaluates actions with respect to goal \(g\) in state \(s\)
Agent picks goal according to \(Q_2(g,s;\theta_2)\) then picks actions according to \(Q_1(a,s;g,\theta_1)\) until goal reached
Goals/skills/tasks must be predefined or generated separately e.g. DIAYN, affordances,...

Hierarchical-DQN

The Option-Critic Architecture
- Parameterize option policies and termination conditions
- Learn both via gradient descent using actor-critic like setup.
FeUdal Networks for Hierarchical Reinforcement Learning
- similar, but beats Option-Critic
Deepmind approach exists.

Auxiliary tasks

RL with unsupervised auxiliary tasks
Add additional tasks that (parts of) the architecture have to solve in order to extract more structure from experience
Auxiliary tasks can be
- control tasks (with own policy networks and Q-functions)
  - sensor value control
  - feature control (in deeper layers) as in independently controllable features
- value prediction tasks
also use experience replay extensively

Auxiliary tasks

Hindsight experience replay (openAI)
- Create large set of arbitrary binary goals and learn a universal value function \(Q(a,s,g)\) using experience replay for subset of goals after each episode
- Can learn something from unsuccessful experiences
- Can solve bit-flipping up to 40 bits instead of 13

Auxiliary tasks

Hindsight experience replay

Auxiliary tasks

Model Learning for Look-ahead Exploration in Continuous Control

Intrinsic motivations

Often implemented as auxiliary tasks
Prediction based auxiliary task used to be state of the art in Montezuma's revenge (Random network distillation)
Before that it was a knowledge seeking additional reward (Count based exploration)
Current state of the art exploits ability to go back to promising situations and explore again from there. When it finds a solution it reinforces it by imitation learning.

Model based meta-RL

Model based approaches predict future states of the environment and not only future rewards (like Q-functions do)
The advantage is high data-efficiency, problems are model misspecification and computational cost

Model based meta-RL

Meta Reinforcement Learning with Latent Variable Gaussian Processes (Saemundsson et al)
- Introduce latent (meta-) variable \(h\) that identifies the environment (e.g. mass and length of pole in cart pole)
- model environment dynamics \(p(x_{t+1}|a_t,x_t,h,\theta)\) with gaussian process
- train the gaussian process on datasets from different environments
- infer \(h\) on test environments and use it to predict future / select actions
Similar to FEP without intrinsic motivation

Model based meta-RL

Model-Based Reinforcement Learning via Meta-Policy Optimization (Clavera et al)
- Based on MAML, but identical reward different dynamics
- learn ensemble of DNN deterministic environment models on different subsets of data
- use usual MAML to train policy that can adapt to each of them as fast as possible
- Nice exploration side effects

Model based meta-RL

Model-Based Reinforcement Learning via Meta-Policy Optimization (Clavera et al)
- Better than (non-MAML) model free approaches

Model based meta-RL

Model-Based Reinforcement Learning via Meta-Policy Optimization (Clavera et al)
- Better than SOTA model based approaches

CF-GPS

WOULDA, COULDA, SHOULDA :COUNTERFACTUALLY -GUIDED POLICY SEARCH (Buesing et al., deepmind)
- Model baseed RL using structural causal models with interventional calculus
- Show that interventional calculus allows improved counterfactual policy evaluation