Extending active inference to other intrinsic motivations

Martin Biehl

Background to intrinsic motivations
Formal framework for intrinsic motivations
1. Perception-action loop
2. Generative model
3. Prediction / complete posterior
4. Action selection
Some intrinsic motivations:
1. Free Energy Minimization
2. Predictive information maximization
3. Knowledge seeking
4. Empowerment maximization
5. Curiosity
Active inference

Overview

Motivation is something that generates behavior for an agent (robot, living organism)

similar to reward function in reinforcement learning (RL)

Background on intrinsic motivations

Originally from psychology e.g. (Ryan and Deci, 2000):

activity for its inherent satisfaction rather than separable consequence

for the fun or challenge entailed rather than because of external products, pressures or reward

Examples (Oudeyer, 2008):

infants grasping, throwing, biting new objects,
adults playing crosswords, painting, gardening, reading novels...

Background on intrinsic motivations

But can always argue:

these things possibly increase the probability of survival in some way
"learned" by evolution
cannot be sure they have no purpose

Background on intrinsic motivations

Developmental robotics:

study developmental processes of infants
- motor skill acquisition
- language acquisition
implement similar processes in robots

Working definition compatible with Oudeyer (2008):

Motivation is intrinsic if it:

embodiment independent,
semantic free / information theoretic,

This includes the approach by Schmidhuber (2010):

Motivation is intrinsic if it

rewards improvement of some model quality measure.

Background on intrinsic motivations

Embodiment independent means it should work (without changes) for any form of agent:

Background on intrinsic motivations

and produce "worthwhile" behavior

Embodiment independent means it should work for any form of agent:

Background on intrinsic motivations

any number or kind of
- sensors
- actuators
rewired
- sensors
- actuators

this implies

cannot assume there is a special sensor whose value is to be maximized
so reward functions of MDPs, POMDPs, and standard RL are not available

Background on intrinsic motivations

Semantic free, information theoretic:

relations between sensors, actuators, and internal variables count
specific values don't

information theory quantifies relations
if $f$ and $g$ are bijective functions then $$\text{I}(X:Y)=\text{I}(f(X):g(Y))$$
so the values of $X$ or $Y$ can play no role in mutual information.

Background on intrinsic motivations

Another important but not defining feature is usually known from evolution:

Background on intrinsic motivations

Another important but not defining feature is usually known from evolution:

open endedness

The motivation should not vanish until the capacities of the agent are exhausted.

Background on intrinsic motivations

Other applications of intrinsic motivations:

sparse reward reinforcement learning problems
Human level and artificial general AI

Background on intrinsic motivations

Sparse reward reinforcement learning:

Add additional term rewarding model improvement / curiosity / control
when not obtaining reward this lets the agent find useful behaviour (hopefully)

Background on intrinsic motivations

AGI:

implement predictive model that continually improves through experience
implement action selection / optimization that chooses according to the prediction
drive learning with intrinsic motivation
no limit?

possibly hardware limit
ignore this here.

Background on intrinsic motivations

Advantages of intrinsic motivations

scalability :
- no need to redesign reward function for different environments / agents
- environment kind and size does not change reward function
- agent complexity does not change reward function

Disadvantage:

often (very) hard to compute
too general, faster if available:
- specifically designed (dense) reward
- imitation learning

Examples:

hunger is not an intrinsic motivation
- not embodiment (digestive system) independent
- eating more doesn't improve our model of the world

Background on intrinsic motivations

Examples:

maximizing stored energy is closer to an intrinsic motivation
- real world agents need energy but not virtual ones
- doesn't directly improve the world model
- but maybe indirectly
- open ended?

Background on intrinsic motivations

Examples:

maximizing money is also close to an intrinsic motivation
- but it only exists in some societies
- may also indirectly improve our model
- open ended?

Background on intrinsic motivations

Examples:

minimizing prediction error of the model is an intrinsic motivation
- any agent that remembers its predictions can calculate the prediction error
- reducing it improves the model (at least locally)

Background on intrinsic motivations

dark room problem

Examples:

minimizing prediction error of the model is an intrinsic motivation
- any agent that remembers its predictions can calculate the prediction error
- reducing it improves the model (at least locally)
- not open ended

Solution for dark room problem

maximizing the decrease of the prediction error (prediction progress) is an intrinsic motivation
- improves the predictions of the model in one area until more progress can be made in another
- may be open ended

Background on intrinsic motivations

Background to intrinsic motivations
Formal framework for intrinsic motivations
1. Perception-action loop
2. Generative model
3. Prediction / complete posterior
4. Action selection
Some intrinsic motivations:
1. Free Energy Minimization
2. Predictive information maximization
3. Knowledge seeking
4. Empowerment maximization
5. Curiosity
Active inference

Overview

2. Formal framework for intrinsic motivations

Perception-action loop

Similar to reinforcement learning for POMDPs :

partially observable environment
unknown environment transition dynamics

But we assume no extrinsic reward

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$ : initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

\newcommand{\p}{\text{p}} \p(e_{0:T},s_{0:T},a_{1:T},m_{1:T}) = \left( \prod_{t=1}^T \p(a_t|m_t) \p(m_t|s_{t-1},a_{t-1},m_{t-1}) \p(s_t|e_t) \p(e_t|a_t,e_{t-1}) \right) \p(s_0|e_0) \p(e_0)

\newcommand{\p}{\text{p}} \p(e_{0:T},s_{0:T},a_{1:T},m_{1:T}) = \left( \prod_{t=1}^T \p(a_t|m_t) \p(m_t|s_{t-1},a_{t-1},m_{t-1}) \p(s_t|e_t) \p(e_t|a_t,e_{t-1}) \right) \p(s_0|e_0) \p(e_0)

Joint distribution until final time $t=T$ :

2. Formal framework for intrinsic motivations

Perception-action loop

Assumptions :

constant environment and sensor dynamics given $$\newcommand{\p}{\text{p}}\p(e_{t_1}|a_{t_1},e_{t_1-1})=\p(e_{t_2}|a_{t_2},e_{t_2-1})$$ $$\newcommand{\p}{\text{p}}\p(s_{t_1}|e_{t_1})=\p(s_{t_2}|e_{t_2})$$
perfect agent memory : $$\newcommand{\pt}{{\prec t}}m_t := (s_\pt,a_\pt) := sa_\pt \Rightarrow \newcommand{\p}{\text{p}}\p(m'|s,a,m) $$

2. Formal framework for intrinsic motivations

Perception-action loop

Assumptions :

constant environment and sensor dynamics given $$\newcommand{\p}{\text{p}}\p(e_{t_1}|a_{t_1},e_{t_1-1})=\p(e_{t_2}|a_{t_2},e_{t_2-1})$$ $$\newcommand{\p}{\text{p}}\p(s_{t_1}|e_{t_1})=\p(s_{t_2}|e_{t_2})$$
perfect agent memory : $$\newcommand{\pt}{{\prec t}}m_t := (s_\pt,a_\pt) := sa_\pt \Rightarrow \newcommand{\p}{\text{p}}\p(m'|s,a,m) $$

2. Formal framework for intrinsic motivations

Perception-action loop

Only missing:

action generation mechanism $\newcommand{\p}{\text{p}} \p(a|m)$
- takes current sensor-action history $sa_{\prec t}$ to generate new action $a_t$
- reuse of computational results on previous history $sa_{\prec t-1}$ not reflected here but possible

2. Formal framework for intrinsic motivations

Remarks:

Intrinsic motivations quantify statistical relations between sensor values, actions, and beliefs/internal variables.
Taking actions according to such measures requires to predict them.
Possible by using parameterized / generative model.
Encodes beliefs and predictions in probability distributions over parameters and latent variables.
Easy to rigorously express many intrinsic motivations.
Naive computation is intractable.
Making it tractable is not discussed.

2. Formal framework for intrinsic motivations

1. Generative model

Internal to the agent
For parameters write $\Theta=(\Theta^1,\Theta^2,\Theta^3)$
$\xi=(\xi^1,\xi^2,\xi^3)$ are fixed hyperparameters that encode priors over the parameters

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$

2. Formal framework for intrinsic motivations

3. Inference / prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Inference / prediction

So at $t$ agent can plug $m_t=sa_{\prec t}$ into model

updates the probability distribution to a posterior:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Inference / prediction

So at $t$ agent can plug $m_t=sa_{\prec t}$ into model

updates the probability distribution to a posterior:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Inference / prediction

predicts consequences of assumed actions $\blue{\hat{a}_{t:\hat{T}}}$ for relations between:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

parameters $\red{\Theta}$

latent variables $\red{\hat{E}_{0:\hat{T}}}$

future sensor values $\red{\hat{S}_{t:\hat{T}}}$

2. Formal framework for intrinsic motivations

3. Inference / prediction

Call $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}},sa_{\prec t},\xi)$ the complete posteriors.

2. Formal framework for intrinsic motivations

3. Inference / prediction

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies $\blue{\hat{\Pi}}$

here $\hat{\Pi}$ is a parameterization of a general policy $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|sa_{\prec t}) = \q(\hat{a}_t|sa_{\prec t},\pi)$$

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies $\blue{\hat{\Pi}}$

simpler policies are also possible $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|s_{t-1}) = \q(\hat{a}_t|s_{t-1},\pi)$$

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) = \q(\hs_\thT,\he_\thT|\ha_\thT,\he_{t-1},\theta) \q(\he_{\prec t},\theta|sa_\pt,\xi)

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) = \q(\hs_\thT,\he_\thT|\ha_\thT,\he_{t-1},\theta) \q(\he_{\prec t},\theta|sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Inference / prediction

Complete posteriors factorizes:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) = \q(\hs_\thT,\he_\thT|\ha_\thT,\he_{t-1},\theta) \q(\he_{\prec t},\theta|sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Inference / prediction

into posterior and predictive factor:

2. Formal framework for intrinsic motivations

4. Action selection

define intrinsic motivations as functions $\mathfrak{M}$ of the complete posteriors and a given sequence $\hat{a}_{t:\hat{T}}$ of future actions:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\mot}{\mathfrak{M}} \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)

To act find best sequence:

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \ha^*_\thT(sa_\pt):=\argmax_{\ha_\thT} \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)

2. Formal framework for intrinsic motivations

4. Action selection

General reinforcement learning (RL) evaluates actions by expected cumulative reward $Q(\hat{a}_{t:\hat{T}},sa_{\prec t}):=\mathbb{E}[R|\hat{a}_{t:\hat{T}},sa_{\prec t}]$:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\hat{a}_{t:\hat{T}},sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs_\thT|\ha_\thT,sa_\pt,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\pi,sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs\ha_\thT|sa_\pt,\pi,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)

Standard RL : reward $r_t$ is one of the sensor values so $$r(\hat{s}\hat{a}_{t:\tau},sa_{\prec t})=r(s_\tau)=r_\tau$$

For the case of evaluating policies $Q(\pi,sa_{\prec t}):=\mathbb{E}[R|\pi,sa_{\prec t}]$:

2. Formal framework for intrinsic motivations

4. Action selection

From best sequence:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\p}{\text{p}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \p(a_t|sa_\pt):=\delta_{\ha^*_t(sa_\pt)}(a_t)

Select / perform the first action:

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \ha^*_\thT(sa_\pt):=\argmax_{\ha_\thT} \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)

2. Formal framework for intrinsic motivations

4. Action selection

From best sequence:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\p}{\text{p}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \p(a_t|sa_\pt):=\delta_{\ha^*_t(sa_\pt)}(a_t)

Select / perform the first action:

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \ha^*_\thT(sa_\pt):=\argmax_{\ha_\thT} \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)

General requirement is conditional probability $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}})$ how this is obtained by agent does not matter.
drop $(sa_{\prec t},\xi)$ in following

Background to intrinsic motivations
Formal framework for intrinsic motivations
1. Perception-action loop
2. Generative model
3. Prediction / complete posterior
4. Action selection
Some intrinsic motivations:
1. Free Energy Minimization
2. Predictive information maximization
3. Knowledge seeking
4. Empowerment maximization
5. Curiosity
Active inference

Overview

3. Some intrinsic motivations

1. Free energy minimization

Actions should lead to environment states expected to have precise sensor values (e.g. Friston, Parr et al., 2017):

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &amp;= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

Get $\text{q}(\hat{e}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})$ frome the complete posterior:

3. Some intrinsic motivations

1. Free energy minimization

random noise source are avoided
will get stuck in known "dark room traps"
- we know $$\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{E}_{t:\hat{t}},\hat{a}_{t:\hat{T}})=0$$
- such an optimal action sequence $\hat{a}_{t:\hat{T}}$ exists e.g. if there is a "dark room" in the environment
- even if room cannot be escaped once entered and the agent knows it!
- solved by adding KL divergence to constructed desired sensory experience
  - breaks purpose of intrinsic motivations (not scalable)
Free energy is not suitable for AGI

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{PI}(\d(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_{t:t+k-1}:\hS_{t+k:t+2k-1}|\ha_\thT)\\ &=\sum_{\hs_{t:t+2k-1}} \q(\hs_{t:t+2k-1}|\ha_\thT) \log \frac{\q(\hs_{t+k:t+2k-1}|\hs_{t:t+k-1},\ha_\thT)}{\q(\hs_{t:t+k-1}|\ha_\thT)}\\ \end{aligned}

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{PI}(\d(.,.,.|.),\ha_\thT) :&amp;= \I_{\q}(\hS_{t:t+k-1}:\hS_{t+k:t+2k-1}|\ha_\thT)\\ &amp;=\sum_{\hs_{t:t+2k-1}} \q(\hs_{t:t+2k-1}|\ha_\thT) \log \frac{\q(\hs_{t+k:t+2k-1}|\hs_{t:t+k-1},\ha_\thT)}{\q(\hs_{t:t+k-1}|\ha_\thT)}\\ \end{aligned}

3. Some intrinsic motivations

2. Predictive information maximization

Actions should lead to the most complex sensor stream:

Next $k$ sensor values should have max mutual information with the subsequent $k$.
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

2. Predictive information maximization

random noise source are avoided as they produce no mutual information
will not get stuck in known "dark room traps"
- from $$\text{H}_{\text{q}}(\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:t+k-1},\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0$$
possible long term behavior:
- ergodic sensor process
- finds a subset of environment states that allows this ergodicity

3. Some intrinsic motivations

2. Predictive information maximization

Georg Martius, Ralf Der

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{KSA}(\q(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_\thT:\hE_{0:\hT},\Theta|\ha_\thT)\\ &=\sum_{\hs_\thT} \int \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \log \frac{\q(\he_{0:\hT},\theta|\hs_\thT,\ha_\thT)}{\q(\theta)} \diff \theta \end{aligned}

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{KSA}(\q(.,.,.|.),\ha_\thT) :&amp;= \I_{\q}(\hS_\thT:\hE_{0:\hT},\Theta|\ha_\thT)\\ &amp;=\sum_{\hs_\thT} \int \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \log \frac{\q(\he_{0:\hT},\theta|\hs_\thT,\ha_\thT)}{\q(\theta)} \diff \theta \end{aligned}

3. Some intrinsic motivations

3. Knowledge seeking

Actions should lead to sensor values that tell the most about hidden (environment) variables $\hat{E}_{0:\hat{T}}$ and model parameters $\Theta$:

Also known as information gain maximization
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

3. Knowledge seeking

avoids random noise sources once they are known
similar to prediction progress
can rewrite as $$\text{H}_{\text{q}}(\hat{E}_{0:\hat{T}},\Theta|\hat{a}_{t:\hat{T}})-\text{H}_{\text{q}}(\hat{E}_{0:\hat{T}},\Theta|\hat{S}_{t:\hat{T}},\hat{a}_{t:\hat{T}})$$
will not get stuck in known "dark room traps"
- from $$\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:\hat{T}}:\hat{E}_{0:\hat{T}},\Theta|\hat{a}_{t:\hat{T}})=0$$
technical results exist by Orseau et al. (2013) (off-policy prediction!)
possible long term behavior:
- when model is known does nothing / random walk

3. Some intrinsic motivations

3. Knowledge seeking

Bellemare et al. (2016)

Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying Count-Based Exploration and Intrinsic Motivation. arXiv:1606.01868 [cs]. arXiv: 1606.01868.

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{EM}(\d(.,.,.|.),\ha_\thTa) :&= \max_{\d(\ha_{\hTa+1:\hT})} \; \I_{\d}(\hA_{\hTa+1:\hT}:\hS_\hT|\ha_\thTa) \\ &=\max_{\d(\ha_{\hTa+1:\hT})} \; \sum_{\ha_{\hTa+1:\hT},\hs_\hT} \d(\ha_{\hTa+1:\hT}) \d(\hs_\hT|\ha_\thT) \log \frac{\d(\hs_\hT|\ha_\thT)}{\d(\hs_\hT|\ha_\thTa)}. \end{aligned}

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{EM}(\d(.,.,.|.),\ha_\thTa) :&amp;= \max_{\d(\ha_{\hTa+1:\hT})} \; \I_{\d}(\hA_{\hTa+1:\hT}:\hS_\hT|\ha_\thTa) \\ &amp;=\max_{\d(\ha_{\hTa+1:\hT})} \; \sum_{\ha_{\hTa+1:\hT},\hs_\hT} \d(\ha_{\hTa+1:\hT}) \d(\hs_\hT|\ha_\thT) \log \frac{\d(\hs_\hT|\ha_\thT)}{\d(\hs_\hT|\ha_\thTa)}. \end{aligned}

3. Some intrinsic motivations

4. Empowerment maximization

Actions should lead to control over as many future experiences as possible:

Actions $\hat{a}_{t:\hat{T}_a}$ are taken such that subsequent actions $\hat{a}_{\hat{T}_a+1:\hat{T}}$ have control
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

4. Empowerment maximization

avoids random noise sources because they cannot be controlled
will not get stuck in known "dark room traps"
- from $$\text{H}_{\text{q}}(\hat{S}_{\hat{T}}|\hat{a}_{t:\hat{T}_a})=0\Rightarrow\text{I}_{\text{q}}(\hat{A}_{\hat{T}_a+1:\hat{T}}\hat{S}_{\hat{T}}:|\hat{a}_{t:\hat{T}_a})=0$$
similar to energy and money maximization but more general
possible long term behavior:
- remains in (or maintains) the situation where it expects the most control over future experience
- exploration behavior not fully understood
- Belief empowerment may solve it...

3. Some intrinsic motivations

4. Empowerment

Guckelsberger et al. (2016)

Guckelsberger, C., Salge, C., & Colton, S. (2016). Intrinsically Motivated General Companion NPCs via Coupled Empowerment Maximisation. 2016 IEEE Conf. Computational Intelligence in Games (CIG’16), 150–157

3. Some intrinsic motivations

5a. Curiosity

Actions should lead to surprising sensor values.

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=+\HS_{\q}(\hS_\thT|\ha_\thT)\\ &= \sum_{\hs_\thT} \q(\hs_\thT|\ha_\thT)(- \log \q(\hs_\thT|\ha_\thT))\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;=+\HS_{\q}(\hS_\thT|\ha_\thT)\\ &amp;= \sum_{\hs_\thT} \q(\hs_\thT|\ha_\thT)(- \log \q(\hs_\thT|\ha_\thT))\\ \end{aligned}

Also called Shannon knowledge seeking (Orseau)
maximize expected surprise (=entropy)
Get density from the complete posterior.

3. Some intrinsic motivations

5a. Curiosity

will not get stuck in known "dark room traps"
- it directly pursues the opposite situation
will get stuck at random noise sources
in deterministic environments not a problem
- proven to asymptotically drive an agent to the behavior of an agent that knows the environment (under some technical conditions, see Orseau, 2011)

3. Some intrinsic motivations

5b. Curiosity

Actions should lead to surprising environment states.

I have never seen this explicitly
Get density from the complete posterior.

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=+\HS_{\q}(\hE_\thT|\ha_\thT)\\ &= \sum_{\hs_\thT} \q(\he_\thT|\ha_\thT)(- \log \q(\he_\thT|\ha_\thT))\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;=+\HS_{\q}(\hE_\thT|\ha_\thT)\\ &amp;= \sum_{\hs_\thT} \q(\he_\thT|\ha_\thT)(- \log \q(\he_\thT|\ha_\thT))\\ \end{aligned}

3. Some intrinsic motivations

5b. Curiosity

Actions should lead to surprising embedding of sensor values:

Something between the last two
Get density from the complete posterior.

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=+\HS_{\q}(f(\hS_\thT)|\ha_\thT)\\ &= \sum_{\hs_\thT} \q(f(\hs_\thT)|\ha_\thT)(- \log \q(f(\hs_\thT)|\ha_\thT))\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;=+\HS_{\q}(f(\hS_\thT)|\ha_\thT)\\ &amp;= \sum_{\hs_\thT} \q(f(\hs_\thT)|\ha_\thT)(- \log \q(f(\hs_\thT)|\ha_\thT))\\ \end{aligned}

3. Some intrinsic motivations

5. Curiosity

Burda et al. (2018) with permission.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. (2018). Large-Scale Study of Curiosity-Driven Learning. arXiv:1808.04355 [cs, stat]. arXiv: 1808.04355.

4.Active inference

1. Variational complete posteriors

Complete posteriors are computationally intractable
approximation usually necessary
introduce variational complete posterior

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \re(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,\phi)

the parameter $\phi$ ranges over a (usually simpler) class of conditional distributions than the true posterior
obtaining $\phi^*$ that parameterizes the best approximation to the true posterior is called variational inference

4.Active inference

1. Variational complete posteriors

Active inference promises to do inference and action selection in a single optimization step
before we used direct non-variational updating of the complete posteriors and then chose the best actions
now construct a way to do both (possibly approximately) at once.

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi}:&=\text{argmin}_\phi \;\F[\phi,sa_\pt,\xi]\\ &=\text{argmin}_\phi \;\sum_{\he_\pt} \int \re(\he_\pt,\theta|\phi) \log \frac{\re(\he_\pt,\theta|\phi)} {\q(s_\pet,\he_\pt,\theta|a_\pt,\xi)} \diff \theta\\ &=\text{argmin}_\phi \;- \log \q(s_\pt|a_\pt,\xi) + \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)]\\ &=\text{argmin}_\phi\; \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)]\\ \end{aligned}

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi}:&amp;=\text{argmin}_\phi \;\F[\phi,sa_\pt,\xi]\\ &amp;=\text{argmin}_\phi \;\sum_{\he_\pt} \int \re(\he_\pt,\theta|\phi) \log \frac{\re(\he_\pt,\theta|\phi)} {\q(s_\pet,\he_\pt,\theta|a_\pt,\xi)} \diff \theta\\ &amp;=\text{argmin}_\phi \;- \log \q(s_\pt|a_\pt,\xi) + \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)]\\ &amp;=\text{argmin}_\phi\; \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)]\\ \end{aligned}

4.Active inference

2. Variational inference

Each parameter value $\phi$ specifies a "complete posterior-like" object
not only $\phi^*_{sa_{\prec t},\xi}$ (the optimized one)
could also select action according to the predictions of the consequences of actions specified by such an arbitrary $\phi$
define an according policy

4.Active inference

2. Variational policy

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \re(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,\phi)

optimal action sequence according to $\phi$:

4.Active inference

2. Variational policy

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\re}{\text{r}} \ha^*_\thT(\phi):=\argmax_{\ha_\thT} \mot(\re(.,.,.|.,\phi),\ha_\thT)

turn this into a stochastic policy using softmax:

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\re}{\text{r}} \q(\ha_\thT|\phi):=\frac{1}{Z} e^{\mot(\re(.,.,.|.,\phi),\ha_\thT)}

introduce another policy $\text{r}(\hat{a}_{t:\hat{T}}|\pi)$ parameterized by $\pi$
minimize at the same time
- the difference of $\text{r}(\hat{a}_{t:\hat{T}}|\pi)$ to $\text{q}(\hat{a}_{t:\hat{T}}|\phi)$
- the difference of the approximate posterior to the true one:

4.Active inference

3. Active inference

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi},\pi^*_{sa_\pt,\xi} : &=\text{argmin}_{\phi,\pi} \left(\F[\phi,sa_\pt,\xi]+\KL[\re(\hA_\thT|\pi)||\q(\hA_\thT|\phi)]\right) \end{aligned}

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi},\pi^*_{sa_\pt,\xi} : &amp;=\text{argmin}_{\phi,\pi} \left(\F[\phi,sa_\pt,\xi]+\KL[\re(\hA_\thT|\pi)||\q(\hA_\thT|\phi)]\right) \end{aligned}

act according to $\text{r}(\hat{a}_{t:\hat{T}}|\pi^*_{sa_{\prec t},\xi})$

can be used with any intrinsic motivation $\mathfrak{M}$ at least in principle
performance probably worse than first optimizing $\phi$ and then $\pi$:

4.Active inference

3. Active inference

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi}: &=\text{argmin}_{\phi} \; \F[\phi,sa_\pt,\xi] \\ \pi^*_{sa_\pt,\xi}:&=\text{argmin}_{\pi}\; \KL[\re(\hA_\thT|\pi)||\q(\hA_\thT|\phi^*_{sa_\pt,\xi})] \end{aligned}

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi}: &amp;=\text{argmin}_{\phi} \; \F[\phi,sa_\pt,\xi] \\ \pi^*_{sa_\pt,\xi}:&amp;=\text{argmin}_{\pi}\; \KL[\re(\hA_\thT|\pi)||\q(\hA_\thT|\phi^*_{sa_\pt,\xi})] \end{aligned}

computational advantage of active inference to the latter is unknown

Background to intrinsic motivations
Formal framework for intrinsic motivations
1. Perception-action loop
2. Generative model
3. Prediction / complete posterior
4. Action selection
Some intrinsic motivations:
1. Free Energy Minimization
2. Predictive information maximization
3. Knowledge seeking
4. Empowerment maximization
5. Curiosity
Active inference

Overview

References:

Aslanides, J., Leike, J., and Hutter, M. (2017). Universal Reinforcement Learning Algorithms: Survey and Experiments. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1403–1410.

Ay, N., Bertschinger, N., Der, R., Güttler, F., and Olbrich, E. (2008). Predictive Information and Explorative Behavior of Autonomous Robots. The European Physical Journal B-Condensed Matter and Complex Systems, 63(3):329–339.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. (2018). Large-Scale Study of Curiosity-Driven Learning. arXiv:1808.04355 [cs, stat]. arXiv: 1808.04355.

Friston, K. J., Parr, T., and de Vries, B. (2017). The Graphical Brain: Belief Propagation and Active Inference. Network Neuroscience, 1(4):381–414.

Klyubin, A., Polani, D., and Nehaniv, C. (2005). Empowerment: A Universal Agent-Centric Measure of Control. In The 2005 IEEE Congress on Evolutionary Computation, 2005, volume 1, pages 128–135.

Orseau, L., Lattimore, T., and Hutter, M. (2013). Universal Knowledge-Seeking Agents for Stochastic Environments. In Jain, S., Munos, R., Stephan, F., and Zeugmann, T., editors, Algorithmic Learning Theory, number 8139 in Lecture Notes in Computer Science, pages 158–172. Springer Berlin Heidelberg.

Oudeyer, P.-Y. and Kaplan, F. (2008). How can we define intrinsic motivation? In Proceedings of the 8th International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, Lund University Cognitive Studies, Lund: LUCS, Brighton. Lund University Cognitive Studies, Lund: LUCS, Brighton.

Schmidhuber, J. (2010). Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247.

Storck, J., Hochreiter, S., and Schmidhuber, J. (1995). Reinforcement Driven Information Acquisition in Non-Deterministic Environments. In Proceedings of the International Conference on Artificial Neural Networks, volume 2, pages 159–164.

2. Formal framework for intrinsic motivations

4. Recognition model / approximate prediction

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi}:&amp;=\text{argmin}_\phi \;\F[\phi,sa_\pt,\xi]\\ &amp;=\text{argmin}_\phi \;\sum_{\he_\pt} \int \re(\he_\pt,\theta|\phi) \log \frac{\re(\he_\pt,\theta|\phi)} {\q(s_\pet,\he_\pt,\theta|a_\pt,\xi)} \diff \theta\\ &amp;=\text{argmin}_\phi \;- \log \q(s_\pt|a_\pt,\xi) + \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)]\\ &amp;=\text{argmin}_\phi\; \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)]\\ \end{aligned}

Each parameter value $\phi$ specifies a "complete posterior-like" object
not only $\phi^*_{sa_{\prec t},\xi}$ (the optimized one

2. Formal framework for intrinsic motivations

4. Recognition model / approximate prediction

Complete posteriors are computationally intractable
approximation usually necessary
introduce approximate complete posterior

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \re(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,\phi)

the parameter $\phi$ ranges over a simpler class of conditional distributions than the true posterior
an established way of obtaining the $\phi^*$ that parameterizes the best approximation to the true posterior is variational inference

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi}:&=\text{argmin}_\phi \F[\phi,sa_\pt,\xi]\\ &=\text{argmin}_\phi \sum_{\he_\pt} \int \re(\he_\pt,\theta|\phi) \log \frac{\re(\he_\pt,\theta|\phi)}{\q(s_\pet,\he_\pt,\theta|a_\pt,\xi)} \diff \theta\\ &=\text{argmin}_\phi - \log \q(s_\pt|a_\pt,\xi) + \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)] \end{aligned}

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\hE}{\hat{E}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \newcommand{\F}{\mathcal{F}} \newcommand{\diff}{\text{d}} \newcommand{\KL}{\text{KL}} \begin{aligned} \phi^*_{sa_\pt,\xi}:&amp;=\text{argmin}_\phi \F[\phi,sa_\pt,\xi]\\ &amp;=\text{argmin}_\phi \sum_{\he_\pt} \int \re(\he_\pt,\theta|\phi) \log \frac{\re(\he_\pt,\theta|\phi)}{\q(s_\pet,\he_\pt,\theta|a_\pt,\xi)} \diff \theta\\ &amp;=\text{argmin}_\phi - \log \q(s_\pt|a_\pt,\xi) + \KL[\re(\hE_\pt,\Theta|\phi)||\q(\hE_\pt,\Theta|sa_\pt,\xi)] \end{aligned}

2. Formal framework for intrinsic motivations

4. Recognition model / approximate prediction

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \re(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,\phi):= \q(\hs_\thT,\he_\thT|\ha_\thT, \he_\tm,\theta)\re(\he_\pt, \theta|\phi)

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) = \q(\hs_\thT,\he_\thT|\ha_\thT,\he_{t-1},\theta) \q(\he_{\prec t},\theta|sa_\pt,\xi)

Complete posteriors are computationally intractable
approximation usually necessary
one option:
- approximate the posterior factor

exact complete posteriors:

approximate complete posteriors:

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \re(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,\phi):= \q(\hs_\thT,\he_\thT|\ha_\thT, \he_\tm,\theta)\re(\he_\pt, \theta|\phi)

into posterior and predictive factor:

2. Formal framework for intrinsic motivations

4. Recognition model / approximate prediction

2. Formal framework for intrinsic motivations

4. Recognition model / approximate prediction

\newcommand{\tT}{{t:T}} \newcommand{\tm}{{t-1}} \newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\re}{\text{r}} \re(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,\phi):= \q(\hs_\thT,\he_\thT|\ha_\thT, \he_\tm,\theta)\re(\he_\pt, \theta|\phi)

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) = \q(\hs_\thT,\he_\thT|\ha_\thT,\he_{t-1},\theta) \q(\he_{\prec t},\theta|sa_\pt,\xi)

Complete posteriors are computationally intractable
approximation necessary
one option:
- approximate the posterior factor

exact complete posteriors:

approximate complete posteriors:

Assume independence of hidden variables i.e. $\Theta^i$ and $\hat{E}_{\prec t}$
parameterize their individual distributions by $\phi=\{\phi^{E_0},...,\phi^{E_{t-1}},\phi^1,\phi^2,\phi^3\}$

2. Formal framework for intrinsic motivations

4. Recognition model / approximate prediction

Call $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}},sa_{\prec t},\xi)$ the complete posteriors.

2. Formal framework for intrinsic motivations

4. Recognition model / approximate prediction

3. Some intrinsic motivations

1. Free energy minimization

Intuition:

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT,sa_\pt,\xi)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT,sa_\pt,\xi)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &amp;= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ &= - \sum_{\he_\thT} \q(\he_\thT|\ha_\thT)\HS_{\q}(\hS_\thT|\he_\thT)\\ &=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ &amp;= - \sum_{\he_\thT} \q(\he_\thT|\ha_\thT)\HS_{\q}(\hS_\thT|\he_\thT)\\ &amp;=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ \end{aligned}

2. Formal framework for intrinsic motivations

2. Generative model

In standard POMDP case all $\Theta$ have specified values $\Theta=\theta$
for $\hat{e}_t:=\hat{s}\hat{a}_{\prec t}$ this model becomes identical to the one used in general reinforcement learning

3. Some intrinsic motivations

Remarks:

here Occams razor / the preference of simplicity was hidden in prior
- Algorithmic information theory formalizes the idea of complexity for strings and can be used for less probabilistic intrinsic motivations (see Schmidhuber, 2010)
Aslanides et al. (2017) have used knowledge seeking, Shannon knowledge seeking, and minimium description length improvement to augment AIXI
- great testbed and running code: http://www.hutter1.net/aixijs/

Background on intrinsic motivations

AGI:

implement predictive model that continually improves through experience
implement action selection / optimization that chooses according to the prediction
drive it by open ended intrinsic motivation

2. Formal framework for intrinsic motivations

4. Action selection

complete posterior $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}},sa_{\prec t},\xi)$ captures consequences of actions for relations between
define intrinsic motivations as functions $\mathfrak{M}$ of this posterior and a given sequence $\hat{a}_{t:\hat{T}}$ of future actions:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\mot}{\mathfrak{M}} Q(\ha_\thT,sa_\pt,\xi):= \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)

Requirement is a conditional probability $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}})$ how this is obtained by agent does not matter.

2. Formal framework for intrinsic motivations

3. Prediction

Allows to predict consequences of future actions $\blue{\hat{a}_{t:\hat{T}}}$:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)= \frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT}}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)= \frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT}}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$