Copy of Tutorial on comparing intrinsic motivations in a unified framework

Background on intrinsic motivations

Applications of intrinsic motivations:

developmental robotics
Human level and artificial general AI
sparse reward reinforcement learning problems

Background on intrinsic motivations

Developmental robotics:

study developmental processes of infants
- motor skill acquisition
- language acquisition
implement similar processes in robots

Background on intrinsic motivations

AGI:

implement predictive model that continually improves through experience
implement action selection / optimization that chooses according to the prediction
drive it by open ended intrinsic motivation

Background on intrinsic motivations

Sparse reward reinforcement learning:

Add additional term rewarding model improvement / curiosity / control

Examples:

hunger is not an intrinsic motivation
- what counts as food (can be converted into energy) depends on the organism/agent (plants ``eat'' CO2, robots ``eat'' electrical power, virtual agents don't eat)
- eating more doesn't improve our model of the world

Background on intrinsic motivations

Examples:

maximizing energy is closer to an intrinsic motivation
- most agents need energy but not all (e.g. virtual ones)
- storing more and more energy doesn't improve the world model

Background on intrinsic motivations

Examples:

maximizing money is not an intrinsic motivation
- it only exists in some societies
- getting more doesn't improve our model

Background on intrinsic motivations

Examples:

minimizing prediction error of the model is an intrinsic motivation
- as long as the agent remembers its predictions it can calculate the prediction error, no matter what the environment, sensors, or actuators are / mean.
- reducing it improves the model (at least locally)

Background on intrinsic motivations

Examples:

minimizing prediction error of the model is an intrinsic motivation
- as long as the agent remembers its predictions it can calculate the prediction error, no matter what the environment, sensors, or actuators are / mean.
- reducing it improves the model (at least locally)

Background on intrinsic motivations

Examples:

minimizing prediction error of the model is an intrinsic motivation
- as long as the agent remembers its predictions it can calculate the prediction error, no matter what the environment, sensors, or actuators are / mean.
- reducing it improves the model (at least locally)

Background on intrinsic motivations

dark room problem

Examples:

maximizing the decrease per time of the prediction error (prediction progress) is an intrinsic motivation
- agent has to monitor prediction error, possible for any sensors, motors, environment..
- improves the model where most progress can be made (open ended)

Background on intrinsic motivations

2. Formal framework for intrinsic motivations

Perception-action loop

Similar to reinforcement learning for POMDPs :

partially observable environment
unknown environment transition dynamics

But we assume no extrinsic reward

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$ : initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

$E$ : Environment state

$S$ : Sensor state

$A$ : Action

$M$ : Agent memory state

$\newcommand{\p}{\text{p}} \p(e_0)$: initial distribution

$\newcommand{\p}{\text{p}}\p(s|e)$ : sensor dynamics

$\newcommand{\p}{\text{p}}\p(m'|s,a,m)$ : memory dynamics

$\newcommand{\p}{\text{p}}\p(a|m)$ : action generation

$\newcommand{\p}{\text{p}}\p(e'|a',e)$ : environment dynamics

2. Formal framework for intrinsic motivations

Perception-action loop

\newcommand{\p}{\text{p}} \p(e_{0:T},s_{0:T},a_{1:T},m_{1:T}) = \left( \prod_{t=1}^T \p(a_t|m_t) \p(m_t|s_{t-1},a_{t-1},m_{t-1}) \p(s_t|e_t) \p(e_t|a_t,e_{t-1}) \right) \p(s_0|e_0) \p(e_0)

\newcommand{\p}{\text{p}} \p(e_{0:T},s_{0:T},a_{1:T},m_{1:T}) = \left( \prod_{t=1}^T \p(a_t|m_t) \p(m_t|s_{t-1},a_{t-1},m_{t-1}) \p(s_t|e_t) \p(e_t|a_t,e_{t-1}) \right) \p(s_0|e_0) \p(e_0)

Joint distribution until final time $t=T$ :

2. Formal framework for intrinsic motivations

Perception-action loop

Assumptions :

constant environment and sensor dynamics given $$\newcommand{\p}{\text{p}}\p(e_{t_1}|a_{t_1},e_{t_1-1})=\p(e_{t_2}|a_{t_2},e_{t_2-1})$$ $$\newcommand{\p}{\text{p}}\p(s_{t_1}|e_{t_1})=\p(s_{t_2}|e_{t_2})$$
perfect agent memory : $$\newcommand{\pt}{{\prec t}}m_t := (s_\pt,a_\pt) := sa_\pt \Rightarrow \newcommand{\p}{\text{p}}\p(m'|s,a,m) $$

2. Formal framework for intrinsic motivations

Perception-action loop

Assumptions :

constant environment and sensor dynamics given $$\newcommand{\p}{\text{p}}\p(e_{t_1}|a_{t_1},e_{t_1-1})=\p(e_{t_2}|a_{t_2},e_{t_2-1})$$ $$\newcommand{\p}{\text{p}}\p(s_{t_1}|e_{t_1})=\p(s_{t_2}|e_{t_2})$$
perfect agent memory : $$\newcommand{\pt}{{\prec t}}m_t := (s_\pt,a_\pt) := sa_\pt \Rightarrow \newcommand{\p}{\text{p}}\p(m'|s,a,m) $$

2. Formal framework for intrinsic motivations

Perception-action loop

Only missing:

action generation mechanism $\newcommand{\p}{\text{p}} \p(a|m)$
- takes current sensor-action history $sa_{\prec t}$ to generate new action $a_t$
- reuse of computational results on previous history $sa_{\prec t-1}$ not reflected here but possible

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

sensor dynamics model $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)$
environment dynamics model $\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)$
initial environment distribution $\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)$

2. Formal framework for intrinsic motivations

2. Generative model

write $\Theta=(\Theta^1,\Theta^2,\Theta^3)$
$\xi=(\xi^1,\xi^2,\xi^3)$ are hyperparameters that encode priors over the parameters

2. Formal framework for intrinsic motivations

2. Generative model

Possible simplifications:

set and fix parameter $\theta^i$ to actual one $\theta^i_*$
- $\newcommand{\q}{\text{q}} \q(\theta^i|\xi^i) = \delta(\theta^i-\theta^i_{*})$
no updating of this parameter anymore

2. Formal framework for intrinsic motivations

2. Generative model

In standard POMDP case all $\Theta$ have specified values $\Theta=\theta$
for $\hat{e}_t:=\hat{s}\hat{a}_{\prec t}$ this model becomes identical to the one used in general reinforcement learning

2. Formal framework for intrinsic motivations

2. Generative model

In standard POMDP case all $\Theta$ have specified values $\Theta=\theta$
for $\hat{e}_t:=\hat{s}\hat{a}_{\prec t}$ this model becomes identical to the one used in general reinforcement learning

2. Formal framework for intrinsic motivations

2. Generative model

We fixed $\Xi=\xi$ but model could be more complicated

vary the cardinality/dimensionality of environment state
vary different graph structures internal to the environment

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

new sensor values $S_t=s_t$ and actions $A_t=a_t$ are generated
get stored in $M_t=m_t=sa_{\prec t}$

2. Formal framework for intrinsic motivations

3. Prediction

So at $t$ agent can plug $m_t=sa_{\prec t}$ into model

updates the probability distribution to a posterior:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

So at $t$ agent can plug $m_t=sa_{\prec t}$ into model

updates the probability distribution to a posterior:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

So at $t$ agent can plug $m_t=sa_{\prec t}$ into model

updates the probability distribution to a posterior:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

Allows to predict consequences of future actions $\blue{\hat{a}_{t:\hat{T}}}$:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)= \frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT}}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)= \frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT}}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

predicts consequences of $\blue{\hat{a}_{t:\hat{T}}}$ for relations between:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

parameters $\red{\Theta}$

latent variables $\red{\hat{E}_{0:\hat{T}}}$

future sensor values $\red{\hat{S}_{t:\hat{T}}}$

2. Formal framework for intrinsic motivations

3. Prediction

allows agent to choose actions that lead to semantic free (information theoretical) relations between those

2. Formal framework for intrinsic motivations

3. Prediction

predicts consequences of $\blue{\hat{a}_{t:\hat{T}}}$ for :

parameters $\red{\Theta}$
latent variables $\red{\hat{E}_{0:\hat{t}}}$
future sensor values $\red{\hat{S}_{t:\hat{T}}}$

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Prediction

predicts consequences of $\blue{\hat{a}_{t:\hat{T}}}$ for :

parameters $\red{\Theta}$
latent variables $\red{\hat{E}_{0:\hat{t}}}$
future sensor values $\red{\hat{S}_{t:\hat{T}}}$

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Prediction

Call $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}},sa_{\prec t},\xi)$ the complete posterior.

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies $\blue{\hat{\Pi}}$

here $\hat{\Pi}$ is a parameterization of a general policy $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|sa_{\prec t}) = \q(\hat{a}_t|sa_{\prec t},\pi)$$

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies $\blue{\hat{\Pi}}$

simpler policies are also possible $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|s_{t-1}) = \q(\hat{a}_t|s_{t-1},\pi)$$

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies $\blue{\hat{\Pi}}$

here $\hat{\Pi}$ is a parameterization of a general policy $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|sa_{\prec t}) = \q(\hat{a}_t|sa_{\prec t},\pi)$$
maps any sequence $sa_{\prec t}$ to a distribution over the next action $a_t$
latent variables $\red{\hat{E}_{0:\hat{t}}}$
future sensor values $\red{\hat{S}_{t:\hat{T}}}$

2. Formal framework for intrinsic motivations

4. Action selection

General reinforcement learning (RL) evaluates actions by expected cumulative reward $Q(\hat{a}_{t:\hat{T}},sa_{\prec t}):=\mathbb{E}[R|\hat{a}_{t:\hat{T}},sa_{\prec t}]$:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\hat{a}_{t:\hat{T}},sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs_\thT|\ha_\thT,sa_\pt,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\hat{a}_{t:\hat{T}},sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs_\thT|\ha_\thT,sa_\pt,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\pi,sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs\ha_\thT|sa_\pt,\pi,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\pi,sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs\ha_\thT|sa_\pt,\pi,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)

Standard RL : reward $r_t$ is one of the sensor values so $$r(\hat{s}\hat{a}_{t:\tau},sa_{\prec t})=r(s_\tau)=r_\tau$$

For the case of evaluating policies $Q(\pi,sa_{\prec t}):=\mathbb{E}[R|\pi,sa_{\prec t}]$:

2. Formal framework for intrinsic motivations

4. Action selection

Find best sequence:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \ha^*_\thT(sa_\pt):=\argmax_{\ha_\thT} Q_{\q}(\ha_\thT,sa_\pt)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \ha^*_\thT(sa_\pt):=\argmax_{\ha_\thT} Q_{\q}(\ha_\thT,sa_\pt)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\p}{\text{p}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \p(a_t|sa_\pt):=\delta_{\ha^*_t(sa_\pt)}(a_t)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\p}{\text{p}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \p(a_t|sa_\pt):=\delta_{\ha^*_t(sa_\pt)}(a_t)

Select / perform its first action:

2. Formal framework for intrinsic motivations

4. Action selection

Intrinsic motivations not restricted to consequences for future sensor values $\hat{s}_{t:\hat{T}}$
focus on relations between sensor values $\hat{s}_{t:\hat{T}}$, (latent) environment states $\hat{E}_{0:\hat{T}}$, and parameters $\Theta$
these consequences are captured by complete posterior $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}},sa_{\prec t},\xi)$
define intrinsic motivations as functions $\mathfrak{M}$ of this posterior and a given sequence $\hat{a}_{t:\hat{T}}$ of future actions:

\newcommand{\tT}{{t:T}}

\newcommand{\tT}{{t:T}}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\mot}{\mathfrak{M}} Q(\ha_\thT,sa_\pt,\xi):= \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\mot}{\mathfrak{M}} Q(\ha_\thT,sa_\pt,\xi):= \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)

Requirement is a conditional probability $\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}})$ how this is obtained by agent does not matter.

3. Some intrinsic motivations

1. Free energy minimization

Actions should lead to environment states expected to have precise sensor values.

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &amp;= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

Get $\text{q}(\hat{e}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})$ frome the complete posterior:

3. Some intrinsic motivations

1. Free energy minimization

random noise source are avoided
will get stuck in known "dark room traps"
- we know $$\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{E}_{t:\hat{t}},\hat{a}_{t:\hat{T}})=0$$
- such an optimal action sequence $\hat{a}_{t:\hat{T}}$ exists e.g. if there is a "dark room" in the environment
- even if it cannot be escaped once entered
- solved by adding KL divergence to constructed desired sensory experience
  - breaks purpose of intrinsic motivations (not scalable)
Free energy is not suitable for AGI

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{PI}(\d(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_{t:t+k-1}:\hS_{t+k:t+2k-1}|\ha_\thT)\\ &=\sum_{\hs_{t:t+2k-1}} \q(\hs_{t:t+2k-1}|\ha_\thT) \log \frac{\q(\hs_{t+k:t+2k-1}|\hs_{t:t+k-1},\ha_\thT)}{\q(\hs_{t:t+k-1}|\ha_\thT)}\\ \end{aligned}

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{PI}(\d(.,.,.|.),\ha_\thT) :&amp;= \I_{\q}(\hS_{t:t+k-1}:\hS_{t+k:t+2k-1}|\ha_\thT)\\ &amp;=\sum_{\hs_{t:t+2k-1}} \q(\hs_{t:t+2k-1}|\ha_\thT) \log \frac{\q(\hs_{t+k:t+2k-1}|\hs_{t:t+k-1},\ha_\thT)}{\q(\hs_{t:t+k-1}|\ha_\thT)}\\ \end{aligned}

3. Some intrinsic motivations

2. Predictive information maximization

Actions should lead to the most complex sensor stream:

Next $k$ sensor values should have max mutual information with the subsequent $k$.
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

2. Predictive information maximization

Georg Martius, Ralf Der

3. Some intrinsic motivations

2. Predictive information maximization

random noise source are avoided as they produce no mutual information
will not get stuck in known "dark room traps"
- from $$\text{H}_{\text{q}}(\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:t+k-1},\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0$$
possible long term behavior:
- ergodic sensor process
- finds a subset of environment states that allows this ergodicity

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{KSA}(\q(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_\thT:\Theta|\ha_\thT)\\ &=\sum_{\hs_\thT} \int \q(\hs_\thT,\theta|\ha_\thT) \log \frac{\q(\theta|\hs_\thT,\ha_\thT)}{\q(\theta)} \diff \theta \end{aligned}

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{KSA}(\q(.,.,.|.),\ha_\thT) :&amp;= \I_{\q}(\hS_\thT:\Theta|\ha_\thT)\\ &amp;=\sum_{\hs_\thT} \int \q(\hs_\thT,\theta|\ha_\thT) \log \frac{\q(\theta|\hs_\thT,\ha_\thT)}{\q(\theta)} \diff \theta \end{aligned}

3. Some intrinsic motivations

3. Knowledge seeking

Actions should lead to sensor values that tell the most about model parameters $\Theta$:

Also known as information gain maximization
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

3. Knowledge seeking

avoids random noise sources once they are known
similar to prediction progress
can rewrite as $$\text{H}_{\text{q}}(\Theta)-\text{H}_{\text{q}}(\Theta|\hat{S}_{t:\hat{T}},\hat{a}_{t:\hat{T}})$$
will not get stuck in known "dark room traps"
- from $$\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:\hat{T}},\Theta|\hat{a}_{t:\hat{T}})=0$$
possible long term behavior:
- when model is known does nothing / random walk

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{EM}(\d(.,.,.|.),\ha_\thTa) :&= \max_{\d(\ha_{\hTa+1:\hT})} \; \I_{\d}(\hA_{\hTa+1:\hT}:\hS_\hT|\ha_\thTa) \\ &=\max_{\d(\ha_{\hTa+1:\hT})} \; \sum_{\ha_{\hTa+1:\hT},\hs_\hT} \d(\ha_{\hTa+1:\hT}) \d(\hs_\hT|\ha_\thT) \log \frac{\d(\hs_\hT|\ha_\thT)}{\d(\hs_\hT|\ha_\thTa)}. \end{aligned}

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{EM}(\d(.,.,.|.),\ha_\thTa) :&amp;= \max_{\d(\ha_{\hTa+1:\hT})} \; \I_{\d}(\hA_{\hTa+1:\hT}:\hS_\hT|\ha_\thTa) \\ &amp;=\max_{\d(\ha_{\hTa+1:\hT})} \; \sum_{\ha_{\hTa+1:\hT},\hs_\hT} \d(\ha_{\hTa+1:\hT}) \d(\hs_\hT|\ha_\thT) \log \frac{\d(\hs_\hT|\ha_\thT)}{\d(\hs_\hT|\ha_\thTa)}. \end{aligned}

3. Some intrinsic motivations

4. Empowerment maximization

Actions should lead to control over as many future experiences as possible:

Actions $\hat{a}_{t:\hat{T}_a}$ are taken such that subsequent actions $\hat{a}_{\hat{T}_a+1:\hat{T}}$ have control
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

4. Empowerment maximization

avoids random noise sources because they cannot be controlled
will not get stuck in known "dark room traps"
- from $$\text{H}_{\text{q}}(\hat{S}_{\hat{T}}|\hat{a}_{t:\hat{T}_a})=0\Rightarrow\text{I}_{\text{q}}(\hat{A}_{\hat{T}_a+1:\hat{T}}\hat{S}_{\hat{T}}:|\hat{a}_{t:\hat{T}_a})=0$$
possible long term behavior:
- remains in (or maintains) the situation where it expects the most control over future experience
- exploration behavior not fully understood
- Belief empowerment may solve it...

3. Some intrinsic motivations

1. Free energy minimization

Intuition:

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT,sa_\pt,\xi)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT,sa_\pt,\xi)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &amp;= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ &= - \sum_{\he_\thT} \q(\he_\thT|\ha_\thT)\HS_{\q}(\hS_\thT|\he_\thT)\\ &=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ \end{aligned}

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&amp;= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ &amp;= - \sum_{\he_\thT} \q(\he_\thT|\ha_\thT)\HS_{\q}(\hS_\thT|\he_\thT)\\ &amp;=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ \end{aligned}

References:

Aslanides, J., Leike, J., and Hutter, M. (2017). Universal Reinforcement Learning Algorithms: Survey and Experiments. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1403–1410.

Ay, N., Bertschinger, N., Der, R., Güttler, F., and Olbrich, E. (2008). Predictive Information and Explorative Behavior of Autonomous Robots. The European Physical Journal B-Condensed Matter and Complex Systems, 63(3):329–339.

Friston, K. J., Parr, T., and de Vries, B. (2017). The Graphical Brain: Belief Propagation and Active Inference. Network Neuroscience, 1(4):381–414.

Klyubin, A., Polani, D., and Nehaniv, C. (2005). Empowerment: A Universal Agent-Centric Measure of Control. In The 2005 IEEE Congress on Evolutionary Computation, 2005, volume 1, pages 128–135.

Orseau, L., Lattimore, T., and Hutter, M. (2013). Universal Knowledge-Seeking Agents for Stochastic Environments. In Jain, S., Munos, R., Stephan, F., and Zeugmann, T., editors, Algorithmic Learning Theory, number 8139 in Lecture Notes in Computer Science, pages 158–172. Springer Berlin Heidelberg.

Storck, J., Hochreiter, S., and Schmidhuber, J. (1995). Reinforcement Driven Information Acquisition in Non-Deterministic Environments. In Proceedings of the International Conference on Artificial Neural Networks, volume 2, pages 159–164.

Introduction to intrinsic motivations

Main goals

Overview

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Background on intrinsic motivations

Overview

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

2. Formal framework for intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations

3. Some intrinsic motivations