Introduction to intrinsic motivations

Martin Biehl

Main goals

  • Introduce framework to formulate intrinsic motivations
  • Introduce existing intrinsic motivations
  • Compare and relate them
  1. Background / Introduction to intrinsic motivations
  2. Formal framework for intrinsic motivations
    1. Perception-action loop
    2. A generative model to represent predictions
    3. Action selection based on consequences of actions
    4. Intrinsic motivations for action selection
  3. Discussion / Comparison / Results

Overview

Originally from psychology e.g. (Ryan and Deci, 2000):

 

activity for its inherent satisfaction rather than separable consequence

 

for the fun or challenge entailed rather than because of external products, pressures or reward

 

Examples (Oudeyer, 2008):

  • infants grasping, throwing, biting new objects,
  • adults playing crosswords, painting, gardening, reading novels...

Background on intrinsic motivations

But can always argue:

  • these things possibly increase the probability of survival in some way
  • "learned" by evolution
  • cannot be sure they have no purpose

 

 

Background on intrinsic motivations

Working definition compatible with Oudeyer (2008):

Motivation is intrinsic if its formulation is:

  • embodiment independent,
  • semantic free / information theoretic,
  • ``rewiring agnostic''.

This includes the approach by Schmidhuber (2010):

Motivation is intrinsic if it

  • rewards improvement of some model quality measure.

Background on intrinsic motivations

Background on intrinsic motivations

Another important but not defining feature is

  • open endedness

 

The motivation should not vanish until the capacities of the agent are exhausted.

Background on intrinsic motivations

Applications of intrinsic motivations:

  • developmental robotics
  • Human level and artificial general AI
  • sparse reward reinforcement learning problems

Background on intrinsic motivations

Developmental robotics:

  • study developmental processes of infants
    • motor skill acquisition
    • language acquisition
  • implement similar processes in robots

Background on intrinsic motivations

AGI:

  • implement predictive model that continually improves through experience
  • implement action selection / optimization that chooses according to the prediction
  • drive it by open ended intrinsic motivation

Background on intrinsic motivations

Sparse reward reinforcement learning:

  • Add additional term rewarding model improvement / curiosity / control

Background on intrinsic motivations

Advantages of intrinsic motivations

  • scalability :
    • no need to design reward function for each environment
    • environment kind and size does not change reward function
    • agent complexity does not change reward function

Disadvantage:

  • often harder to optimize
  • too general (specific design may be faster if known)

Examples:

  • hunger is not an intrinsic motivation
    • what counts as food (can be converted into energy) depends on the organism/agent (plants ``eat'' CO2, robots ``eat'' electrical power, virtual agents don't eat)
    • eating more doesn't improve our model of the world

Background on intrinsic motivations

Examples:

  • maximizing energy is closer to an intrinsic motivation
    • most agents need energy but not all (e.g. virtual ones)
    • storing more and more energy doesn't improve the world model

Background on intrinsic motivations

Examples:

  • maximizing money is not an intrinsic motivation
    • it only exists in some societies
    • getting more doesn't improve our model

Background on intrinsic motivations

Examples:

  • minimizing prediction error of the model is an intrinsic motivation
    • as long as the agent remembers its predictions it can calculate the prediction error, no matter what the environment, sensors, or actuators are / mean.
    • reducing it improves the model (at least locally)

Background on intrinsic motivations

Examples:

  • minimizing prediction error of the model is an intrinsic motivation
    • as long as the agent remembers its predictions it can calculate the prediction error, no matter what the environment, sensors, or actuators are / mean.
    • reducing it improves the model (at least locally)

Background on intrinsic motivations

Examples:

  • minimizing prediction error of the model is an intrinsic motivation
    • as long as the agent remembers its predictions it can calculate the prediction error, no matter what the environment, sensors, or actuators are / mean.
    • reducing it improves the model (at least locally)

Background on intrinsic motivations

dark room problem

Examples:

  • maximizing the decrease per time of the prediction error (prediction progress) is an intrinsic motivation
    • agent has to monitor prediction error, possible for any sensors, motors, environment..
    • improves the model where most progress can be made (open ended)

Background on intrinsic motivations

  1. Background / Introduction to intrinsic motivations
  2. Formal framework for intrinsic motivations
    1. Perception-action loop
    2. Action generation
      1. Generative model to obtain predictions
      2. Action selection based on predictions
    3. Action selection
  3. Some intrinsic motivations:
    1. Free Energy Minimization
    2. Empowerment maximization
    3. Knowledge seeking
    4. Predictive information maximization

Overview

2. Formal framework for intrinsic motivations

  1. Perception-action loop

Similar to reinforcement learning for POMDPs :

  • partially observable environment
  • unknown environment transition dynamics

But we assume no extrinsic reward

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\) : initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop

\(E\) : Environment state

\(S\) : Sensor state

\(A\) : Action

\(M\) : Agent memory state

\(\newcommand{\p}{\text{p}} \p(e_0)\): initial distribution

\(\newcommand{\p}{\text{p}}\p(s|e)\) : sensor dynamics

\(\newcommand{\p}{\text{p}}\p(m'|s,a,m)\) : memory dynamics

\(\newcommand{\p}{\text{p}}\p(a|m)\) : action generation

\(\newcommand{\p}{\text{p}}\p(e'|a',e)\) : environment dynamics

2. Formal framework for intrinsic motivations

  1. Perception-action loop
\newcommand{\p}{\text{p}} \p(e_{0:T},s_{0:T},a_{1:T},m_{1:T}) = \left( \prod_{t=1}^T \p(a_t|m_t) \p(m_t|s_{t-1},a_{t-1},m_{t-1}) \p(s_t|e_t) \p(e_t|a_t,e_{t-1}) \right) \p(s_0|e_0) \p(e_0)
p(e0:T,s0:T,a1:T,m1:T)=(t=1Tp(atmt)p(mtst1,at1,mt1)p(stet)p(etat,et1))p(s0e0)p(e0)\newcommand{\p}{\text{p}} \p(e_{0:T},s_{0:T},a_{1:T},m_{1:T}) = \left( \prod_{t=1}^T \p(a_t|m_t) \p(m_t|s_{t-1},a_{t-1},m_{t-1}) \p(s_t|e_t) \p(e_t|a_t,e_{t-1}) \right) \p(s_0|e_0) \p(e_0)

Joint distribution until final time \(t=T\) :

2. Formal framework for intrinsic motivations

  1. Perception-action loop

Assumptions :

  • constant environment and sensor dynamics given $$\newcommand{\p}{\text{p}}\p(e_{t_1}|a_{t_1},e_{t_1-1})=\p(e_{t_2}|a_{t_2},e_{t_2-1})$$ $$\newcommand{\p}{\text{p}}\p(s_{t_1}|e_{t_1})=\p(s_{t_2}|e_{t_2})$$
  • perfect agent memory :  $$\newcommand{\pt}{{\prec t}}m_t := (s_\pt,a_\pt) := sa_\pt \Rightarrow \newcommand{\p}{\text{p}}\p(m'|s,a,m) $$

2. Formal framework for intrinsic motivations

  1. Perception-action loop

Assumptions :

  • constant environment and sensor dynamics given $$\newcommand{\p}{\text{p}}\p(e_{t_1}|a_{t_1},e_{t_1-1})=\p(e_{t_2}|a_{t_2},e_{t_2-1})$$ $$\newcommand{\p}{\text{p}}\p(s_{t_1}|e_{t_1})=\p(s_{t_2}|e_{t_2})$$
  • perfect agent memory :  $$\newcommand{\pt}{{\prec t}}m_t := (s_\pt,a_\pt) := sa_\pt \Rightarrow \newcommand{\p}{\text{p}}\p(m'|s,a,m) $$

2. Formal framework for intrinsic motivations

  1. Perception-action loop

Only missing:

  • action generation mechanism \(\newcommand{\p}{\text{p}} \p(a|m)\)
    • takes current sensor-action history \(sa_{\prec t}\) to generate new action \(a_t\)
    • reuse of computational results on previous history \(sa_{\prec t-1}\) not reflected here but possible

 

2. Formal framework for intrinsic motivations

2. Action generation

  • Intrinsic motivations quantify statistical relations between sensor values, actions, and beliefs.
  • Taking actions according to such measures requires to predict them.
  • Possible by using parameterized / generative model.
  • Encodes beliefs and predictions in probability distributions over parameters and latent variables.
  • Easy to rigorously express many intrinsic motivations.
  • Naive computation is intractable.
  • Making it tractable is not discussed.

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

  1. sensor dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)\)
  2. environment dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)\)
  3. initial environment distribution \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)\)

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

  1. sensor dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)\)
  2. environment dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)\)
  3. initial environment distribution \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)\)

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

  1. sensor dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)\)
  2. environment dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)\)
  3. initial environment distribution \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)\)

2. Formal framework for intrinsic motivations

2. Generative model

Model split up into three parts:

  1. sensor dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{s}|\hat{e},\theta)\)
  2. environment dynamics model \(\newcommand{\q}{\text{q}}\newcommand{\ha}{\hat{a}}\q(\hat{e}'|\ha,\hat{e},\theta)\)
  3. initial environment distribution \(\newcommand{\q}{\text{q}}\newcommand{\hs}{\hat{s}}\q(\hat{e}|\theta)\)

2. Formal framework for intrinsic motivations

2. Generative model

  • write \(\Theta=(\Theta^1,\Theta^2,\Theta^3)\)
  • \(\xi=(\xi^1,\xi^2,\xi^3)\) are hyperparameters that encode priors over the parameters

2. Formal framework for intrinsic motivations

2. Generative model

Possible simplifications:

  • set and fix parameter \(\theta^i\) to actual one \(\theta^i_*\)
    • \(\newcommand{\q}{\text{q}} \q(\theta^i|\xi^i) = \delta(\theta^i-\theta^i_{*})\)
  • no updating of this parameter anymore

2. Formal framework for intrinsic motivations

2. Generative model

  • In standard POMDP case all \(\Theta\) have specified values \(\Theta=\theta\)
  • for \(\hat{e}_t:=\hat{s}\hat{a}_{\prec t}\) this model becomes identical to the one used in general reinforcement learning

2. Formal framework for intrinsic motivations

2. Generative model

  • In standard POMDP case all \(\Theta\) have specified values \(\Theta=\theta\)
  • for \(\hat{e}_t:=\hat{s}\hat{a}_{\prec t}\) this model becomes identical to the one used in general reinforcement learning

2. Formal framework for intrinsic motivations

2. Generative model

We fixed \(\Xi=\xi\) but model could be more complicated

  • vary the cardinality/dimensionality of environment state
  • vary different graph structures internal to the environment

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

As time passes in the perception action loop

  • new sensor values \(S_t=s_t\) and actions \(A_t=a_t\) are generated
  • get stored in \(M_t=m_t=sa_{\prec t}\)

 

2. Formal framework for intrinsic motivations

3. Prediction

So at \(t\) agent can plug \(m_t=sa_{\prec t}\) into model

  • updates the probability distribution to a posterior:
\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }
q(s^t:T^,e^0:T^,a^t:T^,θsat,ξ)=q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)s^t:T^,e^0:T^,a^t:T^q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)dθ\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

So at \(t\) agent can plug \(m_t=sa_{\prec t}\) into model

  • updates the probability distribution to a posterior:
\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }
q(s^t:T^,e^0:T^,a^t:T^,θsat,ξ)=q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)s^t:T^,e^0:T^,a^t:T^q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)dθ\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

So at \(t\) agent can plug \(m_t=sa_{\prec t}\) into model

  • updates the probability distribution to a posterior:
\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }
q(s^t:T^,e^0:T^,a^t:T^,θsat,ξ)=q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)s^t:T^,e^0:T^,a^t:T^q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)dθ\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\ha_\thT,\theta|sa_\pt,\xi)=\frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT},\ha_\thT}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

Allows to predict consequences of future actions \(\blue{\hat{a}_{t:\hat{T}}}\):

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)= \frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT}}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }
q(s^t:T^,e^0:T^,θa^t:T^,sat,ξ)=q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)s^t:T^,e^0:T^q(st,s^t:T^,e^0:T^,at,a^t:T^,θ,ξ)dθ\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)= \frac{\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)}{\int\sum_{\hs_\thT,\he_{0:\hT}}\q(s_\pt,\hs_\thT,\he_{0:\hT},a_\pt,\ha_\thT,\theta,\xi)\diff \theta }

2. Formal framework for intrinsic motivations

3. Prediction

                                        predicts consequences of \(\blue{\hat{a}_{t:\hat{T}}}\) for relations between:

\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)
q(s^t:T^,e^0:T^,θa^t:T^,sat,ξ)\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)
  • parameters \(\red{\Theta}\)
  • latent variables \(\red{\hat{E}_{0:\hat{T}}}\)
  • future sensor values \(\red{\hat{S}_{t:\hat{T}}}\)

2. Formal framework for intrinsic motivations

3. Prediction

allows agent to choose actions that lead to semantic free (information theoretical) relations between those

2. Formal framework for intrinsic motivations

3. Prediction

                                        predicts consequences of \(\blue{\hat{a}_{t:\hat{T}}}\) for :

  • parameters \(\red{\Theta}\)
  • latent variables \(\red{\hat{E}_{0:\hat{t}}}\)
  • future sensor values \(\red{\hat{S}_{t:\hat{T}}}\)
\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)
q(s^t:T^,e^0:T^,θa^t:T^,sat,ξ)\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Prediction

                                        predicts consequences of \(\blue{\hat{a}_{t:\hat{T}}}\) for :

  • parameters \(\red{\Theta}\)
  • latent variables \(\red{\hat{E}_{0:\hat{t}}}\)
  • future sensor values \(\red{\hat{S}_{t:\hat{T}}}\)
\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)
q(s^t:T^,e^0:T^,θa^t:T^,sat,ξ)\newcommand{\tT}{{t:T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\hT}{{\hat{T}}} \newcommand{\diff}{\text{d}} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi)

2. Formal framework for intrinsic motivations

3. Prediction

Call \(\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}},sa_{\prec t},\xi)\) the complete posterior.  

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies \(\blue{\hat{\Pi}}\)

  • here \(\hat{\Pi}\) is a parameterization of a general policy $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|sa_{\prec t}) = \q(\hat{a}_t|sa_{\prec t},\pi)$$

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies \(\blue{\hat{\Pi}}\)

  • simpler policies are also possible $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|s_{t-1}) = \q(\hat{a}_t|s_{t-1},\pi)$$

2. Formal framework for intrinsic motivations

3. Prediction

Note: can use same model to predict consequences of closed loop policies \(\blue{\hat{\Pi}}\)

  • here \(\hat{\Pi}\) is a parameterization of a general policy $$\newcommand{\q}{\text{q}}\pi(\hat{a}_t|sa_{\prec t}) = \q(\hat{a}_t|sa_{\prec t},\pi)$$
  • maps any sequence \(sa_{\prec t}\) to a distribution over the next action \(a_t\)
  • latent variables \(\red{\hat{E}_{0:\hat{t}}}\)
  • future sensor values \(\red{\hat{S}_{t:\hat{T}}}\)

2. Formal framework for intrinsic motivations

4. Action selection

General reinforcement learning (RL) evaluates actions by expected cumulative reward \(Q(\hat{a}_{t:\hat{T}},sa_{\prec t}):=\mathbb{E}[R|\hat{a}_{t:\hat{T}},sa_{\prec t}]\):

\newcommand{\tT}{{t:T}}
\newcommand{\tT}{{t:T}}
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\hat{a}_{t:\hat{T}},sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs_\thT|\ha_\thT,sa_\pt,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)
Q(a^t:T^,sat):=s^t:T^q(s^t:T^a^t:T^,sat,ξ)τ=tT^r(s^a^t:τ,sat)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\hat{a}_{t:\hat{T}},sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs_\thT|\ha_\thT,sa_\pt,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\pi,sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs\ha_\thT|sa_\pt,\pi,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)
Q(π,sat):=s^t:T^q(s^a^t:T^sat,π,ξ)τ=tT^r(s^a^t:τ,sat)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} Q(\pi,sa_{\prec t}) := \sum_{\hs_\thT} \q(\hs\ha_\thT|sa_\pt,\pi,\xi) \sum_{\tau=t}^{\hat{T}} r(\hs\ha_{t:\tau}, sa_\pt)

Standard RL : reward \(r_t\) is one of the sensor values so $$r(\hat{s}\hat{a}_{t:\tau},sa_{\prec t})=r(s_\tau)=r_\tau$$

For the case of evaluating policies \(Q(\pi,sa_{\prec t}):=\mathbb{E}[R|\pi,sa_{\prec t}]\):

2. Formal framework for intrinsic motivations

4. Action selection

Find best sequence:

\newcommand{\tT}{{t:T}}
\newcommand{\tT}{{t:T}}
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \ha^*_\thT(sa_\pt):=\argmax_{\ha_\thT} Q_{\q}(\ha_\thT,sa_\pt)
a^t:T^(sat):=argmaxa^t:T^Qq(a^t:T^,sat)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \ha^*_\thT(sa_\pt):=\argmax_{\ha_\thT} Q_{\q}(\ha_\thT,sa_\pt)
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\p}{\text{p}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \p(a_t|sa_\pt):=\delta_{\ha^*_t(sa_\pt)}(a_t)
p(atsat):=δa^t(sat)(at)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\p}{\text{p}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \p(a_t|sa_\pt):=\delta_{\ha^*_t(sa_\pt)}(a_t)

Select / perform its first action:

2. Formal framework for intrinsic motivations

4. Action selection

  • Intrinsic motivations not restricted to consequences for future sensor values \(\hat{s}_{t:\hat{T}}\)
  • focus on relations between sensor values \(\hat{s}_{t:\hat{T}}\), (latent) environment states \(\hat{E}_{0:\hat{T}}\), and parameters \(\Theta\)
  • these consequences are captured by complete posterior \(\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}},sa_{\prec t},\xi)\)
  • define intrinsic motivations as functions \(\mathfrak{M}\) of this posterior and a given sequence \(\hat{a}_{t:\hat{T}}\) of future actions:
\newcommand{\tT}{{t:T}}
\newcommand{\tT}{{t:T}}
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\mot}{\mathfrak{M}} Q(\ha_\thT,sa_\pt,\xi):= \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)
Q(a^t:T^,sat,ξ):=M(q(.,.,..,sat,ξ),a^t:T^)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\mot}{\mathfrak{M}} Q(\ha_\thT,sa_\pt,\xi):= \mot(\q(.,.,.|.,sa_\pt,\xi),\ha_\thT)
  • Requirement is a conditional probability \(\text{q}(\hat{s}_{t:\hat{T}},\hat{e}_{0:\hat{T}},\theta|\hat{a}_{t:\hat{T}})\) how this is obtained by agent does not matter.

3. Some intrinsic motivations

1. Free energy minimization

Actions should lead to environment states expected to have precise sensor values.

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \diff \theta
q(e^t:T^a^t:T^)=s^t:T^,e^tq(s^t:T^,e^0:T^,θa^t:T^)dθ\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \diff \theta
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}
M(q(.,.,..,ξ),a^t:T^):=Hq(S^t:T^E^t:T^,a^t:T^)=e^t:T^q(e^t:T^a^t:T^)s^t:T^q(s^t:T^e^t:T^)logq(s^t:T^e^t:T^)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

Get \(\text{q}(\hat{e}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})\) frome the complete posterior:

3. Some intrinsic motivations

1. Free energy minimization

  • random noise source are avoided
  • will get stuck in known "dark room traps"
    • we know $$\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{E}_{t:\hat{t}},\hat{a}_{t:\hat{T}})=0$$
    • such an optimal action sequence \(\hat{a}_{t:\hat{T}}\) exists e.g. if there is a "dark room" in the environment
    • even if it cannot be escaped once entered
    • solved by adding KL divergence to constructed desired sensory experience
      • breaks purpose of intrinsic motivations (not scalable)
  • Free energy is not suitable for AGI
\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{PI}(\d(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_{t:t+k-1}:\hS_{t+k:t+2k-1}|\ha_\thT)\\ &=\sum_{\hs_{t:t+2k-1}} \q(\hs_{t:t+2k-1}|\ha_\thT) \log \frac{\q(\hs_{t+k:t+2k-1}|\hs_{t:t+k-1},\ha_\thT)}{\q(\hs_{t:t+k-1}|\ha_\thT)}\\ \end{aligned}
MPI(q(.,.,..),a^t:T^):=Iq(S^t:t+k1:S^t+k:t+2k1a^t:T^)=s^t:t+2k1q(s^t:t+2k1a^t:T^)logq(s^t+k:t+2k1s^t:t+k1,a^t:T^)q(s^t:t+k1a^t:T^)\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{PI}(\d(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_{t:t+k-1}:\hS_{t+k:t+2k-1}|\ha_\thT)\\ &=\sum_{\hs_{t:t+2k-1}} \q(\hs_{t:t+2k-1}|\ha_\thT) \log \frac{\q(\hs_{t+k:t+2k-1}|\hs_{t:t+k-1},\ha_\thT)}{\q(\hs_{t:t+k-1}|\ha_\thT)}\\ \end{aligned}

3. Some intrinsic motivations

2. Predictive information maximization

Actions should lead to the most complex sensor stream:

  • Next \(k\) sensor values should have max mutual information with the subsequent \(k\).
  • Can get needed distributions from complete posterior.

3. Some intrinsic motivations

2. Predictive information maximization

Georg Martius, Ralf Der

3. Some intrinsic motivations

2. Predictive information maximization

  • random noise source are avoided as they produce no mutual information
  • will not get stuck in known "dark room traps"
    • from $$\text{H}_{\text{q}}(\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:t+k-1},\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0$$
  • possible long term behavior:
    • ergodic sensor process
    • finds a subset of environment states that allows this ergodicity
\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{KSA}(\q(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_\thT:\Theta|\ha_\thT)\\ &=\sum_{\hs_\thT} \int \q(\hs_\thT,\theta|\ha_\thT) \log \frac{\q(\theta|\hs_\thT,\ha_\thT)}{\q(\theta)} \diff \theta \end{aligned}
MKSA(q(.,.,..),a^t:T^):=Iq(S^t:T^:Θa^t:T^)=s^t:T^q(s^t:T^,θa^t:T^)logq(θs^t:T^,a^t:T^)q(θ)dθ\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{KSA}(\q(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_\thT:\Theta|\ha_\thT)\\ &=\sum_{\hs_\thT} \int \q(\hs_\thT,\theta|\ha_\thT) \log \frac{\q(\theta|\hs_\thT,\ha_\thT)}{\q(\theta)} \diff \theta \end{aligned}

3. Some intrinsic motivations

3. Knowledge seeking

Actions should lead to sensor values that tell the most about model parameters \(\Theta\):

  • Also known as information gain maximization
  • Can get needed distributions from complete posterior.

3. Some intrinsic motivations

3. Knowledge seeking

  • avoids random noise sources once they are known
  • similar to prediction progress
  • can rewrite as $$\text{H}_{\text{q}}(\Theta)-\text{H}_{\text{q}}(\Theta|\hat{S}_{t:\hat{T}},\hat{a}_{t:\hat{T}})$$
  • will not get stuck in known "dark room traps"
    • from $$\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:\hat{T}},\Theta|\hat{a}_{t:\hat{T}})=0$$
  • possible long term behavior:
    • when model is known does nothing / random walk
\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{EM}(\d(.,.,.|.),\ha_\thTa) :&= \max_{\d(\ha_{\hTa+1:\hT})} \; \I_{\d}(\hA_{\hTa+1:\hT}:\hS_\hT|\ha_\thTa) \\ &=\max_{\d(\ha_{\hTa+1:\hT})} \; \sum_{\ha_{\hTa+1:\hT},\hs_\hT} \d(\ha_{\hTa+1:\hT}) \d(\hs_\hT|\ha_\thT) \log \frac{\d(\hs_\hT|\ha_\thT)}{\d(\hs_\hT|\ha_\thTa)}. \end{aligned}
MEM(q(.,.,..),a^t:T^a):=maxq(a^T^a+1:T^)  Iq(A^T^a+1:T^:S^T^a^t:T^a)=maxq(a^T^a+1:T^)  a^T^a+1:T^,s^T^q(a^T^a+1:T^)q(s^T^a^t:T^)logq(s^T^a^t:T^)q(s^T^a^t:T^a).\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{EM}(\d(.,.,.|.),\ha_\thTa) :&= \max_{\d(\ha_{\hTa+1:\hT})} \; \I_{\d}(\hA_{\hTa+1:\hT}:\hS_\hT|\ha_\thTa) \\ &=\max_{\d(\ha_{\hTa+1:\hT})} \; \sum_{\ha_{\hTa+1:\hT},\hs_\hT} \d(\ha_{\hTa+1:\hT}) \d(\hs_\hT|\ha_\thT) \log \frac{\d(\hs_\hT|\ha_\thT)}{\d(\hs_\hT|\ha_\thTa)}. \end{aligned}

3. Some intrinsic motivations

4. Empowerment maximization

Actions should lead to control over as many future experiences as possible:

  • Actions \(\hat{a}_{t:\hat{T}_a}\) are taken such that subsequent actions \(\hat{a}_{\hat{T}_a+1:\hat{T}}\) have control
  • Can get needed distributions from complete posterior.

3. Some intrinsic motivations

4. Empowerment maximization

  • avoids random noise sources because they cannot be controlled
  • will not get stuck in known "dark room traps"
    • from $$\text{H}_{\text{q}}(\hat{S}_{\hat{T}}|\hat{a}_{t:\hat{T}_a})=0\Rightarrow\text{I}_{\text{q}}(\hat{A}_{\hat{T}_a+1:\hat{T}}\hat{S}_{\hat{T}}:|\hat{a}_{t:\hat{T}_a})=0$$
  • possible long term behavior:
    • remains in (or maintains) the situation where it expects the most control over future experience
    • exploration behavior not fully understood
    • Belief empowerment may solve it...

3. Some intrinsic motivations

1. Free energy minimization

Intuition:

  •  
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT,sa_\pt,\xi)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) \diff \theta
q(e^t:T^a^t:T^,sat,ξ)=s^t:T^,e^tq(s^t:T^,e^0:T^,θa^t:T^,sat,ξ)dθ\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT,sa_\pt,\xi)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT,sa_\pt,\xi) \diff \theta
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}
M(q(.,.,..,ξ),a^t:T^):=Hq(S^t:T^E^t:T^,a^t:T^)=e^t:T^q(e^t:T^a^t:T^)s^t:T^q(s^t:T^e^t:T^)logq(s^t:T^e^t:T^)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}
\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ &= - \sum_{\he_\thT} \q(\he_\thT|\ha_\thT)\HS_{\q}(\hS_\thT|\he_\thT)\\ &=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ \end{aligned}
M(q(.,.,..,ξ),a^t:T^):=e^t:T^q(e^t:T^a^t:T^)s^t:T^q(s^t:T^e^t:T^)logq(s^t:T^e^t:T^)=e^t:T^q(e^t:T^a^t:T^)Hq(S^t:T^e^t:T^)=Hq(S^t:T^E^t:T^,a^t:T^)\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ &= - \sum_{\he_\thT} \q(\he_\thT|\ha_\thT)\HS_{\q}(\hS_\thT|\he_\thT)\\ &=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ \end{aligned}

References:

Aslanides, J., Leike, J., and Hutter, M. (2017). Universal Reinforcement Learning Algorithms: Survey and Experiments. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1403–1410.


Ay, N., Bertschinger, N., Der, R., Güttler, F., and Olbrich, E. (2008). Predictive Information and Explorative Behavior of Autonomous Robots. The European Physical Journal B-Condensed Matter and Complex Systems, 63(3):329–339.


Friston, K. J., Parr, T., and de Vries, B. (2017). The Graphical Brain: Belief Propagation and Active Inference. Network Neuroscience, 1(4):381–414.


Klyubin, A., Polani, D., and Nehaniv, C. (2005). Empowerment: A Universal Agent-Centric Measure of Control. In The 2005 IEEE Congress on Evolutionary Computation, 2005, volume 1, pages 128–135.


Orseau, L., Lattimore, T., and Hutter, M. (2013). Universal Knowledge-Seeking Agents for Stochastic Environments. In Jain, S., Munos, R., Stephan, F., and Zeugmann, T., editors, Algorithmic Learning Theory, number 8139 in Lecture Notes in Computer Science, pages 158–172. Springer Berlin Heidelberg.


Storck, J., Hochreiter, S., and Schmidhuber, J. (1995). Reinforcement Driven Information Acquisition in Non-Deterministic Environments. In Proceedings of the International Conference on Artificial Neural Networks, volume 2, pages 159–164.

 

 

Made with Slides.com