Introduction to intrinsic motivations

Martin Biehl

3. Some intrinsic motivations

1. Free energy minimization

Actions should lead to environment states expected to have precise sensor values.

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \q(\he_\thT|\ha_\thT)= \int \sum_{\hs_\thT,\he_\pt} \q(\hs_\thT,\he_{0:\hT},\theta|\ha_\thT) \diff \theta

\newcommand{\hT}{\hat{T}} \newcommand{\thT}{{t:\hT}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \begin{aligned} \mot(\q(.,.,.|.,\xi),\ha_\thT) :&=-\HS_{\q}(\hS_\thT|\hE_\thT,\ha_\thT)\\ &= \sum_{\he_\thT} \q(\he_\thT|\ha_\thT) \sum_{\hs_\thT} \q(\hs_\thT|\he_\thT) \log \q(\hs_\thT|\he_\thT)\\ \end{aligned}

Get $\text{q}(\hat{e}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})$ frome the complete posterior:

3. Some intrinsic motivations

1. Free energy minimization

random noise source are avoided
will get stuck in known "dark room traps"
- we know $\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{E}_{t:\hat{t}},\hat{a}_{t:\hat{T}})=0$
- such an optimal action sequence $\hat{a}_{t:\hat{T}}$ exists e.g. if there is a "dark room" in the environment
- even if it cannot be escaped once entered
- solved by adding KL divergence to constructed desired sensory experience
  - breaks purpose of intrinsic motivations (not scalable)
Free energy is not suitable for AGI

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{PI}(\d(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_{t:t+k-1}:\hS_{t+k:t+2k-1}|\ha_\thT)\\ &=\sum_{\hs_{t:t+2k-1}} \q(\hs_{t:t+2k-1}|\ha_\thT) \log \frac{\q(\hs_{t+k:t+2k-1}|\hs_{t:t+k-1},\ha_\thT)}{\q(\hs_{t:t+k-1}|\ha_\thT)}\\ \end{aligned}

3. Some intrinsic motivations

2. Predictive information maximization

Actions should lead to the most complex sensor stream:

Next $k$ sensor values should have max mutual information with the subsequent $k$ .
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

2. Predictive information maximization

random noise source are avoided as they produce no mutual information
will not get stuck in known "dark room traps"
- from $\text{H}_{\text{q}}(\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:t+k-1},\hat{S}_{t+k:t+2k-1}|\hat{a}_{t:\hat{T}})=0$
possible long term behavior:
- ergodic sensor process
- finds a subset of environment states that allows this ergodicity

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{KSA}(\q(.,.,.|.),\ha_\thT) :&= \I_{\q}(\hS_\thT:\Theta|\ha_\thT)\\ &=\sum_{\hs_\thT} \int \q(\hs_\thT,\theta|\ha_\thT) \log \frac{\q(\theta|\hs_\thT,\ha_\thT)}{\q(\theta)} \diff \theta \end{aligned}

3. Some intrinsic motivations

3. Knowledge seeking

Actions should lead to sensor values that tell the most about model parameters $\Theta$ :

Also known as information gain maximization
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

3. Knowledge seeking

avoids random noise sources once they are known
similar to prediction progress
can rewrite as $\text{H}_{\text{q}}(\Theta)-\text{H}_{\text{q}}(\Theta|\hat{S}_{t:\hat{T}},\hat{a}_{t:\hat{T}})$
will not get stuck in known "dark room traps"
- from $\text{H}_{\text{q}}(\hat{S}_{t:\hat{T}}|\hat{a}_{t:\hat{T}})=0\Rightarrow\text{I}_{\text{q}}(\hat{S}_{t:\hat{T}},\Theta|\hat{a}_{t:\hat{T}})=0$
possible long term behavior:
- when model is known does nothing / random walk

\newcommand{\hT}{{\hat{T}}} \newcommand{\thT}{{t:\hT}} \newcommand{\hTa}{{\hat{T}_a}} \newcommand{\thTa}{{t:\hTa}} \newcommand{\hs}{\hat{s}} \newcommand{\pt}{{\prec t}} \newcommand{\pet}{{\preceq t}} \newcommand{\set}{{\succeq t}} \newcommand{\ha}{\hat{a}} \newcommand{\he}{\hat{e}} \newcommand{\q}{\text{q}} \newcommand{\d}{\text{q}} \newcommand{\diff}{\text{d}} \newcommand{\ptau}{{\prec \tau}} \newcommand{\petau}{{\preceq \tau}} \newcommand{\stau}{{\succ \tau}} \newcommand{\setau}{{\succeq \tau}} \newcommand{\argmax}{\text{argmax}} \newcommand{\mot}{\mathfrak{M}} \newcommand{\HS}{\text{H}} \newcommand{\I}{\text{I}} \newcommand{\hS}{\hat{S}} \newcommand{\hE}{\hat{E}} \newcommand{\hA}{\hat{A}} \begin{aligned} \mot^{EM}(\d(.,.,.|.),\ha_\thTa) :&= \max_{\d(\ha_{\hTa+1:\hT})} \; \I_{\d}(\hA_{\hTa+1:\hT}:\hS_\hT|\ha_\thTa) \\ &=\max_{\d(\ha_{\hTa+1:\hT})} \; \sum_{\ha_{\hTa+1:\hT},\hs_\hT} \d(\ha_{\hTa+1:\hT}) \d(\hs_\hT|\ha_\thT) \log \frac{\d(\hs_\hT|\ha_\thT)}{\d(\hs_\hT|\ha_\thTa)}. \end{aligned}

3. Some intrinsic motivations

4. Empowerment maximization

Actions should lead to control over as many future experiences as possible:

Actions $\hat{a}_{t:\hat{T}_a}$ are taken such that subsequent actions $\hat{a}_{\hat{T}_a+1:\hat{T}}$ have control
Can get needed distributions from complete posterior.

3. Some intrinsic motivations

4. Empowerment maximization

avoids random noise sources because they cannot be controlled
will not get stuck in known "dark room traps"
- from $\text{H}_{\text{q}}(\hat{S}_{\hat{T}}|\hat{a}_{t:\hat{T}_a})=0\Rightarrow\text{I}_{\text{q}}(\hat{A}_{\hat{T}_a+1:\hat{T}}\hat{S}_{\hat{T}}:|\hat{a}_{t:\hat{T}_a})=0$
similar to energy and money maximization but more general
possible long term behavior:
- remains in (or maintains) the situation where it expects the most control over future experience
- exploration behavior not fully understood
- Belief empowerment may solve it...

Introduction to intrinsic motivations Martin Biehl

Introduction to intrinsic motivations

By slides_martin

Introduction to intrinsic motivations

Presentation at Araya on 17 August 2018

7 years ago
1,176

Introduction to intrinsic motivations

Introduction to intrinsic motivations

More from slides_martin