Active inference in multi-armed bandits

Active Inference in multi-armed bandits

Dimitrije Marković

Theoretical Neurobiology Meeting

30.11.2020

Active inference and semi-markov decision processes

(Part I) Active inference in multi-armed bandits
- Comparison with UCB and Thompson sampling.
- Comparison of state and outcome based preferences.

(Part II) Active inference and semi-Markov processes
- Semi-Markov models and a representation of state duration.
- Learning the hidden temporal structure of state transitions.
- Application: Reversal learning task.

(Part III) Active inference and semi-Markov decision processes
- Extending policies with action (policy) duration.
- Decision about when actions should be taken and for how long.
- Applications: Temporal attention, intertemporal choices.

Motivation

Multi-armed bandits generalise resource allocation problems
Often used in experimental cognitive neuroscience:
- Decision making in dynamic environments
- Spatio-temporal attentional processing
Wide range of industrial applications.
Good benchmark problem for comparing active inference based solutions with the state-of-the-art alternatives.

Multi-armed Bandits

Stationary bandits
Dynamic bandits
Adversarial bandits
Risk-aware bandits
Contextual bandits
non-Markovian bandits

Stationary (classical) bandits

On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits
Reward probabilities are fixed with $p_{max}=\frac{1}{2} + \epsilon$ and $ p_{\neg max} = \frac{1}{2}$.
Task difficulty:
- The best arm advantage -> $\epsilon$
- The number of arms -> $K$

Non-stationary (Dynamic) bandits

On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits.
Reward probabilities associated with each arm change over time:
- changes happen at the same time on all arms (e.g. switching bandits)
- changes happen independently on each arm (e.g. restless bandits)
The additional task difficulty:
- The rate of change/change probability

Choice optimality as regret minimization

At trial $ t $ an agent chooses arm $a_t$.
We define the regret as

$$ R(t) = p_{max}(t) - p_{a_t}(t) $$

The cumulative regret after $T$ trials is obtained as

$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$

Regret rate is defined as

$$ r(T) = \frac{1}{T} \hat{R}(T)$$

Bayesian Inference

Bernoulli bandits $ \rightarrow $ Bernoulli likelihoods.
Beliefs about reward probabilities as Beta distributions.
Bayesian belief updating for all action selection algorithms.

Generative model

p(o_t|\theta^k_t) = \left(\theta^k_t\right)^{o_{t}}\left( 1- \theta^k_t \right)^{1-o_{t}}

A K-armed bandit.
The $k$th arm is associated with a reward probability $\theta^k_t$ at trial t.
Choice outcomes denoted with $ o_t $.

p(\theta^k_t|\theta^{k}_{t-1}, j_t) = \left\{ \begin{array}{cc} \mathcal{Be}(\alpha_0, \beta_0) & \textrm{for } j_{t-1} = 1 \\ \delta(\theta^k_{t} - \theta^k_{t-1}) & \textrm{for } j_{t-1} = 0 \end{array} \right.

p(j_{t}=j|j_{t-1}, \rho) = \left\{ \begin{array}{cc} \delta_{j, 0} & \textrm{for } j_{t-1} = 1 \\ \rho^{j}(1-\rho)^{1-j} & \textrm{for } j_{t-1} = 0 \end{array} \right.

Belief updating

Given some choice $a_t$ on trial $t$ and $\vec{\theta}_t = (\theta_1, \ldots, \theta_K) $ belief updating corresponds to

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}_t, a_t) p(\vec{\theta}_t|j_t, o_{t-1:1}) p(j_t|o_{t-1:1})

Q(\theta_k) = \mathcal{Be}(\alpha^k_t, \beta^k_t), \quad Q(j_t=1) = \omega_t

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \approx \prod_{k=1}^K Q(\theta_k) Q(j_t)

Belief updating

Given some choice $a_t$ on trial $t$ and $\vec{\theta}_t = (\theta_1, \ldots, \theta_K) $ belief updating corresponds to

\omega_t = \frac{\frac{1}{2}\rho(1 - \omega_{t-1})}{ \frac{1}{2}\rho(1 - \omega_{t-1}) + (\mu_t^{a_t})^{o_t}(1 - \mu_t^{a_t})^{1-o_t}(1 - \rho(1-\omega_{t-1}))}

\alpha^k_t = (1 - \omega_t) \alpha^k_{t-1} + \omega_t \alpha_0 + \delta_{a_t, k} o_t

\beta^k_t = (1 - \omega_t) \beta^k_{t-1} + \omega_t \beta_0 + \delta_{a_t, k} (1 - o_t)

Belief updating

Given some choice $a_t$ on trial $t$ and $\vec{\theta}_t = (\theta_1, \ldots, \theta_K) $ belief updating corresponds to

\omega_t = \frac{\frac{1}{2}\rho(1 - \omega_{t-1})}{ \frac{1}{2}\rho(1 - \omega_{t-1}) + (\mu_t^{a_t})^{o_t}(1 - \mu_t^{a_t})^{1-o_t}(1 - \rho(1-\omega_{t-1}))}

\alpha^k_t = (1 - \omega_t) \alpha^k_{t-1} + \omega_t \alpha_0 + \delta_{a_t, k} o_t

\beta^k_t = (1 - \omega_t) \beta^k_{t-1} + \omega_t \beta_0 + \delta_{a_t, k} (1 - o_t)

Stationary case $\rightarrow $ $\rho = 0$

\omega_t = 0

\alpha^k_t = \alpha^k_{t-1} + \delta_{a_t, k} o_t

\beta^k_t = \beta^k_{t-1} + \delta_{a_t, k} (1 - o_t)

Action Selection Algorithms

Optimistic Thompson sampling (O-TS)
Bayesian upper confidence bound (B-UCB)
Active inference

Thompson sampling

Raj, Vishnu, and Sheetal Kalyani. "Taming non-stationary bandits: A Bayesian approach."

arXiv preprint arXiv:1707.09727 (2017).

Classical algorithm for Bayesian bandits

$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta^k_t|o_{t-1:1}, a_{t-1:1}\right)$$

Optimistic Thompson sampling (O-TS) defined as

$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta^k_t \rangle) \right]$$

Upper Confidence bound (UCB)

One of the oldest algorithms defined as

a_t = \left\{ \begin{array}{cc} \arg\max_k \left(\mu^k_{t-1} + \sqrt{\frac{2 \ln t}{n^{k}_{t-1}}}\right) & \textrm{for } t>K \\ t & \textrm{otherwise} \end{array} \right.

$\mu^{k}_t$ denotes the expected reward probability of the $k$th arm.
$n^k_{t}$ denotes the number of times $k$th arm was selected.

Bayesian Upper Confidence bound (B-UCB)

a_t = \arg\max_k CDF^{-1}\left( 1 - \frac{1}{t}, \bar{\alpha}_{t,k}, \bar{\beta}_{t,k} \right)

Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems." Artificial intelligence and statistics, 2012.

\int_{0}^x \mathcal{Be} \left(\theta; \bar{\alpha}_t^k, \bar{\beta}_t^k \right) \textrm{d}\theta = 1 - \frac{1}{t}

Inverse regularised incomplete beta function

\bar{\alpha}_t^k = \left(1 - \rho(1-\omega_{t-1})\right) \alpha_{t-1}^k + \frac{1}{2}\rho(1-\omega_{t-1})

\bar{\beta}_t^k = \left(1 - \rho(1-\omega_{t-1})\right) \beta_{t-1}^k + \frac{1}{2}\rho(1-\omega_{t-1})

Active inference

expected free energy

S(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]

rolling behavioural policies $ \rightarrow $ independent from the past choices.
a behavioural policy corresponds to a single choice $\rightarrow$ $\pi = a_t, \quad a_t \in \{1,\ldots,K\}$.

G(a_t) = D_{KL}\left(Q(\vec{\theta}|a_t)||P(\vec{\theta}|a_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]

expected surprisal

Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.

Active inference

expected free energy

S(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]

rolling behavioural policies $ \rightarrow $ independent from the past choices.
a behavioural policy corresponds to a single choice $\rightarrow$ $\pi = a_t, \quad a_t \in \{1,\ldots,K\}$.

G(a_t) = D_{KL}\left(Q(\vec{\theta}|a_t)||P(\vec{\theta}|a_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]

expected surprisal

$ G(a_t) \geq S(a_t) $

$ \arg\min_a G(a) \neq \arg\min_a S(a)$

$$ P(\vec{\theta}_t| a_t=k) \propto \left( \theta_t^k\right)^{\left( e^{2\lambda}-1 \right)}$$

$ P(o_t) \propto e^{o_t \lambda} e^{-(1-o_t) \lambda} $

$P(o_t) = \int d \vec{\theta} p(o_t|\vec{\theta}_t, a_t) P(\vec{\theta}_t|a_t) $

Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.

Active inference - choice selection

Posterior over policies

$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$

For rolling policies $ F(\pi) = const. $

$$Q(\pi)\propto e^{\gamma G(\pi)}$$

Choice selection based on expected free energy (F-AI)

$$ a_t \sim p(a_t) \propto e^{-\gamma G(a_t, \lambda)} $$

Choice selection based on expected surprisal (S-AI)

$$ a_t \sim p(a_t) \propto e^{-\gamma S(a_t, \lambda)} $$

approximate Expected surprisal (A-Ai)

Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."

arXiv preprint arXiv:2001.07203 (2020).

S(a_t) \approx - \left(1-\rho(1-\omega_{t-1})\right)\cdot\left[\lambda (2\mu_{t-1}^{a_t} - 1) + \frac{1}{2\nu_{t-1}^{a_t}} \right]

Next step:

compare F-AI, S-AI, and A-AI in stationary bandits

Between AI comparison

Parametrised regret rate for A-AI algorithm

comparison:

Optimistic Thompson sampling (O-TS)
Bayesian upper confidence bound (B-UCB)
Approximate active inference (A-Ai)

comparison in Stationary bandits

A-AI:

$\lambda = 0.8 $
$\gamma = 20 $

regret rate over time

A-AI:

$\lambda = 0.8 $
$\gamma = 20 $

$ \epsilon = 0.25 $

$ \epsilon = 0.10 $

$ \epsilon = 0.40 $

Switching bandits

Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits
One arm is associate with maximal reward probability $p_{max}=\frac{1}{2} + \epsilon$
All other arms are fixed to $ p_{\neg max} = \frac{1}{2}$.
The optimal arm changes with probability $\rho$.
Task difficulty:
- Advantage of the best arm $\rightarrow$ $\epsilon$
- Number of arms $\rightarrow$ $K$
- Change probability $\rightarrow$ $\rho$

comparison in switching bandits

$\epsilon=0.25$

A-AI:

$\lambda = 0.8 $
$\gamma = 20 $

conclusion

When comparing active inference on stationary bandits, we get mixed results.

When comparing active inference on stationary bandits, we get mixed results.

In non-stationary bandits, active inference based agents show lower performance if changes occur often enough.

A tentative TODO list:
- Determine the optimality relation for AI algorithms, e.g. $\gamma^* = f(\lambda, K, \epsilon, \rho)$.
- Introduce learning of $\lambda$.
- Would adaptive $\lambda$ lead to values that minimize regret rate?

Thanks to:

Hrvoje Stojić
Sarah Schwöbel
Stefan Kiebel
Thomas Parr
Karl Friston
Ryan Smith
Lancelot de Costa

https://slides.com/revealme/

https://github.com/dimarkov/aibandits

Active Inference in multi-armed bandits

Active inference and semi-markov decision processes

Motivation

Multi-armed Bandits

Stationary (classical) bandits

Non-stationary (Dynamic) bandits

Choice optimality as regret minimization

Bayesian Inference

Generative model

Belief updating

Belief updating

Belief updating

Action Selection Algorithms

Thompson sampling

Upper Confidence bound (UCB)

Bayesian Upper Confidence bound (B-UCB)

Active inference

Active inference

Active inference - choice selection

approximate Expected surprisal (A-Ai)

Between AI comparison

Parametrised regret rate for A-AI algorithm

comparison:

Optimistic Thompson sampling (O-TS)

Bayesian upper confidence bound (B-UCB)

Approximate active inference (A-Ai)

comparison in Stationary bandits

regret rate over time

Switching bandits

comparison in switching bandits

conclusion