AN EMPIRICAL EVALUATION OF ACTIVE INFERENCE IN MULTI-ARMED BANDITS

Dimitrije Marković, Hrvoje Stojić, Sarah Schwöbel, and Stefan Kiebel

Active Inference Lab

22.06.2021

Max Planck UCL Centre for Computational Psychiatry and Ageing Research

Secondmind (https://www.secondmind.ai/)

Motivation

Multi-armed bandits generalise resource allocation problems.
Often used in experimental cognitive neuroscience:
- Decision making in dynamic environments $^1$.
- Value-based decision making $^2$.
- Structure learning $^3$.
Wide range of industrial applications$^4$.

[1] Wilson, Robert C., and Yael Niv. "Inferring relevance in a changing world." Frontiers in human neuroscience 5 (2012): 189.

[2] Payzan-LeNestour, Elise, et al. "The neural representation of unexpected uncertainty during value-based decision making." Neuron 79.1 (2013): 191-201.

[3]Schulz, Eric, Nicholas T. Franklin, and Samuel J. Gershman. "Finding structure in multi-armed bandits." Cognitive psychology 119 (2020): 101261.

[4] Bouneffouf, Djallel, Irina Rish, and Charu Aggarwal. "Survey on Applications of Multi-Armed and Contextual Bandits." 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2020.

Multi-armed Bandits

Stationary bandits
Dynamic bandits
Adversarial bandits
Risk-aware bandits
Contextual bandits
non-Markovian bandits
...

Outline

Stationary bandits
- Exact inference
- Action selection algorithms
- Asymptotic efficiency
Switching bandits
- Approximate inference
- Comparison

Stationary (classical) bandits

On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits
Reward probabilities are fixed with $p_{max}=\frac{1}{2} + \epsilon$ and $ p_{\neg max} = \frac{1}{2}$.
Task difficulty:
- The best arm advantage -> $\epsilon$
- The number of arms -> $K$

Bayesian Inference

Bernoulli bandits $ \rightarrow $ Bernoulli likelihoods.
Beliefs about reward probabilities as Beta distributions.
Bayesian belief updating for all action selection algorithms.

Bayesian Inference

Generative model

p(o_t|\vec{\theta}, a_t) = \prod_{k=1}^K \left[ \left(\theta_k\right)^{o_{t}}\left( 1- \theta_k \right)^{1-o_{t}} \right]^{\delta_{k, a_t}}

A K-armed bandit.
The $k$th arm is associated with a reward probability $\theta_k \in [0, 1]$.
Choice outcomes denoted with $ o_t \in \{0, 1 \} $.

p\left(\vec{\theta}\right) = \prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_0, \beta_0)

Belief updating

Given some choice $a_t$ on trial $t$ belief updating corresponds to

p(\vec{\theta}|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}, a_t) p(\vec{\theta}|o_{t-1:1}, a_{t-1})

Beta distribution is a conjugate prior:

Both prior and posterior belong to the same distribution family.
Inference is exact.

\alpha_{t, k} = \alpha_{t-1, k} + \delta_{a_t, k} o_t

\prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t, k}, \beta_{t, k})\propto p(o_t|\vec{\theta}, a_t) \prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t-1, k}, \beta_{t-1, k})

\beta_{t, k} = \beta_{t-1, k} + \delta_{a_t, k} \left( 1 - o_t \right)

Belief updating

Given some choice $a_t$ on trial $t$ belief updating corresponds to

p(\vec{\theta}|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}, a_t) p(\vec{\theta}|o_{t-1:1}, a_{t-1})

\alpha_{t, k} = \alpha_{t-1, k} + \delta_{a_t, k} o_t

\prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t, k}, \beta_{t, k})\propto p(o_t|\vec{\theta}, a_t) \prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t-1, k}, \beta_{t-1, k})

\beta_{t, k} = \beta_{t-1, k} + \delta_{a_t, k} \left( 1 - o_t \right)

\nu_t = \alpha_t + \beta_t

\mu_t = \frac{\alpha_t}{\nu_t}

\nu_{t, k} = \nu_{t-1, k} + \delta_{a_t, k}

\mu_{t, k} = \mu_{t-1, k} + \frac{\delta_{a_t, k}}{\nu_{t, k}} \left( o_t - \mu_{t-1, k} \right)

Action Selection Algorithms

Optimistic Thompson sampling (O-TS)
Bayesian upper confidence bound (B-UCB)
Active inference

$ \epsilon $ - Greedy
UCB
KL-UCB
Thompson sampling
etc.

Action Selection Algorithms

Optimistic Thompson sampling (O-TS)
Bayesian upper confidence bound (B-UCB)
Active inference

$ \epsilon $ - Greedy
UCB
KL-UCB
Thompson sampling
etc.

Upper Confidence bound (UCB)

One of the oldest algorithms

a_t = \left\{ \begin{array}{cc} \arg\max_k \left( m_{t, k} + \frac{\ln t}{n_{t,k}} + \sqrt{ \frac{ m_{t,k} \ln t}{ n_{t, k}}} \right) & \textrm{for } t>K \\ t & \textrm{otherwise} \end{array} \right.

$m_{t, k} $ denotes the expected reward probability of the $k$th arm.
$n_{t, k}$ denotes the number of times $k$th arm was selected.

Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling."

Advances in neural information processing systems 24 (2011): 2249-2257.

Bayesian Upper Confidence bound (B-UCB)

a_t = \arg\max_k CDF^{-1}\left( 1 - \frac{1}{t (\ln n)^c}, \alpha_{t-1,k}, \beta_{t-1,k} \right)

Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems."

Artificial intelligence and statistics, 2012.

\int_{0}^x \mathcal{Be} \left(\theta; \alpha_{t-1, k}, \beta_{t-1, k} \right) \textrm{d}\theta = 1 - \frac{1}{t (\ln n)^c}

Inverse regularised incomplete beta function

Best results for $ c=0$

Thompson sampling

Lu, Xue, Niall Adams, and Nikolas Kantas. "On adaptive estimation for dynamic Bernoulli bandits."

Foundations of Data Science 1.2 (2019): 197.

Classical algorithm for Bayesian bandits

$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta_k|o_{t-1:1}, a_{t-1:1}\right)$$

Optimistic Thompson sampling (O-TS) defined as

$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_k \rangle) \right]$$

Active inference

expected free energy

G_t(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta})}\left[H[o_t|\vec{\theta}, a_t] \right]

Rolling behavioural policies $ \rightarrow $ independent from the past choices.
A behavioural policy corresponds to a single choice $\rightarrow$ $\pi = a_t, \quad a_t \in \{1,\ldots,K\}$.

Friston, Karl, et al. "Active inference: a process theory."

Neural computation 29.1 (2017): 1-49.

G_t(a_t) = - E_{Q(o_t|a_t)} \left[ \ln P(o_t) \right] - E_{Q(o_t|a_t)}\left[ D_{KL}\left(Q(\vec{\theta}|o_t, a_t)| Q(\vec{\theta}) \right) \right]

Active inference - choice selection

Posterior over policies

$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$

For rolling policies $ F(\pi) = const. $

$$Q(\pi)\propto e^{\gamma G(\pi)}$$

Choice selection

$$ a_t \sim p(a_t) \propto e^{-\gamma G_t(a_t)} $$

Optimal choice $\left( \gamma \rightarrow \infty \right)$

$$ a_t = \argmin_a G_t(a) $$

Computing expected free energy

expected free energy

G_t(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta})}\left[H[o_t|\vec{\theta}, a_t] \right]

Q\left(o_t|a_t\right) = \int d \vec{\theta} p\left( o_t|a_t, \vec{\theta} \right) Q\left(\vec{\theta}\right)

Q\left(\vec{\theta}\right) = p\left(\vec{\theta}|o_{1:t-1}, a_{1:t-1} \right)

P(o_t) = \frac{1}{Z(\lambda)} e^{o_t \lambda} e^{-(1-o_t) \lambda}

H[o_t|\vec{\theta}, a_t] = - \left[\theta_{a_t} \ln \theta_{a_t} + (1 - \theta_{a_t}) \ln (1-\theta_{a_t}) \right]

Computing expected free energy

Expected free energy (G-AI)

G_t(a) = - 2 \lambda \mu_{t-1, a}

+ \mu_{t-1, a} \ln \mu_{t-1, a} + (1-\mu_{t-1, a}) \ln ( 1- \mu_{t-1, a})

- \mu_{t-1, a} \psi(\alpha_{t-1,a}) - (1 - \mu_{t-1,a}) \psi(\beta_{t-1, a})

+ \psi(\nu_{t-1,a}) - \frac{1}{\nu_{t-1,a}} + const.

Computing expected free energy

Approximate expected free energy (A-AI)

\tilde{G}_t(a) = - 2 \lambda \mu_{t-1, a} - \frac{1}{2\nu_{t-1, a}}

\psi(x) \approx \ln x - \frac{1}{2 x}, \text{ for } x \gg 1

Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."

arXiv preprint arXiv:2001.07203 (2020).

Choice optimality as regret minimization

At trial $ t $ an agent chooses arm $a_t$.
We define the regret as

$$ R(t) = p_{max}(t) - p_{a_t}(t) $$

The cumulative regret after $T$ trials is obtained as

$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$

Regret rate is defined as

$$ r(T) = \frac{1}{T} \hat{R}(T)$$

Asymptotic Efficiency

\lim_{T\rightarrow\infty} r(T) = 0

\bar{R}(T) \geq \underline{R}(T) = 2 \epsilon \frac{K-1}{\ln (1 + 4\epsilon^2)} \ln T +const. = \omega(K, \epsilon) \ln T + const.

For good algorithms

Asymptotically efficient algorithms scale as (when $T \rightarrow \infty $)

comparison:

Optimistic Thompson sampling (O-TS)
Bayesian upper confidence bound (B-UCB)
EXACT ACTIVE INFERENCE (G-AI)
Approximate active inference (A-Ai)

Between AI ALGOS comparison

comparison of All algorithms

END point histogram

Short-TERM BEHAVIOR

What would make active inference asymptotically efficient?

Non-stationary (Dynamic) bandits

On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits.
Reward probabilities associated with each arm change over time:
- changes happen at the same time on all arms (e.g. switching bandits)
- changes happen independently on each arm (e.g. restless bandits)
The additional task difficulty:
- The rate of change/change probability

Switching bandits with fixed difficulty

Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits
One arm is associate with maximal reward probability $p_{max}=\frac{1}{2} + \epsilon$
All other arms are fixed to $ p_{\neg max} = \frac{1}{2}$.
The optimal arm changes with probability $\rho$.
Task difficulty:
- Advantage of the best arm $\rightarrow$ $\epsilon$
- Number of arms $\rightarrow$ $K$
- Change probability $\rightarrow$ $\rho$

Switching bandits with Varying difficulty

Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits
On each trial arms maintain their reward probabilities with probability $ \rho$

Or are sampled from a uniform distribution with probability $ \rho $

p_{t, k} = p_{t-1, k}

Fixed task difficulty:
- Number of arms $\rightarrow$ $K$
- Change probability $\rightarrow$ $\rho$

p_{t, k} \sim \mathcal{Be}(1, 1)

restless bandits

Choice outcomes are binary variables $\rightarrow$ Bernoulli bandits
On each trial reward probability is generated from the following process

x_{t, k} = x_{t-1, k} + \sigma n_{t, k}

p_{t, k} = e^{x_{t, k}}

Bayesian Inference

Bernoulli bandits $ \rightarrow $ Bernoulli likelihoods.
Beliefs about reward probabilities as Beta distributions.
Bayesian belief updating for all action selection algorithms.
Access to underlying probability of change.

Bayesian Inference

Generative model

p(o_t|\vec{\theta}, a_t) = \prod_{k=1}^K \left[ \left(\theta_{t, k} \right)^{o_{t}}\left( 1- \theta_{t, k} \right)^{1-o_{t}} \right]^{\delta_{k, a_t}}

A K-armed bandit.
The $k$th arm is associated with a reward probability $\theta^{t, k}$.
Choice outcomes denoted with $ o_t $.

p(\theta_{t, k}|\theta_{t-1, k}, j_t) = \left\{ \begin{array}{cc} \mathcal{Be}(\alpha_0, \beta_0) & \textrm{for } j_{t} = 1 \\ \delta(\theta_{t, k} - \theta_{t-1, k}) & \textrm{for } j_{t} = 0 \end{array} \right.

p(j_{t}) = \rho^{j_t} (1 - \rho)^{1- j_t}

Belief updating

Given some choice $a_t$ on trial $t$ and $\vec{\theta}_t = (\theta_1, \ldots, \theta_K) $ belief updating corresponds to

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}_t, a_t) p(\vec{\theta}_t|j_t, o_{t-1:1}) p(j_t)

Q(\theta_k) = \mathcal{Be}(\alpha^k_t, \beta^k_t), \quad Q(j_t=1) = \omega_t

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \approx \prod_{k=1}^K Q(\theta_k) Q(j_t)

p(\vec{\theta}_t|o_{t:1}, a_{t:1}) = \omega_t p(\vec{\theta}_t|j_t=1, o_{t:1}, a_{t:1}) + (1-\omega_t) p(\vec{\theta}_t|j_t=0, o_{t:1}, a_{t:1})

Mean-field approximation

p(j_t=1|o_{t:1}, a_{t:1}) = \omega_t

variational surprise minimisation learning (SMiLE)

Given some choice $a_t$ on trial $t$ and $\vec{\theta}_t = (\theta_1, \ldots, \theta_K) $ belief updating corresponds to

Liakoni, Vasiliki, et al. "Learning in volatile environments with the Bayes factor surprise."

Neural Computation 33.2 (2021): 269-340.

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}_t, a_t) p(\vec{\theta}_t|j_t, o_{t-1:1}) p(j_t)

Q(\theta_k) = \mathcal{Be}(\alpha_{t, k}, \beta_{t, k}) \propto p(o_t|\vec{\theta}, a_t)e^{\sum_{j_t \in \{0 ,1\}} Q(j_t) \ln p(\vec{\theta}_t|j_t)}

Q(j_t) = p(j_t|o_{t:1}, a_{t:1}) = p(o_t|j_t, a_t, o_{1:t-1}) p(j_t)

variational surprise minimisation learning (SMiLE)

Q(j_t) = \omega_t^{j_t} (1 - \omega_t)^{j_t}

\omega(S, m) = \frac{mS}{1 + mS}

\omega_t = \omega(S_{BF}^t, m)

S_{BF}^t = \frac{p (o_t|j_t=0, a_t, o_{t:t-1})}{p(o_t|j_t=1, a_t, o_{1:t-1})}

m = \frac{\rho}{1 - \rho}

Q(\theta_k) = \mathcal{Be}(\alpha_{t,k}, \beta_{t, k})

\alpha_{t, k} = (1 - \omega_t)\alpha_{t-1, k} + \omega_t\alpha_0 + \delta_{a_t, k} \cdot o_t

\beta_{t, k} = (1 - \omega_t)\beta_{t-1, k} + \omega_t\beta_0 + \delta_{a_t, k} \cdot (1 - o_t)

recovering Stationary bandit

Stationary case $\rightarrow $ $\rho = 0$

\omega_t = 0

\alpha^k_t = \alpha^k_{t-1} + \delta_{a_t, k} o_t

\beta^k_t = \beta^k_{t-1} + \delta_{a_t, k} (1 - o_t)

Action Selection Algorithms

Optimistic Thompson sampling (O-TS)

$ \theta^*_k \sim p\left(\theta_{t, k}|o_{t-1:1}, a_{t-1:1}\right) $

$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_{t, k} \rangle) \right]$

Bayesian upper confidence bound (B-UCB)

$ a_t = \argmax_k CDF^{-1}\left( 1 - \frac{1}{t}; \bar{\alpha}_{t,k}, \bar{\beta}_{t,k} \right) $

$\bar{\alpha}_{t, k} = (1-\rho) \alpha_{t-1, k} + \rho $

$\bar{\beta}_{t, k} = (1-\rho) \beta_{t-1, k} + \rho $

Approximate expected free energy (A-AI)

$ \tilde{G}_t(a) = - 2 \lambda \mu_{t-1, a} - \frac{1}{2\nu_{t-1, a}} $

$a_t = \argmin_a \tilde{G}_t(a) $

comparison:

Optimistic Thompson sampling (O-TS)
Bayesian upper confidence bound (B-UCB)
Approximate active inference (A-Ai)

comparison in switching bandits - Fixed difficulty

$\epsilon=0.1$

A-AI:

$\lambda = 0.5 $

comparison in switching bandits - Fixed difficulty

$K=40$

A-AI:

$\lambda = 0.5 $

comparison in switching bandits - varying difficulty

dotted line

$ \lambda=0.25 $

comparison in switching bandits - varying difficulty

$ \lambda_{\text{G-AI}} = 0.25$

$ \lambda_{\text{A-AI}} = 0.5$

conclusion

Active inference does not result in asymptotically efficient decision making. Additional work is required to establish theoretical bounds on regret and derive improved algorithms.

In non-stationary bandits, active inference based agents show improved performance, specially noticeable in more difficult settings.

A tentative TODO list:
- Introduce learning of $\lambda$ parameter.
- Establish theoretical lower/upper bounds on cumulative regret.
- Improve algorithms for stationary case.
- Test real-world applications.

Thanks to:

Hrvoje Stojić
Sarah Schwöbel
Stefan Kiebel
Thomas Parr
Karl Friston
Ryan Smith
Lancelot de Costa

https://slides.com/dimarkov/

https://github.com/dimarkov/aibandits

An empirical evaluation of active inference in multi-armed bandits

By dimarkov

An empirical evaluation of active inference in multi-armed bandits

dimarkov PRO

dimarkov.github.io

AN EMPIRICAL EVALUATION OF ACTIVE INFERENCE IN MULTI-ARMED BANDITS

Motivation

Multi-armed Bandits

Outline

Stationary (classical) bandits

Bayesian Inference

Bayesian Inference

Generative model

Belief updating

Belief updating

Action Selection Algorithms

Action Selection Algorithms

Upper Confidence bound (UCB)

Bayesian Upper Confidence bound (B-UCB)

Thompson sampling

Active inference

Active inference - choice selection

Computing expected free energy

Computing expected free energy

Computing expected free energy

Choice optimality as regret minimization

Asymptotic Efficiency

comparison:

Optimistic Thompson sampling (O-TS)

Bayesian upper confidence bound (B-UCB)

EXACT ACTIVE INFERENCE (G-AI)

Approximate active inference (A-Ai)

Between AI ALGOS comparison

comparison of All algorithms

END point histogram

Short-TERM BEHAVIOR

Short-TERM BEHAVIOR

What would make active inference asymptotically efficient?

Non-stationary (Dynamic) bandits

Switching bandits with fixed difficulty

Switching bandits with Varying difficulty

restless bandits

Bayesian Inference

Bayesian Inference

Generative model

Belief updating

variational surprise minimisation learning (SMiLE)

variational surprise minimisation learning (SMiLE)

recovering Stationary bandit

Action Selection Algorithms

comparison:

Optimistic Thompson sampling (O-TS)

Bayesian upper confidence bound (B-UCB)

Approximate active inference (A-Ai)

comparison in switching bandits - Fixed difficulty

comparison in switching bandits - Fixed difficulty

comparison in switching bandits - Fixed difficulty

comparison in switching bandits - Fixed difficulty

comparison in switching bandits - varying difficulty

comparison in switching bandits - varying difficulty

conclusion

An empirical evaluation of active inference in multi-armed bandits

More from dimarkov