AN EMPIRICAL EVALUATION OF ACTIVE INFERENCE IN MULTI-ARMED BANDITS

Dimitrije Marković, Hrvoje Stojić, Sarah Schwöbel, and Stefan Kiebel

Active Inference Lab

22.06.2021

Max Planck UCL Centre for Computational Psychiatry and Ageing Research

Secondmind (https://www.secondmind.ai/)

Motivation

  • Multi-armed bandits generalise resource allocation problems.
  • Often used in experimental cognitive neuroscience:
    • Decision making in dynamic environments \(^1\).
    • Value-based decision making \(^2\).
    • Structure learning \(^3\).
  • Wide range of industrial applications\(^4\).

 

[1] Wilson, Robert C., and Yael Niv. "Inferring relevance in a changing world." Frontiers in human neuroscience 5 (2012): 189.

[2] Payzan-LeNestour, Elise, et al. "The neural representation of unexpected uncertainty during value-based decision making." Neuron 79.1 (2013): 191-201.

[3]Schulz, Eric, Nicholas T. Franklin, and Samuel J. Gershman. "Finding structure in multi-armed bandits." Cognitive psychology 119 (2020): 101261.

[4] Bouneffouf, Djallel, Irina Rish, and Charu Aggarwal. "Survey on Applications of Multi-Armed and Contextual Bandits." 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2020.

Multi-armed Bandits

  • Stationary bandits
  • Dynamic bandits
  • Adversarial bandits
  • Risk-aware bandits
  • Contextual bandits
  • non-Markovian bandits
  • ...

Outline

  • Stationary bandits
    • Exact inference
    • Action selection algorithms
    • Asymptotic efficiency
  • Switching bandits
    • Approximate inference
    • Comparison

Stationary (classical) bandits

  • On any trial an agents makes a choice between K arms.

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits

  • Reward probabilities are fixed with \(p_{max}=\frac{1}{2} + \epsilon\) and \( p_{\neg max} = \frac{1}{2}\).

  • Task difficulty:

    • The best arm advantage -> \(\epsilon\)

    • The number of arms -> \(K\)

Bayesian Inference

  • Bernoulli bandits \( \rightarrow \) Bernoulli likelihoods.
  • Beliefs about reward probabilities as Beta distributions.
  • Bayesian belief updating for all action selection algorithms.

Bayesian Inference

Generative model

p(o_t|\vec{\theta}, a_t) = \prod_{k=1}^K \left[ \left(\theta_k\right)^{o_{t}}\left( 1- \theta_k \right)^{1-o_{t}} \right]^{\delta_{k, a_t}}
  • A K-armed bandit.
  • The \(k\)th arm is associated with a reward probability \(\theta_k \in [0, 1]\).
  • Choice outcomes denoted with  \( o_t \in \{0, 1 \} \).
p\left(\vec{\theta}\right) = \prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_0, \beta_0)

Belief updating

Given some choice \(a_t\) on trial \(t\) belief updating corresponds to

p(\vec{\theta}|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}, a_t) p(\vec{\theta}|o_{t-1:1}, a_{t-1})

Beta distribution is a conjugate prior:

  • Both prior and posterior belong to the same distribution family.
  • Inference is exact.
\alpha_{t, k} = \alpha_{t-1, k} + \delta_{a_t, k} o_t
\prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t, k}, \beta_{t, k})\propto p(o_t|\vec{\theta}, a_t) \prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t-1, k}, \beta_{t-1, k})
\beta_{t, k} = \beta_{t-1, k} + \delta_{a_t, k} \left( 1 - o_t \right)

Belief updating

Given some choice \(a_t\) on trial \(t\) belief updating corresponds to

p(\vec{\theta}|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}, a_t) p(\vec{\theta}|o_{t-1:1}, a_{t-1})
\alpha_{t, k} = \alpha_{t-1, k} + \delta_{a_t, k} o_t
\prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t, k}, \beta_{t, k})\propto p(o_t|\vec{\theta}, a_t) \prod_{k=1}^K \mathcal{Be}(\theta_k; \alpha_{t-1, k}, \beta_{t-1, k})
\beta_{t, k} = \beta_{t-1, k} + \delta_{a_t, k} \left( 1 - o_t \right)
\nu_t = \alpha_t + \beta_t
\mu_t = \frac{\alpha_t}{\nu_t}
\nu_{t, k} = \nu_{t-1, k} + \delta_{a_t, k}
\mu_{t, k} = \mu_{t-1, k} + \frac{\delta_{a_t, k}}{\nu_{t, k}} \left( o_t - \mu_{t-1, k} \right)

Action Selection Algorithms

  • Optimistic Thompson sampling (O-TS)
  • Bayesian upper confidence bound (B-UCB)
  • Active inference
  • \( \epsilon \) - Greedy
  • UCB
  • KL-UCB
  • Thompson sampling
  • etc.

Action Selection Algorithms

  • Optimistic Thompson sampling (O-TS)
  • Bayesian upper confidence bound (B-UCB)
  • Active inference
  • \( \epsilon \) - Greedy
  • UCB
  • KL-UCB
  • Thompson sampling
  • etc.

Upper Confidence bound (UCB)

  • One of the oldest algorithms

a_t = \left\{ \begin{array}{cc} \arg\max_k \left( m_{t, k} + \frac{\ln t}{n_{t,k}} + \sqrt{ \frac{ m_{t,k} \ln t}{ n_{t, k}}} \right) & \textrm{for } t>K \\ t & \textrm{otherwise} \end{array} \right.
  • \(m_{t, k} \) denotes the expected reward probability of the \(k\)th arm.
  • \(n_{t, k}\) denotes the number of times \(k\)th arm was selected.

Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling."

Advances in neural information processing systems 24 (2011): 2249-2257.

Bayesian Upper Confidence bound (B-UCB)

a_t = \arg\max_k CDF^{-1}\left( 1 - \frac{1}{t (\ln n)^c}, \alpha_{t-1,k}, \beta_{t-1,k} \right)

Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems."

Artificial intelligence and statistics, 2012.

\int_{0}^x \mathcal{Be} \left(\theta; \alpha_{t-1, k}, \beta_{t-1, k} \right) \textrm{d}\theta = 1 - \frac{1}{t (\ln n)^c}

Inverse regularised incomplete beta function

Best results for \( c=0\)

Thompson sampling

Lu, Xue, Niall Adams, and Nikolas Kantas. "On adaptive estimation for dynamic Bernoulli bandits."

Foundations of Data Science 1.2 (2019): 197.

Classical algorithm for Bayesian bandits

 

$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta_k|o_{t-1:1}, a_{t-1:1}\right)$$

 

Optimistic Thompson sampling (O-TS) defined as

 

$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_k \rangle) \right]$$

Active inference

expected free energy

G_t(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta})}\left[H[o_t|\vec{\theta}, a_t] \right]
  • Rolling behavioural policies \( \rightarrow \) independent from the past choices.
  • A behavioural policy corresponds to a single choice \(\rightarrow\) \(\pi = a_t, \quad a_t \in \{1,\ldots,K\}\).

Friston, Karl, et al. "Active inference: a process theory."

Neural computation 29.1 (2017): 1-49.

G_t(a_t) = - E_{Q(o_t|a_t)} \left[ \ln P(o_t) \right] - E_{Q(o_t|a_t)}\left[ D_{KL}\left(Q(\vec{\theta}|o_t, a_t)| Q(\vec{\theta}) \right) \right]

Active inference - choice selection

Posterior over policies

$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$

 

For rolling policies \( F(\pi) = const. \)

$$Q(\pi)\propto e^{\gamma G(\pi)}$$

Choice selection

    $$ a_t \sim p(a_t) \propto e^{-\gamma G_t(a_t)} $$

Optimal choice \(\left( \gamma \rightarrow \infty \right)\)

    $$ a_t = \argmin_a G_t(a) $$

Computing expected free energy

expected free energy

G_t(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta})}\left[H[o_t|\vec{\theta}, a_t] \right]
Q\left(o_t|a_t\right) = \int d \vec{\theta} p\left( o_t|a_t, \vec{\theta} \right) Q\left(\vec{\theta}\right)
Q\left(\vec{\theta}\right) = p\left(\vec{\theta}|o_{1:t-1}, a_{1:t-1} \right)
P(o_t) = \frac{1}{Z(\lambda)} e^{o_t \lambda} e^{-(1-o_t) \lambda}
H[o_t|\vec{\theta}, a_t] = - \left[\theta_{a_t} \ln \theta_{a_t} + (1 - \theta_{a_t}) \ln (1-\theta_{a_t}) \right]

Computing expected free energy

Expected free energy (G-AI)

G_t(a) = - 2 \lambda \mu_{t-1, a}
+ \mu_{t-1, a} \ln \mu_{t-1, a} + (1-\mu_{t-1, a}) \ln ( 1- \mu_{t-1, a})
- \mu_{t-1, a} \psi(\alpha_{t-1,a}) - (1 - \mu_{t-1,a}) \psi(\beta_{t-1, a})
+ \psi(\nu_{t-1,a}) - \frac{1}{\nu_{t-1,a}} + const.

Computing expected free energy

Approximate expected free energy (A-AI)

\tilde{G}_t(a) = - 2 \lambda \mu_{t-1, a} - \frac{1}{2\nu_{t-1, a}}
\psi(x) \approx \ln x - \frac{1}{2 x}, \text{ for } x \gg 1

Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."

arXiv preprint arXiv:2001.07203 (2020).

Choice optimality as regret minimization

  • At trial \( t \) an agent chooses arm \(a_t\).
  • We define the regret as

$$ R(t) = p_{max}(t) - p_{a_t}(t) $$

  • The cumulative regret after \(T\) trials is obtained as 

$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$

  • Regret rate is defined as 

$$ r(T) = \frac{1}{T} \hat{R}(T)$$

Asymptotic Efficiency

\lim_{T\rightarrow\infty} r(T) = 0
\bar{R}(T) \geq \underline{R}(T) = 2 \epsilon \frac{K-1}{\ln (1 + 4\epsilon^2)} \ln T +const. = \omega(K, \epsilon) \ln T + const.

For good algorithms

Asymptotically efficient algorithms scale as (when \(T \rightarrow \infty \))

comparison:

  • Optimistic Thompson sampling (O-TS)

  • Bayesian upper confidence bound (B-UCB)

  • EXACT ACTIVE INFERENCE (G-AI)

  • Approximate active inference (A-Ai)

Between AI ALGOS comparison

comparison of All algorithms

END point histogram

Short-TERM BEHAVIOR

Short-TERM BEHAVIOR

What would make active inference asymptotically efficient?

Non-stationary (Dynamic) bandits

  • On any trial an agents makes a choice between K arms.

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits.
  • Reward probabilities associated with each arm change over time:
    • ​changes happen at the same time on all arms (e.g. switching bandits)
    • changes happen independently on each arm (e.g. restless bandits)
  • The additional task difficulty:
    • The rate of change/change probability

Switching bandits with fixed difficulty

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits

  • One arm is associate with maximal reward probability \(p_{max}=\frac{1}{2} + \epsilon\)

  • All other arms are fixed to \( p_{\neg max} = \frac{1}{2}\).

  • The optimal arm changes with probability \(\rho\).

  • Task difficulty:
    • Advantage of the best arm \(\rightarrow\) \(\epsilon\)
    • Number of arms \(\rightarrow\) \(K\)
    • Change probability \(\rightarrow\) \(\rho\)

Switching bandits with Varying difficulty

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits

  • On each trial arms maintain their reward probabilities with probability \( \rho\)

  • Or are sampled from a uniform distribution with probability \( \rho \)

p_{t, k} = p_{t-1, k}
  • Fixed task difficulty:
    • Number of arms \(\rightarrow\) \(K\)
    • Change probability \(\rightarrow\) \(\rho\)
p_{t, k} \sim \mathcal{Be}(1, 1)

restless bandits

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits

  • On each trial reward probability is generated from the following process

x_{t, k} = x_{t-1, k} + \sigma n_{t, k}
p_{t, k} = e^{x_{t, k}}

Bayesian Inference

  • Bernoulli bandits \( \rightarrow \) Bernoulli likelihoods.
  • Beliefs about reward probabilities as Beta distributions.
  • Bayesian belief updating for all action selection algorithms.
  • Access to underlying probability of change.

Bayesian Inference

Generative model

p(o_t|\vec{\theta}, a_t) = \prod_{k=1}^K \left[ \left(\theta_{t, k} \right)^{o_{t}}\left( 1- \theta_{t, k} \right)^{1-o_{t}} \right]^{\delta_{k, a_t}}
  • A K-armed bandit.
  • The \(k\)th arm is associated with a reward probability \(\theta^{t, k}\).
  • Choice outcomes denoted with  \( o_t \).
p(\theta_{t, k}|\theta_{t-1, k}, j_t) = \left\{ \begin{array}{cc} \mathcal{Be}(\alpha_0, \beta_0) & \textrm{for } j_{t} = 1 \\ \delta(\theta_{t, k} - \theta_{t-1, k}) & \textrm{for } j_{t} = 0 \end{array} \right.
p(j_{t}) = \rho^{j_t} (1 - \rho)^{1- j_t}

Belief updating

Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}_t, a_t) p(\vec{\theta}_t|j_t, o_{t-1:1}) p(j_t)
Q(\theta_k) = \mathcal{Be}(\alpha^k_t, \beta^k_t), \quad Q(j_t=1) = \omega_t
p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \approx \prod_{k=1}^K Q(\theta_k) Q(j_t)
p(\vec{\theta}_t|o_{t:1}, a_{t:1}) = \omega_t p(\vec{\theta}_t|j_t=1, o_{t:1}, a_{t:1}) + (1-\omega_t) p(\vec{\theta}_t|j_t=0, o_{t:1}, a_{t:1})

Mean-field approximation

p(j_t=1|o_{t:1}, a_{t:1}) = \omega_t

variational surprise minimisation learning (SMiLE)

Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to

Liakoni, Vasiliki, et al. "Learning in volatile environments with the Bayes factor surprise."

Neural Computation 33.2 (2021): 269-340.

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}_t, a_t) p(\vec{\theta}_t|j_t, o_{t-1:1}) p(j_t)
Q(\theta_k) = \mathcal{Be}(\alpha_{t, k}, \beta_{t, k}) \propto p(o_t|\vec{\theta}, a_t)e^{\sum_{j_t \in \{0 ,1\}} Q(j_t) \ln p(\vec{\theta}_t|j_t)}
Q(j_t) = p(j_t|o_{t:1}, a_{t:1}) = p(o_t|j_t, a_t, o_{1:t-1}) p(j_t)

variational surprise minimisation learning (SMiLE)

Q(j_t) = \omega_t^{j_t} (1 - \omega_t)^{j_t}
\omega(S, m) = \frac{mS}{1 + mS}
\omega_t = \omega(S_{BF}^t, m)
S_{BF}^t = \frac{p (o_t|j_t=0, a_t, o_{t:t-1})}{p(o_t|j_t=1, a_t, o_{1:t-1})}
m = \frac{\rho}{1 - \rho}
Q(\theta_k) = \mathcal{Be}(\alpha_{t,k}, \beta_{t, k})
\alpha_{t, k} = (1 - \omega_t)\alpha_{t-1, k} + \omega_t\alpha_0 + \delta_{a_t, k} \cdot o_t
\beta_{t, k} = (1 - \omega_t)\beta_{t-1, k} + \omega_t\beta_0 + \delta_{a_t, k} \cdot (1 - o_t)

recovering Stationary bandit

Stationary case \(\rightarrow \) \(\rho = 0\)

\omega_t = 0
\alpha^k_t = \alpha^k_{t-1} + \delta_{a_t, k} o_t
\beta^k_t = \beta^k_{t-1} + \delta_{a_t, k} (1 - o_t)

Action Selection Algorithms

Optimistic Thompson sampling (O-TS)

\( \theta^*_k \sim p\left(\theta_{t, k}|o_{t-1:1}, a_{t-1:1}\right) \)

\(a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_{t, k} \rangle) \right]\)

Bayesian upper confidence bound (B-UCB)

\(  a_t = \argmax_k CDF^{-1}\left( 1 - \frac{1}{t}; \bar{\alpha}_{t,k}, \bar{\beta}_{t,k} \right) \)

\(\bar{\alpha}_{t, k} = (1-\rho) \alpha_{t-1, k} + \rho \)

\(\bar{\beta}_{t, k} = (1-\rho) \beta_{t-1, k} + \rho \)

Approximate expected free energy (A-AI)

\(  \tilde{G}_t(a) = - 2 \lambda \mu_{t-1, a} - \frac{1}{2\nu_{t-1, a}} \)

\(a_t = \argmin_a \tilde{G}_t(a) \)

comparison:

  • Optimistic Thompson sampling (O-TS)

  • Bayesian upper confidence bound (B-UCB)

  • Approximate active inference (A-Ai)

comparison in switching bandits - Fixed difficulty

comparison in switching bandits - Fixed difficulty

comparison in switching bandits - Fixed difficulty

\(\epsilon=0.1\)

A-AI:

  • \(\lambda = 0.5 \)

comparison in switching bandits - Fixed difficulty

\(K=40\)

A-AI:

  • \(\lambda = 0.5 \)

comparison in switching bandits - varying difficulty

dotted line

\(  \lambda=0.25 \)

comparison in switching bandits - varying difficulty

\( \lambda_{\text{G-AI}} = 0.25\)

\( \lambda_{\text{A-AI}} = 0.5\)

conclusion

  • Active inference does not result in asymptotically efficient decision making. Additional work is required to establish theoretical bounds on regret and derive improved algorithms.
  • In non-stationary bandits, active inference based agents show improved performance, specially noticeable in more difficult settings.
  • A tentative TODO list:
    • Introduce learning of \(\lambda\) parameter.
    • Establish theoretical lower/upper bounds on cumulative regret.
    • Improve algorithms for stationary case.
    • Test real-world applications.

Thanks to:

  • Hrvoje Stojić
  • Sarah Schwöbel
  • Stefan Kiebel
  • Thomas Parr
  • Karl Friston
  • Ryan Smith
  • Lancelot de Costa
image/svg+xml Project A9

https://slides.com/dimarkov/

 

https://github.com/dimarkov/aibandits

An empirical evaluation of active inference in multi-armed bandits

By dimarkov

An empirical evaluation of active inference in multi-armed bandits

  • 39