Active Inference in multi-armed bandits

Dimitrije Marković

Theoretical Neurobiology Meeting

30.11.2020

Active inference and semi-markov decision processes

  • (Part I) Active inference in multi-armed bandits
    • Comparison with UCB and Thompson sampling.
    • Comparison of state and outcome based preferences.

 

  • (Part II) Active inference and semi-Markov processes
    • Semi-Markov models and a representation of state duration.
    • Learning the hidden temporal structure of state transitions.
    • Application: Reversal learning task.

  • (Part III) Active inference and semi-Markov decision processes
    • Extending policies with action (policy) duration.
    • Decision about when actions should be taken and for how long.
    • Applications: Temporal attention, intertemporal choices.

Motivation

  • Multi-armed bandits generalise resource allocation problems
  • Often used in experimental cognitive neuroscience:
    • Decision making in dynamic environments
    • Spatio-temporal attentional processing
  • Wide range of industrial applications.
  • Good benchmark problem for comparing active inference based solutions with the state-of-the-art alternatives.

 

Multi-armed Bandits

  • Stationary bandits
  • Dynamic bandits
  • Adversarial bandits
  • Risk-aware bandits
  • Contextual bandits
  • non-Markovian bandits

Stationary (classical) bandits

  • On any trial an agents makes a choice between K arms.

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits

  • Reward probabilities are fixed with \(p_{max}=\frac{1}{2} + \epsilon\) and \( p_{\neg max} = \frac{1}{2}\).

  • Task difficulty:

    • The best arm advantage -> \(\epsilon\)

    • The number of arms -> \(K\)

Non-stationary (Dynamic) bandits

  • On any trial an agents makes a choice between K arms.

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits.
  • Reward probabilities associated with each arm change over time:
    • ​changes happen at the same time on all arms (e.g. switching bandits)
    • changes happen independently on each arm (e.g. restless bandits)
  • The additional task difficulty:
    • The rate of change/change probability

Choice optimality as regret minimization

  • At trial \( t \) an agent chooses arm \(a_t\).
  • We define the regret as

$$ R(t) = p_{max}(t) - p_{a_t}(t) $$

  • The cumulative regret after \(T\) trials is obtained as 

$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$

  • Regret rate is defined as 

$$ r(T) = \frac{1}{T} \hat{R}(T)$$

Bayesian Inference

  • Bernoulli bandits \( \rightarrow \) Bernoulli likelihoods.
  • Beliefs about reward probabilities as Beta distributions.
  • Bayesian belief updating for all action selection algorithms.

Generative model

p(o_t|\theta^k_t) = \left(\theta^k_t\right)^{o_{t}}\left( 1- \theta^k_t \right)^{1-o_{t}}
  • A K-armed bandit.
  • The \(k\)th arm is associated with a reward probability \(\theta^k_t\) at trial t.
  • Choice outcomes denoted with  \( o_t \).
p(\theta^k_t|\theta^{k}_{t-1}, j_t) = \left\{ \begin{array}{cc} \mathcal{Be}(\alpha_0, \beta_0) & \textrm{for } j_{t-1} = 1 \\ \delta(\theta^k_{t} - \theta^k_{t-1}) & \textrm{for } j_{t-1} = 0 \end{array} \right.
p(j_{t}=j|j_{t-1}, \rho) = \left\{ \begin{array}{cc} \delta_{j, 0} & \textrm{for } j_{t-1} = 1 \\ \rho^{j}(1-\rho)^{1-j} & \textrm{for } j_{t-1} = 0 \end{array} \right.

Belief updating

Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to

p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \propto p(o_t|\vec{\theta}_t, a_t) p(\vec{\theta}_t|j_t, o_{t-1:1}) p(j_t|o_{t-1:1})
Q(\theta_k) = \mathcal{Be}(\alpha^k_t, \beta^k_t), \quad Q(j_t=1) = \omega_t
p(\vec{\theta}_t, j_t|o_{t:1}, a_{t:1}) \approx \prod_{k=1}^K Q(\theta_k) Q(j_t)

Belief updating

Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to

\omega_t = \frac{\frac{1}{2}\rho(1 - \omega_{t-1})}{ \frac{1}{2}\rho(1 - \omega_{t-1}) + (\mu_t^{a_t})^{o_t}(1 - \mu_t^{a_t})^{1-o_t}(1 - \rho(1-\omega_{t-1}))}
\alpha^k_t = (1 - \omega_t) \alpha^k_{t-1} + \omega_t \alpha_0 + \delta_{a_t, k} o_t
\beta^k_t = (1 - \omega_t) \beta^k_{t-1} + \omega_t \beta_0 + \delta_{a_t, k} (1 - o_t)

Belief updating

Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to

\omega_t = \frac{\frac{1}{2}\rho(1 - \omega_{t-1})}{ \frac{1}{2}\rho(1 - \omega_{t-1}) + (\mu_t^{a_t})^{o_t}(1 - \mu_t^{a_t})^{1-o_t}(1 - \rho(1-\omega_{t-1}))}
\alpha^k_t = (1 - \omega_t) \alpha^k_{t-1} + \omega_t \alpha_0 + \delta_{a_t, k} o_t
\beta^k_t = (1 - \omega_t) \beta^k_{t-1} + \omega_t \beta_0 + \delta_{a_t, k} (1 - o_t)

Stationary case \(\rightarrow \) \(\rho = 0\)

\omega_t = 0
\alpha^k_t = \alpha^k_{t-1} + \delta_{a_t, k} o_t
\beta^k_t = \beta^k_{t-1} + \delta_{a_t, k} (1 - o_t)

Action Selection Algorithms

  • Optimistic Thompson sampling (O-TS)
  • Bayesian upper confidence bound (B-UCB)
  • Active inference

Thompson sampling

Raj, Vishnu, and Sheetal Kalyani. "Taming non-stationary bandits: A Bayesian approach."

arXiv preprint arXiv:1707.09727 (2017).

Classical algorithm for Bayesian bandits

 

$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta^k_t|o_{t-1:1}, a_{t-1:1}\right)$$

 

Optimistic Thompson sampling (O-TS) defined as

 

$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta^k_t \rangle) \right]$$

Upper Confidence bound (UCB)

  • One of the oldest algorithms defined as

a_t = \left\{ \begin{array}{cc} \arg\max_k \left(\mu^k_{t-1} + \sqrt{\frac{2 \ln t}{n^{k}_{t-1}}}\right) & \textrm{for } t>K \\ t & \textrm{otherwise} \end{array} \right.
  • \(\mu^{k}_t\) denotes the expected reward probability of the \(k\)th arm.
  • \(n^k_{t}\) denotes the number of times \(k\)th arm was selected.

Bayesian Upper Confidence bound (B-UCB)

a_t = \arg\max_k CDF^{-1}\left( 1 - \frac{1}{t}, \bar{\alpha}_{t,k}, \bar{\beta}_{t,k} \right)

Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems." Artificial intelligence and statistics, 2012.

\int_{0}^x \mathcal{Be} \left(\theta; \bar{\alpha}_t^k, \bar{\beta}_t^k \right) \textrm{d}\theta = 1 - \frac{1}{t}

Inverse regularised incomplete beta function

\bar{\alpha}_t^k = \left(1 - \rho(1-\omega_{t-1})\right) \alpha_{t-1}^k + \frac{1}{2}\rho(1-\omega_{t-1})
\bar{\beta}_t^k = \left(1 - \rho(1-\omega_{t-1})\right) \beta_{t-1}^k + \frac{1}{2}\rho(1-\omega_{t-1})

Active inference

expected free energy

S(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]
  • rolling behavioural policies \( \rightarrow \) independent from the past choices.
  • a behavioural policy corresponds to a single choice \(\rightarrow\) \(\pi = a_t, \quad a_t \in \{1,\ldots,K\}\).
G(a_t) = D_{KL}\left(Q(\vec{\theta}|a_t)||P(\vec{\theta}|a_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]

expected surprisal

Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.

Active inference

expected free energy

S(a_t) = D_{KL}\left(Q(o_t |a_t)||P(o_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]
  • rolling behavioural policies \( \rightarrow \) independent from the past choices.
  • a behavioural policy corresponds to a single choice \(\rightarrow\) \(\pi = a_t, \quad a_t \in \{1,\ldots,K\}\).
G(a_t) = D_{KL}\left(Q(\vec{\theta}|a_t)||P(\vec{\theta}|a_t)\right) + E_{Q(\vec{\theta}|a_t)}\left[H[o_t|\vec{\theta}, a_t] \right]

expected surprisal

\( G(a_t) \geq S(a_t) \)

\( \arg\min_a G(a) \neq \arg\min_a S(a)\)

$$ P(\vec{\theta}_t| a_t=k) \propto \left( \theta_t^k\right)^{\left( e^{2\lambda}-1 \right)}$$

\( P(o_t) \propto e^{o_t \lambda} e^{-(1-o_t) \lambda} \)

\(P(o_t) = \int d \vec{\theta} p(o_t|\vec{\theta}_t, a_t) P(\vec{\theta}_t|a_t) \)

Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.

Active inference - choice selection

Posterior over policies

$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$

 

For rolling policies \( F(\pi) = const. \)

$$Q(\pi)\propto e^{\gamma G(\pi)}$$

Choice selection based on expected free energy (F-AI)

    $$ a_t \sim p(a_t) \propto e^{-\gamma G(a_t, \lambda)} $$

Choice selection based on expected surprisal (S-AI)

    $$ a_t \sim p(a_t) \propto e^{-\gamma S(a_t, \lambda)} $$

approximate Expected surprisal (A-Ai)

Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."

arXiv preprint arXiv:2001.07203 (2020).

S(a_t) \approx - \left(1-\rho(1-\omega_{t-1})\right)\cdot\left[\lambda (2\mu_{t-1}^{a_t} - 1) + \frac{1}{2\nu_{t-1}^{a_t}} \right]

Next step:

compare F-AI, S-AI, and A-AI in stationary bandits

Between AI comparison

Parametrised regret rate for A-AI algorithm

comparison:

  • Optimistic Thompson sampling (O-TS)

  • Bayesian upper confidence bound (B-UCB)

  • Approximate active inference (A-Ai)

comparison in Stationary bandits

A-AI:

  • \(\lambda = 0.8 \)
  • \(\gamma = 20 \)

regret rate over time

A-AI:

  • \(\lambda = 0.8 \)
  • \(\gamma = 20 \)

\( \epsilon = 0.25 \)

\( \epsilon = 0.10 \)

\( \epsilon = 0.40 \)

Switching bandits

  • Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits

  • One arm is associate with maximal reward probability \(p_{max}=\frac{1}{2} + \epsilon\)

  • All other arms are fixed to \( p_{\neg max} = \frac{1}{2}\).

  • The optimal arm changes with probability \(\rho\).

  • Task difficulty:
    • Advantage of the best arm \(\rightarrow\) \(\epsilon\)
    • Number of arms \(\rightarrow\) \(K\)
    • Change probability \(\rightarrow\) \(\rho\)

comparison in switching bandits

\(\epsilon=0.25\)

A-AI:

  • \(\lambda = 0.8 \)
  • \(\gamma = 20 \)

conclusion

  • When comparing active inference on stationary bandits, we get mixed results.
  • When comparing active inference on stationary bandits, we get mixed results.
  • In non-stationary bandits, active inference based agents show lower performance if changes occur often enough.
  • A tentative TODO list:
    • Determine the optimality relation for AI algorithms, e.g. \(\gamma^* = f(\lambda, K, \epsilon, \rho)\).
    • Introduce learning of \(\lambda\).
    • Would adaptive \(\lambda\) lead to values that minimize regret rate?

Thanks to:

  • Hrvoje Stojić
  • Sarah Schwöbel
  • Stefan Kiebel
  • Thomas Parr
  • Karl Friston
  • Ryan Smith
  • Lancelot de Costa
image/svg+xml Project A9

https://slides.com/revealme/

https://github.com/dimarkov/aibandits