AN EMPIRICAL EVALUATION OF ACTIVE INFERENCE IN MULTI-ARMED BANDITS
Dimitrije Marković, Hrvoje Stojić, Sarah Schwöbel, and Stefan Kiebel
Active Inference Lab
22.06.2021
Max Planck UCL Centre for Computational Psychiatry and Ageing Research
Secondmind (https://www.secondmind.ai/)
Motivation
- Multi-armed bandits generalise resource allocation problems.
- Often used in experimental cognitive neuroscience:
- Decision making in dynamic environments \(^1\).
- Value-based decision making \(^2\).
- Structure learning \(^3\).
- Wide range of industrial applications\(^4\).
[1] Wilson, Robert C., and Yael Niv. "Inferring relevance in a changing world." Frontiers in human neuroscience 5 (2012): 189.
[2] Payzan-LeNestour, Elise, et al. "The neural representation of unexpected uncertainty during value-based decision making." Neuron 79.1 (2013): 191-201.
[3]Schulz, Eric, Nicholas T. Franklin, and Samuel J. Gershman. "Finding structure in multi-armed bandits." Cognitive psychology 119 (2020): 101261.
[4] Bouneffouf, Djallel, Irina Rish, and Charu Aggarwal. "Survey on Applications of Multi-Armed and Contextual Bandits." 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2020.
Multi-armed Bandits
- Stationary bandits
- Dynamic bandits
- Adversarial bandits
- Risk-aware bandits
- Contextual bandits
- non-Markovian bandits
- ...
Outline
- Stationary bandits
- Exact inference
- Action selection algorithms
- Asymptotic efficiency
- Switching bandits
- Approximate inference
- Comparison
Stationary (classical) bandits
-
On any trial an agents makes a choice between K arms.
-
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
-
Reward probabilities are fixed with \(p_{max}=\frac{1}{2} + \epsilon\) and \( p_{\neg max} = \frac{1}{2}\).
-
Task difficulty:
-
The best arm advantage -> \(\epsilon\)
-
The number of arms -> \(K\)
-
Bayesian Inference
- Bernoulli bandits \( \rightarrow \) Bernoulli likelihoods.
- Beliefs about reward probabilities as Beta distributions.
- Bayesian belief updating for all action selection algorithms.
Bayesian Inference
Generative model
- A K-armed bandit.
- The \(k\)th arm is associated with a reward probability \(\theta_k \in [0, 1]\).
- Choice outcomes denoted with \( o_t \in \{0, 1 \} \).
Belief updating
Given some choice \(a_t\) on trial \(t\) belief updating corresponds to
Beta distribution is a conjugate prior:
- Both prior and posterior belong to the same distribution family.
- Inference is exact.
Belief updating
Given some choice \(a_t\) on trial \(t\) belief updating corresponds to
Action Selection Algorithms
- Optimistic Thompson sampling (O-TS)
- Bayesian upper confidence bound (B-UCB)
- Active inference
- \( \epsilon \) - Greedy
- UCB
- KL-UCB
- Thompson sampling
- etc.
Action Selection Algorithms
- Optimistic Thompson sampling (O-TS)
- Bayesian upper confidence bound (B-UCB)
- Active inference
- \( \epsilon \) - Greedy
- UCB
- KL-UCB
- Thompson sampling
- etc.
Upper Confidence bound (UCB)
-
One of the oldest algorithms
- \(m_{t, k} \) denotes the expected reward probability of the \(k\)th arm.
- \(n_{t, k}\) denotes the number of times \(k\)th arm was selected.
Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling."
Advances in neural information processing systems 24 (2011): 2249-2257.
Bayesian Upper Confidence bound (B-UCB)
Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems."
Artificial intelligence and statistics, 2012.
Inverse regularised incomplete beta function
Best results for \( c=0\)
Thompson sampling
Lu, Xue, Niall Adams, and Nikolas Kantas. "On adaptive estimation for dynamic Bernoulli bandits."
Foundations of Data Science 1.2 (2019): 197.
Classical algorithm for Bayesian bandits
$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta_k|o_{t-1:1}, a_{t-1:1}\right)$$
Optimistic Thompson sampling (O-TS) defined as
$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_k \rangle) \right]$$
Active inference
expected free energy
- Rolling behavioural policies \( \rightarrow \) independent from the past choices.
- A behavioural policy corresponds to a single choice \(\rightarrow\) \(\pi = a_t, \quad a_t \in \{1,\ldots,K\}\).
Friston, Karl, et al. "Active inference: a process theory."
Neural computation 29.1 (2017): 1-49.
Active inference - choice selection
Posterior over policies
$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$
For rolling policies \( F(\pi) = const. \)
$$Q(\pi)\propto e^{\gamma G(\pi)}$$
Choice selection
$$ a_t \sim p(a_t) \propto e^{-\gamma G_t(a_t)} $$
Optimal choice \(\left( \gamma \rightarrow \infty \right)\)
$$ a_t = \argmin_a G_t(a) $$
Computing expected free energy
expected free energy
Computing expected free energy
Expected free energy (G-AI)
Computing expected free energy
Approximate expected free energy (A-AI)
Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."
arXiv preprint arXiv:2001.07203 (2020).
Choice optimality as regret minimization
- At trial \( t \) an agent chooses arm \(a_t\).
- We define the regret as
$$ R(t) = p_{max}(t) - p_{a_t}(t) $$
- The cumulative regret after \(T\) trials is obtained as
$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$
- Regret rate is defined as
$$ r(T) = \frac{1}{T} \hat{R}(T)$$
Asymptotic Efficiency
For good algorithms
Asymptotically efficient algorithms scale as (when \(T \rightarrow \infty \))
comparison:
-
Optimistic Thompson sampling (O-TS)
-
Bayesian upper confidence bound (B-UCB)
-
EXACT ACTIVE INFERENCE (G-AI)
-
Approximate active inference (A-Ai)
Between AI ALGOS comparison
comparison of All algorithms
END point histogram
Short-TERM BEHAVIOR
Short-TERM BEHAVIOR
What would make active inference asymptotically efficient?
Non-stationary (Dynamic) bandits
-
On any trial an agents makes a choice between K arms.
- Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits.
-
Reward probabilities associated with each arm change over time:
- changes happen at the same time on all arms (e.g. switching bandits)
- changes happen independently on each arm (e.g. restless bandits)
- The additional task difficulty:
- The rate of change/change probability
Switching bandits with fixed difficulty
-
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
-
One arm is associate with maximal reward probability \(p_{max}=\frac{1}{2} + \epsilon\)
-
All other arms are fixed to \( p_{\neg max} = \frac{1}{2}\).
-
The optimal arm changes with probability \(\rho\).
- Task difficulty:
- Advantage of the best arm \(\rightarrow\) \(\epsilon\)
- Number of arms \(\rightarrow\) \(K\)
- Change probability \(\rightarrow\) \(\rho\)
Switching bandits with Varying difficulty
-
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
-
On each trial arms maintain their reward probabilities with probability \( \rho\)
-
Or are sampled from a uniform distribution with probability \( \rho \)
- Fixed task difficulty:
- Number of arms \(\rightarrow\) \(K\)
- Change probability \(\rightarrow\) \(\rho\)
restless bandits
-
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
-
On each trial reward probability is generated from the following process
Bayesian Inference
- Bernoulli bandits \( \rightarrow \) Bernoulli likelihoods.
- Beliefs about reward probabilities as Beta distributions.
- Bayesian belief updating for all action selection algorithms.
- Access to underlying probability of change.
Bayesian Inference
Generative model
- A K-armed bandit.
- The \(k\)th arm is associated with a reward probability \(\theta^{t, k}\).
- Choice outcomes denoted with \( o_t \).
Belief updating
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Mean-field approximation
variational surprise minimisation learning (SMiLE)
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Liakoni, Vasiliki, et al. "Learning in volatile environments with the Bayes factor surprise."
Neural Computation 33.2 (2021): 269-340.
variational surprise minimisation learning (SMiLE)
recovering Stationary bandit
Stationary case \(\rightarrow \) \(\rho = 0\)
Action Selection Algorithms
Optimistic Thompson sampling (O-TS)
\( \theta^*_k \sim p\left(\theta_{t, k}|o_{t-1:1}, a_{t-1:1}\right) \)
\(a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_{t, k} \rangle) \right]\)
Bayesian upper confidence bound (B-UCB)
\( a_t = \argmax_k CDF^{-1}\left( 1 - \frac{1}{t}; \bar{\alpha}_{t,k}, \bar{\beta}_{t,k} \right) \)
\(\bar{\alpha}_{t, k} = (1-\rho) \alpha_{t-1, k} + \rho \)
\(\bar{\beta}_{t, k} = (1-\rho) \beta_{t-1, k} + \rho \)
Approximate expected free energy (A-AI)
\( \tilde{G}_t(a) = - 2 \lambda \mu_{t-1, a} - \frac{1}{2\nu_{t-1, a}} \)
\(a_t = \argmin_a \tilde{G}_t(a) \)
comparison:
-
Optimistic Thompson sampling (O-TS)
-
Bayesian upper confidence bound (B-UCB)
-
Approximate active inference (A-Ai)
comparison in switching bandits - Fixed difficulty
comparison in switching bandits - Fixed difficulty
comparison in switching bandits - Fixed difficulty
\(\epsilon=0.1\)
A-AI:
- \(\lambda = 0.5 \)
comparison in switching bandits - Fixed difficulty
\(K=40\)
A-AI:
- \(\lambda = 0.5 \)
comparison in switching bandits - varying difficulty
dotted line
\( \lambda=0.25 \)
comparison in switching bandits - varying difficulty
\( \lambda_{\text{G-AI}} = 0.25\)
\( \lambda_{\text{A-AI}} = 0.5\)
conclusion
- Active inference does not result in asymptotically efficient decision making. Additional work is required to establish theoretical bounds on regret and derive improved algorithms.
- In non-stationary bandits, active inference based agents show improved performance, specially noticeable in more difficult settings.
- A tentative TODO list:
- Introduce learning of \(\lambda\) parameter.
- Establish theoretical lower/upper bounds on cumulative regret.
- Improve algorithms for stationary case.
- Test real-world applications.
Thanks to:
- Hrvoje Stojić
- Sarah Schwöbel
- Stefan Kiebel
- Thomas Parr
- Karl Friston
- Ryan Smith
- Lancelot de Costa
https://slides.com/dimarkov/
https://github.com/dimarkov/aibandits
An empirical evaluation of active inference in multi-armed bandits
By dimarkov
An empirical evaluation of active inference in multi-armed bandits
- 39