Dimitrije Marković, Hrvoje Stojić, Sarah Schwöbel, and Stefan Kiebel
Active Inference Lab
22.06.2021
Max Planck UCL Centre for Computational Psychiatry and Ageing Research
Secondmind (https://www.secondmind.ai/)
[1] Wilson, Robert C., and Yael Niv. "Inferring relevance in a changing world." Frontiers in human neuroscience 5 (2012): 189.
[2] Payzan-LeNestour, Elise, et al. "The neural representation of unexpected uncertainty during value-based decision making." Neuron 79.1 (2013): 191-201.
[3]Schulz, Eric, Nicholas T. Franklin, and Samuel J. Gershman. "Finding structure in multi-armed bandits." Cognitive psychology 119 (2020): 101261.
[4] Bouneffouf, Djallel, Irina Rish, and Charu Aggarwal. "Survey on Applications of Multi-Armed and Contextual Bandits." 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2020.
On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
Reward probabilities are fixed with \(p_{max}=\frac{1}{2} + \epsilon\) and \( p_{\neg max} = \frac{1}{2}\).
Task difficulty:
The best arm advantage -> \(\epsilon\)
The number of arms -> \(K\)
Given some choice \(a_t\) on trial \(t\) belief updating corresponds to
Beta distribution is a conjugate prior:
Given some choice \(a_t\) on trial \(t\) belief updating corresponds to
One of the oldest algorithms
Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling."
Advances in neural information processing systems 24 (2011): 2249-2257.
Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems."
Artificial intelligence and statistics, 2012.
Inverse regularised incomplete beta function
Best results for \( c=0\)
Lu, Xue, Niall Adams, and Nikolas Kantas. "On adaptive estimation for dynamic Bernoulli bandits."
Foundations of Data Science 1.2 (2019): 197.
Classical algorithm for Bayesian bandits
$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta_k|o_{t-1:1}, a_{t-1:1}\right)$$
Optimistic Thompson sampling (O-TS) defined as
$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_k \rangle) \right]$$
expected free energy
Friston, Karl, et al. "Active inference: a process theory."
Neural computation 29.1 (2017): 1-49.
Posterior over policies
$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$
For rolling policies \( F(\pi) = const. \)
$$Q(\pi)\propto e^{\gamma G(\pi)}$$
Choice selection
$$ a_t \sim p(a_t) \propto e^{-\gamma G_t(a_t)} $$
Optimal choice \(\left( \gamma \rightarrow \infty \right)\)
$$ a_t = \argmin_a G_t(a) $$
expected free energy
Expected free energy (G-AI)
Approximate expected free energy (A-AI)
Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."
arXiv preprint arXiv:2001.07203 (2020).
$$ R(t) = p_{max}(t) - p_{a_t}(t) $$
$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$
$$ r(T) = \frac{1}{T} \hat{R}(T)$$
For good algorithms
Asymptotically efficient algorithms scale as (when \(T \rightarrow \infty \))
On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
One arm is associate with maximal reward probability \(p_{max}=\frac{1}{2} + \epsilon\)
All other arms are fixed to \( p_{\neg max} = \frac{1}{2}\).
The optimal arm changes with probability \(\rho\).
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
On each trial arms maintain their reward probabilities with probability \( \rho\)
Or are sampled from a uniform distribution with probability \( \rho \)
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
On each trial reward probability is generated from the following process
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Mean-field approximation
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Liakoni, Vasiliki, et al. "Learning in volatile environments with the Bayes factor surprise."
Neural Computation 33.2 (2021): 269-340.
Stationary case \(\rightarrow \) \(\rho = 0\)
Optimistic Thompson sampling (O-TS)
\( \theta^*_k \sim p\left(\theta_{t, k}|o_{t-1:1}, a_{t-1:1}\right) \)
\(a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta_{t, k} \rangle) \right]\)
Bayesian upper confidence bound (B-UCB)
\( a_t = \argmax_k CDF^{-1}\left( 1 - \frac{1}{t}; \bar{\alpha}_{t,k}, \bar{\beta}_{t,k} \right) \)
\(\bar{\alpha}_{t, k} = (1-\rho) \alpha_{t-1, k} + \rho \)
\(\bar{\beta}_{t, k} = (1-\rho) \beta_{t-1, k} + \rho \)
Approximate expected free energy (A-AI)
\( \tilde{G}_t(a) = - 2 \lambda \mu_{t-1, a} - \frac{1}{2\nu_{t-1, a}} \)
\(a_t = \argmin_a \tilde{G}_t(a) \)
\(\epsilon=0.1\)
A-AI:
\(K=40\)
A-AI:
dotted line
\( \lambda=0.25 \)
\( \lambda_{\text{G-AI}} = 0.25\)
\( \lambda_{\text{A-AI}} = 0.5\)
Thanks to:
https://slides.com/dimarkov/
https://github.com/dimarkov/aibandits