Active Inference in multi-armed bandits
Dimitrije Marković
Theoretical Neurobiology Meeting
30.11.2020
Active inference and semi-markov decision processes
-
(Part I) Active inference in multi-armed bandits
- Comparison with UCB and Thompson sampling.
- Comparison of state and outcome based preferences.
-
(Part II) Active inference and semi-Markov processes
- Semi-Markov models and a representation of state duration.
- Learning the hidden temporal structure of state transitions.
- Application: Reversal learning task.
-
(Part III) Active inference and semi-Markov decision processes
- Extending policies with action (policy) duration.
- Decision about when actions should be taken and for how long.
- Applications: Temporal attention, intertemporal choices.
Motivation
- Multi-armed bandits generalise resource allocation problems
- Often used in experimental cognitive neuroscience:
- Decision making in dynamic environments
- Spatio-temporal attentional processing
- Wide range of industrial applications.
- Good benchmark problem for comparing active inference based solutions with the state-of-the-art alternatives.
Multi-armed Bandits
- Stationary bandits
- Dynamic bandits
- Adversarial bandits
- Risk-aware bandits
- Contextual bandits
- non-Markovian bandits
Stationary (classical) bandits
-
On any trial an agents makes a choice between K arms.
-
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
-
Reward probabilities are fixed with \(p_{max}=\frac{1}{2} + \epsilon\) and \( p_{\neg max} = \frac{1}{2}\).
-
Task difficulty:
-
The best arm advantage -> \(\epsilon\)
-
The number of arms -> \(K\)
-
Non-stationary (Dynamic) bandits
-
On any trial an agents makes a choice between K arms.
- Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits.
-
Reward probabilities associated with each arm change over time:
- changes happen at the same time on all arms (e.g. switching bandits)
- changes happen independently on each arm (e.g. restless bandits)
- The additional task difficulty:
- The rate of change/change probability
Choice optimality as regret minimization
- At trial \( t \) an agent chooses arm \(a_t\).
- We define the regret as
$$ R(t) = p_{max}(t) - p_{a_t}(t) $$
- The cumulative regret after \(T\) trials is obtained as
$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$
- Regret rate is defined as
$$ r(T) = \frac{1}{T} \hat{R}(T)$$
Bayesian Inference
- Bernoulli bandits \( \rightarrow \) Bernoulli likelihoods.
- Beliefs about reward probabilities as Beta distributions.
- Bayesian belief updating for all action selection algorithms.
Generative model
- A K-armed bandit.
- The \(k\)th arm is associated with a reward probability \(\theta^k_t\) at trial t.
- Choice outcomes denoted with \( o_t \).
Belief updating
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Belief updating
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Belief updating
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Stationary case \(\rightarrow \) \(\rho = 0\)
Action Selection Algorithms
- Optimistic Thompson sampling (O-TS)
- Bayesian upper confidence bound (B-UCB)
- Active inference
Thompson sampling
Raj, Vishnu, and Sheetal Kalyani. "Taming non-stationary bandits: A Bayesian approach."
arXiv preprint arXiv:1707.09727 (2017).
Classical algorithm for Bayesian bandits
$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta^k_t|o_{t-1:1}, a_{t-1:1}\right)$$
Optimistic Thompson sampling (O-TS) defined as
$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta^k_t \rangle) \right]$$
Upper Confidence bound (UCB)
-
One of the oldest algorithms defined as
- \(\mu^{k}_t\) denotes the expected reward probability of the \(k\)th arm.
- \(n^k_{t}\) denotes the number of times \(k\)th arm was selected.
Bayesian Upper Confidence bound (B-UCB)
Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems." Artificial intelligence and statistics, 2012.
Inverse regularised incomplete beta function
Active inference
expected free energy
- rolling behavioural policies \( \rightarrow \) independent from the past choices.
- a behavioural policy corresponds to a single choice \(\rightarrow\) \(\pi = a_t, \quad a_t \in \{1,\ldots,K\}\).
expected surprisal
Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.
Active inference
expected free energy
- rolling behavioural policies \( \rightarrow \) independent from the past choices.
- a behavioural policy corresponds to a single choice \(\rightarrow\) \(\pi = a_t, \quad a_t \in \{1,\ldots,K\}\).
expected surprisal
\( G(a_t) \geq S(a_t) \)
\( \arg\min_a G(a) \neq \arg\min_a S(a)\)
$$ P(\vec{\theta}_t| a_t=k) \propto \left( \theta_t^k\right)^{\left( e^{2\lambda}-1 \right)}$$
\( P(o_t) \propto e^{o_t \lambda} e^{-(1-o_t) \lambda} \)
\(P(o_t) = \int d \vec{\theta} p(o_t|\vec{\theta}_t, a_t) P(\vec{\theta}_t|a_t) \)
Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.
Active inference - choice selection
Posterior over policies
$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$
For rolling policies \( F(\pi) = const. \)
$$Q(\pi)\propto e^{\gamma G(\pi)}$$
Choice selection based on expected free energy (F-AI)
$$ a_t \sim p(a_t) \propto e^{-\gamma G(a_t, \lambda)} $$
Choice selection based on expected surprisal (S-AI)
$$ a_t \sim p(a_t) \propto e^{-\gamma S(a_t, \lambda)} $$
approximate Expected surprisal (A-Ai)
Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."
arXiv preprint arXiv:2001.07203 (2020).
Next step:
compare F-AI, S-AI, and A-AI in stationary bandits
Between AI comparison
Parametrised regret rate for A-AI algorithm
comparison:
-
Optimistic Thompson sampling (O-TS)
-
Bayesian upper confidence bound (B-UCB)
-
Approximate active inference (A-Ai)
comparison in Stationary bandits
A-AI:
- \(\lambda = 0.8 \)
- \(\gamma = 20 \)
regret rate over time
A-AI:
- \(\lambda = 0.8 \)
- \(\gamma = 20 \)
\( \epsilon = 0.25 \)
\( \epsilon = 0.10 \)
\( \epsilon = 0.40 \)
Switching bandits
-
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
-
One arm is associate with maximal reward probability \(p_{max}=\frac{1}{2} + \epsilon\)
-
All other arms are fixed to \( p_{\neg max} = \frac{1}{2}\).
-
The optimal arm changes with probability \(\rho\).
- Task difficulty:
- Advantage of the best arm \(\rightarrow\) \(\epsilon\)
- Number of arms \(\rightarrow\) \(K\)
- Change probability \(\rightarrow\) \(\rho\)
comparison in switching bandits
\(\epsilon=0.25\)
A-AI:
- \(\lambda = 0.8 \)
- \(\gamma = 20 \)
conclusion
- When comparing active inference on stationary bandits, we get mixed results.
- When comparing active inference on stationary bandits, we get mixed results.
- In non-stationary bandits, active inference based agents show lower performance if changes occur often enough.
- A tentative TODO list:
- Determine the optimality relation for AI algorithms, e.g. \(\gamma^* = f(\lambda, K, \epsilon, \rho)\).
- Introduce learning of \(\lambda\).
- Would adaptive \(\lambda\) lead to values that minimize regret rate?
Thanks to:
- Hrvoje Stojić
- Sarah Schwöbel
- Stefan Kiebel
- Thomas Parr
- Karl Friston
- Ryan Smith
- Lancelot de Costa
https://slides.com/revealme/
https://github.com/dimarkov/aibandits
Active inference in multi-armed bandits
By dimarkov
Active inference in multi-armed bandits
- 53