Dimitrije Marković
Theoretical Neurobiology Meeting
30.11.2020
On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
Reward probabilities are fixed with \(p_{max}=\frac{1}{2} + \epsilon\) and \( p_{\neg max} = \frac{1}{2}\).
Task difficulty:
The best arm advantage -> \(\epsilon\)
The number of arms -> \(K\)
On any trial an agents makes a choice between K arms.
$$ R(t) = p_{max}(t) - p_{a_t}(t) $$
$$ \hat{R}(T) = \sum_{t=1}^T R(t) $$
$$ r(T) = \frac{1}{T} \hat{R}(T)$$
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Given some choice \(a_t\) on trial \(t\) and \(\vec{\theta}_t = (\theta_1, \ldots, \theta_K) \) belief updating corresponds to
Stationary case \(\rightarrow \) \(\rho = 0\)
Raj, Vishnu, and Sheetal Kalyani. "Taming non-stationary bandits: A Bayesian approach."
arXiv preprint arXiv:1707.09727 (2017).
Classical algorithm for Bayesian bandits
$$a_t = \arg\max_k \theta^*_k, \qquad \theta^*_k \sim p\left(\theta^k_t|o_{t-1:1}, a_{t-1:1}\right)$$
Optimistic Thompson sampling (O-TS) defined as
$$a_t = \arg\max_k \left[ \max(\theta^*_k, \langle \theta^k_t \rangle) \right]$$
One of the oldest algorithms defined as
Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems." Artificial intelligence and statistics, 2012.
Inverse regularised incomplete beta function
expected free energy
expected surprisal
Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.
expected free energy
expected surprisal
\( G(a_t) \geq S(a_t) \)
\( \arg\min_a G(a) \neq \arg\min_a S(a)\)
$$ P(\vec{\theta}_t| a_t=k) \propto \left( \theta_t^k\right)^{\left( e^{2\lambda}-1 \right)}$$
\( P(o_t) \propto e^{o_t \lambda} e^{-(1-o_t) \lambda} \)
\(P(o_t) = \int d \vec{\theta} p(o_t|\vec{\theta}_t, a_t) P(\vec{\theta}_t|a_t) \)
Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.
Posterior over policies
$$ Q(\pi) \propto e^{\gamma G(\pi) + F(\pi)}$$
For rolling policies \( F(\pi) = const. \)
$$Q(\pi)\propto e^{\gamma G(\pi)}$$
Choice selection based on expected free energy (F-AI)
$$ a_t \sim p(a_t) \propto e^{-\gamma G(a_t, \lambda)} $$
Choice selection based on expected surprisal (S-AI)
$$ a_t \sim p(a_t) \propto e^{-\gamma S(a_t, \lambda)} $$
Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."
arXiv preprint arXiv:2001.07203 (2020).
Next step:
compare F-AI, S-AI, and A-AI in stationary bandits
A-AI:
A-AI:
\( \epsilon = 0.25 \)
\( \epsilon = 0.10 \)
\( \epsilon = 0.40 \)
Choice outcomes are binary variables \(\rightarrow\) Bernoulli bandits
One arm is associate with maximal reward probability \(p_{max}=\frac{1}{2} + \epsilon\)
All other arms are fixed to \( p_{\neg max} = \frac{1}{2}\).
The optimal arm changes with probability \(\rho\).
\(\epsilon=0.25\)
A-AI:
Thanks to:
https://slides.com/revealme/
https://github.com/dimarkov/aibandits