Dimitrije Marković
Theoretical Neurobiology Meeting
30.11.2020
On any trial an agents makes a choice between K arms.
Choice outcomes are binary variables → Bernoulli bandits
Reward probabilities are fixed with pmax=21+ϵ and p¬max=21.
Task difficulty:
The best arm advantage -> ϵ
The number of arms -> K
On any trial an agents makes a choice between K arms.
R(t)=pmax(t)−pat(t)
R^(T)=t=1∑TR(t)
r(T)=T1R^(T)
Given some choice at on trial t and θt=(θ1,…,θK) belief updating corresponds to
Given some choice at on trial t and θt=(θ1,…,θK) belief updating corresponds to
Given some choice at on trial t and θt=(θ1,…,θK) belief updating corresponds to
Stationary case → ρ=0
Raj, Vishnu, and Sheetal Kalyani. "Taming non-stationary bandits: A Bayesian approach."
arXiv preprint arXiv:1707.09727 (2017).
Classical algorithm for Bayesian bandits
at=argkmaxθk∗,θk∗∼p(θtk∣ot−1:1,at−1:1)
Optimistic Thompson sampling (O-TS) defined as
at=argkmax[max(θk∗,⟨θtk⟩)]
One of the oldest algorithms defined as
Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. "On Bayesian upper confidence bounds for bandit problems." Artificial intelligence and statistics, 2012.
Inverse regularised incomplete beta function
expected free energy
expected surprisal
Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.
expected free energy
expected surprisal
G(at)≥S(at)
argminaG(a)=argminaS(a)
P(θt∣at=k)∝(θtk)(e2λ−1)
P(ot)∝eotλe−(1−ot)λ
P(ot)=∫dθp(ot∣θt,at)P(θt∣at)
Parr, Thomas, and Karl J. Friston. "Generalised free energy and active inference." Biological cybernetics 113.5-6 (2019): 495-513.
Posterior over policies
Q(π)∝eγG(π)+F(π)
For rolling policies F(π)=const.
Q(π)∝eγG(π)
Choice selection based on expected free energy (F-AI)
at∼p(at)∝e−γG(at,λ)
Choice selection based on expected surprisal (S-AI)
at∼p(at)∝e−γS(at,λ)
Da Costa, Lancelot, et al. "Active inference on discrete state-spaces: a synthesis."
arXiv preprint arXiv:2001.07203 (2020).
Next step:
compare F-AI, S-AI, and A-AI in stationary bandits
A-AI:
A-AI:
ϵ=0.25
ϵ=0.10
ϵ=0.40
Choice outcomes are binary variables → Bernoulli bandits
One arm is associate with maximal reward probability pmax=21+ϵ
All other arms are fixed to p¬max=21.
The optimal arm changes with probability ρ.
ϵ=0.25
A-AI:
Thanks to:
https://slides.com/revealme/
https://github.com/dimarkov/aibandits