move on a grid
Each step you smell if you did a good move
finite memory
Put the reward far away
\(\Rightarrow\) 2 arms bandit problem
bandit problem
Our agent will remember
bandit problem
with finite memory
Our approach to find an optimal strategy to the finite memory 2 arm bandit problem
\(s\)
\(\sigma\)
\(\sigma\)
\(\sigma\)
\(\sigma\)
\(\sigma\)
\(\sigma\)
\(\sigma\)
\(\sigma\)
steady state: \(p_{\epsilon\pi}(s)\) power method
\(G = \langle\;\mathbb{E}_{\epsilon\pi}[r(s)]\;\rangle_\epsilon\)
gradient descent on \(\pi\)
problem of \(\infty\) time
\(\tilde P_\epsilon[\sigma' | \sigma, a]=(1-r) P_\epsilon[\sigma' | \sigma, a] + r \; p_0(\sigma')\)
reset probability : \(r\)
\( \langle \;\; \rangle_\epsilon, r \Rightarrow \pi \)
\(\epsilon = (\frac12 \leftrightarrow \frac12 \pm \gamma)\) and \(r = 10^{-9}\)
new action on the right
small probabilities \(\varepsilon\)
\(\varepsilon \propto \sqrt{r} \)
\(\text{expl. time} \propto \sqrt{\text{env. time}} \)
\(\varepsilon \propto \sqrt{r} \)
\(\text{expl. time} \propto \sqrt{\text{env. time}} \)
minimize \( t + T/t \)
\(\epsilon = (\frac12 \leftrightarrow \frac12 \pm \gamma)\) and \(r = 10^{-9}\)
symmetric under \(A \leftrightarrow B\)
\(\epsilon = \sqrt{3r}\)
reset = 0.01
\(\epsilon \approx \sqrt{3r}\)
12 states out
\(\pi\) at reset = \(0.1\)
\(p(s)=0\) for 32 states
AAB+++ AAB++- AAB-++ AAB-+- ABA+++ ABA++- ABA+-+ ABA+-- ABA-++ ABA-+- ABB+++ ABB++- ABB+-+ ABB+-- ABB--+ ABB--- BAA+++ BAA++- BAA+-+ BAA+-- BAA--+ BAA--- BAB+++ BAB++- BAB+-+ BAB+-- BAB-++ BAB-+- BBA+++ BBA++- BBA-++ BBA-+-
Optimal solutions exploit best arm
\(\Delta G/ G^* = 0.046 \rightarrow 0.045\)
reset \(r = 10^{-7}\)
\(\Delta G/ G^* = 0.045\)
gradient flow
\(\Delta G/G^* = 0.99 \rightarrow 0.005\)
reset \(r = 10^{-7}\)
\(G^* - G = 0.0005\)
reset \(r = 10^{-7}\)
\(G^* - G = 0.0006\)
U shape strat
\(\Delta G/G^* = 0.00098\)
random init
\(\Delta G / G^* = 0.00305\)
Determinist strat
\(\Delta G/G^* = 0.044 \)