RL Food gathering 

move on a grid

Each step you smell if you did a good move

finite memory

Put the reward far away

\(\Rightarrow\) 2 arms bandit problem

bandit problem

  • N past (actions + reward)
  • Shift Register of N bits
  • Random Access Memory of N bits

Our agent will remember

bandit problem

with finite memory

Our approach to find an optimal strategy to the finite memory 2 arm bandit problem

environment  \(\epsilon\)

  1. microstate : \(\sigma \in \Omega_\epsilon\)
  2. actions : \(a\)
  3. transition : \(P_\epsilon[\sigma_{t+1} | \sigma_t, a_t]\)

agent  \(\pi\)

  1. state : \(s\)
  2. strategy : \(\pi(a,s)\)

\(s\)

\(\sigma\)

\(\sigma\)

\(\sigma\)

\(\sigma\)

\(\sigma\)

\(\sigma\)

\(\sigma\)

\(\sigma\)

environment  \(\epsilon\)

  1. microstate : \(\sigma \in \Omega_\epsilon\)
  2. actions : \(a\)
  3. transition : \(P_\epsilon[\sigma_{t+1} | \sigma_t, a_t]\)

agent  \(\pi\)

  1. state : \(s\)
  2. strategy : \(\pi(a,s)\)
  • \(P_\epsilon[s_{t+1}|s_t a_t \; s_{t-1} a_{t-1} \; \dots \sigma_0]\)
  • \(= P_\epsilon[s_{t+1}|s_t a_t]\) 
  • \(P_{\epsilon\pi}[s'|s] = \sum_a \pi(a,s) P_\epsilon[s'|s a]\)
  • steady state: \(p_{\epsilon\pi}(s)\)    power method

  • \(G = \langle\;\mathbb{E}_{\epsilon\pi}[r(s)]\;\rangle_\epsilon\)

  • gradient descent on \(\pi\)

problem of \(\infty\) time

finite time

\(\tilde P_\epsilon[\sigma' | \sigma, a]=(1-r) P_\epsilon[\sigma' | \sigma, a] + r \; p_0(\sigma')\)

reset probability : \(r\)

\( \langle \;\; \rangle_\epsilon, r \Rightarrow \pi \)

\(\epsilon = (\frac12 \leftrightarrow \frac12 \pm \gamma)\)   and \(r = 10^{-9}\)

new action on the right

small probabilities \(\varepsilon\)

\(\varepsilon \propto \sqrt{r} \)

\(\text{expl. time} \propto \sqrt{\text{env. time}} \)

\(\varepsilon \propto \sqrt{r} \)

\(\text{expl. time} \propto \sqrt{\text{env. time}} \)

minimize \( t + T/t \)

\(\epsilon = (\frac12 \leftrightarrow \frac12 \pm \gamma)\)   and \(r = 10^{-9}\)

small reset

symmetric under \(A \leftrightarrow B\)

\(\epsilon = \sqrt{3r}\)

reset = 0.01

\(\epsilon \approx \sqrt{3r}\)

12 states out

\(\pi\) at reset = \(0.1\)

\(p(s)=0\) for 32 states

 

AAB+++ AAB++- AAB-++ AAB-+- ABA+++ ABA++- ABA+-+ ABA+-- ABA-++ ABA-+- ABB+++ ABB++- ABB+-+ ABB+-- ABB--+ ABB--- BAA+++ BAA++- BAA+-+ BAA+-- BAA--+ BAA--- BAB+++ BAB++- BAB+-+ BAB+-- BAB-++ BAB-+- BBA+++ BBA++- BBA-++ BBA-+-

Thanks

Conclusions

Optimal solutions exploit best arm

\(\Delta G/ G^* = 0.046 \rightarrow 0.045\)

reset \(r = 10^{-7}\)

\(\Delta G/ G^* = 0.045\)

gradient flow

\(\Delta G/G^* = 0.99 \rightarrow 0.005\)

reset \(r = 10^{-7}\)

\(G^* - G = 0.0005\)

reset \(r = 10^{-7}\)

\(G^* - G = 0.0006\)

U shape strat

\(\Delta G/G^* = 0.00098\)

random init

\(\Delta G / G^* = 0.00305\)

Determinist strat

\(\Delta G/G^* = 0.044 \)