RL Food gathering

https://www.cellsalive.com/animabug.htm

move on a grid

Each step you smell if you did a good move

finite memory

Put the reward far away

\(\Rightarrow\) 2 arms bandit problem

bandit problem

N past (actions + reward)
Shift Register of N bits
Random Access Memory of N bits

Our agent will remember

bandit problem

with finite memory

Our approach to find an optimal strategy to the finite memory 2 arm bandit problem

environment \(\epsilon\)

microstate : \(\sigma \in \Omega_\epsilon\)
actions : \(a\)
transition : \(P_\epsilon[\sigma_{t+1} | \sigma_t, a_t]\)

agent \(\pi\)

state : \(s\)
strategy : \(\pi(a,s)\)

\(s\)

\(\sigma\)

environment \(\epsilon\)

microstate : \(\sigma \in \Omega_\epsilon\)
actions : \(a\)
transition : \(P_\epsilon[\sigma_{t+1} | \sigma_t, a_t]\)

agent \(\pi\)

state : \(s\)
strategy : \(\pi(a,s)\)

\(P_\epsilon[s_{t+1}|s_t a_t \; s_{t-1} a_{t-1} \; \dots \sigma_0]\)
\(= P_\epsilon[s_{t+1}|s_t a_t]\)
\(P_{\epsilon\pi}[s'|s] = \sum_a \pi(a,s) P_\epsilon[s'|s a]\)
steady state: \(p_{\epsilon\pi}(s)\) power method
\(G = \langle\;\mathbb{E}_{\epsilon\pi}[r(s)]\;\rangle_\epsilon\)
gradient descent on \(\pi\)

problem of \(\infty\) time

finite time

\(\tilde P_\epsilon[\sigma' | \sigma, a]=(1-r) P_\epsilon[\sigma' | \sigma, a] + r \; p_0(\sigma')\)

reset probability : \(r\)

\( \langle \;\; \rangle_\epsilon, r \Rightarrow \pi \)

\(\epsilon = (\frac12 \leftrightarrow \frac12 \pm \gamma)\) and \(r = 10^{-9}\)

new action on the right

small probabilities \(\varepsilon\)

\(\varepsilon \propto \sqrt{r} \)

\(\text{expl. time} \propto \sqrt{\text{env. time}} \)

\(\varepsilon \propto \sqrt{r} \)

\(\text{expl. time} \propto \sqrt{\text{env. time}} \)

minimize \( t + T/t \)

\(\epsilon = (\frac12 \leftrightarrow \frac12 \pm \gamma)\) and \(r = 10^{-9}\)

small reset

symmetric under \(A \leftrightarrow B\)

\(\epsilon = \sqrt{3r}\)

reset = 0.01

\(\epsilon \approx \sqrt{3r}\)

12 states out

\(\pi\) at reset = \(0.1\)

\(p(s)=0\) for 32 states

AAB+++ AAB++- AAB-++ AAB-+- ABA+++ ABA++- ABA+-+ ABA+-- ABA-++ ABA-+- ABB+++ ABB++- ABB+-+ ABB+-- ABB--+ ABB--- BAA+++ BAA++- BAA+-+ BAA+-- BAA--+ BAA--- BAB+++ BAB++- BAB+-+ BAB+-- BAB-++ BAB-+- BBA+++ BBA++- BBA-++ BBA-+-

Thanks

Conclusions

Optimal solutions exploit best arm

\(\Delta G/ G^* = 0.046 \rightarrow 0.045\)

reset \(r = 10^{-7}\)

\(\Delta G/ G^* = 0.045\)

gradient flow

\(\Delta G/G^* = 0.99 \rightarrow 0.005\)

reset \(r = 10^{-7}\)

\(G^* - G = 0.0005\)

reset \(r = 10^{-7}\)

\(G^* - G = 0.0006\)

U shape strat

\(\Delta G/G^* = 0.00098\)

random init

\(\Delta G / G^* = 0.00305\)

Determinist strat

\(\Delta G/G^* = 0.044 \)

RL food gathering

By Mario Geiger

RL Food gathering

environment \(\epsilon\)

agent \(\pi\)

environment \(\epsilon\)

agent \(\pi\)

finite time

small reset

Thanks

Conclusions

RL food gathering

More from Mario Geiger