Artyom Sorokin |4 May
To get the first reward:
go down the first ladder -> jump on the rope -> jump on the second ladder -> go down the second ladder -> avoid skull -> ...
Can we derive an optimal exploration strategy?
Yes (theoretically tractable)
No ( theoretically intractable)
multi-armed bandits
contextual bandits
small, finite MDPs
large, infinite MDPs
can formalize exploration as POMDP identification
optimal methods don’t work …but can take inspiration from optimal methods
Bandit Example:
This defines a POMDP with states: \(s = [\theta_1, ..., \theta_n]\),
where belief states are \(\hat{p}(\theta_1,..., \theta_n)\)
How to measure goodness of exploration algorithm?
Regret: difference between optimal policy and ours
keep track of average reward \(\hat{\mu}_a\) for each action \(a\)
exploitation: pick \(a = argmax\,\hat{\mu}_a\)
optimistic estimate: \(a = argmax\,\hat{\mu}_a + C \sigma_a\)
Intuition: try each arm until you are sure it's not great
how uncertain we are about this action
Example:
\(Reg(T)\) is \(O(log\,T)\) as good as possible
Bandit Example:
Correction:
Thompson sampling is asymptotically optimal!
https://arxiv.org/abs/1209.3353
We want to determine some latent variable \(z\)
(e.g. optimal action, q-value, parameters of a model)
Which action do we take to determine \(z\) ?
let \( H(\hat{p}(z))\) be the current entropy of our \(z\) estimate
let \( H(\hat{p}(z)|y)\) be the entropy of our \(z\) estimate after observation \(y\)
Entropy measures lack of information.
Then Information Gain measures how much information about \(z\) we can get by observing \(y\)
\(y=r_a\), \(z = \theta_a\) (parameters of a model)
\(g(a) = IG(\theta_a; r_a|a)\) - information gain for action \(a\)
\(\Delta(a) = E[r(a^*) - r(a)]\) - expected suboptimality of a
Policy: choose actions according to:
UCB:
Thompson sampling:
Info Gain:
Can we use this idea with MDPs?
UCB:
Yes! Add exploration bonus (based on \(N(s,a)\) or \(N(s)\)) to the reward:
This should work as long as \(\mathbf{B}(N(s,a))\) decrease with \(N(s,a)\)
Use \(r^{+}(s,a)\) with any model-free algorithm!
But the is one problem...
But wait… what’s a count?
Idea: fit a density model \(p_{\theta}(s)\)
\(p_{\theta}(s)\) might be high even for a new \(s\) if \(s\) is similar to previously seen states.
Can we \(p_{\theta}(s)\) to get a "pseudo-count"?
The true probability is:
After visiting \(s\):
GOAL:
To create \(p_{\theta}(s)\) and \(p_{\theta'}(s)\) that obey these equations!
Solve a system of two linear equations:
GOAL:
To create \(p_{\theta}(s)\) and \(p_{\theta'}(s)\) that obey these equations!
Idea: compress states into k-bit code via \(\phi(s)\), then count \(N(\phi(s))\)
Shorter codes = more hash collisions,
probably similar states get the same hash...
Improve the odds by learning a compression:
\(p_{\theta}(s)\) need to be able to output densities, but doesn’t necessarily need to produce great samples
Intuition: the state is novel if it is easy to distinguish from all previous seen states by a classifier
For each observation state \(s\), fit a classifier to classify that state against past states \(D\), use classifier error to obtain density:
probability that classifier assigns that \(s\) is not in past states \(D\)
That is where we get:
Fitting a new classifier for each new state is abit to much,
therefore we train a single network that takes s as input: