изображение
робота
Оптимальное действие:
тут обучили награду
тут будут проблемы
Reminder:
Previously:
Now the environment's model is fully accessible:
or \(s_{t+1} = f(s_t, a_t)\)
In deterministic environments
Model-free RL (interaction):
agent
environment
In deterministic environments
Model-based RL (open-loop planning):
agent
environment
optimal plan
Plan:
Reality:
Closed-loop planning (Model Predictive Control - MPC):
agent
environment
optimal plan
Apply only first action!
Discard all other actions!
REPLAN AT NEW STATE!!
Continuous actions:
Discrete actions:
Deterministic dynamics case
reminder:
apply \(s' = f(s, a)\) to follow the tree
assume only terminal rewards!
Deterministic dynamics case
Stochastic dynamics case
apply \(s' \sim p(s'|s, a)\) to follow the tree
assume only terminal rewards!
Now we need an infinite amount of runs through the tree!
Stochastic dynamics case
If the dynamics noise is small, forget about stochasticity and use the approach for deterministic dynamics.
The actions will be suboptimal.
But who cares...
Monte-Carlo Tree Search (MCTS)
also known as Pure Monte-Carlo Game Search
Simulate with some policy \(\pi_O\) and calculate reward-to-go
We need an infinite amount of runs to converge!
also known as Pure Monte-Carlo Game Search
Is it necessary to explore all the actions with the same frequency?
Maybe we'd better explore actions with a higher estimate of \(Q(s, a)\) ?
At an earlier stage, we also should have some exploration bonus for the least explored actions!
UCT
Basic Upper Confidence Bound for Bandits:
Th.: Given some assumptions the following is true:
Upper Confidence Bound bonus for MCTS:
We should choose actions that maximize the following value:
Simulate with some policy \(\pi_O\) and calculate reward-to-go
For each state we store a tuple:
\( \big(\Sigma, n(s) \big) \)
\(\Sigma\) - cummulative reward
\(n(s)\) - counter of visits
Stages:
and again, and again....
Python pseudo-code
def mcts(root, n_iter, c):
for n in range(n_iter):
leaf = forward(root, c)
reward_to_go = rollout(leaf)
backpropagate(leaf, reward_to_go)
return best_action(root, c=0)
def forward(node, c):
while is_all_actions_visited(node):
a = best_action(node, c)
node = dynamics(node, a)
if is_terminal(node):
return node
a = best_action(node, c)
child = dynamics(node, a)
add_child(node, child)
return child
def rollout(node):
while not is_terminal(node):
a = rollout_policy(node)
node = dynamics(node, a)
return reward(node)
def backpropagate(node, reward):
if is_root(node):
return None
node.n_visits += 1
node.cumulative_reward += reward
node = parent(node)
return backpropagate(node, reward)
agent
environment
The environment's dynamics is unknown!
environment
The environment's dynamics is now known!
agent
maximizing return
minimizing return
There are just a few differences compared to the MCTS v1.0 algorithm:
You may have noted that MCTS looks something like this:
But then we just throw \(\pi_O^{MCTS}\) and \(V^{\pi_O}\) away and recompute them again!
We can use two Neural Networks to simplify and improve computations:
During the forward stage:
Now, the output of MCTS\((s)\) is not the best action for \(s\), but rather a distribution:
\( \pi_\theta^{MCTS}(a|s) \propto n(f(s, a))^{1/\tau}\)
Other stages are not affected.
Assume, we have a rollout policy \(\pi_\theta\) and a corresponding value function \(V_\phi\)
But how to improve parameters \(\theta\)?
And update \(\phi\) for the new policy?
Play several games with yourself:
Keep the tree through the game!
. . .
Store the triples: \( (s_t, \pi_\theta^{MCTS}(s_t), R), \;\; \forall t\)
Once in a while, sample batches \( (s, \pi_\theta^{MCTS}, R) \) from the buffer and minimize:
Better to share parameters of NNs
\( f(s, a) \) - ????
If we assume dynamics to be unknown but deterministic,
then we can note the following:
we can learn it directly from transtions of a real environment
The easiest motivation ever!
Previously, in AlphaZero we had:
Now the dynamics is not available!
Thus, we will throw all the available variables into a larger NN:
\(s_{root}, a_0, \dots, a_L\)
dynamics
\(s_L\)
get value and policy
\(V_\theta (s_L), \pi_\theta (s_L) = f_\theta(s_L)\)
\(s_{root}, a_0, \dots, a_L\)
get value and policy of future states
\(V_\theta (s_L), \pi_\theta (s_L) = f_\theta(s_L)\)
the estimate of r(s, a)
\(g_\theta(z_{t+1}, a_{t+1}) \)
\(f_\theta(z_{t+2}) \)
. . .
. . .
encoder
\(g_\theta(z_t, a_t) \)
\(f_\theta(z_{t+1}) \)
Simulate with some policy \(\pi_O\) and calculate reward-to-go
For each state we store a tuple:
\( \big(\Sigma, n(s) \big) \)
\(\Sigma\) - cummulative reward
\(n(s)\) - counter of visits
Stages:
and again, and again....
Simulate with some policy \(\pi_O\) and calculate reward-to-go
Stages:
Play several games:
. . .
Store whole games: \( (..., s_t, a_t, \pi_t^{MCTS}, r_t, u_t, s_{t+1}, ...)\)
Randomly pick state \(s_i\) from the buffer with a subsequence of length \(K\)
WOW
It works!