Model-Based RL:
(MCTS, AlphaZero, MuZero)
lecturer: Pavel Temirchev
Curricula
- What if the environment's model is given?
- Open-loop and closed-loop planning
- Tree Search planning, MCTS
- Policy Iteration guided by MCTS
- The case of unknown dynamics
What if the environment's model
is given?
Reminder:
Previously:
- observed only samples from the environment
- not able to start from an arbitrary state
- were required to interact with the environment
Now the environment's model is fully accessible:
- can plan in our mind without interaction
- do not really need a policy! (in deterministic environments)
- we assume rewards are known too!
or \(s_{t+1} = f(s_t, a_t)\)
Curricula
- What if the environment's model is given?
- Open-loop and closed-loop planning
- Tree Search planning, MCTS
- Policy Iteration guided by MCTS
- The case of unknown dynamics
Interaction vs. Planning
In deterministic environments
Model-free RL (interaction):
agent
environment
Interaction vs. Planning
In deterministic environments
Model-based RL (open-loop planning):
agent
environment
optimal plan
Planning in stochastic environments
Plan:
Reality:
Closed-loop planning (Model Predictive Control - MPC):
agent
environment
optimal plan
Planning in stochastic environments
Apply only first action!
Discard all other actions!
REPLAN AT NEW STATE!!
How to plan?
Continuous actions:
- Linear Quadratic Regulator (LQR)
- iterative LQR (iLQR)
- Differential Dynamic Programming (DDP)
- ....
Discrete actions:
- Monte-Carlo Tree Search
- ....
- ....
NEXT TIME
What would be your suggestions?
Curricula
- What if the environment's model is given?
- Open-loop and closed-loop planning
- Tree Search planning, MCTS
- Policy Iteration guided by MCTS
- The case of unknown dynamics
Tree Search
Deterministic dynamics case
reminder:
apply \(s' = f(s, a)\) to follow the tree
assume only terminal rewards!
Tree Search
Deterministic dynamics case
- Full search is exponentially hard!
- We are not required to track states: the sequence of actions contains all the required informations
Tree Search
Stochastic dynamics case
apply \(s' \sim p(s'|s, a)\) to follow the tree
assume only terminal rewards!
Now we need an infinite amount of runs through the tree!
Tree Search
Stochastic dynamics case
- The problem is even harder!
- We will need to track states since an action can lead us to a random outcome
- The model should be able to give us samples. Actual probabilities will not be used.
If the dynamics noise is small, forget about stochasticity and use the approach for deterministic dynamics.
The actions will be suboptimal.
But who cares...
Monte-Carlo Tree Search: v0.5
also known as Pure Monte-Carlo Game Search
Simulate with some policy \(\pi_O\) and calculate reward-to-go
We need an infinite amount of runs to converge!
Monte-Carlo Tree Search: v0.5
also known as Pure Monte-Carlo Game Search
- Not so hard as full search, but still!
- Will give a plan which is better than following \(\pi_O\), but is suboptimal!
- The better the plan - the harder the problem!
Is it necessary to explore all the actions with the same frequency?
Maybe we'd better explore actions with a higher estimate of \(Q(s, a)\) ?
At an earlier stage, we also should have some exploration bonus for the least explored actions!
Upper Confidence Bound for Trees
UCT
Basic Upper Confidence Bound for Bandits:
Th.: Given some assumptions the following is true:
Upper Confidence Bound bonus for MCTS:
We should choose actions that maximize the following value:
Monte-Carlo Tree Search: v1.0
Simulate with some policy \(\pi_O\) and calculate reward-to-go
For each state we store a tuple:
\( \big(\Sigma, n(s) \big) \)
\(\Sigma\) - cummulative reward
\(n(s)\) - counter of visits
Stages:
- Forward
- Expand
- Backward
and again, and again....
Monte-Carlo Tree Search
Python pseudo-code
def mcts(root, n_iter, c):
for n in range(n_iter):
leaf = forward(root, c)
reward_to_go = rollout(leaf)
backpropagate(leaf, reward_to_go)
return best_action(root, c=0)
def forward(node, c):
while is_all_actions_visited(node):
a = best_action(node, c)
node = dynamics(node, a)
if is_terminal(node):
return node
a = best_action(node, c)
child = dynamics(node, a)
add_child(node, child)
return child
def rollout(node):
while not is_terminal(node):
a = rollout_policy(node)
node = dynamics(node, a)
return reward(node)
def backpropagate(node, reward):
if is_root(node):
return None
node.n_visits += 1
node.cumulative_reward += reward
node = parent(node)
return backpropagate(node, reward)
Minimax MCTS
agent
environment
The environment's dynamics is unknown!
environment
The environment's dynamics is now known!
agent
maximizing return
minimizing return
Minimax MCTS
There are just a few differences compared to the MCTS v1.0 algorithm:
- Now you should track which player's move is now
- \(o = 1\;\; \texttt{if player's 1 move, else}\;\; -1 \)
- During forward pass, best actions now should maximize:
\(W(s) = oV(s) + c\sqrt{\frac{\log n^{parent}(s)}{n(s)}}\)
- The best action, computed by MCTS is now:
\(a^* = \arg\max_a oQ(s, a) \)
- Other stages are not changed at all!
Curricula
- What if the environment's model is given?
- Open-loop and closed-loop planning
- Tree Search planning, MCTS
- Policy Iteration guided by MCTS
- The case of unknown dynamics
Policy Iteration guided by MCTS
You may have noted that MCTS looks something like this:
- Estimate value \(V^{\pi_O}\) for the rollout policy \(\pi_O\) using Monte-Carlo samples
- Compute it's improvement as \(\pi_O^{MCTS}(s) \leftarrow MCTS(s, \pi_O)\)
But then we just throw \(\pi_O^{MCTS}\) and \(V^{\pi_O}\) away and recompute them again!
We can use two Neural Networks to simplify and improve computations:
- \(V_\phi\) that will capture state-values and will be used instead of rollout estimates
- \(\pi_\theta\) that will learn from MCTS improvements
MCTS algorithm from AlphaZero
During the forward stage:
- At leaf states \(s_L\), we are not required to make rollouts - we already have its value:
\(V(s_L) = V_\phi(s_L)\)
or we can still do a rollout: \(V(s_L) = \lambda V_\phi(s_L) + (1-\lambda)\hat{V}(s_L)\)
- The exploration bonus is now changed - the policy guides exploration:
\(W(s') = V(s') + c\frac{\pi_\theta(a|s)}{1+ n(s')}\)
There are no infinite bonuses
Now, the output of MCTS\((s)\) is not the best action for \(s\), but rather a distribution:
\( \pi_\theta^{MCTS}(a|s) \propto n(f(s, a))^{1/\tau}\)
Other stages are not affected.
Assume, we have a rollout policy \(\pi_\theta\) and a corresponding value function \(V_\phi\)
But how to improve parameters \(\theta\)?
And update \(\phi\) for the new policy?
Policy Iteration through a self-play
Play several games with yourself:
Keep the tree through the game!
. . .
Store the triples: \( (s_t, \pi_\theta^{MCTS}(s_t), R), \;\; \forall t\)
Once in a while, sample batches \( (s, \pi_\theta^{MCTS}, R) \) from the buffer and minimize:
Better to share parameters of NNs
AlphaZero results
AlphaZero results
AlphaZero results
Curricula
- What if the environment's model is given?
- Open-loop and closed-loop planning
- Tree Search planning, MCTS
- Policy Iteration guided by MCTS
- The case of unknown dynamics
What if the dynamics is unknown?
\( f(s, a) \) - ????
If we assume dynamics to be unknown but deterministic,
then we can note the following:
- states are fully controlled by applied actions
- during MCTS search the states are not required
in case of known \( V^{\pi_O}(s), \pi_O(s), r(s) \)
(rewards are here for more general environments than board-games) - we only need to predict future value, policy and rewards for a new state
given previous states and applied actions!
we can learn it directly from transtions of a real environment
BUT NEXT TIME
What if the dynamics is unknown?
The easiest motivation ever!
Previously, in AlphaZero we had:
Now the dynamics is not available!
Thus, we will throw all the available variables into a larger NN:
\(s_{root}, a_0, \dots, a_L\)
dynamics
\(s_L\)
get value and policy
\(V_\theta (s_L), \pi_\theta (s_L) = f_\theta(s_L)\)
\(s_{root}, a_0, \dots, a_L\)
get value and policy of future states
\(V_\theta (s_L), \pi_\theta (s_L) = f_\theta(s_L)\)
Architecture of the Neural Network
the estimate of r(s, a)
\(g_\theta(z_{t+1}, a_{t+1}) \)
\(f_\theta(z_{t+2}) \)
. . .
. . .
encoder
\(g_\theta(z_t, a_t) \)
\(f_\theta(z_{t+1}) \)
. . .
MuZero: article
Play several games:
. . .
Store whole games: \( (..., s_t, a_t, \pi_t^{MCTS}, r_t, u_t, s_{t+1}, ...)\)
Randomly pick state \(s_i\) from the buffer with a subsequence of length \(K\)
MuZero: results
MuZero: results
WOW
It works!
MB-RL: MCTS, AlphaZero, MuZero
By cydoroga
MB-RL: MCTS, AlphaZero, MuZero
- 522