


изображение
робота
Оптимальное действие:
тут обучили награду
тут будут проблемы


Случай известной динамики среды
Планирование с и без обратной связи
What if the environment's model
is given?
Reminder:

Previously:
- observed only samples from the environment
- not able to start from an arbitrary state
Now the environment's model is fully accessible:
- can plan in our mind without interaction
- we assume rewards are known too!
or st+1=f(st,at)
Interaction vs. Planning
In deterministic environments
Model-free RL (interaction):




agent
environment
Interaction vs. Planning
In deterministic environments
Model-based RL (open-loop planning):




agent
environment




optimal plan

Planning in stochastic environments
Plan:




Reality:




Closed-loop planning (Model Predictive Control - MPC):




agent
environment




optimal plan

Planning in stochastic environments
Apply only first action!
Discard all other actions!
REPLAN AT NEW STATE!!
How to plan?
Continuous actions:
- Linear Quadratic Regulator (LQR)
- iterative LQR (iLQR)
- Differential Dynamic Programming (DDP)
- ....
Discrete actions:
- Monte-Carlo Tree Search
- ....
- ....



Планирование как поиск по дереву
Tree Search
Deterministic dynamics case
reminder:
apply s′=f(s,a) to follow the tree
assume only terminal rewards!
Tree Search
Deterministic dynamics case
- Full search is exponentially hard!
- We are not required to track states: the sequence of actions contains all the required informations
Tree Search
Stochastic dynamics case
apply s′∼p(s′∣s,a) to follow the tree
assume only terminal rewards!
Now we need an infinite amount of runs through the tree!
Tree Search
Stochastic dynamics case
- The problem is even harder!
If the dynamics noise is small, forget about stochasticity and use the approach for deterministic dynamics.
The actions will be suboptimal.
But who cares...
Монте-Карло поиск по дереву
Monte-Carlo Tree Search (MCTS)
Monte-Carlo Tree Search: v0.5
also known as Pure Monte-Carlo Game Search




Simulate with some policy πO and calculate reward-to-go
We need an infinite amount of runs to converge!
Monte-Carlo Tree Search: v0.5
also known as Pure Monte-Carlo Game Search
- Not so hard as full search, but still!
- Will give a plan which is better than following πO, but is suboptimal!
- The better the plan - the harder the problem!
Is it necessary to explore all the actions with the same frequency?
Maybe we'd better explore actions with a higher estimate of Q(s,a) ?
At an earlier stage, we also should have some exploration bonus for the least explored actions!
Upper Confidence Bound for Trees
UCT
Basic Upper Confidence Bound for Bandits:
Th.: Given some assumptions the following is true:
Upper Confidence Bound bonus for MCTS:
We should choose actions that maximize the following value:
Monte-Carlo Tree Search: v1.0

Simulate with some policy πO and calculate reward-to-go


For each state we store a tuple:
(Σ,n(s))
Σ - cummulative reward
n(s) - counter of visits
Stages:
- Forward
- Expand
- Rollout
- Backward


and again, and again....
Monte-Carlo Tree Search
Python pseudo-code
def mcts(root, n_iter, c): for n in range(n_iter): leaf = forward(root, c) reward_to_go = rollout(leaf) backpropagate(leaf, reward_to_go) return best_action(root, c=0)
def forward(node, c): while is_all_actions_visited(node): a = best_action(node, c) node = dynamics(node, a) if is_terminal(node): return node a = best_action(node, c) child = dynamics(node, a) add_child(node, child) return child
def rollout(node): while not is_terminal(node): a = rollout_policy(node) node = dynamics(node, a) return reward(node)
def backpropagate(node, reward): if is_root(node): return None node.n_visits += 1 node.cumulative_reward += reward node = parent(node) return backpropagate(node, reward)
Моделирование оппонента
в настольных играх
Minimax MCTS


agent

environment
The environment's dynamics is unknown!
environment
The environment's dynamics is now known!

agent
maximizing return
minimizing return
Minimax MCTS
There are just a few differences compared to the MCTS v1.0 algorithm:
- Now you should track which player's move is now
- o=1if player’s 1 move, else−1
- During forward pass, best actions now should maximize:
W(s,a)=oV(s′)+ cn(a)logn(s)
- The best action, computed by MCTS is now:
a∗=argmaxaoQ(s,a)
- Other stages are not changed at all!
Улучшение стратегии
с помощью
Монте-Карло поиска по дереву
Policy Iteration guided by MCTS
You may have noted that MCTS looks something like this:
- Estimate value VπO for the rollout policy πO using Monte-Carlo samples
- Compute it's improvement as πOMCTS(s)←MCTS(s,πO)
But then we just throw πOMCTS and VπO away and recompute them again!
We can use two Neural Networks to simplify and improve computations:
- Vϕ that will capture state-values and will be used instead of rollout estimates
- πθ that will learn from MCTS improvements
MCTS algorithm from AlphaZero
During the forward stage:
- At leaf states sL, we are not required to make rollouts - we already have its value:
V(sL)=Vϕ(sL)
or we can still do a rollout: V(sL)=λVϕ(sL)+(1−λ)V^(sL)
- The exploration bonus is now changed - the policy guides exploration:
W(s′)=V(s′)+ c1+n(s′)πθ(a∣s)
There are no infinite bonuses
Now, the output of MCTS(s) is not the best action for s, but rather a distribution:
πθMCTS(a∣s)∝n(f(s,a))1/τ
Other stages are not affected.
Assume, we have a rollout policy πθ and a corresponding value function Vϕ
But how to improve parameters θ?
And update ϕ for the new policy?
Policy Iteration through a self-play
Play several games with yourself:

Keep the tree through the game!

. . .

Store the triples: (st,πθMCTS(st),R),∀t
Once in a while, sample batches (s,πθMCTS,R) from the buffer and minimize:
Better to share parameters of NNs
AlphaZero results




AlphaZero results


AlphaZero results

Случай неизвестной динамики
What if the dynamics is unknown?
f(s,a) - ????
If we assume dynamics to be unknown but deterministic,
then we can note the following:
- states are fully controlled by applied actions
- during MCTS search the states are not required
in case of known VπO(s),πO(s),r(s)
(rewards are here for more general environments than board-games)
we can learn it directly from transtions of a real environment
What if the dynamics is unknown?
The easiest motivation ever!
Previously, in AlphaZero we had:
Now the dynamics is not available!
Thus, we will throw all the available variables into a larger NN:
sroot,a0,…,aL
dynamics
sL
get value and policy
Vθ(sL),πθ(sL)=fθ(sL)
sroot,a0,…,aL
get value and policy of future states
Vθ(sL),πθ(sL)=fθ(sL)
Architecture of the Neural Network
the estimate of r(s, a)
gθ(zt+1,at+1)
fθ(zt+2)
. . .
. . .
encoder
gθ(zt,at)
fθ(zt+1)
MCTS in MuZero

Simulate with some policy πO and calculate reward-to-go


For each state we store a tuple:
(Σ,n(s))
Σ - cummulative reward
n(s) - counter of visits
Stages:
- Forward
- Expand
- Rollout
- Backward


and again, and again....

MCTS in MuZero
Simulate with some policy πO and calculate reward-to-go
Stages:
- Forward
- Expand
- Rollout
- Backward
MuZero: article
Play several games:


. . .

Store whole games: (...,st,at,πtMCTS,rt,ut,st+1,...)
Randomly pick state si from the buffer with a subsequence of length K
MuZero: results

MuZero: results


WOW
It works!
OpenEdu MB-RL: MCTS, AlphaZero, MuZero
By cydoroga
OpenEdu MB-RL: MCTS, AlphaZero, MuZero
- 456