Introduction to Monte Carlo Tree Search
Dimitrije Marković
DySCO meeting 14.07.2021
Why am I talking about MCTS?
Markov Decision Process
a discrete-time stochastic control process
- state space \( S\)
- action space \( A \)
- state transition probabilities \( Pr(s_{t+1}| s_t, a_t) \)
- value function \( V_a(s, s') \) of transitioning from state \( s\) to state \( s'\) due to action \( a \)
Optimal behaviour maximises accumulated value (reward).
Decision tree
a tree-like model of decisions and their possible consequences
Decision tree
a tree-like model of decisions and their possible consequences
depth \( d = 0\)
depth \( d = 1\)
depth \( d = 2\)
Decision tree
a tree-like model of decisions and their possible consequences
Alternative representation in the case of stochastic state transitions
State nodes
Action nodes
tree search
In computer science tree search refers to the process of visiting (evaluating) each node in a tree data structure, exactly once.
depth first search
breadth-first search
1
2
3
4
5
6
7
8
1
3
4
2
5
6
7
8
monte carlo planning
Monte Carlo methods rely on repeated random sampling to obtain numerical results.
The use of Monte-Carlo simulations in video games started with the pioneering work of:
- Brügmann et al. (1993) -> Computer Go
- Abramson et al. (1990) -> Othello
A position is evaluated by running many “playouts” (simulations) using a sequences of random moves generated alternatively from the player and the adversary. Starting from the current game state until a terminal configuration is reached.
...
\(a_1\)
\(a_2\)
\(a_K\)
\( V^i_{a_1}(s) \)
\( V^i_{a_2}(s) \)
\( V^i_{a_K}(s) \)
How to sample actions?
\(a_1\)
\(a_2\)
...
\(a_K\)
\( V^i_{a_1}(s) \)
\( V^i_{a_2}(s) \)
\( V^i_{a_K}(s) \)
Bandit based MCP
Apply the upper confidence bound (UCB) algorithm to MCP
Kocsis, Levente, and Csaba Szepesvári. "Bandit based monte-carlo planning." European conference on machine learning. Springer, Berlin, Heidelberg, 2006.
When combined with tree search the algorithm is called UCT.
Monte Carlo tree Search
The Crazy-Stone program was the starting point of the MCTS method. In this program Coulom introduced the following improvements to the MCP:
- Instead of selecting the moves according to a uniform distribution, the probability distribution over possible moves is updated after each simulation so that more weight is assigned to moves that achieved better scores in previous runs.
- An incremental tree representation adding a leaf to the current tree representation at each play-out enables the construction of an asymmetric tree where the most promising branches are explored to a greater depth.
Coulom, Rémi. "Efficient selectivity and backup operators in Monte-Carlo tree search." International conference on computers and games. Springer, Berlin, Heidelberg, 2006.
Monte Carlo tree Search
The algorithm consist of four steps:
- Selection: the tree is traversed down to a leaf node \( L \).
-
Expansion: the leaf node is expanded to create child nodes corresponding to all possible decisions.
-
Simulation: the decision process is simulated from one of the newly created nodes with random decisions until the end is reached.
-
Backpropagation: the result of the simulation is stored and backpropagated up to the root.
Monte Carlo tree Search
Selection
Expansion
Simulation
Backpropagation
win
1/1
1/2
loss
1/1
1/2
1/1
1/2
0/1
0/1
0/1
1/2
1/3
1/1
1/1
0/1
First iteration
Second iteration
Monte Carlo tree Search
Selection
Expansion
Simulation
Backpropagation
1/3
1/1
0/1
1/3
1/1
0/1
1/3
1/1
0/1
win
2/4
2/2
0/1
1/1
Third iteration
Tic tac toe
X
X
X
X
X
O
O
O
O
Bandit algorithms for three search
"The analysis of the popular UCT (Upper Confidence Bounds applied to Trees) algorithm has been a theoretical failure: the algorithm may perform very poorly (much worse than a uniform search) on toy problems and does not possess nice finite-time performance guarantees."
Munos, Rémi. "From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning." (2014).
"Hierarchical bandit approach -- where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level -- possesses the nice feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area, at different scales, according to the evaluations observed so far, and eventually performing a local search around the global optima of the function."
Monte Carlo tree Search
Selection
Expansion
Simulation
Backpropagation
1/3
1/1
0/1
1/3
1/1
0/1
1/3
1/1
0/1
win
2/4
2/2
0/1
1/1
Third iteration
Bayesian Bandit algorithms
Tesauro, Gerald, V. T. Rajan, and Richard Segal. "Bayesian inference in monte-carlo tree search." arXiv preprint arXiv:1203.3519 (2012).
Bai, Aijun, Feng Wu, and Xiaoping Chen. "Bayesian mixture modelling and inference based Thompson sampling in Monte-Carlo tree search." Proceedings of the Advances in Neural Information Processing Systems (NIPS) (2013): 1646-1654.
Bai, Aijun, Feng Wu, and Xiaoping Chen. "Posterior sampling for Monte Carlo planning under uncertainty." Applied Intelligence 48.12 (2018): 4998-5018.
How long should one plan?
Best arm identification and optimal stopPing
Kaufmann, Emilie, and Wouter Koolen. "Monte-Carlo tree search by best arm identification." arXiv preprint arXiv:1706.02986 (2017).
Dai, Zhongxiang, et al. "Bayesian optimization meets Bayesian optimal stopping." International Conference on Machine Learning. PMLR, 2019.
Partially Observable Markov Decision Process
a discrete-time stochastic control process
- state space \( S\)
- action space \( A \)
- observation space \(\Omega \)
- state transition probabilities \( Pr(s_{t+1}| s_t, a_t) \)
- Observation likelihood \(Pr(o_t|s_t) \)
- Beliefs over states \( b(s) \)
- value funcitonal \( V_a[b, b'] \) of transitioning from belief state \( b\) to belief state \( b'\) due to action \( a \).
Optimal behaviour maximises accumulated value (reward).
Monte-Carlo Planning in POMDP
Asmuth, John, and Michael L. Littman. "Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search." arXiv preprint arXiv:1202.3699 (2012).
Vien, Ngo Anh, et al. "Monte-Carlo tree search for Bayesian reinforcement learning." Applied intelligence 39.2 (2013): 345-353.
MDPs with unkown state transitions:
- Bayesian inference
- Belief MDPs
Monte-Carlo Planning in POMDPs
Silver, David, and Joel Veness. "Monte-Carlo planning in large POMDPs." Neural Information Processing Systems, 2010.
"Partially Observable Monte-Carlo Planning (POMCP) consists of a UCT search that selects actions at each time-step; and a particle filter that updates the agent’s belief state."
"We extend the UCT algorithm to partially observable environments by using a search tree of histories (beliefs) instead of states. The tree contains a node \(T(h_t) =〈N(h_t),V(h_t)〉\)for each represented history \(h_t = (o_1, \ldots, o_t)\)."
Interesting
Fischer, Johannes, and Ömer Sahin Tas. "Information particle filter tree: An online algorithm for pomdps with belief-based rewards on continuous domains." International Conference on Machine Learning. PMLR, 2020.
Abstract
Planning in Partially Observable Markov Decision Processes (POMDPs) inherently gathers the information necessary to act optimally under uncertainties. The framework can be extended to model pure information gathering tasks by considering belief-based rewards. This allows us to use reward shaping to guide POMDP planning to informative beliefs by using a weighted combination of the original reward and the expected information gain as the objective. In this work we propose a novel online algorithm, Information Particle Filter Tree (IPFT), to solve problems with belief-dependent rewards on continuous domains. It simulates particle-based belief trajectories in a Monte Carlo Tree Search (MCTS) approach to construct a search tree in the belief space. The evaluation shows that the consideration of information gain greatly improves the performance in problems where information gathering is an essential part of the optimal policy.
MCTS and planning as inference
Lieck, Robert, and Marc Toussaint. "Active Tree Search." ICAPS Workshop on Planning, Search, and Optimization. 2017.
Maisto, Domenico, et al. "Active Tree Search in Large POMDPs." arXiv preprint arXiv:2103.13860 (2021).
Fountas, Zafeirios, et al. "Deep active inference agents using Monte-Carlo methods." arXiv preprint arXiv:2006.04176 (2020).
Literature - HUman MC Planning
Hula, Andreas, P. Read Montague, and Peter Dayan. "Monte carlo planning method estimates planning horizons during interactive social exchange." PLoS computational biology 11.6 (2015): e1004254.
Krusche, Moritz JF, et al. "Adaptive planning in human search." bioRxiv (2018): 268938.
Keramati, Mehdi, et al. "Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum." PNAS 113.45 (2016): 12868-12873.
To READ
Parascandolo, Giambattista, et al. "Divide-and-conquer monte carlo tree search for goal-directed planning." arXiv preprint arXiv:2004.11410 (2020).
Divide-and-Conquer MCTS - for goal-directed reinforcement learning problems:
"Approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively."
Sunberg, Zachary N., and Mykel J. Kochenderfer. "Online algorithms for POMDPs with continuous state, action, and observation spaces." Twenty-Eighth International Conference on Automated Planning and Scheduling. 2018.
Conclusion
MCST is a versatile algorithm with a wide range of applications -- MCST is awesome!
Many ways to integrate it into research in cognitive neuroscience:
- Simulating decision noise.
- Estimating planning depth.
- Simulating neuronal responses.
- Simulating response times.
Uncharted possibilities with respect to meta-control and a hierarchical planning.
Introduction to Monte Carlo Tree Search
By dimarkov
Introduction to Monte Carlo Tree Search
- 50