Introduction to Monte Carlo Tree Search

Dimitrije Marković

DySCO meeting 14.07.2021

Why am I talking about MCTS?

Markov Decision Process

a discrete-time stochastic control process

state space $ S$
$action space $ A $$
$state transition probabilities$ $Pr(s_{t+1}| s_t, a_t)$
value function $ V_a(s, s') $ of transitioning from state $ s$ to state $ s'$ due to action $ a $

Optimal behaviour maximises accumulated value (reward).

\pi^*(a|s) = \max_{\pi} E_\pi \left[ \sum_{k=t}^{T} V_{a_k} \left(s_k, s_{k+1} \right) \right]

Decision tree

a tree-like model of decisions and their possible consequences

Decision tree

a tree-like model of decisions and their possible consequences

depth $ d = 0$

depth $ d = 1$

depth $ d = 2$

Decision tree

a tree-like model of decisions and their possible consequences

Alternative representation in the case of stochastic state transitions

State nodes

Action nodes

tree search

In computer science tree search refers to the process of visiting (evaluating) each node in a tree data structure, exactly once.

depth first search

breadth-first search

monte carlo planning

Monte Carlo methods rely on repeated random sampling to obtain numerical results.

The use of Monte-Carlo simulations in video games started with the pioneering work of:

Brügmann et al. (1993) -> Computer Go
Abramson et al. (1990) -> Othello

A position is evaluated by running many “playouts” (simulations) using a sequences of random moves generated alternatively from the player and the adversary. Starting from the current game state until a terminal configuration is reached.

a^* = \max_a \frac{1}{S}\sum_{i=1}^S V^i_a(s)

...

$a_1$

$a_2$

$a_K$

$ V^i_{a_1}(s) $

$ V^i_{a_2}(s) $

$ V^i_{a_K}(s) $

How to sample actions?

$a_1$

$a_2$

...

$a_K$

$ V^i_{a_1}(s) $

$ V^i_{a_2}(s) $

$ V^i_{a_K}(s) $

Bandit based MCP

Apply the upper confidence bound (UCB) algorithm to MCP

Kocsis, Levente, and Csaba Szepesvári. "Bandit based monte-carlo planning." European conference on machine learning. Springer, Berlin, Heidelberg, 2006.

When combined with tree search the algorithm is called UCT.

a^* = \max_{a} \left[ \bar{V}_a(s) + c \cdot \sqrt{\frac{\ln S}{n(a)}} ~\right]

\bar{V}_a(s) = \frac{1}{n(a)} \sum_{i=1}^{n(a)} V^i_a(s)

S = \sum_a n(a)

Monte Carlo tree Search

The Crazy-Stone program was the starting point of the MCTS method. In this program Coulom introduced the following improvements to the MCP:

Instead of selecting the moves according to a uniform distribution, the probability distribution over possible moves is updated after each simulation so that more weight is assigned to moves that achieved better scores in previous runs.
An incremental tree representation adding a leaf to the current tree representation at each play-out enables the construction of an asymmetric tree where the most promising branches are explored to a greater depth.

Coulom, Rémi. "Efficient selectivity and backup operators in Monte-Carlo tree search." International conference on computers and games. Springer, Berlin, Heidelberg, 2006.

Monte Carlo tree Search

The algorithm consist of four steps:

Selection: the tree is traversed down to a leaf node $ L $.
Expansion: the leaf node is expanded to create child nodes corresponding to all possible decisions.
Simulation: the decision process is simulated from one of the newly created nodes with random decisions until the end is reached.
Backpropagation: the result of the simulation is stored and backpropagated up to the root.

Monte Carlo tree Search

Selection

Expansion

Simulation

Backpropagation

win

1/1

1/2

loss

1/1

1/2

1/1

1/2

0/1

1/2

1/3

1/1

0/1

First iteration

Second iteration

Monte Carlo tree Search

Selection

Expansion

Simulation

Backpropagation

1/3

1/1

0/1

1/3

1/1

0/1

1/3

1/1

0/1

win

2/4

2/2

0/1

1/1

Third iteration

Tic tac toe

Bandit algorithms for three search

"The analysis of the popular UCT (Upper Confidence Bounds applied to Trees) algorithm has been a theoretical failure: the algorithm may perform very poorly (much worse than a uniform search) on toy problems and does not possess nice finite-time performance guarantees."

Munos, Rémi. "From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning." (2014).

"Hierarchical bandit approach -- where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level -- possesses the nice feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area, at different scales, according to the evaluations observed so far, and eventually performing a local search around the global optima of the function."

Monte Carlo tree Search

Selection

Expansion

Simulation

Backpropagation

1/3

1/1

0/1

1/3

1/1

0/1

1/3

1/1

0/1

win

2/4

2/2

0/1

1/1

Third iteration

Bayesian Bandit algorithms

Tesauro, Gerald, V. T. Rajan, and Richard Segal. "Bayesian inference in monte-carlo tree search." arXiv preprint arXiv:1203.3519 (2012).

Bai, Aijun, Feng Wu, and Xiaoping Chen. "Bayesian mixture modelling and inference based Thompson sampling in Monte-Carlo tree search." Proceedings of the Advances in Neural Information Processing Systems (NIPS) (2013): 1646-1654.

Bai, Aijun, Feng Wu, and Xiaoping Chen. "Posterior sampling for Monte Carlo planning under uncertainty." Applied Intelligence 48.12 (2018): 4998-5018.

How long should one plan?

Best arm identification and optimal stopPing

Kaufmann, Emilie, and Wouter Koolen. "Monte-Carlo tree search by best arm identification." arXiv preprint arXiv:1706.02986 (2017).

Dai, Zhongxiang, et al. "Bayesian optimization meets Bayesian optimal stopping." International Conference on Machine Learning. PMLR, 2019.

Partially Observable Markov Decision Process

a discrete-time stochastic control process

state space $ S$
$action space $ A $$
$observation space $\Omega $$
$state transition probabilities$ $Pr(s_{t+1}| s_t, a_t)$
$Observation likelihood $Pr(o_t|s_t) $$
$Beliefs over states $ b(s) $$
value funcitonal $ V_a[b, b'] $ of transitioning from belief state $ b$ to belief state $ b'$ due to action $ a $.

Optimal behaviour maximises accumulated value (reward).

\pi^*(a|b) = \max_{\pi} E_\pi \left[ \sum_{k=t}^{T} V_{a_k} \left[b_k, b_{k+1} \right] \right]

Monte-Carlo Planning in POMDP

Asmuth, John, and Michael L. Littman. "Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search." arXiv preprint arXiv:1202.3699 (2012).

Vien, Ngo Anh, et al. "Monte-Carlo tree search for Bayesian reinforcement learning." Applied intelligence 39.2 (2013): 345-353.

MDPs with unkown state transitions:

Bayesian inference
Belief MDPs

Monte-Carlo Planning in POMDPs

Silver, David, and Joel Veness. "Monte-Carlo planning in large POMDPs." Neural Information Processing Systems, 2010.

"Partially Observable Monte-Carlo Planning (POMCP) consists of a UCT search that selects actions at each time-step; and a particle filter that updates the agent’s belief state."

"We extend the UCT algorithm to partially observable environments by using a search tree of histories (beliefs) instead of states. The tree contains a node $T(h_t) =〈N(h_t),V(h_t)〉$for each represented history $h_t = (o_1, \ldots, o_t)$."

Interesting

Fischer, Johannes, and Ömer Sahin Tas. "Information particle filter tree: An online algorithm for pomdps with belief-based rewards on continuous domains." International Conference on Machine Learning. PMLR, 2020.

Abstract

Planning in Partially Observable Markov Decision Processes (POMDPs) inherently gathers the information necessary to act optimally under uncertainties. The framework can be extended to model pure information gathering tasks by considering belief-based rewards. This allows us to use reward shaping to guide POMDP planning to informative beliefs by using a weighted combination of the original reward and the expected information gain as the objective. In this work we propose a novel online algorithm, Information Particle Filter Tree (IPFT), to solve problems with belief-dependent rewards on continuous domains. It simulates particle-based belief trajectories in a Monte Carlo Tree Search (MCTS) approach to construct a search tree in the belief space. The evaluation shows that the consideration of information gain greatly improves the performance in problems where information gathering is an essential part of the optimal policy.

MCTS and planning as inference

Lieck, Robert, and Marc Toussaint. "Active Tree Search." ICAPS Workshop on Planning, Search, and Optimization. 2017.

Maisto, Domenico, et al. "Active Tree Search in Large POMDPs." arXiv preprint arXiv:2103.13860 (2021).

Fountas, Zafeirios, et al. "Deep active inference agents using Monte-Carlo methods." arXiv preprint arXiv:2006.04176 (2020).

Literature - HUman MC Planning

Hula, Andreas, P. Read Montague, and Peter Dayan. "Monte carlo planning method estimates planning horizons during interactive social exchange." PLoS computational biology 11.6 (2015): e1004254.

Krusche, Moritz JF, et al. "Adaptive planning in human search." bioRxiv (2018): 268938.

Keramati, Mehdi, et al. "Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum." PNAS 113.45 (2016): 12868-12873.

To READ

Parascandolo, Giambattista, et al. "Divide-and-conquer monte carlo tree search for goal-directed planning." arXiv preprint arXiv:2004.11410 (2020).

Divide-and-Conquer MCTS - for goal-directed reinforcement learning problems:

"Approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively."

Sunberg, Zachary N., and Mykel J. Kochenderfer. "Online algorithms for POMDPs with continuous state, action, and observation spaces." Twenty-Eighth International Conference on Automated Planning and Scheduling. 2018.

Conclusion

MCST is a versatile algorithm with a wide range of applications -- MCST is awesome!

Many ways to integrate it into research in cognitive neuroscience:

Simulating decision noise.
Estimating planning depth.
Simulating neuronal responses.
Simulating response times.

Uncharted possibilities with respect to meta-control and a hierarchical planning.