Dimitrije Marković
DySCO meeting 14.07.2021
a discrete-time stochastic control process
Optimal behaviour maximises accumulated value (reward).
a tree-like model of decisions and their possible consequences
a tree-like model of decisions and their possible consequences
depth \( d = 0\)
depth \( d = 1\)
depth \( d = 2\)
a tree-like model of decisions and their possible consequences
Alternative representation in the case of stochastic state transitions
State nodes
Action nodes
In computer science tree search refers to the process of visiting (evaluating) each node in a tree data structure, exactly once.
depth first search
breadth-first search
1
2
3
4
5
6
7
8
1
3
4
2
5
6
7
8
Monte Carlo methods rely on repeated random sampling to obtain numerical results.
The use of Monte-Carlo simulations in video games started with the pioneering work of:
A position is evaluated by running many “playouts” (simulations) using a sequences of random moves generated alternatively from the player and the adversary. Starting from the current game state until a terminal configuration is reached.
...
\(a_1\)
\(a_2\)
\(a_K\)
\( V^i_{a_1}(s) \)
\( V^i_{a_2}(s) \)
\( V^i_{a_K}(s) \)
\(a_1\)
\(a_2\)
...
\(a_K\)
\( V^i_{a_1}(s) \)
\( V^i_{a_2}(s) \)
\( V^i_{a_K}(s) \)
Apply the upper confidence bound (UCB) algorithm to MCP
Kocsis, Levente, and Csaba Szepesvári. "Bandit based monte-carlo planning." European conference on machine learning. Springer, Berlin, Heidelberg, 2006.
When combined with tree search the algorithm is called UCT.
The Crazy-Stone program was the starting point of the MCTS method. In this program Coulom introduced the following improvements to the MCP:
Coulom, Rémi. "Efficient selectivity and backup operators in Monte-Carlo tree search." International conference on computers and games. Springer, Berlin, Heidelberg, 2006.
The algorithm consist of four steps:
Expansion: the leaf node is expanded to create child nodes corresponding to all possible decisions.
Simulation: the decision process is simulated from one of the newly created nodes with random decisions until the end is reached.
Backpropagation: the result of the simulation is stored and backpropagated up to the root.
Selection
Expansion
Simulation
Backpropagation
win
1/1
1/2
loss
1/1
1/2
1/1
1/2
0/1
0/1
0/1
1/2
1/3
1/1
1/1
0/1
First iteration
Second iteration
Selection
Expansion
Simulation
Backpropagation
1/3
1/1
0/1
1/3
1/1
0/1
1/3
1/1
0/1
win
2/4
2/2
0/1
1/1
Third iteration
X
X
X
X
X
O
O
O
O
"The analysis of the popular UCT (Upper Confidence Bounds applied to Trees) algorithm has been a theoretical failure: the algorithm may perform very poorly (much worse than a uniform search) on toy problems and does not possess nice finite-time performance guarantees."
Munos, Rémi. "From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning." (2014).
"Hierarchical bandit approach -- where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level -- possesses the nice feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area, at different scales, according to the evaluations observed so far, and eventually performing a local search around the global optima of the function."
Selection
Expansion
Simulation
Backpropagation
1/3
1/1
0/1
1/3
1/1
0/1
1/3
1/1
0/1
win
2/4
2/2
0/1
1/1
Third iteration
Tesauro, Gerald, V. T. Rajan, and Richard Segal. "Bayesian inference in monte-carlo tree search." arXiv preprint arXiv:1203.3519 (2012).
Bai, Aijun, Feng Wu, and Xiaoping Chen. "Bayesian mixture modelling and inference based Thompson sampling in Monte-Carlo tree search." Proceedings of the Advances in Neural Information Processing Systems (NIPS) (2013): 1646-1654.
Bai, Aijun, Feng Wu, and Xiaoping Chen. "Posterior sampling for Monte Carlo planning under uncertainty." Applied Intelligence 48.12 (2018): 4998-5018.
Kaufmann, Emilie, and Wouter Koolen. "Monte-Carlo tree search by best arm identification." arXiv preprint arXiv:1706.02986 (2017).
Dai, Zhongxiang, et al. "Bayesian optimization meets Bayesian optimal stopping." International Conference on Machine Learning. PMLR, 2019.
a discrete-time stochastic control process
Optimal behaviour maximises accumulated value (reward).
Asmuth, John, and Michael L. Littman. "Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search." arXiv preprint arXiv:1202.3699 (2012).
Vien, Ngo Anh, et al. "Monte-Carlo tree search for Bayesian reinforcement learning." Applied intelligence 39.2 (2013): 345-353.
MDPs with unkown state transitions:
Silver, David, and Joel Veness. "Monte-Carlo planning in large POMDPs." Neural Information Processing Systems, 2010.
"Partially Observable Monte-Carlo Planning (POMCP) consists of a UCT search that selects actions at each time-step; and a particle filter that updates the agent’s belief state."
"We extend the UCT algorithm to partially observable environments by using a search tree of histories (beliefs) instead of states. The tree contains a node \(T(h_t) =〈N(h_t),V(h_t)〉\)for each represented history \(h_t = (o_1, \ldots, o_t)\)."
Fischer, Johannes, and Ömer Sahin Tas. "Information particle filter tree: An online algorithm for pomdps with belief-based rewards on continuous domains." International Conference on Machine Learning. PMLR, 2020.
Abstract
Planning in Partially Observable Markov Decision Processes (POMDPs) inherently gathers the information necessary to act optimally under uncertainties. The framework can be extended to model pure information gathering tasks by considering belief-based rewards. This allows us to use reward shaping to guide POMDP planning to informative beliefs by using a weighted combination of the original reward and the expected information gain as the objective. In this work we propose a novel online algorithm, Information Particle Filter Tree (IPFT), to solve problems with belief-dependent rewards on continuous domains. It simulates particle-based belief trajectories in a Monte Carlo Tree Search (MCTS) approach to construct a search tree in the belief space. The evaluation shows that the consideration of information gain greatly improves the performance in problems where information gathering is an essential part of the optimal policy.
Lieck, Robert, and Marc Toussaint. "Active Tree Search." ICAPS Workshop on Planning, Search, and Optimization. 2017.
Maisto, Domenico, et al. "Active Tree Search in Large POMDPs." arXiv preprint arXiv:2103.13860 (2021).
Fountas, Zafeirios, et al. "Deep active inference agents using Monte-Carlo methods." arXiv preprint arXiv:2006.04176 (2020).
Hula, Andreas, P. Read Montague, and Peter Dayan. "Monte carlo planning method estimates planning horizons during interactive social exchange." PLoS computational biology 11.6 (2015): e1004254.
Krusche, Moritz JF, et al. "Adaptive planning in human search." bioRxiv (2018): 268938.
Keramati, Mehdi, et al. "Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum." PNAS 113.45 (2016): 12868-12873.
Parascandolo, Giambattista, et al. "Divide-and-conquer monte carlo tree search for goal-directed planning." arXiv preprint arXiv:2004.11410 (2020).
Divide-and-Conquer MCTS - for goal-directed reinforcement learning problems:
"Approximating the optimal plan by means of proposing intermediate sub-goals which hierarchically partition the initial tasks into simpler ones that are then solved independently and recursively."
Sunberg, Zachary N., and Mykel J. Kochenderfer. "Online algorithms for POMDPs with continuous state, action, and observation spaces." Twenty-Eighth International Conference on Automated Planning and Scheduling. 2018.
MCST is a versatile algorithm with a wide range of applications -- MCST is awesome!
Many ways to integrate it into research in cognitive neuroscience:
Uncharted possibilities with respect to meta-control and a hierarchical planning.