Markov Decision
Processes
Deterministic Search
Real World
Stylized Example: Grid World
state: location of the bot
+1 reward for the diamond, -1 reward for the fire
noisy movement: actions may not go as planned
Markov Decision Processes: Ingredients
States: \(S\)
Policy: map of states to actions
Q-values: expected future utility from a state+action pair
Actions: \(A\)
Transitions: \(T(s^\prime | s,a)\)
Rewards: \(R(s,a,s^\prime)\)
Start state: \(s_0\)
Utility: sum of discounted rewards
Values: expected future utility from a state
Discount: \(\gamma\)
How do you combine rewards?
\(r_1, r_2, r_3, \ldots\)
We can define some function \(f\)
\(f(r_1, r_2, r_3, \ldots)\)
that aggregates these rewards.
e.g, \(\sum_{i}r_i\)
or, e.g, \(\sum_{i}r_i \gamma^i\)
(rewards sooner are better than later)
How do you combine rewards?
\(\begin{gathered}\left[a_1, a_2, \ldots\right]\succ\left[b_1, b_2, \ldots\right] \\ \text{iff} \\ {\left[r, a_1, a_2, \ldots\right] \succ\left[r, b_1, b_2, \ldots\right]}\end{gathered}\)
e.g, \(\sum_{i}r_i\)
or, e.g, \(\sum_{i}r_i \gamma^i\)
(rewards sooner are better than later)
(stationary preference assumption)
Discounting
Exit action has a reward of 10
Exit action has a reward of 1
R
L
R
L
R
L
R
L
Assume actions are deterministic
\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)
What if \(\gamma = 1\)?
What if \(\gamma = 0.1\)?
Discounting
Exit action has a reward of 10
Exit action has a reward of 1
R
L
R
L
R
L
R
L
Assume actions are deterministic
\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)
What value of \(\gamma\) makes both L and R equally good when in the blue state?
Markov Decision Processes: Policies
In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of actions, from start to a goal
For MDPs, we want an optimal policy,
or a map from \(S \rightarrow A\).
Optimal policies are denoted \(\pi^\star\).
Markov Decision Processes: Example
Markov Decision Processes: Example
When you decide to move N:
with 80% probability, you move N
with 10% probability, you move E
with 10% probability, you move W
Markov Decision Processes: Example
When you decide to move S:
with 80% probability, you move S
with 10% probability, you move E
with 10% probability, you move W
Markov Decision Processes: Example
When you decide to move W:
with 80% probability, you move W
with 10% probability, you move N
with 10% probability, you move S
Markov Decision Processes: Example
When you decide to move E:
with 80% probability, you move E
with 10% probability, you move N
with 10% probability, you move S
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Racing
A robot car wants to travel far, quickly.
Three states:
Cool, Warm, Overheated
Two actions:
Slow Down; Go Fast
Infinite Rewards?
What if the game lasts forever?
Do we get infinite rewards?
Discounting
Finite Horizon
(Gives non-stationary policies;
\(\pi\) will depend on time left.)
\(U\left(\left[r_0, \ldots r_{\infty}\right]\right)=\sum_{t=0}^{\infty} \gamma^t r_t \\ \leq R_{\max } /(1-\gamma)\)
Absorbing
State
Markov Decision Processes
The Optimization Problem
Find:
\(\pi^*=\max _\pi, \forall s \in \mathcal{S}, V^\pi(s)\)
Given:
\(\left(\mathcal{S}, \mathcal{A}, \mathcal{P}_{s s^{\prime}}^a, \mathcal{R}_{s s^{\prime}}^a, \gamma\right)\)
Optimal Quanitities
The value (utility) of a state s: \(\mathrm{V}^*(\mathrm{~s})=\) expected utility starting in \(s\) and acting optimally
The value (utility) of a q-state \((s, a)\) :
\(Q^*(s, a)=\) expected utility starting out having taken action a from state \(s\) and (thereafter) acting optimally
The optimal policy: \(\pi^*(s)=\) optimal action from state \(s\)
Optimality Strategy
1. Take the first correct action
2. Keep making optimal choices
Bellman Equations
\(V^*(s)=\max _a Q^*(s, a)\)
\(Q^*(s, a)=\sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
\(s\)
\((s,a)\)
\(s^\prime\)
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
Value Iteration: Example
\(V_{1}\)
\(V_{0}\)
\(V_{2}\)
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
\(V_k\)
\(V_{k+1}\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
\(V_k\)
\(V_{k+1}\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{0}\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)
\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_1\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)
\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right] + T\left(s, a, s^{\prime}\right)\gamma {\color{IndianRed}V_{1}\left(s^{\prime}\right)}\)
\(V_k\) and \(V_{k+1}\) are at most \(\gamma^k \max |R|\) different
The values do converge as \(k\) increases if \(0 \leqslant \gamma < 1\).
Policy Evaluation
\(Q^\pi(s, a)=E\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \mid s_0=s, a_0=a, \pi\right]\)
Markov Decision Processes
The Q-Function
& the Value Function
The \(\gamma\) accounts for discounting.
\(V^\pi(s)=Q^\pi(s, \pi(s))\)
FAI · MDPs
By Neeldhara Misra
FAI · MDPs
- 146