Markov Decision
Processes
Deterministic Search
Real World
Stylized Example: Grid World
state: location of the bot
+1 reward for the diamond, -1 reward for the fire
noisy movement: actions may not go as planned
Markov Decision Processes: Ingredients
States: \(S\)
Policy: map of states to actions
Q-values: expected future utility from a state+action pair
Actions: \(A\)
Transitions: \(T(s^\prime | s,a)\)
Rewards: \(R(s,a,s^\prime)\)
Start state: \(s_0\)
Utility: sum of discounted rewards
Values: expected future utility from a state
Discount: \(\gamma\)
How do you combine rewards?
\(r_1, r_2, r_3, \ldots\)
We can define some function \(f\)
\(f(r_1, r_2, r_3, \ldots)\)
that aggregates these rewards.
e.g, \(\sum_{i}r_i\)
or, e.g, \(\sum_{i}r_i \gamma^i\)
(rewards sooner are better than later)
How do you combine rewards?
\(\begin{gathered}\left[a_1, a_2, \ldots\right]\succ\left[b_1, b_2, \ldots\right] \\ \text{iff} \\ {\left[r, a_1, a_2, \ldots\right] \succ\left[r, b_1, b_2, \ldots\right]}\end{gathered}\)
e.g, \(\sum_{i}r_i\)
or, e.g, \(\sum_{i}r_i \gamma^i\)
(rewards sooner are better than later)
(stationary preference assumption)
Discounting
Exit action has a reward of 10
Exit action has a reward of 1
R
L
R
L
R
L
R
L
Assume actions are deterministic
\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)
What if \(\gamma = 1\)?
What if \(\gamma = 0.1\)?
Discounting
Exit action has a reward of 10
Exit action has a reward of 1
R
L
R
L
R
L
R
L
Assume actions are deterministic
\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)
What value of \(\gamma\) makes both L and R equally good when in the blue state?
Markov Decision Processes: Policies
In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of actions, from start to a goal
For MDPs, we want an optimal policy,
or a map from \(S \rightarrow A\).
Optimal policies are denoted \(\pi^\star\).
Markov Decision Processes: Example
Markov Decision Processes: Example
When you decide to move N:
with 80% probability, you move N
with 10% probability, you move E
with 10% probability, you move W
Markov Decision Processes: Example
When you decide to move S:
with 80% probability, you move S
with 10% probability, you move E
with 10% probability, you move W
Markov Decision Processes: Example
When you decide to move W:
with 80% probability, you move W
with 10% probability, you move N
with 10% probability, you move S
Markov Decision Processes: Example
When you decide to move E:
with 80% probability, you move E
with 10% probability, you move N
with 10% probability, you move S
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Example
Markov Decision Processes: Racing
A robot car wants to travel far, quickly.
Three states:
Cool, Warm, Overheated
Two actions:
Slow Down; Go Fast
Infinite Rewards?
What if the game lasts forever?
Do we get infinite rewards?
Discounting
Finite Horizon
(Gives non-stationary policies;
\(\pi\) will depend on time left.)
\(U\left(\left[r_0, \ldots r_{\infty}\right]\right)=\sum_{t=0}^{\infty} \gamma^t r_t \\ \leq R_{\max } /(1-\gamma)\)
Absorbing
State
Markov Decision Processes
The Optimization Problem
Find:
\(\pi^*=\max _\pi, \forall s \in \mathcal{S}, V^\pi(s)\)
Given:
\(\left(\mathcal{S}, \mathcal{A}, \mathcal{P}_{s s^{\prime}}^a, \mathcal{R}_{s s^{\prime}}^a, \gamma\right)\)
Optimal Quanitities
The value (utility) of a state s: \(\mathrm{V}^*(\mathrm{~s})=\) expected utility starting in \(s\) and acting optimally
The value (utility) of a q-state \((s, a)\) :
\(Q^*(s, a)=\) expected utility starting out having taken action a from state \(s\) and (thereafter) acting optimally
The optimal policy: \(\pi^*(s)=\) optimal action from state \(s\)
Optimality Strategy
1. Take the first correct action
2. Keep making optimal choices
Bellman Equations
\(V^*(s)=\max _a Q^*(s, a)\)
\(Q^*(s, a)=\sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
\(s\)
\((s,a)\)
\(s^\prime\)
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)
Value Iteration: Example
\(V_{1}\)
\(V_{0}\)
\(V_{2}\)
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
\(V_k\)
\(V_{k+1}\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
\(V_k\)
\(V_{k+1}\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{0}\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)
\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_1\left(s^{\prime}\right)\right]\)
Value Iteration: Convergence Intuition
\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)
What is the difference between \(V_k\) and \(V_{k+1}\)?
\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)
\(V_{0}(s) \leftarrow 0\)
\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)
\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right] + T\left(s, a, s^{\prime}\right)\gamma {\color{IndianRed}V_{1}\left(s^{\prime}\right)}\)
\(V_k\) and \(V_{k+1}\) are at most \(\gamma^k \max |R|\) different
The values do converge as \(k\) increases if \(0 \leqslant \gamma < 1\).
Policy Evaluation
Policy Evaluation
\(\mathrm{V}^\pi(\mathrm{s})=\) expected total discounted rewards starting in s and following \(\pi\)
\(V^\pi(s)=\sum_{s^{\prime}} T\left(s, \pi(s), s^{\prime}\right)\left[R\left(s, \pi(s), s^{\prime}\right)+\gamma V^\pi\left(s^{\prime}\right)\right]\)
\(V_0^\pi(s)=0\)
\(V_{k+1}^\pi(s) \leftarrow \sum_{s^{\prime}} T\left(s, \pi(s), s^{\prime}\right)\left[R\left(s, \pi(s), s^{\prime}\right)+\gamma V_k^\pi\left(s^{\prime}\right)\right]\)
Policy Iteration
Start with an arbitrary policy.
\(V_{k+1}^{\pi_i}(s) \leftarrow \sum_{s^{\prime}} T\left(s, \pi_i(s), s^{\prime}\right)\left[R\left(s, \pi_i(s), s^{\prime}\right)+\gamma V_k^{\pi_i}\left(s^{\prime}\right)\right]\)
Evaluation
Improvement
\(\pi_{i+1}(s)=\underset{{\color{IndianRed}a}}{{\color{IndianRed}\arg \max} } \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{\pi_i}\left(s^{\prime}\right)\right]\)
Policy Iteration v. Value Iteration
Both value iteration and policy iteration compute the same thing (all optimal values)
Every iteration updates both the values and (implicitly) the policy
We don't track the policy, but taking the max over actions
implicitly recomputes it
We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them)
After the policy is evaluated, a new policy is chosen
(slow like a value iteration pass)
The new policy will be better (or we're done)
Reinforcement Learning
\(s\)
\((s,a)\)
\(s^\prime\)
\((s^\prime,a^\prime)\)
\(s^{\prime\prime}\)
Input: \(S,A\) and a sequence
\(s_0,a_0,r_0; s_1,a_1,r_1; \cdots\)
In particular, \(T\) and \(R\) are unknown.
\(\begin{aligned} Q^{+}(s, a) & =r_t+\gamma~{\color{DodgerBlue}\max _{a^{\prime}}}~Q\left(s^{\prime}, a^{\prime}\right) \\ \delta & =Q^{+}(s, a)-Q(s, a) \\ Q(s, a) & =Q(s, a)+\alpha \delta\end{aligned}\)
Noisy Estimate of \(Q\):
\(\pi^\epsilon(s) \triangleq \begin{cases}\operatorname{argmax}_a Q^\pi\left(s_e a\right), & \text { with probability } 1-\epsilon \\ \operatorname{UniformRandom}(\mathcal{A}), & \text { with probability } \epsilon\end{cases}\)
Policy Improvement
Q-Learning
\(s\)
\((s,a)\)
\(s^\prime\)
\((s^\prime,a^\prime)\)
\(s^{\prime\prime}\)
Input: \(S,A\) and a sequence
\(s_0,a_0,r_0; s_1,a_1,r_1; \cdots\)
In particular, \(T\) and \(R\) are unknown.
\(\begin{aligned} Q^{+}(s, a) & =r_t+\gamma Q\left(s^{\prime}, a^{\prime}\right) \\ \delta & =Q^{+}(s, a)-Q(s, a) \\ Q(s, a) & =Q(s, a)+\alpha \delta\end{aligned}\)
Noisy Estimate of \(Q\):
\(\pi^\epsilon(s) \triangleq \begin{cases}\operatorname{argmax}_a Q^\pi\left(s_e a\right), & \text { with probability } 1-\epsilon \\ \operatorname{UniformRandom}(\mathcal{A}), & \text { with probability } \epsilon\end{cases}\)
Policy Improvement
SARSA
Reinforcement Learning
*The rest of the lecture in the video
is highly recommended too!
\(Q^\pi(s, a)=E\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \mid s_0=s, a_0=a, \pi\right]\)
Markov Decision Processes
The Q-Function
& the Value Function
The \(\gamma\) accounts for discounting.
\(V^\pi(s)=Q^\pi(s, \pi(s))\)
FAI · MDPs
By Neeldhara Misra
FAI · MDPs
- 351