Markov Decision

Processes

Deterministic Search

Real World

Stylized Example: Grid World

state: location of the bot

+1 reward for the diamond, -1 reward for the fire

noisy movement: actions may not go as planned

Markov Decision Processes: Ingredients

States: \(S\)

Policy: map of states to actions

Q-values: expected future utility from a state+action pair

Actions: \(A\)

Transitions: \(T(s^\prime | s,a)\)

Rewards: \(R(s,a,s^\prime)\)

Start state: \(s_0\)

Utility: sum of discounted rewards

Values: expected future utility from a state

Discount: \(\gamma\)

How do you combine rewards?

\(r_1, r_2, r_3, \ldots\)

We can define some function \(f\)

\(f(r_1, r_2, r_3, \ldots)\)

that aggregates these rewards.

e.g, \(\sum_{i}r_i\)

or, e.g, \(\sum_{i}r_i \gamma^i\)

(rewards sooner are better than later)

How do you combine rewards?

\(\begin{gathered}\left[a_1, a_2, \ldots\right]\succ\left[b_1, b_2, \ldots\right] \\ \text{iff} \\ {\left[r, a_1, a_2, \ldots\right] \succ\left[r, b_1, b_2, \ldots\right]}\end{gathered}\)

e.g, \(\sum_{i}r_i\)

or, e.g, \(\sum_{i}r_i \gamma^i\)

(rewards sooner are better than later)

(stationary preference assumption)

Discounting

Exit action has a reward of 10

Exit action has a reward of 1

R

L

R

L

R

L

R

L

Assume actions are deterministic

\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)

What if \(\gamma = 1\)?

What if \(\gamma = 0.1\)?

Discounting

Exit action has a reward of 10

Exit action has a reward of 1

R

L

R

L

R

L

R

L

Assume actions are deterministic

\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)

What value of \(\gamma\) makes both L and R equally good when in the blue state?

Markov Decision Processes: Policies

In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of actions, from start to a goal

For MDPs, we want an optimal policy,
or a map from \(S \rightarrow A\).

Optimal policies are denoted \(\pi^\star\).

Markov Decision Processes: Example

Markov Decision Processes: Example

When you decide to move N:

with 80% probability, you move N

with 10% probability, you move E

with 10% probability, you move W

Markov Decision Processes: Example

When you decide to move S:

with 80% probability, you move S

with 10% probability, you move E

with 10% probability, you move W

Markov Decision Processes: Example

When you decide to move W:

with 80% probability, you move W

with 10% probability, you move N

with 10% probability, you move S

Markov Decision Processes: Example

When you decide to move E:

with 80% probability, you move E

with 10% probability, you move N

with 10% probability, you move S

Markov Decision Processes: Example

Markov Decision Processes: Example

Markov Decision Processes: Example

Markov Decision Processes: Example

Markov Decision Processes: Example

Markov Decision Processes: Example

Markov Decision Processes: Example

Markov Decision Processes: Racing

A robot car wants to travel far, quickly.

Three states:
Cool, Warm, Overheated

Two actions:
Slow Down; Go Fast

Infinite Rewards?

What if the game lasts forever?
Do we get infinite rewards?

Discounting

Finite Horizon

(Gives non-stationary policies;
\(\pi\) will depend on time left.)

\(U\left(\left[r_0, \ldots r_{\infty}\right]\right)=\sum_{t=0}^{\infty} \gamma^t r_t \\ \leq R_{\max } /(1-\gamma)\)

Absorbing
State

Markov Decision Processes

The Optimization Problem

Find:
\(\pi^*=\max _\pi, \forall s \in \mathcal{S}, V^\pi(s)\)

Given:
\(\left(\mathcal{S}, \mathcal{A}, \mathcal{P}_{s s^{\prime}}^a, \mathcal{R}_{s s^{\prime}}^a, \gamma\right)\)

Optimal Quanitities

The value (utility) of a state s: \(\mathrm{V}^*(\mathrm{~s})=\) expected utility starting in \(s\) and acting optimally

The value (utility) of a q-state \((s, a)\) :
\(Q^*(s, a)=\) expected utility starting out having taken action a from state \(s\) and (thereafter) acting optimally

The optimal policy: \(\pi^*(s)=\) optimal action from state \(s\)

Optimality Strategy

1. Take the first correct action

2. Keep making optimal choices

Bellman Equations

\(V^*(s)=\max _a Q^*(s, a)\)

\(Q^*(s, a)=\sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)

\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)

\(s\)

\((s,a)\)

\(s^\prime\)

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)

\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)

Value Iteration: Example

\(V_{1}\)

\(V_{0}\)

\(V_{2}\)

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

\(V_k\)

\(V_{k+1}\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

\(V_k\)

\(V_{k+1}\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{0}\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)

\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_1\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)

\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right] + T\left(s, a, s^{\prime}\right)\gamma {\color{IndianRed}V_{1}\left(s^{\prime}\right)}\)

\(V_k\) and \(V_{k+1}\) are at most \(\gamma^k \max |R|\) different

The values do converge as \(k\) increases if \(0 \leqslant \gamma < 1\).

Policy Evaluation

\(Q^\pi(s, a)=E\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \mid s_0=s, a_0=a, \pi\right]\)

Markov Decision Processes

The Q-Function
& the Value Function

The \(\gamma\) accounts for discounting.

\(V^\pi(s)=Q^\pi(s, \pi(s))\)

FAI · MDPs

By Neeldhara Misra