FAI · MDPs

Markov Decision Processes

Foundations of AI

Definition and setup.
Some interactive examples.
Algorithms: value iteration, policy iteration.
An introduction to reinforcement learning.

Deterministic Search

Real World

Stylized Example: Grid World

state: location of the bot

+1 reward for the diamond, -1 reward for the fire

noisy movement: actions may not go as planned

Markov Decision Processes: Ingredients

States: \(S\)

Policy: map of states to actions

Q-values: expected future utility from a state+action pair

Actions: \(A\)

Transitions: \(T(s^\prime | s,a)\)

Rewards: \(R(s,a,s^\prime)\)

Start state: \(s_0\)

Utility: sum of discounted rewards

Values: expected future utility from a state

Discount: \(\gamma\)

How do you combine rewards?

\(r_1, r_2, r_3, \ldots\)

We can define some function \(f\)

\(f(r_1, r_2, r_3, \ldots)\)

that aggregates these rewards.

e.g, \(\sum_{i}r_i\)

or, e.g, \(\sum_{i}r_i \gamma^i\)

(rewards sooner are better than later)

How do you combine rewards?

\(\begin{gathered}\left[a_1, a_2, \ldots\right]\succ\left[b_1, b_2, \ldots\right] \\ \text{iff} \\ {\left[r, a_1, a_2, \ldots\right] \succ\left[r, b_1, b_2, \ldots\right]}\end{gathered}\)

e.g, \(\sum_{i}r_i\)

or, e.g, \(\sum_{i}r_i \gamma^i\)

(rewards sooner are better than later)

(stationary preference assumption)

Discounting

Exit action has a reward of 10

Exit action has a reward of 1

Assume actions are deterministic

\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)

What if \(\gamma = 1\)?

What if \(\gamma = 0.1\)?

Discounting

Exit action has a reward of 10

Exit action has a reward of 1

Assume actions are deterministic

\(r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots \)

What value of \(\gamma\) makes both L and R equally good when in the blue state?

Markov Decision Processes: Policies

In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of actions, from start to a goal

For MDPs, we want an optimal policy,
or a map from \(S \rightarrow A\).

Optimal policies are denoted \(\pi^\star\).

Markov Decision Processes: Example

When you decide to move N:

with 80% probability, you move N

with 10% probability, you move E

with 10% probability, you move W

Markov Decision Processes: Example

When you decide to move S:

with 80% probability, you move S

with 10% probability, you move E

with 10% probability, you move W

Markov Decision Processes: Example

When you decide to move W:

with 80% probability, you move W

with 10% probability, you move N

with 10% probability, you move S

Markov Decision Processes: Example

When you decide to move E:

with 80% probability, you move E

with 10% probability, you move N

with 10% probability, you move S

Markov Decision Processes: Example

Markov Decision Processes: Racing

A robot car wants to travel far, quickly.

Three states:
Cool, Warm, Overheated

Two actions:
Slow Down; Go Fast

Infinite Rewards?

(Source)

What if the game lasts forever?
Do we get infinite rewards?

Discounting

Finite Horizon

(Gives non-stationary policies;
\(\pi\) will depend on time left.)

\(U\left(\left[r_0, \ldots r_{\infty}\right]\right)=\sum_{t=0}^{\infty} \gamma^t r_t \\ \leq R_{\max } /(1-\gamma)\)

Absorbing
State

Markov Decision Processes

The Optimization Problem

Find:
\(\pi^*=\max _\pi, \forall s \in \mathcal{S}, V^\pi(s)\)

Given:
\(\left(\mathcal{S}, \mathcal{A}, \mathcal{P}_{s s^{\prime}}^a, \mathcal{R}_{s s^{\prime}}^a, \gamma\right)\)

Optimal Quanitities

The value (utility) of a state s: \(\mathrm{V}^*(\mathrm{~s})=\) expected utility starting in \(s\) and acting optimally

The value (utility) of a q-state \((s, a)\) :
\(Q^*(s, a)=\) expected utility starting out having taken action a from state \(s\) and (thereafter) acting optimally

The optimal policy: \(\pi^*(s)=\) optimal action from state \(s\)

Optimality Strategy

1. Take the first correct action

2. Keep making optimal choices

Bellman Equations

\(V^*(s)=\max _a Q^*(s, a)\)

\(Q^*(s, a)=\sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)

\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)

\(s\)

\((s,a)\)

\(s^\prime\)

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

\(V^*(s)=\max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^*\left(s^{\prime}\right)\right]\)

Value Iteration: Example

\(V_{1}\)

\(V_{0}\)

\(V_{2}\)

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

\(V_k\)

\(V_{k+1}\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

\(V_k\)

\(V_{k+1}\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{0}\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)

\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_1\left(s^{\prime}\right)\right]\)

Value Iteration: Convergence Intuition

\(V_{k+1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_k\left(s^{\prime}\right)\right]\)

What is the difference between \(V_k\) and \(V_{k+1}\)?

\(V_{k}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V_{k-1}\left(s^{\prime}\right)\right]\)

\(V_{0}(s) \leftarrow 0\)

\(V_{1}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right]\)

\(V_{2}(s) \leftarrow \max _a \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)\right] + T\left(s, a, s^{\prime}\right)\gamma {\color{IndianRed}V_{1}\left(s^{\prime}\right)}\)

\(V_k\) and \(V_{k+1}\) are at most \(\gamma^k \max |R|\) different

The values do converge as \(k\) increases if \(0 \leqslant \gamma < 1\).

Policy Evaluation

Image Source

Policy Evaluation

\(\mathrm{V}^\pi(\mathrm{s})=\) expected total discounted rewards starting in s and following \(\pi\)

\(V^\pi(s)=\sum_{s^{\prime}} T\left(s, \pi(s), s^{\prime}\right)\left[R\left(s, \pi(s), s^{\prime}\right)+\gamma V^\pi\left(s^{\prime}\right)\right]\)

\(V_0^\pi(s)=0\)

\(V_{k+1}^\pi(s) \leftarrow \sum_{s^{\prime}} T\left(s, \pi(s), s^{\prime}\right)\left[R\left(s, \pi(s), s^{\prime}\right)+\gamma V_k^\pi\left(s^{\prime}\right)\right]\)

Policy Iteration

Start with an arbitrary policy.

\(V_{k+1}^{\pi_i}(s) \leftarrow \sum_{s^{\prime}} T\left(s, \pi_i(s), s^{\prime}\right)\left[R\left(s, \pi_i(s), s^{\prime}\right)+\gamma V_k^{\pi_i}\left(s^{\prime}\right)\right]\)

Evaluation

Improvement

\(\pi_{i+1}(s)=\underset{{\color{IndianRed}a}}{{\color{IndianRed}\arg \max} } \sum_{s^{\prime}} T\left(s, a, s^{\prime}\right)\left[R\left(s, a, s^{\prime}\right)+\gamma V^{\pi_i}\left(s^{\prime}\right)\right]\)

Policy Iteration v. Value Iteration

Both value iteration and policy iteration compute the same thing (all optimal values)

Every iteration updates both the values and (implicitly) the policy

We don't track the policy, but taking the max over actions
implicitly recomputes it

We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them)

After the policy is evaluated, a new policy is chosen
(slow like a value iteration pass)

The new policy will be better (or we're done)

Reinforcement Learning

\(s\)

\((s,a)\)

\(s^\prime\)

\((s^\prime,a^\prime)\)

\(s^{\prime\prime}\)

Input: \(S,A\) and a sequence

\(s_0,a_0,r_0; s_1,a_1,r_1; \cdots\)

In particular, \(T\) and \(R\) are unknown.

\(\begin{aligned} Q^{+}(s, a) & =r_t+\gamma~{\color{DodgerBlue}\max _{a^{\prime}}}~Q\left(s^{\prime}, a^{\prime}\right) \\ \delta & =Q^{+}(s, a)-Q(s, a) \\ Q(s, a) & =Q(s, a)+\alpha \delta\end{aligned}\)

Noisy Estimate of \(Q\):

\(\pi^\epsilon(s) \triangleq \begin{cases}\operatorname{argmax}_a Q^\pi\left(s_e a\right), & \text { with probability } 1-\epsilon \\ \operatorname{UniformRandom}(\mathcal{A}), & \text { with probability } \epsilon\end{cases}\)

Policy Improvement

Q-Learning

\(s\)

\((s,a)\)

\(s^\prime\)

\((s^\prime,a^\prime)\)

\(s^{\prime\prime}\)

Input: \(S,A\) and a sequence

\(s_0,a_0,r_0; s_1,a_1,r_1; \cdots\)

In particular, \(T\) and \(R\) are unknown.

\(\begin{aligned} Q^{+}(s, a) & =r_t+\gamma Q\left(s^{\prime}, a^{\prime}\right) \\ \delta & =Q^{+}(s, a)-Q(s, a) \\ Q(s, a) & =Q(s, a)+\alpha \delta\end{aligned}\)

Noisy Estimate of \(Q\):

Policy Improvement

SARSA

Reinforcement Learning

SARSA v. DP Example Run

*The rest of the lecture in the video
is highly recommended too!

\(Q^\pi(s, a)=E\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \mid s_0=s, a_0=a, \pi\right]\)

Markov Decision Processes

The Q-Function
& the Value Function

The \(\gamma\) accounts for discounting.

\(V^\pi(s)=Q^\pi(s, \pi(s))\)