CS 4/5789: Introduction to Reinforcement Learning

Lecture 3: Dynamic Programming

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

  • Questions about waitlist/enrollment?
  • Homework released this week
    • Problem Set 1 released, due Monday 2/5 at 11:59pm
    • Programming Assignment 1 released Wednesday
  • My office hours:
    • Mondays 4:10-5:10pm in Olin 255 (right after lecture)
    • Today they will be shortened: only until 4:30pm
  • CIS Partner Finding Social: January 31, 5-7 pm at Gates G01

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

Recap: Finite Horizon MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\) defined by states, actions, reward, transition, horizon

action \(a_t\in\mathcal A\)

state \(s_t\in\mathcal S\)

reward

\(r_t= r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

in today's lecture, \(r(s,a)\) is deterministic

Goal: maximize expected cumulative reward

$$\max_\pi ~\mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1}  r(s_k, a_k) \right]$$

Recap: Finite Horizon MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\) defined by states, actions, reward, transition, horizon

probability of trajectory \(\tau=(s_0,a_0,...,s_{H-1},a_{H-1})\) under \(P\), policy \(\pi\), initial distribution \(\mu_0\)

  1. How to efficiently compute expected reward of a given policy?
  2. How to efficiently find a policy that maximizes expected reward?

Today's lecture: two big questions

\(a_t=\pi_t(s_t)\)

\(r_t= r(s_t, a_t)\)

\(s_{t}\sim P(s_{t-1}, a_{t-1})\)

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

The value of a state \(s\) under a policy \(\pi\) at time \(t\) is the expected cumulative reward-to-go

Value Function

$$V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1}  r(s_k, a_k) \mid s_t=s,s_{k+1}\sim P(s_k, a_k),a_k\sim \pi_k(s_k)\right]$$

\(s_t\)

\(a_t\)

\(s_{t+1}\)

\(a_{t+1}\)

\(s_{t+2}\)

\(a_{t+2}\)

...

\(s_{H-1}\)

\(a_{H-1}\)

Example

\(0\)

\(1\)

stay: \(1\)

move: \(1\)

stay: \(p_1\)

move: \(1-p_2\)

stay: \(1-p_1\)

move: \(p_2\)

  • Recall simple MDP example
    • Actions works always in \(s=0\) and w.p. \(p_1,p_2\) in \(s=1\)
  • Suppose the reward is:
    • \(r(0,a)=1\) and \(r(1,a)=0\) for all \(a\)
  • Consider the policy
    • \(\pi(s)=\)stay for all \(s\)
  • Simulate reward sequences

Example

\(0\)

\(1\)

\(1\)

\(p_1\)

\(1-p_1\)

  • If \(s_t=0\) then \(s_k=0\) for all \(t\geq k\)
  • PollEV: \(V_t^\pi(0) = \sum_{k=t}^{H-1}  r(0,\mathsf{stay})\)
    • \(=\sum_{k=t}^{H-1}  1 = H-t\)
  • If \(s_t=1\), \(V_t^\pi(1)\)...
    • consider the time \(T\geq t\) such that \(s_k=1\) for \(k<t\) and \(s_k=0\) for \(k\geq T\)
    • after transition, value will be \(H-T\)
    • compute expectation over \(T\)

\(\pi(s)=\)stay

Example

\(0\)

\(1\)

  • If \(s_t=0\) then \(s_k=0\) for all \(t\geq k\)
  • PollEV: \(V_t^\pi(0) = \sum_{k=t}^{H-1}  r(0,\mathsf{stay})\)
    • \(=\sum_{k=t}^{H-1}  1 = H-t\)
  • If \(s_t=1\), \(V_t^\pi(1)\)...
    • consider the time \(T\geq t\) such that \(s_k=1\) for \(k<t\) and \(s_k=0\) for \(k\geq T\)
    • after transition, value will be \(H-T\)
    • compute expectation over \(T\)
  • What about the policy \(\pi(s)=\)move for all \(s\)

\(1\)

\(1-p_2\)

\(p_2\)

\(\pi(s)=\)move

Bellman Consistency Equation

  • The value function \(V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k,a_k) \mid s_t=s \right]\)
  • Bellman Consistency Equation: \(\forall s\), $$V_t^{\pi}(s) = \mathbb{E}_{a \sim \pi_t(s)} \left[ r(s, a) +  \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')] \right]$$

    • Exercise: review the proof below

  • Enables policy evaluation (i.e. computing \(V_t^\pi\)) by backwards iteration

    1. Initialize \(V_H^\pi(s) =0\) for all \(s\in\mathcal S\)

    2. For \(t=H-1,H-2,...,0\): $$V_t^{\pi}(s)=\mathbb{E}_{a \sim \pi_t(s)} \left[ Q_t^\pi(s,a) \right] ~~\forall ~s\in\mathcal S$$

  • Total complexity to compute is \(S^2AH\)

\(=Q_t^\pi(s,a)\)

\(\underbrace{\qquad\qquad\qquad\qquad}{}\)

Proof

  • \(V_t^\pi(s) = \mathbb E\left[r(s_t,a_t) + \sum_{k=t+1}^{H-1} r(s_k,a_k) \mid s_t=s, \pi, P \right]\)
  • \(= \mathbb{E}[r(s_t,a_t)\mid s_t=s, \pi, P ] + \mathbb{E}[\sum_{k=t+1}^{H-1} r(s_{k},a_{k}) \mid s_t=s, \pi,P]\)
    (linearity of expectation)
  • First term: \(= \mathbb{E}[r(s,a)\mid a\sim\pi_t(s) ] \) (simplify dependence)
  • Second term:
    • \(=\mathbb{E}[\mathbb{E}[\sum_{k=t+1}^{H-1} r(s_{k},a_{k}) \mid s_{t+1}=s', \pi, P] \mid s_t=s, \pi, P]\)
      (tower property of conditional expectation)
    • \(=  \mathbb{E}[\mathbb{E}[\sum_{k=t+1}^{H-1} r(s_{k},a_{k}) \mid s_{t+1}=s', \pi, P] \mid a\sim \pi(s),s'\sim P(s,a)]\)
      (Markov property)
    • \(=  \mathbb{E}[\mathbb{E}[V_{t+1}(s')\mid s'\sim P(s,a)] \mid a\sim \pi(s)]\) (defn of \(V\) and tower)
  • Combine terms & use linearity of conditional expectation

Example

\(0\)

\(1\)

\(1\)

\(p_1\)

\(1-p_1\)

\(\pi(s)=\)stay

Recall \(r(0,a)=1\) and \(r(1,a)=0\)

  • \(\pi(s)=\)stay so:
    • \(V_t^{\pi}(s) = r(s, \pi(s)) +  \mathbb{E}_{s' \sim P( s, \pi(s))} [V_{t+1}^\pi(s')] \)
  • \(V^\pi_H = \begin{bmatrix}0\\ 0\end{bmatrix}\)
  • \(V^\pi_{H-1} = \begin{bmatrix}1\\ 0\end{bmatrix}\)
  • \(V^\pi_{H-2} = \begin{bmatrix}2\\ 1-p_1\end{bmatrix}\)
  • ...

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

  1. How to efficiently compute expected reward of a given policy?
  2. How to efficiently find a policy that maximizes expected reward?

Today's lecture: two big questions

\(a_t=\pi_t(s_t)\)

\(r_t= r(s_t, a_t)\)

\(s_{t}\sim P(s_{t-1}, a_{t-1})\)

Optimal Policy

  • Consider all possible policies \(\Pi\), including stochastic, history-dependent, time-dependent  (most general)
  • Define: An optimal policy \(\pi_\star\) is one where \(V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)\) for all \(t\), \(s\in\mathcal S\), and policies \(\pi\in\Pi\)
    • i.e. the policy dominates other policies for all states
    • vector notation: \(V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)~\forall~s\iff V_t^{\pi_\star} \geq V_t^{\pi}\)
  • Thus we can write \(V^\star(s) = V^{\pi_\star}(s)\)

Goal: maximize expected cumulative reward

$$\max_\pi ~\mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1}  r(s_k, a_k) \right]$$

$$=\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$

Optimal Policy

  • Consider all possible policies \(\Pi\), including stochastic, history-dependent, time-dependent  (most general)
  • Define: An optimal policy \(\pi_\star\) is one where \(V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)\) for all \(t\), \(s\in\mathcal S\), and policies \(\pi\in\Pi\)
  • Thus we can write \(V^\star(s) = V^{\pi_\star}(s)\)
  • Notice that the starting distribution \(\mu_0\) does not determine the optimal policy \(\pi^\star\)

Goal: maximize expected cumulative reward

$$\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$

Optimal Policy

  • Consider all possible policies \(\Pi\), including stochastic, history-dependent, time-dependent  (most general)
  • Define: An optimal policy \(\pi_\star\) is one where \(V_t^{\pi_\star}(s) \geq V_t^{\pi}(s)\) for all \(t\), \(s\in\mathcal S\), and policies \(\pi\in\Pi\)
  • Corollary (of next slide): For any finite horizon MDP, there exists a deterministic, state-dependent, time-dependent policy which is optimal

Goal: maximize expected cumulative reward

$$\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$

Bellman Optimality Equation

  • Bellman Optimality Equation (BOE): A value function
    \(V=(V_0,\dots,V_{H-1})\) satisfies the BOE if for all \(s\), $$V_t(s)=\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V_{t+1}(s')]$$
  • Theorem (Bellman Optimality):

    1. \(\pi\) is an optimal policy if and only if \(V^{\pi}\) satisfies the BOE
    2. The optimal policy is greedy with respect to the optimal value function $$\pi_t^\star(s) \in \arg\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$

shorthand \(Q_t^\star(s,a)\)

\(\underbrace{\qquad\qquad\qquad\qquad}{}\)

Agenda

1. Recap

2. Value Function

3. Optimal Policy

4. Dynamic Programming

Dynamic Programming

  • Initialize \(V^\star_H = 0\)
  • For \(t=H-1, H-2, ..., 0\):
    • \(Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)
    • \(\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)\)
    • \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
  • The BOE leads to a backwards induction

Example

  • Reward:  \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for \(a=\) switch
  • \(Q^{\pi}_{H-1}(s,a)=r(s,a)\) for all \(s,a\)
  • \(\pi^\star_{H-1}(s)=\)stay for all \( s\)
  • \(V^\star_{H-1}=\begin{bmatrix}1\\0\end{bmatrix}\), \(Q^\star_{H-2}=\begin{bmatrix}2&\frac{1}{2}\\p & -\frac{1}{2}+2p\end{bmatrix}\)
  • \(\pi^\star_{H-2}(s)=\)stay for all \(s\)
  • \(V^\star_{H-2}=\begin{bmatrix}2\\p\end{bmatrix}\), \(Q^\star_{H-3}=\begin{bmatrix}3&\frac{1}{2}+p\\(1-p)p+2p & -\frac{1}{2}+(1-2p)p+4p\end{bmatrix}\)
  • \(\pi^\star_{H-3}(0)=\)stay and \(\pi^\star_{H-3}(1)=\)switch if \(p\geq 1-\frac{1}{\sqrt{2}}\)
  • ...

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(1-p\)

switch: \(1-2p\)

stay: \(p\)

switch: \(2p\)

Recap

  • PSet 1 released
  • Office hours right after lecture

 

  • Value & Q Function
  • Optimal Policy
  • Dynamic Programming

 

  • Next lecture: infinite horizon MDPs

Sp24 CS 4/5789: Lecture 3

By Sarah Dean

Private

Sp24 CS 4/5789: Lecture 3