CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Infinite Horizon Discounted MDPs

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Announcements

  • Questions about waitlist/enrollment?
  • Homework
    • Problem Set 1 released Monday, due 2/5
    • Programming Assignment 1 released tonight, due in 2 weeks
  • Office Hours posted on Ed
  • Materials (slides and *new* lecture notes) on Canvas

Agenda

 

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Recap: Value & Bellman (finite H)

  • The value function \(V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k,a_k) \mid s_t=s \right]\)

  • Bellman Consistency Equation enables efficient policy evaluation

  • Optimal policies have optimal (highest) value \(V^\star_t(s)\) for all \(t,s\)

  • Bellman Optimality Equation (BOE) enables efficient policy optimization

  • The optimal policy is greed wrt optimal value $$\pi_t^\star(s) = \arg\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$

Recap: State Distributions

  • The probability of state trajectory \(\tau=(s_0,s_1,\dots,s_t)\) from \(s_0\sim\mu_0\) and deterministic policy \(\pi\): $$ \mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\displaystyle\prod_{i=1}^{t} {P}(s_i \mid s_{i-1}, \pi_{i-1}(s_{i-1})) $$
  • State distribution vector & transition matrix $$ d_t = \begin{bmatrix} \mathbb{P}\{s_t=1\}\\ \vdots \\ \mathbb{P}\{s_t=S\}\end{bmatrix} ,\quad P_{\pi_t} = \begin{bmatrix}  P(1\mid 1,\pi_t(1)) & \cdots & P(S\mid 1,\pi_t(1)) \\ \vdots && \vdots \\ P(1\mid S,\pi_t(S)) & \cdots & P(S\mid S,\pi_t(S))\end{bmatrix}$$
  • State distribution evolves by transition matrix \(d_{t+1} = P_{\pi_t}^\top d_t\)

Agenda

 

1. Recap

2. Infinite Horizon & Value Function

3. Bellman Equation

4. Policy Evaluation

Infinite Horizon Discounted MDP

  • \(\mathcal{S}, \mathcal{A}\) state and action space
  • \(r\) reward function and \(P\) transition function
  • \(\gamma\) discount factor between \(0\) and \(1\)

Goal: achieve high cumulative reward:

$$\sum_{t=0}^\infty \gamma^t r_t$$

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)

action \(a_t\in\mathcal A\)

state \(s_t\in\mathcal S\)

reward

\(r_t= r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state

Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), \right]$$

this lecture we take \(r(s,a)\) to be deterministic and \(\pi\) to be stationary and state-dependent

Example

\(0\)

\(1\)

stay: \(1\)

move: \(1\)

stay: \(p_1\)

move: \(1-p_2\)

stay: \(1-p_1\)

move: \(p_2\)

  • Recall simple MDP example
  • Suppose the reward is:
    • \(r(0,a)=1\) and \(r(1,a)=0\) for all \(a\)
  • Consider the policy
    • \(\pi(s)=\)stay for all \(s\)
  • Simulate reward sequences
  • PollEV What is \(V^\pi(0)\)?
    • \(\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}\)
  • If \(s_0=0\) then \(s_t=0\) for all \(t\)
  • PollEV: \(V^\pi(0) = \sum_{t=0}^\infty \gamma^t r(0,\pi(0))\)
    • \(=\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}\)
  • If \(s_0=1\) then at some time \(T\) state transitions and then \(s_t=0\) for all \(t\geq T\)
  • \(V^\pi(1) = \mathbb E_{T}[ \sum_{t=0}^{T-1} \gamma^t r(1,\pi(1)) + \sum_{t=T}^\infty \gamma^t r(0,\pi(0))]\)
    • \(\displaystyle =\sum_{T=0}^\infty \mathbb P\{T=t\} \sum_{t=T}^\infty \gamma^t \)
    • \(\displaystyle=\sum_{T=0}^\infty p_1^T(1-p_1) \frac{\gamma^T}{1-\gamma}=\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)

Example

\(0\)

\(1\)

\(p_1\)

\(1-p_1\)

PSet preview: what distribution determines the value function?

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(s_t)) \mid s_0=s, P, \pi\right]$$

  • Recall \(d_t\) the state distribution at \(t\)
  • Discounted "steady-state" distribution $$ d_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_t $$

Steady-state?

Agenda

 

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Exercise: review proof (below)

Bellman Consistency Equation:

 \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

...

...

...

The cumulative reward expression is almost recursive:

$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) $$

Bellman Consistency Equation

Proof

  • \(V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]\)
  • \(= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]\)
    (linearity of expectation)
  • \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]\)
    (simplifying conditional expectation, re-indexing sum)
  • \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]\) (tower property of conditional expectation)
  • \(= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]\)
    (definition of value function and linearity of expectation)

Example

\(0\)

\(1\)

move: \(1\)

stay: \(p_1\)

move: \(1-p_2\)

stay: \(1-p_1\)

move: \(p_2\)

  • Reward \(r(0,a)=1\) and \(r(1,a)=0\) for all \(a\)
  • Recall \(V^\pi(0) = \frac{1}{1-\gamma}\)
  • What is \(V^\pi(1)\)?
    • \(V^\pi(1) = 0 + \gamma (p_1 V^\pi(1) + (1-p_1)V^\pi(0))\)

\(\pi(s)=\)stay

Agenda

 

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Policy Evaluation

  • Consider deterministic policy. The Bellman equation:
    •  \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \mathbb{E}_{s' \sim P( s, \pi(s))} [V^\pi(s')] \)
    •  \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S}  P(s'\mid s, \pi(s)) V^\pi(s') \)
  • As with the state distribution, consider \(V^\pi\) as an \(S=|\mathcal S|\) dimensional vector, also \(R^\pi\)
    •  \(V^{\pi}(s) = R^\pi(s) + \gamma \langle P_\pi(s) , V^\pi \rangle \)

\(P_\pi=\)

\(s\)

\(P(\cdot\mid s,\pi(s))\)

Policy Evaluation

  • \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S}  P(s'\mid s, \pi(s)) V^\pi(s') \)
  • The matrix vector form of the Bellman Equation is $$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$$

\(s\)

\(s'\)

\(P(s'\mid s,\pi(s))\)

\(=\)

\(+\gamma\)

\(V^\pi(s)\)

\(r(s,\pi(s))\)

Matrix inversion is slow! \(\mathcal O(S^3)\)

To exactly compute the value function, we just need to solve the \(S\times S\) system of linear equations:

  • \(V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}\)
  • (PSet 1 shows inverse exists)

Exact Policy Evaluation

\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)

Approximate Policy Evaluation:

  • Initialize \(V_0\)
  • For \(t=0,1,\dots, T\):
    • \(V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t\)

Complexity of each iteration?

  • \(\mathcal O(S^2)\)

Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

Example

Recall \(r(0,a)=1\) and \(r(1,a)=0\)

  • What are \(R^\pi\) and \(P_\pi \)?
    • \(\displaystyle \begin{bmatrix} 1\\ 0\end{bmatrix} \quad\text{\&}\quad  \begin{bmatrix} 1 & 0 \\ 1-p_1 &p_1 \end{bmatrix}\)
  • Exercises:
    • Confirm previously computed \(V^\pi\) is a fixed point
    • Iterative PE with \(V_0 = \begin{bmatrix}1& 1\end{bmatrix}^\top \)

\(0\)

\(1\)

\(1\)

\(p_1\)

\(1-p_2\)

\(1-p_1\)

\(p_2\)

\(\pi(s)=\)stay

To show the Approx PE works, we first prove a contraction lemma

Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

  • \(\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty\) by algorithm definition
  • \(= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty\) by Bellman eq
  • \(= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle \) norm definition
  • \(=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|\) expectation definition
  • \(\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]\) basic inequality (PSet 1)
  • \(\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty\) basic inequality (PSet 1)

Proof

  • First statement follows by induction using the Lemma
  • For the second statement,
    • \(\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon\)
    • Taking \(\log\) of both sides,
    • \(T\log \gamma + \log  \|V_0-V^\pi\|_\infty \leq \log \epsilon \), then rearrange

Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an \(\epsilon\) correct solution requires

\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)

Recap

  • PSet 1 due Monday
  • PA 1 released soon
  • Office Hours on Ed

 

  • Value Function
  • Bellman Equation
  • Policy Evaluation

 

  • Next lecture: Optimality

Sp24 CS 4/5789: Lecture 4

By Sarah Dean

Private

Sp24 CS 4/5789: Lecture 4