## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 16

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

0. Announcements

1. Review

2. Questions

## Announcements

HW2 due Monday 3/28

5789 Paper Review Assignment (weekly pace suggested)

Today is the last day to drop

Prelim TOMORROW 3/22 at 7:30-9pm in Phillips 101

Closed-book, definition/equation sheet provided

Focus: mainly Unit 1 (known models) but many lectures in Unit 2 revisit important key concepts

Study Materials: Lecture Notes 1-15, HW0&1

## Prelim Exam

Outline:

1. MDP Definitions
2. Policies and Distributions
3. Value and Q function
4. Optimal Policies
5. Linear Optimal Control

## Review

Participation point: PollEV.com/sarahdean011

Infinite Horizon Discounted MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$

## 1. MDP Definitions

• $$\mathcal{S}$$ states, $$\mathcal{A}$$ actions
• $$r$$ map from state, action to scalar reward
• $$P$$ transition probability to next state given current state and action (Markov assumption)
• $$\gamma$$ discount factor

Finite Horizon MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}$$

• $$\mathcal{S},\mathcal{A},r,P$$ same
• $$H$$ horizon
• $$\mu_0$$ initial distribution

ex - Pac-Man as MDP

## 1. MDP Definitions

Optimal Control Problem

• continuous states/actions $$\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}$$
• transitions $$P$$ described in terms of dynamics function and disturbance $$w\sim \mathcal D$$
$$s'= f(s, a, w)$$

ex - UAV as OCP

## 2. Policies and Distributions

• Policy $$\pi$$ chooses an action based on the current state so $$a_t=a$$ with probability $$\pi(a|s_t)$$
• Shorthand for deterministic policy: $$a_t=\pi(s_t)$$

examples:

Policy results in a trajectory $$\tau = (s_0, a_0, s_1, a_1, ... )$$

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

## 2. Policies and Distributions

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

• Probability of trajectory $$\tau =(s_0, a_0, s_1, ... s_t, a_t)$$ $$\mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
• Probability of $$(s, a)$$ at $$t$$ $$\mathbb{P}^\pi_t(s, a ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, a_{0:t-1}, s_t, a_t \mid s_t = s, a_t = a)$$
• Discounted "steady-state" distribution $$d^\pi_{\mu_0}(s, a) = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t \mathbb{P}^\pi_t(s, a; \mu_0)$$

## 2. Policies and Distributions

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

Food for thought:

• How do these distributions change under two different policies $$\pi$$ and $$\pi'$$? (HW2)
• How to write the distributions $$\mathbb{P}^\pi_t$$ and $$d^\pi_{\mu_0}$$ over the state only?

## 3. Value and Q function

• Evaluate policy by cumulative reward
• $$V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s]$$
• $$Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s, a_0=a]$$
• For finite horizon, for $$t=0,...H-1$$,
• $$V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s]$$
• $$Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r_t | s_t=s, a_t=a]$$

examples:

...

...

...

## 3. Value and Q function

Recursive Bellman Expectation Equation:

• Discounted Infinite Horizon
•  $$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$
• $$Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]$$
• Finite Horizon,  for $$t=0,\dots H-1$$,
• $$V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]$$
• $$Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')]$$

...

...

...

Recall: Gardening MDP HW problem

## 3. Value and Q function

• Recursive computation: $$V^{\pi} = R^{\pi} + \gamma P^{\pi} V^\pi$$
• Exact Policy Evaluation: $$V^{\pi} = (I- \gamma P^{\pi} )^{-1}R^{\pi}$$
• Iterative Policy Evaluation: $$V^{\pi}_{t+1} = R^{\pi} + \gamma P^{\pi} V^\pi_t$$
• Backwards-Iterative computation in finite horizon:
• Initialize $$V^{\pi}_H = 0$$
• For $$t=H-1, H-2, ... 0$$
• $$V^{\pi}_t = R^{\pi} +P^{\pi} V^\pi_{t+1}$$

...

...

...

## 4. Optimal Policies

• An optimal policy $$\pi^*$$ is one where $$V^{\pi^*}(s) \geq V^{\pi}(s)$$ for all $$s$$ and policies $$\pi$$
• Equivalent condition: Bellman Optimality
• $$V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]$$
• $$Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]$$
• Optimal policy $$\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)$$

Recall: Gardening MDP HW problem (verifying optimality)

Food for thought: What does Bellman Optimality imply about advantage function $$A^{\pi^*}(s,a)$$?

## 4. Optimal Policies

• Finite horizon, for $$t=0,\dots H-1$$,
• $$V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]$$
• $$Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]$$
• Optimal policy $$\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)$$
• Can directly solve with Dynamic Programming
• Iterate backwards in time from $$V^*_{H}=0$$

## 4. Optimal Policies

• Infinite horizon: algorithms for recursion in the Bellman Optimality equation
• Value Iteration
• Initialize $$Q_0$$. For $$t=0,1,\dots$$,
• $$Q^{t+1}(s,a) =r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{t}(s', a') \right]$$
• Policy Iteration
• Initialize $$\pi_0$$. For $$t=0,1,\dots$$,
• $$Q^{t}=$$ PolicyEval($$\pi^t$$)
• $$\pi^{t+1}(s) = \argmax_{a\in\mathcal A} Q^{t}(s,a)$$

## 4. Optimal Policies

• Value Iteration
• Fixed point iteration (like Iterative Policy Iteration) from Bellman Q Optimality
• Contraction in Q: $$\|Q^{t+1} - Q^*\|_\infty \leq \gamma \|Q^t - Q^*\|_\infty$$
• Policy Iteration
• Monotone Improvement: $$Q^{t+1}(s,a) \geq Q^{t}(s,a)$$
• Contraction in V: $$\|V^{t+1} - V^*\|_\infty \leq \gamma \|V^t - V^*\|_\infty$$

## 5. Linear Optimal Control

• Linear Dynamics: $$s_{t+1} = A s_t + Ba_t + w_t,\quad w_t\sim \mathcal N(0,\sigma^2 I)$$
• Unrolled dynamics $$s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k (Ba_{t-k-1} + w_{t-k-1})$$
• Stability of uncontrolled $$s_{t+1}=As_t$$:
• stable if $$\rho(A)< 1$$
• unstable if $$\rho(A) > 1$$
• marginally unstable if $$\rho(A) = 1$$

ex - UAV

Food for thought: What are dynamics, stability, value under linear policy $$a_t = K s_t$$?

## 5. Linear Optimal Control

Finite Horizon LQR: Application of Dynamic Programming

• Initialize $$V^{\pi}_H(s) = 0$$
• For $$t=H-1, H-2, ... 0$$
• $$Q^{*}_t(s,a) = c(s,a) +\mathbb E_{s'\sim P(s,a)}[ V^*_{t+1}(s')]$$
• $$\pi^*(s) = \argmin_{a\in\mathcal A} Q^{*}_t(s,a)$$
• $$V^*_t = Q^{*}_t(s,\pi^*(s))$$

Basis for approximation-based algorithms (local linearization and iLQR)

## Proof Stratgies

1. Add and subtract: $$\|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\|$$
2. Contractions (induction) $$\|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
3. Additive induction $$\|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\|$$
4. Basic Inequalities (HW0) $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|]$$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)|$$ $$\mathbb E[f(x)] \leq \max f(x)$$

## Test-taking Strategies

1. Move on if stuck!
2. Write explanations and show steps for partial credit
3. Multipart questions: can be done mostly independently
• ex: 1) show $$\|x_{t+1}\|\leq \gamma \|x_t\|$$
2) give a bound on $$\|x_t\|$$ in terms of $$\|x_0\|$$

By Sarah Dean

Private