Outline:

1. MDP Definitions
2. Policies and Distributions
3. Value and Q function
4. Optimal Policies
5. Linear Optimal Control
6. Learned Models, Values, Policies
7. Exploration
8. Learning from Experts

## Review

Infinite Horizon Discounted MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$

## 1. MDP Definitions

• $$\mathcal{S}$$ states, $$\mathcal{A}$$ actions
• $$r$$ map from state, action to scalar reward
• $$P$$ transition probability to next state given current state and action (Markov assumption)
• $$\gamma$$ discount factor

Finite Horizon MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}$$

• $$\mathcal{S},\mathcal{A},r,P$$ same
• $$H$$ horizon

ex - Pac-Man as MDP

## 1. MDP Definitions

Optimal Control Problem

• continuous states/actions $$\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}$$
• transitions are deterministic and described in terms of dynamics function
$$s'= f(s, a)$$

ex - UAV as OCP

## 2. Policies and Distributions

• Policy $$\pi$$ chooses an action based on the current state so $$a_t=a$$ with probability $$\pi(a|s_t)$$
• Shorthand for deterministic policy: $$a_t=\pi(s_t)$$

examples:

Policy results in a trajectory $$\tau = (s_0, a_0, s_1, a_1, ... )$$

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

## 2. Policies and Distributions

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

• Probability of trajectory $$\tau =(s_0, a_0, s_1, ... s_t, a_t)$$ $$\mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
• Probability of $$s$$ at $$t$$ $$\mathbb{P}^\pi_t(s ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t}, a_{0:t-1} \mid s_t = s)$$

## 2. Policies and Distributions

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

• Probability vector of $$s$$ at $$t$$: $$d_{\mu_0,t}^\pi(s) = \mathbb{P}^\pi_t(s ; \mu_0)$$ evolves as $$d_{\mu_0,t+1}^\pi=P_\pi^\top d_{\mu_0,t}^\pi$$ where $$P_\pi$$ at row $$s$$ and column $$s'$$ is $$\mathbb E_{a\sim \pi(s)}[P(s'\mid s,a)]$$
• Discounted "steady-state" distribution (PSet 2) $$d^\pi_{\mu_0} = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_{\mu_0,t}^\pi$$

$$+\gamma$$

$$+\gamma^2$$

$$+\quad ...\quad=$$

## 2. Policies and Distributions

Food for thought:

• How are these distributions different when
• Initial states are different
• Policies are different (Performance Difference Lemma)
• Transitions are different (Simulation Lemma)

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

## 3. Value and Q function

• Evaluate policy by cumulative reward
• $$V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]$$
• $$Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]$$
• For finite horizon, for $$t=0,...H-1$$,
• $$V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s,P,\pi]$$
• $$Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s, a_t=a,P,\pi]$$

examples:

...

...

...

## 3. Value and Q function

Recursive Bellman Expectation Equation:

• Discounted Infinite Horizon
•  $$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$
• $$Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]$$
• Finite Horizon,  for $$t=0,\dots H-1$$,
• $$V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]$$
• $$Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')]$$

...

...

...

Recall: Icy navigation (PSet 2, lecture example), Prelim question

## 3. Value and Q function

• Recursive computation: $$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$$
• Exact Policy Evaluation: $$V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$$
• Iterative Policy Evaluation: $$V^{\pi}_{i+1} = R^{\pi} + \gamma P_{\pi} V^\pi_i$$
• Converges: fixed point contraction
• Backwards-Iterative computation in finite horizon:
• Initialize $$V^{\pi}_H = 0$$
• For $$t=H-1, H-2, ... 0$$
• $$V^{\pi}_t = R^{\pi} +P_{\pi} V^\pi_{t+1}$$

## 4. Optimal Policies

• An optimal policy $$\pi^*$$ is one where $$V^{\pi^*}(s) \geq V^{\pi}(s)$$ for all $$s$$ and policies $$\pi$$
• Equivalent condition: Bellman Optimality
• $$V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]$$
• $$Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]$$
• Optimal policy $$\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)$$

Recall: Verifying optimality in Icy Street example, Prelim

## 4. Optimal Policies

• $$0$$ cost for $$a_0$$
• $$2\epsilon$$ cost for $$a_1$$
• $$\epsilon$$ reward in $$s_0$$
• $$1$$ reward in $$s_1$$
• $$\gamma$$ discount

Food for thought: rigorous argument for optimal policy?

## 4. Optimal Policies

• Finite horizon: for $$t=0,\dots H-1$$,
• $$V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]$$
• $$Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]$$
• Optimal policy $$\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)$$
• Solve exactly with Dynamic Programming
• Iterate backwards in time from $$V^*_{H}=0$$

## 4. Optimal Policies

• Infinite horizon: algorithms for recursion in the Bellman Optimality equation
• Value Iteration
• Initialize $$V_0$$. For $$i=0,1,\dots$$,
• $$V^{i+1}(s) =\max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]$$
• Policy Iteration
• Initialize $$\pi_0$$. For $$i=0,1,\dots$$,
• $$V^{i}=$$ PolicyEval($$\pi^i$$)
• $$\pi^{i+1}(s) = \argmax_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]$$

## 5. Linear Optimal Control

• Linear Dynamics: $$s_{t+1} = A s_t + Ba_t$$
• Unrolled dynamics (PSet 3) $$s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k Ba_{t-k-1}$$
• Stability of $$s_{t+1}=As_t$$:
• stable if $$\max_i |\lambda_i(A)|< 1$$
• unstable if $$\max_i |\lambda_i(A)| > 1$$
• marginally unstable if $$\max_i |\lambda_i(A)|= 1$$

ex - UAV

Recall: PSet 4 and Prelim question about cumulative cost and stability

## 5. Linear Optimal Control

• Finite Horizon LQR: Application of Dynamic Programming $$c(s,a) = s^\top Qs + a^\top Ra \implies \pi^\star_t(s) = Ks,~~V^\star_t(s) = s^\top P_t s$$
• Basis for approximation-based algorithms (local linearization and iLQR)
• Food for thought:
• What is $$V^\pi$$ for a non-optimal linear policy?
• What are policy/value for infinite horizon discounted LQR? (recursive use of Bellman equations)

## 6. Learning from data

• What do we want to learn? $$\mathcal M = \{\mathcal S,\mathcal A,P,r,\gamma\}$$
• Unknown transitions $$P(s'|s,a)$$ or reward function $$r(s,a)$$
• Value/Q function
• of policy $$V^\pi(s)$$ or $$Q^\pi(s,a)$$
• optimal $$V^*(s)$$ or $$Q^*(s,a)$$
• Optimal Policy $$\pi^*(s)$$
• Given a dataset with features $$x_i$$ and labels $$y_i$$, model:
• Via counting: $$\widehat f(x) = \sum_{i=1}^N y_i \mathbf 1\{x=x_i\} / \sum_{i=1}^N\mathbf 1\{x=x_i\}$$
• Function approx: $$\widehat f(x) = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N (f(x_i)-y_i)^2$$
• e.g. closed-form linear regression solution

Model-Based RL

• Features are $$(s,a)$$ and label is $$s'$$
• Data collection
• Query model: uniform exploration
• Active exploration with reward bonus
• Tabular setting: $$\widehat P$$ via counting
• Simulation Lemma
• Translate error in $$\widehat P$$ vs $$P$$ into difference in performance $$\widehat V$$ vs $$V$$

## 6. Learning Models

• Features are $$(s_i,a_i)$$
• $$(s_i,a_i) = (s_{h_1}, a_{h_1}) \sim d^\pi_{\mu_0}$$
• Labels constructed as:
• Rollout based (MC): $$y_i = \sum_{t=h_1}^{h_1+h_2} r_t$$
• Bellman Exp based (TD): $$y_t =r_t + \gamma \widehat Q(s_{t+1},a_{t+1})$$
• Bellman Opt based (TD): $$y_t =r_t + \gamma \max_a \widehat Q(s_{t+1},a)$$
• On vs. off policy with importance weighting (Recall PSet)
• $$\widehat Q =\arg\min \frac{1}{N}\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2$$

## 6. Learning Value/Q

$$h_1=h$$ w.p. $$\propto \gamma^h$$

$$s_t$$

$$a_t\sim \pi(s_t)$$

$$r_t\sim r(s_t, a_t)$$

$$s_{t+1}\sim P(s_t, a_t)$$

$$a_{t+1}\sim \pi(s_{t+1})$$

• Value-based RL
For $$i=0,1...$$:
1. Rollout $$\pi^i$$ and construct dataset
2. Learn $$\widehat Q^{i+1}$$ from data
3. Update $$\pi^{i+1}$$ greedy, incremental, or $$\epsilon$$ greedy
• Approximate vs. Conservative Policy Iteration
• Greedy improvement $$\rightarrow$$ oscillations, incremental is more stable
• Performance Difference Lemma

## 6. Policy Optimization

• $$J(\theta)=$$ expected cumulative reward under policy $$\pi_\theta$$
• Estimate $$\nabla_\theta J(\theta)$$ via rollouts $$\tau$$, observed reward $$R(\tau)$$
• Random Search: $$\theta \pm \delta v$$ , $$g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v$$
• REINFORCE: $$g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)$$
• Actor-Critic: $$s,a\sim d^{\pi_\theta}_{\mu_0}$$ ,
$$g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^{\pi_\theta}(s,a)-b(s))$$

Food for thought: how to compute off-policy gradient estimate?

## 6. Policy Optimization

for $$i=0,1,...$$
1. collect rollouts using $$\theta_i$$
2. estimate gradient with $$g_i$$
3. $$\theta_{i+1} = \theta_i+ \alpha g_i$$
• Trust regions and Natural PG

$$\max ~J(\theta)$$

$$\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta$$

$$\max ~\nabla J(\theta_0)^\top(\theta-\theta_0)$$

$$\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta$$

$$\theta_{i+1} = \theta_i + \alpha F^{-1}_{i} g_i$$

## 7. Exploration

• Multi-Arm and Contextual Bandits: MDP with no transitions!
• Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
• Linear regret: pure random, pure greedy, $$\epsilon$$ greedy (PSet)
• Sublinear regret: Explore-then-commit, UCB, LinUCB
• $$\arg\max_a \widehat \mu_a$$   vs   $$\arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$$

Food for thought: performance/regret of softmax policy?

## 8. Learning From Experts

Imitation Learning with BC

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

$$\mathcal D = (x_i, y_i)_{i=1}^M$$

...

$$\pi$$(       ) =

## 8. Learning From Experts

Imitation Learning with DAgger

Supervised Learning

Policy

Dataset

$$\mathcal D = (x_i, y_i)_{i=1}^M$$

...

$$\pi$$(       ) =

Execute

Query Expert

$$\pi^*(s_0), \pi^*(s_1),...$$

$$s_0, s_1, s_2...$$

Aggregate

$$(x_i = s_i, y_i = \pi^*(s_i))$$

## 8. Learning From Experts

• Inverse RL: Principle of maximum entropy

• Max-Ent IRL with Soft-VI (entropy regularized)
• replace $$\max$$ with softmax
• Lagrange Formulation to constrained optimization
• $$x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$$
• $$\displaystyle x^* =\arg \min_x \max_{w} ~~f(x)+w\cdot g(x)$$ equivalent
• Iterative algorithm or solve $$\nabla [f(x) + w^\top g(x)] = 0$$

maximize    $$\mathsf{Ent}(\pi)$$

s.t.    $$\pi$$ consistent with expert data

## Proof Stratgies

1. Add and subtract: $$\|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\|$$
2. Contractions (induction) $$\|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
3. Additive induction $$\|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\|$$
4. Basic Inequalities (PSet 1): $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|]$$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)|$$ $$\mathbb E[f(x)] \leq \max f(x)$$

## Test-taking Strategies

1. Move on if stuck!
2. Write explanations and show steps for partial credit
3. Multipart questions: can be done mostly independently
• ex: 1) show $$\|x_{t+1}\|\leq \gamma \|x_t\|$$
2) give a bound on $$\|x_t\|$$ in terms of $$\|x_0\|$$

