Copy of DiffRL

Policy Search through Contact Dynamics

\begin{aligned} F(\alpha;{\color{red}x_i},{\color{green}p})\quad & \coloneqq \left[\sum^\infty_{t=0} \gamma^t r(x_{t},u_{t})\right] \\ \text{s.t.}\quad & x_{t+1}=f(x_t,u_t,{\color{green} p}),\;\\ & u_t = \pi(x_t,\alpha), \quad x_0 = {\color{red} x_i} \end{aligned}

Original problem formulation:

Find a deterministic policy that minimizes sum of rewards.

1. Flatness

2. Stiffness

Challenges of Gradient-based Optimization

\begin{aligned} \min_\alpha F(\alpha;{\color{red}x_i},{\color{green}p}) \end{aligned}

Policy Search through Contact Dynamics

\begin{aligned} F(\alpha;{\color{red} x_i}, {\color{green} p}, {\color{blue} w_t})\quad & \coloneqq \left[\sum^\infty_{t=0} \gamma^t r(x_{t},u_{t})\right] \\ \text{s.t.}\quad & x_{t+1}=f(x_t,u_t,{\color{green}p}),\;\\ & u_t = \pi(x_t,\alpha) + {\color{blue} w_t},\quad x_0={\color{red} x_i} \end{aligned}

Perhaps unknowingly, many RL formulations benefit from specifying a more relaxed objective with a stochastic formulation.

Randomization of initial conditions

Randomization of domain parameters

Randomization of stochastic policy

\begin{aligned} \min_\alpha {\color{red}\mathbb{E}_{x_i}}{\color{green}\mathbb{E}_{p}}{\color{blue}\mathbb{E}_{w_{0:\infty}}}\left[F(\alpha;{\color{red} x_i},{\color{green} p},{\color{blue} w_{0:\infty}})\right] \end{aligned}

Policy Search through Contact Dynamics

Policy Randomization

Domain Randomization

Initial Condition Randomization

Randomization potentially alleviates flatness and stiffness

Policy Search through Contact Dynamics

Policy Randomization

Domain Randomization

Initial Condition Randomization

Randomized smoothing also encodes robustness

Rate of Bias

Suboptimality

\begin{aligned} e_\rho \coloneqq V(\alpha^*_\rho) - V(\alpha^*) \end{aligned}

Suboptimality gap (Bias)

Suboptimality Stiffness

\begin{aligned} \Delta e_\rho \coloneqq \frac{V(\alpha^*_\rho) - V(\alpha^*)}{\rho} \end{aligned}

\begin{aligned} (\varepsilon,\delta) \coloneqq \inf_{\|\Delta p\|\leq \varepsilon} V(\alpha^*,\hat{p} + \Delta p)-V(\alpha^*,\hat{p}) \leq \delta \end{aligned}

Robustness

Suboptimality

\begin{aligned} \alpha^* & = \text{argmin}_\alpha V(\alpha)\\ \alpha^*_\rho & = \text{argmin}_\alpha \bar{V}_\rho(\alpha) \end{aligned}

Pitfalls of Randomized Smoothing

Challenge 1. Sample-Inefficient Exploration

Getting out of flatness requires lots of samples.

Need to have lot of samples to eventually reach a non-flat region.

Pitfalls of Randomized Smoothing

Challenge 2. Empirical Bias from Stiff Underlying Landscape

Example: Domain Randomization

Summing up gradients does not result in direction of true gradient!

\begin{aligned} \mathbb{E}_p V(\alpha, p) \end{aligned}

\begin{aligned} \frac{1}{N}\sum_i V(\alpha,p_i) \end{aligned}

What causes stiffness?

x_3(\alpha_1)

x_2(\alpha_1)

x_1(\alpha_1)

x_1(\alpha_2)

x_2(\alpha_2)

x_3(\alpha_2)

\lim_{\alpha_2\rightarrow\alpha_1}\frac{r_3(\alpha_2) - r_3(\alpha_1)}{x_3(\alpha_2) - x_3(\alpha_1)}\cdot \frac{x_3(\alpha_2) - x_3(\alpha_1)}{x_2(\alpha_2) - x_2(\alpha_1)}\cdot {\color{red} \frac{x_2(\alpha_2) - x_2(\alpha_1)}{x_1(\alpha_2) - x_1(\alpha_1)}}\cdot \frac{x_1(\alpha_2) - x_1(\alpha_1)}{x_0(\alpha_2) - x_0(\alpha_1)}\cdot\frac{x_0(\alpha_2) - x_0(\alpha_1)}{\alpha_2 - \alpha_1}

x_0(\alpha_1)

x_0(\alpha_2)

\frac{\partial r_3}{\partial \alpha} = \frac{\partial r_3}{\partial x_3}\frac{\partial x_3}{\partial x_2}{\color{red}\frac{\partial x_2}{\partial x_1}}\frac{\partial x_1}{\partial x_0}\frac{\partial x_0}{\partial \alpha}

Started off close, ended up far!

Rewards are tied to states and trajectories.

Stiff dynamics causes vastly different trajectories in state. If these are different in the direction of the reward, they directly translate to stiff value landscapes.

Dynamic smoothing alleviates stiffness / flatness