Model-Ensemble Trust Region Policy Optimization

report is made by
Pavel Temirchev

Deep RL

reading group

Model-Based VS Model-Free

+ Low sample complexity

+ Gives opportunity for modelling real-world environments

+ Similar to the man's thinking process

- Requires careful tuning of a model

- Still no practical implementation for complex environments

- Real-time consuming

+ Currently shown great results in complex environments

+ CTRL-C \ CTRL-V training style, little tuning involved

- Requires a LOT of samples

- Usually can not be implemented for the real-world environments

Vanilla Model-Based RL

Text

1) Train transition model on available data

2) Use the transition model for policy model training

3) Collect new data using real-world environment

4) Repeat

ASSUMPTION:

If the transition model is good enough

Then optimal policy trained on it

Will be optimal in the real-world environment too...

Vanilla Model Learning

\min_{\phi} \frac{1}{|\mathcal{D}|} \sum_{(s_t,a_t,s_{t+1})} ||s_{t+1} - \hat{f}(s_t, a_t)||^2_2

\min_{\phi} \frac{1}{|\mathcal{D}|} \sum_{(s_t,a_t,s_{t+1})} ||s_{t+1} - \hat{f}(s_t, a_t)||^2_2

Adam
Normalization
Early stopping
Predicting $$ s_{t+1}-s_t $$ instead of $$ s_{t+1} $$

Vanilla Policy Learning

\max_{\theta} \hat{\eta}(\theta,\phi)

\max_{\theta} \hat{\eta}(\theta,\phi)

\hat{\eta}(\theta,\phi) = \mathbb{E}_{\hat\tau} \sum_{t=0}^Tr(s_t, a_t)

\hat{\eta}(\theta,\phi) = \mathbb{E}_{\hat\tau} \sum_{t=0}^Tr(s_t, a_t)

where

a_t \sim \pi_\theta(s_t)

a_t \sim \pi_\theta(s_t)

and

s_{t+1} = \hat f (s_t, a_t)

s_{t+1} = \hat f (s_t, a_t)

a_t(s_t) = \mu_\theta(s_t) + \sigma_\theta(s_t) \cdot \zeta, \zeta \sim \mathcal{N}(0, I)

a_t(s_t) = \mu_\theta(s_t) + \sigma_\theta(s_t) \cdot \zeta, \zeta \sim \mathcal{N}(0, I)

Then we use a reparametrization trick

\nabla_\theta \hat{\eta} = \mathbb{E}_{s_0\sim\rho_0, \zeta\sim\mathcal{N}} \nabla_\theta\sum_{t=0}^Tr(s_t, a_t)

\nabla_\theta \hat{\eta} = \mathbb{E}_{s_0\sim\rho_0, \zeta\sim\mathcal{N}} \nabla_\theta\sum_{t=0}^Tr(s_t, a_t)

Adam
Gradient clipping

Vanilla MB-RL Algorithm

1: Initialize a policy $ \pi_\theta $ and a model $ \hat{f}_\phi $

2: Initialize an empty dataset $ D $

3: repeat

4: Collect samples from the real env $ f $ using $ \pi_\theta $ and add them to $ D $

5: Train the model $ \hat{f}_\phi $ using $ D $

6: repeat

7: Collect fictitious samples from $ \hat{f}_\phi $ using $ \pi_\theta $

8: Update the policy using BPTT on the fictitious samples

9: Estimate the performance $ \hat\eta(\theta, \phi) $

10: until the performance stop improving

11: until the policy performs well in real environment $ f $.

Troubles with Vanilla MB-RL

Model Bias:

Stucking in bad optima due to BPTT instability: inexact, vanishing and exploding gradients

Suggested Solution

1) Single Model

\hat f _ \phi

\hat f _ \phi

Model Ensemble

\{\hat f _ {\phi_i} \}_{i=0}^K

\{\hat f _ {\phi_i} \}_{i=0}^K

2) BPTT

TRPO

3) Stop criteria:

Policy stops improving

Stop criteria:

\frac{1}{K} \sum^K I[\hat\eta(\phi^{new}_i)>\hat\eta(\phi^{old}_i)]

\frac{1}{K} \sum^K I[\hat\eta(\phi^{new}_i)>\hat\eta(\phi^{old}_i)]

TRPO: Theoretical Foundation

TRPO: Practical Algorithm

\max_\theta L_{\theta_{old}}(\theta)

\max_\theta L_{\theta_{old}}(\theta)

subject to

\mathbb{E}_s KL(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta_{new}}(\cdot|s)) \leq \delta

\mathbb{E}_s KL(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta_{new}}(\cdot|s)) \leq \delta

Use the single path or vine procedures to collect a set of state-action pairs along with Monte Carlo estimates of their Q-values.
By averaging over samples, construct the estimated objective and constraint.
Approximately solve this constrained optimization problem to update the policy’s parameter vector $ \theta $.

Suggested Algorithm

1: Initialize a policy $ \pi_\theta $ and all models $ \hat{f}_{\phi_1} $, ..., $ \hat{f}_{\phi_K} $

2: Initialize an empty dataset $ D $

3: repeat

4: Collect samples from the real env $ f $ using $ \pi_\theta $ and add them to $ D $

5: Train all models $ \{ \hat{f}_{\phi_i} \}_{i=0}^K $ using $ D $

6: repeat

7: Collect fictitious samples from $ \{ \hat{f}_{\phi_i} \}_{i=0}^K $ using $ \pi_\theta $

8: Update the policy using TRPO on the fictitious samples

9: Estimate the performances $ \hat\eta(\theta, \phi_i) $ for $ i = 1 ... K $

10: until the performances stop improving

11: until the policy performs well in real environment $ f $.

Plots: BPTT vs. TRPO

Text

Plots: Ensemble vs. Single Model

Plots: ME-TRPO vs. everyone else

Text

Implementation

Data Collecting:

Neural Networks:

Hardware:

3000-6000 samples from stochastic policy
It's st.d. is generated from $ \mathcal{U}[0, 3] $ and kept fixed throughout the episode
Policy's parameters are perturbed with Gaussian noise proportional to the difference in parameters at a timestep

MODEL: 2 hidden-layers: 1024 x 1024; ReLU
POLICY: 2-3 hidden-layers: 32 x 32 or 100 x 50 x 25 (humanoid); tanh nonlinearity

Amazon EC2 using:

1 NVIDIA K80 GPU, 4 vCPUs, and 61 GB of memory

Implementation: Real-Time Consumption

Environments	Run time (in 1000s)
Swimmer	~ 35.3
Snake	~ 60.8
Hopper	~ 183.6
Half Cheetah	~ 103.7
Ant	~ 395.2
Humanoid	~ 362.1

Fun Stuff

Discussion

Text

Accepted to ICLR 2018
Grades are: 7, 6, 7
Not really novel
But the first one that works well with complex NN models
May be used for real-world problems
Time-consuming
But sample-effective
Why not to use BNN? Where time consumption comparison? Why the overfitting plot is not given for the new model?

Thanks for your

attention!