Model-Ensemble Trust Region Policy Optimization

report is made by
Pavel Temirchev

 

Deep RL

reading group

 

Model-Based VS Model-Free

+ Low sample complexity

+ Gives opportunity for modelling real-world environments

+ Similar to the man's thinking process

 

- Requires careful tuning of a model

- Still no practical implementation for complex environments

- Real-time consuming

+ Currently shown great results in complex environments

+ CTRL-C \ CTRL-V training style, little tuning involved

 

- Requires a LOT of samples

- Usually can not be implemented for the real-world environments

Vanilla Model-Based RL

Text

1) Train transition model on available data

2) Use the transition model for policy model training

3) Collect new data using real-world environment

4) Repeat

 

ASSUMPTION:

If the transition model is good enough

Then optimal policy trained on it

Will be optimal in the real-world environment too...

Vanilla Model Learning

\min_{\phi} \frac{1}{|\mathcal{D}|} \sum_{(s_t,a_t,s_{t+1})} ||s_{t+1} - \hat{f}(s_t, a_t)||^2_2
minϕ1D(st,at,st+1)st+1f^(st,at)22\min_{\phi} \frac{1}{|\mathcal{D}|} \sum_{(s_t,a_t,s_{t+1})} ||s_{t+1} - \hat{f}(s_t, a_t)||^2_2
  • Adam
  • Normalization
  • Early stopping
  • Predicting $$ s_{t+1}-s_t $$ instead of $$ s_{t+1} $$

Vanilla Policy Learning

\max_{\theta} \hat{\eta}(\theta,\phi)
maxθη^(θ,ϕ)\max_{\theta} \hat{\eta}(\theta,\phi)
\hat{\eta}(\theta,\phi) = \mathbb{E}_{\hat\tau} \sum_{t=0}^Tr(s_t, a_t)
η^(θ,ϕ)=Eτ^t=0Tr(st,at)\hat{\eta}(\theta,\phi) = \mathbb{E}_{\hat\tau} \sum_{t=0}^Tr(s_t, a_t)

where

a_t \sim \pi_\theta(s_t)
atπθ(st)a_t \sim \pi_\theta(s_t)

and

s_{t+1} = \hat f (s_t, a_t)
st+1=f^(st,at)s_{t+1} = \hat f (s_t, a_t)
a_t(s_t) = \mu_\theta(s_t) + \sigma_\theta(s_t) \cdot \zeta, \zeta \sim \mathcal{N}(0, I)
at(st)=μθ(st)+σθ(st)ζ,ζN(0,I)a_t(s_t) = \mu_\theta(s_t) + \sigma_\theta(s_t) \cdot \zeta, \zeta \sim \mathcal{N}(0, I)

Then we use a reparametrization trick

\nabla_\theta \hat{\eta} = \mathbb{E}_{s_0\sim\rho_0, \zeta\sim\mathcal{N}} \nabla_\theta\sum_{t=0}^Tr(s_t, a_t)
θη^=Es0ρ0,ζNθt=0Tr(st,at)\nabla_\theta \hat{\eta} = \mathbb{E}_{s_0\sim\rho_0, \zeta\sim\mathcal{N}} \nabla_\theta\sum_{t=0}^Tr(s_t, a_t)
  • Adam
  • Gradient clipping

Vanilla MB-RL Algorithm

1: Initialize a policy \( \pi_\theta \) and a model \( \hat{f}_\phi \)

2: Initialize an empty dataset \( D \)

3: repeat

4:      Collect samples from the real env \( f \) using \( \pi_\theta \) and add them to \( D \)

5:      Train the model  \( \hat{f}_\phi \) using \( D \)

6:      repeat

7:            Collect fictitious samples from \( \hat{f}_\phi \) using \( \pi_\theta \)

8:            Update the policy using BPTT on the fictitious samples

9:            Estimate the performance \( \hat\eta(\theta, \phi) \)

10:     until the performance stop improving

11: until the policy performs well in real environment \( f \).

Troubles with Vanilla MB-RL

  • Model Bias:
  • Stucking in bad optima due to BPTT instability: inexact, vanishing and exploding gradients

Suggested Solution

1) Single Model

\hat f _ \phi
f^ϕ\hat f _ \phi

Model Ensemble

\{\hat f _ {\phi_i} \}_{i=0}^K
{f^ϕi}i=0K\{\hat f _ {\phi_i} \}_{i=0}^K

2) BPTT

TRPO

3) Stop criteria:

Policy stops improving

Stop criteria:

\frac{1}{K} \sum^K I[\hat\eta(\phi^{new}_i)>\hat\eta(\phi^{old}_i)]
1KKI[η^(ϕinew)>η^(ϕiold)]\frac{1}{K} \sum^K I[\hat\eta(\phi^{new}_i)>\hat\eta(\phi^{old}_i)]

TRPO: Theoretical Foundation

TRPO: Practical Algorithm

\max_\theta L_{\theta_{old}}(\theta)
maxθLθold(θ)\max_\theta L_{\theta_{old}}(\theta)

subject to

\mathbb{E}_s KL(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta_{new}}(\cdot|s)) \leq \delta
EsKL(πθold(s)πθnew(s))δ\mathbb{E}_s KL(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta_{new}}(\cdot|s)) \leq \delta
  • Use the single path or vine procedures to collect a set of state-action pairs along with Monte Carlo estimates of their Q-values.
  • By averaging over samples, construct the estimated objective and constraint.
  • Approximately solve this constrained optimization problem to update the policy’s parameter vector \( \theta \).

Suggested Algorithm

1: Initialize a policy \( \pi_\theta \) and all models \( \hat{f}_{\phi_1} \), ..., \( \hat{f}_{\phi_K} \)

2: Initialize an empty dataset \( D \)

3: repeat

4:      Collect samples from the real env \( f \) using \( \pi_\theta \) and add them to \( D \)

5:      Train all models \( \{ \hat{f}_{\phi_i} \}_{i=0}^K \) using \( D \)

6:      repeat

7:            Collect fictitious samples from \( \{ \hat{f}_{\phi_i} \}_{i=0}^K \) using \( \pi_\theta \)

8:            Update the policy using TRPO on the fictitious samples

9:            Estimate the performances \( \hat\eta(\theta, \phi_i) \) for \( i = 1 ... K \)

10:     until the performances stop improving

11: until the policy performs well in real environment \( f \).

Plots: BPTT vs. TRPO

Text

Plots: Ensemble vs. Single Model

Plots: ME-TRPO vs. everyone else

Text

Implementation

Data Collecting:

Neural Networks:

Hardware:

  • 3000-6000 samples from stochastic policy
  • It's st.d. is generated from \( \mathcal{U}[0, 3] \) and kept fixed throughout the episode
  • Policy's parameters are perturbed with Gaussian noise proportional to the difference in parameters at a timestep 
  • MODEL: 2 hidden-layers: 1024 x 1024; ReLU
  • POLICY: 2-3 hidden-layers: 32 x 32 or 100 x 50 x 25 (humanoid); tanh nonlinearity

Amazon EC2 using:

1 NVIDIA K80 GPU, 4 vCPUs, and 61 GB of memory

Implementation: Real-Time Consumption

Environments Run time (in 1000s)
Swimmer ~ 35.3
Snake ~ 60.8
Hopper ~ 183.6
Half Cheetah ~ 103.7
Ant ~ 395.2
Humanoid ~ 362.1

Fun Stuff

Discussion

Text

  • Accepted to ICLR 2018
  • Grades are: 7, 6, 7
  • Not really novel
  • But the first one that works well with complex NN models
  • May be used for real-world problems
  • Time-consuming
  • But sample-effective
  • Why not to use BNN? Where time consumption comparison? Why the overfitting plot is not given for the new model?

Thanks for your

attention!

Model-Ensemble Trust Region Policy Optimization

By cydoroga

Model-Ensemble Trust Region Policy Optimization

  • 542