Model-Ensemble Trust Region Policy Optimization
report is made by
Pavel Temirchev
Deep RL
reading group
Model-Based VS Model-Free
+ Low sample complexity
+ Gives opportunity for modelling real-world environments
+ Similar to the man's thinking process
- Requires careful tuning of a model
- Still no practical implementation for complex environments
- Real-time consuming
+ Currently shown great results in complex environments
+ CTRL-C \ CTRL-V training style, little tuning involved
- Requires a LOT of samples
- Usually can not be implemented for the real-world environments
Vanilla Model-Based RL
Text
1) Train transition model on available data
2) Use the transition model for policy model training
3) Collect new data using real-world environment
4) Repeat
ASSUMPTION:
If the transition model is good enough
Then optimal policy trained on it
Will be optimal in the real-world environment too...
Vanilla Model Learning
- Adam
- Normalization
- Early stopping
- Predicting $$ s_{t+1}-s_t $$ instead of $$ s_{t+1} $$
Vanilla Policy Learning
where
and
Then we use a reparametrization trick
- Adam
- Gradient clipping
Vanilla MB-RL Algorithm
1: Initialize a policy \( \pi_\theta \) and a model \( \hat{f}_\phi \)
2: Initialize an empty dataset \( D \)
3: repeat
4: Collect samples from the real env \( f \) using \( \pi_\theta \) and add them to \( D \)
5: Train the model \( \hat{f}_\phi \) using \( D \)
6: repeat
7: Collect fictitious samples from \( \hat{f}_\phi \) using \( \pi_\theta \)
8: Update the policy using BPTT on the fictitious samples
9: Estimate the performance \( \hat\eta(\theta, \phi) \)
10: until the performance stop improving
11: until the policy performs well in real environment \( f \).
Troubles with Vanilla MB-RL
- Model Bias:
- Stucking in bad optima due to BPTT instability: inexact, vanishing and exploding gradients
Suggested Solution
1) Single Model
Model Ensemble
2) BPTT
TRPO
3) Stop criteria:
Policy stops improving
Stop criteria:
TRPO: Theoretical Foundation
TRPO: Practical Algorithm
subject to
- Use the single path or vine procedures to collect a set of state-action pairs along with Monte Carlo estimates of their Q-values.
- By averaging over samples, construct the estimated objective and constraint.
- Approximately solve this constrained optimization problem to update the policy’s parameter vector \( \theta \).
Suggested Algorithm
1: Initialize a policy \( \pi_\theta \) and all models \( \hat{f}_{\phi_1} \), ..., \( \hat{f}_{\phi_K} \)
2: Initialize an empty dataset \( D \)
3: repeat
4: Collect samples from the real env \( f \) using \( \pi_\theta \) and add them to \( D \)
5: Train all models \( \{ \hat{f}_{\phi_i} \}_{i=0}^K \) using \( D \)
6: repeat
7: Collect fictitious samples from \( \{ \hat{f}_{\phi_i} \}_{i=0}^K \) using \( \pi_\theta \)
8: Update the policy using TRPO on the fictitious samples
9: Estimate the performances \( \hat\eta(\theta, \phi_i) \) for \( i = 1 ... K \)
10: until the performances stop improving
11: until the policy performs well in real environment \( f \).
Plots: BPTT vs. TRPO
Text
Plots: Ensemble vs. Single Model
Plots: ME-TRPO vs. everyone else
Text
Implementation
Data Collecting:
Neural Networks:
Hardware:
- 3000-6000 samples from stochastic policy
- It's st.d. is generated from \( \mathcal{U}[0, 3] \) and kept fixed throughout the episode
- Policy's parameters are perturbed with Gaussian noise proportional to the difference in parameters at a timestep
- MODEL: 2 hidden-layers: 1024 x 1024; ReLU
- POLICY: 2-3 hidden-layers: 32 x 32 or 100 x 50 x 25 (humanoid); tanh nonlinearity
Amazon EC2 using:
1 NVIDIA K80 GPU, 4 vCPUs, and 61 GB of memory
Implementation: Real-Time Consumption
Environments | Run time (in 1000s) |
---|---|
Swimmer | ~ 35.3 |
Snake | ~ 60.8 |
Hopper | ~ 183.6 |
Half Cheetah | ~ 103.7 |
Ant | ~ 395.2 |
Humanoid | ~ 362.1 |
Fun Stuff
Discussion
Text
- Accepted to ICLR 2018
- Grades are: 7, 6, 7
- Not really novel
- But the first one that works well with complex NN models
- May be used for real-world problems
- Time-consuming
- But sample-effective
- Why not to use BNN? Where time consumption comparison? Why the overfitting plot is not given for the new model?
Thanks for your
attention!
Model-Ensemble Trust Region Policy Optimization
By cydoroga
Model-Ensemble Trust Region Policy Optimization
- 542