report is made by
Pavel Temirchev
Deep RL
reading group
Model-Based VS Model-Free
+ Low sample complexity
+ Gives opportunity for modelling real-world environments
+ Similar to the man's thinking process
- Requires careful tuning of a model
- Still no practical implementation for complex environments
- Real-time consuming
+ Currently shown great results in complex environments
+ CTRL-C \ CTRL-V training style, little tuning involved
- Requires a LOT of samples
- Usually can not be implemented for the real-world environments
Vanilla Model-Based RL
Text
1) Train transition model on available data
2) Use the transition model for policy model training
3) Collect new data using real-world environment
4) Repeat
ASSUMPTION:
If the transition model is good enough
Then optimal policy trained on it
Will be optimal in the real-world environment too...
Vanilla Model Learning
Vanilla Policy Learning
where
and
Then we use a reparametrization trick
Vanilla MB-RL Algorithm
1: Initialize a policy \( \pi_\theta \) and a model \( \hat{f}_\phi \)
2: Initialize an empty dataset \( D \)
3: repeat
4: Collect samples from the real env \( f \) using \( \pi_\theta \) and add them to \( D \)
5: Train the model \( \hat{f}_\phi \) using \( D \)
6: repeat
7: Collect fictitious samples from \( \hat{f}_\phi \) using \( \pi_\theta \)
8: Update the policy using BPTT on the fictitious samples
9: Estimate the performance \( \hat\eta(\theta, \phi) \)
10: until the performance stop improving
11: until the policy performs well in real environment \( f \).
Troubles with Vanilla MB-RL
Suggested Solution
1) Single Model
Model Ensemble
2) BPTT
TRPO
3) Stop criteria:
Policy stops improving
Stop criteria:
TRPO: Theoretical Foundation
TRPO: Practical Algorithm
subject to
Suggested Algorithm
1: Initialize a policy \( \pi_\theta \) and all models \( \hat{f}_{\phi_1} \), ..., \( \hat{f}_{\phi_K} \)
2: Initialize an empty dataset \( D \)
3: repeat
4: Collect samples from the real env \( f \) using \( \pi_\theta \) and add them to \( D \)
5: Train all models \( \{ \hat{f}_{\phi_i} \}_{i=0}^K \) using \( D \)
6: repeat
7: Collect fictitious samples from \( \{ \hat{f}_{\phi_i} \}_{i=0}^K \) using \( \pi_\theta \)
8: Update the policy using TRPO on the fictitious samples
9: Estimate the performances \( \hat\eta(\theta, \phi_i) \) for \( i = 1 ... K \)
10: until the performances stop improving
11: until the policy performs well in real environment \( f \).
Plots: BPTT vs. TRPO
Text
Plots: Ensemble vs. Single Model
Plots: ME-TRPO vs. everyone else
Text
Implementation
Data Collecting:
Neural Networks:
Hardware:
Amazon EC2 using:
1 NVIDIA K80 GPU, 4 vCPUs, and 61 GB of memory
Implementation: Real-Time Consumption
Environments | Run time (in 1000s) |
---|---|
Swimmer | ~ 35.3 |
Snake | ~ 60.8 |
Hopper | ~ 183.6 |
Half Cheetah | ~ 103.7 |
Ant | ~ 395.2 |
Humanoid | ~ 362.1 |
Fun Stuff
Discussion
Text
Thanks for your
attention!