speaker: Pavel Temirchev
hey, we already discussed it
go left
go right
check it yourself!
use replay buffer
Bandit №1
Bandit №2
Bandit №3
Bandit №N
COVID-19
a = 1 |
a = 2 |
a = 3 |
... |
a = N |
|
---|---|---|---|---|---|
REWARD
a = 1 |
a = 2 |
a = 3 |
... |
a = N |
|
---|---|---|---|---|---|
REWARD
a = 1 |
a = 2 |
a = 3 |
... |
a = N |
|
---|---|---|---|---|---|
REWARD
a = 1
a = 2
a = 3
a = N
Soft Q-learning:
https://arxiv.org/pdf/1702.08165.pdf
Soft Actor Critic:
https://arxiv.org/pdf/1801.01290.pdf
Big Review on Probabilistic Inference for RL:
https://arxiv.org/pdf/1805.00909.pdf
Implementation on TensorFlow:
https://github.com/rail-berkeley/softlearning
Implementation on Catalyst.RL:
https://github.com/catalyst-team/catalyst/tree/master/examples/rl_gym
Hierarchical policies (further reading):