Artyom Sorokin | 18 Mar
Lets recall Reinforcement learning objective:
where:
Lets recall Reinforcement learning objective:
RL objective:
GOAL:
We want to find gradient of RL objective \(J(\theta)\) with respect to policy parameters \(\theta\)!
To maximize mean expected return:
Find:
Log-derivative trick:
Maximize mean expected return:
Gradients w.r.t \(\theta\):
We can rewrite \(p_{\theta}(\tau)\) as:
Then:
Maximize mean expected return:
Gradients w.r.t \(\theta\):
Policy Gradients:
We don't know the true expectation there:
And of course we can approximate it with sampling:
Estimate policy gradients:
Update policy parameters:
REINFORCE (Pseudocode):
To train REINFORCE we estimate this:
We can only use samples generated with \(\pi_{\theta}\)!
On-policy learning:
REINFORCE (Pseudocode):
Lets recall Reinforcement learning objective:
where:
We can rewrite probability of generating trajectory \(\tau\):
Lets recall Reinforcement learning objective:
where:
We can rewrite probability of generating trajectory \(\tau\):
What if we use behaviour cloning to learn a policy?
Cross Entropy-loss for each transition in dataset:
0.2
0.7
0.1
1
0
0
Ground Truth at state \(s_t\):
Policy at \(s_t\):
\(a_t\)
Gradients with behaviour clonning:
Policy Gradients:
Goal is to minimize \(J_{BC}(\theta)\)
Goal is to maximize \(J(\theta)\)
Goal is to maximize \(-J_{BC}(\theta)\)
BC trains policy to choose the same actions as the experts
PG trains policy to choose actions that leads to higher episodic returns!
Policy Gradients:
PG trains policy to choose actions that leads to higher episodic returns!
Probliem: hight variance!
Recal Tabular RL: Monte-Carlo Return has high variance!
Doesn't it look strange?
Causality principle: action at step \(t\) cannot affect reward at \(t'\) when \(t' < t\)
Doesn't it look strange?
Causality principle: action at step \(t\) cannot affect reward at \(t'\) when \(t' < t\)
Later actions became less relevant!
Final Version:
Updates policy proportionally to \(\tau (r) \):
Updates policy proportionally to how much \(\tau (r) \) is better than average:
where:
Substracting baseline is unbiased in expectation!
(and often works better)
Value-based algorithms (DQN, Q-learning, SARSA, etc.) use \(\epsilon\)-greedy policy to encourage exploration!
Entropy Regularization for strategy:
In policy-based algorithms(PG, A3C, PPO, etc.) we can utilize a more agile trick:
Adding \(-H(\pi_{\theta})\) to a loss function:
Final Version with "causality improvement" and baseline:
Now recall Value functions:
What is this?
Single point estimate of \(Q_{\pi_{\theta}}(s_{i,t},a_{i,t})\)
Combining PG and Value Functions!
What about baseline?
Has lower variance than single point estimate!
Better account for causality here....
Combining PG and Value Functions!
Advantage Function:
approximate with a sample
how much choosing \(a_t\) is better than average policy
It is easier to learn only one function!
...but we can do better:
Combining PG and Value Functions!
Advantage Function:
how much choosing \(a_t\) is better than average policy
It is easier to learn \(V\)-function as it depends on fewer arguments!
Recall
Policy Iteration:
No Target Network (recall DQN) here, just stop the gradients.
\(\phi\): critic parameters
Policy Iteration reminder:
No actual target network, no grads
\(\phi\): different set of parameters
Answer: Parallel Computation!
We can't use Replay Memory, but we need to decorrelate our samples!
Each worker procedure:
All workers run Asynchronously
Each worker procedure:
All workers run Asynchronously
Pros:
Cons:
All workers run Asynchronously
Solution:
It's usually called A2C... again
Pros:
Cons:
It's usually called A2C... again