b05902031謝議霆
b05902008王行健
1. Play a episode with fixed model and get the reward
- for each state, get (probability, reward)
2. reward = reward * r ^ (t' - t) - baseline
3. ▽J = ▽log(probability) * reward
4. update with ▽J
5. repeat to 1.
200 episode
1200 episode
2500 episode
1. Play a episode with fixed model and get the reward
- for each state, get (probability, reward)
2. reward = reward * r ^ (t' - t) - baseline
3. ▽J = ▽log(probability) * reward
4. update with ▽J
5. repeat to 1.
1. Play a episode with fixed model and get the reward
- for each state, get (probability, reward)
2. reward = reward * r ^ (t' - t) - baseline
3. ▽J = ▽log(probability) * reward
4. update with ▽J
5. repeat to 1.
state | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
reward | 0 | 0 | 0 | 0 | 0 | -1 |
modified | -0.99^5 | -0.99^4 | -0.99^3 | -0.99^2 | -0.99 | -1 |
state | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
reward | 0 | 0 | 0 | 0 | 0 | 1 |
modified | 0.99^5 | 0.99^4 | 0.99^3 | 0.99^2 | 0.99 | 1 |
1. Play a episode with fixed model and get the reward
- for each state, get (probability, reward)
2. reward = reward * r ^ (t' - t) - baseline
3. ▽J = ▽log(probability) * reward
4. update with ▽J
5. repeat to 1.
state | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
reward | 0 | 0 | 0 | 0 | 0 | 1 |
modified | 0.99^5 | 0.99^4 | 0.99^3 | 0.99^2 | 0.99 | 1 |
normalize | -1.4540 | -0.8802 | -0.3006 | 0.2849 | 0.8763 | 1.4736 |
5000 episode
30000 episode
60000 episode
1. Play a episode with fixed model and get the reward
- for each state, get (state, action, state', reward)
- store in reply buffer
2. sample
3. Compute y = r + 0.99 * max(Q(s', a')) using target network
4. Compute the gradient between y and Q(s, a) using new network
5. update the new network with gradient
6. repeat to 1.
1. Play a episode with fixed model and get the reward
- for each state, get (state, action, state', reward)
- store in reply buffer
2. sample
3. Compute y = r + 0.99 * max(Q'(s', argmax(Q(s', a')))) using current network to output the action
4. Compute the gradient between y and Q(s, a) using new network
5. update the new network with gradient
6. repeat to 1.