Many examples taken from David Silver's UCL course
(literally took screenshots this time!)
https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDMOYHWgPebj2MfCFzFObQ
Policy Evaluation: Estimate
Policy Improvement: Generate
Policy Evaluation: MC evaluation
Policy Improvement: -Greedy
Goal: Reach G in the least time-steps
State: Location in the grid
Actions: U, D, L, R, Diag.
Rewards: -1 per timestep