TD-GAMMON
1992
2013
DQN
2016
AlphaGo
AlphaGo
ChatGPT
(RLHF)
2022
Reasoning (RLVR)
2025
Hide And Seek
2019
Agent
Environment
State
Action
Reward
Policy
The agent generates its own training data by acting
Should I do what's worked before, or try something new?
Tells you how good, not what was right
The consequence of an action may only show up many steps later
State
Policy Network
Environment
Reward
Not Differentiable!
https://arxiv.org/pdf/2504.13837
https://arxiv.org/pdf/2510.14901