Reinforcement Learning:
Classical Foundations and the LLM era
Flatiron Institute
Carol(ina) Cuesta-Lazaro



TD-GAMMON
1992

2013
DQN
2016
AlphaGo
AlphaGo
ChatGPT
(RLHF)
2022
Reasoning (RLVR)
2025
Hide And Seek
2019

Agent


Environment
State


Action

s_t
s_{t+1}
a_t \sim \Pi_\theta(s_t)
Reward
R_t
Policy
RL vs Supervised Learning
- In RL the data distribution is changing
- Reward signal is evaluative, not instructive
- Very stochastic outcomes
- Trade off between exploration and exploitation
The agent generates its own training data by acting
Should I do what's worked before, or try something new?
Tells you how good, not what was right
- Delayed Feedback
The consequence of an action may only show up many steps later
The Learning Problem
\frac{\partial R}{\partial \theta} = \frac{\partial R}{\partial s_T} \cdot \frac{\partial s_T}{\partial a_{T-1}} \cdot \frac{\partial a_{T-1}}{\partial s_{T-1}} \cdots \frac{\partial a_0}{\partial \theta}
State
s_{t-1}

Policy Network
Environment
Reward
s_{t}
Not Differentiable!
a_t \sim \Pi_\theta(s_{t-1})

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Verifiable Rewards (RLVR)


https://arxiv.org/pdf/2504.13837
Are models learning something new via RL?
https://arxiv.org/pdf/2510.14901

Are high likelihood responses more correct?


Solving Long Horizon Reasoning Problems





Resources
The Reinforce Algorithm
\begin{array}{l}
\text{Initialize } \theta \\
\textbf{repeat:} \\
\quad \text{Sample } \tau_1, \ldots, \tau_N \sim \pi_\theta \\
\quad \hat{g} \leftarrow \frac{1}{N} \sum_i r(\tau_i) \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \\
\quad \theta \leftarrow \theta + \alpha \hat{g}
\end{array}
RL-Lectures-NASA-2026
By carol cuesta
RL-Lectures-NASA-2026
- 11