Where's The Reward?
A Review of Reinforcement Learning for
Instructional Sequencing
Shayan Doroudi, Vincent Aleven, Emma Brunskill
AIED 2020 Journal Track





Over the past 50 years, how successful has reinforcement learning been in discovering useful adaptive instructional policies?
Research Questions
Under what conditions is reinforcement learning most likely to be successful in advancing instructional sequencing?
Research Questions
Overview
- Reinforcement Learning: Towards a "Theory of Instruction"
- Historical Perspective
- Review of Empirical Studies
- Discussion: Where's the Reward?
- Planning for the Future
Overview
- Reinforcement Learning: Towards a "Theory of Instruction"
- Historical Perspective
- Review of Empirical Studies
- Discussion: Where's the Reward?
- Planning for the Future
Reinforcement Learning (RL)
Markov Decision Process (MDP)
MDP Planning: methods for deriving optimal policies given a MDP
-
Set of States \(S\)
-
Set of Actions \(A\)
-
Transition Matrix \(T\)
-
Reward function \(R\)
-
Horizon \(H\)
Reinforcement Learning: methods for deriving high-reward policies when the the transition matrix is unknown.
Policy \(\pi\): a mapping of states to actions
Optimal Policy: policy that achieves that highest average reward
Theory of Instruction
Atkinson's (1972) “Ingredients for a Theory of Instruction”:
taken in conjunction with methods for deriving optimal strategies
-
A model of the learning process.
-
Specification of admissible instructional actions.
-
Specification of instructional objectives
-
A measurement scale that permits costs to be assigned to each of the instructional actions and and payoffs to the achievement of instructional objectives.
S, T
A
R
MDP Planning
Overview
- Reinforcement Learning: Towards a "Theory of Instruction"
- Historical Perspective
- Review of Empirical Studies
- Discussion: Where's the Reward?
- Planning for the Future
First Wave: 1960s-70s
- Teaching machines were popular in late 50s-early 60s.
- Computers! → Computer-Assisted Instruction
- Dynamic Programming and Markov Decision Processes
- Mathematical Psychology: studying mathematical models of learning
Why 1960s?
“The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future.”
Suppes (1974)
The Place of Theory in Educational Research
AERA Presidential Address
RL's Status in Education Research
The Dark Ages
c. 1975 - 1999
Seemingly no work applying reinforcement learning to instructional sequencing during this time.
The Three Waves
| First Wave (1960s-70s) |
Second Wave (2000s-2010s) |
Third Wave (2010s) |
|
|---|---|---|---|
| Medium of Instruction | Teaching Machines / CAI | Intelligent Tutoring Systems | Massive Open Online Courses |
| Optimization Methods | Decision Processes | Reinforcement Learning | Deep RL |
| Models of Learning | Mathematical Psychology | Machine Learning AIED/EDM |
Deep Learning |
More data-driven
More data-generating
Overview
- Reinforcement Learning: Towards a "Theory of Instruction"
- Historical Perspective
- Review of Empirical Studies
- Discussion: Where's the Reward?
- Planning for the Future
Searched for all empirical studies (as of December 2018) that compared one or more RL-induced policies (broadly conceived) to one or more baseline policies.
Review of Empirical Studies
Found 41 such studies that were clustered into five qualitatively different clusters.
Paired-Associate Learning Tasks
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Five Clusters of Studies
Not Optimizing Learning
leer

to read
Paired-Associate Learning Tasks
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Five Clusters of Studies
Not Optimizing Learning
Includes all studies done in 1960s-1970s
Treats each pair as independent
Use psychological models that account for learning and forgetting.
Paired-Associate Learning Tasks
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Five Clusters of Studies
Not Optimizing Learning


reading
Use cognitive science models that describe how people learn concepts from examples.
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Five Clusters of Studies
Not Optimizing Learning
Worked Example
Problem
Solving
\(x^2 - 4 = 12\)
Solve for \(x\):
\(x^2 - 4 = 12\)
\(x^2 = 4 + 12\)
\(x^2 = 16\)
\(x = \sqrt{16} = \pm4\)
\(x^2 - 4 = 12\)
Solve for \(x\):
Paired-Associate Learning Tasks
Content is pre-determined.
The decision is what kind of instruction to give for each piece of content.
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Five Clusters of Studies
Not Optimizing Learning


Paired-Associate Learning Tasks
Pieces of content are interrelated.
Instructional methods could also vary.
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Five Clusters of Studies
Not Optimizing Learning
Paired-Associate Learning Tasks
Five Clusters of Studies
| RL Policy Outperformed Baseline | Mixed Results / ATI |
RL Policy Did Not Outperform Baseline |
|
|---|---|---|---|
| Paired-Associate Learning Tasks | 11 | 0 | 3 |
| Concept Learning Tasks | 4 | 2 | 1 |
| Sequencing Activity Types | 4 | 4 | 2 |
| Sequencing Interdependent Content | 0 | 2 | 6 |
| Not Optimizing Learning | 2 | 0 | 0 |
Overview
- Reinforcement Learning: Towards a "Theory of Instruction"
- Historical Perspective
- Review of Empirical Studies
- Discussion: Where's the Reward?
- Planning for the Future
The Role of Theory
Use Psychologically-Inspired Models
Spacing Effect
Expertise Reversal Effect
Use Data-Driven
Models
Theoretical Basis
More
Less
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Paired-Associate Learning Tasks
When Has RL Been Successful?
Students' prior knowledge can effect how much instructional sequencing matters.
RL seems to perform better when the baseline policy is weaker!
Robust evaluations can help determine when RL will be successful.
Overview
- Reinforcement Learning: Towards a "Theory of Instruction"
- Historical Perspective
- Review of Empirical Studies
- Discussion: Where's the Reward?
- Planning for the Future
Planning for the Future
Data-driven models should be combined with insights from psychological theories.
Choice of models
Types of actions
Theory can inform:
Space of policies
Planning for the Future
Psychological theory (beyond the cognitive)
Learner control
Teacher control
Machine intelligence should be combined with insights from human intelligence.
Human insights can come in the form of:
[T]he development of a theory of instruction cannot progress if one holds the view that a complete theory of learning is a prerequisite.
Rather, advances in learning theory will affect the development of a theory of instruction,
and conversely the development of a theory of instruction will influence research on learning.
Atkinson (1972)
Acknowledgements
The research reported here was supported, in whole or in part, by the Institute of Education Sciences, U.S. Department of Education, through Grants R305A130215 and R305B150008 to Carnegie Mellon University. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Dept. of Education.
Inclusion Criteria
We consider any papers published before December 2018 where:
- There is (implicitly) a model of the learning process, where different instructional actions probabilistically change the state of a student.
- There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions.
- Data collected from students are used to learn either:
- the model
- an adaptive policy
- If the model is learned, the instructional policy is designed to (approximately) optimize that model according to some reward function
What's Not Included?
- Adaptive policies that use hand-made or heuristic decision rules (rather than data-driven/optimized decision rules)
- Experiments that do not control for everything other than sequence of instruction
- Experiments that use RL for other educational purposes, such as:
- generating data-driven hints (Stamper et al., 2013) or
- giving feedback (Rafferty et al., 2015)
The Role of Theory
Psychological theory can help determine both when sequencing might matter as well as how to most effectively leverage RL for instructional sequencing.
Case Study
- Fractions Tutor
- Two experiments testing RL-induced policies (both no sig difference)
- Off-policy policy evaluation
Fractions Tutor



Experiment 1
-
Used prior data to fit G-SCOPE Model (Hallak et al., 2015).
-
Used G-SCOPE Model to derive two new Adaptive Policies.
-
Wanted to compare Adaptive Policies to a Baseline Policy (fixed, spiraling curriculum).
-
Simulated both policies on G-SCOPE Model to predict posttest scores (out of 16 points).
Experiment 1:
Policy Evaluation
| Baseline | Adaptive Policy | |
|---|---|---|
| Simulated Posttest | 5.9 ± 0.9 | 9.1 ± 0.8 |
Doroudi, Aleven, and Brunskill, L@S 2017
| Baseline | Adaptive Policy | |
|---|---|---|
| Simulated Posttest | 5.9 ± 0.9 | 9.1 ± 0.8 |
| Actual Posttest | 5.5 ± 2.6 | 4.9 ± 2.6 |
Doroudi, Aleven, and Brunskill, L@S 2017
Experiment 1:
Policy Evaluation
Single Model Simulation
- Used by Chi, VanLehn, Littman, and Jordan (2011) and Rowe, Mott, and Lester (2014) in educational settings.
- Rowe, Mott, and Lester (2014): New adaptive policy estimated to be much better than random policy.
- But in experiment, no significant difference found (Rowe and Lester, 2015).
Importance Sampling
- Estimator that gives unbiased and consistent estimates for a policy!
- Can have very high variance when policy is different from prior data.
- Example: Worked example or problem-solving?
- 20 sequential decisions ⇒ need over \(2^{20}\) students
- 50 sequential decisions ⇒ need over \(2^{50}\) students!
- Importance sampling can prefer the worse of two policies more often than not (Doroudi et al., 2017b).
Doroudi, Thomas, and Brunskill, UAI 2017, Best Paper
Robust Evaluation Matrix
Policy 1 |
Policy 2 |
Policy 3 |
|
|---|---|---|---|
Student Model 1 |
|||
Student Model 2 |
|||
Student Model 3 |
\(V_{SM_1,P_1}\)
\(V_{SM_2,P_1}\)
\(V_{SM_3,P_1}\)
\(V_{SM_1,P_2}\)
\(V_{SM_2,P_2}\)
\(V_{SM_3,P_2}\)
\(V_{SM_1,P_3}\)
\(V_{SM_2,P_3}\)
\(V_{SM_3,P_3}\)
Robust Evaluation Matrix
|
Baseline |
Adaptive Policy |
|
|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
Doroudi, Aleven, and Brunskill, L@S 2017
Robust Evaluation Matrix
|
Baseline |
Adaptive Policy |
|
|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
|
Bayesian Knowledge Tracing |
6.5 ± 0.8 |
7.0 ± 1.0 |
Doroudi, Aleven, and Brunskill, L@S 2017
Robust Evaluation Matrix
|
Baseline |
Adaptive Policy |
|
|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
|
Bayesian Knowledge Tracing |
6.5 ± 0.8 |
7.0 ± 1.0 |
|
Deep Knowledge Tracing |
9.9 ± 1.5 |
8.6 ± 2.1 |
Doroudi, Aleven, and Brunskill, L@S 2017
Robust Evaluation Matrix
|
Baseline |
Adaptive Policy |
Awesome Policy |
|
|---|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
16 |
|
Bayesian Knowledge Tracing |
6.5 ± 0.8 |
7.0 ± 1.0 |
16 |
|
Deep Knowledge Tracing |
9.9 ± 1.5 |
8.6 ± 2.1 |
16 |
Doroudi, Aleven, and Brunskill, L@S 2017
Experiment 2
- Used Robust Evaluation Matrix to test new policies
- Found that a New Adaptive Policy that was very simple but robustly expected to do well:
- sequence problems in increasing order of avg. time
- skip any problems where students have demonstrated mastery of all skills (according to BKT)
- Ran an experiment testing New Adaptive Policy
Experiment 2
| Baseline | New Adaptive Policy | |
|---|---|---|
| Actual Posttest | 8.12 ± 2.9 | 7.97 ± 2.7 |
Experiment 2:
Insights
-
Even though we did robust evaluation, two things were not considered adequately:
-
How long each problem takes per student
-
Student population mismatch
-
Robust evaluation can help us identify where our models are lacking and lead to building better models over time.
Overview
- Reinforcement Learning: Towards a "Theory of Instruction"
- Part 1: Historical Perspective
- Part 2: Systematic Review
- Discussion: Where's the Reward?
- Part 3: Case Study: Fractions Tutor and Policy Selection
- Planning for the Future
Planning for the Future
-
Data-Driven + Theory-Driven Approach
-
Reinforcement learning researchers should work with learning scientists and psychologists.
-
Work on domains where we have or can develop decent cognitive models.
-
Work in settings where the set of actions is restricted but that are still meaningful
(e.g., worked examples vs. problem solving) -
Compare to good baselines based on learning sciences (e.g., expertise reversal effect)
-
-
Do thoughtful and extensive offline evaluations.
-
Iterate and replicate! Develop theories of instruction that can help us see where the reward might be.
Is Data-Driven Sufficient?
- Might we see a revolution in data-driven instructional sequencing?
- More data
- More computational power
- Better RL algorithms
- Similar advances have recently revolutionized the fields of computer vision, natural language processing, and computational game-playing.
- Why not instruction?
- Learning is fundamentally different from images, language, and games.
- Baselines are much stronger for instructional sequencing.
So, where is the reward?
- In the coming years, will likely see both purely data-driven (deep learning) approaches as well as theory+data-driven approaches to instructional sequencing.
- Only time can tell where the reward lies, but our robust evaluation suggests combining theory and data.
- By reviewing the history and prior empirical literature, we can have a better sense of the terrain we are operating in.
So, where is the reward?
- Applying RL to instructional sequencing has been rewarding in other ways:
- Advances have been made to the field of RL.
- The Optimal Control of Partially Observable Markov Processes
- Our work on importance sampling (Doroudi et al., 2017b)
- Advances have been made to student modeling.
- Advances have been made to the field of RL.
By continuing to try to optimize instruction, we will likely continue to expand the frontiers of the study of human and machine learning.
Wheres The Reward? A Review of RL for Instructional Sequencing
By Shayan Doroudi
Wheres The Reward? A Review of RL for Instructional Sequencing
AIED 2020 Journal Track
- 192