Shayan Doroudi
RL4ED Workshop @ EDM 2021
| First Wave (1960s-70s) |
|
|---|---|
| Educational Technology | Teaching Machines / CAI |
| Optimization Methods | Decision Processes |
| Models of Learning | Mathematical Psychology |
| First Wave (1960s-70s) |
Second Wave (2000s-2010s) |
|
|---|---|---|
| Educational Technology | Teaching Machines / CAI | Intelligent Tutoring Systems |
| Optimization Methods | Decision Processes | Reinforcement Learning |
| Models of Learning | Mathematical Psychology | Machine Learning AIED/EDM |
| First Wave (1960s-70s) |
Second Wave (2000s-2010s) |
Third Wave (2010s) |
|
|---|---|---|---|
| Educational Technology | Teaching Machines / CAI | Intelligent Tutoring Systems | Massive Open Online Courses |
| Optimization Methods | Decision Processes | Reinforcement Learning | Deep RL |
| Models of Learning | Mathematical Psychology | Machine Learning AIED/EDM |
Deep Learning |
More data-driven
More data-generating
Over the past 50 years, how successful has RL been in discovering useful adaptive instructional policies?
What are the challenges that we must face if we are to use RL productively in education going forward?
Learning from the Past
Meeting the Challenges of the Future
Over the past 50 years, how successful has RL been in discovering useful adaptive instructional policies?
What are the challenges that we must face if we are to use RL productively in education going forward?
Learning from the Past
Meeting the Challenges of the Future
Searched for all empirical studies (as of December 2018) that compared one or more RL-induced policies (broadly conceived) to one or more baseline policies.
Found 41 such studies that were clustered into five qualitatively different clusters.
| Doroudi, S., Aleven, V., & Brunskill, E. (2019). Where’s the reward? A review of reinforcement learning for instructional sequencing. International Journal of Artificial Intelligence in Education, 29(4), 568-620. |
Paired-Associate Learning Tasks
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Not Optimizing Learning
leer
to read
Paired-Associate Learning Tasks
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Not Optimizing Learning
Includes all studies done in 1960s-1970s
Treats each pair as independent
Use psychological models that account for learning and forgetting.
Paired-Associate Learning Tasks
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Not Optimizing Learning
reading
Use cognitive science models that describe how people learn concepts from examples.
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Not Optimizing Learning
Worked Example
Problem
Solving
\(x^2 - 4 = 12\)
Solve for \(x\):
\(x^2 - 4 = 12\)
\(x^2 = 4 + 12\)
\(x^2 = 16\)
\(x = \sqrt{16} = \pm4\)
\(x^2 - 4 = 12\)
Solve for \(x\):
Paired-Associate Learning Tasks
Content is pre-determined.
The decision is what kind of instruction to give for each piece of content.
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Not Optimizing Learning
Paired-Associate Learning Tasks
Pieces of content are interrelated.
Instructional methods could also vary.
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Not Optimizing Learning
Paired-Associate Learning Tasks
| RL Policy Outperformed Baseline | Mixed Results / ATI |
RL Policy Did Not Outperform Baseline |
|
|---|---|---|---|
| Paired-Associate Learning Tasks | 11 | 0 | 3 |
| RL Policy Outperformed Baseline | Mixed Results / ATI |
RL Policy Did Not Outperform Baseline |
|
|---|---|---|---|
| Paired-Associate Learning Tasks | 11 | 0 | 3 |
| Concept Learning Tasks | 4 | 2 | 1 |
| RL Policy Outperformed Baseline | Mixed Results / ATI |
RL Policy Did Not Outperform Baseline |
|
|---|---|---|---|
| Paired-Associate Learning Tasks | 11 | 0 | 3 |
| Concept Learning Tasks | 4 | 2 | 1 |
| Sequencing Activity Types | 4 | 4 | 2 |
| RL Policy Outperformed Baseline | Mixed Results / ATI |
RL Policy Did Not Outperform Baseline |
|
|---|---|---|---|
| Paired-Associate Learning Tasks | 11 | 0 | 3 |
| Concept Learning Tasks | 4 | 2 | 1 |
| Sequencing Activity Types | 4 | 4 | 2 |
| Sequencing Interdependent Content | 0 | 2 | 6 |
| RL Policy Outperformed Baseline | Mixed Results / ATI |
RL Policy Did Not Outperform Baseline |
|
|---|---|---|---|
| Paired-Associate Learning Tasks | 11 | 0 | 3 |
| Concept Learning Tasks | 4 | 2 | 1 |
| Sequencing Activity Types | 4 | 4 | 2 |
| Sequencing Interdependent Content | 0 | 2 | 6 |
| Not Optimizing Learning | 2 | 0 | 0 |
Use Psychologically-Inspired Models
Spacing Effect
Expertise Reversal Effect
Use Data-Driven
Models
Theoretical Basis
More
Less
Concept Learning Tasks
Sequencing Activity Types
Sequencing Interdependent Content
Paired-Associate Learning Tasks
Over the past 50 years, how successful has RL been in discovering useful adaptive instructional policies?
What are the challenges that we must face if we are to use RL productively in education going forward?
Learning from the Past
Meeting the Challenges of the Future
“We argue that despite the power of big data, psychological theory provides essential constraints on models, and that despite the success of psychological theory in providing a qualitative understanding of phenomena, big data enables quantitative, individualized predictions of learning and performance”
Mozer, M. C., & Lindsey, R. V. (2017). Predicting and Improving memory retention: Psychological theory matters in the big data era. In Big data in cognitive science (pp. 34-64).
Choice of models
Types of actions
Space of policies
In our review, 71% studies with sig. effect compared to random sequencing or other RL-induced policies as baselines.
Only 35% of studies with no sig. effect compared to random sequencing or other RL-induced policies as baselines
For paired-associate learning tasks, there are advanced heuristics used in commercial systems that we could compare to:
SuperMemo
Leitner system
Only Lindsey (2016) conducted a study comparing an RL-induced policy to SuperMemo.
For sequencing activity types, we could compare to policies based on principles such as the expertise-reversal effect.
For sequencing interdependent content, we could compare to heuristics that sequence based-on expert-curated knowledge maps.
Kalyuga, S., & Sweller, J. (2005). Rapid dynamic assessment of expertise to improve the efficiency of adaptive e-learning. Educational Technology Research and Development, 53(3), 83-93.
Just because an RL-based system is shown to be effective in an experiment does not necessarily mean it will be effective in the real world.
Does the policy work well when used in conjunction with other instructional practices?
Does the policy work well in educationally-relevant time scales?
Do teachers have enough control/flexibility?
Do students have enough control?
Just because an RL-based system is shown to be effective does not necessarily mean it will get used.
Atkinson (1972)
The research reported here was supported, in whole or in part, by the Institute of Education Sciences, U.S. Department of Education, through Grants R305A130215 and R305B150008 to Carnegie Mellon University. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Dept. of Education.
We consider any papers published before December 2018 where:
“The mathematical techniques of optimization used in theories of instruction draw upon a wealth of results from other areas of science, especially from tools developed in mathematical economics and operations research over the past two decades, and it would be my prediction that we will see increasingly sophisticated theories of instruction in the near future.”
Suppes (1974)
The Place of Theory in Educational Research
AERA Presidential Address
Used prior data to fit G-SCOPE Model (Hallak et al., 2015).
Used G-SCOPE Model to derive two new Adaptive Policies.
Wanted to compare Adaptive Policies to a Baseline Policy (fixed, spiraling curriculum).
Simulated both policies on G-SCOPE Model to predict posttest scores (out of 16 points).
| Baseline | Adaptive Policy | |
|---|---|---|
| Simulated Posttest | 5.9 ± 0.9 | 9.1 ± 0.8 |
Doroudi, Aleven, and Brunskill, L@S 2017
| Baseline | Adaptive Policy | |
|---|---|---|
| Simulated Posttest | 5.9 ± 0.9 | 9.1 ± 0.8 |
| Actual Posttest | 5.5 ± 2.6 | 4.9 ± 2.6 |
Doroudi, Aleven, and Brunskill, L@S 2017
Doroudi, Thomas, and Brunskill, UAI 2017, Best Paper
Policy 1 |
Policy 2 |
Policy 3 |
|
|---|---|---|---|
Student Model 1 |
|||
Student Model 2 |
|||
Student Model 3 |
\(V_{SM_1,P_1}\)
\(V_{SM_2,P_1}\)
\(V_{SM_3,P_1}\)
\(V_{SM_1,P_2}\)
\(V_{SM_2,P_2}\)
\(V_{SM_3,P_2}\)
\(V_{SM_1,P_3}\)
\(V_{SM_2,P_3}\)
\(V_{SM_3,P_3}\)
|
Baseline |
Adaptive Policy |
|
|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
Doroudi, Aleven, and Brunskill, L@S 2017
|
Baseline |
Adaptive Policy |
|
|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
|
Bayesian Knowledge Tracing |
6.5 ± 0.8 |
7.0 ± 1.0 |
Doroudi, Aleven, and Brunskill, L@S 2017
|
Baseline |
Adaptive Policy |
|
|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
|
Bayesian Knowledge Tracing |
6.5 ± 0.8 |
7.0 ± 1.0 |
|
Deep Knowledge Tracing |
9.9 ± 1.5 |
8.6 ± 2.1 |
Doroudi, Aleven, and Brunskill, L@S 2017
|
Baseline |
Adaptive Policy |
Awesome Policy |
|
|---|---|---|---|
|
G-SCOPE Model |
5.9 ± 0.9 |
9.1 ± 0.8 |
16 |
|
Bayesian Knowledge Tracing |
6.5 ± 0.8 |
7.0 ± 1.0 |
16 |
|
Deep Knowledge Tracing |
9.9 ± 1.5 |
8.6 ± 2.1 |
16 |
Doroudi, Aleven, and Brunskill, L@S 2017
| Baseline | New Adaptive Policy | |
|---|---|---|
| Actual Posttest | 8.12 ± 2.9 | 7.97 ± 2.7 |
Even though we did robust evaluation, two things were not considered adequately:
How long each problem takes per student
Student population mismatch
Robust evaluation can help us identify where our models are lacking and lead to building better models over time.
Data-Driven + Theory-Driven Approach
Reinforcement learning researchers should work with learning scientists and psychologists.
Work on domains where we have or can develop decent cognitive models.
Work in settings where the set of actions is restricted but that are still meaningful
(e.g., worked examples vs. problem solving)
Compare to good baselines based on learning sciences (e.g., expertise reversal effect)
Do thoughtful and extensive offline evaluations.
Iterate and replicate! Develop theories of instruction that can help us see where the reward might be.
By continuing to try to optimize instruction, we will likely continue to expand the frontiers of the study of human and machine learning.