ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
. .
)
. .
)
. .
)
Pursuit
Evade
P1
P2
or
Pure Stratergies
Mixed Stratergies
U is Utility for each joint policy permutations
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
At High Level
Double Oracle Algorithm Solves
Extensive form games, in general without exhaustive search
Oracle -> Costly step, approximate using Emperical Game theoritic Analysis and Deep RL.
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
Normal Form Games generally has a pay-off matrix and players make moves simultaneusly to get a pay-off
Nash-Eq is a equilibrium concept, where each player does not have any incentive to deviate from their stratergy.
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
Every Finite extensive-form game has an equivalent normal-form
0
1
2
3
. .
)
. .
)
. .
)
{Left , Right , Stay}
ECE381V: Deep Reinforcement Learning (Spring 2026)
0
1
2
3
. .
)
. .
)
. .
)
{Left , Right , Stay}
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
P\E
NE:
The Pursuer looks at the columns and
picks the highest number
Evader looks at rows and picks lowest number
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5
P\E
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5
0.5
P\E
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5,Up
0.5,Right
Left
Up
0.5
-0.5
-1
P\E
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5,Down
0.5,Right
Left
Up
0.5
-0.5
-1
P\E
Right
0.5
0.5
0.5
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5,Up
-0.5,Down
Left
Up
0.5
-0.5
-1
P\E
Right
0.5
0.5
0.5
Down
-0.5
-0.5
-1
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5,Up
-0.5,Down
Left
Up
0.5
-0.5
-1
P\E
Right
0.5
0.5
0.5
Down
-0.5
-0.5
-1
Down
-0.5
-1
-0.5
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5,Up
-0.5,Down
Left
Up
0.5
-0.5
-1
P\E
Right
0.5
0.5
0.5
Down
-0.5
-0.5
-1
Down
-0.5
-1
-0.5
No New best responses
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
-0.5,Up
-0.5,Down
Left
Up
0.5
-0.5
-1
P\E
Right
0.5
0.5
0.5
Down
-0.5
-0.5
-1
Down
-0.5
-1
-0.5
No New best responses
Converged
ECE381V: Deep Reinforcement Learning (Spring 2026)
Double Oracle Algorithm
Consider this setup
P
E
Restricted Game Matrix
Stay
Stay
0
Left
Up
0.5
-0.5
-1
P\E
Right
0.5
0.5
0.5
Down
-0.5
-0.5
-1
Down
-0.5
-1
-0.5
No New best responses
Converged
Note, converged in 4x3 submatrix , instead
of full 5x5 game solver
Worst case, Might have to enumerate full stratergy space
However, the authors claim that there is evidence that support sizes shrink for many games as a function of episode length and information structure.
ECE381V: Deep Reinforcement Learning (Spring 2026)
Emperical Game theoretic analysis
Simulate much smaller game
Emperical Payoff matrix
Discover new stratergies & Reason
ECE381V: Deep Reinforcement Learning (Spring 2026)
Emperical Game theoretic analysis
Simulate much smaller game
Emperical Payoff matrix
Discover new stratergies & Reason
Instead of exact best response, use RL for approximate best response
Use simulations for emperical pay-offs
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
Restricted Game Matrix
Uniform random
0.5
-0.5
P\E
ECE381V: Deep Reinforcement Learning (Spring 2026)
Restricted Game Matrix
Uniform random
P\E
In partially observable multiagent environments, when the other players are fixed the envirionment becomes markovian and computing best responses reduces to MDP , thus any RL algo. can be used
ECE381V: Deep Reinforcement Learning (Spring 2026)
Restricted Game Matrix
U11
P\E
U12
U21
U22
U11
U12
U21
U22
Obtained through simulation
of respective joint policies
ECE381V: Deep Reinforcement Learning (Spring 2026)
Restricted Game Matrix
U11
P\E
U12
U21
U22
U11
U12
U21
U22
Obtained through simulation
of respective joint policies
U11
U12
U21
U22
Meta-solver
Metasolver gives the mixture over
policies found through the oracle (RL)
ECE381V: Deep Reinforcement Learning (Spring 2026)
Text
U11
P\E
U12
U21
U22
U11
U12
U21
U22
Obtained through simulation
of respective joint policies
U11
U12
U21
U22
Meta-solver
Metasolver gives the mixture over
policies found through the oracle (RL)
ECE381V: Deep Reinforcement Learning (Spring 2026)
Text
U11
P\E
U12
U21
U22
U11
U12
U21
U22
Obtained through simulation
of respective joint policies
U11
U12
U21
U22
Meta-solver
Meta-solver
ECE381V: Deep Reinforcement Learning (Spring 2026)
Text
U11
P\E
U12
U21
U22
U11
U12
U21
U22
Obtained through simulation
of respective joint policies
U11
U12
U21
U22
Meta-solver
Meta-solver
Regret matching
Hedge
Projected Replicated dyn.
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
N
K
Start N*K procs in
Parallel
Uniform Random
ECE381V: Deep Reinforcement Learning (Spring 2026)
N
K
Start N*K procs in
Parallel
Uniform Random
Cache to central disk
ECE381V: Deep Reinforcement Learning (Spring 2026)
N
K
Start N*K procs in
Parallel
Uniform Random
Cache to central disk
Since Each procs, uses
Slightly outdated copies,
its a approximation of
PSRO
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
17
10
10
2
21x20x3 pixels as obs
ECE381V: Deep Reinforcement Learning (Spring 2026)
17
10
10
2
21x20x3 pixels as obs
ECE381V: Deep Reinforcement Learning (Spring 2026)
Average Propotional Loss
bar{D} (Mean of Diagonals): This represents the average score when agents play with the exact partners they trained with. Because they trained together, they have usually figured out how to coordinate well, resulting in a high score.
bar{O} (Mean of Off-diagonals): This represents the average score when an agent is forced to play with a "stranger," an agent that was trained in an identical environment, but in a separate training run (with a different random seed).
Even in a simple domain with almost full observability (small2), an independently-learned policy could expect to lose 34.2% of its reward when playing with another independently-learned policy even though it was trained under identical circumstances
ECE381V: Deep Reinforcement Learning (Spring 2026)
{
No JPC problem because tasks are largley independent?
ECE381V: Deep Reinforcement Learning (Spring 2026)
Meta-strategies help bringing R_ down
R_ for just level-10 policies (without mixing) is 0.147,0.27 and 0.118 for small2-4 respectively
With Meta-stratergies is shown in table above
ECE381V: Deep Reinforcement Learning (Spring 2026)
Diminishing Returns on num levels
Level 3: Helps decently. It reduces the overfitting penalty by 44%.
Level 5: Helps a lot. It reduces the overfitting penalty by 56.1%
Level 10: Barely helps more than Level 5. It reduces the penalty by 56.7%.
ECE381V: Deep Reinforcement Learning (Spring 2026)
Results on Leduc Poker
Effect of the various meta-strategies and exploration parameters
Metric Used: Performance was measured using the Mean Area-Under-Curve (MAUC) of the final 32 NashConv values. (Lower better)
Optimal Exploration Rate: Setting gamma =0.4 proved most effective for minimizing NashConv.
Meta-Solver Performance Ranking: 1. Decoupled Replicator Dynamics (Best)
2. Decoupled Regret-Matching
3. Exp3
Level Scaling: Higher training levels consistently reduce NashConv, but exhibit diminishing returns (improvements plateau at higher levels).
ECE381V: Deep Reinforcement Learning (Spring 2026)
Comparison to Neural Fictitious Self play
DCH & PSRO converge faster than NFSP at the beginning of training.
This early speed advantage is likely due to their use of superior meta-strategies compared to the uniform random strategy used in NFSP's fictitious play.
ECE381V: Deep Reinforcement Learning (Spring 2026)
Comparison to Neural Fictitious Self play
DCH & PSRO converge faster than NFSP at the beginning of training.
This early speed advantage is likely due to their use of superior meta-strategies compared to the uniform random strategy used in NFSP's fictitious play.
NFSP wins in the long run: It eventually reaches a lower exploitability than the others.
It Plateus: DCH and PSRO convergence curves plateau earlier (DCH is especially hindered by asynchronous updates). NFSP successfully mixes strategies deep down in the game tree (crucial for games like poker). DCH and PSRO only mix at the very top level over full, complete policies.
ECE381V: Deep Reinforcement Learning (Spring 2026)
Comparison to Neural Fictitious Self play
DCH & PSRO converge faster than NFSP at the beginning of training.
This early speed advantage is likely due to their use of superior meta-strategies compared to the uniform random strategy used in NFSP's fictitious play.
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
Evaluation and Results: The evaluation and results section lacks depth and could have been more comprehensive.
Methodology: The methodology mainly relies on Double Oracle combined with Reinforcement Learning (RL) and Empirical Game-Theoretic Analysis (EGTA), with limited methodological novelty.
Metric Limitation: The proposed metric is defined only for two-player games, limiting its general applicability.
Motivation vs Experiments: Although the paper claims to support both collaborative and competitive settings, no experiments demonstrate this capability.
Clarity and Related Work: The writing does not sufficiently explain related work or prerequisites, making the paper difficult to follow. In particular, the Double Oracle algorithm could have been explained in more detail.
Hyperparameter Sensitivity: The paper does not discuss sensitivity to hyperparameters or provide analysis on their impact.
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
Extends to Multiplayer games
ECE381V: Deep Reinforcement Learning (Spring 2026)
Fater Convergence, Mixing at every infostate
ECE381V: Deep Reinforcement Learning (Spring 2026)
N Player general sum game, CE for metasolver
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)
ECE381V: Deep Reinforcement Learning (Spring 2026)