Identifying Latent Intentions
via
Inverse Reinforcement Learning
in
Repeated Public Good Games

Carina I Hausladen, Marcel H Schubert, Christoph Engel

MAX PLANCK INSTITUTE
FOR RESEARCH ON COLLECTIVE GOODS

  • Social Dilemma Games
    • Initial contributions start positive but gradually decline over time.

  • Classify behavior to understand contribution patterns

 

Meta-Analysis
Thöni et al. (2018)

Conditional Cooperation

19.2 %

Hump-Shaped

Fischbacher et al. (2001)

61.3 %

Freeriding

10.4 %

Meta-Analysis
Thöni et al. (2018)

Conditional Cooperation

19.2 %

Hump-Shaped

Fischbacher et al. (2001)

61.3 %

Freeriding

10.4 %

Strategy Method Data

Game Play Data

Analysing Game Play Data

  • Finite Mixture Model


     
  • Bayesian Model


     
  • C-Lasso


     
  • Clustering
     

Theory Driven

Data Driven

Theory first: Use theory to find groups

Model first:
Specify a model, then find groups

Data first: Let the data decide groups, then theorize

Behavior Beyond Theory

  • Finite Mixture Model


     
  • Bayesian Model


     
  • C-Lasso


     
  • Clustering
     

Theory Driven

Data Driven

Bardsely (2006)| Tremble terms: 18.5%.

Houser (2004) | Confused:  24%

Fallucchi (2021) |Others: 22% – 32%

Fallucchi (2019) |Various: 16.5%

initial tremble

simulated

actual

Others

     18.5%

                  24%

                               32%

16.5%

random / unexplained

Step 1

 

Clustering

 


uncover patterns

Step 2

 

Inverse Reinforcement Learning

 

interpret patterns

Can we build a model that explains both
the behavior types we know and the ones we don’t?

Dataset

  • Data from 10 published studies
  • 2,938 participants,
    50,390 observations
  • Game play data
  • Standard linear public good games
    • No behavioral interventions
  • Identical information treatment

 

 

 

two-dimensional time series
per player





What are the most common patterns in the data?

What are the most common patterns in the data?

Step 1

 

Clustering

 


uncover patterns

Step 2

 

Inverse Reinforcement Learning

 

interpret patterns

Clustering consists of three main steps

  1. Choose the number of clusters k
  2. Calculate distances between series
  3. Apply a clustering algorithm to group similar series

Calculating Distances

  • Round 2 ≠ same experience
  • Local time-points are misleading
  • Better: compare global shape

rounds

20

contribution

1

0

Local Similarity Measure

Global Similarity Measure

Dynamic Time Warping (DTW)

Euclidean Distance

Empirical Perspective

Clustering consists of three main steps

  1. Choose the number of clusters k
  2. Calculate distances between series
    • Euclidean (local)
    • DTW (global)
  3. Apply a clustering algorithm to group similar series
    1. k-means clustering
    2. Gaussian Mixture Models (GMM)
    3. Agglomerative clustering

Agglom

GMM

k-means

Agglom

GMM

k-means

DTW

Euclidean

k-means

uids

round

global

local

Agglom

GMM

k-means

DTW

Euclidean

k-means

DTW

Euclidean

Contrasting Perspectives

  • A local— drop from high to low contributions at different times,
  • within a global trend of sustained high then low contributions.
  • A shared local switching point,
    but clusters stay noisy due to ignored global patterns.

DTW

Euclidean

 




 

  • Results depend fundamentally on how similarity is defined.

  • We focus on generalizable patterns.




     

Our
Clustering Setup

  1. Choose the number of clusters k
    • ​k = 6
  2. Calculate distances between series
    • DTW distance (global)
  3. Apply a clustering algorithm to group similar series​
    • Spectral Clustering

Interpreting the Clusters

Interpreting the Clusters

Step 1

 

Clustering

 


uncover patterns

Step 2

 

Inverse Reinforcement Learning

 

interpret patterns

  • evolutionary game-theoretic learning
  • best-response learning
  • reinforcement learning

Learning in social dilemmas

The key challenge is to define a 

Reward Function

inverse

  • recovers reward functions from data
  • lead to breakthroughs in robotics, autonomous driving, and
    modeling animal behavior

Hierarchical Inverse Q-Learning (Zhu 2024)

Hierarchical Inverse Q-Learning

action

state

  • Markov Decision Process: \( P(s' |s,a) \)
  • Behavior of a Q-learner:
    • maintains a Q-table
    • exploitation vs. exploration
    • Q-value update

\( Q(s,a) = (1- \alpha) Q(s,a) + \alpha \left( r + \gamma \max Q(s', a') - Q(s,a) \right) \)

 

 

Expected best possible outcome from the next state

 

 

Compare to now

 

 

 

re-ward​

 

Hierarchical Inverse Q-Learning

\( Q(s,a) = (1- \alpha) Q(s,a) + \alpha \left( r + \gamma \max Q(s', a') - Q(s,a) \right) \)

 

 

re-ward​

 

action

state

Estimate the reward function by maximizing the likelihood of observed actions and states.

unknown

Hierarchical Inverse Q-Learning

\( r_{t-1} \)

\( a_{t-1} \)

P

\( s \)

\( \Lambda \)

\( r_t \)

\( a_t \)

P

\( s_{t+1} \)

discrete transition

Hierarchical Inverse
Q-Learning

\( P(r_t \mid s_{0:t}, a_{0:t}) \)

action

state

\( r \)

Hierarchical Inverse
Q-Learning

\( P(r_t \mid s_{0:t}, a_{0:t}) \)

action

state

\( r \)

Unconditional
Cooperators

Consistent
Cooperators

Freeriders

    • The intention to free-ride is not fixed.
    • Some variation exists, but the competing intention never exceeds the adoption threshold .

Threshold
Switchers

Volatile
Explorers

  • The cluster actively experiments with new strategies.
  • Deliberate switching between strategies.

Conclusion

Dataset with ~ 50'000 observations from PGG

A global distance metric, such as two-dimensional Dynamic Time Warping (DTW), is best-suited for partitioning data from social dilemma games.

Estimating intentions that transition in a discrete manner offers a unifying theory to explain all behavioral clusters — including the 'Other' cluster.

carinah@ethz.ch

slides.com/carinah

S

Appendix

Dataset

Various Game Lengths

  • The intention to free-ride may be less rigid than previously assumed.
  • Longer horizons
    • increase intention volatility and thus
    • create more opportunities for behavioral shifts.

Freeriders

  • Longer time horizons generally promote cooperation.
  • A small group of participants remains cooperative regardless of game duration.
    • For some, cooperation is a stable trait rather than a response to game length.

Consistent Cooperators

Unconditional
Cooperators

Consistent
Cooperators

How many reward functions?

How many reward functions should we estimate?

1

2

3

4

→ 2

→ 3

→ 4

→ 5

0.6

0.4

0.2

0.2

75.2

88.6

101.5

114.4

\( \Delta \) Test LL

\( \Delta \) BIC

Choice of two intentions aligns with the fundamental RL principle of exploration vs. exploitation.

Comparing Partitioning Methods

Theory Driven

Data Driven

Manhattan +

Finite Mixture Model

Bayesian Model

C-Lasso

DTW +
Spectral Clustering

Hierarchical Clustering

Theory Driven

Data Driven

Finite Mixture Model

C-Lasso

DTW Distance

Manhattan Distance

Finite Mixture Model

C-Lasso

DTW

Local
Clustering

DTW-based clustering leads to a clearer, less noisy partition.

 

Reward Function

Inverse RL recovers reward functions from data.

  • Lead to breakthroughs in robotics, autonomous driving, and modeling animal behavior.
  • Past models used smoothly time-varying reward functions
    (Alyahyay 2023).
  • New approaches model behavior with discrete reward functions
  • ​              Hierarchical Inverse Q-Learning (HIQL) (Zhu 2024)

     18.5%

                  24%

                               32%

16.5%

random / unexplained

DTW

Euclidean

 




 

  • Results depend fundamentally on how similarity is defined.

  • We focus on generalizable patterns




     

IC2S2

By Carina Ines Hausladen

IC2S2

IC2S2

  • 72