# Sparse Cooperative QLearning

• Single Agent MDP is inefficient
• Independent Learners does not necessarily converge
$Q^\star(s,a) = R(s,a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q^\star(s',a')$

QFunction Definition

QFunction Update

$Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]$

Joint state and action space!

We don't need to always coordinate!

$Q(s,a) \rightarrow v \in \mathbb{R}$
$Q_{i}(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}$
$Q(s,a) = \sum_{i} Q_i(\tilde{s}, \tilde{a})$

Additional decomposition

$Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}$
$\rho(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}$
$\rho_0(s_0 = 0 \land a_1 = 3 \land a_2 = 1) = 7.5$

Example Rule

$n_0 = 2$
$a^\star= \max_{a'}Q(s',a')$
$Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]$
$Q_i(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]$

Original Update

Per Agent

$\rho_k(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\sum_{i=0}^{n_k}\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]$

Per Rule

$Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}$

Remember to wait to update the rules!
They depend on each other!

### Advantages

• Very very sparse QFunction
• Rules are independent of factors they don't care about!

### DisAdvantages

• Bit of micromanagement to define the rules depending on problem

