# Sparse Cooperative QLearning

• Single Agent MDP is inefficient
• Independent Learners does not necessarily converge
Q^\star(s,a) = R(s,a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q^\star(s',a')
$Q^\star(s,a) = R(s,a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q^\star(s',a')$

QFunction Definition

QFunction Update

Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
$Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]$

Joint state and action space!

We don't need to always coordinate!

Q(s,a) \rightarrow v \in \mathbb{R}
$Q(s,a) \rightarrow v \in \mathbb{R}$
Q_{i}(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
$Q_{i}(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}$
Q(s,a) = \sum_{i} Q_i(\tilde{s}, \tilde{a})
$Q(s,a) = \sum_{i} Q_i(\tilde{s}, \tilde{a})$

Additional decomposition

Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}
$Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}$
\rho(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
$\rho(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}$
\rho_0(s_0 = 0 \land a_1 = 3 \land a_2 = 1) = 7.5
$\rho_0(s_0 = 0 \land a_1 = 3 \land a_2 = 1) = 7.5$

Example Rule

n_0 = 2
$n_0 = 2$
a^\star= \max_{a'}Q(s',a')
$a^\star= \max_{a'}Q(s',a')$
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
$Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]$
Q_i(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]
$Q_i(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]$

Original Update

Per Agent

\rho_k(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\sum_{i=0}^{n_k}\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]
$\rho_k(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\sum_{i=0}^{n_k}\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]$

Per Rule

Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}
$Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}$

Remember to wait to update the rules!
They depend on each other!

### Advantages

• Very very sparse QFunction
• Rules are independent of factors they don't care about!

### DisAdvantages

• Bit of micromanagement to define the rules depending on problem

By svalorzen

# Sparse Cooperative QLearning

• 189
Loading comments...