Sparse Cooperative QLearning
- Single Agent MDP is inefficient
- Independent Learners does not necessarily converge
Q^\star(s,a) = R(s,a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q^\star(s',a')
Q⋆(s,a)=R(s,a)+γ∑s′T(s,a,s′)maxa′Q⋆(s′,a′)
QFunction Definition
QFunction Update
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
Q(s,a)+=α[R(s,a)+γmaxa′Q(s′,a′)−Q(s,a)]
Joint state and action space!
We don't need to always coordinate!
Q(s,a) \rightarrow v \in \mathbb{R}
Q(s,a)→v∈R
Q_{i}(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
Qi(s~,a~)→v∈R
Q(s,a) = \sum_{i} Q_i(\tilde{s}, \tilde{a})
Q(s,a)=∑iQi(s~,a~)
Additional decomposition
Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}
Qi(s~,a~)=∑jnjρji(s~,a~)
\rho(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
ρ(s~,a~)→v∈R
\rho_0(s_0 = 0 \land a_1 = 3 \land a_2 = 1) = 7.5
ρ0(s0=0∧a1=3∧a2=1)=7.5
Example Rule
n_0 = 2
n0=2
a^\star= \max_{a'}Q(s',a')
a⋆=maxa′Q(s′,a′)
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
Q(s,a)+=α[R(s,a)+γmaxa′Q(s′,a′)−Q(s,a)]
Q_i(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]
Qi(s~,a~)+=α[Ri(s,a)+γQi(s~′,a~⋆)−Qi(s~,a~)]
Original Update
Per Agent
\rho_k(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\sum_{i=0}^{n_k}\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]
ρk(s~,a~)+=α∑i=0nk[Ri(s,a)+γQi(s~′,a~⋆)−Qi(s~,a~)]
Per Rule
Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}
Qi(s~,a~)=∑jnjρji(s~,a~)
Remember to wait to update the rules!
They depend on each other!
Advantages
- Very very sparse QFunction
- Rules are independent of factors they don't care about!
DisAdvantages
- Bit of micromanagement to define the rules depending on problem
Questions?
Sparse Cooperative QLearning
By svalorzen
Sparse Cooperative QLearning
- 799