Sparse Cooperative QLearning

  • Single Agent MDP is inefficient
  • Independent Learners does not necessarily converge
Q^\star(s,a) = R(s,a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q^\star(s',a')
Q(s,a)=R(s,a)+γsT(s,a,s)maxaQ(s,a)Q^\star(s,a) = R(s,a) + \gamma\sum_{s'}T(s, a, s')\max_{a'}Q^\star(s',a')

QFunction Definition

QFunction Update

Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
Q(s,a)+=α[R(s,a)+γmaxaQ(s,a)Q(s,a)]Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]

Joint state and action space!

We don't need to always coordinate!

Q(s,a) \rightarrow v \in \mathbb{R}
Q(s,a)vRQ(s,a) \rightarrow v \in \mathbb{R}
Q_{i}(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
Qi(s~,a~)vRQ_{i}(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
Q(s,a) = \sum_{i} Q_i(\tilde{s}, \tilde{a})
Q(s,a)=iQi(s~,a~)Q(s,a) = \sum_{i} Q_i(\tilde{s}, \tilde{a})

Additional decomposition

Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}
Qi(s~,a~)=jρji(s~,a~)njQ_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}
\rho(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
ρ(s~,a~)vR\rho(\tilde{s},\tilde{a}) \rightarrow v \in \mathbb{R}
\rho_0(s_0 = 0 \land a_1 = 3 \land a_2 = 1) = 7.5
ρ0(s0=0a1=3a2=1)=7.5\rho_0(s_0 = 0 \land a_1 = 3 \land a_2 = 1) = 7.5

Example Rule

n_0 = 2
n0=2n_0 = 2
a^\star= \max_{a'}Q(s',a')
a=maxaQ(s,a)a^\star= \max_{a'}Q(s',a')
Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
Q(s,a)+=α[R(s,a)+γmaxaQ(s,a)Q(s,a)]Q(s,a) \mathrel{+}= \alpha\left [ R(s,a) + \gamma\max_{a'}Q(s',a') - Q(s,a) \right ]
Q_i(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]
Qi(s~,a~)+=α[Ri(s,a)+γQi(s~,a~)Qi(s~,a~)]Q_i(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]

Original Update

Per Agent

\rho_k(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\sum_{i=0}^{n_k}\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]
ρk(s~,a~)+=αi=0nk[Ri(s,a)+γQi(s~,a~)Qi(s~,a~)]\rho_k(\tilde{s},\tilde{a}) \mathrel{+}= \alpha\sum_{i=0}^{n_k}\left [ R_i(s,a) + \gamma Q_i(\tilde{s}',\tilde{a}^\star) - Q_i(\tilde{s},\tilde{a}) \right ]

Per Rule

Q_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}
Qi(s~,a~)=jρji(s~,a~)njQ_{i}(\tilde{s},\tilde{a}) = \sum_j \frac{\rho_j^i(\tilde{s},\tilde{a})}{n_j}

Remember to wait to update the rules!
They depend on each other!

 

Advantages

  • Very very sparse QFunction
  • Rules are independent of factors they don't care about!

DisAdvantages

  • Bit of micromanagement to define the rules depending on problem

Questions?

Made with Slides.com