Multi-agent deep reinforcement learning (MDRL)

Sergey Sviridov

N shades of MDRL

  • Centralized Training with Centralized Execution

 

  • Centralized Training with Decentralized Execution

 

  • Decentralized Training with Decentralized Execution
 

N shades of MDRL

  • Competitive

 

  • Cooperative

 

  • Mixed
 

N shades of MDRL

  • Analysis of emergent behavior

 

  • Learning communication

 

  • Learning cooperation

 

  • Agents modeling agents
 

Major Challenges

  • Non-stationarity

 

  • Curse of dimensionality (action space grows exponentially in the number of agents)

 

  • Multi-agent credit assignment

 

  • Global exploration

 

  • Relative overgeneralization
 

Decentralized Training and Decentralized Execution (DTDE)

Just train DQN for each agent independently for cooperative or competitive behavior to emerge

IQN with importance sampling and fingerprint conditioning

Use epoch number and exploration rate as fingerprint to condition each agent Q-function:

L(\theta) = \displaystyle\sum_{i=1}^b \frac{\bold{\pi}^{t_r}_{-a}(\bold{u_{-a}|s})}{\bold{\pi}^{t_i}_{-a}(\bold{u_{-a}|s})} [(y_{i}^{DQN} - Q(s, u; \theta))^2]
O^\prime (s) = \{O(s), \epsilon, e\}

Just train PPO for competitive behavior to emerge

Tasks: Reach goal, You Shall Not Pass, Sumo, Kick and Defend

Centralized Training with Decentralized Execution (CTDE)

Train actor-critic with centralized critic and counterfactual baseline in cooperative setting

Information flow, actor and critic

Environment

A^a(s,\bold u) = Q(s, \bold u) - \displaystyle\sum_{u^{'a}}\pi^a(u^{'a}|\tau^a)Q(s, (u^{-a}, u^{'a}))

Train DQN with summed combined Q-function in cooperative setting

Fetch, Switch and Checkers environments

Q((h^1, h^2, ..., h^d), (a^1, a^2, ..., a^d)) \approx \displaystyle\sum_{i=1}^d \tilde{Q_i}(h^i, a^i)

Train DQN with monotonic combined Q-function in cooperative setting

Agent networks, mixing network and a set of hypernetworks

\frac{\partial{Q_{tot}}}{\partial{Q_a}} \geq 0, \forall a \in A
Q_{tot}(s, \bold{\tau}, \bold{u}; \bold{\theta}, \phi) \coloneqq f_{\phi}(s, \{Q_a(\tau^a, u^a; \theta^a)\}_{a=1}^N)

Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting

\nabla_{\theta_{\it_{i}}} J(\bold{\mu_{\it_{i}}}) = \Epsilon_{\bold{x},a \backsim D} [\nabla_{\theta_i}\bold{\mu_{\it_{i}}} (a_{\it_{i}}| o_{\it_{i}})\nabla_{a_{\it_{i}}}Q_{\it_{i}}^\bold{\mu} (\bold{x}, a_1, ..., a_N) |_{a_{{\it_{i}}={\bold{\mu}_{\it_{i}}}(o_{{\it_{i}}})}}]
L(\phi_i^j) = - \Epsilon_{o_j, a_j} [log \bold{\hat\mu}_i^j(a_j|o_j) + \lambda H(\bold{\hat\mu}_i^j)]
\hat y = r_i + \gamma Q_i^\bold{\mu^\prime} (\bold{x}^\prime, \hat\mu_i^{\prime 1} (o_1)), ...,\hat\mu_i^\prime (o_i), ..., \hat\mu_i^{\prime N} (o_N))
\nabla_{\theta_{\it_{i}}^{(k)}} J_e(\bold{\mu_{\it_{i}}}) = \frac{1}{K} \Epsilon_{\bold{x},a \backsim D_i^{(k)}} [\nabla_{\theta_i^{(k)}}\bold{\mu}_{\it_{i}}^{(k)} (a_{\it_{i}}| o_{\it_{i}})\nabla_{a_{\it_{i}}}Q_{\it_{i}}^{\bold{\mu}_{i}} (\bold{x}, a_1, ..., a_N) |_{a_{{\it_{i}}={\bold{\mu}_{\it_{i}}^{(k)}}(o_{{\it_{i}}})}}]

\(L(\theta_i) = \Epsilon_{\bold{x}, a, r, \bold{x^\prime}} [(Q_i^{\bold{\mu}}(\bold{x}, a_1,...,a_N)- y)^2]\),

where \(y = r_i + \gamma Q_i^{\bold{\mu^\prime}} (x^\prime, a_1^\prime, ..., a_N^\prime)|_{a^\prime_{{\it_{j}}={\bold{\mu}^\prime_{\it_{i}}}(o_{{\it_{i}}})}}]\)

Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting

Сооperative communication, Predator-Prey, Cooperative Navigation, Physical Deception

 

Comparison

QMIX + CEM for continuous action spaces with factorized Q-functions for mixed environments

Multi-Agent MoJuCo

MADDPG with factorized Q-functions for mixed environments

Cobine individual Q-functions:

$$g_{\phi}(s, Q_1(\tau^1, u^1, ..., u^N; \theta^1), ..., Q_N(\tau^N, u^1, ..., u^N; \theta^N)),$$ where \(g_{\phi}\) is mixing network.

Results

Text

(a) Continuous Predator-Prey, (b) 2-agent HalfCheetah, (c) 2-Agent Hopper, (d) 3-Agent Hopper

Other examples

Learning Communication

Reinforced Inter-Agent Learning (RIAL)

Differentiable Inter-Agent Learning (DIAL)

Simultaneously learn policy and communication in cooperative setting (Switch Riddle and MNIST Game)

Simultaneously learn policy and communication in cooperative setting

Traffic junction and Combat tasks

Account for learning of other agent in iterated prisoners' dilemma and rock-paper-scissors

Agents modeling agents

MADDPG + MiniMax + Multi-Agent Adversarial Learning

MADDPG + MiniMax + Multi-Agent Adversarial Learning

Build other agents model from observations

ToMNet Architecture

\( \hat \pi\) - next step actions probabilities

\( \hat c \) - whether certain objects will be consumed

\( \hat{SR} \) - predicted successor representations

Made with Slides.com