Multi-agent deep reinforcement learning (MDRL)
Sergey Sviridov
N shades of MDRL
- Centralized Training with Centralized Execution
- Centralized Training with Decentralized Execution
- Decentralized Training with Decentralized Execution
N shades of MDRL
- Competitive
- Cooperative
- Mixed
N shades of MDRL
- Analysis of emergent behavior
- Learning communication
- Learning cooperation
- Agents modeling agents
Major Challenges
- Non-stationarity
- Curse of dimensionality (action space grows exponentially in the number of agents)
- Multi-agent credit assignment
- Global exploration
- Relative overgeneralization
Decentralized Training and Decentralized Execution (DTDE)

Just train DQN for each agent independently for cooperative or competitive behavior to emerge

IQN with importance sampling and fingerprint conditioning
Use epoch number and exploration rate as fingerprint to condition each agent Q-function:
Just train PPO for competitive behavior to emerge

Tasks: Reach goal, You Shall Not Pass, Sumo, Kick and Defend
Centralized Training with Decentralized Execution (CTDE)
Train actor-critic with centralized critic and counterfactual baseline in cooperative setting


Information flow, actor and critic
Environment
Train DQN with summed combined Q-function in cooperative setting

Fetch, Switch and Checkers environments

Train DQN with monotonic combined Q-function in cooperative setting
Agent networks, mixing network and a set of hypernetworks
Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting
\(L(\theta_i) = \Epsilon_{\bold{x}, a, r, \bold{x^\prime}} [(Q_i^{\bold{\mu}}(\bold{x}, a_1,...,a_N)- y)^2]\),
where \(y = r_i + \gamma Q_i^{\bold{\mu^\prime}} (x^\prime, a_1^\prime, ..., a_N^\prime)|_{a^\prime_{{\it_{j}}={\bold{\mu}^\prime_{\it_{i}}}(o_{{\it_{i}}})}}]\)
Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting

Сооperative communication, Predator-Prey, Cooperative Navigation, Physical Deception
Comparison

QMIX + CEM for continuous action spaces with factorized Q-functions for mixed environments

Multi-Agent MoJuCo
MADDPG with factorized Q-functions for mixed environments
Cobine individual Q-functions:
$$g_{\phi}(s, Q_1(\tau^1, u^1, ..., u^N; \theta^1), ..., Q_N(\tau^N, u^1, ..., u^N; \theta^N)),$$ where \(g_{\phi}\) is mixing network.
Results

Text
(a) Continuous Predator-Prey, (b) 2-agent HalfCheetah, (c) 2-Agent Hopper, (d) 3-Agent Hopper
Other examples

Learning Communication

Reinforced Inter-Agent Learning (RIAL)
Differentiable Inter-Agent Learning (DIAL)
Simultaneously learn policy and communication in cooperative setting (Switch Riddle and MNIST Game)

Simultaneously learn policy and communication in cooperative setting

Traffic junction and Combat tasks


Account for learning of other agent in iterated prisoners' dilemma and rock-paper-scissors
Agents modeling agents
MADDPG + MiniMax + Multi-Agent Adversarial Learning

MADDPG + MiniMax + Multi-Agent Adversarial Learning

Build other agents model from observations

ToMNet Architecture
\( \hat \pi\) - next step actions probabilities
\( \hat c \) - whether certain objects will be consumed
\( \hat{SR} \) - predicted successor representations
MDRL
By Sergey Sviridov
MDRL
- 2,699