Multi-agent deep reinforcement learning (MDRL)
Sergey Sviridov
N shades of MDRL
- Centralized Training with Centralized Execution
- Centralized Training with Decentralized Execution
- Decentralized Training with Decentralized Execution
N shades of MDRL
- Competitive
- Cooperative
- Mixed
N shades of MDRL
- Analysis of emergent behavior
- Learning communication
- Learning cooperation
- Agents modeling agents
Major Challenges
- Non-stationarity
- Curse of dimensionality (action space grows exponentially in the number of agents)
- Multi-agent credit assignment
- Global exploration
- Relative overgeneralization
Decentralized Training and Decentralized Execution (DTDE)
Just train DQN for each agent independently for cooperative or competitive behavior to emerge
IQN with importance sampling and fingerprint conditioning
Use epoch number and exploration rate as fingerprint to condition each agent Q-function:
Just train PPO for competitive behavior to emerge
Tasks: Reach goal, You Shall Not Pass, Sumo, Kick and Defend
Centralized Training with Decentralized Execution (CTDE)
Train actor-critic with centralized critic and counterfactual baseline in cooperative setting
Information flow, actor and critic
Environment
Train DQN with summed combined Q-function in cooperative setting
Fetch, Switch and Checkers environments
Train DQN with monotonic combined Q-function in cooperative setting
Agent networks, mixing network and a set of hypernetworks
Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting
\(L(\theta_i) = \Epsilon_{\bold{x}, a, r, \bold{x^\prime}} [(Q_i^{\bold{\mu}}(\bold{x}, a_1,...,a_N)- y)^2]\),
where \(y = r_i + \gamma Q_i^{\bold{\mu^\prime}} (x^\prime, a_1^\prime, ..., a_N^\prime)|_{a^\prime_{{\it_{j}}={\bold{\mu}^\prime_{\it_{i}}}(o_{{\it_{i}}})}}]\)
Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting
Сооperative communication, Predator-Prey, Cooperative Navigation, Physical Deception
Comparison
QMIX + CEM for continuous action spaces with factorized Q-functions for mixed environments
Multi-Agent MoJuCo
MADDPG with factorized Q-functions for mixed environments
Cobine individual Q-functions:
$$g_{\phi}(s, Q_1(\tau^1, u^1, ..., u^N; \theta^1), ..., Q_N(\tau^N, u^1, ..., u^N; \theta^N)),$$ where \(g_{\phi}\) is mixing network.
Results
Text
(a) Continuous Predator-Prey, (b) 2-agent HalfCheetah, (c) 2-Agent Hopper, (d) 3-Agent Hopper
Other examples
Learning Communication
Reinforced Inter-Agent Learning (RIAL)
Differentiable Inter-Agent Learning (DIAL)
Simultaneously learn policy and communication in cooperative setting (Switch Riddle and MNIST Game)
Simultaneously learn policy and communication in cooperative setting
Traffic junction and Combat tasks
Account for learning of other agent in iterated prisoners' dilemma and rock-paper-scissors
Agents modeling agents
MADDPG + MiniMax + Multi-Agent Adversarial Learning
MADDPG + MiniMax + Multi-Agent Adversarial Learning
Build other agents model from observations
ToMNet Architecture
\( \hat \pi\) - next step actions probabilities
\( \hat c \) - whether certain objects will be consumed
\( \hat{SR} \) - predicted successor representations
MDRL
By Sergey Sviridov
MDRL
- 2,633