Sergey Sviridov
N shades of MDRL
N shades of MDRL
N shades of MDRL
Major Challenges
Just train DQN for each agent independently for cooperative or competitive behavior to emerge
IQN with importance sampling and fingerprint conditioning
Use epoch number and exploration rate as fingerprint to condition each agent Q-function:
Just train PPO for competitive behavior to emerge
Tasks: Reach goal, You Shall Not Pass, Sumo, Kick and Defend
Train actor-critic with centralized critic and counterfactual baseline in cooperative setting
Information flow, actor and critic
Environment
Train DQN with summed combined Q-function in cooperative setting
Fetch, Switch and Checkers environments
Train DQN with monotonic combined Q-function in cooperative setting
Agent networks, mixing network and a set of hypernetworks
Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting
\(L(\theta_i) = \Epsilon_{\bold{x}, a, r, \bold{x^\prime}} [(Q_i^{\bold{\mu}}(\bold{x}, a_1,...,a_N)- y)^2]\),
where \(y = r_i + \gamma Q_i^{\bold{\mu^\prime}} (x^\prime, a_1^\prime, ..., a_N^\prime)|_{a^\prime_{{\it_{j}}={\bold{\mu}^\prime_{\it_{i}}}(o_{{\it_{i}}})}}]\)
Train DDPG with a separate critics and policy conditioning (and learning them via observations) in mixed setting
Сооperative communication, Predator-Prey, Cooperative Navigation, Physical Deception
Comparison
QMIX + CEM for continuous action spaces with factorized Q-functions for mixed environments
Multi-Agent MoJuCo
MADDPG with factorized Q-functions for mixed environments
Cobine individual Q-functions:
$$g_{\phi}(s, Q_1(\tau^1, u^1, ..., u^N; \theta^1), ..., Q_N(\tau^N, u^1, ..., u^N; \theta^N)),$$ where \(g_{\phi}\) is mixing network.
Results
Text
(a) Continuous Predator-Prey, (b) 2-agent HalfCheetah, (c) 2-Agent Hopper, (d) 3-Agent Hopper
Other examples
Reinforced Inter-Agent Learning (RIAL)
Differentiable Inter-Agent Learning (DIAL)
Simultaneously learn policy and communication in cooperative setting (Switch Riddle and MNIST Game)
Simultaneously learn policy and communication in cooperative setting
Traffic junction and Combat tasks
Account for learning of other agent in iterated prisoners' dilemma and rock-paper-scissors
MADDPG + MiniMax + Multi-Agent Adversarial Learning
MADDPG + MiniMax + Multi-Agent Adversarial Learning
Build other agents model from observations
ToMNet Architecture
\( \hat \pi\) - next step actions probabilities
\( \hat c \) - whether certain objects will be consumed
\( \hat{SR} \) - predicted successor representations