Deep Generals

Bingliang Zhang, Zui Chen

Game Introduction

Large Action Space

Imperfect Information

Long-Horizon

Non-Trivial Valuation

Pipeline

Virtual Environment

Based on OpenAI Gym

RL Frameworks

Modified from tianshou

 s

Model Architectures

Specifically Designed

for generals.io game

State/Feature Design

Reward Design

RL Algorithm Choice

Fully Convolutional

Attention-Based

Simulator & Interface

Virtual Environment

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

Virtual Environment

One-hot

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

Virtual Environment

One-hot

One-hot

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

Virtual Environment

One-hot

One-hot

One-hot

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

Virtual Environment

One-hot

One-hot

One-hot

NORMALIZED float

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

Virtual Environment

One-hot

One-hot

One-hot

NORMALIZED float

Broadcast

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

Virtual Environment

One-hot

One-hot

One-hot

NORMALIZED float

Broadcast

Multi-frame stacking

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

Virtual Environment

One-hot

One-hot

One-hot

NORMALIZED float

Broadcast

Multi-frame stacking

?

pixel-wise feature representation:

  • Grid Landscape: Unknown, Mountain, Empty land, City, Capital
  • Controlling Status: Unknown, Neutral, Player 0, Player 1, Player 2, ...
  • Observation Status: Fog, Observing, Observed
  • Armies
  • Common Knowledge
  • Historical States
  • Historical Actions?

RL Frameworks

RL ALgorithm Choice:

  • Value Based: Current Solution (Prioritized DQN).
  • Policy Based: PPO, etc.

RL Frameworks

RL ALgorithm Choice:

  • Value Based: Current Solution (Prioritized DQN).
  • Policy Based: PPO, etc.

Reward Design:

  • Large Action Space & Sparse Reward & Imperfect Information: Could not apply e.g.: Monte-Carlo tree search. Dense Manual Reward needed.

RL Frameworks

RL ALgorithm Choice:

  • Value Based: Current Solution (Prioritized DQN).
  • Policy Based: PPO, etc.

Reward Design:

  • Large Action Space & Sparse Reward & Imperfect Information: Could not apply e.g.: Monte-Carlo tree search. Dense Manual Reward needed.
  • Long, Variable Horizon: Avoid cheating through time. Define reward by differentiating State Score.

RL Frameworks

RL ALgorithm Choice:

  • Value Based: Current Solution (Prioritized DQN).
  • Policy Based: PPO, etc.

Reward Design:

  • Large Action Space & Sparse Reward & Imperfect Information: Could not apply e.g.: Monte-Carlo tree search. Dense Manual Reward needed.
  • Long, Variable Horizon: Avoid cheating through time. Define reward by differentiating State Score.
  • Non-Trivial Valuation:
    • Land/City Control
    • RELATIVE Armies Strength
    • Observation?

Model Architecture

Model Architecture

Spatial-Related Large Action Space of Various Size

  • Observation (C,H,W) -> Action (8,H,W)
  • Spatial dimension & size should be preserved.

Model Architecture

Spatial-Related Large Action Space of Various Size

  • Observation (C,H,W) -> Action (8,H,W)
  • Spatial dimension & size should be preserved.

Fully Convolutional Layers

Model Architecture

Spatial-Related Large Action Space of Various Size

  • Observation (C,H,W) -> Action (8,H,W)
  • Spatial dimension & size should be preserved.

Long Horizon, Action Continuity

  • Focus on important tasks and importan areas

Fully Convolutional Layers

Model Architecture

Spatial-Related Large Action Space of Various Size

  • Observation (C,H,W) -> Action (8,H,W)
  • Spatial dimension & size should be preserved.

Long Horizon, Action Continuity

  • Focus on important tasks and importan areas

Fully Convolutional Layers

Attention Window Layers

Current Result

We train our model from scratch:

  • On Epoch #1, it has learned to play feasible moves or even effective moves most of the time. It is capable of expanding.
  • On Epoch #2, it has learned to attack, occupy cities.
  • On Epoch #3, it has learned to defense, gather armies.
  • On Epoch #4, it has learned basic continuity of moves.
  • On Epoch #5, it ... crashed.

Furture Plans

  • Improve reward design for better stability and perhaps more prior.
  • Try out different RL framworks and different model architectures.
  • Use min-max search (god view) to generate experiences.
  • Evaluate against random, greedy, search or even human players.
  • Scale-up to larger maps (e.g.: 32×32).

Deep Generals

Bingliang Zhang, Zui Chen

Thank You!

Deep Generals

By Zui Chen

Deep Generals

  • 156