Ubisoft in Google Cloud: game cluster autoscaling

Vladislav Shpilevoy

Plan

Cluster scaling

Scaling in Ubisoft

Google Cloud

New scaling algorithm

Testing

Future plans

Watch interactively

Cluster scaling [1]

Add or drop instances depending on load

Two types

Bare metal

Cloud

Cluster scaling [2]

Metal

Cloud

Buy or rent, deliver, install

  • Hours, days, weeks long
  • Excess reservation
  • Can be cheaper
POST create_instances HTTP/1.1

Send an HTTP request

  • Seconds, minutes
  • Accurate
  • Can be more expensive

Cluster scaling [3]

Cluster load

Capacity with good scaling

Capacity with bad scaling

Money waste

Scaling in Ubisoft

Bare metal cluster

  • Expensive powerful servers
  • Plan for load spikes

Now moving to cloud for better scaling and more

Load

Very volatile

600 000 players

200 000 players

Region Europe West

Region
USA East

Region Asia

Instance groups

Google Autoscaler

For Managed Instance Groups

Google Autoscaler

Ungraceful shutdown looses the state

Servers

Load

For Managed Instance Groups

Automatic cluster resize based on metrics

Players are disconnected

Saves are lost

New Autoscaler

From Ubisoft

New Autoscaler

Instance Group

Google Cloud Region

Autoscaler

Players

Greeting

State

...

Commands

Google Cloud REST API

/create_instances
/delete_instances

Scale up

Based on threshold

Capacity

Threshold

Players

Example

Average player count

Servers

Scale down

Based on threshold

Capacity

Threshold

Players

Safety Timeout

Two-phase shutdown

ACTIVE

Autoscaler sends decommission request

DECOMMISSIONING

Server enters DECOMMISIONING state

No new players

Players eventually leave

Save games are uploaded

Server enters DECOMMISIONED state

DECOMMISSIONED

Autoscaler kills the machine

Extensions

Restoration

Ext: Restoration [1]

Scale down

Scale up

Load is below scale down threshold

Decommission a couple of servers

But now load grows

Add new servers

Some servers waste money. Have space, but can't be used

Need to reuse the machines

Ext: Restoration [2]

Restore the old servers

Then add new. All the space is used when need to scale up

Extensions

Quorum

Ext: Quorum [1]

Need to scale up

Add servers

But they are broken

Add more

Still broken

Some machines don't work but waste money. Can continue infinitely

Protect from infinite scale up

Ext: Quorum [2]

Cluster size

Connected count

Quorum  minimal % of connected nodes to approve  scaling

\frac{connected}{cluster\_size} >= quorum
\frac{connected}{cluster\_size} < quorum

Extensions

Adaptive thresholds

Ext: Adaptive thresholds [1]

Server capacity: 100

Scale up threshold: 90

~10% reservation

Example:

A single constant threshold does not scale

10\% * 10\ 000 = 1000 \\ 10\% * 100 = 10

Reservation cost is x100 bigger

Assume 2 servers are filled per second,
Server boot time is 3 minutes. Then:

  • 10 servers give 5 seconds  not enough
  • 1000 servers give 8 minutes  too much

100 servers

10 000 servers

10 reserve

1000 reserve

Ext: Adaptive thresholds [2]

10%

5%

2%

"thresholds": [
  {
    "server_count": 100,
    "scale_up_player_count": 90,
    "scale_down_player_count": 70
  },
  {
    "server_count": 1000,
    "scale_up_player_count": 95,
    "scale_down_player_count": 80
  },
  {
    "server_count": 5000,
    "scale_up_player_count": 98,
    "scale_down_player_count": 90
  },
]

Reservation is changed depending on server count

Capacity

Threshold

Players

Extensions

Prediction

Ext: Prediction [1]

Boot time is never zero

In reality the booting takes time

Booting

How scaling looks in theory

Example:

  • 200 servers
  • 1 server filled per second
  • 10% reservation
  • Boot time is 3 mins

The reserve ends in 20 secs. People wait for 2 min 40 secs

The cluster can and will  be overloaded

Ext: Prediction [2]

Existing solutions

Electricity load (at winter)

Game cluster load (one of)

  • Complex solutions, up to neural networks
  • Days/weeks/months of training

Time of day

00

06

12

14

18

20

00

Prediction is used in

  • Electricity
  • Water
  • Web services
  • Games

Solutions are very similar

Ext: Prediction [3]

Linear regression

Quadratic regression

y = a + b*x
y = a + b*x + c*x^2

Prediction

Train

Train

Prediction

"prediction": {
  "algorithm": "quadratic_regression",
  "train_interval_sec": 600,
  "sample_interval_sec": 10
}

~120 C++ code lines

~240 C++ code lines

Ext: Prediction [4]

Adapts for O(1) time

(X, Y) = \{(x_0, y_0), ..., (x_n, y_n)\}

All the same for the quadratic regression

y = a + b*x;

Find for each new point

S_{xx} = \sum_{n=1}^{N}x_i^2 - N*mean(X)^2;
a = mean(X) - b * mean(Y);
b = \dfrac{S_{xy}}{S_{xx}};
S_{xy} = \sum_{n=1}^{N}x_i*y_i- N*mean(X)*mean(Y);
mean(Y) = \dfrac{\sum_{n=1}^{N}y_i}{N};
mean(X) = \dfrac{\sum_{n=1}^{N}x_i}{N};

Store and update on each step

Calculate on each step

Ext: Prediction [5]

No prediction

Linear regression

Quadratic regression

100 000 players

160 000 players

20 minutes

Queue

Capacity

Extensions

Rolling update

Ext: Rolling update [1]

Full cluster upgrade

There is a cluster

Group template:

The template is changed. Has a new version of the game

Google's autoscaler recreates old instances in packs. This is rolling update

Google automatically assigns it to new instances

Ext: Rolling update [2]

New instances get the new template automatically. Only need to find and recreate the old ones

Google Cloud REST API

/get_info
{
  "template": <templ>
}
{
  "name": <name>,
  "template": <templ>,
}

Autoscaler downloads the latest template from Google

The servers send their template to Autoscaler

Autoscaler finds outdated instances by their template

Ext: Rolling update [3]

Part 1: during scale down prefer to decommission outdated servers

Scale down

Scale up

New template

Old template

Decommission

Need to scale down

Decommission outdated servers first

Need to scale down again

Now can decommission new servers

Ext: Rolling update [4]

Part 2: during normal work upgrade in batches

Scale down

Scale up

New template

Old template

Decommission

No scaling is required. But has outdated servers

Introduce upgrade quota  max % of outdated machines to upgrade at a time

upgrade_quota = 20%

Outdated are decommissioned by the quota. But it brings disbalance

Add new machines. Hence the quota is max over-reservation

Eventually the old ones are deleted

The process continues. The update is rolling

Ext: Rolling update [5]

Part 3: during scale up try not to restore outdated servers

Scale down

Scale up

New template

Old template

Decommission

Scale down is in progress

upgrade_quota = 20%

But now load grows – need to scale up!

Restore some outdated servers. But keep upgrading

Create new servers to finish scale up

Restored

Extensions

Canary deployment

Ext: Canary deployment [1]

Partial cluster upgrade

There is a cluster

Group template:

There is a new version of the game. Want to try it on a few nodes  it is canary deployment

In Google's group settings can specify 2 templates, one with canary size

50%

Google automatically assigns templates to new instances

Google's built-in autoscaler can upgrade existing instances to meet the target size

Ext: Canary deployment [2]

The same as rolling update

Download canary settings from Google's API

Let Google assign templates automatically

Prefer to decommission excess canary instances, use upgrade quota

"versions": [
  {
    "instanceTemplate": <templ_old>,
  },
  {
    "instanceTemplate": <templ_new>,
    "targetSize": {
      "percent": <size>,
    }
  }
]

+

-

Ext: Canary deployment [3]

Scale down

Scale up

Canary template

Default template

Decommission

Balanced cluster

Group template:

upgrade_quota = 20%

Now a canary template is installed

50%

Start an upgrade using the quota. Target size is 5 instances

Continue the upgrade

Eventually, there are 50% canaries, rounded up to 5 instances

Google assigns templates automatically. 6 of 12 are canaries 50%.

Now need 3 more servers!

Now need to decommission 6 servers!

The autoscaler decommissions 3 canaries and 3 defaults

The balance is respected – 3 of 6 are canaries

Extensions

Zone balance

Ext: Zone balance [1]

Instances can live in multiple zones. They must be balanced to protect from outage

Cluster has multiple zones

When add new instances, Google spreads them evenly automatically

Google built-in autoscaler keeps the zones balanced when scales down

If one zone is out, it affects only cluster_size/zone_count part of the cluster

Ext: Zone balance [2]

Download zone list

Let Google assign zones automatically

Prefer to decommission servers from bloated zones

"zones": [
  {
    "zone": <zone1>,
  },
  {
    "zone": <zone2>,
  },
  {
    "zone": <zone3>,
  }
],

-

+

The same as rolling update

Ext: Zone balance [3]

Need to scale up

Scale down

Scale up

Create new instances. They are assigned to zones by Google automatically

Need to scale down

Decommission servers evenly across the zones

Now need to scale up again!

Restore servers evenly across the zones

Decisions

Deal with all extensions

Server selection [1]

Is outdated

Is canary,
canary size > target

Isn't canary,
canary size <= target

Zone balance
Is empty

100000

10000
 

1000
 

100 + zone size %
1

Decommission of 1 server

Calculate server scores:

Decommission the server with the biggest score

Digit positions reveal score value components

Server selection [2]

– outdated

 new

 canary

Canary target  25%

Need to decommission 1 server

101125

101125

1125

1125

1125

125

1150

1150

1150

1150

150

150

Calculate scores

The outdated servers have the biggest score: 100000 + 1000 + 125

Kill one of them

The scores are updated due to new zone size percentage

101118

1118

1128

1128

1128

128

1155

1155

1155

1155

155

155

Kill the next outdated server

The scores are updated again

1110

1130

1130

130

1160

1160

1160

1160

160

160

Kill 2 more servers similarly

The scores now are more interesting – there are too many canaries. 25% is 2 now

113

138

138

10138

150

150

10150

10150

The next kill prefers a canary from the biggest zone

The scores are updated again and so on ...

1115

1143

1143

143

1143

1143

143

Server selection [3]

Restoration of 1 server

Calculate server scores:

Restore the server with the biggest score

Is not outdated

Is canary,
canary size < target

Isn't canary,
canary size >= target

Zone balance

100000

10000
 

1000
 

200 - zone size %

Digit positions reveal score value components

Testing [1]

Autoscaler  5800 lines C++

Basic algorithms

Autoscaler library

Executable

Thresholds, prediction, decisions.
Unit tests

Talk to Google API and the servers.
Tests with Google Cloud Emulator

Does common work not related to scaling. Manual tests

Testing [2]

Autoscaler in Python

Copy of the autoscaler to play with algorithms in real-world datasets

Loadtests

Full game backend in Cloud + thousands of bots

Playtests

~200 real people play in Cloud

Comparison

Google autoscaler

Ubisoft autoscaler

Quorum:

No, but ungraceful

Restoration:

Yes, but graceful

Rolling update:

Canary deployment:

Zone balance:

Prediction:

Graceful:

3 days train, but smart

3 mins train, but 'special'

Future plans

Go live

Bare metal + cloud combined

Cloud machines

Metal machines

Conclusion

Perfect autoscaler does not exist

Load prediction is necessary and can be simple

In cloud save money on scaling

More presentations

Jobs

More presentations

Jobs

Linear regression [1]

y = a + b*x, (x, y) = {(x_1, y_1), ..., (x_n, y_n)}

Need to find A and B. Using "least squares" method solve the equation system approximately:

y_1 = A + B*x_1 \\ ... \\ y_n = A + B*x_n

Solution according to "least squares":

B = S_{xy} / S_{xx} \\ A = mean_y - B * mean_x \\ \\ sum_{var} = \sum_{n=1}^{N}f(var_i) \\ mean_{var} = sum_{var} / N \\ \\ S_{xx} = sum_{xx} - n * mean_x * mean_x \\ S_{xy} = sum_{xy} - n * mean_x * mean_y

Linear regression [2]

sum_x, sum_y, sum_{xx}, sum_{xy}

These values need to be cached in the model:

By them can always get A and B in a few operations by the previous formulas.
This is how to update them when a new point is added and the oldest is dropped

sum_x = old\_sum_x + x_{n+1} - x_0 \\ sum_y = old\_sum_y + y_{n+1} - y_0 \\ sum_{xx} = old\_sum_{xx} + x_{n+1}^2 - x_0^2 \\ sum_{xy} = old\_sum_{yy} + y_{n+1}^2 - y_0^2

The latest A and B can be cached to make predictions

Quadratic regression [1]

y = a + b*x + c*x^2, (x, y) = {(x_1, y_1), ..., (x_n, y_n)}

Need to find A, B, and C. Using "least squares" method solve the equation system approximately:

y_1 = A + B*x_1 + C*x_1^2 \\ ... \\ y_n = A + B*x_n + C*x_n^2

Quadratic regression [2]

Solution according to "least squares":

B = \dfrac{S_{xy} * S_{xxxx} - S_{xxy} * S_{xxx}}{S_{xx} * S_{xxxx} - S_{xxx}^2} \\~\\ C = \dfrac{S_{xxy} * S_{xx} - S_{xy} * S_{xxx}}{S_{xx} * S_{xxxx} - S_{xxx}^2} \\~\\ A = mean_y - B * mean_x - C * mean_xx
sum_{var} = \sum_{n=1}^{N}f(var_i) \\ mean_{var} = sum_{var} / N \\ \\ S_{xx} = sum_{xx} - n * mean_x * mean_x \\ S_{xy} = sum_{xy} - n * mean_x * mean_y \\ S_{xxx} = sum_{xxx} - n * mean_x * mean_{xx} \\ S_{xxy} = sum_{xxy} - n * mean_{xx} * mean_y \\ S_{xxxx} = sum_{xxxx} - n * mean_{xx}^2

Quadratic regression [3]

sum_x, sum_y, sum_{xx}, sum_{xy}, sum_{xxy}, sum_{xxx}, sum_{xxxx}

These values need to be cached in the model:

By them can always get A, B, and C in a few operations by the previous formulas.
This is how to update them when a new point is added and the oldest is dropped

sum_x = old\_sum_x + x_{n+1} - x_0 \\ sum_y = old\_sum_y + y_{n+1} - y_0 \\ sum_{xx} = old\_sum_{xx} + x_{n+1}^2 - x_0^2 \\ sum_{xy} = old\_sum_{yy} + y_{n+1}^2 - y_0^2 \\ sum_{xxy} = old\_sum_{xxy} + x_{n+1}^2 * y_{n+1} - x_0^2 * y_0 \\ sum_{xxx} = old\_sum_{xxx} + x_{n+1}^3 - x_0^3 \\ sum_{xxxx} = old\_sum_{xxxx} + x_{n+1}^4 - x_0^4

The latest A, B, and C can be cached to make predictions

Quadratic regression [4]

The formulas are heavy. The temporary calculations must be reused as much as possible during update. For example, to get this:

x_i^2, x_i^3, x_i^4

Do the following:

x2 = x * x \\ x3 = x2 * x \\ x4 = x3 * x

Instead of:

x2 = x * x \\ x3 = x * x * x \\ x4 = x * x * x * x

Quadratic regression [5]

The formulas involve 6th power of X  it won't fit into 64 bit integers and will loose all precision in doubles. Use a decimal number library to make the calculations precise. For instance decNumber works in C and C++.

Ubisoft in Google Cloud: game cluster autoscaling

By Vladislav Shpilevoy

Ubisoft in Google Cloud: game cluster autoscaling

  • 934