Vladislav Shpilevoy PRO
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.
Cluster scaling
Scaling in Ubisoft
Google Cloud
New scaling algorithm
Testing
Future plans
Watch interactively
Add or drop instances depending on load
Two types
Bare metal
Cloud
Buy or rent, deliver, install
POST create_instances HTTP/1.1
Send an HTTP request
Cluster load
Capacity with good scaling
Capacity with bad scaling
Money waste
Bare metal cluster
Now moving to cloud for better scaling and more
Load
Very volatile
600 000 players
200 000 players
Region Europe West
Region
USA East
Region Asia
Instance groups
Ungraceful shutdown looses the state
Servers
Load
For Managed Instance Groups
Automatic cluster resize based on metrics
Players are disconnected
Saves are lost
Instance Group
Google Cloud Region
Autoscaler
Players
Greeting
State
...
Commands
Google Cloud REST API
/create_instances
/delete_instances
Based on threshold
Capacity
Threshold
Players
Example
Average player count
Servers
Based on threshold
Capacity
Threshold
Players
Safety Timeout
Two-phase shutdown
ACTIVE
Autoscaler sends decommission request
DECOMMISSIONING
Server enters DECOMMISIONING state
No new players
Players eventually leave
Save games are uploaded
Server enters DECOMMISIONED state
DECOMMISSIONED
Autoscaler kills the machine
Scale down
Scale up
Load is below scale down threshold
Decommission a couple of servers
But now load grows
Add new servers
Some servers waste money. Have space, but can't be used
Restore the old servers
Then add new. All the space is used when need to scale up
Need to scale up
Add servers
But they are broken
Add more
Still broken
Some machines don't work but waste money. Can continue infinitely
Cluster size
Connected count
Quorum – minimal % of connected nodes to approve scaling
Server capacity: 100
Scale up threshold: 90
~10% reservation
Example:
Reservation cost is x100 bigger
Assume 2 servers are filled per second,
Server boot time is 3 minutes. Then:
100 servers
10 000 servers
10 reserve
1000 reserve
10%
5%
2%
"thresholds": [ { "server_count": 100, "scale_up_player_count": 90, "scale_down_player_count": 70 }, { "server_count": 1000, "scale_up_player_count": 95, "scale_down_player_count": 80 }, { "server_count": 5000, "scale_up_player_count": 98, "scale_down_player_count": 90 }, ]
Reservation is changed depending on server count
Capacity
Threshold
Players
In reality the booting takes time
Booting
How scaling looks in theory
Example:
The reserve ends in 20 secs. People wait for 2 min 40 secs
The cluster can and will be overloaded
Electricity load (at winter)
Game cluster load (one of)
Time of day
00
06
12
14
18
20
00
Prediction is used in
Solutions are very similar
Prediction
Train
Train
Prediction
"prediction": { "algorithm": "quadratic_regression", "train_interval_sec": 600, "sample_interval_sec": 10 }
~120 C++ code lines
~240 C++ code lines
All the same for the quadratic regression
Find for each new point
Store and update on each step
Calculate on each step
100 000 players
160 000 players
20 minutes
Queue
Capacity
There is a cluster
Group template:
The template is changed. Has a new version of the game
Google's autoscaler recreates old instances in packs. This is rolling update
Google automatically assigns it to new instances
New instances get the new template automatically. Only need to find and recreate the old ones
Google Cloud REST API
/get_info
{ "template": <templ> }
{ "name": <name>, "template": <templ>, }
Autoscaler downloads the latest template from Google
The servers send their template to Autoscaler
Autoscaler finds outdated instances by their template
Part 1: during scale down prefer to decommission outdated servers
Scale down
Scale up
New template
Old template
Decommission
Need to scale down
Decommission outdated servers first
Need to scale down again
Now can decommission new servers
Part 2: during normal work upgrade in batches
Scale down
Scale up
New template
Old template
Decommission
No scaling is required. But has outdated servers
Introduce upgrade quota – max % of outdated machines to upgrade at a time
upgrade_quota = 20%
Outdated are decommissioned by the quota. But it brings disbalance
Add new machines. Hence the quota is max over-reservation
Eventually the old ones are deleted
The process continues. The update is rolling
Part 3: during scale up try not to restore outdated servers
Scale down
Scale up
New template
Old template
Decommission
Scale down is in progress
upgrade_quota = 20%
But now load grows – need to scale up!
Restore some outdated servers. But keep upgrading
Create new servers to finish scale up
Restored
There is a cluster
Group template:
There is a new version of the game. Want to try it on a few nodes – it is canary deployment
In Google's group settings can specify 2 templates, one with canary size
50%
Google automatically assigns templates to new instances
Google's built-in autoscaler can upgrade existing instances to meet the target size
Download canary settings from Google's API
Let Google assign templates automatically
Prefer to decommission excess canary instances, use upgrade quota
"versions": [ { "instanceTemplate": <templ_old>, }, { "instanceTemplate": <templ_new>, "targetSize": { "percent": <size>, } } ]
+
-
Scale down
Scale up
Canary template
Default template
Decommission
Balanced cluster
Group template:
upgrade_quota = 20%
Now a canary template is installed
50%
Start an upgrade using the quota. Target size is 5 instances
Continue the upgrade
Eventually, there are 50% canaries, rounded up to 5 instances
Google assigns templates automatically. 6 of 12 are canaries – 50%.
Now need 3 more servers!
Now need to decommission 6 servers!
The autoscaler decommissions 3 canaries and 3 defaults
The balance is respected – 3 of 6 are canaries
Instances can live in multiple zones. They must be balanced to protect from outage
When add new instances, Google spreads them evenly automatically
Google built-in autoscaler keeps the zones balanced when scales down
If one zone is out, it affects only cluster_size/zone_count part of the cluster
Download zone list
Let Google assign zones automatically
Prefer to decommission servers from bloated zones
"zones": [ { "zone": <zone1>, }, { "zone": <zone2>, }, { "zone": <zone3>, } ],
-
+
Need to scale up
Scale down
Scale up
Create new instances. They are assigned to zones by Google automatically
Need to scale down
Decommission servers evenly across the zones
Now need to scale up again!
Restore servers evenly across the zones
Is outdated
Is canary,
canary size > target
Isn't canary,
canary size <= target
Zone balance
Is empty
100000
10000
1000
100 + zone size %
1
Calculate server scores:
Decommission the server with the biggest score
Digit positions reveal score value components
– outdated
– new
– canary
Canary target – 25%
Need to decommission 1 server
101125
101125
1125
1125
1125
125
1150
1150
1150
1150
150
150
Calculate scores
The outdated servers have the biggest score: 100000 + 1000 + 125
Kill one of them
The scores are updated due to new zone size percentage
101118
1118
1128
1128
1128
128
1155
1155
1155
1155
155
155
Kill the next outdated server
The scores are updated again
1110
1130
1130
130
1160
1160
1160
1160
160
160
Kill 2 more servers similarly
The scores now are more interesting – there are too many canaries. 25% is 2 now
113
138
138
10138
150
150
10150
10150
The next kill prefers a canary from the biggest zone
The scores are updated again and so on ...
1115
1143
1143
143
1143
1143
143
Calculate server scores:
Restore the server with the biggest score
Is not outdated
Is canary,
canary size < target
Isn't canary,
canary size >= target
Zone balance
100000
10000
1000
200 - zone size %
Digit positions reveal score value components
Thresholds, prediction, decisions.
Unit tests
Talk to Google API and the servers.
Tests with Google Cloud Emulator
Does common work not related to scaling. Manual tests
Copy of the autoscaler to play with algorithms in real-world datasets
Full game backend in Cloud + thousands of bots
~200 real people play in Cloud
Quorum:
No, but ungraceful
Restoration:
Yes, but graceful
Rolling update:
Canary deployment:
Zone balance:
Prediction:
Graceful:
3 days train, but smart
3 mins train, but 'special'
Cloud machines
Metal machines
Need to find A and B. Using "least squares" method solve the equation system approximately:
Solution according to "least squares":
These values need to be cached in the model:
By them can always get A and B in a few operations by the previous formulas.
This is how to update them when a new point is added and the oldest is dropped
The latest A and B can be cached to make predictions
Need to find A, B, and C. Using "least squares" method solve the equation system approximately:
Solution according to "least squares":
These values need to be cached in the model:
By them can always get A, B, and C in a few operations by the previous formulas.
This is how to update them when a new point is added and the oldest is dropped
The latest A, B, and C can be cached to make predictions
The formulas are heavy. The temporary calculations must be reused as much as possible during update. For example, to get this:
Do the following:
Instead of:
The formulas involve 6th power of X – it won't fit into 64 bit integers and will loose all precision in doubles. Use a decimal number library to make the calculations precise. For instance decNumber works in C and C++.
By Vladislav Shpilevoy
Database C developer at Tarantool. Backend C++ developer at VirtualMinds.