Sidekiq Load Balancing
Agenda
Load Balancing
Our solution
- We can load balance as many queries as possible. , even queries that don't originate directly from our own code.
- This proxy object in turn determines what host a query is sent to based on the methods called, removing the need for parsing SQL queries.
- Our solution essentially works by replacing ActiveRecord::Base.connection with a proxy object that handles routing of queries.
Sticky connections
- Sticky connections are supported by storing a pointer to the current PostgreSQL WAL (Write Ahead Log) position the moment a write is performed.
- This pointer is then stored in Redis for a short duration at the end of a request.
- Each user is given their own key so that the actions of one user won't lead to all other users being affected.
- In the next request, we get the pointer and compare this with all the secondaries.
- If all secondaries have a WAL pointer that exceeds our pointer we know they are in sync and we can safely use a secondary for our read-only queries.
- If one or more secondaries are not yet in sync we will continue using the primary until they are in sync.
# Returns true if this host has caught up to the given transaction
# write location.
#
# location - The transaction write location as reported by a primary.
def caught_up?(location)
string = connection.quote(location)
# In case the host is a primary pg_last_wal_replay_lsn/pg_last_xlog_replay_location() returns
# NULL. The recovery check ensures we treat the host as up-to-date in
# such a case.
query = <<-SQL.squish
SELECT NOT pg_is_in_recovery()
OR pg_wal_lsn_diff(pg_last_wal_replay_lsn(), #{string}) >= 0
AS result
SQL
row = query_and_release(query)
::Gitlab::Utils.to_boolean(row['result'])
rescue *CONNECTION_ERRORS
false
endChecking if a secondary has caught up:
Sidekiq
We can't reliably use the same sticking mechanism as we have no way of knowing whether a job should use the primary or not as many jobs are not directly tied to a user.
Data Consistency
- For jobs that do not require read-write and up-to-date data, we can still benefit from load balancing.
- By annotating those workers, we could now hit Replicas for a majority of the time.
Data Consistency
In order to utilize Sidekiq read-only database replicas capabilities, jobs can have a data_consistency attribute set, which can be:
data_consistencty: :always
- The job is required to use primary (a default)
data_consistency: :sticky
- The job would use replica as long as possible.
- It would switch to primary either on write or long replication lag.
- Should be used on jobs that require to be executed as fast as possible
data_consistency: :delayed
- The job would switch to primary only on write.
- It would use replica always.
- If there’s a long replication lag the job will be delayed, and only if the replica is not up to date on the next retry, it will switch to the primary.
- It should be used on jobs where we are fine to delay the execution of a given job, due to their importance: expire caches, or execute hooks...
To set a job’s data consistency, we can use the data_consistency class method:
class DelayedWorker
include ApplicationWorker
data_consistency :delayed, feature_flag: :load_balancing_for_delayed_worker
# ...
end- The feature_flag property allows you to experimentaly toggle job’s data_consistency.
- When feature flag is disabled, job will default to :always, which means that the job will always use the primary.
Sidekiq Sticking Connections
- If the Sidekiq job has data_consistency configured, the moment when it's scheduled, we need to keep the current primary PostgreSQL WAL (Write Ahead Log) position, if write was performed.
- If the write was not performed, It can still happen that the Sidekiq client replica is different than the Sidekiq server replica, so we would like to keep the write-ahead log location replayed during recovery.
Sidekiq Sticky Connections
- When the job is executed by Sidekiq server, the job will contain the database write location pointing to the write-ahead log location.
- We will compare this location with the current replica.
-
If the current replica has a WAL pointer that exceeds the provided location, we know they are in sync and we can safely use the replica for our read-only queries
- If the current replica is not caught up, and data_consistency is set to :delayed, we will retry the Sidekiq Job.
- If the replica is not caught up the second time, we will fall back and use primary instead.
- If the data_consistency is set to :sticky, we will immediately fall back and use primary.
Metrics




Future work
Questions
Sidekiq load balancing
By Nikola Milojevic
Sidekiq load balancing
- 207