Sidekiq Load Balancing

Agenda

Load Balancing

Sticking Connections

Sidekiq

Sidekiq Sticking Connections

Data Consistency

Metrics

Future work

Questions

Load Balancing

Our solution

We can load balance as many queries as possible. , even queries that don't originate directly from our own code.

This proxy object in turn determines what host a query is sent to based on the methods called, removing the need for parsing SQL queries.

Our solution essentially works by replacing ActiveRecord::Base.connection with a proxy object that handles routing of queries.

Sticky connections

Sticky connections are supported by storing a pointer to the current PostgreSQL WAL (Write Ahead Log) position the moment a write is performed.

This pointer is then stored in Redis for a short duration at the end of a request.

Each user is given their own key so that the actions of one user won't lead to all other users being affected.

In the next request, we get the pointer and compare this with all the secondaries.

If all secondaries have a WAL pointer that exceeds our pointer we know they are in sync and we can safely use a secondary for our read-only queries.

If one or more secondaries are not yet in sync we will continue using the primary until they are in sync.

        # Returns true if this host has caught up to the given transaction
        # write location.
        #
        # location - The transaction write location as reported by a primary.
        def caught_up?(location)
          string = connection.quote(location)

          # In case the host is a primary pg_last_wal_replay_lsn/pg_last_xlog_replay_location() returns
          # NULL. The recovery check ensures we treat the host as up-to-date in
          # such a case.
          query = <<-SQL.squish
            SELECT NOT pg_is_in_recovery()
              OR pg_wal_lsn_diff(pg_last_wal_replay_lsn(), #{string}) >= 0
              AS result
          SQL

          row = query_and_release(query)

          ::Gitlab::Utils.to_boolean(row['result'])
        rescue *CONNECTION_ERRORS
          false
        end

Checking if a secondary has caught up:

Sidekiq

We can't reliably use the same sticking mechanism as we have no way of knowing whether a job should use the primary or not as many jobs are not directly tied to a user.

Data Consistency

For jobs that do not require read-write and up-to-date data, we can still benefit from load balancing.

By annotating those workers, we could now hit Replicas for a majority of the time.

Data Consistency

In order to utilize Sidekiq read-only database replicas capabilities, jobs can have a data_consistency attribute set, which can be:

data_consistencty: :always

The job is required to use primary (a default)

data_consistency: :sticky

The job would use replica as long as possible.

It would switch to primary either on write or long replication lag.

Should be used on jobs that require to be executed as fast as possible

data_consistency: :delayed

The job would switch to primary only on write.

It would use replica always.

If there’s a long replication lag the job will be delayed, and only if the replica is not up to date on the next retry, it will switch to the primary.

It should be used on jobs where we are fine to delay the execution of a given job, due to their importance: expire caches, or execute hooks...

To set a job’s data consistency, we can use the data_consistency class method:

class DelayedWorker
  include ApplicationWorker

  data_consistency :delayed, feature_flag: :load_balancing_for_delayed_worker

  # ...
end

The feature_flag property allows you to experimentaly toggle job’s data_consistency.

When feature flag is disabled, job will default to :always, which means that the job will always use the primary.

Sidekiq Sticking Connections

If the Sidekiq job has data_consistency configured, the moment when it's scheduled, we need to keep the current primary PostgreSQL WAL (Write Ahead Log) position, if write was performed.

If the write was not performed, It can still happen that the Sidekiq client replica is different than the Sidekiq server replica, so we would like to keep the write-ahead log location replayed during recovery.

Sidekiq Sticky Connections

When the job is executed by Sidekiq server, the job will contain the database write location pointing to the write-ahead log location.

We will compare this location with the current replica.

If the current replica has a WAL pointer that exceeds the provided location, we know they are in sync and we can safely use the replica for our read-only queries

If the current replica is not caught up, and data_consistency is set to :delayed, we will retry the Sidekiq Job.

If the replica is not caught up the second time, we will fall back and use primary instead.

If the data_consistency is set to :sticky, we will immediately fall back and use primary.

Metrics

db_replica_count

db_replica_count vs db_primary_count when BuildHooksWorker LB was rolled out

We can see how the number of operations per second on the primary started to drop, and the number of replica's ops started to increase

Database chosen: Replica vs Retried jobs vs Primary