Sidekiq Load Balancing

Agenda

Load Balancing

Our solution

  • We can load balance as many queries as possible. , even queries that don't originate directly from our own code. 
  • This proxy object in turn determines what host a query is sent to based on the methods called, removing the need for parsing SQL queries.
  • Our solution essentially works by replacing ActiveRecord::Base.connection with a proxy object that handles routing of queries.

Sticky connections

  • Sticky connections are supported by storing a pointer to the current PostgreSQL WAL (Write Ahead Log) position the moment a write is performed.

 

  • This pointer is then stored in Redis for a short duration at the end of a request.
  • Each user is given their own key so that the actions of one user won't lead to all other users being affected.
  • In the next request, we get the pointer and compare this with all the secondaries.

  

  • If all secondaries have a WAL pointer that exceeds our pointer we know they are in sync and we can safely use a secondary for our read-only queries.
  • If one or more secondaries are not yet in sync we will continue using the primary until they are in sync.
        # Returns true if this host has caught up to the given transaction
        # write location.
        #
        # location - The transaction write location as reported by a primary.
        def caught_up?(location)
          string = connection.quote(location)

          # In case the host is a primary pg_last_wal_replay_lsn/pg_last_xlog_replay_location() returns
          # NULL. The recovery check ensures we treat the host as up-to-date in
          # such a case.
          query = <<-SQL.squish
            SELECT NOT pg_is_in_recovery()
              OR pg_wal_lsn_diff(pg_last_wal_replay_lsn(), #{string}) >= 0
              AS result
          SQL

          row = query_and_release(query)

          ::Gitlab::Utils.to_boolean(row['result'])
        rescue *CONNECTION_ERRORS
          false
        end

Checking if a secondary has caught up:

Sidekiq

We can't reliably use the same sticking mechanism as we have no way of knowing whether a job should use the primary or not as many jobs are not directly tied to a user.

Data Consistency

  • For jobs that do not require read-write and up-to-date data, we can still benefit from load balancing.

 

  • ​By annotating those workers, we could now hit Replicas for a majority of the time.

Data Consistency

In order to utilize Sidekiq read-only database replicas capabilities, jobs can have a data_consistency attribute set, which can be: 

data_consistencty: :always

  • The job is required to use primary (a default)

data_consistency: :sticky

  • The job would use replica as long as possible.               ​
  • It would switch to primary either on write or long replication lag.
  • Should be used on jobs that require to be executed as fast as possible

data_consistency: :delayed

  • The job would switch to primary only on write.              
  • It would use replica always.         
  • If there’s a long replication lag the job will be delayed, and only if the replica is not up to date on the next retry, it will switch to the primary.                                     
  • It should be used on jobs where we are fine to delay the execution of a given job, due to their importance: expire caches, or execute hooks...

To set a job’s data consistency, we can use the data_consistency class method:

 
class DelayedWorker
  include ApplicationWorker

  data_consistency :delayed, feature_flag: :load_balancing_for_delayed_worker

  # ...
end
  • The feature_flag property allows you to experimentaly toggle job’s data_consistency.

 

  • ​When feature flag is disabled, job will default to :always, which means that the job will always use the primary.

Sidekiq Sticking Connections

  • ​If the Sidekiq job has data_consistency configured, the moment when it's scheduled, we need to keep the current primary PostgreSQL WAL (Write Ahead Log) position, if write was performed.

 

  • ​If the write was not performed, It can still happen that the Sidekiq client replica is different than the Sidekiq server replica, so we would like to keep the write-ahead log location replayed during recovery.

Sidekiq Sticky Connections

  • ​When the job is executed by Sidekiq server, the job will contain the database write location pointing to the write-ahead log location.

 

  • ​We will compare this location with the current replica.        

     

  • If the current replica has a WAL pointer that exceeds the provided location, we know they are in sync and we can safely use the replica for our read-only queries

  • ​If the current replica is not caught up, and data_consistency is set to :delayed, we will retry the Sidekiq Job.

 

  • If the replica is not caught up the second time, we will fall back and use primary instead.

 

  • ​If the data_consistency is set to :sticky, we will immediately fall back and use primary.

Metrics

Future work

Questions

Made with Slides.com