By @carlossanchezp - 2016 - ConferenciaRoR

Bulk load data with
ActiveRecord and NoSQL

The only card I need is .......The Ace Of Spades

Carlos Sánchez Pérez

Pragmatic developer, Ruby/Rails choice at the moment, Learning Scala, Elixir/Phoenix & DDD. Currently working as Web Dev at @The_Cocktail.

 

Let's make it happen

Blog (only in spanish):

carlossanchezperez.wordpress.com

Twitter:

@carlossanchezp

 

Based on Real projects with Ruby On Rails

My experience

1.-Project called Searcher:

Bulk load 300k per week

  • File CSV data
  • Active Record
  • Medium Traffic Volume
  • Create
  • Update
  • Delete
  • MySQL - 300k 
  • Thinking Sphinx full-text search tool

2.-Project called E-commerce

Bulk load 300k per day

  • Connection with external services
  • Hight Traffic Volume
  • Create
  • Update
  • Logical deletion products
  • CouchDB - 1M products
  • Search tool ElasticSearch

Overview

1.-Project called Searcher:

 

  1. Architecture
  2. Models
  3. My First solution - Arrays
  4. Mistakes lead to good things
  5. Tips & Ideas
  6. With & without Active Record

2.-Project called E-commerce

 

  1. Architecture
  2. Real time
  3. Document design - NoSQL
  4. Superimpose processes
  5. Data load & synchronization

4.-Conclusions

RoadMap

3.-Logs & the power of information

1.-Project Searcher

Data load with Active Record

1.1-Architecture

ARCHITECTURE

1.2- Models

Main Models

1.3- My First Solution

Arrays & Performance

First Solution - Arrays

First Solution - Arrays

First Solution - Arrays

First Solution - Arrays

Solution method

It takes a LOT of time

1) Penalized performance with "not in" with a big array (a lot of elements)

 

2) "uniq" because we can have duplicated elements in the array

 

3) "map" because we only need ID's to destroy them

The reason why

AR suffers a lot with this solution 

We realised we had to go over all these things:

  • Bad performance
  • CSV file was read 3 times
  • Big arrays with ID's processed data & "not in" SQL
  • Several SQL "SELECT" where executed per CSV line

End process time was:

24 - 28 hours for 300,000 lines CSV file

1.4- Mistakes Lead To Good things

what happens if i change the way of doing things

Remember

Quick idea

  • Read file CVS
  • Professional ID's processed into Array
  • Delete method with "not processed" using "not in"
  • More than 24h to finish
  • A lot SQL for the same model per line read

1.4.1.Refactoring

Re-Thinking load process

 def process_file

      




        set_ids_centers       = Set.new get_ids_centers
        set_ids_professionals = Set.new get_ids_professionals

Set with ID's 


   # ID's with SET
   pop_ids(@set_ids_centers,center.id)
      
   pop_ids(@set_ids_professionals,professional.id)




    # Pops ID's to delete at the end
    def pop_ids(ids, id)
      ids.delete(id) if ids.include? id
    end



   

Check ID's from file into SET





# Unprocessed ID's 
def delete_unprocessed_ids(ids_centers,ids_professionals)
      
   Professional.destroy(ids_professionals.to_a) if ids_professionals
        

   Center.destroy(ids_centers.to_a) if ids_centers

end

Delete ID's from SET

It's much faster

 Less time consuming

We are getting better...

1.4.2- Tips & Ideas

 

1.4.2.1-Set vs Array

The reason why

Benchmark.bm do |x|
  x.report "Set" do
    a=Set.new
    1000.times{a.add(rand(100))}
  end
  x.report "Array" do
    a=[]
    1000.times{r=rand(100);a<<r unless a.include?(r)}
  end  
end




       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.001127)
Array  0.000000   0.000000   0.000000 (  0.000829)

Array vs Set

Benchmark.bm do |x|
  x.report "Set" do
    a=Set.new
    1000.times{a.add(rand(1000))}
  end
  x.report "Array" do
    a=[]
    1000.times{r=rand(1000);a<<r unless a.include?(r)}
  end
end




       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.003271)
Array  0.000000   0.000000   0.000000 (  0.003250)

Array vs Set

Benchmark.bm do |x|
  x.report "Set" do
    a=Set.new
    1000.times{a.add(rand(10000))}
  end
  x.report "Array" do
    a=[]
    1000.times{r=rand(10000);a<<r unless a.include?(r)}
  end    
end

       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.001240)
Array  0.010000   0.000000   0.010000 (  0.010081)
       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.001545)
Array  0.010000   0.010000   0.020000 (  0.008941)

Array vs Set

the best way that can i use both of them

1) With a small number of elements is much better Array

 

2) With a large number of elements is much better Set

1.4.2.2.-Pluck vs Map

users = User.all
=> [#<User id: 1, email: 'csanchez@example.com', active: true>, 
    #<User id: 2, email: 'cperez@example.com', active: false>]

users.map(&:email)
=> ['csanchez@example.com', 'cperez@example.com']

# always i use: User.all.map(&:email)

emails = User.select(:email)
=> [#<User email: 'csanchez@example.com'>, #<User email: 'cperez@example.com'>]

emails.map(&:email)
=> ['csanchez@example.com', 'cperez@example.com']




User.pluck(:email)
=> ['csanchez@example.com', 'cperez@example.com']

User.where(active:true).pluck(:email)

Pluck vs Map - examples

  ActiveRecord::Base.logger.level = 1
  n = 1000
  Benchmark.bm do |x|
    x.report('Country.all.map(&:name):      ') { n.times { Country.all.map(&:name) } }
    
    x.report('Coutry.pluck(:name):          ') { n.times { Country.pluck(:name) } }
  end



## Score
                                    user     system      total        real
Country.all.map(&:name):        3.830000   0.140000   3.970000 (  4.328655)

Coutry.pluck(:name):            1.550000   0.040000   1.590000 (  1.879490)

Benchmark Pluck vs map

    
      # How to load

      set_ids_centers = Set.new get_ids_centers
      set_ids_professionals = Set.new get_ids_professionals


      # Method only return id's
      def get_ids_centers
        Center.pluck(:id)
      end

      def get_ids_professionals
        Professional.pluck(:id)
      end

Load element into SET

1.4.2.3-We need validations

Do we need model validations all along the load process?

  create_table "professionals", force: true do |t|
    t.string   "email",                                         null: false
    t.string   "first_name",                                    null: false
    t.string   "last_name",                                     null: false
    t.string   "personal_web",              default: "http://"
    t.string   "telephone"
    t.boolean  "show_telephone",            default: true,      null: false
    t.boolean  "show_email",                default: true,      null: false
    t.text     "cv"
    t.integer  "update_check",              default: 0
    t.boolean  "delta",                     default: true,      null: false
    t.integer  "type_id",                   default: 0,         null: false
    t.string   "languages"

    t.string   "twitter"
    t.string   "numbercol",      limit: 30
    t.boolean  "active",                    default: true,      null: false

Validate - Data Base

class Professional < ActiveRecord::Base
  include NestedAttributeList, FriendlyId

  # Attributes
  friendly_id :full_name, use: :slugged

  # Validations
  validates :email, uniqueness: true, case_sensitive: false, allow_blank: true

  validate :first_name, present: true
  validate :last_name, present: true

  validate :type_id, present: true
 

Validate - Models

If i use these:

  • Cache Model
  • Model polymorphic

the answer was..... "not always"

Skip presence validations while loading.....

and delegate to DB validations

    def skip_validations
      # Skip presence validations while loading, and delegate to DB validations
      skip_presence_validation(Address, :country)
      skip_presence_validation(Address, :province)
      skip_presence_validation(Address, :city)
      skip_presence_validation(Skill,   :professional)
      skip_presence_validation(SpecialitySpecialist, :speciality)
      skip_presence_validation(SpecialitySpecialist, :specialist)
      skip_presence_validation(ProfessionalCenter, :professional)
      skip_presence_validation(ProfessionalCenter, :center)
      skip_presence_validation(InsuranceCompanyPartner, :insurance_company)
      skip_presence_validation(InsuranceCompanyPartner, :partner)
    end

Delegate to BBDD

    def skip_presence_validation(model_class, field)
      validators = model_class._validators[field]
      validators.reject! do |validator|
        validator.is_a?(ActiveRecord::Validations::PresenceValidator)
      end
      model_class._validators.delete(field) if validators.empty?

      empty_callbacks = []
      callbacks = model_class._validate_callbacks
      callbacks.each do |callback|
        if callback.name == :validate && callback.filter.is_a?(ActiveRecord::Validations::PresenceValidator) && callback.filter.attributes.include?(field)
          callback.filter.attributes.delete(field)
          empty_callbacks << callback if callback.filter.attributes.empty?
        end
      end
      empty_callbacks.each {|c| callbacks.delete(c) }
    end

Delegate to BBDD

class Address < ActiveRecord::Base

  # Relations
  belongs_to  :country, inverse_of: :addresses
  belongs_to  :city, inverse_of: :addresses
  belongs_to  :province, inverse_of: :addresses

  has_many    :centers, inverse_of: :address

  # Validations
  validates :country, :city, :province, presence: true
  validates :name, presence: true

  # Delegations
  delegate :name, :code, to: :country, prefix: true, allow_nil: true
  delegate :name, to: :city, prefix: true, allow_nil: true
  delegate :name, to: :province, prefix: true, allow_nil: true

end

Delegate to BBDD

class SpecialitySpecialist < ActiveRecord::Base

  # Relations
  belongs_to :speciality, inverse_of: :speciality_specialists
  belongs_to :specialist, polymorphic: true

  # Validations
  validates :speciality, presence: true
  validates :specialist, presence: true

end

Delegate to BBDD

Less time consuming

1.4.2.4-Cache models

4 select AR per line reading - file

   def cached_tables
      {
        cities:       City.all.index_by {
            |c| "#{c.province_id}-#{c.external_id}" },
        provinces:    Province.all.index_by(&:external_id),
        countries:    Country.all.index_by(&:external_id),
        specialities: Speciality.all.index_by(&:external_id),
        insurance_companies: InsuranceCompany.all.to_a,
      }
    end

Cached tables - Hash

    def cities
      @caches[:cities]
    end

    def provinces
      @caches[:provinces]
    end

    def countries
      @caches[:countries]
    end

    def specialities
      @caches[:specialities]
    end

    def insurance_companies
      @caches[:insurance_companies]
    end

Cached tables - Method

      
def find_or_create_city(province, country)
   city = cities["#{province.id}-#{row.city_attributes[:external_id]}"] 
                || City.new

   city.attributes = row.city_attributes.
                 merge(province: province, country: country)
   city.save! if city.changed?
   cities["#{city.province_id}-#{city.external_id}"] = city
   city
end

Method

Less time consuming

1.5-Result of this ideas and tips

Score

  • Time process more or less 4 hours
  • 400,000 AR select less 
  • Code was more readable

1.6-With & without ActiveRecord


TIMES = 10000

def do_inserts
    TIMES.times { User.create(:user_id => 1, :sku => 12, :delta => 1) }
end




Benchmark.measure { ActiveRecord::Base.transaction { do_inserts } }

Benchmark.measure {  do_inserts } 



CODE

CONN = ActiveRecord::Base.connection
TIMES = 10000

def raw_sql
    TIMES.times { CONN.execute "INSERT INTO `user` 
(`delta`, `updated_at`, `sku`, `user_id`) 
VALUES(1, '2015-11-21 20:21:13', 12, 1)" }
end


Benchmark.measure { ActiveRecord::Base.transaction { raw_sql } }
Benchmark.measure { raw_sql } 

CODE

CONN = ActiveRecord::Base.connection
TIMES = 10000


def mass_insert
    inserts = []
    TIMES.times do
        inserts.push "(1, '2015-11-21 20:21:13', 12, 1)"
    end
    sql = "INSERT INTO user (`delta`, `updated_at`, `sku`, 
                            `user_id`) 
           VALUES #{inserts.join(", ")}"
    CONN.execute sql
end

Benchmark.measure { mass_insert }

CODE




ActiveRecord without transaction:
 14.930000   0.640000  15.570000 ( 18.898352)
ActiveRecord with transaction:
 13.420000   0.310000  13.730000 ( 14.619136)
  1.29x faster than base



Raw SQL without transaction:
  0.920000   0.170000   1.090000 (  3.731032)
  5.07x faster than base
Raw SQL with transaction:
  0.870000   0.150000   1.020000 (  1.648834)
  11.46x faster than base



Only Mass Insert:
  0.000000   0.000000   0.000000 (  0.268634)
  70.35x faster than base

Score

Remember,  knowledge, benchmark and refactoring

2.Project E-Commerce

Real time load data

2.1-Architecture

Architecture

2.2.Real Time

Frontend

Provider 1

Provider 2

2.3.Document design - NoSQL

Documents

2.4.Process Time & Superimpose process

*/30 * * * * flock -n /tmp/cron.txt.lock sh -c 'cd /var/www/project/current && bundle exec rake load:parse' || sh -c 'echo MyProject already running; ps; ls /tmp/*.lock'

crontab S.O.

2.5-Data load & synchronization

Sync CouchDB & Elastic

3- LOGS & The power of information

  • How many items have been loaded?
  • How many errors have you had?
  • How many elements with a specific class have  been loaded?
  • How many items have been removed / changed?
  • How many items did not meet load conditions?
  • .....

Logs should answer questions like these:

feeds logs

logs with information

LOG data

  • Transaction ID
  • Time of Web Services
  • Differents Logs
  • We need to store more info at the beginning  of the project
#  warnings, errors o info

Rails.logger.warn "cuidado no dispone de...."

Rails.logger.error "Error en..."

Rails.logger.info "RESPONSE TOKEN: #{token_info["access_token"]}"


# way to use

Rails.logger.tagged "MYPROJECT" do
    Rails.logger.tagged "GET_OFFERS_BY_BLOCK" do
    end
end

Rails logger

4. Conclusion

Final summary

  • The more ActiveRecord you know, the better
  • Things to take into consideration: speed, amount of data, real time or not real time and the architecture components
  • Design of your documents and Models
  • Cache as much as possible
  • Re-think everything and refactor
  • The power of Logging 

and especially ... it's very very important

share your experiences

 That's all Folks!!

By @carlossanchezp - 2016 - ConferenciaRoR

Bulk load data with AR & NoSQL - ConferenciaRoR 2016

By Carlos Sánchez Pérez

Bulk load data with AR & NoSQL - ConferenciaRoR 2016

@carlossanchezp

  • 1,971