By @carlossanchezp - 2016 - ConferenciaRoR

Bulk load data with
ActiveRecord and NoSQL

The only card I need is .......The Ace Of Spades

Carlos Sánchez Pérez

Pragmatic developer, Ruby/Rails choice at the moment, Learning Scala, Elixir/Phoenix & DDD. Currently working as Web Dev at @The_Cocktail.

Let's make it happen

Blog (only in spanish):

carlossanchezperez.wordpress.com

Twitter:

@carlossanchezp

Based on Real projects with Ruby On Rails

My experience

1.-Project called Searcher:

Bulk load 300k per week

File CSV data
Active Record
Medium Traffic Volume
Create
Update
Delete
MySQL - 300k
Thinking Sphinx full-text search tool

2.-Project called E-commerce

Bulk load 300k per day

Connection with external services
Hight Traffic Volume
Create
Update
Logical deletion products
CouchDB - 1M products
Search tool ElasticSearch

Overview

1.-Project called Searcher:

Architecture
Models
My First solution - Arrays
Mistakes lead to good things
Tips & Ideas
With & without Active Record

2.-Project called E-commerce

Architecture
Real time
Document design - NoSQL
Superimpose processes
Data load & synchronization

4.-Conclusions

RoadMap

3.-Logs & the power of information

1.-Project Searcher

Data load with Active Record

1.1-Architecture

ARCHITECTURE

1.2- Models

Main Models

1.3- My First Solution

Arrays & Performance

First Solution - Arrays

Solution method

It takes a LOT of time

1) Penalized performance with "not in" with a big array (a lot of elements)

2) "uniq" because we can have duplicated elements in the array

3) "map" because we only need ID's to destroy them

The reason why

AR suffers a lot with this solution

We realised we had to go over all these things:

Bad performance
CSV file was read 3 times
Big arrays with ID's processed data & "not in" SQL
Several SQL "SELECT" where executed per CSV line

End process time was:

24 - 28 hours for 300,000 lines CSV file

1.4- Mistakes Lead To Good things

what happens if i change the way of doing things

Remember

Quick idea

Read file CVS
Professional ID's processed into Array
Delete method with "not processed" using "not in"
More than 24h to finish
A lot SQL for the same model per line read

1.4.1.Refactoring

Re-Thinking load process

 def process_file

      




        set_ids_centers       = Set.new get_ids_centers
        set_ids_professionals = Set.new get_ids_professionals

Set with ID's


   # ID's with SET
   pop_ids(@set_ids_centers,center.id)
      
   pop_ids(@set_ids_professionals,professional.id)




    # Pops ID's to delete at the end
    def pop_ids(ids, id)
      ids.delete(id) if ids.include? id
    end

Check ID's from file into SET





# Unprocessed ID's 
def delete_unprocessed_ids(ids_centers,ids_professionals)
      
   Professional.destroy(ids_professionals.to_a) if ids_professionals
        

   Center.destroy(ids_centers.to_a) if ids_centers

end

Delete ID's from SET

It's much faster

Less time consuming

We are getting better...

1.4.2- Tips & Ideas

1.4.2.1-Set vs Array

The reason why

Benchmark.bm do |x|
  x.report "Set" do
    a=Set.new
    1000.times{a.add(rand(100))}
  end
  x.report "Array" do
    a=[]
    1000.times{r=rand(100);a<<r unless a.include?(r)}
  end  
end




       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.001127)
Array  0.000000   0.000000   0.000000 (  0.000829)

Array vs Set

Benchmark.bm do |x|
  x.report "Set" do
    a=Set.new
    1000.times{a.add(rand(1000))}
  end
  x.report "Array" do
    a=[]
    1000.times{r=rand(1000);a<<r unless a.include?(r)}
  end
end




       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.003271)
Array  0.000000   0.000000   0.000000 (  0.003250)

Array vs Set

Benchmark.bm do |x|
  x.report "Set" do
    a=Set.new
    1000.times{a.add(rand(10000))}
  end
  x.report "Array" do
    a=[]
    1000.times{r=rand(10000);a<<r unless a.include?(r)}
  end    
end

       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.001240)
Array  0.010000   0.000000   0.010000 (  0.010081)
       user     system      total        real
Set    0.000000   0.000000   0.000000 (  0.001545)
Array  0.010000   0.010000   0.020000 (  0.008941)

Array vs Set

the best way that can i use both of them

1) With a small number of elements is much better Array

2) With a large number of elements is much better Set

1.4.2.2.-Pluck vs Map

users = User.all
=> [#<User id: 1, email: 'csanchez@example.com', active: true>, 
    #<User id: 2, email: 'cperez@example.com', active: false>]

users.map(&:email)
=> ['csanchez@example.com', 'cperez@example.com']

# always i use: User.all.map(&:email)

emails = User.select(:email)
=> [#<User email: 'csanchez@example.com'>, #<User email: 'cperez@example.com'>]

emails.map(&:email)
=> ['csanchez@example.com', 'cperez@example.com']




User.pluck(:email)
=> ['csanchez@example.com', 'cperez@example.com']

User.where(active:true).pluck(:email)

Pluck vs Map - examples

  ActiveRecord::Base.logger.level = 1
  n = 1000
  Benchmark.bm do |x|
    x.report('Country.all.map(&:name):      ') { n.times { Country.all.map(&:name) } }
    
    x.report('Coutry.pluck(:name):          ') { n.times { Country.pluck(:name) } }
  end



## Score
                                    user     system      total        real
Country.all.map(&:name):        3.830000   0.140000   3.970000 (  4.328655)

Coutry.pluck(:name):            1.550000   0.040000   1.590000 (  1.879490)

Benchmark Pluck vs map

    
      # How to load

      set_ids_centers = Set.new get_ids_centers
      set_ids_professionals = Set.new get_ids_professionals


      # Method only return id's
      def get_ids_centers
        Center.pluck(:id)
      end

      def get_ids_professionals
        Professional.pluck(:id)
      end

Load element into SET

1.4.2.3-We need validations

Do we need model validations all along the load process?

  create_table "professionals", force: true do |t|
    t.string   "email",                                         null: false
    t.string   "first_name",                                    null: false
    t.string   "last_name",                                     null: false
    t.string   "personal_web",              default: "http://"
    t.string   "telephone"
    t.boolean  "show_telephone",            default: true,      null: false
    t.boolean  "show_email",                default: true,      null: false
    t.text     "cv"
    t.integer  "update_check",              default: 0
    t.boolean  "delta",                     default: true,      null: false
    t.integer  "type_id",                   default: 0,         null: false
    t.string   "languages"

    t.string   "twitter"
    t.string   "numbercol",      limit: 30
    t.boolean  "active",                    default: true,      null: false

Validate - Data Base

class Professional < ActiveRecord::Base
  include NestedAttributeList, FriendlyId

  # Attributes
  friendly_id :full_name, use: :slugged

  # Validations
  validates :email, uniqueness: true, case_sensitive: false, allow_blank: true

  validate :first_name, present: true
  validate :last_name, present: true

  validate :type_id, present: true

Validate - Models

If i use these:

Cache Model
Model polymorphic

the answer was..... "not always"

Skip presence validations while loading.....

and delegate to DB validations

    def skip_validations
      # Skip presence validations while loading, and delegate to DB validations
      skip_presence_validation(Address, :country)
      skip_presence_validation(Address, :province)
      skip_presence_validation(Address, :city)
      skip_presence_validation(Skill,   :professional)
      skip_presence_validation(SpecialitySpecialist, :speciality)
      skip_presence_validation(SpecialitySpecialist, :specialist)
      skip_presence_validation(ProfessionalCenter, :professional)
      skip_presence_validation(ProfessionalCenter, :center)
      skip_presence_validation(InsuranceCompanyPartner, :insurance_company)
      skip_presence_validation(InsuranceCompanyPartner, :partner)
    end

Delegate to BBDD

    def skip_presence_validation(model_class, field)
      validators = model_class._validators[field]
      validators.reject! do |validator|
        validator.is_a?(ActiveRecord::Validations::PresenceValidator)
      end
      model_class._validators.delete(field) if validators.empty?

      empty_callbacks = []
      callbacks = model_class._validate_callbacks
      callbacks.each do |callback|
        if callback.name == :validate && callback.filter.is_a?(ActiveRecord::Validations::PresenceValidator) && callback.filter.attributes.include?(field)
          callback.filter.attributes.delete(field)
          empty_callbacks << callback if callback.filter.attributes.empty?
        end
      end
      empty_callbacks.each {|c| callbacks.delete(c) }
    end

Delegate to BBDD

class Address < ActiveRecord::Base

  # Relations
  belongs_to  :country, inverse_of: :addresses
  belongs_to  :city, inverse_of: :addresses
  belongs_to  :province, inverse_of: :addresses

  has_many    :centers, inverse_of: :address

  # Validations
  validates :country, :city, :province, presence: true
  validates :name, presence: true

  # Delegations
  delegate :name, :code, to: :country, prefix: true, allow_nil: true
  delegate :name, to: :city, prefix: true, allow_nil: true
  delegate :name, to: :province, prefix: true, allow_nil: true

end

Delegate to BBDD

class SpecialitySpecialist < ActiveRecord::Base

  # Relations
  belongs_to :speciality, inverse_of: :speciality_specialists
  belongs_to :specialist, polymorphic: true

  # Validations
  validates :speciality, presence: true
  validates :specialist, presence: true

end

Delegate to BBDD

Less time consuming

1.4.2.4-Cache models

4 select AR per line reading - file

   def cached_tables
      {
        cities:       City.all.index_by {
            |c| "#{c.province_id}-#{c.external_id}" },
        provinces:    Province.all.index_by(&:external_id),
        countries:    Country.all.index_by(&:external_id),
        specialities: Speciality.all.index_by(&:external_id),
        insurance_companies: InsuranceCompany.all.to_a,
      }
    end

Cached tables - Hash

    def cities
      @caches[:cities]
    end

    def provinces
      @caches[:provinces]
    end

    def countries
      @caches[:countries]
    end

    def specialities
      @caches[:specialities]
    end

    def insurance_companies
      @caches[:insurance_companies]
    end

Cached tables - Method

      
def find_or_create_city(province, country)
   city = cities["#{province.id}-#{row.city_attributes[:external_id]}"] 
                || City.new

   city.attributes = row.city_attributes.
                 merge(province: province, country: country)
   city.save! if city.changed?
   cities["#{city.province_id}-#{city.external_id}"] = city
   city
end

Method

Less time consuming

1.5-Result of this ideas and tips

Score

Time process more or less 4 hours
400,000 AR select less
Code was more readable

1.6-With & without ActiveRecord


TIMES = 10000

def do_inserts
    TIMES.times { User.create(:user_id => 1, :sku => 12, :delta => 1) }
end




Benchmark.measure { ActiveRecord::Base.transaction { do_inserts } }

Benchmark.measure {  do_inserts }

CODE

CONN = ActiveRecord::Base.connection
TIMES = 10000

def raw_sql
    TIMES.times { CONN.execute "INSERT INTO `user` 
(`delta`, `updated_at`, `sku`, `user_id`) 
VALUES(1, '2015-11-21 20:21:13', 12, 1)" }
end


Benchmark.measure { ActiveRecord::Base.transaction { raw_sql } }
Benchmark.measure { raw_sql }

CODE

CONN = ActiveRecord::Base.connection
TIMES = 10000


def mass_insert
    inserts = []
    TIMES.times do
        inserts.push "(1, '2015-11-21 20:21:13', 12, 1)"
    end
    sql = "INSERT INTO user (`delta`, `updated_at`, `sku`, 
                            `user_id`) 
           VALUES #{inserts.join(", ")}"
    CONN.execute sql
end

Benchmark.measure { mass_insert }

CODE




ActiveRecord without transaction:
 14.930000   0.640000  15.570000 ( 18.898352)
ActiveRecord with transaction:
 13.420000   0.310000  13.730000 ( 14.619136)
  1.29x faster than base



Raw SQL without transaction:
  0.920000   0.170000   1.090000 (  3.731032)
  5.07x faster than base
Raw SQL with transaction:
  0.870000   0.150000   1.020000 (  1.648834)
  11.46x faster than base



Only Mass Insert:
  0.000000   0.000000   0.000000 (  0.268634)
  70.35x faster than base

Score

Remember, knowledge, benchmark and refactoring

2.Project E-Commerce

Real time load data

2.1-Architecture

Architecture

2.2.Real Time

Frontend

Provider 1

Provider 2

2.3.Document design - NoSQL

Documents

2.4.Process Time & Superimpose process

*/30 * * * * flock -n /tmp/cron.txt.lock sh -c 'cd /var/www/project/current && bundle exec rake load:parse' || sh -c 'echo MyProject already running; ps; ls /tmp/*.lock'

crontab S.O.

2.5-Data load & synchronization

Sync CouchDB & Elastic

3- LOGS & The power of information

How many items have been loaded?
How many errors have you had?
How many elements with a specific class have been loaded?
How many items have been removed / changed?
How many items did not meet load conditions?
.....

Logs should answer questions like these:

feeds logs

logs with information

LOG data

Transaction ID
Time of Web Services
Differents Logs
We need to store more info at the beginning of the project

#  warnings, errors o info

Rails.logger.warn "cuidado no dispone de...."

Rails.logger.error "Error en..."

Rails.logger.info "RESPONSE TOKEN: #{token_info["access_token"]}"


# way to use

Rails.logger.tagged "MYPROJECT" do
    Rails.logger.tagged "GET_OFFERS_BY_BLOCK" do
    end
end

Rails logger

4. Conclusion

Final summary

The more ActiveRecord you know, the better
Things to take into consideration: speed, amount of data, real time or not real time and the architecture components
Design of your documents and Models
Cache as much as possible
Re-think everything and refactor
The power of Logging

and especially ... it's very very important

share your experiences

That's all Folks!!

By @carlossanchezp - 2016 - ConferenciaRoR

Bulk load data with AR & NoSQL - ConferenciaRoR 2016

By Carlos Sánchez Pérez

Bulk load data with AR & NoSQL - ConferenciaRoR 2016

@carlossanchezp

1,971

Carlos Sánchez Pérez

Currently working at @navandu_ COO, Co-founder at @leemurapp & @wearenominis exCTO @beruby_es. ExDev at @The_Cocktail & @aspgems

Bulk load data with ActiveRecord and NoSQL

Carlos Sánchez Pérez

Based on Real projects with Ruby On Rails

1.-Project called Searcher:

Bulk load 300k per week

2.-Project called E-commerce

Bulk load 300k per day

Overview

1.-Project called Searcher:

2.-Project called E-commerce

4.-Conclusions

RoadMap

3.-Logs & the power of information

1.-Project Searcher

Data load with Active Record

1.1-Architecture

ARCHITECTURE

1.2- Models

Main Models

1.3- My First Solution

First Solution - Arrays

First Solution - Arrays

First Solution - Arrays

First Solution - Arrays

Solution method

It takes a LOT of time

The reason why

AR suffers a lot with this solution

We realised we had to go over all these things:

End process time was:

24 - 28 hours for 300,000 lines CSV file

1.4- Mistakes Lead To Good things

what happens if i change the way of doing things

Remember

Quick idea

1.4.1.Refactoring

Re-Thinking load process

Set with ID's

Check ID's from file into SET

Delete ID's from SET

It's much faster

Less time consuming

We are getting better...

1.4.2- Tips & Ideas

1.4.2.1-Set vs Array

The reason why

Array vs Set

Array vs Set

Array vs Set

the best way that can i use both of them

1.4.2.2.-Pluck vs Map

Pluck vs Map - examples

Benchmark Pluck vs map

Load element into SET

1.4.2.3-We need validations

Do we need model validations all along the load process?

Validate - Data Base

Validate - Models

If i use these:

the answer was..... "not always"

Skip presence validations while loading.....

and delegate to DB validations

Delegate to BBDD

Delegate to BBDD

Delegate to BBDD

Delegate to BBDD

Less time consuming

1.4.2.4-Cache models

4 select AR per line reading - file

Cached tables - Hash

Cached tables - Method

Method

Less time consuming

1.5-Result of this ideas and tips

Score

1.6-With & without ActiveRecord

CODE

CODE

CODE

Score

Bulk load data with
ActiveRecord and NoSQL