Carlos Sánchez Pérez
Currently working at @navandu_ COO, Co-founder at @leemurapp & @wearenominis exCTO @beruby_es. ExDev at @The_Cocktail & @aspgems
By @carlossanchezp - 2015 - MadridRB
Pragmatic developer, Startup addict, Linux, Ruby & Rails lover, android and Social TV.
Currently working as Web Dev at @The_Cocktail.
Let's make it happen
Blog:
carlossanchezperez.wordpress.com
Twitter:
Mi experiencia
Severidad de carga en ambos procesos distinta
HOJA DE RUTA
Fichero CSV
Tarea lanzada semanalmente Task rake
Back
Front
Nos llega fichero CSV a procesar datos
Fichero CSV
Proceso de carga
Active Record
Task rake
MySQL
Back
Front
Fichero CSV
Fichero LOG
Proceso de carga
Active Record
Sphinx
Task rake
MySQL
Back
Front
Center
Profesional
has_many through:
belongs_to
Address
belongs_to
City
Province
Country
Speciality
has_many through: P y C
Leemos los profesionales a tratar CSV
Lo metemos en Array
Tratamos la información leída del CSV
Eliminamos los que no hemos tratado en CSV
def process_profesional profesional_id
if profesional_id != "-1"
profesional = User.find_by_external_id(profesional_id)
@processed_profesional << profesional.id if profesional
end
end
Al terminar buscamos en la bbdd los que nos faltan
Eliminamos los sobrantes
def delete_unprocessed_professionals
puts "Deleting unprocessed professionals"
to_delete_professionals = Professional.find(:all, :select => :id,
:conditions => "id not in
(#{@processed_profesionals.uniq.join(',')})").map(&:id)
@deleted_professionals = User.destroy_all(:id => to_delete_professionals)
puts "#{@deleted_professionals.size} professionals deleted"
end
Penaliza el rendimiento el "not in" con un array con muchos elementos, "uniq", "map"
Si necesitas una colección en la que el orden no te importa y necesitas que los elementos sean únicos.......
Entonces "Set" es tu mejor aliado
>> s = Set.new([1,2,3])
=> #<Set: {1, 2, 3}>
>> [1,2,3,3].to_set
=> #<Set: {1, 2, 3}>
>> s = Set.new
>> s << 1
>> s.add 2
>> s.delete(1)
=> #<Set: {2}>
>> s.include? 1
=> false
>> s.include? 2
=> true
>> [1,2,3] ^ [2,3,4]
=> NoMethodError: undefined method `^' for [1, 2, 3]:Array
>> s1 ^ s2
=> #<Set: {4, 1}>
NOTA:El rendimiento se mejora cambiando la forma de tratar los datos combinado con SET
def process_file
total_rows = %x{wc -l #{@utf8_file}}.split.first.to_i-1
row_count = 1
msg = "===Init Load Proccessing===#{Time.now}======\r"
IMPORTER_CSV_LOGGER.debug(msg)
msg = "Proccessing Total #{total_rows}\r"
IMPORTER_CSV_LOGGER.debug(msg)
set_ids_centers = Set.new get_ids_centers
set_ids_professionals = Set.new get_ids_professionals
# Procesado de los ID's con SET
pop_ids(@set_ids_centers,center.id)
pop_ids(@set_ids_professionals,professional.id)
# Método para dejar los ID's a eliminar
def pop_ids(ids, id)
ids.delete(id) if ids.include? id
end
# Método para finalizar y eliminar los ID's que no han sido procesados
def delete_unprocessed_ids(ids_centers,ids_professionals)
Professional.destroy(ids_professionals.to_a) if ids_professionals
Center.destroy(ids_centers.to_a) if ids_centers
end
users = User.all
=> [#<User id: 1, email: 'csanchez@example.com', active: true>,
#<User id: 2, email: 'cperez@example.com', active: false>]
users.map(&:email)
=> ['csanchez@example.com', 'cperez@example.com']
# Normalmente es: User.all.map(&:email)
emails = User.select(:email)
=> [#<User email: 'csanchez@example.com'>, #<User email: 'cperez@example.com'>]
emails.map(&:email)
=> ['csanchez@example.com', 'cperez@example.com']
User.pluck(:email)
=> ['csanchez@example.com', 'cperez@example.com']
User.where(active:true).pluck(:email)
# Carga de los elementos en nuestro Set
set_ids_centers = Set.new get_ids_centers
set_ids_professionals = Set.new get_ids_professionals
# Métodos que nos devuelven los id's directamente
def get_ids_centers
Center.pluck(:id)
end
def get_ids_professionals
Professional.pluck(:id)
end
ActiveRecord::Base.logger.level = 1
n = 1000
Benchmark.bm do |x|
x.report('Country.all.map(&:name): ') { n.times { Country.all.map(&:name) } }
x.report('Coutry.pluck(:name): ') { n.times { Country.pluck(:name) } }
end
## Resultados obtenidos
user system total real
Country.all.map(&:name): 3.830000 0.140000 3.970000 ( 4.328655)
Coutry.pluck(:name): 1.550000 0.040000 1.590000 ( 1.879490)
def process
ActiveRecord::Base.transaction do
add_insurance_company(center)
add_speciality(center, speciality)
add_speciality(center, subspeciality)
pop_ids(@set_ids_centers,center.id)
row
end
end
Ejemplo de un extracto de un bloque de tratamiento de información
Validaciones en MySQL
Validaciones en Model
La pregunta es ¿necesitamos ambas la carga?
Validaciones en MySQL
create_table "professionals", force: true do |t|
t.string "email", null: false
t.string "first_name", null: false
t.string "last_name", null: false
t.string "personal_web", default: "http://"
t.string "telephone"
t.boolean "show_telephone", default: true, null: false
t.boolean "show_email", default: true, null: false
t.text "cv"
t.integer "update_check", default: 0
t.boolean "delta", default: true, null: false
t.integer "type_id", default: 0, null: false
t.string "languages"
t.string "twitter"
t.string "numbercol", limit: 30
t.boolean "active", default: true, null: false
Validaciones en Model
class Professional < ActiveRecord::Base
include NestedAttributeList, FriendlyId
# Attributes
friendly_id :full_name, use: :slugged
# Validations
validates :email, uniqueness: true, case_sensitive: false, allow_blank: true
validate :first_name, present: true
validate :last_name, present: true
validate :type_id, present: true
def skip_validations
# Skip presence validations while loading, and delegate to DB validations
skip_presence_validation(Address, :country)
skip_presence_validation(Address, :province)
skip_presence_validation(Address, :city)
skip_presence_validation(Skill, :doctor)
skip_presence_validation(SpecialitySpecialist, :speciality)
skip_presence_validation(SpecialitySpecialist, :specialist)
skip_presence_validation(ProfessionalCenter, :professional)
skip_presence_validation(ProfessionalCenter, :center)
skip_presence_validation(InsuranceCompanyPartner, :insurance_company)
skip_presence_validation(InsuranceCompanyPartner, :partner)
end
def cached_tables
{
cities: City.all.index_by {|c| "#{c.province_id}-#{c.external_id}" },
provinces: Province.all.index_by(&:external_id),
countries: Country.all.index_by(&:external_id),
specialities: Speciality.all.index_by(&:external_id),
insurance_companies: InsuranceCompany.all.to_a,
}
end
Recurso index_by: http://api.rubyonrails.org/classes/Enumerable.html
people.index_by(&:login)
=> { "nextangle" => <Person ...>, "chade-" => <Person ...>, ...}
people.index_by { |person| "#{person.first_name} #{person.last_name}" }
=> { "Chade- Fowlersburg-e" => <Person ...>,
"David Heinemeier Hansson" => <Person ...>, ...}
Converte Enumerable en una Hash
def cities
@caches[:cities]
end
def provinces
@caches[:provinces]
end
def countries
@caches[:countries]
end
def specialities
@caches[:specialities]
end
def insurance_companies
@caches[:insurance_companies]
end
def find_or_create_city(province, country)
city = cities["#{province.id}-#{row.city_attributes[:external_id]}"] || City.new
city.attributes = row.city_attributes.merge(province: province, country: country)
city.save! if city.changed?
cities["#{city.province_id}-#{city.external_id}"] = city
city
end
Nos ahorramos una consulta y el save si procede
En general aplicando esta idea de cacheo puede ahorrarse unas 400.000 consultas en el proceso de carga aproximadamente.
A esto tendríamos que sumar los tiempos de save en aquellos que no es necesario hacerlo.
msg = "Initial Total BBDD Centers #{set_ids_centers.size}
Doctors #{set_ids_doctors.size}\r"
IMPORTER_CSV_LOGGER.debug(msg)
CONN = ActiveRecord::Base.connection
TIMES = 10000
def do_inserts
TIMES.times { User.create(:user_id => 1, :sku => 12, :delta => 1) }
end
def raw_sql
TIMES.times { CONN.execute "INSERT INTO `user`
(`delta`, `updated_at`, `sku`, `user_id`)
VALUES(1, '2015-11-21 20:21:13', 12, 1)" }
end
def mass_insert
inserts = []
TIMES.times do
inserts.push "(1, '2015-11-21 20:21:13', 12, 1)"
end
sql = "INSERT INTO user (`delta`, `updated_at`, `sku`, `user_id`)
VALUES #{inserts.join(", ")}"
CONN.execute sql
end
ActiveRecord without transaction:
14.930000 0.640000 15.570000 ( 18.898352)
ActiveRecord with transaction:
13.420000 0.310000 13.730000 ( 14.619136)
1.29x faster than base
Raw SQL without transaction:
0.920000 0.170000 1.090000 ( 3.731032)
5.07x faster than base
Raw SQL with transaction:
0.870000 0.150000 1.020000 ( 1.648834)
11.46x faster than base
Consulta servicio externo
Datos de inicio
Token
Token + Nº Bloques a tratar con datos
Token + Datos
Consulta servicio externo
Datos de inicio
JSON's
Token
Nº Bloques a tratar con datos
Datos
Tratamiento de la información por bloques
1 Validamos la información
2 Insertamos en CouchDB
3 Indexamos en ES
LOG
Token
Tratamiento de la información por bloques
Cada bloque nº ( varía 10mil) json a tratar "parseando" los datos e igualar a los datos que manejamos internamente
Cada bloque a hacer el bulk a CouchDB
*/30 * * * * flock -n /tmp/cron.txt.lock sh -c 'cd /var/www/project/current && bundle exec rake load:parse' || sh -c 'echo MyProject already running; ps; ls /tmp/*.lock'
# Hacer las diferentes llamadas según importancia warnings, errors o info
Rails.logger.warn "cuidado no dispone de...."
Rails.logger.error "Error en..."
Rails.logger.info "RESPONSE TOKEN: #{token_info["access_token"]}"
# Manera de utilizarlos dentro de nuestro code
Rails.logger.tagged "MYPROJECT" do
Rails.logger.tagged "GET_OFFERS_BY_BLOCK" do
end
end
I, [2015-09-22T11:31:09.937340 #45894] INFO -- : [MYPROJECT] [REQUEST_TOKEN]
E, [2015-09-22T11:31:40.806873 #45894] ERROR -- : [MYPROJECT]
Sin importación... hemos obtenido en get_requestid_blocks una respuesta (500)
D, [2015-09-22T11:31:10.322600 #45894] DEBUG -- :
Compartir experiencias
By Carlos Sánchez Pérez
@carlossanchezp
Currently working at @navandu_ COO, Co-founder at @leemurapp & @wearenominis exCTO @beruby_es. ExDev at @The_Cocktail & @aspgems