Benjamin Roth

@apneadiving

🇫🇷

Case Study

how we made our core process

100 times faster

(at least)

💫 ⭐️ 🌟 ✨ ⚡️

What are we talking about?

Input

advertising related data in database

Output

data pushed to adwords
ids of adwords entities saved in database
errors saved in database

	Before	After
RAM Used	180 GB	10 GB
RAM per thread	3 GB	500Mo
Average time	10 minutes	10 seconds
Synchronisation tracking	no info available	all timestamps in database

And YES, we kept Ruby ❤️

first refactoring

Before

After

Optimization 1

Dealing with small chunks

The STATE

The advertising data was bundled in some huge object known as the state.

Every single problems was supposed to come from it

The State?

Account

Campaigns (~100)

Ad Groups (~25k)

Ads (~100k)

Keywords (~100k)

(around 700Mo of JSON)

So what about the state?

lots of database queries to generate the full object
huge object to keep in memory
on update, the previous state was loaded from database as well to make a diff: 2 huge objects to keep in memory

🏋️‍♂️

Feedback 1

Account

Campaigns (~100)

Ad Groups (~25k)

Ads (~100k)

Keywords (~100k)

No need to build the entire tree
Deal with data layer by layer

Optimization 2

A decent datamodel

State Storage

The whole state object was saved in database, in some byte column:

compressed JSON.

🗜

🦛

Feedback 2

SQL is pretty well designed for storing data.

Do not store blobs. 🙄

We use a table per kind of entity:
campaigns, ads, keywords...

Optimization 3

Only store what you need

Keyword Example

First, we sync

Later we have to sync

{
    "reference_id": "124",
    "text": "Eat my short",
    "match_type": "EXACT"
}

{
    "reference_id": "124",
    "text": "Eat my belt",
    "match_type": "EXACT"
}

No diff

nothing to push on API

Even later we have to sync

{
    "reference_id": "124",
    "text": "Eat my short",
    "match_type": "EXACT"
}

there is a diff

something to push on API

Keyword Example

{
    "reference_id": "124",
    "text": "Eat my short",
    "match_type": "EXACT"
}

{
    "reference_id": "124",
    "text": "Eat my belt",
    "match_type": "EXACT"
}

Does this mean we have to store all properties of each object in database?

🙅‍♂️

MD5

Some String

MD5

Some other String

Feedback 3

Store the minimum relevant data you need

🐣

Optimization 4

Synchronisation tracking

Synchronisation?

triggering a synchronisation is telling the app to push data of a product to adwords

Was a matter of enqueuing some worker

CreateStateWorker.perform_async(product_id)

Enqueuing blindly

What if you want to ensure the same product is not enqueued twice?
What if you want to prioritize some products over some others?
What if you want to know if/when some product's sync was triggered?
What if you want to know how long it took? on average?

😖

CreateStateWorker.perform_async(product_id)

Sidekiqing sideways

CreateStateWorker.perform_async(product_id)

Synchronisation.create!(
  status: 'pending',
  product_id: product_id
)

⚙️

and have a cron handle pushing jobs to queues

Feedback 4

Whenever you are talking a lot about some concept (Synchronisation in our case),

It could be that there is an object crying for you to create it.

Optimization 5

Sidekiq workers

👷‍♂️

Workers

Full sync process was a cascade of workers

CreateStateWorker.perform_async(synchronisation_id)

CreateDiffWorker.perform_async(synchronisation_id)

At the end of the worker,

it triggered the next step

PushDiffWorker.perform_async(synchronisation_id)

Which in turn triggered

(there were actually a few more steps)

Hardware concerns

assign = queues to sync workers: more hardware required / 💸 to have machines idly waiting
assign the same queue to all sync workers: less hardware required. Flaws all synchronisation stats:

Queue

Processing

Sync2 - step 1

Time

Sync1 - step 2

Sync2 - step 2

Sync1 - step 1

Sync2 - step 1

Sync2 - step 2

Feedback 5

Hardware matters, idle hardware is a waste of money.

We did regroup under one queue (then only one worker).

Still doesnt help having a synchro exiting the pipeline as fast as possible.

🤔

Optimization 6

Optimizing the pipeline

Queue for SynchronisationWorker has:

2 processes, 2 threads - 4 jobs can run in //

Say you push 10 jobs:

4 would be handled right away
6 would end up waiting for free room

Queues

Enqueuing 4 jobs max is ok. More is irrelevant.
Enqueing 0 is the way to go if all the slots are still filled.

we had a cron pushing jobs, it was instructed to push as many as:

Queues

MAX_CONCURRENT_PROCESSES - Synchronisation.in_progress.count

Feedback 6

Enqueuing the right amount of jobs:

lets Sidekiq focus on its worker handling responsibilities.
allows you to prioritize the remaining jobs to go each time the cron is executed

🤸‍♀️

Problem 🚱	Solution ✅
ensure the same product is not enqueued twice	DB constraint on Synchronisation Status
ensure synchronisation is dealt with as fast as possible once enqueued	cron controls what is in the queue.
prioritize some products?	Scope in the cron which is responsible for pushing jobs to queue
synchronisation stats?	carried by each Synchronisation object in database
Hardware usage optimization?	One only queue we can adjust depending on the load

Problem 🚱	Solution ✅
Memory issues	No more blob MD5 comparison only
Speed concerns	Workers setup

Prequels

Various rushed and vain attempts of improvement

aka

the pointless

micro-optimization

☠️ 🧟‍♂️

Strange love for bang methods

map!
merge!
strip!
...

It could be useful, but let's face it, it's not a priority.

Chill...

🥶

So called FP style

Hatred of objects

Obsession for functions on hashes instead

🧘‍♂️

  Entity.full_name(
    first_name: 'Mo',
    last_name: 'Fo'
  )

Callbacks

nasty by nature

waiting to bite you in the back...

🦖

Desperate moves

☯️

 GC.start

Stop the Madness

obsession of micro optimization

Shitty code

no real time to fix it

STEP BACKWARDS to see

the in the room

🐘

That’s all!

🙃🙏🙇‍♂️

case-study

By Benjamin Roth

case-study

2,510

Case Study

how we made our core process

100 times faster

(at least)

What are we talking about?

Input

advertising related data in database

Output

data pushed to adwords

ids of adwords entities saved in database

errors saved in database

And YES, we kept Ruby ❤️

Optimization 1

Dealing with small chunks

The STATE

The State?

So what about the state?

Feedback 1

Optimization 2

A decent datamodel

State Storage

Feedback 2

Optimization 3

Only store what you need

Keyword Example

Keyword Example

Feedback 3

Optimization 4

Synchronisation tracking

Synchronisation?

Enqueuing blindly

Sidekiqing sideways

Feedback 4

Optimization 5

Sidekiq workers

Workers

Hardware concerns

Feedback 5

Optimization 6

Optimizing the pipeline

Queues

Queues

Feedback 6

Prequels

Various rushed and vain attempts of improvement

aka

the pointless

micro-optimization

Strange love for bang methods

So called FP style

Callbacks

Desperate moves

Stop the Madness

That’s all!

🙃🙏🙇‍♂️

case-study

More from Benjamin Roth