Benjamin Roth
@apneadiving
🇫🇷
💫 ⭐️ 🌟 ✨ ⚡️
Before | After | |
---|---|---|
RAM Used | 180 GB | 10 GB |
RAM per thread | 3 GB | 500Mo |
Average time | 10 minutes | 10 seconds |
Synchronisation tracking | no info available | all timestamps in database |
first refactoring
Before
After
The advertising data was bundled in some huge object known as the state.
Every single problems was supposed to come from it
Account
Campaigns (~100)
Ad Groups (~25k)
Ads (~100k)
Keywords (~100k)
(around 700Mo of JSON)
🏋️♂️
Account
Campaigns (~100)
Ad Groups (~25k)
Ads (~100k)
Keywords (~100k)
The whole state object was saved in database, in some byte column:
compressed JSON.
🗜
🦛
SQL is pretty well designed for storing data.
Do not store blobs. 🙄
We use a table per kind of entity:
campaigns, ads, keywords...
First, we sync
Later we have to sync
{
"reference_id": "124",
"text": "Eat my short",
"match_type": "EXACT"
}
{
"reference_id": "124",
"text": "Eat my belt",
"match_type": "EXACT"
}
No diff
nothing to push on API
Even later we have to sync
{
"reference_id": "124",
"text": "Eat my short",
"match_type": "EXACT"
}
there is a diff
something to push on API
{
"reference_id": "124",
"text": "Eat my short",
"match_type": "EXACT"
}
{
"reference_id": "124",
"text": "Eat my belt",
"match_type": "EXACT"
}
Does this mean we have to store all properties of each object in database?
🙅♂️
MD5
Some String
MD5
Some other String
Store the minimum relevant data you need
🐣
triggering a synchronisation is telling the app to push data of a product to adwords
Was a matter of enqueuing some worker
CreateStateWorker.perform_async(product_id)
😖
CreateStateWorker.perform_async(product_id)
CreateStateWorker.perform_async(product_id)
Synchronisation.create!(
status: 'pending',
product_id: product_id
)
⚙️
and have a cron handle pushing jobs to queues
Whenever you are talking a lot about some concept (Synchronisation in our case),
It could be that there is an object crying for you to create it.
👷♂️
Full sync process was a cascade of workers
CreateStateWorker.perform_async(synchronisation_id)
CreateDiffWorker.perform_async(synchronisation_id)
At the end of the worker,
it triggered the next step
PushDiffWorker.perform_async(synchronisation_id)
Which in turn triggered
(there were actually a few more steps)
Queue
Processing
Sync2 - step 1
T0
T1
T2
Time
Sync1 - step 2
Sync2 - step 2
Sync1 - step 1
Sync2 - step 1
Sync2 - step 2
Hardware matters, idle hardware is a waste of money.
We did regroup under one queue (then only one worker).
Still doesnt help having a synchro exiting the pipeline as fast as possible.
🤔
Queue for SynchronisationWorker has:
2 processes, 2 threads - 4 jobs can run in //
Say you push 10 jobs:
we had a cron pushing jobs, it was instructed to push as many as:
MAX_CONCURRENT_PROCESSES - Synchronisation.in_progress.count
Enqueuing the right amount of jobs:
🤸♀️
Problem 🚱 | Solution ✅ |
---|---|
ensure the same product is not enqueued twice | DB constraint on Synchronisation Status |
ensure synchronisation is dealt with as fast as possible once enqueued | cron controls what is in the queue. |
prioritize some products? | Scope in the cron which is responsible for pushing jobs to queue |
synchronisation stats? | carried by each Synchronisation object in database |
Hardware usage optimization? | One only queue we can adjust depending on the load |
Problem 🚱 | Solution ✅ |
---|---|
Memory issues | No more blob MD5 comparison only |
Speed concerns | Workers setup |
☠️ 🧟♂️
It could be useful, but let's face it, it's not a priority.
Chill...
🥶
Hatred of objects
Obsession for functions on hashes instead
🧘♂️
Entity.full_name(
first_name: 'Mo',
last_name: 'Fo'
)
nasty by nature
waiting to bite you in the back...
🦖
☯️
GC.start
obsession of micro optimization
=
Shitty code
+
no real time to fix it
STEP BACKWARDS to see
the in the room
🐘