How to make an app
fast and resilient...
while querying a petabyte of data
About me
Karim Pinchon
- Backend developer
- @kpn13
- https://blog.karimpinchon.com
- https://slides.com/kpn13










What are we going to talk about?
Agenda
- Context
- Data
- Code
- Conclusion
Context

Context

B2B

SAAS

Market measurement
Context
Industries
Ecom
Food
Ride
FMCG
BNPL





Context
Some clients







Context

Context

Context

Context

Context


Context





Context
50+ countries
















































Context
7M+ shoppers

Context
1.5B orders

Context

1PB of data
Data
How do we process the data?

Raw data









1 - Data acquisition
2 - Data structuring

3 - Data enrichment
4 - Flat tables



Data acquisition
Several sources

Parsers
to
extract data
Data structuring


Scripts to transform
and structure data
Data enrichment


- fix bias
- add data
- ...

Flat tables


Split by industries and countries
Why this strategy ?
Flat tables
- very different data structures
- data volum discrepancy
- limited number
- business compatible
Code
What architecture?
What architecture?
boring
Technology

- Apache
- PHP
- Mysql
- Redis
- BigQuery
- SQS
Technology
- Monolith
- Vanilla PHP
- Vanilla JS
- VueJS

What can we do with that to process so much data?



The big three
The big three



Caching
Why caching is important?
Caching
Performances
Cost savings


Caching
Performances

About 15x faster
Caching
Cost savings
BigQuery is sooo expensive!
Solution?
No call no money

Always set SQL limits!
Important note
Caching
Resilience
What if BigQuery down or slow?
No problem... Almost
Bonus
Caching
Ok but how?
By using a trendy technology?

Caching
Well-establish technology and simplicity

+ handmade code
Caching
<?php
$fingerprint = compute($method, $url, $parameters, ...);
$responseCached = (new \Redis())->get($fingerprint);
if ($responseCached !== null) {
return $responseCached;
}
// else execute request
Caching
Benefits
Drawbacks
- simple
- flexible
- fine grained management
- it just works
- maintenance
- less reliable?
- slower
- ressources (RAM)
Caching
Importante notes


Duration
Invalidation
Testing

Asynchronous tasks
Asynchronous task
Why async is important?
Asynchronous task
Better UX
Resilience



Scalable
Asynchronous task
Message broker
+
Workers
+
PHP
Asynchronous task
<?php
use Aws\Sns\SnsClient;
$params = [
'Message' => $body,
'region' => $region,
'TopicArn' => $topicArn,
];
(new SnsClient())->publish($params);
Asynchronous task
Benefits
Drawbacks
- scalability
- resilience
- non-blocking
- more complexe
- delays
- inconsistencies
- testability
Short Polling

Short Polling
- The client sends an HTTP request to the server.
- The server processes the request and responds (in progress / done)
- The client waits for a set delay before sending a new request.
- This cycle repeats indefinitely until achievement.
Short polling
Short polling
Different analysis
Different steps
- each analysis steps are configurable using a JSON file
- could be 2 steps scenario, 3 steps, 4 steps, etc
Short Polling
Benefits
Drawbacks
- easy to implement
- resilience
- server load
- latency
Use SSE or websockets ?
Benefits
Drawback
- save resources
- less lentency
- more complexe
Not necessary for us!
Conclusion
Conclusion
- Build a smart read model
- Defer processing as much as possible
- Cache what you can
- Use technologies you really need
Thank you

Comment avons-nous rendu notre application rapide et résiliante tout en exécutant des requêtes sur des TB de data !
By Karim PINCHON
Comment avons-nous rendu notre application rapide et résiliante tout en exécutant des requêtes sur des TB de data !
- 51