Resilient, Cost Effective Data Analytics Architecture
OLAP vs OLTP
Motivation
Combined Compute
CAP
Scale for $$$
Separate Storage and Compute
Distributed Multi-tenant architecture
READ ONLY
Updates / Deletes/ Edits
Costs
Storage
$25/TB/Month Per Client
Compute
$5 Per TB Scanned
Network
Negligible
It's not perfect
We have to manage our own files
It's Actually Not That Bad
Data File Formats On Distributed File Systems
Why do file formats matter?
Compaction
Quicker Reads
What features do you need?
- Compact (saves space)
- Reduce scan time
- Allows for quick scans
Columnar Storage
It's Just SQL Pointed At The File
SELECT origin, count(*) AS total_departures
FROM flights_avro_example
WHERE year >= '2000'
GROUP BY origin
ORDER BY total_departures DESC
LIMIT 10;
Performance
Query | Costs | Speed | Amount Scanned |
---|---|---|---|
SELECT cab_type, count(*) FROM trips_parquet GROUP BY cab_type; | $.005 | 6 Seconds | 600MB |
Query | Costs | Speed | Amount Scanned |
---|---|---|---|
SELECT passenger_count, avg(total_amount) FROM trips_parquet GROUP BY passenger_count; | $.50 | 6 Seconds | 102GB |
Query | Costs | Speed | Amount Scanned |
---|---|---|---|
SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_parquet GROUP BY passenger_count, year(pickup_datetime); | $.50 | 6 Seconds | 101GB |
Alternative To Athena
Ultimate Modularity
Operational Overhead
That's It
File Formats
By Jowanza Joseph
File Formats
- 1,024