Resilient, Cost Effective Data Analytics Architecture

OLAP vs OLTP

Motivation

Combined Compute

CAP

Scale for $$$

Separate Storage and Compute

Distributed Multi-tenant architecture

READ ONLY

Updates / Deletes/ Edits

Costs

Storage

$25/TB/Month Per Client

Compute

$5 Per TB Scanned

Network

Negligible

It's not perfect

We have to manage our own files

It's Actually Not That Bad

Data File Formats On Distributed File Systems

Why do file formats matter?

Compaction

Quicker Reads

What features do you need?

  • Compact (saves space)
  • Reduce scan time
  • Allows for quick scans

Columnar Storage

It's Just SQL Pointed At The File

SELECT origin, count(*) AS total_departures
FROM flights_avro_example
WHERE year >= '2000'
GROUP BY origin
ORDER BY total_departures DESC
LIMIT 10;

Performance

Query Costs Speed Amount Scanned
SELECT cab_type, count(*) FROM trips_parquet GROUP BY cab_type; $.005 6 Seconds 600MB
Query Costs Speed Amount Scanned
SELECT passenger_count, avg(total_amount) FROM trips_parquet GROUP BY passenger_count; $.50 6 Seconds 102GB
Query Costs Speed Amount Scanned
SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_parquet GROUP BY passenger_count, year(pickup_datetime); $.50 6 Seconds 101GB

Alternative To Athena

Ultimate Modularity

Operational Overhead

That's It

File Formats

By Jowanza Joseph

File Formats

  • 1,037