Open Source AWS Athena 

Jowanza Joseph

jowanza.com

@jowanza

Agenda

  • The Problem
  • AWS Athena
  • OpenStack Swift
  • File Formats
  • PrestoDB
  • Architecture
  • Failure
  • New Architecture
  • Demo

Most Nights

About Me

  • Spark/ Scala
  • Too many opinions
  • Writing a book about Spark
  • Dad

Multi-Tenant Distributed Data Platform

How AWS Athena Works

Replicating

  1. Object store
  2. File serialization
  3. Distributed data processing
  4. Bonus: JDBC
  5. Bonus: On Demand

Step 1. Object Storage

Swift Architecture

Step 2. Presto

Presto Architecture

Presto Benefits

  • Can use s3 as a Hive store
  • Distributed
  • Well Supported

Step 3. File Serialization

Summary

All Together

Recap

It Sucks

  • OpenStack is not Cloud Native
  • Swift is a bad Hive store
  • Swift is not at parity with s3
  • Presto doesn't work well with Swift
  • Presto is not Cloud Native

The Struggles

Minio

Benefits

  • Cloud Native
  • Tiny
  • S3 Compliant
  • Solid Ecosystem

Minio Architecture

Updated Architecture

Create New Table

CREATE TABLE hive.web.request_logs (
  sku varchar,
  pageviews double,
  conversion_rate double
)
WITH (
  format = 'ocr',
  external_location = 's3://databucked/data.orc'
)

Demo

Resources

Open Source AWS Athena

By Jowanza Joseph

Open Source AWS Athena

Creating an open source alternative to Athena.

  • 1,646