Building a data warehouse using Spark SQL

Budapest Data Forum 2018

Gabor Ratky, CTO

About me

Hands-on CTO at Secret Sauce to this day
Software engineer at heart
Made enough bad decisions to know that everything is a trade-off
Code quality and maintainability above all
Not writing code when I don't have to
Not building distributed systems when I don't have to
Not a data warehouse guy, but ❤️ data

Simple is better than complex.
Complex is better than complicated.

The Zen of Python, by Tim Peters

About Secret Sauce

SV startup in Budapest
B2B2C apparel e-commerce company
Data driven products to help merchandising
Services build on top of the data we collect
Cloud-based infrastructure (AWS)
Small, effective teams
Strong engineering culture
- Code quality
- Code reviews
- Testability
Everybody needs data to do their jobs

Early days

Partner data

MongoDB

$ mongoimport

MongoDB

Redshift

PostgreSQL

Partner data

Event analytics

kafka

MongoDB

Databricks

PostgreSQL

Partner data

Data warehousing

kafka

Why Databricks and Spark?

Storage and compute are separate
Managed clusters operated by Databricks
Fits into and runs as part of our existing infrastructure (AWS)
Right tool for the job
- Data engineers use pyspark
- Data analysts use SQL
- Data scientists use Python, R, SQL, H2O, Pandas, scikit-learn, dist-keras
Shared metastore (databases and tables)
Collaborative, interactive notebooks
Github integration and flow
Automated jobs and schedules
Programmatic API

Clusters

Workspace

Notebooks

Jobs

Analytics

Build vs buy

BUY

Cost

Cost (Redshift)

Persistent data warehouse
4x ds2.xlarge nodes (8TB, 16 vCPU, 124GB RAM)
On-demand price: $0.85/hr/node
1 month ~ 732 hours

$2,488

Cost (Databricks)

Ephemeral, interactive, multi-tenant cluster
8TB storage (S3)
i3.xlarge driver node (4 vCPU, 30.5GB RAM)
4x i3.xlarge worker nodes (16 vCPU, 122GB RAM)
Compute: $0.712/hr
- $0.312/hr on-demand price
- 4x $0.1/hr spot price
Databricks: $2/hr
- $0.4/hr/node
Storage: $188/mo + change
1 month ~ 22 workdays ~ 176 hours

$665

Utilization (Redshift)

Utilization (Databricks)

~34 DBU/day, ~4.5 DBU/hr

~11.5 DBU/day

Our experience so far

Started using Databricks in January 2018
Quick adoption across the whole company
Fast turnaround on data requests
Easy collaboration between technical and non-technical people
Databricks allows us to focus on data engineering, not data infrastructure
Github integration not perfect, but fits into our workflow
Partitioning and schema evolution needs a lot of attention
Databricks is an implementation detail, pick your poison
Everything is a trade-off, make the right ones

NIHS*

* not invented here syndrome

Thanks!

Questions?

@rgabo

gabor@secretsaucepartners.com

https://bit.ly/sspinc-careers

Building a data warehouse using Spark SQL

By Secret Sauce Partners, Inc.

Building a data warehouse using Spark SQL

Why on Earth would you want to replace your data warehouse with a bunch of files lying around “in the cloud” and expect that everyone from engineers to data analysts and scientists will tap more into that data to power their analyses and their research and in general, do their work? Separating the storage of data from the computational work and giving everyone the right tool for their job within a single, coherent environment has made it easy for our engineers, analysts, data scientists, and even non-technical people to collaborate on and work with data. Working with data is hard. Far from the La La Land of Machine Learningstan and the United States of AI, there’s many of us who still need to deal with messy data, ETL, backfilling, failed jobs, inefficient SQL queries, overloading production databases, partitioning, and with the advent of cloud computing: unterminated Spark clusters and infrastructure cost. This talk is about how we tackled some of those challenges while building out our new data warehouse using S3 and Spark. This talk is for the rest of us.

1,724

Secret Sauce Partners, Inc.

Transforming apparel & footwear shopping through data.

Building a data warehouse using Spark SQL

About me

About Secret Sauce

Early days

Event analytics

Data warehousing

Why Databricks and Spark?

Clusters

Workspace

Notebooks

Jobs

Analytics

Analytics

Build vs buy

BUY

Cost

Cost (Redshift)

$2,488

Cost (Databricks)

$665

Utilization (Redshift)

Utilization (Redshift)

Utilization (Databricks)

Our experience so far

NIHS*

Thanks!

Building a data warehouse using Spark SQL

More from Secret Sauce Partners, Inc.