The Snowflake Elastic Data Warehouse

Papers We Love San Diego

February 6th, 2020

Presented by Michael Jalkio

Today

Data Warehousing History
Enter Snowflake, Inc.
Innovations in the paper
Changes in the last few years

Data Warehousing History

*Credit to

The Data Warehousing Evolution

by Frank Bell

Relational Database Management System (RDBMS)

First proposed by EF Codd at IBM in 1970
Commercial systems available from 1979 (with SQL)
SQL Server (1989), MySQL (1995), PostgreSQL (1996)
Mainly designed for transaction processing, start to hit issues with scaling

OLTP vs OLAP Databases

OLTP = Transaction = Row-Based
OLAP = Analytics = Column-Based
OLAP is optimized for aggregations over large datasets
OLTP is optimized for INSERTs, UPDATEs, DELETEs, etc.

Specialized DW Appliances

Teradata, Netezza, Vertica, etc.
Massively parallel processing (MPP) architectures
Very expensive, only for large enterprises
No elasticity, Monday mornings are very hard

Google File System (2003)

MapReduce (2004)

Hadoop (2006)

Issues with the Hadoop Ecosystem

New model to gain expertise in
Didn't play nicely with existing tooling
Complemented the data warehouse

AWS (2006)

BigQuery (2010)

Redshift (2012)

Snowflake (2014/2015)

About Snowflake

Raised over $920M with a valuation of $3.9B
1,400+ customers
~1,000 employees
Available on all major public clouds

The Paper!

Introduction

The cloud is great
Data warehouses (DWs) don't fit the cloud model
Today's data can't be processed by many DWs
Big Data platforms don't fit the DW model
Can we build a DW for the cloud?

Key Features of Snowflake

Software-as-a-service (SaaS) experience
Relational w/ ANSI SQL and ACID transactions
Semi-Structured data support (JSON)
Elastic: storage and compute are independent
Highly available
Durable
Cost-efficient
Secure

The Secret Sauce

Random access of files & custom data format
Columnar storage lends itself to compression

The Secret Sauce

Provides availability & durability
Immutability provides some of Snowflake's best features
Storage is cheap

Virtual Warehouses

Easy to start/stop and scale up/down
Allows different teams/processes to be isolated
Claims of "better performance for same price"

Cloud Stuff

Fault resilience across AZs (except warehouses)
Designed to be seamlessly upgraded

JSON/XML Support

Allows Snowflake to be used for ELT
Basically encourages you to dump tons of data straight into your warehouse
Automated type inference for columnar storage
- Happens for each file, which supports schema changes over time and makes it more efficient
- Optimistic conversion of types

Snowflake vs Competitors

(at the time)

Redshift

No database maintenance and tuning
Faster scale up and down
Compute and storage are separated
JSON support

BigQuery

Actual ANSI SQL
Supports updates and deletes for data

Today...

Redshift Spectrum

Allows compute and storage to be separated
Files in S3 need to be managed manually and schema needs to be defined
Still requires a Redshift cluster (but it's not used)
Compute capacity is not something the user controls
Pay per data scanned, not servers

BigQuery

It's really, really good now
No tuning (not even t-shirt sizes)
Pay per-query or for dedicated resources

Questions?

Made with Slides.com