Data Engineering for Startups
Our journey and lessons learned
John McKim
VP of Product & Technology
A Cloud Guru
@johncmckim
https://acloud.guru
What is A Cloud Guru
Cloud Training for Organisations and Engineers
ACG in 2017
Small team and Growing Fast
Series A
Oh shit. We need Metrics.
Data Eng v1
No experience. No problem
Data Eng v1
DynamoDB, Redshift and Segment
Data Eng v1
Not fun.
Time to Level Up
We need help
Levelling up
Challenges
- Stakeholders - unsure what they want
- Missing many data sources
- Firebase, Salesforce, Zendesk, Braintree, Hubspot
- Modelling - non-existent
Data Eng V2
Starting with Firebase
Data Eng v2
Firebase Data Pipeline
Data Eng v2
Adopting Fivetran
Data Eng v2
Fivetran
Data Replication Pipeline
- Pre-built connectors
- Supports many SaaS services
- Supports many Warehouses
- "Guaranteed data delivery" ...
Data Eng v2
Using DBT for Modelling
Data Eng v2
Build Models on your Data
- SQL Models
- Reference other models
- Materialise models as tables or views
{{ config(materialized='table',
sort = 'full_date',
dist = 'full_date') }}
select
created_at::date as full_date,
zendesk_agent_id,
zendesk_group_id,
count(case score when 'offered' then 1 else null end) as surveys_sent,
count(case score when 'offered' then null else 1 end) as responses,
count(case score when 'good' then 1 else null end) as good_ratings,
count(case score when 'bad' then 1 else null end) as bad_ratings
from {{ref('dim_zendesk_satisfaction_rating')}}
group by 1,2,3
Data Eng v2
Done ?
Data Eng v2
Maybe not.
Fivetran Reliabilty
Incident Timeline
- Historical reliability issue detected
- First historical occurrence - May 2018*
- First reported - 31 Oct 2018
- Sad times...
- Infra fixed - 3rd Jan
- Data re-synced - 16th Jan
Fivetran Reliability
Incident metrics
Severity: Production Impacted
MTTD: > 5 months
MTTR: > 10 weeks
Semantic Layer
Build Performance
- Increasing DBT build times & failures
- Builds > 4 hrs for 150 models
- Errors indicating deadlocks
Redshift Performance
Ever slowing queries
- Up to 90sec query planning
- 90 sec * 150 models = 3.75 hrs per run on query planning alone
Redshift Performance
Sad.
Redshift Performance
Break through
- Increased Cluster size as a hail mary
- Performance improved drastically & degraded
- Tested another reboot & saw same result
Orange = Query Planning
Redshift Performance
No answers. Only suspicions.
- Bad Redshift node
- Segment - COPY command
- ...
- Any ideas?
Data Eng - Future
New Replication Service & Data Lake
Lessons Learned
Changing role of Data Eng
- Cloud provides - managed data infrastructure & pre-built connectors
- Outsourcing ETL is will have a big on data eng
- designing, managing and optimizing core data infrastructure
- building and maintaining custom ingestion pipelines
https://blog.fishtownanalytics.com/does-my-startup-data-team-need-a-data-engineer-b6f4d68d7da9
Lessons Learned
Outsourcing has challenges
- DON'T build because we had a bad experience
- BUT, Cloud service selection is important
- Look for SLAs in contracts
- Understand your risk
- Have a Business Continuity Plan
Thanks for Listening!
Questions?
johncmckim.me
twitter.com/@johncmckim
medium.com/@johncmckim
data-eng-meetup-feb-2019
By John McKim
data-eng-meetup-feb-2019
- 129