Data Engineering for Startups

Our journey and lessons learned

John McKim

VP of Product & Technology

A Cloud Guru

@johncmckim

https://acloud.guru

What is A Cloud Guru

Cloud Training for Organisations and Engineers

ACG in 2017

Small team and Growing Fast 

Series A

Oh shit. We need Metrics.

Data Eng v1

No experience. No problem

Data Eng v1

DynamoDB, Redshift and Segment

Data Eng v1

Not fun.

Time to Level Up

We need help

Levelling up

Challenges

  • Stakeholders - unsure what they want
  • Missing many data sources
    • Firebase, Salesforce, Zendesk, Braintree, Hubspot
  • Modelling - non-existent

Data Eng V2

Starting with Firebase

Data Eng v2

 

Firebase Data Pipeline

Data Eng v2

Adopting Fivetran

Data Eng v2

Fivetran

Data Replication Pipeline

 

  • Pre-built connectors
  • Supports many SaaS services
  • Supports many Warehouses

 

  • "Guaranteed data delivery" ...

Data Eng v2

Using DBT for Modelling

Data Eng v2

Build Models on your Data

  • SQL Models
  • Reference other models
  • Materialise models as tables or views
{{ config(materialized='table',
    sort = 'full_date',
    dist = 'full_date') }}
    
select
    created_at::date as full_date,
    zendesk_agent_id,
    zendesk_group_id,
    count(case score when 'offered' then 1 else null end) as surveys_sent,
    count(case score when 'offered' then null else 1 end) as responses,
    count(case score when 'good' then 1 else null end) as good_ratings,
    count(case score when 'bad' then 1 else null end) as bad_ratings
from {{ref('dim_zendesk_satisfaction_rating')}}
group by 1,2,3

Data Eng v2

Done ?

Data Eng v2

Maybe not.

Fivetran Reliabilty

Incident Timeline

  • Historical reliability issue detected
  • First historical occurrence - May 2018*
  • First reported - 31 Oct 2018
  • Sad times...
  • Infra fixed - 3rd Jan
  • Data re-synced - 16th Jan

Fivetran Reliability

Incident metrics

Severity: Production Impacted

MTTD: > 5 months

MTTR: > 10 weeks

 

Semantic Layer

Build Performance

  • Increasing DBT build times & failures
  • Builds > 4 hrs for 150 models
  • Errors indicating deadlocks

Redshift Performance

Ever slowing queries

  • Up to 90sec query planning
  • 90 sec * 150 models = 3.75 hrs per run on query planning alone

Redshift Performance

Sad.

Redshift Performance

Break through

  • Increased Cluster size as a hail mary
  • Performance improved drastically & degraded
  • Tested another reboot & saw same result

Orange = Query Planning

Redshift Performance

No answers. Only suspicions.

  • Bad Redshift node
  • Segment - COPY command
  • ...
  • Any ideas?

Data Eng - Future

New Replication Service & Data Lake

Lessons Learned

Changing role of Data Eng

  • Cloud provides - managed data infrastructure & pre-built connectors
  • Outsourcing ETL is will have a big on data eng
    • designing, managing and optimizing core data infrastructure
    • building and maintaining custom ingestion pipelines

https://blog.fishtownanalytics.com/does-my-startup-data-team-need-a-data-engineer-b6f4d68d7da9

Lessons Learned

Outsourcing has challenges

  • DON'T build because we had a bad experience
  • BUT, Cloud service selection is important
    • Look for SLAs in contracts
    • Understand your risk
    • Have a Business Continuity Plan

 

Thanks for Listening!

Questions?

johncmckim.me

twitter.com/@johncmckim

medium.com/@johncmckim

data-eng-meetup-feb-2019

By John McKim

data-eng-meetup-feb-2019

  • 129