Data Pipelines

some DevOps ideas

how to assess Pipelines

tools to measure a Pipeline (PL) status:

PL Maturity Matrix
Service Level Objectives (SLOs) and Indicators (SLIs)

Pipeline maturity matrix

measures five key characteristics (Can be extended):

Failure tolerance
Scalability
Monitoring and debugging
Transparency and ease of implementation
Unit and integration testing

Source: https://sre.google/workbook/data-processing/#pipeline-maturity-matrix

Pipelines SLOs and SLIs

e.g. of Service Level Objectives and Indicators

Data freshness: e.g. X% of data processed in Y minutes, The oldest data is no older than X minutes, The PL job has completed successfully within X minutes
Data correctness: e.g. On a per-job basis: less than X% of input items contain data errors; Over an X-minute moving window: less than Y% of input items contain data errors.
Data is checked against reference data: via tests in CI/CD, end-2-end PL tests in pre-production env, monitoring PLs@prod to observe metrics related to data correctness
Data isolation/load balancing: e.g. High-priority jobs are processed within 10 mins, Standard-priority jobs are processed by 9 AM the next business day

source: https://cloud.google.com/solutions/building-production-ready-data-pipelines-using-dataflow-planning#defining_and_measuring_slos

Data characteristics

Data is immutable (today will never produce data for yesterday), time-based and independent from user
Data Partition: looks logical to partition per datetime (by day?), by source

An idea of a plan

Initial phases (can overlap):

Status: Infrastructure as it is today
Roadmap: define goals/plan
Actions: start acting towards goals

Goals: improve on PL Maturity Matrix, SLOs + SLIs, reduce costs (time to market, money of running Infra), CI/CD, automate operations, security everywhere

Long term vision

Reactive Services 4 key tenets:

Responsive: the PLs show results
Resilient: recover from errors (an external API failed, workers died, etc)
Elastic: grows with load, shrinks on idle times
Message Driven: communication is asynchronous (Redis + Sidekiq is already there)

Achieve the vision

Divide and Conquer

Parallelize
Isolate: separate every data flow (PL), paths for Read and Write, minimize interactions between services
Services only do 1 thing and do it well
Async Messaging: (a perfect world = non-blocking API calls with callback urls for receiving results)
Share nothing Architecture
Immutable infrastructure

close work with Tech Team is a must!

1. Phase: Status - Goals

Goals:

handover Operations
list pain points on: infrastructure (single point of failures, metrics, observability), operations, CI/CD, processes, security (CI/CD, code, infra, ops)
derive a backlog of existing issues, and prioritize them

Expected Outcome:

full picture of infra + PLs + documentation
starting work on: Current issues on Infra + CI/CD, ops
current and future SLOs/SLIs, Maturity Matrix of PLs

1. Phase: Status - Actions

Actions:

Operations: AWS, jenkins, scripts
Infra:
- cost
- observability = logs, alerts, metrics (!) Ideas of tools: newrelic, airbrake, prometheus (far future)
- security
- incident processes
CI/CD .. speed, reliability, security, ease of use
Code:
- Rails: e.g. are queue configs in code? check Writes and Reads
- UI: e.g. API for searches (read)
Data: find partition strategies (e.g. by day, by source)

1. Phase: Status - Questions

e.g. Questions:

Redis: what happens if we lose a sidekiq message? is Redis Highly Available (HA)? is it Load Balanced (LB)?
Mongo: is data partitioned? is it HA? LB? Backups?
Code:
- Rails API dependency: what happens if 1 external API fails?
- UI: read queries are coupled to Mongo's APIs? (low prio)

2. Phase: RoadMap

Goals:

refine SLOs and SLIs
work on the Maturity Matrix improvements by priority
Performance improvements:
- Infra: Load Balance queues+workers, new server creation: cost? time needed? impediments? Linux servers performance, Linux Image (idea for the future: containerize app)
- CI/CD, processes and ops
PL best practices and design, tech exchanges with team

a Rails worker...

Given: a Rails sidekiq worker that is fully configurable (via a Configuration System) at server creation or reconfigured+restarted

Then: configure Workers to
- read from queue x, write to queue y => Event
- read from source n, write to sink m => Data

Question: where are the queues configuration now?

This achieves Isolation of jobs/PLs: Load Balance jobs (add more workers for a high-priority queue), monitor/observe separately, etc

Data Pipelines

how to assess Pipelines

Pipeline maturity matrix

Pipelines SLOs and SLIs

Data characteristics

An idea of a plan

Long term vision

Achieve the vision

1. Phase: Status - Goals

1. Phase: Status - Actions

1. Phase: Status - Questions

2. Phase: RoadMap

a Rails worker...

Some food for thought

thanks for reading :-)

Data Pipelines

Data Pipelines

Joaquin Rivera Padron

Data Pipelines

how to assess Pipelines

Pipeline maturity matrix

Pipelines SLOs and SLIs

Data characteristics

An idea of a plan

Long term vision

Achieve the vision

1. Phase: Status - Goals

1. Phase: Status - Actions

1. Phase: Status - Questions

2. Phase: RoadMap

a Rails worker...

Some food for thought

thanks for reading :-)

Data Pipelines

More from Joaquin Rivera Padron