Data Pipelines
some DevOps ideas
how to assess Pipelines
tools to measure a Pipeline (PL) status:
- PL Maturity Matrix
- Service Level Objectives (SLOs) and Indicators (SLIs)
Pipeline maturity matrix
measures five key characteristics (Can be extended):
- Failure tolerance
- Scalability
- Monitoring and debugging
- Transparency and ease of implementation
- Unit and integration testing
Source: https://sre.google/workbook/data-processing/#pipeline-maturity-matrix
Pipelines SLOs and SLIs
e.g. of Service Level Objectives and Indicators
- Data freshness: e.g. X% of data processed in Y minutes, The oldest data is no older than X minutes, The PL job has completed successfully within X minutes
- Data correctness: e.g. On a per-job basis: less than X% of input items contain data errors; Over an X-minute moving window: less than Y% of input items contain data errors.
Data is checked against reference data: via tests in CI/CD, end-2-end PL tests in pre-production env, monitoring PLs@prod to observe metrics related to data correctness - Data isolation/load balancing: e.g. High-priority jobs are processed within 10 mins, Standard-priority jobs are processed by 9 AM the next business day
source: https://cloud.google.com/solutions/building-production-ready-data-pipelines-using-dataflow-planning#defining_and_measuring_slos
Data characteristics
- Data is immutable (today will never produce data for yesterday), time-based and independent from user
-
Data Partition: looks logical to partition per datetime (by day?), by source
An idea of a plan
Initial phases (can overlap):
- Status: Infrastructure as it is today
- Roadmap: define goals/plan
- Actions: start acting towards goals
Goals: improve on PL Maturity Matrix, SLOs + SLIs, reduce costs (time to market, money of running Infra), CI/CD, automate operations, security everywhere
Long term vision
Reactive Services 4 key tenets:
- Responsive: the PLs show results
- Resilient: recover from errors (an external API failed, workers died, etc)
- Elastic: grows with load, shrinks on idle times
- Message Driven: communication is asynchronous (Redis + Sidekiq is already there)
Achieve the vision
Divide and Conquer
- Parallelize
- Isolate: separate every data flow (PL), paths for Read and Write, minimize interactions between services
- Services only do 1 thing and do it well
- Async Messaging: (a perfect world = non-blocking API calls with callback urls for receiving results)
- Share nothing Architecture
- Immutable infrastructure
close work with Tech Team is a must!
1. Phase: Status - Goals
Goals:
- handover Operations
- list pain points on: infrastructure (single point of failures, metrics, observability), operations, CI/CD, processes, security (CI/CD, code, infra, ops)
- derive a backlog of existing issues, and prioritize them
Expected Outcome:
- full picture of infra + PLs + documentation
- starting work on: Current issues on Infra + CI/CD, ops
- current and future SLOs/SLIs, Maturity Matrix of PLs
1. Phase: Status - Actions
Actions:
- Operations: AWS, jenkins, scripts
-
Infra:
- cost
- observability = logs, alerts, metrics (!) Ideas of tools: newrelic, airbrake, prometheus (far future)
- security
- incident processes - CI/CD .. speed, reliability, security, ease of use
-
Code:
- Rails: e.g. are queue configs in code? check Writes and Reads
- UI: e.g. API for searches (read) - Data: find partition strategies (e.g. by day, by source)
1. Phase: Status - Questions
e.g. Questions:
- Redis: what happens if we lose a sidekiq message? is Redis Highly Available (HA)? is it Load Balanced (LB)?
- Mongo: is data partitioned? is it HA? LB? Backups?
- Code:
- Rails API dependency: what happens if 1 external API fails?
- UI: read queries are coupled to Mongo's APIs? (low prio)
2. Phase: RoadMap
Goals:
- refine SLOs and SLIs
- work on the Maturity Matrix improvements by priority
- Performance improvements:
- Infra: Load Balance queues+workers, new server creation: cost? time needed? impediments? Linux servers performance, Linux Image (idea for the future: containerize app)
- CI/CD, processes and ops - PL best practices and design, tech exchanges with team
a Rails worker...
Given: a Rails sidekiq worker that is fully configurable (via a Configuration System) at server creation or reconfigured+restarted
Then: configure Workers to
- read from queue x, write to queue y => Event
- read from source n, write to sink m => Data
Question: where are the queues configuration now?
This achieves Isolation of jobs/PLs: Load Balance jobs (add more workers for a high-priority queue), monitor/observe separately, etc
Some food for thought
thanks for reading :-)
Data Pipelines
By Joaquin Rivera Padron
Data Pipelines
some DevOps ideas about Data Pipelines
- 673