Data Pipelines

some DevOps ideas

how to assess Pipelines 

tools to measure a Pipeline (PL) status:

  • PL Maturity Matrix
  • Service Level Objectives (SLOs) and  Indicators (SLIs)

Pipeline maturity matrix

 measures five key characteristics (Can be extended):

  • Failure tolerance
  • Scalability
  • Monitoring and debugging
  • Transparency and ease of implementation
  • Unit and integration testing

 

 

Source: https://sre.google/workbook/data-processing/#pipeline-maturity-matrix

Pipelines SLOs and SLIs

e.g. of Service Level Objectives and Indicators

  • Data freshness: e.g. X% of data processed in Y minutes, The oldest data is no older than X minutes, The PL job has completed successfully within X minutes
  • Data correctness: e.g.  On a per-job basis: less than X% of input items contain data errors; Over an X-minute moving window: less than Y% of input items contain data errors.
        Data is checked against reference data: via tests in CI/CD, end-2-end PL tests in pre-production env, monitoring PLs@prod to observe metrics related to data correctness
  • Data isolation/load balancing: e.g. High-priority jobs are processed within 10 mins, Standard-priority jobs are processed by 9 AM the next business day

source: https://cloud.google.com/solutions/building-production-ready-data-pipelines-using-dataflow-planning#defining_and_measuring_slos

Data characteristics

  • Data is immutable (today will never produce data for yesterday), time-based and independent from user
  • Data Partition: looks logical to partition per datetime (by day?), by source

An idea of a plan

Initial phases (can overlap):

  • Status: Infrastructure as it is today
  • Roadmap: define goals/plan
  • Actions: start acting towards goals

 

Goals: improve on PL Maturity Matrix, SLOs + SLIs, reduce costs (time to market, money of running Infra), CI/CD, automate operations, security everywhere

Long term vision

Reactive Services 4 key tenets:

  • Responsive: the PLs show results
  • Resilient: recover from errors (an external API failed, workers died, etc)
  • Elastic: grows with load, shrinks on idle times
  • Message Driven: communication is asynchronous (Redis + Sidekiq is already there)

Achieve the vision

Divide and Conquer

  • Parallelize
  • Isolate: separate every data flow (PL), paths for Read and Write, minimize interactions between services
  • Services only do 1 thing and do it well
  • Async Messaging: (a perfect world = non-blocking API calls with callback urls for receiving results)
  • Share nothing Architecture
  • Immutable infrastructure

 

close work with Tech Team is a must!

1. Phase: Status  - Goals

Goals: 

  • handover Operations
  • list pain points on: infrastructure (single point of failures, metrics, observability), operations, CI/CD, processes, security (CI/CD, code, infra, ops)
  • derive a backlog of existing issues, and prioritize them

 

Expected Outcome:

  • full picture of infra + PLs + documentation
  • starting work on: Current issues on Infra + CI/CD, ops
  • current and future SLOs/SLIs, Maturity Matrix of PLs

1. Phase: Status  - Actions

Actions:

  • Operations: AWS, jenkins, scripts
  • Infra:
        - cost
        - observability = logs, alerts, metrics (!) Ideas of tools: newrelic, airbrake, prometheus (far future)
        - security
        - incident processes
  • CI/CD .. speed, reliability, security, ease of use
  • Code:
        - Rails: e.g. are queue configs in code? check Writes and Reads
        - UI: e.g. API for searches (read)
  • Data: find partition strategies (e.g. by day, by source)

1. Phase: Status  - Questions

e.g. Questions:

  • Redis: what happens if we lose a sidekiq message? is Redis Highly Available (HA)? is it Load Balanced (LB)?
  • Mongo: is data partitioned? is it HA? LB? Backups?
  • Code:
    - Rails API dependency: what happens if 1 external API fails?
    - UI: read queries are coupled to Mongo's APIs? (low prio)

2. Phase: RoadMap

Goals:

  • refine SLOs and SLIs
  • work on the Maturity Matrix improvements by priority
  • Performance improvements:
      - Infra: Load Balance queues+workers, new server creation: cost? time needed? impediments? Linux servers performance, Linux Image (idea for the future: containerize app)
      - CI/CD, processes and ops
  • PL best practices and design, tech exchanges with team

a Rails worker...

Given: a Rails sidekiq worker that is fully configurable (via a Configuration System) at server creation or reconfigured+restarted

Then: configure Workers to
   - read from queue x, write to queue y => Event
   - read from source n, write to sink m => Data
 

Question: where are the queues configuration now?
   

This achieves Isolation of jobs/PLs: Load Balance jobs (add more workers for a high-priority queue), monitor/observe separately, etc

Some food for thought

thanks for reading :-)

Data Pipelines

By Joaquin Rivera Padron

Data Pipelines

some DevOps ideas about Data Pipelines

  • 673