Task Configuration at Scale

Andrew Halberstadt

CI Automation

:ahal

What is "Scale"?

  • ~15,000 unique tasks
  • ~410 pushes / weekday
  • ~560 tasks / push (or 230k / weekday)

What is "Unique"?

  • Ignore runtime info (timestamps, repo, user, etc)
  • Otherwise every difference counts, e.g:
    • linux64 opt mochitest chunks 1-5 => 5 unique tasks
    • pref set vs unset => 2 unique tasks

 

There are a lot of similarities between many of those 15k tasks.

WET vs DRY

  • Write Everything Twice vs Don't Repeat Yourself
    • Aka duplication vs consolidation
    • Can apply to configuration as well as code
       
  • Two ends of a scale
    • Let's examine both ends at their extremes

Write Everything Twice

  • Pros
    • Easy to understand
    • Can handle new requirements well
       
  • Cons
    • Difficult to maintain
    • A pain to make sweeping changes

Don't Repeat Yourself

  • Pros
    • Fewest LOC
    • Can easily make sweeping changes
       
  • Cons
    • Also hard to maintain
    • Modifications are code refactorings
    • Hard to handle unforeseen changes

 

Both extremes are silly, there needs to be a balance.

Not All Configuration is Equal

  • Some configuration changes frequently
    • call this dynamic configuration
    • # of chunks, platforms, suites
  • Some configuration rarely changes
    • call this static configuration
    • caching, scopes, worker related configs

 

Dynamic configuration should be WET.

Static configuration should be DRY.

Easy, problem solved!

Configuration Groups

  • Many ways to group tasks, e.g:
    • all tasks => {release, product}
    • product => {build, test, lint}
    • test => {platform, suite, platform+suite}
    • platform+suite => {chunk}
  • Many more axes to group tasks across

 

Each layer has distinct but not disjoint sets of dynamic and static configuration.

Challenge

Design a configuration system that:

  • Is easy to understand and maintain
  • Is easy to modify
    • individual tasks
    • all tasks in a specific group (low or high)
  • Can handle uncertainty and changing requirements
    • easy to extend without regressing existing tasks
  • Reduces unnecessary duplication

Our Solution: Taskgraph

  • Not to be confused with "taskcluster"
    • Confusingly lives under /taskcluster
    • /taskcluster/taskgraph => core module
    • /taskcluster/ci => initial task configuration files
  • Docs: https://firefox-source-docs.mozilla.org/taskcluster/taskcluster/index.html
  • Originally designed by Dustin Mitchell
  • Shared ownership between many teams
    • build, ci automation, releng, taskcluster, +more

Graph Generation

# see all available steps
$ ./mach taskgraph --help

# generate and display the full task graph (labels only)
$ ./mach taskgraph full

# generate and display the target task graph (entire JSON)
$ ./mach taskgraph target -J

# similarly..
$ ./mach taskgraph optimized
$ ./mach taskgraph morphed

Step 1: Load Task Configs

  • Get a big list of every task
    • Read all the .yml files under /taskcluster/ci
  • Concepts
    • kinds / kind dependencies
    • jobs / jobs-from / job-defaults
    • transforms

Step 2: Apply Transforms

  • Slowly transform task into final form
    • Many "stages" of transformation
    • Validation at every step of the way
    • End result in a format taskcluster expects
  • Concepts
    • transform functions
    • stages
    • schemas

Step 3: There is no Step 3

  • Now we have the "full task graph"
    • ./mach taskgraph full
    • DAG of all tasks (2+ million JSON formatted lines)
  • Filter target tasks and optimizations
  • Apply morphs
  • Submit to taskcluster (the service) via REST api

Recap

  • Dynamic vs Static
    • Configs in the .yml are generally dynamic
    • Configs in transforms are generally static
  • Concrete configuration groups (low to high):
    • individual task (key in .yml)
    • jobs-from (job-defaults)
    • kinds (lowest level transform)
    • transforms (intermediate stages)
    • task.py (highest level transform)
      • modification here affects every single task

Success?

  • Taskgraph is not perfect
    • Difficult to grok at first
    • Difficult to figure out where to set a config
      • No hard and fast rules
      • Requires a lot of intuition to get right
      • A lot of inconsistencies between teams
    • Extremely flexible + very little gate keeping
      • opens door for all sorts of weird applications
      • complexity keeps increasing over time

Success!

  • Overall taskgraph is a big success
    • Allows us to move fast
    • Handles all requirement changes we throw at it
    • Ability to balance WET vs DRY
      • even if implementation is not always perfect
    • Can change entire configuration groups with ease
    • Self serve + in-tree
      • many tasks are developer created
      • extremely powerful

Questions?

Taskgraph

By ahal

Taskgraph

  • 692