Task Configuration at Scale

Andrew Halberstadt

CI Automation

:ahal

What is "Scale"?

~15,000 unique tasks
~410 pushes / weekday
~560 tasks / push (or 230k / weekday)

source

What is "Unique"?

Ignore runtime info (timestamps, repo, user, etc)
Otherwise every difference counts, e.g:
- linux64 opt mochitest chunks 1-5 => 5 unique tasks
- pref set vs unset => 2 unique tasks

There are a lot of similarities between many of those 15k tasks.

WET vs DRY

Write Everything Twice vs Don't Repeat Yourself
- Aka duplication vs consolidation
- Can apply to configuration as well as code
Two ends of a scale
- Let's examine both ends at their extremes

Write Everything Twice

Pros
- Easy to understand
- Can handle new requirements well
Cons
- Difficult to maintain
- A pain to make sweeping changes

Don't Repeat Yourself

Pros
- Fewest LOC
- Can easily make sweeping changes
Cons
- Also hard to maintain
- Modifications are code refactorings
- Hard to handle unforeseen changes

Both extremes are silly, there needs to be a balance.

Not All Configuration is Equal

Some configuration changes frequently
- call this dynamic configuration
- # of chunks, platforms, suites
Some configuration rarely changes
- call this static configuration
- caching, scopes, worker related configs

Dynamic configuration should be WET.

Static configuration should be DRY.

Easy, problem solved!

Configuration Groups

Many ways to group tasks, e.g:
- all tasks => {release, product}
- product => {build, test, lint}
- test => {platform, suite, platform+suite}
- platform+suite => {chunk}
Many more axes to group tasks across

Each layer has distinct but not disjoint sets of dynamic and static configuration.

Challenge

Design a configuration system that:

Is easy to understand and maintain
Is easy to modify
- individual tasks
- all tasks in a specific group (low or high)
Can handle uncertainty and changing requirements
- easy to extend without regressing existing tasks
Reduces unnecessary duplication

Our Solution: Taskgraph

Not to be confused with "taskcluster"
- Confusingly lives under /taskcluster
- /taskcluster/taskgraph => core module
- /taskcluster/ci => initial task configuration files
Docs: https://firefox-source-docs.mozilla.org/taskcluster/taskcluster/index.html
Originally designed by Dustin Mitchell
Shared ownership between many teams
- build, ci automation, releng, taskcluster, +more

Graph Generation

https://firefox-source-docs.mozilla.org/taskcluster/taskcluster/taskgraph.html#graph-generation

# see all available steps
$ ./mach taskgraph --help

# generate and display the full task graph (labels only)
$ ./mach taskgraph full

# generate and display the target task graph (entire JSON)
$ ./mach taskgraph target -J

# similarly..
$ ./mach taskgraph optimized
$ ./mach taskgraph morphed

Step 1: Load Task Configs

Get a big list of every task
- Read all the .yml files under /taskcluster/ci
Concepts
- kinds / kind dependencies
- jobs / jobs-from / job-defaults
- transforms

Step 2: Apply Transforms

Slowly transform task into final form
- Many "stages" of transformation
- Validation at every step of the way
- End result in a format taskcluster expects
Concepts
- transform functions
- stages
- schemas

Step 3: There is no Step 3

Now we have the "full task graph"
- ./mach taskgraph full
- DAG of all tasks (2+ million JSON formatted lines)
Filter target tasks and optimizations
Apply morphs
Submit to taskcluster (the service) via REST api

Recap

Dynamic vs Static
- Configs in the .yml are generally dynamic
- Configs in transforms are generally static
Concrete configuration groups (low to high):
- individual task (key in .yml)
- jobs-from (job-defaults)
- kinds (lowest level transform)
- transforms (intermediate stages)
- task.py (highest level transform)
  - modification here affects every single task

Success?

Taskgraph is not perfect
- Difficult to grok at first
- Difficult to figure out where to set a config
  - No hard and fast rules
  - Requires a lot of intuition to get right
  - A lot of inconsistencies between teams
- Extremely flexible + very little gate keeping
  - opens door for all sorts of weird applications
  - complexity keeps increasing over time

Success!

Overall taskgraph is a big success
- Allows us to move fast
- Handles all requirement changes we throw at it
- Ability to balance WET vs DRY
  - even if implementation is not always perfect
- Can change entire configuration groups with ease
- Self serve + in-tree
  - many tasks are developer created
  - extremely powerful

Questions?

https://slides.com/ahal/taskgraph

https://firefox-source-docs.mozilla.org/taskcluster/taskcluster

https://docs.taskcluster.net/docs

Taskgraph

By ahal