Workflows & Pipelines

a breath of fresh air With Apache Airflow

Florian Dambrine - Principal Engineer

Click to trigger!

> Agenda <

  • General concepts
    • What is Airflow ?
    • Airflow Executors
    • Airflow 101
    • Airflow Operators
  • Airflow Deployments
    • Airflow as a service
    • MLE Airflow Infrastructure

 > what is Airflow ?

  • Open source project from Airbnb
  • Airflow is a platform to programmatically
    , schedule and monitor workflows
  • Alternative to AWS DataPipeline
  • ETL
  • Machine Learning Jobs
  • Cron replacement (offers HA / automatic catch up)

 > Airflow use cases

 > Airflow Example Dag

import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator

args = {
    'owner': 'Airflow',
    'start_date': airflow.utils.dates.days_ago(2),

dag = DAG(

run_after = DummyOperator(

run_first = BashOperator(
    bash_command='echo "Hello World!"',

run_first >> run_after

if __name__ == "__main__":

(Directed Acyclic Graph)

 > Airflow Executors

How tasks are being executed by Airflow


Run one task at a time on the Airflow instance (development purposes)

Run multiple tasks at a time on the Airflow instance (pre-forking model / vertical scaling)

Kick off Kubernetes pods to execute tasks and cleans up automatically on job completion (dynamic horizontal scaling)

Delegate tasks runtime to Celery workers. Requires a message broker like Redis (common in production, horizontal scaling)

 > Airflow 101

# XComs (Cross-Communications)

Let tasks exchange messages. XComs are made of key, value, timestamp and task/dag info. They can be pushed or pulled

# Connections

The connection information to external systems is stored in the Airflow metadata database and managed in the UI


In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

# Plugins

Airflow offers a generic toolbox for working with data. Using Airflow plugins can be a way for companies to customize their Airflow installation to reflect their ecosystem.

  • Make common code logic available to all DAGs (shared library)
  • Write your own Operators
  • Extend Airflow and build on top of it (Auditing tool)

# Operators

While DAGs describe how to run a workflow, Operators determine what actually gets done. An operator describes a single task in a workflow.

 > Airflow OPerators

Focus on your Business logic and leverage operators to glue things together...


Send message to SQS


Submits tasks to Druid


Execute command inside container


Perform actions on Jira


Send notification to Slack


Submits a Spark job to Databricks


S3 > local > transform > S3


Run SparkSql queries


Load files from S3 to RS


Execute SQL query

 > Airflow As a service

  • GCP Airflow as a service (On Kubernetes)
    •     ​IAM role integration
    •     No-Ops (Almost...)
  • Astronomer (Airflow Committers)
    •     Deployment in Astronomer Cloud
    •     On-prem Kubernetes cluster
  • MLE Airflow 
    •     Perfect project to learn and adopt Kubernetes in MLE team
    •     ML projects are Kubernetes centric (Kubeflow / MLFlow)
    •     Built to be easily migrated to a hosted service if wanted
    •     Stateful components leveraging managed services (RDS)
    •     Fully integrated with IAM roles / scoped permissions

 > Airflow InfraStructure

 > Airflow DEMO

Thanks !

Ready to Jump in Airflow now ?

Workflows & Pipeline - Breath of Fresh Air With Apache Airflow

By Florian Dambrine

Workflows & Pipeline - Breath of Fresh Air With Apache Airflow

  • 166
Loading comments...

More from Florian Dambrine