It started at Airbnb in October 2014.
Became a Top-Level Apache Software Foundation project in January 2019.
export AIRFLOW_HOME=~/airflow pip install apache-airflow airflow initdb (You may need to `pip install werkzeug==0.16.0`) airflow webserver -p 8080
A DAG run is a physical instance of a DAG, containing task instances that run for a specific execution_date.
Directed - If multiple tasks exist, each must have at least one defined upstream (previous) or downstream (subsequent) tasks, although they could easily have both.
Acyclic - No task can create data that goes on to reference itself. This could cause an infinite loop.
Graph - All tasks are laid out in a clear structure with discrete processes occurring at set points and clear relationships made to other tasks.
with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag: task_1 = PythonOperator('task_1') task_2 = BashOperator('task_2') task_1 >> task_2 # Define dependencies
with TaskGroup("group1") as group1:
task1 = DummyOperator(task_id="task1")
task2 = DummyOperator(task_id="task2")
task3 = DummyOperator(task_id="task3")
group1 >> task3
Scheduler
Worker
Web Server
Master
Slave/Client
Web Server
plugin
logging
monitoring
visualizing
plugin
logging
monitoring
visualizing
For a scheduled pipeline, its output will be the same no matter when we run it.
Focus on CODE.
For a scheduled pipeline, executing on different days will have different results.
Focus on DATA.
Manage a list of shell commands.
Manage variant task groups.
- Abstraction
- Modular
怎样更加高效地使用Airflow (用R当例子
Shared python library for Airflow DAGs
python dev?
focus on data makes it hard to test in local
you may break the whole Airflow
Amazon Managed Workflows for Apache Airflow starting from November 2020