DataFusion: Embeddable Query Engine Written in Rust

Who Am I?

Boaz Berman, 29, Software Engineer for 8 years

DataFusion is an extensible query planning, optimization, and execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

Purpose of this talk

  • Background
  • DataFusion Ingenuity
  • How Query Engines Work 101
  • DataFusion's Implementation
  • Benchmarks
https://www.oreilly.com/library/view/database-internals/9781492040330/ch01.html

Database Internals

Online Analytics Processing

No transactions, just data processing

Why Another Solution?

  • Embeddability
  • Cross-platform
  • Speed
    • Start time
    • GC pauses
  • Single binary
  • Ease of development
  • Predictable cost
  • Infinite(ish) horizontal scale
  • Scale up & down on request
  • Cost-effectiveness is in data locality and resource efficiency (No YARN)

Data IS Everywhere

  • KB to PB
  • Pandas good for local machine, Spark good for servers.
  • Write once, run at whatever scale (Using Ballista)
  • Every engine rewrites the entire thing
  • Multiple origins and formats
    • Presto leading the way
    • Arrow Flight

Language-independent columnar memory format, organized for efficient analytic operations on modern hardware like CPUs and GPUs. Supports zero-copy reads for lightning-fast data access without serialization overhead.

  • Storage <-> Memory parsing time

  • SIMD (Same Instruction Multiple Data)

  • GPUs are everywhere

  • Columnar formats (e.g. Parquet, ORC)

How Query Engines Work

&

DataFusion's Implementation

SQL

DataFrame

LogicalPlan

Optimizer

PhysicalPlan

Optimizer

async fn main() -> Result<()> {
    let ctx = SessionContext::new();

    // SQL API
    ctx.register_csv("titanic", data_file).await?;
    let df = ctx.sql(
    	"SELECT passengerid, survived FROM titanic \
        	WHERE (age > 20 AND age < 40) OR 1 != 1\
        	ORDER BY age DESC, survived",
        ).await?;
    df.show().await?;
    
    // DataFrame API
    let df = ctx
        .read_csv(data_file, CsvReadOptions::default())
        .await?
        .select_columns(&["passengerid", "survived"])?
        .filter(
            col("age")
                .gt(lit(20))
                .and(col("age").lt(lit(40)))
                .or(lit(1).not_eq(lit(1))),
        )?
        .sort(vec![
            col("age").sort(false, true),
            col("survived").sort(true, false),
        ])?;
    df.show().await?;

    Ok(())
}
pub async fn sql(&self, sql: &str) -> Result<Arc<DataFrame>> {
	// ----------
	// ----------
	let plan = self.create_logical_plan(sql)?;
	// ----------
	// ----------
	match plan {
		...
		plan => Ok(Arc::new(DataFrame::new(self.state.clone(), &plan))),
	}
}

...

pub fn create_logical_plan(&self, sql: &str) -> Result<LogicalPlan> {
	// ----------
	// ----------
	let mut statements = DFParser::parse_sql(sql)?;
	// ----------
	// ----------
	...
	SqlToRel::new(&state).statement_to_plan(statements)
}

SQL Parser

https://github.com/sqlparser-rs/sqlparser-rs

Open source standalone library extracted from DataFusion, in use by other projects

Parses SQL string and creates an SQL AST

DFParser::parse_sql(sql)
pub(crate) async fn main() -> Result<()> {
    let ctx = SessionContext::new();

    let data_file = ...

    ctx.register_csv("titanic", data_file, CsvReadOptions::new()).await?;

    let df = ctx
        .sql("SELECT * FROM titanic")
        .await?
        .select_columns(&["c1", "c12"])?
        .filter(
            col("c1")
                .gt(lit(0.1))
                .and(col("c1").lt(lit(0.9)))
                .or(lit(1).not_eq(lit(1))),
        )?
        .sort(vec![
            col("c12").sort(false, true),
            col("c1").sort(true, false),
        ])?;

    df.show().await?;

    Ok(())
}

DataFrame

 

Data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet.

 

Originally “data frame”, emerged in the S programming language at Bell Labs. Released in 1990.

 

 R, the open-source version of S, had its first stable release in 2000. In 2009, pandas was released to bring R dataframe semantics to Python.

pub async fn read_csv(
	&self,
	table_path: impl AsRef<str>,
	options: CsvReadOptions<'_>,
) -> Result<Arc<DataFrame>> {
	...
    self.read_table(Arc::new(provider))
}

pub fn read_table(
	&self, 
	provider: Arc<dyn TableProvider>
) -> Result<Arc<DataFrame>> {
	Ok(Arc::new(DataFrame::new(
		self.state.clone(),
        // ----------
		&LogicalPlanBuilder::scan(
			UNNAMED_TABLE,
			provider_as_source(provider),
			None
		)?.build()?,
        // ----------
	)))
}
pub fn scan(
    table_name: impl Into<String>,
    table_source: Arc<dyn TableSource>,
    projection: Option<Vec<usize>>,
) -> Result<Self> {
    let schema = table_source.schema();
    let projected_schema = ...;
    
    Ok(Self::from(LogicalPlan::TableScan(TableScan {
        table_name,
        source: table_source,
        projected_schema,
        projection,
        filters: vec![],
        fetch: None,
    })))
}
pub enum LogicalPlan {
    Projection(Projection),
    Filter(Filter),
    Window(Window),
    Aggregate(Aggregate),
    Sort(Sort),
    Join(Join),
    CrossJoin(CrossJoin),
    Repartition(Repartition),
    Union(Union),
    TableScan(TableScan),
    EmptyRelation(EmptyRelation),
    Subquery(Subquery),
    SubqueryAlias(SubqueryAlias),
    Limit(Limit),
    CreateExternalTable(CreateExternalTable),
    CreateMemoryTable(CreateMemoryTable),
    CreateView(CreateView),
    CreateCatalogSchema(CreateCatalogSchema),
    CreateCatalog(CreateCatalog),
    DropTable(DropTable),
    Values(Values),
    Explain(Explain),
    Analyze(Analyze),
    Extension(Extension),
    Distinct(Distinct),
}

Logical Plan

Describes conceptually what operation needs to be performed.

Physical Plan

Describes practically what operation will be performed.

EXPLAIN SELECT * FROM titanic
+---------------+-----------------------------------------------------------------------------+
| plan_type     | plan                                                                        |
+---------------+-----------------------------------------------------------------------------+
| logical_plan  | TableScan: titanic                                                          |
| physical_plan | CsvExec: files=[Users/boazbe/IdeaProjects/RustMeetup/src/data/titanic.csv], | 
|               |	has_header=true, limit=None, projection=None                          |
|               |                                                                             |
+---------------+-----------------------------------------------------------------------------+

TableScan

Projection

Filter

Projection: #titanic.passengerid, #titanic.survived
  Sort: #titanic.age DESC NULLS FIRST, #titanic.survived ASC NULLS LAST
    Filter: #titanic.age > Int32(20) AND #titanic.age < Int32(40) OR Int32(1) != Int32(1)
      Projection: #titanic.passengerid, #titanic.survived, #titanic.age
        Projection: #titanic.passengerid, #titanic.survived, #titanic.pclass, 
            #titanic.name, #titanic.sex, #titanic.age, #titanic.sibsp, 
            #titanic.parch, #titanic.ticket, #titanic.fare, #titanic.cabin, 
            #titanic.embarked
          TableScan: titanic

Projection

Sort

Projection

OptimizerS

| initial logical_plan                     | Projection: #titanic.passengerid, #titanic.survived                                                                                                                                                                        |
|                                          |   Sort: #titanic.age DESC NULLS FIRST, #titanic.survived ASC NULLS LAST                                                                                                                                                    |
|                                          |     Filter: #titanic.age > Int32(20) AND #titanic.age < Int32(40) OR Int32(1) != Int32(1)                                                                                                                                  |
|                                          |       Projection: #titanic.passengerid, #titanic.survived, #titanic.age                                                                                                                                                    |
|                                          |         Projection: #titanic.passengerid, #titanic.survived, #titanic.pclass, #titanic.name, #titanic.sex, #titanic.age, #titanic.sibsp, #titanic.parch, #titanic.ticket, #titanic.fare, #titanic.cabin, #titanic.embarked |
|                                          |           TableScan: titanic                                                                                                                                                                                               |
| after simplify_expressions               | Projection: #titanic.passengerid, #titanic.survived                                                                                                                                                                        |
|                                          |   Sort: #titanic.age DESC NULLS FIRST, #titanic.survived ASC NULLS LAST                                                                                                                                                    |
|                                          |     Filter: #titanic.age > Int32(20) AND #titanic.age < Int32(40) AS titanic.age > Int32(20) AND titanic.age < Int32(40) OR Int32(1) != Int32(1)                                                                           |
|                                          |       Projection: #titanic.passengerid, #titanic.survived, #titanic.age                                                                                                                                                    |
|                                          |         Projection: #titanic.passengerid, #titanic.survived, #titanic.pclass, #titanic.name, #titanic.sex, #titanic.age, #titanic.sibsp, #titanic.parch, #titanic.ticket, #titanic.fare, #titanic.cabin, #titanic.embarked |
|                                          |           TableScan: titanic                                                                                                                                                                                               |
| after decorrelate_where_exists           | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after decorrelate_where_in               | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after decorrelate_scalar_subquery        | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after subquery_filter_to_join            | Projection: #titanic.passengerid, #titanic.survived                                                                                                                                                                        |
|                                          |   Sort: #titanic.age DESC NULLS FIRST, #titanic.survived ASC NULLS LAST                                                                                                                                                    |
|                                          |     Filter: #titanic.age > Int32(20) AND #titanic.age < Int32(40)                                                                                                                                                          |
|                                          |       Projection: #titanic.passengerid, #titanic.survived, #titanic.age                                                                                                                                                    |
|                                          |         Projection: #titanic.passengerid, #titanic.survived, #titanic.pclass, #titanic.name, #titanic.sex, #titanic.age, #titanic.sibsp, #titanic.parch, #titanic.ticket, #titanic.fare, #titanic.cabin, #titanic.embarked |
|                                          |           TableScan: titanic                                                                                                                                                                                               |
| after eliminate_filter                   | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after common_sub_expression_eliminate    | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after eliminate_limit                    | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after projection_push_down               | Projection: #titanic.passengerid, #titanic.survived                                                                                                                                                                        |
|                                          |   Sort: #titanic.age DESC NULLS FIRST, #titanic.survived ASC NULLS LAST                                                                                                                                                    |
|                                          |     Filter: #titanic.age > Int32(20) AND #titanic.age < Int32(40)                                                                                                                                                          |
|                                          |       TableScan: titanic projection=[passengerid, survived, age]                                                                                                                                                           |
| after rewrite_disjunctive_predicate      | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after reduce_outer_join                  | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after filter_push_down                   | Projection: #titanic.passengerid, #titanic.survived                                                                                                                                                                        |
|                                          |   Sort: #titanic.age DESC NULLS FIRST, #titanic.survived ASC NULLS LAST                                                                                                                                                    |
|                                          |     Filter: #titanic.age > Int32(20) AND #titanic.age < Int32(40)                                                                                                                                                          |
|                                          |       TableScan: titanic projection=[passengerid, survived, age], partial_filters=[#titanic.age > Int32(20), #titanic.age < Int32(40)]                                                                                     |
| after limit_push_down                    | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| after SingleDistinctAggregationToGroupBy | SAME TEXT AS ABOVE                                                                                                                                                                                                         |
| logical_plan                             | Projection: #titanic.passengerid, #titanic.survived                                                                                                                                                                        |
|                                          |   Sort: #titanic.age DESC NULLS FIRST, #titanic.survived ASC NULLS LAST                                                                                                                                                    |
|                                          |     Filter: #titanic.age > Int32(20) AND #titanic.age < Int32(40)                                                                                                                                                          |
|                                          |       TableScan: titanic projection=[passengerid, survived, age], partial_filters=[#titanic.age > Int32(20), #titanic.age < Int32(40)]                                                                                     |
impl SessionContext {
    pub fn new() -> Self {
        Self::with_config(...)
    }

    pub fn with_config(config: SessionConfig) -> Self {
        Self::with_config_rt(...)
    }

    pub fn with_config_rt(config: SessionConfig, runtime: Arc<RuntimeEnv>) -> Self {
        let state = SessionState::with_config_rt(...);
        ...
    }
}
impl SessionState {
    pub fn with_config_rt(config: SessionConfig, runtime: Arc<RuntimeEnv>) -> Self {
        ...
        let mut rules: Vec<Arc<dyn OptimizerRule + Sync + Send>> = vec![
            Arc::new(SimplifyExpressions::new()),
            Arc::new(DecorrelateWhereExists::new()),
            Arc::new(DecorrelateWhereIn::new()),
            Arc::new(DecorrelateScalarSubquery::new()),
            Arc::new(SubqueryFilterToJoin::new()),
            Arc::new(EliminateFilter::new()),
            Arc::new(CommonSubexprEliminate::new()),
            Arc::new(EliminateLimit::new()),
            Arc::new(ProjectionPushDown::new()),
            Arc::new(RewriteDisjunctivePredicate::new()),
        ];
        if config.config_options.get_bool(OPT_FILTER_NULL_JOIN_KEYS) {
            rules.push(Arc::new(FilterNullJoinKeys::default()));
        }
        rules.push(Arc::new(ReduceOuterJoin::new()));
        rules.push(Arc::new(FilterPushDown::new()));
        rules.push(Arc::new(LimitPushDown::new()));
        rules.push(Arc::new(SingleDistinctToGroupBy::new()));

        // Physical Plan Optimization Rules
        let mut physical_optimizers: Vec<Arc<dyn PhysicalOptimizerRule + Sync + Send>> = vec![
            Arc::new(AggregateStatistics::new()),
            Arc::new(HashBuildProbeOrder::new()),
        ];
        if config.config_options.get_bool(OPT_COALESCE_BATCHES) {
            physical_optimizers.push(Arc::new(CoalesceBatches::new(...)));
        }
        physical_optimizers.push(Arc::new(Repartition::new()));
        physical_optimizers.push(Arc::new(AddCoalescePartitionsExec::new()));
        ...
    }
}
let df = ctx
	.read_csv(data_file.to_str().unwrap(), CsvReadOptions::default())
	.await?
	.filter(lit(true).eq(lit(false)))?;
+---------------+----------------------------------+
| plan_type     | plan                             |
+---------------+----------------------------------+
| logical_plan  | EmptyRelation                    |
| physical_plan | EmptyExec: produce_one_row=false |
|               |                                  |
+---------------+----------------------------------+
+------------------------------------------+----------------------------------------------------------+
| plan_type                                | plan                                                     |
+------------------------------------------+----------------------------------------------------------+
| initial_logical_plan                     | Filter: Boolean(true) = Boolean(false)                   |
|                                          |   TableScan: ?table?                                     |
| after simplify_expressions               | Filter: Boolean(false) AS Boolean(true) = Boolean(false) |
|                                          |   TableScan: ?table?                                     |
| after decorrelate_where_exists           | SAME TEXT AS ABOVE                                       |
| after decorrelate_where_in               | SAME TEXT AS ABOVE                                       |
| after decorrelate_scalar_subquery        | SAME TEXT AS ABOVE                                       |
| after subquery_filter_to_join            | Filter: Boolean(false)                                   |
|                                          |   TableScan: ?table?                                     |
| after eliminate_filter                   | EmptyRelation                                            |
| after common_sub_expression_eliminate    | SAME TEXT AS ABOVE                                       |
| after eliminate_limit                    | SAME TEXT AS ABOVE                                       |
| after projection_push_down               | SAME TEXT AS ABOVE                                       |
| after rewrite_disjunctive_predicate      | SAME TEXT AS ABOVE                                       |
| after reduce_outer_join                  | SAME TEXT AS ABOVE                                       |
| after filter_push_down                   | SAME TEXT AS ABOVE                                       |
| after limit_push_down                    | SAME TEXT AS ABOVE                                       |
| after SingleDistinctAggregationToGroupBy | SAME TEXT AS ABOVE                                       |
| logical_plan                             | EmptyRelation                                            |
|                                          |                                                          |
|                                          |                                                          |
| initial_physical_plan                    | EmptyExec: produce_one_row=false                         |
|                                          |                                                          |
| after aggregate_statistics               | SAME TEXT AS ABOVE                                       |
| after hash_build_probe_order             | SAME TEXT AS ABOVE                                       |
| after coalesce_batches                   | SAME TEXT AS ABOVE                                       |
| after repartition                        | SAME TEXT AS ABOVE                                       |
| after add_merge_exec                     | SAME TEXT AS ABOVE                                       |
| physical_plan                            | EmptyExec: produce_one_row=false                         |
+------------------------------------------+----------------------------------------------------------+
impl OptimizerRule for EliminateFilter {
    fn optimize(
        &self,
        plan: &LogicalPlan,
        optimizer_config: &mut OptimizerConfig,
    ) -> Result<LogicalPlan> {
        match plan {
            LogicalPlan::Filter(Filter {
                predicate: Expr::Literal(ScalarValue::Boolean(Some(v))),
                input,
            }) => {
                if !*v {
                    Ok(LogicalPlan::EmptyRelation(EmptyRelation {
                        produce_one_row: false,
                        schema: input.schema().clone(),
                    }))
                } else {
                    ...
                }
            }
            _ => {
                // Apply the optimization to all inputs of the plan
                ...
            }
        }
    }

    fn name(&self) -> &str {
        "eliminate_filter"
    }
}

Physical Plan

// ---------
// Calling any terminal operation, will do roughly this
// ---------
df.show().await?;
// ---------


pub async fn show(&self) -> Result<()> {
    let results = self.collect().await?;
    Ok(pretty::print_batches(&results)?)
}



pub async fn collect(&self) -> Result<Vec<RecordBatch>> {
    let plan = self.create_physical_plan().await?;
    ...
    collect(plan, ...).await
}
impl PhysicalOptimizerRule for AddCoalescePartitionsExec {
    fn optimize(
        &self,
        plan: Arc<dyn crate::physical_plan::ExecutionPlan>,
        config: &crate::execution::context::SessionConfig,
    ) -> Result<Arc<dyn crate::physical_plan::ExecutionPlan>> {
        if plan.children().is_empty() {
            Ok(plan.clone())
        } else {
            let children = ...
            match plan.required_child_distribution() {
                Distribution::UnspecifiedDistribution => {
                    with_new_children_if_necessary(plan, children)
                }
                Distribution::HashPartitioned(_) => {
                    with_new_children_if_necessary(plan, children)
                }
                Distribution::SinglePartition => with_new_children_if_necessary(
                    plan,
                    ...
                ),
            }
        }
    }

    fn name(&self) -> &str {
        "add_merge_exec"
    }
}
ProjectionExec: expr=[passengerid@0 as passengerid, survived@1 as survived]
   SortExec: [age@2 DESC,survived@1 ASC NULLS LAST]
     CoalescePartitionsExec
       CoalesceBatchesExec: target_batch_size=4096
         FilterExec: age@2 > CAST(20 AS Float64) AND age@2 < CAST(40 AS Float64)
           RepartitionExec: partitioning=RoundRobinBatch(10)
             CsvExec: files=[Users/boazbe/IdeaProjects/RustMeetup/src/data/titanic.csv],
                 has_header=true, limit=None, projection=[passengerid, survived, age]

Thank You

Minimal

By Boaz Berman

Minimal

  • 64