DuckDB

A nontraditional database

How have I found DuckDB?

How have I found DuckDB?

select *
from read_parquet()
limit 100;

select count(*)
from read_parquet()
limit 100;

-- Inferred schema based on select
describe select *
from read_parquet();

-- Actual parquet file schema
select *
from parquet_schema();

select *
from parquet_metadata();

Demo

Usage from IDE

Demo

  1. Parquet examples
  2. CSV
  3. Json examples
  4. Creating views
  5. Creating tables

DuckDB Modes

1. In-memory

2. Persistent

jdbc:duckdb:

jdbc:duckdb:memory:

jdbc:duckdb:/path/to/file.duckdb

In-memory

create view my_view as
select *
from read_json_auto('...');

select *
from my_view; -- actually reads a file



create table my_table -- copies data to memory
select *
from read_json_auto('...');

select *
from my_table; -- reads data from memory

Persistent

create view my_view as
select *
from read_json_auto('...');

select *
from my_view; -- actually reads a file



create table my_table -- copies data to DB file
select *
from read_json_auto('...');

select *
from my_table; -- reads data from DB file

SQL Dialect

  • Arrays/List
  • Lambda functions
  • Pattern Matching + Regular Expressions
  • Struct (json/kv)
  • Text Functions
  • Text Similarity Functions
  • Window Functions
  • Pivot/Unpivot
  • Filter
  • Grouping sets

What about writing?

Inserts work as expected:

For in-memory mode insert copies data to in-memory table.

For persistent mode insert copies data to table in DB file.

Extensions

+----------------+------------------------------------------------------------------------+
|extension_name  |description                                                             |
+----------------+------------------------------------------------------------------------+
|arrow           |A zero-copy data integration between Apache Arrow and DuckDB            |
|autocomplete    |Adds support for autocomplete in the shell                              |
|aws             |Provides features that depend on the AWS SDK                            |
|azure           |Adds a filesystem abstraction for Azure blob storage to DuckDB          |
|delta           |Adds support for Delta Lake                                             |
|excel           |Adds support for Excel-like format strings                              |
|fts             |Adds support for Full-Text Search Indexes                               |
|httpfs          |Adds support for reading and writing files over a HTTP(S) connection    |
|iceberg         |Adds support for Apache Iceberg                                         |
|icu             |Adds support for time zones and collations using the ICU library        |
|inet            |Adds support for IP-related data types and functions                    |
|jemalloc        |Overwrites system allocator with JEMalloc                               |
|json            |Adds support for JSON operations                                        |
|motherduck      |Enables motherduck integration with the system                          |
|mysql_scanner   |Adds support for connecting to a MySQL database                         |
|parquet         |Adds support for reading and writing parquet files                      |
|postgres_scanner|Adds support for connecting to a Postgres database                      |
|spatial         |Geospatial extension that adds support for working with spatial data... |
|sqlite_scanner  |Adds support for reading and writing SQLite database files              |
|substrait       |Adds support for the Substrait integration                              |
|tpcds           |Adds TPC-DS data generation and query support                           |
|tpch            |Adds TPC-H data generation and query support                            |
|vss             |Adds indexing support to accelerate Vector Similarity Search            |
+----------------+------------------------------------------------------------------------+

Extensions

DuckDB as an ETL tool

  • File formats
    • Parquet
    • JSON
    • CSV
  • S3 support
  • MySQL + PostgreSQL support

DuckDB as an ETL tool

Source

Target

DuckDB

Local machine

EC2 instance

Lambda

DuckDB as an ETL tool

Lambda Demo

  • README
  • lambda_function.py
  • invoke.py
  • invoke_template.py
  • justfile
  • Airflow
    • redshift/import/from_rds.py
    • common/duckdb.py

DuckDB downsides

1. Single-node processing

2. Sharing

Kinda solvable with duckdb file + sql script

Summary

  • Amazing tool with a lot of use-cases
  • Reading/writing local files
  • Ad-hoc data transformations
  • Data analysis
  • ETL
  • Extremely fast
  • It just works

Thank you!

Made with Slides.com