DuckDB

A nontraditional database

How have I found DuckDB?

https://stackoverflow.com/a/74260936/1726295

select *
from read_parquet()
limit 100;

select count(*)
from read_parquet()
limit 100;

-- Inferred schema based on select
describe select *
from read_parquet();

-- Actual parquet file schema
select *
from parquet_schema();

select *
from parquet_metadata();

https://duckdb.org/docs/installation

Demo

Usage from IDE

https://duckdb.org/docs/guides/sql_editors/dbeaver.html

Demo

Parquet examples
CSV
Json examples
Creating views
Creating tables

DuckDB Modes

1. In-memory

2. Persistent

jdbc:duckdb:

jdbc:duckdb:memory:

jdbc:duckdb:/path/to/file.duckdb

In-memory

create view my_view as
select *
from read_json_auto('...');

select *
from my_view; -- actually reads a file



create table my_table -- copies data to memory
select *
from read_json_auto('...');

select *
from my_table; -- reads data from memory

Persistent

create view my_view as
select *
from read_json_auto('...');

select *
from my_view; -- actually reads a file



create table my_table -- copies data to DB file
select *
from read_json_auto('...');

select *
from my_table; -- reads data from DB file

SQL Dialect

https://duckdb.org/docs/sql/functions/overview

https://duckdb.org/docs/sql/statements/overview

https://duckdb.org/docs/sql/query_syntax/filter

Arrays/List
Lambda functions
Pattern Matching + Regular Expressions
Struct (json/kv)
Text Functions
Text Similarity Functions
Window Functions

Pivot/Unpivot

Filter
Grouping sets

What about writing?

Inserts work as expected:

For in-memory mode insert copies data to in-memory table.

For persistent mode insert copies data to table in DB file.

COPY statement

Extensions

+----------------+------------------------------------------------------------------------+
|extension_name  |description                                                             |
+----------------+------------------------------------------------------------------------+
|arrow           |A zero-copy data integration between Apache Arrow and DuckDB            |
|autocomplete    |Adds support for autocomplete in the shell                              |
|aws             |Provides features that depend on the AWS SDK                            |
|azure           |Adds a filesystem abstraction for Azure blob storage to DuckDB          |
|delta           |Adds support for Delta Lake                                             |
|excel           |Adds support for Excel-like format strings                              |
|fts             |Adds support for Full-Text Search Indexes                               |
|httpfs          |Adds support for reading and writing files over a HTTP(S) connection    |
|iceberg         |Adds support for Apache Iceberg                                         |
|icu             |Adds support for time zones and collations using the ICU library        |
|inet            |Adds support for IP-related data types and functions                    |
|jemalloc        |Overwrites system allocator with JEMalloc                               |
|json            |Adds support for JSON operations                                        |
|motherduck      |Enables motherduck integration with the system                          |
|mysql_scanner   |Adds support for connecting to a MySQL database                         |
|parquet         |Adds support for reading and writing parquet files                      |
|postgres_scanner|Adds support for connecting to a Postgres database                      |
|spatial         |Geospatial extension that adds support for working with spatial data... |
|sqlite_scanner  |Adds support for reading and writing SQLite database files              |
|substrait       |Adds support for the Substrait integration                              |
|tpcds           |Adds TPC-DS data generation and query support                           |
|tpch            |Adds TPC-H data generation and query support                            |
|vss             |Adds indexing support to accelerate Vector Similarity Search            |
+----------------+------------------------------------------------------------------------+

Extensions

S3 (httpfs + aws)
MySQL
PostgreSQL
Redshift
Google Sheets

DuckDB as an ETL tool

File formats
- Parquet
- JSON
- CSV

S3 support

MySQL + PostgreSQL support

DuckDB as an ETL tool

Source

Target

DuckDB

Local machine

EC2 instance

Lambda

DuckDB as an ETL tool

Lambda Demo

README
lambda_function.py
invoke.py
invoke_template.py
justfile
Airflow
- redshift/import/from_rds.py
- common/duckdb.py

https://bitbucket.org/parrotanalytics/duckdb_lambda

DuckDB downsides

1. Single-node processing

https://motherduck.com

2. Sharing

Kinda solvable with duckdb file + sql script

Summary

Amazing tool with a lot of use-cases

Reading/writing local files

Ad-hoc data transformations

Data analysis

Extremely fast

It just works

Thank you!

duckdb

By Yury Badalyants

duckdb

Internal Presentation for Parrot Analytics

Yury Badalyants

lmnet89

DuckDB

A nontraditional database

How have I found DuckDB?

How have I found DuckDB?

Demo

Usage from IDE

Demo

DuckDB Modes

In-memory

Persistent

SQL Dialect

What about writing?

Extensions

Extensions

DuckDB as an ETL tool

DuckDB as an ETL tool

DuckDB as an ETL tool

Lambda Demo

DuckDB downsides

Summary

Thank you!

duckdb

More from Yury Badalyants