DuckDB
An in-process analytical SQL database — "SQLite for analytics."
olap columnar analytical embedded open-source sql
What it is
DuckDB is an in-process analytical database. Like SQLite, it runs as a library linked into your application — no server, no port. Unlike SQLite, it is columnar and vectorized, designed for analytical SQL over large local datasets. The most common pitch is “SQLite for analytics.”
It was created at CWI Amsterdam (the database research lab that also produced MonetDB) by Hannes Mühleisen and Mark Raasveldt, with first release in 2019 and v1.0 in June 2024. DuckDB Labs, the steward, is based in Amsterdam.
Why people use it
- No server.
pip install duckdband run analytical SQL against your Pandas dataframes, Parquet files, CSVs, or remote S3 objects. - Reads everything natively. CSV, Parquet, JSON, Arrow, even direct Postgres or MySQL via extensions. No ingestion step for ad-hoc analysis.
- Standard SQL. PostgreSQL-compatible-ish dialect. Window functions, CTEs, full analytical SQL.
- Fast. Vectorized execution and columnar storage. For local analytical queries on data up to ~hundreds of GB, DuckDB is competitive with much heavier systems.
- Embeddable everywhere. Python, R, JavaScript (including in the browser via Wasm), Rust, Go, Java — first-class bindings.
- MIT licensed. Genuinely free, no commercial license clauses.
When to use DuckDB
- Data analysis / data science where you currently use Pandas + numpy and want analytical SQL.
- Local OLAP over Parquet, CSV, or Arrow files.
- ETL pipelines where you don’t want to stand up a server.
- Small-medium analytics replacing ClickHouse or Snowflake (think: anything that fits on one machine).
- Embedded analytical engines inside applications (BI tools, dashboards, notebook engines).
- Running SQL over remote S3 data lakes without spinning up a query engine.
When not to use DuckDB
- Multi-user concurrent writes. DuckDB is single-process, single-writer.
- Always-on services with thousands of QPS. Use a server-based system.
- Petabyte-scale. DuckDB is single-node. Beyond your machine’s memory and disk, you need ClickHouse, BigQuery, or Snowflake.
- OLTP. Columnar storage makes single-row inserts and updates relatively slow. Use Postgres.
Notable trade-offs
- Single-process. No client/server protocol natively. MotherDuck (the commercial offering) adds remote access, but vanilla DuckDB is in-process.
- Memory pressure. Large queries can OOM if intermediate state doesn’t fit. DuckDB has spilling-to-disk for many operators, but not all.
- Newer than ClickHouse. Fewer “battle-tested at scale” production deployments. The 1.0 release in 2024 stabilized the format and API; older databases may need migration.
- Schema-on-read is convenient but hides issues. Reading a CSV with auto-detection is great until your data has subtle type drift across files.
Ecosystem
- MotherDuck. Commercial managed DuckDB with a hybrid local/cloud query model. Founded in 2022 by Jordan Tigani (formerly Google BigQuery).
- DuckDB-Wasm. DuckDB compiled to WebAssembly. Runs in the browser, used in tools like Evidence, Observable, and Rill.
- pgduckdb. DuckDB as a Postgres extension — query Parquet files from Postgres using DuckDB’s engine.
- dbt, Dagster, Streamlit, Evidence. All have first-class DuckDB integrations.
- Extensions. DuckDB has its own extension system: spatial, full-text search, JSON, ICU, HTTP filesystem (read directly from S3 / GCS / Azure Blob), and more.