Introduction to DuckDB
DuckDB is an in-process, analytical SQL database. You can think of it as SQLite for analytics (OLAP), not transactions (OLTP).
Instead of running as a server, DuckDB runs inside your application process - Python, R, Java, C++, etc. and queries data where it already lives.
What makes DuckDB special¶
Built for analytics¶
DuckDB is optimized for:
Large scans
Aggregations
Joins
Window functions
Complex analytical queries
Think: dashboards, feature engineering, ad-hoc analysis, data science---not user logins or high-frequency writes.
In-process (no server)¶
There’s:
No daemon
No port
No client/server overhead
You just:
import duckdb
duckdb.sql("SELECT * FROM 'data.parquet'")Columnar + vectorized execution¶
Under the hood, DuckDB:
Stores data in columnar format
Processes data in vectors (chunks) instead of row-by-row
This is why it’s fast. Very fast. Often competitive with Spark for single-machine workloads.
Reads data directly (zero-copy vibes)¶
DuckDB can query:
Parquet
CSV
JSON
Arrow
Pandas / Polars DataFrames
S3 / HTTP / local files
Without importing them first.
SELECT avg(price)
FROM read_parquet('s3://bucket/sales/*.parquet');How it compares to other tools¶
| Tool | Best at | DuckDB difference |
|---|---|---|
| SQLite | App storage | DuckDB is analytical |
| Postgres | Transactions + analytics | DuckDB is faster for scans, simpler to embed |
| Pandas | In-memory analysis | DuckDB scales better + SQL |
| Spark | Distributed big data | DuckDB is single-node but simpler & often faster locally |
| BigQuery | Cloud analytics | DuckDB is local, offline, cheap (free) |
Typical use cases¶
DuckDB shines when you:
Do data analysis in Python/R
Work with Parquet files
Need SQL over DataFrames
Want reproducible, local analytics
Don’t want to run a database server
Examples:
Feature engineering for ML
Exploratory data analysis
Replacing slow pandas groupbys
Local analytics before pushing to a warehouse
Limitations¶
DuckDB is not:
A transactional database
Multi-user concurrent server
Meant for high-write workloads
Rule of thumb:
If you’d normally reach for Spark, BigQuery, or heavy pandas, reach for DuckDB If you need users, writes, and locks, consider Postgres
Installing DuckDB¶
If you’re using Python without conda, install with pip:
pip install duckdbIf you’re using a conda environment, install with:
conda install python-duckdb -c conda-forgeUsing DuckDB¶
Using DuckDB is as simple as importing the library and running SQL queries. Here’s a quick example:
import duckdb
duckdb.sql("SELECT 1").show()┌───────┐
│ 1 │
│ int32 │
├───────┤
│ 1 │
└───────┘
Reading data with DuckDB¶
DuckDB can read data from multiple sources, including Parquet files, CSVs, and even directly from Pandas DataFrames. You can query these data sources using SQL syntax, making it easy to perform complex analyses without needing to load everything into memory first.
From DuckDB Documentation:
import duckdb
duckdb.read_csv("example.csv") # read a CSV file into a Relation
duckdb.read_parquet("example.parquet") # read a Parquet file into a Relation
duckdb.read_json("example.json") # read a JSON file into a Relation
duckdb.sql("SELECT * FROM 'example.csv'") # directly query a CSV file
duckdb.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
duckdb.sql("SELECT * FROM 'example.json'") # directly query a JSON fileDuckDB can directly query Pandas DataFrames without needing to load them into memory first. This allows you to leverage the power of SQL for data manipulation while still working with familiar DataFrame structures.
import pandas as pd
pandas_df = pd.DataFrame({"name": ["John", "Jane"], "age": [30, 25]})
duckdb.sql("SELECT * FROM pandas_df")┌─────────┬───────┐
│ name │ age │
│ varchar │ int64 │
├─────────┼───────┤
│ John │ 30 │
│ Jane │ 25 │
└─────────┴───────┘Persistent Storage¶
You can create a connection to a persistent database using duckdb.connect(dbname), although DuckDB is often used for ad-hoc analysis without needing to manage a database file.
For our credit card transactions analysis, we’ll be using DuckDB to query Parquet files directly, so we won’t need to set up a persistent database.