Introduction to DuckDB - BDI 199/593 Textbook

DuckDB is an in-process, analytical SQL database. You can think of it as SQLite for analytics (OLAP), not transactions (OLTP).

Instead of running as a server, DuckDB runs inside your application process - Python, R, Java, C++, etc. and queries data where it already lives.

What makes DuckDB special¶

Built for analytics¶

DuckDB is optimized for:

Large scans
Aggregations
Joins
Window functions
Complex analytical queries

Think: dashboards, feature engineering, ad-hoc analysis, data science---not user logins or high-frequency writes.

In-process (no server)¶

There’s:

No daemon
No port
No client/server overhead

You just:

import duckdb
duckdb.sql("SELECT * FROM 'data.parquet'")

Columnar + vectorized execution¶

Under the hood, DuckDB:

Stores data in columnar format
Processes data in vectors (chunks) instead of row-by-row

This is why it’s fast. Very fast. Often competitive with Spark for single-machine workloads.

Reads data directly (zero-copy vibes)¶

DuckDB can query:

Parquet
CSV
JSON
Arrow
Pandas / Polars DataFrames
S3 / HTTP / local files

Without importing them first.

SELECT avg(price)
FROM read_parquet('s3://bucket/sales/*.parquet');

How it compares to other tools¶

Tool	Best at	DuckDB difference
SQLite	App storage	DuckDB is analytical
Postgres	Transactions + analytics	DuckDB is faster for scans, simpler to embed
Pandas	In-memory analysis	DuckDB scales better + SQL
Spark	Distributed big data	DuckDB is single-node but simpler & often faster locally
BigQuery	Cloud analytics	DuckDB is local, offline, cheap (free)

Typical use cases¶

DuckDB shines when you:

Do data analysis in Python/R
Work with Parquet files
Need SQL over DataFrames
Want reproducible, local analytics
Don’t want to run a database server

Examples:

Feature engineering for ML
Exploratory data analysis
Replacing slow pandas groupbys
Local analytics before pushing to a warehouse

Limitations¶

DuckDB is not:

A transactional database
Multi-user concurrent server
Meant for high-write workloads

Rule of thumb:

If you’d normally reach for Spark, BigQuery, or heavy pandas, reach for DuckDB If you need users, writes, and locks, consider Postgres

Installing DuckDB¶

If you’re using Python without conda, install with pip:

pip install duckdb

If you’re using a conda environment, install with:

conda install python-duckdb -c conda-forge

Using DuckDB¶

Using DuckDB is as simple as importing the library and running SQL queries. Here’s a quick example:

import duckdb

duckdb.sql("SELECT 1").show()

┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
└───────┘

Reading data with DuckDB¶

DuckDB can read data from multiple sources, including Parquet files, CSVs, and even directly from Pandas DataFrames. You can query these data sources using SQL syntax, making it easy to perform complex analyses without needing to load everything into memory first.

From DuckDB Documentation:

import duckdb

duckdb.read_csv("example.csv")                # read a CSV file into a Relation
duckdb.read_parquet("example.parquet")        # read a Parquet file into a Relation
duckdb.read_json("example.json")              # read a JSON file into a Relation

duckdb.sql("SELECT * FROM 'example.csv'")     # directly query a CSV file
duckdb.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
duckdb.sql("SELECT * FROM 'example.json'")    # directly query a JSON file

DuckDB can directly query Pandas DataFrames without needing to load them into memory first. This allows you to leverage the power of SQL for data manipulation while still working with familiar DataFrame structures.

import pandas as pd

pandas_df = pd.DataFrame({"name": ["John", "Jane"], "age": [30, 25]})
duckdb.sql("SELECT * FROM pandas_df")

┌─────────┬───────┐
│  name   │  age  │
│ varchar │ int64 │
├─────────┼───────┤
│ John    │    30 │
│ Jane    │    25 │
└─────────┴───────┘

Persistent Storage¶

You can create a connection to a persistent database using duckdb.connect(dbname), although DuckDB is often used for ad-hoc analysis without needing to manage a database file.

For our credit card transactions analysis, we’ll be using DuckDB to query Parquet files directly, so we won’t need to set up a persistent database.