Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Introduction to DuckDB

DuckDB is an in-process, analytical SQL database. You can think of it as SQLite for analytics (OLAP), not transactions (OLTP).

Instead of running as a server, DuckDB runs inside your application process - Python, R, Java, C++, etc. and queries data where it already lives.

What makes DuckDB special

Built for analytics

DuckDB is optimized for:

  • Large scans

  • Aggregations

  • Joins

  • Window functions

  • Complex analytical queries

Think: dashboards, feature engineering, ad-hoc analysis, data science---not user logins or high-frequency writes.

In-process (no server)

There’s:

  • No daemon

  • No port

  • No client/server overhead

You just:

import duckdb
duckdb.sql("SELECT * FROM 'data.parquet'")

Columnar + vectorized execution

Under the hood, DuckDB:

  • Stores data in columnar format

  • Processes data in vectors (chunks) instead of row-by-row

This is why it’s fast. Very fast. Often competitive with Spark for single-machine workloads.

Reads data directly (zero-copy vibes)

DuckDB can query:

  • Parquet

  • CSV

  • JSON

  • Arrow

  • Pandas / Polars DataFrames

  • S3 / HTTP / local files

Without importing them first.

SELECT avg(price)
FROM read_parquet('s3://bucket/sales/*.parquet');

How it compares to other tools

ToolBest atDuckDB difference
SQLiteApp storageDuckDB is analytical
PostgresTransactions + analyticsDuckDB is faster for scans, simpler to embed
PandasIn-memory analysisDuckDB scales better + SQL
SparkDistributed big dataDuckDB is single-node but simpler & often faster locally
BigQueryCloud analyticsDuckDB is local, offline, cheap (free)

Typical use cases

DuckDB shines when you:

  • Do data analysis in Python/R

  • Work with Parquet files

  • Need SQL over DataFrames

  • Want reproducible, local analytics

  • Don’t want to run a database server

Examples:

  • Feature engineering for ML

  • Exploratory data analysis

  • Replacing slow pandas groupbys

  • Local analytics before pushing to a warehouse

Limitations

DuckDB is not:

  • A transactional database

  • Multi-user concurrent server

  • Meant for high-write workloads

Rule of thumb:

If you’d normally reach for Spark, BigQuery, or heavy pandas, reach for DuckDB If you need users, writes, and locks, consider Postgres

Installing DuckDB

If you’re using Python without conda, install with pip:

pip install duckdb

If you’re using a conda environment, install with:

conda install python-duckdb -c conda-forge

Using DuckDB

Using DuckDB is as simple as importing the library and running SQL queries. Here’s a quick example:

import duckdb

duckdb.sql("SELECT 1").show()
┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
└───────┘

Reading data with DuckDB

DuckDB can read data from multiple sources, including Parquet files, CSVs, and even directly from Pandas DataFrames. You can query these data sources using SQL syntax, making it easy to perform complex analyses without needing to load everything into memory first.

From DuckDB Documentation:

import duckdb

duckdb.read_csv("example.csv")                # read a CSV file into a Relation
duckdb.read_parquet("example.parquet")        # read a Parquet file into a Relation
duckdb.read_json("example.json")              # read a JSON file into a Relation

duckdb.sql("SELECT * FROM 'example.csv'")     # directly query a CSV file
duckdb.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
duckdb.sql("SELECT * FROM 'example.json'")    # directly query a JSON file

DuckDB can directly query Pandas DataFrames without needing to load them into memory first. This allows you to leverage the power of SQL for data manipulation while still working with familiar DataFrame structures.

import pandas as pd

pandas_df = pd.DataFrame({"name": ["John", "Jane"], "age": [30, 25]})
duckdb.sql("SELECT * FROM pandas_df")
┌─────────┬───────┐ │ name │ age │ │ varchar │ int64 │ ├─────────┼───────┤ │ John │ 30 │ │ Jane │ 25 │ └─────────┴───────┘

Persistent Storage

You can create a connection to a persistent database using duckdb.connect(dbname), although DuckDB is often used for ad-hoc analysis without needing to manage a database file.

For our credit card transactions analysis, we’ll be using DuckDB to query Parquet files directly, so we won’t need to set up a persistent database.