Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Using DuckDB for Data Analysis

Overview

In this exercise, you will walk through the previous data preprocessing workflow using DuckDB.

The goal is to replicate the same steps we performed in polars, but this time using SQL queries in DuckDB. This will give you a chance to see how the same transformations can be expressed in SQL and how DuckDB handles data manipulation.

You will learn how to:

  • Load and inspect data

  • Identify and handle missing values

  • Clean and standardize columns

  • Perform basic transformations

  • Verify the final dataset

Import Libraries

import duckdb

Read the Input Parquet File

First, download the credit_card_transactions-ibm_v2.parquet file from this GitHub link and place it in the same folder as your Jupyter notebook.

The file is about 250 MB, which is about 9 times smaller than the original CSV file, making it much faster to read and process.

duckdb.connect()

The duckdb.connect() function creates a connection to a DuckDB database. If you don’t specify a database name, it creates an in-memory database that exists only for the duration of the connection. This is useful for ad-hoc analysis and temporary data manipulation without needing to manage a persistent database file.

con = duckdb.connect()

con.execute(
    """
CREATE OR REPLACE VIEW transactions AS
SELECT *
FROM read_parquet('credit_card_transactions-ibm_v2.parquet');
"""
)

# Peek (like df.head())
con.sql("SELECT * FROM transactions LIMIT 5").df()
Loading...

Exploratory Data Analysis (EDA)

Describe Schema

To understand the structure of the dataset, we can use the DESCRIBE command to get information about the columns, their data types, and whether they contain null values.

con.sql("DESCRIBE transactions;").df()
Loading...

If you’re familiar with the PRAGMA introspection command from SQLite or another engine, you can also use PRAGMA table_info('transactions'); to get similar information about the columns in the transactions table.

con.sql("PRAGMA table_info('transactions');").df()
Loading...

Note that the Datetime column is TIMESTAMP type, which means it contains both date and time information. The Amount column is DOUBLE, which is a floating-point number, and the Merchant Name column is BIGINT, which were encoded as integers to avoid confusion with real merchant names. The Errors? and Is Fraud? columns use VARCHAR type, which is a variable-length string.

con.sql('SELECT DISTINCT("Is Fraud?") FROM transactions;').df()
Loading...

Convert "Is Fraud?" column to a boolean

The "Is Fraud?" column currently contains string values (“True” and “False”). We can convert this column to a boolean type for easier analysis.

con.sql(
    """
CREATE OR REPLACE VIEW transactions_with_fraud_flag AS
SELECT
  * EXCLUDE ("Is Fraud?"),
  CASE
    WHEN "Is Fraud?" = 'Yes' THEN TRUE
    WHEN "Is Fraud?" = 'No'  THEN FALSE
    ELSE NULL
  END AS "Is Fraud?"
FROM transactions;
"""
)
con.sql(
    """
SELECT *
FROM transactions_with_fraud_flag
LIMIT 5;
"""
).df()
Loading...

You can verify that the "Is Fraud?" column is now a boolean type using the DESCRIBE command again.

con.sql("DESCRIBE transactions_with_fraud_flag;").df()
Loading...

Payment method frequency analysis

The code below provides a high-level summary of the payment methods used across your credit card dataset. The .value_counts() method performs a frequency analysis on the "Use Chip" column, identifying every unique entry (Chip, Online, and Swipe transactions).

con.sql(
    """
SELECT
  "Use Chip",
  COUNT(*) AS count
FROM transactions
GROUP BY "Use Chip"
ORDER BY count DESC;
        """
)
┌────────────────────┬──────────┐ │ Use Chip │ count │ │ varchar │ int64 │ ├────────────────────┼──────────┤ │ Swipe Transaction │ 15386082 │ │ Chip Transaction │ 6287598 │ │ Online Transaction │ 2713220 │ └────────────────────┴──────────┘

Looking at the numbers, you can see that Swipe Transactions are the dominant payment method in this dataset with over 15 million entries, followed by Chip Transactions at roughly 6.2 million. This breakdown is essential for understanding consumer behavior or detecting potential fraud patterns, as certain types of transactions (like “Online” vs. “Swipe”) carry different risk profiles.

Fraud distribution analysis

The code below shows the frequency of the "Is Fraud?" column to determine the distribution of legitimate versus fraudulent transactions within your dataset.

con.sql(
    """
SELECT
  "Is Fraud?",
  COUNT(*) AS count
FROM transactions
GROUP BY "Is Fraud?"
ORDER BY count DESC;
        """
)
┌───────────┬──────────┐ │ Is Fraud? │ count │ │ varchar │ int64 │ ├───────────┼──────────┤ │ No │ 24357143 │ │ Yes │ 29757 │ └───────────┴──────────┘

The data reveals that the overwhelming majority of transactions - over 24.3 million - are flagged as “No”, while only 29,757 are flagged as “Yes”. While the fraudulent cases represent a very small fraction of the total volume (roughly 0.12%), identifying this minority is the primary goal of most credit card analytics. Understanding this ratio is a critical first step before building a machine learning model, as it tells you that you’ll need specialized techniques, like oversampling or specific loss functions, to account for the rarity of fraud.

Flag suspicious transactions based on amount

The code below performs feature engineering by creating three new boolean (True/False) indicators based on specific transaction characteristics.

The AS keyword is used to assign a name to each new column, making it easier to reference them in subsequent queries.

con.sql(
    """
SELECT
  *,
  "Amount" < 0 AS is_refund,
  "Amount" > 500 AS large_txn,
  "Use Chip" = 'Online Transaction' AS is_online
FROM transactions
LIMIT 10;
"""
).df()
Loading...

These “flag” columns can be useful for both exploratory analysis and machine learning. Instead of writing complex filters repeatedly, you can now quickly segment your data. For example, to see if “large transactions” are more likely to be “fraudulent” or to calculate the total volume of “online” versus in-person sales. This step transforms raw data into behavioral features, providing the model or the analyst with clear, binary signals that highlight high-interest events within the transaction stream.

Filter rows based on conditions

You can also apply a filter to isolate a very specific subset of your data: online refunds. The WHERE clause in the SQL query allows you to specify multiple conditions that must be met for a row to be included in the result set. In this case, we are filtering for transactions where the "Amount" is less than 0 (indicating a refund) and the "Use Chip" column indicates an “Online Transaction”.

con.sql(
    """
SELECT *
FROM transactions
WHERE
  "Amount" < 0
  AND "Use Chip" = 'Online Transaction'
LIMIT 10;
"""
).df()
Loading...

Per-user aggregation

The code below performs a user-level aggregation, collapsing millions of individual transaction records into a concise summary of spending behavior for each unique customer. By using the GROUP BY clause, you are instructing DuckDB to organize the data into buckets based on the individual user ID. The SELECT statement with aggregate functions then calculates three key metrics for each bucket:

  1. the total volume of money spent,

  2. the average cost per purchase, and

  3. the total number of transactions

user_stats = con.sql(
    """
SELECT
  "User",
  SUM("Amount") AS total_spent,
  AVG("Amount") AS average_amount,
  COUNT(*)      AS num_transactions
FROM transactions
GROUP BY "User";
"""
).df()

user_stats.head()
Loading...

Rolling / window features (SQL-like mental model):

10 most recent transactions rolling sum

You can also compute rolling aggregates to capture recent behavior. The code below calculates a rolling sum of the "Amount" column over the last 10 transactions for each user.

con.sql(
    """
SELECT
  *,
  SUM("Amount") OVER (
    PARTITION BY "User"
    ORDER BY "Datetime"
    ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
  ) AS rolling_amount_sum_10
FROM transactions
ORDER BY "User", "Datetime"
LIMIT 100;
"""
).df()
Loading...
Loading...

Conclusion

You have replicated the data preprocessing pipeline using DuckDB. Then, you performed feature engineering and aggregation to extract meaningful insights!