Using DuckDB for Data Analysis

Overview¶

In this exercise, you will walk through the previous data preprocessing workflow using DuckDB.

The goal is to replicate the same steps we performed in polars, but this time using SQL queries in DuckDB. This will give you a chance to see how the same transformations can be expressed in SQL and how DuckDB handles data manipulation.

You will learn how to:

Load and inspect data
Identify and handle missing values
Clean and standardize columns
Perform basic transformations
Verify the final dataset

Import Libraries¶

import duckdb

Read the Input Parquet File¶

First, download the credit_card_transactions-ibm_v2.parquet file from this GitHub link and place it in the same folder as your Jupyter notebook.

The file is about 250 MB, which is about 9 times smaller than the original CSV file, making it much faster to read and process.

`duckdb.connect()`¶

The duckdb.connect() function creates a connection to a DuckDB database. If you don’t specify a database name, it creates an in-memory database that exists only for the duration of the connection. This is useful for ad-hoc analysis and temporary data manipulation without needing to manage a persistent database file.

con = duckdb.connect()

con.execute(
    """
CREATE OR REPLACE VIEW transactions AS
SELECT *
FROM read_parquet('credit_card_transactions-ibm_v2.parquet');
"""
)

# Peek (like df.head())
con.sql("SELECT * FROM transactions LIMIT 5").df()

DuckDB’s .df() Method

The .df() method in DuckDB is used to convert a DuckDB query result into a Pandas DataFrame. This allows you to easily manipulate and analyze the data using Pandas’ powerful data handling capabilities after performing SQL queries in DuckDB.

However, you should be cautious when using .df() with large datasets, as it will load the entire result set into memory. If the dataset is too large to fit in memory, it can lead to performance issues or crashes. Always consider the size of your data and the available memory before using .df().

To avoid loading large dataset into memory, consider (1) using the LIMIT clause in your SQL queries to reduce the amount of data being loaded when testing or exploring the data, and (2) aggregating or filtering the data in SQL to reduce the result set size before converting it to a DataFrame.

Exploratory Data Analysis (EDA)¶

Describe Schema¶

To understand the structure of the dataset, we can use the DESCRIBE command to get information about the columns, their data types, and whether they contain null values.

con.sql("DESCRIBE transactions;").df()

If you’re familiar with the PRAGMA introspection command from SQLite or another engine, you can also use PRAGMA table_info('transactions'); to get similar information about the columns in the transactions table.

con.sql("PRAGMA table_info('transactions');").df()

Note that the Datetime column is TIMESTAMP type, which means it contains both date and time information. The Amount column is DOUBLE, which is a floating-point number, and the Merchant Name column is BIGINT, which were encoded as integers to avoid confusion with real merchant names. The Errors? and Is Fraud? columns use VARCHAR type, which is a variable-length string.

con.sql('SELECT DISTINCT("Is Fraud?") FROM transactions;').df()

Convert `"Is Fraud?"` column to a boolean¶

The "Is Fraud?" column currently contains string values (“True” and “False”). We can convert this column to a boolean type for easier analysis.

con.sql(
    """
CREATE OR REPLACE VIEW transactions_with_fraud_flag AS
SELECT
  * EXCLUDE ("Is Fraud?"),
  CASE
    WHEN "Is Fraud?" = 'Yes' THEN TRUE
    WHEN "Is Fraud?" = 'No'  THEN FALSE
    ELSE NULL
  END AS "Is Fraud?"
FROM transactions;
"""
)

con.sql(
    """
SELECT *
FROM transactions_with_fraud_flag
LIMIT 5;
"""
).df()

You can verify that the "Is Fraud?" column is now a boolean type using the DESCRIBE command again.

con.sql("DESCRIBE transactions_with_fraud_flag;").df()

Payment method frequency analysis¶

The code below provides a high-level summary of the payment methods used across your credit card dataset. The .value_counts() method performs a frequency analysis on the "Use Chip" column, identifying every unique entry (Chip, Online, and Swipe transactions).

con.sql(
    """
SELECT
  "Use Chip",
  COUNT(*) AS count
FROM transactions
GROUP BY "Use Chip"
ORDER BY count DESC;
        """
)

┌────────────────────┬──────────┐
│      Use Chip      │  count   │
│      varchar       │  int64   │
├────────────────────┼──────────┤
│ Swipe Transaction  │ 15386082 │
│ Chip Transaction   │  6287598 │
│ Online Transaction │  2713220 │
└────────────────────┴──────────┘

Looking at the numbers, you can see that Swipe Transactions are the dominant payment method in this dataset with over 15 million entries, followed by Chip Transactions at roughly 6.2 million. This breakdown is essential for understanding consumer behavior or detecting potential fraud patterns, as certain types of transactions (like “Online” vs. “Swipe”) carry different risk profiles.

Fraud distribution analysis¶

The code below shows the frequency of the "Is Fraud?" column to determine the distribution of legitimate versus fraudulent transactions within your dataset.

con.sql(
    """
SELECT
  "Is Fraud?",
  COUNT(*) AS count
FROM transactions
GROUP BY "Is Fraud?"
ORDER BY count DESC;
        """
)

┌───────────┬──────────┐
│ Is Fraud? │  count   │
│  varchar  │  int64   │
├───────────┼──────────┤
│ No        │ 24357143 │
│ Yes       │    29757 │
└───────────┴──────────┘

The data reveals that the overwhelming majority of transactions - over 24.3 million - are flagged as “No”, while only 29,757 are flagged as “Yes”. While the fraudulent cases represent a very small fraction of the total volume (roughly 0.12%), identifying this minority is the primary goal of most credit card analytics. Understanding this ratio is a critical first step before building a machine learning model, as it tells you that you’ll need specialized techniques, like oversampling or specific loss functions, to account for the rarity of fraud.

Flag suspicious transactions based on amount¶

The code below performs feature engineering by creating three new boolean (True/False) indicators based on specific transaction characteristics.

The AS keyword is used to assign a name to each new column, making it easier to reference them in subsequent queries.

con.sql(
    """
SELECT
  *,
  "Amount" < 0 AS is_refund,
  "Amount" > 500 AS large_txn,
  "Use Chip" = 'Online Transaction' AS is_online
FROM transactions
LIMIT 10;
"""
).df()

These “flag” columns can be useful for both exploratory analysis and machine learning. Instead of writing complex filters repeatedly, you can now quickly segment your data. For example, to see if “large transactions” are more likely to be “fraudulent” or to calculate the total volume of “online” versus in-person sales. This step transforms raw data into behavioral features, providing the model or the analyst with clear, binary signals that highlight high-interest events within the transaction stream.

Filter rows based on conditions¶

You can also apply a filter to isolate a very specific subset of your data: online refunds. The WHERE clause in the SQL query allows you to specify multiple conditions that must be met for a row to be included in the result set. In this case, we are filtering for transactions where the "Amount" is less than 0 (indicating a refund) and the "Use Chip" column indicates an “Online Transaction”.

con.sql(
    """
SELECT *
FROM transactions
WHERE
  "Amount" < 0
  AND "Use Chip" = 'Online Transaction'
LIMIT 10;
"""
).df()

Per-user aggregation¶

The code below performs a user-level aggregation, collapsing millions of individual transaction records into a concise summary of spending behavior for each unique customer. By using the GROUP BY clause, you are instructing DuckDB to organize the data into buckets based on the individual user ID. The SELECT statement with aggregate functions then calculates three key metrics for each bucket:

the total volume of money spent,
the average cost per purchase, and
the total number of transactions

user_stats = con.sql(
    """
SELECT
  "User",
  SUM("Amount") AS total_spent,
  AVG("Amount") AS average_amount,
  COUNT(*)      AS num_transactions
FROM transactions
GROUP BY "User";
"""
).df()

user_stats.head()

Rolling / window features (SQL-like mental model):

10 most recent transactions rolling sum¶

You can also compute rolling aggregates to capture recent behavior. The code below calculates a rolling sum of the "Amount" column over the last 10 transactions for each user.

con.sql(
    """
SELECT
  *,
  SUM("Amount") OVER (
    PARTITION BY "User"
    ORDER BY "Datetime"
    ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
  ) AS rolling_amount_sum_10
FROM transactions
ORDER BY "User", "Datetime"
LIMIT 100;
"""
).df()

Conclusion¶

You have replicated the data preprocessing pipeline using DuckDB. Then, you performed feature engineering and aggregation to extract meaningful insights!