Using Polars for Data Preprocessing and Analysis

Overview¶

In this exercise, you will walk through a complete data preprocessing workflow using Python.

The goal is to take a raw dataset and transform it into a clean, structured form that is ready for analysis or modeling.

You will learn how to:

Load and inspect data
Identify and handle missing values
Clean and standardize columns
Perform basic transformations
Verify the final dataset

Each step is explained in detail so you can understand why it is done, not just how.

Import Libraries¶

We are using the Polars library for data manipulation due to its performance advantages, especially with larger datasets.

import polars as pl

Read the Input CSV File¶

We will import the dataset using both Pandas and Polars to compare their performance. First, download the credit_card_transactions-ibm_v2.csv file from this Kaggle page and place it in the same folder as your Jupyter notebook.

`polars.read_csv()`¶

Let’s read the CSV file using Polars and measure the time taken.

%%time
df_polars = pl.read_csv("credit_card_transactions-ibm_v2.csv")
df_polars

CPU times: total: 8.84 s
Wall time: 1.62 s

Benchmarking with Pandas¶

You can compare the time taken by Polars with that of Pandas for reading the same CSV file.

%%time
df_pandas = pd.read_csv("credit_card_transactions-ibm_v2.csv")

The execution time will vary based on your system, but Polars is generally expected to be much faster for larger datasets. In my desktop with 32GB RAM and an 13th Gen Intel(R) Core(TM) i7-13700, Polars took approximately 1 second while Pandas took around 22 seconds. That’s more than a 20x speedup!

pandas time result:

CPU times: total: 3 s
Wall time: 22.6 s

polars time result:

CPU times: total: 1.02 s
Wall time: 941 ms

Preprocessing Steps¶

Convert `"Amount"` column to a numeric type¶

The code below performs data type conversion on the "Amount" column. It utilizes Polars’ expression API to remove “$” and “,” characters via replace_all(), subsequently casting the cleaned values to Float64 to overwrite the existing column with numeric data.

df_polars = df_polars.with_columns(
    pl.col("Amount")
    .str.replace_all(r"[$,]", "")  # remove $ (and commas, just in case)
    .cast(pl.Float64)
    .alias("Amount")
)

df_polars.head(3)

The output confirms that the "Amount" column has been successfully converted to a f64 (Float64) data type.

Check the Schema¶

What is a Schema in Polars?¶

In Polars, the schema is a fixed mapping of column names to their specific data types (dtypes). Unlike some other data libraries, Polars is built on the Apache Arrow memory format, which requires every column to have a strictly defined type from the moment the DataFrame is initialized. This schema acts as a “contract” for your data; if you try to perform a string operation on a column that the schema defines as an integer, Polars will stop you immediately with a clear error. This strictness is a core reason why Polars is so fast and memory-efficient---it knows exactly how much space to allocate in memory and which CPU instructions to use before it even touches the data.

The fundamental difference between Polars and pandas lies in how they handle type inference and consistency:

Predictability vs. Flexibility: Pandas is “lazy” and flexible with types, often defaulting to the generic object dtype if it encounters mixed data (like a column with both numbers and strings). This can lead to “Object” columns that swallow up memory and slow down your code. Polars, conversely, is “eager” about types; it does not allow mixed types in a single column.
The Schema Check: In pandas, you often don’t know a calculation will fail until the code has already been running for several minutes. Because Polars has a defined schema and a lazy execution engine, it can inspect your entire query plan and check the schema for type mismatches before it actually processes a single row of data.
Null Handling: While pandas historically used NaN (a float value) to represent missing numbers - which could accidentally change an integer column into a float column - Polars handles nulls using a separate bitmask. This ensures that your integers stay integers, keeping your schema stable throughout your pipeline.

df_polars.schema

Schema([('User', Int64),
        ('Card', Int64),
        ('Year', Int64),
        ('Month', Int64),
        ('Day', Int64),
        ('Time', String),
        ('Amount', Float64),
        ('Use Chip', String),
        ('Merchant Name', Int64),
        ('Merchant City', String),
        ('Merchant State', String),
        ('Zip', Float64),
        ('MCC', Int64),
        ('Errors?', String),
        ('Is Fraud?', String)])

The df_polars.schema output is a structured dictionary-like object that acts as the blueprint for your DataFrame. It maps every column name to its specific Polars data type (such as String, Float64, or Int64). Unlike a simple list of names, this schema is the source of truth for the Query Optimizer; it allows Polars to validate operations - like ensuring you aren’t trying to add a number to a date - without having to scan the actual data. When you look at this output, you are seeing the strict definitions that Polars uses to allocate memory and plan the most efficient way to execute your code.

Create a timestamp column from the date time components¶

The code below merges fragmented date and time components into a single, unified Datetime column named "timestamp". By using pl.datetime(), you are instructing Polars to reach into the individual “Year”, “Month”, and “Day” columns and combine them with hours and minutes extracted from the "Time" string. To get those specific time units, the code uses .str.slice() to “cut” the hour and minute portions out of the string and casts them to integers.

df_polars = df_polars.with_columns(
    pl.datetime(
        pl.col("Year"),
        pl.col("Month"),
        pl.col("Day"),
        pl.col("Time").str.slice(0, 2).cast(pl.Int32),
        pl.col("Time").str.slice(3, 2).cast(pl.Int32),
    ).alias("Datetime")
)

df_polars

The primary benefit of this transformation is that it shifts your dataset from a static table into a time-series format. Instead of treating “Year” and “Month” as independent categories, Polars now understands the linear flow of time between transactions. This allows you to perform advanced chronological operations—such as calculating the time elapsed between purchases, sorting transactions by occurrence, or resampling the data to see fraud trends by hour.

(Optional) Write to a Parquet File¶

Parquet is a columnar storage file format that is optimized for performance and efficiency. It allows for faster read and write operations compared to CSV, especially with larger datasets. By writing the cleaned DataFrame to a Parquet file, you can significantly reduce the time it takes to load the data in future analyses, as Parquet files are designed for efficient data retrieval and storage.

# Remove redundant date and time component columns
df_for_parquet = df_polars.drop(["Year", "Month", "Day", "Time"])

# Reorder columns to match the original CSV structure, with the new "Datetime" column in a logical position
first_columns = ["User", "Card", "Amount", "Datetime"]
df_for_parquet = df_for_parquet.select([*first_columns, pl.exclude(first_columns)])

df_for_parquet.head(2)

Write to a parquet file using Polars’ write_parquet() method. This will save the cleaned DataFrame in a more efficient format for future use.

df_for_parquet.write_parquet("credit_card_transactions-ibm_v2.parquet")

Columnar vs row-based storage

CSV files are row-oriented.

row1: a,b,c,d,e
row2: a,b,c,d,e
row3: a,b,c,d,e

To read one column, the engine must:

read every row
parse every value
throw most of it away

On the other hand Parquet files are column-oriented.

col_a: a a a
col_b: b b b
col_c: c c c
col_d: d d d
col_e: e e e

If your query only needs Amount and Merchant Name:

CSV parses everything.
Parquet only reads those two columns.

This is why Parquet is dramatically faster for analytics.

Payment method frequency analysis¶

The code below provides a high-level summary of the payment methods used across your credit card dataset. The .value_counts() method performs a frequency analysis on the "Use Chip" column, identifying every unique entry (Chip, Online, and Swipe transactions).

df_polars["Use Chip"].value_counts()

Looking at the numbers, you can see that Swipe Transactions are the dominant payment method in this dataset with over 15 million entries, followed by Chip Transactions at roughly 6.2 million. This breakdown is essential for understanding consumer behavior or detecting potential fraud patterns, as certain types of transactions (like “Online” vs. “Swipe”) carry different risk profiles.

Fraud distribution analysis¶

The code below shows the frequency of the "Is Fraud?" column to determine the distribution of legitimate versus fraudulent transactions within your dataset.

df_polars["Is Fraud?"].value_counts()

The data reveals that the overwhelming majority of transactions - over 24.3 million - are flagged as “No”, while only 29,757 are flagged as “Yes”. While the fraudulent cases represent a very small fraction of the total volume (roughly 0.12%), identifying this minority is the primary goal of most credit card analytics. Understanding this ratio is a critical first step before building a machine learning model, as it tells you that you’ll need specialized techniques, like oversampling or specific loss functions, to account for the rarity of fraud.

Flag suspicious transactions based on amount¶

The code below performs feature engineering by creating three new boolean (True/False) indicators based on specific transaction characteristics. By using with_columns() with a list of expressions, Polars efficiently evaluates these conditions in parallel, adding descriptive flags that make the dataset much easier to filter and analyze. Specifically, it identifies refunds (amounts less than zero), large transactions (amounts exceeding $500), and online activity (where the “Use Chip” status matches “Online Transaction”).

df_polars.with_columns(
    [
        (pl.col("Amount") < 0).alias("is_refund"),
        (pl.col("Amount") > 500).alias("large_txn"),
        (pl.col("Use Chip") == "Online Transaction").alias("is_online"),
    ]
)

These “flag” columns can be useful for both exploratory analysis and machine learning. Instead of writing complex filters repeatedly, you can now quickly segment your data. For example, to see if “large transactions” are more likely to be “fraudulent” or to calculate the total volume of “online” versus in-person sales. This step transforms raw data into behavioral features, providing the model or the analyst with clear, binary signals that highlight high-interest events within the transaction stream.

Filter rows based on conditions¶

You can also apply a filter to isolate a very specific subset of your data: online refunds. By passing multiple conditions into the .filter() method, Polars treats them as a logical AND operation, meaning a transaction will only remain in the resulting DataFrame if it is both a negative value (the "is_refund" condition) and was conducted as an “Online Transaction” (the "is_online" condition).

df_polars.filter(
    (pl.col("Amount") < 0).alias("is_refund"),
    (pl.col("Use Chip") == "Online Transaction"),
)

Per-user aggregation¶

The code below performs a user-level aggregation, collapsing millions of individual transaction records into a concise summary of spending behavior for each unique customer. By using group_by("User"), you are instructing Polars to organize the data into buckets based on the individual user ID. The .agg() function then calculates three key metrics for each bucket:

the total volume of money spent,
the average cost per purchase, and
the total count of transactions (using pl.len()).

user_stats = df_polars.group_by("User").agg(
    [
        pl.col("Amount").sum().alias("total_spent"),
        pl.col("Amount").mean().alias("average_amount"),
        pl.len().alias("num_transactions"),
    ]
)

user_stats

Rolling / window features (SQL-like mental model):

10 most recent transactions rolling sum¶

You can also compute rolling aggregates to capture recent behavior. The code below calculates a rolling sum of the "Amount" column over the last 10 transactions for each user. By using .over("User"), you ensure that the rolling calculation is performed separately for each individual customer, maintaining the integrity of user-specific spending patterns. The .rolling_sum(window_size=10) function then computes the sum of the last 10 transaction amounts, providing insight into recent spending trends.

df_rolling = df_polars.sort(["User", "Datetime"]).with_columns(
    pl.col("Amount")
    .rolling_sum(window_size=10)
    .over("User")
    .alias("rolling_amount_sum_10")
)

df_rolling

The .over("User") part is the direct counterpart to SQL’s PARTITION BY User. It ensures the rolling calculation “resets” or stays contained within each specific user’s history, preventing a transaction from User A from being included in the sum for User B. In SQL terms, this operation is essentially:

SUM(Amount) OVER (
    PARTITION BY User
    ORDER BY timestamp
    ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
)

This technique can be used in real-time fraud detection. By tracking a “rolling sum,” you can detect “velocity attacks” where a card is used for many large purchases in a very short window. If this rolling sum spikes suddenly compared to the user’s historical average, it provides a much stronger signal for fraud than any single transaction could on its own.

Conclusion¶

You have completed a full data preprocessing pipeline from raw data to a cleaned dataset. Then, you performed feature engineering and aggregation to extract meaningful insights.