Preprocessing Kickstarter Data with Polars

Import polars.

import polars as pl

Preprocess the DataFrame¶

Read the CSV file¶

Polars and “null” Strings

If you try to read the CSV files using Polars, you may encounter an error due to the presence of “null” strings in the data. Polars is stricter than pandas about types. It infers the data types based on the content of the columns, and if it encounters “null” as a string, it may not be able to infer the correct type for that column, leading to an error.

To handle this, you can specify that “null” should be treated as a null value when reading the CSV files with Polars. Here’s how you can do it.

# This code can throw an error
df_pl = pl.read_csv("my_kickstarter_dataset.csv")
df_pl.head(3)

Overriding Schema

You can specify the column types explicitly when reading the CSV file, although you may still need to handle the string “null” values appropriately.

# Sample code to read CSV with schema overrides
df_pl = pl.read_csv(
    "my_kickstarter_dataset.csv",
    null_values=["null"],
    schema_overrides={"converted_pledged_amount": pl.Float64},
)

df_pl = pl.read_csv(
    "http://raw.githubusercontent.com/bdi593/datasets/refs/heads/main/kickstarter-projects/kickstarter-sample-data.csv",
    null_values=["null"],
)
display(df_pl.head(2))
df_pl.shape

Loading...

(3173, 42)

Select Relevant Columns¶

There are 42 columns in the dataset. Let’s begin by only selecting the columns that are relevant to our analysis.

keep = [
    "name",
    "state",
    "backers_count",
    "usd_pledged",
    "goal",
    "percent_funded",
    "launched_at",
    "state_changed_at",
    "country",
    "currency",
    "staff_pick",
    "spotlight",
    "category",
    "video",
    "blurb",
]

df_pl = df_pl.select([c for c in keep if c in df_pl.columns])
display(df_pl.head(3))
df_pl.shape

Loading...

(3173, 15)

Filter based on `"state"`¶

We can first check the frequency of the “state” column to see how many unique values it contains and how many times each value appears. This can help us understand the distribution of the data and identify any potential issues with the “state” column.

df_pl["state"].value_counts()

Loading...

We are mainly interested in the “successful” and “failed” states, as these represent the outcomes of the Kickstarter projects. If there are other states present, we will need to consider how to handle them in our analysis, such as whether to include them as separate categories or to filter them out.

Keep only the completed campaigns by filtering based on the “state” column.

df_pl = df_pl.filter(pl.col("state").is_in(["successful", "failed"]))

Check Other Boolean Columns¶

While this step is optional, it can be helpful to check the frequency of other boolean columns in the dataset to understand their distribution and how they might relate to the “state” column.

`"staff_pick"`¶

The "staff_pick" column indicates whether a project was selected as a staff pick by Kickstarter. This could be an interesting feature to analyze, as being a staff pick may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were staff picks and how many were not.

df_pl["staff_pick"].value_counts()

Loading...

`"spotlight"`¶

The "spotlight" column indicates whether a project was featured on the Kickstarter homepage. This could be an interesting feature to analyze, as being spotlighted may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were spotlighted and how many were not.

df_pl["spotlight"].value_counts()

Loading...

Convert Epoch Timestamp to Datetime¶

Convert the "launched_at" and "state_changed_at" columns from epoch time to datetime format for easier analysis.

df_pl = df_pl.with_columns(
    pl.from_epoch("launched_at", time_unit="s").alias("launched_at"),
    pl.from_epoch("state_changed_at", time_unit="s").alias("state_changed_at"),
)

df_pl.select(["launched_at", "state_changed_at"]).head(3)

Loading...

Parse Video Information¶

The "video" column contains JSON strings with information about the project’s video, such as the video URL, width, height, codecs, and etc. We can parse this JSON data to extract relevant information about the videos associated with each project.

Sample a few rows where the video information is not null.

df_video_samples = (
    df_pl.filter(pl.col("video").is_not_null())
    .sample(n=5, seed=42)
    .select(["name", "video"])
)
df_video_samples

Loading...

The JSON strings are truncated in the output. We can print the full JSON string for a sample row to see the complete structure of the video information.

video_json = df_video_samples.row(0)[1]
print(video_json)

{"id":1260668,"status":"successful","hls":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-hls_playlist.m3u8","hls_type":"application/x-mpegURL","high":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_high.mp4","high_type":"video/mp4; codecs="avc1.64001E, mp4a.40.2"","base":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_base.mp4","base_type":"video/mp4; codecs="avc1.42E01E, mp4a.40.2"","tracks":"[]","width":640,"height":360,"frame":"https://d15chbti7ht62o.cloudfront.net/projects/4687488/video-1260668-h264_base.jpg?2023"}

df_pl = df_pl.with_columns(
    [
        # Extract digits following the "width": key
        pl.col("video")
        .str.extract(r'"width":\s*(\d+)', 1)
        .cast(pl.Int64)
        .alias("video_width"),
        # Extract digits following the "height": key
        pl.col("video")
        .str.extract(r'"height":\s*(\d+)', 1)
        .cast(pl.Int64)
        .alias("video_height"),
    ]
)

df_pl.filter(
    pl.col("video_width").is_not_null(), pl.col("video_height").is_not_null()
).select(["name", "video_width", "video_height"]).head(10)

Loading...

Can we parse JSON programmatically instead of using Regex?

In the code above, we used regular expressions to extract the video width and height from the JSON string in the "video" column. However, this approach can be error-prone and may not handle all cases correctly, especially if the JSON structure changes or if there are variations in the formatting.

Polars provides a robust way to parse JSON data using the json_decode function, which can handle complex JSON structures and is less likely to break if the format changes. Here’s how you can use json_decode to extract the video width and height:

video_schema = pl.Struct([
    pl.Field("width", pl.Int64),
    pl.Field("height", pl.Int64)
])

# 2. Pass the schema into json_decode
df_with_dims = my_dataframe.with_columns(
    pl.col("video").str.json_decode(dtype=video_schema).alias("video_struct")
).with_columns([
    pl.col("video_struct").struct.field("width").alias("video_width"),
    pl.col("video_struct").struct.field("height").alias("video_height")
]).drop("video_struct")

However, this only works if the JSON is not malformed and follows a consistent structure. If the JSON data is inconsistent or contains errors, the entire operation will fail. This is because JSON parsing happens at the Rust level for high performance (Polars is written in Rust). If the parser hits a syntax error (like unescaped quotes, missing commas, etc.), it considers the entire operation a failure and throws a ComputeError rather than silently returning a null.

df_pl.head(2)

Loading...

Parse Category Information¶

import json

df_pl = df_pl.with_columns(
    pl.col("category")
    .map_elements(
        lambda x: json.loads(x).get("name") if x else None, return_dtype=pl.Utf8
    )
    .alias("category"),
    pl.col("category")
    .map_elements(
        lambda x: json.loads(x).get("parent_name") if x else None, return_dtype=pl.Utf8
    )
    .alias("category_parent"),
)

df_pl.head(3)

Loading...

Rearrange the columns so that "category_parent" comes right after "category" for easier analysis. We can use the select method to specify the order of the columns in the DataFrame.

cols = df_pl.columns
idx = cols.index("category")

df_pl = df_pl.select(
    cols[: idx + 1]
    + ["category_parent"]
    + [c for c in cols[idx + 1 :] if c != "category_parent"]
)

df_pl.head(3)

Loading...

Store as Parquet¶

(Optional) Store the cleaned DataFrame as a Parquet file for efficient storage.

df_pl.write_parquet("kickstarter-sample-preprocessed.parquet")

Preprocessing Kickstarter Data with Polars

Preprocess the DataFrame¶

Read the CSV file¶

Select Relevant Columns¶

Filter based on "state"¶

Check Other Boolean Columns¶

"staff_pick"¶

"spotlight"¶

Convert Epoch Timestamp to Datetime¶

Parse Video Information¶

Parse Category Information¶

Store as Parquet¶

Filter based on `"state"`¶

`"staff_pick"`¶

`"spotlight"`¶