Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Preprocessing Kickstarter Data with Polars

Import polars.

import polars as pl

Preprocess the DataFrame

Read the CSV file

df_pl = pl.read_csv(
    "http://raw.githubusercontent.com/bdi593/datasets/refs/heads/main/kickstarter-projects/kickstarter-sample-data.csv",
    null_values=["null"],
)
display(df_pl.head(2))
df_pl.shape
Loading...
(3173, 42)

Select Relevant Columns

There are 42 columns in the dataset. Let’s begin by only selecting the columns that are relevant to our analysis.

keep = [
    "name",
    "state",
    "backers_count",
    "usd_pledged",
    "goal",
    "percent_funded",
    "launched_at",
    "state_changed_at",
    "country",
    "currency",
    "staff_pick",
    "spotlight",
    "category",
    "video",
    "blurb",
]

df_pl = df_pl.select([c for c in keep if c in df_pl.columns])
display(df_pl.head(3))
df_pl.shape
Loading...
(3173, 15)

Filter based on "state"

We can first check the frequency of the “state” column to see how many unique values it contains and how many times each value appears. This can help us understand the distribution of the data and identify any potential issues with the “state” column.

df_pl["state"].value_counts()
Loading...

We are mainly interested in the “successful” and “failed” states, as these represent the outcomes of the Kickstarter projects. If there are other states present, we will need to consider how to handle them in our analysis, such as whether to include them as separate categories or to filter them out.

Keep only the completed campaigns by filtering based on the “state” column.

df_pl = df_pl.filter(pl.col("state").is_in(["successful", "failed"]))

Check Other Boolean Columns

While this step is optional, it can be helpful to check the frequency of other boolean columns in the dataset to understand their distribution and how they might relate to the “state” column.

"staff_pick"

The "staff_pick" column indicates whether a project was selected as a staff pick by Kickstarter. This could be an interesting feature to analyze, as being a staff pick may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were staff picks and how many were not.

df_pl["staff_pick"].value_counts()
Loading...

"spotlight"

The "spotlight" column indicates whether a project was featured on the Kickstarter homepage. This could be an interesting feature to analyze, as being spotlighted may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were spotlighted and how many were not.

df_pl["spotlight"].value_counts()
Loading...

Convert Epoch Timestamp to Datetime

Convert the "launched_at" and "state_changed_at" columns from epoch time to datetime format for easier analysis.

df_pl = df_pl.with_columns(
    pl.from_epoch("launched_at", time_unit="s").alias("launched_at"),
    pl.from_epoch("state_changed_at", time_unit="s").alias("state_changed_at"),
)

df_pl.select(["launched_at", "state_changed_at"]).head(3)
Loading...

Parse Video Information

The "video" column contains JSON strings with information about the project’s video, such as the video URL, width, height, codecs, and etc. We can parse this JSON data to extract relevant information about the videos associated with each project.

Sample a few rows where the video information is not null.

df_video_samples = (
    df_pl.filter(pl.col("video").is_not_null())
    .sample(n=5, seed=42)
    .select(["name", "video"])
)
df_video_samples
Loading...

The JSON strings are truncated in the output. We can print the full JSON string for a sample row to see the complete structure of the video information.

video_json = df_video_samples.row(0)[1]
print(video_json)
{"id":1260668,"status":"successful","hls":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-hls_playlist.m3u8","hls_type":"application/x-mpegURL","high":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_high.mp4","high_type":"video/mp4; codecs="avc1.64001E, mp4a.40.2"","base":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_base.mp4","base_type":"video/mp4; codecs="avc1.42E01E, mp4a.40.2"","tracks":"[]","width":640,"height":360,"frame":"https://d15chbti7ht62o.cloudfront.net/projects/4687488/video-1260668-h264_base.jpg?2023"}
df_pl = df_pl.with_columns(
    [
        # Extract digits following the "width": key
        pl.col("video")
        .str.extract(r'"width":\s*(\d+)', 1)
        .cast(pl.Int64)
        .alias("video_width"),
        # Extract digits following the "height": key
        pl.col("video")
        .str.extract(r'"height":\s*(\d+)', 1)
        .cast(pl.Int64)
        .alias("video_height"),
    ]
)

df_pl.filter(
    pl.col("video_width").is_not_null(), pl.col("video_height").is_not_null()
).select(["name", "video_width", "video_height"]).head(10)
Loading...
df_pl.head(2)
Loading...

Parse Category Information

import json

df_pl = df_pl.with_columns(
    pl.col("category")
    .map_elements(
        lambda x: json.loads(x).get("name") if x else None, return_dtype=pl.Utf8
    )
    .alias("category"),
    pl.col("category")
    .map_elements(
        lambda x: json.loads(x).get("parent_name") if x else None, return_dtype=pl.Utf8
    )
    .alias("category_parent"),
)

df_pl.head(3)
Loading...

Rearrange the columns so that "category_parent" comes right after "category" for easier analysis. We can use the select method to specify the order of the columns in the DataFrame.

cols = df_pl.columns
idx = cols.index("category")

df_pl = df_pl.select(
    cols[: idx + 1]
    + ["category_parent"]
    + [c for c in cols[idx + 1 :] if c != "category_parent"]
)

df_pl.head(3)
Loading...

Store as Parquet

(Optional) Store the cleaned DataFrame as a Parquet file for efficient storage.

df_pl.write_parquet("kickstarter-sample-preprocessed.parquet")