Preprocessing Kickstarter Data with Polars
Import polars.
import polars as plPreprocess the DataFrame¶
Read the CSV file¶
df_pl = pl.read_csv(
"http://raw.githubusercontent.com/bdi593/datasets/refs/heads/main/kickstarter-projects/kickstarter-sample-data.csv",
null_values=["null"],
)
display(df_pl.head(2))
df_pl.shape(3173, 42)Select Relevant Columns¶
There are 42 columns in the dataset. Let’s begin by only selecting the columns that are relevant to our analysis.
keep = [
"name",
"state",
"backers_count",
"usd_pledged",
"goal",
"percent_funded",
"launched_at",
"state_changed_at",
"country",
"currency",
"staff_pick",
"spotlight",
"category",
"video",
"blurb",
]
df_pl = df_pl.select([c for c in keep if c in df_pl.columns])
display(df_pl.head(3))
df_pl.shape(3173, 15)Filter based on "state"¶
We can first check the frequency of the “state” column to see how many unique values it contains and how many times each value appears. This can help us understand the distribution of the data and identify any potential issues with the “state” column.
df_pl["state"].value_counts()We are mainly interested in the “successful” and “failed” states, as these represent the outcomes of the Kickstarter projects. If there are other states present, we will need to consider how to handle them in our analysis, such as whether to include them as separate categories or to filter them out.
Keep only the completed campaigns by filtering based on the “state” column.
df_pl = df_pl.filter(pl.col("state").is_in(["successful", "failed"]))Check Other Boolean Columns¶
While this step is optional, it can be helpful to check the frequency of other boolean columns in the dataset to understand their distribution and how they might relate to the “state” column.
"staff_pick"¶
The "staff_pick" column indicates whether a project was selected as a staff pick by Kickstarter. This could be an interesting feature to analyze, as being a staff pick may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were staff picks and how many were not.
df_pl["staff_pick"].value_counts()"spotlight"¶
The "spotlight" column indicates whether a project was featured on the Kickstarter homepage. This could be an interesting feature to analyze, as being spotlighted may have an impact on the success of a campaign. We can check the frequency of this column to see how many projects were spotlighted and how many were not.
df_pl["spotlight"].value_counts()Convert Epoch Timestamp to Datetime¶
Convert the "launched_at" and "state_changed_at" columns from epoch time to datetime format for easier analysis.
df_pl = df_pl.with_columns(
pl.from_epoch("launched_at", time_unit="s").alias("launched_at"),
pl.from_epoch("state_changed_at", time_unit="s").alias("state_changed_at"),
)
df_pl.select(["launched_at", "state_changed_at"]).head(3)Parse Video Information¶
The "video" column contains JSON strings with information about the project’s video, such as the video URL, width, height, codecs, and etc. We can parse this JSON data to extract relevant information about the videos associated with each project.
Sample a few rows where the video information is not null.
df_video_samples = (
df_pl.filter(pl.col("video").is_not_null())
.sample(n=5, seed=42)
.select(["name", "video"])
)
df_video_samplesThe JSON strings are truncated in the output. We can print the full JSON string for a sample row to see the complete structure of the video information.
video_json = df_video_samples.row(0)[1]
print(video_json){"id":1260668,"status":"successful","hls":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-hls_playlist.m3u8","hls_type":"application/x-mpegURL","high":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_high.mp4","high_type":"video/mp4; codecs="avc1.64001E, mp4a.40.2"","base":"https://v2.kickstarter.com/1770873151-S0kszdLXafjMVmF4ZVvrjPK4tMfnFbRgX5mJxHQE74s%3D/projects/4687488/video-1260668-h264_base.mp4","base_type":"video/mp4; codecs="avc1.42E01E, mp4a.40.2"","tracks":"[]","width":640,"height":360,"frame":"https://d15chbti7ht62o.cloudfront.net/projects/4687488/video-1260668-h264_base.jpg?2023"}
df_pl = df_pl.with_columns(
[
# Extract digits following the "width": key
pl.col("video")
.str.extract(r'"width":\s*(\d+)', 1)
.cast(pl.Int64)
.alias("video_width"),
# Extract digits following the "height": key
pl.col("video")
.str.extract(r'"height":\s*(\d+)', 1)
.cast(pl.Int64)
.alias("video_height"),
]
)
df_pl.filter(
pl.col("video_width").is_not_null(), pl.col("video_height").is_not_null()
).select(["name", "video_width", "video_height"]).head(10)df_pl.head(2)Parse Category Information¶
import json
df_pl = df_pl.with_columns(
pl.col("category")
.map_elements(
lambda x: json.loads(x).get("name") if x else None, return_dtype=pl.Utf8
)
.alias("category"),
pl.col("category")
.map_elements(
lambda x: json.loads(x).get("parent_name") if x else None, return_dtype=pl.Utf8
)
.alias("category_parent"),
)
df_pl.head(3)Rearrange the columns so that "category_parent" comes right after "category" for easier analysis. We can use the select method to specify the order of the columns in the DataFrame.
cols = df_pl.columns
idx = cols.index("category")
df_pl = df_pl.select(
cols[: idx + 1]
+ ["category_parent"]
+ [c for c in cols[idx + 1 :] if c != "category_parent"]
)
df_pl.head(3)Store as Parquet¶
(Optional) Store the cleaned DataFrame as a Parquet file for efficient storage.
df_pl.write_parquet("kickstarter-sample-preprocessed.parquet")