Data Job Postings Analysis
In this case study, we will analyze a dataset of job postings for data science and analytics roles. The dataset contains various columns such as job title, company, location, salary, job description, and required skills. However, some columns may have mostly missing values, which we will need to handle during preprocessing. For example, most companies may prefer not to disclose salary information, resulting in a column with many null values.
import polars as plπ₯ Load DatasetΒΆ
The dataset is uploaded to a GitHub repository for convenience. You can also access the original dataset on Hugging Face Datasets at lukebarousse/data_jobs. The original CSV file is quite large (~230 MB) due to the large number of rows and text fields, but it has been converted to Parquet format for this case study, which significantly reduces the file size to ~30 MB while preserving all data.
df = pl.read_parquet(
"https://github.com/bdi593/datasets/raw/refs/heads/main/data-jobs/data_jobs.parquet"
)
dfπ’ Check the Number of RowsΒΆ
The number of rows and columns in the dataset can be checked using the shape attribute of the Polars DataFrame. The shape attribute returns a tuple containing the number of rows and columns in the DataFrame.
df.shape(785741, 17)ποΈ Check SchemaΒΆ
Check the schema to understand the data types of each column.
df.schemaSchema([('job_title_short', String),
('job_title', String),
('job_location', String),
('job_via', String),
('job_schedule_type', String),
('job_work_from_home', Boolean),
('search_location', String),
('job_posted_date', String),
('job_no_degree_mention', Boolean),
('job_health_insurance', Boolean),
('job_country', String),
('salary_rate', String),
('salary_year_avg', Float64),
('salary_hour_avg', Float64),
('company_name', String),
('job_skills', String),
('job_type_skills', String)])π§Ή Preprocess DatasetΒΆ
β Check Missing ValuesΒΆ
Check the number of missing values.
df.null_count()The output is in a βwideβ format, which makes it difficult to compare the number of missing values across columns. We can transpose the output to make it easier to read and analyze.
df.null_count().transpose(include_header=True)We can go one step further and only display columns with one or more missing values to focus our attention on the columns that require preprocessing.
(
df.null_count()
.transpose(include_header=True, header_name="column", column_names=["null_count"])
.filter(pl.col("null_count") > 0)
)While the output is helpful, the number of rows displayed in the output is limited to 10 rows by default by Polars. There are more columns with missing values that are not displayed in the output. To see all columns with missing values, we can temporarily set the maximum number of rows to display to a higher value, such as 50, to ensure that all columns with missing values are shown in the output.
# Display up to 50 rows of the null count table
with pl.Config(tbl_rows=50):
display(
(
df.null_count()
.transpose(
include_header=True, header_name="column", column_names=["null_count"]
)
.filter(pl.col("null_count") > 0)
)
)ποΈ Drop Rows with Missing Job TitleΒΆ
There is one row with a null value in the job_title column. Filter the DataFrame to show only this row.
df.filter(df["job_title"].is_null())The dataset has over 780,000 rows, but only one row has a null value in the job_title and the company column.
df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).row(0)("['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']",)π
Parse "job_posted_date" ColumnΒΆ
The "job_posted_date" column contains date information in string format.
df["job_posted_date"].head(5)Check the data type of the "job_posted_date" column to confirm that it is currently stored as a string.
df.schema["job_posted_date"]StringWe can parse this column into a proper date format using the pl.col().str.strptime() or pl.col().str.to_datetime() method in Polars, which allows us to specify the date format and handle any parsing errors gracefully.
If you donβt specify the date format, Polars will attempt to infer the format. Since the date format in the "job_posted_date" column is consistent (e.g., β2023-06-16 13:44:15β), Polars should be able to parse it correctly without explicitly providing the format.
However, if you want to ensure that the parsing is done correctly and to handle any potential variations in date formats, you can specify the format using the format parameter.
df = df.with_columns(pl.col("job_posted_date").str.to_datetime())
df.select(pl.col("job_posted_date")).head(5)df.schema["job_posted_date"]Datetime(time_unit='us', time_zone=None)Monthly Number of Job PostingsΒΆ
The value_counts() method can be used to count the number of job postings for each month. By extracting the month from the "job_posted_date" column and applying value_counts(), we can see how many job postings were made in each month. Sorting the results by month will give us a clear view of the distribution of job postings over time.
df_monthly = (
df.select(
pl.col("job_posted_date")
.dt.month()
.alias("month")
.value_counts()
.alias("month_counts")
)
.unnest("month_counts")
.sort("month")
)
df_monthlyπ§° Parse "job_skills" ColumnΒΆ
The "job_skills" column is stored as a string that looks like a list, but it is not actually a list data type.
We can check the data type of the "job_skills" column to confirm that it is currently stored as a string. The output should indicate that the data type of the "job_skills" column is String.
df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)df.schema["job_skills"]StringTo convert it to a list, we can use the str.replace_all() method to replace single quotes with double quotes, and then use the str.json_decode() method to parse the string as JSON. This will give us a proper list of skills for each job.
df = df.with_columns(
pl.when(pl.col("job_skills").is_not_null())
.then(
pl.col("job_skills")
.str.replace_all("'", '"')
.str.json_decode(dtype=pl.List(pl.Utf8))
)
.otherwise(None)
.alias("job_skills")
)
df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)Print the first non-null value in the "job_skills" column to see a non-truncated output.
df.select(pl.col("job_skills")).filter(pl.col("job_skills").is_not_null()).row(0)[0]['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']Verify that the "job_skills" column has been successfully parsed as a list by checking the data type of the column. The output should indicate that the data type of the "job_skills" column is now a list.
df.schema["job_skills"]List(String)How many jobs require βPythonβ as a skill?ΒΆ
We can use the list.contains() method to filter the DataFrame for rows where the "job_skills" list contains the skill βPythonβ.
The code below filters the DataFrame to include only rows where the "job_skills" column contains βPythonβ.
df.filter(pl.col("job_skills").list.contains("python")).select(
pl.col("job_title"), pl.col("company_name"), pl.col("job_skills")
)The height attribute can be used to count the number of rows that match this condition.
It is equivalent to using the len() function, or .shape[0] to get the number of rows in the filtered DataFrame.
df.filter(pl.col("job_skills").list.contains("python")).height380909π οΈ Parse "job_type_skills" columnΒΆ
Similar to the "job_skills" column, the "job_type_skills" column is also stored as a string. However, the column contains dictionary-like string values as opposed to the list-like string values in the "job_skills" column.
Print the first five non-null values in the "job_type_skills" column to understand its structure and confirm that it is stored as a string.
df.select("job_type_skills").filter(pl.col("job_type_skills").is_not_null()).head(5)To see a non-truncated view of the first row, retrieve the first non-null value in the "job_type_skills" column and print it.
df.filter(pl.col("job_type_skills").is_not_null()).get_column("job_type_skills").item(0)"{'analyst_tools': ['power bi', 'tableau'], 'programming': ['r', 'python', 'sql', 'nosql']}"Check the data type of the "job_type_skills" column to confirm that it is currently stored as a string.
df.schema["job_type_skills"]StringThe output shows that the "job_type_skills" column contains string representations of dictionaries, where each dictionary has a job type as the key and a list of skills as the value. For example, one of the values is (added line breaks for readability):
"{
'analyst_tools': ['power bi', 'tableau'],
'programming': ['r', 'python', 'sql', 'nosql']
}"There are two keys in the dictionary: "analyst_tools" and "programming". The value for each key is a list of skills relevant to that job type. For instance, the βprogrammingβ key has a list of programming languages and technologies such as βrβ, βpythonβ, βsqlβ, and βnosqlβ.
Approach 1: Use ast.literal_eval() to parse the string values into dictionariesΒΆ
The ast.literal_eval() function from the ast module in Python can be used to safely evaluate a string containing a Python literal (like a dictionary) and convert it into the corresponding Python data structure. This approach is flexible and can handle varying keys across rows, but it may be less efficient and potentially unsafe if the input is not controlled, as it can execute arbitrary code if the input string is malicious.
Approach 2: Create a schema for the Struct in advanceΒΆ
This approach can only be used if you know all possible keys in advance. Because we donβt know all possible keys in the "job_type_skills" column, we will skip this approach for now. However, if you have a limited and known set of keys, you can create a schema for the Struct and use conditional logic to handle missing keys when parsing the string values.
Approach 3: Find all unique keys across the dataset and create a schemaΒΆ
This will be the slowest approach, but it will allow you to create a struct type column instead of a generic object type column, which will enable you to work with the nested data more efficiently in Polars.
First, find all unique keys across the dataset in the "job_type_skills" column.
%%time
import ast
keys = (
df.filter(pl.col("job_type_skills").is_not_null())
.select(
pl.col("job_type_skills")
.map_elements(ast.literal_eval, return_dtype=pl.Object)
.map_elements(lambda d: list(d.keys()), return_dtype=pl.List(pl.Utf8))
.alias("keys")
)
.explode("keys")
.unique()
)
keysCPU times: total: 22.1 s
Wall time: 22.2 s
Then, use the list of unique keys to create a schema for the Struct and parse the string values in the "job_type_skills" column accordingly.
key_list = keys.get_column("keys").to_list()
skills_struct_dtype = pl.Struct([pl.Field(k, pl.List(pl.Utf8)) for k in key_list])
df = df.with_columns(
pl.when(pl.col("job_type_skills").is_not_null())
.then(
pl.col("job_type_skills")
.str.replace_all("'", '"')
.str.json_decode(dtype=skills_struct_dtype)
)
.otherwise(None)
.alias("job_type_skills")
)Confirm that the "job_type_skills" column has been successfully parsed as a Struct by checking the data type of the column. The output should indicate that the data type of the "job_type_skills" column is now a Struct with fields corresponding to the unique keys found in the previous step.
df.schema["job_type_skills"]Struct({'cloud': List(String), 'async': List(String), 'databases': List(String), 'other': List(String), 'os': List(String), 'webframeworks': List(String), 'libraries': List(String), 'programming': List(String), 'sync': List(String), 'analyst_tools': List(String)})Print the first non-null value in the "job_type_skills" column to see the structured data format after parsing it as a Struct.
df.select(pl.col("job_type_skills")).filter(
pl.col("job_type_skills").is_not_null()
).row(0)[0]{'cloud': None,
'async': None,
'databases': None,
'other': None,
'os': None,
'webframeworks': None,
'libraries': None,
'programming': ['r', 'python', 'sql', 'nosql'],
'sync': None,
'analyst_tools': ['power bi', 'tableau']}The parsed Struct keeps all 10 keys, even if some rows have missing values for certain keys. This allows you to work with the nested data in a consistent way, regardless of whether all keys are present in every row. You can access the fields of the Struct using dot notation or by selecting specific fields as needed for your analysis.
Below is an example of how to access the βprogrammingβ field of the Struct in the "job_type_skills" column:
df.select(pl.col("job_type_skills").struct.field("programming"))How many jobs require βsqlβ as a programming skill in the "job_type_skills" column?ΒΆ
sql_skill_required = df.filter(
pl.col("job_type_skills").struct.field("programming").list.contains("sql")
).select(pl.col("job_type_skills").struct.field("programming").alias("programming"))
sql_skill_requirednum_sql_required = sql_skill_required.height
print(
f"Number of jobs that require 'sql' as a programming skill: {num_sql_required} out of {df.height} total jobs ({num_sql_required / df.height:.1%})."
)Number of jobs that require 'sql' as a programming skill: 384849 out of 785741 total jobs (49.0%).
π€ Export Preprocessed DatasetΒΆ
(Optional) We can save the preprocessed dataset in Parquet format so that we can easily load it in the future for further analysis or modeling without having to repeat the preprocessing steps. The write_parquet() method can be used to write the DataFrame to a Parquet file.
df.write_parquet("data_jobs_preprocessed.parquet")π Explore DatasetΒΆ
πΌ Job TitlesΒΆ
How many job postings are for remote jobs?ΒΆ
df["job_work_from_home"].value_counts()Display the output in percentages.
(
df["job_work_from_home"]
.value_counts()
.with_columns(
(pl.col("count") / pl.col("count").sum() * 100).round(2).alias("percentage")
)
)Job CountryΒΆ
df["job_country"].value_counts().sort("count", descending=True).head(10)Job Posting MediumΒΆ
df["job_via"].value_counts().sort("count", descending=True).head(10)Annual SalaryΒΆ
Many companies choose not to disclose salary information in their job postings, which results in a large number of null values in the "salary_year_avg" column.
df["salary_year_avg"].is_null().sum()763738Print the percentage of null values in the "salary_year_avg" column to understand the extent of missing salary information in the dataset.
null_pct = (df["salary_year_avg"].null_count() / df.height) * 100
print(f"{null_pct:.2f}% of the salary_year_avg values are null.")97.20% of the salary_year_avg values are null.
Use describe() to get summary statistics for the "salary_year_avg" column.
df.select(pl.col("salary_year_avg")).describe()