Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Data Job Postings Analysis

In this case study, we will analyze a dataset of job postings for data science and analytics roles. The dataset contains various columns such as job title, company, location, salary, job description, and required skills. However, some columns may have mostly missing values, which we will need to handle during preprocessing. For example, most companies may prefer not to disclose salary information, resulting in a column with many null values.

import polars as pl

πŸ“₯ Load DatasetΒΆ

The dataset is uploaded to a GitHub repository for convenience. You can also access the original dataset on Hugging Face Datasets at lukebarousse/data_jobs. The original CSV file is quite large (~230 MB) due to the large number of rows and text fields, but it has been converted to Parquet format for this case study, which significantly reduces the file size to ~30 MB while preserving all data.

df = pl.read_parquet(
    "https://github.com/bdi593/datasets/raw/refs/heads/main/data-jobs/data_jobs.parquet"
)
df
Loading...

πŸ”’ Check the Number of RowsΒΆ

The number of rows and columns in the dataset can be checked using the shape attribute of the Polars DataFrame. The shape attribute returns a tuple containing the number of rows and columns in the DataFrame.

df.shape
(785741, 17)

πŸ—‚οΈ Check SchemaΒΆ

Check the schema to understand the data types of each column.

df.schema
Schema([('job_title_short', String), ('job_title', String), ('job_location', String), ('job_via', String), ('job_schedule_type', String), ('job_work_from_home', Boolean), ('search_location', String), ('job_posted_date', String), ('job_no_degree_mention', Boolean), ('job_health_insurance', Boolean), ('job_country', String), ('salary_rate', String), ('salary_year_avg', Float64), ('salary_hour_avg', Float64), ('company_name', String), ('job_skills', String), ('job_type_skills', String)])

🧹 Preprocess Dataset¢

❓ Check Missing ValuesΒΆ

Check the number of missing values.

df.null_count()
Loading...

The output is in a β€œwide” format, which makes it difficult to compare the number of missing values across columns. We can transpose the output to make it easier to read and analyze.

df.null_count().transpose(include_header=True)
Loading...

We can go one step further and only display columns with one or more missing values to focus our attention on the columns that require preprocessing.

(
    df.null_count()
    .transpose(include_header=True, header_name="column", column_names=["null_count"])
    .filter(pl.col("null_count") > 0)
)
Loading...

While the output is helpful, the number of rows displayed in the output is limited to 10 rows by default by Polars. There are more columns with missing values that are not displayed in the output. To see all columns with missing values, we can temporarily set the maximum number of rows to display to a higher value, such as 50, to ensure that all columns with missing values are shown in the output.

# Display up to 50 rows of the null count table
with pl.Config(tbl_rows=50):
    display(
        (
            df.null_count()
            .transpose(
                include_header=True, header_name="column", column_names=["null_count"]
            )
            .filter(pl.col("null_count") > 0)
        )
    )
Loading...

πŸ—‘οΈ Drop Rows with Missing Job TitleΒΆ

There is one row with a null value in the job_title column. Filter the DataFrame to show only this row.

df.filter(df["job_title"].is_null())
Loading...

The dataset has over 780,000 rows, but only one row has a null value in the job_title and the company column.

df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).row(0)
("['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']",)

πŸ“… Parse "job_posted_date" ColumnΒΆ

The "job_posted_date" column contains date information in string format.

df["job_posted_date"].head(5)
Loading...

Check the data type of the "job_posted_date" column to confirm that it is currently stored as a string.

df.schema["job_posted_date"]
String

We can parse this column into a proper date format using the pl.col().str.strptime() or pl.col().str.to_datetime() method in Polars, which allows us to specify the date format and handle any parsing errors gracefully.

If you don’t specify the date format, Polars will attempt to infer the format. Since the date format in the "job_posted_date" column is consistent (e.g., β€œ2023-06-16 13:44:15”), Polars should be able to parse it correctly without explicitly providing the format.

However, if you want to ensure that the parsing is done correctly and to handle any potential variations in date formats, you can specify the format using the format parameter.

df = df.with_columns(pl.col("job_posted_date").str.to_datetime())

df.select(pl.col("job_posted_date")).head(5)
Loading...
df.schema["job_posted_date"]
Datetime(time_unit='us', time_zone=None)

Monthly Number of Job PostingsΒΆ

The value_counts() method can be used to count the number of job postings for each month. By extracting the month from the "job_posted_date" column and applying value_counts(), we can see how many job postings were made in each month. Sorting the results by month will give us a clear view of the distribution of job postings over time.

df_monthly = (
    df.select(
        pl.col("job_posted_date")
        .dt.month()
        .alias("month")
        .value_counts()
        .alias("month_counts")
    )
    .unnest("month_counts")
    .sort("month")
)

df_monthly
Loading...

🧰 Parse "job_skills" Column¢

The "job_skills" column is stored as a string that looks like a list, but it is not actually a list data type.

We can check the data type of the "job_skills" column to confirm that it is currently stored as a string. The output should indicate that the data type of the "job_skills" column is String.

df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)
Loading...
df.schema["job_skills"]
String

To convert it to a list, we can use the str.replace_all() method to replace single quotes with double quotes, and then use the str.json_decode() method to parse the string as JSON. This will give us a proper list of skills for each job.

df = df.with_columns(
    pl.when(pl.col("job_skills").is_not_null())
    .then(
        pl.col("job_skills")
        .str.replace_all("'", '"')
        .str.json_decode(dtype=pl.List(pl.Utf8))
    )
    .otherwise(None)
    .alias("job_skills")
)

df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)
Loading...

Print the first non-null value in the "job_skills" column to see a non-truncated output.

df.select(pl.col("job_skills")).filter(pl.col("job_skills").is_not_null()).row(0)[0]
['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']

Verify that the "job_skills" column has been successfully parsed as a list by checking the data type of the column. The output should indicate that the data type of the "job_skills" column is now a list.

df.schema["job_skills"]
List(String)

How many jobs require β€œPython” as a skill?ΒΆ

We can use the list.contains() method to filter the DataFrame for rows where the "job_skills" list contains the skill β€œPython”.

The code below filters the DataFrame to include only rows where the "job_skills" column contains β€œPython”.

df.filter(pl.col("job_skills").list.contains("python")).select(
    pl.col("job_title"), pl.col("company_name"), pl.col("job_skills")
)
Loading...

The height attribute can be used to count the number of rows that match this condition.

It is equivalent to using the len() function, or .shape[0] to get the number of rows in the filtered DataFrame.

df.filter(pl.col("job_skills").list.contains("python")).height
380909

πŸ› οΈ Parse "job_type_skills" columnΒΆ

Similar to the "job_skills" column, the "job_type_skills" column is also stored as a string. However, the column contains dictionary-like string values as opposed to the list-like string values in the "job_skills" column.

Print the first five non-null values in the "job_type_skills" column to understand its structure and confirm that it is stored as a string.

df.select("job_type_skills").filter(pl.col("job_type_skills").is_not_null()).head(5)
Loading...

To see a non-truncated view of the first row, retrieve the first non-null value in the "job_type_skills" column and print it.

df.filter(pl.col("job_type_skills").is_not_null()).get_column("job_type_skills").item(0)
"{'analyst_tools': ['power bi', 'tableau'], 'programming': ['r', 'python', 'sql', 'nosql']}"

Check the data type of the "job_type_skills" column to confirm that it is currently stored as a string.

df.schema["job_type_skills"]
String

The output shows that the "job_type_skills" column contains string representations of dictionaries, where each dictionary has a job type as the key and a list of skills as the value. For example, one of the values is (added line breaks for readability):

"{
    'analyst_tools': ['power bi', 'tableau'],
    'programming': ['r', 'python', 'sql', 'nosql']
}"

There are two keys in the dictionary: "analyst_tools" and "programming". The value for each key is a list of skills relevant to that job type. For instance, the β€œprogramming” key has a list of programming languages and technologies such as β€œr”, β€œpython”, β€œsql”, and β€œnosql”.

Approach 1: Use ast.literal_eval() to parse the string values into dictionariesΒΆ

The ast.literal_eval() function from the ast module in Python can be used to safely evaluate a string containing a Python literal (like a dictionary) and convert it into the corresponding Python data structure. This approach is flexible and can handle varying keys across rows, but it may be less efficient and potentially unsafe if the input is not controlled, as it can execute arbitrary code if the input string is malicious.

Approach 2: Create a schema for the Struct in advanceΒΆ

This approach can only be used if you know all possible keys in advance. Because we don’t know all possible keys in the "job_type_skills" column, we will skip this approach for now. However, if you have a limited and known set of keys, you can create a schema for the Struct and use conditional logic to handle missing keys when parsing the string values.

Approach 3: Find all unique keys across the dataset and create a schemaΒΆ

This will be the slowest approach, but it will allow you to create a struct type column instead of a generic object type column, which will enable you to work with the nested data more efficiently in Polars.

First, find all unique keys across the dataset in the "job_type_skills" column.

%%time

import ast

keys = (
    df.filter(pl.col("job_type_skills").is_not_null())
    .select(
        pl.col("job_type_skills")
        .map_elements(ast.literal_eval, return_dtype=pl.Object)
        .map_elements(lambda d: list(d.keys()), return_dtype=pl.List(pl.Utf8))
        .alias("keys")
    )
    .explode("keys")
    .unique()
)

keys
CPU times: total: 22.1 s
Wall time: 22.2 s
Loading...

Then, use the list of unique keys to create a schema for the Struct and parse the string values in the "job_type_skills" column accordingly.

key_list = keys.get_column("keys").to_list()

skills_struct_dtype = pl.Struct([pl.Field(k, pl.List(pl.Utf8)) for k in key_list])

df = df.with_columns(
    pl.when(pl.col("job_type_skills").is_not_null())
    .then(
        pl.col("job_type_skills")
        .str.replace_all("'", '"')
        .str.json_decode(dtype=skills_struct_dtype)
    )
    .otherwise(None)
    .alias("job_type_skills")
)

Confirm that the "job_type_skills" column has been successfully parsed as a Struct by checking the data type of the column. The output should indicate that the data type of the "job_type_skills" column is now a Struct with fields corresponding to the unique keys found in the previous step.

df.schema["job_type_skills"]
Struct({'cloud': List(String), 'async': List(String), 'databases': List(String), 'other': List(String), 'os': List(String), 'webframeworks': List(String), 'libraries': List(String), 'programming': List(String), 'sync': List(String), 'analyst_tools': List(String)})

Print the first non-null value in the "job_type_skills" column to see the structured data format after parsing it as a Struct.

df.select(pl.col("job_type_skills")).filter(
    pl.col("job_type_skills").is_not_null()
).row(0)[0]
{'cloud': None, 'async': None, 'databases': None, 'other': None, 'os': None, 'webframeworks': None, 'libraries': None, 'programming': ['r', 'python', 'sql', 'nosql'], 'sync': None, 'analyst_tools': ['power bi', 'tableau']}

The parsed Struct keeps all 10 keys, even if some rows have missing values for certain keys. This allows you to work with the nested data in a consistent way, regardless of whether all keys are present in every row. You can access the fields of the Struct using dot notation or by selecting specific fields as needed for your analysis.

Below is an example of how to access the β€œprogramming” field of the Struct in the "job_type_skills" column:

df.select(pl.col("job_type_skills").struct.field("programming"))
Loading...

How many jobs require β€œsql” as a programming skill in the "job_type_skills" column?ΒΆ

sql_skill_required = df.filter(
    pl.col("job_type_skills").struct.field("programming").list.contains("sql")
).select(pl.col("job_type_skills").struct.field("programming").alias("programming"))

sql_skill_required
Loading...
num_sql_required = sql_skill_required.height
print(
    f"Number of jobs that require 'sql' as a programming skill: {num_sql_required} out of {df.height} total jobs ({num_sql_required / df.height:.1%})."
)
Number of jobs that require 'sql' as a programming skill: 384849 out of 785741 total jobs (49.0%).

πŸ“€ Export Preprocessed DatasetΒΆ

(Optional) We can save the preprocessed dataset in Parquet format so that we can easily load it in the future for further analysis or modeling without having to repeat the preprocessing steps. The write_parquet() method can be used to write the DataFrame to a Parquet file.

df.write_parquet("data_jobs_preprocessed.parquet")

πŸ” Explore DatasetΒΆ

πŸ’Ό Job TitlesΒΆ

How many job postings are for remote jobs?ΒΆ

df["job_work_from_home"].value_counts()
Loading...

Display the output in percentages.

(
    df["job_work_from_home"]
    .value_counts()
    .with_columns(
        (pl.col("count") / pl.col("count").sum() * 100).round(2).alias("percentage")
    )
)
Loading...

Job CountryΒΆ

df["job_country"].value_counts().sort("count", descending=True).head(10)
Loading...

Job Posting MediumΒΆ

df["job_via"].value_counts().sort("count", descending=True).head(10)
Loading...

Annual SalaryΒΆ

Many companies choose not to disclose salary information in their job postings, which results in a large number of null values in the "salary_year_avg" column.

df["salary_year_avg"].is_null().sum()
763738

Print the percentage of null values in the "salary_year_avg" column to understand the extent of missing salary information in the dataset.

null_pct = (df["salary_year_avg"].null_count() / df.height) * 100

print(f"{null_pct:.2f}% of the salary_year_avg values are null.")
97.20% of the salary_year_avg values are null.

Use describe() to get summary statistics for the "salary_year_avg" column.

df.select(pl.col("salary_year_avg")).describe()
Loading...