Data Job Postings Analysis

In this case study, we will analyze a dataset of job postings for data science and analytics roles. The dataset contains various columns such as job title, company, location, salary, job description, and required skills. However, some columns may have mostly missing values, which we will need to handle during preprocessing. For example, most companies may prefer not to disclose salary information, resulting in a column with many null values.

import polars as pl

📥 Load Dataset¶

The dataset is uploaded to a GitHub repository for convenience. You can also access the original dataset on Hugging Face Datasets at lukebarousse/data_jobs. The original CSV file is quite large (~230 MB) due to the large number of rows and text fields, but it has been converted to Parquet format for this case study, which significantly reduces the file size to ~30 MB while preserving all data.

df = pl.read_parquet(
    "https://github.com/bdi593/datasets/raw/refs/heads/main/data-jobs/data_jobs.parquet"
)
df

🔢 Check the Number of Rows¶

The number of rows and columns in the dataset can be checked using the shape attribute of the Polars DataFrame. The shape attribute returns a tuple containing the number of rows and columns in the DataFrame.

df.shape

(785741, 17)

🗂️ Check Schema¶

Check the schema to understand the data types of each column.

df.schema

Schema([('job_title_short', String),
        ('job_title', String),
        ('job_location', String),
        ('job_via', String),
        ('job_schedule_type', String),
        ('job_work_from_home', Boolean),
        ('search_location', String),
        ('job_posted_date', String),
        ('job_no_degree_mention', Boolean),
        ('job_health_insurance', Boolean),
        ('job_country', String),
        ('salary_rate', String),
        ('salary_year_avg', Float64),
        ('salary_hour_avg', Float64),
        ('company_name', String),
        ('job_skills', String),
        ('job_type_skills', String)])

🧹 Preprocess Dataset¶

❓ Check Missing Values¶

Check the number of missing values.

df.null_count()

The output is in a “wide” format, which makes it difficult to compare the number of missing values across columns. We can transpose the output to make it easier to read and analyze.

df.null_count().transpose(include_header=True)

We can go one step further and only display columns with one or more missing values to focus our attention on the columns that require preprocessing.

(
    df.null_count()
    .transpose(include_header=True, header_name="column", column_names=["null_count"])
    .filter(pl.col("null_count") > 0)
)

While the output is helpful, the number of rows displayed in the output is limited to 10 rows by default by Polars. There are more columns with missing values that are not displayed in the output. To see all columns with missing values, we can temporarily set the maximum number of rows to display to a higher value, such as 50, to ensure that all columns with missing values are shown in the output.

# Display up to 50 rows of the null count table
with pl.Config(tbl_rows=50):
    display(
        (
            df.null_count()
            .transpose(
                include_header=True, header_name="column", column_names=["null_count"]
            )
            .filter(pl.col("null_count") > 0)
        )
    )

What does the with statement do in Python?

The with statement in Python is used to wrap the execution of a block of code with methods defined by a context manager. It is commonly used for resource management, such as opening files or managing database connections, ensuring that resources are properly released after their use, even if an error occurs.

In the code example above, the with statement is used to temporarily set the maximum number of rows displayed in the output to 50. This means that when the block of code inside the with statement is executed, 50 rows of the DataFrame will be shown in the output, regardless of how many rows are actually in the DataFrame. After the block of code is executed, the maximum number of rows displayed will return to its previous setting.

The default maximum number of rows displayed in Polars is 10, so using the with statement allows you to temporarily change this setting for a specific block of code without affecting the global configuration.

🗑️ Drop Rows with Missing Job Title¶

There is one row with a null value in the job_title column. Filter the DataFrame to show only this row.

df.filter(df["job_title"].is_null())

The dataset has over 780,000 rows, but only one row has a null value in the job_title and the company column.

df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).row(0)

("['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']",)

📅 Parse `"job_posted_date"` Column¶

The "job_posted_date" column contains date information in string format.

df["job_posted_date"].head(5)

Check the data type of the "job_posted_date" column to confirm that it is currently stored as a string.

df.schema["job_posted_date"]

String

We can parse this column into a proper date format using the pl.col().str.strptime() or pl.col().str.to_datetime() method in Polars, which allows us to specify the date format and handle any parsing errors gracefully.

If you don’t specify the date format, Polars will attempt to infer the format. Since the date format in the "job_posted_date" column is consistent (e.g., “2023-06-16 13:44:15”), Polars should be able to parse it correctly without explicitly providing the format.

However, if you want to ensure that the parsing is done correctly and to handle any potential variations in date formats, you can specify the format using the format parameter.

df = df.with_columns(pl.col("job_posted_date").str.to_datetime())

df.select(pl.col("job_posted_date")).head(5)

df.schema["job_posted_date"]

Datetime(time_unit='us', time_zone=None)

Monthly Number of Job Postings¶

The value_counts() method can be used to count the number of job postings for each month. By extracting the month from the "job_posted_date" column and applying value_counts(), we can see how many job postings were made in each month. Sorting the results by month will give us a clear view of the distribution of job postings over time.

df_monthly = (
    df.select(
        pl.col("job_posted_date")
        .dt.month()
        .alias("month")
        .value_counts()
        .alias("month_counts")
    )
    .unnest("month_counts")
    .sort("month")
)

df_monthly

🧰 Parse `"job_skills"` Column¶

The "job_skills" column is stored as a string that looks like a list, but it is not actually a list data type.

We can check the data type of the "job_skills" column to confirm that it is currently stored as a string. The output should indicate that the data type of the "job_skills" column is String.

df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)

df.schema["job_skills"]

String

To convert it to a list, we can use the str.replace_all() method to replace single quotes with double quotes, and then use the str.json_decode() method to parse the string as JSON. This will give us a proper list of skills for each job.

df = df.with_columns(
    pl.when(pl.col("job_skills").is_not_null())
    .then(
        pl.col("job_skills")
        .str.replace_all("'", '"')
        .str.json_decode(dtype=pl.List(pl.Utf8))
    )
    .otherwise(None)
    .alias("job_skills")
)

df.filter(pl.col("job_skills").is_not_null()).select(pl.col("job_skills")).head(5)

Print the first non-null value in the "job_skills" column to see a non-truncated output.

df.select(pl.col("job_skills")).filter(pl.col("job_skills").is_not_null()).row(0)[0]

['r', 'python', 'sql', 'nosql', 'power bi', 'tableau']

Verify that the "job_skills" column has been successfully parsed as a list by checking the data type of the column. The output should indicate that the data type of the "job_skills" column is now a list.

df.schema["job_skills"]

List(String)

How many jobs require “Python” as a skill?¶

We can use the list.contains() method to filter the DataFrame for rows where the "job_skills" list contains the skill “Python”.

The code below filters the DataFrame to include only rows where the "job_skills" column contains “Python”.

df.filter(pl.col("job_skills").list.contains("python")).select(
    pl.col("job_title"), pl.col("company_name"), pl.col("job_skills")
)

The height attribute can be used to count the number of rows that match this condition.

It is equivalent to using the len() function, or .shape[0] to get the number of rows in the filtered DataFrame.

df.filter(pl.col("job_skills").list.contains("python")).height

380909

🛠️ Parse `"job_type_skills"` column¶

Similar to the "job_skills" column, the "job_type_skills" column is also stored as a string. However, the column contains dictionary-like string values as opposed to the list-like string values in the "job_skills" column.

Print the first five non-null values in the "job_type_skills" column to understand its structure and confirm that it is stored as a string.

df.select("job_type_skills").filter(pl.col("job_type_skills").is_not_null()).head(5)

To see a non-truncated view of the first row, retrieve the first non-null value in the "job_type_skills" column and print it.

df.filter(pl.col("job_type_skills").is_not_null()).get_column("job_type_skills").item(0)

"{'analyst_tools': ['power bi', 'tableau'], 'programming': ['r', 'python', 'sql', 'nosql']}"

Check the data type of the "job_type_skills" column to confirm that it is currently stored as a string.

df.schema["job_type_skills"]

String

The output shows that the "job_type_skills" column contains string representations of dictionaries, where each dictionary has a job type as the key and a list of skills as the value. For example, one of the values is (added line breaks for readability):

"{
    'analyst_tools': ['power bi', 'tableau'],
    'programming': ['r', 'python', 'sql', 'nosql']
}"

There are two keys in the dictionary: "analyst_tools" and "programming". The value for each key is a list of skills relevant to that job type. For instance, the “programming” key has a list of programming languages and technologies such as “r”, “python”, “sql”, and “nosql”.

Structs in Polars

A Struct in Polars is:

A single column that contains multiple named, typed subcolumns.

Example:

Struct({ analyst_tools: List(Utf8), programming: List(Utf8) })

Visually:

job_type_skills
{analyst_tools: [...], programming: [...]}
{analyst_tools: [...], programming: [...]}

But internally it is not stored row-by-row like Python dictionaries. While you don’t need to understand the internal storage format of Struct to work with it, it is helpful to know that it is not stored as a string, but rather as a structured data type that allows for efficient querying and manipulation of the nested data.

Polars is built on Apache Arrow, which is:

Columnar
Typed
Memory-contiguous
Zero-copy friendly

A Polars Struct column’s each field is stored as its own full column, which allows for efficient access and manipulation of the nested data without needing to parse strings or perform complex operations on row-by-row data.

So instead of:

Row 1 → {a:1, b:2}
Row 2 → {a:3, b:4}

Memory looks like:

a column → [1, 3]
b column → [2, 4]

The struct is just a logical grouping.

What if the dictionary-like string values have varying keys across rows?

If the dictionary-like string values in the "job_type_skills" column have varying keys across rows, it can pose challenges for parsing and analyzing the data. In such cases, you may need to:

Approach 1: Use a parser that converts a string containing a Python literal into an actual Python object, such as ast.literal_eval() from the ast module in Python. This function can safely evaluate a string containing a Python literal (like a dictionary) and convert it into the corresponding Python data structure.
- This will be more flexible if the keys in the dictionary-like string values vary widely across rows, as it can dynamically parse any valid Python literal. However, it may be less efficient and potentially unsafe if the input is not controlled, as it can execute arbitrary code if the input string is malicious.
Approach 2: If you know all possible keys in advance, you can create a schema for the Struct and use conditional logic to handle missing keys when parsing the string values.

This is more efficient if the set of possible keys is limited and known, as it allows you to directly parse the string values into a structured format without needing to evaluate arbitrary Python literals, which can be less efficient and potentially unsafe if the input is not controlled.

Approach 3: Find all unique keys across the dataset and create a schema that includes all possible keys, then parse the string values accordingly.
- While this is the slowest approach, it will ensure that you use a predefined schema for the Struct that includes all possible keys, allowing you to parse the string values into a structured format.

Approach 1: Use `ast.literal_eval()` to parse the string values into dictionaries¶

The ast.literal_eval() function from the ast module in Python can be used to safely evaluate a string containing a Python literal (like a dictionary) and convert it into the corresponding Python data structure. This approach is flexible and can handle varying keys across rows, but it may be less efficient and potentially unsafe if the input is not controlled, as it can execute arbitrary code if the input string is malicious.

Slow execution warning!

The code below uses ast.literal_eval() to parse the string values in the "job_type_skills" column into dictionaries, and use %%time to measure the execution time of this code block.

Although the sample code is provided, we will not run it in the notebook due to its slow execution time. We will use ast.literal_eval later within this chapter to find the unique keys of the "job_type_skills" column, so you will be able to how it works.

%%time

import ast

df.with_columns(
    pl.when(pl.col("job_type_skills").is_not_null())
    .then(
        pl.col("job_type_skills").map_elements(
            ast.literal_eval,
            return_dtype=pl.Object,  # important: avoid inference surprises
        )
    )
    .otherwise(None)
    .alias("job_type_skills")
)

Approach 2: Create a schema for the `Struct` in advance¶

This approach can only be used if you know all possible keys in advance. Because we don’t know all possible keys in the "job_type_skills" column, we will skip this approach for now. However, if you have a limited and known set of keys, you can create a schema for the Struct and use conditional logic to handle missing keys when parsing the string values.

Approach 3: Find all unique keys across the dataset and create a schema¶

This will be the slowest approach, but it will allow you to create a struct type column instead of a generic object type column, which will enable you to work with the nested data more efficiently in Polars.

First, find all unique keys across the dataset in the "job_type_skills" column.

%%time

import ast

keys = (
    df.filter(pl.col("job_type_skills").is_not_null())
    .select(
        pl.col("job_type_skills")
        .map_elements(ast.literal_eval, return_dtype=pl.Object)
        .map_elements(lambda d: list(d.keys()), return_dtype=pl.List(pl.Utf8))
        .alias("keys")
    )
    .explode("keys")
    .unique()
)

keys

CPU times: total: 22.1 s
Wall time: 22.2 s

Then, use the list of unique keys to create a schema for the Struct and parse the string values in the "job_type_skills" column accordingly.

key_list = keys.get_column("keys").to_list()

skills_struct_dtype = pl.Struct([pl.Field(k, pl.List(pl.Utf8)) for k in key_list])

df = df.with_columns(
    pl.when(pl.col("job_type_skills").is_not_null())
    .then(
        pl.col("job_type_skills")
        .str.replace_all("'", '"')
        .str.json_decode(dtype=skills_struct_dtype)
    )
    .otherwise(None)
    .alias("job_type_skills")
)

Confirm that the "job_type_skills" column has been successfully parsed as a Struct by checking the data type of the column. The output should indicate that the data type of the "job_type_skills" column is now a Struct with fields corresponding to the unique keys found in the previous step.

df.schema["job_type_skills"]

Struct({'cloud': List(String), 'async': List(String), 'databases': List(String), 'other': List(String), 'os': List(String), 'webframeworks': List(String), 'libraries': List(String), 'programming': List(String), 'sync': List(String), 'analyst_tools': List(String)})

Print the first non-null value in the "job_type_skills" column to see the structured data format after parsing it as a Struct.

df.select(pl.col("job_type_skills")).filter(
    pl.col("job_type_skills").is_not_null()
).row(0)[0]

{'cloud': None,
 'async': None,
 'databases': None,
 'other': None,
 'os': None,
 'webframeworks': None,
 'libraries': None,
 'programming': ['r', 'python', 'sql', 'nosql'],
 'sync': None,
 'analyst_tools': ['power bi', 'tableau']}

The parsed Struct keeps all 10 keys, even if some rows have missing values for certain keys. This allows you to work with the nested data in a consistent way, regardless of whether all keys are present in every row. You can access the fields of the Struct using dot notation or by selecting specific fields as needed for your analysis.

Below is an example of how to access the “programming” field of the Struct in the "job_type_skills" column:

df.select(pl.col("job_type_skills").struct.field("programming"))

How many jobs require “sql” as a programming skill in the `"job_type_skills"` column?¶

sql_skill_required = df.filter(
    pl.col("job_type_skills").struct.field("programming").list.contains("sql")
).select(pl.col("job_type_skills").struct.field("programming").alias("programming"))

sql_skill_required

num_sql_required = sql_skill_required.height
print(
    f"Number of jobs that require 'sql' as a programming skill: {num_sql_required} out of {df.height} total jobs ({num_sql_required / df.height:.1%})."
)

Number of jobs that require 'sql' as a programming skill: 384849 out of 785741 total jobs (49.0%).

📤 Export Preprocessed Dataset¶

(Optional) We can save the preprocessed dataset in Parquet format so that we can easily load it in the future for further analysis or modeling without having to repeat the preprocessing steps. The write_parquet() method can be used to write the DataFrame to a Parquet file.

df.write_parquet("data_jobs_preprocessed.parquet")

🔍 Explore Dataset¶

💼 Job Titles¶

How many job postings are for remote jobs?¶

df["job_work_from_home"].value_counts()

Display the output in percentages.

(
    df["job_work_from_home"]
    .value_counts()
    .with_columns(
        (pl.col("count") / pl.col("count").sum() * 100).round(2).alias("percentage")
    )
)

Job Country¶

df["job_country"].value_counts().sort("count", descending=True).head(10)

Job Posting Medium¶

df["job_via"].value_counts().sort("count", descending=True).head(10)

Annual Salary¶

Many companies choose not to disclose salary information in their job postings, which results in a large number of null values in the "salary_year_avg" column.

df["salary_year_avg"].is_null().sum()

763738

Print the percentage of null values in the "salary_year_avg" column to understand the extent of missing salary information in the dataset.

null_pct = (df["salary_year_avg"].null_count() / df.height) * 100

print(f"{null_pct:.2f}% of the salary_year_avg values are null.")

97.20% of the salary_year_avg values are null.

Use describe() to get summary statistics for the "salary_year_avg" column.

df.select(pl.col("salary_year_avg")).describe()

📥 Load Dataset¶

🔢 Check the Number of Rows¶

🗂️ Check Schema¶

🧹 Preprocess Dataset¶

❓ Check Missing Values¶

🗑️ Drop Rows with Missing Job Title¶

📅 Parse "job_posted_date" Column¶

Monthly Number of Job Postings¶

🧰 Parse "job_skills" Column¶

How many jobs require “Python” as a skill?¶

🛠️ Parse "job_type_skills" column¶

Approach 1: Use ast.literal_eval() to parse the string values into dictionaries¶

Approach 2: Create a schema for the Struct in advance¶

Approach 3: Find all unique keys across the dataset and create a schema¶

How many jobs require “sql” as a programming skill in the "job_type_skills" column?¶

📤 Export Preprocessed Dataset¶

🔍 Explore Dataset¶

💼 Job Titles¶

How many job postings are for remote jobs?¶

Job Country¶

Job Posting Medium¶

Annual Salary¶

📅 Parse `"job_posted_date"` Column¶

🧰 Parse `"job_skills"` Column¶

🛠️ Parse `"job_type_skills"` column¶

Approach 1: Use `ast.literal_eval()` to parse the string values into dictionaries¶

Approach 2: Create a schema for the `Struct` in advance¶

How many jobs require “sql” as a programming skill in the `"job_type_skills"` column?¶