Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Case Overview

In this case, we will use a real-world job postings dataset focused on data science and analytics roles, collected from internet search results and job boards since 2023 by Luke Barousse.

Searching Everwhere - Undraw

 

We have already covered common preprocessing steps for tabular data in the previous chapters, but this dataset also includes list-like or dictionary-like string columns that require additional parsing and transformation.

For example, the job_skills column contains a string representation of a list of skills extracted from the job posting using NLP techniques.

Additionally, the job_type_skills contain a Python dictionary-like string that map skill types (e.g., ‘cloud’, ‘libraries’) to sets of skills.

We will need to parse these string representations into actual Python data structures (lists and dictionaries) to work with them effectively in our analysis.

📌 Dataset Information

📘 Data Dictionary

Column NameDescriptionTypeSource
job_title_shortCleaned/standardized job title using BERT model (10-class classification)CalculatedFrom job_title
job_titleFull original job title as scrapedRawScraped
job_locationLocation string shown in job postingRawScraped
job_viaPlatform the job was posted on (e.g., LinkedIn, Jobijoba)RawScraped
job_schedule_typeType of schedule (Full-time, Part-time, Contractor, etc.)RawScraped
job_work_from_homeWhether the job is remote (true/false)BooleanParsed
search_locationLocation used by the bot to generate search queriesGeneratedBot logic
job_posted_dateDate and time when job was postedRawScraped
job_no_degree_mentionWhether the posting explicitly mentions no degree is requiredBooleanParsed
job_health_insuranceWhether the job mentions health insuranceBooleanParsed
job_countryCountry extracted from job locationCalculatedParsed
salary_rateIndicates if salary is annual or hourlyRawScraped
salary_year_avgAverage yearly salary (calculated from salary ranges when available)CalculatedDerived
salary_hour_avgAverage hourly salary (same logic as yearly)CalculatedDerived
company_nameCompany name listed in job postingRawScraped
job_skillsList of relevant skills extracted from job posting using PySparkParsed ListNLP Extracted
job_type_skillsDictionary mapping skill types (e.g., ‘cloud’, ‘libraries’) to skill setsParsed DictNLP Extracted