Case Overview
In this case, we will use a real-world job postings dataset focused on data science and analytics roles, collected from internet search results and job boards since 2023 by Luke Barousse.
We have already covered common preprocessing steps for tabular data in the previous chapters, but this dataset also includes list-like or dictionary-like string columns that require additional parsing and transformation.
For example, the job_skills column contains a string representation of a list of skills extracted from the job posting using NLP techniques.
Additionally, the job_type_skills contain a Python dictionary-like string that map skill types (e.g., ‘cloud’, ‘libraries’) to sets of skills.
We will need to parse these string representations into actual Python data structures (lists and dictionaries) to work with them effectively in our analysis.
📌 Dataset Information¶
Source:
lukebarousse/data_jobsLicensing: Apache-2.0
Format: Parquet (original CSV file converted to Parquet for compression - no other preprocessing applied)
Size: ~30 MB (~786,000 rows)
The original CSV file is much larger (~230 MB) due to the large number of rows and text fields, but the Parquet format significantly reduces the file size while preserving all data.
📘 Data Dictionary¶
| Column Name | Description | Type | Source |
|---|---|---|---|
job_title_short | Cleaned/standardized job title using BERT model (10-class classification) | Calculated | From job_title |
job_title | Full original job title as scraped | Raw | Scraped |
job_location | Location string shown in job posting | Raw | Scraped |
job_via | Platform the job was posted on (e.g., LinkedIn, Jobijoba) | Raw | Scraped |
job_schedule_type | Type of schedule (Full-time, Part-time, Contractor, etc.) | Raw | Scraped |
job_work_from_home | Whether the job is remote (true/false) | Boolean | Parsed |
search_location | Location used by the bot to generate search queries | Generated | Bot logic |
job_posted_date | Date and time when job was posted | Raw | Scraped |
job_no_degree_mention | Whether the posting explicitly mentions no degree is required | Boolean | Parsed |
job_health_insurance | Whether the job mentions health insurance | Boolean | Parsed |
job_country | Country extracted from job location | Calculated | Parsed |
salary_rate | Indicates if salary is annual or hourly | Raw | Scraped |
salary_year_avg | Average yearly salary (calculated from salary ranges when available) | Calculated | Derived |
salary_hour_avg | Average hourly salary (same logic as yearly) | Calculated | Derived |
company_name | Company name listed in job posting | Raw | Scraped |
job_skills | List of relevant skills extracted from job posting using PySpark | Parsed List | NLP Extracted |
job_type_skills | Dictionary mapping skill types (e.g., ‘cloud’, ‘libraries’) to skill sets | Parsed Dict | NLP Extracted |