Case Overview - BDI 199/593 Textbook

In this case, we will use a real-world job postings dataset focused on data science and analytics roles, collected from internet search results and job boards since 2023 by Luke Barousse.

We have already covered common preprocessing steps for tabular data in the previous chapters, but this dataset also includes list-like or dictionary-like string columns that require additional parsing and transformation.

For example, the job_skills column contains a string representation of a list of skills extracted from the job posting using NLP techniques.

Additionally, the job_type_skills contain a Python dictionary-like string that map skill types (e.g., ‘cloud’, ‘libraries’) to sets of skills.

We will need to parse these string representations into actual Python data structures (lists and dictionaries) to work with them effectively in our analysis.

📌 Dataset Information¶

Source: lukebarousse/data_jobs
Licensing: Apache-2.0
Format: Parquet (original CSV file converted to Parquet for compression - no other preprocessing applied)
Size: ~30 MB (~786,000 rows)
- The original CSV file is much larger (~230 MB) due to the large number of rows and text fields, but the Parquet format significantly reduces the file size while preserving all data.

📘 Data Dictionary¶

Column Name	Description	Type	Source
`job_title_short`	Cleaned/standardized job title using BERT model (10-class classification)	Calculated	From `job_title`
`job_title`	Full original job title as scraped	Raw	Scraped
`job_location`	Location string shown in job posting	Raw	Scraped
`job_via`	Platform the job was posted on (e.g., LinkedIn, Jobijoba)	Raw	Scraped
`job_schedule_type`	Type of schedule (Full-time, Part-time, Contractor, etc.)	Raw	Scraped
`job_work_from_home`	Whether the job is remote (`true`/`false`)	Boolean	Parsed
`search_location`	Location used by the bot to generate search queries	Generated	Bot logic
`job_posted_date`	Date and time when job was posted	Raw	Scraped
`job_no_degree_mention`	Whether the posting explicitly mentions no degree is required	Boolean	Parsed
`job_health_insurance`	Whether the job mentions health insurance	Boolean	Parsed
`job_country`	Country extracted from job location	Calculated	Parsed
`salary_rate`	Indicates if salary is annual or hourly	Raw	Scraped
`salary_year_avg`	Average yearly salary (calculated from salary ranges when available)	Calculated	Derived
`salary_hour_avg`	Average hourly salary (same logic as yearly)	Calculated	Derived
`company_name`	Company name listed in job posting	Raw	Scraped
`job_skills`	List of relevant skills extracted from job posting using PySpark	Parsed List	NLP Extracted
`job_type_skills`	Dictionary mapping skill types (e.g., ‘cloud’, ‘libraries’) to skill sets	Parsed Dict	NLP Extracted