Cleaning Airbnb Listings Dataset
Inside Airbnb provides datasets containing Airbnb listings for various cities around the world. These datasets often require cleaning before they can be effectively analyzed. In this notebook, we will demonstrate how to clean an Airbnb listings dataset for Barcelona, Spain.
Data Cleaning Steps¶
Import Libraries: We will use
pandasfor data manipulation andnumpyfor numerical operations.Load Dataset: Read the CSV file containing the Airbnb listings data.
Sometimes, the dataset may be compressed (e.g., in gzip format), so we will handle that accordingly.
If the dataset is large but can fit into memory, consider loading a sample for initial exploration.
If the dataset is too large to fit into memory, consider using libraries like Dask, Spark, or Polars for out-of-core processing.
Inspect Data: Check the first few rows of the dataset to understand its structure.
Select Relevant Columns: Identify and retain only the columns that are necessary for analysis.
Handle Missing Values: Identify columns with missing values and decide on strategies to handle them (e.g., dropping rows, filling with mean/median/mode, etc.).
Data Type Conversion: Ensure that each column has the appropriate data type (e.g., converting price columns to numeric types).
A common issue is that price columns may contain currency symbols or commas, which need to be removed before conversion.
Another common issue is date columns being stored as strings; these should be converted to datetime objects if you plan to perform date-based analyses.
▶️ Import libraries.
import pandas as pd
import numpy as np
import os▶️ Display up to 100 columns in this Jupyter notebook.
pd.set_option("display.max_columns", 100)Load Dataset¶
The compression="gzip" parameter is used because the dataset file is compressed in gzip format. If your dataset is not compressed, you should omit this parameter. This parameter is optional if the file extension is .gz, as pandas can automatically detect the compression type based on the file extension. However, it is always a good practice to specify it explicitly when you know the file is compressed, because additional URL query parameters may interfere with pandas’ ability to infer the compression type from the file extension.
df = pd.read_csv(
"https://data.insideairbnb.com/spain/catalonia/barcelona/2025-09-14/data/listings.csv.gz",
compression="gzip",
)
df.head(2)▶️ Print out the DataFrame’s summary.
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19410 entries, 0 to 19409
Data columns (total 79 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 19410 non-null int64
1 listing_url 19410 non-null object
2 scrape_id 19410 non-null int64
3 last_scraped 19410 non-null object
4 source 19410 non-null object
5 name 19410 non-null object
6 description 18673 non-null object
7 neighborhood_overview 8986 non-null object
8 picture_url 19410 non-null object
9 host_id 19410 non-null int64
10 host_url 19410 non-null object
11 host_name 19405 non-null object
12 host_since 19405 non-null object
13 host_location 14708 non-null object
14 host_about 12240 non-null object
15 host_response_time 16284 non-null object
16 host_response_rate 16284 non-null object
17 host_acceptance_rate 16662 non-null object
18 host_is_superhost 19027 non-null object
19 host_thumbnail_url 19405 non-null object
20 host_picture_url 19405 non-null object
21 host_neighbourhood 8623 non-null object
22 host_listings_count 19405 non-null float64
23 host_total_listings_count 19405 non-null float64
24 host_verifications 19405 non-null object
25 host_has_profile_pic 19405 non-null object
26 host_identity_verified 19405 non-null object
27 neighbourhood 8986 non-null object
28 neighbourhood_cleansed 19410 non-null object
29 neighbourhood_group_cleansed 19410 non-null object
30 latitude 19410 non-null float64
31 longitude 19410 non-null float64
32 property_type 19410 non-null object
33 room_type 19410 non-null object
34 accommodates 19410 non-null int64
35 bathrooms 15298 non-null float64
36 bathrooms_text 19399 non-null object
37 bedrooms 17446 non-null float64
38 beds 15228 non-null float64
39 amenities 19410 non-null object
40 price 15276 non-null object
41 minimum_nights 19410 non-null int64
42 maximum_nights 19410 non-null int64
43 minimum_minimum_nights 19406 non-null float64
44 maximum_minimum_nights 19406 non-null float64
45 minimum_maximum_nights 19406 non-null float64
46 maximum_maximum_nights 19406 non-null float64
47 minimum_nights_avg_ntm 19410 non-null float64
48 maximum_nights_avg_ntm 19410 non-null float64
49 calendar_updated 0 non-null float64
50 has_availability 18346 non-null object
51 availability_30 19410 non-null int64
52 availability_60 19410 non-null int64
53 availability_90 19410 non-null int64
54 availability_365 19410 non-null int64
55 calendar_last_scraped 19410 non-null object
56 number_of_reviews 19410 non-null int64
57 number_of_reviews_ltm 19410 non-null int64
58 number_of_reviews_l30d 19410 non-null int64
59 availability_eoy 19410 non-null int64
60 number_of_reviews_ly 19410 non-null int64
61 estimated_occupancy_l365d 19410 non-null int64
62 estimated_revenue_l365d 15276 non-null float64
63 first_review 14421 non-null object
64 last_review 14421 non-null object
65 review_scores_rating 14421 non-null float64
66 review_scores_accuracy 14418 non-null float64
67 review_scores_cleanliness 14419 non-null float64
68 review_scores_checkin 14417 non-null float64
69 review_scores_communication 14420 non-null float64
70 review_scores_location 14418 non-null float64
71 review_scores_value 14418 non-null float64
72 license 13930 non-null object
73 instant_bookable 19410 non-null object
74 calculated_host_listings_count 19410 non-null int64
75 calculated_host_listings_count_entire_homes 19410 non-null int64
76 calculated_host_listings_count_private_rooms 19410 non-null int64
77 calculated_host_listings_count_shared_rooms 19410 non-null int64
78 reviews_per_month 14421 non-null float64
dtypes: float64(23), int64(20), object(36)
memory usage: 11.7+ MB
▶️ Although info() provides some information about the number of non-missing values, you can explicitly count missing values in each column using the following code:
df.isna().sum()
# Equivalent to:
df.isnull().sum()id 0
listing_url 0
scrape_id 0
last_scraped 0
source 0
...
calculated_host_listings_count 0
calculated_host_listings_count_entire_homes 0
calculated_host_listings_count_private_rooms 0
calculated_host_listings_count_shared_rooms 0
reviews_per_month 4989
Length: 79, dtype: int64To only filter columns that have missing values, you can use:
df.isna().sum()[df.isna().sum() > 0]
# Equivalent to:
df.isnull().sum()[df.isnull().sum() > 0]description 737
neighborhood_overview 10424
host_name 5
host_since 5
host_location 4702
host_about 7170
host_response_time 3126
host_response_rate 3126
host_acceptance_rate 2748
host_is_superhost 383
host_thumbnail_url 5
host_picture_url 5
host_neighbourhood 10787
host_listings_count 5
host_total_listings_count 5
host_verifications 5
host_has_profile_pic 5
host_identity_verified 5
neighbourhood 10424
bathrooms 4112
bathrooms_text 11
bedrooms 1964
beds 4182
price 4134
minimum_minimum_nights 4
maximum_minimum_nights 4
minimum_maximum_nights 4
maximum_maximum_nights 4
calendar_updated 19410
has_availability 1064
estimated_revenue_l365d 4134
first_review 4989
last_review 4989
review_scores_rating 4989
review_scores_accuracy 4992
review_scores_cleanliness 4991
review_scores_checkin 4993
review_scores_communication 4990
review_scores_location 4992
review_scores_value 4992
license 5480
reviews_per_month 4989
dtype: int64Select and rename columns¶
df_c = df[
[
"name",
"neighbourhood_cleansed",
"room_type",
"bedrooms",
"bathrooms",
"accommodates",
"minimum_nights",
"price",
"availability_365",
"number_of_reviews",
"review_scores_rating",
"latitude",
"longitude",
"host_is_superhost",
]
].copy()
df_c.rename(
columns={
"neighbourhood_cleansed": "neighbourhood",
"review_scores_rating": "review_score",
"host_is_superhost": "is_superhost",
},
inplace=True,
)
df_c.head(2)Parse price as floats¶
df_c["price"] = (
df_c["price"].str.replace("$", "").str.replace(",", "").astype(np.float64)
)Convert is_superhost to 0s and 1s¶
df_c["is_superhost"] = np.where(df_c["is_superhost"] == "t", 1, 0)
df_c[["name", "is_superhost"]].sample(10)Remove listings with missing values¶
df_c["bathrooms"].unique()array([ 2. , 1.5, 1. , 3. , nan, 3.5, 4. , 0. , 2.5, 4.5, 6. ,
5.5, 0.5, 7.5, 5. , 8. , 7. , 9. , 12. , 10. , 14. , 13. ,
11. ])df_c["bedrooms"].unique()array([ 3., 2., 1., 4., nan, 0., 7., 5., 6., 8., 9., 12., 10.,
18., 20., 15., 24., 14., 19., 16., 11., 26., 29.])df_c.dropna(subset=["bedrooms", "bathrooms", "review_score"], inplace=True)df_c["room_type"].value_counts()room_type
Entire home/apt 8503
Private room 3160
Shared room 81
Hotel room 50
Name: count, dtype: int64Check the cleaned DataFrame¶
df_c.info()<class 'pandas.core.frame.DataFrame'>
Index: 11794 entries, 0 to 19366
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 11794 non-null object
1 neighbourhood 11794 non-null object
2 room_type 11794 non-null object
3 bedrooms 11794 non-null float64
4 bathrooms 11794 non-null float64
5 accommodates 11794 non-null int64
6 minimum_nights 11794 non-null int64
7 price 11776 non-null float64
8 availability_365 11794 non-null int64
9 number_of_reviews 11794 non-null int64
10 review_score 11794 non-null float64
11 latitude 11794 non-null float64
12 longitude 11794 non-null float64
13 is_superhost 11794 non-null int64
dtypes: float64(6), int64(5), object(3)
memory usage: 1.3+ MB
df_c.head()df_c.to_csv("cleaned-output.csv", index=None)