Cleaning Airbnb Listings Dataset

Inside Airbnb provides datasets containing Airbnb listings for various cities around the world. These datasets often require cleaning before they can be effectively analyzed. In this notebook, we will demonstrate how to clean an Airbnb listings dataset for Barcelona, Spain.

Data Cleaning Steps¶

Import Libraries: We will use pandas for data manipulation and numpy for numerical operations.
Load Dataset: Read the CSV file containing the Airbnb listings data.
- Sometimes, the dataset may be compressed (e.g., in gzip format), so we will handle that accordingly.
- If the dataset is large but can fit into memory, consider loading a sample for initial exploration.
- If the dataset is too large to fit into memory, consider using libraries like Dask, Spark, or Polars for out-of-core processing.
Inspect Data: Check the first few rows of the dataset to understand its structure.
Select Relevant Columns: Identify and retain only the columns that are necessary for analysis.
Handle Missing Values: Identify columns with missing values and decide on strategies to handle them (e.g., dropping rows, filling with mean/median/mode, etc.).
Data Type Conversion: Ensure that each column has the appropriate data type (e.g., converting price columns to numeric types).
- A common issue is that price columns may contain currency symbols or commas, which need to be removed before conversion.
- Another common issue is date columns being stored as strings; these should be converted to datetime objects if you plan to perform date-based analyses.

▶️ Import libraries.

import pandas as pd
import numpy as np
import os

▶️ Display up to 100 columns in this Jupyter notebook.

pd.set_option("display.max_columns", 100)

Maximum Number of Rows

You can also set the maximum number of rows to display using the following code:

pd.set_option("display.max_rows", 100)

If you want to display all columns without any limit, you can set it to None:

pd.set_option("display.max_columns", None)

This is not recommended for very wide DataFrames, as it may clutter your output.

Load Dataset¶

The compression="gzip" parameter is used because the dataset file is compressed in gzip format. If your dataset is not compressed, you should omit this parameter. This parameter is optional if the file extension is .gz, as pandas can automatically detect the compression type based on the file extension. However, it is always a good practice to specify it explicitly when you know the file is compressed, because additional URL query parameters may interfere with pandas’ ability to infer the compression type from the file extension.

df = pd.read_csv(
    "https://data.insideairbnb.com/spain/catalonia/barcelona/2025-09-14/data/listings.csv.gz",
    compression="gzip",
)
df.head(2)

▶️ Print out the DataFrame’s summary.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19410 entries, 0 to 19409
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            19410 non-null  int64  
 1   listing_url                                   19410 non-null  object 
 2   scrape_id                                     19410 non-null  int64  
 3   last_scraped                                  19410 non-null  object 
 4   source                                        19410 non-null  object 
 5   name                                          19410 non-null  object 
 6   description                                   18673 non-null  object 
 7   neighborhood_overview                         8986 non-null   object 
 8   picture_url                                   19410 non-null  object 
 9   host_id                                       19410 non-null  int64  
 10  host_url                                      19410 non-null  object 
 11  host_name                                     19405 non-null  object 
 12  host_since                                    19405 non-null  object 
 13  host_location                                 14708 non-null  object 
 14  host_about                                    12240 non-null  object 
 15  host_response_time                            16284 non-null  object 
 16  host_response_rate                            16284 non-null  object 
 17  host_acceptance_rate                          16662 non-null  object 
 18  host_is_superhost                             19027 non-null  object 
 19  host_thumbnail_url                            19405 non-null  object 
 20  host_picture_url                              19405 non-null  object 
 21  host_neighbourhood                            8623 non-null   object 
 22  host_listings_count                           19405 non-null  float64
 23  host_total_listings_count                     19405 non-null  float64
 24  host_verifications                            19405 non-null  object 
 25  host_has_profile_pic                          19405 non-null  object 
 26  host_identity_verified                        19405 non-null  object 
 27  neighbourhood                                 8986 non-null   object 
 28  neighbourhood_cleansed                        19410 non-null  object 
 29  neighbourhood_group_cleansed                  19410 non-null  object 
 30  latitude                                      19410 non-null  float64
 31  longitude                                     19410 non-null  float64
 32  property_type                                 19410 non-null  object 
 33  room_type                                     19410 non-null  object 
 34  accommodates                                  19410 non-null  int64  
 35  bathrooms                                     15298 non-null  float64
 36  bathrooms_text                                19399 non-null  object 
 37  bedrooms                                      17446 non-null  float64
 38  beds                                          15228 non-null  float64
 39  amenities                                     19410 non-null  object 
 40  price                                         15276 non-null  object 
 41  minimum_nights                                19410 non-null  int64  
 42  maximum_nights                                19410 non-null  int64  
 43  minimum_minimum_nights                        19406 non-null  float64
 44  maximum_minimum_nights                        19406 non-null  float64
 45  minimum_maximum_nights                        19406 non-null  float64
 46  maximum_maximum_nights                        19406 non-null  float64
 47  minimum_nights_avg_ntm                        19410 non-null  float64
 48  maximum_nights_avg_ntm                        19410 non-null  float64
 49  calendar_updated                              0 non-null      float64
 50  has_availability                              18346 non-null  object 
 51  availability_30                               19410 non-null  int64  
 52  availability_60                               19410 non-null  int64  
 53  availability_90                               19410 non-null  int64  
 54  availability_365                              19410 non-null  int64  
 55  calendar_last_scraped                         19410 non-null  object 
 56  number_of_reviews                             19410 non-null  int64  
 57  number_of_reviews_ltm                         19410 non-null  int64  
 58  number_of_reviews_l30d                        19410 non-null  int64  
 59  availability_eoy                              19410 non-null  int64  
 60  number_of_reviews_ly                          19410 non-null  int64  
 61  estimated_occupancy_l365d                     19410 non-null  int64  
 62  estimated_revenue_l365d                       15276 non-null  float64
 63  first_review                                  14421 non-null  object 
 64  last_review                                   14421 non-null  object 
 65  review_scores_rating                          14421 non-null  float64
 66  review_scores_accuracy                        14418 non-null  float64
 67  review_scores_cleanliness                     14419 non-null  float64
 68  review_scores_checkin                         14417 non-null  float64
 69  review_scores_communication                   14420 non-null  float64
 70  review_scores_location                        14418 non-null  float64
 71  review_scores_value                           14418 non-null  float64
 72  license                                       13930 non-null  object 
 73  instant_bookable                              19410 non-null  object 
 74  calculated_host_listings_count                19410 non-null  int64  
 75  calculated_host_listings_count_entire_homes   19410 non-null  int64  
 76  calculated_host_listings_count_private_rooms  19410 non-null  int64  
 77  calculated_host_listings_count_shared_rooms   19410 non-null  int64  
 78  reviews_per_month                             14421 non-null  float64
dtypes: float64(23), int64(20), object(36)
memory usage: 11.7+ MB

▶️ Although info() provides some information about the number of non-missing values, you can explicitly count missing values in each column using the following code:

df.isna().sum()
# Equivalent to:
df.isnull().sum()

id                                                 0
listing_url                                        0
scrape_id                                          0
last_scraped                                       0
source                                             0
                                                ... 
calculated_host_listings_count                     0
calculated_host_listings_count_entire_homes        0
calculated_host_listings_count_private_rooms       0
calculated_host_listings_count_shared_rooms        0
reviews_per_month                               4989
Length: 79, dtype: int64

To only filter columns that have missing values, you can use:

df.isna().sum()[df.isna().sum() > 0]
# Equivalent to:
df.isnull().sum()[df.isnull().sum() > 0]

description                      737
neighborhood_overview          10424
host_name                          5
host_since                         5
host_location                   4702
host_about                      7170
host_response_time              3126
host_response_rate              3126
host_acceptance_rate            2748
host_is_superhost                383
host_thumbnail_url                 5
host_picture_url                   5
host_neighbourhood             10787
host_listings_count                5
host_total_listings_count          5
host_verifications                 5
host_has_profile_pic               5
host_identity_verified             5
neighbourhood                  10424
bathrooms                       4112
bathrooms_text                    11
bedrooms                        1964
beds                            4182
price                           4134
minimum_minimum_nights             4
maximum_minimum_nights             4
minimum_maximum_nights             4
maximum_maximum_nights             4
calendar_updated               19410
has_availability                1064
estimated_revenue_l365d         4134
first_review                    4989
last_review                     4989
review_scores_rating            4989
review_scores_accuracy          4992
review_scores_cleanliness       4991
review_scores_checkin           4993
review_scores_communication     4990
review_scores_location          4992
review_scores_value             4992
license                         5480
reviews_per_month               4989
dtype: int64

Select and rename columns¶

df_c = df[
    [
        "name",
        "neighbourhood_cleansed",
        "room_type",
        "bedrooms",
        "bathrooms",
        "accommodates",
        "minimum_nights",
        "price",
        "availability_365",
        "number_of_reviews",
        "review_scores_rating",
        "latitude",
        "longitude",
        "host_is_superhost",
    ]
].copy()

df_c.rename(
    columns={
        "neighbourhood_cleansed": "neighbourhood",
        "review_scores_rating": "review_score",
        "host_is_superhost": "is_superhost",
    },
    inplace=True,
)

df_c.head(2)

Parse price as `float`s¶

df_c["price"] = (
    df_c["price"].str.replace("$", "").str.replace(",", "").astype(np.float64)
)

Convert `is_superhost` to 0s and 1s¶

df_c["is_superhost"] = np.where(df_c["is_superhost"] == "t", 1, 0)
df_c[["name", "is_superhost"]].sample(10)

Remove listings with missing values¶

df_c["bathrooms"].unique()

array([ 2. ,  1.5,  1. ,  3. ,  nan,  3.5,  4. ,  0. ,  2.5,  4.5,  6. ,
        5.5,  0.5,  7.5,  5. ,  8. ,  7. ,  9. , 12. , 10. , 14. , 13. ,
       11. ])

df_c["bedrooms"].unique()

array([ 3.,  2.,  1.,  4., nan,  0.,  7.,  5.,  6.,  8.,  9., 12., 10.,
       18., 20., 15., 24., 14., 19., 16., 11., 26., 29.])

df_c.dropna(subset=["bedrooms", "bathrooms", "review_score"], inplace=True)

df_c["room_type"].value_counts()

room_type
Entire home/apt    8503
Private room       3160
Shared room          81
Hotel room           50
Name: count, dtype: int64

Check the cleaned DataFrame¶

df_c.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11794 entries, 0 to 19366
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               11794 non-null  object 
 1   neighbourhood      11794 non-null  object 
 2   room_type          11794 non-null  object 
 3   bedrooms           11794 non-null  float64
 4   bathrooms          11794 non-null  float64
 5   accommodates       11794 non-null  int64  
 6   minimum_nights     11794 non-null  int64  
 7   price              11776 non-null  float64
 8   availability_365   11794 non-null  int64  
 9   number_of_reviews  11794 non-null  int64  
 10  review_score       11794 non-null  float64
 11  latitude           11794 non-null  float64
 12  longitude          11794 non-null  float64
 13  is_superhost       11794 non-null  int64  
dtypes: float64(6), int64(5), object(3)
memory usage: 1.3+ MB

df_c.head()

df_c.to_csv("cleaned-output.csv", index=None)