Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Cleaning Airbnb Listings Dataset

Inside Airbnb provides datasets containing Airbnb listings for various cities around the world. These datasets often require cleaning before they can be effectively analyzed. In this notebook, we will demonstrate how to clean an Airbnb listings dataset for Barcelona, Spain.

Data Cleaning Steps

  1. Import Libraries: We will use pandas for data manipulation and numpy for numerical operations.

  2. Load Dataset: Read the CSV file containing the Airbnb listings data.

    • Sometimes, the dataset may be compressed (e.g., in gzip format), so we will handle that accordingly.

    • If the dataset is large but can fit into memory, consider loading a sample for initial exploration.

    • If the dataset is too large to fit into memory, consider using libraries like Dask, Spark, or Polars for out-of-core processing.

  3. Inspect Data: Check the first few rows of the dataset to understand its structure.

  4. Select Relevant Columns: Identify and retain only the columns that are necessary for analysis.

  5. Handle Missing Values: Identify columns with missing values and decide on strategies to handle them (e.g., dropping rows, filling with mean/median/mode, etc.).

  6. Data Type Conversion: Ensure that each column has the appropriate data type (e.g., converting price columns to numeric types).

    • A common issue is that price columns may contain currency symbols or commas, which need to be removed before conversion.

    • Another common issue is date columns being stored as strings; these should be converted to datetime objects if you plan to perform date-based analyses.

▶️ Import libraries.

import pandas as pd
import numpy as np
import os

▶️ Display up to 100 columns in this Jupyter notebook.

pd.set_option("display.max_columns", 100)

Load Dataset

The compression="gzip" parameter is used because the dataset file is compressed in gzip format. If your dataset is not compressed, you should omit this parameter. This parameter is optional if the file extension is .gz, as pandas can automatically detect the compression type based on the file extension. However, it is always a good practice to specify it explicitly when you know the file is compressed, because additional URL query parameters may interfere with pandas’ ability to infer the compression type from the file extension.

df = pd.read_csv(
    "https://data.insideairbnb.com/spain/catalonia/barcelona/2025-09-14/data/listings.csv.gz",
    compression="gzip",
)
df.head(2)
Loading...

▶️ Print out the DataFrame’s summary.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19410 entries, 0 to 19409
Data columns (total 79 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            19410 non-null  int64  
 1   listing_url                                   19410 non-null  object 
 2   scrape_id                                     19410 non-null  int64  
 3   last_scraped                                  19410 non-null  object 
 4   source                                        19410 non-null  object 
 5   name                                          19410 non-null  object 
 6   description                                   18673 non-null  object 
 7   neighborhood_overview                         8986 non-null   object 
 8   picture_url                                   19410 non-null  object 
 9   host_id                                       19410 non-null  int64  
 10  host_url                                      19410 non-null  object 
 11  host_name                                     19405 non-null  object 
 12  host_since                                    19405 non-null  object 
 13  host_location                                 14708 non-null  object 
 14  host_about                                    12240 non-null  object 
 15  host_response_time                            16284 non-null  object 
 16  host_response_rate                            16284 non-null  object 
 17  host_acceptance_rate                          16662 non-null  object 
 18  host_is_superhost                             19027 non-null  object 
 19  host_thumbnail_url                            19405 non-null  object 
 20  host_picture_url                              19405 non-null  object 
 21  host_neighbourhood                            8623 non-null   object 
 22  host_listings_count                           19405 non-null  float64
 23  host_total_listings_count                     19405 non-null  float64
 24  host_verifications                            19405 non-null  object 
 25  host_has_profile_pic                          19405 non-null  object 
 26  host_identity_verified                        19405 non-null  object 
 27  neighbourhood                                 8986 non-null   object 
 28  neighbourhood_cleansed                        19410 non-null  object 
 29  neighbourhood_group_cleansed                  19410 non-null  object 
 30  latitude                                      19410 non-null  float64
 31  longitude                                     19410 non-null  float64
 32  property_type                                 19410 non-null  object 
 33  room_type                                     19410 non-null  object 
 34  accommodates                                  19410 non-null  int64  
 35  bathrooms                                     15298 non-null  float64
 36  bathrooms_text                                19399 non-null  object 
 37  bedrooms                                      17446 non-null  float64
 38  beds                                          15228 non-null  float64
 39  amenities                                     19410 non-null  object 
 40  price                                         15276 non-null  object 
 41  minimum_nights                                19410 non-null  int64  
 42  maximum_nights                                19410 non-null  int64  
 43  minimum_minimum_nights                        19406 non-null  float64
 44  maximum_minimum_nights                        19406 non-null  float64
 45  minimum_maximum_nights                        19406 non-null  float64
 46  maximum_maximum_nights                        19406 non-null  float64
 47  minimum_nights_avg_ntm                        19410 non-null  float64
 48  maximum_nights_avg_ntm                        19410 non-null  float64
 49  calendar_updated                              0 non-null      float64
 50  has_availability                              18346 non-null  object 
 51  availability_30                               19410 non-null  int64  
 52  availability_60                               19410 non-null  int64  
 53  availability_90                               19410 non-null  int64  
 54  availability_365                              19410 non-null  int64  
 55  calendar_last_scraped                         19410 non-null  object 
 56  number_of_reviews                             19410 non-null  int64  
 57  number_of_reviews_ltm                         19410 non-null  int64  
 58  number_of_reviews_l30d                        19410 non-null  int64  
 59  availability_eoy                              19410 non-null  int64  
 60  number_of_reviews_ly                          19410 non-null  int64  
 61  estimated_occupancy_l365d                     19410 non-null  int64  
 62  estimated_revenue_l365d                       15276 non-null  float64
 63  first_review                                  14421 non-null  object 
 64  last_review                                   14421 non-null  object 
 65  review_scores_rating                          14421 non-null  float64
 66  review_scores_accuracy                        14418 non-null  float64
 67  review_scores_cleanliness                     14419 non-null  float64
 68  review_scores_checkin                         14417 non-null  float64
 69  review_scores_communication                   14420 non-null  float64
 70  review_scores_location                        14418 non-null  float64
 71  review_scores_value                           14418 non-null  float64
 72  license                                       13930 non-null  object 
 73  instant_bookable                              19410 non-null  object 
 74  calculated_host_listings_count                19410 non-null  int64  
 75  calculated_host_listings_count_entire_homes   19410 non-null  int64  
 76  calculated_host_listings_count_private_rooms  19410 non-null  int64  
 77  calculated_host_listings_count_shared_rooms   19410 non-null  int64  
 78  reviews_per_month                             14421 non-null  float64
dtypes: float64(23), int64(20), object(36)
memory usage: 11.7+ MB

▶️ Although info() provides some information about the number of non-missing values, you can explicitly count missing values in each column using the following code:

df.isna().sum()
# Equivalent to:
df.isnull().sum()
id 0 listing_url 0 scrape_id 0 last_scraped 0 source 0 ... calculated_host_listings_count 0 calculated_host_listings_count_entire_homes 0 calculated_host_listings_count_private_rooms 0 calculated_host_listings_count_shared_rooms 0 reviews_per_month 4989 Length: 79, dtype: int64

To only filter columns that have missing values, you can use:

df.isna().sum()[df.isna().sum() > 0]
# Equivalent to:
df.isnull().sum()[df.isnull().sum() > 0]
description 737 neighborhood_overview 10424 host_name 5 host_since 5 host_location 4702 host_about 7170 host_response_time 3126 host_response_rate 3126 host_acceptance_rate 2748 host_is_superhost 383 host_thumbnail_url 5 host_picture_url 5 host_neighbourhood 10787 host_listings_count 5 host_total_listings_count 5 host_verifications 5 host_has_profile_pic 5 host_identity_verified 5 neighbourhood 10424 bathrooms 4112 bathrooms_text 11 bedrooms 1964 beds 4182 price 4134 minimum_minimum_nights 4 maximum_minimum_nights 4 minimum_maximum_nights 4 maximum_maximum_nights 4 calendar_updated 19410 has_availability 1064 estimated_revenue_l365d 4134 first_review 4989 last_review 4989 review_scores_rating 4989 review_scores_accuracy 4992 review_scores_cleanliness 4991 review_scores_checkin 4993 review_scores_communication 4990 review_scores_location 4992 review_scores_value 4992 license 5480 reviews_per_month 4989 dtype: int64

Select and rename columns

df_c = df[
    [
        "name",
        "neighbourhood_cleansed",
        "room_type",
        "bedrooms",
        "bathrooms",
        "accommodates",
        "minimum_nights",
        "price",
        "availability_365",
        "number_of_reviews",
        "review_scores_rating",
        "latitude",
        "longitude",
        "host_is_superhost",
    ]
].copy()

df_c.rename(
    columns={
        "neighbourhood_cleansed": "neighbourhood",
        "review_scores_rating": "review_score",
        "host_is_superhost": "is_superhost",
    },
    inplace=True,
)

df_c.head(2)
Loading...

Parse price as floats

df_c["price"] = (
    df_c["price"].str.replace("$", "").str.replace(",", "").astype(np.float64)
)

Convert is_superhost to 0s and 1s

df_c["is_superhost"] = np.where(df_c["is_superhost"] == "t", 1, 0)
df_c[["name", "is_superhost"]].sample(10)
Loading...

Remove listings with missing values

df_c["bathrooms"].unique()
array([ 2. , 1.5, 1. , 3. , nan, 3.5, 4. , 0. , 2.5, 4.5, 6. , 5.5, 0.5, 7.5, 5. , 8. , 7. , 9. , 12. , 10. , 14. , 13. , 11. ])
df_c["bedrooms"].unique()
array([ 3., 2., 1., 4., nan, 0., 7., 5., 6., 8., 9., 12., 10., 18., 20., 15., 24., 14., 19., 16., 11., 26., 29.])
df_c.dropna(subset=["bedrooms", "bathrooms", "review_score"], inplace=True)
df_c["room_type"].value_counts()
room_type Entire home/apt 8503 Private room 3160 Shared room 81 Hotel room 50 Name: count, dtype: int64

Check the cleaned DataFrame

df_c.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11794 entries, 0 to 19366
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               11794 non-null  object 
 1   neighbourhood      11794 non-null  object 
 2   room_type          11794 non-null  object 
 3   bedrooms           11794 non-null  float64
 4   bathrooms          11794 non-null  float64
 5   accommodates       11794 non-null  int64  
 6   minimum_nights     11794 non-null  int64  
 7   price              11776 non-null  float64
 8   availability_365   11794 non-null  int64  
 9   number_of_reviews  11794 non-null  int64  
 10  review_score       11794 non-null  float64
 11  latitude           11794 non-null  float64
 12  longitude          11794 non-null  float64
 13  is_superhost       11794 non-null  int64  
dtypes: float64(6), int64(5), object(3)
memory usage: 1.3+ MB
df_c.head()
Loading...
df_c.to_csv("cleaned-output.csv", index=None)