Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Build a Regressor

import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import Ridge

import xgboost
from xgboost import XGBRegressor

import plotly
import plotly.express as px

print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"xgboost version: {xgboost.__version__}")
print(f"plotly version: {plotly.__version__}")
pandas version: 3.0.0
numpy version: 2.4.2
scikit-learn version: 1.8.0
xgboost version: 3.2.0
plotly version: 6.5.2

Load Data

df = pd.read_csv(
    "https://github.com/bdi593/datasets/raw/refs/heads/main/australian-rental-market-2026/australian_rental_market_2026.csv"
)
df.head()
Loading...

Explore Data

df.info()
<class 'pandas.DataFrame'>
RangeIndex: 6767 entries, 0 to 6766
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           6767 non-null   str    
 1   price_display   6767 non-null   float64
 2   description     6767 non-null   str    
 3   propertyType    6767 non-null   str    
 4   locality        6767 non-null   str    
 5   latitude        6767 non-null   float64
 6   longitude       6767 non-null   float64
 7   postcode        6767 non-null   int64  
 8   state           6767 non-null   str    
 9   street_address  6767 non-null   str    
 10  suburb          6767 non-null   str    
 11  bathrooms       6767 non-null   int64  
 12  bedrooms        6767 non-null   int64  
 13  parking_spaces  6767 non-null   int64  
 14  agency_name     6767 non-null   str    
 15  amenities       6449 non-null   str    
dtypes: float64(3), int64(4), str(9)
memory usage: 846.0 KB
df.shape
(6767, 16)
df.isnull().sum()
title 0 price_display 0 description 0 propertyType 0 locality 0 latitude 0 longitude 0 postcode 0 state 0 street_address 0 suburb 0 bathrooms 0 bedrooms 0 parking_spaces 0 agency_name 0 amenities 318 dtype: int64

Target Column

The target column is price_display, because we are interested in building a prediction model for rental prices.

fig = px.histogram(
    df,
    x="price_display",
    nbins=50,
    title="Weekly rent distribution (AUD/week)",
    template="simple_white",
)
fig.update_layout(xaxis_title="price_display", yaxis_title="count")
fig.show()
Loading...

The distribution is positively skewed (right-skewed):

  • Most properties cluster between $400 and $900 per week.

  • A small number of properties extend far into the right tail, exceeding $2,000-$5,000 per week.

This long right tail indicates the presence of high-end luxury properties that are relatively rare but significantly more expensive than typical rentals.

fig = px.box(
    df,
    x="price_display",
    title="Weekly rent distribution (AUD/week)",
    template="simple_white",
)
fig.show()
Loading...

The box plot confirms that weekly rental prices are strongly right-skewed. The median rent is $670 per week, with the middle 50% of properties (the interquartile range) spanning from about $550 to $800. This indicates that most rental listings fall within a relatively concentrated mid-market band. However, the right whisker extends much farther than the left, and there are numerous high-value outliers beyond $1,180 per week.

The large number of extreme points stretching out toward $3,000-$5,000 per week highlights a small but significant luxury segment. These outliers will influence the mean more than the median and can inflate error metrics like RMSE in predictive models. From a modeling perspective, this reinforces the need for robust approaches (e.g., gradient boosting) and possibly a log transformation of rent to better handle skewness and extreme values.

Listing Location

The dataset includes latitude and longitude coordinates for each rental listing, allowing us to visualize the geographic distribution of properties across Australia.

Latitude is the north-south position, while longitude is the east-west position. We can plot these coordinates as x and y axes to see a rough distribution of listings across the country. However, this simple scatter plot may not be the most effective way to visualize spatial data, especially if there are many listings clustered in certain areas.

# Drop rows missing coordinates or target for this plot
tmp = df.dropna(subset=["latitude", "longitude", "price_display"]).copy()

fig = px.scatter(
    tmp,
    x="longitude",
    y="latitude",
    opacity=0.4,
    title="Listing locations",
    template="simple_white",
)
fig.show()
Loading...

A better way to visualize the geographic distribution of rental listings is to use a map-based scatter plot, which can provide more context and clarity. By plotting the latitude and longitude on a map, we can see where listings are concentrated across different cities and regions in Australia. This approach also allows us to overlay additional information, such as state boundaries, to better understand the spatial distribution of rentals.

Plotly provides multiple options for map visualizations, including scatter_mapbox and scatter_map. Some styles require Mapbox API tokens, while others use open-source map styles that do not require authentication. For our purposes, we can use scatter_map with an open-source style to visualize the rental listings without needing an API token.

fig = px.scatter_map(
    df,
    lat="latitude",
    lon="longitude",
    hover_name="title",
    hover_data={
        "price_display": True,
        "bedrooms": True,
        "bathrooms": True,
        "state": True,
    },
    zoom=3,
    height=600,
    template="simple_white",
    title="Listings location on a geographic map",
)

fig.show()
Loading...

Parameters Explained

lat="latitude": Column name for the y-coordinate

  • Must be numeric

  • Values between roughly -44 to -10 for Australia

lon="longitude": Column name for the X-coordinate

  • Must be numeric

  • Values around 110--155 for Australia

hover_data={...}: Controls what additional information appears in the tooltip.

Example:

hover_data={
    "price_display": True,
    "bedrooms": True,
    "bathrooms": True,
    "state": True
}
  • True: include the variable

  • False: hide it

  • You can also format numbers (e.g., ":.0f")

zoom=3: Controls initial zoom level.

  • 2-3: whole Australia

  • 6-8: metro-level

  • 10+: suburb-level

Preprocess

There are not many preprocessing steps needed for this dataset, since XGBoost can handle categorical variables natively (this requires XGBoost > 1.6). We will convert the categorical features to pd.CategoricalDtype to ensure they are treated as categorical by XGBoost.

numeric_features = ["bedrooms", "bathrooms", "parking_spaces", "latitude", "longitude"]
categorical_features = [
    "propertyType",
    "locality",
    "postcode",
    "state",
    "suburb",
    "agency_name",
]

for col in categorical_features:
    df[col] = df[col].astype("category")

We will split the data into training and testing sets, using an 80/20 split. This allows us to train the model on the majority of the data while reserving a portion for evaluating its performance on unseen data.

target = "price_display"

y = df[target]
X = df[numeric_features + categorical_features]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train:", X_train.shape, "Test:", X_test.shape)
Train: (5413, 11) Test: (1354, 11)

Create the Model

A “Minimal” Regressor with Default Hyperparameters

We first build a “minimal” regressor using XGBoost with default hyperparameters. This serves as a baseline model to evaluate the performance of our features without any tuning.

  • he tree_method="hist" parameter is required to enable native categorical support in XGBoost.

  • The enable_categorical=True parameter tells XGBoost to treat categorical features as such, allowing it to handle them more effectively without needing one-hot encoding or other preprocessing steps.

model_minimal = XGBRegressor(
    tree_method="hist",  # Required for categorical support
    enable_categorical=True,  # Enable native categorical handling
)

model_minimal.fit(X_train, y_train)
Loading...

Evaluate the Model

y_minimal_pred = model_minimal.predict(X_test)

mae = mean_absolute_error(y_test, y_minimal_pred)
rmse = mean_squared_error(y_test, y_minimal_pred) ** 0.5
r2 = r2_score(y_test, y_minimal_pred)

print(f"MAE:  ${mae:,.2f} AUD/week")
print(f"RMSE: ${rmse:,.2f} AUD/week")
print(f"R²:   {r2:.3f}")
MAE:  $131.76 AUD/week
RMSE: $273.35 AUD/week
R²:   0.385

Interpre the Results

MAE: $131.76 AUD/week

Mean Absolute Error (MAE) tells us the average size of the prediction error, regardless of direction.

On average, the model’s predicted rent is off by about $132 per week.

This is the most business-friendly metric. For context:

  • If the median rent is around 650650-700/week,

  • A $132 error represents roughly 18-20% average deviation.

From a pricing engine perspective, this is moderate accuracy, but likely not production-grade yet.

RMSE: $273.35 AUD/week

Root Mean Squared Error (RMSE) penalizes large errors more heavily than MAE.

Notice:

  • RMSE (273)ismorethandoubleMAE(273) is more than double MAE (132)

This tells us something important:

The model makes some large errors, likely on high-end or luxury properties.

Because rent is right-skewed, extreme properties in the 2,0002,000-5,000 range inflate RMSE significantly.

This is very typical for housing price models.

R²: 0.385

R² = 0.385 means the model explains 38.5% of the variation in rental prices.

In practical terms:

  • About 39% of rent variability is captured by the features.

  • About 61% remains unexplained.

For housing data, this suggests:

  • Location + property features matter

  • But important variables are missing (e.g., floor area, property condition, distance to CBD, school zones, etc.)

An R² around 0.35-0.50 is common in early-stage real estate modeling.

A “Tuned” Regressor

We will build a “tuned” regressor using XGBoost with hyperparameter tuning to optimize performance. This involves adjusting parameters such as learning rate, max depth, and number of estimators to find the best combination for our dataset.

Note that only one sample hyperparameter configuration is shown here for demonstration purposes. In practice, you would want to perform a more systematic search (e.g., grid search or random search) to find the optimal hyperparameters for your model.

model = XGBRegressor(
    n_estimators=1200,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    objective="reg:squarederror",
    tree_method="hist",  # REQUIRED
    enable_categorical=True,  # KEY PARAMETER
    random_state=42,
    n_jobs=-1,
)

model.fit(X_train, y_train)
Loading...
ParameterValueWhat It ControlsEffect on ModelWhy It Matters for Rental Pricing
n_estimators1200Number of boosting rounds (trees)More trees → more learning capacityCaptures complex geographic and feature interactions
learning_rate0.05Shrinks each tree’s contributionLower = slower, more stable learningPrevents overfitting while allowing many trees
max_depth6Maximum tree depthHigher = more complex splitsCaptures nonlinear rent effects (e.g., suburb × bedrooms)
subsample0.8% of rows used per treeAdds randomnessImproves generalization, reduces overfitting
colsample_bytree0.8% of features used per treeAdds randomnessPrevents reliance on a single dominant variable
reg_lambda1.0L2 regularization strengthHigher = simpler modelControls extreme predictions
objective"reg:squarederror"Loss functionMinimizes squared errorStandard for regression problems
tree_method"hist"Tree construction algorithmFaster, memory-efficientRequired for categorical handling
enable_categoricalTrueNative categorical supportLearns category splits automaticallyAvoids one-hot explosion for suburb/postcode
random_state42Random seedEnsures reproducibilityImportant for teaching & benchmarking
n_jobs-1Number of CPU cores usedFaster trainingNo effect on accuracy
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)

print(f"MAE:  ${mae:,.2f} AUD/week")
print(f"RMSE: ${rmse:,.2f} AUD/week")
print(f"R²:   {r2:.3f}")
MAE:  $123.61 AUD/week
RMSE: $258.00 AUD/week
R²:   0.452