Build a Regressor - BDI 199/593 Textbook

import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import Ridge

import xgboost
from xgboost import XGBRegressor

import plotly
import plotly.express as px

print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"xgboost version: {xgboost.__version__}")
print(f"plotly version: {plotly.__version__}")

pandas version: 3.0.0
numpy version: 2.4.2
scikit-learn version: 1.8.0
xgboost version: 3.2.0
plotly version: 6.5.2

Load Data¶

df = pd.read_csv(
    "https://github.com/bdi593/datasets/raw/refs/heads/main/australian-rental-market-2026/australian_rental_market_2026.csv"
)
df.head()

Explore Data¶

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 6767 entries, 0 to 6766
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           6767 non-null   str    
 1   price_display   6767 non-null   float64
 2   description     6767 non-null   str    
 3   propertyType    6767 non-null   str    
 4   locality        6767 non-null   str    
 5   latitude        6767 non-null   float64
 6   longitude       6767 non-null   float64
 7   postcode        6767 non-null   int64  
 8   state           6767 non-null   str    
 9   street_address  6767 non-null   str    
 10  suburb          6767 non-null   str    
 11  bathrooms       6767 non-null   int64  
 12  bedrooms        6767 non-null   int64  
 13  parking_spaces  6767 non-null   int64  
 14  agency_name     6767 non-null   str    
 15  amenities       6449 non-null   str    
dtypes: float64(3), int64(4), str(9)
memory usage: 846.0 KB

df.shape

(6767, 16)

df.isnull().sum()

title               0
price_display       0
description         0
propertyType        0
locality            0
latitude            0
longitude           0
postcode            0
state               0
street_address      0
suburb              0
bathrooms           0
bedrooms            0
parking_spaces      0
agency_name         0
amenities         318
dtype: int64

Target Column¶

The target column is price_display, because we are interested in building a prediction model for rental prices.

fig = px.histogram(
    df,
    x="price_display",
    nbins=50,
    title="Weekly rent distribution (AUD/week)",
    template="simple_white",
)
fig.update_layout(xaxis_title="price_display", yaxis_title="count")
fig.show()

The distribution is positively skewed (right-skewed):

Most properties cluster between $400 and $900 per week.
A small number of properties extend far into the right tail, exceeding $2,000-$5,000 per week.

This long right tail indicates the presence of high-end luxury properties that are relatively rare but significantly more expensive than typical rentals.

fig = px.box(
    df,
    x="price_display",
    title="Weekly rent distribution (AUD/week)",
    template="simple_white",
)
fig.show()

The box plot confirms that weekly rental prices are strongly right-skewed. The median rent is $670 per week, with the middle 50% of properties (the interquartile range) spanning from about $550 to $800. This indicates that most rental listings fall within a relatively concentrated mid-market band. However, the right whisker extends much farther than the left, and there are numerous high-value outliers beyond $1,180 per week.

The large number of extreme points stretching out toward $3,000-$5,000 per week highlights a small but significant luxury segment. These outliers will influence the mean more than the median and can inflate error metrics like RMSE in predictive models. From a modeling perspective, this reinforces the need for robust approaches (e.g., gradient boosting) and possibly a log transformation of rent to better handle skewness and extreme values.

How is an outlier defined?

There are various methods to define outliers, such as:

IQR Method: An outlier is any data point that falls below Q1 - 1.5 _ IQR or above Q3 + 1.5 _ IQR, where IQR is the interquartile range (Q3 - Q1).
Z-Score Method: An outlier is any data point with a z-score greater than 3 or less than -3, indicating it is more than 3 standard deviations away from the mean.
Domain Knowledge: In some cases, outliers are defined based on specific thresholds relevant to the context of the data (e.g., rental prices above $5,000 per week may be considered outliers in this dataset).

In the box plot above, Plotly identifies outliers based on the IQR method. In most data visualization libraries or software, you can customize the definition of outliers by adjusting the parameters used to calculate the whiskers and identify extreme points.

Listing Location¶

The dataset includes latitude and longitude coordinates for each rental listing, allowing us to visualize the geographic distribution of properties across Australia.

Latitude is the north-south position, while longitude is the east-west position. We can plot these coordinates as x and y axes to see a rough distribution of listings across the country. However, this simple scatter plot may not be the most effective way to visualize spatial data, especially if there are many listings clustered in certain areas.

# Drop rows missing coordinates or target for this plot
tmp = df.dropna(subset=["latitude", "longitude", "price_display"]).copy()

fig = px.scatter(
    tmp,
    x="longitude",
    y="latitude",
    opacity=0.4,
    title="Listing locations",
    template="simple_white",
)
fig.show()

A better way to visualize the geographic distribution of rental listings is to use a map-based scatter plot, which can provide more context and clarity. By plotting the latitude and longitude on a map, we can see where listings are concentrated across different cities and regions in Australia. This approach also allows us to overlay additional information, such as state boundaries, to better understand the spatial distribution of rentals.

Plotly provides multiple options for map visualizations, including scatter_mapbox and scatter_map. Some styles require Mapbox API tokens, while others use open-source map styles that do not require authentication. For our purposes, we can use scatter_map with an open-source style to visualize the rental listings without needing an API token.

fig = px.scatter_map(
    df,
    lat="latitude",
    lon="longitude",
    hover_name="title",
    hover_data={
        "price_display": True,
        "bedrooms": True,
        "bathrooms": True,
        "state": True,
    },
    zoom=3,
    height=600,
    template="simple_white",
    title="Listings location on a geographic map",
)

fig.show()

Parameters Explained¶

lat="latitude": Column name for the y-coordinate

Must be numeric
Values between roughly -44 to -10 for Australia

lon="longitude": Column name for the X-coordinate

Must be numeric
Values around 110--155 for Australia

hover_data={...}: Controls what additional information appears in the tooltip.

Example:

hover_data={
    "price_display": True,
    "bedrooms": True,
    "bathrooms": True,
    "state": True
}

True: include the variable
False: hide it
You can also format numbers (e.g., ":.0f")

zoom=3: Controls initial zoom level.

2-3: whole Australia
6-8: metro-level
10+: suburb-level

Preprocess¶

There are not many preprocessing steps needed for this dataset, since XGBoost can handle categorical variables natively (this requires XGBoost > 1.6). We will convert the categorical features to pd.CategoricalDtype to ensure they are treated as categorical by XGBoost.

numeric_features = ["bedrooms", "bathrooms", "parking_spaces", "latitude", "longitude"]
categorical_features = [
    "propertyType",
    "locality",
    "postcode",
    "state",
    "suburb",
    "agency_name",
]

for col in categorical_features:
    df[col] = df[col].astype("category")

We will split the data into training and testing sets, using an 80/20 split. This allows us to train the model on the majority of the data while reserving a portion for evaluating its performance on unseen data.

target = "price_display"

y = df[target]
X = df[numeric_features + categorical_features]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train:", X_train.shape, "Test:", X_test.shape)

Train: (5413, 11) Test: (1354, 11)

Create the Model¶

A “Minimal” Regressor with Default Hyperparameters¶

We first build a “minimal” regressor using XGBoost with default hyperparameters. This serves as a baseline model to evaluate the performance of our features without any tuning.

he tree_method="hist" parameter is required to enable native categorical support in XGBoost.
The enable_categorical=True parameter tells XGBoost to treat categorical features as such, allowing it to handle them more effectively without needing one-hot encoding or other preprocessing steps.

model_minimal = XGBRegressor(
    tree_method="hist",  # Required for categorical support
    enable_categorical=True,  # Enable native categorical handling
)

model_minimal.fit(X_train, y_train)

Evaluate the Model¶

y_minimal_pred = model_minimal.predict(X_test)

mae = mean_absolute_error(y_test, y_minimal_pred)
rmse = mean_squared_error(y_test, y_minimal_pred) ** 0.5
r2 = r2_score(y_test, y_minimal_pred)

print(f"MAE:  ${mae:,.2f} AUD/week")
print(f"RMSE: ${rmse:,.2f} AUD/week")
print(f"R²:   {r2:.3f}")

MAE:  $131.76 AUD/week
RMSE: $273.35 AUD/week
R²:   0.385

Interpre the Results¶

MAE: $131.76 AUD/week¶

Mean Absolute Error (MAE) tells us the average size of the prediction error, regardless of direction.

On average, the model’s predicted rent is off by about $132 per week.

This is the most business-friendly metric. For context:

If the median rent is around $650-$ 700/week,
A $132 error represents roughly 18-20% average deviation.

From a pricing engine perspective, this is moderate accuracy, but likely not production-grade yet.

RMSE: $273.35 AUD/week¶

Root Mean Squared Error (RMSE) penalizes large errors more heavily than MAE.

Notice:

RMSE ( $273) is more than double MAE ($ 132)

This tells us something important:

The model makes some large errors, likely on high-end or luxury properties.

Because rent is right-skewed, extreme properties in the $2,000-$ 5,000 range inflate RMSE significantly.

This is very typical for housing price models.

R²: 0.385¶

R² = 0.385 means the model explains 38.5% of the variation in rental prices.

In practical terms:

About 39% of rent variability is captured by the features.
About 61% remains unexplained.

For housing data, this suggests:

Location + property features matter
But important variables are missing (e.g., floor area, property condition, distance to CBD, school zones, etc.)

An R² around 0.35-0.50 is common in early-stage real estate modeling.

A “Tuned” Regressor¶

We will build a “tuned” regressor using XGBoost with hyperparameter tuning to optimize performance. This involves adjusting parameters such as learning rate, max depth, and number of estimators to find the best combination for our dataset.

Note that only one sample hyperparameter configuration is shown here for demonstration purposes. In practice, you would want to perform a more systematic search (e.g., grid search or random search) to find the optimal hyperparameters for your model.

model = XGBRegressor(
    n_estimators=1200,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    objective="reg:squarederror",
    tree_method="hist",  # REQUIRED
    enable_categorical=True,  # KEY PARAMETER
    random_state=42,
    n_jobs=-1,
)

model.fit(X_train, y_train)

Parameter	Value	What It Controls	Effect on Model	Why It Matters for Rental Pricing
`n_estimators`	1200	Number of boosting rounds (trees)	More trees → more learning capacity	Captures complex geographic and feature interactions
`learning_rate`	0.05	Shrinks each tree’s contribution	Lower = slower, more stable learning	Prevents overfitting while allowing many trees
`max_depth`	6	Maximum tree depth	Higher = more complex splits	Captures nonlinear rent effects (e.g., suburb × bedrooms)
`subsample`	0.8	% of rows used per tree	Adds randomness	Improves generalization, reduces overfitting
`colsample_bytree`	0.8	% of features used per tree	Adds randomness	Prevents reliance on a single dominant variable
`reg_lambda`	1.0	L2 regularization strength	Higher = simpler model	Controls extreme predictions
`objective`	`"reg:squarederror"`	Loss function	Minimizes squared error	Standard for regression problems
`tree_method`	`"hist"`	Tree construction algorithm	Faster, memory-efficient	Required for categorical handling
`enable_categorical`	`True`	Native categorical support	Learns category splits automatically	Avoids one-hot explosion for suburb/postcode
`random_state`	42	Random seed	Ensures reproducibility	Important for teaching & benchmarking
`n_jobs`	-1	Number of CPU cores used	Faster training	No effect on accuracy

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)

print(f"MAE:  ${mae:,.2f} AUD/week")
print(f"RMSE: ${rmse:,.2f} AUD/week")
print(f"R²:   {r2:.3f}")

MAE:  $123.61 AUD/week
RMSE: $258.00 AUD/week
R²:   0.452