Build a Regressor
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import Ridge
import xgboost
from xgboost import XGBRegressor
import plotly
import plotly.express as px
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"xgboost version: {xgboost.__version__}")
print(f"plotly version: {plotly.__version__}")pandas version: 3.0.0
numpy version: 2.4.2
scikit-learn version: 1.8.0
xgboost version: 3.2.0
plotly version: 6.5.2
Load Data¶
df = pd.read_csv(
"https://github.com/bdi593/datasets/raw/refs/heads/main/australian-rental-market-2026/australian_rental_market_2026.csv"
)
df.head()Explore Data¶
df.info()<class 'pandas.DataFrame'>
RangeIndex: 6767 entries, 0 to 6766
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 6767 non-null str
1 price_display 6767 non-null float64
2 description 6767 non-null str
3 propertyType 6767 non-null str
4 locality 6767 non-null str
5 latitude 6767 non-null float64
6 longitude 6767 non-null float64
7 postcode 6767 non-null int64
8 state 6767 non-null str
9 street_address 6767 non-null str
10 suburb 6767 non-null str
11 bathrooms 6767 non-null int64
12 bedrooms 6767 non-null int64
13 parking_spaces 6767 non-null int64
14 agency_name 6767 non-null str
15 amenities 6449 non-null str
dtypes: float64(3), int64(4), str(9)
memory usage: 846.0 KB
df.shape(6767, 16)df.isnull().sum()title 0
price_display 0
description 0
propertyType 0
locality 0
latitude 0
longitude 0
postcode 0
state 0
street_address 0
suburb 0
bathrooms 0
bedrooms 0
parking_spaces 0
agency_name 0
amenities 318
dtype: int64Target Column¶
The target column is price_display, because we are interested in building a prediction model for rental prices.
fig = px.histogram(
df,
x="price_display",
nbins=50,
title="Weekly rent distribution (AUD/week)",
template="simple_white",
)
fig.update_layout(xaxis_title="price_display", yaxis_title="count")
fig.show()The distribution is positively skewed (right-skewed):
Most properties cluster between $400 and $900 per week.
A small number of properties extend far into the right tail, exceeding $2,000-$5,000 per week.
This long right tail indicates the presence of high-end luxury properties that are relatively rare but significantly more expensive than typical rentals.
fig = px.box(
df,
x="price_display",
title="Weekly rent distribution (AUD/week)",
template="simple_white",
)
fig.show()The box plot confirms that weekly rental prices are strongly right-skewed. The median rent is $670 per week, with the middle 50% of properties (the interquartile range) spanning from about $550 to $800. This indicates that most rental listings fall within a relatively concentrated mid-market band. However, the right whisker extends much farther than the left, and there are numerous high-value outliers beyond $1,180 per week.
The large number of extreme points stretching out toward $3,000-$5,000 per week highlights a small but significant luxury segment. These outliers will influence the mean more than the median and can inflate error metrics like RMSE in predictive models. From a modeling perspective, this reinforces the need for robust approaches (e.g., gradient boosting) and possibly a log transformation of rent to better handle skewness and extreme values.
Listing Location¶
The dataset includes latitude and longitude coordinates for each rental listing, allowing us to visualize the geographic distribution of properties across Australia.
Latitude is the north-south position, while longitude is the east-west position. We can plot these coordinates as x and y axes to see a rough distribution of listings across the country. However, this simple scatter plot may not be the most effective way to visualize spatial data, especially if there are many listings clustered in certain areas.
# Drop rows missing coordinates or target for this plot
tmp = df.dropna(subset=["latitude", "longitude", "price_display"]).copy()
fig = px.scatter(
tmp,
x="longitude",
y="latitude",
opacity=0.4,
title="Listing locations",
template="simple_white",
)
fig.show()A better way to visualize the geographic distribution of rental listings is to use a map-based scatter plot, which can provide more context and clarity. By plotting the latitude and longitude on a map, we can see where listings are concentrated across different cities and regions in Australia. This approach also allows us to overlay additional information, such as state boundaries, to better understand the spatial distribution of rentals.
Plotly provides multiple options for map visualizations, including scatter_mapbox and scatter_map. Some styles require Mapbox API tokens, while others use open-source map styles that do not require authentication. For our purposes, we can use scatter_map with an open-source style to visualize the rental listings without needing an API token.
fig = px.scatter_map(
df,
lat="latitude",
lon="longitude",
hover_name="title",
hover_data={
"price_display": True,
"bedrooms": True,
"bathrooms": True,
"state": True,
},
zoom=3,
height=600,
template="simple_white",
title="Listings location on a geographic map",
)
fig.show()Parameters Explained¶
lat="latitude": Column name for the y-coordinate
Must be numeric
Values between roughly -44 to -10 for Australia
lon="longitude": Column name for the X-coordinate
Must be numeric
Values around 110--155 for Australia
hover_data={...}: Controls what additional information appears in the tooltip.
Example:
hover_data={
"price_display": True,
"bedrooms": True,
"bathrooms": True,
"state": True
}True: include the variableFalse: hide itYou can also format numbers (e.g.,
":.0f")
zoom=3: Controls initial zoom level.
2-3: whole Australia
6-8: metro-level
10+: suburb-level
Preprocess¶
There are not many preprocessing steps needed for this dataset, since XGBoost can handle categorical variables natively (this requires XGBoost > 1.6). We will convert the categorical features to pd.CategoricalDtype to ensure they are treated as categorical by XGBoost.
numeric_features = ["bedrooms", "bathrooms", "parking_spaces", "latitude", "longitude"]
categorical_features = [
"propertyType",
"locality",
"postcode",
"state",
"suburb",
"agency_name",
]
for col in categorical_features:
df[col] = df[col].astype("category")We will split the data into training and testing sets, using an 80/20 split. This allows us to train the model on the majority of the data while reserving a portion for evaluating its performance on unseen data.
target = "price_display"
y = df[target]
X = df[numeric_features + categorical_features]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Train:", X_train.shape, "Test:", X_test.shape)Train: (5413, 11) Test: (1354, 11)
Create the Model¶
A “Minimal” Regressor with Default Hyperparameters¶
We first build a “minimal” regressor using XGBoost with default hyperparameters. This serves as a baseline model to evaluate the performance of our features without any tuning.
he
tree_method="hist"parameter is required to enable native categorical support in XGBoost.The
enable_categorical=Trueparameter tells XGBoost to treat categorical features as such, allowing it to handle them more effectively without needing one-hot encoding or other preprocessing steps.
model_minimal = XGBRegressor(
tree_method="hist", # Required for categorical support
enable_categorical=True, # Enable native categorical handling
)
model_minimal.fit(X_train, y_train)Evaluate the Model¶
y_minimal_pred = model_minimal.predict(X_test)
mae = mean_absolute_error(y_test, y_minimal_pred)
rmse = mean_squared_error(y_test, y_minimal_pred) ** 0.5
r2 = r2_score(y_test, y_minimal_pred)
print(f"MAE: ${mae:,.2f} AUD/week")
print(f"RMSE: ${rmse:,.2f} AUD/week")
print(f"R²: {r2:.3f}")MAE: $131.76 AUD/week
RMSE: $273.35 AUD/week
R²: 0.385
Interpre the Results¶
MAE: $131.76 AUD/week¶
Mean Absolute Error (MAE) tells us the average size of the prediction error, regardless of direction.
On average, the model’s predicted rent is off by about $132 per week.
This is the most business-friendly metric. For context:
If the median rent is around 700/week,
A $132 error represents roughly 18-20% average deviation.
From a pricing engine perspective, this is moderate accuracy, but likely not production-grade yet.
RMSE: $273.35 AUD/week¶
Root Mean Squared Error (RMSE) penalizes large errors more heavily than MAE.
Notice:
RMSE (132)
This tells us something important:
The model makes some large errors, likely on high-end or luxury properties.
Because rent is right-skewed, extreme properties in the 5,000 range inflate RMSE significantly.
This is very typical for housing price models.
R²: 0.385¶
R² = 0.385 means the model explains 38.5% of the variation in rental prices.
In practical terms:
About 39% of rent variability is captured by the features.
About 61% remains unexplained.
For housing data, this suggests:
Location + property features matter
But important variables are missing (e.g., floor area, property condition, distance to CBD, school zones, etc.)
An R² around 0.35-0.50 is common in early-stage real estate modeling.
A “Tuned” Regressor¶
We will build a “tuned” regressor using XGBoost with hyperparameter tuning to optimize performance. This involves adjusting parameters such as learning rate, max depth, and number of estimators to find the best combination for our dataset.
Note that only one sample hyperparameter configuration is shown here for demonstration purposes. In practice, you would want to perform a more systematic search (e.g., grid search or random search) to find the optimal hyperparameters for your model.
model = XGBRegressor(
n_estimators=1200,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=1.0,
objective="reg:squarederror",
tree_method="hist", # REQUIRED
enable_categorical=True, # KEY PARAMETER
random_state=42,
n_jobs=-1,
)
model.fit(X_train, y_train)| Parameter | Value | What It Controls | Effect on Model | Why It Matters for Rental Pricing |
|---|---|---|---|---|
n_estimators | 1200 | Number of boosting rounds (trees) | More trees → more learning capacity | Captures complex geographic and feature interactions |
learning_rate | 0.05 | Shrinks each tree’s contribution | Lower = slower, more stable learning | Prevents overfitting while allowing many trees |
max_depth | 6 | Maximum tree depth | Higher = more complex splits | Captures nonlinear rent effects (e.g., suburb × bedrooms) |
subsample | 0.8 | % of rows used per tree | Adds randomness | Improves generalization, reduces overfitting |
colsample_bytree | 0.8 | % of features used per tree | Adds randomness | Prevents reliance on a single dominant variable |
reg_lambda | 1.0 | L2 regularization strength | Higher = simpler model | Controls extreme predictions |
objective | "reg:squarederror" | Loss function | Minimizes squared error | Standard for regression problems |
tree_method | "hist" | Tree construction algorithm | Faster, memory-efficient | Required for categorical handling |
enable_categorical | True | Native categorical support | Learns category splits automatically | Avoids one-hot explosion for suburb/postcode |
random_state | 42 | Random seed | Ensures reproducibility | Important for teaching & benchmarking |
n_jobs | -1 | Number of CPU cores used | Faster training | No effect on accuracy |
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred) ** 0.5
r2 = r2_score(y_test, y_pred)
print(f"MAE: ${mae:,.2f} AUD/week")
print(f"RMSE: ${rmse:,.2f} AUD/week")
print(f"R²: {r2:.3f}")MAE: $123.61 AUD/week
RMSE: $258.00 AUD/week
R²: 0.452