Exploratory Data Analysis
Why EDA Matters before Building a Predictive Model¶
Exploratory Data Analysis (EDA) is the process of understanding the data beyond just reading it. It helps identify:
the types of features (numeric, categorical),
data quality issues such as missing values or constant columns,
distributions and shapes of variables,
relationships and correlations between predictors and the target, and
unexpected patterns or anomalies that might affect modeling.
This phase sets the foundation for building reliable predictive models.
Understanding Feature Types¶
In our dataset:
Numeric features include weather measurements (e.g.,
"temperature","snow depth") and operational variables (e.g., total items, zipcode).The only categorical variable is our target:
"classification_ontime".
Understanding these types helps us choose the right summaries and visualizations.
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as goThis chapter uses the following package versions:
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)
print("Plotly version:", plotly.__version__)Pandas version: 3.0.0
NumPy version: 2.4.2
Plotly version: 6.5.2
Read data.
df = pd.read_csv(
"http://raw.githubusercontent.com/bdi593/datasets/main/amazon-on-time-delivery-data/on-time-delivery-data.csv"
)
dfData Shape and Basic Information¶
print(f"Number of rows: {df.shape[0]}, Number of columns: {df.shape[1]}")Number of rows: 45000, Number of columns: 16
print(f"Column names and data types:")
print(df.dtypes)Column names and data types:
zipcode int64
total_items float64
precipitation_rate float64
water_runoff float64
snow_depth float64
temperature float64
temperature_at_1500m float64
min_temperature float64
max_temperature float64
pressure float64
wind_gust_speed float64
total_cloud_cover float64
dew_point_temperature float64
relative_humidity float64
wind_speed float64
classification_ontime str
dtype: object
First Few Rows¶
df.head()Missing Values¶
missing_values = df.isnull().sum()
missing_valueszipcode 0
total_items 0
precipitation_rate 0
water_runoff 0
snow_depth 0
temperature 0
temperature_at_1500m 0
min_temperature 0
max_temperature 0
pressure 0
wind_gust_speed 0
total_cloud_cover 0
dew_point_temperature 0
relative_humidity 0
wind_speed 0
classification_ontime 0
dtype: int64There are no missing values in the dataset.
Statistical Summary¶
df.describe()Target Variable Analysis¶
The target variable for our prediction model is classification_ontime. Let’s explore its distribution and characteristics.
print("Target Variable Distribution:")
print(df["classification_ontime"].value_counts())
print(f"\nTarget Variable Proportions:")
print(df["classification_ontime"].value_counts(normalize=True))Target Variable Distribution:
classification_ontime
Delayed 26138
On time 18862
Name: count, dtype: int64
Target Variable Proportions:
classification_ontime
Delayed 0.580844
On time 0.419156
Name: proportion, dtype: float64
Visualize Target Variable Distribution¶
ontime_delivery_counts = df["classification_ontime"].value_counts()
fig = px.pie(
df,
names="classification_ontime",
title="Breakdown of classification_ontime",
template="simple_white",
)
fig.update_traces(textposition="inside", textinfo="percent+label")
fig.show()Feature Analysis¶
Numeric Features¶
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric features ({len(numeric_features)}):")
print(numeric_features)Numeric features (15):
['zipcode', 'total_items', 'precipitation_rate', 'water_runoff', 'snow_depth', 'temperature', 'temperature_at_1500m', 'min_temperature', 'max_temperature', 'pressure', 'wind_gust_speed', 'total_cloud_cover', 'dew_point_temperature', 'relative_humidity', 'wind_speed']
Categorical Features¶
categorical_features = df.select_dtypes(include=["str"]).columns.tolist()
print(f"Categorical features ({len(categorical_features)}):")
print(categorical_features)Categorical features (1):
['classification_ontime']
Numeric Features Distribution¶
if numeric_features:
for feature in numeric_features:
fig = px.histogram(
df, x=feature, title=f"Distribution of {feature}", template="simple_white"
)
fig.show()Categorical Features Distribution¶
for feature in categorical_features:
fig = px.bar(
df[feature].value_counts(),
title=f"Distribution of {feature}",
template="simple_white",
)
fig.show()The target variable is not perfectly balanced:
Delayed: ~58%
On time: ~42%
This means that accuracy alone may be misleading during model evaluation, and metrics like ROC-AUC, precision/recall, or F1 score will be more informative. However, the gap is not extreme, so we can still train a model without special techniques for imbalance, but it’s something to keep in mind when interpreting results.
Correlation Analysis¶
Correlation matrices help us understand linear relationships between numeric features. Features with high correlation (e.g., temperature and dew point temperature) may provide overlapping information to models, whereas features with low correlation may provide unique predictive value.
Numeric Features Correlation¶
if numeric_features:
corr_matrix = df[numeric_features].corr()
fig = go.Figure(
data=go.Heatmap(
z=corr_matrix.values, x=corr_matrix.columns, y=corr_matrix.columns
)
)
fig.update_layout(
title="Correlation Matrix of Numeric Features", template="simple_white"
)
fig.show()if numeric_features:
for feature in numeric_features:
fig = px.box(
df,
x="classification_ontime",
y=feature,
title=f"{feature} by On-Time Status",
template="simple_white",
)
fig.show()