Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Exploratory Data Analysis

Why EDA Matters before Building a Predictive Model

Exploratory Data Analysis (EDA) is the process of understanding the data beyond just reading it. It helps identify:

  • the types of features (numeric, categorical),

  • data quality issues such as missing values or constant columns,

  • distributions and shapes of variables,

  • relationships and correlations between predictors and the target, and

  • unexpected patterns or anomalies that might affect modeling.

This phase sets the foundation for building reliable predictive models.

Understanding Feature Types

In our dataset:

  • Numeric features include weather measurements (e.g., "temperature", "snow depth") and operational variables (e.g., total items, zipcode).

  • The only categorical variable is our target: "classification_ontime".

Understanding these types helps us choose the right summaries and visualizations.

import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go

This chapter uses the following package versions:

print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)
print("Plotly version:", plotly.__version__)
Pandas version: 3.0.0
NumPy version: 2.4.2
Plotly version: 6.5.2

Read data.

df = pd.read_csv(
    "http://raw.githubusercontent.com/bdi593/datasets/main/amazon-on-time-delivery-data/on-time-delivery-data.csv"
)
df
Loading...

Data Shape and Basic Information

print(f"Number of rows: {df.shape[0]}, Number of columns: {df.shape[1]}")
Number of rows: 45000, Number of columns: 16
print(f"Column names and data types:")
print(df.dtypes)
Column names and data types:
zipcode                    int64
total_items              float64
precipitation_rate       float64
water_runoff             float64
snow_depth               float64
temperature              float64
temperature_at_1500m     float64
min_temperature          float64
max_temperature          float64
pressure                 float64
wind_gust_speed          float64
total_cloud_cover        float64
dew_point_temperature    float64
relative_humidity        float64
wind_speed               float64
classification_ontime        str
dtype: object

First Few Rows

df.head()
Loading...

Missing Values

missing_values = df.isnull().sum()
missing_values
zipcode 0 total_items 0 precipitation_rate 0 water_runoff 0 snow_depth 0 temperature 0 temperature_at_1500m 0 min_temperature 0 max_temperature 0 pressure 0 wind_gust_speed 0 total_cloud_cover 0 dew_point_temperature 0 relative_humidity 0 wind_speed 0 classification_ontime 0 dtype: int64

There are no missing values in the dataset.

Statistical Summary

df.describe()
Loading...

Target Variable Analysis

The target variable for our prediction model is classification_ontime. Let’s explore its distribution and characteristics.

print("Target Variable Distribution:")
print(df["classification_ontime"].value_counts())

print(f"\nTarget Variable Proportions:")
print(df["classification_ontime"].value_counts(normalize=True))
Target Variable Distribution:
classification_ontime
Delayed    26138
On time    18862
Name: count, dtype: int64

Target Variable Proportions:
classification_ontime
Delayed    0.580844
On time    0.419156
Name: proportion, dtype: float64

Visualize Target Variable Distribution

ontime_delivery_counts = df["classification_ontime"].value_counts()
fig = px.pie(
    df,
    names="classification_ontime",
    title="Breakdown of classification_ontime",
    template="simple_white",
)
fig.update_traces(textposition="inside", textinfo="percent+label")
fig.show()
Loading...

Feature Analysis

Numeric Features

numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric features ({len(numeric_features)}):")
print(numeric_features)
Numeric features (15):
['zipcode', 'total_items', 'precipitation_rate', 'water_runoff', 'snow_depth', 'temperature', 'temperature_at_1500m', 'min_temperature', 'max_temperature', 'pressure', 'wind_gust_speed', 'total_cloud_cover', 'dew_point_temperature', 'relative_humidity', 'wind_speed']

Categorical Features

categorical_features = df.select_dtypes(include=["str"]).columns.tolist()
print(f"Categorical features ({len(categorical_features)}):")
print(categorical_features)
Categorical features (1):
['classification_ontime']

Numeric Features Distribution

if numeric_features:
    for feature in numeric_features:
        fig = px.histogram(
            df, x=feature, title=f"Distribution of {feature}", template="simple_white"
        )
        fig.show()
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Categorical Features Distribution

for feature in categorical_features:
    fig = px.bar(
        df[feature].value_counts(),
        title=f"Distribution of {feature}",
        template="simple_white",
    )
    fig.show()
Loading...

The target variable is not perfectly balanced:

  • Delayed: ~58%

  • On time: ~42%

This means that accuracy alone may be misleading during model evaluation, and metrics like ROC-AUC, precision/recall, or F1 score will be more informative. However, the gap is not extreme, so we can still train a model without special techniques for imbalance, but it’s something to keep in mind when interpreting results.

Correlation Analysis

Correlation matrices help us understand linear relationships between numeric features. Features with high correlation (e.g., temperature and dew point temperature) may provide overlapping information to models, whereas features with low correlation may provide unique predictive value.

Numeric Features Correlation

if numeric_features:
    corr_matrix = df[numeric_features].corr()
    fig = go.Figure(
        data=go.Heatmap(
            z=corr_matrix.values, x=corr_matrix.columns, y=corr_matrix.columns
        )
    )
    fig.update_layout(
        title="Correlation Matrix of Numeric Features", template="simple_white"
    )
    fig.show()
Loading...

Target Variable Relationships

Numeric Features vs Target Variable

if numeric_features:
    for feature in numeric_features:
        fig = px.box(
            df,
            x="classification_ontime",
            y=feature,
            title=f"{feature} by On-Time Status",
            template="simple_white",
        )
        fig.show()
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...