Exploratory Data Analysis - BDI 199/593 Textbook

Why EDA Matters before Building a Predictive Model¶

Exploratory Data Analysis (EDA) is the process of understanding the data beyond just reading it. It helps identify:

the types of features (numeric, categorical),
data quality issues such as missing values or constant columns,
distributions and shapes of variables,
relationships and correlations between predictors and the target, and
unexpected patterns or anomalies that might affect modeling.

This phase sets the foundation for building reliable predictive models.

Understanding Feature Types¶

In our dataset:

Numeric features include weather measurements (e.g., "temperature", "snow depth") and operational variables (e.g., total items, zipcode).
The only categorical variable is our target: "classification_ontime".

Understanding these types helps us choose the right summaries and visualizations.

import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go

This chapter uses the following package versions:

print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)
print("Plotly version:", plotly.__version__)

Pandas version: 3.0.0
NumPy version: 2.4.2
Plotly version: 6.5.2

Read data.

df = pd.read_csv(
    "http://raw.githubusercontent.com/bdi593/datasets/main/amazon-on-time-delivery-data/on-time-delivery-data.csv"
)
df

Data Shape and Basic Information¶

print(f"Number of rows: {df.shape[0]}, Number of columns: {df.shape[1]}")

Number of rows: 45000, Number of columns: 16

print(f"Column names and data types:")
print(df.dtypes)

Column names and data types:
zipcode                    int64
total_items              float64
precipitation_rate       float64
water_runoff             float64
snow_depth               float64
temperature              float64
temperature_at_1500m     float64
min_temperature          float64
max_temperature          float64
pressure                 float64
wind_gust_speed          float64
total_cloud_cover        float64
dew_point_temperature    float64
relative_humidity        float64
wind_speed               float64
classification_ontime        str
dtype: object

First Few Rows¶

df.head()

Missing Values¶

missing_values = df.isnull().sum()
missing_values

zipcode                  0
total_items              0
precipitation_rate       0
water_runoff             0
snow_depth               0
temperature              0
temperature_at_1500m     0
min_temperature          0
max_temperature          0
pressure                 0
wind_gust_speed          0
total_cloud_cover        0
dew_point_temperature    0
relative_humidity        0
wind_speed               0
classification_ontime    0
dtype: int64

There are no missing values in the dataset.

Statistical Summary¶

df.describe()

Target Variable Analysis¶

The target variable for our prediction model is classification_ontime. Let’s explore its distribution and characteristics.

print("Target Variable Distribution:")
print(df["classification_ontime"].value_counts())

print(f"\nTarget Variable Proportions:")
print(df["classification_ontime"].value_counts(normalize=True))

Target Variable Distribution:
classification_ontime
Delayed    26138
On time    18862
Name: count, dtype: int64

Target Variable Proportions:
classification_ontime
Delayed    0.580844
On time    0.419156
Name: proportion, dtype: float64

Visualize Target Variable Distribution¶

ontime_delivery_counts = df["classification_ontime"].value_counts()
fig = px.pie(
    df,
    names="classification_ontime",
    title="Breakdown of classification_ontime",
    template="simple_white",
)
fig.update_traces(textposition="inside", textinfo="percent+label")
fig.show()

Feature Analysis¶

Numeric Features¶

numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric features ({len(numeric_features)}):")
print(numeric_features)

Numeric features (15):
['zipcode', 'total_items', 'precipitation_rate', 'water_runoff', 'snow_depth', 'temperature', 'temperature_at_1500m', 'min_temperature', 'max_temperature', 'pressure', 'wind_gust_speed', 'total_cloud_cover', 'dew_point_temperature', 'relative_humidity', 'wind_speed']

Categorical Features¶

categorical_features = df.select_dtypes(include=["str"]).columns.tolist()
print(f"Categorical features ({len(categorical_features)}):")
print(categorical_features)

Categorical features (1):
['classification_ontime']

Numeric Features Distribution¶

if numeric_features:
    for feature in numeric_features:
        fig = px.histogram(
            df, x=feature, title=f"Distribution of {feature}", template="simple_white"
        )
        fig.show()

Categorical Features Distribution¶

for feature in categorical_features:
    fig = px.bar(
        df[feature].value_counts(),
        title=f"Distribution of {feature}",
        template="simple_white",
    )
    fig.show()

The target variable is not perfectly balanced:

Delayed: ~58%
On time: ~42%

This means that accuracy alone may be misleading during model evaluation, and metrics like ROC-AUC, precision/recall, or F1 score will be more informative. However, the gap is not extreme, so we can still train a model without special techniques for imbalance, but it’s something to keep in mind when interpreting results.

Correlation Analysis¶

Correlation matrices help us understand linear relationships between numeric features. Features with high correlation (e.g., temperature and dew point temperature) may provide overlapping information to models, whereas features with low correlation may provide unique predictive value.

Numeric Features Correlation¶

if numeric_features:
    corr_matrix = df[numeric_features].corr()
    fig = go.Figure(
        data=go.Heatmap(
            z=corr_matrix.values, x=corr_matrix.columns, y=corr_matrix.columns
        )
    )
    fig.update_layout(
        title="Correlation Matrix of Numeric Features", template="simple_white"
    )
    fig.show()

Target Variable Relationships¶

Numeric Features vs Target Variable¶

if numeric_features:
    for feature in numeric_features:
        fig = px.box(
            df,
            x="classification_ontime",
            y=feature,
            title=f"{feature} by On-Time Status",
            template="simple_white",
        )
        fig.show()