Introduction to Polars - BDI 199/593 Textbook

Polars is a fast DataFrame library for data wrangling/analytics. Like pandas, it lets you load, transform, join, group, and aggregate tabular data---but it’s built with a different performance model:

Engine: written in Rust (pandas is mostly Python with NumPy/C extensions).
Execution model: supports lazy evaluation (build a query plan, then execute efficiently) and query optimization; pandas is mostly eager (does work immediately).
Parallelism: Polars is designed to use multiple CPU cores by default for many operations; pandas is often single-threaded for typical DataFrame ops.
Memory model: Polars uses Apache Arrow columnar memory under the hood, which is great for speed and interoperability.

Pandas vs Polars¶

Dimension	pandas	Polars
Primary language	Python (with C/NumPy under the hood)	Rust (Python bindings)
Execution model	Eager (operations run immediately)	Eager and Lazy (query planning + optimization)
Performance	Good for small--medium data; can slow down on large groupbys/joins	Often much faster, especially on large datasets
Parallelism	Limited; many ops are single-threaded	Built-in multi-threading by default
Memory format	NumPy-based, row-oriented tendencies	Apache Arrow, columnar
Memory efficiency	Can be higher overhead, especially with object columns	Generally more memory-efficient
Data types	Flexible; `object` dtype common	Strict, explicit dtypes
Missing values	Uses `NaN`, `None`, nullable dtypes added later	Native null support via Arrow
Index	Central concept (powerful but sometimes confusing)	No index (explicit columns only)
API style	Imperative, step-by-step	Expression-based, declarative
Lazy evaluation	❌ No	✅ Yes
Query optimization	❌ No	✅ Yes
Streaming / out-of-core	Limited	Supported (especially with lazy mode)
String performance	Often slower (`object` strings)	Very fast (Arrow strings)
Time series support	Very mature	Solid and improving
Ecosystem support	Massive (default for ML, stats, viz)	Growing, but smaller
Learning curve	Low (widely taught)	Moderate (different mental model)
Interoperability	Native to most Python data tools	Easy conversion to/from pandas
Typical use cases	Data exploration, ML prep, teaching, quick analysis	ETL pipelines, large data, performance-critical workflows
Maturity	Very mature	Newer but rapidly evolving

Using Polars¶

Installing Polars¶

To install polars, use pip in your terminal or command prompt (where your Jupyter environment is set up):

pip install polars

import polars as pl
import datetime as dt

df_polars = pl.DataFrame(
    {
        "developer": [
            "Alice Chen",
            "Brian Patel",
            "Carlos Gomez",
            "Diana Nguyen",
        ],
        "hire_date": [
            dt.date(2019, 6, 1),
            dt.date(2020, 9, 15),
            dt.date(2018, 3, 22),
            dt.date(2021, 1, 10),
        ],
        "weekly_commits": [45, 30, 60, 25],
        "hours_worked": [40, 38, 45, 35],
    }
)

df_polars

Check the type of the df object to confirm it’s a Polars DataFrame:

type(df_polars)

polars.dataframe.frame.DataFrame

Comparing Polars and Pandas DataFrames¶

import pandas as pd
import datetime as dt

df_pandas = pd.DataFrame(
    {
        "developer": [
            "Alice Chen",
            "Brian Patel",
            "Carlos Gomez",
            "Diana Nguyen",
        ],
        "hire_date": [
            "2019-06-01",
            "2020-09-15",
            "2018-03-22",
            "2021-01-10",
        ],
        "weekly_commits": [45, 30, 60, 25],
        "hours_worked": [40, 38, 45, 35],
    }
)

df_pandas

type(df_pandas)

pandas.core.frame.DataFrame

You’ll notice that the Polars DataFrame is of type polars.dataframe.DataFrame, while the pandas DataFrame is of type pandas.core.frame.DataFrame. Other than that, they look quite similar!

However, Polars offers many modern-API features. For example, Polars supports expressions for data transformations.

For example, here is an expression to calculate the productivity score for each developer based on their number of weekly commits and hours worked:

result = df_polars.select(
    pl.col("developer"),
    pl.col("hire_date").dt.year().alias("hire_year"),
    (pl.col("weekly_commits") / pl.col("hours_worked")).alias("productivity_score"),
)
result

You can also add columns to the DataFrame instead of selecting them using with_columns:

result = df_polars.with_columns(
    pl.col("developer"),
    pl.col("hire_date").dt.year().alias("hire_year"),
    (pl.col("weekly_commits") / pl.col("hours_worked")).alias("productivity_score"),
)
result

Filtering¶

Filtering in Polars can be done using expressions as well. For example, to filter developers with a productivity score greater than 1:

result = df_polars.filter((pl.col("weekly_commits") / pl.col("hours_worked")) > 1)
result

You can also provide multiple expressions as separate parameters, which is much more convenient than having to use bitwise operators (&, |) as in pandas.

result = df_polars.filter(
    pl.col("hire_date").is_between(dt.date(2018, 1, 1), dt.date(2019, 12, 31)),
    pl.col("hours_worked") >= 40,
)
result