NLP Preprocessing - BDI 199/593 Textbook

We will use a short excerpt from a corporate earnings call transcript (Microsoft Q2 2026) to demonstrate common NLP preprocessing steps.

# Excerpt from Satya Nadella's Opening Remarks
text = """
You can think of agents as the new apps, and to build, deploy, and manage agents, 
customers will need a model catalog, tuning services, harness for orchestration, 
services for context engineering, AI safety, management, observability, and security. 
It starts with having broad model choice. Our customers expect to use multiple models 
as part of any workload that they can fine-tune and optimize based on cost, latency, 
and performance requirements. We offer the broadest selection of models of any 
hyperscaler. This quarter, we added support for GPT-5 too, as well as Claude 4.5. 
Already, over 1,500 customers have used both Anthropic and OpenAI models on Foundry. 
We are seeing increasing demand for region-specific models, including Mistral and 
Cohere, as more customers look for sovereign AI choices.
"""

First, install spaCy and download the English language model:

If using pip:

pip install spacy

Then download the language model:

python -m spacy download en_core_web_sm

If using conda:

conda install -c conda-forge spacy

Then download the language model:

python -m spacy download en_core_web_sm

Common NLP Preprocessing Steps¶

Recall from the previous page that raw text often needs to be cleaned before analysis.

Common preprocessing steps include:

Lowercasing
Removing punctuation
Tokenization
Stopword removal
Stemming or lemmatizatio

spaCy is a powerful NLP library that provides built-in functions for many of these preprocessing steps. We will use it to demonstrate how to preprocess our conference call transcript excerpt.

Load the model.

import spacy

nlp = spacy.load("en_core_web_sm")

Parse the Text as a spaCy Document¶

The first step is to parse the raw text into a spaCy Doc object.

doc = nlp(text)

Sentence Segmentation¶

spaCy can automatically split text into sentences.

for sent in doc.sents:
    print("---------")
    print(sent.text.strip())

---------
You can think of agents as the new apps, and to build, deploy, and manage agents, 
customers will need a model catalog, tuning services, harness for orchestration, 
services for context engineering, AI safety, management, observability, and security.
---------
It starts with having broad model choice.
---------
Our customers expect to use multiple models 
as part of any workload that they can fine-tune and optimize based on cost, latency, 
and performance requirements.
---------
We offer the broadest selection of models of any 
hyperscaler.
---------
This quarter, we added support for GPT-5 too, as well as Claude 4.5.
---------
Already, over 1,500 customers have used both Anthropic and OpenAI models on Foundry.
---------
We are seeing increasing demand for region-specific models, including Mistral and 
Cohere, as more customers look for sovereign AI choices.

Tokenization¶

Tokenization splits text into individual units called tokens.

for token in doc[:10]:
    print(token.text)



You
can
think
of
agents
as
the
new
apps

Token Attributes¶

spaCy provides built-in metadata about each token.

for token in doc[:10]:
    print(token.text, token.pos_, token.is_stop)


 SPACE False
You PRON True
can AUX True
think VERB False
of ADP True
agents NOUN False
as ADP True
the DET True
new ADJ False
apps NOUN False

We can format this into a Pandas DataFrame.

import pandas as pd

data = [
    {"token": token.text, "pos": token.pos_, "is_stop": token.is_stop}
    for token in doc[:10]
]

df = pd.DataFrame(data)
df

Loading...

Useful attributes include:

Attribute	Description
`token.pos_`	Part-of-speech
`token.lemma_`	Base form of the word
`token.is_stop`	Whether the word is a stopword
`token.is_punct`	Whether it is punctuation
`token.like_num`	Whether it represents a number

Lowercasing¶

spaCy tokens contain both original and normalized text.

tokens_lower = [token.text.lower() for token in doc if not token.is_punct]

tokens_lower[:10]

['\n', 'you', 'can', 'think', 'of', 'agents', 'as', 'the', 'new', 'apps']

Removing Stopwords and Puntuation¶

clean_tokens = [
    token.text.lower() for token in doc if not token.is_stop and not token.is_punct
]

clean_tokens[:10]

['\n',
 'think',
 'agents',
 'new',
 'apps',
 'build',
 'deploy',
 'manage',
 'agents',
 '\n']

Lemmatization¶

Lemmatization reduces words to their base form.

lemmas = [
    token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct
]

lemmas[:10]

['\n',
 'think',
 'agent',
 'new',
 'app',
 'build',
 'deploy',
 'manage',
 'agent',
 '\n']

Feature Extraction¶

spaCy can also extract more complex features like named entities, noun chunks, and syntactic dependencies.

Part-of-Speech Tagging¶

spaCy can identify grammatical roles.

for token in doc[:10]:
    print(token.text, token.pos_)


 SPACE
You PRON
can AUX
think VERB
of ADP
agents NOUN
as ADP
the DET
new ADJ
apps NOUN

Named Entity Recognition¶

Named Entity Recognition (NER) identifies important entities in text.

for ent in doc.ents:
    print(ent.text, ent.label_)

AI GPE
This quarter DATE
Claude 4.5 LAW
over 1,500 CARDINAL
OpenAI ORG
Foundry ORG
Mistral ORG

Common entity types:

Label	Meaning
PERSON	Person name
ORG	Organization
GPE	Location
DATE	Date
MONEY	Monetary value

In the next page, we will see how to use these features for analysis.