NLP Preprocessing
We will use a short excerpt from a corporate earnings call transcript (Microsoft Q2 2026) to demonstrate common NLP preprocessing steps.
# Excerpt from Satya Nadella's Opening Remarks
text = """
You can think of agents as the new apps, and to build, deploy, and manage agents,
customers will need a model catalog, tuning services, harness for orchestration,
services for context engineering, AI safety, management, observability, and security.
It starts with having broad model choice. Our customers expect to use multiple models
as part of any workload that they can fine-tune and optimize based on cost, latency,
and performance requirements. We offer the broadest selection of models of any
hyperscaler. This quarter, we added support for GPT-5 too, as well as Claude 4.5.
Already, over 1,500 customers have used both Anthropic and OpenAI models on Foundry.
We are seeing increasing demand for region-specific models, including Mistral and
Cohere, as more customers look for sovereign AI choices.
"""First, install spaCy and download the English language model:
If using pip:
pip install spacyThen download the language model:
python -m spacy download en_core_web_smIf using conda:
conda install -c conda-forge spacyThen download the language model:
python -m spacy download en_core_web_smCommon NLP Preprocessing Steps¶
Recall from the previous page that raw text often needs to be cleaned before analysis.
Common preprocessing steps include:
Lowercasing
Removing punctuation
Tokenization
Stopword removal
Stemming or lemmatizatio
spaCy is a powerful NLP library that provides built-in functions for many of these preprocessing steps. We will use it to demonstrate how to preprocess our conference call transcript excerpt.
Load the model.
import spacy
nlp = spacy.load("en_core_web_sm")Parse the Text as a spaCy Document¶
The first step is to parse the raw text into a spaCy Doc object.
doc = nlp(text)Sentence Segmentation¶
spaCy can automatically split text into sentences.
for sent in doc.sents:
print("---------")
print(sent.text.strip())---------
You can think of agents as the new apps, and to build, deploy, and manage agents,
customers will need a model catalog, tuning services, harness for orchestration,
services for context engineering, AI safety, management, observability, and security.
---------
It starts with having broad model choice.
---------
Our customers expect to use multiple models
as part of any workload that they can fine-tune and optimize based on cost, latency,
and performance requirements.
---------
We offer the broadest selection of models of any
hyperscaler.
---------
This quarter, we added support for GPT-5 too, as well as Claude 4.5.
---------
Already, over 1,500 customers have used both Anthropic and OpenAI models on Foundry.
---------
We are seeing increasing demand for region-specific models, including Mistral and
Cohere, as more customers look for sovereign AI choices.
Tokenization¶
Tokenization splits text into individual units called tokens.
for token in doc[:10]:
print(token.text)
You
can
think
of
agents
as
the
new
apps
Token Attributes¶
spaCy provides built-in metadata about each token.
for token in doc[:10]:
print(token.text, token.pos_, token.is_stop)
SPACE False
You PRON True
can AUX True
think VERB False
of ADP True
agents NOUN False
as ADP True
the DET True
new ADJ False
apps NOUN False
We can format this into a Pandas DataFrame.
import pandas as pd
data = [
{"token": token.text, "pos": token.pos_, "is_stop": token.is_stop}
for token in doc[:10]
]
df = pd.DataFrame(data)
dfUseful attributes include:
| Attribute | Description |
|---|---|
token.pos_ | Part-of-speech |
token.lemma_ | Base form of the word |
token.is_stop | Whether the word is a stopword |
token.is_punct | Whether it is punctuation |
token.like_num | Whether it represents a number |
Lowercasing¶
spaCy tokens contain both original and normalized text.
tokens_lower = [token.text.lower() for token in doc if not token.is_punct]
tokens_lower[:10]['\n', 'you', 'can', 'think', 'of', 'agents', 'as', 'the', 'new', 'apps']Removing Stopwords and Puntuation¶
clean_tokens = [
token.text.lower() for token in doc if not token.is_stop and not token.is_punct
]
clean_tokens[:10]['\n',
'think',
'agents',
'new',
'apps',
'build',
'deploy',
'manage',
'agents',
'\n']Lemmatization¶
Lemmatization reduces words to their base form.
lemmas = [
token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct
]
lemmas[:10]['\n',
'think',
'agent',
'new',
'app',
'build',
'deploy',
'manage',
'agent',
'\n']Feature Extraction¶
spaCy can also extract more complex features like named entities, noun chunks, and syntactic dependencies.
Part-of-Speech Tagging¶
spaCy can identify grammatical roles.
for token in doc[:10]:
print(token.text, token.pos_)
SPACE
You PRON
can AUX
think VERB
of ADP
agents NOUN
as ADP
the DET
new ADJ
apps NOUN
Named Entity Recognition¶
Named Entity Recognition (NER) identifies important entities in text.
for ent in doc.ents:
print(ent.text, ent.label_)AI GPE
This quarter DATE
Claude 4.5 LAW
over 1,500 CARDINAL
OpenAI ORG
Foundry ORG
Mistral ORG
Common entity types:
| Label | Meaning |
|---|---|
| PERSON | Person name |
| ORG | Organization |
| GPE | Location |
| DATE | Date |
| MONEY | Monetary value |
In the next page, we will see how to use these features for analysis.