Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

NLP Preprocessing

We will use a short excerpt from a corporate earnings call transcript (Microsoft Q2 2026) to demonstrate common NLP preprocessing steps.

# Excerpt from Satya Nadella's Opening Remarks
text = """
You can think of agents as the new apps, and to build, deploy, and manage agents, 
customers will need a model catalog, tuning services, harness for orchestration, 
services for context engineering, AI safety, management, observability, and security. 
It starts with having broad model choice. Our customers expect to use multiple models 
as part of any workload that they can fine-tune and optimize based on cost, latency, 
and performance requirements. We offer the broadest selection of models of any 
hyperscaler. This quarter, we added support for GPT-5 too, as well as Claude 4.5. 
Already, over 1,500 customers have used both Anthropic and OpenAI models on Foundry. 
We are seeing increasing demand for region-specific models, including Mistral and 
Cohere, as more customers look for sovereign AI choices.
"""

First, install spaCy and download the English language model:

If using pip:

pip install spacy

Then download the language model:

python -m spacy download en_core_web_sm

If using conda:

conda install -c conda-forge spacy

Then download the language model:

python -m spacy download en_core_web_sm

Common NLP Preprocessing Steps

Recall from the previous page that raw text often needs to be cleaned before analysis.

Common preprocessing steps include:

  • Lowercasing

  • Removing punctuation

  • Tokenization

  • Stopword removal

  • Stemming or lemmatizatio

spaCy is a powerful NLP library that provides built-in functions for many of these preprocessing steps. We will use it to demonstrate how to preprocess our conference call transcript excerpt.

Load the model.

import spacy

nlp = spacy.load("en_core_web_sm")

Parse the Text as a spaCy Document

The first step is to parse the raw text into a spaCy Doc object.

doc = nlp(text)

Sentence Segmentation

spaCy can automatically split text into sentences.

for sent in doc.sents:
    print("---------")
    print(sent.text.strip())
---------
You can think of agents as the new apps, and to build, deploy, and manage agents, 
customers will need a model catalog, tuning services, harness for orchestration, 
services for context engineering, AI safety, management, observability, and security.
---------
It starts with having broad model choice.
---------
Our customers expect to use multiple models 
as part of any workload that they can fine-tune and optimize based on cost, latency, 
and performance requirements.
---------
We offer the broadest selection of models of any 
hyperscaler.
---------
This quarter, we added support for GPT-5 too, as well as Claude 4.5.
---------
Already, over 1,500 customers have used both Anthropic and OpenAI models on Foundry.
---------
We are seeing increasing demand for region-specific models, including Mistral and 
Cohere, as more customers look for sovereign AI choices.

Tokenization

Tokenization splits text into individual units called tokens.

for token in doc[:10]:
    print(token.text)


You
can
think
of
agents
as
the
new
apps

Token Attributes

spaCy provides built-in metadata about each token.

for token in doc[:10]:
    print(token.text, token.pos_, token.is_stop)

 SPACE False
You PRON True
can AUX True
think VERB False
of ADP True
agents NOUN False
as ADP True
the DET True
new ADJ False
apps NOUN False

We can format this into a Pandas DataFrame.

import pandas as pd

data = [
    {"token": token.text, "pos": token.pos_, "is_stop": token.is_stop}
    for token in doc[:10]
]

df = pd.DataFrame(data)
df
Loading...

Useful attributes include:

AttributeDescription
token.pos_Part-of-speech
token.lemma_Base form of the word
token.is_stopWhether the word is a stopword
token.is_punctWhether it is punctuation
token.like_numWhether it represents a number

Lowercasing

spaCy tokens contain both original and normalized text.

tokens_lower = [token.text.lower() for token in doc if not token.is_punct]

tokens_lower[:10]
['\n', 'you', 'can', 'think', 'of', 'agents', 'as', 'the', 'new', 'apps']

Removing Stopwords and Puntuation

clean_tokens = [
    token.text.lower() for token in doc if not token.is_stop and not token.is_punct
]

clean_tokens[:10]
['\n', 'think', 'agents', 'new', 'apps', 'build', 'deploy', 'manage', 'agents', '\n']

Lemmatization

Lemmatization reduces words to their base form.

lemmas = [
    token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct
]

lemmas[:10]
['\n', 'think', 'agent', 'new', 'app', 'build', 'deploy', 'manage', 'agent', '\n']

Feature Extraction

spaCy can also extract more complex features like named entities, noun chunks, and syntactic dependencies.

Part-of-Speech Tagging

spaCy can identify grammatical roles.

for token in doc[:10]:
    print(token.text, token.pos_)

 SPACE
You PRON
can AUX
think VERB
of ADP
agents NOUN
as ADP
the DET
new ADJ
apps NOUN

Named Entity Recognition

Named Entity Recognition (NER) identifies important entities in text.

for ent in doc.ents:
    print(ent.text, ent.label_)
AI GPE
This quarter DATE
Claude 4.5 LAW
over 1,500 CARDINAL
OpenAI ORG
Foundry ORG
Mistral ORG

Common entity types:

LabelMeaning
PERSONPerson name
ORGOrganization
GPELocation
DATEDate
MONEYMonetary value

In the next page, we will see how to use these features for analysis.