Welcome
Welcome to a hands-on NLP lesson.
You are going to build a working English stemmer from scratch — an algorithm that strips words down to their root form.
By the end, you will have a real, tested algorithm that transforms words like running → run, happiness → happi, and organizational → organ.
You will also write unit tests, integration tests, and functional tests for your stemmer — because an untested algorithm is just a guess.
What Is Stemming?
The problem
Search engines face a fundamental problem: a user searches for "running" but the document contains "run" or "runs" or "runner". These are all the same concept — but they are different strings.
Stemming reduces inflected words to a common base form (the stem). It does not need to be a real word — it just needs to be consistent.
| Word | Stem |
|------|------|
| running | run |
| runs | run |
| runner | runner |
| happiness | happi |
| happily | happi |
| happy | happi |
Notice that happi is not a real English word. That is fine. Stemming is about grouping, not meaning. As long as happiness, happily, and happy all collapse to the same stem, search and retrieval improve.
Zellig Harris and Distributional Analysis
The origin of computational stemming
In 1955, the linguist Zellig Harris published From Phoneme to Morpheme, describing a method for finding the boundaries between meaningful units (morphemes) in words.
His insight was distributional: if you look at a large corpus of English words, the boundary between a stem and a suffix shows up as a statistical signal.
The successor variety method
For any prefix of a word, count how many distinct characters follow it in the corpus. Harris called this the successor variety.
Consider the prefix "work" in a corpus containing: worked, worker, working, works, workshop.
| Prefix | What follows | Successor variety |
|--------|-------------|-------------------|
| w | o | 1 |
| wo | r | 1 |
| wor | k | 1 |
| work | e, i, s, sh | 4 |
| worke | d, r | 2 |
After "work", four different characters can follow — a spike in variety. That spike marks a morpheme boundary. The stem is work and everything after it is a suffix.
This was radical in 1955. No linguistic rules, no dictionary — just counting. Harris showed that the structure of language reveals itself through distribution.
Understanding Successor Variety
Harris's method works on any language. You do not need to know the grammar — the statistics reveal the morpheme boundaries.
In practice, pure successor variety requires a large corpus and careful peak detection. Later researchers — Lovins (1968), Porter (1980) — simplified the approach into rule-based suffix stripping: instead of computing successor variety from a corpus, they encoded the suffix rules directly.
Today you will build a rule-based suffix stripper inspired by Harris's insight. You will define the suffixes explicitly, then strip them from words. This is how most production stemmers work.
Your First Suffix Stripper
Let's code
Start simple. Write a function called `stem` that takes a word and strips these suffixes (in this order):
1. -ing (running → runn)
2. -ed (walked → walk)
3. -ly (quickly → quick)
4. -s (cats → cat)
Rules:
- Convert the word to lowercase first
- Only strip one suffix (the first match in the order above)
- Only strip if the remaining stem is at least 3 characters long
- Return the stem
Example:
```
def stem(word):
word = word.lower()
# your suffix stripping logic here
return word
```
Handling Edge Cases
Making the stemmer smarter
Your basic stripper has a problem: running → runn and hoping → hop. We need two refinements:
1. Double consonant cleanup: if stripping -ing or -ed leaves a double consonant at the end (like runn), remove the last letter → run
2. Silent-e restoration: if stripping -ing leaves a stem ending in a consonant (not a vowel), and the original might have had a silent e (like hop from hoping), add e back → hope
For the silent-e rule, keep it simple: if after stripping -ing, the stem is 3+ chars, ends with a consonant, and the second-to-last char is a vowel (a pattern like hop, mak, tak), add e back.
Also add these new suffixes (check them before -ing, -ed, -ly, -s):
5. -tion (organization → organiza)
6. -ness (happiness → happi)
7. -ment (movement → move)
8. -able (readable → read)
9. -ible (sensible → sens)
Updated suffix priority: -tion, -ness, -ment, -able, -ible, -ing, -ed, -ly, -s
Keep the minimum stem length rule: only strip if the remaining stem is 3+ characters.
The -ies and -ier Rules
More morphology
English has another common pattern: words ending in -y change to -ies, -ied, or -ier when inflected.
| Word | Should stem to |
|------|---------------|
| babies | babi |
| carried | carri |
| earlier | earli |
| flies | fli |
| studied | studi |
Add these rules before the -s and -ed checks:
- -ies → strip and add i (babies → babi)
- -ied → strip and add i (carried → carri)
- -ier → strip and add i (earlier → earli)
Same minimum stem length rule: only transform if the result is 3+ characters.
Why Test?
Testing is not optional
You have a working stemmer. How do you know it actually works? Right now, you are running a few examples by hand. That does not scale.
Professional software uses three levels of testing:
Unit tests — test one function in isolation with known inputs and expected outputs. Fast, numerous, specific.
Integration tests — test that multiple components work together. For a stemmer, this means testing it against a batch of words and verifying the results are consistent.
Functional tests — test the system from the outside, as a user would. For a stemmer, this means feeding it real text and verifying the output makes sense for a real use case like search.
You will write all three.
Write Unit Tests
Unit tests
Write a function called `run_unit_tests` that tests your `stem` function with at least 15 test cases covering:
1. Basic suffix stripping — words ending in -ing, -ed, -ly, -s
2. Complex suffixes — -tion, -ness, -ment, -able, -ible
3. Y-inflection — -ies, -ied, -ier
4. Edge cases — short words that should not be stripped, words with no suffix, already-stemmed words
5. Double consonant cleanup — running → run, sitting → sit
6. Silent-e restoration — hoping → hope
7. Case insensitivity — uppercase input should be lowercased
Structure your tests like this:
```
def run_unit_tests():
tests = [
('running', 'run'),
('cats', 'cat'),
# ... at least 15 test cases
]
passed = 0
failed = 0
for word, expected in tests:
result = stem(word)
if result == expected:
passed += 1
else:
failed += 1
print(f'FAIL: stem({word}) = {result}, expected {expected}')
print(f'{passed}/{passed + failed} unit tests passed')
return failed == 0
```
Write Integration Tests
Integration tests
Unit tests verify individual inputs. Integration tests verify that components work together correctly.
For a stemmer, a key integration property is consistency: if you stem the same word twice, you get the same result. And words that should group together should produce the same stem.
Write a function called `run_integration_tests` that tests:
1. Idempotency — stemming an already-stemmed word should return the same stem. `stem(stem(word)) == stem(word)` for all words.
2. Grouping — words that should share a stem actually do. Test at least 3 word families (e.g., run/runs/running/runner should all share a stem).
3. Batch processing — process a list of 20+ words and verify no crashes, no empty strings, no None values.
```
def run_integration_tests():
# Test 1: idempotency
# Test 2: word family grouping
# Test 3: batch stability
...
```
Write Functional Tests
Functional tests
Functional tests verify the system works for its intended use case. Your stemmer exists to improve search — so test that.
Write a function called `run_functional_tests` that:
1. Search simulation — given a list of document strings and a query word, stem both the documents and the query, then check if stemmed query terms appear in stemmed documents. Test that searching for 'running' finds a document containing 'run' and 'runner'.
2. Precision check — verify that stemming does NOT incorrectly group unrelated words. 'university' and 'universe' might share a stem — check if your stemmer handles this (it is OK if it groups them; document the behavior).
3. Real text processing — stem every word in a paragraph of real English text. Verify the output is reasonable: no empty strings, no crashes, output has the same number of words as input.
```
def run_functional_tests():
# Test 1: search finds related documents
# Test 2: precision — check over-stemming
# Test 3: real paragraph processing
...
```
What You Built
What you built
You implemented a working English stemmer with:
- 12 suffix rules (-tion, -ness, -ment, -able, -ible, -ies, -ied, -ier, -ing, -ed, -ly, -s)
- Double consonant cleanup
- Silent-e restoration
- Unit tests, integration tests, and functional tests
The lineage
Your stemmer descends from a line of work that started with Zellig Harris in 1955:
- Harris (1955) — Discovered that morpheme boundaries show up as statistical signals (successor variety)
- Lovins (1968) — First published stemming algorithm, 294 suffix rules
- Porter (1980) — Simplified to ~60 rules in 5 steps, became the standard for decades
- Snowball (2001) — Porter's framework generalized to multiple languages
- Your stemmer (today) — 12 rules, same core principle
What you could do next
- Implement the full Porter algorithm (it is well documented and a great exercise)
- Port your stemmer to C for a 100x speed improvement
- Build a simple search engine that uses your stemmer to index and query text files
- Compare your stemmer's output against NLTK's PorterStemmer to measure accuracy
The code you wrote today is the same fundamental operation that runs inside every search engine on the planet. Not bad for a day's work.