un

guest
1 / ?
back to lessons

Welcome

Welcome to a hands-on NLP lesson.

You are going to build a working English stemmer from scratch — an algorithm that strips words down to their root form.

By the end, you will have a real, tested algorithm that transforms words like runningrun, happinesshappi, and organizationalorgan.

You will also write unit tests, integration tests, and functional tests for your stemmer — because an untested algorithm is just a guess.

What Is Stemming?

The problem

Search engines face a fundamental problem: a user searches for "running" but the document contains "run" or "runs" or "runner". These are all the same concept — but they are different strings.


Stemming reduces inflected words to a common base form (the stem). It does not need to be a real word — it just needs to be consistent.


| Word | Stem |

|------|------|

| running | run |

| runs | run |

| runner | runner |

| happiness | happi |

| happily | happi |

| happy | happi |


Notice that happi is not a real English word. That is fine. Stemming is about grouping, not meaning. As long as happiness, happily, and happy all collapse to the same stem, search and retrieval improve.

In your own words, explain why a search engine that uses stemming would return better results than one that only matches exact strings. Give a concrete example.

Zellig Harris and Distributional Analysis

The origin of computational stemming

In 1955, the linguist Zellig Harris published From Phoneme to Morpheme, describing a method for finding the boundaries between meaningful units (morphemes) in words.


His insight was distributional: if you look at a large corpus of English words, the boundary between a stem and a suffix shows up as a statistical signal.


The successor variety method

For any prefix of a word, count how many distinct characters follow it in the corpus. Harris called this the successor variety.


Consider the prefix "work" in a corpus containing: worked, worker, working, works, workshop.


| Prefix | What follows | Successor variety |

|--------|-------------|-------------------|

| w | o | 1 |

| wo | r | 1 |

| wor | k | 1 |

| work | e, i, s, sh | 4 |

| worke | d, r | 2 |


After "work", four different characters can follow — a spike in variety. That spike marks a morpheme boundary. The stem is work and everything after it is a suffix.


This was radical in 1955. No linguistic rules, no dictionary — just counting. Harris showed that the structure of language reveals itself through distribution.

Understanding Successor Variety

Harris's method works on any language. You do not need to know the grammar — the statistics reveal the morpheme boundaries.


In practice, pure successor variety requires a large corpus and careful peak detection. Later researchers — Lovins (1968), Porter (1980) — simplified the approach into rule-based suffix stripping: instead of computing successor variety from a corpus, they encoded the suffix rules directly.


Today you will build a rule-based suffix stripper inspired by Harris's insight. You will define the suffixes explicitly, then strip them from words. This is how most production stemmers work.

What is the key insight of Harris's successor variety method? In other words, what statistical signal tells you where a morpheme boundary is?

Your First Suffix Stripper

Let's code

Start simple. Write a function called `stem` that takes a word and strips these suffixes (in this order):


1. -ing (running → runn)

2. -ed (walked → walk)

3. -ly (quickly → quick)

4. -s (cats → cat)


Rules:

- Convert the word to lowercase first

- Only strip one suffix (the first match in the order above)

- Only strip if the remaining stem is at least 3 characters long

- Return the stem


Example:

```

def stem(word):

word = word.lower()

# your suffix stripping logic here

return word

```

Write the `stem` function. It should strip -ing, -ed, -ly, or -s (first match, in order) from the word, but only if the remaining stem has 3 or more characters. Test it by printing stem('running'), stem('walked'), stem('quickly'), and stem('cats').

Handling Edge Cases

Making the stemmer smarter

Your basic stripper has a problem: runningrunn and hopinghop. We need two refinements:


1. Double consonant cleanup: if stripping -ing or -ed leaves a double consonant at the end (like runn), remove the last letter → run

2. Silent-e restoration: if stripping -ing leaves a stem ending in a consonant (not a vowel), and the original might have had a silent e (like hop from hoping), add e back → hope


For the silent-e rule, keep it simple: if after stripping -ing, the stem is 3+ chars, ends with a consonant, and the second-to-last char is a vowel (a pattern like hop, mak, tak), add e back.


Also add these new suffixes (check them before -ing, -ed, -ly, -s):

5. -tion (organization → organiza)

6. -ness (happiness → happi)

7. -ment (movement → move)

8. -able (readable → read)

9. -ible (sensible → sens)


Updated suffix priority: -tion, -ness, -ment, -able, -ible, -ing, -ed, -ly, -s


Keep the minimum stem length rule: only strip if the remaining stem is 3+ characters.

Update your `stem` function with the new suffixes, double consonant cleanup, and silent-e restoration. Print results for: stem('running'), stem('hoping'), stem('happiness'), stem('organization'), stem('readable').

The -ies and -ier Rules

More morphology

English has another common pattern: words ending in -y change to -ies, -ied, or -ier when inflected.


| Word | Should stem to |

|------|---------------|

| babies | babi |

| carried | carri |

| earlier | earli |

| flies | fli |

| studied | studi |


Add these rules before the -s and -ed checks:

- -ies → strip and add i (babies → babi)

- -ied → strip and add i (carried → carri)

- -ier → strip and add i (earlier → earli)


Same minimum stem length rule: only transform if the result is 3+ characters.

Add the -ies, -ied, and -ier rules to your stem function. Print results for: stem('babies'), stem('carried'), stem('earlier'), stem('happiness'), stem('running').

Why Test?

Testing is not optional

You have a working stemmer. How do you know it actually works? Right now, you are running a few examples by hand. That does not scale.


Professional software uses three levels of testing:


Unit tests — test one function in isolation with known inputs and expected outputs. Fast, numerous, specific.


Integration tests — test that multiple components work together. For a stemmer, this means testing it against a batch of words and verifying the results are consistent.


Functional tests — test the system from the outside, as a user would. For a stemmer, this means feeding it real text and verifying the output makes sense for a real use case like search.


You will write all three.

Write Unit Tests

Unit tests

Write a function called `run_unit_tests` that tests your `stem` function with at least 15 test cases covering:


1. Basic suffix stripping — words ending in -ing, -ed, -ly, -s

2. Complex suffixes — -tion, -ness, -ment, -able, -ible

3. Y-inflection — -ies, -ied, -ier

4. Edge cases — short words that should not be stripped, words with no suffix, already-stemmed words

5. Double consonant cleanup — running → run, sitting → sit

6. Silent-e restoration — hoping → hope

7. Case insensitivity — uppercase input should be lowercased


Structure your tests like this:

```

def run_unit_tests():

tests = [

('running', 'run'),

('cats', 'cat'),

# ... at least 15 test cases

]

passed = 0

failed = 0

for word, expected in tests:

result = stem(word)

if result == expected:

passed += 1

else:

failed += 1

print(f'FAIL: stem({word}) = {result}, expected {expected}')

print(f'{passed}/{passed + failed} unit tests passed')

return failed == 0

```

Include your complete `stem` function AND write `run_unit_tests` with at least 15 test cases covering all 7 categories above. Call `run_unit_tests()` at the end.

Write Integration Tests

Integration tests

Unit tests verify individual inputs. Integration tests verify that components work together correctly.


For a stemmer, a key integration property is consistency: if you stem the same word twice, you get the same result. And words that should group together should produce the same stem.


Write a function called `run_integration_tests` that tests:


1. Idempotency — stemming an already-stemmed word should return the same stem. `stem(stem(word)) == stem(word)` for all words.

2. Grouping — words that should share a stem actually do. Test at least 3 word families (e.g., run/runs/running/runner should all share a stem).

3. Batch processing — process a list of 20+ words and verify no crashes, no empty strings, no None values.


```

def run_integration_tests():

# Test 1: idempotency

# Test 2: word family grouping

# Test 3: batch stability

...

```

Include your `stem` function and write `run_integration_tests` with all three test categories. Call it at the end.

Write Functional Tests

Functional tests

Functional tests verify the system works for its intended use case. Your stemmer exists to improve search — so test that.


Write a function called `run_functional_tests` that:


1. Search simulation — given a list of document strings and a query word, stem both the documents and the query, then check if stemmed query terms appear in stemmed documents. Test that searching for 'running' finds a document containing 'run' and 'runner'.

2. Precision check — verify that stemming does NOT incorrectly group unrelated words. 'university' and 'universe' might share a stem — check if your stemmer handles this (it is OK if it groups them; document the behavior).

3. Real text processing — stem every word in a paragraph of real English text. Verify the output is reasonable: no empty strings, no crashes, output has the same number of words as input.


```

def run_functional_tests():

# Test 1: search finds related documents

# Test 2: precision — check over-stemming

# Test 3: real paragraph processing

...

```

Include your `stem` function and write `run_functional_tests` with all three test categories. Call it at the end.

What You Built

What you built

You implemented a working English stemmer with:

- 12 suffix rules (-tion, -ness, -ment, -able, -ible, -ies, -ied, -ier, -ing, -ed, -ly, -s)

- Double consonant cleanup

- Silent-e restoration

- Unit tests, integration tests, and functional tests


The lineage

Your stemmer descends from a line of work that started with Zellig Harris in 1955:


- Harris (1955) — Discovered that morpheme boundaries show up as statistical signals (successor variety)

- Lovins (1968) — First published stemming algorithm, 294 suffix rules

- Porter (1980) — Simplified to ~60 rules in 5 steps, became the standard for decades

- Snowball (2001) — Porter's framework generalized to multiple languages

- Your stemmer (today) — 12 rules, same core principle


What you could do next

- Implement the full Porter algorithm (it is well documented and a great exercise)

- Port your stemmer to C for a 100x speed improvement

- Build a simple search engine that uses your stemmer to index and query text files

- Compare your stemmer's output against NLTK's PorterStemmer to measure accuracy


The code you wrote today is the same fundamental operation that runs inside every search engine on the planet. Not bad for a day's work.

Reflect on what you built. What was the most surprising thing you learned? If you were going to improve your stemmer, what would you add or change?