un — Build a Stemmer: Harris 1956 and the Roots of NLP

un

guest

1 / ?

back to lessons

Welcome

Welcome to a hands-on NLP lesson.

You are going to build a working English stemmer from scratch: an algorithm that strips words down to their root form.

By the end, you will have a real, tested algorithm that transforms words like running → run, happiness → happi, & organizational → organ.

You will also write unit tests, integration tests, and functional tests for your stemmer: because an untested algorithm is just a guess.

What Is Stemming?

The problem

Search engines face a fundamental problem: a user searches for "running" but the document contains "run" or "runs" or "runner". These are all the same concept: but they are different strings.

Stemming reduces inflected words to a common base form (the stem). It does not need to be a real word: it just needs to be consistent.

Word	Stem
running	run
runs	run
runner	runner
happiness	happi
happily	happi
happy	happi

Notice that happi is not a real English word. No problem. Stemming is about grouping, not meaning. As long as happiness, happily, & happy all collapse to the same stem, search & retrieval improve.

Stemming: Many Forms Collapse to One Stem, Improving Search Results

In your own words, explain why a search engine that uses stemming would return better results than one that only matches exact strings. Give a concrete example.

Zellig Harris and Distributional Analysis

The origin of computational stemming

In 1955, the linguist Zellig Harris published From Phoneme to Morpheme, describing a method for finding the boundaries between meaningful units (morphemes) in words.

His insight was distributional: if you look at a large corpus of English words, the boundary between a stem & a suffix shows up as a statistical signal.

The successor variety method

For any prefix of a word, count how many distinct characters follow it in the corpus. Harris called this the successor variety.

Consider the prefix "work" in a corpus containing: worked, worker, working, works, workshop.

Prefix	What follows	Successor variety
w	o	1
wo	r	1
wor	k	1
work	e, i, s, sh	4
worke	d, r	2

After "work", four different characters can follow: a spike in variety. That spike marks a morpheme boundary. The stem is work and everything after it is a suffix.

This was radical in 1955. No linguistic rules, no dictionary: just counting. Harris showed that the structure of language reveals itself through distribution.

Harris Successor Variety: Spike at 'work' marks the morpheme boundary

Understanding Successor Variety

Harris's method works on any language. You do not need to know the grammar: the statistics reveal the morpheme boundaries.

In practice, pure successor variety requires a large corpus and careful peak detection. Later researchers: Lovins (1968), Porter (1980): simplified the approach into rule-based suffix stripping: instead of computing successor variety from a corpus, they encoded the suffix rules directly.

Today you will build a rule-based suffix stripper inspired by Harris's insight. You will define the suffixes explicitly, then strip them from words. This is how most production stemmers work.

What is the key insight of Harris's successor variety method? In other words, what statistical signal tells you where a morpheme boundary is?

Your First Suffix Stripper

Let's code

Start simple. Write a function called stem that takes a word & strips these suffixes (in this order):

1. -ing (running → runn)

2. -ed (walked → walk)

3. -ly (quickly → quick)

4. -s (cats → cat)

Rules:

- Convert the word to lowercase first

- Only strip one suffix (the first match in the order above)

- Only strip if the remaining stem is at least 3 characters long

- Return the stem

Example:

def stem(word):
    word = word.lower()
    # your suffix stripping logic here
    return word

Write the `stem` function. It should strip -ing, -ed, -ly, or -s (first match, in order) from the word, but only if the remaining stem has 3 or more characters. Test it by printing stem('running'), stem('walked'), stem('quickly'), & stem('cats').

Handling Edge Cases

Making the stemmer smarter

Your basic stripper has a problem: running → runn & hoping → hop. We need two refinements:

1. Double consonant cleanup: if stripping -ing or -ed leaves a double consonant at the end (like runn), remove the last letter → run

2. Silent-e restoration: if stripping -ing leaves a stem ending in a consonant (not a vowel), & the original might have had a silent e (like hop from hoping), add e back → hope

For the silent-e rule, keep it simple: if after stripping -ing, the stem is 3+ chars, ends with a consonant, & the second-to-last char is a vowel (a pattern like hop, mak, tak), add e back.

Also add these new suffixes (check them before -ing, -ed, -ly, -s):

5. -tion (organization → organiza)

6. -ness (happiness → happi)

7. -ment (movement → move)

8. -able (readable → read)

9. -ible (sensible → sens)

Updated suffix priority: -tion, -ness, -ment, -able, -ible, -ing, -ed, -ly, -s

Keep the minimum stem length rule: only strip if the remaining stem is 3+ characters.

Update your `stem` function with the new suffixes, double consonant cleanup, & silent-e restoration. Print results for: stem('running'), stem('hoping'), stem('happiness'), stem('organization'), stem('readable').

-ies & -ier Rules

More morphology

English has another common pattern: words ending in -y change to -ies, -ied, or -ier when inflected.

Word	Should stem to
babies	babi
carried	carri
earlier	earli
flies	fli
studied	studi

Add these rules before the -s & -ed checks:

- -ies → strip & add i (babies → babi)

- -ied → strip & add i (carried → carri)

- -ier → strip & add i (earlier → earli)

Same minimum stem length rule: only transform if the result is 3+ characters.

Add the -ies, -ied, & -ier rules to your stem function. Print results for: stem('babies'), stem('carried'), stem('earlier'), stem('happiness'), stem('running').

Why Test?

Testing is not optional

You have a working stemmer. How do you know it actually works? Right now, you are running a few examples by hand. That does not scale.

Professional software uses three levels of testing:

Unit tests: test one function in isolation with known inputs and expected outputs. Fast, numerous, specific.

Integration tests: test that multiple components work together. For a stemmer, this means testing it against a batch of words and verifying the results are consistent.

Functional tests: test the system from the outside, as a user would. For a stemmer, this means feeding it real text and verifying the output makes sense for a real use case like search.

You will write all three.

Three Levels of Testing: Unit, Integration, and Functional Test Pyramid

Write Unit Tests

Unit tests

Write a function called run_unit_tests that tests your stem function with at least 15 test cases covering:

1. Basic suffix stripping: words ending in -ing, -ed, -ly, -s

2. Complex suffixes: -tion, -ness, -ment, -able, -ible

3. Y-inflection: -ies, -ied, -ier

4. Edge cases: short words that should not be stripped, words with no suffix, already-stemmed words

5. Double consonant cleanup: running → run, sitting → sit

6. Silent-e restoration: hoping → hope

7. Case insensitivity: uppercase input should be lowercased

Structure your tests like this:

def run_unit_tests():
    tests = [
        ('running', 'run'),
        ('cats', 'cat'),
        # ... at least 15 test cases
    ]
    passed = 0
    failed = 0
    for word, expected in tests:
        result = stem(word)
        if result == expected:
            passed += 1
        else:
            failed += 1
            print(f'FAIL: stem({word}) = {result}, expected {expected}')
    print(f'{passed}/{passed + failed} unit tests passed')
    return failed == 0

Include your complete `stem` function AND write `run_unit_tests` with at least 15 test cases covering all 7 categories above. Call `run_unit_tests()` at the end.

Write Integration Tests

Integration tests

Unit tests verify individual inputs. Integration tests verify that components work together correctly.

For a stemmer, a key integration property is consistency: if you stem the same word twice, you get the same result. And words that should group together should produce the same stem.

Write a function called run_integration_tests that tests:

1. Idempotency: stemming an already-stemmed word should return the same stem. stem(stem(word)) == stem(word) for all words.

2. Grouping: words that should share a stem actually do. Test at least 3 word families (e.g., run/runs/running/runner should all share a stem).

3. Batch processing: process a list of 20+ words and verify no crashes, no empty strings, no None values.

def run_integration_tests():
    # Test 1: idempotency
    # Test 2: word family grouping
    # Test 3: batch stability
    ...

Include your `stem` function & write `run_integration_tests` with all three test categories. Call it at the end.

Write Functional Tests

Functional tests

Functional tests verify the system works for its intended use case. Your stemmer exists to improve search: so test that.

Write a function called run_functional_tests that:

1. Search simulation: given a list of document strings and a query word, stem both the documents and the query, then check if stemmed query terms appear in stemmed documents. Test that searching for 'running' finds a document containing 'run' and 'runner'.

2. Precision check: verify that stemming does NOT incorrectly group unrelated words. 'university' and 'universe' might share a stem: check if your stemmer handles this (it is OK if it groups them; document the behavior).

3. Real text processing: stem every word in a paragraph of real English text. Verify the output is reasonable: no empty strings, no crashes, output has the same number of words as input.

def run_functional_tests():
    # Test 1: search finds related documents
    # Test 2: precision: check over-stemming
    # Test 3: real paragraph processing
    ...

Include your `stem` function & write `run_functional_tests` with all three test categories. Call it at the end.

What You Built

What you built

You implemented a working English stemmer with:

- 12 suffix rules (-tion, -ness, -ment, -able, -ible, -ies, -ied, -ier, -ing, -ed, -ly, -s)

- Double consonant cleanup

- Silent-e restoration

- Unit tests, integration tests, & functional tests

The lineage

Your stemmer descends from a line of work that started with Zellig Harris in 1955:

- Harris (1955): Discovered that morpheme boundaries show up as statistical signals (successor variety)

- Lovins (1968): First published stemming algorithm, 294 suffix rules

- Porter (1980): Simplified to ~60 rules in 5 steps, became the standard for decades

- Snowball (2001): Porter's framework generalized to multiple languages

- Your stemmer (today): 12 rules, same core principle

What you could do next

- Implement the full Porter algorithm (it is well documented & a great exercise)

- Port your stemmer to C for a 100x speed improvement

- Build a simple search engine that uses your stemmer to index & query text files

- Compare your stemmer's output against NLTK's PorterStemmer to measure accuracy

The code you wrote today is the same fundamental operation that runs inside every search engine on the planet. Not bad for a day's work.

Stemmer Lineage: Harris 1955 through Snowball 2001

Reflect on what you built. What was the most surprising thing you learned? If you were going to improve your stemmer, what would you add or change?