Revision

Back to NLP

Introduction

Stemming and Lemmatization are methods used to extract the root of a word.

Stemming

Stemming represents an ensemble of rule based methods based on suffix and prefix used to extract the root of a word.

Stemming works by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations.

Here are an example of how stemming works:

Using this stemming method, the words ‘studying’ and ‘studies’ are not mapped to the same root word.

Lemmatization

Lemmatization is based on a dictionary of relation among words and their roots. It thus can take into consideration the morphological analysis of the words.

Here are an example of how lemmatization works:

Using this a lemmatization dictionary, the words ‘studying’ and ‘studies’ are mapped to the word ‘study’.

Difference with subwords tokenizers

A subword tokenizer like word-piece helps in multiple ways, and should be better than lemmatizer or stemming due to multiple reasons:

If you have the words ‘playful’, ‘playing’, ‘played’, to be lemmatized to ‘play’, it can lose some information such as playing is present-tense and played is past-tense, which doesn’t happen in word-piece tokenization.
Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with ‘unknown’ token.

Usage of word-piece tokenization instead of tokenizer + lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.

Resources

See: