The MaxMatch Algorithm

Motivation

You'll read about tokenization in Chapter 2 of the textbook. The ability to split strings of text into atomic units (usually words) is important before we can even think of interpreting what was written. This becomes even more critical with social media when words are sometimes smashed together. Hashtags are the prime example of this:

#bigbangtheory -> big bang theory
#chickensoup -> chicken soup
#running -> running
#30times -> 30 times
#neverstop -> never stop

MaxMatch

This algorithm is the Maximum Matching algorithm, a greedy approach to segmentation. It requires access to a vocabulary, or list of words in the target language. Starting at the front of a string, it identifies the longest known word in the vocabulary and splits that word off the front. Then it repeats the process on whatever remains at the tail end of the original string. Again, find the longest word, pull it off, and repeat. This continues until no characters remain. Here is an example with three steps:

nothingbetterok
nothing betterok
nothing better ok

What happens when no known word starts the string? The default behavior is to split off the first character (it becomes a token by itself), and then repeat the algorithm starting from the second character. In our example above, it's possible that the vocabulary didn't have "ok" as a word. Then the result would be:

nothing better o k

That's it! A former version of your textbook had this algorithm and I'm providing you its old suggested pseudocode to get an even better understanding:

function MAXMATCH(string, dictionary) returns list of tokens T
   if string is empty return empty list
	for i <-- length(string) downto 1
	     firstword = first i chars of string
	     remainder = rest of string
	     if InDictionary(firstword, dictionary)
	          return list(firstword, MAXMATCH(remainder,dictionary) )
	// Didn't find a word in the dictionary!
	letter = first char in string
	remainder = rest of string
	return list(letter, MAXMATCH(remainder,dictionary))