Lab 1: Twitter Hashtag Segmentation

Due date: the start of the next lab, Aug 29

Motivation

This lab requires you to interpret strings of continuous characters and segment them into words. Segmenting text is an important task for many foreign languages, such as Chinese and Arabic. Even in English where whitespace typically separates its words, genres like social media often mash words together into incomprehensible (to computers, at least) strings. Hashtags are one particular example #segmentingisfun #whatisahashtag

Generative AI: GenAI tools are permitted on this lab (but you are not required or expected to use them). Your grade on this lab is based on your overall creativity and novel ideas to improve performance. All GenAI usage must be cited in the code through clear code comments.

Objective

You are given a set of hashtags. Your goal is to segment them into tokens using an English dictionary, the MaxMatch algorithm, and your own enhancements to MaxMatch. The examples below are the goal, but not what a vanilla MaxMatch implementation will produce.

#bigbangtheory -> big bang theory
#chickensoup -> chicken soup
#running -> running
#30times -> 30 times
#neverstop -> never stop

I am providing you two incomplete source files: segment.py and evaluate.py. Your task is to fill in the missing code.

Code Setup and Data

Download the lab's starter code and data files by clicking here.

segment.py and evaluate.py: programs you will edit
lev.py: helper function to compute string edit distance
bigwordlist.txt : a huge list of words observed from a large collection of English text. Each line is a word and how often it was seen. You will need this for the MaxMatch algorithm.
hashtags-train.txt : a big file of hashtags for you to run your code on and test
hashtags-dev.txt and hashtags-dev-gold.txt : smaller files of hashtags and their gold answers

You will edit segment.py and evaluate.py for this lab. You will use lev.py as helper code but don't need to edit it.

Run the code: I will run your code with the following commands, so make sure they work. There are 3 parts to this lab, so 3 commands I expect:

python3 segment.py max bigwordlist.txt hashtags-dev.txt > maxguesses
python3 segment.py better bigwordlist.txt hashtags-dev.txt > betterguesses
python3 evaluate.py maxguesses hashtags-dev-gold.txt

You can try these now! Right now they will print out the hashtags verbatim. It's your job to change the program to print out segmented words instead of hashtags.

Part One: MaxMatch

Implement the MaxMatch algorithm as described here. Briefly, the algorithm starts from the first character and finds the longest matching known English word. It then continues from the end of that word, repeating the lookup. You obviously will need an English dictionary. Use the one you just extracted bigwordlist.txt. It contains a list of words and how often they occurred in a large corpus. Note the list is very noisy (as much of NLP is).

Implement MaxMatch in segment.py. You should implement it inside the maxatch(words, path) method. This method should simply output your segmentation guesses, one per line, and nothing else. Do not output the hashtag character, but just your segmented words (as in the Objective section example above). Run it on the provided list of hash tags at hashtags-train.txt. The code should run like this with this output:

python3 segment.py max bigwordlist.txt hashtags-train.txt
...
Africa
African Proverb Day
African YOLO Moment
African Yolo Moment
AguasCalientes
...

You then need to subjectively study your output. Answer the following question in a readme.txt file. You'll want to scan through your output and see what kinds of errors have appeared:

When MaxMatch is incorrect, what types of failures do you see and what causes each? List at least 3 different types of errors that you observe WITH AN EXAMPLE of each.

Part Two: Empirical Evaluation

We now want to objectively evaluate our performance. This part has you compute a score for your Part 1. Note that you have data files hashtags-dev.txt and hashtags-dev-gold.txt. These contain more tags, but also the correct answers (gold tags). You will now compare your output to the gold answers.

The textbook describes the minimum edit distance score to compare how different two strings are (by counting the number of changes to transform one to the other). Lucky for you, I provided code that computes this for you free of charge. Your task is to use my provided distance metric to compute a Word Error Rate (WER) between your guesses and the gold answers. WER is just the length normalized minimum edit distance (i.e, minimum edit distance divided by the length of the correct segmentation string):

WER = #edits / lengthOfGoldString

Use my provided code lev.py to compute individual edit distance scores:

edits = levenshtein('teststring', 'test string')    # edits=1 (add a space char)

Now create a working WER that takes your output with gold answers and returns the average WER across the entire dev set. Do this in evaluate.py. Note that your calculation needs to read two files: (1) your output from Part One, and (2) my provided gold answers. So you'll just want to pipe Part One's output to a file. Then run your program:

python3 segment.py max bigwordlist.txt hashtags-dev.txt > myguesses.txt
python3 evaluate.py myguesses.txt hashtags-dev-gold.txt
0.0869217799211904   # for instance

Answer the following question in your readme.txt:

What is the average WER of MaxMatch on your test data? (this is one number: the average WER scores over all test hashtags) Your number will be small, as in, far less than 1.0. An average error of 1.0 would mean you made one mistake on every character you print. An average of 0.1 means one mistake every 10 characters. For a correct MaxMatch implementation, look for something around 0.08 or 0.09.

Part Three: Improve MaxMatch

Using your analysis from Parts 1 and 2, you will now improve MaxMatch. Your goal is to improve your WER score, and the sky's the limit on what you can try. Points will be given to the number of different improvements, creativity, and overall WER improvement. Do not change your maxmatch() method. You should copy that code into improved_maxmatch() and then make your improvements there. You can run the system now:

python3 segment.py better bigwordlist.txt hashtags-dev.txt

Answer the following questions in your readme.txt:

List the improvements you made.
What is the final WER of your improved algorithm?
(it should be less than your Part Two. Expected performance is usually less than 0.06)

What to turn in

A single text file called readme.txt with answers to the FOUR questions in Parts 1,2,3.
Completed segment.py and evaluate.py files.
Your segment program should output your segmentations, one per line, no '#' symbol. Do not print anything except the segmentations. Don't print extra whitespace on the end or the beginning! Follow the format shown in Part 1's instructions. The number of lines output must equal the number of lines input.

How to turn in

Upload your three files (readme.txt, segment.py, evaluate.py) to our department submission server under Lab01.

Competition!

I have a secret test set that is just like your dev set but with different tags. Make sure your Part 3 improvements are general solutions, not hard-coding answers for tags found in the dev set.

Your submissions will be run on this test set. It is thus very important that you follow the above output instructions correctly, or your final WER score will be incorrect. I will post the final results on a results page here. The top two performing submissions will receive extra credit.

Grading

Part One: 20 pts

MaxMatch correctly implemented: 14 pts
Failure analysis: 6 pts

Part Two: 10 pts

Part Three: 15 pts

More than one improvements were made: 10 pts
Final WER reported and performance improved: 5 pts (0 pts if WER does not significantly improve)

Runtime Penalty: your code crashes: -10 pts
Output Penalty: your output does not conform to the requirements: -5 pts
Extra Credit: best class performance, or creative/lots of MaxMatch improvements

Total: 45 pts