Lab 2: Email Recovery with Language Models

Part 1 Due: the start of the next lab day, Sep 5
Part 2 Due: the start of the next lab day, Sep 12

Motivation

Language Modeling is one of the most fundamental concepts in natural language processing. It can be found in machine translation, speech recognition, document classification, and many more applications. Your phone uses it to predict the next word you will type. You will now build a couple basic language models from class.

Objective

It is late 2001 and the financial world has been rocked by irregularities in the accounting department of Enron Corporation. Investigators into the irregularities need your help (despite you just being born). They obtained by subpoena thousands of e-mails from its employees, but the e-mails were corrupted during the investigation and need to be recovered. All of the words in the e-mails are intact, but the word order has been scrambled. Being trained in NLP, you suggest using a rich language model to reconstruct the original word order. You decide to generate possible orderings from the scrambled words, and choose the most probable sentence.

Code Setup

Starter Python code and data is provided here.

Create a lab2 directory in your local space. Download the above link and extract. Make sure you can run the code like so:

python3 LMTester.py -model unigram 

Part One: Bigram and Trigram LMs

Milestone: Part One is due next week at the start of lab day, Sep 5.

NUTSHELL TASK: fill in BigramModel.py and TrigramModel.py

python3 LMTester.py -model bigram
python3 LMTester.py -model trigram

Implement a bigram and a trigram language model trained on the provided English text (see "Code Explanation" below). I provide for you a unigram model, fully implemented (see UnigramModel.py). Refer to that code for how to count and store tokens. Fill in the provided files named BigramModel.py and TrigramModel.py. These classes inherit from LanguageModel already, and you will see four functions that you need to implement:

generate_sentence(), get_vocabulary(), get_word_probability(sentence, index), train(trainingSentences)

You do not need to handle unknown words. Do not smooth these models. If an n-gram is unseen, the probability should be zero. Note that UnigramModel.py handles unknown words with a simplistic smoothing approach. I am not asking you to do this for this part.

*** Your code must pass the "sum to one" test, and thus be valid probability distributions. The output of python3 LMTester.py should print out GOOD for all but the final "unseen words" test:

Checking for proper probability distributions:
Testing context [] ...GOOD!
Testing context ['united'] ...GOOD!
Testing context ['to', 'the'] ...GOOD!
Testing context ['the', 'quick', 'brown'] ...GOOD!
Testing context ['lalok', 'nok', 'crrok'] ...ERROR: probability distribution does not sum up to one, sum = 0.0

Calculating perplexity:
Training set perplexity:	66.26
Test set perplexity:    	inf

Reconstructing emails:
Perfect Email Recovery: 2 of 116 = 1.7%
Average WER: 0.9799

MILESTONE: how to turn in

Upload only BigramModel.py, TrigramModel.py to our submit system.

You must pass the two "sum to 1" tests on the online webpage. Note this is minimal sanity checking, this does not mean you implemented everything correctly. You should compare your output to the above for BigramModel, and use your commonsense to judge TrigramModel correctness.

Part Two: Lidstone Smoothing

NUTSHELL TASK: create and fill in SmoothedBigram.py and SmoothedTrigram.py

python3 LMTester.py -model smoothbi
python3 LMTester.py -model smoothtri

Now let's mimic your Bigram and Trigram language models (create SmoothedBigram.py and SmoothedTrigram.py), but implement some actual smoothing of unknown words. We covered Laplace smoothing in class where you add 1 count to all vocabulary items. Lidstone smoothing is the same, but instead of adding 1, you add some fraction less than 1. Your job is to play with this fraction and see what works best. You could alternatively implement Good-Turing smoothing for extra credit. Your code for this part will go in SmoothedBigram.py and SmoothedTrigram.py.

Time Saving Tip: part 2 is only about how you compute P(w | context). You should not need to change how you train() your model or how you count words. In fact, you might only change/add a small portion of code! Don't copy/paste all your functions. It's time to use your IC211 inheritance skills by inheriting the appropriate parent class:

from BigramModel import BigramModel

class SmoothedBigram(BigramModel):   # inherit all the BigramModel goodies from Part 1

    def __init__(self):
        super().__init__()           # calls your parent's init function
    

Which functions do you now need to override with new definitions here?

Your output should be in this ballpark. Pass all the "sum to 1" tests because you have smoothing, and you have perplexity and decent performance now:

python3 LMTester.py -model smoothedbi
Loading data/europarl-train.sent.txt
Loading data/europarl-test.sent.txt

Checking for proper probability distributions:
Testing context [] ...GOOD!
Testing context ['united'] ...GOOD!
Testing context ['to', 'the'] ...GOOD!
Testing context ['the', 'quick', 'brown'] ...GOOD!
Testing context ['lalok', 'nok', 'crrok'] ...GOOD!

Calculating perplexity:
Training set perplexity:	119.34
Test set perplexity:    	325.92

Reconstructing emails:
Perfect Email Recovery: 78 of 116 = 67.2%
Average WER: 0.1843

Part Three: Linear Interpolation

NUTSHELL TASK: create and fill in LinearInterpolation.py

python3 LMTester.py -model interp

Implement a single language model that combines your unigram, bigram, and trigram models with linear interpolation. Remember, the weights that you put on each of the three models must sum to one. Your code for this part will go in LinearInterpolation.py. Inherit from LanguageModel.py and implement the four functions. This must still pass the "sum to one" tests.

Optimize the weights. First choose something reasonable, then change them slowly to see what combination produces the best scores.

Time Saving Tip: DON'T YOU DARE START COPY/PASTING CODE FROM YOUR PREVIOUS PARTS TO MAKE HUGE HACKY FUNCTIONS. WE ARE NOT ANIMALS! Import your classes from parts one and two and create objects of each. Call their functions (they work, right?), and just interpolate their results. For example, how do you train() this linear interpolation? Well, you just call train() on each of your 3 models.

Ballpark results? Well you have results from Part 2. You should be at least as good as those. Your goal is to try and improve the Average WER score. Your sentence generation quality might suffer, but that's ok. The two tasks are actually different! What is best for generating European parliament language is not best for de-jumbling emails.

Evaluation

You'll just run my LMTester.py, and its evaluation output contains 4 different tests:

  1. proper probability distributions
  2. perplexity
  3. email word error rate
  4. sentence generation

Proper Probability Distributions means your probabilities better sum to 1!

Perplexity is a measure that gives you a numeric score of how well each model matches the data's distribution. The starter code computes this automatically for you. Lower is better!

Word Error Rate (WER): the core of this lab is reconstructing the jumbled Enron e-mails. You will predict which sentence is not scrambled, and your performance will be scored using WER, and also % correct (the number of sentences your model selects exactly correct, divided by the total number of sentences).

Sentence generation: this one is more subjective, but it's like your phone's word prediction! Your model keeps choosing the next probable word until it sees end-of-sentence.

Below is what you should see out-of-the-box with my unigram model:

python3 LMTester.py -model unigram
Loading data/europarl-train.sent.txt
Loading data/europarl-test.sent.txt

Checking for proper probability distributions:
Testing context [] ...GOOD!
Testing context ['united'] ...GOOD!
Testing context ['to', 'the'] ...GOOD!
Testing context ['the', 'quick', 'brown'] ...GOOD!     # should see all GOODs
Testing context ['lalok', 'nok', 'crrok'] ...GOOD!

Calculating perplexity:
Training set perplexity:	754.67
Test set perplexity:    	768.85                 # lower the better

Reconstructing emails:
Perfect Email Recovery: 2 of 116 = 1.7%
Average WER: 0.795                                    # WER lower the better
    

Code Explanation and Requirements

LMTester.py is the program you will run mostly as-is. This class has the job of managing data files and constructing and testing a language model. Its behavior is controlled via command-line options. Each command-line option has a default value, and the effective values are printed at the beginning of each run.

UnigramModel.py is a fully functional language model that you can reference as you do this lab. You will build more advanced models. LMTester.py runs this by default out of the box. You can see that this extends LanguageModel.py and implements the four methods:

  1. train(trainingSentences). Trains the model from the supplied collection of training sentences. Note that these sentence collections are disk-backed, so doing anything other than iterating over them will be very slow, and should be avoided.
  2. get_word_probability(sentence, index). Returns the probability of the word at index, according to the model, within the specified sentence.
  3. get_vocabulary(). Returns the set of all tokens that your model makes predictions for. This includes the unknown UNK word and the sentence ending STOP word. Do not include the START word in the vocabulary.
  4. generate_sentence(). Returns a random sentence sampled according to the model. The -data option to LMTester specifies the directory in which to find data. By default, this is ./data/

Training and Testing: The -train and -test options (to LMTester) specify the names of sentence files (containing one sentence per line) in the data directory to be used as training and test data. The default values are europarl-train.sent.txt and europarl-test.sent.txt. These files contain sentences from the Europarl corpus. A second test set is also found in enron-test.sent.txt. This is a list of e-mails, described below. You need to test on both this and Europarl by changing the -test option. After loading the training and test sentences, the LMTester will create a language model of the specified class (-model), and train it using the specified training sentences (-train). It will then compute the perplexity of the test sentences (-test) with respect to the language model. When the supplied unigram model is trained on the Europarl data, it gets a perplexity between 800 and 900, which is very poor. A reasonably good perplexity number should be around 200; a competitive perplexity can be around 100 on the test data.

Probability Distributions: If the -check option is set to true (the default), your model will be tested to make sure all probabilities sum to 1. To verify this experimentally, the LMTester will call the check_probability() method included in my LanguageModel prototype. It will call your implementation of get_word_probability() to compute Sum_w(P(w)). It must sum to 1.

Model Generation: If the -generate option is set to true (the default), the LMTester will print 6 random sentences generated according to your language model, so that you can get an intuitive sense of what the model is doing. Note that the outputs of the supplied UNIGRAM model aren't even vaguely like well-formed English.

Unknown Words: A note about unknown words: Language models most standardly treat unseen words as if they were a single UNKNOWN token (or event). This means that, for example, a good unigram model will actually assign a larger probability to an unknown word ("some unknown event") than to a known but rare word. Look for this in your error analysis during your experimentation. You must handle unknown words in your models.

Programming Tips

Use my completed UnigramModel.py as an example as you work on BigramModel.py if you need some Python help and some starting direction.

For development and debugging purposes, it can be helpful to work with a very tiny data set. I included a data file called mini.sent.txt containing just 10 sentences. Of course, any results obtained from training or testing on this data are meaningless, but the run time will be very fast.

What to turn in

  1. All of your completed *.py files. It all must run successfully.
  2. I will run python3 LMTester.py -model interp, so you should set this to run your best optimized version. This will be used for grading and comparing against classmate performance.
  3. A text file readme.txt. You must include the perplexity, WER, and % correct scores for each of the five models in Parts 1-3 (just copy the few relevant lines from running LMTester.py). You then must write one paragraph explaining the differences in scores and why each model performs worse/better than the others.

How to turn in

Upload to the submit system: BigramModel.py, TrigramModel.py, SmoothedBigram.py, SmoothedTrigram.py, LinearInterpolation.py

submit -c=SI425 -p=lab02 BigramModel.py TrigramModel.py SmoothedBigram.py SmoothedTrigram.py LinearInterpolation.py readme.txt

Grading

Milestone Part One on time: 10%

Part One Correctness: 40%

Part Two: 20%

Part Three: 20%

readme.txt, scores and explanation: 10%

Runtime Penalty: your code crashes: -10/20%
Output Penalty: your code prints debugging info and junk to screen: -5%
Extra Credit: Good-Turing smoothing: 10%!

Total: 100%