SI425, Fall 2017

Lab 2: Email Recovery with Language Models

current standings

Milestone 1: the start of class, Sep 7
Due date: the start of class, Sep 14


Language Modeling is one of the most fundamental concepts in natural language processing. It can be found in machine translation, speech recognition, document classification, and many more applications. Your phone uses these it predicts the next word you will type. You will build some of the basic language models that we cover in class.


It is late 2001 and the financial world has been rocked by irregularities in the accounting department of Enron Corporation. Investigators into the irregularities need your help. They have obtained by subpoena thousands of e-mails from its employees, but the e-mails were corrupted during the investigation and need to be recovered. All of the words in the e-mails are intact, but the word order has been scrambled. Being trained in NLP, you suggest using a rich language model to reconstruct the original word order. You decide to generate a list of possible sentences from the scrambled words, and choose the most probable sentence as determined by your language model.

Part One: Bigram and Trigram LMs

Milestone: Part One is due next week at the start of class, Sep 7.

Implement a bigram and a trigram language model trained on the provided English text (see "Code Setup" below). I provide for you a unigram model, fully implemented (see Use that code as a starting block for counting and storing n-grams. Create files named and Use Eclipse to inherit from the class and you will see four functions that you need to implement: generateSentence(), getVocabulary(), getWordProbability(List sentence, int index), and train(Collection> trainingSentences).

You do not need to handle unknown words. Do not smooth these models. If an n-gram is unseen, the probability should be zero. Note that handles unknown words with a simplistic smoothing approach. I am not asking you to do this for this part.

*** Your code must pass the "sum to one" test, and thus be valid probability distributions. The output of run should print out GOOD for all the tests except the unseen bigram/trigram tests.

Part Two: Lidstone Smoothing

Now change your Bigram and Trigram language models to implement some actual smoothing of unknown words. We covered Laplace smoothing in class where you add 1 count to all vocabulary items. Lidstone smoothing is the same, but instead of adding 1, you add some fraction less than 1. Your job is to play with this fraction and see what works best. You could alternatively implement Good-Turing smoothing for extra credit. Your code for this part will go in and

Time Saving Tip: This part is only about how you compute P(w | context). You should not need to change how you train() your model or how you count words. In fact, you might only change/add a few lines of code!

Part Three: Linear Interpolation

Implement a single language model that combines your unigram, bigram, and trigram models with linear interpolation. Remember, the weights that you put on each of the three models must sum to one. Your code for this part will go in Again, inherit from and implement the four functions. This must still pass the "sum to one" tests.

Optimize the weights. First randomly choose some weights, then change them slowly to see what combination produces the best scores.

Time Saving Tip: Use your code from parts one and two and interpolate between your Java classes.


You will first evaluate the models' performance using perplexity. This measure gives you a numeric score of how well each model matches the data's distribution. The starter code computes this automatically for you.

The core of this lab is reconstructing the jumbled Enron e-mails. You will predict which sentence is not scrambled, and your performance will be scored using Word Error Rate (WER) as in the last lab, and also % correct (the number of sentences your model selects exactly correct, divided by the total number of sentences). Again, the starter code computes this for you. Isn't that great? How easy!

Summary: three evaluation metrics; look for these in the output of (see below).

Code Setup

Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab2/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab2/data/ : the data sets used in this assignment

Create a lab2 directory in your local space. Then copy lab2/java/ to it (cp -R /courses/nchamber/nlp/lab2/java lab2/). Do not copy the data subdirectory. There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. I provided a run script that takes no arguments. Execute it from the command line and everything should look ok.

If everything is working, you'll get some output describing the construction and testing of a (pretty bad) language model. The next section will help you make sense of what you're seeing.

Code Explanation and Requirements

run. A shell script that runs the default setup, out of the box. Use this to see what flags you can send to your program! contains the main() method that the run script invokes. Take a look at main(). This class has the job of managing data files and constructing and testing a language model. Its behavior is controlled via command-line options. Each command-line option has a default value, and the effective values are printed at the beginning of each run. is a fully functional language model that you can reference as you do this lab. You will build more advanced models. The -model option in the run script specifies the class name of what LM you want to run. Its default value is usna.langmodel.EmpiricalUnigramLanguageModel. You can see that this extends and must implement the four methods:

  1. train(Collection<List<String>> trainingSentences). Trains the model from the supplied collection of training sentences. Note that these sentence collections are disk-backed, so doing anything other than iterating over them will be very slow, and should be avoided.
  2. getWordProbability(List<String> sentence, int index). Returns the probability of the word at index, according to the model, within the specified sentence.
  3. getVocabulary(). Returns the set of all tokens that your model makes predictions for. This includes the unknown UNK word and the sentence ending STOP word. Do not include the START word in the vocabulary.
  4. generateSentence(). Returns a random sentence sampled according to the model. The -data option to LanguageModelTester specifies the directory in which to find data. By default, this is /courses/nchamber/nlp/lab2/data/

Training and Testing: The -train and -test options (to LanguageModelTester) specify the names of sentence files (containing one sentence per line) in the data directory to be used as training and test data. The default values are europarl-train.sent.txt and europarl-test.sent.txt. These files contain sentences from the Europarl corpus. A second test set is also found in enron-test.sent.txt. This is a list of e-mails, described below. You need to test on both this and Europarl by changing the -test option. After loading the training and test sentences, the LanguageModelTester will create a language model of the specified class (-model), and train it using the specified training sentences (-train). It will then compute the perplexity of the test sentences (-test) with respect to the language model. When the supplied unigram model is trained on the Europarl data, it gets a perplexity between 800 and 900, which is very poor. A reasonably good perplexity number should be around 200; a competitive perplexity can be around 100 on the test data.

Probability Distribution As part of the assignment you are required to ensure that all of your probability distributions are valid (sum to 1). To verify this experimentally, the LanguageModelTester will call the checkProbability() method included in the LanguageModel prototype. It will call your implementation of getWordProbability() to compute Sum_w(P(w)). If the -check flag is set to true, checkModel will be run on your current model and the returned number will be tested for proximity to 1. It must sum to 1. This checkModel only needs to be run once on your final model, however, so disabling the -check flag will save you time on running repeated tests.

Model Generation If the -generate option is set to true (the default), the LanguageModelTester will print 10 random sentences generated according to the language model, so that you can get an intuitive sense of what the model is doing. Note that the outputs of the supplied unigram model aren't even vaguely like well-formed English.

Unknown Words A note about unknown words. Language models most standardly treat unseen words as if they were a single UNKNOWN token (or event). This means that, for example, a good unigram model will actually assign a larger probability to an unknown word ("some unknown event") than to a known but rare word. Look for this in your error analysis during your experimentation. You must handle unknown words in your models.

Programming Tips

In the usna.util package there are a few utility classes that might be of use, particularly the Counter and CounterMap classes, which are helpful for counting n-grams.

For development and debugging purposes, it can be helpful to work with a very tiny data set. For this purpose, I've supplied a data file called mini.sent.txt containing just 10 sentences. Of course, any results obtained from training or testing on this data are meaningless, but the run time will be very fast.

If you're running out of memory, you can instruct Java to use more memory with the -mx option, for example -mx500m. You might also consider calling String.intern() on all new strings in order to conserve memory.

What to turn in

  1. All of your completed *.java files. It all must compile successfully by running "ant".
  2. A run script that is filled in to run your best model on your best training data. I will call this script, and record your best results. These will be used for grading and comparing against classmate performance.
  3. A text file readme.txt. You must include the perplexity, WER, and % correct scores for each of the five models in Parts 1-3 (just copy the few relevant lines from running run). You then must write one paragraph explaining the differences in scores and why each model performs worse/better than the others.

How to turn in

Use the submit script: /courses/nchamber/submit

Create a directory for your lab called lab2 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab2:
    /courses/nchamber/submit  lab2

Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.


Milestone Part One on time: 5pts

Part One Correctness: 20 pts

Part Two: 5 pts

Part Three: 10 pts

readme.txt, scores and explanation: 5 pts

Compiling Penalty: your code does not compile: -10 pts
Output Penalty: your code prints debugging info and junk to screen: -2 pts
Extra Credit: Good-Turing smoothing: 5 pts!

Total: 45 pts