SI486i, Spring 2015
Milestone 1: the start of class, Jan 22
Due date: the start of class, Jan 29
Language Modeling is one of the most fundamental concepts in natural language processing. It can be found in machine translation, speech recognition, document classification, and many more applications. Your phone uses these it predicts the next word you will type. You will build some of the basic language models that we cover in class.
It is late 2001 and the financial world has been rocked by irregularities in the accounting department of Enron Corporation. Investigators into the irregularities need your help. They have obtained by subpoena thousands of e-mails from its employees, but the e-mails were corrupted during the investigation and need to be recovered. All of the words in the e-mails are intact, but the word order has been scrambled. Being trained in NLP, you suggest using a rich language model to reconstruct the original word order. You decide to generate a list of possible sentences from the scrambled words, and choose the most probable sentence as determined by your language model.
Milestone: Part One is due next week at the start of class, Jan 22.
Implement a bigram and a trigram language model trained on the provided English text (see "Code Setup" below). I provide for you a unigram model, fully implemented (see EmpiricalUnigramLanguageModel.java). Use that code as a starting block for counting and storing n-grams. Create files named BigramModel.java and TrigramModel.java. Use Eclipse to inherit from the LanguageModel.java class and you will see four functions that you need to implement: generateSentence(), getVocabulary(), getWordProbability(List
You do not need to handle unknown words. Do not smooth these models. If an n-gram is unseen, the probability should be zero. Note that EmpiricalUnigramLanguageModel.java handles unknown words with a simplistic smoothing approach. I am not asking you to do this for this part.
*** Your code must pass the "sum to one" test, and thus be valid probability distributions. The output of run should print out GOOD for all the tests except the unseen bigram/trigram tests.
Now change your Bigram and Trigram language models to implement some actual smoothing of unknown words. We covered Laplace smoothing in class where you add 1 count to all vocabulary items. Lidstone smoothing is the same, but instead of adding 1, you add some fraction less than 1. Your job is to play with this fraction and see what works best. You could alternatively implement Good-Turing smoothing for extra credit. Your code for this part will go in SmoothedBigram.java and SmoothedTrigram.java.
Time Saving Tip: This part is only about how you compute P(w | context). You should not need to change how you train() your model or how you count words. In fact, you might only change/add a few lines of code!
Implement a single language model that combines your unigram, bigram, and trigram models with linear interpolation. Remember, the weights that you put on each of the three models must sum to one. Your code for this part will go in LinearInterpolation.java. Again, inherit from LanguageModel.java and implement the four functions. This must still pass the "sum to one" tests.
Optimize the weights. First randomly choose some weights, then change them slowly to see what combination produces the best scores.
Time Saving Tip: Use your code from parts one and two and interpolate between your Java classes.
You will first evaluate the models' performance using perplexity. This measure gives you a numeric score of how well each model matches the data's distribution. The starter code computes this automatically for you.
The core of this lab is reconstructing the jumbled Enron e-mails. You will predict which sentence is not scrambled, and your performance will be scored using Word Error Rate (WER) as in the last lab, and also % correct (the number of sentences your model selects exactly correct, divided by the total number of sentences). Again, the starter code computes this for you. Isn't that great? How easy!
Summary: three evaluation metrics; look for these in the output of LanguageModelTester.java (see below).
Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab2/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab2/data/ : the data sets used in this assignment
Create a lab2 directory in your local space. Then copy lab2/java/ to it (cp -R /courses/nchamber/nlp/lab2/java lab2/). Do not copy the data subdirectory. There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. I provided a run script that takes no arguments. Execute it from the command line and everything should look ok.
If everything is working, you'll get some output describing the construction and testing of a (pretty bad) language model. The next section will help you make sense of what you're seeing.
run. A shell script that runs the default setup, out of the box. Use this to see what flags you can send to your program!
LanguageModelTester.java contains the main() method that the run script invokes. Take a look at main(). This class has the job of managing data files and constructing and testing a language model. Its behavior is controlled via command-line options. Each command-line option has a default value, and the effective values are printed at the beginning of each run.
EmpiricalUnigramLanguageModel.java is a fully functional language model that you can reference as you do this lab. You will build more advanced models. The -model option in the run script specifies the class name of what LM you want to run. Its default value is usna.langmodel.EmpiricalUnigramLanguageModel. You can see that this extends LanguageModel.java and must implement the four methods:
Training and Testing: The -train and -test options (to LanguageModelTester) specify the names of sentence files (containing one sentence per line) in the data directory to be used as training and test data. The default values are europarl-train.sent.txt and europarl-test.sent.txt. These files contain sentences from the Europarl corpus. A second test set is also found in enron-test.sent.txt. This is a list of e-mails, described below. You need to test on both this and Europarl by changing the -test option. After loading the training and test sentences, the LanguageModelTester will create a language model of the specified class (-model), and train it using the specified training sentences (-train). It will then compute the perplexity of the test sentences (-test) with respect to the language model. When the supplied unigram model is trained on the Europarl data, it gets a perplexity between 800 and 900, which is very poor. A reasonably good perplexity number should be around 200; a competitive perplexity can be around 100 on the test data.
Probability Distribution As part of the assignment you are required to ensure that all of your probability distributions are valid (sum to 1). To verify this experimentally, the LanguageModelTester will call the checkProbability() method included in the LanguageModel prototype. It will call your implementation of getWordProbability() to compute Sum_w(P(w)). If the -check flag is set to true, checkModel will be run on your current model and the returned number will be tested for proximity to 1. It must sum to 1. This checkModel only needs to be run once on your final model, however, so disabling the -check flag will save you time on running repeated tests.
Model Generation If the -generate option is set to true (the default), the LanguageModelTester will print 10 random sentences generated according to the language model, so that you can get an intuitive sense of what the model is doing. Note that the outputs of the supplied unigram model aren't even vaguely like well-formed English.
Unknown Words A note about unknown words. Language models most standardly treat unseen words as if they were a single UNKNOWN token (or event). This means that, for example, a good unigram model will actually assign a larger probability to an unknown word ("some unknown event") than to a known but rare word. Look for this in your error analysis during your experimentation. You must handle unknown words in your models.
In the usna.util package there are a few utility classes that might be of use, particularly the Counter and CounterMap classes, which are helpful for counting n-grams.
For development and debugging purposes, it can be helpful to work with a very tiny data set. For this purpose, I've supplied a data file called mini.sent.txt containing just 10 sentences. Of course, any results obtained from training or testing on this data are meaningless, but the run time will be very fast.
If you're running out of memory, you can instruct Java to use more memory with the -mx option, for example -mx500m. You might also consider calling String.intern() on all new strings in order to conserve memory.
Use the submit script: /courses/nchamber/submit
Create a directory for your lab called lab2 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab2:
Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.
Milestone Part One on time: 5pts
Part One Correctness: 20 pts
Part Two: 5 pts
Part Three: 10 pts
Results.txt, scores and explanation: 5 pts
Compiling Penalty: your code does not compile: -10 pts
Output Penalty: your code prints debugging info and junk to screen: -2 pts
Extra Credit: Good-Turing smoothing: 5 pts!
Total: 45 pts