SI425, Fall 2017
Due date: the start of class, Sep 28
What if you could identify who wrote a piece of text solely by the language used? Such a system would help with spam detection, preventing emails spoofing friends, analyzing historical documents (did Shakespeare write his plays?), and of course tasks important to intelligence agencies.
You have a corpus of training documents from ten known authors. Your goal is to build a language model that best models each author's use of language, and then is able to correctly identify who wrote which documents in a test set.
This lab will be done in pairs. You may work alone if you so choose. Only one group of three will be allowed max...and your system better win.
Milestone: Part One is due next week at the start of lab, Sep 21.
Implement a Naive Bayes classifier for the n given authors that uses unigram or bigram features (or both!). That is, be able to compute P(author | document) using proper probability distributions. Your main challenge will be to compute the inverted P(document | author). Remember that the P(document) is just the product of all P(w1|context)*P(w2|context)*...*P(wn|context).
You may reuse any code you wrote for Lab 2 if you find that helpful (hint: that would be extremely helpful). If you did not complete Lab 2 fully, Dr. Chambers can provide a working smoothed bigram implementation. Look in TrainTest.java and complete the train() and test() methods. You will most certainly add your own methods to support the code you put in these two methods. You must also perform some basic text pre-processing before you count your n-grams. This means that you should not be counting tokens that have commas and periods attached to them. Your accuracy should get into the upper 70's.
Question: What rules did you apply to pre-process your text?
Question: What is your accuracy on the development set?
Handling small probabilities. You will compute the probability of an entire passage which involves multiplying hundreds of small probabilities together. This results in a very tiny number, too small to represent with a double! What do we do? Perform all calculations in log space. Remember: log(A * B) = log(A) + log(B). Your code should not multiple two probabilities P(A)*P(B) anymore, but instead add two logarithms log(P(A)) + log(P(B)). You still just choose the highest (log space) probability in the end.
Now that you are fully experienced in making your own n-gram language models, it is time to dig deeper into NLP. We as humans do not just look at short n-gram phrases ... what other aspects of text might a computer easily pull out and use in a Naive Bayes classifier? What makes an author unique? The best NLP research sometimes comes up with the most unique text features. This Part requires you to do your best innovative thinking and integration into your learning system.
Come up with non-bigram features that improve performance. What else might distinguish two authors? Be creative. For instance, does punctuation vary? Number of sentences? etc. Add new features to your existing n-gram features (don't remove the n-grams!). I expect to see many new features here, not just one or two. Your grade for this part will largely be on the amount of effort and experimentation you put into it.
Features in code: I highly suggest you make a new Java class called OurFeatures.java or something along those lines. Then each feature should have its own method in this class (e.g., int hyphenFeature(String text)). During training, you can call your new class' train method which calls these feature methods, and just comment out each call to turn them on or off as you test your system for better accuracy performance. By having a separate class, you can also naturally merge (interpolate!) it with your n-gram language model classes.
Question: What features did you try and which ones help performance? List everything in a clear, bulleted list, and whether or not each feature is in your final model (features that don't help on accuracy should be left out, but keep your code in place!).
Question: What is your final accuracy on the development set?
Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab3/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab3/data/ : the data sets used in this assignment
Create a lab3 directory in your local space, and copy lab3/java/ to it (cp -R /courses/nchamber/nlp/lab3/java lab3/). There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. There is a run script which does this for you!
./run -data <path-to-data-dir>
Eclipse setup: Click New->Project->"Java Project from Existing Ant Buildfile". Browse to find the build.xml file in your new lab3 directory. You are ready to go! Open up TrainTest.java to see where you will place your code.
Use the submit script: /courses/nchamber/submit
Create a directory for your lab called lab3 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab3:
Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.
Milestone Part One on time: 5pts
Part One, Correctness: 10 pts
Part Two: 18 pts
Performance (final accuracy score): 4 pts
readme.txt, questions answered: 3 pts
Compiling Penalty: your code does not compile: -10 pts
Extra Credit: Best performance in the class: 5 pts!
Total: 40 pts