SI486i, Spring 2015

Lab 3: Author Identification

Due date: the start of class, Feb 12

Motivation

What if you could identify who wrote a piece of text solely by the language used? Such a system would help with spam detection, preventing emails spoofing friends, analyzing historical documents (did Shakespeare write his plays?), and of course tasks important to intelligence agencies.

Objective

You have a corpus of training documents from ten known authors. Your goal is to build a language model that best models each author's use of language, and then is able to correctly identify who wrote which documents in a test set.

Teams

This lab will be done in pairs. You may work alone if you so choose. Only one group of three will be allowed max...and your system better win.

Part One: Bigram Naive Bayes

Milestone: Part One is due next week at the start of lab, Feb 5.

Implement a Naive Bayes classifier for the n given authors that uses unigram or bigram features (or both!). That is, be able to compute P(author | document) using proper probability distributions. Your main challenge will be to compute the inverted P(document | author). Remember that the P(document) is just the product of all P(w1|context)*P(w2|context)*...*P(wn|context). You may reuse any code you wrote for Lab 2 if you find that helpful (hint: that would be extremely helpful). Look in TrainTest.java and complete the train() and test() methods. You will most certainly add your own methods to support the code you put in these two methods. You must also perform some basic text pre-processing before you count your n-grams. This means that you should not be counting tokens that have commas and periods attached to them. Your accuracy should get into the upper 70's.

Question: What rules did you apply to pre-process your text?
Question: What is your accuracy on the development set?

Handling small probabilities. You will compute the probability of an entire passage which involves multiplying hundreds of small probabilities together. This results in a very tiny number, too small to represent with a double! What do we do? Perform all calculations in log space. Remember: log(A * B) = log(A) + log(B). Your code should not multiple two probabilities P(A)*P(B) anymore, but instead add two logarithms log(P(A)) + log(P(B)). You still just choose the highest (log space) probability in the end.

Part Two: Innovative Features

Now that you are fully experienced in making your own n-gram language models, it is time to dig deeper into NLP. We as humans do not just look at short n-gram phrases ... what other aspects of text might a computer easily pull out and use in a Naive Bayes classifier? What makes an author unique? The best NLP research sometimes comes up with the most unique text features. This Part requires you to do your best innovative thinking and integration into your learning system.

Come up with non-bigram features that improve performance. What else might distinguish two authors? Be creative. For instance, does punctuation vary? Number of sentences? etc. Add new features to your existing n-gram features (don't remove the n-grams!). I expect to see many new features here, not just two or three. Your grade for this part will largely be on the amount of effort and experimentation you put into it.

Features in code: each feature should have its own method (e.g., hyphenFeature(String text)). Then you can call the methods during training, and just comment out each call to turn them on or off as you test your system for better accuracy performance.

Question: What features did you try and which ones help performance? List everything in a clear, bulleted list, and whether or not each feature is in your final model (features that don't help on accuracy should be left out, but keep your code in place!).

Question: What is your final accuracy on the development set?

Code Setup

Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab3/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab3/data/ : the data sets used in this assignment

Create a lab3 directory in your local space, and copy lab3/java/ to it (cp -R /courses/nchamber/nlp/lab3/java lab3/). There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. There is a run script which does this for you!

Eclipse setup: Click New->Project->"Java Project from Existing Ant Buildfile". Browse to find the build.xml file in your new lab3 directory. You are ready to go! Open up TrainTest.java to see where you will place your code.

What to turn in

  1. A single text file (answers.txt) with answers to the four questions above.
  2. Your name and partner name(s) at the top of the text file.
  3. A completed TrainTest.java file that compiles and runs. I will run it as-is on a new test set that you will not see to compute final accuracy scores.

How to turn in

Use the submit script: /courses/nchamber/submit

Create a directory for your lab called lab3 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab3:
    /courses/nchamber/submit  lab3

Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.

Grading

Milestone Part One on time: 5pts

Part One, Correctness: 10 pts

Part Two: 18 pts

Performance (final accuracy score): 4 pts

answers.txt, questions answered: 3 pts

Compiling Penalty: your code does not compile: -10 pts

Extra Credit: Best performance in the class: 5 pts!

Total: 40 pts