Lab 3: Author Identification

final standings

Milestone: Part 1 by start of lab day, Sep 22

Due: in two weeks start of lab day, Sep 29


What if you could identify who wrote a piece of text solely by the language used? Such a system would help with spam detection, preventing emails spoofing friends, analyzing historical documents (did Shakespeare write his plays?), and even things like judging online reviews for positivity.


You have a corpus of training documents from ten known authors. Your goal is to build a language model that best models each author's use of language, and then is able to correctly identify who wrote which documents in an unlabeled test set.


This lab will be done in pairs. You may work alone if you so choose. Only one group of three will be allowed max...and your system better win.

Code Setup

Starter Python code and data is provided here.

Create a lab3 directory in your local space. Download the above link and extract. Make sure you can run the code like so:


Part One: Bigram Naive Bayes

Milestone: Part One is due next week at the start of lab, Sep 22.


Implement a Naive Bayes classifier for the n given authors that uses unigram or bigram features (or both!). That is, be able to compute P(author | document) using proper probability distributions. Remember from our book and lectures:

P(author | document) ~= P(author) * P(document | author)
  1. P(author): we will ignore this for the lab and assume each author is equally likely.

  2. P(document | author): this is just a language model's probability that was trained on an author's text.


It is a fair, even-handed, noble adjustment of things, that while there is infection in disease and sorrow, there is nothing in the world so irresistibly contagious as laughter and good-humour.


P( Austen | passage ) = 0.04
P( Shakespeare | passage ) = 0.01
P( Dickens | passage ) = 0.72
P( Melville | passage ) = 0.11

Your job is to thus compute the P(author|document) numbers on the right. Using Bayes' rule, your main task to approximate it with simply P(document | author)! Remember that P(document) is just the product of all its word probabilities: P(w1|context)*P(w2|context)*...*P(wn|context). You wrote a function to get the probability of a word in context in Lab 2, right? Ok, that's how you compute P(document), but what about P(document|author)? That's easy. You just train a language model (LM) on an author's text, and only his/her text. You now have a distribution for that author. You will train a separate LM for each author.

Your Programming Task: fill in by completing its train() and test() methods. Achieve a prediction accuracy well into the 70's%.

Reuse the code you wrote for Lab 2. If you did not complete Lab 2 fully, Dr. Chambers can provide a working smoothed bigram implementation. Look in and complete the train() and test() methods. You'll probably want to set some global variables in train (store your trained models), and then use those in test. You must also perform some basic text pre-processing before you count your n-grams. This means that you should not be counting tokens that have commas and periods attached to them. Your accuracy should get into the upper 70's.

Question: What did you do to pre-process your text?
Question: What is your accuracy on the development set?

Handling small probabilities. You will compute the probability of an entire passage which involves multiplying hundreds of small probabilities together. This results in a very tiny number, too small to represent with a double! What do we do? Perform all calculations in log space. Remember: log(A * B) = log(A) + log(B). Your code should not multiply two probabilities P(A)*P(B) anymore, but instead add two logarithms log(P(A)) + log(P(B)). Python has math.log2(x) for you. All this does is move your numbers into log space -- you still just choose the highest (log) probability in the end.

Part Two: logistic regression


For this part, install the sklearn library:
pip3 install -U scikit-learn

If the above pip3 command failed, check your Ubuntu version:
lsb_release -a

If it says version 16, then you need some additional setup because that's several years old. You need to install Python 3.6 first:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get install python3.6
sudo rm /usr/bin/python3
sudo ln -s /usr/bin/python3.6 /usr/bin/python3
pip3 install -U scikit-learn

Congratulations on programming a Naive Bayes classifier from scratch! You will find most people in the world use libraries for these things, but you'll be one of the few to really know how it works.

Part 2 allows you to join the world of using libraries, making use of some great NLP and ML implementations. A widely used library for machine learning (ML) is sklearn. This lab requires you to create a logistic regression classifier, filling in with sklearn's tools. I will describe the basic parts of an sklearn classifier here, but it is up to you to put them together. It is very little code in the end!

Train the classifier. You need to create a LogisticRegression object and call its fit(counts, labels) function:

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()             # creates a Logistic Regression object, Y_train)       # trains the model using gradient descent

Build the inputs. That looks easy, right? You should see now that your only goal is to create the two arguments (X_train_counts, Y_train). The second argument, Y_train, is just a list of strings that represent the labels (authors) of each passage. If we have 1000 passages to train on, then this should be 1000 author names. Simple. The first argument, X_train_counts, is a bit harder because it is a sparse matrix of all your n-gram counts for each passage. One row is a vector of n-gram counts for one passage. It is a scipy object that all sklearn classifiers know how to interpret. Think of it as just an n-gram lookup for their counts, but in a nice matrix format for easy mathematical computations. How do you create such a matrix?

Count n-grams in a sparse matrix: sklearn provides a class CountVectorizer that takes text, counts n-grams, and returns the matrix.

from sklearn.feature_extraction.text import CountVectorizer

# Create a "vectorizer" object.	
vectorizer = CountVectorizer(analyzer='word', min_df=5, ngram_range=(1, 2))   # 1 and 2-grams, min 5 count

# Convert your text with the vectorizer.
X_train_counts = vectorizer.fit_transform(X_train)     # X_train is a list of strings

Could this be any easier? Note the X_train argument. This is a list of your strings. Each string is an entire document or passage. You'll want to make sure X_train and Y_train are aligned, where the author of each document in X_train matches the author of Y_train. The CountVectorizer actually does all the token splitting, punctuation removal, and lowercasing. It has lots of argument options to control these things, but the defaults are pretty good.

The above thus trains a logistic regression classifier. Now how do we use the model to make predictions? Well you need to do two things:

  1. Transform your unseen text:
    X_test_counts = vectorizer.transform(X_test)   # NOT fit_transform(), just transform()
  2. Make predictions:
    guesses = clf.predict(X_test_counts)

Your job is to use the above and fill in train() and test(). This lab intentionally requires you to play with these pieces and figure things out. Not sure what each variable is? Try it out! Print things! See what's going in and out! You learn by trying.

Question: Play with the LogisticRegression arguments. What happens when you change min_df? How about ngram_range? Choose your best settings and tell me what they are.

Question: What is your final accuracy on the development set?

Extra Credit: Other Classifiers

There are many other classifiers available in sklearn. They all use the same interface, so all you have to do is import a different one and replace the LogisticRegression() declaration with another.

Lookup sklearn classifiers online, and experiment with others. Can you get better results? Tell me in your README which ones you tried along with their classification accuracies. Turnin the best performing setup that you found for a chance to compete for best class performance.

What to turn in

  1. A single text file (readme.txt) with answers to the four questions above (five if extra credit).
  2. Your name and partner name(s) at the top of readme.txt.
  3. Two files and I will run them as-is on a new test set that you will not see to compute final accuracy scores.

How to turn in

Only one of the two of you needs to submit. Please don't both submit.

Upload all to our external submission webpage.

Login, select SI425 and this Lab 3. Click your name and use the 'Upload Submission' option in the menu.


Milestone Part One on time: 10%
Part One, Correctness: 50%
Part Two: 30%
Performance (final accuracy score): 5%
readme.txt, questions answered: 5%

Penalty: your code crashes or runs too long: -10%
Extra Credit: + 5-10%