Lab 4: Twitter Sentiment Classification

final standings

Milestone: Part 1 by start of lab day, Oct 6

Due: in two weeks start of lab day, Oct 13

Motivation

Sentiment is now one of the more well-known applications of NLP learning. Applications to automatically identify if text is positive or negative (happy or sad) are used by corporations to judge consumer reactions, politicians to judge voter sentiment, and even political scientists to study population behavior. This lab will show you how complex and difficult the task actually is.

Objective

You will build two types of classifiers for positive, negative, and objective sentiment. The first is one that simply uses a sentiment lexicon, and the second is with logistic regression. You will see how different the two approaches are, and get a good sense for the challenge and tools available to you for sentiment identification. Your dataset consists of millions of tweets, so you'll also get exposure to the infamous informal language of social media.

Code Setup

Download this tarball for the code and the data (790MB).

If that's too big, you can download this with only one file of tweets (132MB). This is a temporary fix, you probably want to fix your space issues when you train your Part 2 model so you can use more tweets than just this one file (there are 6 in the original download).

How to Run the Code

python3 sentiment.py -lexicon     # Part 1
python3 sentiment.py -learn       # Part 2

Part One: Use a lexicon

You have a lexicon of sentiment words! Can you create a sentiment classifier with it, and how well does it work?

First, Twitter text is very noisy and needs to be cleaned up. It is full of things like URLs and usernames. Write a text processing function that tokenizes the text (splits up words and punctuation), normalizes links/usernames to some constant token, and anything else you notice.

Second, use a popular lexicon to label tweets for their sentiment. I provide you the Opinion Lexicon from Hu and Liu at the Univ of Illinois at Chicago. It contains a couple thousand positive and negative words. Your task is to label tweets using the lexicon. The approach is up to you based on our class discussions. Report your final accuracy. The given sentiment.py reads the lexicon in already and just calls your function with it.

Tasks:

  1. Complete the label_with_lexicon() function in sentiment.py
  2. Write a helper function clean(tweet) that cleans a tweet of punctuation and noise.
  3. Create a scoring metric that uses the given lexicon to label tweets with sentiment. Suggestion: count positive and negative words in a tweet, and do something reasonable.
  4. Optimize your algorithm to increase the final accuracy %

Questions you must answer in readme.txt:

  1. What is your best accuracy? (are you near 60% ???)
  2. Describe your algorithm/scoring metric.
  3. Find three positive tweets you got wrong. Why were you wrong?
  4. Find three negative tweets you got wrong. Why were you wrong?
  5. Find three objective tweets you got wrong. Why were you wrong?

Target Accuracy: 60% or higher

Part Two: Classify with Logistic Regression

Part One used a hand-created lexicon. What if we don't have a lexicon? Your task is now to create a logistic regression classifier.

Without a lexicon, we need some training data. I provide you a training corpus of ~4 million tweets here. You will train a classifier to identify positive and negative tweets, but the trick here is that we don't have a lexicon nor do we have labels on the tweets! In the author lab, each passage had an author to learn from, but these are just raw unlabeled tweets. How do we get labels?

There is a type of learning called distance learning that looks for a substitute signal when the true label is unknown. Here we want positive and negative. Are there any reliable signals in tweets that would show us some emotional tweets? Smiley faces! Find all tweets with smiley faces :) and treat them as positive examples. Do the same for sad faces :(. You can consider other variants of faces too in order to expand your training data size. Ignore all other tweets (which is most of them).

Step 1: train a logistic regression classifier on tweets with smiley/sad faces

Step 2: use your trained classifier to identify positive and negative tweets ... as well as objective tweets (see below)

Step 3: fine-tune your logisticregression arguments for better performance. Unigrams? Bigrams? Both? etc.

Step 2! Don't forget about objective tweets. We don't have training data for these, so you should use thresholds for your positive and negative decisions, and your code should predict objective when the logistic regression probability is not high enough to be confident in its decision. The following shows how to access the probabilities:

Accessing the probabilities:
# Previous: just get the best labels
guesses = clf.predict(X_test)
positive
positive
negative
positive
# Now: get the actual probabilities
# 'negative' is first, 'positive' is second		
allprobs = clf.predict_proba(X_test)



		
# Example of iterating and accessing the numbers.
for i in range(len(allprobs)):
    print('negative prob =', allprobs[i][0])
    print('positive prob =', allprobs[i][1])
	    
[[0.32 0.68]
 [0.41 0.59]
 [0.88 0.12]
 [0.23 0.77]]

Possible extra credit for other creative ways of learning sentiment words using signals other than smiley/sad faces!

Questions you must answer in readme.txt:

  1. What is your best accuracy?
  2. What types of tweets does it make mistakes on? Give an example.

Target Accuracy: approaching 60%

What to turn in

  1. A single text file readme.txt with answers to the questions in Parts 1 and 2. Note that this lab is more question-heavy than previous labs. I'm looking for thorough answers, and more time spent looking at your output and the text itself.
  2. sentiment.py

How to turn in

Upload all to our external submission webpage.

Login, select SI425 and Lab 4. Click your name and use the 'Upload Submission' option in the menu.

Grading

Milestone (part one) on time: 10%

Part One: 40%

Part Two: 40%

Questions: 10%
Extra Credit: top performers in class on either task of Part One and Part Two