SI486m, Spring 2015

Lab 4: Twitter Sentiment Classification

final results

Due date: the start of class, Feb 26

Motivation

Sentiment is the newest and now one of the most well-known applications of learning and NLP techniques. Automatic means of identifying if text is positive or negative (happy or sad) are used by corporations to judge consumer reactions, politicians to judge voter sentiment, and even political scientists to study population behavior. This lab will show you how difficult the task actually is...if you want to do it right.

Objective

You will experiment with lexicons, lists of positive/negative words. You will try out a popular lexicon, and then try to learn your own. The keyword here is "try", as this is not straightforward. Your goal is to use a lexicon to label text with happy and sad emotion. You'll do your best, and quickly realize that a lexicon only gets you so far. Your ultimate goal is to learn your own lexicon automatically. You have millions of unlabeled tweets and will use what is called distant supervision to learn a custom lexicon. Although you don't have labels, many tweets contain implicit labels in the form of emoticons (smiley faces). You will use these to identify happy/sad words and build a lexicon and/or classifier from them.

Part One: A General Lexicon

Milestone: Part One is due Thursday Feb 19. NOTE: this is easier than Part Two. Do not only do Part One and stop for the first week.

First off, Twitter text is very noisy and needs to be cleaned up. It is full of things like URLs and usernames. You must write some text processing code that tokenizes the text (splits up words and punctuation), normalizes links/usernames to some constant token, and anything else you notice.

Second, you'll use a popular lexicon to label text. I have provided you the Opinion Lexicon from Hu and Liu at the Univ of Illinois at Chicago. Look in /courses/nchamber/nlp/lab4/data/lexicon. This contains a couple thousand positive and negative words. Your task is to label tweets using the lexicon. The approach is up to you based on our class discussions. Report your final accuracy.

Tasks:

  1. Create an algorithm that uses the given lexicon to label tweets with sentiment.
  2. Complete the labelWithLexicon() function in TrainTest.java
  3. Optimize your algorithm using the final accuracy %

Questions you must answer:

  1. What is your best accuracy? (is it around 55% ?)
  2. Describe your algorithm.
  3. Find three positive tweets you got wrong. Why were you wrong?
  4. Find three negative tweets you got wrong. Why were you wrong?
  5. Find three objective tweets you got wrong. Why were you wrong?

Part Two: Learn a Custom Lexicon

Part One used a hand-created lexicon. What if we didn't have this? What if our language was less commonly used and nobody spent months creating a lexicon of sentiment words? In this part, you will learn a lexicon straight from the text. I have provided you a training corpus of ~4 million tweets. Look in /courses/nchamber/nlp/lab4/data/tweets. Your task is to learn a list of positive and negative words specific to twitter. You may use any approach you like, but I suggest following the emoticon approach. Look at all tweets with smiley faces, and count the words that appear in each such tweet, associated with either positive or negative emoticons. You must then score each word. How? That is up to you, and you will determine the best way to score them through experimentation. Note that just counting frequencies isn't going to work, but try that first and see why it fails. Your code must go in learnTheLexicon() of TrainTest.java.

Tasks:

  1. Write a basic "english detector" function. You'll need this...trust me. It should be very quick with only a few string checks at most. Choose a few very common English words, and use a few regexes to see if they are present or not.
  2. Count words with happy emoticons. Count words with sad emoticons.
  3. Compute a happy/sad score for each word (how to score is up to you).
  4. Set thresholds, and output SORTED BY SCORE into mypositive.txt and mynegative.txt. The learnTheLexicon() function already creates these files empty.
  5. Run your Part One code with your learned lexicon and see how you do! (are you around 40%?)

Extra credit for other creative ways of learning sentiment words using signals other than emoticons!

Questions you must answer:

  1. What metric do you use to score words? I'm looking for a math equation here.
  2. What is your best accuracy?
  3. What is most frustrating about the words you learn?
  4. What other types of knowledge might help learn a better lexicon?

Code Setup

Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab4/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab4/data/ : the data sets used in this assignment

Create a lab4 directory in your local space, and copy lab4/java/ to it (cp -R /courses/nchamber/nlp/lab4/java lab4/). There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. There is a run script which does this for you!

Eclipse setup: Click New->Project->"Java Project from Existing Ant Buildfile". Browse to find the build.xml file in your new lab4 directory. You are ready to go! Open up TrainTest.java to see where you will place your code.

Evaluator.java: Computes precision/accuracy for you.
LabeledTweet.java: A single tweet and its sentiment label.
Datasets.java: Code that reads the Opinion Lexicon for you, and also reads raw unlabeled tweets for you. Just call getNextRawTweet() and it returns a single tweet (up to 8 million of them)!
TrainTest.java: This is where you will write all of your code.

How to Run the Code

Use the run script and it is very easy. There are three modes:
run (calls labelWithLexicon() with Opinion Lexicon)
run -usemylex (calls labelWithLexicon() with your learned lexicon)
run -learnmylex (calls learnTheLexicon())

What to turn in

  1. A single text file results.txt with answers to the questions in Parts 1 and 2. Note that this lab is more question-heavy than previous labs. I'm looking for thorough answers, and more time spent looking at your output and the text itself.
  2. Your lab4/ directory containing just your java/ subdirectory. The code should of course compile and run with the run script provided to you.

How to turn in

Use the submit script: /courses/nchamber/submit

Create a directory for your lab called lab4 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab4:
    /courses/nchamber/submit  lab4

Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.

Grading

Milestone (part one) on time: 4pts

Part One: 16 pts

Part Two: 24 pts

Compiling Penalty: your code does not compile: -10 pts
Extra Credit: top performers in class on both tasks of Part One and Part Two

Total: 44 pts