SI425, Fall 2017
Due date: the start of class, Oct 12
Sentiment is the newest and now one of the most well-known applications of learning and NLP techniques. Automatic means of identifying if text is positive or negative (happy or sad) are used by corporations to judge consumer reactions, politicians to judge voter sentiment, and even political scientists to study population behavior. This lab will show you how difficult the task actually is...if you want to do it right.
You will experiment with lexicons, lists of positive/negative words. You will try out a popular lexicon, and then try to learn your own. The keyword here is "try", as this is not straightforward. Your goal is to use a lexicon to label text with happy and sad emotion. You'll do your best, and quickly realize that a lexicon only gets you so far. Your ultimate goal is to learn your own lexicon automatically. You have milions of unlabeled tweets and will use what is called distant supervision to learn a custom lexicon. Although you don't have labels, many tweets contain implicit labels in the form of emoticons (smiley faces). You will use these to identify happy/sad words and build a lexicon and/or classifier from them.
Milestone: Part One is due Thursday Oct 5. NOTE: this is easier than Part Two. Do not only do Part One and stop for the first week.
First off, Twitter text is very noisy and needs to be cleaned up. It is full of things like URLs and usernames. You must write some text processing code that tokenizes the text (splits up words and punctuation), normalizes links/usernames to some constant token, and anything else you notice.
Second, you'll use a popular lexicon to label text. I have provided you the Opinion Lexicon from Hu and Liu at the Univ of Illinois at Chicago. Look in /courses/nchamber/nlp/lab4/data/lexicon. This contains a couple thousand positive and negative words. Your task is to label tweets using the lexicon. The approach is up to you based on our class discussions. Report your final accuracy.
Questions you must answer:
Part One used a hand-created lexicon. What if we didn't have this? What if our language was less commonly used and nobody spent months creating a lexicon of sentiment words? In this part, you will learn a lexicon straight from the text. I have provided you a training corpus of ~4 million tweets. Look in /courses/nchamber/nlp/lab4/data/tweets. Your task is to learn a list of positive and negative words specific to twitter. You may use any approach you like, but I suggest following the emoticon approach. Look at all tweets with smiley faces, and count the words that appear in each such tweet, associated with either positive or negative emoticons. You must then score each word. How? That is up to you, and you will determine the best way to score them through experimentation. Note that just counting frequencies isn't going to work, but try that first and see why it fails. Your code must go in learnTheLexicon() of TrainTest.java.
Extra credit for other creative ways of learning sentiment words using signals other than emoticons!
Questions you must answer:
Starter Java code is provided, as well as training and test data. Make sure you can access the following directories:
/courses/nchamber/nlp/lab4/java/ : the Java code provided for this course
/courses/nchamber/nlp/lab4/data/ : the data sets used in this assignment
Create a lab4 directory in your local space, and copy lab4/java/ to it:
cp -R /courses/nchamber/nlp/lab4/java lab4/
There is a build.xml file included, so just type ant in the java/ directory. Make sure it compiles without error. Ant compiles all .java files in this directory structure, so you shouldn't have to change build.xml otherwise. Make sure you can run the code. There is a run script which does this for you!
Eclipse setup: Click New->Project->"Java Project from Existing Ant Buildfile". Browse to find the build.xml file in your new lab4 directory. You are ready to go! Open up TrainTest.java to see where you will place your code.
Evaluator.java: Computes precision/accuracy for you.
LabeledTweet.java: A single tweet and its sentiment label.
Datasets.java: Code that reads the Opinion Lexicon for you, and also reads raw unlabeled tweets for you. Just call getNextRawTweet() and it returns a single tweet (up to 8 million of them)!
TrainTest.java: This is where you will write all of your code.
Use the run script and it is very easy. There are three modes:
run (calls labelWithLexicon() with Opinion Lexicon)
run -usemylex (calls labelWithLexicon() with your learned lexicon)
run -learnmylex (calls learnTheLexicon())
Use the submit script: /courses/nchamber/submit
Create a directory for your lab called lab4 (all lowercase). When you are ready to turn in your code, execute the following command from the directory one level above lab4:
Double-check the output from this script to make sure your lab was submitted. It is your responsibility if the script fails and you did not notice. The script will print out "Submission successful." at the very end of its output. If you don't see this, your lab was not submitted.
Milestone (part one) on time: 4pts
Part One: 16 pts
Part Two: 24 pts
Compiling Penalty: your code does not compile: -10 pts
Extra Credit: top performers in class on both tasks of Part One and Part Two
Total: 44 pts