Milestone: Part 1 by start of lab day, Oct 3
All Due: in two weeks, Oct 10
Sentiment is now one of the more well-known applications of NLP learning. Applications to automatically identify if text is positive or negative (happy or sad) are used by corporations to judge consumer reactions, politicians to judge voter sentiment, and even political scientists to study population behavior. This lab will show you how complex and difficult the task actually is.
You will build two types of classifiers for positive, negative, and objective sentiment. The first is one that simply uses a sentiment lexicon, and the second is with logistic regression. You will see how different the two approaches are, and get a good sense for the challenge and tools available to you for sentiment identification. Your dataset consists of millions of tweets, so you'll also get exposure to the infamous informal language of social media.
Generative AI: GenAI tools are permitted as code assistants on this lab (but you are not required or expected to use them). Your grade on this lab is based on your ability to code within the given frameworks, and to achieve good performance. All GenAI usage must be cited in the code through clear code comments.
Download this tarball for the code and the data (790MB).
If that's too big, you can download this with only one file of tweets (132MB). This is a temporary fix, you probably want to fix your space issues when you train your Part 2 model so you can use more tweets than just this one file (there are 6 in the original download).
python3 sentiment.py -lexicon # Part 1 python3 sentiment.py -learn # Part 2
You have a lexicon of sentiment words! Can you create a sentiment classifier with it, and how well does it work?
First, Twitter text is very noisy and needs to be cleaned up. It is full of things like URLs and usernames. Write a text processing function that tokenizes the text (splits up words and punctuation), normalizes links/usernames to some constant token, and anything else you notice.
Second, use a popular lexicon to label tweets for their sentiment. I provide you the Opinion Lexicon from Hu and Liu at the Univ of Illinois at Chicago. It contains a couple thousand positive and negative words. Your task is to label tweets using the lexicon. The approach is up to you based on our class discussions. Report your final accuracy. The given sentiment.py reads the lexicon in already and just calls your function with it.
Tasks:
Questions you must answer in readme.txt:
Target Accuracy: 60% or higher
Part One used a hand-created lexicon. What if we don't have a lexicon? Your task is now to create a logistic regression classifier.
Without a lexicon, we need some training data. I provide you a training corpus of ~4 million tweets here. You will train a classifier to identify positive and negative tweets, but the trick here is that we don't have a lexicon nor do we have labels on the tweets! In the author lab, each passage had an author to learn from, but these are just raw unlabeled tweets. How do we get labels?
There is a type of learning called distance learning that looks for a substitute signal when the true label is unknown. Here we want positive and negative. Are there any reliable signals in tweets that would show us some emotional tweets? Smiley faces! Find all tweets with smiley faces :) and treat them as positive examples. Do the same for sad faces :(. You can consider other variants of faces too in order to expand your training data size. Ignore all other tweets (which is most of them).
Step 1: train a logistic regression classifier on tweets with smiley/sad faces
Step 2: use your trained classifier to identify positive and negative tweets ... as well as objective tweets (see below)
Step 3: fine-tune your logisticregression arguments for better performance. Unigrams? Bigrams? Both? etc.
Step 2! Don't forget about objective tweets. We don't have training data for these, so you should use thresholds for your positive and negative decisions, and your code should predict objective when the logistic regression probability is not high enough to be confident in its decision. The following shows how to access the probabilities:
|
positive positive negative positive |
| [[0.32 0.68] [0.41 0.59] [0.88 0.12] [0.23 0.77]] |
Possible extra credit for other creative ways of learning sentiment words using signals other than smiley/sad faces!
Questions you must answer in readme.txt:
Target Accuracy: approaching 60%
Upload all to the submit system.
submit -c=SI425 -p=lab04 readme.txt sentiment.py
Milestone (part one) on time: 10%
Part One: 40%
Part Two: 40%
Questions: 10%
Extra Credit: top performers in class on either task of Part One and Part Two