The Political Twitter Corpus

Publication

Formal analysis can be found in the following publication. If you find this data useful, please cite:

Micol Marchetti-Bowick and Nathanael Chambers.
Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter.
In Proceedings of the European Association for Computational Linguistics. Avignon, France. 2012.

Data Format

Each file contains ~2000 tweets, one tweet per line. A line contains two fields separated by a single tab character: the label, and the text of the tweet.

POLIT RT @AdamSmithInst Quote of the week: My political opinions lean more and more towards Anarchy
NOT @DeeptiLamba LOL, I like quotes. Feminist, anti-men quotes.

The two labels are POLIT (political) and NOT (not political).

Download

Two corpora are available. The first is a randomly selected set of 2000 tweets from Twitter's "spritzer" feed collected between June 1, 2009 and Dec 31, 2009. The second corpus is not selected from the entire feed, but rather randomly selected from a subset of tweets that contained at least one political keyword in each tweet.

(Note that the Keyword Tweet Corpus contains 2004 tweets, not 2000 tweets as the 2012 EACL paper mistakenly states. The General Tweet Corpus contains 2000 tweets.)

Politics General Tweet Corpus

Politics Keyword Tweet Corpus

Corpus Prediction of Approval Polls

Below is a graph from the above paper of a trained classifier for sentiment analysis, predicting presidential approval polls. We used the two tweet corpora above for initial analysis of its performance on general politics topic identification.

The Twitter Political Corpus

Publication

Data Format

Download

Corpus Prediction of Approval Polls