The Twitter Political Corpus

This page provides resources to develop learning algorithms that link political statements on Twitter to general opinions about government and politicians. We provide two datasets of tweets that have been hand labeled for their topics, specifially, discussing politics or not discussing politics.


Formal analysis can be found in the following publication. If you find this data useful, please cite:

Micol Marchetti-Bowick and Nathanael Chambers.
Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter.
In Proceedings of the European Association for Computational Linguistics. Avignon, France. 2012.

Data Format

Each file contains ~2000 tweets, one tweet per line. A line contains two fields separated by a single tab character: the label, and the text of the tweet.

POLIT     RT @AdamSmithInst Quote of the week: My political opinions lean more and more towards Anarchy
NOT     @DeeptiLamba LOL, I like quotes. Feminist, anti-men quotes.

The two labels are POLIT (political) and NOT (not political).


Two corpora are available. The first is a randomly selected set of 2000 tweets from Twitter's "spritzer" feed collected between June 1, 2009 and Dec 31, 2009. The second corpus is not selected from the entire feed, but rather randomly selected from a subset of tweets that contained at least one political keyword in each tweet.

(Note that the Keyword Tweet Corpus contains 2004 tweets, not 2000 tweets as the 2012 EACL paper mistakenly states. The General Tweet Corpus contains 2000 tweets.)

Politics General Tweet Corpus

Politics Keyword Tweet Corpus

Corpus Prediction of Approval Polls

Below is a graph from the above paper of a trained classifier for sentiment analysis, predicting presidential approval polls. We used the two tweet corpora above for initial analysis of its performance on general politics topic identification.