The Twitter Username Alias Dataset

This page contains the username/real-name dataset from our publication on "Aligning Entity Names with Online Aliases on Twitter". The dataset contains 110k twitter usernames with their profile names, as well as 110k false username/profile pairs. All usernames were taken from public twitter pages, and aligned with the names entered on their profiles. The dataset is ideal for research in aligning how people create aliases or nicknames for themselves.

Publication

Formal analysis can be found in the following publication. If you find this data useful, please cite:

Kevin McKelvey, Peter Goutzounis, Stephen da Cruz, and Nathanael Chambers.
Aligning Entity Names with Online Aliases on Twitter.
In the 5th International Workshop on NLP for Social Media. Valencia, Spain. 2017.

Data Format

We provide three files of usernames: a 160k training set, a 23k development set for testing models, and a 40k test set for final evaluation of performance.

Each file contains one username per line. A line contains three fields separated by a single semicolon character: the user's full name as entered on their profile, the user's username, and the word 'correct' or 'incorrect'.

Emily;@Emilllygibson;correct
Teen Sephirtoh Neo;@BoyNeoSephiroth;correct
Robin BinBin;@pharmaconbg;incorrect
Chloe Williams;@pepitobarcelona;incorrect

The above shows two examples for each of the two labels: CORRECT and INCORRECT

Download

Download the zip file of the train/dev/test files here.