Adversarial Phone Extraction

Code and Data from my paper on adversarial text extraction.

Nate Chambers, Department of Computer Science, US Naval Academy


This page provides data and code for our work on phone number extraction. The research investigated methods to identify the true phone number in adversarial text, such as the below example:.

This work won a Best Paper Award at the Workshop for Noisy User-Generated Text in 2019. Please cite as follows if you find the data or code useful:

Nathanael Chambers, Timothy Forman, Catherine Griswold, Kevin Lu, and Stephen Steckler.
Character-Based Models for Adversarial Phone Number Extraction: Preventing Human Sex Trafficking.
In Proceedings of the 5th Workshop on Noisy User-Generated Text. Hong Kong. 2019 [read the PDF]


Download this zip file of starter data and a data-generation script.

Run the ./ script and it will generate the following for you:

Artificially Generated Data (artificial/)

After running the scripts, you'll see the following data files in the artificial/ subdirectories (notpadded/ and padded/):


Experiment Usage


rnn.train and crf.train


Non-padded snippets, W-NUT 2019 experiments


padded.rnn.train and padded.crf.test


Padded snippets with random ad text

? - m4w (Seattle) QR Code s&1SFive13$ix5(172)8 IK? - m4w QR Code Lin

rnn.u10.train and padded.rnn.u10.train

crf.u10.train and padded.crf.u10.train

Duplicates but with 10% unicode inserted


FORMAT: the RNN files should be straightforward to understand. Each line is a phone number and contains two values (tab-separated), the first is the gold digits and the other with the adversarial snippet:

label   obscured
6513651728      s&1SFive13$ix5(172)8

The CRF files do not have one phone number per line, but instead shows one character per line. Each phone number has a unique ID, and all consecutive lines with the same ID belong to that phone number:

Sentence #      Word    Tag
Sentence: 1     s       B6
Sentence: 1     e       I6
Sentence: 1     i       I6
Sentence: 1     S       I6

The above line is part of the 1st phone number, showing 4 characters 'seis' which has the 'B6' label starting it (begin 6) with the rest I6 (included in 6). Generally you don't need to look at these files, but will just feed them into the neural network programs.

Real-World Adversarial Phone Number Data (real/)

Look in the real/ subdirectory for a DEV and TEST set of real-world examples of adversarial phone numbers. As this data is difficult to obtain, the files are small. See our published work above for details. The TEST file is intended to be held-out and not looked at except for final experiments. You may use the DEV file for experiment setup and parameter searching.

The zip file above contains real/DEV.txt and real/TEST.txt as well as "challenge" versions where a percentage of its characters were substituted for unicode lookalikes. The number in the filename indicates the percent (e.g., DEVchallenge20.txt is 20% unicode).


You need two github repos to run these programs.

  1. Phone Number models: the python code for all learning models

  2. UnicodeViz database: (if you want to run the image-based CNN models) this contains 30k unicode images

The code was developed on Python 3.5.2 with TensorFlow 1.14.0 and Keras 2.2.4. It may not work on later versions of the libraries.

Create an environment variable UNICODEVIZ that points to your UnicodeViz download.

Examples of Running the Code

(baseline) Training the base LSTM without CRF:

python3 -con -att train rnn.train

(better model) Training the CRF:

python3 train crf.train

(better model) Training the CRF with CNN visual characters:

export UNICODEVIZ='path/to/repo/imgs/'
python3 -cnn -jiggle train crf.train

Testing a trained CRF model (modelname is output from the above training):

python3 [-cnn] test <modelname> rnn.test

Running with different parameters

The two program and each take command-line parameters to experiment with different network setups and hyperparameters. Look at the code itself (at the top) for all the options. One explicit example is here:

python3 -cnn -jiggle -e 100 -i 300 -d .2 -gpu 2 train crf.train

The above trains a CRF with CNNs on the bottom, so will use images as the input. It will also jiggle the images (required for good CNN training). The final layer of the CNN learning will be size 100 (-e) and the biLSTM internal representation will have dimension 300 (-i). Learning uses 0.2 dropout (-d) and this will run on the 3rd GPU (-gpu 2).