Adversarial Phone Extraction

Code and Data from my paper on adversarial text extraction.

Nate Chambers, Department of Computer Science, US Naval Academy

Overview

This page provides data and code for our work on phone number extraction. The research investigated methods to identify the true phone number in adversarial text, such as the below example:.

This work won a Best Paper Award at the Workshop for Noisy User-Generated Text in 2019. Please cite as follows if you find the data or code useful:

Nathanael Chambers, Timothy Forman, Catherine Griswold, Kevin Lu, and Stephen Steckler.
Character-Based Models for Adversarial Phone Number Extraction: Preventing Human Sex Trafficking.
In Proceedings of the 5th Workshop on Noisy User-Generated Text. Hong Kong. 2019 [read the PDF]

Data

Download this zip file of starter data and a data-generation script.

Run the ./generateData.sh script and it will generate the following for you:

Artificially Generated Data (artificial/)

After running the generateData.sh scripts, you'll see the following data files in the artificial/ subdirectories (notpadded/ and padded/):

Files
Experiment Usage
Example

rnn.train and crf.train
rnnANDcrf.test
Non-padded snippets, W-NUT 2019 experiments
s&1SFive13$ix5(172)8

padded.rnn.train and padded.crf.test
padded.rnnANDcrf.test
Padded snippets with random ad text
? - m4w (Seattle) QR Code s&1SFive13$ix5(172)8 IK? - m4w QR Code Lin

rnn.u10.train and padded.rnn.u10.train
crf.u10.train and padded.crf.u10.train
Duplicates but with 10% unicode inserted
ｓ&1SFiv𝓮1Ȝ$ix5(172)৪

Files	Experiment Usage	Example
rnn.train and crf.train rnnANDcrf.test	Non-padded snippets, W-NUT 2019 experiments	s&1SFive13$ix5(172)8
padded.rnn.train and padded.crf.test padded.rnnANDcrf.test	Padded snippets with random ad text	? - m4w (Seattle) QR Code s&1SFive13$ix5(172)8 IK? - m4w QR Code Lin
rnn.u10.train and padded.rnn.u10.train crf.u10.train and padded.crf.u10.train	Duplicates but with 10% unicode inserted	ｓ&1SFiv𝓮1Ȝ$ix5(172)৪

FORMAT: the RNN files should be straightforward to understand. Each line is a phone number and contains two values (tab-separated), the first is the gold digits and the other with the adversarial snippet:

label   obscured
6513651728      ｓ&1SFive13$ix5(172)8

The CRF files do not have one phone number per line, but instead shows one character per line. Each phone number has a unique ID, and all consecutive lines with the same ID belong to that phone number:

Sentence #      Word    Tag
Sentence: 1     s       B6
Sentence: 1     e       I6
Sentence: 1     i       I6
Sentence: 1     S       I6

The above line is part of the 1st phone number, showing 4 characters 'seis' which has the 'B6' label starting it (begin 6) with the rest I6 (included in 6). Generally you don't need to look at these files, but will just feed them into the neural network programs.

Real-World Adversarial Phone Number Data (real/)

Look in the real/ subdirectory for a DEV and TEST set of real-world examples of adversarial phone numbers. As this data is difficult to obtain, the files are small. See our published work above for details. The TEST file is intended to be held-out and not looked at except for final experiments. You may use the DEV file for experiment setup and parameter searching.

The zip file above contains real/DEV.txt and real/TEST.txt as well as "challenge" versions where a percentage of its characters were substituted for unicode lookalikes. The number in the filename indicates the percent (e.g., DEVchallenge20.txt is 20% unicode).

Code

You need two github repos to run these programs.

Phone Number models: the python code for all learning models
UnicodeViz database: (if you want to run the image-based CNN models) this contains 30k unicode images

The code was developed on Python 3.5.2 with TensorFlow 1.14.0 and Keras 2.2.4. It may not work on later versions of the libraries.

Create an environment variable UNICODEVIZ that points to your UnicodeViz download.

Examples of Running the Code

(baseline) Training the base LSTM without CRF:

python3 phoneRNN.py -con -att train rnn.train

(better model) Training the CRF:

python3 phoneCRF.py train crf.train

(better model) Training the CRF with CNN visual characters:

export UNICODEVIZ='path/to/repo/imgs/'
python3 phoneCRF.py -cnn -jiggle train crf.train

Testing a trained CRF model (modelname is output from the above training):

python3 phoneCRF.py [-cnn] test <modelname> rnn.test

Running with different parameters

The two program phoneCRF.py and phoneRNN.py each take command-line parameters to experiment with different network setups and hyperparameters. Look at the code itself (at the top) for all the options. One explicit example is here:

python3 phoneCRF.py -cnn -jiggle -e 100 -i 300 -d .2 -gpu 2 train crf.train

The above trains a CRF with CNNs on the bottom, so will use images as the input. It will also jiggle the images (required for good CNN training). The final layer of the CNN learning will be size 100 (-e) and the biLSTM internal representation will have dimension 300 (-i). Learning uses 0.2 dropout (-d) and this will run on the 3rd GPU (-gpu 2).