Code and Data from my paper on adversarial text extraction.
Nate Chambers, Department of Computer Science, US Naval Academy
This page provides data and code for our work on phone number extraction. The research investigated methods to identify the true phone number in adversarial text, such as the below example:.
This work won a Best Paper Award at the Workshop for Noisy User-Generated Text in 2019. Please cite as follows if you find the data or code useful:
Nathanael Chambers, Timothy Forman, Catherine Griswold, Kevin Lu, and Stephen Steckler.
Character-Based Models for Adversarial Phone Number Extraction: Preventing Human Sex Trafficking.
In Proceedings of the 5th Workshop on Noisy User-Generated Text. Hong Kong. 2019 [read the PDF]
Run the ./generateData.sh script and it will generate the following for you:
After running the generateData.sh scripts, you'll see the following data files in the artificial/ subdirectories (notpadded/ and padded/):
rnn.train and crf.train
Non-padded snippets, W-NUT 2019 experiments
padded.rnn.train and padded.crf.test
Padded snippets with random ad text
? - m4w (Seattle) QR Code s&1SFive13$ix5(172)8 IK? - m4w QR Code Lin
rnn.u10.train and padded.rnn.u10.train
crf.u10.train and padded.crf.u10.train
Duplicates but with 10% unicode inserted
FORMAT: the RNN files should be straightforward to understand. Each line is a phone number and contains two values (tab-separated), the first is the gold digits and the other with the adversarial snippet:
label obscured 6513651728 ｓ&1SFive13$ix5(172)8
The CRF files do not have one phone number per line, but instead shows one character per line. Each phone number has a unique ID, and all consecutive lines with the same ID belong to that phone number:
Sentence # Word Tag Sentence: 1 s B6 Sentence: 1 e I6 Sentence: 1 i I6 Sentence: 1 S I6
The above line is part of the 1st phone number, showing 4 characters 'seis' which has the 'B6' label starting it (begin 6) with the rest I6 (included in 6). Generally you don't need to look at these files, but will just feed them into the neural network programs.
Look in the real/ subdirectory for a DEV and TEST set of real-world examples of adversarial phone numbers. As this data is difficult to obtain, the files are small. See our published work above for details. The TEST file is intended to be held-out and not looked at except for final experiments. You may use the DEV file for experiment setup and parameter searching.
The zip file above contains real/DEV.txt and real/TEST.txt as well as "challenge" versions where a percentage of its characters were substituted for unicode lookalikes. The number in the filename indicates the percent (e.g., DEVchallenge20.txt is 20% unicode).
You need two github repos to run these programs.
Phone Number models: the python code for all learning models
UnicodeViz database: (if you want to run the image-based CNN models) this contains 30k unicode images
The code was developed on Python 3.5.2 with TensorFlow 1.14.0 and Keras 2.2.4. It may not work on later versions of the libraries.
Create an environment variable UNICODEVIZ that points to your UnicodeViz download.
(baseline) Training the base LSTM without CRF:
python3 phoneRNN.py -con -att train rnn.train
(better model) Training the CRF:
python3 phoneCRF.py train crf.train
(better model) Training the CRF with CNN visual characters:
export UNICODEVIZ='path/to/repo/imgs/' python3 phoneCRF.py -cnn -jiggle train crf.train
Testing a trained CRF model (modelname is output from the above training):
python3 phoneCRF.py [-cnn] test <modelname> rnn.test
The two program phoneCRF.py and phoneRNN.py each take command-line parameters to experiment with different network setups and hyperparameters. Look at the code itself (at the top) for all the options. One explicit example is here:
python3 phoneCRF.py -cnn -jiggle -e 100 -i 300 -d .2 -gpu 2 train crf.train
The above trains a CRF with CNNs on the bottom, so will use images as the input. It will also jiggle the images (required for good CNN training). The final layer of the CNN learning will be size 100 (-e) and the biLSTM internal representation will have dimension 300 (-i). Learning uses 0.2 dropout (-d) and this will run on the 3rd GPU (-gpu 2).