Datasets for Document Timestamping

Document Timestamping

Below are the evaluation datasets used in my 2012 published work on labeling documents with timestamps. I used the NYT section of the Gigaword Corpus, Fourth Edition (available from the LDC). The datasets ignore documents that are repeats of other documents in the corpus, and a training/development/test split is defined. Each file simply lists the document IDs in the three datasets.

Publication

Formal analysis can be found in the following publication. If you produce new evaluations on this dataset, please cite:

Nathanael Chambers. Labeling Documents with Timestamps: Learning from their Time Expressions.
In Proceedings of the Association for Computational Linguistics. Jeju, Republic of Korea. 2012.

Data Format

Each file contains one document ID per line. They correspond to documents in the Gigaword Corpus, Fourth Edition.

Download

training set (725,468 docs)
development set (7,300 docs)
test set (113,000 docs)