This page contains links to download the corpus of science journalism articles created and used for work in the following papers.
The corpus also comes with the full set of feature values used in our TACL13 paper. We hope that this will enable others to replicate and compare with our results.
The corpus is free for research purposes. But note that you would need the New York Times annotated corpus to obtain the electronic text of the articles in our corpus. Please cite the above papers if you use this corpus.
The corpus was collected in a semi-automatic manner. The great writing contains articles from the New York Times that were selected to appear in the anthologies called "Best American Science Writing". An index and links to online versions of the articles from some of the anthologies are here. The anthologies comprise of articles published at many venues but we only take the ones that were originally published in the New York Times.
The very good and typical articles contain articles appearing in the New York Times newspaper around the same time as the great writing articles and on similar topics. The very good articles were written by one of the authors of the great samples and the remaining articles are considered typical writing in the NYT newspaper.
Further filtering and article categorization was done and is explained in our D&D paper.
We have used this corpus for the development of automatic methods to predict text quality. But we believe that science journalism could be a useful genre for investigating explanatory, figurative, humourous and creative language as well as other aspects of science writing. We hope you will find it useful in other applications as well.
The corpus linked in this page contains the identifiers of New York Times articles that we categorized by text quality. Note that it does not contain the full text of the articles because the text of the New York Times (NYT) articles was taken from another licensed corpus - The New York Times Annotated corpus (Sandhaus 2008). You will need to obtain the NYT corpus from LDC (link to catalog) and use the text from there.
An example file identifier in our corpus looks like this: 1999_01_12_1076469.xml. This file corresponds to the year 1999, month 01, day 12 and article 1076469.xml. This information can be easily tracked to the corresponding xml article in the NYT corpus.
Note that the current download of the corpus is based on an updated filtering method and the number of articles in the categories are slightly different from those reported in the above papers.
We introduce two types of resource for text quality analysis. One is the lists of great, very good and typical articles. These lists are below.
We also paired the excellent writing (great and very good articles) with topically similar ones in the typical set. For the experiments in our papers above, we used the 10 most similar typical articles with each great or very good article. This mapping is here.
This file is separated by tab spaces and every line contains the following fields.
For any queries/comments about the corpus and features please write to:
Annie Louis (firstname.lastname@example.org)
Ani Nenkova (email@example.com)