CATS: A Corpus for Analysing the Text quality of Science news articles

This page contains links to download the corpus of science journalism articles created and used for work in the following papers.

Annie Louis and Ani Nenkova, A corpus of science journalism for analyzing writing quality, Discourse and Dialogue, 2013

Annie Louis and Ani Nenkova, What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain, Transactions of ACL, 2013.

The corpus also comes with the full set of feature values used in our TACL13 paper. We hope that this will enable others to replicate and compare with our results.

The corpus is free for research purposes. But note that you would need the New York Times annotated corpus to obtain the electronic text of the articles in our corpus. Please cite the above papers if you use this corpus.

Description of the corpus

The corpus contains science journalism articles all taken from the New York Times newspaper. They are divided into three categories with regard to text quality.

Great writing
Very good writing
Typical writing

The corpus was collected in a semi-automatic manner. The great writing contains articles from the New York Times that were selected to appear in the anthologies called "Best American Science Writing". An index and links to online versions of the articles from some of the anthologies are here. The anthologies comprise of articles published at many venues but we only take the ones that were originally published in the New York Times.

The very good and typical articles contain articles appearing in the New York Times newspaper around the same time as the great writing articles and on similar topics. The very good articles were written by one of the authors of the great samples and the remaining articles are considered typical writing in the NYT newspaper.

Further filtering and article categorization was done and is explained in our D&D paper.

We have used this corpus for the development of automatic methods to predict text quality. But we believe that science journalism could be a useful genre for investigating explanatory, figurative, humourous and creative language as well as other aspects of science writing. We hope you will find it useful in other applications as well.

How to use this corpus

The corpus linked in this page contains the identifiers of New York Times articles that we categorized by text quality. Note that it does not contain the full text of the articles because the text of the New York Times (NYT) articles was taken from another licensed corpus - The New York Times Annotated corpus (Sandhaus 2008). You will need to obtain the NYT corpus from LDC (link to catalog) and use the text from there.

An example file identifier in our corpus looks like this: 1999_01_12_1076469.xml. This file corresponds to the year 1999, month 01, day 12 and article 1076469.xml. This information can be easily tracked to the corresponding xml article in the NYT corpus.

Download

Note that the current download of the corpus is based on an updated filtering method and the number of articles in the categories are slightly different from those reported in the above papers.

a) The corpus

We introduce two types of resource for text quality analysis. One is the lists of great, very good and typical articles. These lists are below.

We also paired the excellent writing (great and very good articles) with topically similar ones in the typical set. For the experiments in our papers above, we used the 10 most similar typical articles with each great or very good article. This mapping is here.

This file is separated by tab spaces and every line contains the following fields.

(field 1) the name of a great or very good article
(field 2) 10 the number of matched typical articles
The next 10 fields each contain the id of a matched typical article followed by a ':' and then the similarity of the article with the article mentioned in field 1.

b) Text quality features

The computed features for all the files in our corpus (according to our TAACL 13 paper) can be downloaded from here. These features include those from a number of prior studies on readability and coherence as well as the newly proposed ones in our TACL paper. The feature names match those in our paper. Please see the relevant sections for how the feature was computed.

Contact

For any queries/comments about the corpus and features please write to:

Annie Louis (alouis@inf.ed.ac.uk)

Ani Nenkova (nenkova@seas.upenn.edu)