The Data that Archiving Fails to Capture

Peter Buneman

University of Pennsylvania

peter@cis.upenn.edu

Introduction

When databases are useful, they get copied. Typically they do not get copied in their entirety, but useful subsets of data are extracted, transformed, and placed in other databases. Since the cost of storage is now relatively cheap and our tools for extracting data are, generally speaking, improving, this has led to a proliferation of "derived" databases. These databases contain little original data; their value lies in the way the data has been selected from other databases and organized into a structure that serves the needs of some individual or organization. Such databases are sometimes referred to as data warehouses. When the databases are constructed with a great deal of manual correction and supervision, they are sometimes called curated databases.

To someone interested in the preservation and archiving of data, this is good news. The information is being kept alive and there is no need to bring any new technology to bear on the problem of ensuring the longevity of data. Indeed, one can argue that historically, duplication has always been a better guarantee of data preservation than archival media or institutions (consider the loss caused by the destruction of the libraries of Alexandria, Cotton and Louvain!). Therefore, the right way to preserve data is the natural way -- to facilitate the copying and construction of new databases. One could argue further that keeping data alive by this process is natural in that it is context sensitive. The form of the data adjusts to the context in which it is used. One can think of countless examples in which the original raw data is much less useful than its modern representation: music, literature etc. The original rendering is often unintelligible, and a work survives because it is constantly adjusted (re-interpreted, translated etc.) to suit the context.

So why should the issue of derived databases be of interest to a conference on data archiving when it appears that these information sources are being naturally archived? The problem is that duplication of databases per se does not archive all the relevant data The copying of databases typically involves extracting a subset of the data from some data source, manually cleaning it, and transforming it into a form suitable for some other data source. However, the process by which a piece of data arrives in a database, its provenance, is frequently lost. A user of a derived database may have no idea of how the data got there; worse still, the maintainers of the database may not keep this information. Knowing the provenance of a data element is crucial to ones assessment of its reliability.

An Example

The following diagram shows the interrelationships between a very small subset of the biological databases whose primary concern is genetics. The arrows between databases describe how these databases are derived from each other. It is important to remember that this extraction involves the selection and transformation of certain data elements (a database query) and it involves extensive manual "cleaning" the data. Some of the databases are general purpose, for example Genbank is a general purpose sequence database, while others are specific to a such as EpoDB (a database of genes connected with red blood cells) are specific to a research project. A genetics researcher will use the appropriate database as the most reliable source. Swissprot, for example, is regarded as the most reliable source of protein sequence data because it is heavily curated. In this figure a * indicates databases that are curated, SUB indicates that the database has some form of automatic submission process and LIT indicates that the curators of the database may go to external sources (the literature) to augment or correct data.

Each of these databases is constantly evolving as new experimental evidence is obtained. This explains the cycles in the diagram. Data may appear first in one database, be corrected as it moves into another, and that correction is moved back into the original database. In general, the individual database curators do an excellent job of keeping old versions of their databases; the databases are sometimes available in more than one format; and it is likely that XML versions of most of these databases will shortly be available.

On the face of it, biological databases are being naturally and effectively archived. Yet despite the effort that is expended in this domain on information preservation, we are loosing crucial information!

What we are loosing is the linkage between the databases. How one database depends on another is a complex process involving query languages, data mining techniques, data cleaning and various forms of data translation. Taking a "data-oriented" view of the problem, when you see some data element in one of these databases, you may have no idea how it got there. Almost certainly it was extracted from some other database, which in turn extracted it from another database, and at each step some correction or transformation may have been applied. Also, the relevant data may have been available in two or more databases, and some judgment was exercised in which source to use. The provenance of the data -- the process by which the data moved through the complex system of databases -- is often lost. When it is maintained, it is kept in uninterpreted comment fields and is typically partial. This information is crucial to anyone trying to assess the reliability of the databases, not least to the people (the curators) who are maintaining the databases. The tools for recording data provenance are, at best, minimal.

The Need for Data Annotation

A recent NSF database workshop a discussion group invented the term self aware data to describe data that carried wits own history. Perhaps "self-describing" would be better but this term has been coined, somewhat inaccurately, by people working on data formats and semistructured data. What "self aware" means is that whenever you extract data from a database you will not only get the "face value" data -- the data you wanted, but some latent metadata -- metadata that describes the history of the data: where it came from, how it was transformed, who corrected it, etc. Moreover, when you pass it on to someone else, this latent data will (perhaps automatically) be augmented with the further details of that transaction. This has the following consequences for database construction. To someone coming from a document perspective of data, these desiderata are not particularly demanding. Many applications (mailers, text formatters etc.) go a long way to generating something like what we are advocating by adding some comment field to the document header, and a "diff" file often serves to describe how a document was modified. However databases differ from documents in that they are typically very large and have structure. We do not just want to add the latent metadata to the whole database, we want to add it to data at any level of granularity. As an extreme example one might want to add something like the fields specified in the Dublin Core (and then some) to each pixel in an image database, to the database itself, and to each component in between. This is a requirement that appears to call for something like a 1000-fold expansion of a terabyte database! If this example of detailed annotation appears fanciful, it is certainly the case that one wants to add annotations of this form to base pairs in genetic sequence data.

Sheer size is not the only problem. Databases have structure and, in conventional database systems, that structure is predefined and restricts what one can put into the database. Adding arbitrary annotations, which is one of the desiderata of our latent metadata, is difficult, if not impossible. Even if one could expand the database schema with fields that account for the "core" annotations, one still has the problem of the unanticipated annotations and the problem of annotating the annotations.

Fortunately there is recent work in both the database and document communities that may offer some hope of solving the problem of implementing such annotations. This is work on semistructured data, which converges with XML in that the database work offers methods for storing and querying large XML documents. Semistructured data models allow us to accommodate unanticipated structure, and there is now considerable interest in techniques for the efficient storage and retrieval of mixtures and structured and semistructured data. This offers at least the beginning of a solution to the problem of what data model and storage mechanisms might be appropriate for storing annotations.

The size issue still needs to be addressed. Again, some existing ideas in databases may help us. In an image database we would expect that the latent data associated with most pixels in an image would be the same. Therefore one should be able to use the same annotation for most of the pixels. Only the "deviant" pixels -- for example those that have been corrected -- need special treatment. By and large the latent data for each pixel will be inherited from the latent data in the image. Again, in genetic databases we have some idea of how much exceptional annotation is needed. One typically sees a small number -- two or three at a rough guess -- annotations on a sequence of several hundred base pairs. Thus, while the overhead for transmitting a single base pair may be 1000%, the overhead on larger units of information may be relatively low.

The use of semistructured data as the substrate for data annotation is only a partial solution to the problem. There are many more issues concerning data models, languages and storage techniques that are involved in building an environment in which the recording of data provenance is a simple and natural process. Especially important is the development of tools for helping annotators/curators to record and repeat the corrections they make.

Acknowledgements and References

This note is intended to draw attention to the issue of data provenance, something that I believe to as important as, and inseparable from, the problems of data preservation. For a germinal web page with pointers to relevant discussions, please consult db.cis.upenn.edu/Prov, which also contains some pointers to work on semistructured data.

Many of the ideas in this note are the result of discussions with my colleagues in the database group at the University of Pennsylvania and with John Ockerbloom. I am also grateful for discussions with David Maier and Paul Kantor at a recent meeting, the 1999 NSF Information and Data Management Workshop at which we coined the term "self-aware data". Some of these issues also came up at NSF Invitational Workshop on Distributed Information, Computation, and Process Management for Scientific and Engineering Environments . My extremely limited knowledge of data preservation issues was taken from the position papers of the NSF Workshop on Data Archival and Information Preservation