Provenance is metadata describing the creation, modification history, ownership, and other influences of data. Traditionally, provenance information has been used in database and scientific computation settings, where it is essential for characterizing the quality, integrity, and authenticity of data. More recently, it has been applied to a number of new areas including probabilistic databases, synchronization, annotation propagation, version control, and archiving. Yet while good practical work is being done in a growing number of areas, no cross-cutting, foundational studies of provenance have been undertaken. We believe the time is ripe for such studies, and are organizing a workshop to bring together researchers with interests in this area and to address questions such as:
Please contact Nate Foster for additional local information.
As information seekers increasingly move from print to digital media, print resources are being digitized, both by nonprofit libraries and for-profit companies like Google and Microsoft, at an ever-accelerating rate. Increasingly, the force that most tightly constrains the process for many of the most important research sources is not the cost or complexity of the digitization itself, but the cost and complexity of clearing copyrights. For material less than 100 years old, clearing copyright can involve establishing a complex and tricky provenance chain for a work, its publication, its registration, its derivative sources, its authorship, and its ownership. Moreover, to clear copyrights at high volume with low risk, it is important to be able to quickly and automatically identify relevant factual assertions deriving from reliable sources. My talk will informally review some of the provenance-related issues and challenges in clearing copyrights at large scale, discuss some current work and proposals to more efficiently gather and share copyright information, and provide useful examples for applying principles of provenance.
The term 'provenance' has become increasingly popular in a number of foundational and applied computer science areas, in particular in databases, scientific workflows, and Grid computing. (About twenty years ago, as object-orientation became increasingly popular, the phrase "my cat is object-oriented" was coined. Today we might add, "my cat does provenance", emphasizing that it has become an activity to be engaged in and excited about, as opposed to a boring static property -- of course every cat *has* provenance.)
We first give a brief overview of what people sometimes mean by provenance in the context of scientific workflows, e.g. some distinguish 'data provenance', 'workflow provenance', 'process provenance', etc. We then present a view of (data) provenance as an approximation (and/or augmentation) of a processing history according to a model of computation (MoC). In our view, a model of provenance (MoP) approximates a MoC via a set of observables. Finally, we turn to some basic research questions and directions, in particular to provenance modeling and design to support use cases as queries.
We hope that this workshop can begin to identify basic notions and characteristics of different types of provenance, thus marking a step towards a provenance taxonomy which will bring at least some order to the teaming provenance zoo.
As more and more information from autonomous web databases becomes available, query processing over these databases must adapt to deal with the imprecise nature of user queries as well as the incompleteness of data due to missing attribute values and schema heterogeneity. In such scenarios, a query processor begins to acquire the role of a recommender system. Specifically, in addition to presenting answers which satisfy the user's query, the query processor is expected to provide highly relevant answers even though they do not exactly satisfy the query predicates. In this case, it is important that the query results recommended by the system are trusted by the user. While query result explanation is critical in traditional databases, it is even more critical in order to obtain user trust in the presence of imprecision and incompleteness. For example, consider a scenario where the user gives a query that searches for car whose model is Civic, one of the tuples returned by the system has a null value on model, and another has a value of Corolla on model. Without an accompanying explanation, the user would be hard pressed to understand why these are relevant answers to her query. To provide such explanation, we need to record the provenance information as how we derive query results and how we rank them, and design a user-friendly interface for query result explanation.
I will briefly introduce the QUIC system that handles data incompleteness and query imprecision during data integration, ranks answers in the order of their expected relevance to users, and attempts to explain the relevance of its answers by providing snippets of its reasoning about query result identification and ranking. Then, I will discuss several challenges on data provenance that we found in developing QUIC.
We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and why-provenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials.
We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, a peer's updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditions -- expressing what data and sources a peer judges to be authoritative -- which may cause a peer to reject another's updates. In order to support such filtering, updates carry provenance information.
In this talk we present methods for realizing such systems. Specifically, we extend techniques from data integration, data exchange, and incremental view maintenance to propagate updates along mappings and we integrate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance, as implemented in our Orchestra prototype system.