Research: Orchestra
Home Research Publications Service C.V.




Database Group

Data Integration

Managing the Collaborative Sharing of Evolving Data

One of the most prevalent problems today is the need to map data from one database to another -- where the databases may potentially have different schemas and interfaces. Examples include everything from bibliographic citation databases to course grade sheets to the ACM Digital Library. Once data is mapped, it is frequently modified in multiple places at once, and the challenge lies in "synchronizing" or reconciling the modifications.

Project Overview

The ORCHESTRA project focuses on the challenges of such data sharing scenarios in the sciences -- specifically addressing the challenges in bioinformatics. In this domain, there are a great many "standardized" databases with overlapping information, similar but not identical data, differing levels of data quality/confidence, and a variety of different target audiences. In general, each database owner would like to store a "live" view of all relevant knowledge in its domain -- however, each site is being independently extended, corrected, and analyzed. Moreover, individual biologists would like to be able to download and maintain local "live snapshots" of data in order to run their own experiments. Unfortunately, there is often no consensus on what the best data is -- certain data items will always be disputed or revised. Our focus in the ORCHESTRA collaborative data sharing system (CDSS) is on how to support reconciliation across different schemas, with disagreeing users. In general, each participant in the system specifies whom it trusts, and this is used to locally resolve conflicts.

Click on any of the images below to see a larger version.

Overview figure Basic Process

The figure to the right illustrates the basic functionality of ORCHESTRA. The system coordinates among a set of participating sites, each of whom manages a database. Schema mappings describe how the data at these sites relates. Trust conditions specify which sites trust which data (and how much). The system allows all of the sites to be continuously updated, and on demand, it will propagate these updates across sites, according to the specified schema mappings and trust.

Research Topics

The ORCHESTRA project touches on a number of important database- related topics, including update translation across mappings or views; conditional information; peer-to-peer data sharing; data provenance; and more. This project takes our past work on the Piazza system one step further in supporting decentralization. See the list of publications below for further details.

System Implementation

ORCHESTRA uses a peer-to-peer implementation that requires a runtime on each machine with a database, and additional computation and storage nodes can be hosted on the cloud.

We have recently released the source code of the first prototype ORCHESTRA system. We will continue to improve the distribution's flexibility and installation options. Currently we are happy to arrange for demonstrations and trial deployments here at Penn.

New: we gave a demonstration of the prototype ORCHESTRA system at SIGMOD 2007 and DILS 2007.

A video demonstration can be found here.

Here are some screen shots:

Orchestra Peer View

This is the main ORCHESTRA screen, showing a series of biological databases (ellipse nodes) and mappings among them (arcs with "Mx" labels). The PCBI PlasmoDB database has been highlighted.

Orchestra Provenance Viewer

This is the ORCHESTRA provenance viewer, which shows how a given data value (the tuple selected from the list on the right side of the screen) was produced. In this case, the tuple is highlighted graphically in green, and the arrows going into it represent sources from which it was derived. This tuple was derived from Mapping M5, which combined three tuples, which were in turn direct user insertions (the "+"s in the diamond vertices). In general, derivations can be significantly more complex.

Related Publications

Team Members

  • Prof. Zachary Ives
  • Prof. Val Tannen

Team Alumni

  • Olivier Biton
  • Murat Cakir
  • Charuta Joshi
  • Aneesh Kapur
  • Ivan Terziev
  • Mike Wittie
  • Nitin Khandelwal (first position: Oracle)
  • TJ Green (first position: UC Davis)
  • Grigoris Karvounarakis (first position: LogicBlox)
  • Nick Taylor (first position: Google)
  • Soeren Auer (first position: U. Leipzig)


This research has been funded by NSF CAREER grant award #IIS-0477972, awarded to Zachary G. Ives at the University of Pennsylvania, and NSF SEIII grant #IIS-0513778.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Last modified: Wed Jul 18 12:03:16 EST 2007