Tukwila: Data Integration, XML, and Adaptive Query Processing
The Tukwila data integration system was my Ph.D. thesis project. Our focus has been on developing a query processor for data integration that provides good performance. In data integration, we pose queries across a variety of heterogeneous, autonomous data sources scattered throughout the intranet and Internet (but mapped into a single, unified "mediated schema"). These sources will generally be able to export their data in XML over HTTP, but we typically have very limited knowledge about network performance or statistics about the data within the sources. As a result, it is difficult to "optimize" the plan for combining the data from the sources. My thesis proposes a combination of three techniques to address these challenges: (1) use of operators with flexible scheduling policies, to mask latencies, (2) overlapping of query operations using pipelining, even for XML data, with the x-scan operator, and (3) convergent query processing, which allows the system to choose a query plan, execute it for some amount of time, then revise statistics and cost estimates and generate an improved plan -- all in mid-stream.
I expect that we will use Tukwila as a foundation for numerous other research projects: currently it serves as the engine behind the Sagres, Piazza, and Revere systems at the University of Washington. There are many research directions that can still be explored in the space of adaptive query processing, especially once storage is considered. I am also very interested in the possibility of fleshing it out in a few directions and releasing it as an open-source codebase.
Sagres: An Initial Look at Managing Data Sharing Among Devices
project built a series of "active rules" (event-based triggers) on top
of the Tukwila core, and we showed how the basic ideas of data integration could
be used to manage interactions between devices in a ubiquitous computing
Piazza: Semantically Rich Data Sharing among Peers
In building Sagres, we came to understand that many of the issues in Sagres were not unique to ubiquitous computing -- and in fact, many of them (e.g., replication, propagation of updates, and data migration) also appeared in peer-to-peer systems. The Piazza peer data management system examines these and other problems, and it also generalizes the basic ideas of data integration. Instead of having a single mediated schema, peer data management allows us to have a different mediated schema at each peer. Schemas between peers can be related via a set of mappings; now, all of the data sources within a peer data management system can be related by evaluating the transitive closure of all mappings between peers.
Piazza also serves as an interesting "bridge" between traditional data integration and the so-called "semantic web" advocated by Tim Berners-Lee and the World-Wide Web Consortium.
Xperanto: Processing XML Queries over DB2
I spent a summer at IBM Almaden Research Center, and was one of the initiators of the Xperanto middleware layer, which exports relational data into XML. IBM is now commercializing the technology as part of their effort to XML-ify DB2.
XQuery: Querying and Updating XML Data
I have provided a number of suggestions to the W3C's XQuery Working Group, which is developing a standard query language for XML (the "SQL for XML," if you will). Since the focus of the working group has been limited to querying data, I co-authored a paper looking at the next step -- the semantics for updating XML data -- in a recent paper. Our update language may serve as a useful foundation for developing distributed data management systems for collaboration.