My research is concerned with how best to interconnect, query, and update heterogeneous data-producing components in a networked world. Not only have recent decades resulted in a plethora of new data, but they have also resulted in a plethora of different data representations and data versions. I seek to build fast and robust algorithms and systems for helping to inter-relate, find, update, and synchronize ("reconcile") data in this world.
Major Research Projects
In collaboration with USC Information Sciences Institute (led by Craig Knoblock) and Fetch Technologies (led by Steve Minton), considers the problem of how to make it easy for users to author, use, and debug mappings for one-time integration tasks. The system presents a spreadsheet-like workspace, into which the user may paste columns and rows of data from source applications. The system attempts to learn what data is being extracted and what queries are being asked, and it makes auto-complete suggestions that generalize the user's work. The user provides feedback (either explicitly or by pasting more data) and the system refines its suggestions accordingly. Provenance information is used to explain and debug results, and it is also a foundation for the learning process. See here for an overview paper. CopyCat was funded in part by a DARPA IPTO seedling in the area of "best effort data integration," and is also funded in part by DARPA DSO funding through the CSSG program.
Querying over distributed, heterogeneous data
Traditional data integration allows many different structured (or semi-structured) data sources to be mapped to a single umbrella mediated schema, which can be queried by users. The data integration or mediator system masks all of the variations in schemas and interfaces, and presents a uniform interface. The huge challenge in data integration is gaining consensus about what the mediated schema should be -- with secondary challenges in extending, maintaining, and modifying the schema as needs change. Worse, the mediated schema, as the product of global standardization, may be very different from the way certain users want to think about their schema.
In the Piazza peer data management system, we have proposed to make data integration more flexible and decentralized by eliminating the need for a single central schema: instead, participants or peers can each provide their own schema, and different peers will be interrelated via schema mappings. Queries over any schema will be answered using the transitive closure and merge of all mappings in the system. We are currently developing techniques for building a corresponding system implementation in a peer-to-peer fashion, to take advantage of replication for reliability and performance.
The Tukwila query engine is a component of Piazza responsible for providing high-performance query answering. Within Tukwila, we are focused on the topic of adaptive query processing as a means of providing query answers with good performance. Adaptive query processing allows the query engine to "discover" properties of the data as it is executing a query, and to exploit those characteristics to produce a more efficient query plan.
Our work on adaptive query processing focuses on the following problems: (1) extending our query processing techniques to increasingly complex types of queries, (2) investigating whether adaptive techniques provide significant benefits in more traditional database applications, (3) extending to a distributed and peer-to-peer environment, and (4) understanding the principles and effectiveness of adaptive query processing techniques.
Collaborators: Nick Taylor, Sudipto Guha, Mohammad Daud. Former collaborators: Alon Halevy, Daniel Weld, Dan Suciu, Igor Tatarinov, University of Washington; Aneesh Kapur, Mike Wittie, Ivan Terziev.
Data Integration in Practice
In the data integration research community, there is only a limited understanding of the needs of real integration applications. I propose to build a suite of mappings, data sources, and workloads for benchmarking and evaluating data integration techniques.
One of the first domains of interest is bioinformatics, which has a rich set of complex data types, as well as a variety of publicly available data sources.
Collaborators: Lyle Ungar, Elisabetta Manduchi, Chris Stoeckert, Hina Altaf. Alumni: Weichen Wu, Hina Altaf.
Senior Projects and Independent Study Projects
Please see here for some possible undergraduate and term project ideas.
Past Research Collaborations
Please see here for past research projects.