|
Research
My research interests lie in the areas of databases and distributed systems,
especially as they relate to the Web, Web-scale information sharing, and
distributed networks of devices (e.g., sensors, actuators). I am a member of
the database, wireless/mobile systems, and
systems research groups at Penn.
My research projects relate to making it easier to exchange, locate, and analyze networked information.
- ORCHESTRA focuses on the problem of
collaborative data sharing: exchanging data and updates among loose confederations of
databases, when the different database owners have different schemas and different ideas of what is the "right"
content. We have developed techniques to map data and updates among different
sites, maintain data provenance, and use the data provenance as the
basis of assessing trust and ultimately to resolve conflicts. We
specifically target biological data sharing applications.
See here for an overview paper. Funded by NSF
CAREER #IIS-0477972.
- The Q query system addresses the
challenges of querying in a system like Orchestra, when one does not
know apriori where to find the most relevant data. Q takes as input a
keyword query, which it matches against schema elements to produce potential
data integration queries. The system returns answers from the most
promising queries and takes user feedback on the results. This
feedback is used to learn which sources are most relevant to the
information need that motivated the query. Funded by NSF CAREER #IIS-0477972
and SEIII #IIS-0513778.
- Aspen addresses the problem of programming and
integrating large-scale and complex sensor networks. The system focuses on a
setting in which large numbers of distributed sensors, with varying
capabilities, must be coordinated in order to manage and reason about
collections of physical entities and phenomena. My focus is on sensor
data integration, i.e., integration of data streams from multiple sensor
(and other) sources. A target application is data center monitoring for energy,
temperature, load, and other factors. Different aspects of the research are funded by NSF III
#IIS-0713267 and NOSS
#CNS-0721541.
- CopyCat, in collaboration with USC Information Sciences
Institute (led by Craig Knoblock) and Fetch
Technologies (led by Steve Minton), considers the problem of how to make it easy for users to
author, use, and debug mappings for one-time integration tasks. The system
presents a spreadsheet-like workspace, into which the user may paste columns
and rows of data from source applications. The system attempts
to learn what data is being extracted and what queries are being
asked, and it makes auto-complete suggestions that generalize the
user's work. The user provides feedback (either explicitly or by
pasting more data) and the system refines its suggestions accordingly.
Provenance information is used to explain and debug results, and it is also a
foundation for the learning process. See
here for an overview paper.
CopyCat is funded in part by a DARPA IPTO seedling in the area of "best
effort data integration."
I also participate in several projects that are led by my
colleagues at Penn:
- SHARQ
(led by Susan Davidson) is a joint effort with the Penn Center for Bioinformatics. It leverages the core Orchestra engine
and the Q system, plus a portal (SHARQ Guide) that offers both keyword search and
browse access to data sources, schemas, and queries. Funded by NSF
SEIII #IIS-0513778.
- pPOD
(led by Val Tannen) focuses on the modeling and management of information related to phylogenetic trees. pPOD leverages the Orchestra engine.
- PIRIS (led by Doug Wiebe) focuses on integrating data
records relating to gunshot wound cases in Philadelphia, in order to help
support intervention. Funded by the State of Pennsylvania.
Acknowledgments: I have also received grants from
DARPA CSSG (#HR0011-06-1-0016), Penn
ISTAR, the State of Pennsylvania, and Lockheed Martin, and software donations from MarkLogic, Electric Software, and IBM Corp.
Selected recent courses and seminars:
Detailed information is here.
Publications
To appear:
- Invited entries on Adaptive stream
processing, Updates in P2P systems,
and XML publishing for the upcoming
Encyclopedia of Database Systems, edited by Ling Liu and M. Tamer Ozsu, soon to
be available from Springer.
- SmartCIS: Integrating Digital and
Physical Environments, with Mengmeng Liu, Svilen Mihaylov,
Zhuowei Bao, Marie Jacob, Boon Thau Loo, Sudipto Guha. Demonstration
description, to appear in SIGMOD 2009.
Selected recent publications:
- Reconciling Differences, with TJ Green and Val Tannen. ICDT 2009.
- Recursive Computation of Regions and Connectivity in Networks, with
Mengmeng Liu, Nicholas E. Taylor, Wenchao Zhou, and Boon Thau Loo. ICDE 2009.
- Interactive Data Integration through Smart Copy and Paste, with
Craig Knoblock, Steve Minton, Marie Jacob, Partha Talukdar, Rattapoom
Tuchinda, Jose Luis Ambite, Maria Muslea, Cenk Gazen. CIDR 2009.
- The Orchestra Collaborative Data Sharing
System, with Todd J. Green, Grigoris Karvounarakis, Nicholas
E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, Fernando
Pereira. ACM SIGMOD Record, September 2008.
- A Substrate for In-Network Sensor Data Integration, with Svilen
Mihaylov, Marie Jacob, and Sudipto Guha. DMSN 2008. Extended version
available as Technical Report MS-CIS-08-26.
- Learning to Create Data-Integrating Queries, with Partha Pratim
Talukdar, Marie Jacob, M. Salman Mehmood, Koby Crammer, Fernando Pereira,
and Sudipto Guha, VLDB 2008.
- Bidirectional Mappings for Data and Update Exchange, with Grigoris
Karvounarakis, WebDB 2008.
- Sideways Information Passing for Push-Style Query Processing, with
Nicholas Taylor. ICDE 2008, Cancun, Mexico.
- DBpedia: a Nucleus for a Web of Open Data, with Soeren Auer,
Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak.
ISWC/ASWC In-Use Track, 2007.
- Adaptive
Query Processing, with Amol Deshpande, Vijayshankar Raman. Tutorial, VLDB 2007. Slides
- Update Exchange with Mappings and Provenance, with Todd J. Green,
Grigoris Karvounarakis, and Val Tannen. VLDB 2007.
- Adaptive Query Processing, with Amol
Deshpande and Vijayshankar Raman. Foundations and Trends in
Databases, Vol. 1 No. 1, 2007. Hardcopy available at a discount from Now
Publishers; see here.
- ORCHESTRA: Facilitating Collaborative Data
Sharing, with TJ Green, Nick Taylor, Grigoris Karvounarakis, Olivier Biton,
Val Tannen. Demonstration description, SIGMOD 2007.
- Reconciling while Tolerating Disagreement in Collaborative Data
Sharing, with Nick Taylor. SIGMOD 2006.
A complete list is here.
PhD Advisees
Collaborators
- Steve Minton, Fetch Technologies
- Craig Knoblock, USC ISI
- Val Tannen, Penn CIS
- Insup Lee, Penn CIS
- Sudipto Guha, Penn CIS
- Matt Blaze, Penn CIS
- Fernando Pereira, Penn CIS
- Lyle Ungar, Penn CIS
- Boon Thau Loo, Penn CIS
- Chris Stoeckert, Penn Center for Bioinformatics
- Pete White, Children's Hospital of Philadelphia
|