Research: Distributed Stream Integration
This work is funded by NSF IIS-0713267, with PIs Zachary Ives and Sudipto Guha. The duration of the grant is from September 15, 2007 - September 2009. Please see here for the full NSF details.
Declarative, database-style models for programming distributed applications are becoming widely adopted, in a variety of realms ranging from sensors to publish-subscribe to network state management. They free the developer to define high-level queries for the specific data of interest, without regard to details about data sources, communications protocols, or synchronization. As this approach to programming gains momentum, there is increasing need to abstract low-level stream data source variations away under a uniform representation, i.e., a view; and to integrate, i.e., conjoin, different types of stream data from large numbers of sources. Such tasks involve much more distributed communication and coordination than in traditional distributed databases or even data stream management systems. It becomes essential to do in-network computation of the query, and to optimize the processing of each stream (or few streams) separately, in a way that considers the topology of the network.
This proposal develops the technologies to support integration of data streams, including languages for stream schema mappings, focusing on issues relating to combining distributed messages and maintaining timing information; techniques for rapidly establishing query computation paths through a network, for sets of data stream elements that need to be joined and aggregated together; offline and adaptive, network-aware query optimization techniques for distributed computation in the network. These techniques will scale across widely heterogeneous (sensor, wireless, and conventional) networks, and will be evaluated in environmental monitoring applications.
The intellectual merit is the development of new techniques for performing queries across large, highly distributed networks of stream-producing sources; this increases understanding of the adaptive query processing space when access costs to data items are non-uniform and query processing requires distributed communication, and the trade-offs with respect to offline versus adaptive optimization and relative to optimization granularity. The broader impact includes the development of distributed stream integration capabilities that can directly address a number of emerging and well-known challenges in the network and environmental monitoring domains. The educational component includes the training of two PhD students including a female student, and the teaching of stream data integration in graduate and advanced undergraduate courses.
This work is done in the context of the multi-investigator Aspen sensor management platform, which develops runtime systems, languages, real-time processing, and security techniques for sensor networks. This grant specifically addresses the data acquisition, information integration, and distributed query optimization aspects of this project.