Term Project Ideas
Home Research


If you are interested in doing an independent study that may lead to something longer term, please send me mail. If you are interested in doing a a senior project with me, here are some examples of the types of short-term projects I'd be interested in. Obviously you are encouraged to propose your own ideas, as well.

Past Senior Projects

  • An XML-based web infrastructure for the Penn Database Group that makes use of data integration tools developed within our group, XSL tools, and various other standard web components. This project will teach skills relating to XML, XQuery, XSL, and various open source and locally developed tools. This would be a great way to gain experience and skills relevant to both research and post-academic careers, to learn new technologies, and to be able to showcase the results. BSE senior project by Eric McGlinchey.
  • The Universal Message Center -- a "Google for personal and public messages" that includes security. Develop an application for indexing, sharing, and searching e-mail, newsgroup, scheduling, and other information (e.g., RSS feeds), in a way that preserves both privacy and security. This project will teach information retrieval, XML, and database techniques. It will involve building a significant system in Java. BSE senior project by Nina Quek and Mike Daglian.
  • Meta-search as a way of combatting the hacks people use to increase their page rankings. BSE senior project by Nate Hashem.
  • A scalable infrastructure for Penn Course Review. BSE senior project by Howie Vegter and Steve MacCrory.
  • Building a better tech support help web site. BSE senior project by Dan Margolis.
  • ILMUNC, a database-backed web site for the Ivy League Model United Nations conference. BAS capstone project by Amit Vazirani and Raghav Bajaj.

Open Projects

  • (MS/BSE level) The Universal Message Center is a peer-to-peer indexing and search system for public and personal data: web pages, XML documents, blog posts, newsgroups, email, etc. The UMC (developed as a Senior Project by Mike Daglian and Nina Quek) allows for keyword search, as well as keyword-within-tag search, over data, and it supports a P2P architecture with certain privacy restrictions. There are a number of further development steps that can make for interesting projects:
    • Views or Collections. Can we come up with a way of "saving" and naming collections of data based on keyword queries, where we can selectively add related documents, and remove "bad" ones? Can we define this in a hierarchical way, as with Yahoo categories?
    • Ranking. Currently we rely on a combination of Information Retrieval-style ranking and several heuristics. Can concepts like Google's PageRank and user feedback be applied here?
    • Adding structure to unstructured data. Can we define a series of "templates" to progressively add XML tags to plain-text data, so we can take emails and web pages and query them for semantic information? Can we account for the fact that such templates are only correct with a certain probability?
    • Sharing. Given a distributed engine with public, semi-private, and private data, how should we define models of sharing between users?
  • (MS/BSE level) Peer-to-peer synchronization. We are building a peer-to-peer system, Orchestra, that (among several tasks) allows nodes to "check out" data, modify it, and synchronize with one another. There are many aspects of this system that need to be investigated.
  • (MS/BSE level) XML path matching for query processing. Our Tukwila system, used for integrating or querying XML data, can read and operate on data that is still being read across the network (a form of "streaming" or "pipelining"). Currently, the implementation of this capability is limited in the information it collects about the XML (it does not collect information like the location of the data in the document), and it has not been optimized. We would like to extend this to be more general so more complex operations can be performed, and we would like to add further optimized behaviors.
  • (MS level) Experimenting with state-of-the-art adaptive query processing techniques in a local database. This project will involve extending the Tukwila adaptive data integration system, which queries data across a network, so it has local storage. That will be accomplished by coupling the existing Tukwila codebase with a data storage system, BerkeleyDB. This project requires skills in reading and writing C++ code.
  • (MS level) Implementing new datatypes and the XQuery function library for an XML query processor. This project involves adding the XQuery standard function library and some new data types to the Tukwila codebase. It would teach the internals of a database query processor, the fundamentals of XQuery, and software engineering skills. This project requires C++ skills and familiarity with XML and the XQuery language.
  • (MS level) Developing an XQuery rewrite-based optimizer. Query processing is a challenging and interesting area of work, with some overlap with compiler techniques but a number of unique characteristics. XQuery is a highly delarative language, so it is possible to specify a query in many forms (some with many nested expressions, some with many expressions that can be simplified, and some that depend on previously defined queries, aka views). This project would involve understanding how XQueries relate and simplifying the queries so they are more compact and more efficient.

Last modified: Sun Jan 9 12:03:32 EST 2005