How do we tie together the world's data to answer fundamental scientific or policy questions? How do we facilitate and foster large-scale collaborative projects? My research interests lie in the areas of databases and distributed systems, especially as they relate to the Web, Web-scale information sharing, and distributed networks of devices (e.g., sensors, actuators). I am a member of the database and systems research groups, and the Warren Center for Network and Data Science at Penn. My research projects relate to making it easier to exchange, locate, and analyze networked information.
For any type of large-scale data integration, the "glue" that keeps everything together is data provenance. Provenance links source data to derived data, and explains the steps involved in the derivation. However, it's also the "social network of data", showing who used what data and how. It can be used for clustering, for recommendations, and in general to assess trustworthiness of data. Unfortunately today in the real world, we make very little use of provenance (except for enabling other people to reproduce experiments). We are investigating how to build better, easier, and more useful mechanisms for capturing and reasoning about data provenance. Funded by NSF (CiCi) and NIH (BD2K Targeted Software).
The IEEG Web Portal, in collaboration with Prof. Brian Litt of Bioengineering and Neurology, and Prof. Greg Worrell at Mayo Clinic, seeks to enable community-scale data integration and cloud-hosted science for epileptic seizure prediction (and beyond). Beyond its scientific applications, IEEG serves as a testbed for technologies from the Q System and other data integration research. As of Oct 2014 we have over 1200 datasets and 450 users. We have also hosted competitions for epileptic seizure detection and epileptic seizure prediction. Funded by NIH as well as grants from Amazon.
This project has received a good deal of notice for its impact on data science:
- Seizure prediction contest results (504 teams, 82% accuracy)
- NIH Director's blog
- American Epilepsy Society press release
- Announcement of winners
- Science Daily: Crowdsourcing advances epileptic seizure detection, prediction
- NPR, A Crowd of Scientists Finds a Better Way to Predict Seizures
Several prior projects have resulted in building blocks towards our ongoing work in supporting large-scale data integration and analysis. These projects are no longer directly active, but their core ideas (and code) are part of our more recent projects:
Acknowledgments: I have also received grants from DARPA CSSG (#HRO011-06-1-0016 and HRO1107-1-0029), Penn ISTAR, the State of Pennsylvania, Amazon, Google, and Lockheed Martin, and software donations from MarkLogic, Electric Software, and IBM Corp.
I was the first Undergraduate Curriculum Chair for Penn's Singh Program on Networked and Social Systems Engineering, NETS, which was formerly known as MKSE. This Internet-centered degree program looks at how people and systems interact over networks. It combines computer science (algorithms, distributed systems) with sociology, incentives (game theory), and dynamic systems. The overall program is directed by Ali Jadbabaie. New NETS courses I co-developed include NETS (MKSE) 212 "Scalable and Cloud Computing" and NETS (MKSE) 150 "Market and Social Systems on the Internet".
Selected recent courses and seminars:
- Fall 2016: CIS 455/555, Internet and Web Systems
- Spring 2016: CIS 450/550, Database and Information Systems
- Fall 2015: CIS 455/555, Internet and Web Systems
- Spring 2015: CIS 455/555, Internet and Web Systems
- Fall 2014: NETS 212, Scalable and Cloud Computing.
- Spring 2014: CIS 650, Implementing Data Management Systems.
- Fall 2013: CIS 450/550, Database and Information Systems.
- Spring 2012: MKSE 150, Market and Social Systems on the Internet.
- Fall 2011: CIS 550, Database and Information Systems
- Spring 2011: MKSE 150, Market and Social Systems on the Internet, with Sampath Kannan.
- Fall 2010: CIS 399/002 (MKSE 212 pilot offering), Scalable and Cloud Computing, with Andreas Haeberlen.
- Spring 2010: CIS 555, Internet and Web Systems.
- Fall 2008: CIS 650, Implementing Data Management Systems.
Detailed information is here.
|Principles of Data Integration, with AnHai Doan and Alon Halevy. This textbook gives a comprehensive academic treatment of the wide range of topics related to research in data integration: mappings and data transformations, query rewriting, adaptive query processing, XML and streaming data, probabilistic mappings, keyword search, data provenance, and much more. We also describe research challenges, real systems, and implementation techniques. Lecture slides are available from Elsevier. Available from Amazon in hardcopy or Kindle form; from Google Play store in e-book form; from Barnes & Noble in hardcopy or Nook form. Thanks to Xiaofeng Meng, there is also now a Chinese translation of the book.|
|Adaptive Query Processing, with Amol Deshpande and Vijayshankar Raman. Foundations and Trends in Databases, Vol. 1 No. 1, 2007. Hardcopy available at a discount from Now Publishers; see here.|
- StreamQRE: Modular Specification and Efficient Evaluation of Quantitative Queries over Streaming Data. With Kostas Mamouras, Mukund Raghothaman, Rajeev Alur, Sanjeev Khanna. To appear, PLDI 2017.
- Enabling an Open Data Ecosystem for the Neurosciences. With Martin Wiener, Fritz Sommer, Russ Poldrack, and Brian Litt. In Neuron.
- Enabling Incremental Query Re-Optimization . With Mengmeng Liu and Boon Thau Loo. SIGMOD 2016.
- Collaborating and Sharing Data in Epilepsy Research. With Joost Wagenaar, Greg Worrell, Matthias Dumpelmann, Brian Litt, Andreas Schulze-Bonhage. Journal of Clinical Neurophysiology.
- Active Learning in Keyword Search-Based Data Integration. With Zhepeng Yan, Nan Zheng, Partha Pratim Talukdar, and Cong Yu. VLDB Journal Special Issue on Best Papers of VLDB 2013.
- Looking at Everything in Context. With Zhepeng Yan, Nan Zheng, Brian Litt, Joost B. Wagenaar. CIDR 2015.
- I recently participated on a panel on Big Data for VLDB 2013. Slides are here.
- Our work in Schema Mediation in Peer Data Management Systems (with Alon Halevy, Dan Suciu, and Igor Tatarinov), published in ICDE 2003, has received the Most Influential Paper Award in ICDE 2013!
- Actively Soliciting Feedback for Query Answers in Keyword Search-Based Data Integration, with Zhepeng Yan, Nan Zheng, Partha Talukdar, and Cong Yu. VLDB 2013.
- Caravan: Provisioning for What-If Analysis, with Daniel Deutch, Tova Milo, and Val Tannen. CIDR 2013.
- Distributed Time-aware Provenance, with Wenchao Zhou, Suyog Mapara, Yiqing Ren, Yang Li, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. VLDB 2013.
- REX: Recursive, Delta-Based Data-Centric Computation, with Svilen Mihaylov and Sudipto Guha. Proc. VLDB 5(11): 1280-1291. VLDB 2012.
- Querying Provenance for Ranking and Recommending, with Andreas Haeberlen, Tao Feng, Wolfgang Gatterbauer. TaPP 2012.
- Recomputing Materialized Instances after Changes to Mappings and Data, with Todd J Green. ICDE 2012. Runner-up, Best paper award. Invited to TKDE Special Issue on Best Papers of ICDE 2012.
- Sharing Work in Keyword Search over Databases, with Marie Jacob. SIGMOD 2011.
- Querying Data Provenance, with Grigoris Karvounarakis and Val Tannen. SIGMOD 2010.
- Automatically Incorporating New Sources in Keyword Search-Based Data Integration, with Partha Pratim Talukdar and Fernando Pereira. SIGMOD 2010.
- Reliable Storage and Querying for Collaborative Data Sharing Systems, with Nicholas Taylor. Full paper, ICDE 2010.
- Maintaining Recursive Views of Regions and Connectivity in Networks, with Mengmeng Liu, Nicholas Taylor, Wenchao Zhou, and Boon Thau Loo. IEEE TKDE Special Issue, "Best Papers of ICDE 2008".
- The Orchestra Collaborative Data Sharing System, with Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, Fernando Pereira. ACM SIGMOD Record, September 2008.
- Learning to Create Data-Integrating Queries, with Partha Pratim Talukdar, Marie Jacob, M. Salman Mehmood, Koby Crammer, Fernando Pereira, and Sudipto Guha, VLDB 2008.
- DBpedia: a Nucleus for a Web of Open Data, with Soeren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak. ISWC/ASWC In-Use Track, 2007.
- Update Exchange with Mappings and Provenance, with Todd J. Green, Grigoris Karvounarakis, and Val Tannen. VLDB 2007.
- Reconciling while Tolerating Disagreement in Collaborative Data Sharing, with Nick Taylor. SIGMOD 2006.
A complete list is here.
- Soonbo Han
- Yi Zhang
- Nan Zheng
- Bill La Cava (postdoc, with Jason Moore)
- Abdu Alawini (postdoc, with Susan Davidson)
- Rishabh Gupta (MS)
- Marie Jacob Rajan. Apple.
- Dr. Babak Bagheri Hariri (postdoc) with Val Tannen. System Group (Iran).
- Dr. Allen Zhepeng Yan, Google Inc.
- Dr. Mengmeng Liu (with Boon Thau Loo). @WalmartLabs.
- Ling Ding, MS. First employment: Yushkevich Lab, Radiology, Penn.
- Dr. Medha Atre (postdoc). Assistant Professor, IIT-Kanpur
- Dr. Svilen Mihaylov (with Sudipto Guha). Apptio, Inc.
- Dr. Nicholas Taylor. Google, Inc.
- Dr. Partha Pratim Talukdar (with Fernando Pereira and Mark Liberman). Assistant Professor, IISc-Bangalore.
- Dr. Soren Auer (postdoc). Professor, University of Bonn.
- Dr. Todd J. Green (with Val Tannen). First employment: University of California-Davis (now Adjunct Professor). Currently at LogicBlox, Inc.
- Dr. Grigoris Karvounarakis (with Val Tannen). LogicBlox, Inc.
- Geetika Vasudeo, MSE. Goldman Sachs.
- Ani Nenkova, Penn CIS
- Val Tannen, Penn CIS
- Susan Davidson, Penn CIS
- Sampath Kannan, Penn CIS
- Cong Yu, Google, Inc.
- Sudipto Guha, Penn CIS
- Boon Thau Loo, Penn CIS
- Andreas Haeberlen, Penn CIS
- Jason Moore, Penn Genetics / Dir, Inst for Biomedical Informatics
- Junhyong Kim, Penn Biology
- Brian Litt, Penn Bioengineering and Neurology
- Santosh Kumar, U Memphis
- Mani Srivastava, UCLA
- Ida Sim, UCSF
- Byron Wallace, Northeastern U
Tips on Interviewing
Finishing your PhD and going on the job market? I have previously compiled a list of reverences on interviewing, which you can find here.