The following are some ideas for broad topics you might explore in your project. They are by no means exhaustive -- you are highly encouraged to come up with your own project topics not from this list. Remember, projects can be entirely theoretical, entirely experimental/implementation based, or somewhere in between. The only requirement is that the project attempt original research related to some aspect of differential privacy.

The schedule for the project follows:

10/4: Email me a few paragraphs about an idea for your project, and who (if anyone) you will be working with. Only one email per group is needed. I'll give you some feedback.
10/11: Turn in (by email) your full project proposal. This should be 2-3 pages including summary of the idea, an overview of background and related work, what you propose to do, and your plan of how to do it.
11/10: Turn in (by email) your mid-project report. This should be a 5-6 page extension of your project proposal, explaining what you have done so far, how your plan for the project has changed based on your findings, and what else you plan on accomplishing before the project is due.
12/6, 12/8: Project presentations in class.
12/8: Final written project is due. This should be a polished 10+ page report written in the style of an academic paper, discussing and citing related work, and explaining your results in detail.

Projects can be done in groups of size 1 or 2. Only one copy of each deliverable (proposal, mid-project report, final paper) is needed per group.

Project Ideas

Efficient Heuristics

Many of the data release algorithms we saw in class would not be practical to implement on a real dataset. Several heuristics are possible to speed up these algorithms, giving the same privacy guarantees, but losing the worst-case utility guarantees. For example:
Hardt and Rothblum suggest running the multiplicative weights algorithm on a randomly selected (polynomially sized) subset of the data universe. This gives average case guarantees for randomly selected databases, but how does it do on real data?

How about random projections of the database, as used by Blum and Roth? This gives guarantees for sparse queries, but how about more realistic queries? What about multiplicative weights run on a small projection of the database?

Privacy Preserving Machine Learning Algorithms

Kasiviswanathan, Lee, Nissim, Raskhodnikova and Smith show that in principle, private machine learning algorithms are essentially as powerful as non-private machine learning algortihms (at least in the theoretical PAC model of machine learning). These generic private learning algorithms aren't efficient, however. Blum, Dwork, McSherry, and Nissim show that algorithms in the SQ-model do have efficient versions that are differentially private. But what is the cost of privacy? How does the performance of these algorithms degrade on your favorite data set at different levels of privacy? What if you apply composition theorems to get epsilon,delta differential privacy?

How about private versions of more sophisticated machine learning algorithms like SVMs, like those given by Chaudhuri, Monteleone, and Sarwate, or Rubinstein, Bartlett, Huang, and Taft?

Distributed Differential Privacy

Most of what we saw in class concerned the centralized model of differential privacy, in which a trusted data curator holds (and gets to look at) the entire private database, and compute on it in a differentially private way. But what if the dataset is divided among multiple curators who are mutually untrusting, and so they have to compute by communicating differentially private messages between themselves? What kinds of things can you do?

Kasiviswanathan, Lee, Nissim, Raskhodnikova and Smith characterize what you can learn in the local privacy model (i.e. everyone holds their own data -- we have n databases each of size 1). McGregor, Mironov, Pitassi, Reingold, Talwar, and Vadhan show a lowerbound for the problem of computing the hamming distance between two databases in the two party setting (i.e. there are two data curators, each of which holds half of the database). But almost nothing is known when the number of curators lies between 2 and n. Even in the local privacy setting and the 2-party setting, little is known beyond the results in the cited papers. This is a good topic for open-ended theoretical exploration.

Privacy and Game Theory

McSherry and Talwar first proposed designing auction mechanisms using differentially private mechanisms as a building block. This mechanism, while private, is only approximately truthful. Nissim, Smorodinsky, and Tennenholtz show how to convert (in some settings) differentially private mechanisms into exactly truthful mechanisms. However, in doing so, the mechanism loses its privacy properties. Xiao asks how to design mechanisms that are both truthful and private, and gives an answer in a setting in which individuals to not explicitely model privacy in their utility function. But what about when they do?

Similar issues arise in the question of how to sell access to private data, studied by Ghosh and Roth.

Privacy and Approximation Algorithms

Gupta, Ligett, McSherry, Roth, and Talwar give algorithms for various combinatorial optimization problems that preserve differential privacy. However, this paper only analyzes specific algorithms based on combinatorial (greedy) methods, without giving any kind of general theory. What about linear-programming based approximation algorithms (Perhaps solved approximately using a multiplicative weights method)? Can any of these be made private? Is there any class of approximation algorithms that admits a generic reduction to privacy preserving versions, while preserving some of its utility guarantees?

Pan-Privacy and Streaming Algorithms

Suppose we do have a trusted central database administrator. Nevertheless, the threat of computer intrusions or government subpoena might at some future state expose the internal records and state of the database administrators algorithm. Pan-private algorithms address this problem by requiring that the internal state of the algorithm itself be differentially private. Because this means storing only randomized "hashes" of the data, this setting is amenable to problems usually considered for streaming algorithms, in which hashes are often used because of space constraints. There is some work in this area, beginning with Dwork, Naor, Pitassi, Rothblum, and Yekhanin, and continuing with Mir, Muthukrishnan, Nokolov, and Wright (1 and 2). This is a good area for exploration. What can be computed in the pan-private setting? Does it have any relation to what can be computed in a distributed setting?

Computational Complexity in Differential Privacy

This course has mostly focused on information theoretic upper and lower bounds. But even when a problem in data privacy is information theoretically solvable, there may be computational barriers to solving it quickly. Dwork, Naor, Reingold, Rothblum, and Vadhan showed that under certain cryptographic assumptions, general release mechanisms (such as the net mechanism) cannot be implemented in polynomial time. Ullman and Vadhan then extended this result even to show hardness for algorithms that release small conjunctions (of 2 literals!) using synthetic data as their output representation. This is a PCP reduction from the synthetic data hardness result of DNRRV. Of course, this hardness result is specific to the output representation, since we can efficiently release the (numeric) answers to all conjunctions of size 2 using the Laplace mechanism...

Privacy and Statistics

Smith studies the convergence rates of certain statistical estimators, and gives differentially private versions of these estimators which have the same (optimal) convergence rates. The theorem comes with certain technical conditions though (i.e. the dimension of the statistics can't be too large, epsilon can't be too small, etc.). Can you extend theorems like this to hold with less restrictive conditions? The theorems are also asymptotic, giving guarantees as the number of data points goes to infinity. How do they work in practice, with (finite) samples of real data? Compare the empirical performance of these optimal private statistical estimators with non-private versions.

Axiomatic Approaches to Privacy

Is differential privacy a good definition? Is it too strong? One way to formalize questions like this is to derive differential privacy as the solution to a set of axioms; then if you wish to weaken differential privacy, you can reduce the problem to objecting to one of the basic axioms. Kifer and Lin begin this process, but there's plenty of room here to explore.

Private Programming Languages and Implementations

Wouldn't it be nice if you could just write a program and be guaranteed that it was privacy preserving, instead of having to prove a theorem every time you come up with some algorithm? Thats the idea behind differentially private programming languages. There are now several such languages: Pinq, Airavat, and (here at Penn) Fuzz. One thing you have to worry about in practice are side channel attacks, recently studied by Haeberlen, Pierce, and Narayan. What are the limitations of these languages? What can you implement in them, and what can't you? Are these really limitations, or can you get around them with clever implementations? How might you extend these languages, and are there other attacks you might be able to mount?

Testing Function Sensitivity

Jha and Raskhodnikova give algorithms for testing the global sensitivity of a function, and for reconstructing insensitive functions given only black-box access to a (possibly) sensitive function. But global sensitivity is not the only relevant parameter in differential privacy. For example, Nissim, Raskhodnikova, and Smith introduce smooth sensitivity, which can in many cases be much lower than the global sensitivity of a function. Can similar techniques be applied to smooth sensitivity?