Prof. Michael Kearns

Time: Mondays 12-3 PM
Location: Active Learning Classroom, 3401 Walnut St., fourth floor.
(Elevator lobby just left of the Starbucks entrance)

IMPORTANT NOTE: As per the University schedule, The first course meeting will be on Weds Jan 10 from 12-3. From then on we will meet Mondays 12-3.

URL for this page:

Previous incarnations of this course: (with Grigory Yaroslavtsev) (with Jake Abernethy) (with Koby Crammer)


This course is an introduction to the theory of machine learning, which attempts to provide algorithmic, complexity-theoretic and probabilistic foundations to modern machine learning and related topics.

The first part of the course will closely follow portions of An Introduction to Computational Learning Theory, by M. Kearns and U. Vazirani (MIT Press). We will cover perhaps 6 or 7 of the chapters in K&V over (approximately) the first half of the course, often supplementing with additional readings and materials. Copies of K&V will be available at the Penn bookstore. The second portion of the course will focus on a number of models and topics in learning theory and related areas not covered in K&V.

The course will give a broad overview of the kinds of problems and techniques typically studied in theoretical machine learning, and provide a basic arsenal of powerful mathematical tools for analyzing machine learning problems.

Topics likely to be covered include:

  • Basics of the Probably Approximately Correct (PAC) Learning Model
  • Occam's Razor, Compression and Learning
  • Uniform Convergence and the Vapnik-Chervonenkis Dimension
  • Boosting
  • Learning in the Presence of Noise and Statistical Query Learning
  • Learning and Cryptography
  • Query Models
  • Online and No-Regret Learning Models
  • Agnostic Learning

    Additional topics we may also cover include:

  • Learning and Differential Privacy
  • Fairness in Machine Learning
  • Theory of Reinforcement Learning


    Much of the course will be in fairly traditional "chalk talk" lecture format, but with ample opportunity for discussion, participation, and critique. The course will meet once a week on Mondays from 12 to 3, with the first meeting on Weds Jan 10. Lunch will be served.

    The course will involve advanced mathematical material and will cover formal proofs in detail, and will be taught at the doctoral level. While there are no specific formal prerequisites, background or courses in algorithms, complexity theory, discrete math, combinatorics, probability theory and statistics will prove helpful, as will "mathematical maturity" in general. We will be examining detailed proofs throughout the course. If you have questions about the desired background, please ask. Auditors and occasional participants are welcome.

    The course requirements for registered students will be a mixture of active in-class participation, problem sets, possibly leading a class discussion, and a final project. The final projects can range from actual research work, to a literature survey, to solving some additional problems.

    The exact timing and set of topics below will depend on our progress and will be updated as we proceed.

    Wed Jan 10
    In the first meeting, we will go over course mechanics and present a course overview, then immediately begin investigating the Probably Approximately Correct (PAC) model of learning. Detailed topics covered: Learning rectangles in the real plane; definition of the PAC model; PAC learnability of rectangles in d dimensions.

    READING: K&V Chapter 1.

    Mon Jan 22
    PAC learnability of boolean conjunctions; intractability of PAC learning 3-term DNF.

    READING: K&V Chapter 1.

    Mon Jan 29
    Intractability of PAC learning 3-term DNF continued; PAC learning 3-term DNF by 3CNF; learning, consistency and compression; Occam's Razor.

    READING: K&V Chapter 1 and Chapter 2.

    PROBLEM SET #1 (due in hardcopy form in class Feb 5):

    1. As carefully as you can, prove the PAC learnability of axis-aligned rectangles in n dimensions in time polynomial in n, 1/epsilon and 1/delta. Describe the algorithm precisely and provide as detailed a proof as you can, and calculate the sample size needed.

    For problems 2. and 3. below, you may assume that the input distribution/density D is uniform over the unit square [0,1] x [0,1]. For extra credit, sketch how you might generalize your results to the case of unknown and arbitrary D. You might also find it useful to know about Chernoff bounds and related inequalities, which are discussed both here and in the appendix of K&V.

    2. Prove the PAC learnability of unions of 2 axis-aligned rectangles in the real plane in time polynomial in 1/epsilon and 1/delta.

    3. Consider the variant of the PAC model with classification noise: each time the learner asks for a random example of the target concept c, instead of receiving x,c(x) for x drawn from D, the learner receives x,y where y = c(x) with probability 2/3, and y = -c(x) with probability 1/3. Here we assume that c(x) is +1 or -1, and that the noise is independent for each example. Prove the PAC learnability of axis-aligned rectangles in the real plane in this modified model.

    Collaboration on the problem sets is permitted, but   everyone must turn in their own, independent writeup, in which you acknowledge your collaborators.

    Mon Feb 5
    PAC learning yields weak compression; consistency and learning in the agnostic/unrealizable setting; introduction to VC dimension.

    READING: K&V Chapters 2 and 3.

    Mon Feb 12
    Today to start we will have a special guest lecture from Prof. Dana Moshkovitz of UT Austin on some very exciting recent results in the PAC model; afterwards we will continue with the VC dimension. Dana's abstract:

    What Cannot Be Learned With Bounded Memory

    How does computational learning change when one cannot store all the examples one sees in memory? This question has seen a burst of interest in the past couple of years, leading to the surprising theorem that there exist simple concepts (parities) that require an extraordinary amount of time to learn unless one has quite a lot of memory. In this work we show that in fact most concepts cannot be learned without sufficient memory. This subsumes the aforementioned theorem and implies similar results for other concepts of interest. The new results follow from a general combinatorial framework that we developed to prove lower bounds for space bounded learning.

    Joint work with Michal Moshkovitz, Hebrew University.

    READING: K&V Chapter 3, and here is a link to a paper related to Dana's talk.

    Mon Feb 19

    Complete proof of VC-dimension based upper bound on the sample complexity of learning in the PAC model via Sauer's Lemma and the two-sample trick; matching lower bound; extensions to unrealizable/agnostic setting; trading off approximation error/model complexity with estimation error via stuctural risk minimization.

    READING: K&V Chapter 3; and here is a link to a very nice survey paper generalizing VC-style bounds to a wide variety of other learning models:

    "Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications", D. Haussler 1992.

    Mon Feb 26

    Brief overview of cryptographic hardness results for PAC learning boolean formulae, finite automata, and neural networks; boosting.

    READING: K&V Chapter 4 and the following papers:

  • The Boosting Apporach to Machine Learning: An Overview. Rob Schapire.
  • A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Yoav Freund and Rob Schapire.
  • Experiments with a New Boosting Algorithm. Yoav Freund and Rob Schapire.
  • On the Boosting Ability of Top-Down Decision Tree Learning Algorithms. MK and Y. Mansour.

    PROBLEM SET #2 (due in hardcopy form in class Mar 19):

    1. Solve K&V Problem 3.2.

    2. Consider the concept class of parity functions defined over n-bit strings x: for every subset $T$ of the indices {1,...,n}, the function f_T(x) = 1 if and only if the number of 1s in x on just the indices in T is odd; otherwise f_T(x) = 0. For example, if n = 6 and T = {1,2,5} then f_T(001011) = 1 and f_T(101011) = 0. Give the best upper and lower bounds that you can on the the VC dimension of this class. Then give a computationally efficient algorithm for PAC learning parity functions. (Hint: try viewing the problem from a linear algebra perspective.)

    3. Define the consistency dimension of a concept class C to be the smallest d such that for any c in C, there exists a sample S labeled by c of size at most d, and for which c is the only concept in C consistent with S. In other words, the consistency dimension of C is the smallest d such that every concept in C can be uniquely identified by a sample of d points. Note that the sample S is allowed to depend on c. Restricting attention to finite C, carefully describe and analyze (a) a concept class C in which the consistency dimension of C is much smaller than the VC dimension of C, and (b) a concept class C in which the consistency dimension of C is much larger than the VC dimension of C.

    4. Solve K&V Problem 3.6, which is incorrect as stated --- you simply need to find *some* set of labelings/concepts of size phi_d(m) for which the VC dimension is d.

    5. Consider the problem of learning in the presence of adversarial errors or noise in the PAC model. In this setting, there is an error rate eta >= 0. Each time the learning algorithm asks for an example, with probability 1-eta, it receives a "correct" example (x,y) in which x is drawn from the target distribution D, and y = c(x), where c is the target concept in C. But with probability eta, the algorithm receives a pair (x,y) about which no assumptions whatsoever can be made. In particular, both x and y can be chosen by an adversary who knows the current state of the algorithm, and is deliberately trying to foil the algorithm. Let's call a concept class C nontrivial if there exist c1 and c2 in C, and inputs u, v, w, x such that c1(u) = 1 and c2(u) = 1; c1(v) = 1 and c2(v) = 0; c1(w) = 0 and c2(w) = 1; and c1(x) = 0 and c2(x) = 0. Show that in the adversarial noise model, PAC learning for any nontrivial C is impossible unless eta < epsilon/(1+epsilon), where epsilon is the desired error of the algorithm's hypothesis with respect to c and D. (Hint: find a "bad" distribution D that allows the adversary to "confuse" c1 and c2.)

    Mon Mar 5

    Spring Break, no meeting.

    Mon Mar 12

    Boosting continued; introduction to PAC learning with classification noise.

    READING: K&V Chapter 5.