CS446: Machine Learning

Spring 2017

Lecture Notes

Updated notes will be available here as ppt and pdf files after the lecture. Older lecture notes are provided before the class for students who want to consult it before the lecture. Pointers to relevant material will also be made available -- I assume you look at least at the Reading and the *-ed references.

The dates next to the lecture notes are tentative; some of the material as well as the order of the lectures may change during the semester.

Lecture #0: Course Introduction and Motivation, pdf
Reading: Mitchell, Chapter 1

Lecture #1: Introduction to Machine Learning, pdf
Also see: Weather - Whether Example
Reading: Mitchell, Chapter 2

Tutorial: Building a Classifier with Learning Based Java, pdf, pdf2
Walkthrough on using LBJava with examples.

Lecture #2: Decision Trees, pdf
Additional notes: Experimental Evaluation
Reading: Mitchell, Chapter 3
References
- J. Quinlan, "Induction of Decision Trees". Machine Learning, 1:81-106, 1986.
- (*) R. Rivest, "Learning Decision Lists". Machine Learning, 2(3):229-246, 1987. (link)
- J. Quinlan and R. Rivest, "Inferring Decision Trees Using the Minimum Description Length Principle". Information and Computation, 80:227-248, 1989.
- T. Dietterich, "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms", Neural Computation 10(7), 1998.

Learning Rules + ILP (used to be Lecture #3, will not be covered in Fall 2016)
Reading: Mitchell, Chapter 10
References
- (*) W. Cohen, "Fast Effective Rule Induction". ICML, 1995. (citeseer)
- W. Cohen and Y. Singer, "A Simple, Fast, and Effective Rule Learner". AAAI, 1999. (link)
- Bratko, I. and Muggleton, S. "Applications of Inductive Logic Programming". Commun. ACM 38, 11 (Nov. 1995), 65-70. (acm)

Lecture #4: On-Line Learning: Winnow, Perceptron: P1.pptx, P2.pptx,P1.pdf,P2.pdf, notes(1) notes(2) notes(3)
References
- (*) D. Roth, "On-Line Learning of Linear Functions (course notes)". 2000. ( .pdf)
- (*) J. Kivinen and M. Warmuth, "The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds when few Input Variables are Relevant". 1995. (link)
- A. Blum, "On-Line Algorithms in Machine Learning". 1996. (link)
- (*) A. Blum, "Learning Boolean Functions in an Infinite Attribute Space". Machine Learning, 9(4):373-386, 1992. (.ps)
- R. Khardon, D. Roth, and R. Servedio, "Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms". NIPS, 2001. (link)
- (*) Y. Freund and R. Schapire, "Large Margin Classification Using the Perceptron Algorithm". COLT, 1998. (link)
- N. Littlestone, "Learning Quickly When Irrelevant Attributes Abound". Machine Learning 2(4):285-318, 1988. (link)
- Adam J. Grove, Nick Littlestone, Dale Schuurmans, "General Convergence Results for Linear Discriminant Updates". Machine Learning 43(3): 173-210 (2001) link
- Shai Ben-David and Hans Ulrich Simon, "Efficient Learning of Linear Perceptrons", NIPS 2000 (link)
- Large Margin Winnow Methods for Text Categorization, Tong Zhang (.ps)
- Tong Zhang and Frank J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4:5-31, 2001.
- R. Khardon and G. Wachman, Noise Tolerant Variants of the Perceptron Algorithm, Journal of Machine Learning Research , Vol 8, pp 227--248, 2007
- K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online Passive-Aggressive Algorithms. (link)
- John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR. 12 (July 2011), 2121-2159. (pdf)

Lecture #5: Computational Learning Theory, pdf
Reading: Mitchell, Chapter 7
References
- Kearns and Vazirani, Introduction to Computational Learning Theory
- (*) L. Valiant, "A Theory of the Learnable". CACM, pg 1134-1142, 1984 (link)
- L. Pitt and L. Valiant, "Computational Limitations on Learning From Examples". JACM, 35(4):965-984, 1988. (.pdf)
- A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth, "Learnability and the Vapnik-Chervonenkis Dimension". JACM, 36(4):929-965, 1987. (.pdf)
- V. Vapnik and A. Chervonenkis, "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theoretical Probability and Its Applications, 16(2):264-280, 1971. (link)
- (*) David Haussler: Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework. Artif. Intell. 36(2): 177-221 (1988) (link)
- David Haussler: Learning Conjunctive Concepts in Structural Domains. Machine Learning 4: 7-40 (1989) (link)

Lecture #6: Neural Networks, NN-P1.pptx, NN-P1.pdf, NN-P1-New.pptx, NN-P1-New.pdf, NN-P2.pptx, NN-P2.pdf, NN-P2-New.pptx, NN-P2-New.pdf
References
- Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." Cognitive modeling 5 (1988): 3. (link)
- Barron, Andrew R. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14: 115-133, 1994. (link)
- Livni, Roi, Shai Shalev-Shwartz, and Ohad Shamir. "On the computational efficiency of training neural networks." In Advances in Neural Information Processing Systems, pp. 855-863. 2014. (link)
- Presentation: "On the computational complexity of deep learning", by Shai Shalev-Shwartz in 2015 (link)
- Blum, Avrim L., and Ronald L. Rivest. "Training a 3-node neural network is NP-complete." In Machine learning: From theory to applications, pp. 9-28. Springer Berlin Heidelberg, 1993. (link)

Lecture #6: Boosting, pdf, Formal View
References
- Robert E. Schapire, "The strength of Weak Learnability". Machine Learning 5(2):197-227, 1990
- Yoav Freund and Robert E. Schapire, "A decision-theoretic generalization of on-line learning and an application to boosting". Journal of Computer and System Sciences, 55(1):119-139, 1997. (.ps)
- Erin L. Allwein, Robert E. Schapire and Yoram Singer, "Reducing multiclass to binary: A unifying approach for margin classifiers". Journal of Machine Learning Research, 1:113-141, 2000. (.pdf)
- Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee, "Boosting the margin: a new explanation for the effectiveness of voting methods". The Annals of Statistics, 26(5):1651-1686, 1998. (.ps)

Lecture #7: Multiclass Classification, pdf
References
- Sariel Har-peled, Dan Roth and Dav Zimak, " Constraint classification for multiclass classification and ranking". NIPS2003. (.pdf)

Midterm Review, pdf
Midterm Exam (during class)

Lecture #8: Support Vector Machines, pdf
Additional Notes on Optimization and SVMs
Additional Notes on Logistic Regression and SVMs
References
- C.-J. Lin, Optimization, Support Vector Machines, and Machine Learning. Talk in DIS, University of Rome and IASI, CNR, Italy. September 1-2, 2005. (slides)
- C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition". Data Mining and Knowledge Discovery, 2(2):121-167, 1998. (citeseer)

Lecture #9: Bayesian Learning, pdf
Additional Notes: naive Bayes (1) pdf , naive Bayes (2) pdf
Reading: Mitchell, Chapter 6

Lecture #10: The EM Algorithm, pdf

Lecture #11: Learning Probability Distributions, pdf

Lecture #12: Clustering, pdf

Final Review, pdf
Final Exam (May 9th, 2017)

Dan Roth