CIS 639

Introduction to Statistical Natural Language Processing

Spring 2000

 

Mitch Marcus

(mitch@linc.cis.upenn.edu)

Moore 556

898-2538

 

Prerequisites: The course will assume familiarity with Natural Language Processing, elementary statistics, and simple programming.  CSE 530 as taught in Fall 1999 by Steve Bird is a perfect introduction.

 

This course is intended to provide a fairly broad but thorough introduction to Statistical NLP sufficient to allow independent reading and understanding of the current research literature and to allow the execution of intermediate-level research projects in Statistical NLP.  The syllabus will roughly follow the Manning and Schütze text.

 

Syllabus:

 

·        A brief review of discrete probability theory, information theory and Unix tools for text manipulation

·        Statistical tools for investigating the structure of text:  Collocations and n-grams.

·        Word-sense disambiguation

·        Part of speech tagging:  Markov models, Brill learners, etc.

·        Probabilistic Parsing:  NP chunking, PCFGs, skeletal grammars, statistical TAG parsing,

·        Statistical Machine Translation

·        Information Retrieval

·        Current hot topics:  Combining information sources, boosting and the like, Multilingual IR

 

 

Text:

 

Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press

 

Additional Texts:

 

Fredrick Jelinek, Statistical Methods for Speech Recognition, MIT Press. (An encyclopedic, quite advanced introduction)

 

B. V. Gnedenko and A. Ya. Khinchin, An Elementary Introduction to the Theory of Probability, Dover Publications. (A very good simple introduction to discrete probability for those with little mathematical background)

 

David Yarowsky,  Three Machine Learning Algorithms for Lexical Ambiguity Resolution, Ph.D. Dissertation, U. of Pennsylvania, 1996

 

Michael Collins,  Head-Driven Statistical Models for Natural Language Parsing, Ph.D. dissertation, U. of Pennsylvania, 1999

 

Various papers, to be distributed.