CIS 430: Introduction to Human Language Technology

Fall 2008

Ani Nenkova,
Office: Levine Hall 505
Office Hours: Tuesday 3:15 to 4:15, or by appointment
TT 4:30-6pm
Moore 212
Automatic summarization as part of information retrieval systems can help alleviate the information overload problem caused by the unprecedented amount of online textual information. The building of a summarization system requires good understanding of the properties of human language and the use of various natural language tools. In this course we will build several summarization systems of increasing complexity and sophistication. In the process we will learn about various natural language processing tools and resources such as part of speech tagging, chunking, parsing, Wordnet, and machine learning toolkits, and will overview the fundamentals of information retieval systems. We will also cover probability and statistics concepts used in summarization, but also applicable to a wide range of other language-related and information retrieval tasks. Topics to be covered include:
  • Introduction to summarization Applications of summarization, data-driven methods, supervised summarization
  • Word distribution and weigting schemes Word frequency, Zipffs law, stopwords, tf.idf, entropy
  • Statistics concepts probabilities, binomial and multinomial distribution, log-likelihood ratio statistics
  • Language processing tools part-of-speech tagging, chunking, parsing and language resources
  • Supervised summarization Getting data (human models, summary-input alignment), training a classifier (feature extraction, using WEKA), classifier performance measures (accuracy, precision, recall)
  • System performance evaluation and comparison Introduction to R, correlation coefficients, p-values, tests for statistical significance
  • Discussion of assigned technical papers Discussions will be held throughout the semester
There is no required textbook for the class. However, here are two texts that you will find useful and interesting if you decide you want to further persue some of the topics.
  • 5 homeworks (65% total)
    • One will be with focus on clear writing on complex topics---literature overview
    • You are encouraged to work in teams, but the write-ups should be individual
    • 5 late days for the semester
  • Midterm (20%)
  • Class participation (15%)


Topic and Readings

Sep 4 Very brief class intro; no real class because of conflict with CIS400
Sep 9 Course overview
Introduction to summarization and language applications
Sep 11 Vocabulary size and term distribution: tokenization, text normalization, stemming
Reading: Chapter 23 from J-M textbook, Question Answering and Summarization
Sep 16 Term weighting and vector represenation of text
Sep 18 Language models; Evaluation in information retrieval
Reading: Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status
Sep 23 Approaches to automatic summarization
Sep 25 Probability theory; Bayes Theoren and Naive Bayes classification
Reading: How to Write a Spelling Corrector
HW1 out
Sep 29 Text categorization and feature selection; chi square test
Oct 2 Homework discussion; Exercises on text representations, calculating probabilities, similarities, etc; Discussion of readings
Reading: (1) A trainable document summarizer
(2) Language Identification: Examining the Issues
Oct 6 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio
Oct 9 Part of speech tagging
HW1 due
Readings: (1) Chapter 5, M&S
(2) Experiments in multi-document summarization
Oct 14 No class; fall break
Oct 16 Homework discussion; Log likelihood ratio test and topic signature words
HW2 Part1 out
Readings: (1) Topic-Focused Multi-document Summarization Using an Approximate Oracle Score
(2) The Automated Acquisition of Topic Signatures for Text Summarization
Oct 21 Introduction to WordNet
Oct 23 Word sense disambiguation and word similarity
Reading: Automatic record reviews
Oct 28 Word sense disambiguation
Oct 30 Discussion of assigned readings
Nov 4 Lexical chains for summarization
Readings: Using lexical chains for text summarization
Efficiently computed lexical chains as an intermediate representation for automatic text summarization
Nov 6 Web search
HW3 and take home midterm distributed
Nov 11
Readings: Entropy of search logs
A taxonomy of web search
Nov 14 Discourse, coheremce and anaphora resolution
Nov 18 Evaluation in summarization; Summarization beyond extraction
Reading: Summarization Evaluation for Text and Speech
Nov 20 Discussion of midterm; writing
Nov 25 Randomized tests for statistical significance
Dec 2 Predicting input difficulty; Evaluation without human models
Reading: Identifying correlates of input difficulty for generic multi-document summarization
Dec 4 Final review
Reading: Opinion mining and sentiment analysis