CIS 639 Finite-State Methods in Natural Language Processing
CIS 639 Finite-State Methods in Natural Language Processing
Spring Semester 1998
Instructor
-
- Lauri Karttunen
IRCS 406, 573-6284, karttunen@central
-
-
Office Hours: MW 2-4
-
-
Time and Location: MW 9-10:30, Towne 303
Many basic steps in language processing, ranging from tokenization, to
phonological and morphological analysis, disambiguation, and shallow parsing,
can be performed efficiently by means of finite-state transducers. The course
will introduce the students to the theory and technology of compiling such
transducers from a lexical data bases, from regular expressions, and by other
means.
Arabic Demo
XTAG Lexicon now available in /pkg/cis639/lex/xtaglex.fst.
Assignments
Syllabus:
- 1/19 Basic concepts: languages and relations, regular expressions, networks.
Applications: word/number mapping, lexical transducer
- 1/21 More on regular expressions: intersection, composition, restriction. Using fst.
- 1/26 Pig Latin. More on regular expressions: restriction,
contexts, equivalences, crossproduct, simple replacement.
- 1/28 More on Pig Latin, application vs. composition, replacement.
- 2/2 Conditional replacement. Assignments 2 and 3.
- 2/4 Parallel replacement. Introduction to two-level morphology.
- 2/9 Two-level rules. Assignment 4.
- 2/11 More on two-level rules: rule conflicts, epenthesis rules,
variables. Two-level description for Pig Latin.
Introduction to the lexicon compiler.
- 2/16 More on lexicon compilation. Assignments 5,6. Lexicon of dates.
- 2/18 Lexicon of numbers. Long-distance constraints.
- 2/23 Assignment 8. More on numerals. Introduction to Flag diacritics.
- 2/25 Assignment 9. More on Flag diacritics.
- 3/2 Left-to-right, longest-match replacement
- 3/4 Tokenization, syntactic markup
- 3/9,11 - Spring Break
- 3/18 Optimality theory, lenient composition
- 3/30 Disambiguation
- 4/1 Finite-state parsing
- 4/6 Bottom-up incremental parsing, xfsp tool.
- 4/8 Top down-incremental parsing, Roche 1997.
- 4/13 Top-down parsing, Optimality mapping theory, Finite-state
approximation of CF grammars.
- 4/15 Finite-state approximation CF grammars, HMMs.
- 4/20 Project reviews
- 4/22 Project reviews, Reflexions
Software for the course
You must be a registered student or an approved auditor to have access to the
software. Please add /pkg/cis639/bin to your
PATH variable. You can launch the applications but there is no
read access to the directory.
- Development Tools
- twolc
- Two-Level Rule Compiler works
on rule systems written in the widely used two-level formalism.
The compiler converts each rule into a deterministic, minimized
transducer.
- lexc
- Finite-State Lexicon Compiler
is an authoring tool for creating lexical transducers. It is designed
to be used in conjunction with transducers produced with the Xerox
Two-level Rule Compiler.
- xfst
- Xerox Finite-State Tool is a
general-purpose utility for computing with finite-state networks.
It enables the user to create simple automata and transducers from
text and binary files, regular expressions and other networks by a
variety of operations. The user can display, examine and modify the
structure and the content of the networks. The result can be saved
as text or binary files.
- Simple runtime utilities
- tokenize
- Finite-state tokenizer.
Breaks text into tokens, one token per line.
- lookup
- Finite-state morphological
analyzer. Applies one of more lexical transducers to
each token producing a line per each possible analysis.
- disamb
- HMM disambiguator.
Takes the output from the analyzer and eliminates the lines
that contain a disfavored part-of-speech tag.
- Shell scripts
- tag
- tag -l language applies the tokenizer, analyzer,
and disambiguator in a sequence to an input file
for the given language.
- inxight-tag
- inxight-tag invokes a fast commercial English HMM tagger.
This is a part of Inxight's LinguistiX
product suite. See Inxight's
documentation about tagging in general and about the specifics
of the English tagger.
- inxight-npr
- inxight-npr runs Inxight's
English noun phrase recognizer,
another component of the LinguistiX suite.
- Language Modules
-
- Tokenizers, analyzers, and HMM disambiguators for
English, French, German, Spanish, and Italian.
- Parsing Tools
- xfsp
- Incremental Finite-State Parser
Tutorials
Beesley & Karttunen Book (DRAFT)
- Finite-State Morphology
- A gentle introduction into the art of creating morphological
analyzers with Xerox tools. For readers who have had some
training in formal linguistics and some previous programming
experience but no prior knowledge of regular expressions,
automata, sets, relations, or formal language theory.
The first chapters are probably too elementary for most
students in this course but some of the later sections and
the exercises may be useful. This is
an unfinished draft. Please do not quote or circulate.
(Postscript, 377 pages, 3 MB)
karttunen@cis.upenn.edu
Last modified September 27, 1999