KOREAN NLP AT THE UNIVERSITY OF PENNSYLVANIA

Korean forms one of the major languages in multilingual NLP research at the University of Pennsylvania. This site introduces three main projects on Korean NLP currently being conducted at Penn: Korean XTAG, Korean Treebank, and Korean/English machine translation. These projects are partially funded by the Army Research Lab via a subcontract from CoGenTex, Inc., and by NSF Grant SBR 8920230.

Korean XTAG

Korean Treebank and Propbank

Korean Morphological Analysis and Tagging

Korean Syntactic Parsing

Korean/English Machine Translation

Papers

People

Some Links to NLP in Korea

Korean XTAG

Korean XTAG is an on-going project to develop a wide-coverage grammar for Korean using Feature-Based Lexicalized Tree Adjoining Grammar (LTAG) formalism. For grammar development system, it uses the XTAG system which we have customized for Korean TAG development. The XTAG system was originally developed for English TAG and it consists of a parser, an X-windows grammar development interface and a POS tagger. We have modified the original XTAG system and incorporated a Korean morphological analyzer to handle rich inflectional morphology in Korean and facilitate lexicon development and parsing. More on the Korean XTAG system description can be found in our TAG+5 workshop paper.

LTAG is based on the Tree Adjoining Grammar (TAG) formalism developed by Joshi, Levy and Takahashi (1975). The TAG formalism in general, and lexicalized TAG in particular, are well-suited for linguistic applications. An LTAG consists of a finite set of elementary trees anchoring lexical items and composition operations of substitution and adjunction. The elementary trees represent extended projections of lexical items and encapsulate syntactic/semantic arguments of the lexical anchor. In the last decade, the LTAG approach has been applied to various NLP tasks such as parsing, machine translation, information retrieval, generation and summarization applications. More on the introduction to LTAG and the current status of our Korean LTAG grammar is documented in our technical report.

[BACK UP]

Korean Treebank

A Treebank is a corpus annotated with morphological and syntactic information. Each word in the corpus is annotated with morpho-syntactic tags and each sentence is bracketed to represent its structural analysis. This kind of corpus has served as an extremely valuable resource for computational linguistics applications, and has also proved useful in theoretical linguistics research.

Annotation Format

For syntactic bracketing, we use a phrase structure annotation. Similar phrase structure annotation schemes were also used by the Penn English Treebank, the Penn Middle English Treebank and the Penn Chinese Treebank. This annotation is preferable to a pure dependency annotation because with a phrase structure annotation we can encode richer structural information than with dependency annotation, as illustrated below:

Phrase structure annotation has phrasal level node labels such as VP and NP, whereas dependency annotation does not have any node labels.
Phrase structure annotation can explicitly represent empty arguments, but dependency annotation cannot.
Phrase structure annotation can distinguish between complementation and adjunction, but dependency annotation cannot.
Phrase structure annotation can make use of traces for displaced constituents, whereas dependency annotation cannot.

Corpus

The corpus for the Korean Treebank project consists of texts from military language training manuals. These texts contain information about various aspects of the military, such as troop movement, intelligence gathering, and equipment supplies, among others. The texts in the manuals were originally in printed form, and in order to use them for our Treebank, we converted the manuals into a machine-readable form. This corpus contains 54,366 words and 5078 sentences.

Guidelines and a Sample File

Applications

The linguistic information in the Korean Treebank will provide a standard framework in which to train and evaluate tools such as POS tagger and stochastic parsers.
The Treebank will also be used to extract lexicalized grammars, e.g. a Korean Tree Adjoining Grammar, which can be used for other applications, such as natural language generation. There are already tools developed at Penn that train parsers and extract Tree Adjoining Grammars from a phrase-structure based Treebank (Xia 1999), which will be equally applicable to the Korean Treebank.
Having an on-line corpus of parsed texts will be extremely useful for research in corpus linguistics and will lead to many interesting theoretical results.

[BACK UP]

Korean Morphological Analysis and Tagging

... about Korean morphological analysis and tagging ...

[BACK UP]

Korean Syntactic Parsing

... about Korean syntactic parsing ...

[BACK UP]

Korean/English Machine Translation

This is a joint project with CoGenTex and Systran.

Basic Elements of our Approach

Given that Korean and English are very different from each other in structure and morphology, many challenging problems arise, demanding sophisticated linguistic analysis. The basic elements of our approach include:

Following the model described in Palmer, Rambow and Nasr (1998) for English/French translation, our system has a plug-and-play architecture that is composed of state-of-the-art off-the-shelf components in parsing (and morphological analysis) and generation. These components communicate with each other via a common predicate-argument structure representation.
Our system is a hybrid system that profits from a stochastic parser that was independently trained on domain-general corpora and a hand-crafted linguistic knowledge base in the form of a predicate-argument lexicon and linguistically sophisticated transfer rules. The linguistic knowledge base plays an important role in handling structural divergences and recovering dropped arguments.
For defining transfer rules, we use the `lexico-structural transfer' framework, which is based on a lexicalized predicate-argument structure. In this framework, the transfer lexicon does not simply relate words (or context-free rewrite rules) from one language to words (or context-free rewrite rules) from another language. Instead, lexemes and their relevant syntactic structures (essentially, their syntactic projection along with syntactic/semantic features) are mapped. This framework was applied previously in English/French and English/Arabic MT (Nasr et. al. 1997; Palmer, Rambow and Nasr 1998).

Corpus

The corpus for this project is a set of Korean/English parallel texts that consist of battle scenario message traffic and military language training manuals which contain information on typical military events such as troop movement, intelligence gathering, and equipment supplies, among others. Each half has roughly 50,000 word tokens, and 5000 sentences.

Presentations

Some issues concerning Korean/English machine translation
Presented at ARL's Federated Laboratories Symposium (FEDLAB), University of Maryland, College Park, February 2-4, 1999
Slides explaining the demo of our Korean/English machine translation system at the Army Research Lab, February 10, 2000
Example sentences and the output of the MT system
Predicate argument lexicon and transfer lexicon
Overview of the corpus and current status

[BACK UP]

Papers

Chung-hye Han, Na-Rae Han and Eon-Suk Ko
Development and Evaluation of a Korean Treebank and its Application to NLP, Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC-2002)
Chung-hye Han, Na-Rae Han, Eon-Suk Ko, Heejong Yi and Martha Palmer
Penn Korean Treebank: Development and Evaluation, Proceedings of the 16th Pacific Asia Conference on Language, Information and Computation. The Korean Society for Language and Information. (2002)
Chung-hye Han, Na-Rae han, Eon-Suk Ko
Bracketing Guidelines for Penn Korean TreeBank, Technical Report, IRCS-01-10 (2001)
Chung-hye Han, Na-Rae Han
Part of Speech Tagging Guidelines for Penn Korean Treebank, Technical Report, IRCS-01-09 (2001)
Chung-hye Han, Juntae Yoon, Nari Kim and Martha Palmer
A Feature-Based Lexicalized Tree Adjoining Grammar for Korean, Technical Report, IRCS-00-04 (2000)
Chung-hye Han, Benoit Lavoie, Martha Palmer, Owen Rambow, Richard Kittredge, Tanya Korelsky, Nari Kim and Myunghee Kim
Handling Structural Divergences and Recovering Dropped Arguments in a Korean/English Machine Translation System
Proceedings of the Association for Machine Translation in the Americas '2000.
Published in Lecture Notes in AI series of Springer-Verlag, © Springer-Verlag (2000).
Juntae Yoon, Chung-hye Han, Nari Kim and Mee-sook Kim
Customizing the XTAG system for efficient grammar development for Korean
Proceedings of the Fifth International Workshop on Tree Adjoining Grammars and Related Formalisms, TAG+ 5 (2000).
Chung-hye Han and Owen Rambow
The Sino-Korean light verb construction and lexical argument structure
Proceedings of the Fifth International Workshop on Tree Adjoining Grammars and Related Formalisms, TAG+ 5 (2000)
Martha Palmer,Dania Egedi,Chunghye Han, Fei Xia, and Joseph Rosenzweig.
Constraining Lexical Selection Across Languages Using TAGs.
Tree Adjoining Grammars: Formal, Computational and Linguistic Aspects (TAG+ 3 Workshop Proceedings)
Eds. Anne Abeille and Owen Rambow, CSLI, Stanford (2000).
Chung-hye Han, Fei Xia, Martha Palmer, Joseph Rosenzweig.
Capturing Language Specific Constraints on Lexical Selection with Feature-Based Lexicalized Tree-Adjoining Grammars
Proceedings of International Conference on Chinese Computing '96 (ICCC '96).

[BACK UP]

People (click to see pictures )

Faculty

Graduate Students

Na-Rae Han
Jinyoung Choi
Yeongmi Jeon

Staff

Shijong Ryu

Visitors

Seunghun Lee (Rutgers University)
Sung-Dong Kim (Hansung University, Korea)
Sinwon Yoon (Paris 7 University, France)

Thanks to

Mee-sook Kim (participated from Nov. 1999 to July 2000)
Nari Kim (participated from Mar. 1998 to Dec. 1999, now at Konan Technology, Inc.)
Juntae Yoon (participated from Mar. 1999 to Mar. 2000, now at Daum Communications)
Jong-Cheol Park (participated at the very early phase of the project, now at KAIST)
Heejong Yi (participated in 1998)
Eon-Suk Ko (participated from Spring 1998 to Spring 2000)
Seungyun Yang (participated from Spring 1999 to Spring 2000)
Myuncheol Kim (participated from Spring 1999 to Spring 2000)
Chung-hye Han (participated from Spring 1998 to August 2001)
Chulwoo Park (participated from Spring 1999 to February 2002)

[BACK UP]