KOREAN NLP AT THE UNIVERSITY OF PENNSYLVANIA

Korean forms one of the major languages in multilingual NLP research at the University of Pennsylvania. This site introduces three main projects on Korean NLP currently being conducted at Penn: Korean XTAG, Korean Treebank, and Korean/English machine translation. These projects are partially funded by the Army Research Lab via a subcontract from CoGenTex, Inc., and by NSF Grant SBR 8920230.


Korean XTAG


[a^n b^n c^n d^n]

Korean XTAG is an on-going project to develop a wide-coverage grammar for Korean using Feature-Based Lexicalized Tree Adjoining Grammar (LTAG) formalism. For grammar development system, it uses the XTAG system which we have customized for Korean TAG development. The XTAG system was originally developed for English TAG and it consists of a parser, an X-windows grammar development interface and a POS tagger. We have modified the original XTAG system and incorporated a Korean morphological analyzer to handle rich inflectional morphology in Korean and facilitate lexicon development and parsing. More on the Korean XTAG system description can be found in our TAG+5 workshop paper.

LTAG is based on the Tree Adjoining Grammar (TAG) formalism developed by Joshi, Levy and Takahashi (1975). The TAG formalism in general, and lexicalized TAG in particular, are well-suited for linguistic applications. An LTAG consists of a finite set of elementary trees anchoring lexical items and composition operations of substitution and adjunction. The elementary trees represent extended projections of lexical items and encapsulate syntactic/semantic arguments of the lexical anchor. In the last decade, the LTAG approach has been applied to various NLP tasks such as parsing, machine translation, information retrieval, generation and summarization applications. More on the introduction to LTAG and the current status of our Korean LTAG grammar is documented in our technical report.

[BACK UP]


Korean Treebank

A Treebank is a corpus annotated with morphological and syntactic information. Each word in the corpus is annotated with morpho-syntactic tags and each sentence is bracketed to represent its structural analysis. This kind of corpus has served as an extremely valuable resource for computational linguistics applications, and has also proved useful in theoretical linguistics research.

Annotation Format

For syntactic bracketing, we use a phrase structure annotation. Similar phrase structure annotation schemes were also used by the Penn English Treebank, the Penn Middle English Treebank and the Penn Chinese Treebank. This annotation is preferable to a pure dependency annotation because with a phrase structure annotation we can encode richer structural information than with dependency annotation, as illustrated below:

Corpus

The corpus for the Korean Treebank project consists of texts from military language training manuals. These texts contain information about various aspects of the military, such as troop movement, intelligence gathering, and equipment supplies, among others. The texts in the manuals were originally in printed form, and in order to use them for our Treebank, we converted the manuals into a machine-readable form. This corpus contains 54,366 words and 5078 sentences.

Guidelines and a Sample File

Applications

[BACK UP]

Korean Morphological Analysis and Tagging

... about Korean morphological analysis and tagging ...

[BACK UP]


Korean Syntactic Parsing

... about Korean syntactic parsing ...

[BACK UP]


Korean/English Machine Translation

This is a joint project with CoGenTex and Systran.

Basic Elements of our Approach

Given that Korean and English are very different from each other in structure and morphology, many challenging problems arise, demanding sophisticated linguistic analysis. The basic elements of our approach include:

Corpus

The corpus for this project is a set of Korean/English parallel texts that consist of battle scenario message traffic and military language training manuals which contain information on typical military events such as troop movement, intelligence gathering, and equipment supplies, among others. Each half has roughly 50,000 word tokens, and 5000 sentences.

Presentations

[BACK UP]

Papers

[BACK UP]

People (click to see pictures )

Faculty

Graduate Students

Staff

Visitors

Thanks to

[BACK UP]

Some Links to NLP in Korea

[BACK UP]
This web page is maintained by Chung-hye Han
Last changed: $Date: 2004/08/18 20:31:03 $