Penn Chinese Treebank Project

The Penn Chinese Treebank Project

Growing interest in Chinese Language Processing is leading to the development of resources such as annotated corpora and automatic segmenters, part-of-speech taggers and parsers. Currently these are all being developed independently, often with quite different standards for segmentation, part-of-speech tagging and syntactic bracketing. The time is ripe for an open discussion of the methodological issues involved in achieving agreement on annotation standards.

Unlike Western and Middle Eastern Writing systems, Chinese writing does not have a natural delimiter between words with the result that appropriate word segmentation becomes a prerequisite for any other NLP tasks. In the literature this problem has been discussed extensively. The problem of part-of-speech tagging is closely related. These are both prerequisites to the establishment of a Chinese Treebank that could be of general use.

We have completed building a 500-thousand-word Chinese Treebank. Our aim is to work towards a community consensus on guidelines that will include the input of influential researchers from Taiwan, Singapore, Hong Kong, China and the US. To this end, we held two workshops and a number of meetings between 7/1998 to 10/2000 in USA and abroad. We are very interested in the community's reaction to our guidelines and Treebank, and encourage anyone interested in getting involved to please look into the guidelines we have attached below, use the Treebank, which is available via LDC, and to get in touch with us with your comments.

Descriptions of the project:

Penn guidelines for Chinese Treebank

Publications

2000: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation
Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, and Mitch Marcus
Proceedings of the second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 2000.
2001: Facilitating Treebank Annotation with a Statistical Parser
Fu-Dong Chiou, David Chiang, and Martha Palmer
Proceedings of the Human Language Technology Conference (HLT 2001), San Diego, California, 2001.
2002: Building a Large-Scale Annotated Chinese Corpus
Nianwen Xue, Fu-Dong Chiou, and Martha Palmer
Proceedings of the 19th. International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, 2002.
2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.
Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer
Natural Language Engineering, 11(2)207-238.

Personnel

Principal Investigators:
Martha Palmer, Mitch Marcus, Tony Kroch
Consultants:
Shizhe Huang, Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc
Project Managers:
Shudong Huang (September - December, 1998),
Fei Xia (September 1998 - December 2000),
Nianwen Xue (May 1999 - May 2000),
Fu-Dong Chiou (January 2001 - present)
Guideline Designers:
Fei Xia Nianwen Xue
Programming Support:
Zhibiao Wu (September 1998 - September 2000)
Scott Cotton (October - December, 2000)
Annotators:
Meiyu Chang (June 2003 - present)
Fu-Dong Chiou (September 1998 - present)
Shudong Huang (September - December, 1998)
Tsan-Kuang Lee (June 2002 - present)
Nianwen Xue (September 1998 - May 2000; September 2001 - November 2002)

Sample Files

Treebank Releases on [image of CD]

  • Preliminary Release: June 2000, see the announcement

  • Second Release: Dec 2000, see the announcement

    Workshops and meetings

  • 1st CLP Workshop (6-7/98), Philadelphia, USA
  • meeting during ACL-98, Montreal, Canada (8/98)
  • meeting during ICCIP-98, Beijing, China (11/98)
  • meeting during ACL-99, Maryland, USA (6/99)
  • 2nd CLP Workshop (10/00), Hong Kong, China

    Links to other sites

  • Penn English Treebank Project
  • Penn Korean Treebank Project

    Acknowledgment



    Last modified on February 10, 2004. This page has been viewed times since March 5, 2003.