Penn Chinese Treebank Project
The Penn Chinese Treebank Project
Growing interest in Chinese Language Processing is leading to the development
of resources such as annotated corpora and automatic segmenters, part-of-speech
taggers and parsers. Currently these are all being developed independently,
often with quite different standards for segmentation, part-of-speech tagging
and syntactic bracketing. The time is ripe for an open discussion of the
methodological issues involved in achieving agreement on annotation
Unlike Western and Middle Eastern
Writing systems, Chinese writing does not have a
natural delimiter between words with the result that appropriate word
segmentation becomes a prerequisite for any other NLP tasks. In the literature
this problem has been discussed extensively. The problem of part-of-speech
tagging is closely related. These are both prerequisites to the establishment
of a Chinese Treebank that could be of general use.
We have completed building a 500-thousand-word Chinese Treebank.
Our aim is to work towards a community
consensus on guidelines that will include the input of influential researchers
from Taiwan, Singapore, Hong Kong, China and the US. To this end,
we held two workshops and a number of meetings between 7/1998 to 10/2000
in USA and abroad.
We are very interested in the community's
reaction to our guidelines and Treebank, and encourage anyone interested in
getting involved to please look into the guidelines we have attached below, use
the Treebank, which is available via LDC, and
to get in touch with us with your comments.
Descriptions of the project:
- Task: Building a segmented, POS tagged and bracketed Chinese corpus. The
data consists of Xinhua newswire, Hong Kong news and articles from Sinorama
- Project Status: The Chinese TreeBank (CTB) version 4.0, which has 404K
words, has been officially
released via Linguistic Data Consortium.
CTB 5.0, which will have 507K words, is also in the LDC data release
pipeline. It will be available at the end of 2004.
Penn guidelines for Chinese Treebank
Developing Guidelines and Ensuring Consistency for Chinese Text Annotation
- Fei Xia, Martha Palmer, Nianwen Xue,
Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou,
Shizhe Huang, Tony Kroch, and Mitch Marcus
- Proceedings of the second International Conference on Language Resources
and Evaluation (LREC 2000), Athens, Greece, 2000.
Facilitating Treebank Annotation with a Statistical Parser
- Fu-Dong Chiou, David Chiang, and Martha Palmer
- Proceedings of the Human Language Technology Conference (HLT 2001), San
Diego, California, 2001.
Building a Large-Scale Annotated Chinese Corpus
- Nianwen Xue, Fu-Dong Chiou, and Martha Palmer
- Proceedings of the 19th. International Conference on Computational
Linguistics (COLING 2002), Taipei, Taiwan, 2002.
- 2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.
- Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer
- Natural Language Engineering, 11(2)207-238.
- Principal Investigators:
- Martha Palmer,
- Shizhe Huang,
Mary Ellen Okurowski,
Boyan A. Onyshkevyc
- Project Managers:
- Shudong Huang (September - December, 1998),
- Fei Xia (September
1998 - December 2000),
- Nianwen Xue (May 1999 - May 2000),
- Fu-Dong Chiou
(January 2001 - present)
- Guideline Designers:
- Fei Xia
- Programming Support:
- Zhibiao Wu (September 1998 - September 2000)
- Scott Cotton (October - December, 2000)
- Meiyu Chang (June 2003 - present)
- Fu-Dong Chiou
(September 1998 - present)
- Shudong Huang (September - December, 1998)
- Tsan-Kuang Lee (June 2002 - present)
- Nianwen Xue (September 1998 - May 2000; September 2001 - November
Treebank Releases on
Preliminary Release: June 2000,
see the announcement
Second Release: Dec 2000,
Workshops and meetings
1st CLP Workshop (6-7/98), Philadelphia, USA
meeting during ACL-98, Montreal, Canada (8/98)
meeting during ICCIP-98, Beijing, China (11/98)
meeting during ACL-99, Maryland, USA (6/99)
2nd CLP Workshop (10/00), Hong Kong,
Links to other sites
Penn English Treebank Project
Penn Korean Treebank Project
Last modified on February 10, 2004. This page has been viewed
times since March 5, 2003.