(This web page is permanently under construction.)
Penn Treebank Project
The Penn Treebank Project
The Penn Treebank Project annotates naturally-occuring text for linguistic
structure. Most notably, we produce skeletal parses showing rough
syntactic and semantic information -- a bank of linguistic
trees.
We also annotate text with
part-of-speech tags,
and for the Switchboard corpus of telephone conversations,
dysfluency annotation.
We are located in the
LINC Laboratory
of the
Computer and Information Science Department
at the
University of Pennsylvania.
All data produced by the Treebank is released through the
Linguistic Data Consortium.
Descriptions and samples of annotated corpora:
Wall Street Journal |
The Brown Corpus |
Switchboard |
ATIS
On-line tgrep
searches are now possible for those with
LDC Online access.
Frequently Asked Questions (FAQs)
tokenization
NP heads and Base NPs in Treebank II bracketing
Annotation Style Manuals
Part-of-speech tagging
Treebank I bracketing was used until 12/92.
Treebank II bracketing is designed to allow the extraction of
simple predicate-argument structure.
Dysfluency annotation used for Switchboard corpus only
Treebank Releases on CD
Preliminary Release, Version 0.5 CDROM, 1992
Release 2 CDROM, 1995
Publications
A nice
overview of the project (before Treebank II style), Computational
Linguistics, vol. 19, 1993.
Introduction to
predicate-argument bracketing (a.k.a. Treebank II), ARPA '94.
Personnel
- Principal Investigator:
- Mitchell Marcus
- Project Administrator:
- Ann Taylor
- Programmer/Data Manager:
- Robert MacIntyre
- Annotators:
- Ann Bies, Constance Cooper, Mark Ferguson, Alyson Littman
Links to other sites
AMALGAM
Project (Automatic Mapping Among Lexico-Grammatical Annotation Model)
CCALAS
(Centre for Computer Analysis of Language and Speech)
The LDC's Linguistic
Annotation Page Tools and formats for creating linguistic
annotations.
This web page is maintained by
treebank@unagi.cis.upenn.edu.
Last change: $Date: 1999/02/02 17:57:13 $ UTC.
access count: