University of Pennsylvania

Institute For Research in Cognitive Science
The Penn Discourse Treebank Project is an NSF funded project, supported by NSF grants:
IIS-14-22186 (2014-2017)
IIS-14-21067 (2014-2017)
CNS-10-59353 (2011-2013)
IIS-07-05671 (2007-2012)
CNS-02-24417 (2002-2006)

The PDTB 3.0 corpus was released on March 15, 2019 through the Linguistic Data Consortium. N.B. The corpus was updated on February 4, 2020, to include the final versions of two files of to clause annotation that were discovered to not have been loaded earlier, as well as several tokens were inadvertently omitted on the assumption that they were duplicates, when they weren't. Specific changes/additions are recorded in the file pdtb3-revision-jan-2020.txt.

For an introduction to PDTB 3.0 and the PDTB 3.0 Annotation Manual, click here. Please visit the tools page for technical support.

The PDTB 2.0 corpus is still available from this LDC page.

The Penn Discourse Treebank (PDTB) is a large scale corpus annotated with information related to discourse structure and discourse semantics. While there are many aspects of discourse that are crucial to a complete understanding of natural language, the PDTB focuses on encoding discourse relations. The annotation methodology follows a lexically-grounded approach. The PDTB has strived to maintain a theory-neutral approach with respect to the nature of high-level representation of discourse structure, in order to allow the corpus to be usable within different theoretical frameworks. Theory-neutrality is achieved by keeping annotations of discourse relations "low-level": Each discourse relations is annotated independently of other relations, that is, dependencies across relations are not marked.

The PDTB is aimed to support the extraction of a range of inferences associated with discourse relations, for a wide range of NLP applications, such as parsing, information extraction, question-answering, summarization, machine translation, generation, as well as corpus based studies in linguistics and psycholinguistics.

PDTB 3.0 annotation guidelines, annotation format and summary distributions are provided in the manual:
Bonnie Webber, Rashmi Prasad, Alan Lee and Aravind Joshi. 2019. The Penn Discourse Treebank 3.0 Annotation Manual. (available at

The following publication describes the PDTB 2.0 corpus:
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC). Marrakech, Morocco.

PDTB 2.0 annotation guidelines, annotation format, and summary distributions are provided in:
The PDTB Research Group. 2008. The PDTB 2.0. Annotation Manual. Technical Report IRCS-08-01. Institute for Research in Cognitive Science, University of Pennsylvania.

The PDTB project also aims to conduct empirical research with the PDTB corpus, for NLP as well as theoretical linguistics. See the publications for PDTB related research supported by the project.