Explicit Discourse Connectives Tagger - December 14, 2009 ------------------------------------------------------------------------------- This tool is built to automatically identify explicit discourse connectives and their sense (Expansion, Contingency, Comparison, Temporal). It takes syntactic parse trees as input and outputs augmented trees with tags for each discourse connective. This tool is based on the work described in: Emily Pitler and Ani Nenkova. Using Syntax to Disambiguate Explicit Discourse Connectives in Text. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 13-16. QUICKSTART At a terminal, while in the addDiscourse directory, type: perl addDiscourse.pl --parses sample-parse.txt This will read in the parse trees in sample-parse.txt and output them augmented with discourse connective tags. Each word or phrase which can be a discourse connective is tagged with an ID number and its predicted sense (or 0 if predicted non-discourse). The word, its id, and its sense are separated by # marks. The possible senses are: Expansion, Contingency, Comparison, Temporal, or 0 (non-discourse usage). The four discourse usage tags correspond to the top level of the sense hierarchy in the Penn Discourse Treebank. For example, as#6#Temporal should be read as 'the word as in this context has an id of 6, and a sense of Temporal'. Id numbers are used to identify multi-word or long-distance connectives. The shared id numbers for: In#0#Expansion and addition#0#Expansion show that ``In addition'' is just one instance of an Expansion connective. INSTALLING Uncompress addDiscourse.tar.gz by typing: gunzip < addDiscourse.tar.gz | tar -xvf - The file addDiscourse.pl is the program which identifies each instance of a discourse connective, classifies them, and prints out the augmented trees. The resources folder contains two files: connectives.info and connectives.txt. connectives.info contain the learned feature weights. This is the output of Mallet's MaxEnt classifier trained on sections 2-22 of version 2 of the Penn Discourse Treebank. connectives.txt contains a list of the words and phrases to consider as possible connectives. If you wish, you may add to this list. Words or phrases that were unknown at training time will be classified solely on the basis of their syntactic context. Long-distance connectives (like ``On the one hand...on the other hand") are specified using .. between the first half and the second half (on the one hand..on the other hand). INPUT Input to the program is specified through the --parses command line argument: ie. perl addDiscourse.pl --parses sample-parse.txt Either a directory or a file can be passed in. If a directory is given, it will annotate each of the files contained in that directory. The input expected are syntactic parse trees. The trees can be either pretty-printed or one sentence per line, it makes no difference. OUTPUT The location of the output is specified through the --output command line argument: ie. perl addDiscourse.pl --parses sample-parse.txt --output sample-out.txt If a file is specified, then output is written to that file. If a directory is given, then individual files are placed in to that directory in the format directory/filename.disc If no output argument is given, then addDiscourse.pl writes to stdout.