HOW-TO GUIDE for Evaluating Paraphrases

by Chris Callison-Burch (Released: Nov 10, 2008)

These instructions are for people who want to manually evaluate paraphrases in a similar fashion to how I did in my EMNLP-2008 paper.

In order to evaluate whether the syntactically-constrained models produced better paraphrases than the baseline, I substituted the paraphrases into sentences and paid annotators to judge their quality. They judged the paraphrases produced by the baseline model, my syntactically constrained model, and a number of different variants. These instructions will tell you how to generate an evaluation file.

In this example, I'll show how to generate an evaluation set from the file test-sets/nc-test2007.en.gz which is one of the sets that I evaluated in my paper. The file is distributed along with my software.

Step 1: Generate paraphrases for all phrases in the sentences you want to evaluate

Follow Steps 1-5 in the how-to guide for extracting paraphrases. That page gives detailed instructions, here I'll just give short re-capped instructions. My file is already tokenized, but if you are using a new file you should first tokenize it before extracting all phrases from it:

cat sentence-file | perl software/tokenizer.perl -l en | perl software/lowercase.perl > sentence-file.tokenized

Next extract all phrases from the sentences in your test set.

java -cp software/linearb.jar phd.util.EnumPhrases sentence-file.tokenized 3 | sort | uniq > phrases

The reason that you'll need to generate paraphrases for all of the phrases in your test set is that the evaluation code randomly selects them (with some frequency cut-offs, and with the ability to select a balance between syntactic constituents and arbitrary n-grams.

Finally, generate both syntactically-constrained paraphrases and baseline paraphrases:

nohup java -d64 -Xmx10g -Dfile.encoding=utf8 -cp software/linearb.jar edu.jhu.paraphrasing.ParaphraseFinderWithSyntacticConstraints software/paraphrase.properties.en phrases paraphrases-with-syntactic-constraints &

nohup java -d64 -Xmx4g -Dfile.encoding=utf8 -cp software/linearb.jar edu.jhu.paraphrasing.ParaphraseFinder software/paraphrase.properties.en phrases paraphrases-baseline &

The tarball that you downloaded contains paraphrases for all of the files in the test-sets/ directory. The baseline paraphrases are in paraphrases/paraphrases-baseline.gz and the syntactically constrained ones are in paraphrases/paraphrases-with-syntactic-constraints.gz.

Step 2: Create a suffix array index of the file

We create a suffix array index to quickly get frequency information for phrases in the test sentences. Create a suffix array index for your test sentences like this:

java -cp software/linearb.jar edu.jhu.util.suffix_array.SuffixArrayFactory test-sets/nc-test2007.en.gz nc-test2007 en test-sets/

Where:

test-sets/nc-test2007.en.gz is the file you want to index (can be unzipped).
nc-test2007 is the corpus name.
en is the language that it's in.
test-sets/ is the output directory for the index.

Step 3: Parse the test sentences

Because we select phrases that are syntactic constituents and because we apply syntactic constraints, the evaluation software is expecting a file containing parses for the test sentences. The test-sets/ contains parses for each of the test sets that I used.

If you are using a new set, you'll have to parse it using Dan Bikel's parser. I used the Bikel parser trained in the Penn Treebank to parse the English sides of the parallel corpora used by my paraphrase models. If there's a mismatch things probably won't work too well.

Step 4: Modify the language model configuration file

When the paraphrases were substituted into a sentences they can also be ranked by a language model probability. I used the SRILM to generate a language model file from the English side of the Europarl corpus. The LM file is training/europarl.en.lm. My Java code calls an external executable to interpret this file. I've only got executables for Linux and Mac OS X. They are stored in emnlp08/software/srilm/linux/ngram and emnlp08/software/srilm/linux/ngram. You'll need to modify the language model configuration file emnlp08/software/lm.properties to point to the ngram executable and the LM file by changing the lines in bold:

languagemodel.datafile.en=/Users/ccb/Desktop/syntactic-constraints/europarl.en.lm
languagemodel.process=/Users/ccb/LinearB/OpenSrc/srilm/macosx/ngram
languagemodel.workingdir=/Users/ccb/LinearB/OpenSrc/srilm/macosx/
languagemodel.debug=true

Step 5: Create the file for your annotators to judge

After you've done all that prep work, you can finally create an output file for the annotators to judge by running the command:

java -cp software/linearb.jar edu.jhu.paraphrasing.ParaphraseEvaluator en nc-test2007 test-sets/ software/lm.properties 100 paraphrases/paraphrases-baseline.gz paraphrases/paraphrases-with-syntactic-constraints.gz > evaluation-file.csv

Where:

en - is the language that you're using.
nc-test2007 - is the name of the corpus that you're drawing test sentences from.
test-sets/ - is directory that your suffix array index is stored in (from Step 2).
software/lm.properties - is the language model properties file that you modified in Step 4.
100 - is the number of phrases to evaluate
paraphrases/paraphrases-baseline.gz - is the file containing baseline paraphrases
paraphrases/paraphrases-with-syntactic-constraints.gz - is the file containing syntactically-constrained paraphrases

Unfortunately there's an error in my code which means that it won't terminate on it's own. So you'll have to press control+c to kill it after a couple minutes. Sorry about that. You'll see the following progress message printed out:

Loading parse trees from test-sets/en_nc-test2007_parses.txt
Loading language model data from /Volumes/Models/emnlp/emnlp08/training/europarl.en.lm
Language Model initialized
Finished reading paraphrases from paraphrases/paraphrases-baseline.gz
Finished reading paraphrases from paraphrases/paraphrases-with-syntactic-constraints.gz

Followed by a bunch of blank lines. After that you can kill the process.

The evaluation file will look something like this:

		MORE LIKELY TO (phrase number: 1)
MEANING	GRAMMAR	as long as the marginal piece of german debt is used as collateral for a short-term loan or as the centerpiece of a repurchase agreement to gain liquidity , its value is much __MORE LIKELY TO__ be determined by the terms on which the ecb accepts it as collateral than by its fundamentals . (item number: 1)
		as long as the marginal piece of german debt is used as collateral for a short-term loan or as the centerpiece of a repurchase agreement to gain liquidity , its value is much __to__ be determined by the terms on which the ecb accepts it as collateral than by its fundamentals . 1,9
		as long as the marginal piece of german debt is used as collateral for a short-term loan or as the centerpiece of a repurchase agreement to gain liquidity , its value is much __likely to__ be determined by the terms on which the ecb accepts it as collateral than by its fundamentals . 3,11
		as long as the marginal piece of german debt is used as collateral for a short-term loan or as the centerpiece of a repurchase agreement to gain liquidity , its value is much __more likely__ be determined by the terms on which the ecb accepts it as collateral than by its fundamentals . 0,8
		as long as the marginal piece of german debt is used as collateral for a short-term loan or as the centerpiece of a repurchase agreement to gain liquidity , its value is much __little to__ be determined by the terms on which the ecb accepts it as collateral than by its fundamentals . 2,10
MEANING	GRAMMAR	first , ahmadinejad is __MORE LIKELY TO__ focus on domestic issues , trying -- with whatever degree of success is unclear -- to improve living standards for the poorest iranians . (item number: 2)
		first , ahmadinejad is __to__ focus on domestic issues , trying -- with whatever degree of success is unclear -- to improve living standards for the poorest iranians . 1,9
		first , ahmadinejad is __likely to__ focus on domestic issues , trying -- with whatever degree of success is unclear -- to improve living standards for the poorest iranians . 3,5,6,7,11,13,14,15
		first , ahmadinejad is __more likely__ focus on domestic issues , trying -- with whatever degree of success is unclear -- to improve living standards for the poorest iranians . 0,8
		first , ahmadinejad is __inclined to__ focus on domestic issues , trying -- with whatever degree of success is unclear -- to improve living standards for the poorest iranians . 4,12
		first , ahmadinejad is __little to__ focus on domestic issues , trying -- with whatever degree of success is unclear -- to improve living standards for the poorest iranians . 2,10
MEANING	GRAMMAR	some iranian reformers and exiles put a bright face on ahmadinejad 's election , arguing that his administration is __MORE LIKELY TO__ show the regime 's real face and disabuse any western hopes of compromise . (item number: 3)
		some iranian reformers and exiles put a bright face on ahmadinejad 's election , arguing that his administration is __to__ show the regime 's real face and disabuse any western hopes of compromise . 1,9
		some iranian reformers and exiles put a bright face on ahmadinejad 's election , arguing that his administration is __likely to__ show the regime 's real face and disabuse any western hopes of compromise . 3,5,6,7,11,13,14,15
		some iranian reformers and exiles put a bright face on ahmadinejad 's election , arguing that his administration is __more likely__ show the regime 's real face and disabuse any western hopes of compromise .0,8
		some iranian reformers and exiles put a bright face on ahmadinejad 's election , arguing that his administration is __inclined to__ show the regime 's real face and disabuse any western hopes of compromise .4,12
		some iranian reformers and exiles put a bright face on ahmadinejad 's election , arguing that his administration is __little to__ show the regime 's real face and disabuse any western hopes of compromise . 2,10

The test phrase is in uppercase, and the paraphrases produced by each different model are surrounded by underscores. There is room for the annotators to put scores for meaning and grammaticality next to each paraphrase. The experimental condition number is listed at the end. The experimental conditions numbered 0-7 are the ones described in section 4.3 of my paper. The 8-15 are n-best paraphrases, which I didn't describe in the paper.

Step 6: Judge the sentences in the evaluation file

I hired three annotators to judge the sentences in the files that I created. You can see their judgments in eval/evaluated-paraphrases.emily.csv, eval/evaluated-paraphrases.michelle.csv, eval/evaluated-paraphrases.sally.csv and eval/evaluated-paraphrases.sally.set2.csv.

I gave them the following instructinos:

Paraphrases are alternative ways of wording a phrase which retain its meaning. The goal of this project is to evaluate how well different automatic paraphrasing methods do. Since these are automatically generated the frequently contain errors. The evaluation is conducted by selecting some phrase, generating paraphrases for it, and then substituting the original phrase with the paraphrases in a number of sentences. Your task, should you choose to accept it, it to judge these substitutions along two dimensions. First, you will be asked to judge whether the paraphrase retains the SAME MEANING as the original phrase. Second, you will be asked to judge whether the sentence with the paraphrase substituted into it REMAINS GRAMMATICAL. To quantify these, you'll assign scores along 5-point scales:

MEANING:
5 All of the meaning of the original phrase is retained, and nothing is added
4 The meaning of the original phrase is retained, although some additional information may be added but does not transform the meaning
3 The meaning of the original phrase is retained, although some information may be deleted without too great a loss in the meaning
2 Substantial amount of the meaning is different
1 The paraphrase doesn't mean anything close to the original phrase

GRAMMAR:
5 The sentence with the paraphrase inserted is perfectly grammatical, and would require no correction
4 The sentence is grammatical, but might be a little awkward sounding
3 The sentence has an agreement error (such as between its subject and verb, or between a plural noun and singular determiner)
2 The sentence has multiple errors and/or omits words that would be required to make it grammatical
1 The sentence is totally ungrammatical

Here's an example of one annotation item, with some sample numbers. The original phrase is "European countries", the first numbers correspond to the meaning judgments and the second correspond to grammar judgments:

EUROPEAN COUNTRIES
MEANING GRAMMAR ever more open borders imply increasing racial fragmentation in __EUROPEAN COUNTRIES__ .
2 2 ever more open borders imply increasing racial fragmentation in __european__ .
5 4 ever more open borders imply increasing racial fragmentation in __countries of europe__ .
3 5 ever more open borders imply increasing racial fragmentation in __the member states__ .
3 5 ever more open borders imply increasing racial fragmentation in __the countries__ .
3 4 ever more open borders imply increasing racial fragmentation in __countries__ .
5 5 ever more open borders imply increasing racial fragmentation in __european states__ .
5 5 ever more open borders imply increasing racial fragmentation in __europe__ .
5 5 ever more open borders imply increasing racial fragmentation in __european nations__ .
3 5 ever more open borders imply increasing racial fragmentation in __nations__ .
1 1 ever more open borders imply increasing racial fragmentation in __the__ .
5 5 ever more open borders imply increasing racial fragmentation in __the european countries__ .
5 5 ever more open borders imply increasing racial fragmentation in __the countries of europe__ .

Step 7: Calculate the results of the different data conditions

After I collected all of the results I calculated the performance for each of the different data conditions. I ran the following commands:

cat eval/evaluated-paraphrases*csv > evaluated-paraphrases

perl eval/get-results.perl evaluated-paraphrases

This outputs the following:

380 PHRASES (29 excluded):
...


MEANING
	0  	1  	2  	3  	4  	5  	6  	7  
5	0.28	0.23	0.34	0.29	0.35	0.32	0.35	0.33
4	0.01	0.02	0.04	0.04	0.04	0.03	0.04	0.03
3	0.27	0.21	0.24	0.27	0.21	0.25	0.22	0.25
2	0.29	0.24	0.26	0.24	0.25	0.21	0.24	0.22
1	0.15	0.3	0.12	0.16	0.15	0.18	0.14	0.18

GRAMMAR
	0  	1  	2  	3  	4  	5  	6  	7  
5	0.21	0.32	0.41	0.48	0.41	0.49	0.43	0.49
4	0.13	0.12	0.16	0.17	0.18	0.19	0.18	0.19
3	0.04	0.02	0.02	0.02	0.02	0.02	0.02	0.02
2	0.47	0.31	0.31	0.25	0.28	0.2	0.28	0.20
1	0.15	0.22	0.09	0.09	0.1	0.11	0.09	0.10

PASSED (MEANING THRESHOLD=3, GRAMMAR THRESHOLD=4)
MEAN:	0.56	0.46	0.62	0.6	0.60	0.61	0.62	0.61
GRAM:	0.35	0.44	0.57	0.65	0.6	0.68	0.61	0.68
BOTH:	0.30	0.36	0.46	0.5	0.5	0.54	0.51	0.55
TOTALS:	1193	1194	1144	1142	876	878	876	877

There were a total of 1195 contexts (averaging 3.14473684210526 per phrase),
with 8422 judgments in total.

Those are the results that I reported in the paper.

To evaluate the inter-rater agreement I used the script:

perl eval/inter-annotator-agreement.perl eval/evaluated-paraphrases.michelle.csv eval/evaluated-paraphrases.sally.csv

Michelle and Sally were the only annotators to have overlapping items.

Further steps

If you want to modify how the test sentences are selected, or to add different paraphrase models for your own evaluation you can modify the class at edu/jhu/paraphrasing/ParaphraseEvaluator.java

Made Possible By

Multi-level modeling of language and translation