Johns Hopkins University, Whiting School of Engineering

HOW-TO GUIDE for Extracting Syntactically Constrained Paraphrases

by Chris Callison-Burch (Released: Nov 10, 2008)

This document gives instructions on how to use the software and data that I used in my EMNLP 2008 paper, entitled Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. There are two intended audiences for this step-by-step guide: individuals who want to generate paraphrases to use in their own applications, and researchers who want to recreate my results and extend the method that I proposed. Steps 1-5 are for people in the first category, and the remaining steps are for people in the second category.

The materials that I provide include the following:

  • The source code for my paraphrase extraction methods (both the baseline and the syntactically constrained versions).
  • The complete set of training data that I used in the paper. This includes 10 bilingual parallel corpora that have been automatically word-aligned, suffix array indexes for them, parses for the English side of the parallel corpora, and a trigram language model.
  • The test sentences and the complete set of paraphrases generated for all phrases that occur in the test sets which are up to five words long.
  • The judgments that were collected during the manual evaluation process, and the perl scripts that I used to calculate the results and the inter-annotator agreement.

In order to run the software on the full data sets you'll need a 64-bit machine with a large amount of memory (I use 10 gigs). The data sets are quite large, so you'll need 6 GB of hard drive space too.

Step 1: Download the data

You can download the data here:

After untarring the files by typing tar xf emnlp08.tar on the command line, you should have the following files:

emnlp08/eval:
   calculate_rate_of_substring_paraphrses.perl
   evaluated-paraphrases
   evaluated-paraphrases.emily.csv
   evaluated-paraphrases.michelle.csv
   evaluated-paraphrases.sally.csv
   evaluated-paraphrases.sally.set2.csv
   get-results.perl
   inter-annotator-agreement.perl
   
emnlp08/paraphrases:
   paraphrases-baseline.gz
   paraphrases-with-syntactic-constraints.gz
   phrases-to-paraphrase.gz
   
emnlp08/software:
   linearb.jar
   lm.properties
   lowercase.perl
   nonbreaking_prefixes/
   paraphrase.properties.en
   srilm/linux/ngram
   srilm/macosx/ngram
   tokenizer.perl
   
emnlp08/test-sets:
   dev2006.en.gz
   devtest2006.en.gz
   en_dev2006_parses.txt.gz
   en_devtest2006_parses.txt.gz
   en_nc-dev2007_parses.txt.gz
   en_nc-devtest2007_parses.txt.gz
   en_nc-test2007_parses.txt.gz
   en_test2006_parses.txt.gz
   en_test2007_parses.txt.gz
   nc-test2007.en.gz
   test2006.en.gz
   test2007.en.gz

   
emnlp08/training:
   da-en/
   de-en/
   el-en/
   es-en/
   europarl.en.lm
   fi-en/
   fr-en/
   it-en/
   nl-en/
   pt-en/
   sv-en/

Each of the subdirectories in emnlp08/training/ contain files similar to these:

en_europarl_parses.txt.gz
en_europarl_sentences.txt.gz
en_europarl_suffixes.txt.gz
en_europarl_vocab.txt.gz
es_en_europarl_alignments.txt.gz
es_europarl_sentences.txt.gz
es_europarl_suffixes.txt.gz
es_europarl_vocab.txt.gz
europarl.en.gz
europarl.es.gz

The parses file contains parses of the English side of the parallel corpus that were produced by the Bikel parser trained on the Penn Treebank. The sentences, suffixes, and vocab files are suffix-array indexes of each side of the parallel corpus. The alignments file contains word alignments produced by Giza++ and the Moses toolkit. The europarl files contain the plain text of parallel corpus. The europarl are sentence-aligned and contain one sentence per line.

Step 2: Edit the configuration file

Before running the software you'll need to edit the paraphrase.properties.en configuration file located in the emnlp08/software/ directory. You'll need to change the lines in bold to be the absolute path to your emnlp08/training/ directory.

# The source language is the language to paraphrase
paraphrases.source_lang=en

# The parallel corpora are listed with three values:
# targetLang,corpusName,direcotry
# They should be in the appropriate format and use
# the naming convetions for a SuffixArrayParallelCorpus
paraphrases.corpus_1=da,europarl,/Volumes/Models/Paraphrasing/English/da-en/
paraphrases.corpus_2=de,europarl,/Volumes/Models/Paraphrasing/English/de-en/
# An arbitray number of parallel corpora can be specified
paraphrases.corpus_3=el,europarl,/Volumes/Models/Paraphrasing/English/el-en/
paraphrases.corpus_4=es,europarl,/Volumes/Models/Paraphrasing/English/es-en/
paraphrases.corpus_5=fi,europarl,/Volumes/Models/Paraphrasing/English/fi-en/
paraphrases.corpus_6=fr,europarl,/Volumes/Models/Paraphrasing/English/fr-en/
paraphrases.corpus_7=it,europarl,/Volumes/Models/Paraphrasing/English/it-en/
paraphrases.corpus_8=nl,europarl,/Volumes/Models/Paraphrasing/English/nl-en/
paraphrases.corpus_9=pt,europarl,/Volumes/Models/Paraphrasing/English/pt-en/
paraphrases.corpus_10=sv,europarl,/Volumes/Models/Paraphrasing/English/sv-en/

# The working directory is where the program writes incremental
# results which are deleted before the program exists
paraphrases.working_directory=/Volumes/Models/Paraphrasing/English/tmp/

# The sample size specifies how many occurrences of 
# a pharase and its translations should be examined
# when determinig translation model probabilities.
paraphrases.sample_size=100

# This value specifies the maximum number of paraphrases
# to output for each phrase
paraphrases.max_paraphrases_per_phrase=10

# This value is the minimum paraphrase probability
# that will be printed.
paraphrases.min_paraphrase_probability=0.01

You will also need to create a temporary directory to store the intermediate paraphrases that are extracted from each of the parallel corpora: mkdir tmp. Make sure that the paraphrases.working_directory field in the config file points to this.

Step 3: Assemble a list of phrases

Before you run the software you'll need to create a file containing a list of the phrases that you want to paraphrase, with one phrase per line. Your tokenization scheme should also match the one used for the English side of the parallel corpus, and the text should be lowercased. You can run the following commands to enumerate all of the phrases in a file with sentences with proper tokenization:

  • cat sentence-file | perl software/tokenizer.perl -l en | perl software/lowercase.perl > sentence-file.tokenized
  • java -cp software/linearb.jar phd.util.EnumPhrases sentence-file.tokenized 4 | sort | uniq > phrases

When you test the paraphrase model for the first time, you might want to create a short list of phrases (100 phrases or so). So that you can quickly figure out whether you're having any problems. After that you can try paraphrasing hundreds of thousands of phrases. Running on a long list of phrases will take some time.

For this example, I'm choosing 100 phrases from the middle of the corpus by running: zcat paraphrases/phrases-to-paraphrase.gz | head -750000 | tail -100 > phrases. The first few phrases are:

the commission about article
the commission about article 4
the commission about the
the commission about the need
the commission accepted
the commission accepted many
the commission accepted many of
the commission accepts
the commission accepts amendment
the commission accepts amendment no

Step 4: Run the paraphrase extraction code

After you assemble the list of phrases that you want to paraphrase, you can extract syntactically constrained paraphrases by run the following command to start the software:

nohup java -d64 -Xmx10g -Dfile.encoding=utf8 -cp software/linearb.jar edu.jhu.paraphrasing.ParaphraseFinderWithSyntacticConstraints software/paraphrase.properties.en phrases phrases.paraphrased &

For those of you who aren't very familiar with Java, the arguments are the following:

  • -d64 -- this tells Java to operate in 64-bit mode which allow the program to use more than 2GB of memory.
  • -Xmx10g -- this tells Java to use 10GB of memory. If you don't have a computer with 10GB you could try using less memory, but I'm afraid that since the paraphrasing is so data-intensive it requires a lot of memory.
  • -Dfile.encoding=utf8 -- this tells Java that the files are all encoded in UTF8 format.
  • -cp software/linearb.jar -- this tells Java to use the linearb.jar jar file. If you want to look at the source code you can extract it by running the command jar xf software/linearb.jar
  • edu.jhu.paraphrasing.ParaphraseFinderWithSyntacticConstraints -- This is the class which is run. If you extract the source code from the jar file you can find the source for this class in edu/jhu/paraphrasing/ParaphraseFinderWithSyntacticConstraints.java
  • software/paraphrase.properties.en -- This is paraphrase configuration that you edited it step 2.
  • phrases -- This is the input file containing the list of phrases that you created in step 3.
  • phrases.paraphrased -- This is the output file that the paraphrases will be written to.

As the code is running it will write the output file into the tmp/ directory that you specified in the paraphrase.properties.en file. It will create files containing paraphrases extracted from each of the individual parallel corpora before aggregating these in a final step. You can monitor these as the program is running. You can also follow the progress by looking at the nohup.out file which will contain infrequent message like:

Loading the europarl en-da parallel corpus ...
Loading parse trees from /Volumes/Models/emnlp/emnlp08/training/da-en/en_europarl_parses.txt
Looking up paraphrases in the europarl en-da corpus

When the program finishes running the tab-delimited output file should contain paraphrases that look like this (head -25 phrases.paraphrased):

S	the commission accepted	the commission accepted	0.72946429
S	the commission accepted	the commission has accepted	0.11517857
S	the commission accepted	following on from the commission ' s comments	0.0625
S	the commission accepted	the committee could have delivered	0.04464286
S	the commission accepted	were accepted by the commission	0.03125
S	the commission accepted	the committee has taken on board	0.0125
S/(VP/NP NP)	the commission accepted	the commission accepted	0.875
S/(VP/NP NP)	the commission accepted	the commission adopted	0.125
S/(VP/NP PP)	the commission accepted	the commission accepted	0.875
S/(VP/NP PP)	the commission accepted	the commission approves	0.125
S/(VP/NP)	the commission accepted	the commission accepted	0.86237599
S/(VP/NP)	the commission accepted	the commission had	0.05555556
S/(VP/NP)	the commission accepted	the commission adopted	0.03687169
S/(VP/NP)	the commission accepted	the commission accept	0.01481481
S/(VP/NP)	the commission accepted	the commission approved	0.01388889
S/(VP/NP) .	the commission accepted	the commission accepted	0.92708333
S/(VP/NP) .	the commission accepted	the commission took up	0.04166667
S/(VP/NP) .	the commission accepted	the committee also accepted	0.03125
S/(VP/PP)	the commission accepted	the commission accepted	0.9
S/(VP/PP)	the commission accepted	the commission agreed	0.1
S/(VP/SBAR) .	the commission accepted	the commission accepted	0.55357143
S/(VP/SBAR) .	the commission accepted	the commission said	0.13214286
S/(VP/SBAR) .	the commission accepted	the commission agreed	0.1
S/(VP/SBAR) .	the commission accepted	the commission stated	0.07857143
S/(VP/SBAR) .	the commission accepted	the commission considered	0.05714286

The first column contains the syntactic label assigned to the phrase and the paraphrase in one or more sentence in the parallel corpora. Note that this can be a simple label like S or a more complex CCG-style label like S/(VP/NP NP) that indicates a sentence with an incomplete verb phrase to its right where the VP is missing two noun phrases to its right. See the paper for more details about these complex labels.

The second column contains the original phrases. Note that the first four phrases in the example phrases file that we constructed in Step 3 do not have paraphrases. The first phrase for which a paraphrase is found is the commission accepted. 48 of the 100 original phrases have paraphrases. You can figure this out by typing cut -f2 phrases.paraphrased | uniq | wc -l at the command line.

The third column contains the paraphrases. There are a total of 1321 paraphrases generated for the 48 phrases (wc -l phrases.paraphrased). We output up to 10 paraphrases per syntactic label, and many phrases have more than one label. You'll notice that many of the paraphrases are identical across different the different syntactic labels. If you ignore the labels and just count the unique paraphrases then there are 436 unique paraphrases (cut -f2,3 phrases.paraphrased | sort | uniq | wc -l).

The fourth column contains the probabilities that are assigned to each of the paraphrases. This is calculated as described in Equations 6 and 7 in the paper.

In most cases the above instructions are all you'll need. If you'd like additional information about how to extract baseline paraphrases, and generate sets for manual evaluation in order to replicate the results that I reported in my paper, the following steps are for you!

Step 5 (optional): Extracting baseline paraphrases

If you would like to create paraphrases with the baseline model you can do so by running this command:

nohup java -d64 -Xmx4g -Dfile.encoding=utf8 -cp software/linearb.jar edu.jhu.paraphrasing.ParaphraseFinder software/paraphrase.properties.en phrases phrases.paraphrased-baseline &

Once that is done running you can take a look at its output (head -25 phrases.paraphrased-baseline):

the commission about the	the commission	0.17446251
the commission about the	commission	0.11266017
the commission about the	the commission about the	0.07509643
the commission about the	the commission on the	0.07437752
the commission about the	the commission about	0.05844822
the commission about the	the	0.02667624
the commission about the	the commission on	0.02535433
the commission about the	on the	0.02387995
the commission about the	on	0.01289723
the commission about the	commission about	0.01034483
the commission accepted	the commission accepted	0.29974536
the commission accepted	commission	0.05675223
the commission accepted	the commission	0.0558402
the commission accepted	the commission has accepted	0.05148735
the commission accepted	the commission adopted	0.0478455
the commission accepted	the commission agreed	0.04574657
the commission accepted	the commission approved	0.02165048
the commission accepted	commission has	0.02058147
the commission accepted	the commission has	0.01778682
the commission accepted	commission accepted	0.01420254
the commission accepted many	the commission accepted many	0.6
the commission accepted many	commission accepted many	0.21875
the commission accepted many	warm	0.125
the commission accepted many	many	0.01875
the commission accepted many	commission	0.0125

You'll notice that the file now only has three columns since the baseline model doesn't care about syntactic labels. You'll also notice that more of the original phrases have paraphrases. The baseline model generates 557 unique paraphrases for 65 of the 100 phrases (wc -l phrases.paraphrased-baseline ; cut -f1 phrases.paraphrased-baseline | uniq | wc -l). Many of them are pretty bad. I think that you'll agree that we've improved things with the syntactic constraints.

Further Steps

The next page gives instructions on how to more rigorously evaluate whether the syntactically-constrained paraphrase models is better than the baseline model.

Made Possible By


Multi-level modeling of language and translation