PPDB: The Paraphrase Database

PPDB is currently available as gzipped plain text files, one paraphrase rule per line. Each line is formatted as follows:
LHS ||| SOURCE ||| TARGET ||| (FEATURE=VALUE )* ||| ALIGNMENT
Here, SOURCE is the expression to be paraphrased, TARGET is its paraphrase, and LHS is the constituent or CCG-style slashed constituent label for both SOURCE and TARGET. The features contain approximated conditional paraphrase probabilities accrued from bilingual data, similarity scores estimated from monolingual data, and others. For example:
[VBN] ||| pruned ||| cropped ||| p(e|f)=4.33 p(f|e)=4.88 ... ||| 0-0
PPDB 1.0 comes pre-packaged in 6 sizes: S to XXXL. The smaller packages contain only better-scoring, high-precision paraphrases, while the larger ones aim for high coverage. Larger packages subsume smaller packages.

Additionally PPDB is broken down into lexical paraphrases (i.e. one word to one word), phrasal paraphrases (i.e. multi-word phrases), as well as syntactic paraphrases which contain nonterminals. We break the syntactic paraphrase sets down into constituent rules (i.e. nonterminals labeled with Penn Treebank constituents only) and paraphrase rules that contain CCG-style slashed constituents.

We'll be posting updates on PPDB to Twitter: @ppdb. You can also email us directly with questions: Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch.

All Lexical One-To-Many Phrasal Syntactic
S
Paraphrases
(424MB, 6.8M rules)
Paraphrases
(1.7MB, 31k rules)
One-To-Many
(3.8MB, 47k rules)
Paraphrases
(42MB, 637k rules)
Constituent
(38MB, 585k rules)
Identity
(16MB, 437k rules)
Many-To-One
(3.8MB, 47k rules)
Identity
(170MB, 4.1M rules)
Non-Constituent
(343MB, 5.6M rules)
M
Paraphrases
(757MB, 11.9M rules)
Paraphrases
(1.7MB, 69k rules)
One-To-Many
(7.6MB, 94k rules)
Paraphrases
(42MB, 1.2M rules)
Constituent
(69MB, 1.0M rules)
Identity
(16MB, 468k rules)
Many-To-One
(7.6MB, 94k rules)
Identity
(170MB, 4.3M rules)
Non-Constituent
(601MB, 9.6M rules)
L
Paraphrases
(1.5GB, 23.5M rules)
Paraphrases
(12MB, 198k rules)
One-To-Many
(16MB, 188k rules)
Paraphrases
(209MB, 3.0M rules)
Constituent
(148MB, 2.2M rules)
Identity
(19MB, 503k rules)
Many-To-One
(16MB, 188k rules)
Identity
(191MB, 4.5M rules)
Non-Constituent
(1.2GB, 18.2M rules)
XL
Paraphrases
(2.8GB, 43.2M rules)
Paraphrases
(33MB, 548k rules)
One-To-Many
(31MB, 376k rules)
Paraphrases
(486MB, 6.9M rules)
Constituent
(300MB, 4.4M rules)
Identity
(20MB, 532k rules)
Many-To-One
(31MB, 376k rules)
Identity
(198MB, 4.7M rules)
Non-Constituent
(2.1GB, 31.4M rules)
XXL
Paraphrases
(5.7GB, 86.4M rules)
Paraphrases
(125MB, 2.1M rules)
One-To-Many
(61MB, 752k rules)
Paraphrases
(1.5GB, 20.2M rules)
Constituent
(644MB, 9.3M rules)
Identity
(21MB, 559k rules)
Many-To-One
(61MB, 752k rules)
Identity
(204MB, 4.8M rules)
Non-Constituent
(3.6GB, 54.8M rules)
XXXL
Paraphrases
(12.2GB, 169M rules)
Paraphrases
(451MB, 7.6M rules)
One-To-Many
(117MB, 1.5M rules)
Paraphrases
(4.9GB, 68.4M rules)
Constituent
(1.1GB, 16.1M rules)
Identity
(22MB, 570k rules)
Many-To-One
(117MB, 1.5M rules)
Identity
(207MB, 4.9M rules)
Non-Constituent
(5.1GB, 77.4M rules)

PPDB is licensed under a Creative Commons Attribution 3.0 Unported License.

For details on the extraction of the dataset, check out our NAACL 2013 short paper. If you use PPDB in your work, please cite the paper as
@inproceedings{ganitkevitch2013ppdb,
  title = {{PPDB}: The Paraphrase Database},
  author = {Ganitkevitch, Juri and {Van Durme}, Benjamin and
    Callison-Burch, Chris},
  booktitle = {Proceedings of NAACL-HLT},
  pages = {758--764},
  month = {June},   year={2013},
  address = {Atlanta, Georgia},
  publisher = {Association for Computational Linguistics},
  url = {http://cs.jhu.edu/~ccb/publications/ppdb.pdf}
}