Download Parsing Engine

While I do not have any formal means to track users of this software, I would appreciate a quick e-mail message if you are a new user. My e-mail address is    dan AT bikel DOT net.

See below for installation instructions and system requirements.

Date VersionDownloadComments
2008/07/08 1.0 download
  • This is version 1.0!!!
  • Completely re-wrote from scratch the danbikel.util.HashMap class, as well as all the implementations of the danbikel.util.MapToPrimitive interface. All these new implementations use generics wherever possible.
  • Added much javadoc documentation. Crucially, all classes (and inner classes) should have, at a minimum, a top-level javadoc comment.
Previous versions
2008/07/03 0.9.9e download
  • Added a setting that prevents tree post-processing from happening after parsing, the parser.decoder.dontPostProcess setting.
  • Fixed an egregious bug whereby I had temporarily overridden the function that strips nonterminals of their augmentations (such as -TMP) in the English package. So all models trained with version 0.9.9d were secretly using all the function tags, sparsifying the data and possibly degrading performance (at least, in English).
2008/03/21 0.9.9d download
2005/09/14 0.9.9c download
  • Added robustness: if the parser cannot produce any parse at the widest beam setting (usually due to zero-probability estimates), it removes all hard constraints and tries again. This feature is on by default, but is controllable by the parser.decoder.relaxConstraintsAfterBeamWidening setting. Please note that for maximum robustness, the parser.decoder.useCommaConstraint setting should be false.
  • Fixed an egregious bug introduced in v0.9.9b whereby the Settings.getIntProperty method would throw a NullPointerException. Inter alia, this bug prevented the parser software from running in the distributed-computing mode using the Switchboard, because the parser clients could not retrive their timeout values without crashing.
2005/06/15 0.9.9b download
  • Fixed tree output bug when the parser.decoder.outputHeadLexicalizedLabels was true. Before this fix, the “exceptional” nonterminal labels that contained separators as part of the label (such as -LRB-) would not be output correctly when the “output head lexicalized labels” option was on.
  • Incorporated all of the changes made thus far for my current task, called “Operation: Document All Of”, whereby everything in the entire API will have a thorough and accurate javadoc comment. When this task is complete (“RSN”), I will release v1.0 of this software.
2004/10/25 0.9.9a download Bug fix release for Arabic language package.
2004/06/17 0.9.9 download
  • Parsing engine can now perform parameter selection as detailed in my EMNLP 2004 paper, A Distributional Analysis of a Lexicalized Statistical Parsing Model. There are two new settings that control parameter pruning: The result is a significantly smaller model that performs with no loss of parsing accuracy. Coupled with a tighter beam, there is also a significant speedup in average parse time (again, with no loss in accuracy over the full model with standard-sized beam). Use the new bikel.properties file to perform model pruning when training and running an English parser.
  • Parsing accuracy in Chinese has been improved, due to an improvement in the way double-quotation preterminals are removed, and due to a bug fix in the argument-finding heuristics.
  • The new restore pruned words setting is now true by default.
2004/04/06 0.9.8 download
  • New restore pruned words setting allows re-insertion of pruned words after decoder has produced a parse tree, preserving word indices from input sentence (useful when using parser in an MT system, for example).
  • Now up to 30% faster! A new setting allows for optional use of a faster mechanism to determine whether a modifying nonterminal’s probability is zero. (The engine is capable of instantiating models for which use of this faster mechanism is inappropriate, which is why it is switchable via a run-time setting.)
    Also, parsing engine now includes 50% more zing.
  • Created a new package, danbikel.parser.ms, for all the default model structure classes. As a result, the danbikel.parser package has much less clutter.
2004/03/04 0.9.7 download
  • Trainer and parser now recognize a ".gz" filename extension and compress or decompress accordingly. The train script now creates ".gz" files by default.
  • Added two new scripts:
    • tag-and-train automates process of preparing a tagged test file using Adwait Ratnaparkhi’s MXPOST and training from a training file.
    • train-from-observed takes an observed file as output by the class danbikel.parser.Trainer and creates a derived data file.
2004/02/11 0.9.6 download
  • The need for count-sharing has been removed. Instead of sharing counts from the last back-off level of the modifier word–generation model (which is typically p(wMi | tMi)) for the last back-off level of the +TOP+ lexical model, the trainer now collects “trivial” head-generation counts for lexicalized preterminals, so that the head events counts table contains information about all lexicalized nonterminals observed in training (instead of just containing information about only those that are the head child of some parent). See §4.9 of my Computational Linguistics journal paper Intricacies of Collins’ Parsing Model for more about the +TOP+ parameter classes and count sharing.
  • Prior to this version, I had understood that I could not license the software I had developed as a Ph.D. student at the University of Pennsylvania. This turns out not to be the case. The current license is now granted by me and not the University.
2004/02/03 0.9.5 download Incremental training can now be performed even when precomputing probabilities (removing a limitation present since v0.9.2).
2004/01/23 0.9.4 download
  • Capable of Knesser-Ney smoothing (as with most things in the engine, the type of smoothing is determined via a run-time setting in the settings file).
  • Often-requested feature: now includes a hack to do k-best parsing (see the user guide on how to activate this feature).
2004/01/15 0.9.3 download Now includes a new setting, parser.trainer.outputCollins, that causes the trainer to output events in the format output by Michael Collins’ trainer. Also, now includes a new (optional) smoothing “penalty” when a higher-order model’s history was not seen during training.
2003/09/17 0.9.2 download Trainer can now read observations from file and derive counts incrementally, reducing memory footprint for large observations files. Also, user guide has been expanded and reformatted (using my favorite document processor, LyX).
2003/09/16 0.9.1 download Now includes source code! (Sorry it wasn't there before.) Also includes various small fixes and improvements, as well as an overhaul of the language package and constrain-parsing systems.
2003/08/25 0.9.0 download Initial release. Why do you want this when there’s a new one? :-)

Installation Instructions

All downloads are self-extracting shell scripts, which should work on most Unix systems. Just click the link, save the file to disk, make sure it is executable (execute chmod +x install.sh if necessary) and run.

System Requirements

Platform

The entire software package is written in Java, on the Java2 platform. Any of Sun’s or IBM’s JVM’s as of version 1.3 will work, but versions 1.4.x and higher are preferable. As Java is platform-independent, the software should work on any platform for which a compatible JVM is available. It has been tested on Linux, MacOS X, Solaris and, to a small extent, Windows.

Memory

In general, one should have at least 512MB of RAM for parsing and 1GB of RAM for training.

Training can now be performed incrementally, vastly reducing the RAM footprint. For example, incrementally training on the WSJ corpus can be easily done within a 400MB heap size (800MB is good heap size for non-incremental training).

These are generous estimates for maximum heap sizes; the working sets of the trainer and parser programs are actually much lower. Also, memory footprint is correlated with the size of the training corpus. The above estimates are based on usage of the roughly one million–word corpus of the English Penn Treebank; “your mileage may vary”. For example, one can halve the above estimates when working with the 250k-word Chinese Treebank (v3.0).