Return to Dan
Bikel’s Home Page
A few years ago, I wrote a Java library to provide efficient access to WordNet. I'm afraid I only just recently got around to making it available.
Download: Click here to download (gzip’ed tar archive).
The switchboard framework is a collection of Java packages that provides a mechanism to implement a distributed client-server environment, with a central switchboard responsible for assigning clients to servers and for doling out objects to clients for processing. The switchboard obtains objects for processing from one or more input files. When clients finish processing an object, they return it to the switchboard, and when all the objects of an input file have been processed, the switchboard assembles them in the order in which they appeared in the input file and outputs them to an output file. RMI is used for all communication between the switchboard and its users (clients and servers).
The capabilities of the framework may all be used independently: you can use the switchboard simply to provide a central naming lookup service, to do distributed object-processing, to provide clients with load-balanced servers or any combination of the above.
N.B.: The switchboad framework was used to provide distributed parsing facilities in the parser software (described below). The parser download, however, contains all switchboard classes (i.e., the parser download is self-contained).
Download: Click here to download the switchboard software (gzip’ed tar archive).
For David Chiang's and my Recovering Latent Information in Treebanks paper, we developed a syntax and implemented software for augmenting nodes of phrase structure trees, as an aid for developing statistical parsers (actually, David did all the implementation). Please visit the Treep webpage for more information and to download the software.
A persistent problem with parsing—particularly with English—is that of repeatedly evaluating on the same test set, noting increases in evaluation metrics like labeled bracket recall and precision, but not knowing if those differences are statistically significant (there is another, perhaps more serious problem with repeatedly evaluating on the same test set, but don’t get me started on that here). To address this issue, I have written a short Perl script that reads that output of evalb on two different parsing runs and outputs p-values for whether observed differences in recall and/or precision are statistically significant.
The test employed is a type of “stratified shuffling” (which in turn is a type of “compute-intensive randomized test”). In this testing method, the null hypothesis is that the two models that produced the observed results are the same, such that for each test instance (sentence that was parsed), the two observed scores are equally likely. This null hypothesis is tested by randomly shuffling individual sentences’ scores between the two models and then re-computing the evaluation metrics (precision and recall, in this case). If the difference in a particular metric after a shuffling is equal to or greater than the original observed difference in that metric, then a counter for that metric is incremented. Ideally, one would perform all 2n shuffles, where n is the number of test cases (sentences), but given that this is often prohibitively expensive, the default number of iterations is 10,000. After all iterations, the likelihood of incorrectly rejecting the null is simply (nc + 1)/(nt + 1), where nc is the number of random differences greater than the original observed difference, and nt is the total number of iterations.
Caveat: This type of testing method assumes independence between test instances (sentences). This is not a bad assumption for parsing results, but is not correct, either.
Warning: the script is provided as is; use at your own risk (although it can’t really harm anything to try it out).
Download: Click here to view the Perl script (with most browsers, you must shift-click to download).
I have designed and built an extensible, parallel parsing engine that accommodates many different types of generative, statistical parsing models (including an emulation of Mike Collins’ parsing model with equally good performance; click here for Mike’s original C implementation), and can easily be extended to new domains and new languages. The parser currently comes “out of the box” with settings files and resources to train and do state-of-the-art parsing in English, Chinese, Arabic (“Treebanks sold separately”). It is also fairly easy to develop a new language package. Coming soon: Korean.
Update: as of July 8th, 2008, my parsing software has now reached version 1.0!
Parsing Software License highlights:
Please read the license itself for details. If you would like a different license, please contact me (Dan Bikel, dan AT bikel DOT net).
- Free for research purposes
- May not redistribute
- Must cite in published work
Download: Click here to download the parsing engine (including the user guide and API docs).
- Latest version: 1.0
- New features:
- Now has robustness: if the parser cannot produce any parse at the widest beam setting (usually due to zero-probability estimates), it removes all hard constraints and tries again. This means the parser should always produce some kind of a parse for every input sentence.
- Can now do parameter selection. For details, see my EMNLP 2004 paper A Distributional Analysis of a Lexicalized Statistical Parsing Model.
If you are a researcher at a restrictive industrial research lab, click here to download the parsing engine (including the user guide and API docs, but without source).
Parsing engine resources
- Download the user guide (last updated April 6th, 2004, 10:51).
- This guide is provided as part of the parsing engine download, but is made available here as a separate download.
- Now includes a “Quick Start” section.
- Browse the API on-line. If you’re interested in tinkering with the behavior of the parser, be sure to browse the API of the danbikel.parser.Settings class.
- Download observed events from Sections 02–21 of the Penn Treebank (wsj-02-21.observed.gz)
- Use the train-from-observed script along with collins.properties to create a “derived data file” for parsing (see the “Quick Start” section of the user guide for details).
- N.B.:
- This file must be decompressed using gunzip if using with a version of the parsing engine prior to 0.9.7.
- The events file format used by this parsing engine is different from that used by Mike Collins’ parser.
- Download Adwait Ratnaparkhi’s MXPOST
- While the parsing engine is fully capable of doing all its own part-of-speech tagging, I typically run it on data that has already been tagged using an automatic tagger, and that tagger is almost always MXPOST.
- Download “MXPOST extras” for English (gzip’ed tarball), containing:
- an MXPOST project directory that is the result of training MXPOST on Sections 02–21 of the Penn Treebank and
- replacements for the trainmxpost and mxpost scripts that come with the MXPOST distribution