Software

Return to Dan Bikel’s Home Page

Table of Contents of this page:


Java WordNet library

A few years ago, I wrote a Java library to provide efficient access to WordNet. I'm afraid I only just recently got around to making it available.

Download: Click here to download (gzip’ed tar archive).

Switchboard Framework

The switchboard framework is a collection of Java packages that provides a mechanism to implement a distributed client-server environment, with a central switchboard responsible for assigning clients to servers and for doling out objects to clients for processing. The switchboard obtains objects for processing from one or more input files. When clients finish processing an object, they return it to the switchboard, and when all the objects of an input file have been processed, the switchboard assembles them in the order in which they appeared in the input file and outputs them to an output file. RMI is used for all communication between the switchboard and its users (clients and servers).

The capabilities of the framework may all be used independently: you can use the switchboard simply to provide a central naming lookup service, to do distributed object-processing, to provide clients with load-balanced servers or any combination of the above.

N.B.: The switchboad framework was used to provide distributed parsing facilities in the parser software (described below). The parser download, however, contains all switchboard classes (i.e., the parser download is self-contained).

Download: Click here to download the switchboard software (gzip’ed tar archive).

Treep

For David Chiang's and my Recovering Latent Information in Treebanks paper, we developed a syntax and implemented software for augmenting nodes of phrase structure trees, as an aid for developing statistical parsers (actually, David did all the implementation). Please visit the Treep webpage for more information and to download the software.

Randomized Parsing Evaluation Comparator (Statistical Significance Tester for evalb Output)

A persistent problem with parsing—particularly with English—is that of repeatedly evaluating on the same test set, noting increases in evaluation metrics like labeled bracket recall and precision, but not knowing if those differences are statistically significant (there is another, perhaps more serious problem with repeatedly evaluating on the same test set, but don’t get me started on that here). To address this issue, I have written a short Perl script that reads that output of evalb on two different parsing runs and outputs p-values for whether observed differences in recall and/or precision are statistically significant.

The test employed is a type of “stratified shuffling” (which in turn is a type of “compute-intensive randomized test”). In this testing method, the null hypothesis is that the two models that produced the observed results are the same, such that for each test instance (sentence that was parsed), the two observed scores are equally likely. This null hypothesis is tested by randomly shuffling individual sentences’ scores between the two models and then re-computing the evaluation metrics (precision and recall, in this case). If the difference in a particular metric after a shuffling is equal to or greater than the original observed difference in that metric, then a counter for that metric is incremented. Ideally, one would perform all 2n shuffles, where n is the number of test cases (sentences), but given that this is often prohibitively expensive, the default number of iterations is 10,000. After all iterations, the likelihood of incorrectly rejecting the null is simply (nc + 1)/(nt + 1), where nc is the number of random differences greater than the original observed difference, and nt is the total number of iterations.

Caveat: This type of testing method assumes independence between test instances (sentences). This is not a bad assumption for parsing results, but is not correct, either.

Warning: the script is provided as is; use at your own risk (although it can’t really harm anything to try it out).

Download: Click here to view the Perl script (with most browsers, you must shift-click to download).

Multilingual Statistical Parsing Engine

I have designed and built an extensible, parallel parsing engine that accommodates many different types of generative, statistical parsing models (including an emulation of Mike Collins’ parsing model with equally good performance; click here for Mike’s original C implementation), and can easily be extended to new domains and new languages. The parser currently comes “out of the box” with settings files and resources to train and do state-of-the-art parsing in English, Chinese, Arabic (“Treebanks sold separately”). It is also fairly easy to develop a new language package. Coming soon: Korean.

Update: as of July 8th, 2008, my parsing software has now reached version 1.0!

Parsing Software License highlights:

Please read the license itself for details. If you would like a different license, please contact me (Dan Bikel, dan AT bikel DOT net).

Download: Click here to download the parsing engine (including the user guide and API docs).

If you are a researcher at a restrictive industrial research lab, click here to download the parsing engine (including the user guide and API docs, but without source).

Parsing engine resources