While I do not have any formal means to track users of this software, I would
appreciate a quick e-mail message if you are a new user. My e-mail address
is dan AT bikel DOT net.
| Date | Version | Download | Comments |
| 2008/10/30 |
1.2 |
download |
- Fixed a long-standing limitation whereby settings could not reliably be
loaded or changed at run-time. Now,
the danbikel.parser.Settings
class defines an interface called
danbikel.parser.Settings.Change
which allows any class or instance of a class that needs to be notified
of settings changes to be notified as soon as they occur. Please note
that only a single language may be parsed at a time; however, one may now
change to a new group of settings for a new language in the middle of a
parsing run at will.
An important consequence of removing this long-standing limitation is
that one no longer needs to specify the settings file on the command
line, using -Dparser.settingsFile=<settings file> like an
animal. One can now finally specify the settings file as a command-line
argument like a normal human being.
|
| Previous versions |
| 2008/10/29 |
1.1 |
download |
- Fixed limitation in
danbikel.switchboard.Switchboard
class whereby two streams were opened for every file immediately upon
starting a mult-file run. Now, streams are opened on-demand, so that the
dreaded “too many open files” is avoided.
- Several, smaller bug-fixes and javadoc fixes/improvements.
|
| 2008/07/08 |
1.0 |
download |
- This is version 1.0!!!
- Completely re-wrote from scratch the
danbikel.util.HashMap
class, as well as all the implementations of the
danbikel.util.MapToPrimitive
interface. All these new implementations use generics wherever possible.
- Added much javadoc documentation. Crucially, all classes (and inner
classes) should have, at a minimum, a top-level javadoc comment.
|
| 2008/07/03 |
0.9.9e |
download |
- Added a setting that prevents tree post-processing from happening after
parsing, the parser.decoder.dontPostProcess
setting.
- Fixed an egregious bug whereby I had temporarily overridden the
function that strips nonterminals of their augmentations (such as
-TMP) in the English package. So all models trained with
version 0.9.9d were secretly using all the function tags, sparsifying
the data and possibly degrading performance (at least, in English).
|
| 2008/03/21 |
0.9.9d |
download |
|
| 2005/09/14 |
0.9.9c |
download |
- Added robustness: if the parser cannot produce any parse
at the widest beam setting (usually due to zero-probability estimates),
it removes all hard constraints and tries again. This feature is on by
default, but is controllable by the parser.decoder.relaxConstraintsAfterBeamWidening
setting. Please note that for maximum robustness, the parser.decoder.useCommaConstraint
setting should be false.
- Fixed an egregious bug introduced in v0.9.9b whereby the Settings.getIntProperty
method would throw a NullPointerException. Inter alia,
this bug prevented the parser software from running in the
distributed-computing mode using the Switchboard,
because the parser clients could not retrive their timeout values
without crashing.
|
| 2005/06/15 |
0.9.9b |
download |
- Fixed tree output bug when the parser.decoder.outputHeadLexicalizedLabels
was true. Before this fix, the “exceptional”
nonterminal labels that contained separators as part of the label (such as
-LRB-) would not be output correctly when the “output head
lexicalized labels” option was on.
- Incorporated all of the changes made thus far for my current task,
called “Operation: Document All Of”, whereby
everything in the entire API will have a thorough and accurate javadoc
comment. When this task is complete (“RSN”), I will release
v1.0 of this software.
|
| 2004/10/25 |
0.9.9a |
download |
Bug fix release for Arabic language package.
|
| 2004/06/17 |
0.9.9 |
download |
- Parsing engine can now perform parameter selection as detailed in my
EMNLP 2004
paper, A Distributional Analysis of a
Lexicalized Statistical Parsing Model. There are two new settings
that control parameter pruning:
The result is a significantly smaller model that performs with no
loss of parsing accuracy. Coupled with a tighter beam, there
is also a significant speedup in average parse time (again, with no
loss in accuracy over the full model with standard-sized beam). Use
the new bikel.properties file to perform model pruning
when training and running an English parser.
- Parsing accuracy in Chinese has been improved, due to an improvement in
the way double-quotation preterminals are removed, and due to a bug fix
in the argument-finding heuristics.
- The new restore
pruned words setting is now true by default.
|
| 2004/04/06 |
0.9.8 |
download |
- New restore
pruned words setting allows re-insertion of pruned words after
decoder has produced a parse tree, preserving word indices from input
sentence (useful when using parser in an MT system, for example).
- Now up to 30% faster! A
new setting allows for optional use of a faster mechanism to
determine whether a modifying nonterminal’s probability is zero.
(The engine is capable of instantiating models for which use of this
faster mechanism is inappropriate, which is why it is switchable via a
run-time setting.)
Also, parsing engine now includes 50% more zing.
- Created a new package,
danbikel.parser.ms,
for all the default model
structure classes. As a result, the danbikel.parser
package has much less clutter.
|
| 2004/03/04 |
0.9.7 |
download |
- Trainer and parser now recognize a ".gz" filename
extension and compress or decompress accordingly. The train
script now creates ".gz" files by default.
- Added two new scripts:
- tag-and-train automates process of preparing a tagged test
file using Adwait Ratnaparkhi’s MXPOST
and training from a training file.
- train-from-observed takes an observed file
as output by the class
danbikel.parser.Trainer
and creates a derived data file.
|
| 2004/02/11 |
0.9.6 |
download |
- The need for count-sharing
has been removed. Instead of sharing counts from the last back-off level
of the modifier word–generation model (which is typically
p(wMi | tMi)) for the last back-off level of the
+TOP+ lexical model, the trainer now collects
“trivial” head-generation counts for lexicalized preterminals,
so that the head events counts table contains information about all
lexicalized nonterminals observed in training (instead of just containing
information about only those that are the head child of some parent). See
§4.9 of my Computational
Linguistics journal paper Intricacies of Collins’
Parsing Model for more about the +TOP+ parameter classes
and count sharing.
- Prior to this version, I had understood that I could not license the
software I had developed as a Ph.D. student at the University of
Pennsylvania. This turns out not to be the case. The current license is
now granted by me and not the University.
|
| 2004/02/03 |
0.9.5 |
download |
Incremental training can now be performed even when precomputing
probabilities (removing a limitation present since v0.9.2).
|
| 2004/01/23 |
0.9.4 |
download |
- Capable of Knesser-Ney smoothing (as with most things in the
engine, the type of smoothing is determined via a run-time setting in
the settings file).
- Often-requested feature: now includes a hack to do
k-best parsing (see the user guide on how to activate this
feature).
|
| 2004/01/15 |
0.9.3 |
download |
Now includes a new setting, parser.trainer.outputCollins,
that causes the trainer to output events in the format output by
Michael Collins’ trainer. Also, now includes a new (optional)
smoothing “penalty” when a higher-order model’s
history was not seen during training. |
All downloads are self-extracting shell scripts, which should work on most Unix
systems. Just click the link, save the file to disk, make sure it is
executable (execute chmod +x install-commercial.sh if necessary) and
run.
In general, one should have at least 512MB of RAM for parsing and 1GB of RAM
for training.
These are generous estimates for maximum heap sizes; the working sets of the
trainer and parser programs are actually much lower. Also, memory footprint is
correlated with the size of the training corpus. The above estimates are based
on usage of the roughly one million–word corpus of the English Penn Treebank;
“your mileage may vary”. For example, one can halve the above
estimates when working with the 250k-word Chinese Treebank (v3.0).