tokenize -h
tokenize -v
... | tokenize [ tokenizing_fst ] [ options ] | ...
... | tokenize -l < language > [ options ] | ...
... | tokenize -n < tokenizing_fst > | ...
| -h | Print the help message. The preset path of a default tokenizing transducer (usually it is language-specific) is shown here, too. If tokenize is called without arguments, this default will be used. |
| -v | Print the program version number. |
| tokenizing_fst | A single tokenizing transducer is defined explicitly. A fast, deterministic tokenizing algorithm is used. |
| -l language | Use the default tokenizing_fst for the given language. Please type the name of one of the supported languages in lower-case here. A fast, deterministic tokenizing algorithm is used. |
| -n tokenizing_fst | A possibly non-deterministic tokenizer can be applied where several tokenization results are allowed for a given string. This is slower but more powerful than the default deterministic tokenizing algorithm. |
[options] (scarcely used by blood and flesh people) may be the following:
| [-d number] | Internal value to set the size of an Internal Buffer. If this value is overflown an error message will be sent. Such an overflow is a heuristic indication of the non-deterministic-ness of the transducer used. To verify this hypothesis, one can either apply the non-deterministic mode ( by -n tokenizing_fst) or give gradually increasing number values to allow for possible unusually long tokens until the message disappears (there is no guarantee, though). |
| [-e] | Immediate output. This is recommended for daemon usage, otherwise suboptimal. |