Xerox Research Centre Europe 
Copyright © 1997 Xerox Corporation. All rights reserved. 

Lexical Tokenization


Lexical tokenization requires a tokenizing transducer to break a given text into a sequence of tokens. This is language dependent.

Usage:

The Tokenizing tool, tokenize, can be called in one of the following ways:

            tokenize -h
            tokenize -v 
      ... | tokenize [ tokenizing_fst ] [ options ] | ...
      ... | tokenize -l < language > [ options ] | ...
      ... | tokenize -n < tokenizing_fst > | ...

Arguments

-h Print the help message. The preset path of a default tokenizing transducer (usually it is language-specific) is shown here, too. If tokenize is called without arguments, this default will be used.
-v Print the program version number.
tokenizing_fst A single tokenizing transducer is defined explicitly. A fast, deterministic tokenizing algorithm is used.
-l language Use the default tokenizing_fst for the given language. Please type the name of one of the supported languages in lower-case here. A fast, deterministic tokenizing algorithm is used.
-n tokenizing_fst A possibly non-deterministic tokenizer can be applied where several tokenization results are allowed for a given string. This is slower but more powerful than the default deterministic tokenizing algorithm.

[options]   (scarcely used by blood and flesh people) may be the following:
[-d number] Internal value to set the size of an Internal Buffer. If this value is overflown an error message will be sent. Such an overflow is a heuristic indication of the non-deterministic-ness of the transducer used. To verify this hypothesis, one can either apply the non-deterministic mode ( by -n tokenizing_fst) or give gradually increasing number values to allow for possible unusually long tokens until the message disappears (there is no guarantee, though).
[-e] Immediate output. This is recommended for daemon usage, otherwise suboptimal.


Input

Raw text in character form. No multi-character symbols allowed in order to gain speed. If multi-character symbols are really needed, please tell us.

Output

The output will be tokenized, one token on a line.

Supported languages

English, French, German, Italian, Spanish.
We welcome your comments and suggestions.
Copyright © The Document Company - Xerox 1997. All rights reserved.
Written by Tamás Gaál. Last modified on Jan. 12, 1998 .