lookup -h
lookup -v
.... | lookup lexicon_file [ options ] | ....
.... | lookup -l language [ options ] | ....
.... | lookup -f lookup_script [ options ] | ....
| -h | Print the help message. |
| -v | Print the program version number. |
| lexicon_file | A single lexical transducer is defined explicitly. |
| -l language | Use the default lookup_script for the given language. |
| -f lookup_script | The script contains lookup strategies and file names of transducers (see below). |
[options] may be one or more of the following:
| [-a 1|2|d|b|D|B] | Algorithm choice: 1 or d or D means depth-first lookup (recursive). 2 or b or B means breadth-first lookup (incremental building/pruning of possible outputs from a cascade of transducers). Default: 1. |
| [-d] | Immediate output (recommended for daemon usage, otherwise suboptimal). |
| [-o #] | A value - tied to the symbol length of a token - to tune storage versus time optimization in the breadth-first part. If slow - decrease it. Default: 8. |
| [-x] | Xelda style output. |
| [-flags flag_values] | Lookup flags can be set (with decreasing priority) by the -flags option or by the environment variable LOOKUP_FLAGS or by default (see below). |
setenv LOOKUP_FLAGS flag_values
If the variable LOOKUP_FLAGS does not exist in your environment,
default flag_values will be used, which are cKv29.
The flag_values are a string that may contain the following
characters (additional symbols will have no influence).
Below # denotes a number and $ a character. | c | Print all comments (statistics etc.) |
| n | With [-a 1]: Create a result net for the lookup of every word (very time consuming, but makes sure that no result is printed more than once) |
| k# | With [-a 1]: When making a lookup on a net vector, intermediate results are checked immediately on a deterministic simple network corresponding to the input side of the following net (transducer), if this transducer has at least # states. (e.g.: k1000) The deterministic check net is generate at run time. |
| K | With [-a 1]: Like k# except that the deterministic check net is taken from the same file as the transducer it belongs to. The order of networks in the file must be: 1. transducer, 2. check net. |
| v# | The input side of a transducer, that is used to check intermediate results will be partially vectorized, i.e. states having at least # arcs will get an arc vector. (e.g.: v20) |
| e# | With [-a 1]: Both approaches (using check-net or not) will be tried and evaluated on # words each. The fastest method will be used for the remaining text. |
| m$ | Multi-character symbols are allowed to occur in the input if they occur in any one of the networks on the side that is specified by $ which means: $ = i : INPUT side (usualy the LOWER side) $ = u : UPPER side $ = l : LOWER side $ = b : BOTH sides |
| L...L | The string between both L will be used in the output as separator between surface form and lemma. Default: Tabulation. |
| T...T | The string between both T will be used in the output as separator between lemma and tag. Default: Tabulation. |
| x | Do not copy the input to the output (what usually is done) |
| I# | The maximal length of an input line will be #. otherwise it is 1000. |
surface_form_1 TAB lemma_1_1 TAB tag_1_1
surface_form_1 TAB lemma_1_2 TAB tag_1_2
....
surface_form_1 TAB lemma_1_n TAB tag_1_n
<empty line>
surface_form_2 TAB lemma_2_1 TAB tag_2_1
surface_form_2 TAB lemma_2_2 TAB tag_2_2
....
setenv LOOKUP_SCRIPT_BASE /dir1/dir2/
setenv LOOKUP_SCRIPT_ENGLISH english/english.lsc
setenv LOOKUP_SCRIPT_FRENCH french/french.lsc
This will give for English
/dir1/dir2/english/english.lsc
and for French
/dir1/dir2/french/french.lsc
symbol_1 filename_1
...
symbol_n filename_n
<empty line>
symbol_a symbol_b ...
...
This may be e.g.:
lexicon /dir1/dir2/..../french.fst
normalizer /dir1/dir2/..../french.ntr
guesser /dir1/dir2/..../french.gtr
defaulttags /dir1/dir2/..../french.def
lexicon
normalizer lexicon
guesser
normalizer guesser
defaulttags
The example means:
First, all needed transducer files (each
containing at least one transducer that may be followed
by a check net; see flags k# and K),
and symbols (names) are assigned to them.
If one does not assign symbols,
defaults will be used (1st line l meaning lexicon,
2nd line n for normalizer, 3rd line g for guesser).
Then all strategies for looking up a word, are listed.