Xerox Research Centre Europe 
Copyright © 1997 Xerox Corporation. All rights reserved. 

Lexical Lookup


Lexical lookup requires a morphological analyzer to associate each token with one or more readings. Unknown words are handled by a guesser which provides potential part-of-speech categories based on affix patterns.

Function call

The Lexical lookup tool, lookup, can be called in one of the following ways:

            lookup -h
            lookup -v
     .... | lookup lexicon_file [ options ] | ....
     .... | lookup -l language  [ options ] | ....
     .... | lookup -f lookup_script [ options ] | ....

Arguments

The arguments mean:
-h Print the help message.
-v Print the program version number.
lexicon_file A single lexical transducer is defined explicitly.
-l language Use the default lookup_script for the given language.
-f lookup_script The script contains lookup strategies and file names of transducers (see below).

[options]   may be one or more of the following:
[-a 1|2|d|b|D|B] Algorithm choice:
1 or d or D means depth-first lookup (recursive).
2 or b or B means breadth-first lookup (incremental building/pruning of possible outputs from a cascade of transducers). Default: 1.
[-d] Immediate output (recommended for daemon usage, otherwise suboptimal).
[-o #] A value - tied to the symbol length of a token - to tune storage versus time optimization in the breadth-first part. If slow - decrease it. Default: 8.
[-x] Xelda style output.
[-flags flag_values] Lookup flags can be set (with decreasing priority) by the -flags option or by the environment variable LOOKUP_FLAGS or by default (see below).

Flags

Flags can be set by the -flags option. If this option is not specified, flags will be set by the environment variable LOOKUP_FLAGS. You can set this variable by:
    setenv  LOOKUP_FLAGS  flag_values
If the variable LOOKUP_FLAGS does not exist in your environment, default flag_values will be used, which are cKv29. The flag_values are a string that may contain the following characters (additional symbols will have no influence). Below # denotes a number and $ a character.
The flags mean:
c Print all comments (statistics etc.)
n With [-a 1]: Create a result net for the lookup of every word (very time consuming, but makes sure that no result is printed more than once)
k# With [-a 1]: When making a lookup on a net vector, intermediate results are checked immediately on a deterministic simple network corresponding to the input side of the following net (transducer), if this transducer has at least # states. (e.g.: k1000) The deterministic check net is generate at run time.
K With [-a 1]: Like k# except that the deterministic check net is taken from the same file as the transducer it belongs to. The order of networks in the file must be: 1. transducer, 2. check net.
v# The input side of a transducer, that is used to check intermediate results will be partially vectorized, i.e. states having at least # arcs will get an arc vector. (e.g.: v20)
e# With [-a 1]: Both approaches (using check-net or not) will be tried and evaluated on # words each. The fastest method will be used for the remaining text.
m$ Multi-character symbols are allowed to occur in the input if they occur in any one of the networks on the side that is specified by $ which means: $ = i : INPUT side (usualy the LOWER side) $ = u : UPPER side $ = l : LOWER side $ = b : BOTH sides
L...L The string between both L will be used in the output as separator between surface form and lemma. Default: Tabulation.
T...T The string between both T will be used in the output as separator between lemma and tag. Default: Tabulation.
x Do not copy the input to the output (what usually is done)
I# The maximal length of an input line will be #. otherwise it is 1000.

Input

The tokenized input must have on every line one of the following items:
(1) a word
(2) a multi word expression (containing blancs but no tabulations)
(3) a tagged (1) or (2) containing tabulations (word TAB lemma TAB tag). Every input line containing tabulations will unmodified be written to the output.

Output

The output will be of the following form:

	surface_form_1  TAB  lemma_1_1  TAB  tag_1_1
	surface_form_1  TAB  lemma_1_2  TAB  tag_1_2
	....
	surface_form_1  TAB  lemma_1_n  TAB  tag_1_n
    <empty line>
	surface_form_2  TAB  lemma_2_1  TAB  tag_2_1
	surface_form_2  TAB  lemma_2_2  TAB  tag_2_2
	....

Tags

The tags in all lexica are expected to start with +

Currently supported languages

English, French, German, Italian, Spanish.

Standard lookup script for a language (script name and path):

The name and path of a standard script for a language can be specified by environment variables, e.g.:
    setenv  LOOKUP_SCRIPT_BASE     /dir1/dir2/
    setenv  LOOKUP_SCRIPT_ENGLISH  english/english.lsc
    setenv  LOOKUP_SCRIPT_FRENCH   french/french.lsc
This will give for English
    /dir1/dir2/english/english.lsc
and for French
    /dir1/dir2/french/french.lsc

Lookup script (content):

A lookup script must be of the form:
    symbol_1  filename_1
    ...
    symbol_n  filename_n
  <empty line>
    symbol_a  symbol_b ...
    ...                       
This may be e.g.:
    lexicon	/dir1/dir2/..../french.fst
    normalizer	/dir1/dir2/..../french.ntr
    guesser	/dir1/dir2/..../french.gtr
    defaulttags	/dir1/dir2/..../french.def

    lexicon
    normalizer  lexicon
    guesser
    normalizer  guesser
    defaulttags
The example means: First, all needed transducer files (each containing at least one transducer that may be followed by a check net; see flags k# and K), and symbols (names) are assigned to them. If one does not assign symbols, defaults will be used (1st line l meaning lexicon, 2nd line n for normalizer, 3rd line g for guesser). Then all strategies for looking up a word, are listed.
Here, the strategies are: (1) try the lexicon only, (2) try 1st the normalizer and then the lexicon, (3) try the guesser only, (4) try 1st the normalizer and then the guesser, and (5) append a set of default tags to an un-recognized word.
The strategies will be applied sequentially beginning with the 1st one, and this process will stop as soon as a strategy gives a result.
We welcome your comments and suggestions.
Copyright © The Document Company - Xerox 1997. All rights reserved.
Written by André Kempe. Last modified on Jan. 12, 1998 .