Copyright © 1996, 1997 Xerox Corporation.
Part-of-Speech Disambiguation
The part-of-speech disambiguator, disamb, is based on
a Hidden Markov Model.
For efficiency we use the Viterbi algorithm.
Function call
The disambiguator disamb must be called as follows:
.... | disamb [-flag_values] hmm_data_file [recode_file] | ....
Arguments
The arguments mean:
| [-flag_values] |
Disambiguation flags can be set (with decreasing priority)
by the -flag_values option or
by the environment variable DISAMB_FLAGS or
by default (see below). |
| hmm_data_file |
This file contains HMM data for disambiguation of part-of-speech tags.
It is produced by the XSoft training tool hmmtrain.
The file contains the list of unambiguous tags, the list of
ambiguity classes and three tables of HMM probabilities
(initial, transition, and class probabilities). |
| [recode_file] |
The optional recode_file is used to modify the tagset
of the HMM in the output. For example, if the user
wants to collaps TAG1 and TAG2 to just TAG, the recode
file will contain the lines
TAG1 TAG
TAG2 TAG
|
Flags
Flags can be set by the -flag_values option.
This is a string starting with '-'.
If the option is not specified, the flags will be set by the environment
variable DISAMB_FLAGS. You can set this variable by:
setenv DISAMB_FLAGS flag_values
If the variable DISAMB_FLAGS does not exist in your environment,
default flag_values will be used, which are b1+
The flag_values string may contain the characters below
(additional ones will have no influence).
Some flags allow to use only a section of a complex tag for
disambiguation,
e.g.: only +Verb from +SG+P1P2+IndP+Verb.
The selected tag section must correspond to the tags in the
HMM data file.
Below, # denotes a number and $ a symbol.
The flags means:
| c |
Print all comments |
| b#$ |
Cut the tag on the #th $ going backward.
E.g.: b1+ takes Verb
from +SG+P1P2+IndP+Verb |
| bi#$ |
Cut the tag on the #th $ going backward and include
the $ itself to the selected part.
E.g.: bi1+ takes +Verb
from +SG+P1P2+IndP+Verb |
| f#$ |
Cut the tag on the #th $ going forward. |
| fi#$ |
Cut the tag on the #th $ going forward and include
the $ itself to the selected part. |
| d |
Every fprintf() and fwrite()
will be followed by fflush() |
Input
surface_form_1 TAB lemma_1_1 TAB tag_1_1
surface_form_1 TAB lemma_1_2 TAB tag_1_2
....
surface_form_1 TAB lemma_1_n TAB tag_1_n
<empty line>
surface_form_2 TAB lemma_2_1 TAB tag_2_1
surface_form_2 TAB lemma_2_2 TAB tag_2_2
....
This format corresponds to the output of the lexicon lookup tool
lookup.
Both, surface_form and lemma
may contain blanks but no tabulations.
For every surface form (input word or multi word expression)
all lemmata are given with there tags.
Output
surface_form_1 TAB lemma_1_a TAB tag_1_a
....
<empty line>
surface_form_2 TAB lemma_2_b TAB tag_2_b
....
Input lines with the most probable tag for a surface form
(HMM based; Viterbi) are printed to the output.
Note, if this tag is contained in more than one line (per
surface form) than all these lines are selected, because
disambiguation only concerns tags, but not lemmata. E.g.:
suis etre +Verb
suis suivre +Verb
We welcome
your comments and suggestions.
Copyright © The Document Company - Rank Xerox 1996. All rights reserved.
Written by André Kempe. Last modified on Jan. 12, 1998
.