Xerox Research Centre Europe
Copyright © 1996, 1997 Xerox Corporation.

Part-of-Speech Disambiguation


The part-of-speech disambiguator, disamb, is based on a Hidden Markov Model. For efficiency we use the Viterbi algorithm.

Function call

The disambiguator disamb must be called as follows:
    .... | disamb [-flag_values] hmm_data_file [recode_file] | ....

Arguments

The arguments mean:
[-flag_values] Disambiguation flags can be set (with decreasing priority) by the -flag_values option or by the environment variable DISAMB_FLAGS or by default (see below).
hmm_data_file This file contains HMM data for disambiguation of part-of-speech tags. It is produced by the XSoft training tool hmmtrain. The file contains the list of unambiguous tags, the list of ambiguity classes and three tables of HMM probabilities (initial, transition, and class probabilities).
[recode_file] The optional recode_file is used to modify the tagset of the HMM in the output. For example, if the user wants to collaps TAG1 and TAG2 to just TAG, the recode file will contain the lines
                TAG1    TAG
                TAG2    TAG
        

Flags

Flags can be set by the -flag_values option. This is a string starting with '-'. If the option is not specified, the flags will be set by the environment variable DISAMB_FLAGS. You can set this variable by:
    setenv  DISAMB_FLAGS  flag_values
If the variable DISAMB_FLAGS does not exist in your environment, default flag_values will be used, which are b1+
The flag_values string may contain the characters below (additional ones will have no influence).
Some flags allow to use only a section of a complex tag for disambiguation, e.g.: only +Verb from +SG+P1P2+IndP+Verb. The selected tag section must correspond to the tags in the HMM data file.
Below, # denotes a number and $ a symbol. The flags means:
c Print all comments
b#$ Cut the tag on the #th $ going backward. E.g.: b1+ takes Verb from +SG+P1P2+IndP+Verb
bi#$ Cut the tag on the #th $ going backward and include the $ itself to the selected part. E.g.: bi1+ takes +Verb from +SG+P1P2+IndP+Verb
f#$ Cut the tag on the #th $ going forward.
fi#$ Cut the tag on the #th $ going forward and include the $ itself to the selected part.
d Every fprintf() and fwrite() will be followed by fflush()

Input


	surface_form_1  TAB  lemma_1_1  TAB  tag_1_1
	surface_form_1  TAB  lemma_1_2  TAB  tag_1_2
	....
	surface_form_1  TAB  lemma_1_n  TAB  tag_1_n
    <empty line>
	surface_form_2  TAB  lemma_2_1  TAB  tag_2_1
	surface_form_2  TAB  lemma_2_2  TAB  tag_2_2
	....
This format corresponds to the output of the lexicon lookup tool lookup. Both, surface_form and lemma may contain blanks but no tabulations. For every surface form (input word or multi word expression) all lemmata are given with there tags.

Output


	surface_form_1  TAB  lemma_1_a  TAB  tag_1_a
	....
    <empty line>
	surface_form_2  TAB  lemma_2_b  TAB  tag_2_b
	....
Input lines with the most probable tag for a surface form (HMM based; Viterbi) are printed to the output. Note, if this tag is contained in more than one line (per surface form) than all these lines are selected, because disambiguation only concerns tags, but not lemmata. E.g.:
	suis	etre	+Verb
	suis	suivre	+Verb

We welcome your comments and suggestions.
Copyright © The Document Company - Rank Xerox 1996. All rights reserved.
Written by André Kempe. Last modified on Jan. 12, 1998 .