To become familiar with the capabilties LinguistX offers without doing any programming, you can use the various demo programs that come with the LXRT package. These demo programs come with source code, and serve as sample programs for the LinguistX platform.
These programs operate on text files from the command line, and most of them work together in series. For example, the tokenize program takes plaintext or HTML input, and outputs one token record per line, formatted in plaintext tabular output. After filtering this through the field program, the tokens are provided to tagsent which accepts tokenized input as a text stream, with one token per line. Tagsent outputs tokens/tag pairs in the one-per-line format, to match the tokens it receives. The output of tagsent may be given directly to npr, which can output one noun phrase per line, or annotate the input it receives with brackets around noun phrases.
The apply program operates interactively, separate from the others.
Following is a complete description of the included demo programs.
Using the Apply Program
The apply program provides a demonstration of morphological analysis, generation, stemming, inflection, and tokenization. The intelligence and content is contained in the language module; apply provides an interactive interface to that content. It works with the following kinds of language modules: analyzers (extensions .ia and .da), stemmers (extensions .is and .ds), and tokenizers (extension .tok).
Analyzer language modules can perform morphological analysis and generation. Analysis refers to finding the stem and features of a given word; generation refers to finding the right spelling for a word, given a stem and features.
Stemmer language modules can perform stemming and inflection. Stemming refers to finding the inflectional stem (.is) or derivational root (.ds) of a given word; inflection refers to finding all possible words related to a given inflectional stem (.is) or derivational root (.ds).
Tokenizer language modules perform tokenization. This option of apply is limited in that HTML input cannot be tokenized. Please use the tokenize program to tokenize HTML text.
To start apply, which is an interactive command-line program, type:
apply language-module
Example:
apply english.ia
Apply loads the given language module and, in the example above, switches to analyze mode, because the module is an analyzer. Apply gives the following prompt:
Analyze:
At this point you may type in a word, and apply will print an analysis. For example:
Analyze: run run -> run[Verb][PastPerf][123SP] run[Verb][Pres][Non3sg] run[Noun][Sg] Analyze:
With analyzers, generation is also possible. To switch to generation mode, type -g:
Analyze: -g Generate:
Now you may type a given baseform, and apply will print the word to which the baseform corresponds:
Generate: run[Verb][PastPerf][123SP] run[Verb][PastPerf][123SP] -> run Generate:
To return to analyze mode, type -a. Stemmer language modules provide stemming mode (-s) and inflection mode (-i). Tokenizer language modules provide tokenization mode (-t), which accepts a line of input instead of a word.
Additional options are: -f, which reads a words from a file containing one word per line, and analyzes them; -d, which allows loading of a different language module; and -q, which quits the program.
For more information on stemming, please see the
section on that subject.
Using the Tokenize Program
The program tokenize accepts plaintext or HTML input, and prints one token per line, with additional information. The input comes from the standard input, which is by default set to the terminal keyboard. Output is sent to standard output, which is by default the terminal screen. Redirection and pipe operators can be used in most command-line shells to use tokenize with files.
To run tokenize, the following command is used:
tokenize language (options)
The language given may be any language for which the file language.tok exists. For example, english may be specified. Options are -html, which directs tokenize to tokenize HTML text (plain text is assumed otherwise), and -time repeats, which generates no output and repeats the operation the given number of times, for benchmark purposes.
The output of tokenize is one token per line, plus offset and attribute information. For example, given the text:
The President considered the impact of foreign trade policy on American businesses.
the tokenize program prints:
The 0 3 an President 4 9 an considered 14 10 an the 25 3 an impact 29 6 an of 36 2 an foreign 39 7 an trade 47 5 an policy 53 6 an on 60 2 an 62 1 ws American 63 8 an businesses 72 10 an . 82 1 sent,post,punct
The first column contains the token, the second column contains the character offset of the token and its length, and the third column shows the attributes of the token: an means alphanumeric; ws means white space (newline tokens are included in the token stream); sent refers to a sentence-ending token; punct means punctuation; and post identifies punctuation which identifies closing punctuation.
Some tokens, such as hyphenated words, are given no attributes by the tokenizer; this is printed by tokenize as --- in the attribute column.
For more information about how the tokenizer works, see the
section on tokenization.
Using the Tagsent Program
The tagsent program accepts one token per line on standard input, and prints out tags on standard output. Redirection and pipe operators can be used in most command-line shells to use tagsent with files.
To run tagsent, the following command is used:
tagsent language
The language given may be any language for which the files language.clg and language.hmm exist. For example, english may be specified.
The output of tagsent is one token per line, plus tag information. For example, given the input:
The President considered the impact of trade policy on American businesses .
the tagger outputs the following:
Det-Def The Prop-Title President V-Past considered Det-Def the Nn-Sg impact Prep-of of Nn-Sg trade Nn-Sg policy Prep on Adj American Nn-Pl businesses Punct-Sent .
The blank line represents a newline token; the tagger just ignores it. The tags should be vaguely familiar as specifying parts of speech; for complete information on the tag sets for various languages, please see the section on tagging.
Tagsent accepts the option -mwt, which causes the tagger to find multiword tokens. When the tagger identifies a multiword token, it tags all but the last word with the special tag __mwt. The last word in the multiword token is given the tag for the entire phrase.
To use tagsent with tokenize, the extra information provided by tokenize must be removed from its output. Use the field program to do this:
tokenize english | field -1 | tagsent english
The field program removes all but the specified column (in this case,
the first column) from its output.
Using the Npr Program
The npr program finds noun phrases in the given tagged text. Npr accepts one token per line, with tag information, as output by tagsent. The input comes from the standard input, which is by default set to the terminal keyboard. Output is sent to standard output, which is by default the terminal screen. Redirection and pipe operators can be used in most command-line shells to use npr with files and the other programs tokenize, field, and tagsent.
To run npr, the following command is used:
npr language (options)
The language given may be any language for which the file language.npr exists. For example, english may be specified.
The output of npr is phrase information, one phrase per line. Included is the token offset of the phrase, the type of the phrase, which is always np, and the text of the phrase. For example, given the output from tagsent shown in the example for that program, npr prints the following:
1 np President 4 np impact of trade policy 10 np American businessesOptions are: -bracket, which instead of outputting phrases, outputs the tagged text, with brackets around noun phrases; -subgroups, which finds noun phrases inside larger noun phrases, instead of showing only the maximal-length noun phrase; and -sort, which sorts the returned phrases by their position in the sentence. For more information about how the noun-phrase recognizer works, please see the section on noun phrase recognizers.