Describes the stemming services of the LinguistX Platform library, version 2.2.
Stemming enables more efficient full-text databases and more accurate text analysis. By mapping written words to their "stems", LinguistX lexicons enhance an application's ability to focus on word meanings instead of word spellings. The lexicons understand New World/Old World spelling variations (such as American English vs. British English), inflectional variations, and word derivation, including compound word derivation.
Word analysis provides complete morphological information for a given word, and also allows a word to be generated from its stem and morphological properites.
A full-text information retrieval application keeps a repository of documents indexed by word. One makes queries on the repository to retrieve documents based on the words the documents contain. For example, one can ask for all the documents containing the word "ground".
Plugging LinguistX stemming into a full-text retrieval system usually means indexing the repository by word stem, and stemming the query words before referring to the index. It's important to be aware that LinguistX stemmers may return more than one stem for a given word, and that it is necessary to store all the stems for maximum recall and precision.
Consider the word "ground". The stems for this word are "ground" (the adjective, noun, and verb) and "grind" (the verb). Suppose "ground" appears in a document, and we ask the stem-based repository for documents containing "grounds" (which is a verb and a noun). With a query of "grounds", we would naturally want to find occurrences in the repository of "ground" or "grounds".
If we tried to pick one stem of "ground" to store, however, and picked "grind", we would miss it in the case of the "grounds" query. If we elected to store the stem "ground", then we would miss it when we gave the query "grind". Therefore we must store all stems in the stem-based index.
Many of the LinguistX languages handle compound word derivation, which forms new words by concatenating existing ones together. For example, the German word "Bildungsroman" is formed from two words "Bildung" and "Roman", and is mapped by the LinguistX stemmer to "bildung#roman". Compound element boundaries are denoted by the hash mark (#). Full-text retrieval applications can choose to index each compound element as a separate term, for maximum recall, or index the entire baseform "bildung#roman" as a single term for maximum precision.