Describes the tokenization services of the LinguistX Platform library, version 2.2.
The text tokenizer breaks a document into words and punctuation tokens, and categorizes each token. Because every language has different rules for breaking up text, a different tokenizer language module is used for each language. Language modules are provided for every supported language except Japanese. The tokenizer is for use on plain-text and HTML (Hypertext Markup Language) documents. It can handle regular text, as well as uppercase text, with or without diacritics.
Tokenization also provides the locations and lengths of each token in the original input, and provides an attribute for each token.
| Token | Category |
| John | alphanumeric |
| 's | none |
| not | alphanumeric |
| the | alphanumeric |
| lettuce-eating | none |
| American | alphanumeric |
| people | alphanumeric |
| take | alphanumeric |
| him | alphanumeric |
| for | alphanumeric |
| . | sentence-ending punctuation |
Note that John's is separated into two syntactic elements (a proper noun and a verb), and that lettuce-eating is kept together as one token, because the entirety functions as an adjective. Note also that punctuation is included in the token list. The tokenizer's actions are most suited for syntactic analysis; indexing applications have to decide what to do about hyphenations.
Breaking up all hyphenations, however, does not result in the best retrieval precision. For example, the following hyphenations have a meaning different from the sum of their parts:
Abbreviations are treated as single tokens, including their period, except when they appear at the end of a sentence, in which case the period is separated from the abbreviation. For example, in the following sentence:
The jurisdiction of her several courts, general and local, of law, of equity, of admiralty, etc., is not less a source of frequent and intricate discussions, etc.
The tokenizer finds the 35 tokens listed below.
| Token | Category |
| The | alphanumeric |
| jurisdiction | alphanumeric |
| of | alphanumeric |
| her | alphanumeric |
| several | alphanumeric |
| courts | alphanumeric |
| , | post-punctuation |
| general | alphanumeric |
| and | alphanumeric |
| local | alphanumeric |
| , | post-punctuation |
| of | alphanumeric |
| law | alphanumeric |
| , | post-punctuation |
| of | alphanumeric |
| equity | alphanumeric |
| , | post-punctuation |
| of | alphanumeric |
| admiralty | alphanumeric |
| , | post-punctuation |
| etc. | none |
| , | post-punctuation |
| is | alphanumeric |
| not | alphanumeric |
| less | alphanumeric |
| a | alphanumeric |
| source | alphanumeric |
| of | alphanumeric |
| frequent | alphanumeric |
| and | alphanumeric |
| intricate | alphanumeric |
| discussions | alphanumeric |
| , | post-punctuation |
| etc | alphanumeric |
| . | sentence-ending punctuation |