Tokenization

Describes the tokenization services of the LinguistX Platform library, version 2.2.

Contents

  1. Introduction
  2. Features
  3. Words vs. Syntactic Elements
  4. Handling Hyphenated Words in Indexing Applications
  5. Abbreviations
  6. Rules for Specific Languages

Introduction

The text tokenizer breaks a document into words and punctuation tokens, and categorizes each token. Because every language has different rules for breaking up text, a different tokenizer language module is used for each language. Language modules are provided for every supported language except Japanese. The tokenizer is for use on plain-text and HTML (Hypertext Markup Language) documents. It can handle regular text, as well as uppercase text, with or without diacritics.

Tokenization also provides the locations and lengths of each token in the original input, and provides an attribute for each token.

Tokenizer Features

Words vs. Syntactic Elements

To be useful for both syntactic analysis and indexing, the tokenizers break up text into syntactic elements, which are usually words. Some words, however, contain more than one syntactic element, and some syntactic elements are made up of more than one word. For example, in the following sentence: John's not the lettuce-eating American people take him for. the tokenizer finds the 11 tokens listed below.

Token Category
John alphanumeric
's none
not alphanumeric
the alphanumeric
lettuce-eating none
American alphanumeric
people alphanumeric
take alphanumeric
him alphanumeric
for alphanumeric
. sentence-ending punctuation

Note that John's is separated into two syntactic elements (a proper noun and a verb), and that lettuce-eating is kept together as one token, because the entirety functions as an adjective. Note also that punctuation is included in the token list. The tokenizer's actions are most suited for syntactic analysis; indexing applications have to decide what to do about hyphenations.

Handling Hyphenated Words in Indexing Applications

If it is desired to break up hyphenations, all uncategorized tokens (marked "none" in the example above) must be scanned for hyphens. This results in a negligible cost in processing speed, since fewer than 1 in 10 tokens will be uncategorized; but it does add a little work to the integration of the tokenizer into an application.

Breaking up all hyphenations, however, does not result in the best retrieval precision. For example, the following hyphenations have a meaning different from the sum of their parts:

Breaking up these hyphenated words results in loss of retrieval precision. For best results a good lexicon is necessary.

Abbreviations

Abbreviations are treated as single tokens, including their period, except when they appear at the end of a sentence, in which case the period is separated from the abbreviation. For example, in the following sentence:

The jurisdiction of her several courts, general and local, of law, of equity, of admiralty, etc., is not less a source of frequent and intricate discussions, etc.

The tokenizer finds the 35 tokens listed below.

Token Category
The alphanumeric
jurisdiction alphanumeric
of alphanumeric
her alphanumeric
several alphanumeric
courts alphanumeric
, post-punctuation
general alphanumeric
and alphanumeric
local alphanumeric
, post-punctuation
of alphanumeric
law alphanumeric
, post-punctuation
of alphanumeric
equity alphanumeric
, post-punctuation
of alphanumeric
admiralty alphanumeric
, post-punctuation
etc. none
, post-punctuation
is alphanumeric
not alphanumeric
less alphanumeric
a alphanumeric
source alphanumeric
of alphanumeric
frequent alphanumeric
and alphanumeric
intricate alphanumeric
discussions alphanumeric
, post-punctuation
etc alphanumeric
. sentence-ending punctuation


General Contents