Tagging

Describes the tagging services of the LinguistX Platform library, version 2.2.

Contents

  1. Introduction
  2. Features
  3. Feature Abbreviations
      Major Categories
      Additional Features
  4. Tag Name Conventions
  5. Taggers for specific languages
    1. English

Introduction

The LinguistX taggers label each word in a given sentence with a tag, which is chosen from a rich tag set. The tag contains part-of-speech information, plus noun or subject number information, verb tense information, and fine distinctions of use. Tagger features include:

Feature Abbreviations

The tags and language-specific characteristics of each tagger are described in the documents associated with each language module. However, some characteristics apply to all the taggers. These are described in this and the following sections. For the language-specific information, see the documents shipped with each tagger language module.

This section describes the various abbreviations used for features in tags. Tags consist of feature names separated by hyphens. The first feature name is called the major category and usually specifies the part of speech of the word. Not all languages use these feature names to mean the exact same categories, but the categories denoted are very similar.

Major Categories

The following table lists the parts of speech and other major categories identified by all the supported languages. Not all languages make the same distinctions between major categories or identify all categories listed:

Category Description
Abbr abbreviation
Adj adjective
Adv adverb
Art article
Aux auxiliary
Aux/V auxiliary /main verb
Cmpd part of a compound
Conj conjunction
Conn connector (multiple functions)
Det determiner
DetPron determiner or pronoun
Foreign foreign word
Func function word (miscellaneous category)
Init initials
Interj interjection
Letter single letter
Markup formatting markup, e.g., SGML
Meas unit of measure
Misc uncategorized
Money currency expression
Nn noun
Num numeric expression
Onom onomatopoeia
Ord ordinal number
Part particle
Prep preposition
Pron pronoun
Prop proper noun
Punct punctuation
Time time expression
Title title
V verb
V/A verb or adjective
WordPart part of a multi-word phrase

Additional Features

Additional features are abbreviated as shown in the following table. Features not included in the table are single digits (1, 2, or 3), which denote 1st, 2nd, or 3rd person respectively. Groups of digits together denote disjunction. For example, Aux-3-Sg means 3rd person singular auxiliary, and Aux-12 means 1st or 2nd person auxiliary. When a feature appears in all lower case, as in the tag Prep-para from the Spanish tagger, it stands for a word in that language, and means that the word's distribution differs in some way from that of other words of its category. These features are not included in this table.

Feature Description
Acc accusative (pronoun)
Adv adverbial
Art article
Attr attributive
Circ circumposition
Clitic pronominal clitic
Close closing (punctuation)
Comma comma
Comp comparative (adjective, adverb, or conjunction)
Coord coordinating
Def definite
Deg degree
Dem demonstrative
Det determiner
Dig digit form (of a number, as opposed to words)
Fam family (name)
Fin finite (verb)
Gen genitive
Ger gerund
Giv given (name)
Imp impersonal
Impv imperative
Indef indefinite
Indet indeterminate
Inf infinitive
Infin infinitive
Init initial
Int interrogative
IntRel interrogative or relative
Item item
Left left (part of a compound)
Meas measure
Money money expression
Name name
Neg negative
Nom nominative (pronoun)
Open opening (punctuation)
Org organization (name)
PaPart past participle
Part left or right part of a compound
Past past tense
Percent percent expression
Pers personal (pronoun)
PersRefl personal or reflexive (pronoun)
Pl plural
Place place (name)
Poss possessive
Post postposition
PrPart present participle
Pre occurring before the major category (e.g., a pre-determiner modifies a determiner)
PreCoord conjunctional adverb
Pred predicative (adjective)
Prefix prefix
Prep preposition
Pres present tense
Prog progressive verb
Pron pronoun
Quant quantifier
Quote quote (punctuation)
Recip reciprocal
Refl reflexive
Rel relative
Right right (part of a compound)
Roman Roman (numeral)
Sent sentence (punctuation)
SForm special verbal form
Sg singular
SGML SGML markup tag
Slash slash (punctuation)
SP singular / plural
Sub subordinating (conjunction)
Sup superlative (adjective or adverb)
Title title, used with a name
Word word form (of a number, as opposed to digits)

Tag Name Conventions

Some word classes are treated slightly differently in each language; these are described in each section. Some word types that receive particular attention here include demonstratives (e.g., this, that), quantifiers (some, all), interrogatives (where, when, who), and relativizers (that, which). Other words to be noted are number expressions and proper nouns.

In these taggers, words are marked for number but not for gender. The feature Pl stands for plural, and Sg for singular. If neither tag is used, or the tag is referred to as "invariant" in its description, the same word can be used for both singular and plural contexts.

In some cases, these descriptions may include a statement such as, "both uses of demonstratives are tagged Det-Dem." In this case the tag Det-Dem may be shorthand, standing for all three of the tags Det-Dem, Det-Dem-Sg, and Det-Dem-Pl, if they all exist for that particular tagger.

Each row of the tables in the language-specific sections contains a tag name for the given language, a brief description, and an example to illustrate the tag. Where some context is necessary to illustrate the meaning more clearly, the example word itself appears in bold, while the context words are in plain type. In examples without context, the illustrative word is in plain type.

For specific information about language-specific behavior of each of the language modules, please see the documentation that is shipped with the language modules.


General Contents