CIT 594 Tokenizer Examples
Spring 2004, David Matuszek

This is an attempt to clarify the definition of "Token."

Each Token has a value, which is the string of characters making up that token, and a type, which tells what kind of token it is. Your program is supposed to extract tokens from a string, and you should not make any assumptions about what kind of string it will be given.

Token.NAME A string consisting of one or more letters and/or digits, beginning with a letter.
Token.NUMBER A string consisting of one or more decimal digits.
Token.EOL The newline character, '\n'.
Token.SYMBOL Any single character that is not a letter, digit, newline, or whitespace. Note that this includes the underscore character, '_'.
Token.EOI A special token to indicate that the end of input has been reached, and there are no more tokens to return.
Token.ERROR A special token to indicate that an error has occurred, such as the user asking for more tokens after Token.EOI has been returned.

Special case: If you use java.util.StringTokenizer, it will return a string such as 123XYZ as a single token. This should be classified as a Token.ERROR. You can use the String method myToken.matches("[0-9]+") to test whether string myToken contains only digits.

Example 1:

Input String: for (i = 0; i < 10; i++) {
    sum += i;
}
# value type # value type # value type # value type
1 "for" NAME 9 "10" NUMBER 17 "\n" EOL 25 "\n" EOL
2 "(" SYMBOL 10 ";" SYMBOL 18 "sum" NAME 26 "" EOI
3 "i" NAME 11 "i" NAME 19 "+" SYMBOL 27 "" ERROR
4 "=" SYMBOL 12 "+" SYMBOL 20 "=" SYMBOL 28 "" ERROR
5 "0" NUMBER 13 "+" SYMBOL 21 "i" NAME 29 "" ERROR
6 ";" SYMBOL 14 ")" SYMBOL 22 ";" SYMBOL 30 "" ERROR
7 "i" NAME 15 ";" SYMBOL 23 "\n" EOL 31 "" ERROR
8 "<" SYMBOL 16 "{" SYMBOL 24 "}" SYMBOL 32 "" ERROR

Notes:

Example 2:

Input String:

"If it wasn't backed up, then it wasn't important."
     -- The sysadmin's motto

# value type # value type # value type # value type
1 "\"" SYMBOL 9 "," SYMBOL 17 "\"" SYMBOL 25 "motto" NAME
2 "If" NAME 10 "then" NAME 18 "\n" EOL 26 "\n" EOL
3 "it" NAME 11 "it" NAME 19 "-" SYMBOL 27 "" EOI
4 "wasn" NAME 12 "wasn" NAME 20 "-" SYMBOL 28 "" ERROR
5 "'" SYMBOL 13 "'" SYMBOL 21 "The" NAME 29 "" ERROR
6 "t" NAME 14 "t" NAME 22 "sysadmin" NAME 30 "" ERROR
7 "backed" NAME 15 "important" NAME 23 "'" SYMBOL 31 "" ERROR
8 "up" NAME 16 "." SYMBOL 24 "s" NAME 32 "" ERROR