CIT 591 Assignment 8: Token Counter Clarifications
Fall 2004, David Matuszek

It seems that any interesting project has "edge cases"--cases that are on some boundary, and that are easy to get wrong. These are seldom obvious when you start the project, but become apparent as you get into it. JavaTokenCounter is no exception.

Guiding Principles

  1. Any tokenizer, or program that gets and classifies tokens--such as the current assignment--should not be concerned with assembling those tokens into a coherent whole. It should classify each token without regard to context. Hence, in myArray[index], there are two names and two punctuation marks. Technically, the brackets form a single indexing operator, but that's not for the tokenizer to decide. The closing bracket occurs some indefinite time after the opening bracket, making this a much harder problem, not suitable for a state machine. It isn't necessary for a tokenizer to be this complicated.

    Typically, some higher-level program, such as a parser, may have the task of assembling the tokens into meaningful combinations, such as declarations, statements, methods, and classes

  2. No token contains other tokens. Operators that are written as a group of characters with no intervening tokens or whitespace, such as +=, can and should be counted as a single token. However, operators that are separated, such as the indexing operator [] in myArray[index], or the ternary operator ?: in max = x > y ? x : y; must be counted as two separate tokens.

  3. If it looks like a name, it's either a name or a keyword.

Specific Cases

The Java™ Tutorial, Third Edition says that true, false, and null are reserved words, but are not keywords. Count them as keywords.

Technically, new and instanceof are operators. Count them as keywords.

If we count true, false, null, new, and instanceof, there are 51 keywords in Java 1.4. Java 1.5 adds the keyword enum, for a total of 52. You should add enum to your list of keywords, if you don't already have it.

Technically, [] and () are operators; brackets are an indexing operator, and parentheses are both a method call operator and a casting operator. Count each bracket and each parenthesis as a separate punctuation mark.

The colon (:) appears in at least three places in Java, once as part of the ternary operator ?: and twice as simple punction. Don't try to sort this out. Count every colon as a punctuation mark and the question mark (?) as an operator.

Numbers never begin with + or - (although these can occur within real literals).

Numeric literals containing an exponent are always real. Octal and hexadecimal numbers are always integer.

If I notice or hear about any more special cases, I will post something.