CIT 594 Notes on the Tokenizer Assignment
Spring 2014, David Matuszek

The state machine

The Tokenizer is to be implemented as a state machine, so a good way to start is to draw, on paper, the state machine you want to implement. Notice which states should be final states, and which states should not be final states--the latter may represent errors.

The most complex token is the NUMBER token. The Tokenizer should recognize both unsigned integers and floating-point numbers. Floating-point numbers may have a exponent, and the exponent may be signed or unsigned. You don't need to handle octal, hexadecimal, or binary numbers, or byte, short, or long integers, though you are welcome to do so; my tests will not test for these, either way.

Backing up

There is a requirement that your Tokenizer be able to "back up" (deliver again) one token. This feature will be essential in a later assignment.

The state machine, which turns character sequences into tokens, must also be able to back up (or somehow do the equivalent of backing up). This wasn't explicitly stated in the assignment, but is necessary in order to parse tokens that aren't separated by whitespace. In my code, I only allow backing up one character, and anything that requires backing up more than one character (for example, 123e+ABC) is an error. If you allow backing up more than one character, this could be treated as four tokens (123, e, +, and ABC).

For our purposes, backing up one character is "good enough" (but it's okay if you back up more). My tokenizer tests will be primarily concerned with whether your Tokenizer successfully tokenizes correct (error free) input, not whether you code catches every possible error.

The Reader interface

Tokenizers are usually used to tokenize the contents of a file. So, when you construct a Tokenizer, you might think that it should be given a File object as a parameter. The problem with this is that, in order to test the Tokenizer, you need to have a very carefully written file containing exactly the expected string, and you need to keep this file with the tests. This greatly increases the things that can go wrong. If the file is misplaced or damaged, your tests are worthless (and you'll never get around to reconstructing the file). So it's much better to keep your tests self-contained.

If your Tokenizer uses a Reader instead of a File, you can tokenize strings from a wide variety of sources, not just from a file. You can write your tests using a StringReader and feel confident that you program will work just as well when given a FileReader.

Recommended alternative to EOI

The assignment specified an EOI (end of input) token. This leads to some confusion (which I tried to answer on Piazza) about exactly how next() and hasNext() should behave at the end of the input.

Here is a somewhat better design, which is both easier to implement and easier to understand. Ignore the EOI token, and never use it in your Tokenizer (but leave it in the TokenType enum, or you'll break my tests!). Instead, do this:

Unless you have already completed the assignment, I recommend that you make this change. My tests will accept it either way, so long as you leave EOI as a possible TokenType.