CIT 591 Assignment 8: Token Counter
Fall 2004, David Matuszek

Purposes of this assignment:

General idea of the assignment:

Categorize and count the tokens in an input Java file.


Here are the kinds of "tokens" that can appear in a Java program:

This includes both Sun-defined names, such as String and println, and user-defined names. Do not try to distinguish between user-defined names and the thousands of Sun-defined names!
Make a list of all the Java-defined keywords, and each time you get a "name," check and see whether it's in this list. If so, it's really a keyword rather than a name.
Character literals
A character literal is a character enclosed in single quotes. However, not all characters are as simple as 'x', because you can have escaped characters (such as '\n') and characters represented by their numeric codes (such as '\015'). Be sure to recognize all valid character literals.
Integer literals
An integer literal may be decimal, octal, or hexadecimal. You don't need to count these types separately, but you need to know the syntax of each kind.
Real literals
Real literals are always in decimal, and may have a decimal point or an exponent or both. Be sure to recognize float literals as well as double literals.
String literals
String literals consist of zero or more characters (see above) enclosed in double quotes. Strings cannot be broken across lines. Inside a string there are no operators, numbers, comments, or anything else--the only thing special is the (unescaped) double quote character that ends the string.
Operators are one, two, or three characters (for example, =, +, !=, +=, <, <<, <<<). Notice that each of these counts as one operator, not one per character. For purposes of this assignment, we will not count the colon (:) as an operator.
Punctuation consists of any single character that isn't whitespace, a letter, a digit, or an operator. We will consider parentheses, brackets, braces, commas, colons, and semicolons to be punctuation, rather than operators. (I think that's the complete list--let me know if you think of others.)
C-style comments
C-style comments begin with /* and end with */.
Javadoc comments
Javadoc comments begin with /** and end with */.
End-of-line comments
End-of-line comments begin with // and extend to the end of the line.
End-of-file (EOF)
Not an actual token as such, but it can simplify your code to treat it as one.

This isn't the only possible set of token categories, but it will do.

There are some important methods in the Character class that you should look at and use.

The primary purpose of this program is to count tokens in files and display the counts in a GUI, but if it combines all those operations, JUnit testing is very difficult. So we'll use the MVC (Model-View-Controller) design pattern to divorce the actual computations from the input and output operations, as follows:

Classes and interfaces

Please be really sure to use the exact same names for everything as listed here, because we will be doing our own unit testing of your code.

interface DataSource

I'll supply the code for this interface. Here it is:

    public interface DataSource {
        char read();
        void unread(char ch);

You get to supply the Javadoc comments.

Your implementations of read() should return a null character, '\0', when there are no more characters to be returned. (You can't return a -1.)

The unread(char ch) method is useful because you often discover you are at the end of a token by getting a character that belongs to the next token. For example, in count++, you know you are finished getting the name count when you get the '+'; but you need to put that plus back so you can use it in getting the next token.

class StringDataSource implements DataSource

This has a constructor that takes a String as an argument.

class FileDataSource implements DataSource

This has a no-argument constructor that sets up a file for use as a data source. The constructor should ask the user for a file by calling JFileChooser.

Once you have a file, see my (unchanged from last time) to see how to get a BufferedReader for it; then use this as a parameter to the constructor for PushbackReader. See the Java API for information on PushbackReader--notice that it works with ints, not chars, so you will need to do a little casting in your class.

public class JavaTokenCounter

This class should have (at least) the following methods:

public void countTokens(DataSource ds)
Counts all the various types of tokens in the DataSource, and stores them in instance variables. If called repeatedly, the new counts should be added to, not replace, the previous counts.

public void clearAllCounts()
Sets all token counts to zero.

public int getNameCount(), getKeywordCount(), getCharacterLiteralCount(), getIntegerLiteralCount(), getRealLiteralCount(), getStringLiteralCount(), getOperatorCount(), getPunctuationCount(), getCStyleCommentCount(), getJavadocCommentCount(), and getEndOfLineCommentCount()

All of these should be public int. Don't count the EOF, and don't supply methods to count various combinations of totals--the calling program can combine totals if it wants to.

public static void main(String[] args)
When used as an application, the program should ask for a file as a data source, count the tokens in it, and display the following results in a GUI:

  • Each of the above token types (not including EOF),
  • A comment count (combining the three comment types),
  • A count of non-comments (combining the other eight types, not including EOF), and
  • A total token count (not including EOF).

After reading each file, the user should be able to:

  • Choose another file, count tokens, and add them to the existing counts,
  • Clear the counts, or
  • Quit the program.

public class JavaTokenCounterTest extends TestCase

Provides JUnit tests for the JavaTokenCounter class. Use StringDataSource, not FileDataSource, in your testing (because unit testing should be completely automatic). It's a good idea to write this class first.

You should assume that the file you are reading contains a correct Java program. Your program shouldn't crash on any input, but if it's not correct Java, don't worry about how to count things in it.

How to do it:

Use state machines.

You could do this with a single (very large) state machine, but it's more convenient to break it up into multiple state machines. For example, I have a method parseNumber() which is called when I first encounter a digit; it implements a state machine with states INTEGER_PART, FRACTION_PART, EXPONENT_PART, and EXPONENT_DIGITS_PART, to keep track of which part of the number I'm in. (It doesn't yet deal with octal or hexadecimal numbers.) I'd also recommend state space machines for operators and possibly for character literals, string literals, and comments.

If you use state machines properly, and choose your state names carefully, this assignment turns out to be easier than you would expect. To give you an example of what I mean, here's what my FractionCalculator program does when a digit is clicked:

    public String clickDigit(char digit) {
        switch (state) {
            case STARTING_NUMERATOR:
                state = GETTING_NUMERATOR;
            case GETTING_NUMERATOR:
            case STARTING_DENOMINATOR:
                state = GETTING_DENOMINATOR;
           case GETTING_DENOMINATOR:
           case COMPLETED_FRACTION:
                haveFirstNumber = true;
                state = GETTING_NUMERATOR;
           case ERROR:
                assert false;
        return displayString;

(This has no relevance to the current assignment, other than as a reminder of what state machines are like.)

For this assignment, you may not use the classes StringTokenizer, StreamTokenizer, or Pattern.


You should work with a partner and, as before, you both will get the same grade on the project. Ideally each of you should write at least one method for the JavaTokenCounter class that uses a state machine.

Due date:

Wednesday, December 1, before midnight.