Tenth Java Assignment: Syntax Coloring
CIT 591, David Matuszek, Fall 2001

Purposes of this assignment:

Idea of the assignment:

I post a lot of Java programs on the web. However, if I want those programs to have any syntax coloring, I have to do it all myself. This is a lot of work, and sometimes I make mistakes. It would be nice to have a program that does it for me.

Your job is to write a Java application that reads in a .java file and does some minimal syntax coloring on it. Since doing a thorough job could turn into a very large project, your program should do the most basic things and, more importantly, you should think about how you can design the program so that it can be enhanced later.

HTML

Documents for the Web are written in HTML (HyperText Markup Language). HTML uses tags, enclosed in angle brackets, to tell your browser how to format things. For example, <B>enclosed text</B> causes the enclosed text to be made bold. The entire document should be enclosed in assorted other tags, as follows:

<HTML>
<HEAD>
<TITLE>Syntax coloring assignment by
(your name goes here)</TITLE>
</HEAD>
<BODY>
<PRE>
   
Your document (modified Java program) goes here.
</PRE>
</BODY>
</HTML>

Many modern program editors use syntax coloring: keywords, strings, and comments are set off by putting them in different colors.

Here's what your program should do:

  1. Use a dialog box to choose a .java file.
  2. Read in the chosen file, which you can assume contains a Java program (though not necessarily a correct Java program; it may have unclosed string constants, for example)..
  3. Embed the program in the necessary HTML tags, and perform the necessary character conversions (see below).
  4. Perform some syntax coloring, as described below.
  5. Write out the result on a new file; either
    1. Replace the .java in the file name with .html; e.g. if your input file is Sample.java, your output file should be Sample.html; or
    2. Use a dialog box to choose an output file.

Entities

Since HTML uses the characters < (less than), > (greater than), and & (ampersand) for its own purposes, you must replace these in the Java program you are processing with the entities  &lt;  &gt;  and  &amp;,  respectively. Notice that the semicolon (;) is part of the entity.

The <, >, and & characters must always be replaced by their corresponding entities, but these are not the only characters that might be replaced by entities. Instead of writing special code for each of these three cases, consider writing more general code so that it is trivial to add other entities.

Syntax coloring

Various editors perform various kinds of syntax coloring. There are two sets of options to keep in mind: (1) the kind of things that are recognized, and (2) how they are colored.

Some things the program might recognize:

  • keywords
  • comments (3 kinds)
  • quoted strings
  • singly-quoted characters
  • numbers
  • class/interface names
  • method names
    

Some ways it might "color" them:

  • boldface, <B>...</B>
  • italic, <I>...</I>
  • different font face, size, or color,
    <FONT FACE="Times"
          SIZE="2"
          COLOR="#663300"> ... </FONT>
     
  • combinations of the above

Just about any way you might "color" a part of the program, you can do by wrapping HTML tags around the text. You should put these tags in your program as named String constants (for example, START_KEYWORD, END_KEYWORD, instead of putting them directly in the code. That is,

String START_KEYWORD = "<B>";
String END_KEYWORD = "</B>";
...
System.out.print(START_KEYWORD + keyword + END_KEYWORD);
is much more flexible (and therefore better style) than
System.out.print("<B>" + keyword + "</B>");

Your syntax coloring assignment

Do syntax coloring for the following:

Use a different color or style for each of these.

This program is neither long nor complicated, but it is very different from most of the programs you have written. You do not need to create a lot of new classes for this assignment--probably just one class and a couple of methods is enough.

Programming notes

A comment is not a comment if it is embedded in a quoted string. Similarly, there are no strings inside comments. Keywords, such as for, while, and if are not keywords if they occur in a comment or quoted string, or as part of an identifier (such as form or knife). Singly-quoted characters, also, must be treated specially.

These complications can best be handled by a state machine. The main states that you need to implement are:

The "normal" state
In this state, keywords may be recognized. The beginning of comments, quoted strings, and singly-quoted characters may be recognized (each leading to a new state).
Inside a /* ... */ comment
In this state, which is entered by encountering a /*, no other symbols are treated as special except for */, which causes a transfer to the "normal" state. Note that javadoc comments, /** ... */, do not need to be treated differently, although you could do so if you wanted to. Note also that /*/ is not a complete comment (the second / does not end the comment), but that /**/ is complete. In addition, **/ also ends a comment.
Inside a // comment
In this state, which is entered by encountering a //, no other symbols are treated as special except for \n, the end-of-line character , which causes a transfer to the "normal" state.
Inside a string
This state is entered by encountering a " (double quote) symbol, and exited (to the normal state) by another ". Inside this state, the one character immediately following a \ (backslash) must be ignored. This requires a bit of care: \" by itself does not end the state, but \\" does (it's the backslash that's quoted, not the double-quote). Similarly, \\\" does not end the state, but \\\\" does. Finally, strings cannot extend over more than one line, so an end-of-line character ends the string (and is an error).
Inside a quoted character
This state is entered by encountering a ' (single quote) symbol, and exited (to the normal state) by another '. Inside this state, the one character immediately following a \ (backslash) must not be treated as special.

However, these are just the "main" states that your state machine might have. For example, if you encounter a / in the "normal" state, you might go into a "just saw a slash" state. From this state, a second / could put you in the "// comment" state, while a star could put you into the "/* comment" state; any other character would return you to the "normal" state. The same sort of trick can be used elsewhere, so that you only need to process a single character at a time.

If your state machine does not end in the "normal" state, you should close whatever HTML tag is currently open.

How to implement a state machine

The best way to implement a state machine is with a switch statement inside a while loop. The switch statement chooses a block of code based on the current state, and the while loop exits when the state machine is done.

The following example marks every sequence of digits as bold. Since the length of the string is known beforehand, a for loop is used instead of a while loop.


public class StateMachine {
    
    String testString = "Testing...1...2...3974...end test 1";
    StringBuffer result = new StringBuffer();
    final int NORMAL = 1;
    final int NUMBER = 2;
    
    public static void main(String args[]) {
        StateMachine machine = new StateMachine();
        machine.run();
    }
    
    void run() {
        int state = NORMAL;
        for (int i = 0; i < testString.length(); i++) {
            char ch = testString.charAt(i);
            switch (state) {
                case NORMAL:
                    if (Character.isDigit(ch)) {
                        result.append("<b>" + ch);
                        state = NUMBER;
                    }
                    else {
                        result.append(ch);
                    }
                    break;
                case NUMBER:
                    if (!Character.isDigit(ch)) {
                        result.append("</b>" + ch);
                        state = NORMAL;
                    }
                    else {
                        result.append(ch);
                    }
                    break;
             } 
        }
        if (state == NUMBER) result.append("</b>");
        System.out.println(testString);
        System.out.println(result);
    }
}
If you use a test string such as the one in the above program, remember that certain characters must be backslashed in string literals. For example, the string backslash is "\" must be written as "backslash is \"\\\"" .

Possible extra credit

For 20 points extra credit, make all keywords bold (including the ones we haven't talked about in class).

While it is possible to use a state machine to recognize keywords, doing so will result in a very large number of states. Here is a better solution: when you encounter a letter, go into a state that collects all the letters of the word; then check whether it is a keyword.

Due date

Please turn in your program via Blackboard by midnight, December 10.