CIT 594 Creating a Parser for Tree Expressions
Spring 2015, David Matuszek

The most difficult method in the Tree API assignment is the parse method. There are many possible ways to write this method; most of them result in horrible code that is extremely difficult to debug. It doesn't need to be that bad.

The Parser

public static Tree<String> parse(String treeDescription)
Translates a String description of a tree into a Tree<String> object. The treeDescription has the form value(child child ... child), where a value is any sequence of characters not containing parentheses or whitespace, and each child is either just a (String) value or is another treeDescription. Whitespace is optional except where needed to separate values.

For example, the String "one (two three (four five(six seven eight)) )" becomes the Tree:
parsed tree
 

The input string can be described as follows:

<treeDescription> ::= <value> [ "(" { <treeDescription> } ")" ]

That is, a <treeDescription> is either a single <value>, or a <value> followed by any number of <treeDescription>s enclosed in parentheses. The <value> is any sequence of characters not containing parentheses or whitespace (the tokenizer finds parentheses and values for you).

Here is some pseudocode to help you out. You will still need to understand the method in order to get it to work.

Tree parse(input_string) {
    Create a tokenizer for the input_string;
    return parse(tokenizer);
}

Tree parse(tokenizer) {
    Get a value (token) from the tokenizer if you can,
        or return null if you can't;
    Make a tree with this value in the root;
    If the next token is an open parenthesis {
        Recur to get a (sub)tree;
        While the new (sub)tree isn't null {
            Add the tree as a child to the root;
            Recur to get another (sub)tree;
        }
        Make sure the next token is a close parenthesis;
    }
    Return the root;
}

The red lines above are what you need to add to create the tree from the description.

Understanding the Tokenizer

The parse method is much simpler if it can work with "tokens" rather than raw strings. It helps to have a specialized tokenizer that returns four kinds of things: Left parentheses, right parentheses, "values" (a sequence of non-whitespace, non-parenthesis characters), and an end-of-string indicator. Also, it turns out that the parse method will need to be able to "push back" a token that it has read and doesn't want.

You have two obvious choices:

My approach was the second of these: Use a StringTokenizer to do the majority of the work, but add pushback capability to it. Here's the code:

static class PushbackStringTokenizer {
    private StringTokenizer tokenizer;
    private String pushedValue = null;
    
    PushbackStringTokenizer(String input) {
        tokenizer = new StringTokenizer(input, "\t\n\r\f ()", true);
        pushedValue = null;
    }
    
    boolean hasNext() {
        return pushedValue != null || tokenizer.hasMoreTokens();
    }
    
    String next() {
        String temp = pushedValue;
        if (temp == null && tokenizer.hasMoreTokens()) {
            temp = tokenizer.nextToken();
        }
        pushedValue = null;
        // skip whitespace tokens
if (temp != null && temp.length() == 0) {
temp = next();
} return temp; } void pushBack(String token) { pushedValue = token; } }

I've removed the Javadoc from this code in order to make it short enough to see all at once; when you already know what the methods are supposed to do, the Javadoc just gets in the way.

Unlike most tokenizers, if you call next() when there are no more tokens, it will just return null rather than throw an exception. I think this makes it easier to write and easier to use.

Since this tokenizer is specialized to support the tree parser, I made it an inner class of the Tree class.