Recursive Descent Parsing

Parsing, you will recall, is the process of turning a stream of tokens into an abstract syntax tree. Any parsing technique requires a grammar--a formal, detailed definition of what sequence of symbols constitutes a syntactically correct program. Grammars are usually defined in BNF notation, which we will explore shortly.

For industrial-strength compilers, parsers are usually written in the yacc (Yet Another Compiler Compiler) formalism. Yacc is a program that reads a BNF-like grammar, creates the abstract syntax tree, and produces object code. Yacc is very powerful, but also quite complex, and there are subtle, difficult-to-understand constraints on the BNF that can be used.

Recursive descent parsing is an alternative to using yacc. Whereas yacc reads BNF and produces a parser, recursive descent is a technique for doing the same thing manually. That is, once you have the BNF for a language, you can apply cookbook rules to write a parser that implements the BNF. With a little practice, you can turn BNF into code almost as fast as you can type.

Recursive descent does have some drawbacks. While it is easy to detect syntax errors and produce error messages, it is more of a challenge to produce error messages that are helpful to the user. The typical error message simply tells what token was found, and the set of token types that would have been legal in that position.

Turning BNF into code

We will discuss first the simpler problem of recognizing a sequence of tokens. Recognition is like parsing, except that no parse tree is built. Instead, the result of recognition is a simple boolean value: true if the input stream is a syntactically correct program, and false otherwise. Turning a recognizer into a parser is simply a matter of adding tree-building code to each method, and returning the tree rather than a boolean value.

Each nonterminal in the BNF is represented by a single method (or function, if you are using a non-object-oriented language) in the recursive descent parser. Typically, the current token is passed in as a parameter, though it may instead be maintained as a global variable. We will assume a Token type, with accessible fields type (containing a flag that indicates to which of several categories the token belongs) and value (containing the actual characters that make up the token). We will further assume that we have a Tokenizer class with a method nextToken() to get the next Token from some input stream.

When a method for a nonterminal succeeds, it should "consume" (use up) exactly those tokens that constitute the nonterminal. If the method fails (does not recognize the expected nonterminal), it should return without consuming any tokens--this may involve "putting back" some tokens. Failure is not, in general, an error, since we may need to test for several nonterminals before finding the correct one.

Delegation

The very simplest BNF rules are those that define a nonterminal as identical to another nonterminal. For example, we might define a <variable> as being identical to a <name>. This yields the following code:

boolean variable(Token tok) {
    return name(Token tok);
}

Almost as simple is the case where a nonterminal is defined as consisting of a single terminal, for example, if <less than> is defined as "<". This requires that the next token be used and compared to the desired token.

boolean lessThan(Token tok) {
	return tok.value.equals("<");
}

Alternatives

A BNF rule may define alternative definitions for a nonterminal. For example,

<operator> ::= "+" | "-" | "*" | "/"

This is handled by testing for each alternative in turn, and returning true as soon as any alternative is satisfied, or false if none are satisfied.

boolean operator(Token tok) {
    if (tok.value.equals("+")) return true;
    if (tok.value.equals("-")) return true;
    if (tok.value.equals("*")) return true;
    if (tok.value.equals("/")) return true;
    return false;
}

Sequencing

Some BNF rules consist of a simple sequence of terminals and nonterminals. If it were not for the need to return a boolean result, we could implement this with a simple sequence of method calls. For example,

<assignment> ::= <variable> "=" <expression>

could almost be coded as

boolean assignment(Token tok) {
    variable(tok);
    equalsSign(tok);
    expression(tok);
}

There are several problems with this. First, of course, the method has to return a boolean to tell whether it succeeded. Second, after each of the first two calls, a new token has to be acquired. Finally, if a failure occurs after new token have been taken, those tokens have to be "put back" so they can be reused by some other rule. The resultant code is somewhat cluttered, though not particularly complex:

boolean assignment(Token tok) {
    Token tok2, tok3;
 
    if (variable(tok)) {
        tok2 = tokenizer.nextToken();
    } else {
        return false;
    }
    if (equalsSign(tok2)) {
        tok3 = tokenizer.nextToken();
    } else {
        tokenizer.pushBack(tok2);
        return false;
    }
    if (expression(tok3)) {
        return true;
    } else {
        tokenizer.pushBack(tok3); // notice order of pushBacks
        tokenizer.pushBack(tok2);
        return false;
    }
}

Since this pattern occurs frequently, we can simplify the code if our Tokenizer class has a way to "back up" and give us the same Tokens over again. This can be done either by having the Tokenizer class keep track of the last several tokens that it has returned, or by keeping track of the Tokens ourselves, and telling the Tokenizer class to take back those Tokens. For now, let's assume that the Tokenizer has a pushBack(int) method to reuse the most recent int tokens. Moreover, since we always return false after calling pushBack, we can simplify our code ever further by having pushBack itself return false.

boolean assignment(Token tok) { 
    if (variable(tok)) {
        tok = tokenizer.nextToken();
    } else {
        return false;
    }
    if (equalsSign(tok)) {
        tok = tokenizer.nextToken();
    } else {
        return tokenizer.pushBack(1);
    }
    if (expression(tok)) {
        return true;
    } else {
        return tokenizer.pushBack(2);
    }
}


There is a further problem, in that sometimes we may not know exactly how many tokens have been consumed, and must be returned. Consider the following BNF:

<something> ::= <foo><bar>

If <foo> is recognized but <bar> is not, we must replace the (unknown) number of tokens recognized by <foo>. This can be handled by having the Tokenizer class keep a count of the number of tokens read, and the ability to reuse tokens back to a given token number. Thus, the above code could be represented by something like:

boolean something(Token tok) {
    int oldTokenCount = tokenizer.getTokenCount();
    if (foo(tok)) {
        tok = tokenizer.nextToken();
    } else {
        return false;
    }
    if (bar(tok)) {
        return true;
    } else {
        return tokenizer.pushBack(tokenizer.getTokenCount() - oldTokenCount);
    }
}
        

Sequencing and Alternation

Sequencing combined with alternation is, in most cases, simply handled. For example,

<integer> ::= <unsigned integer> |  "-" <unsigned integer>

boolean unsignedInteger(Token tok) {
    if (number(tok)) return true;

    if (if (tok.value.equals("-"))) {
        tok = tokenizer.nextToken();
    } else {
        return false;
    }
    if (number(tok)) return true;
    } else {
        return tokenizer.pushBack(1);
    }
}

Living without pushBack

If we have a Tokenizer without adequate pushBack facilities, it is almost always better to wrap that Tokenizer in a "front end" class of our own that provides these facilities. But if we only need pushBack in a very few places, sometimes it is better to modify the grammar so as to avoid the need for reusing Tokens.

For example,

<expression> ::= <term> "+" <expression> | <term> "-" <expression> | <term>

Without pushBack, if we find a <term> but don't find a "+" after it, we can't put back the <term> for use in another alternative. However, the problem goes away entirely if we rewrite the grammar slightly:

<expression> ::= <term> <expression extender>

<expression extender> ::= "+" <expression> | "-" <expression> | <empty>

Direct Recursion

Many BNF rules are recursive: Statements can contain statements, and expressions can contain expressions. For a first example, though, we will use something simpler: an "unsigned integer." The idea here is that an integer contains other integers, as for example the integer 12345 contains the integer 345. In practice, of course, recognizing an integer would be done by the tokenizer, not by the parser.

There are four obvious recursive ways to define an unsigned integer, as follows:

// left recursive, base case first
<unsigned integer> ::= <digit> | <unsigned integer> <digit>

// right recursive, base case first
<unsigned integer> ::= <digit> | <digit> <unsigned integer>

// left recursive, base case last
<unsigned integer> ::= <unsigned integer><digit> | <digit>

// right recursive, base case last
<unsigned integer> ::= <digit><unsigned integer> | <digit>

When writing a recursive program, it is generally a good idea to put the base cases first. However, in this case, that strategy does not work. Our recursive method for <unsigned integer> would begin like this:

boolean unsignedInteger(Token tok) {
    if (tok.type == Token.DIGIT) return true;
    ...
}

Clearly, with this code, the second and subsequent digits would never be recognized. This suggests an important principle:

If one alternative is an initial substring of another alternative, the longer alternative should be attempted first.

Left recursion has an even worse problem. A left-recursive method would begin as follows:

boolean unsignedInteger(Token tok) {
    if (unsignedInteger(tok)) {
        ...
}

Any time a recursive call is made with exactly the same parameter values as the calling method, the result is an "infinite" recursion. Depending on the implementation, such a program will either hang or will terminate abnormally. This suggests a second important principle:

Avoid left recursion!

This leaves us with trying to define an unsigned integer with a right-recursive program, with the base case last:

// <unsigned integer> ::= <digit><unsigned integer> | <digit>

boolean unsignedInteger(Token tok) {

    // first case: <unsigned integer> ::= <digit><unsigned integer>
    if (tok.type == Token.DIGIT) {
        tok = tokenizer.nextToken();
    } else {
        return false;
    }
    if (unsignedInteger(tok)) {
        return true;
    } else {
        tokenizer.pushBack(1);
        tok = currentToken();
    }

    // second case: <unsigned integer> ::= <digit>
    if (tok.type == Token.DIGIT) {
        tok = tokenizer.nextToken();
    } else {
        return false;
    }
}

When the first case in the above code fails, we restore everything to the way it was on method entry, before going on to the second case. This involves putting back the tokens we consumed in the failed attempt, and resetting the tok variable to what it was upon method entry.

The above code could be somewhat simplified, but at the moment we are concerned with describing a fairly mechanical way of turning BNF into code.

Indirect Recursion

Usually recursion involves two or more rules, as for example,

<statement> ::= ... | <if statement> | ...
<if statement> ::= if ( <condition> ) <statement>

Neither of these rules is recursive in itself, but each refers to the other, and together they form an indirect recursion. Usually, nothing special must be done to handle indirect recursion; the only case to avoid is when the very first element of each nonterminal is the other nonterminal, so that together the rules form an infinite recursion. This is very unlikely to occur in practice.

Optional Elements

Although not present in original BNF, Extended BNF (EBNF) allows optional elements, enclosed in brackets. Thus, for example, the following indicates that the <else part> is optional:

<if statement> ::= <if part> [ <else part> ]

This can be implemented quite easily. All we need do is attempt to parse the optional element, and return true whether or not the parse succeeds.

boolean ifStatement(Token tok) {
    if (ifPart(tok)) {
        tok = tokenizer.nextToken();
    } else {
        return false;
    }
    elsePart(tok); // ignore result
    return true;
}

Repeated Elements

EBNF also allows zero or more repetitions of an element, enclosed in braces. For example,

<list> ::= <item> { "," <item> }

that is, a <list> consists of one or more <item>s, separated by commas. This construction is best handled with a loop, rather than with recursion.

boolean list(Token tok) {
    if (item(tok)) {
        tok = tokenizer.nextToken();
        while (tok.value.equals(",")) {
            tok = tokenizer.nextToken();
            if (!item(tok)) error();
        }
        return true;
    }
    return false;
}

EBNF adds no expressive power to BNF; that is, any grammar that can be written with EBNF could also be written with just BNF. However, just as EBNF makes it simpler to write grammars, it also makes it easier to write recognizers for those grammars.