Parsing, you will recall, is the process of turning a stream of tokens into an abstract syntax tree. Any parsing technique requires a grammar--a formal, detailed definition of what sequence of symbols constitutes a syntactically correct program. Grammars are usually defined in BNF notation, which we will explore shortly.
For industrial-strength compilers, parsers are usually written in the yacc
(Yet Another Compiler Compiler) formalism. Yacc is a program that
reads a BNF-like grammar, creates the abstract syntax tree, and produces
object code. Yacc is very powerful, but also quite complex, and
there are subtle, difficult-to-understand constraints on the BNF that can be
used.
Recursive descent parsing is an alternative to using yacc. Whereas
yacc reads BNF and produces a parser, recursive descent is a technique for doing
the same thing manually. That is, once you have the BNF for a language, you
can apply cookbook rules to write a parser that implements the BNF. With a little
practice, you can turn BNF into code almost as fast as you can type.
Recursive descent does have some drawbacks. While it is easy to detect syntax errors and produce error messages, it is more of a challenge to produce error messages that are helpful to the user. The typical error message simply tells what token was found, and the set of token types that would have been legal in that position.
We will discuss first the simpler problem of recognizing a sequence
of tokens. Recognition is like parsing, except that no parse tree is built.
Instead, the result of recognition is a simple boolean value: true
if the input stream is a syntactically correct program, and false
otherwise. Turning a recognizer into a parser is simply a matter of adding tree-building
code to each method, and returning the tree rather than a boolean value.
Each nonterminal in the BNF is represented by a single method (or function,
if you are using a non-object-oriented language) in the recursive descent
parser. Typically, the current token is passed in as a parameter, though it
may instead be maintained as a global variable. We will assume a Token type,
with accessible fields type (containing a flag that indicates
to which of several categories the token belongs) and value (containing
the actual characters that make up the token). We will further assume that
we have a Tokenizer class with a method nextToken() to get the
next Token from some input stream.
When a method for a nonterminal succeeds, it should "consume" (use up) exactly those tokens that constitute the nonterminal. If the method fails (does not recognize the expected nonterminal), it should return without consuming any tokens--this may involve "putting back" some tokens. Failure is not, in general, an error, since we may need to test for several nonterminals before finding the correct one.
The very simplest BNF rules are those that define a nonterminal as identical to another nonterminal. For example, we might define a <variable> as being identical to a <name>. This yields the following code:
boolean variable(Token tok) {
return name(Token tok);
}
Almost as simple is the case where a nonterminal is defined as consisting of a single terminal, for example, if <less than> is defined as "<". This requires that the next token be used and compared to the desired token.
boolean lessThan(Token tok) {
return tok.value.equals("<");
}
A BNF rule may define alternative definitions for a nonterminal. For example,
<operator> ::= "+" | "-" | "*" | "/"
This is handled by testing for each alternative in turn, and returning true as soon as any alternative is satisfied, or false if none are satisfied.
boolean operator(Token tok) {
if (tok.value.equals("+")) return true;
if (tok.value.equals("-")) return true;
if (tok.value.equals("*")) return true;
if (tok.value.equals("/")) return true;
return false;
}
Some BNF rules consist of a simple sequence of terminals and nonterminals. If it were not for the need to return a boolean result, we could implement this with a simple sequence of method calls. For example,
<assignment> ::= <variable> "=" <expression>
could almost be coded as
boolean assignment(Token tok) {
variable(tok);
equalsSign(tok);
expression(tok);
}
There are several problems with this. First, of course, the method has to return a boolean to tell whether it succeeded. Second, after each of the first two calls, a new token has to be acquired. Finally, if a failure occurs after new token have been taken, those tokens have to be "put back" so they can be reused by some other rule. The resultant code is somewhat cluttered, though not particularly complex:
boolean assignment(Token tok) {
Token tok2, tok3;
if (variable(tok)) {
tok2 = tokenizer.nextToken();
} else {
return false;
}
if (equalsSign(tok2)) {
tok3 = tokenizer.nextToken();
} else {
tokenizer.pushBack(tok2);
return false;
}
if (expression(tok3)) {
return true;
} else {
tokenizer.pushBack(tok3); // notice order of pushBacks
tokenizer.pushBack(tok2);
return false;
}
}
Since this pattern occurs frequently, we can simplify the code if our Tokenizer class
has a way to "back up" and give us the same Tokens over again. This can be
done either by having the Tokenizer
class keep track of the last several tokens that it has returned, or by keeping
track of the Tokens ourselves, and telling the Tokenizer class to take back
those Tokens. For now, let's assume that the Tokenizer has a pushBack(int) method
to reuse the most recent int
tokens. Moreover, since we always return false after calling pushBack,
we can simplify our code ever further by having pushBack itself
return false.
boolean assignment(Token tok) {
if (variable(tok)) {
tok = tokenizer.nextToken();
} else {
return false;
}
if (equalsSign(tok)) {
tok = tokenizer.nextToken();
} else {
return tokenizer.pushBack(1);
}
if (expression(tok)) {
return true;
} else {
return tokenizer.pushBack(2);
}
}
There is a further problem, in that sometimes we may not know exactly how many
tokens have been consumed, and must be returned. Consider the following BNF:
<something> ::= <foo><bar>
If <foo> is recognized but <bar> is not, we must replace the (unknown) number of tokens recognized by <foo>. This can be handled by having the Tokenizer class keep a count of the number of tokens read, and the ability to reuse tokens back to a given token number. Thus, the above code could be represented by something like:
boolean something(Token tok) {
int oldTokenCount = tokenizer.getTokenCount();
if (foo(tok)) {
tok = tokenizer.nextToken();
} else {
return false;
}
if (bar(tok)) {
return true;
} else {
return tokenizer.pushBack(tokenizer.getTokenCount() - oldTokenCount);
}
}
Sequencing combined with alternation is, in most cases, simply handled. For example,
<integer> ::= <unsigned integer> | "-" <unsigned integer>
boolean unsignedInteger(Token tok) {
if (number(tok)) return true;
if (if (tok.value.equals("-"))) {
tok = tokenizer.nextToken();
} else {
return false;
}
if (number(tok)) return true;
} else {
return tokenizer.pushBack(1);
}
}
pushBackIf we have a Tokenizer without adequate pushBack facilities,
it is almost always better to wrap that Tokenizer in a "front end" class of
our own that provides these facilities. But if we only need pushBack in a very
few places, sometimes it is better to modify the grammar so as to avoid the
need for reusing Tokens.
For example,
<expression> ::= <term> "+" <expression> | <term> "-" <expression> | <term>
Without pushBack, if we find a <term> but don't find a "+" after it, we can't put back the <term> for use in another alternative. However, the problem goes away entirely if we rewrite the grammar slightly:
<expression> ::= <term> <expression extender>
<expression extender> ::= "+" <expression> | "-" <expression> | <empty>
Many BNF rules are recursive: Statements can contain statements, and expressions
can contain expressions. For a first example, though, we will use something
simpler: an "unsigned integer." The idea here is that an integer contains
other integers, as for example the integer 12345 contains the integer
345. In practice, of course, recognizing an integer would be done
by the tokenizer, not by the parser.
There are four obvious recursive ways to define an unsigned integer, as follows:
// left recursive, base case first
<unsigned integer> ::= <digit> | <unsigned integer> <digit>
// right recursive, base case first
<unsigned integer> ::= <digit> | <digit> <unsigned integer>
// left recursive, base case last
<unsigned integer> ::= <unsigned integer><digit> | <digit>
// right recursive, base case last
<unsigned integer> ::= <digit><unsigned integer> | <digit>
When writing a recursive program, it is generally a good idea to put the base cases first. However, in this case, that strategy does not work. Our recursive method for <unsigned integer> would begin like this:
boolean unsignedInteger(Token tok) {
if (tok.type == Token.DIGIT) return true;
...
}
Clearly, with this code, the second and subsequent digits would never be recognized. This suggests an important principle:
| If one alternative is an initial substring of another alternative, the longer alternative should be attempted first. |
Left recursion has an even worse problem. A left-recursive method would begin as follows:
boolean unsignedInteger(Token tok) {
if (unsignedInteger(tok)) {
...
}
Any time a recursive call is made with exactly the same parameter values as the calling method, the result is an "infinite" recursion. Depending on the implementation, such a program will either hang or will terminate abnormally. This suggests a second important principle:
| Avoid left recursion! |
This leaves us with trying to define an unsigned integer with a right-recursive program, with the base case last:
// <unsigned integer> ::= <digit><unsigned integer> | <digit>
boolean unsignedInteger(Token tok) {
// first case: <unsigned integer> ::= <digit><unsigned integer>
if (tok.type == Token.DIGIT) {
tok = tokenizer.nextToken();
} else {
return false;
}
if (unsignedInteger(tok)) {
return true;
} else {
tokenizer.pushBack(1);
tok = currentToken();
}
// second case: <unsigned integer> ::= <digit>
if (tok.type == Token.DIGIT) {
tok = tokenizer.nextToken();
} else {
return false;
}
}
When the first case in the above code fails, we restore everything to the way it was on method entry, before going on to the second case. This involves putting back the tokens we consumed in the failed attempt, and resetting the tok variable to what it was upon method entry.
The above code could be somewhat simplified, but at the moment we are concerned with describing a fairly mechanical way of turning BNF into code.
Usually recursion involves two or more rules, as for example,
<statement> ::= ... | <if statement> | ... <if statement> ::= if ( <condition> ) <statement>
Neither of these rules is recursive in itself, but each refers to the other, and together they form an indirect recursion. Usually, nothing special must be done to handle indirect recursion; the only case to avoid is when the very first element of each nonterminal is the other nonterminal, so that together the rules form an infinite recursion. This is very unlikely to occur in practice.
Although not present in original BNF, Extended BNF (EBNF) allows optional elements,
enclosed in brackets. Thus, for example, the following indicates that the <else
part> is optional:
<if statement> ::= <if part> [ <else part> ]
This can be implemented quite easily. All we need do is attempt to parse the optional element, and return true whether or not the parse succeeds.
boolean ifStatement(Token tok) {
if (ifPart(tok)) {
tok = tokenizer.nextToken();
} else {
return false;
}
elsePart(tok); // ignore result
return true;
}
EBNF also allows zero or more repetitions of an element, enclosed in braces. For example,
<list> ::= <item> { "," <item> }
that is, a <list> consists of one or more <item>s, separated by commas. This construction is best handled with a loop, rather than with recursion.
boolean list(Token tok) {
if (item(tok)) {
tok = tokenizer.nextToken();
while (tok.value.equals(",")) {
tok = tokenizer.nextToken();
if (!item(tok)) error();
}
return true;
}
return false;
}
EBNF adds no expressive power to BNF; that is, any grammar that can be written with EBNF could also be written with just BNF. However, just as EBNF makes it simpler to write grammars, it also makes it easier to write recognizers for those grammars.