CIT 594 BNF for BNF
Spring 2013, David Matuszek

BNF is a metalanguage--a language for describing languages. As such, we can use BNF to describe BNF. We just have to take special care to distinguish between symbols in the language and symbols in the metalanguage.

Here is my attempt at describing BNF in BNF. This has not been machine tested, so may have errors. Use at your own risk! (But please inform me of any corrections!)

<BNF> ::= { <rule> }.
There are no terminals here. <BNF> and <rule> are nonterminals, ::=, {, }, and the terminating . are metasymbols. This says that a "BNF" consists of zero or more rules.
<rule> ::= <nonterminal> \::= <definition> { \| <definition> } \. .
This says a rule consists of a nonterminal, the ::= symbol, and one or more definitions separated by "or" (|) symbols.
<definition> ::= { <term> }.
A definition consists of a sequence of zero or more "terms" (see below). Notice that (1) a definition does not end in a period; that's a metasymbol marking the end of this rule; and (2) since we might have zero terms, it's possible to write a rule like "<empty> ::= .".
<term> ::= <terminal> | <nonterminal> | <option> | <any number of>.
A term may be a terminal, an nonterminal, an option (optional element), or something repeated zero or more times.
<option> ::= \[ <definition> \].
An option is any definition enclosed in brackets. It's important to understand that, as with all BNF, this is a description of syntax. The brackets in our metalanguage, BNF, mean that the enclosed definition is optional, but those are not the brackets we see here--the brackets here are in the target language, the language being described, and whatever they might mean in the target language is irrelevant to us here.
<any number of> ::= \{ <definition> \}.
An "any number of" consists of an open brace, a definition, and a close brace. Again, this is a description of syntax, not meaning. The rule does not say that an "any number of" is a sequence of zero or more definitions, though that may be what it means in the target language.

The above BNF does not define <terminal> or <nonterminal>. This would be possible, but long and complicated, as we would have to list every possible character. Instead, we will leave it up to our Tokenizer to provide terminals and nonterminals as tokens. Separating the concerns in this way makes parsing much easier.

Issues with characters

Consider this string:  abc\nxyz

Inside a Java program, you might say String s = "abc\nxyz"; . You now have a string contain seven characters, one of which is a newline. But if you put the above characters in a file, and read in that file, you will get eight characters, one of which is a backslash and another is an n.

In the same way, suppose you wanted to write a BNF definition of a "newline." If you read the BNF in from a file, you might find
<newline> ::= \n.
but in your unit tests (which are Java code!) your test might be
assertEquals("\\n", tokenizer.next().getValue());

There are two basic problems Java solves with backslashes:

There are two problems my assignment is trying to solve with backslashes:

Java allows certain characters, but not others, to follow a backslash. I'm trying to do something simpler, by saying that any character except 'n' or 't' following a backslash means that character, not a metasymbol. Why am I making an exception for 'n' and 't'? Because without that, you could never write a BNF description (in my notation) for Python. I'm not planning to do that, but why make it impossible?

Delimiters (especially whitespace)

I can't just say "Use BNF to describe a language." I have to be specific about how to handle things like braces in a Java program. So I have to specify some details about how to write BNF. I've tried to make those details as simple and unannoying as possible. But...

What are delimiters? That is, what separates one token from another? The basic answer is, whitespace and BNF metasymbols. (Exception: Whitespace within a nonterminal, so we can have multiword names.) So <bit> ::= 1 | 0 . and <bit>::=1|0. are equivalent.

What is whitespace? If I say <snack> ::= hot dog. (on a file), there are two tokens in the definition, but what if I say <snack> ::= hot\ dog. ? How many tokens are in the definition part of this rule? One, two, or maybe even three? I'd normally consider it to be one token (containing 7 characters), but it's a goofy thing to do, and if you want to do something else, I'm okay with that.

Whitespace is an unusual thing to want as part of a terminal, but other characters are not. For example, Java has, among other interesting cases, the "or" operator, "||". We need to be able to create this as a single token, and the obvious way to do that is to write "\|\|", so this should work. For consistency, then, probably "hot\ dog" should be a single token.

Whatever you do, you should (1) strive for consistency, and (2) document it in your Javadoc.