CIT 594 Parser for "Bugs" Language
Spring 2015, David Matuszek

Purposes

General Idea

Your assignment is to write a parser for programs in the "Bugs" language. A parser is a program that, given a stream of tokens, creates a tree that represents the structure of the program represented by that stream of tokens. Such a tree is called a parse tree or abstract syntax tree. The meaning of the program, if any, is outside the scope of this assignment.

Important data structures

We are dealing with fairly large data structures, and it is important to be consistent. Otherwise, things get too confusing. Here is the design:

You will find it essential to remember this design and keep it in mind as you program.

The Parser class

I have posted starter files Parser.java and ParserTest.java, and helper file TreeParser.java and TreeParserTest.java, on GitHub, as well as updated Token.java and new TokenTest.java. (The Token class has a new constructor which just takes a String value, and figures out the TokenType for itself.) Begin by copying these files into your Bugs project.

Next, copy all the methods that you wrote for the Recognizer into Parser.java. Don't replace the methods that I provided (helper methods and everything relating to expression handling). My provided methods have been enhanced to produce parse trees whenever they recognize something. You will be doing the same to the methods that you wrote. We will use the created parse trees in a later assignment.

There is only one change you should make in the Recognizer code you have written. The definition of <program> now ends with an <EOF>.

<program> ::= [ <allbugs code> ]
              <bug definition>
              { <bug definition> }
              <EOF>

This will allow detection of some additional errors.

To change a Recognizer method into a Parser method, keep a global Stack of Trees as you go. As you step through the parts of a definition, those parts go onto the stack; when you reach the end, you remove those parts from the stack, combine them into a tree for the thing being defined, and put the tree onto the stack.

For example, consider the grammar rule <expression> ::= <term> { <add_operator> <term> }:

In the Recognizer: In the Parser:
public boolean expression() {
    if (!term()) return false;
    while (addOperator()) {
        if (!term()) {
            error("Error after '+' or '-'");
    }
    return true;
}
public boolean expression() {
    if (!term()) return false;  // Note 1
    while (addOperator()) {     // Note 2
        if (!term()) {          // Note 3
            error("Error after '+' or '-'");
        }
        makeTree(2, 3, 1)       // Note 4
    }
    return true;
}

Notes:

  1. If the first term() succeeds, it leaves a Tree representing that term on the stack.
  2. If addOperator() succeeds, it leaves a second Tree node (a leaf) on the stack, containing the add operator ("+" or "-").
  3. If the second term() succeeds, it leaves a third Tree representing that term on the stack.
  4. If you get here, remove the top three Trees from the stack, create a new one that represents this expression, and leave it on the stack.

Making parse trees

The general idea is that as you recognize things in the input, you build Trees to represent them, and push these Trees onto a stack. As you go along, you pop Trees from the stack, combine them into a larger Tree, and put this new Tree onto the stack. When you are done, you should have exactly one Tree on the stack, representing the complete <program>.

There are two utility methods that you will use a lot.

For example, suppose you find "x = 5\n". You will recognize the x as a <variable>, make it into a (single node) Tree, and put it on the stack. Similarly for the "=" symbol and the "5" <expression>. You also recognize the "\n", but discard it. Then using makeTree, you will pop these three Trees from the stack, combine them into a single tree (with the "=" Tree node as the root), and put the result back on the stack.

stack  
  makeTree(2, 3, 1)  

First, you should realize that not everything goes into the parse tree. Some tokens, such parentheses and ends-of-lines, are there to help define the structure or as an aid to the recognizer. Some nonterminals, such as <expression> and <term>, are in the grammar mostly to define the structure that the resulting tree should have. In general, anything that provides structure is represented in the structure of the tree; anything that provides content is represented as nodes with values.

Second, there may be things in the parse tree that are not explicit in the original Bugs program. For example, a block occurs in many places in the grammar, but there is no "block" keyword. As another example, when you create a method node, one of its children is a parameters node to hold the list of formal parameters. The pushNewNode method can be used to create the appropriate single-node Tree.

In the following table I will attempt to show what goes into the parse tree for each grammar rule or rule alternative. Where a rule has more than one alternative, the first column contains only the alternative that is used in the example.

Icon Intended Meaning
A rectangle indicates a specific type of Tree node, representing the action to be taken when the node is evaluated. In this example, a "block" Tree node, which indicates that its children (commands) are to executed in order. Many, but not all, nodes like this are named after keywords. Others, like block, are "pseudo-keywords"--words that do not appear in the grammar and would be treated like ordinary variable names if they occur in a program, but aid in structuring the parse tree.

The "cloud" indicates that a Tree of the appropriate type belongs here. In this example, the "expression" means that a Tree built from an <expression> belongs here. The Tree might, for example, contain a "+" and have children, or the number "17", or anything else representing a valid expression. You can think of the cloud as being a "nonterminal" of my "picture language."

Note: For space reasons, <expression> is sometimes abbreviated <expr> below.

A rounded rectangle indicates that a single terminal of the indicated type belongs here. In this example, "number" might be "5" or "83", but cannot be a name or a more complicated expression.

Note #1: Some nodes, such as block and var, have zero or more children, all of the same kind. Other nodes, such as program and Bug, need to allow for different kinds of children, but may have more than one of some kinds (for example, a program has an Allbugs node, but can contain many Bug nodes). These more complicated nodes have an intermediate list child.

Note #2: "Optional" elements, such as the Allbugs part of a program, may be omitted from the program text, but must be present in the tree if they are needed in order to keep the positions of other children correct. For example, the second child of a program is always a list. However, a node that represents an omitted text element (Allbugs in this example) need not have children.
Another way to say this is: Any node in the tree (other than variable nodes, such as block and list) must have either the correct number of children, or none at all.

Grammar Rule Tree Comments
<program> ::=
    [ <allbugs code> ]
<bug definition>
{ <bug definition> }
program A program node has exactly two children: An Allbugs node (which must be present, but doesn't need to contain any code), and a list node with one or more Bug children.
<allbugs code> ::=
    "Allbugs"  "{" <eol>
{ <var declaration> }
{ <function definition> }
"}" <eol>
allbugs An Allbugs node has exactly two children: A list node with zero or more var children, and a list node with one or more function children.
<bug definition> ::=
    "Bug" <name> "{" <eol>
{ <var declaration> }
[ <initialization block> ]
<command> { <command> }
{ <function definition> }
"}" <eol>
bug definition

A Bug node has exactly five children: a name, a list with var children, an initialization block (with a root node initially), a block (which must contain at least one command), and a list with function children.

The initialization block is optional in the Bugs source code, but must be present in the tree. This is so that the commands and function definitions are always in the same location in the tree. For similar reasons, the list nodes must be present, even if the lists are emtpy.

<var declaration> ::=
    "var" <NAME> { "," <NAME> } <eol>

A var node has zero or more children, each of which is a variable name.

(It's silly to have a var without any named variables, but it simplifies the grammar slightly. If you wish, you may modify the grammar to disallow this case.)
<initialization block> ::=
    "initially" <block>
An <initialization block> consists of an initially node with one child, a block.
<block> ::=
    "{" <eol>
         { <command> }
    "}" <eol>
A block node has zero or more commands as children. The children are in the same order as they appear in the Bugs source code.
<move action> ::=
    "move" <expr> <eol>

<turn action> ::=
    "turn" <expr> <eol>

<turnto action> ::=
    "turnto" <expr> <eol>

<return statement> ::=
    "return" <expr> <eol>
A <move action> requires exactly one <expression>. <turn action>, <turnto action> and <return statement> all have the same structure.
<moveto action> ::=
   "moveto" <expr> "," <expr> <eol>
A <moveto action> requires exactly two <expression>s.
<line action> ::=
   "line" <expr> "," <expr> ","
          <expr> "," <expr> <eol>
A <line action> requires exactly four <expression>s.
<assignment statement> ::=
    <variable> "=" <expression> <eol>
assignment statement  
<loop statement> ::=
    "loop" <block>
loop statement  
<exit if statement> ::=
    "exit" "if" <expression> <eol>
exit if  
<switch statement> ::=
    "switch" "{" <eol>
      { "case" <expression> <eol>
           { <command> } }
    "}" <eol>
switch statement  
<color statement> ::=
    "color" <KEYWORD> <eol>
color The syntax allows any keyword as a "color name," for example, color switch. This is a weakness in the grammar which we will not correct, but will instead detect at runtime.
<function definition> ::=
    "define" <NAME>
     [ "using" <variable>
         { "," <variable> }  ] <block>
function definition A function node always has exactly three children. The var node may have zero or more children.
<function call> ::=
    <NAME> <parameter list>

<do statement> ::=
    "do" <variable> [ <parameter list> ] <eol>
A <function call> occurs in an expression, while a <do statement> occurs as an independent statement, but they are represented by the same kind of Tree.
<color statement> ::=
    "color" <KEYWORD> <eol>
The grammar I provided does not distinguish between color names and other keywords. We can do this later.
<expression>

<arithmetic expression>

<term>

<unsigned factor>

In any sort of expression, the top node is one of :

  • A number
  • A variable name
  • A function call
  • An operator: ".", "+", "-", "*", "/", "<", "<=", "=", "!=", ">=", ">".

Parentheses are represented implicitly in the shape of the tree, not explicitly.

<factor> ::= [ "+" | "-" ] <unsigned factor>
unary operator

Any <factor> may begin with a unary plus or minus

Order of evaluation

An operation cannot be performed until its operands are known. When an operation is represented by a node in a binary tree, that operation cannot be performed until the children of that node have been evaluated. (This requires a postorder tree traversal.) Therefore, the structure of the binary tree imposes restrictions on the order in which operations are performed.

For example, consider the expression 20 * 3 / 4. In this expression, it is important to do the operations left to right. Our parser might construct either of two trees from this expression, only one of which is correct.

This is the correct representation--if you evaluate the left and right children, then do the division, the result is (20 * 3) / 4 = 60 / 4 = 15.

This is a wrong representation. 3/4 is an integer division, so the result is 0, so 20 * (3 / 4) = 20 * 0 = 0.

Note that you do not evaluate expressions in the current assignment. However, you need to build the parse tree correctly so that you can evaluate it correctly in a subsequent assignment.

Testing

Testing primarily consists of determining whether you have built the correct trees. For that you need a working equals method, and an easy way of building the "expected" tree. For that purpose I have provided a TreeParser class which has a default (no argument) constructor and a parse(String) method. Given an input similar to that of Tree.parse(String), this method uses the new typeOf method in the Token class to construct a tree of Tokens.

For example, given the string "moveto(2 abc)", the parse method would construct a tree with the token KEYWORD:moveto at the root and children NUMBER:2 and NAME:abc.

Due date

Turn your assignment in to Canvas before 6 a.m. Tuesday, March 17. Be sure to include all relevant files; your Parser and ParserTest should be immediately runnable.