CIT 594 Assignment 4: Parser
Spring 2004, David Matuszek

Purposes:

General Idea:

In this assignment you will continue to build on the previous assignments. Since there will be some changes in each of the preceding classes, you should make copies of your earlier work, rather than modify it directly.

You will need the Token, Tokenizer, BinaryTree, LogoParseException, and Recognizer classes, and their corresponding JUnit test classes. You will not need the ArithmeticExpression class.

Your assignment is to write a parser for programs in that language. A parser is a program that, given a stream of tokens, creates a tree that represents the structure of the program represented by that stream of tokens. Such a tree is called a parse tree. In this assignment we will use binary trees, rather than general trees, to represent the program. The meaning of the program, if any, is outside the scope of this assignment.

Changes to the Tokenizer:

Add a new type of token, the "keyword." A keyword is any of the words defined in the grammar--if, red, etc.

Add a new constant, public static final int KEYWORD, to the Token class, and use it in the Tokenizer class.

To implement this, define a Set (see java.util), and fill it with the keywords from the Recognizer assignment. (The best way to do this is to create a String array with the keywords as its initial values--you can do this in one statement, if you understand array initialization--and then create a Set from the array.) Each time your Tokenizer finds a NAME, check to see whether it is a member of the Set of keywords, and if so, return it as a KEYWORD rather than as a NAME.

Your Tokenizer should have the pushBack(Token) method as used in the Recognizer assignment.

Be sure to add JUnit tests for all your changes to the Tokenizer.

The Parser class:

First, make a copy of your Recognizer class, and change its name to Parser. (You can do this easily with Eclipse's "Rename" refactoring.) You will be transforming this class from a recognizer, which simply returns true or false, into a parser, which creates a (binary) tree corresponding to its input. We will use this parse tree in a later, final assignment.

One way of doing this transformation is to change each Recognizer method to return a BinaryTree instead of a boolean. Don't do this. There is an easier way that allows you to make smaller changes and do unit testing as you go. That easier way is to keep a "global" Stack of binary trees as you go.

For example, consider the grammar rule <expression> ::= <term> | <term> <add_operator> <expression>:

In the Recognizer: In the Parser:
public boolean expression() {
    if (!term())
        return false;
    if (!addOperator())
        return true;
    if (!expression())
        error("Error in expression after '+' or '-'");
}
public boolean expression() {
    if (!term())              // Note 1
        return false;
    if (!addOperator())   // Note 2
        return true;
    if (!expression())     // Note 3
        error("Error in expression after '+' or '-'");
    makeTree(2, 1, 3); // Note 4
    return true;
}

Notes:

  1. If term() succeeds, it leaves a BinaryTree representing that term on the stack.
  2. If addOperator() succeeds, it leaves a BinaryTree leaf node on the stack, containing the add operator ("+" or "-").
  3. If expression() succeeds, it leaves a BinaryTree representing that expression on the stack.
  4. If we get here, my makeTree method removes the three preceeding BinaryTrees from the stack, creates a new one that represents this expression, and leaves it on the stack.

Changes to the grammar:

  1. Change
         <variable> ::= ":" <name>
    to simply
         <variable> ::= <name>

    The "real" Logo language uses a colon to indicate a variable name, but I don't believe we need one.

  2. Change
          <eol> ::= "\n"
    to
         <eol> ::= "\n" { "\n" }

    This keeps mandatory newlines, but also allows for possible blank lines.

  3. To the definition of <command>, add the alternative
         "do" <name> { <expression> } <eol>

    (I just forgot this command before, and we will want it. I hope I haven't forgotten any others!)

  4. Eliminate "backward" from the rules for <move>, but keep "forward", "left", and "right". Therefore, the word "backward" should not be treated as a keyword.

  5. Change
          <expression> ::= <term> | <term> <add_operator> <expression>
    to
          <expression> ::= <term> { <add_operator> <term> }
    and change
          <term> ::= <factor> | <factor> <multiply_operator> <term>
    to
         <term> ::= <factor> { <multiply_operator> <factor> }

    These changes do not affect what is recognized by the grammar, but they have a significant influence on the form of the binary tree that results. How to make this change, and the effect it will have, is discussed in detail below.

For your convenience, I have posted a page containing the modified grammar.

Making parse trees:

First, you should realize that not everything goes into the parse tree. Some things, such parentheses and ends-of-lines, are there to help define the structure or as an aid to the recognizer. Most everything else does go into the parse tree, though.

In the following table I will attempt to show what goes into the parse tree for each grammar rule or rule alternative. Where a rule has more than one alternative, the first column contains only the alternative that is used in the example.

Grammar rule (partial) Example input
What it should push onto the stack
Comments
<variable> ::= <name> "foo"
Leaf node, value is the name.
<command> ::=
<move> <expression> <eol>
"forward 5 \n"
The left child is the BinaryTree that represents the <expression>; it may be significantly more complicated than this.
<command> ::=
"penup" <eol>
"penup \n"
Leaf node, value is the keyword.
<command> ::=
"pendown" <eol>
"penup \n"
Leaf node, value is the keyword.
<command> ::=
"color" <colorname> <eol>
"color red \n"
The left child is the BinaryTree that represents the <colorname>.
<command> ::=
   "home" <eol>
"home \n"
Leaf node, value is the keyword.
<command> ::=
 "set" <variable> <expression> <eol>
"set x 5 \n"
The left child is the BinaryTree that represents the <variable>, while the right child is the BinaryTree that represents the (possibly complex) <expression>.
<command> ::=
 "repeat" <expression> <block> <eol>
"repeat 5 [ \n
penup \n
pendown \n
] \n"
The left child is the BinaryTree that represents the <expression>, while the right child is the BinaryTree that represents the <block>. (The <block> subtree is not shown in detail.)
<command> ::=
 "while" <condition> <block> <eol>
"while x < 5 [ \n
penup \n
] \n"
The left child is the BinaryTree that represents the <condition>, (shown explicitly here) while the right child is the BinaryTree that represents the <block>. (The <block> subtree is not shown in detail.)
<command> ::=
 "if" <condition> <block> <eol>
"if x < 5 [ \n
penup \n
] \n"
The left child is the BinaryTree that represents the <condition>, (shown explicitly here) while the right child is the BinaryTree that represents the <block>. (The <block> subtree is not shown in detail.)
<command> ::=
"do" <name> { <expression> } <eol>
"do foo 5 \n"
The right child is not shown because, although the example has only one expression, the right child could be a binary tree containing several expressions. See below.
<move> ::= "forward" "forward"
Leaf node, value is the keyword. Similarly for right and left.
<colorname> ::= "red" "red"
Leaf node, value is the keyword. Similarly for each other color name.
<block> ::= "[" <eol> { <command> } "]" <eol> "[ \n
home \n
pendown \n
penup \n
] \n

Each internal node is a "list" node. Note that each <command> within the <block> is a left child. (To keep the diagram simple, I used only simple commands; each might be a large, complex subtree.)

This kind of structure is necessary because we have only binary trees to work with, not general trees.

<expression> ::= <term> { <add_operator> <term> } "foo + 5"

When an <expression> consists of a single term, the binary tree is the same as it is for that <term> alone. It never contains any "list" nodes.

<term> ::= <factor> { <multiply_operator> <factor> } "foo * 5"
When a <term> consists of a single factor, the binary tree is the same as it is for that <factor> alone. It never contains any "list" nodes.
<factor> ::= <name> "foo"
When a <factor> consists of a name or number, the binary tree is the same as it is for that <name> or <number>.
<factor> ::= "(" <expression> ")" "(foo + 5)"
When a <factor> consists of parentheses around an expression, the binary tree is the same as it is for that <expression>.
<condition> ::= <expression> <comparator> <expression> "x < 5"
The children of a comparator node are arbitrarily complex binary trees representing <expression>s. Only simple expressions are used in this picture.
<comparator> ::= "<" | "=" | ">" "<"
Leaf node, value is the symbol. Similarly for the two other comparators.
<procedure> ::= "to" <name> { <variable> } <eol>
{ <command> } "end" <eol>
"to foo x y z \n
forward x \n
end \n"

This is the most complex structure. A <procedure> requires a <name>, a list of <variable>s, and a list of <command>s.

To accommodate these three children, I decided to invent a "header" node to be the parent of the first two. The two lists can be built using "list" nodes.

<program> ::= <command> { <command> } { <procedure> } penup \n
pendown \n

The root node is a "program" node, and each child is a "list" node. Note that an empty list is represented by a "list" node with no children.

<eol> ::= "\n" { "\n" } "\n"

 

 

No binary tree is built. If something was put on the stack in the process of recognizing the <eol>, it should be removed.

Order of evaluation:

An operation cannot be performed until its operands are known. When an operation is represented by a node in a binary tree, that operation cannot be performed until the children of that node have been evaluated. Therefore, the structure of the binary tree imposes restrictions on the order in which operations are performed.

For example, consider the expression 20 * 3 / 4. In this expression, it is important to do the multiplication before doing the division. Our parser might construct either of two trees from this expression, only one of which is correct.

This is the correct representation--if you evaluate the left and right children, then do the division, the result is (20 * 3) / 4 = 60 / 4 = 15.

This is a wrong representation--if you evaluate the left and right children, then do the division, .the result is 20 * (3 / 4) = 20 * 0 = 0. (Remember, this is integer division.)

Unfortunately, this is what will result if you use the definition of <term> that I gave you. While that definition is correct for a simple recognizer, it needs to be modified to be suitable for a parser. The definition of <expression> needs to be modified in a similar manner.

Note that you do not evaluate expressions in the current assignment. However, you need to build the parse tree correctly so that you can evaluate it correctly in the next assignment.

Lists:

A BNF definition that uses { } to indicate "zero or more" needs some special treatment, since a node in a binary tree cannot have "zero or more" children--it can have at most two children. For everything we are doing here, the children need to be "evaluated" from left to right, so we want a tree such as the one pictured above for <block>.

There are four places in the grammar where you need to create a list: one in <block>, two in the <procedure>, and one in the "to" command. You don't need to create one for <name> or <number>, because the Tokenizer takes care of that for you; and <term> and <expression> require binary trees of a slightly different form.

For the four places that require a "list" (with "list" nodes), I recommend that you make a private method to create the necessary binary tree. Here's the basic idea: When you realize a list is about to be needed, push a marker of some sort onto the stack. When you reach the end of the list, work down to the marker to create the necessary binary tree.

For example, if you see "to fiddle", you know a list of variables will be next, so put a marker on the stack. This is so you can later figure out where the list began. The marker should be something that cannot occur as a Token value (for example, the string "<marker>", including the angle brackets).

If the input is "to fiddle one two three \n", your stack should look like this:


(before)

Note that everything on the stack is a BinaryTree. (The ellipsis at the bottom indicates that there may be other binary trees on the stack as well.)

Your method can then take items from the stack, down to and including the marker, assemble the BinaryTree shown at the right, and put it onto the stack. This is a fairly simple--but recursive--pointer manipulation method.


(after)

Starter files:

Parser.java and ParserTest.java.

Due date:

Monday, February 16, before midnight. Turn in your program via Blackboard, and follow these conventions:

It is your responsibility to check and make sure that you have correctly submitted the correct files via Blackboard. Errors in the submission process will no longer be taken as an excuse for late homework.