CIT 594 Assignment 6: Parser
Spring 2006, David Matuszek

Purposes:

General Idea:

In this assignment you will continue to build on the previous assignments. Since there will be some changes in each of the preceding classes, you should make copies of your earlier work, rather than modify it directly.

You will need the Type, Token, Tokenizer, Tree, LogoException, and Recognizer classes, and their corresponding JUnit test classes. You will not need the TreeMethods or the HashTable class.

Your assignment is to write a parser for programs in the "Logo2006" language. A parser is a program that, given a stream of tokens, creates a tree that represents the structure of the program represented by that stream of tokens. Such a tree is called a parse tree. The meaning of the program, if any, is outside the scope of this assignment.

Changes to Type and (not to) Tokenizer:

Add a new constant KEYWORD to the Type class. A keyword is any of the words defined in the grammar--move, if, red, etc.

This change should not break your Tokenizer class. Use your JUnit tests to make sure that Tokenizer still works correctly. Do not change your Tokenizer class to use the new enum type; but if you uncover any bugs in your Tokenizer, write JUnit tests to test for them, then add the corrections to Tokenizer.

Class LogoTokenizer:

The Tokenizer class is almost, but not quite, what we want for our parser. First, it would be convenient to have a different kind of token for keywords than for other kinds of names. Second, it would have been better if Tokenizer implemented Iterable rather than Iterator, because this would allow us to use the new for loop in Java 5. We can correct these shortcomings with an adapter class.

Your new class should provide all the same methods as Tokenizer. Do not duplicate code--let the old Tokenizer class do most of the work. The new LogoTokenizer does only two things: It changes the type field of keywords to KEYWORD, and it implements java.lang.Iterable.

To implement keywords, define a Set (see java.util), and fill it with the keywords from the Recognizer assignment. (The best way to do this is to create a String array with the keywords as its initial values--you can do this in one statement, if you understand array initialization--and then create a Set from the array.) Each time your Tokenizer finds a NAME, check to see whether it is a member of the Set of keywords, and if so, return it as a KEYWORD rather than as a NAME.

Your JUnit tests for LogoTokenizer should check all keywords, to make sure they are all returned correctly, as well as at least one non-keyword. Tests for other methods can be very short, since all you really need to test is that you can get to the Tokenizer methods by way of LogoTokenizer.

The Parser class:

First, make a copy of your Recognizer class, and change its name to Parser. You will be transforming this class from a recognizer, which simply returns true or false, into a parser, which creates a tree corresponding to its input. We will use this parse tree later in this course.

One way of doing this transformation is to change each Recognizer method to return a Tree instead of a boolean. Don't do this. There is an easier way that allows you to make smaller changes and do unit testing as you go. That easier way is to keep a "global" Stack of trees as you go.

For example, consider the grammar rule <expression> ::= <term> { <add_operator> <term> }:

In the Recognizer: In the Parser:
public boolean expression() {
    if (!term()) return false;
    while (addOperator()) {
        if (!term()) {
            error("Error after '+' or '-'");
    }
    return true;
}
public boolean expression() {
    if (!term()) return false;  // Note 1
    while (addOperator()) {     // Note 2
        if (!term()) {          // Note 3
            error("Error after '+' or '-'");
            // Add code here    // Note 4
    }
    return true;
}

Notes:

  1. If the first term() succeeds, it leaves a Tree representing that term on the stack.
  2. If addOperator() succeeds, it leaves a Tree leaf node on the stack, containing the add operator ("+" or "-").
  3. If the second term() succeeds, it leaves a Tree representing that term on the stack.
  4. If you get here, remove the top three Trees from the stack, create a new one that represents this expression, and leave it on the stack. (Since you will be doing similar tree-building operations in a lot of places, remember the DRY principle and write a method or methods to do this work.)

Making parse trees:

The general idea is that as you recognize things in the input, you build Trees to represent them, and push these Trees onto a stack. As you go along, you pop Trees from the stack, combine them into a larger Tree, and put this new Tree onto the stack. When you are done, you should have exactly one Tree on the stack, representing the complete <program>.

For example, suppose you find "x = 5\n". You will recognize the x as a <variable>, make it into a (single node) Tree, and put it on the stack. Similarly for the "=" symbol and the "5" <expression>. Then you will pop these three Trees from the stack, combine them into a single tree (with the "=" Tree node as the root), and put the result back on the stack.

First, you should realize that not everything goes into the parse tree. Some tokens, such parentheses and ends-of-lines, are there to help define the structure or as an aid to the recognizer. Some nonterminals, such as <expression> and <term>, are in the grammar mostly to define the structure that the resulting tree should have. In general, anything that provides structure is represented in the structure of the tree; anything that provides content is represented as nodes with values.

Second, there may be things in the parse tree that are not explicit in the original Logo2006 program. For example, when you create a program node, one of its children is a block to hold the statements in the program (even though the grammar does not call for a block), and another is a parameters node to hold the list of formal parameters.

In the following table I will attempt to show what goes into the parse tree for each grammar rule or rule alternative. Where a rule has more than one alternative, the first column contains only the alternative that is used in the example.

Grammar rule (partial) Example input
What it should push onto the stack
Comments
<variable> ::= <NAME> "foo"
Leaf node, value is the name.
<command> ::=
"move" <expression> <eol>
"move 5 \n"
The (one) child is the Tree that represents the <expression>; the expression may be significantly more complicated than this.
<command> ::=
"turn" <direction> <eol>
"turn east \n" The child is the Tree that represents the <direction>. In this example, <direction> has one child, east.
<command> ::=
"turn" <direction> <eol>
"turn left 45\n" The child is the Tree that represents the <direction>. In this example, <direction> has two children, left and 45.
<command> ::=
"penup" <eol>
"penup \n"
Leaf node, value is the keyword.
<command> ::=
"pendown" <eol>
"penup \n"
Leaf node, value is the keyword.
<command> ::=
"color" <colorName> <eol>
"color red \n"
The child is the Tree that represents the <colorName>.
<command> ::=
   "home" <eol>
"home \n"
Leaf node, value is the keyword.
<command> ::=
 <variable> "=" <expression> <eol>
"x = 5 \n"
The first child is the Tree that represents the <variable>, while the second child is the Tree that represents the (possibly complex) <expression>.
<command> ::=
 "repeat" <expression> <block>
"repeat 5 [ \n
  penup \n
] \n"
The first child is the Tree that represents the <expression>, while the second child is the Tree that represents the <block>.
<command> ::=
 "while" <condition> <block>
"while x < 5 [ \n
  penup \n
] \n"
The first child is the Tree that represents the <condition>, while the second child is the Tree that represents the <block>.
<command> ::=
 "if" <condition> <block>
"if x < 5 [\n
  penup\n
]\n"
The first child is the Tree that represents the <condition>, while the second child is the Tree that represents the <block>.
<command> ::=
 "if" <condition> <block> "else" <block>

"if x < 5 [\n
  penup \n
]\n
else [\n
  pendown \n
]\n"

The first child is the Tree that represents the <condition>, the second child is the Tree that represents the first <block>, and the third child represents the second <block>.
<command> ::=
"call" <name> { <expression> } <eol>
"call foo 1 2 3\n"
The (required) first child of a do command is a name, while any additional children are subtrees that represent expressions.
<colorName> ::= "red" "red"
Leaf node, value is the keyword. Similarly for each other color name.
<block> ::= "[" <eol> { <command> } "]" <eol>

"[ \n
  penup \n
  home \n
  pendown \n
] \n

Each command is a child (subtree) of the block node. (To keep the diagram small, I used only simple commands; each command might be a large, complex subtree.)

<comparator> ::= "<" | "=" | ">" "<"
Leaf node, value is the symbol. Similarly for the two other comparators.
<comparison> ::= <expression> <comparator> <expression> "x < 5"
The two children of a comparator node are arbitrarily complex Trees representing <expression>s. Only simple expressions are used in this picture.
<condition> ::= <disjunct> { "or" <disjunct> } "x<0 or x>100" or is a binary operator, hence has exactly two children.
<condition> ::= <disjunct> { "or" <disjunct> } "x<0 or x>y and y>100"

and and or are binary operators, so each has two children.

and has higher precedence than or, hence must be lower down in the Tree structure, where it will be evaluated sooner.

<disjunct> ::= <conjunct> { "and" <conjunct> } "x>y and y>100" and is a binary operator, hence has exactly two children.
<disjunct> ::= <conjunct> { "and" <conjunct> } "x<y and y<z and z<100" and is a binary operator, hence has exactly two children. The structure of the Tree implies left-to-right evaluation.
<conjunct> ::= [ "not" ] <logicalFactor> "not x < 5" If there is no "not" present, the Tree for a <conjunct> is the same as that for a <logicalFactor>.
<logicalFactor> := <comparison> "x < 5" The Tree for a <logicalFactor> is the same as that for its constituent <comparison>. (This is a minor inefficiency in the grammar, but we'll keep it anyway.)
<expression> ::= <term> { <add_operator> <term> } "foo + 5"

When an <expression> consists of a single term, the Tree is the same as it is for that <term> alone.

<expression> ::= <term> { <add_operator> <term> } "x + y + z"

Operations are done left to right, so the tree structure must reflect this fact.

Other binary operations (of equal precedence) are done the same way. Higher precedence operators must be lower in the Tree.

<term> ::= <factor> { <multiply_operator> <factor> } "foo * 5"
When a <term> consists of a single factor, the tree is the same as it is for that <factor> alone. It never contains any "list" nodes.
<factor> ::= <name> "foo"
When a <factor> consists of a name or number, the tree is the same as it is for that <name> or <number>.
<factor> ::= "(" <expression> ")" "(foo + 5)"
When a <factor> consists of parentheses around an expression, the tree is the same as it is for that <expression>.
<procedure> ::= "define" <name> { <variable> } <eol>
{ <command> } "end" <eol>
"define foo x y z \n
  move x \n
end \n"

This is the most complex structure. A <procedure> requires three children:

  • The first child is just a <name>.
  • The second child should be a new type of node, <parameters>, which may have zero or more children, each of which is a <variable>. It should be present even if there are no parameters.
  • The third child is an implicit <block>--it is not denoted by explicit "[ ]" brackets in the input, but rather by the define line and the end.
<program> ::= <command> { <command> } { <procedure> } "penup \n
home \n
define foo x y z \n
  move x \n
end \n"

The root node is a program node, and it has two children--an implicit <block> around the commands, and a new type of node, procedures, which has zero or more define nodes as children.

<eol> ::= "\n" { "\n" } "\n"

 

 

No Tree is built. If something was put on the stack in the process of recognizing the <eol>, it should be removed.

Order of evaluation:

An operation cannot be performed until its operands are known. When an operation is represented by a node in a binary tree, that operation cannot be performed until the children of that node have been evaluated. (This requires a postorder tree traversal.) Therefore, the structure of the binary tree imposes restrictions on the order in which operations are performed.

For example, consider the expression 20 * 3 / 4. In this expression, it is important to do the multiplication before doing the division. Our parser might construct either of two trees from this expression, only one of which is correct.

This is the correct representation--if you evaluate the left and right children, then do the division, the result is (20 * 3) / 4 = 60 / 4 = 15.

This is a wrong representation--if you evaluate the left and right children, then do the division, .the result is 20 * (3 / 4) = 20 * 0 = 0. (Remember, this is integer division.)

Note that you do not evaluate expressions in the current assignment. However, you need to build the parse tree correctly so that you can evaluate it correctly in a subsequent assignment.

Starter files:

Parser.java and ParserTest.java.

Due date:

Tuesday, February 28, before midnight. There will be no extensions (except for actual emergencies); however, I don't count Spring break, so late points will be as follows:

By midnight Wednesday, February 29    -5 points
By midnight Thursday, March 1 -10 points
By midnight Monday, March 13 -15 points
By midnight Tuesday, March 14 -25 points
By midnight Wednesday, March 15 -20 points
By midnight Thursday, March 16 -30 points
By midnight Friday, March 17 -35 points

Turn in your program via Blackboard, and follow these conventions:

It is your responsibility to check and make sure that you have correctly submitted the correct files via Blackboard. Errors in the submission process will not be taken as an excuse for late homework.