CIT 594 Assignment 8: Parser
Spring 2014, David Matuszek

Purposes

General Idea

Your assignment is to write a parser for programs in the language described by the Grammar in the Recognizer assignment. A parser is a program that, given a stream of tokens, creates a tree that represents the structure of the program represented by that stream of tokens. Such a tree is called a parse tree, or abstract syntax tree (AST) . The "meaning" of this tree, if any, is outside the scope of this assignment.

In this assignment you will build on the Tree, Tokenizer, and Recognizer assignments. Create a Parser project. Make a copy of the Recognizer class, rename it Parser, and put it in a new package parser in the Parser project. Leave the other classes where they are, and import them. Copy the packages tree, tokenizer, parser, and (preferably) tests from previous assignments into this new project. Create an AllTests.java JUnit class in the tests package to call all the previous JUnit tests, along with your new Parser tests.

The Parser class

I am providing starter code for the Parser (see below). This code can parse <expression>s and the other nonterminals that go to make up expressions. You should read and become familiar with all the provided starter code.

You will need to merge your Recognizer with this starter code, keeping the most complete versions of each method. You will be transforming your recognizer, which simply returns true or false, into a parser, which creates a tree corresponding to its input. We will use this parse tree in the next assignment.

The Recognizer methods all return a boolean. Don't change this. The Parser methods will manipulate a global Stack<Tree<Token>>, but will continue to return a boolean result.

For example, consider the grammar rule
      <term> ::= <factor> | <factor> { <multiplyOperator> <factor> }:

In the Recognizer:
public boolean isTerm() {
    if (!isFactor()) return false;
    while (isMultiplyOperator()) {
        if (!isFactor()) {
            error("No factor after '*' or '/'");
        }
    }
    return true;
}
In the Parser:
public boolean isTerm() {
    if (!isFactor()) return false;              // Note 1
    while (isMultiplyOperator()) {              // Note 2
        if (!isFactor()) {                      // Note 3
            error("No factor after '*' or '/'");
        }
        // Add tree construction code here      // Note 4
    }
    return true;
}

Notes:

  1. If the first isFactor() succeeds, it leaves a Tree representing that factor on the stack.
  2. If isMultiplyOperator() succeeds, it leaves a Tree leaf node on the stack, containing the multiply operator token ("*" or "/").
  3. If the second isFactor() succeeds, it leaves a Tree representing that factor on the stack.
  4. If you get here, remove the top three Trees from the stack, create a new one that represents this term, and leave it on the stack. (Since you will be doing similar tree-building operations in a lot of places, remember the DRY principle and use a method or methods to do this work.)

Making parse trees:

The general idea is that as you recognize things in the input, you build Trees to represent them, and push these Trees onto a stack. As you go along, you pop Trees from the stack, combine them into a larger Tree, and put this new Tree onto the stack. When you are done, you should have exactly one Tree on the stack, representing the complete nonterminal (usually <program>) being parsed.

For example, suppose you are trying to parse a <condition> (defined as <expression> <comparator> <expression>) and the string being parsed is "x < 5 { \n". You will parse the x as an <expression> by calling the expression() method, which will leave a Tree representing this expression on the stack. You do similar things for the <comparator> and the second <expression>. Then you will pop these three Trees from the stack, combine them into a single tree (with the "<" Tree node as the root), and put the result back on the stack. The remainder of the input string (in this case, "{ \n") is left for later use.

First, it is important to realize that not everything goes into the parse tree. Some tokens, such parentheses, braces, and newlines, are there to help define the tree structure, or as an aid to the recognizer. Some nonterminals, such as <expression> and <term>, are in the grammar mostly to define the structure that the resulting tree should have. In general, anything that provides structure is represented in the structure of the tree; anything that provides content is represented as nodes with values.

Second, there may be things in the parse tree that are not explicit in the original program. For example, when you create a program node, one of its children is a block to hold the statements in the program (even though the grammar does not call for a block), and another is a parameters node to hold the list of formal parameters.

In the following table I will attempt to show what goes into the parse tree for each grammar rule or rule alternative. Where a rule has more than one alternative, the first column contains only the alternative that is used in the example. Note that all node types have a fixed number of children, with the exception of <list> and <block>, each of which may have any number (including zero) of children.

If there are discrepancies between the grammar given in the Recognizer assignment, and in the examples below, the one in the Recognizer assignment is the correct one.

Grammar rule (partial) Example input What it should push onto the stack Comments
<variable> ::= <name> "foo" Leaf node, value is the name.
<command> ::=
<move> <expression> <eol>
"forward 5 \n" The (one) child is the Tree that represents the <expression>; the expression may be significantly more complicated than this.
<command> ::=
"penup" <eol>
"penup \n" Leaf node, value is the keyword.
<command> ::=
"pendown" <eol>
"pendown \n" Leaf node, value is the keyword.
<command> ::=
"color" <expression> <expression> <expression> <eol>
"color 255 0 0\n" The three children are the red, green, blue components of the <color>. Each value must be in the range 0 to 255.
<command> ::=
"red" <eol>
"red \n" Explicit color names may be represented as a leaf,
-or-
may be represented as if the three color components were explicitly listed.
<command> ::=
   "home" <eol>
"home \n" Leaf node, value is the keyword.
<command> ::=
 "jump" <expression> <expression> <eol>
"jump x y \n" jump x y The children are the two expressions, in order.
<command> ::=
 "set" <variable> <expression> <eol>
"set w 5 \n" The first child is the Tree that represents the <variable>, while the second child is the Tree that represents the (possibly complex) <expression>.
<command> ::=
 "repeat" <expression> <block>
"repeat 5 { \n
  penup \n
} \n"
The first child is the Tree that represents the <expression>, while the second child is the Tree that represents the <block>.
<command> ::=
 "while" <condition> <block>
"while x < 5 { \n
  penup \n
} \n"
The first child is the Tree that represents the <condition>, while the second child is the Tree that represents the <block>.
<command> ::=
 "if" <condition> <block>
"if x < 5 {\n
  penup\n
}\n"
The first child is the Tree that represents the <condition>, while the second child is the Tree that represents the <block>.
<command> ::=
 "if" <condition> <block> "else" <block>

"if x < 5 {\n
  penup \n
}\n
else {\n
  pendown \n
}\n"

The first child is the Tree that represents the <condition>, the second child is the Tree that represents the first <block>, and the third child represents the second <block>.
<command> ::=
"do" <name> { <expression> } <eol>
"do foo 1 2 3\n" The (required) first child of a do command is a name, while the second child is a (possibly empty) list of expressions.
<block> ::=
"{" <eol> { <command> } "}" <eol>

"{ \n
  penup \n
  home \n
  pendown \n
} \n"

Each command is a child (subtree) of the block node. (To keep the diagram small, I used only simple commands; each command might be a large, complex subtree.)

<comparator> ::=
"<" | "=" | >"
"<" Leaf node, value is the symbol. Similarly for the two other comparators.
<condition> ::=
<expression> <comparator> <expression>
"x < 5" The children of a comparator node are arbitrarily complex binary trees representing <expression>s. Only simple expressions are used in this picture.

<procedure> ::=
"def" <name> { <variable> } <block>

"def foo a b c{ \n
forward b \n
} \n"

This is the most complex structure. A <procedure> requires a <name>, a list of <variable>s, and a <block> (list of <command>s).

A "header" node is the parent of the <name> and the (possibly empty) list of <variable>s.

<program> ::=
<command> { <command> } { <procedure> }
"penup \n
home \n \n def foo a b c{ \n left a \n } \n"

The root node is a program node--it has two children, a block of commands and a list of procedures. The block must have at least one child; the list may have zero or more children.

A "block" indicates a sequence of commands. Unlike other places in the language, the block child of a program is implicit (not indicated by braces).

<eol> ::= "\n" { "\n" } "\n"

 

 

No binary tree is built. If something was put on the stack in the process of recognizing the <eol>, it should be removed.


Colors and their RGB values

  red 255 0 0   blue 0 64 255   black 0 0 0
  orange 255 128 0   purple 128 0 255   gray 128 128 128
  yellow 255 255 0   magenta 255 0 255   white 255 255 255
  green 0 153 0   pink 250 175 190   brown 128 64 0
  cyan 0 255 255   olive 128 128 0   tan 210 180 140

Keywords and factors

Please revise the definition of <factor> as follows: And add this to your isFactor method:
<factor> ::= <variable>
           | <number>
           | "getX"
           | "getY"
           | "(" <expression> ")"
if (isKeyword("getX") || isKeyword("getY")) {
return true;
}

Include getX and getY as keywords.

In order to build the tree as shown, you will need internal Tokens containing these words: program, list, header, and block. But if the user uses these words as the names of variables, the program might get confused (this is similar to why you should never use $ in a Java name). We can avoid the problem by making these Tokens into "fake keywords." If your Tokenizer encounters these words in its input, it will classify them as NAMEs; but you can bypass the Tokenizer and create these directly, by (for example) new Token(TokenType.KEYWORD, "program"). This way you can have both "keyword Tokens" as part of the structure of the tree, and the user can use these names just like any other variable name.

Order of evaluation

An operation cannot be performed until its operands are known. When an operation is represented by a node in a binary tree, that operation cannot be performed until the children of that node have been evaluated. (This requires a postorder tree traversal.) Therefore, the structure of the binary tree imposes restrictions on the order in which operations are performed.

For example, consider the expression 20 * 3 / 4. In this expression, it is important to do the multiplication before doing the division. Our parser might construct either of two trees from this expression, only one of which is correct.

This is the correct representation--if you evaluate the left and right children, then do the division, the result is (20 * 3) / 4 = 60 / 4 = 15.

This is a wrong representation--if you evaluate the left and right children, then do the division, .the result is 20 * (3 / 4) = 20 * 0 = 0. (Remember, this is integer division.)

Note that you do not evaluate expressions in the current assignment. However, you need to build the parse tree correctly so that you can evaluate it correctly in a subsequent assignment.

Starter files

Parser.java, ParserTest.java, and Tree.parse.java.

The stack used by the Parser is declared as public Stack<Tree<Token>> stack = new Stack<Tree<Token>>();. After each call to an isXXX method, the top of this stack is tested to see whether it contains the correct Tree.

The Parser test methods depend on two Tree methods that I did not assign, parse(<String> string) and parse(<List<String> list), whose function is to turn a String into a Tree<String>. I am providing these functions for you to add to your Tree class.

Due date

Tuesday, April 1, before midnight. Turn in your program via Canvas, and zip and include all necessary packages from previous projects. The easiest way to do this is to simply copy the packages from previous assignments into the current assignment.

It is your responsibility to check and make sure that you have correctly submitted the correct files via Blackboard. Errors in the submission process will not be taken as an excuse for late homework.