CSE240 - Homework 9 - LC-3 Assembler

Homework 9: LC-3 Assembler
CSE240 - Introduction to Computer Architecture
Autumn 2005

Due: Wednesday, Dec 7 at 11:59PM

Now that you have built a disassembler (HW 8), let's build an assembler, just like the as command in the simulator. Let's call our assembler mylc3as. This assignment is composed of three distinct components: (1) writing the symbol table used by the assembler, (2) writing the assembler code, and (3) augmenting your assembler from part 2 so that it includes pseudo instructions for traps, subtraction, and moving constant and register values into a register. Please complete each part before proceeding to the next. This document describes each of these three components separately and then describes the logistics for compiling, testing, debugging, and turning in your assignment. Note that your assembler should print nothing to the display unless an error is encountered.

Part #1: The Symbol Table

The symbol table allows you to map strings representing labels (for example, "START") to addresses (for example, 0x3000).

Symbol Table Interface

We will use the following interface for our symbol table implementation:

void sym_table_insert(char* sym, int addr): This function adds label sym with the address addr to the symbol table.
int sym_table_lookup(char* sym): This function returns the address associated with symbol sym in the symbol table.

Symbol Table Implementation

Symbol tables can be implemented in many ways (hash tables, linked lists, arrays, etc.), but we will use a binary tree representation. Each node in the tree is represented via the structure below.

typedef struct sym_node_struct sym_node_t;

struct sym_node_struct {
  char* symbol;
  int address;
  sym_node_t* left;
  sym_node_t* right;
};

The symbol field points to an array of characters (string) representing the symbol. The address field is the address of the symbol. The left and right fields are pointers to the left and right sym_node_t child structures in the tree (NULL if child does not exist). The "root" of the tree is pointed to by the global variable g_symboltable_ptr, initially NULL:

  sym_node_t* g_symboltable_ptr = NULL;  // global symbol table

To provide fast lookup of symbols, we want to build the tree such that given any node in the tree, all labels reachable from the left child are alphabetically less than the label of the node ("car" is alphabetically before "dog", but it is after "ant"). Similarly, all labels reachable from the right child are alphabetically greater than the label of the node. The strcmp(char* s1, char* s2) function from the C standard library does just such an alphabetical comparison. The function strcmp(char* s1, char* s2) returns a 0 if the strings are the same, a number greater than 0 if s1 is alphabetically after s2, and a number less than 0 if s1 is alphabetically before s2.

To implement the symbol table, define functions implementing the symbol table interface, above.

int sym_table_lookup(char* sym): This function traverses the tree based on the symbol string it is passed. When the string is found in a node, the value of the address field is returned. If a non-matching childless node is encountered, the symbol does not appear in the symbol table. If symbol SYMBOL cannot be found, print exactly the following error message (so our testing scripts can detect it) followed by a new line and exit (to exit a C program, using the following function call: exit(1);).

  ERROR: symbol not found (SYMBOL)

The pseudocode:

initialize a node pointer to g_symboltable_ptr

while the node pointer is not NULL do:
    use strcmp() to compare sym and the symbol field of structure pointed to by the node pointer
    if the result of the comparison is less than zero:
        set the node pointer to be the left pointer of the structure pointed to by the node pointer
    otherwise, if the result of the comparison is greater than zero:
        set the node pointer to be the right pointer of the structure pointed to by the node pointer
    otherwise:
        the result of the function is the address field of the structure pointed to by the node pointer

If the symbol was not found, print an error message (using printf) and then exit(1)

Our C implementation of this function is less than 20 lines long.

void sym_table_insert(char* sym, int addr): This function begins by allocating and initializing a sym_node_t (via malloc) and populates the structure's fields with appropriate values passed into the function. Set the left and right pointers to NULL. When setting the symbol field, you don't want it to point to the string passed to the function. The caller of this function may free the memory containing the string, and then we would have a dangling pointer. To avoid this, make a copy of the string using the strdup() function. The function strdup() returns a pointer to a new copy of the string it is passed and has the following signature:
```
  char* strdup(char* str);
```
After allocating a new node, this function traverses the tree based on the symbol string passed into the function. If the symbol comes alphabetically before the string in the current node, it traverses the left child; if the symbols comes after, it traverses the right child; if the symbol is the same as the string in the current node, someone is trying to insert a duplicate string into the symbol table, so print exactly the following error message (followed by new line) and exit (again, using exit(1);). (NOTE: SYMBOL is the name of the duplicate symbol.)
```
  ERROR: duplicate symbol added to symbol table (SYMBOL)
```
When a childless node is encountered, the NULL pointer is changed to point to the newly allocated sym_node_t structure. The pseudocode:
```
allocate and initialize a new sym_node_t structure

if the global pointer g_symboltable_ptr is NULL:
   set g_symboltable_ptr to point to the newly allocated structure

otherwise:
   set a current node pointer to g_symboltable_ptr
   while true do:
      use strcmp() to compare sym and the symbol field of structure pointed to by the node pointer
      if the result of the comparison is less than zero:

          if the left pointer is NULL:
              make the left pointer point to the newly allocated structure
              break out of the loop
          otherwise:
              set the node pointer to the left pointer

      otherwise, if the result of the comparison is greater than zero:

          if the right pointer is NULL:
              make the right pointer point to the newly allocated structure
              break out of the loop
          otherwise:
              set the node pointer to the right pointer

      otherwise:
          print out an error and exit
```
Our C implementation of this function is less than 30 lines long.

Part #2: The Assembler

Recall from Chapter 7 that assemblers make multiple passes over the program. Our assembler will make several passes over the program (but one could combine some of these passes):

Pass 0: parse the assembly program into a linked list,
Pass 1: add symbols to symbol table,
Pass 2: resolve label references using symbol table, and
Pass 3: actually generate instructions.

Pass 1 and 2 require a symbol table, which you built in the first part of this assignment. Please carefully test your symbol table and ensure that it is working before using it in your assembler.

Pass #0: Parsing Instructions into a List

Parsing is quite complex, so we provide this code (in parser.c and parser.h). Feel free to examine this code if you like. Although we provide the parser, you still need to understand the structure into which instructions are placed. The program is parsed into a linked list where each node of the list is an instance of the structure below.

enum opcodes_t { 
  // the 1st 16 represent real LC3 opcodes
  BR, ADD, LD, ST, JSR_JSRR, AND, LDR, STR, 
  RTI, NOT, LDI, STI, JMP_RET, MUL, LEA, TRAP, 
  // these represent assembler directives
  ORIG, FILL, BLKW, STRINGZ, END, 
  // these represent kinds of TRAPs; ignore until part 3
  GETC, OUT, PUTS, IN, PUTSP, HALT, 
  // these represent psuedo instructions; ignore until part 3
  SUB, MOV,
  // represents no operation
  UNDEF
};

typedef struct insn_struct insn_t;

struct insn_struct {
  enum opcodes_t opcode;  
  char* label;            
  int first_reg;          
  int second_reg;         
  int third_reg;          
  int offset_immediate;   
  char* label_ref;        
  char* string;           
  int n;                  
  int z;                  
  int p;                  
  insn_t* next;           
};

The fields are:

The opcode field indicates what kind of instruction this structure represents (e.g., ADD, AND, etc.). Opcodes values must come from the opcodes_t enumeration. Note that the first 16 opcodes correspond directly to actually LC-3 opcode values. For example, the opcode of BR is 0x0, ADD is 0x1, LD is 0x2, etc. Note the JSR_JSRR and JMP_RET enum values. We have named these in this way because, for example, JSR and JSRR both have the same opcode encoding in an LC-3 instruction. The 17th (ORIG) through 21st (END) are for assemble directives (i.e., they do not correspond to actual instructions). The 22nd (GETC) through 27th (HALT) are for pseudo instructions for abbreviating certain TRAPs. The 28th (SUB) and 29th (MOV) are also for pseudo instructions. Please ignore the pseudo instructions until part 3. UNDEF is a special opcode meaning the structure does not correspond to an instruction at all. This is used when a label appears on a line without an instruction.
The label field is a pointer to a null-terminated array of characters (i.e., a C-style string) representing the label associated with this line. The value of this pointer is NULL for instructions that are not preceded by a label.
The first_reg, second_reg, and third_reg fields specify (as ints) the register operands for the instruction. Instructions that require fewer than 3 registers will have -1 in the unused registers. For example, if an instruction requires on two register operands (e.g., NOT), third_reg will be -1.
The offset_immediate field is used to hold offsets or immediate values (depending on the instruction). For example, an ADD immediate instruction will have an immediate value in this field, and an LD instruction will have a PCoffset in it. This field is 0 for instructions that don't require an offset or immediate, but you can't conclude this by looking at the field because an instruction might have an immediate with a value of 0. For example, to determine whether the third operand of an ADD, AND, or MUL instruction is a register or immediate, examine the value of third_reg; if it is -1, use the immediate in offset_immediate.
The label_ref field is a pointer to a null-terminated array of characters that represents the label referenced by the instruction (e.g., LABEL1 in "BR LABEL1"). For instructions not using a label, this pointer will be NULL.
The string field is a pointer to a null-terminated array of characters that represents the string operant to the .stringz directive. It has a NULL value for all instructions except STRINGZ.
The n, z, and p fields specify the n, z, and p fields in the BR instruction (1 indicates the corresponding bit is set in the instruction). These fields are obviously meaningless for non-BR instructions.
The next field is a pointer to the next insn_t structure, representing the next instruction in the program. The next field has the value NULL if this is the last instruction in the input file.

Passes #1 and #2: Inserting and Looking Up Labels into the Symbol Table

Passes 1 and 2 use a symbol table to store the address that corresponds to all of the labels in the program. Pass 1 iterates over all the instructions in the linked list, looking at the label field and adding label/address pairs to the symbol table. Pass 2 iterates over all the instructions again, but this time it looks up all label_ref fields in the symbol table and uses the address of the current instruction and the address of the label to compute the correct value of offset_immediate. After pass 2, the label_ref field is unused.

To complete the code for passes 1 and 2, you'll need to build the following three functions:

int next_address(insn_t* insn_ptr, int current_address): This function takes as arguments a pointer to an insn_t structure and the address associated with that instruction, and it returns the address of the next instruction. If the opcode is BLKW, the next instruction in memory is current_pc + size of the block word. If the opcode is STRINGZ, the next instruction is current_pc + the length of the string associated with the STRINGZ + 1 (for the null terminator). If the opcode is ORIG, the next instruction is simply the argument to ORIG. If the opcode is UNDEF, the next instruction is the same as current_pc (because the insn_ptr doesn't represent an instruction). Otherwise, the next instruction in memory is current_pc+1.
void pass_one(insn_t* insn_ptr): This function looks at each instruction to see if it has a label (that is, the label field is not NULL). If so, the label is inserted into the symbol table with the current_address. This function will need to call the sym_table_insert() function described in the first part of this assignment.
void pass_two(insn_t* insn_ptr): This function looks at each instruction for a non-NULL label_ref field. If the label_ref is non-NULL, the label_ref is looked up in the symbol table to get the address associate with the label. If the opcode is FILL, this address becomes the offset_immediate field. Otherwise, we must compute the PC-relative offset as follows: address associated with symbol in symbol table - (current_pc + 1). This value is then stored in offset_immediate. This function will need to call the sym_table_lookup() function described in the first part of this assignment.

Pass #3: Instruction Generation

After pass 1 and 2, the list of insn_t structures contain all the information necessary to generate encoded LC-3 instructions. Instruction encoding consists of taking the information in the insn_t structure and encoding it in a single 16-bit value. It's just the opposite of what we did in the previous assignment. Like last time, we'll provide some auxiliary routines to make this task more manageable. You'll need to complete the pass_three() function described below.

void write_word_to_file(int word): This function (which we provide for you) outputs the 16-bit binary representation of a single instruction (or ORIG value) to the output file.
int add_field(int bits, int new_bits, int hi_bit, int lo_bit): This function (also provided for you) takes a value (bits) and replaces the bits from hi_bit to lo_bit with the value in new_bits and returns that value. For example, to encode an instruction opcode we could do something like the following:
```
    new_insn = add_field(old_insn, insn_ptr->opcode, 15, 12);
```
Similarly, to encode the destination register in an ADD instruction we would do the following:
```
    new_insn = add_field(old_insn, insn_ptr->first_reg, 11, 9);
```
void pass_three(insn_t* insn_ptr): We've given you the basic structure of this pass in the pass_three() function. This function consists principally of a switch statement, and there is a case for each opcode. Within each case you will uses the fields of the insn_t structure and add_field() to build an instruction, which you will then output to a file with write_word_to_file(). We have completed the ADD case for you.

Part #3: Pseudo Instructions

As you know, LC-3 assembly language is very limited. In order to make LC-3 programming slightly simpler, we will introduce several pseudo instructions. Pseudo instructions are instructions that are recognized by the assembler but don't actually exist in the machine. For example, LC-3 does not actually support a HALT instruction, but we've been putting them in our assembly code. The assembler recognizes that HALT is not a real instruction and generates a TRAP x25. Note that pseudo instructions do not give programmers any additional power, because anything that can be done with a pseudo instruction can be done with 1 or more regular instructions. They just make programming easier. Our assembler will also recognize pseudo instructions for abbreviating TRAPs (GETC, OUT, PUTS, IN, PUTSP, HALT) and we'll add a couple other useful pseudo instructions that will make programming a little simpler.

Note that you will not need to modify the parser (parser.c) for this (or any) part of the assignment. The parser already knows how to parse all these pseudo instructions; we've just been ignoring them up until now.

TRAPs

Implementing the pseudo instructions for abbreviating TRAPs is simple. You just need to add a new case for each in the switch in pass_three(). In each case you generate an instruction encoding for the appropriate TRAP and then write it to the output file.

New Pseudo Instructions: SUB and MOV

Adding SUB and MOV pseudo instructions is only slightly more tricky the TRAPs, above. First, let's describe their semantics. SUB is just like ADD except that it performs subtraction. Like ADD, the third operand can be a register or an immediate. For reasons that will become clear soon, if the third operand is an immediate, it can only have a value from decimal -15 to 16 (versus -16 to 15 for ADD). Don't worry about checking that the immediate value is in range. The MOV pseudo instruction allows you to copy a register or immediate value directly into another register. An immediate in MOV is limited to decimal -16 to 15. Consider the following examples.

  MOV r1,r2     ; copies value of r2 into r1
  MOV r1,#3     ; sets the value of r1 to #3

Clearly, both of these instruction will be useful. They allow you to perform subtraction and assign constants into registers with 1 logical instruction. Naturally, our assembler will generate as many instructions as are necessary to realize the intended behavior. To add the SUB instruction, we must add a new case to the switch in pass_three(). The code actually emitted by the assembler depends on whether or not the third operand is an immediate. If it is, you must negate the immediate (i.e., compute -imm in your C code) and emit a single ADD instruction that adds the negated immediate to the second operand and puts the result in the first (this is why SUB immediates must fall within a different range than those in ADDs). If it is not, you must emit instructions to do the following: (i) invert the bits of the third operand (using the NOT instruction), assigning the result into the third register, (ii) add the second register of the SUB to the previous negated third register and put the result in the first register of SUB, (iii) increment the first register of SUB, and (iv) negate the bits of the third operand, assigning the result back into the third register. Pretty neat how we performed addition without an additional register, eh? Here's an example of the instruction emitted for the instruction SUB r1,r2,r3.

  NOT r3,r3
  ADD r1,r2,r3
  ADD r1,r1,#1
  NOT r3,r3
  ADD r1,r1,#0

But... There's a problem if the first and third registers are the same register (SUB r1,r2,r1), because the last instruction (NOT) will corrupt the result of the subtraction. In this case, don't emit the last NOT. And there's another problem if the second and third registers are the same (SUB r1,r2,r2); when you negate the third register you will corrupt the value of the second register (which is the same as the third). In this case, rather than the 4-instruction sequence above, we'll simply emit an instruction that puts 0 in the first register (AND r1,r1,#0).

The MOV pseudo instruction is similar. We will add a new case for it. If the second operand is a register (i.e., the second_reg field is not -1) we simply emit an instruction that adds the second operand of the MOV to #0 and puts the result in the first operand of the MOV. If the second operand is an immediate, we emit two instructions to do the following: (i) set the value of the first operand to the MOV to 0 (via an AND instruction) and (ii) add the value of the first operand (0) to the immediate in the MOV pseudo instruction and place the result back in the first operand of the MOV instruction.

We don't quite have a working assembler yet, because the next_address() doesn't know anything about these new pseudo instructions. The abbreviated TRAPs aren't a problem because they all correspond to one real machine instruction, but SUB and MOV sometimes correspond to multiple instructions. So you must modify next_address() to properly compute the next addresses for SUB and MOV instructions. Note that in both case you will need to figure out whether register or immediate operands are used (and what registers are the same for SUB) to determine whether 1, 2, 3, or 4 instructions will be emitted for the pseudo instruction.

In this way you can make your assembly language easier to use without actually changing the machine. I'll bet you can dream up a couple other useful pseudo instructions!

We needed to add approximately 230 lines of code to mylc3as.c to implement the assembler (excluding the symbol table, above).

Logistics

Files

This assignment consists of a number of files:

mylc3as.c: This file contains working code for functions write_word_to_file(), add_field(), and main(). It also contains unfinished versions of next_address(), pass_one(), pass_two(), pass_three(), which you will need to complete.
mylc3as.h: This file defines the insn_t and opcodes_t types. It is included by mylc3as.c and parser.c. You will not need to (nor should you) modify this file.
sym_table.c: This file contains sym_table_insert(), and sym_table_lookup(). You'll add a bunch of code to this file (although no new functions are required).
sym_table.h: This file contains the function prototypes for sym_table_insert() and sym_table_lookup(). You will not need to (nor should you) modify this file.
parser.c: This file contains the implementation of parse() which is the function that actually parses the assembly text. You should only bother looking at this if you are really interested. Note that there as some limitations in this parser. For example, a prefix ('#', 'x', or 'b') must appear before all immediates (i.e., "102" does not default to decimal; you must specify "#102" instead).
parser.h: This file contains the function prototypes for parser().
test_st1.c and test_st2.c: These are files used to test your symbol table implementation (see below). You should write your own tests, too!
We've also included some .asm files for testing your assembler. You should write your own tests, too!

Getting Started

Begin by creating a directory to work in and copying the files we provide. These files are available at ~cse240/project/hw/hw9 on eniac or in a zip file. If you are working on eniac, you can get these files as follows.

% cd ~
% mkdir cse240hw9
% cd cse240hw9
% cp ~cse240/project/hw/hw9/* .

Compiling Your Code

You may use any C compiler you like, but we recommend that you use gcc because that is what we'll be using to test your code. Use the following gcc command line to compile your assembler:

% gcc -g -Wall mylc3as.c sym_table.c parser.c -o mylc3as

The -g file specifies to add debugging symbols (so you can debug with gdb). The -Wall flag turns on all compiler warnings. The -o flag says that the generated program should be called mylc3as. If the -o flag is omitted, the default name is a.out. To avoid retyping this command each time, you can repeat previous shell commands by using the up arrow key to cycle through past commands.

To run your assembler, specify the input assembly file and the output object file. For example, to assemble test.asm into test.obj:

% mylc3as test.asm test.obj

Debugging

You will certainly need to use appropriate tools to debug your code (staring at the source code will only damage your eyesight!). The GDB debugger will be very useful. You can use it to determine exactly what statement caused a segmentation fault. You can also set break points and step through the program's execution. See the lecture notes for details on getting started with GDB.

You will also want to use the Valgrind memory check. Valgrind can help identify all kinds of memory errors such as the use of uninitialized memory, dangling pointers, wild pointers, corrupting the stack, memory leaks, and more. Volgrind is installed on eniac-l. To run your program with Valgrind:

% valgrind mylc3as test.asm test.obj

When errors are encountered, Valgrind prints them to the display. They are mostly self explanatory, but you may want to refer to the following sources of more information on Valgrind:

Valgrind Project Homepage - http://valgrind.kde.org/
Valgrind HOWTO - http://www.tldp.org/HOWTO/Valgrind-HOWTO/closerview.html
Linux Magazine's Tutorial on Debugging Memory Problems - http://www.linux-mag.com/2003-05/compile_01.html

Testing

Although we have provided some test inputs for both the assembler and the symbol table, it is very important that you use only our tests after you have thoroughly tested and debugged your code. Don't simply run the provided tests and look surprised when your code doesn't work.

We provide a similar set of .asm files that you used for testing in the previous assignment. You can test your assembler by comparing its output to that of the as command in the simulator (note that as does not recognize SUB nor MOV, so you will need to test those manually). For example, suppose you use as to assemble t1.asm to t1.obj. Now use mylc3as to assemble t1.asm and compare the two object files:

% mylc3as t1.asm my-t1.obj  (this will produce a file called my-t1.obj)
% cmp t1.obj my-t1.obj

If cmp does not produce any output, the two files are identical, so your assembler produced the same output as the reference assembler. If cmp reports a difference, examine t1.asm to see what kind of instruction(s) it contains. Use this a starting point for debugging. If t1.asm contains 1000s of instructions, you'll have no idea where the problem is. Generate your own test cases that have 1 or 2 instructions. If cmp reports that the two object files are different, you will know that the problem is with those 1 or 2 instructions.

We've also provided a little code to test your symbol table. Naturally, you will want to write more extensive tests. The files test_st1.c and test_st2.c both contain main() functions that will test your symbol table functions. To build, do the following.

% gcc -g -Wall test_st1.c sym_table.c -o test_st1 
% gcc -g -Wall test_st2.c sym_table.c -o test_st2

We recommend that you run these tests with Valgrind as follows:

% valgrind test_st1
% valgrind test_st2

The output of each program will indicate whether the test was passed or not. If you fail the tests, examine the .c files to determine the problem. Please build your own tests, too.

Submission

Please submit your code in files called lc3as.c and sym_tab.c in the usual way on the HW 9 Submission Page (available from the online version of this document).

Homework 9: LC-3 Assembler CSE240 - Introduction to Computer Architecture Autumn 2005