Homework 9: LC-3 Assembler
CSE240 - Introduction to Computer Architecture
Autumn 2006
Due: Wednesday, Dec 6 at 11:59PM
Now that you have built a disassembler (homework 8), let's build an assembler,
just like the as command in the simulator. Let's call our
assembler mylc3as. This assignment is composed of two distinct
components: (1) writing the assembler code and (2) writing the symbol table
used by the assembler. I suggest you complete each part before proceeding to
the next. This document describes each of these components separately and then
describes the logistics for compiling, testing, debugging, and turning in your
assignment.
Part #1: The Assembler
Recall from Chapter 7 that assemblers make multiple passes over the
program. Our assembler will make several passes over the program (but
one could combine some of these passes):
- Pass 0: parse the assembly program into a linked list,
- Pass 1: add symbols to symbol table,
- Pass 2: resolve label references using symbol table, and
- Pass 3: actually generate instructions.
Pass 1 and 2 require a symbol table, which you will build in the
second half of this assignment. To allow you to complete part 1
without first completing part 2, we've included a binary
implementation of the symbol table (see below for more information).
Pass #0: Parsing Instructions into a List
Parsing is quite complex, so we provide this code (in
parser.c and parser.h). Feel free to examine this
code if you like. Although we provide the parser, you still need to
understand the structure into which instructions are placed. The
program is parsed into a linked list where each node of the list is an
instance of the structure below.
enum opcodes_t {
// LC3 opcodes
BR, ADD, LD, ST, JSR_JSRR, AND, LDR, STR,
RTI, NOT, LDI, STI, JMP_RET, MUL, LEA, TRAP, SUB,
// assembler directives
ORIG, FILL, BLKW, STRINGZ,
// no operation
UNDEF
};
typedef struct insn_struct insn_t;
struct insn_struct {
enum opcodes_t opcode; // the opcode of instruction (see above)
char* label; // label on this line (can be NULL)
int first_reg; // 1st reg number (-1 implies there isn't one)
int second_reg; // 2nd reg number (-1 implies there isn't one)
int third_reg; // 3rd reg number (-1 implies there isn't one)
int offset_immediate; // immediate or offset value
char* label_ref; // label referenced by BR/LD/ST/JSR/LDI/STI/LEA/FILL
char* string; // string for STRINGZ directive
int n; // neg bit (only for BR)
int z; // zero bit (only for BR)
int p; // pos bit (only for BR)
insn_t* next; // pointer to next instruction in list
};
The fields are:
- The opcode field indicates what kind of instruction this
structure represents (e.g., ADD, AND, etc.).
Opcodes values must come from the opcodes_t enumeration.
Note that the first 16 opcodes correspond directly to actually LC-3
opcode values. For example, the opcode of BR is 0x0, ADD is 0x1, LD is
0x2, etc. SUB does not correspond to the opcode encoding directly.
The remaining ones are for assembler directive
(i.e., they do not correspond to actual instructions). Note the
JSR_JSRR and JMP_RET enum values. We have named these in this way
because, for example, JSR and JSRR both have the same opcode encoding
in an LC-3 instruction. UNDEF is a special opcode meaning the
structure does not correspond to an instruction at all. This is used
when a label appears on a line without an instruction.
- The label field is a pointer to a null-terminated array
of characters (i.e., a C-style string) representing the label
associated with this line. The value of this pointer is NULL for
instructions that are not preceded by a label.
- The first_reg, second_reg, and
third_reg fields specify the register operands for the
instruction. Instructions that require fewer than 3 registers will
have -1 in the unused registers. For example, if an instruction
requires on two register operands
(e.g., NOT), third_reg will be -1.
- The offset_immediate field is used to hold offsets or
immediate values (depending on the instruction). For example,
an ADD immediate instruction will have an immediate value in
this field, and an LD instruction will have a PCoffset in it.
This field is 0 for instructions that don't require an offset or
immediate, but you can't conclude this by looking at the field because
an instruction might have an immediate with a value of 0. For
example, to determine whether the third operand of
an ADD, AND, or MUL instruction is a
register or immediate, examine the value of third_reg; if it
is -1, use the immediate in offset_immediate.
- The label_ref field is a pointer to a null-terminated
array of characters that represents the label referenced by the
instruction (e.g., LABEL1 in "BR LABEL1"). For
instructions not using a label, this pointer will be NULL.
- The string field is a pointer to a null-terminated array of
characters that represents the string operant to the .stringz
directive. It has a NULL value for all instructions except STRINGZ.
- The n, z, and p fields specify the n,
z, and p fields in the BR instruction (1 indicates the
corresponding bit is set in the instruction). These fields are
obviously meaningless for non-BR instructions.
- The next field is a pointer to the next insn_t
structure, representing the next instruction in the program. The next field
has the value NULL if this is the last instruction in the input file.
Passes #1 and #2: Inserting and Looking Up Labels into the Symbol Table
Passes 1 and 2 use a symbol table to store the address that corresponds to all
of the labels in the program. Pass 1 iterates over all the instructions in the
linked list, looking at the label field and adding label/address pairs
to the symbol table. Pass 2 iterates over all the instructions again, but this
time it looks up all label_ref fields in the symbol table and uses the
address of the current instruction and the address of the label to compute the
correct value of offset_immediate. After pass 2,
the label_ref field is unused.
To complete the code for passes 1 and 2, you'll need to build the
following three functions:
- int next_address(insn_t* insn_ptr, int current_address):
This function takes as arguments a pointer to an insn_t structure and
the address associated with that instruction, and it returns the address of the
next instruction. If the opcode is BLKW, the next instruction in
memory is current_pc + size of the block word. If the opcode
is STRINGZ, the next instruction is current_pc + the length of the
string + 1 (for the null terminator). If the opcode is ORIG, the next
instruction is simply the argument to ORIG. If the opcode
is UNDEF, the next instruction is the same as current_pc (because the
insn_ptr doesn't represent an instruction). Otherwise, the next instruction in
memory is current_pc+1.
- void pass_one(insn_t* insn_ptr): This function
looks at each instruction to see if it has a label (that is, the label
field is not NULL). If so, the label is inserted into the symbol
table with the
current_address. This function will need to call the
sym_table_insert() function described in the second part of
this assignment.
- void pass_two(insn_t* insn_ptr): This function looks
at each instruction for a non-NULL label_ref field. If the
label_ref is non-NULL, the label_ref is looked up in
the symbol table to get the address associate with the label. If the
opcode is FILL, this address becomes the
offset_immediate field. Otherwise, we must compute the
PC-relative offset as follows: address associated with symbol in symbol
table - (current_pc + 1). This value is then stored in
offset_immediate. This function will need to call the
sym_table_lookup() function described in the second part of
this assignment.
Pass #3: Instruction Generation
After pass 1 and 2, the list of insn_t structures contain all
the information necessary to generate encoded LC-3 instructions.
Instruction encoding consists of taking the information in the
insn_t structure and encoding it in a single 16-bit value.
It's just the opposite of what we did in the previous assignment.
Like last time, we'll provide some auxiliary routines to make this
task more manageable. You'll need to complete
the pass_three() function described below.
- void write_word_to_file(int word): This function
(which we provide for you) outputs the 16-bit binary representation of
a single instruction (or ORIG value) to the output file.
- int add_field(int bits, int new_bits, int hi_bit, int
lo_bit): This function (also provided for you) takes a value
(bits) and replaces the bits from hi_bit
to lo_bit with the value in new_bits and returns
that value. For example, to encode an instruction opcode we could do
something like the following:
new_insn = add_field(old_insn, insn_ptr->opcode, 15, 12);
Similarly, to encode the destination register in an ADD
instruction we would do the following:
new_insn = add_field(old_insn, insn_ptr->first_reg, 11, 9);
- void pass_three(insn_t* insn_ptr): We've given you
the basic structure of this pass in the pass_three()
function. This function consists principally of a switch statement,
and there is a case for each opcode. Within each case you will uses
the fields of the insn_t structure and add_field()
to build an instruction, which you will then output to a file
with write_word_to_file(). We have completed
the ADD case for you.
Note that your assembler should print nothing to the display unless an
error is encountered.
We needed to add approximately 80 lines of code to mylc3as.c to
implement the assembler (excluding the symbol table, below).
Part #2: The Symbol Table
The symbol table allows you to map strings representing labels (for
example, "START") to addresses (for example, 0x3000).
Symbol Table Interface
We will use the following interface for our symbol table
implementation:
- void sym_table_insert(char* sym, int addr): This
function adds label sym with the address addr to the
symbol table.
- int sym_table_lookup(char* sym): This function
returns the address associated with symbol sym in the symbol table.
We provide a reference implementation of the symbol table
(sym_table.o). This allows you to implement the rest of the
homework without your own symbol table. It also helps in confirming
that your symbol table works properly.
Symbol Table Implementation
Symbol tables can be implemented in many ways (hash tables, linked
lists, arrays, etc.), but we will use a binary tree representation.
Each node in the tree is represented via the structure below.
typedef struct sym_node_struct sym_node_t;
struct sym_node_struct {
char* symbol;
int address;
sym_node_t* left;
sym_node_t* right;
};
The symbol field points to an array of characters (string)
representing the symbol. The address field is the address of
the symbol. The left and right fields are pointers to
the left and right sym_node_t child structures in the tree
(NULL if child does not exist). The "root" of the tree is pointed to by
the global variable g_symboltable_ptr, initially NULL:
sym_node_t* g_symboltable_ptr = NULL; // global symbol table
To provide fast lookup of symbols, we want to build the tree such that
given any node in the tree, all labels reachable from the left child
are alphabetically less than the label of the node ("car" is
alphabetically before "dog", but it is after "ant"). Similarly, all
labels reachable from the right child are alphabetically greater than
the label of the node. The strcmp(char* s1, char* s2)
function from the C standard library does just such an alphabetical
comparison. The function strcmp(char* s1, char* s2) returns
a 0 if the strings are the same, a number greater than 0 if s1 is
alphabetically after s2, and a number less than 0 if s1 is
alphabetically before s2.
To implement the symbol table, implement the following three
functions:
- int sym_table_lookup(char* sym): This function
traverses the tree based on the symbol string it is passed. When the
string is found in a node, the value of the address field is
returned. If a non-matching childless node is encountered, the symbol
does not appear in the symbol table. If symbol SYMBOL cannot
be found, print exactly the following error message (so our
testing scripts can detect it) followed by a new line and exit (to exit
a C program, using the following function call: exit(1);).
ERROR: symbol not found (SYMBOL)
The pseudocode:
initialize a node pointer to g_symboltable_ptr
while the node pointer is not NULL do:
use strcmp() to compare sym and the symbol field of structure pointed to by the node pointer
if the result of the comparison is less than zero:
set the node pointer to be the left pointer of the structure pointed to by the node pointer
otherwise, if the result of the comparison is greater than zero:
set the node pointer to be the right pointer of the structure pointed to by the node pointer
otherwise:
the result of the function is the address field of the structure pointed to by the node pointer
If the symbol was not found, print an error message (using printf) and then exit(1)
Our C implementation of this function is less than 20 lines long.
- void sym_table_insert(char* sym, int addr): This
function begins by allocating and initializing a sym_node_t
(via malloc) and populates the structure's fields with appropriate
values passed into the function. Set the left and right pointers to
NULL. When setting the symbol field, you don't want it to
point to the string passed to the function. The caller of this
function may free the memory containing the string, and then we would
have a dangling pointer. To avoid this, make a copy of the string
using the strdup() function. The function strdup()
returns a pointer to a new copy of the string it is passed and has the
following signature:
char* strdup(char* str);
After allocating a new node, this function traverses the tree based on
the symbol string passed into the function. If the symbol comes
alphabetically before the string in the current node, it traverses the
left child; if the symbols comes after, it traverses the right child; if
the symbol is the same as the string in the current node, someone is
trying to insert a duplicate string into the symbol table, so print
exactly the following error message (followed by new line) and
exit (again, using exit(1);). (NOTE: SYMBOL is the
name of the duplicate symbol.)
ERROR: duplicate symbol added to symbol table (SYMBOL)
When a childless node is encountered, the NULL
pointer is changed to point to the newly allocated sym_node_t
structure. The pseudocode:
allocate and initialize a new sym_node_t structure
if the global pointer g_symboltable_ptr is NULL:
set g_symboltable_ptr to point to the newly allocated structure
otherwise:
set a current node pointer to g_symboltable_ptr
while true do:
use strcmp() to compare sym and the symbol field of structure pointed to by the node pointer
if the result of the comparison is less than zero:
if the left pointer is NULL:
make the left pointer point to the newly allocated structure
break out of the loop
otherwise:
set the node pointer to the left pointer
otherwise, if the result of the comparison is greater than zero:
if the right pointer is NULL:
make the right pointer point to the newly allocated structure
break out of the loop
otherwise:
set the node pointer to the right pointer
otherwise:
print out an error and exit
Our C implementation of this function is less than 30 lines long.
Logistics
Files
This assignment consists of a number of files:
- mylc3as.c: This file contains working code for
functions write_word_to_file(), add_field(), and
main(). It also contains unfinished versions of
next_address(), pass_one(), pass_two(),
pass_three(), which you will need to complete.
- mylc3as.h: This file defines the insn_t
and opcodes_t types. It is included
by mylc3as.c and parser.c. You
will not need to (nor should you) modify this file.
- sym_table.c: This file contains
sym_table_insert(), and sym_table_lookup(). You'll
add a bunch of code to this file (although no new functions are
required).
- sym_table.h: This file contains the function prototypes
for sym_table_insert() and sym_table_lookup(). You
will not need to (nor should you) modify this file.
- parser.o: This file contains the implementation of
parse() which is the function that actually parses the
assembly text. The actual implementation of the parser is in
parser.c, but you should only bother looking at this if you are really
interested. Note that there as some limitations in this parser. For example,
a prefix ('#', 'x', or 'b') must appear before all immediates (i.e., "102" does
not default to decimal; you must specify "#102" instead).
- parser.h: The file parser.h defines the
prototypes used by mylc3as.c.
- test_st1.c and test_st2.c: These are files used to
test your symbol table implementation (see below).
- We've also included some .asm files for testing your assembler. You
should write your own tests, too!
Getting Started
Begin by creating a directory to work in and copying the files we
provide. These files are available at ~cse240/html/handouts/hw9/
on eniac or in a zip file. If you are
working on eniac, you can get these files as follows.
% cd ~
% mkdir cse240hw9
% cd cse240hw9
% cp ~cse240/html/handouts/hw9/* .
Compiling Your Code
Use gcc on eniac.seas.upenn.edu to compile and run
your assembler. Use the following command line:
% gcc -g -Wall mylc3as.c sym_table.c parser.o -o mylc3as
The -g file specifies to add debugging symbols. The -Wall flag turns
on all compiler warnings. The -o flag says that the generated program
should be called mylc3as. If the -o flag is omitted, the default name
is a.out. To avoid retyping this command each time, you can repeat
previous shell commands by using the up arrow key to cycle through
past commands.
To work on your assembler before implementing your symbol table, you
can use our implementation by replacing sym_table.c with
sym_table.o:
% gcc -g -Wall mylc3as.c sym_table.o parser.o -o mylc3as
To run your assembler, specify the input assembly file and the output
object file. For example, to assemble test.asm into test.obj:
% mylc3as test.asm test.obj
Debugging
You will certainly need to use appropriate tools to debug your code
(staring at the source code will only damage your eyesight!). The GDB
debugger will be very useful. You can use it to determine exactly
what statement caused a segmentation fault. You can also set break
points and step through the program's execution.
You will also want to use the Valgrind memory check. Valgrind can help
identify all kinds of memory errors such as the use of uninitialized
memory, dangling pointers, wild pointers, corrupting the stack, memory
leaks, and more. To run your program with Valgrind:
% valgrind mylc3as test.asm test.obj
When errors are encountered, Valgrind prints them to the display.
They are mostly self explanatory, but you may want to refer to the
following sources of more information on Valgrind:
Testing
In previous assignments, we've given you lots of test cases that covered most
of the cases. In this assignment, we're giving you just two basic test cases.
Many, many cases are not covered by these two short .asm files. As such,
you'll need to create your own .asm test cases, assemble them with PennSim, and
verify your code generates the same .obj file. Don't simply run the provided tests
and look surprised when your code doesn't work.
You can test your assembler by comparing its output
to that of the as command in the simulator. For example, suppose you
use as (in PennSim) to assemble t1.asm to t1.obj.
Now use mylc3as to assemble t1.asm and compare the two object
files:
% mylc3as t1.asm my-t1.obj (this will produce a file called my-t1.obj)
% cmp t1.obj my-t1.obj
If cmp does not produce any output, the two files are
identical, so your assembler produced the same output as the reference
assembler. If cmp reports a difference, examine
t1.asm to see what kind of instruction(s) it contains. Use
this a starting point for debugging. If t1.asm contains 1000s
of instructions, you'll have no idea where the problem is. Generate
your own test cases that have 1 or 2 instructions. If cmp
reports that the two object files are different, you will know that the
problem is with those 1 or 2 instructions.
You may also find the hex command useful to look at what hex values
are in a binary file.
% hex t1.obj
You may want to use your disassembler from the last assignment to help debug
your code (by disassembling the object file created by your assembler). You
can also disassemble an object file by loading it into the PennSim simulator
and examining the values in memory.
For part#2, we've provided a little code to test your symbol table.
Naturally, you will want to write more extensive tests. The files
test_st1.c and test_st2.c both contain main()
functions that will test your symbol table functions. To build, do the
following.
% gcc -g -Wall test_st1.c sym_table.c -o test_st1
% gcc -g -Wall test_st2.c sym_table.c -o test_st2
We recommend that you run these tests with Valgrind as follows:
% valgrind ./test_st1
% valgrind ./test_st2
The output of each program will indicate whether the test was passed or
not. If you fail the tests, examine the .c files to determine the
problem. You'll want to create your own tests, too.
Turnin
Please submit your code in files called mylc3as.c and sym_table.c in the
usual way:
% cd cse240hw9
% turnin -c cse240 -p hw9 mylc3as.c sym_table.c
% exit