Due: Thursday, Feb. 23
1. Multi-Cycle Datapath (15 points)
The multi-cycle datapath we went over in class supports add, add-immeidate, load, store, branch-on-equal, and unconditional-jump instructions. For this exercise, you will modify the datapath to support two other instructions.
The first instruction is branch-on-condition-code-zero which replaces branch-on-equal (which compares two registers). To implement this instruction, of course, you also have to implement the zero condition code and modify other instructions to set it. The second instruction you have to add is an LC3-like load-pc-indirect instruction. In MIPS, this will be an I-type instruction with the semantics REGS[rt] = MEM[MEM[PC+SEXT(imm16)]]. For this instruction, you can (and should) use a single data memory port and go around the datapath twice. Describe the necessary changes. Draw the datapath and label any new structures, buses, muxes, and control signals, and any changes to current muxes, or control signals.
2. Multi-Cycle Control (15 points)
How many cycles would the branch-on-condition-code-zero instruction take in a multi-cycle datapath? How many cycles would the load-pc-indirect instruction take?
Write the multi-cycle controller (both the control table and the dispatch table) for the datapath that contains these two instructions (as well as add, add-immediate, load, store, and jump).
3. Pipelining (20 points)
Consider the code for the dot product of 100 element arrays:
for (i=0; i<100; i++)
DP += A[i] * B[i];
Assuming that A, B, DP, and i, are initially in registers $1, $2, $3, and $4 and $3 and $4 are initially 0, the assembly code might look something like this.
L: lw $5, 0($1)
lw $6, 0($2)
mul $6, $5, $6
add $3, $3, $6
addi $1, $1, 4
addi $2, $2, 4
addi $4, $4, 1
slti $5, $4, 100
bnez $5, L
a. Assuming a 5-stage pipeline with no bypassing and a multiply instruction that takes 8 cycles but does not have to go through its own memory stage, draw the pipeline diagram for a single iteration of this loop. You may assume that all branches are predicted correctly. The latency of an iteration can be defined as the time in cycles between the writeback stage of the first instruction in iteration i and the writeback stage of the first instruction in iteration i+1. What is the latency of one iteration in this case? What is the latency of the full loop?
b. Repeat the exercise but for a pipeline with MX bypassing.
c. Repeat the exercise but for a pipeline with full bypassing.
d. Reschedule the code to eliminate some of the data hazards in a single iteration. Redraw the diagram and calculate the iteration and full loop latency on a pipeline with full bypassing.
4. Loop Unrolling (10 points)
Pages 438 and 439 of the textbook describe a scheduling technique called loop unrolling which effectively fuses loop iterations together to create a bigger scheduling scope. A 2-unrolled version the dot-product loop loop would be something like:
for (i=0; i<100; i+=2)
DP += (A[i] * B[i]) + (A[i+1] * B[i+1]);
Write the assembly code for the unrolled loop and schedule it to reduce data hazards. Draw the pipeline diagram on a pipeline with full bypassing. What is the new latency of one iteration? What is the latency of the entire loop?
5. Branch Prediction (10 points)
A simple branch predictor consists of a target predictor which for each static control transfer instruction (i.e., the instruction's PC) stores the target address of the last taken dynamic instance of that instruction, and a direction predictor which for each static conditional branch instruction stores the direction (taken or not taken) of the last dynamic instance of the branch. Simple target and direction prediction work well for some kinds of control transfers and conditional branches and less well for others.
a. Give an example of a control transfer instruction for which simple target prediction would work well and an example of one for which it wouldn't work well. Why is this?
b. Give an example of a conditional branch for which simple direction prediction would work well and an example of one for which it wouldn't work well. (You will need to put the branch in a larger context to answer this question effectively). Why is this?
6. PennSim (40 points)
For this assignment, you will not be required to write any P37X code. Instead, you will write Java code to modify PennSim itself. In addition to the jar file PennSim.jar, we are also giving you the java source file Machine.java. These files are also on eniac in ~amir/cse371/pennsim/. You will modify this source file, compile it (javac -deprecation -classpath PennSim.jar Machine.java) and replace the Machine.class file in the PennSim jarchive with your own copy (jar uf PennSim.jar Machine.class).
The updated PennSim simulator collects some statistics on the running program. These include number of instructions (total and broken down into OS instructions and User instructions), cycles (same breakdown), and CPI (same breakdown again). You can view these stats at any time (when simulation is suspended) by typing pstats at the GUI command line. You can clear the stats by typing cstats.
It is easy to understand PennSim's method for counting instructions, for every instruction executed it increments some counter by 1. But how does PennSim count cycles? What underlying implementation does it assume? A single cycle implementation? A multi-cycle implementation? A pipelined implementation? Well, PennSim assumes a 5-stage pipelined implementation like the one we looked at in class. By default, it assumes that this 5-stage pipeline has no bypassing. But PennSim has startup commandline options [-mx], [-wx], and [-wm] that turn on MX, WX, and WM bypassing, respectively.
Your job is to implement the cycle counting functionality in the function updateStats. This function is called for each executed instruction and gives you the instruction's PC, the 16-bit instruction word itself, a pointer to an object def which is a class that implements the functionality of that instruction type, and two other inputs you will not need for this assignment. The methods on def that should interest you are boolean isStore() and boolean isLoad() which return the true/false values and int getRegDep(Word w, int dep) which returns the names of the input and output registers of the instruction w with the dep parameter specifying which register name you are interested in (0 is the first input register, 1 is the second input register, and 2 is the output register). If a given instruction doesn't have a given register, then getRegDep returns -1 (e.g., if you ask for the second input register of a LDR or the output register of an STR). Use these functions and the variables PennSim.MX_BYPASS, PennSim.WX_BYPASS, and PennSim.WM_BYPASS to calculate the appropriate value for the variable nCyclesStall for each instruction. Notice, to calculate a stall for a given instruction you have to remember some information about the previous two instructions, you can remember this information by adding private variables to the class Machine.
Once the calc programming assignment is done, I will give you a benchmark on which to run your code. Run your code on this file using all 8 possible bypassing configurations (none, MX, WX, WM, MX/WX, MX/WM, WX/WM, full). Report the user-code CPI for each configuration. Which kind of bypassing has the biggest performance impact? Why?
Please also submit your modified Machine.java file via email to the TA.