Due Tuesday, October 28 at the beginning of class
Hazards and Pipelining
1. Consider the following loop:
for (I=1; I < 100; I++)
X[I] = A*X[I-1] + Y[I]
mulf f0,f2,f4 // assume A is in f2
beqz r2, loop
a) Identify the data dependences, RAW, WAR, and WAW. Be sure to include dependence through memory and inter-iteration dependences (dependences of an instruction in one iteration on an instruction in a previous iteration).
b) Draw the pipeline diagram for a scalar, in order pipeline with stages F,D,X,M,W. Be sure to show all data stalls (d*), structural stalls (s*) and propagated pipeline stalls (p*). Assume that the execution of an FP add takes 2 cycles, an FP multiply takes 4 cycles and both operations can be pipelined. There is no M stage for FP operations. Assume branches are predicted taken. The pipeline has no bypassing. What is the latency of a single iteration? (Iteration latency is computed by subtracting the W cycles of the first instructions of consecutive iterations). What is the IPC and the pipeline utilization (actual IPC divided by peak IPC)?
c) Repeat exercise assuming full bypassing.
d) Again with full bypassing, pipeline schedule the loop to reduce RAW stalls. Repeat the diagram and calculation for the scheduled loop.
e) Now unroll the loop once and reschedule the two iterations together to reduce RAW stalls. Be sure to rename registers appropriately to remove any WAR or WAW hazards created by the scheduling. Repeat the diagram and calculation. (Loop unrolling will be covered in class on Thursday).
f) Software pipeline two iterations of the loop and redraw the diagram (software pipelining will also be covered in class on Thursday).
Suggestion: if done by hand pipeline diagrams can be pretty tedious. A good thing to do is to use some software which can deal with tables (e.g., Excel or FrameMaker) or just create a simple table in a text file which you can then cut and paste.
2. H+P Problem A.11
Simulation and Programming
3. The eniac simulator directory contains updated code for a new simulator, sim-DLX.c. Copy this file over to your private simulator directory. sim-DLX models a simple 5 stage pipeline, but cheats to do so. It actually doesn't model a pipeline at all, it just computes delays for different sorts of events like cache misses mis-predicted branches and data hazards. You can learn more about sim-DLX one the SimpleScalar help page.
Using the benchmark gcc, running at 2% sampling with 4% warmup
and 10M instruction sampling granularity.
Configure the simulation to use
a) no branch prediction (i.e., stall on every branch)
b) predict not-taken
c) predict taken with a 2K entry, 2-way set-associative tagged BTB
d) same as configuration c) but add a 16 entry RAS
e) same as configuration d), but substitute the fixed taken prediction for a simple 1K-entry BHT with 1-bit counters
f) same as configuration e), but 2-bit counters instead of 1-bit counters
g) same as configuration f), but add a 10-bit BHR which is XOR'ed with the PC
h) same as configuration g), but using a hybrid of f) and g) schemes with a 1K-entry chooser
4. The simulator does not currently model delays due to RAW stalls. Your job is to add this to the model. You can do this by checking the input registers of the current instruction for matches against the output registers of the previous instructions (Hint: the pdi structures contain the register numbers). The functional part of your code (not the variable declarations) should go under the comment which starts with HW3. Your code should add the appropriate number of delay cycles for the appropriate dependence and should be able to model any combination of X->X and M->X bypassing or lack thereof (Clarification: assume that for any instruction, both inputs must be present at the beginning of the X stage, this should simplify your checking). The variables f_bypass_xx and f_bypass_mx have been created for this purpose. If you think carefully about what it is you are checking for, the whole thing should less than 10 lines (notice how I use pdi_2 and pdi_1 to keep track of the instruction 1 ahead and 2 ahead in the pipeline of the current instruction, whatever it is).
Using the final branch predictor configuration, measures gcc's IPC's using no bypassing, X->X bypassing only, M->X bypassing only, and full bypassing. What do these results tell you about the nature of dependences in gcc?