Due Tuesday, November 25 at the beginning of class
Out of Order Execution
Remember the SAXPY loop:
#0: ldf X(r1),f0
#1: mulf f0,f2,f4
#2: ldf Y(r1),f6
#3: addf f4,f6,f8
#4: stf f8,Z(r1)
#5: addi r1,#4,r1
#6: blt r1,r2,#0
General assumptions: branches are always predicted as taken, mulf latency is 5 cycles, addf latency is 2 cycles (both pipelined), and all other operations execute in a single cycle. The machine is 1-wide (i.e., only one operation can be in any stage in a given cycle, unless it's in a pipelined functional unit like one of the floating point units).
Your job is to draw timing diagrams like those I have shown in
class for two full iterations of the loop (12 instructions total).
In each case, you need to show only the instruction status portion, as
shown for the first two instructions below. You do not need to show
the functional unit status tables or reservation stations, since these
change cycle-by-cycle.
| instruction |
|
|
|
|
| ldf X(r1),f0 |
|
|
|
|
| mulf f0,f2,f4 |
|
|
|
|
NOTE: When drawing the timing diagrams, please try to be legible. The best idea is to use some software (e.g., word or excel) which can draw nice tables which you can then fill in.
1. Draw the timing diagram for a machine that uses a scoreboard with an infinite number of entries and an infinite number of functional units.
2. Draw the timing diagram for a machine that uses Tomasulo's algorithm using an infinite number of reservation stations.
3. Draw the timing diagram for a machine that uses Tomasulo's algorithm with an infinite number of reservation stations, but a 5-entry ROB.
4. Draw the timing diagram for the same machine as in question 3, but this time assume that the first instance of the branch blt was mis-predicted, such that the ROB had to be flushed and restarted before executing the second loop iteration.
5. If you did the diagrams correctly, you should have noticed that in a few instances there multiple instructions ready to issue in a single cycle. Because the machine you are hand-simulating is 1-wide, you had to choose one of these instructions and delay the issue of the other to the next cycle. When there are more instructions ready to issue than there are issue slots, the "scheduling policy" kicks in. The default scheduling policy is "oldest-first", if two instructions are ready to issue, the one that occurs earlier in the program wins. What is the rationale for this policy? (Hint: think about the scenario in which you have a ROB). The "oldest-first" policy is not always the best one from a performance standpoint, however. Can you construct an example in which the "oldest-first" policy might produce sub-optimal performance? (I want you to construct one, not to answer yes or no). What scheduling policy would be better in this case?
6. Consider a symmetric 4-wide (each pipeline stage can operate on up to 4 instructions in parallel) implementation of the MIPS R10000 algorithm with a 32-entry ROB and 16 reservation stations. Assume that each instruction can have a maximum of 2 register inputs and one register output.
Simulation (no actual programming this time)
For this part of the homework, you will be using sim-R10K, which simulates a dynamically scheduled superscalar processor with reservation stations, a ROB, MIPS R10K style register renaming, speculative execution with branch prediction, a full cache hierarchy, and load speculation. For the purposes of this assignment, you are not required to hack the simulator, only to run it. However, it may be a good idea to look at the file sim-R10K.c, just to see how the simulation is structured. This is especially important for those few of you who would like to use the timing simulator in your projects. The simulator has an outer loop that increments a cycle counter. Every cycle, it models each pipeline stage from the back of the pipeline to the front: commit, writeback, schedule, rename, and fetch. The simulator uses a "dirty" trick to speed up simulation: it executes in order during the rename stage and simply uses the subsequent out-of-order stages to model timing. You can learn more about sim-R10K on the SimpleScalar help page.
Configure the simulator to use the same cache parameters as you used in question 3 of homework #3 and the branch prediction parameters from part h of the same. Use the benchmark gcc with 2% sampling, 4% warmup and a 10M sampling granularity.
Here are some other settings for your runs:
8. Measure the impact of load scheduling policy. Rerun the out-of-order experiment with "opportunistic" load scheduling. What is the speedup?
9. Measure the impact of ROB/MOB size. Rerun the out-of-order experiments
with both "conservative" and "opportunistic" scheduling two more
times each: once with the ROB/MOB sizes doubled, and once with
those sizes halved. What are the speedups/slowdowns in each case? With
the six simulations you have now (3 ROB sizes and two load scheduling
policies), can you make a statement about which is more important to performance? Is
there some interaction between ROB size and load scheduling policy
(i.e., does a large ROB require one policy over another)?