Homework 5 - CIS501 Fall 2012

Instructor:Prof. Milo Martin
Due:Wednesday, November 21th at 6pm (if you use a "late period", you can turn it in up to Monday, November 26th at 6pm).
Instructions:This is an individual work assignment. Sharing of answers or code is strictly prohibited. For the short answer questions, Show your work to document how you came up with your answers.

Overview

This assignment explores the effectiveness of superscalar execution on program performance by extending and integrating your code from the last three homework assignments (the pipeline simulator, the branch simulator, and the cache simulator). This assignment uses the same traces as previous assignments (see Trace Format document).

As before, you'll write a program that reads in the trace and simulates different processor core configurations. You're welcome to write the program in whatever language you like; although we are going to ask for you to turn in your source code, the end result of this assignment are the experimental results (the program source code is just a means to the end).

Note

Unlike previous assignments which have been debugged over the years, this assignment is mostly new, and thus it may have ambiguities in the description and/or errors in the debug traces. If your trace isn't matching up exactly with the one we've provided for you, it is (more) plausible that there is a bug in the trace than in prior assignments.

Hints & Comments

Simulating a Superscalar Pipeline

The first step in exploring the impact on IPC of a superscalar pipeline is to extend the pipeline simulator you already wrote for an earlier assignment to simulate an N-wide superscalar pipeline.

Simplifications. As before, we are going to make some assumptions to keep the simulation simple. First, we'll assume that up to N instruction can execute per cycle without restricting the type of the instructions. Whereas a real processor could start the execution of four divides at once, we will not model this limitation. That is, any mix of N instructions is allowed. Second, we'll assume perfect fetch (the processor can always fetch N instructions per cycle).

Algorithm. Recall that the previous assignment used a scoreboard to track how many cycles until a specific register value was ready. This same scoreboard algorithm can be extended to model a superscalar processor by querying the scoreboard status for up to the next N instructions before advancing to the next cycle. All this requires is adding a for loop to the original pipeline algorithm. Below is the modified algorithm:

next_instruction = true
while (true):
   for (i=0; i<N; i++):        // ADDED: SIMULATES SUPERSCALAR
       if (next_instruction):
         read next micro-op from trace
         next_instruction = false

       if instruction is "ready" based on scoreboard entries for valid input and output registers:

         if instruction is a load:
           latency = 2
         else:
           latency = 1

         if the instruction writes a register:
           "execute" the instruction by recording when its register(s) will be ready
         next_instruction = true
       else:
         break           // ADDED: BREAK OUT EARLY, NO NEED TO KEEP CHECKING

   advance by one cycle by decrementing non-zero entries in the scoreboard
   increment total cycles counter

When is an instruction ready? As before, If any of an instruction's source (input) registers are not ready (a value greater than zero indicates it isn't ready), that instruction must stall. In addition, we want to avoid out-of-order register file writes. To do this, also ensure that the destination (output) register is also "ready" by stalling if its scoreboard entry is non-zero. This check will prevent out-of-order register writes. It also avoids the situation in which two instruction attempt to write the same register in the same cycle by stalling the second instruction.

(Technically, the above assumption is a bit conservative. A more strict check would not stall when the entry in the scoreboard was less than the instruction's latency. That is, it can execute only if the instruction will write the register after any preceding instructions updated that register. But, to keep things simple, just check to make sure the destination register is zero.)

Handling flags. One wrinkle in dealing with x86 is that we need to handle the flags (condition codes). In the first pipeline simulator, we ignored the flag registers (condition codes) because all instruction that set the condition codes had a single-cycle latency. However, for the superscalar processor, we want to prevent an instruction that reads the flags to execute in the same cycle that an earlier instruction updated the flags. That is, we need to enforce data dependencies via the flags. To enforce such dependencies, we'll treat the flags like just another register by assigning it register 49. Any instruction the reads the flags is then treated as reading register 49; an instruction the writes the flags is then treated as writing register 49. Thus, each micro-op now implicitly has up to three source registers and up to two destination registers that much be checked to be "ready" and the latency recorded upon instruction execution.

Experiment #1 - Superscalar Impact on IPC

How much does superscalar improve IPC? To answer this question, run your simulator with different widths of superscalar processors. Set the latency of load instructions to two cycles (one cycle load-use-penalty) and set all other instructions to a latency of just a single cycle (allowing dependent instruction to execute in back-to-back cycles but not in the same cycle).

  1. Use your simulator to simulate the full trace for varying superscalar widths (1 to 6) and record the micro-op IPC. Plot these results as a line graph with the the pipeline width on the x-axis and the micro-op IPC on the y-axis.

  2. What is the speedup from a 1-wide processor to a 2-wide superscalar processor?

  3. What is the speedup from 2-wide to 4-wide?

  4. Even though the peak speedup potential for each is 2x, why is the speedup from 2-wide to 4-wide smaller than from 1-wide to 2-wide?

Adding Branch Prediction

For the next experiment, we are going to consider realistic branch prediction by extending the simulator to also model a gshare branch direction predictor. Obviously, you should reuse your code from the branch prediction assignment.

Simplifications. Assume that the pipeline can predict up to N branches per cycle, has perfect fetch, and has a perfect BTB. In addition, a real machine would only know that the branch was mis-predicted once it executes. However, in our trace-based simulations, we know actually immediately if the prediction was correct or not. This allows us to simplify the modeling of branch mis-predictions. Instead of actually flushing the pipeline, we can model a branch mis-prediction by stalling fetch until the branch would have resolved. In fact, the trace doesn't even contain the "wrong path" instructions, so there really is not anything else we can do when just using a trace.

Algorithm. Whenever the simulator looks to fetch a conditional branch instruction, it will query the branch predictor. As the trace also contains the correct outcome, we can both determine if the prediction was correct or not and train the predictor as part of querying the predictor. If the conditional branch was predicted correctly, continue as before. If the branch prediction was incorrect, set the "branch stall" counter to the mis-prediction penalty as part of executing the instruction. At the end of each cycle, decrement this "branch stall" counter (unless it is already zero). In essence, you can consider this "branch stall" counter as just another counter in the scoreboard. The final change is that before determining if any instruction is "ready" to execute, the simulator should also check the "branch stall" counter. If the counter is not zero, the instruction cannot yet execute, so it breaks out of the loop. The new pseudo code:

"branch stall" counter = 0  // ADDED: Initialize
next_instruction = true
while (true):
   for (i=0; i<N; i++):
       if (next_instruction):
         read next micro-op from trace
         next_instruction = false

       if "branch stall" > 0:   // ADDED
          break                 // ADDED

       if instruction is "ready" based on scoreboard entries for valid input and output registers:

         if instruction is a load:
           latency = 2
         else:
           latency = 1

         if the instruction writes a register:
           "execute" the instruction by recording when its register(s) will be ready
         next_instruction = true

         if instruction is a conditional branch:                   // ADDED
           if make_branch_prediction(...) == incorrect):           // ADDED
             "branch stall" counter = misprediction_penalty        // ADDED

       else:
         break

   advance by one cycle by decrementing non-zero entries in the scoreboard
   decrement "branch stall" counter if non-zero                    // ADDED
   increment total cycles counter

Note

Let me reemphasize that in a real machine, the hardware would not know the branch was mis-predicted, so it would go on to fetch instructions down the "wrong path", inject them into the pipeline and execute them speculatively. As we are doing this based on a trace, we know immediately that we mis-predicted the branch, thus we can approximate the performance of speculative execution without actually implementing recovery and such. In fact, fetching down wrong paths is actually really difficult to model using traces, as we have only the "correct" trace of instructions. This is a limitation of simple trace-based simulations. In contrast, a simulation that actually simulates the execution of the program can actually model the performance (and energy) impact of executing down the wrong path.

Experiment 2 - Superscalar and Branch Prediction

To calculate the impact of branch prediction, perform simulations with a gshare predictor. The predictor uses a table of 216 two-bit counters (16KB of predictor, which is a reasonably large predictor) and 16 bits of history. Assume a restart penalty of five cycles (this represents the time it takes to redirect fetch and access the instruction cache).

  1. Run the same six simulations as in the previous two experiments (1-wide to 6-wide superscalar) but with the gshare predictor. Plot these results as another line on the same graph (superscalar width on x-axis and IPC on y-axis).

  2. What percent slowdown (versus the IPC from the last experiment) results from modeling the gshare predictor for scalar (1-wide non-superscalar) pipeline?

  3. What percent slowdown results from modeling the predictor for the 6-wide superscalar?

  4. Why is the impact larger for the superscalar pipeline than on the scalar pipeline?

Experiment 3 - Non-Perfect Fetch

Thus far, we've assumed the pipeline can always fetch N instructions per cycle, even past taken branches. To make things a bit more realistic, change the simulator so that it cannot fetch past any taken branch. Taken branches includes any instruction that causes fetch to be redirected, including unconditional branches, call, and return instructions. To model not fetching past a taken branch, after executing a branch instruction that is taken, break out of the for loop. This can be a simple if statement with a "break" in it to jump out of the for loop.

  1. Run the same six simulations as in the previous experiment (1-wide to 6-wide superscalar). Plot these results as another line on the same graph (superscalar width on x-axis and IPC on y-axis). The IPC of the scalar (non-superscalar) pipeline should be identical to your results from experiment 2. The IPC of the other widths (2-wide through 6-wide) should be lower than the perfect fetch from the previous experiment.

  2. What percent slowdown does enforcing this more realistic fetch limitation introduce for the 6-wide superscalar pipeline? (Calculate this by comparing the IPC with and without this limitation.)

Note

Although the modeling of instruction fetch is now more realistic, we are still ignoring the important issue of not being able to fetch from two different cache blocks in the same cycle.

Adding Caches

Next, model the impact of cache misses on the performance of the pipeline by integrating your cache simulator from the previous assignment into your pipeline simulator.

Simplifications. First, we're going to focus on the data cache, so assume the instruction cache remains ideal. Second, we're going to ignore stores by assuming a write-through write-non-allocate cache with an unbounded store buffer. Under these assumptions, only load accesses are sent to the cache simulation code.

Algorithm. The only change to the algorithm is determining the latency of load instruction. Previously all loads had a two-cycle latency (one-cycle load-to-use penalty). Now, if a load hits in the cache it has the same latency as before; if it misses in the cache, assign two cycles plus an additional miss latency. The new algorithm becomes:

if instruction is "ready" based on scoreboard entries for valid input and output registers:

  if instruction is a load:
     if (simulate_cache(...) == hit):   // ADDED
       latency = 2
     else:                              // ADDED
       latency = 2 + miss penalty       // ADDED
  else:
    latency = 1

  if the instruction writes a register:
    "execute" the instruction by recording when its register(s) will be ready
  next_instruction = true

  ...

Experiment 4 - Superscalar and Caching

To calculate the impact of a data cache, perform simulations with a two-way set-associative cache with a capacity of 8KB and 64-byte blocks. Assume miss penalty of 7 cycles (so the total miss latency is 2 + 7 = 9).

  1. Run the same six simulations as in the previous two experiments (1-wide to 6-wide superscalar) but with the data cache. As before, plot these results as another line on the same graph (superscalar width on x-axis and IPC on y-axis).

  2. What percent slowdown (versus the IPC from the last experiment) results from modeling the data cache for scalar (1-wide non-superscalar) pipeline?

  3. What percent slowdown results from modeling the data cache for the 6-wide superscalar?

  4. Comparing these results to the relative slowdowns caused by imperfect branch prediction, what does this say about the interaction of cache effects and superscalar?

Overall Analysis

  1. In this final configuration, what is the speedup of a 2-wide superscalar pipeline over a 1-wide pipeline?

  2. What is the speedup of 4-wide over a 2-wide pipeline?

  3. Assuming this trace is representative and considering all the potentially negative effects on clock frequency and energy consumption of wider and wider superscalar processors, what pipeline width does the simulation results indicate would be the most promising design point?

Test Cases

As before, we've provided debug traces for this assignment. They have a format similar to the previous pipeline format. In fact, the format is exactly the same as the previous pipeline format with the addition of the "branch stall" counter as the last column. The format for the debug trace is:

23    38  4 (Ready)    -1 (Ready)     ->  4  -  ----1--------------------------------------------1 -
23    39 -1 (Ready)     7 (Ready)     -> 44  L  ----1---------------------------------------2----1 -
23    40 44 (NotReady) -1 (Ready)     ->  0  -  ----1---------------------------------------2----1 -
24    40 44 (NotReady) -1 (Ready)     ->  0  -  --------------------------------------------1----- -
25    40 44 (Ready)    -1 (Ready)     ->  0  -  1------------------------------------------------- -
25    41  0 (NotReady) -1 (Ready)     -> 44  -  1------------------------------------------------- -
26    41  0 (Ready)    -1 (Ready)     -> 44  -  --------------------------------------------1----1 -
26    42 -1 (Ready)    -1 (Ready)     -> -1  -  --------------------------------------------1----1 -
27    42 -1 (Ready)    -1 (Ready)     -> -1  -  -------------------------------------------------- 5
28    42 -1 (Ready)    -1 (Ready)     -> -1  -  -------------------------------------------------- 4
29    42 -1 (Ready)    -1 (Ready)     -> -1  -  -------------------------------------------------- 3
30    42 -1 (Ready)    -1 (Ready)     -> -1  -  -------------------------------------------------- 2
31    42 -1 (Ready)    -1 (Ready)     -> -1  -  -------------------------------------------------- 1
32    43 -1 (Ready)     7 (Ready)     ->  5  L  -----2-------------------------------------------- -
32    44  5 (NotReady) -1 (Ready)     -> 44  -  -----2-------------------------------------------- -
33    44  5 (NotReady) -1 (Ready)     -> 44  -  -----1-------------------------------------------- -
34    44  5 (Ready)    -1 (Ready)     -> 44  -  --------------------------------------------1----1 -
34    45 -1 (Ready)    -1 (Ready)     -> -1  -  --------------------------------------------1----1 -
35    45 -1 (Ready)    -1 (Ready)     -> -1  -  -------------------------------------------------- -
35    46 -1 (Ready)    -1 (Ready)     -> 45  -  ---------------------------------------------1---- -
35    47 -1 (Ready)    45 (NotReady)  -> 45  L  ---------------------------------------------1---- -
36    47 -1 (Ready)    45 (Ready)     -> 45  L  ---------------------------------------------2---- -
36    48  5 (Ready)    45 (NotReady)  -> 44  -  ---------------------------------------------2---- -
37    48  5 (Ready)    45 (NotReady)  -> 44  -  ---------------------------------------------1---- -
38    48  5 (Ready)    45 (Ready)     -> 44  -  --------------------------------------------1----1 -
38    49 -1 (Ready)    -1 (Ready)     -> -1  -  --------------------------------------------1----1 -

The columns (left to right) are: cycle count, micro-op count, source register one (ready or not ready), source register two (ready or not ready), the destination register, load/store, the current contents of the scoreboard (the cycles until ready for that register or "-" if zero), and finally the current "branch stall" counter (the cycles until ready of "-" if zero).

We've provided four annotated results for you, one for each experiment but with just three-wide superscalar. Instead of just giving you the debug trace of the first 1K micro-ops, we've included the first 10K (as there are more cases to check against).

What to Turn In

Addendum