Homework 3a - CIS501 Fall 2012

Instructor:Prof. Milo Martin
Due:Friday, October 19th at 6pm (If you use a "late period", you can turn it in up to Tuesday, 23rd at 6pm).
Instructions:This is an individual work assignment. Sharing of answers or code is strictly prohibited. For the short answer questions, Show your work to document how you came up with your answers.
Note:This is one part of a two-part assignment (HW3a and HW3b) due on the same day. Please create a separate PDF file for each part and submit them in electronically via Canvas.

Part A

Overview

In this assignment you will explore the effectiveness of branch direction prediction (taken vs not taken) on an actual program. Your task is to use the trace format (see Trace Format document) to evaluate the effectiveness of a few simple branch prediction schemes.

To do this, you'll write a program that reads in the trace and simulates different branch prediction schemes and different predictor sizes. You can write the program in whatever language you like; although we are going to ask for a printout of your source code, the end result of this assignment are the experimental results (and not the program source code).

Processing the Trace

In this assignment we want to focus only on conditional branches. In the trace format, conditional branches:

  1. Reads the flags register (that is, conditionRegister == 'R'), and
  2. Is either taken or not taken (that is, TNnotBranch != '-').

Ignore all the lines in the trace that fail to meet both the above criteria.

You'll be predicting the taken/not-taken status of each branch. You'll use the program counter from the trace to make the prediction and the taken/not-taken field to determine if the prediction was correct or not.

To help you debug your predictors, we've provided some annotated prediction results for the short (1K) trace for various predictors (see below). Comparing your results to these annotated outputs should help ensure your predictor is working properly.

Also, please see below for the skeleton of the code we used to generate the debug traces. The code also checks against the debug trace to ensure each prediction is correct.

Question 1 - Static Branch Predictor

Before looking at dynamic branch prediction, we are going to look at simple "static" branch prediction policies of "always predict taken" and "always predict not taken". Modify the tracer reader to calculate the mis-prediction rate (that is, the percentage of conditional branches that were mis-predicted) for these two simple schemes.

  1. What is the prediction accuracy for "always taken"?

  2. What is the prediction accuracy for "always not taken"?

Bimodal Predictor

The simplest dynamic branch direction predictor is an array of 2n two-bit saturating counters. Each counter includes one of four values: strongly taken (T), weakly taken (t), weakly not taken (n), and strongly not taken (N).

Prediction. To make a prediction, the predictor selects a counter from the table using using the lower-order n bits of the instruction's address (its program counter value). The direction prediction is made based on the value of the counter.

Note

The easiest way to calculate the index is to modulo (%) the address by the number of entries in the predictor table. However, as the table size is always a power of two, it is almost as easy to use shifts and bit-wise and/or operations to calculate the index.

In the past some students have converted the integer address to a string, extracted a sub-string, and then converted the string back into an integer. Don't do this; it is unnecessary, slow, and ugly.

Training. After each branch (correctly predicted or not), the hardware increments or decrements the corresponding counter to bias the counter toward the actual branch outcome (the outcome given in the trace file). As these are two bit saturating counters, decrementing a minimum counter or incrementing a maxed out counter should have no impact.

Initialization. Although initialization doesn't effect the results in any significant way, your code should initialize the predictor to "strongly not taken" so that your results match the example traces that we've given you.

Question 2 - Bimodal Accuracy versus Predictor Size

Analyze the impact of predictor size on prediction accuracy. Write a program to simulate the bimodal predictor. Use your program to simulate varying sizes of bimodal predictor. Generate data for bimodal predictors with 22, 23, 24, 25 ... 220 counters. These sizes correspond to predictor index size of 2 bits, 3 bits, 4 bits, 5 bits, ... 20 bits. Generate a line plot of the data using MS Excel or some other graphing program. On the y-axis, plot "percentage of branches mis-predicted" (a metric in which smaller is better). On the x-axis plot the log of the predictor size (basically, the number of index bits). By plotting the predictor size in terms of number of index bits, the x-axis in essence becomes a log scale, which is what we want for this graph.

Answer the following questions base on the data you collected:

  1. Given a large enough predictor, what is the best mis-prediction rate obtainable by the bimodal predictor?

  2. How large must the predictor be to reduce the number of mis-predictions by approximately half as compared to the better of "always taken" and "always not taken"? Give the predictor size both in terms of number of counters as well as bytes.

  3. At what point does the performance of the predictor pretty much max out? That is, how large does the predictor need to be before it basically captures almost all of the benefit of a much larger predictor.

Prediction using Global Branch History (gshare)

A gshare predictor is a more advanced dynamic branch predictor that uses the history of recently executed branches to predict the next branch. Gshare uses a history register to record the taken/not-taken history of the last h branches. It does this by shifting in the most recent conditional branch outcome into the low-order bits of the branch history register. It then hashes the branch history and the PC of the branch when making predictions.

Prediction. Whereas bimodal uses the lowest n bits of the program counter to index into the table, gshare uses the lowest n of the xor of the program counter and the history register. (Note: in C/C++/Java, the "^" operator is the xor operator). Except for this different index into the table, the prediction mechanism is otherwise unchanged from a bimodal predictor. In essence a gshare predictor with zero history bits is basically just a bimodal predictor.

Note

The easiest way to encode the history is to use an integer and then use bit manipulations operators such as shifts, bitwise-and (&), bitwise-or (|), xors (^), etc.

Adjusting the history register can be done in three steps: shifting the bits left by one, clearing the nth bit (to model a bounded-size history), and conditionally setting the lowest bit to zero or one, depending on the taken/non-taken status of the branch.

All in all, this can be done in just a few lines of C or Java code. The following two wikipedia pages might give you some hints:

http://en.wikipedia.org/wiki/Bit_manipulation

http://en.wikipedia.org/wiki/Bitwise_operation

Training. The counter used for prediction (selected by history XOR program counter) is trained just as in a bimodal predictor.

Initialization. Same as bimodal.

Question 3 - Gshare Accuracy versus Predictor Size

Similar to the experiment in question 2, analyze the impact of predictor size on prediction accuracy for a gshare predictor. Extend your program for simulating bimodal predictors to also simulate gshare predictors. Recall that a gshare predictor's configuration is determined by two numbers: the number of history bits and the size of the predictor table.

Use your program to simulate gshare predictor in 8 history bits of varying sizes (the same sizes as in the previous question). Add the gshare data to the graph you created in the previous question. That is, your new graph will have two lines.

Answer the following questions base on the data you collected:

  1. Given a large enough predictor, what is the best mis-prediction rate obtainable by this gshare predictor with 8 history bits?

  2. At what table size is the gshare predictor generally better than the simple bimodal predictor?

  3. Explain why gshare is sometimes more accurate than bimodal.

  4. Explain why bimodal is sometimes more accurate than gshare.

Question 4 - Gshare History Length vs Prediction Accuracy

The previous question used a gshare predictor with 8 bits of history. But how much history should a predictor use? Use your simulator to simulate gshare predictors with varying history lengths (zero bits of history through 19 bits of history). Simulate two different predictor sizes: 210 and 216 predictor entries. Plot one line for each size. The y-axis is as before. The x-axis is now history length.

  1. What is the optimal history length (the history length with the lowest mis-prediction rate) for each of these two predictor sizes?

  2. What do these results say about picking a history length? Generally speaking, what history lengths are the best performing?

Question 5 - Gshare Accuracy versus Predictor Size Revisited

Generate a new series of data in which the gshare history length is the same as the number of index bits. Plot a line of this gshare data on the graph from question #3. This graph should have three lines: the bimodal data, the gshare with a fixed 8-bit history, and the data for gshare where the history length grows with predictor size.

  1. Looking at this new data, what is new the lowest (best) mis-prediction rate for gshare?

  2. When comparing the new gshare data to bimodal, where is the crossover point between then? (That is, what is the predictor size at which gshare becomes better than bimodal?) How does this crossover point compare to the crossover point of gshare with 8-bit history versus bimodal.

  3. In these experiments, we're looking at a trace of just one program. Of course, processors run many different programs. Each of these programs might have a different crossover point. What properties of a program might most significantly impact where the crossover point between bimodal and gshare occurs?

Tournament Predictor

From the previous data, you can see that neither bimodal or gshare is always best. To remedy this situation, we can use a hybrid predictor that tries to capture the best of both style of predicts. A tournament predictor consists of three tables. The first and second tables are just normal bimodal and gshare predictors. The third table is a "chooser" table that predicts whether the bimodal or gshare predictor will be more accurate.

The basic idea is that some branches do better with a bimodal predictor and some do better with a gshare predictor. So, the chooser is a table of two-bit saturating counters indexed by the low-order bits of the PC that determines which of the other two table's prediction to return. For the each entry in the chooser, the two-bit counter encodes: strongly prefer bimodal (B), weakly prefer bimodal (b), weakly prefer gshare (g), and strongly prefer gshare (G).

Prediction. Access the chooser table using the low-order bits of the branch's program counter address. In parallel, the bimodal and gshare tables are accessed just as normal predictors, and each generates an independent prediction. Based on the result of the lookup in the chooser table, the final prediction is either the prediction from the bimodal predictor (if the choose counter indicates a preference for bimodal) or the prediction from the gshare predictor (otherwise).

Training. Both the gshare and bimodal predictors are trained on every branch using their normal training algorithm. The choose predictor is trained toward which of the two predictors was more accurate on that specific prediction:

  • Case #1: The two predictors make the same prediction. In this case, either both predictors were correct or both predictors were wrong. In both cases, the chooser table isn't updated (as we didn't really learn anything).
  • Case #2: The two predictors made different predictions (thus, one of the predictors was correct and the other incorrect). In this case, the chooser counter is trained (incremented or decremented by one) toward preferring that of the predictor that was correct. For example, if the gshare predictor was correct (and thus the bimodal predictor was incorrect), the chooser counter is adjusted by one toward preferring gshare.

As stated above, the bimodal and gshare tables are trained on every branch, totally independent of the chooser table.

Initialization. The chooser table is initialized to strongly prefer bimodal. The gshare and bimodal tables are initialized as normal.

Question 6 - Tournament Predictor Accuracy

Add support to your simulator for the tournament predictor by adding support for the chooser table. If your code was written in a modular way, you should be able to fully re-use your gshare and bimodal code (and thus avoid re-writing or replicating code).

Let's compare the tournament predictor's accuracy versus the data from its two constituent predictors (using the data from the previous question). For now, let's compare a 2n-counter bimodal (or gshare) versus a tournament predictor with three 2n-counter tables. (We'll make the comparison more fair in terms of a "same number of bits" comparison in a moment.) As above for the gshare predictor, the gshare component of the tournament predictor should use a history length equal to the number of index bits (log of the table size). The graph will have three lines total.

  1. How does the tournament predictor's accuracy compare to bimodal and gshare? Is the tournament predictor successful in meeting its goal?

  2. Does the tournament predictor improve the overall peak (best-case) accuracy? If so, why? If not, what are the benefits of the tournament predictor?

Question 7 - Tournament Predictor Accuracy -- Same Size Comparison

In the previous question, the tournament predictor was given the unfair advantage of having three times the storage. In this question, run another set of experimental data in which all the predictors at the same location on the x-axis have the same storage capacity. To do this, compare a 2n-counter predictor to a tournament predictor with the following three table sizes:

  • Chooser table: 2n-2 counters
  • Bimodal table: 2n-2 counters
  • Gshare table: 2n-1 counters

As 2n-2 + 2n-2 + 2n-1 is equal to 2n, this becomes a fair "same size" comparison.

Add a line to the graph from the previous question with this new "fair tournament" data. The graph now has four lines: bimodal, gshare, tournament, tournament-fair.

  1. Compare the old and new tournament predictor data. What was the impact of moving to the smaller (fairer) table sizes?

  2. Once adjusted to be a fair-size comparison, does the tournament predictor succeed in its goal of being the best of bimodal and gshare?

  3. Given a fixed transistor budget for a branch predictor (say, 4KBs), what predictor approach would you use?

Note

The idea of gshare and tournament (or "combining") branch predictors was proposed by Scott McFarling in his seminal paper published in 1993 entitled: Combining Branch Predictors. I encourage you to look at McFarling's paper. After having completed this assignment, I think you'll find the graphs and other results in this paper familiar.

Test Cases

We're provide you with three annotated results (one for each predictor type) for the shorter (1K) trace file:

Simulator Skeleton

To assist you in this assignment, we've made available a simplified skeleton of the branch prediction simulator we used to generate the homework solutions and to create the above debug trace.

bpred.C (We currently only have a C++ version available)

The code illustrates a few things:

This code is just an example; it won't actually compile without filling in more of the code (we didn't include the files that actually implement the predictors). But the main loop is likely helpful to see. It contains code that reads in the trace and calls the "makePrediction" method on a branch predictor class. It also contains code to spit out the debug traces; you can use this code to write an output file that matches directly. Finally, it will even auto-check that each prediction is correct versus the above traces above and stop the program at the first time a prediction doesn't agree with the debug trace.

The program takes in three command line parameters. The first command line parameter is passed in as the log of the size of the predictor (the predictor is 2^N). Thus, the following would simulate a predictor with 1024 counters:

zcat gcc-1K.trace.gz | bpred 10

The next command line parameter, if present, specifies the filename to write the debug output trace:

zcat gcc-1K.trace.gz | 3 bimodal3.debug

The final command line parameter, if present, specifies the filename to read in to match the predictions against:

zcat gcc-1K.trace.gz | 3 bimodal3.debug bimodal3.output

The above command line would read in from "bimodal3.output". When doing this checking, it only checks the acutal prediction, not the state of the predictors. Thus, if a prediction is wrong, you'll need to look back in the trace to see what was wrong. Comparing the "bimodal.output" we gave you to the "bimodal3.debug" file generated by the above code it a really good way to do that.

Finally, the code also contains some hints at how one might use object oriented program to encapsule and reuse components. For example, the code contains the following line:

Predictor* predictor = new TournamentPredictor(3, new BimodalPredictor(3), new GSharePredictor(4, 4));

This creates a tournament predictor from any two arbitrary predictors, in this case bimodal and gshare. All variants of predictors have the same interface, allowing for such arbitrary composition. Of couse, you don't have to structure it this way, but I thought you might find it helpful to see how we structured our version of the code.

In fact, the skeleton above is a simplified version of our code. In our actual code it actually runs through the trace once and then calls the makePrediction() method on an entire suite of predictors, thus simulating all the design points in a single pass through the trace. Perhaps that is more effort than it is worth, but it makes data collection fast and formatting the output results for graphing easier, too.

What to Turn In

Hints & Comments


Addendum

  • None, yet.