Lab 3 - Pipelined Processor

CSE 372 (Spring 2007): Digital Systems Organization and Design Lab

Datapath design document due Friday, March 30

Preliminary Demo by Friday, April 13

Final Demo by Friday, April 20 (last day of classes)

Writeup due Friday, April 20 (last day of classes)

This lab is to be done in groups.

Contents

Overview

In this final lab of the semester, you will construct a dual-issue superscalar pipelined processor for the P37X ISA.

Note

This final lab of the semester is the most intellectually challenging of the labs. As with the previous lab, this lab doesn't require that much Verilog code, but reasoning about the pipelining bypass and stall logic requires significant planning.

Part 1: Five-Stage Pipeline

This processor will have the now-familiar five-stage pipeline:

  1. Fetch (F): reads the next instruction from the memory
  2. Decode (D): reads the register file, determines the control signals for the rest of the pipeline stages, and selects the proper immediate values from the instruction.
  3. Execute (X): the ALU performs its calculation and branches are resolved
  4. Memory (M): read or write the data memory
  5. Writeback (W): write the register file

All instructions travel through all five stages. Branches are resolved in the execute stage, so a mispredicted branch has a two-cycle penalty. To reduce the number of branch mispredictions, your processor will use a simple "next-PC" branch predictor during the fetch stage. The data memory is read during the memory stage, so a load followed by a dependent instruction will add a one-cycle stall to the pipeline. A load followed by a store instruction where only the value to be written is dependent should not cause a stall. The pipeline is fully bypassed, so all other instructions execute without stalling.

Part 2: Improving the CPI

The second part of the assignment is to improve your processor's performance by improving its CPI. To help focus your CPI-enhancing efforts, we'll give you suite of simple benchmark programs. Your goal is to improve the CPI of these specific programs. Your final lab report will require you compare the CPI of the baseline 5-stage pipeline specified below to your enhanced processor. You'll look to enhance the CPI of your processor in two ways:

  1. Reduce lost cycles due to branch misprediction by increase the prediction accuracy.
  2. Extending the pipeline to fetch, decode, and execute up to two instructions per cycle.

Exactly how you choose to design and implement this improved processor is up to you.

Note

Your pipeline must keep the full five-stages (including the memory stage), otherwise the clock frequency would suffer. As such, there will still be load-use stalls, and there isn't any really good way to mitigate those stalls.

Tiered Lab Grading

This lab has an unusual tiered grading structure. As mentioned in class, the goal is to both provide a challenge to those students that want it, but yet allow everyone to complete the lab without ridiculous amounts of effort. You as a group will need to decide which tier you wish to shoot for:

The tiers above are a upper bound for your course grade, assuming that you completed all of the other course projects with good scores. This tiered system is a replacement of last year's honors points.

Design Document

Before you begin writing the Verilog code, you're going to need to figure out the design of your new processor. This process should include considering all aspects of the design and testing of the processor. You'll want to consider how the datapath is broken into stages, identify which values are placed in pipeline latches, uncover any tricky bypassing cases, and decide upon a strategy for stalling the pipeline and recovering from branch mispredictions. You'll likely want to include a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage.

Important

The entire goal of the design document is to help (force) you to really think through the design. The design document is not intended to be busy work; I'm forcing you to do this because I think it will save you time in the end. One corollary to this is that you shouldn't spend significant time polishing the document. However, you need to convince the grader that you've actually deeply considered all aspects of the design and that you're adequately prepared to begin the implementation.

The design document will have two parts, and the two parts of the design document directly correspond to the two parts of the assignment.

Part 1: Five-Stage Pipeline

  • Design decisions: Describe in prose any high-level design decisions for your pipelined datapath. This should be a substantial portion of this part of the design document.
  • Pipeline diagram: You'll probably want a pipeline datapath diagram. You can draw out the diagram on a single page or use one page per pipeline stage (it is actually hard to fit it all on one page). If you find it useful, you can use the single-cycle datapath drawn in powerpoint as a starting point.
  • Pipeline latches: Give a table or list that describes which values you're latching between which stages. Give these signals good names that you'll then use in your actual Verilog code. For example, you might want to call the PC in the fetch stage F_PC, the PC in the decode stage D_PC, and latch between the stages the FD_PC register. You'll probably want every signal prepended with either F, D, X, M, or W to distinguish in which stage the signal belongs.
  • Testing plan: How are you going to test your design? Both for functional correctness and performance correctness.
  • Implementation plan: Which group members are going to do what part of the implementation? Which part will be done first? What can be done at the same time? What must be done consecutively?

Part 2: Improving the CPI

In the second part of the design document, you'll want to discuss the CPI improvements you plan and the overall design and implementation plan. You should also explicitly state what tier you're targeting. Discuss the techniques you'd like to use for improving the branch prediction rate. If you're targeting the superscalar design, you'll want to describe the trickiest aspects of the design.

The Pipelined Datapath

The pipelined datapath is similar to the non-pipelined datapath from the previous lab, but has some extensions. The most important change is that each part of the datapath logically needs to live in one pipeline stage or another. Some structures such as the memory and register file are in multiple pipeline stages (F&M and D&W, respectively).

You should re-use most of your controller logic from the single-cycle pipeline by placing the controller module in the decode stage, and then adding the appropriate pipeline registers so that each control signal is used in the correct stage.

Pipeline Registers

In addition to all the structures in the datapath in lab2, the pipelined datapath introduces pipeline "latches" or registers. To implement these registers, use the same Nbit_reg Verilog module as you used for the program counter and internals of the register file.

Note

We'll use both of the terms "pipeline registers" and "pipeline latches" to refer to flip-flop based structures for holding state between pipeline stages. No actual level-sensitive latches are used in the pipeline.

Memory

The memory module is similar to the module you used for the previous lab. As before, we're using the same "global write enable (GWE)" trick, so the memory module will appear as a transparent memory (whereas in reality the block RAMs are internally latched).

To support superscalar fetch and execute, we'll provide you when an updated memory module that includes two fetch ports and two read/write data ports. This will allow your processor to fetch two instructions per cycle and execute two memory operations per cycle. If a store and load are both executed in the cycle to the same address, which ports are used determines if the load (read) will receive the old value or new value. If the load is in the first or "top" port and the store is in the second of "bottom" port, the load will receive the old value. If the store is in the top port and the load is in the bottom port, the load will return the new value. If two stores (writes) to the same address occur in the same cycle, the "lower" store will be the value that persists. Basically, as long as you always route the older instruction to the top data port, this should allow your processor to execute any mixture of two memory operations in the same cycle.

When working on the initial single-issue processor, just ignore the additional ports.

Full Bypassing

The pipeline is fully bypassed, so you should implement M->X bypassing, W->X bypassing, and W->M bypassing. To do this, you'll need to add another mux in front of both ALUs inputs and a mux in front of the data input into the memory. The data bypass logic (the control for these muxes) examines the registers written by prior instructions (looking at the write enable signal to make sure the instruction actually wrote a register). Notice that the destination register and its corresponding write enable signal must already be latched for use in the writeback stage. Be careful not to bypass a value from a prior instruction that either doesn't write a register or an instruction that has been squashed due to a branch misprediction.

You'll also need to bypass around the register file. The register file you created in the last lab is somewhat different than the register file used in the pipeline described in the CSE371 lecture. When a register is read and written in the same cycle, our register file from lab2 returns the old value. For this pipelined design, we want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches that being written, you'll write to return the new value being written (and not the old value).

You can implement this register file bypass one of two ways: (1) make a wrapper around the existing register file that does this bypassing locally or (2) latch the written value separately and then add an additional input into the ALU-input bypass muxes. The first option is perhaps simpler to reason about, but it adds an extra mux to the register read critical path. Option two is a little bit different than what you've seen in lecture, but it doesn't add a new mux (it just expands an existing mux), so it is perhaps faster. It is your choice on how you want to implement this.

Load Stalling

In this pipeline, loads have an extra load-use delay. Thus, if a load is followed by a dependent instruction that uses the value in the execute stage, the dependent instruction must be stalled for one cycle. If the dependent register value isn't needed until the memory stage (for example, as the store's data input value), it should not be stalled.

Detect the stall during the decode stage by looking both at the current instruction and at the instruction in the execute stage. To implement the stall, insert a no-operation into the pipeline (which will propagate down the pipeline just like a normal instruction). The next cycle, the same logic that detected the stall on the previous cycle will then detect "no stall", because the load will have moved further down the pipeline.

Be sure not to falsely stall any instruction whose encoding might match the previous instruction's output register, but does not actually read the register. For example, the JUMP instruction doesn't read any registers, so it shouldn't be stalled just because the bits in its immediate field happen to match the output register of a prior load. Over-stalling is difficult to detect because it is a "performance bug" and not a "correctness bug". That is, you'll get the right answer, but your processor will have a worse CPI than it should.

Handling Branches

Instructions that manipulate the program counter (TRAP, RTT, JUMP, JUMPR, JSR, JSRR, and BR) need additional handling. As the value of the next program counter of the instruction that follows one of these instructions is determined in the execute stage, this pipeline uses a simple branch predictor to predict the next program counter and speculatively fetch and decode the next two instructions. During the execute stage, the "actual next PC" of the instruction is compared with the predicted PC (the PC of the instruction in the decode stage). Only if the "actual next PC" and predicted PC match, was the prediction correct.

If the predicted "next PC" is correct, execution continues normally. If the predicted "next PC" is incorrect, the instructions currently in the fetch and decode stages must be annulled (also know as "squashed"). To annul these instructions, their control signals should be transformed to convert it into a no-operation before they are latched into the next stage (for example, by using a mux right before the pipeline latches).

Branch Prediction

In this lab you'll implement a few different branch prediction schemes that guess a "next PC":

  1. Predict Not-Taken. Just predict PC+1 as the next PC.

  2. Single-entry Predictor. To make our branch predictor more accurate, we're going to use a simple "next PC" predictor accessed during the fetch stage and updated in the execute stage. To generate a prediction, the predictor takes in the current PC (a 16-bit input) and generates a "next PC" prediction (a 16-bit output). For updating during the execute stage, the predictor takes in the PC of the instruction in the execute stage (a 16-bit input), the actual "next PC" of the instruction (a 16-bit input), and a single-bit write enable (asserted only if the prediction was incorrect). Notice that nowhere are the actual bits of the instruction itself used.

    This simple single-entry predictor consists of two registers: the "tag" register and the "predicted PC" register. Each time the predictor is accessed, it compares the input PC with the contents of the tag register. If the tag matches the input PC, the predicted PC will be the value of the "predicted PC" register. If the tag does not match, the predicted next PC is PC+1. The predictor registers are both updated when an instruction's next PC is mispredicted. Both the "tag" register and "next PC" register is updated on a mispredict.

  3. Your own new and improved predictor. In part 2 of this lab, you're asked to create your own improved predictor to reduce the misprediction rate. The most simple change would be to extend the predictor to track more than one branch (use a multi-entry table, indexed with the low-order bits of PC). Much like a cache, such as table could be direct mapped, set-associative, or even fully-associative (with direct-mapped being by far the easiest). For other ideas on how to improve the branch prediction, you may wish to look at the CIS501 slides on branch prediction starting on slide 48.

For the final writeup, you'll need to compare the performance of these three predictors using the performance counters (described next).

To make this comparison easier, you may want to have your processor select the branch prediction mode using one of the board's switches, allowing you to select the specific predictor on the fly.

Performance Counters

Real processors have performance counters that track various events within a processor to help understand its performance. In this lab, we're going to add four memory-mapped performance counters:

  • Cycle count - 0xFF00: the number of cycles since the processor was last reset
  • Instruction count - 0xFF01: the number of actual instructions executed since the processor was last reset
  • Load stall count - 0xFF02: the number of cycles lost to load-use stalls (that is, the number of cycles in which zero instructions executed because of a load-use stall)
  • Branch stall count - 0xFF03: the number of cycles lost to branch mis-predictions and/or stalls (that is, the number of cycles in which zero instructions executed because of a branch misprediction)

The cycle count is incremented every cycle. For the single-issue pipeline, every cycle one (and only one) of the instruction count, load stall, or branch stall counters is incremented. As such, the sum of these three registers should be equal to the cycle count.

For the dual-issue processor, only one of the instruction count, load stall, or branch stall counters is increased, but the instruction count register may sometimes be incremented by two (for cycles in which two instructions execute). As such, the sum of these three registers will be greater than or equal to the cycle count.

Your processor should update The performance counters during the writeback stage of the pipeline. The performance counters should count an actual "NOOP" instruction as an instruction being executed. That is, it isn't either a branch stall or a load stall cycle.

The current value of these counters is determined by using a LD or LDR instruction to access them. For simplicity, stores to these locations do not change the value of the registers, but stores may still update the contents of memory (it doesn't really matter, as anytime you read from these locations, it will use the value in the counter and not the value in the memory). Basically, these counters can be reset to zero only when the entire system is reset.

Multi-Phased Implementation

To give you some mid-project milestones, this project is broken into multiple phases. This phased implementation is just a suggestion and isn't required to be followed exactly.

Phase A

Build up the basic pipeline in two steps:

  1. Create a non-bypassed pipeline and confirm that it can execute simple sequences of instructions that don't require bypassing or stalling (for example, if multiple no-ops are placed after each instruction).
  2. Create a fully-bypassed pipeline, and make sure that all code you test has an explicit no-operation instruction after every load and two no-operations after every branch instruction.

Phase B

Next, extend the fully-bypassed pipeline with:

  1. Load stalling: stall all instructions following a load instruction. This is simpler than detecting an actual dependent operation, but it is still correct (just slower than it should be).
  2. Branch stalling: instead of implementing speculative execution for branches and the branch predictor, initially stall the pipeline by inserting two no-operation instructions after every branch instruction. Basically, during the decode stage, if there is a branch in the execute or memory stages, insert a no-operation down the pipeline.

Note

After Phase A & B, you should still have a 100% functional processor for the first demo, but it will have a worse CPI than your final design.

Phase C

The next phase is to stall only dependent operations following a load and implement the branch prediction and related logic. When implementing the branches, you may want to first always predict branches as "not taken" (always predict PC+1) to test the basic workings of your pipeline. Once you have that working you can implement the actual predictor described above. In fact, the project writeup asks you to compare the performance of a simple "not taken" with the more accurate branch predictors, so you'll want to be able to do both. You'll also want to create your own enhanced branch predictors to improve the CPI of the benchmark suite programs.

Phase D

Extend the pipeline to allow it to achieve an IPC of greater than 1 (CPI < 1) on some of the benchmark suite programs. It is up to you to design, document, and implement your dual-issue pipeline. For later analysis, you're going to want to have a switch that can disable/enable the dual-issue logic, allowing you to compare the performance of the two designs.

A set of skeleton files containing dual-ported memory and device modules has been provided for you here. Please see the contained files for a description of how to use the new modules.

Testing

Please see the lab 3 testbench page for a file that can be used to test the functional correctness of your pipelined processor.

In addition to testing functional correctness, you may also want some sort of plan for testing timing correctness. That is, if you stall when you should not or incorrectly update the branch predictor, your processor will correctly execute, but still isn't fully "correct". Closer to the final demos we'll give you some programs that will test some of the various timing aspects of your pipeline, but you'll want to do your own testing of the timing issues of the pipeline.

Verilog Restrictions

You can use the same subset of Verilog as in lab2. In addition, you may need to modify the behavioral testbench code to test your processor. Ask a TA or post on the forums if you have any questions about the testbench code.

Coding Tips and Style Suggestions

Demos

You'll have both a "preliminary demo" and a "final demo":

All group members should be present at the demos. The TAs will ask you some questions about your design during both the demos. All group members should understand the entire design well enough to be able to answer any such questions.

Final Demo Specifics

We'll give you a number of test programs that are designed to test for tricky bypassing and stalling cases. These programs will self-verify their correct execution and record the results of the performance counters.

The final demo consists of reveal test cases in the p37os-test.hex file. (the associated symbol table file is also available: p37os-test.sym. The .hex file §contains several tests for correctness that execute at system startup, recording a "P" for pass or "F" for failure of each test. The program also spits out the performance counter readings for the program allow us to verify your stalling, flushing, and branch predictor are working correctly. We'll also run the various game programs options from the menu.

Finally, in the p37os-test.hex input, the menu contains a new "benchmarks" menu choice. When run, this program prints out the IPC (instruction per cycle). Remember, a higher IPC is better. The TAs will record the IPCs you obtained to determine that your branch predictor is significantly better and/or your superscalar processor optains an IPC of higher than one on the expected benchmarks.

You're free to experiment with the tests, disassemble the hex file (by loading it into the simulator), or whatever.

What to Turn In

Design Document: You'll turn in a design document draft as specific above.

Final Writeup: The final writeup should include:

  1. Design update and retrospective. Write a "design retrospective", focusing on what you learned during the design that you wish you would have known when you began the implementation. Even if you scored well on your initial design document, you'll probably have uncovered a lot of additional information in your final lab writeup. In the end, how did you actually implemented bypassing and stalling? How do the performance counters work? Etc.

    Basically, describe how your design changed and evolved over time, what parts you left out of the design document, which parts you got wrong, and assess the effectiveness of your approach to testing and debugging your implementation. This shouldn't (just) be a narrative of what when wrong; instead, it should be a deep analysis of your original design. You also might want to include things that, in retrospect, you wish you could go a month back in time and tell yourself concerning the lab.

  2. CPI analysis: Depending on what tier of the project you completed, you'll have up to four different performance points to consider: (1) single-issue pipeline with "predict not taken", (2) pipeline with simple single-entry branch predictor, (3) the pipeline with your custom predictor, and (4) your dual-issue pipeline with your custom predictor. For each of the programs in the benchmark suite, record the CPI for each of these processor configurations you implemented (a table and/or graph would work best to display the data). For each benchmark, briefly describe and explain the differences between the various configurations, commenting on the relative importance of the enhancements (especially relative to what the program does). Did your custom branch predictor significantly increase performance? When does the dual-issue design help the most? Using the results from the performance counters, how much faster would these programs run if there was a perfect branch predictor? Approximately how much faster would the design be with a zero-cycle load-use penalty?

  3. Verilog code. As before, your Verilog code should be well-formatted, easy to understand, and include comments where appropriate. Some part of the project grade will be dependent on the style and readability of your Verilog, including formatting, comments, good signal names, and proper use of hierarchy.

  4. Answer the following questions:

    • What problems, if any, did you encounter while doing this lab?
    • How many hours did it take you to complete this assignment (total for entire group)?
    • On a scale of 1 (least) to 5 (most), how difficult was this assignment?
  5. Finally, to encourage equal contribution by all group members, each group member should send me (Prof. Martin) an e-mail describing the contribution of each member of your group. That is, for each group member (including yourself), give a percentage. You're also free to give a short paragraph for each group member describing their role and efforts on the project. The sum of the percentages must add up to 100%.

Addendum