Lab 4 - Pipelined Processor

CSE 372 (Spring 2006): Digital Systems Organization and Design Lab

Datapath design document due Friday, April 7

Preliminary Demo by Friday, April 14

Final Demo by Friday, April 21 (last day of classes)

Writeup due Friday, April 21 (last day of classes)

This lab is to be done in pairs (groups of two).

This lab is worth 30 points.

Contents

Overview

In this final lab of the semester, you will construct a pipelined processor for the P37X ISA. This processor will have the now-familiar five-stage pipeline:

  1. Fetch (F): reads the next instruction from the memory
  2. Decode (D): selects the proper immediate values from the instruction, determines the control signals for the rest of the pipeline stages, and reads the register file
  3. Execute (X): the ALU performs its calculation and branches are resolved
  4. Memory (B): read or write the data memory
  5. Writeback (W): write the register file

As branches are resolved in the execute stage, a mispredicted branch has a two-cycle penalty. As the data memory is read during the memory stage, a load followed by a dependent instruction will add a one-cycle stall to the pipeline. A load followed by a store instruction where only the value to be written is dependent should not cause a stall. The pipeline is fully bypassed, so all other instructions execute without stalling. All instructions travel through all five stages.

Note

This final lab of the semester is the most intellectually challenging of the labs. As with the previous lab, this lab doesn't require that much Verilog code, but reasoning about the pipelining bypass and stall logic requires significant planning.

Design Document

Before you begin writing the Verilog code, you're going to need to figure out the pipelined datapath. This process should including a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage.

This design document is due on Friday, April 7 by 7pm (turn it into the TAs in the lab).

The Pipelined Datapath

The pipelined datapath is similar to the non-pipelined datapath, but has some extensions. The most important change is that each part of the datapath logically needs to live in one pipeline stage or another. Some structures such as the memory and register file are in multiple pipeline stages (F&M and D&W, respectively).

You should re-use most of your controller logic from the single-cycle pipeline by placing the controller module in the decode stage, and then adding the appropriate pipeline registers so that each control signal is used in the correct stage.

Pipeline Registers

In addition to all the structures in the datapath in lab3, the pipelined datapath introduces pipeline "latches" or registers. To implement these registers, use the same Nbit_reg Verilog module as you used for the program counter and internals of the register file.

Note

We'll use both of the terms "pipeline registers" and "pipeline latches" to refer to flip-flop based structures for holding state between pipeline stages. No actual level-sensitive latches are used in the pipeline.

Memory

Important: The values coming out of the instruction memory and data memory are already implicitly latched internally to the bram module. Thus, you won't need to add pipeline latches for these values.

Note

This pipelined design no longer needs the pseudo multi-cycle hack using the global write enable from the last lab. The global write enable is still useful for single-stepping the processor on the board, but the global write enable could just be hardwired to "on" (in your testbench, for example) to achieve the highest performance.

Another option for getting the memory to work with your pipelined design is to manipulate the read-enable signals for the data and instruction memory to stall the pipeline correctly. You will need to get the new memory and top-level modules that have read-enable support, and to modify your datapath with two additional single-bit output ports: IMEM_RE and DMEM_RE. These files are marked as version 1.2.

Full Bypassing

As the pipeline is fully bypassed, you should implement M->X bypassing, W->X bypassing, and W->M bypassing. To do this, you'll need to add another mux in front of both ALUs inputs and a mux in front of the data input into the memory. The data bypass logic (the control for these muxes) examines the registers written by prior instructions (looking at the write enable signal to make sure the instruction actually wrote a register). Notice that the destination register and its corresponding write enable signal must already be latched for use in the writeback stage. Be careful not to bypass a value from a prior instruction that either doesn't write a register or an instruction that has been squashed.

You'll also need to bypass around the register file. The register file from lab3 is somewhat different than the register file used in the pipeline described in the CSE371 lecture. When a register is read and written in the same cycle, our register file from lab3 returns the old value. For this pipelined design, we want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches that being written, you'll write to return the new value being written (and not the old value).

You can implement this register file bypass one of two ways: (1) make a wrapper around the existing register file that does this bypassing locally or (2) latch the written value separately and then add an additional input into the ALU-input bypass muxes. The first option is perhaps simpler to reason about, but it adds an extra mux to the register read critical path. Option two is a little bit different than what you've seen in lecture, but it doesn't add a new mux (it just expands an existing mux), so it is perhaps faster. It is your choice on how you want to implement this.

Load Stalling

In this pipeline, loads have an extra load-use delay. Thus, if a load is followed by a dependent instruction that uses the value in the execute stage, the dependent instruction must be stalled for one cycle. If the dependent register value isn't needed until the memory stage (for example, as the store's data input value), it should not be stalled.

Detect the stall during the decode stage by looking both at the current instruction and at the instruction in the execute stage. To implement the stall, insert a no-operation into the pipeline (which will propagate down the pipeline just like a normal instruction). The next cycle, the same logic that detected the stall on the previous cycle will then detect "no stall", because the load will have moved further down the pipeline.

Be sure not to falsely stall any instruction whose encoding might match the previous instruction's output register, but does not actually read the register. For example, the JUMP instruction doesn't read any registers, so it shouldn't be stalled just because the bits in its immediate field happen to match the output register of a prior load. Over-stalling is difficult to detect because it is a "performance bug" and not a "correctness bug". That is, you'll get the right answer, but your processor will have a worse CPI than it should.

Handling Branches

Instructions that manipulate the program counter (TRAP, RTT, JUMP, JUMPR, JSR, JSRR, and BR) need additional handling. As the value of the next program counter of the instruction that follows one of these instructions is determined in the execute stage, this pipeline uses a simple branch predictor to predict the next program counter and speculatively fetch and decode the next two instructions. During the execute stage, the "actual next PC" of the instruction is compared with the predicted PC (the PC of the instruction in the decode stage). Only if the "actual next PC" and predicted PC match, was the prediction correct.

If the predicted "next PC" is correct, execution continues normally. If the predicted "next PC" is incorrect, the instructions currently in the fetch and decode stages must be annulled. To annul these instructions, they should be transformed into no-operations before they are latched into the next stage (for example, by using a mux right before the pipeline latches).

Branch Predictors

In this lab you'll implement two simple branch predictors that guess a "next PC":

  1. Predict Not-Taken. Just predict PC+1 as the next PC.

  2. Table Predictor. To make our branch predictor more accurate, we're going to use a simple "next PC" table predictor accessed during the fetch stage and updated in the execute phase. To generate a prediction, the predictor takes in the current PC (a 16-bit input) and generates a "next PC" prediction (a 16-bit output). For updating during the execute phase, the predictor takes in the PC of the instruction in the execute phase (a 16-bit input), the actual "next PC" of the instruction (a 16-bit input), and a single-bit write enable (asserted only if the prediction was incorrect). Notice that nowhere are the actual bits of the instruction itself used.

    This simple table-based predictor will be an 8-entry direct-mapped "next PC" predictor. The predictor consists of two memory arrays: the "tag" array and the "predicted PC" array. Each time the predictor is accessed, it uses the lower bits of the input PC to determine which entry to examine. If that entry's tag matches the input PC, the predicted PC will be the value of the corresponding entry in the "predicted PC" array. If the tag does not match, the predicted next PC is PC+1. The predictor tables are both updated when an instruction's next PC is mispredicted. Both the "tag" array entry for the PC that was mispredicted is updated, and the "next PC" array (for the same entry). The table has 8 entries to allow you to borrow Verilog code from your register file to create a two-ported 8-entry by 16-bit memory.

For the final writeup, please compare the performance of these two predictors using the performance counters (described next). We'll give you some P37X code for you to perform this analysis.

To make this comparision easier (and for a couple of honors points), you may want to have your processor select the branch prediction mode using one of the board's switches, allowing you to enable and disable the predictor on the fly.

Performance Counters

Real processors have performance counters that track various events within a processor to help understand its performance. In this lab, we're going to add four memory-mapped performance counters:

  • Cycle count - 0xFF00: the number of cycles since the processor was last reset
  • Instruction count - 0xFF01: the number of actual instructions executed since the processor was last reset
  • Load stall count - 0xFF02: the number of cycles lost to load-use stalls
  • Branch stall count - 0xFF03: the number of cycles lost to branch mis-predictions and/or stalls

The cycle count is incremented every cycle. Every cycle one (and only one) of the instruction, load stall, or branch stall counters is incremented. As such, the sum of these three registers should be approximately equal to the cycle count. All of these counts should be updated during the writeback stage.

The current value of these counters is determined by using a LD or LDR instruction to access them. For simplicity, stores to these locations do not change the value of the registers, but stores may still update the contents of memory (it doesn't really matter, as anytime you read from these locations, it will use the value in the counter and not the value in the memory). Basically, these counters can be reset to zero only when the entire system is reset.

Multi-Phased Implementation

To give you some mid-project milestones, this project is broken into multiple phases.

Phase One

Build up the basic pipeline in two steps:

  1. Create a non-bypassed pipeline and confirm that it can execute simple sequences of instructions that don't require bypassing or stalling.
  2. Create a fully-bypassed pipeline, and make sure that all code you test has an explicit no-operation instruction after every load and two no-operations after every branch instruction.

Phase Two

Next, extend the fully-bypassed pipeline with:

  1. Load stalling: stall all instructions following a load instruction. This is simpler than detecting an actual dependent operation, but it is still correct (just slower than it should be).
  2. Branch stalling: instead of implementing speculative execution for branches and the branch predictor, initially stall the pipeline by inserting two no-operation instructions after every branch instruction. Basically, during the decode stage, if there is a branch in the execute or memory stages, insert a no-operation down the pipeline.

With these two simplifications, you should still have a 100% functional processor for the first demo, but it will still have a worse CPI than your final design.

Phase Three

The final phase is to stall only dependent operations following a load and implement the branch prediction and related logic. When implementing the branches, you may want to first always predict branches as "taken" to test the basic workings of your pipeline. Once you have that working you can implement the actual predictor described above. In fact, the project writeup asks you to compare the performance of a simple "always predict PC+1" with the more accurate branch predictor described above, so you'll want to be able to do both.

Testing

You're on your own as far as testing for this lab, so you must devise your own testing strategy. Mind what you have learned. Save you it can.

You'll probably want to use the testbench we've gave you in lab3 as a starting point. In addition to testing functional correctness, you'll also need to test timing correctness. That is, if you stall when you should not or incorrectly update the branch predictor, your processor will correctly execute, but still isn't "correct". You'll need to carefully test for such "timing bugs" as well.

Verilog Restrictions

You can use the same subset of Verilog as in lab3. In addition, you may need to modify the behavioral testbench code to test your processor. Ask a TA if you have any questions about the testbench code.

Coding and Style Suggestions

Demos

You'll have both a "preliminary demo" and a "final demo":

The TAs will ask you some questions about your design during both the demos. All group members should understand the entire design well enough to be able to answer any such questions.

Final Demo Specifics

The final demo consists of 13 test cases in the lab4_testcases.hex input file. The switches on the expansion board determine which test will be run each time (set the switches to binary values from zero to 12). If the test passes, the low-order 4 expansion LEDs are lit. Otherwise, a single LED lit signifies a failure. You'll need to re-download the bitfile to re-run the test (by setting the switches differently first). You can also get the assembly source for this input file: lab4_testcases.asm.

We're also going to be recording the output of the performance counters of each of the tests, and grading the "performance" correctness of your design based on the performance counters. The expansion switches also determine which of the performance counter results are displayed on the seven-segment display. The low-order 4 switches control, from left to right: insn count, cycle count, load-use stall cycles, and branch mispredict stall cycles. Without any other combination of switches, the stall cycle (cycle count - insn count) count is displayed.

You're free to begin experimenting with the tests, disassembling the hex file or whatever.

What to Turn In

Design Document: You'll turn in a design document draft as specific above.

Final Writeup: The final writeup should include:

  1. Design document. An updated version of your design document
  2. Verilog code. As before, your Verilog code should be well-formatted, easy to understand, and include comments where appropriate. Some part of the project grade will be dependent on the style and readability of your Verilog, including formatting, comments, good signal names, and proper use of hierarchy.
  3. CPI analysis: Answer the following questions based on the data from the performance counters for modified version of breakout that we'll give you:
    • What is the CPI of the processor with the predict-not-taken predictor and the table predictor? How much faster is the table-based predictor than the predict-not-taken predictor?
    • What would the CPI be with a perfect branch predictor? How much faster would that design be than the table-based predictor?
    • How much faster would a design with a zero-cycle load-use penalty be than your pipeline with the table-based predictor?
  4. Clock frequency analysis: Answer the following questions based on the post-place-and-route timing report:
  • Based on the place-and-route timing report, what is the maximum clock frequency obtainable by your design?
  • Based on the place-and-route timing report, which pipeline stage is likely the bottleneck in your design?
  • What design changes might you make to further increase the clock frequency?
  1. Finally, answer the following questions:
  • What problems, if any, did you encounter while doing this lab?
  • How many hours did it take you to complete this assignment (total for both partners)?
  • On a scale of 1 (least) to 5 (most), how difficult was this assignment?

Honor Points

In addition to the honor points possibilities described in lab3, there are lots of other possible honor point projects. As described above, you'll need to demonstrate any non-trivial honors points projects directly to me (Prof. Martin).

Performance Reconfigurability

  • Branch predictor selection [2 points]. Change your hardware to allow one of the board's switches or pushbuttons to control the mode of the branch predictor (either use the table predictor or always predict PC+1). This change would allow you to actually observe the impact in real time on branch prediction on a running program.
  • Load stall selection [2 point]. Use a hardware switch to toggle between a mode that stalls any instruction after a load versus just stalling dependent instructions.
  • Bypassing selection [5 points]. Use a hardware switch to toggle between bypassing and stalling in your design. This change requires implementing the pipeline stall logic for a non-bypassed design, which is not trivial.

Better CPI

  • Better branch predictor [2 to 10 points, depending on the changes and analysis]. Implement a better branch predictor and show on some code (using the performance counters) that it really is better. Possible improvements to the predictor include (1) make it larger, (2) make it two-way set associative (to avoid conflicts in the table), or (3) add hysteresis in the form of two-bit saturating counters. For more information on branch prediction, see the pipelining lecture notes from CIS501 (starting with around slide 45).

  • Conditional move instruction [5 points]. Most real ISAs have a "conditional move" instruction (or CMOV) that conditionally copies one register into another. For example, CMOVz R1, R2, R3 might copy the value of R2 into R1 if R3 is zero. This instruction allows some hard-to-predict branches to be converted into CMOV instructions (that don't need to be predicted). For example:

    if (r3 == 0) {
      r1 = r2;
    }
    

    Could be written as:

    CMOVz R1, R2, R3
    

    In essence, a CMOV is implemented by disabling the register file write enable when the condition isn't satisfied. Pipelining a CMOV is interesting because the bypassing is now data-dependent, but by changing the pipelined control bits as the condition is resolved in the execute stage should take care of most of the corner cases.

  • Superscalar execution [15 points]. Change your pipeline to fetch two instruction per cycle and execute them in parallel if they are not dependent. If you're interested, see me for more information.

Better Clock Frequency

  • Deeper pipeline [15 points]: increase the clock frequency by splitting the memory access into two stages. This would increase the branch mis-predict penalty to three and the load-use penalty to two. However, it could significantly improve the frequency of your design. This change also makes bypassing a bit more challenging.
  • Instruction cache [10 points]: replace the instruction memory accesses with an access to a smaller (faster) instruction cache.
  • Data cache [15 points]: replace the data memory accesses with an access to a smaller (faster) data cache. The data cache is harder than the instruction cache because you need to handle writes (either via write-back or write-update).
  • Clock tuning [5 to 15 points]: find some other way to substantially improve the clock frequency of your design. Number of points is dependent on the depth of analysis and amount of improvement.

New Instructions

Use one or more of the "unused" instruction encodings to implement some new instruction.

  • Conditional move [5 points]: see above
  • Hardware divide [5 points]: add a hardware integer divide instruction to the ISA.
  • Floating point [12 points]: add hardware floating point support using your own 16-bit floating point representation. 3 points for each of the following instructions (do one or more): floating point add, floating point multiply, integer to floating point conversion, and floating point to integer conversion instructions.
  • SIMD instructions [5 points]: Add some instructions that operate on two 8-bit values in a single 16-bit word, much like Intel's MMX/SSE/SSE2 instructions.

Support for New Hardware Devices

  • New devices [5 to 15 points]: Make use of additional devices on our FPGA boards. The boards have support for more I/O devices than we're actually using right now. Examples include: audio input and output, mouse input, and reading/writing a compact flash card. See me or the TAs for guidance and documentation on these devices.

Software

You also have the option to pursue some software projects.

  • Video character output [5 points]: Use a basic bit-mapped font to allow characters to be displayed using our video output device. A basic 8x8 bitmapped font is available.
  • P37X code [points will vary]: Write some other non-trivial program in P37X assembly that makes use of the devices and such. Possibilities might include a simple OS, more games, simple 3D graphics, etc.
  • P37X compiler [points will vary]: Write a compiler to translate some language into the P37X assembly. If you're taking Prof. Lewis's CSE341, retargeting your COOL compiler to P37X would be a great little project.

Addendum

[March 29] - Stated that honor points demonstrations need to be performed when Prof. Martin is present.

[April 5] - Added description of simple PC+1 predictor as well as the table-based predictor. You'll need to compare these two different designs. Also added the honors points section on "Performance Reconfigurability".

[April 5] - Added the writeup question on CPI analysis of branches and load delay slots.

[April 19] - Clarification: The performance counters should count the "NOOP" instruction as an instruction being executed. That is, it isn't either a branch stall or a load stall cycle. Remember, the three performance counters for: (1) instruction executed, (1) load stalls, and (2) branch stalls should add up to the cycle counter.

[April 19] - The revised version of your design document should include all the of issues that you may have failed to mention in your original design document. Even if you got a 5/5 on your first design document, you'll probably need a lot of additional information in your final lab writeup. How did you handle the block RAM issues? How did you actually implemented bypassing and stalling? How do the performance counters work? Etc. Don't focus as much on datapath pictures as much as good, tight, prose descriptions of the some the high-level and other important design choices you made along the way. Recall they you also have some other questions and analysis as part of the writeup as well.

[April 20] - Here's the source files for breakout: breakout.asm breakos.asm breakbg.obj

[April 20] - Added the description of and link to the final demo hex test file.

[April 21] - Added the source code for the lab4 testcase .hex file. Also, here is a patched version of memory_with_devices.v that makes the memory devices latched.

[April 21] - Bugs in the lab4 testcases (tests 7 and 8) have been resolved, and the .asm and .hex files updated.