Datapath design document due Friday, April 7
Preliminary Demo by Friday, April 14
Final Demo by Friday, April 21 (last day of classes)
Writeup due Friday, April 21 (last day of classes)
This lab is to be done in pairs (groups of two).
This lab is worth 30 points.
In this final lab of the semester, you will construct a pipelined processor for the P37X ISA. This processor will have the now-familiar five-stage pipeline:
As branches are resolved in the execute stage, a mispredicted branch has a two-cycle penalty. As the data memory is read during the memory stage, a load followed by a dependent instruction will add a one-cycle stall to the pipeline. A load followed by a store instruction where only the value to be written is dependent should not cause a stall. The pipeline is fully bypassed, so all other instructions execute without stalling. All instructions travel through all five stages.
Note
This final lab of the semester is the most intellectually challenging of the labs. As with the previous lab, this lab doesn't require that much Verilog code, but reasoning about the pipelining bypass and stall logic requires significant planning.
Before you begin writing the Verilog code, you're going to need to figure out the pipelined datapath. This process should including a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage.
This design document is due on Friday, April 7 by 7pm (turn it into the TAs in the lab).
The pipelined datapath is similar to the non-pipelined datapath, but has some extensions. The most important change is that each part of the datapath logically needs to live in one pipeline stage or another. Some structures such as the memory and register file are in multiple pipeline stages (F&M and D&W, respectively).
You should re-use most of your controller logic from the single-cycle pipeline by placing the controller module in the decode stage, and then adding the appropriate pipeline registers so that each control signal is used in the correct stage.
In addition to all the structures in the datapath in lab3, the pipelined datapath introduces pipeline "latches" or registers. To implement these registers, use the same Nbit_reg Verilog module as you used for the program counter and internals of the register file.
Note
We'll use both of the terms "pipeline registers" and "pipeline latches" to refer to flip-flop based structures for holding state between pipeline stages. No actual level-sensitive latches are used in the pipeline.
Important: The values coming out of the instruction memory and data memory are already implicitly latched internally to the bram module. Thus, you won't need to add pipeline latches for these values.
Note
This pipelined design no longer needs the pseudo multi-cycle hack using the global write enable from the last lab. The global write enable is still useful for single-stepping the processor on the board, but the global write enable could just be hardwired to "on" (in your testbench, for example) to achieve the highest performance.
Another option for getting the memory to work with your pipelined design is to manipulate the read-enable signals for the data and instruction memory to stall the pipeline correctly. You will need to get the new memory and top-level modules that have read-enable support, and to modify your datapath with two additional single-bit output ports: IMEM_RE and DMEM_RE. These files are marked as version 1.2.
As the pipeline is fully bypassed, you should implement M->X bypassing, W->X bypassing, and W->M bypassing. To do this, you'll need to add another mux in front of both ALUs inputs and a mux in front of the data input into the memory. The data bypass logic (the control for these muxes) examines the registers written by prior instructions (looking at the write enable signal to make sure the instruction actually wrote a register). Notice that the destination register and its corresponding write enable signal must already be latched for use in the writeback stage. Be careful not to bypass a value from a prior instruction that either doesn't write a register or an instruction that has been squashed.
You'll also need to bypass around the register file. The register file from lab3 is somewhat different than the register file used in the pipeline described in the CSE371 lecture. When a register is read and written in the same cycle, our register file from lab3 returns the old value. For this pipelined design, we want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches that being written, you'll write to return the new value being written (and not the old value).
You can implement this register file bypass one of two ways: (1) make a wrapper around the existing register file that does this bypassing locally or (2) latch the written value separately and then add an additional input into the ALU-input bypass muxes. The first option is perhaps simpler to reason about, but it adds an extra mux to the register read critical path. Option two is a little bit different than what you've seen in lecture, but it doesn't add a new mux (it just expands an existing mux), so it is perhaps faster. It is your choice on how you want to implement this.
In this pipeline, loads have an extra load-use delay. Thus, if a load is followed by a dependent instruction that uses the value in the execute stage, the dependent instruction must be stalled for one cycle. If the dependent register value isn't needed until the memory stage (for example, as the store's data input value), it should not be stalled.
Detect the stall during the decode stage by looking both at the current instruction and at the instruction in the execute stage. To implement the stall, insert a no-operation into the pipeline (which will propagate down the pipeline just like a normal instruction). The next cycle, the same logic that detected the stall on the previous cycle will then detect "no stall", because the load will have moved further down the pipeline.
Be sure not to falsely stall any instruction whose encoding might match the previous instruction's output register, but does not actually read the register. For example, the JUMP instruction doesn't read any registers, so it shouldn't be stalled just because the bits in its immediate field happen to match the output register of a prior load. Over-stalling is difficult to detect because it is a "performance bug" and not a "correctness bug". That is, you'll get the right answer, but your processor will have a worse CPI than it should.
Instructions that manipulate the program counter (TRAP, RTT, JUMP, JUMPR, JSR, JSRR, and BR) need additional handling. As the value of the next program counter of the instruction that follows one of these instructions is determined in the execute stage, this pipeline uses a simple branch predictor to predict the next program counter and speculatively fetch and decode the next two instructions. During the execute stage, the "actual next PC" of the instruction is compared with the predicted PC (the PC of the instruction in the decode stage). Only if the "actual next PC" and predicted PC match, was the prediction correct.
If the predicted "next PC" is correct, execution continues normally. If the predicted "next PC" is incorrect, the instructions currently in the fetch and decode stages must be annulled. To annul these instructions, they should be transformed into no-operations before they are latched into the next stage (for example, by using a mux right before the pipeline latches).
In this lab you'll implement two simple branch predictors that guess a "next PC":
Predict Not-Taken. Just predict PC+1 as the next PC.
Table Predictor. To make our branch predictor more accurate, we're going to use a simple "next PC" table predictor accessed during the fetch stage and updated in the execute phase. To generate a prediction, the predictor takes in the current PC (a 16-bit input) and generates a "next PC" prediction (a 16-bit output). For updating during the execute phase, the predictor takes in the PC of the instruction in the execute phase (a 16-bit input), the actual "next PC" of the instruction (a 16-bit input), and a single-bit write enable (asserted only if the prediction was incorrect). Notice that nowhere are the actual bits of the instruction itself used.
This simple table-based predictor will be an 8-entry direct-mapped "next PC" predictor. The predictor consists of two memory arrays: the "tag" array and the "predicted PC" array. Each time the predictor is accessed, it uses the lower bits of the input PC to determine which entry to examine. If that entry's tag matches the input PC, the predicted PC will be the value of the corresponding entry in the "predicted PC" array. If the tag does not match, the predicted next PC is PC+1. The predictor tables are both updated when an instruction's next PC is mispredicted. Both the "tag" array entry for the PC that was mispredicted is updated, and the "next PC" array (for the same entry). The table has 8 entries to allow you to borrow Verilog code from your register file to create a two-ported 8-entry by 16-bit memory.
For the final writeup, please compare the performance of these two predictors using the performance counters (described next). We'll give you some P37X code for you to perform this analysis.
To make this comparision easier (and for a couple of honors points), you may want to have your processor select the branch prediction mode using one of the board's switches, allowing you to enable and disable the predictor on the fly.
Real processors have performance counters that track various events within a processor to help understand its performance. In this lab, we're going to add four memory-mapped performance counters:
The cycle count is incremented every cycle. Every cycle one (and only one) of the instruction, load stall, or branch stall counters is incremented. As such, the sum of these three registers should be approximately equal to the cycle count. All of these counts should be updated during the writeback stage.
The current value of these counters is determined by using a LD or LDR instruction to access them. For simplicity, stores to these locations do not change the value of the registers, but stores may still update the contents of memory (it doesn't really matter, as anytime you read from these locations, it will use the value in the counter and not the value in the memory). Basically, these counters can be reset to zero only when the entire system is reset.
To give you some mid-project milestones, this project is broken into multiple phases.
Build up the basic pipeline in two steps:
Next, extend the fully-bypassed pipeline with:
With these two simplifications, you should still have a 100% functional processor for the first demo, but it will still have a worse CPI than your final design.
The final phase is to stall only dependent operations following a load and implement the branch prediction and related logic. When implementing the branches, you may want to first always predict branches as "taken" to test the basic workings of your pipeline. Once you have that working you can implement the actual predictor described above. In fact, the project writeup asks you to compare the performance of a simple "always predict PC+1" with the more accurate branch predictor described above, so you'll want to be able to do both.
You're on your own as far as testing for this lab, so you must devise your own testing strategy. Mind what you have learned. Save you it can.
You'll probably want to use the testbench we've gave you in lab3 as a starting point. In addition to testing functional correctness, you'll also need to test timing correctness. That is, if you stall when you should not or incorrectly update the branch predictor, your processor will correctly execute, but still isn't "correct". You'll need to carefully test for such "timing bugs" as well.
You can use the same subset of Verilog as in lab3. In addition, you may need to modify the behavioral testbench code to test your processor. Ask a TA if you have any questions about the testbench code.
You'll have both a "preliminary demo" and a "final demo":
The TAs will ask you some questions about your design during both the demos. All group members should understand the entire design well enough to be able to answer any such questions.
The final demo consists of 13 test cases in the lab4_testcases.hex input file. The switches on the expansion board determine which test will be run each time (set the switches to binary values from zero to 12). If the test passes, the low-order 4 expansion LEDs are lit. Otherwise, a single LED lit signifies a failure. You'll need to re-download the bitfile to re-run the test (by setting the switches differently first). You can also get the assembly source for this input file: lab4_testcases.asm.
We're also going to be recording the output of the performance counters of each of the tests, and grading the "performance" correctness of your design based on the performance counters. The expansion switches also determine which of the performance counter results are displayed on the seven-segment display. The low-order 4 switches control, from left to right: insn count, cycle count, load-use stall cycles, and branch mispredict stall cycles. Without any other combination of switches, the stall cycle (cycle count - insn count) count is displayed.
You're free to begin experimenting with the tests, disassembling the hex file or whatever.
Design Document: You'll turn in a design document draft as specific above.
Final Writeup: The final writeup should include:
- Based on the place-and-route timing report, what is the maximum clock frequency obtainable by your design?
- Based on the place-and-route timing report, which pipeline stage is likely the bottleneck in your design?
- What design changes might you make to further increase the clock frequency?
- What problems, if any, did you encounter while doing this lab?
- How many hours did it take you to complete this assignment (total for both partners)?
- On a scale of 1 (least) to 5 (most), how difficult was this assignment?
In addition to the honor points possibilities described in lab3, there are lots of other possible honor point projects. As described above, you'll need to demonstrate any non-trivial honors points projects directly to me (Prof. Martin).
Better branch predictor [2 to 10 points, depending on the changes and analysis]. Implement a better branch predictor and show on some code (using the performance counters) that it really is better. Possible improvements to the predictor include (1) make it larger, (2) make it two-way set associative (to avoid conflicts in the table), or (3) add hysteresis in the form of two-bit saturating counters. For more information on branch prediction, see the pipelining lecture notes from CIS501 (starting with around slide 45).
Conditional move instruction [5 points]. Most real ISAs have a "conditional move" instruction (or CMOV) that conditionally copies one register into another. For example, CMOVz R1, R2, R3 might copy the value of R2 into R1 if R3 is zero. This instruction allows some hard-to-predict branches to be converted into CMOV instructions (that don't need to be predicted). For example:
if (r3 == 0) { r1 = r2; }
Could be written as:
CMOVz R1, R2, R3
In essence, a CMOV is implemented by disabling the register file write enable when the condition isn't satisfied. Pipelining a CMOV is interesting because the bypassing is now data-dependent, but by changing the pipelined control bits as the condition is resolved in the execute stage should take care of most of the corner cases.
Superscalar execution [15 points]. Change your pipeline to fetch two instruction per cycle and execute them in parallel if they are not dependent. If you're interested, see me for more information.
Use one or more of the "unused" instruction encodings to implement some new instruction.
You also have the option to pursue some software projects.
[March 29] - Stated that honor points demonstrations need to be performed when Prof. Martin is present.
[April 5] - Added description of simple PC+1 predictor as well as the table-based predictor. You'll need to compare these two different designs. Also added the honors points section on "Performance Reconfigurability".
[April 5] - Added the writeup question on CPI analysis of branches and load delay slots.
[April 19] - Clarification: The performance counters should count the "NOOP" instruction as an instruction being executed. That is, it isn't either a branch stall or a load stall cycle. Remember, the three performance counters for: (1) instruction executed, (1) load stalls, and (2) branch stalls should add up to the cycle counter.
[April 19] - The revised version of your design document should include all the of issues that you may have failed to mention in your original design document. Even if you got a 5/5 on your first design document, you'll probably need a lot of additional information in your final lab writeup. How did you handle the block RAM issues? How did you actually implemented bypassing and stalling? How do the performance counters work? Etc. Don't focus as much on datapath pictures as much as good, tight, prose descriptions of the some the high-level and other important design choices you made along the way. Recall they you also have some other questions and analysis as part of the writeup as well.
[April 20] - Here's the source files for breakout: breakout.asm breakos.asm breakbg.obj
[April 20] - Added the description of and link to the final demo hex test file.
[April 21] - Added the source code for the lab4 testcase .hex file. Also, here is a patched version of memory_with_devices.v that makes the memory devices latched.
[April 21] - Bugs in the lab4 testcases (tests 7 and 8) have been resolved, and the .asm and .hex files updated.