Instructor: | Prof. Milo Martin |
---|---|
Design Document Due: | |
Wednesday, April 13th (turn in PDF via Blackboard by 7pm) | |
Preliminary Demo Due: | |
Thursday, April 21st (you must demo it in the KLab by 7pm) | |
Final Demo Due: | Wednesday, April 27th (you must demo it in the KLab by 7pm) |
Writeup Due: | Thursday, April 28th (turn in PDF via Blackboard by 7pm) |
Instructions: | This lab should be done in groups of two or three. |
This lab is worth 12% of your final grade.
In this lab, you will design and build a scalar five-stage pipelined processor for the LC4 ISA. As described below, the pipeline is fully bypassed, includes a simple branch predictor, and implements MOD and DIV.
As described below (see "What to Turn In"), the first step is creating a design document. That is, before you begin writing the Verilog code, you're going to need to figure out the design of your new processor. This process should include considering all aspects of the design and testing of the processor. You'll want to consider how the datapath is broken into stages, identify which values are placed in pipeline latches, uncover any tricky bypassing cases, and decide upon a strategy for stalling the pipeline and recovering from branch mispredictions. You'll likely want to include a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage.
Important
The entire goal of the design document is to help (force) you to really think through the design. The design document is not intended to be busy work; I'm requiring the document because I think it will save you time in the end. One corollary to this is that you shouldn't spend significant time polishing the document. However, you need to convince the grader that you've actually deeply considered all aspects of the design and that you're adequately prepared to begin the implementation.
The pipeline is a scalar five-stage pipelined processor. To improve the CPI, the pipeline is fully bypassed (including the NZP condition codes) and the design includes a simple branch predictor.
This processor will use the "standard" five-stage pipeline:
All instructions travel through all five stages.
In the standard five-stage fully bypassed pipeline, the only stall is a single-cycle load-to-use penalty. There are a few tricky cases to consider. First, you should only stall an instruction after the load if the subsequent instruction is actually dependent on the load (either by register value or NZP condition code):
LDR R1, R2, 5 // load r1 <- [R2+10] // stall ADD R3, R1, R4 // add r3 <= R1+R4
Second, a load followed by a dependent store may or may not stall, depending on if the output of the load feeds the address generation (stall) or the data input of a store (no stall):
LDR R1, R2, 5 // load R1 <- [R2+10] // no stall STR R1, R3, 6 // store R1 -> [R3+6] LDR R1, R2, 5 // load R1 <- [R2+10] // stall STR R7, R1, 6 // store R7 -> [R1+6]
Third, you'll need to bypass the NZP condition codes (which are read only by BR instructions). The tricky case is that not all instructions write the condition codes, so you'll need to use the NZP bits based on the last instruction that writes the condition codes prior to the BR instruction.
Note: technically a "BRnzp" or just "BR" instruction wouldn't technically need to stall for condition codes (as they are unconditional branches), however LC4 also has a PC-relative JMP that is used for unconditional branching by convention, so all the "BR" instructions can be treated uniformly.
The register file you used in the single-cycle design is different than the register file used in the pipeline described in the CIS371 lecture. Unlike what we've been assuming in class, a write to the register file will not effect the register being read in the same cycle, so you will need to explicitly add another stage of "bypassing" around the register file.
Thus, when a register is read and written in the same cycle, the register file you used in your single-cycle design returns the old value. For this pipelined design, you want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches the register being written, you'll write to return the new value being written (and not the old value).
One simple way to do this is to make a new module that contains the existing register file and the additional logic for this "local" bypassing. You'll also need to consider how this impacts the bypassing of the NZP condition codes as well.
Branches are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty. The branch predictor is an eight-entry direct-mapped branch target buffer (BTB). Each of the eight entries in the BTB has both a 16-bit "tag" and a 16-bit "next PC" field.
Prediction. On an instruction fetch, the BTB is indexed using the lowest three bits of the PC. If the tag matches (the tag for that entry is equal to the PC) then the predicted next PC is whatever was in the "next PC" field. If the tag doesn't match, the predicted PC is just PC+1. Recall, the BTB is accessed during the fetch stage, and it is accessed for every instruction fetched.
Mis-prediction, recovery, and training. Control instructions resolve during the execute phase. If the predicted PC is wrong (doesn't match the actual PC), three things occur:
- The two instructions following the mis predicted branch are squashed (nullified)
- Fetch is re-started the next cycle at the correct PC
- The branch predictor is trained. Using the lowest three bits of the mis-predicted PC to index into the BTB, the tag is set to the PC and the next PC field is set to the calculated PC.
Note
We suggest that you start by always predicting PC+1 and testing your design with that. Once that is working, getting the predictor working isn't so difficult.
Note
Notice that the register file module you already have is an eight-entry RAM with 16-bit values... so there is an opportunity for code re-use.
Unlike the previous lab, your pipelined processor must also support the DIV and MOD instructions. As DIV and MOD arithmetic operations require substantial amounts of logic, these are to be implemented as multi-cycle operation (Note: you should implement MUL as a normal single-cycle operation). In LC4, DIV and MOD operate on unsigned values. To assist you, the following C code gives the basic algorithm for a 16-iteration 16-bit divide/modulo.
int divide(int dividend, int divisor) { int quotient = 0; int remainder = 0; if (divisor == 0) { return 0; } for (int i = 0; i < 16; i++) { remainder = (remainder << 1) | ((dividend >> 15) & 0x1); if (remainder < divisor) { quotient = (quotient << 1) | 0x0; } else { quotient = (quotient << 1) | 0x1; remainder = remainder - divisor; } dividend = dividend << 1; } return quotient; }
Although DIV and MOD on actual processors can take dozens of cycles, your DIV and MOD should be a four-cycle operation. In essence, each cycle your processor can do the work performed by four loop iterations above. Four cycles is likely enough cycles that it shouldn't impact the cycle time of the processor (too much), but long enough that you'll need to handle MOD and DIV as special four-cycle operation.
To further assist you, we're also giving you some Verilog code. The file lc4_divider_four_iter.v (below) contains the Verilog for a single iteration and four iterations together. Use four of the four_iteration modules (in subsequent cycles) to implement MOD and DIV.
`timescale 1ns / 1ps module lc4_divider_one_iter(dividend_in, divisor_in, remainder_in, quotient_in, dividend_out, remainder_out, quotient_out); input [15:0] dividend_in, divisor_in, remainder_in, quotient_in; output [15:0] dividend_out, remainder_out, quotient_out; wire [15:0] remainder = {remainder_in[14:0],dividend_in[15]}; wire condition = (remainder < divisor_in); assign quotient_out = condition ? {quotient_in[14:0], 1'b0} : {quotient_in[14:0], 1'b1}; assign remainder_out = condition ? remainder : remainder - divisor_in; assign dividend_out = {dividend_in[14:0], 1'b0}; endmodule module lc4_divider_four_iter(dividend_in, divisor_in, remainder_in, quotient_in, dividend_out, remainder_out, quotient_out); input [15:0] dividend_in, divisor_in, remainder_in, quotient_in; output [15:0] dividend_out, remainder_out, quotient_out; wire [15:0] iter1_dividend_out, iter1_remainder_out, iter1_quotient_out; wire [15:0] iter2_dividend_out, iter2_remainder_out, iter2_quotient_out; wire [15:0] iter3_dividend_out, iter3_remainder_out, iter3_quotient_out; wire [15:0] iter4_dividend_out, iter4_remainder_out, iter4_quotient_out; wire [15:0] iter1_dividend_in = dividend_in; wire [15:0] iter2_dividend_in = iter1_dividend_out; wire [15:0] iter3_dividend_in = iter2_dividend_out; wire [15:0] iter4_dividend_in = iter3_dividend_out; wire [15:0] iter1_remainder_in = remainder_in; wire [15:0] iter2_remainder_in = iter1_remainder_out; wire [15:0] iter3_remainder_in = iter2_remainder_out; wire [15:0] iter4_remainder_in = iter3_remainder_out; wire [15:0] iter1_quotient_in = quotient_in; wire [15:0] iter2_quotient_in = iter1_quotient_out; wire [15:0] iter3_quotient_in = iter2_quotient_out; wire [15:0] iter4_quotient_in = iter3_quotient_out; lc4_divider_one_iter iter1 (iter1_dividend_in, divisor_in, iter1_remainder_in, iter1_quotient_in, iter1_dividend_out, iter1_remainder_out, iter1_quotient_out); lc4_divider_one_iter iter2 (iter2_dividend_in, divisor_in, iter2_remainder_in, iter2_quotient_in, iter2_dividend_out, iter2_remainder_out, iter2_quotient_out); lc4_divider_one_iter iter3 (iter3_dividend_in, divisor_in, iter3_remainder_in, iter3_quotient_in, iter3_dividend_out, iter3_remainder_out, iter3_quotient_out); lc4_divider_one_iter iter4 (iter4_dividend_in, divisor_in, iter4_remainder_in, iter4_quotient_in, iter4_dividend_out, iter4_remainder_out, iter4_quotient_out); assign dividend_out = iter4_dividend_out; assign remainder_out = iter4_remainder_out; assign quotient_out = iter4_quotient_out; endmodule
Depending on your target CPI (and subsequent target grade for the lab, as described in the "grading" section below), you have several options for integrating MOD and DIV with the pipeline:
- Don't implement DIV and MOD (and not receive full credit, see "Grading" below).
- Stall all subsequent instructions (in essence, disable pipeline while a MOD or DIV is executing).
- Stall subsequent non-MOD/non-DIV instructions only if they are independent (and look out for out-of-order writes and structural hazards). Subsequent MOD/DIV instruction would be stalled (that is, the MOD and DIV units themselves are not pipelined).
- Pipeline the MOD and DIV units such that independent MOD/DIV instructions can be pipelined.
Note
Extending the pipeline so that all instructions travel through extra stages is likely not the right way to add MOD and DIV to the pipeline. Review the lecture notes on pipelining and multi-cycle operations.
We'll evaluate your pipeline's CPI using LC4 benchmark program specifically designed to measure how effectively your pipeline handles MOD and DIV.
Use the file lc4_pipeline.v as the start of your pipelined design. Of course, you will likely want to retain the modules (such as the register file, ALU, and branch unit) as well as create additional modules (say, a branch predictor module). As before, labfiles.tar.gz contains this file, Verilog code for the memory and all of the devices (except for the timer), user constraints, test fixtures, and various trace and hex files. The CLK, RST, GWE and instruction and data memory interfaces are the same as before.
To verify your pipeline is working correctly, requires both "functionality" testing (does it calculate the right answer?) and "performance" testing (does your pipeline stall too much?).
You can use the same test harness and testing traces as in the previous lab, but with two twists.
First, the pipeline spreads the execution of an instruction over many stages, but the test fixture expects all the test_ signals to correspond to a single instruction. To handle this issue, you'll need to carry the relevant signals all the way into the "writeback" pipeline stage and assign them to the proper test_ signals.
Second, the pipelined processor doesn't complete an instruction each cycle (due to stalls and pipeline flushes). However, the test fixture we've given you already includes a test_stall signal. If this signal is zero (indicating no stall), the test fixture will check the various test_ signals versus the next line in the .trace file. When the test_stall signal is non-zero (indicating a stall), it ignores the cycle.
Part of having a "correct" pipeline is not stalling too much (and thus not capturing all the performance benefits of the pipeline). To measure and understand your pipeline's performance, the test_stall signal is actually a two-bit value to indicate the source of the stall as follows:
The test fixture can then use these values to measure the source of stalls.
Note
If a stall occurred due to a load-use penalty, but the instruction ended up being squashed due to a branch mis-prediction, that stall cycle should be classified as a branch mis-prediction cycle.
Use the same subset of Verilog as in the previous lab.
There will be two demos:
All group members should understand the entire design well enough to be able to answer any such questions. As before, all group members should be present at the demos.
This lab counts for 12% of your total grade. Rather than giving you a binary trade-off, I am going to refine it a little bit. This lab essentially has three parts to it: a basic pipeline, simple branch prediction, and MOD/DIV.
The percentages above represent the maximum amount earned for the lab, including the design document, demos, and final writeup.
Stated another way, if your group chooses, you can stop after implement the fully bypassed pipeline and earn up to half the points on the assignment. In fact, if your group chooses such a route prior to the design document, the design document need only describe the parts your group has decided to tackle.
There will be three documents to turn in: