Lab 3 - Pipelined LC4 Processor

CIS 371 (Spring 2011): Computer Organization and Design

Design Document Due:
Instructor:	Prof. Milo Martin
	Wednesday, April 13th (turn in PDF via Blackboard by 7pm)
Preliminary Demo Due:
	Thursday, April 21st (you must demo it in the KLab by 7pm)
Final Demo Due:	Wednesday, April 27th (you must demo it in the KLab by 7pm)
Writeup Due:	Thursday, April 28th (turn in PDF via Blackboard by 7pm)
Instructions:	This lab should be done in groups of two or three.

This lab is worth 12% of your final grade.

Announcements

[April 12] Clarified that MUL (multiply) is just another single-cycle operation.
[April 9] Updated the test cases (same names, better testing coverage), updated the testbench to display stall cycle breakdowns, and provided Verilog code to assist with MOD & DIV. So be sure to download the newest labfiles.tar.gz.

Overview

In this lab, you will design and build a scalar five-stage pipelined processor for the LC4 ISA. As described below, the pipeline is fully bypassed, includes a simple branch predictor, and implements MOD and DIV.

Design Document

As described below (see "What to Turn In"), the first step is creating a design document. That is, before you begin writing the Verilog code, you're going to need to figure out the design of your new processor. This process should include considering all aspects of the design and testing of the processor. You'll want to consider how the datapath is broken into stages, identify which values are placed in pipeline latches, uncover any tricky bypassing cases, and decide upon a strategy for stalling the pipeline and recovering from branch mispredictions. You'll likely want to include a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage.

Important

The entire goal of the design document is to help (force) you to really think through the design. The design document is not intended to be busy work; I'm requiring the document because I think it will save you time in the end. One corollary to this is that you shouldn't spend significant time polishing the document. However, you need to convince the grader that you've actually deeply considered all aspects of the design and that you're adequately prepared to begin the implementation.

Specification

The pipeline is a scalar five-stage pipelined processor. To improve the CPI, the pipeline is fully bypassed (including the NZP condition codes) and the design includes a simple branch predictor.

Basic Pipeline Stages

This processor will use the "standard" five-stage pipeline:

Fetch (F): reads the next instruction from the memory, predicts the next PC.
Decode (D): stall logic and reads the register file
Execute (X): performs ALU calculations, branch unit calculations, resolves branches. As the FPGAs have a fast built-in multiple block, you can consider the multiple just another single-cycle instruction.
Memory (M): read or write the data memory
Writeback (W): write the register file

All instructions travel through all five stages.

Bypassing & Stall Logic

In the standard five-stage fully bypassed pipeline, the only stall is a single-cycle load-to-use penalty. There are a few tricky cases to consider. First, you should only stall an instruction after the load if the subsequent instruction is actually dependent on the load (either by register value or NZP condition code):

LDR R1, R2, 5     // load r1 <- [R2+10]
// stall
ADD R3, R1, R4    // add r3 <= R1+R4

Second, a load followed by a dependent store may or may not stall, depending on if the output of the load feeds the address generation (stall) or the data input of a store (no stall):

LDR R1, R2, 5     // load R1 <- [R2+10]
// no stall
STR R1, R3, 6     // store R1 -> [R3+6]

LDR R1, R2, 5     // load R1 <- [R2+10]
// stall
STR R7, R1, 6     // store R7 -> [R1+6]

Third, you'll need to bypass the NZP condition codes (which are read only by BR instructions). The tricky case is that not all instructions write the condition codes, so you'll need to use the NZP bits based on the last instruction that writes the condition codes prior to the BR instruction.

Note: technically a "BRnzp" or just "BR" instruction wouldn't technically need to stall for condition codes (as they are unconditional branches), however LC4 also has a PC-relative JMP that is used for unconditional branching by convention, so all the "BR" instructions can be treated uniformly.

Register File & Bypassing

The register file you used in the single-cycle design is different than the register file used in the pipeline described in the CIS371 lecture. Unlike what we've been assuming in class, a write to the register file will not effect the register being read in the same cycle, so you will need to explicitly add another stage of "bypassing" around the register file.

Thus, when a register is read and written in the same cycle, the register file you used in your single-cycle design returns the old value. For this pipelined design, you want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches the register being written, you'll write to return the new value being written (and not the old value).

One simple way to do this is to make a new module that contains the existing register file and the additional logic for this "local" bypassing. You'll also need to consider how this impacts the bypassing of the NZP condition codes as well.

Branch Prediction

Branches are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty. The branch predictor is an eight-entry direct-mapped branch target buffer (BTB). Each of the eight entries in the BTB has both a 16-bit "tag" and a 16-bit "next PC" field.

Prediction. On an instruction fetch, the BTB is indexed using the lowest three bits of the PC. If the tag matches (the tag for that entry is equal to the PC) then the predicted next PC is whatever was in the "next PC" field. If the tag doesn't match, the predicted PC is just PC+1. Recall, the BTB is accessed during the fetch stage, and it is accessed for every instruction fetched.

Mis-prediction, recovery, and training. Control instructions resolve during the execute phase. If the predicted PC is wrong (doesn't match the actual PC), three things occur:

The two instructions following the mis predicted branch are squashed (nullified)

Fetch is re-started the next cycle at the correct PC

The branch predictor is trained. Using the lowest three bits of the mis-predicted PC to index into the BTB, the tag is set to the PC and the next PC field is set to the calculated PC.

Note

We suggest that you start by always predicting PC+1 and testing your design with that. Once that is working, getting the predictor working isn't so difficult.

Note

Notice that the register file module you already have is an eight-entry RAM with 16-bit values... so there is an opportunity for code re-use.

Implementing MOD and DIV

Unlike the previous lab, your pipelined processor must also support the DIV and MOD instructions. As DIV and MOD arithmetic operations require substantial amounts of logic, these are to be implemented as multi-cycle operation (Note: you should implement MUL as a normal single-cycle operation). In LC4, DIV and MOD operate on unsigned values. To assist you, the following C code gives the basic algorithm for a 16-iteration 16-bit divide/modulo.

int divide(int dividend, int divisor)
{
  int quotient = 0;
  int remainder = 0;

  if (divisor == 0) {
    return 0;
  }

  for (int i = 0; i < 16; i++) {
    remainder = (remainder << 1) | ((dividend >> 15) & 0x1);
    if (remainder < divisor) {
      quotient = (quotient << 1) | 0x0;
    } else {
      quotient = (quotient << 1) | 0x1;
      remainder = remainder - divisor;
    }
    dividend = dividend << 1;
  }

  return quotient;
}

Although DIV and MOD on actual processors can take dozens of cycles, your DIV and MOD should be a four-cycle operation. In essence, each cycle your processor can do the work performed by four loop iterations above. Four cycles is likely enough cycles that it shouldn't impact the cycle time of the processor (too much), but long enough that you'll need to handle MOD and DIV as special four-cycle operation.

To further assist you, we're also giving you some Verilog code. The file lc4_divider_four_iter.v (below) contains the Verilog for a single iteration and four iterations together. Use four of the four_iteration modules (in subsequent cycles) to implement MOD and DIV.

`timescale 1ns / 1ps

module lc4_divider_one_iter(dividend_in, divisor_in, remainder_in, quotient_in, 
                            dividend_out, remainder_out, quotient_out);
   
   input [15:0] dividend_in, divisor_in, remainder_in, quotient_in;
   output [15:0] dividend_out, remainder_out, quotient_out;
   
   
   wire [15:0]   remainder = {remainder_in[14:0],dividend_in[15]};
   wire          condition = (remainder < divisor_in);
   
   assign quotient_out = condition ? {quotient_in[14:0], 1'b0} : {quotient_in[14:0], 1'b1};
   assign remainder_out = condition ? remainder : remainder - divisor_in;
   assign dividend_out = {dividend_in[14:0], 1'b0};
   
endmodule

module lc4_divider_four_iter(dividend_in, divisor_in, remainder_in, quotient_in, 
                             dividend_out, remainder_out, quotient_out);
   
   input [15:0] dividend_in, divisor_in, remainder_in, quotient_in;
   output [15:0] dividend_out, remainder_out, quotient_out;

   wire [15:0]   iter1_dividend_out, iter1_remainder_out, iter1_quotient_out;
   wire [15:0]   iter2_dividend_out, iter2_remainder_out, iter2_quotient_out;
   wire [15:0]   iter3_dividend_out, iter3_remainder_out, iter3_quotient_out;
   wire [15:0]   iter4_dividend_out, iter4_remainder_out, iter4_quotient_out;

   wire [15:0]   iter1_dividend_in = dividend_in;
   wire [15:0]   iter2_dividend_in = iter1_dividend_out;
   wire [15:0]   iter3_dividend_in = iter2_dividend_out;
   wire [15:0]   iter4_dividend_in = iter3_dividend_out;

   wire [15:0]   iter1_remainder_in = remainder_in;
   wire [15:0]   iter2_remainder_in = iter1_remainder_out;
   wire [15:0]   iter3_remainder_in = iter2_remainder_out;
   wire [15:0]   iter4_remainder_in = iter3_remainder_out;
   
   wire [15:0]   iter1_quotient_in = quotient_in;
   wire [15:0]   iter2_quotient_in = iter1_quotient_out;
   wire [15:0]   iter3_quotient_in = iter2_quotient_out;
   wire [15:0]   iter4_quotient_in = iter3_quotient_out;

   
   lc4_divider_one_iter iter1 (iter1_dividend_in, divisor_in, iter1_remainder_in, iter1_quotient_in, 
                               iter1_dividend_out, iter1_remainder_out, iter1_quotient_out);
   
   lc4_divider_one_iter iter2 (iter2_dividend_in, divisor_in, iter2_remainder_in, iter2_quotient_in, 
                               iter2_dividend_out, iter2_remainder_out, iter2_quotient_out);
   
   lc4_divider_one_iter iter3 (iter3_dividend_in, divisor_in, iter3_remainder_in, iter3_quotient_in, 
                               iter3_dividend_out, iter3_remainder_out, iter3_quotient_out);
   
   lc4_divider_one_iter iter4 (iter4_dividend_in, divisor_in, iter4_remainder_in, iter4_quotient_in, 
                               iter4_dividend_out, iter4_remainder_out, iter4_quotient_out);
   
   assign dividend_out = iter4_dividend_out;
   assign remainder_out = iter4_remainder_out;
   assign quotient_out = iter4_quotient_out;

endmodule

Depending on your target CPI (and subsequent target grade for the lab, as described in the "grading" section below), you have several options for integrating MOD and DIV with the pipeline:

Don't implement DIV and MOD (and not receive full credit, see "Grading" below).

Stall all subsequent instructions (in essence, disable pipeline while a MOD or DIV is executing).

Stall subsequent non-MOD/non-DIV instructions only if they are independent (and look out for out-of-order writes and structural hazards). Subsequent MOD/DIV instruction would be stalled (that is, the MOD and DIV units themselves are not pipelined).

Pipeline the MOD and DIV units such that independent MOD/DIV instructions can be pipelined.

Note

Extending the pipeline so that all instructions travel through extra stages is likely not the right way to add MOD and DIV to the pipeline. Review the lecture notes on pipelining and multi-cycle operations.

We'll evaluate your pipeline's CPI using LC4 benchmark program specifically designed to measure how effectively your pipeline handles MOD and DIV.

Project Files

Use the file lc4_pipeline.v as the start of your pipelined design. Of course, you will likely want to retain the modules (such as the register file, ALU, and branch unit) as well as create additional modules (say, a branch predictor module). As before, labfiles.tar.gz contains this file, Verilog code for the memory and all of the devices (except for the timer), user constraints, test fixtures, and various trace and hex files. The CLK, RST, GWE and instruction and data memory interfaces are the same as before.

Testing

To verify your pipeline is working correctly, requires both "functionality" testing (does it calculate the right answer?) and "performance" testing (does your pipeline stall too much?).

Functional Testing

You can use the same test harness and testing traces as in the previous lab, but with two twists.

First, the pipeline spreads the execution of an instruction over many stages, but the test fixture expects all the test_ signals to correspond to a single instruction. To handle this issue, you'll need to carry the relevant signals all the way into the "writeback" pipeline stage and assign them to the proper test_ signals.

Second, the pipelined processor doesn't complete an instruction each cycle (due to stalls and pipeline flushes). However, the test fixture we've given you already includes a test_stall signal. If this signal is zero (indicating no stall), the test fixture will check the various test_ signals versus the next line in the .trace file. When the test_stall signal is non-zero (indicating a stall), it ignores the cycle.

Performance Testing

Part of having a "correct" pipeline is not stalling too much (and thus not capturing all the performance benefits of the pipeline). To measure and understand your pipeline's performance, the test_stall signal is actually a two-bit value to indicate the source of the stall as follows:

test_stall == 0: no stall
test_stall == 1: flushed due to branch mis-prediction
test_stall == 2: stalled due to load-use penalty
test_stall == 3: stalled due to MOD or DIV

The test fixture can then use these values to measure the source of stalls.

Note

If a stall occurred due to a load-use penalty, but the instruction ended up being squashed due to a branch mis-prediction, that stall cycle should be classified as a branch mis-prediction cycle.

Verilog Restrictions

Use the same subset of Verilog as in the previous lab.

Demos

There will be two demos:

Preliminary demo: Demonstrate that the "test_alu.trace" works correctly on your pipeline. This trace file is a good stress-test for various bypassing corner cases, but contains no branches or memory operations.
Final demo: Demonstrate the fully-working design, including bypassing, branch prediction, and MOD/DIV. This includes:
1. Showing that all the .trace files we've provided work in simulation.
2. Demonstrating that mc.hex and invaders.hex work on the FPGA board.
3. Passing additional functionality tests (including MOD and DIV instructions) provided by the TAs.
4. Recording the stall cycle accounting for traces provided by the TAs (to check for over-stalling).

All group members should understand the entire design well enough to be able to answer any such questions. As before, all group members should be present at the demos.

Grading

This lab counts for 12% of your total grade. Rather than giving you a binary trade-off, I am going to refine it a little bit. This lab essentially has three parts to it: a basic pipeline, simple branch prediction, and MOD/DIV.

A fully bypassed 5-stage pipeline that passes the preliminary demo is worth 4%.
A fully bypassed 5-stage pipeline (without MOD and DIV) that passes all the test and runs on the board is worth another 3%.
Correctly implementing branch prediction using the branch predictor (as measured by the stall accounting) is worth another 3%.
Implementing MOD/DIV correctly and efficiently within the pipeline (as determined by the stall accounting) is worth another 2%. The exact amount will depend on how effectively they are implemented (1% for functional correctness and up to 1% for more efficient implementations).

The percentages above represent the maximum amount earned for the lab, including the design document, demos, and final writeup.

Stated another way, if your group chooses, you can stop after implement the fully bypassed pipeline and earn up to half the points on the assignment. In fact, if your group chooses such a route prior to the design document, the design document need only describe the parts your group has decided to tackle.

What to Turn In

There will be three documents to turn in:

Design Document. Designing a pipelined processor is significantly more complex than designing a single-cycle version of the same processor. Before you begin writing the Verilog code, you should think through the design carefully. You'll want to consider how the datapath is broken into stages, identify which values need to be propagated in pipeline latches, think about any tricky bypassing cases, and decide upon a strategy for stalling the pipeline and recovering from branch mis-predictions. Don't spend a lot of time polishing this document, but do include:
1. Design decisions and overview: Describe in prose any high-level design decisions for your pipelined datapath. A short description is fine.
2. Module descriptions: Describe any additional modules in your design. What are they? What are their interfaces? How are you going to test them? (You don't need to re-explain the register file, ALU, and branch modules from the previous labs, but you should explain how they are going to be used and/or modified.)
3. Pipeline diagram: You'll probably want a pipeline datapath diagram. You can either try to fit the whole diagram on a single page or use one page per pipeline stage. You may wish to start with the single-cycle datapath we gave you (lc4_datapath.pdf or lc4_datapath.pptx).
4. Pipeline registers: Give a table or list that describes which values you're latching between which stages. Give these signals good names that you'll then use in your actual Verilog code. For example, you might want to call the PC in the fetch stage F_PC, the PC in the decode stage D_PC, and latch between the stages the FD_register. We strongly suggest that you adopt some signal naming convention, such as prepending each signal with either F, D, X, M, or W to distinguish in which stage the signal belongs.
5. Design issues: Describe your approach and strategies for handling various control and datapath issues such as implementing the bypass logic, stalling logic, branch predictor, flushing, and MOD/DIV.
6. Testing plan: Describe how you plan to test the processor (both for "functional" and "performance" correctness). Especially considering MOD and DIV, which aren't covered by the existing test cases.
7. Implementation plan: What order do you plan to complete the components? How are you going to divide up the work among the members of the group? Are their any intermediate milestones to target?
Final Writeup:
- Design update and retrospective. Write a short "design retrospective", focusing on what you learned during the design that you wish you would have known when you began the implementation. Even if you scored well on your initial design document, you'll probably have uncovered a lot of additional information in your final lab writeup. In the end, how did you actually implemented bypassing and stalling? Etc. Basically, describe how your design changed, what parts you left out of the design document, which parts you got wrong, and assess the effectiveness of your approach to testing and debugging your implementation. What are a few things that, in retrospect, you wish you could got a month back in time and tell yourself about this lab? This shouldn't (just) be a narrative of what when wrong; instead, it should be a critic and analysis of your original design.
- Verilog code. As before, your Verilog code should be well-formatted, easy to understand, and include comments where appropriate. Some part of the project grade will be dependent on the style and readability of your Verilog, including formatting, comments, good signal names, and proper use of hierarchy.
- Answer the following questions:
  - What problems, if any, did you encounter while doing this lab?
  - How many hours did it take you to complete this assignment (total for entire group)?
  - On a scale of 1 (least) to 5 (most), how difficult was this assignment?
Group Contribution Report: Finally, to encourage equal participation by all group members, each group member must individually submit (via blackboard) an document describing your view of the contribution of each member of your group. For each group member (including yourself), give a percentage that member contributed to the project and describe their role in the project. The sum of the percentages should add up to 100%.

Hints

The preliminary demo will make sure that you are on track. But you probably want to set some finer-grained intermediate milestones as well. For instance, you could create a short program in which all dependent instructions are separated by two explict NOPs. A program like this will run correctly on a pipeline that has no bypassing or stall logic. You can then add bypassing logic in stages.
Use good signal names and comments. Using well-named signals and well-documented code will help you and your partner understand each other's code. Plus, this code is complicated enough that good structure, good signal names, and comments really will pay off. I really cannot stress this enough.
Be careful not to bypass a value from a prior instruction that either doesn't write a register or an instruction that has been squashed due to a branch misprediction.
Similarly, be sure not to falsely stall any instruction whose encoding might match the previous instruction's output register, but does not actually read the register. For example, the JUMP instruction doesn't read any registers, so it shouldn't be stalled just because the bits in its immediate field happen to match the output register of a prior load. Over-stalling is difficult to detect because it is a "performance bug" and not a "correctness bug". That is, you'll get the right answer, but your processor will have a worse CPI than it should.
The NZP bits are implicit values that have to be bypassed between register writing instructions and conditional branches. Basically, you can think of the NZP bits as an additional register.
In several places in the design, you'll have instructions that either (1) don't write a register value (for example, stores) or (2) don't read both input registers (for example, loads). You might consider encoding these cases using an explicit "valid" bit for each 3-bit register specifier. Doing so could dramatically simplify your bypass and stall logic.
You can implement flushing and stalling (NOP insertion) by using a mux to set the instruction bits to all zeros (which is a NOP in LC4). However, you'll still need to record the source of the stall.
When a pipeline bubble arrives at the Writeback stage, it is difficult to figure out what caused it to set the test_stall signal correctly. Thus, you will need to remember why different pipeline registers are cleared at those latches and then propagate the information forward. Careful, although you can clear the main pipeline registers, you should not clear the pipeline registers that track why the main pipeline registers were cleared.