Lab 3 - Pipelined LC4 Processor

CIS 371 (Spring 2012): Computer Organization and Design

Instructor:Prof. Milo Martin
Instructions:This lab should be done in groups.

Due Dates

Announcements

Overview

In this lab, you will design and build a superscalar five-stage pipelined processor for the LC4 ISA. As described below, the pipeline is fully bypassed, includes a simple branch predictor, and can execute up to two instructions per cycle. This lab is broken down into three distinct parts.

Be sure to read this entire document before starting, especially the "Hints" section at the very end.


Part A: Design Document For Scalar Pipelined Processor

Designing a pipelined processor is significantly more complex than designing a single-cycle version of the same processor. So, the first step is creating a design document for the pipelined processor (which is described in Part B). Although Part C asks you to further extend your design, this initial design document needs to cover only the initial pipelined processor in Part B.

Before you begin writing any Verilog code, I'm asking you to first figure out the design of your new processor. This process should include considering all aspects of the design and testing of the processor. Focus on the trickiest parts of the design and use this exercise to help identity and address implementation issues early.

You'll want to consider how the datapath is broken into stages, identify which values are placed in pipeline latches, uncover any tricky bypassing cases, and decide upon a strategy for stalling the pipeline and recovering from branch mis-predictions. You'll likely want to include a detailed pipeline diagram and list of what values are placed in pipeline latches between each stage. Don't spend a lot of time polishing this document, but you should touch on at least all of the following issues:

Important

The entire goal of the design document is to help (force) you to really think through the design. The design document is not intended to be busy work; I'm requiring the document because I think it will save you time in the end. One corollary to this is that you shouldn't spend significant time polishing the document. However, you need to convince the grader that you've actually deeply considered all aspects of the design and that you're adequately prepared to begin the implementation.


Part B: Pipelined Scalar Processor

The next step is to implement a scalar five-stage pipelined processor. To improve the CPI, the pipeline is fully bypassed (including the NZP condition codes). In the next part, you'll be asked to extend your implementation to use a simple branch predictor and execute two instructions per cycle.

Basic Pipeline Stages

This processor will use the "standard" five-stage pipeline:

  • Fetch (F): reads the next instruction from the memory, predicts the next PC.
  • Decode (D): stall logic and reads the register file
  • Execute (X): performs ALU calculations, branch unit calculations, resolves branches. All non-memory instructions execute in a single cycle, including MUL, MOD, and DIV.
  • Memory (M): read or write the data memory
  • Writeback (W): write the register file

All instructions travel through all five stages.

Bypassing & Stall Logic

In the standard five-stage fully bypassed pipeline, the only stall is a single-cycle load-to-use penalty. There are a few tricky cases to consider. First, you should only stall an instruction after the load if the subsequent instruction is actually dependent on the load (either by register value or NZP condition code):

LDR R1, R2, 5     // load r1 <- [R2+10]
// stall
ADD R3, R1, R4    // add r3 <= R1+R4

Second, a load followed by a dependent store may or may not stall, depending on if the output of the load feeds the address generation (stall) or the data input of a store (no stall):

LDR R1, R2, 5     // load R1 <- [R2+10]
// no stall
STR R1, R3, 6     // store R1 -> [R3+6]

LDR R1, R2, 5     // load R1 <- [R2+10]
// stall
STR R7, R1, 6     // store R7 -> [R1+6]

Third, you'll need to bypass the NZP condition codes (which are read only by BR instructions). The tricky case is that not all instructions write the condition codes, so you'll need to use the NZP bits based on the last instruction that writes the condition codes prior to the BR instruction.

Note: technically a "BRnzp" or just "BR" instruction wouldn't technically need to stall for condition codes (as they are unconditional branches), however LC4 also has a PC-relative JMP that is used for unconditional branching by convention, so all the "BR" instructions can be treated uniformly.

Register File & Bypassing

The register file you used in the single-cycle design is different than the register file used in the pipeline described in the CIS371 lecture. Unlike what we've been assuming in class, a write to the register file will not effect the register being read in the same cycle, so you will need to explicitly add another stage of "bypassing" around the register file.

Thus, when a register is read and written in the same cycle, the register file you used in your single-cycle design returns the old value. For this pipelined design, you want it to return the newly written value. More specifically, if the register file write enable is asserted, and the register being read matches the register being written, you'll write to return the new value being written (and not the old value).

One simple way to do this is to make a new module that contains the existing register file and the additional logic for this "local" bypassing. You'll also need to consider how this impacts the bypassing of the NZP condition codes as well.

Handling Branches

For branches, always predict "not taken" and thus the next PC is always predicted as "PC+1". This approach does not require an explicit branch predictor table, but it does require implementing the logic to handle branch mispredictions by squashing and restarting the pipeline. All control transfers (branches, etc.) are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty.

Project Files

Use the file lc4_pipeline.v as the start of your pipelined design. Of course, you will likely want to retain the modules (such as the register file, ALU, and branch unit) as well as create additional modules (say, a branch predictor module). As before, labfiles.tar.gz contains this file, Verilog code for the memory and all of the devices, user constraints, test fixtures, and various trace and hex files. The CLK, RST, GWE and instruction and data memory interfaces are the same as before.


Part C: Superscalar Processor with Branch Prediction

In Part C, your goal is to improve the CPI by adding two more advanced features: a (simple) branch predictor and two-way superscalar execution.

Branch Prediction

Branches are resolved in the execute stage, so a mis-predicted branch has a two-cycle penalty. The branch predictor is an eight-entry direct-mapped branch target buffer (BTB). Each of the eight entries in the BTB has both a 16-bit "tag" and a 16-bit "next PC" field.

Prediction. On an instruction fetch, index into the BTB using the lowest three bits of the PC. If the tag matches (the tag for that entry is equal to the PC) then the predicted next PC is whatever was in the "next PC" field. If the tag doesn't match, the predicted PC is just PC+1. Recall, the BTB is accessed during the fetch stage, and it is accessed for every instruction fetched.

Mis-prediction, recovery, and training. Control instructions resolve during the execute phase. If the predicted PC is wrong (doesn't match the actual PC), three things occur:

  1. The two instructions following the mis-predicted branch are squashed (nullified)
  2. Fetch is re-started the next cycle at the correct PC
  3. The branch predictor is trained. Using the lowest three bits of the mis-predicted PC to index into the BTB, the tag is set to the PC and the next PC field is set to the calculated PC.

Note

Once you've completed Part B, you already have a working pipeline with an always predict PC+1. Once that is working correctly, getting the predictor integrated into the pipeline isn't so difficult.

Note

Notice that the register file module you already have is an eight-entry RAM with 16-bit values... so there is an opportunity for code re-use.

Superscalar

You'll also design and implement a superscalar extension, as described below.


Testing

Verifying that your designs are working correctly requires both "functionality" testing (does it calculate the right answer?) and "performance" testing (does your pipeline stall too much?).

Functional Testing

The same test harness and testing traces we gave you in the previous lab will also work for testing the scalar pipeline, but with two twists.

First, the pipeline spreads the execution of an instruction over many stages, but the test fixture expects all the test_ signals to correspond to a single instruction. To handle this issue, you'll need to carry the relevant signals all the way into the "writeback" pipeline stage and assign them to the proper test_ signals.

Second, the pipelined processor doesn't complete an instruction each cycle (due to stalls and pipeline flushes). However, the test fixture we've given you already includes a test_stall signal. If this signal is zero (indicating no stall), the test fixture will check the various test_ signals versus the next line in the .trace file. When the test_stall signal is non-zero (indicating a stall), it ignores the cycle.

Performance Testing

Part of having a "correct" pipeline is not stalling too much (and thus not capturing all the performance benefits of the pipeline). To measure and understand your pipeline's performance, the test_stall signal is actually a two-bit value to indicate the source of the stall as follows:

  • test_stall == 0: no stall
  • test_stall == 1: flushed due to branch mis-prediction
  • test_stall == 2: stalled due to load-use penalty
  • test_stall == 3: used only for superscalar, as described below

The test fixture can then use these values to measure the source of stalls.

Note

If a stall occurred due to a load-use penalty, but the instruction ended up being squashed due to a branch mis-prediction, that stall cycle should be classified as a branch mis-prediction cycle.

Superscalar Pipeline

The superscalar version of the pipeline executes up to two instructions per cycle using the same five pipeline stages. It fetches, decodes, and executes up to two instructions per cycle, but with the limitation that only one load or store can execute per cycle (as explained below). A summary of some of the considerations for each of the stages:

Note

I suggest you design your pipeline such that the instruction in the "first" (or "top" or "A") part of the pipeline is always the older instruction and that the instruction in the "second" (or "bottom" or "B") part of the pipeline is always the younger instruction. This approach makes it clear which instruction should have priority when calculating stalls, bypassing, and register writeback.

Note

I strong suggest you adopt a naming convention to make it clear which wires belong to which instructions. For example, "X_A_PC" could be the PC of the "A" (first) instruction in the execute stage. "X_B_PC" would be the "B" (second) instruction in the execute stage.

Note

An instruction that reads the NZP bits is considered dependent on a previous instruction that writes the NZP bits, so such a pair of instructions cannot execute in the same cycle.

Testing. Testing a superscalar design requires checking two instructions per cycle. To allow you to do this, we've extended all the test_... signals from the original tester to include a test_A_... and a test_B_... for each signal sent to the testbench. The testbench will then check these signals again two, one, or zero of the instructions from the trace, depending on the values of test_A_stall and test_B_stall.

Stall cycle accounting. As with the regular pipeline, the superscalar testbench counts the number of stall cycles and the source of those stalls:

Verilog Restrictions

Use the same subset of Verilog as in the previous lab.

Demos

There will be three demos:

  1. Preliminary demo I: Demonstrate that the "test_alu.trace" works correctly on your pipeline. This trace file is a good stress-test for various bypassing corner cases, but contains no branches or memory operations.
  2. Preliminary demo II: Demonstrate the fully-working design, including bypassing and PC+1 branch prediction. This includes:
    1. Showing that all the .trace files we've provided work in simulation.
    2. Demonstrating that tetris.hex and invaders.hex work on the FPGA board.
  3. Final demo: Demonstrate your fully-working superscalar pipeline with the branch predictor. This demo includes:
    1. Showing that all the .trace files we've provided work in simulation.
    2. Demonstrating that tetris.hex and invaders.hex work on the FPGA board.
    3. Recording the stall cycle accounting for traces provided by the TAs, which will check for over-stalling and that your superscalar execution is working properly.

All group members should understand the entire design well enough to be able to answer any such questions. As before, all group members should be present at the demos.

Grading

This lab counts for a significant portion of your lab grade for the course. Rather than giving you a binary all-or-nothing trade-off, I am going to refine it a little bit:

The percentages above represent the maximum amount earned for the lab, including the design document, demos, and final writeup.

Stated another way, if your group chooses, you can stop after implement the fully bypassed pipeline and earn up to half the points on the assignment. In fact, if your group chooses such a route prior to the design document, the design document need only describe the parts your group has decided to tackle.

Notice that getting test_alu + test_mem working for superscalar will give you a maximum of 95%.

Late Demo Policy

For those groups behind in the demos, you can demo the above milestones late for 50% of their value. So if you demo a working pipeline with branch prediction (70%) on time and then demo the test_alu working on the superscalar (15%) late, then you have a maximum score of 70+(15*50%) = 77.5%. For the preliminary demos, the "late" demo is Wed. For the final demo, Wednesday is "on time" and Thursday is "late". The TAs will have a limited number of demo slots on Thursday to handle those that have not completed by Wednesday.

What to Turn In

There will be three documents to turn in via BlackBoard.

  1. Design Document. As described above in Part A.
  2. Final Writeup:
    • Design update and retrospective. Write a short "design retrospective", focusing on what you learned during the design that you wish you would have known when you began the implementation. Even if you scored well on your initial design document, you'll probably have uncovered a lot of additional information in your final lab writeup. In the end, how did you actually implemented bypassing and stalling? Etc. Basically, describe how your design changed, what parts you left out of the design document, which parts you got wrong, and assess the effectiveness of your approach to testing and debugging your implementation. What are a few things that, in retrospect, you wish you could got a month back in time and tell yourself about this lab? This shouldn't (just) be a narrative of what when wrong; instead, it should be a critic and analysis of your original design.
    • Verilog code. As before, your Verilog code should be well-formatted, easy to understand, and include comments where appropriate. Some part of the project grade will be dependent on the style and readability of your Verilog, including formatting, comments, good signal names, and proper use of hierarchy.
    • Assessment. Answer the following questions:
      • How many hours did it take you to complete this assignment (total for entire group)?
      • On a scale of 1 (least) to 5 (most), how difficult each part of this assignment?
        1. Pipelined processor with PC+1 branch prediction
        2. Pipelined processor with a branch predictor
        3. Superscalar pipelined processor
  3. Group Contribution Report: Finally, to encourage equal participation by all group members, each group member must individually submit (via blackboard) an document describing your view of the contribution of each member of your group. For each group member (including yourself), give a percentage that member contributed to the project and describe their role in the project. The sum of the percentages should add up to 100%.

Hints