Lab 2 - Pipelined Processor

CIS 372 (Spring 2009): Computer Organization and Design Lab

Preliminary design document due Friday, April 3, 6:30pm

Preliminary Demo by Friday, April 17, 6:30pm

Final Demo and Writeup by Friday, May 1 (last day of classes), 6:30pm

Contents

Overview and Specifications

In this final lab of the semester, you will design and build a scalar five-stage pipelined processor for the LC4 ISA. Your pipeline will target frequency (of course) and IPC (via bypassing and branch prediction). As usual, I am providing you with skeleton code. The pipeline skeleton and all of the supporting modules are in this compressed tarball. Most of your code will be in the file lc4_pipe.v, whose current contents are given below. You may want to split out large sub-components of your pipeline, like the register file, ALU, and branch predictor, into seperate files.:

module lc4_pipe(CLK, RST, GWE,
               IMEM_ADDR, IMEM_OUT,
               DMEM_ADDR, DMEM_OUT, DMEM_IN, DMEM_WE,
               // options
               BPRED_ON, BYPASS_ON,
               // debug interface
               _TEST_W_PC,
               _TEST_W_STALL,
               _TEST_W_INSN,
               _TEST_W_REGFILE_DATA_IN,
               _TEST_W_REGFILE_WE,
               _TEST_W_NZP_IN,
               _TEST_W_NZP_WE,
               _TEST_W_DMEM_ADDR,
               _TEST_W_DMEM_IN,
               _TEST_W_DMEM_WE,
               _TEST_PMC_CYCLE,
               _TEST_PMC_INSN,
               _TEST_PMC_LOAD_STALL,
               _TEST_PMC_BRANCH_STALL);
                               
  input         CLK;   // main clock
  input         RST;   // global reset
  input         GWE;   // global we for single-step clock
  
  output [15:0] IMEM_ADDR;   // instruction address
  input [15:0]  IMEM_OUT;    // output from data memory
  output        DMEM_WE;     // data memory write-enable
  output [15:0] DMEM_ADDR;   // data memory address
  input [15:0]  DMEM_OUT;    // output from data memory
  output [15:0] DMEM_IN;     // input to data memory

  input         BPRED_ON;    // branch prediction is on
  input         BYPASS_ON;   // bypassing is on
  
  output [15:0] _TEST_W_PC;
  output [15:0] _TEST_W_INSN;
  output [15:0] _TEST_W_REGFILE_DATA_IN;
  output        _TEST_W_REGFILE_WE;
  output [2:0]  _TEST_W_NZP_IN;
  output        _TEST_W_NZP_WE;
  output [15:0] _TEST_W_DMEM_ADDR;
  output [15:0] _TEST_W_DMEM_IN;
  output        _TEST_W_DMEM_WE;
  output        _TEST_W_STALL;
  
  output [15:0] _TEST_PMC_CYCLE, _TEST_PMC_INSN, _TEST_PMC_LOAD_STALL, _TEST_PMC_BRANCH_STALL;
  
  // YOUR CODE GOES HERE
  
  always @(posedge CLK)
    if (GWE)
      begin
         $display("--------------------------------------------------------------------------------");
         $display("F:");
         $display("D:");
         $display("X:");
         $display("M:");
         $display("W:");
      end
  
endmodule // lc4_pipe

The pipeline module interface is a superset of the single-cycle module interface. The CLK, RST, GWE and instruction and data memory interfaces are the same. However, there are two additional mode switches (BPRED_ON and BYPASS_ON), a bunch of output signals that start with _TEST_W_, and four output signals that start with _TEST_PMC_. The _TEST_ outputs will be used for debugging and test fixtures only. There is also skeleton behavioral code for displaying interior values. You can modify this code to create print-out snap-shots of pipeline that you can observe in ModelSim to help you debug.

Anyway, here are the basic specifications for the pipeline.

Five-Stage Pipeline

This processor will have a five-stage pipeline:

  • Fetch (F): reads the next instruction from the memory, predicts next PC
  • Decode (D): reads the register file, generates datapath decode/control signals
  • Execute (X): performs ALU and branch calculations, resolves branches
  • Memory (M): read or write the data memory
  • Writeback (W): write the register file

All instructions travel through all five stages. The _TEST_W_ outputs of the pipeline module should contain the corresponding values for the instruction currently in the Writeback (W) stage. Some of these values (PC, instruction bits, register inputs) are typically not needed at the W stage. You will have to propagate them through pipeline registers for debugging purposes. The _TEST_W_STALL signal should be 1 if the Writeback stage currently contains a bubble.

Branch Prediction

Branches are resolved in the execute stage, so a mispredicted branch has a two-cycle penalty. Your pipeline should be able to operate in two branch prediction modes: i) using "implicit" branch prediction where the predicted PC is the current PC + 1 and ii) using "explicit" branch prediction, specifically a tagged 8-entry branch target buffer (BTB). This mode is controlled by switch 7 on the daughter-board: "up" for BTB, "down" for no BTB. The mode switch is passed to the pipeline module via the signal BPRED_ON.

Bypassing

Your pipeline should also be able to operate in two bypassing modes: i) no bypassing, and ii) full bypassing including MX, WX, and WM value bypassing and MX and WX NZP bypassing (yes, the NZP bits have to be bypassed too). When bypassing is on, the only stalls are for load-to-use (this includes load to conditional branch). The bypassing mode is controlled by switch 6 on the daughter-board: "up" for bypassing, "down" for no bypassing. The mode switch is passed to the pipeline module via the signal BYPASS_ON.

Performance Counters

Real processors have performance counters that track various events within a processor to help understand its performance. The pipelined LC4 processor also has performance counters. These are memory-mapped and there are four of them.

  • Cycle count - 0xFF00: the number of cycles since the processor was last reset
  • Instruction count - 0xFF01: the number of actual instructions executed since the processor was last reset
  • Load stall count - 0xFF02: the number of cycles lost to load-use stalls (that is, the number of cycles in which zero instructions executed because of a load-use stall)
  • Branch stall count - 0xFF03: the number of cycles lost to branch mis-predictions and/or stalls (that is, the number of cycles in which zero instructions executed because of a branch misprediction)

The cycle count is incremented every cycle. Every cycle one (and only one) of the instruction count, load stall, or branch stall counters is incremented. As such, the sum of these three registers should be equal to the cycle count.

Your processor should update the performance counters during the writeback stage. The performance counters should count an actual "NOOP" instruction as an instruction being executed. That is, it isn't either a branch stall or a load stall cycle. The counters reset to 0 only when the entire system is reset. The performance counters should also be hooked up to the testing interface via the _TEST_PMC_ buses.

New Daughterboard Interface

Because the main board switches are difficult to get at, I have moved the debugging interface to the daughterboard switches (I kept reset and single-step hooked up to the board buttons). I have also used the larger number of switches to expand the debugging interface and you can expand it further if you want.

  • Switch 8 is the new clock mode switch: "up" is auto, "down" is single-step.
  • Switch 7 is the branch prediction mode switch.
  • Switch 6 is the bypass mode switch.
  • Switches 5-1 control the output on the 7 segment display. Interpreting these as a binary number, you can display up to 32 different values. The value 16 (switch 5 up, 4-1 down) displays the PC currently at the Writeback stage. You can figure out what other things can be displayed by looking at the file dio4.v in the include/ subdirectory. This is the daughterboard IO controller code.

Debugging Using Trace Files

I modified PennSim to generate trace files for LC4 programs. You can generate a trace file for any program using the command line trace on <tracefilename>, then running and stopping the program, and then using the command line trace off. The trace file consists of a line of five 16-bit words for each instruction executed. These are:

  • The PC
  • The instruction
  • This word actually contains four quantities. Bit 12 is the register write enable signal. Bit 8 is the memory write enable signal. Bit 4 is the NZP write enable signal. Bits 2-0 are the NZP bits themselves. For an STR, this word would appear as 0100, because an STR doesn't write either a register or the NZP bits. For an ADD, it would be 101X with X either 1, 2, or 4 depending on the value being written (1 if the value is positive, 2 if the value is negative, 4 if the value is negative). For a CMP, it would be 001X.
  • The value written to the register file, or to memory if the instruction is a store.
  • The memory address if this instruction is a load or store.

The pipeline module we gave you has hooks to dump out various values associated with the instruction currently in the W stage. The new version of the ModelSim test fixture test_lc4_pipe.tf reads in a trace file and compares the values in the trace to the corresponding values of the instruction currently in the W stage. If any of the values mismatch, it will tell you what the mismatch is. By properly connecting the value of _TEST_W_STALL, the test fixture knows to ignore cycles in which no instruction is in the W stage. You can use this to help debug your pipeline. The trace corresponding to the harness4.hex file is harness4.trace.

Timing Test

Here are the timing.hex, timing.trace, and test_pl_timing.tf files for the timing test. Here is also the timing.asm file in case you want to look at the code and/or run it on PennSim.

Verilog Restrictions

You can use the same subset of Verilog as in lab1. You should pass all signals to modules by name (as opposed to by position).

Demos

There will be two demos.

All group members should be present at the demos. All group members should understand the entire design well enough to be able to answer any such questions.

Writeups

There will also be two writeups.

Hints