SimpleScalar Simulation Tools

The SimpleScalar simulation tools were written by Todd Austin at the University of Wisconsin-Madison. They are currently maintained and distributed (free of charge for academic use) by SimpleScalar LLC. The SimpleScalar simulation tools are the most widely used tools by academic microarchitecture researchers.  We will use a version of the tools which I have modified for the purposes of this class.

Getting the Simulation Tools

A copy of the basic simulation tool is on the eniac file system. If you are on eniac itself (red or blue), you can access my directories using ~amir/. Otherwise, you can access them from the mount point /mnt/eniac/home/amir/ (if that doesn't work, try /mnt/eniac/home8/a/amir/).  The stuff of interest is in the subdirectory ~amir/cis501/simplescalar/simulators/.  You should make a private copy of this directory.  The directories ~amir/cis501/simplescalar/traces/specint2000/test/ and ~amir/cis501/simplescalar/traces/specfp2000/test/ contains execution traces of the SPEC2000 integer benchmarks executing their test inputs. You should make symbolic links to this directory.  You will not be modifying these traces and there is no need for each of you to have a private copy of them.

Some Introductory Notes about SimpleScalar

Over the course of the semester, we will use four different simulators: sim-func (a functional-simulator), sim-cache (a functional simulator with caches), sim-bpred (a functional simulator with branch prediction) and sim-R10K (a timing simulator for a dynamically scheduled superscalar processor that resembles the MIPS R10000).  You will first run and then modify (slightly) each of these simulators.  Here is some information that applies to all simulators.

Instruction Set and Binaries

The SimpleScalar simulators simulate the Alpha AXP (EV6) instruction set. It will directly run most executables compiled on/for an Alpha. The simulator is a cross-simulator, it's designed for an X86/Linux host.  You can run this simulator on the Linux machines on your desks and on the machines in Moore 207 (or on your laptops, if those run Linux).

System Call Interface and EIO Traces

We will actually not be using Alpha binaries directly.  Instead, we will use SimpleScalar eio traces. The most useful component of the SimpleScalar toolset is a module called syscall.c, which "implements" an operating system by translating Alpha system calls to the corresponding calls on the host OS (Linux). An eio (external-input-output) trace is an initial loaded memory image of the simulated program, plus register/memory diffs for all system calls made by that program.  An eio trace is useful because it allows you to set up a benchmark run once (with all the input files, output files, paths, arguments, etc.) and then forget about it. All that information is encoded in some way or another in the eio trace.  For subsequent runs, the simulator interacts with the eio trace rather than the fake operating system.  I have created eio traces for the SPECint2000 benchmark test inputs.  These are in /mnt/gradient/home/amir/cis501/simplescalar_tools/specint2000_traces/*.test.eio. To run a simulation, you just type:

simulator_name [simulator_options] input_trace_file

for instance, the following line runs a functional simulation of the first 10M instructions of the benchmark gcc:

<my_simulator_dir>/sim-func -insn:limit 10000000 <my_link_to_traces_dir>/gcc.eio

Instruction Set Definition and Attributes

The Alpha machine definition is contained in three files: machine.h, machine.c, and machine.def.  The machine.def file is particularly interesting.  It contains a functional description of the instruction set as a list of C macros, two macros per instruction. The first macro is called *_IMPL and provides the functional implementation of the instruction.  The second macro is called DEFINST and provides some other information about the instruction.  Here are the macros for the Alpha instruction LDQ (load quad-word).

#define LDQ_IMPL                                                        \
  {                                                                     \
    quad_t _result;                                                     \
    enum md_fault_t _fault;                                             \
    SET_ADDR_DSIZE(READ_REG_Q(IR2) + SEXT(OFS), sizeof(quad_t));                \
    _fault = READ(READ_REG_Q(IR2) + SEXT(OFS), &_result, sizeof(quad_t));       \
    if (_fault != md_fault_none)                                        \
      DECLARE_FAULT(_fault);                                            \
    WRITE_REG_Q(OR1, _result);                                          \
DEFINST(LDQ,                    0x29,
        "ldq",                  "a,o(b)",
        fuclass_MEMREAD,        F_MEM|F_LOAD|F_DISP|F_OFS,
        DGPRA, DNA,             DNA, DGPRB, DNA)

Notice, LDQ_IMPL relies on some other macros:

The simulator defines READ, READ_REG_Q and WRITE_REG_Q to point to its representation of registers and memory.
For example:

#define READ_REG_Q(N)          (regs.regs[N].q)
#define WRITE_REG_Q(N,EXPR)    (regs.regs[N].q = (EXPR))
#define READ(ADDR, PVAL, SIZE) (mem_access(mem, mc_READ, (ADDR), (PVAL), (SIZE)))

Once these macros are defined, you can implement execution with the following code. This way you don't have to create a special case for each instruction, #including the machine.def file does it for you.

      switch (opcode)
#define DEFINST(OP,MSK,NAME,OPFORM,RES,FLAGS,O1,O2,I1,I2,I3)            \
        case OP:                                                        \
          SYMCAT(OP,_IMPL);                                             \
#include "machine.def"
          fprintf(stderr, "unknown opcode");

The DEFINST macro itself contains other information about the instruction:

You can find out any of this information about the current instruction by using a similar macro.  For instance to find the current instruction's input and output registers, use the following code:

      switch (opcode)
#define DEFINST(OP,MSK,NAME,OPFORM,RES,FLAGS,O1,O2,I1,I2,I3)            \
        case OP:                                                        \
          output_reg1 = O1;                                             \
          output_reg2 = O2;                                             \
          input_reg1 = I1;                                              \
          input_reg1 = I2;                                              \
          input_reg1 = I3;                                              \
#include "machine.def"
          fprintf(stderr, "unknown opcode");

You will see code like this in the simulators.

Except for the registers (which are different for every different static instruction), all the information in the DEFINST macro is constant across operation instances (i.e., all static instructions which have that opcode).  For instance, all LDQ instructions will have the same flags.  Because of this, there are preset macros in the machine.h file that get these attributes for you without forcing you to write one of these macro switch statements yourself.  For instance, the macro MD_OP_FLAGS(opcode) will get you the flags for a given opcode. You can then find out if the instruction is a load or a branch by testing these flags.

Pre-Decoded Static Instructions

All instructions with the same opcode have a certain set of identical attributes (e.g., opcodes, flags, functional unit requirements). One thing that programs commonly do is read and write registers.  Now, it is not true that all loads read and write the same registers, but it is true that all dynamic instances of the same static load use the same logical registers.  To speed up simulation and save you from having to #include machine.def everywhere, the simulator includes a module called predec which gives you an interface to pre-decoded instructions.  The pre-decoded instruction structure struct predec_insn_t contains some information unique to every static instruction.  In general, you can access an instruction's pre-decoded form by calling pdi_lookup with the instruction's PC as argument.

The most useful field in the pdi structure is the lregnums field, a 4 element array which contains three input registers (DEP_I1, DEP_I2, and DEP_I3) and one output register (DEP_O1).   You will notice that the execution macros IR1, IR2, IR3, and OR1 are defined using these fields.

#define IR1 (pdi->lregnums[DEP_I1])
#define IR2 (pdi->lregnums[DEP_I2])
#define IR3 (pdi->lregnums[DEP_I3])
#define OR1 (pdi->lregnums[DEP_O1])

The reason *_IMPL macros are defined using this level of indirection (IR1, IR2, etc. instead of DGPRA, DGPRB, etc.) is to allow us to easily redirect the registers at runtime.  This is especially useful when simulating out-of-order execution and register renaming.

Universal Simulation Parameters

The file main.c defines several parameters for controlling simulation:

-insn:limit X tells the simulator to simulate only X instructions. -insn:limit 10000000 tells the simulator to stop after 10M instructions. -insn:limit 0 tells the simulator to simulate until the end of the program (or trace).  The default limit is 0.

-insn:progress X tells the simulator to spit out statistics every X instructions. -insn:progress 10000000 tells the simulator to print statistics every 10M instructions. The default progress update is 100M instructions.

-insn:sample X:Y:Z tells the simulator to cyclically sample the program at an interval of X+Y+Z and granularity Z.  For instance, -insn:sample 90000000:0:10000000 tells the simulator to cyclically skip 90M instructions, then simulate 10M instructions.  The default sample is no, which means that the simulator does not skip over any instructions.  When sampling is on, -insn:limit and -insn:progress only count sampled (non-skipped) instructions.

-insn:sample:first is similar to -insn:sample except it allows you to define a different interval for the first sample, to create a phase shift between two different sampling intervals.


sim-func is the simplest simulator.  It functionally simulates a program but does not measure anything about the program except for the number of each different kind of instruction the program executes.  To build sim-func, go to your copy of the simulators directory and type:

make sim-func

You can then run sim-func as follows:

<my_simulator_dir>/sim-func  <my_link_to_traces_dir>/gcc.eio

the simulator will eventually print out the following output:

sim: ** simulation statistics @ Wed Sep 18 14:15:34 2002 **
sim_elapsed_time                       346 # simulation time in seconds
sim_insn_rate                 5110792.1012 # simulation speed (insts/sec)
sim_num_insn                    1768334067 # instructions simulated (fast-forwarding included)
sim_sample_insn                 1768334067 # instructions (in sample)
sim_sample_load                  445084248 # loads
sim_sample_store                 218320657 # stores
sim_sample_branch                317230787 # branches
sim_sample_fp                      1009049 # floating point operations
sim_sample_prefetch                4771896 # prefetches
sim_sample_sys                        6710 # syscalls

Notice, because the simulator did not use sampling, the number of simulated instructions (sim_num_insn) and the number of sampled instructions (sim_sample_insn) are equal.  Here is an example of sampled execution where we take statistics for 10M out of every 100M instructions.

<my_simulator_dir>/sim-func -insn:sample 90000000:0:10000000 <my_link_to_traces_dir>/gcc.eio

sim: ** simulation statistics @ Wed Sep 18 14:43:32 2002 **
sim_elapsed_time                       266 # simulation time in seconds
sim_insn_rate                 6647872.4323 # simulation speed (insts/sec)
sim_num_insn                    1768334067 # instructions simulated (fast-forwarding included)
sim_sample_insn                  170000000 # instructions (in sample)
sim_sample_load                   42875170 # loads
sim_sample_store                  20676541 # stores
sim_sample_branch                 29998344 # branches
sim_sample_fp                        94262 # floating point operations
sim_sample_prefetch                 410902 # prefetches
sim_sample_sys                         509 # syscalls

Here is a similar run and output where we have limited execution to the first 10M instructions.

<my_simulator_dir>/sim-func -insn:limit 10000000 <my_link_to_traces_dir>/gcc.eio

Reached instruction limit: 10000000

sim: ** simulation statistics @ Wed Sep 18 14:48:33 2002 **
sim_elapsed_time                         2 # simulation time in seconds
sim_insn_rate                 5000000.0000 # simulation speed (insts/sec)
sim_num_insn                      10000000 # instructions simulated (fast-forwarding included)
sim_sample_insn                   10000000 # instructions (in sample)
sim_sample_load                    2492827 # loads
sim_sample_store                   1134749 # stores
sim_sample_branch                  1541242 # branches
sim_sample_fp                         1390 # floating point operations
sim_sample_prefetch                   3760 # prefetches
sim_sample_sys                         711 # syscalls


sim-cache is a functional simulator that also simulates a memory hierarchy.  To build sim-cache, copy the files sim-cache.c, cache.h,  cache.c and Makefile from /mnt/gradient/home/amir/cis501/simplescalar_tools/simulators into your private simulator directory.  To build sim-cache, type the following in your directory:

make sim-cache

You can run sim-cache exactly in the same way as you run sim-func:

<my_simulator_dir>/sim-cache [args]  <my_link_to_traces_dir>/gcc.eio

sim-cache supports instruction limits and sampling just like sim-func does.  In addition, sim-cache has 5 additional parameters.  These are used to configure the memory hierarchy.  They are, -cache:dl1, -cache:il1, -cache:l2, -cache:dtlb and -cache:itlb. All of these are configured in the same style-a colon delimited string of 4 values <nsets>:<bsize>:<assoc>:<rpolicy>.

The total size of the structure in bytes is the product of the associativity, block size and number of sets.  For example, to configure the simulator with a 2-way set associative, 32KB first level data-cache, with 32B blocks and random replacement, use the following line:

-cache:dl1 512:32:2:r

When configuring a TLB, the block size is not the size of the TLB entry, but rather the page size. So, to configure a direct-mapped 16-entry instruction TLB with 1KB pages and LRU replacement, use the following line:

-cache:itlb 16:1024:1:l

If you don't want to include a certain structure in the hierarchy, replace the configuration string with the string none.  For instance, here's how to simulate a machine without a data TLB:

-tlb:dtlb none

The way to simulate a unified structure in simplescalar (i.e., instructions and data living in the same structure) is to point the instruction structure to the corresponding data structure.  For example, to implement a unified first-level cache, use:

-cache:dl1 512:32:2:r
-cache:il1 dl1

In addition, there are the following three optional lines per cache for specifying hit latency, a victim buffer, and write-thru policy.

-cache:<cache_name>:hlat <hit_latency>             // default = 1
-cache:<cache_name>:vb {none|<nvictims>}           // default = none
-cache:<cache_name>:wthru {true|false}             // default = false

To specify a direct-mapped, 32KB first-level data cache with 32B lines, LRU replacement, write thru policy and a 2 item victim buffer, use:

-cache:dl1 1024:32:1:l
-cache:dl1:wthru true
-cache:dl1:vb 2

The hit latency parameter (-cache:<cache_name>:hlat) is only meaningful during timing simulation.

The SimpleScalar module that implements the cache functionality is called cache.c.  After initialization, sim-cache interacts with the cache module using the function cache_access. cache_access has the following signature:

unsigned int
cache_access(cache_t *cp,
             enum mem_cmd_t cmd,
             md_addr_t addr,
             unsigned int nbytes,
             tick_t now,
             bool_t miss_info[ct_NUM],
             miss_handler_t miss_handler)

here is what these parameters mean:

OK. So what is miss_handler? miss_fn is a function that has the following signature:

unsigned int
miss_handler(enum mem_cmd_t cmd,
             md_addr_t addr,
             unsigned int nbytes,
             tick_t now,
             bool_t miss_info[ct_NUM])

Basically, it's like cache access but without the cp pointer and its own miss_handler pointer. cache_access calls miss_handler internally whenever it needs to do something to the lower level of the memory hierarchy.  This could be a read (if there is a miss), or a write (if there is a dirty replacement or a write-thru).  By giving each cache its own personal miss_handler, we effectively connect the different caches to one another.  For instance, here is the first level data cache's miss_handler, which effectively connects it to the L2 cache.

unsigned int
dl1_miss_handler(enum mem_cmd cmd,
                 md_addr_t addr,
                 unsigned int nbytes,
                 tick_t now,
                 bool_t miss_info[ct_NUM])
    if (cache_l2)
        cache_access(cache_l2, cmd, addr, nbytes, now, miss_info, l2_miss_handler);
    return 0;

In the course of the programming assignment, you will be asked to modify the cache.c module somewhat.  Here are a few warnings about the sort of zaniness you may find in there.  SimpleScalar in general uses a lot of shortcuts to make simulation run faster.  In particular, SimpleScalar does not do the things that would actually be done in a real processor, but for which are not needed to measure anything about performance.  For instance, the cache module does not model the cache data itself, only the tags and the valid bits, since these are the things you have to have to calculate miss rates, etc.  Whenever you read or write something, the simulator pretends to get it from the cache, and updates the cache tags and valid bits to indicate which blocks are present.  But in fact, the actual values are read and written directly from/to memory.  If you were to copy entire blocks of data on every cache miss, the simulation would slow down significantly.

SimpleScalar also does not model the TLB contents, meaning it doesn't actually do address translation.   In fact, there is no actual address translation anywhere in SimpleScalar, all addresses are virtual.  The memory module does use pages and an inverted page table to keep track of its own memory usage, but memory is accessed by virtual addresses as well.  There is nothing to say you can't put all that stuff into SimpleScalar if you were interested.  It's just that most people are not interested.  The performance of paging and the memory-disk interface is just on a different scale than the one most architects think about, so they abstract it. And you can model the performance of address translation without actually doing address translation.


sim-DLX is a simple simulator that models the execution of a simple DLX-style, 5 stage pipeline.  sim-DLX  works assuming that each instruction takes a single cycle to execute (for a large number of instructions this is equivalent to assuming that a single instruction exits the pipeline every cycle), and computes the delay on each instruction due to the memory hierarchy, branch prediction, and data dependence hazards. As usual, to build sim-DLX type:

make sim-DLX

You run sim-DLX exactly in the same way as you run sim-func and sim-cache:

<my_simulator_dir>/sim-DLX [args] <my_link_to_traces_dir>/gcc.eio

sim-DLX supports instruction limits and sampling as usual.  It also implements a cache hierarchy similar to the one in sim-cache.  However, since sim-DLX is a timing simulator, some cache timing statistics must be supplied as well.  Each of the caches and TLBs has a hit latency parameter, -cache:<cache_name>:hlat which must be filled.  Since the simulator calculates instruction delay, rather than stage latency, the first level caches and TLBs should have a latency of 0. In addition the cache latencies, there are two more latencies that can be parameterized, the main memory hit latency, -mem:hlat <latency>, and the TLB miss latency -tlb:mlat <latency>.

sim-DLX also implements a branch predictor, which you configure using the following parameters:

-bpred:dirmethod {nottaken|taken|dynamic}
-bpred:dir1 {none|<size>:<pbits>:<hbits>:<pcshift>}
-bpred:dir2 {none|<size>:<pbits>:<hbits>:<pcshift>}
-bpred:dirchooser {none|<size>:<pbits>}
-bpred:btb {none|<nsets>:<assoc>:<tagged>}
-bpred:ras {none|<size>}

The first four parameters configure the mechanism by which the branch predictor predicts the directions (taken vs. not-taken) of conditional branches.

-bpred:dirmethod selects the method by which conditional branches are predicted.  Supported methods are:

If the method is dynamic, some of the tables have to be configured as well.  There are two base prediction tables, dir1 and dir2, and a chooser table (for a hybrid or tournament predictor). Here are the parameters for the base direction tables.

-bpred:dir1 {none|<size>:<pbits>:<hbits>:<pcshift>}
-bpred:dir2 {none|<size>:<pbits>:<hbits>:<pcshift>}

The chooser is configured with only size and pbits fields. The simulator does not support a history-based chooser.

-bpred:dirchooser {none|<size>:<pbits>}

The last two parameters of the branch predictor configure its mechanism for predicting taken targets for conditional and unconditional branches.  The first parameter configures a BTB:

-bpred:btb {none|<nsets>:<assoc>:<dbits>:<tbits>}

The BTB is built like a cache, nsets and assoc configure the number of sets and associativity.  nsets must be a power of two obviously.  dbits and tbits specify the bit width of the BTB's data and tag fields. In the Alpha architecture, all instruction addresses are four-byte aligned so the two least-significant bits are ignored. A BTB match occurs if the least-significant tbits of the PC match the tag; the returned target replaces the least-significant dbits of the PC with the stored value.

There is also the option to configure a return address stack (RAS) to predict the targets of procedure returns.

-bpred:ras {none|<size>}

size is simply the number of entries in the RAS.

The branch prediction module is implemented in bpred.h and bpred.c.  The simulator interacts with the module using three functions.

bpred_lookup(struct bpred_t *bp,
             md_addr_t pc,
             enum md_opcode_t op,
             struct bpred_state_t *state);

bpred_lookup takes a pointer to the branch predictor, the branch PC, the branch instruction bits (so that the predictor knows whether the branch is conditional or unconditional, a call or a return, etc., and a pointer to an empty bpred_state_t structure.  The predictor returns a predicted target address (or 0 if it has no prediction) and fills in the bpred_state_t structure with a copy of the predictor state (e.g., the BHR or the RAS) as it was before the prediction itself. Don't worry about what goes on in this state.

If execution discovers that the prediction was incorrect, the following function must be called:

bpred_recover(struct bpred_t *pred,
              md_addr_t pc,
              enum md_opcode_t op,
              md_addr_t next_pc,
              struct bpred_state_t *pre_state);

bpred_recover takes the same parameters (the state structure must be the one filled in by bpred_lookup) plus the actual target PC and restores the branch predictor to a correct (pre mis-prediction) internal state.

Regardless of whether or not the prediction was correct, the following function must be called when any branch completes.

bpred_update(struct bpred_t *pred,
             md_addr_t pc,
             enum md_opcode_t op,
             md_addr_t next_pc,
             md_addr_t targ_pc,
             md_addr_t pred_pc,
             struct bpred_state_t *pre_state);

bpred_update takes the same arguments, plus the target PC (the computed non-fallthru PC specified in the instruction) and the predicted PC. It updates the predictor state and takes some statistics.

The reason there are three functions rather than one and these functions essentially communicate via the bpred_state_t structure is that in a real pipeline simulation each one of these functions would take place in a different stage, meaning multiple lookups can be performed between updates and recoveries.  More on that in sim-R10K.

Finally, sim-DLX also simulates stalls due to true data dependences (RAW hazards). When computing stalls, the pipeline can be configured to simulate different combinations of bypasses.  X to X forwarding is enabled using the option: -bypass:xx {true|false}.  M to X forwarding is enabled using the option -bypass:mx {true|false}. The simulator does not currently correctly compute data-dependence stalls.  You will add those in as part of homework #3.


Sim-R10K is a cycle-level simulator for a superscalar out-of-order processor with branch prediction, a two-level cache hierarchy and MIPS R10K style register renaming.  As usual, to build sim-R10K type:

make sim-R10K

You run sim-DLX exactly in the same way as you run sim-func and sim-cache:

<my_simulator_dir>/sim-R10K [args] <my_link_to_traces_dir>/gcc.eio

Unlike sim-DLX, sim-R10K measures execution time by actually modeling the pipeline structures and their contents on a cycle-by-cycle basis.  The simulated pipeline has five canonical stages: fetch, rename (which includes decode), schedule (which includes execute), writeback,  and commit.  The main simulator loop (in the function sim_sample_on) looks like this:

for (;;)


The simulator models all the internal processor structures, the ROB, the LSQ (MOB), and physical register file.  The simulator also models an instruction fetch queue (IFQ), which sits between the fetch and rename stages.  The IFQ allows the processor to keep fetching instructions when the ROB stalls for a cycle or two.  In addition to these structures, which in the simulator body are referred to as ROB, LSQ, IFQ, and pregs, the simulator also uses some auxiliary lists and queues to help with the book-keeping of out-of-order execution.  The most important of these are scheduler_queue, a queue of instructions that are ready to execute-and writeback_queue, a queue of completed instructions that are ready to broadcast their results.

It may help to think of how the stages and queues are connected.  fetch_stage places instructions on IFQ. rename_stage removes instructions from IFQ and places them in ROB, LSQ, and scheduler_queue. scheduler_stage removes instructions from scheduler_queue and places them on writeback_queue. writeback_stage removes instructions from writeback_queue and places dependent instructions on scheduler_queue. commit_stage removes instructions from ROB and LSQ.

Sim-R10K shares many of the configuration parameters as sim-func, sim-cache, and sim-DLX. It supports sampling and instruction limits via the usual configuration parameters, its two level memory hierarchy and branch predictor can be configured using the same parameters as those used for sim-DLX.  Sim-R10K also has a number of parameters for configuring the pipeline itself.  For easy understanding parameter names follow a convention.  Parameters are collections of names separated by colons.  The first name is the name of the pipeline stage (e.g., fetch or sched).  The second name is the name of the structure if any (e.g., ROB or LSQ).  The third name is the type of the parameter, which is most often one of three things: size/num (number of entries in a structure or number of structures), lat (structure access latency in cycles or number of pipeline stages), or width (bandwidth or number of parallel accesses to the structure).  So, by this convention: -fetch:lat is the number of pipeline stages in the fetch stage, -sched:ROB:size is the number of entries in the ROB, and -commit:width is the number of instructions that can be committed per cycle.

Here are the other sim-R10K pipeline configuration parameters, defaults and options are in braces.

There are three parameters for setting the fetch stage width, latency, and IFQ size.

-fetch:width <width>
-fetch:lat <lat>
-fetch:ifq:size <size>

There are three parameters for setting the rename stage width, latency, and number of physical registers.

-rename:width <width>
-rename:lat <lat>
-rename:pregs:num <num>

The scheduling stage has the most options, so we will leave that for last.

There are no parameters for controlling the width or latency of the writeback stage (although there probably should be).

There are two parameters for controlling the width of the commit stage.  The parameters control how many total instructions can be retired in a given cycle, and how many stores can be written to the data cache ins a single cycle.

-commit:width <width>
-commit:store:width <width>

The final parameter the simulator takes is the rate at which recovery can take place.  Setting recovery bandwidth equal to commit bandwidth is effectively modeling serial recovery (serial undoing of mappings by walking backwards through the ROB).  Setting recovery width to ROB size effectively models single cycle recovery from a checkpoint.

-recover:width <width>

The most intricate part of the processor is the scheduler, or out-of-order execution engine.  There are about 15 parameters which configure this engine.  The first four configure the size of the scheduler structures.  In the LSQ you can actually limit the number of stores and loads independently.

-sched:rob:size <size>
-sched:ldq:size <size>
-sched:stq:size <size>
-sched:rs:num <num>

The next five parameters configure the scheduler bandwidth, namely how many instructions of each kind can be executed in a given cycle.  There are four instruction type scheduling slots (integer, floating-point, store, and load) and a total scheduling width slot.

-sched:int:width <width>
-sched:fp:width <width>
-sched:load:width <width>
-sched:store:width <width>
-sched:total:width <width>

Since scheduling slots are not exactly the same as functional units (e.g., your scheduler may allow you to execute only one floating-point instruction per cycle, but your processor may have two floating-point units: an adder, and a multiplier), functional units can be configured separately.  For these, the size parameter indicates the number of units of each type.  There are no parameters for controlling unit latency, although again there probably should be.

-sched:ialu:num <num>
-sched:imult:num <num>
-sched:fpalu:num <num>
-sched:fpmult:num <num>
-sched:cacheport:num <num>

The next group of parameters configure the scheduler's latency.  The raw scheduling latency is the number of pipeline stages it takes to schedule an instruction and read its inputs from the physical register file.  However, there are two additional latencies that can be specified.  Loads and stores must compute their addresses before accessing the LSQ.  This is the address generation latency.  There is also a latency for forwarding a value from a store to a load via the LSQ, this is the forwarding latency.  The store-to-load forwarding latency should always be set equal to the data cache hit latency.

-sched:lat <lat>
-sched:agen:lat <lat>
-sched:fwd:lat <lat>

The last three scheduling parameters deal with the scheduling policy.  The first two are boolean flags that control whether or not the scheduler schedules instructions in-order or out-of-order and whether it schedules speculatively (i.e., whether it schedules instructions on wrong execution paths).

-sched:inorder {true|false}
-sched:spec {true|false}

The last parameter controls the load scheduling policy.  The simulator models one of three policies: conservative, opportunistic and selective.  The conservative and opportunistic policies are chosen using the names "conservative" and "opportunistic".  The selective policy is chosen by specifying the dimensions of the selection predictor <sets>:<ways>.

-sched:adisambig {conservative|opportunistic|<sets>:<ways>}

Hacking sim-R10K

For your project, you may find yourself needing to hack sim-R10K.  If you are having trouble, make an appointment with the TA or myself.  I will post hacking advice to the newsgroup and to this spot when I return.