The SimpleScalar simulation tools were written
by Todd Austin at the University of Wisconsin-Madison. They are currently
maintained and distributed (free of charge for academic use) by SimpleScalar
LLC. The SimpleScalar simulation tools are the most widely used tools
by academic microarchitecture researchers. We will use a version
of the tools which I have modified for the purposes of this class.
simulator_name [simulator_options] input_trace_file
for instance, the following line runs a functional simulation of the first 10M instructions of the benchmark gcc:
<my_simulator_dir>/sim-func
-insn:limit 10000000 <my_link_to_traces_dir>/gcc.eio
#define LDQ_IMPL
\
{
\
quad_t _result;
\
enum md_fault_t _fault;
\
\
SET_ADDR_DSIZE(READ_REG_Q(IR2) + SEXT(OFS), sizeof(quad_t));
\
_fault = READ(READ_REG_Q(IR2) + SEXT(OFS), &_result, sizeof(quad_t));
\
if (_fault != md_fault_none)
\
DECLARE_FAULT(_fault);
\
\
WRITE_REG_Q(OR1, _result);
\
}
DEFINST(LDQ,
0x29,
"ldq",
"a,o(b)",
fuclass_MEMREAD, F_MEM|F_LOAD|F_DISP|F_OFS,
DGPRA, DNA,
DNA, DGPRB, DNA)
Notice, LDQ_IMPL relies on some other macros:
#define READ_REG_Q(N)
(regs.regs[N].q)
#define WRITE_REG_Q(N,EXPR)
(regs.regs[N].q = (EXPR))
#define READ(ADDR,
PVAL, SIZE) (mem_access(mem, mc_READ, (ADDR), (PVAL), (SIZE)))
Once these macros are defined, you can implement execution with the following code. This way you don't have to create a special case for each instruction, #including the machine.def file does it for you.
switch (opcode)
{
#define DEFINST(OP,MSK,NAME,OPFORM,RES,FLAGS,O1,O2,I1,I2,I3)
\
case OP:
\
SYMCAT(OP,_IMPL);
\
break;
#include "machine.def"
default:
fprintf(stderr, "unknown opcode");
break;
}
The DEFINST macro itself contains other information about the instruction:
switch (opcode)
{
#define DEFINST(OP,MSK,NAME,OPFORM,RES,FLAGS,O1,O2,I1,I2,I3)
\
case OP:
\
output_reg1 = O1;
\
output_reg2 = O2;
\
input_reg1 = I1;
\
input_reg1 = I2;
\
input_reg1 = I3;
\
break;
#include "machine.def"
default:
fprintf(stderr, "unknown opcode");
break;
}
You will see code like this in the simulators.
Except for the registers (which are different for every different static
instruction), all the information in the DEFINST
macro is constant across operation instances (i.e., all static instructions
which have that opcode). For instance, all LDQ
instructions will have the same flags. Because of this, there are
preset macros in the machine.h file that get these attributes for you without
forcing you to write one of these macro switch statements yourself.
For instance, the macro MD_OP_FLAGS(opcode)
will get you the flags for a given opcode. You can then find out if the
instruction is a load or a branch by testing these flags.
The most useful field in the pdi structure is the lregnums field, a 4 element array which contains three input registers (DEP_I1, DEP_I2, and DEP_I3) and one output register (DEP_O1). You will notice that the execution macros IR1, IR2, IR3, and OR1 are defined using these fields.
#define IR1 (pdi->lregnums[DEP_I1])
#define IR2
(pdi->lregnums[DEP_I2])
#define IR3
(pdi->lregnums[DEP_I3])
#define OR1
(pdi->lregnums[DEP_O1])
The reason *_IMPL
macros are defined using this level of indirection (IR1,
IR2,
etc. instead of DGPRA,
DGPRB,
etc.) is to allow us to easily redirect the registers at runtime.
This is especially useful when simulating out-of-order execution and register
renaming.
-insn:limit X tells the simulator to simulate only X instructions. -insn:limit 10000000 tells the simulator to stop after 10M instructions. -insn:limit 0 tells the simulator to simulate until the end of the program (or trace). The default limit is 0.
-insn:progress X tells the simulator to spit out statistics every X instructions. -insn:progress 10000000 tells the simulator to print statistics every 10M instructions. The default progress update is 100M instructions.
-insn:sample X:Y:Z tells the simulator to cyclically sample the program at an interval of X+Y+Z and granularity Z. For instance, -insn:sample 90000000:0:10000000 tells the simulator to cyclically skip 90M instructions, then simulate 10M instructions. The default sample is no, which means that the simulator does not skip over any instructions. When sampling is on, -insn:limit and -insn:progress only count sampled (non-skipped) instructions.
-insn:sample:first
is similar to -insn:sample except
it allows you to define a different interval for the first sample, to create
a phase shift between two different sampling intervals.
sim-func
sim-func is the simplest simulator. It functionally simulates a program but does not measure anything about the program except for the number of each different kind of instruction the program executes. To build sim-func, go to your copy of the simulators directory and type:
make sim-func
You can then run sim-func as follows:
<my_simulator_dir>/sim-func <my_link_to_traces_dir>/gcc.eio
the simulator will eventually print out the following output:
sim: ** simulation statistics @ Wed
Sep 18 14:15:34 2002 **
sim_elapsed_time
346 # simulation time in seconds
sim_insn_rate
5110792.1012 # simulation speed (insts/sec)
sim_num_insn
1768334067 # instructions simulated (fast-forwarding included)
sim_sample_insn
1768334067 # instructions (in sample)
sim_sample_load
445084248 # loads
sim_sample_store
218320657 # stores
sim_sample_branch
317230787 # branches
sim_sample_fp
1009049 # floating point operations
sim_sample_prefetch
4771896 # prefetches
sim_sample_sys
6710 # syscalls
Notice, because the simulator did not use sampling, the number of simulated instructions (sim_num_insn) and the number of sampled instructions (sim_sample_insn) are equal. Here is an example of sampled execution where we take statistics for 10M out of every 100M instructions.
<my_simulator_dir>/sim-func -insn:sample 90000000:0:10000000 <my_link_to_traces_dir>/gcc.eio
sim: ** simulation statistics @ Wed
Sep 18 14:43:32 2002 **
sim_elapsed_time
266 # simulation time in seconds
sim_insn_rate
6647872.4323 # simulation speed (insts/sec)
sim_num_insn
1768334067 # instructions simulated (fast-forwarding included)
sim_sample_insn
170000000 # instructions (in sample)
sim_sample_load
42875170 # loads
sim_sample_store
20676541 # stores
sim_sample_branch
29998344 # branches
sim_sample_fp
94262 # floating point operations
sim_sample_prefetch
410902 # prefetches
sim_sample_sys
509 # syscalls
Here is a similar run and output where we have limited execution to the first 10M instructions.
<my_simulator_dir>/sim-func -insn:limit 10000000 <my_link_to_traces_dir>/gcc.eio
Reached instruction limit: 10000000
sim: ** simulation statistics @ Wed
Sep 18 14:48:33 2002 **
sim_elapsed_time
2 # simulation time in seconds
sim_insn_rate
5000000.0000 # simulation speed (insts/sec)
sim_num_insn
10000000 # instructions simulated (fast-forwarding included)
sim_sample_insn
10000000 # instructions (in sample)
sim_sample_load
2492827 # loads
sim_sample_store
1134749 # stores
sim_sample_branch
1541242 # branches
sim_sample_fp
1390 # floating point operations
sim_sample_prefetch
3760 # prefetches
sim_sample_sys
711 # syscalls
sim-cache
sim-cache is a functional simulator that also simulates a memory hierarchy. To build sim-cache, copy the files sim-cache.c, cache.h, cache.c and Makefile from /mnt/gradient/home/amir/cis501/simplescalar_tools/simulators into your private simulator directory. To build sim-cache, type the following in your directory:
make sim-cache
You can run sim-cache exactly in the same way as you run sim-func:
<my_simulator_dir>/sim-cache [args] <my_link_to_traces_dir>/gcc.eio
sim-cache supports instruction limits and sampling just like sim-func does. In addition, sim-cache has 5 additional parameters. These are used to configure the memory hierarchy. They are, -cache:dl1, -cache:il1, -cache:l2, -cache:dtlb and -cache:itlb. All of these are configured in the same style-a colon delimited string of 4 values <nsets>:<bsize>:<assoc>:<rpolicy>.
-cache:dl1 512:32:2:r
When configuring a TLB, the block size is not the size of the TLB entry, but rather the page size. So, to configure a direct-mapped 16-entry instruction TLB with 1KB pages and LRU replacement, use the following line:
-cache:itlb 16:1024:1:l
If you don't want to include a certain structure in the hierarchy, replace the configuration string with the string none. For instance, here's how to simulate a machine without a data TLB:
-tlb:dtlb none
The way to simulate a unified structure in simplescalar (i.e., instructions and data living in the same structure) is to point the instruction structure to the corresponding data structure. For example, to implement a unified first-level cache, use:
-cache:dl1 512:32:2:r
-cache:il1 dl1
In addition, there are the following three optional lines per cache for specifying hit latency, a victim buffer, and write-thru policy.
-cache:<cache_name>:hlat
<hit_latency>
// default = 1
-cache:<cache_name>:vb
{none|<nvictims>}
// default = none
-cache:<cache_name>:wthru
{true|false}
// default = false
To specify a direct-mapped, 32KB first-level data cache with 32B lines, LRU replacement, write thru policy and a 2 item victim buffer, use:
-cache:dl1 1024:32:1:l
-cache:dl1:wthru true
-cache:dl1:vb 2
The hit latency parameter (-cache:<cache_name>:hlat) is only meaningful during timing simulation.
The SimpleScalar module that implements the cache functionality is called cache.c. After initialization, sim-cache interacts with the cache module using the function cache_access. cache_access has the following signature:
unsigned int
cache_access(cache_t *cp,
enum mem_cmd_t cmd,
md_addr_t addr,
unsigned int nbytes,
tick_t now,
bool_t miss_info[ct_NUM],
miss_handler_t miss_handler)
here is what these parameters mean:
unsigned int
miss_handler(enum mem_cmd_t cmd,
md_addr_t addr,
unsigned int nbytes,
tick_t now,
bool_t miss_info[ct_NUM])
Basically, it's like cache access but without the cp pointer and its own miss_handler pointer. cache_access calls miss_handler internally whenever it needs to do something to the lower level of the memory hierarchy. This could be a read (if there is a miss), or a write (if there is a dirty replacement or a write-thru). By giving each cache its own personal miss_handler, we effectively connect the different caches to one another. For instance, here is the first level data cache's miss_handler, which effectively connects it to the L2 cache.
unsigned int
dl1_miss_handler(enum mem_cmd cmd,
md_addr_t addr,
unsigned int nbytes,
tick_t now,
bool_t miss_info[ct_NUM])
{
if (cache_l2)
cache_access(cache_l2, cmd, addr, nbytes, now, miss_info, l2_miss_handler);
return 0;
}
In the course of the programming assignment, you will be asked to modify the cache.c module somewhat. Here are a few warnings about the sort of zaniness you may find in there. SimpleScalar in general uses a lot of shortcuts to make simulation run faster. In particular, SimpleScalar does not do the things that would actually be done in a real processor, but for which are not needed to measure anything about performance. For instance, the cache module does not model the cache data itself, only the tags and the valid bits, since these are the things you have to have to calculate miss rates, etc. Whenever you read or write something, the simulator pretends to get it from the cache, and updates the cache tags and valid bits to indicate which blocks are present. But in fact, the actual values are read and written directly from/to memory. If you were to copy entire blocks of data on every cache miss, the simulation would slow down significantly.
SimpleScalar also does not model the TLB contents, meaning it doesn't
actually do address translation. In fact, there is no actual
address translation anywhere in SimpleScalar, all addresses are virtual.
The memory module does use pages and an inverted page table to keep track
of its own memory usage, but memory is accessed by virtual addresses as
well. There is nothing to say you can't put all that stuff into SimpleScalar
if you were interested. It's just that most people are not interested.
The performance of paging and the memory-disk interface is just on a different
scale than the one most architects think about, so they abstract it. And
you can model the performance of address translation without actually doing
address translation.
sim-DLX
sim-DLX is a simple simulator that models the execution of a simple DLX-style, 5 stage pipeline. sim-DLX works assuming that each instruction takes a single cycle to execute (for a large number of instructions this is equivalent to assuming that a single instruction exits the pipeline every cycle), and computes the delay on each instruction due to the memory hierarchy, branch prediction, and data dependence hazards. As usual, to build sim-DLX type:
make sim-DLX
You run sim-DLX exactly in the same way as you run sim-func and sim-cache:
<my_simulator_dir>/sim-DLX [args] <my_link_to_traces_dir>/gcc.eio
sim-DLX supports instruction limits and sampling as usual. It also implements a cache hierarchy similar to the one in sim-cache. However, since sim-DLX is a timing simulator, some cache timing statistics must be supplied as well. Each of the caches and TLBs has a hit latency parameter, -cache:<cache_name>:hlat which must be filled. Since the simulator calculates instruction delay, rather than stage latency, the first level caches and TLBs should have a latency of 0. In addition the cache latencies, there are two more latencies that can be parameterized, the main memory hit latency, -mem:hlat <latency>, and the TLB miss latency -tlb:mlat <latency>.
sim-DLX also implements a branch predictor, which you configure using the following parameters:
-bpred:dirmethod
{nottaken|taken|dynamic}
-bpred:dir1
{none|<size>:<pbits>:<hbits>:<pcshift>}
-bpred:dir2
{none|<size>:<pbits>:<hbits>:<pcshift>}
-bpred:dirchooser
{none|<size>:<pbits>}
-bpred:btb
{none|<nsets>:<assoc>:<tagged>}
-bpred:ras
{none|<size>}
The first four parameters configure the mechanism by which the branch predictor predicts the directions (taken vs. not-taken) of conditional branches.
-bpred:dirmethod selects the method by which conditional branches are predicted. Supported methods are:
-bpred:dir1 {none|<size>:<pbits>:<hbits>:<pcshift>}
-bpred:dir2 {none|<size>:<pbits>:<hbits>:<pcshift>}
-bpred:dirchooser {none|<size>:<pbits>}
The last two parameters of the branch predictor configure its mechanism for predicting taken targets for conditional and unconditional branches. The first parameter configures a BTB:
-bpred:btb {none|<nsets>:<assoc>:<dbits>:<tbits>}
The BTB is built like a cache, nsets and assoc configure the number of sets and associativity. nsets must be a power of two obviously. dbits and tbits specify the bit width of the BTB's data and tag fields. In the Alpha architecture, all instruction addresses are four-byte aligned so the two least-significant bits are ignored. A BTB match occurs if the least-significant tbits of the PC match the tag; the returned target replaces the least-significant dbits of the PC with the stored value.
There is also the option to configure a return address stack (RAS) to predict the targets of procedure returns.
-bpred:ras {none|<size>}
size is simply the number of entries in the RAS.
The branch prediction module is implemented in bpred.h and bpred.c. The simulator interacts with the module using three functions.
md_addr_t
bpred_lookup(struct bpred_t *bp,
md_addr_t pc,
enum md_opcode_t op,
struct bpred_state_t *state);
bpred_lookup takes a pointer to the branch predictor, the branch PC, the branch instruction bits (so that the predictor knows whether the branch is conditional or unconditional, a call or a return, etc., and a pointer to an empty bpred_state_t structure. The predictor returns a predicted target address (or 0 if it has no prediction) and fills in the bpred_state_t structure with a copy of the predictor state (e.g., the BHR or the RAS) as it was before the prediction itself. Don't worry about what goes on in this state.
If execution discovers that the prediction was incorrect, the following function must be called:
void
bpred_recover(struct bpred_t *pred,
md_addr_t pc,
enum md_opcode_t op,
md_addr_t next_pc,
struct bpred_state_t *pre_state);
bpred_recover takes the same parameters (the state structure must be the one filled in by bpred_lookup) plus the actual target PC and restores the branch predictor to a correct (pre mis-prediction) internal state.
Regardless of whether or not the prediction was correct, the following function must be called when any branch completes.
void
bpred_update(struct bpred_t *pred,
md_addr_t pc,
enum md_opcode_t op,
md_addr_t next_pc,
md_addr_t targ_pc,
md_addr_t pred_pc,
struct bpred_state_t *pre_state);
bpred_update takes the same arguments, plus the target PC (the computed non-fallthru PC specified in the instruction) and the predicted PC. It updates the predictor state and takes some statistics.
The reason there are three functions rather than one and these functions essentially communicate via the bpred_state_t structure is that in a real pipeline simulation each one of these functions would take place in a different stage, meaning multiple lookups can be performed between updates and recoveries. More on that in sim-R10K.
Finally, sim-DLX also simulates stalls due to true data dependences
(RAW hazards). When computing stalls, the pipeline can be configured to
simulate different combinations of bypasses. X to X forwarding is
enabled using the option: -bypass:xx
{true|false}. M to X forwarding is enabled using
the option -bypass:mx
{true|false}. The simulator does not currently correctly
compute data-dependence stalls. You will add those in as part of
homework #3.
sim-R10K
Sim-R10K is a cycle-level simulator for a superscalar out-of-order processor with branch prediction, a two-level cache hierarchy and MIPS R10K style register renaming. As usual, to build sim-R10K type:
make sim-R10K
You run sim-DLX exactly in the same way as you run sim-func and sim-cache:
<my_simulator_dir>/sim-R10K [args] <my_link_to_traces_dir>/gcc.eio
Unlike sim-DLX, sim-R10K measures execution time by actually modeling the pipeline structures and their contents on a cycle-by-cycle basis. The simulated pipeline has five canonical stages: fetch, rename (which includes decode), schedule (which includes execute), writeback, and commit. The main simulator loop (in the function sim_sample_on) looks like this:
for (;;)
{
commit_stage();
writeback_stage();
schedule_stage();
rename_stage();
fetch_stage();
sim_cycle++;
}
The simulator models all the internal processor structures, the ROB, the LSQ (MOB), and physical register file. The simulator also models an instruction fetch queue (IFQ), which sits between the fetch and rename stages. The IFQ allows the processor to keep fetching instructions when the ROB stalls for a cycle or two. In addition to these structures, which in the simulator body are referred to as ROB, LSQ, IFQ, and pregs, the simulator also uses some auxiliary lists and queues to help with the book-keeping of out-of-order execution. The most important of these are scheduler_queue, a queue of instructions that are ready to execute-and writeback_queue, a queue of completed instructions that are ready to broadcast their results.
It may help to think of how the stages and queues are connected. fetch_stage places instructions on IFQ. rename_stage removes instructions from IFQ and places them in ROB, LSQ, and scheduler_queue. scheduler_stage removes instructions from scheduler_queue and places them on writeback_queue. writeback_stage removes instructions from writeback_queue and places dependent instructions on scheduler_queue. commit_stage removes instructions from ROB and LSQ.
Sim-R10K shares many of the configuration parameters as sim-func, sim-cache, and sim-DLX. It supports sampling and instruction limits via the usual configuration parameters, its two level memory hierarchy and branch predictor can be configured using the same parameters as those used for sim-DLX. Sim-R10K also has a number of parameters for configuring the pipeline itself. For easy understanding parameter names follow a convention. Parameters are collections of names separated by colons. The first name is the name of the pipeline stage (e.g., fetch or sched). The second name is the name of the structure if any (e.g., ROB or LSQ). The third name is the type of the parameter, which is most often one of three things: size/num (number of entries in a structure or number of structures), lat (structure access latency in cycles or number of pipeline stages), or width (bandwidth or number of parallel accesses to the structure). So, by this convention: -fetch:lat is the number of pipeline stages in the fetch stage, -sched:ROB:size is the number of entries in the ROB, and -commit:width is the number of instructions that can be committed per cycle.
Here are the other sim-R10K pipeline configuration parameters, defaults and options are in braces.
There are three parameters for setting the fetch stage width, latency, and IFQ size.
-fetch:width
<width>
-fetch:lat
<lat>
-fetch:ifq:size
<size>
There are three parameters for setting the rename stage width, latency, and number of physical registers.
-rename:width
<width>
-rename:lat
<lat>
-rename:pregs:num
<num>
The scheduling stage has the most options, so we will leave that for last.
There are no parameters for controlling the width or latency of the writeback stage (although there probably should be).
There are two parameters for controlling the width of the commit stage. The parameters control how many total instructions can be retired in a given cycle, and how many stores can be written to the data cache ins a single cycle.
-commit:width
<width>
-commit:store:width
<width>
The final parameter the simulator takes is the rate at which recovery can take place. Setting recovery bandwidth equal to commit bandwidth is effectively modeling serial recovery (serial undoing of mappings by walking backwards through the ROB). Setting recovery width to ROB size effectively models single cycle recovery from a checkpoint.
-recover:width <width>
The most intricate part of the processor is the scheduler, or out-of-order execution engine. There are about 15 parameters which configure this engine. The first four configure the size of the scheduler structures. In the LSQ you can actually limit the number of stores and loads independently.
-sched:rob:size
<size>
-sched:ldq:size
<size>
-sched:stq:size
<size>
-sched:rs:num
<num>
The next five parameters configure the scheduler bandwidth, namely how many instructions of each kind can be executed in a given cycle. There are four instruction type scheduling slots (integer, floating-point, store, and load) and a total scheduling width slot.
-sched:int:width
<width>
-sched:fp:width
<width>
-sched:load:width
<width>
-sched:store:width
<width>
-sched:total:width
<width>
Since scheduling slots are not exactly the same as functional units (e.g., your scheduler may allow you to execute only one floating-point instruction per cycle, but your processor may have two floating-point units: an adder, and a multiplier), functional units can be configured separately. For these, the size parameter indicates the number of units of each type. There are no parameters for controlling unit latency, although again there probably should be.
-sched:ialu:num
<num>
-sched:imult:num
<num>
-sched:fpalu:num
<num>
-sched:fpmult:num
<num>
-sched:cacheport:num
<num>
The next group of parameters configure the scheduler's latency. The raw scheduling latency is the number of pipeline stages it takes to schedule an instruction and read its inputs from the physical register file. However, there are two additional latencies that can be specified. Loads and stores must compute their addresses before accessing the LSQ. This is the address generation latency. There is also a latency for forwarding a value from a store to a load via the LSQ, this is the forwarding latency. The store-to-load forwarding latency should always be set equal to the data cache hit latency.
-sched:lat
<lat>
-sched:agen:lat
<lat>
-sched:fwd:lat
<lat>
The last three scheduling parameters deal with the scheduling policy. The first two are boolean flags that control whether or not the scheduler schedules instructions in-order or out-of-order and whether it schedules speculatively (i.e., whether it schedules instructions on wrong execution paths).
-sched:inorder
{true|false}
-sched:spec
{true|false}
The last parameter controls the load scheduling policy. The simulator models one of three policies: conservative, opportunistic and selective. The conservative and opportunistic policies are chosen using the names "conservative" and "opportunistic". The selective policy is chosen by specifying the dimensions of the selection predictor <sets>:<ways>.
-sched:adisambig
{conservative|opportunistic|<sets>:<ways>}
Hacking sim-R10K
For your project, you may find yourself needing to hack sim-R10K. If you are having trouble, make an appointment with the TA or myself. I will post hacking advice to the newsgroup and to this spot when I return.