Homework 3 - CIS371 Spring 2009

Instructor:	Prof. Milo Martin
Due:	3pm, Monday, April 13th (at the start of class)

Overview

In this assignment you will explore the effectiveness of caches and the impact of different cache configurations. We've used Pin, a binary instrumentation tool, to generate a trace of data accesses (all loads and stores) of a program. Your task will be use this representative trace to evaluate the impact of cache size, associativity, block size, etc.

To do this, you'll write a program that reads in the trace and simulates different cache configurations. You're welcome to write the program in whatever language you like; although we are going to ask for a printout of your source code, to end result of this assignment are the experimental results (and not the program source code).

Trace and Trace Format

The trace we're giving you is a trace of load and store operations from an execution of the program GCC (Gnu C Compiler) from the SPECint2000 benchmark suite. Instruction cache accesses are not recorded (we're focusing on the data cache in this assignment). Each line of the trace file has two fields. Below are the first few lines of a trace file:

bfede22c    L
bfede228    S

The first field is the memory address (in hex) being read or written. The second field is the character "L" or "S" to distinguish between load and store operations. You may assume that no memory operations cross cache blocks.

The trace is available as an 81MB gzip compressed file (memory-trace-gcc.trace.gz). If you wish to verify that the files downloaded correctly, the MD5 sum is 5036ef77c947267e5b1512080eeced68.

You can either (1) download the compressed file, uncompress it, and process the uncompressed file, or (2) process the compressed file directly. Under Linux this can be done by using zcat to pipe the uncompressed trace into a program reading from standard input. For example: zcat file.trace.gz | your-program. Another option is to use zlib (for C/C++) or java.util.zip (for Java) to open and process the compressed file directly.

To help you debug your cache simulation code, we've provided some annotated results for the first thousand or so memory references for various cache configurations (see below). Comparing your results to these annotated outputs should help ensure your code is working properly.

Question 1 - Direct-mapped Tag/Index/Offset Calculations

As talked about in class, a cache has three primary configuration parameters:

S: Cache size (how much data the cache holds)
B: Block size (the granularity of the data)
A: Associativity (number of blocks with the same index that can be in the cache at the same time).

These three parameters the number of sets and ways in the cache. These parameters also specify how a memory reference address is split into its three components: block offset bits, index bits, and tag bits.

Assuming 32-bit address, calculate the following values for a direct-mapped 32KB cache with 64B blocks (that is: S=32KB, B=64B, and A=1):

The number of bits in the block offset

The number of "sets" in the cache

The number of "ways" in the cache

The number of index bits

The number of tag bits

Question 2 - Set-Associative Tag/Index/Offset Calculations

For a two-way set associative version of the cache in question 1 (that is: S=32KB, B=64B, and A=2), calculate the following values:

The number of bits in the block offset

The number of "sets" in the cache

The number of "ways" in the cache

The number of index bits

The number of tag bits

Question 3 - Generalizing the Calculations

Assuming 32-bit address, write formulas in terms of S, B, and A for calculating the following quantities. You may find using log(S), log(B), and log(A) in the formulas easier than S, B, and A directly. For this assignment, you may assume S, B, and A are all powers of 2. Write the formulas for:

The number of bits in the block offset

The number of "sets" in the cache

The number of "ways" in the cache

The number of index bits

The number of tag bits

Hint: you may find it helpful to first work out the numbers for a direct-mapped cache (that is, consider just S and B), and then extend it to set-associative caches.

Question 4 - Miss Rate vs Cache Size

The first experiment is to gauge the impact of cache size on cache miss rate. Write a program to simulate a direct-mapped cache parametrized by S and B. You may actually find it more convenient to configure your cache in terms of log(S) and log(B) instead of S and B.

For this first direct-mapped cache simulations, the contents of the cache can be represented by a single array of values: the tag array. To avoid needing to track a separate valid bit, just initialize all the tag values to zero and assume that all blocks are valid and remain valid throughout the simulation. As we're not actually manipulating the actual data values (as a real processor would) but just calculating hit/miss for memory reference, there is no need to track the contents of the data array (in fact, the trace doesn't even have the information you would need to do this).

For now, handle loads and stores in the same fashion.

Use your cache simulator to produce cache miss rates for varying cache sizes. Generate the data for caches from 256 bytes (2⁸) to 4MB (2²²). Configure the block size to 64 bytes. Generate a line plot of this data. On the y-axis, plot the "cache miss rate (percent of all memory references)". The smaller the miss rate, the better. On the x-axis, plot the log of the cache size (basically, the log of the cache sizes), in essence giving us a log scale on the x-axis.

How large must the cache be for the miss rate to be less than 10%? How large to be less than 5%?

Today's processors generally have 32KB or 64KB first-level data caches. The best improvement you would expect in general from doubling the size of the cache would be a 2x reduction in misses (for example, reducing the miss rate from 20% to 10%). By what ratio does increasing the cache size from 32KB to 64KB reduce the miss rate? (2.0 would be halving the miss rate; 1.0 would be no change in miss rate; less than 1.0 would be an increase in misses).

Question 5 - Performance Impact of Cache Size

In that last question, you plotted the miss rate, but the miss rate is only indirectly related to performance. In this question, plot the CPI (lower is better) for a processor with varying cache size based on the following assumptions: a base CPI of 1, 33.33% of instructions are memory operations, and the cache miss latency is either 10, 25, or 100 cycles. The x-axis is cache size (same as the previous question). The y-axis is CPI. Plot three lines on a graph (once for each miss latency). Creating this graph shouldn't require additional simulations, but can instead be calculated from the data collected in the previous question.

For each of the cache miss latency, how large must the cache be to achieve a CPI of 2? What does this say about the impact on cache miss latency?

In the previous question you calculated the ratio of miss rates for increasing the cache size from 32KB to 64KB. What is the speedup that results from increasing the cache size from 32KB to 64KB (this is basically a "speedup" calculation in which 1.0 is no speedup, less than 1.0 is a slowdown, and greater than 1.0 is a speedup). Calculate the speedup for each of the cache miss latencies.

How do these speedup numbers compare to the cache miss ratio you calculated in part b of the previous question?

Question 6 - Traffic of Write-Back vs Write-Through Caches

Modify your simulator to calculate the total number of bytes transfered. The first source of traffic is the data transfers to fill cache blocks. This quantity is easy to calculate: it is just the number of misses multiplied by the block size. The other source of traffic is write traffic (either writeback or writethrough). In this question, you'll calculate the traffic for both write-back and write-through caches.

Write-through: For the write-through cache, the traffic is the average number of bytes written by each store multiplied by the number of stores. For this trace, the average number of bytes written by each store is just a little under 4, so just assume each write through is four bytes 4. You'll need to modify your simulator to count the number of stores.
Write-back: For the write-back cache system, the traffic is the number of dirty evictions multiplied by the block size. Because at most one eviction happens per miss, this writeback traffic can at most double the overall traffic (in the case in which all evicted blocks were dirty). To calculate the number of dirty evictions, modify your simulator by adding a "dirty" bit for each block in the cache. Set the dirty bit whenever a store operations writes to the block. This dirty bit is not per word or per byte; it is one dirty bit for the entire cache block.

In either case, use a write-allocate policy. As such, the number of misses between write-through and write-back is the same under our assumptions (no write buffers and such), only the traffic is different.

For a direct-mapped cache with 64-byte blocks, generate a graph plotting the same cache sizes as the previous question (on the x-axis) versus total data traffic per memory operation (total number of bytes transferred divide by the total number of memory references). Plot two lines: (1) the write-through cache (miss fill traffic + write-through traffic) and (2) the write-back cache (miss fill traffic + write-back of dirty blocks).

At what cache size do the two write policies generate approximately the same amount of traffic?

Why does the difference between the two schemes diverge at large cache sizes?

Why does the difference between the two schemes diverge at small cache sizes?

As the write-back cache uses less traffic for reasonable cache sizes, all subsequent experiments use write-back caches.

Question 7 - Set-Associative Caches

Modify your simulator to support two-way set associative caches. Each set in the cache will now need two "tag" entries (one for each way), two dirty bits (one for each way), and a "LRU" (least recently used) bit used for cache replacement. Whenever you make a memory reference to a block, update the LRU bit to point to the other entry in the way. When replacing a block, replace the LRU entry (and don't forget to update the LRU bit to point to the other entry). When performing a store, be sure to set the correct dirty bit.

Generate miss rate data for the same block size and caches sizes as in the previous question, but simulate two-way set associative caches. Plot this result with the original data collected for a direct-mapped cache (two lines on the graph: one for direct-mapped caches and one for the two-way set associative cache).

How large must the set-associative cache be for the miss rate to be less than 10%? How large to be less than 5%?

How large must the direct-mapped cache be before it equals or exceeds the performance of the 16KB two-way set-associative cache (ignore any noise in which the direct-mapped cache might slightly outperform the set-associative cache for some configurations)?

This graph shows that the performance gap between set-associative and direct-mapped caches narrows as the cache size grows very large. Explain what likely causes this narrowing.

As the set-associative cache performs better than the direct-mapped cache, all subsequent experiments use the two-way set associative cache.

Question 8 - Cache Block Size

Now let's explore the impact of different cache block sizes. Modify your simulator to support various block sizes. Use the simulator to calculate the miss rate for several different block sizes: 4B, 8B, 16B, 32B, 64B, 128B, 256B, 512B, and 1024B blocks. Do this for two cache sizes: 8KB and 32KB. Create two graphs from this data:

Miss Rate Graph. Plot the data on graph with miss rate on the y-axis, log of the cache block size on the x-axis, and the two caches sizes plotted as two different lines.
Traffic Graph. Plot the data on graph with bytes per memory reference on the y-axis, log of the cache block size on the x-axis, and the two caches sizes plotted as two different lines.

Answer the following questions:

What is the block size with the lowest miss rate for the 8KB cache? And the 32KB cache?

What is the block size with the lowest traffic (bytes per memory reference) for the 8KB cache? And the 32KB cache?

What are the two (or three) sources of additional traffic as the block size grows? Explain why each component grows.

Given that current processors typically use, say, 64B blocks, which metric (miss rate or traffic) are today's caches designed to minimize?

Question 9 - Way Prediction

One way to reduce the hit latency penalty of a set-associative cache is with way prediction. Way prediction allows the data array and tag array to be accessed in parallel. If the predicted way was correct (determined by a tag match), no penalty occurs. If the prediction was incorrect, additional cycles are needed to find the data.

A way predictor is a table that guesses which "way" a given address should access. For our two-way set-associative cache, this is just a single bit per entry (way "zero" or way "one"). The table uses n bits of the data block address to index into the table that contains 2ⁿ entries. The "offset" bits of the address are not used, so the n bits are the lower order bits from the combination of tag and index bits.

The way predictor is trained in two cases. First, whenever a block is inserted into the cache because of a miss, set the corresponding entry in the way predictor to record in which way of the cache the block was placed. Second, whenever a way mis-prediction occurs, update the way predictor to reflect the way of the block that caused the mis-prediction.

Simulate a way predictor for a two-way set-associative cache with 64B blocks and capacity of 32KB. Simulate way predictors sizes ranging from 1 entry to 2¹⁶ entries. Record how many way mis-predictions occur by categorizing each access as either a normal hit, a way mis-predicted hit, or a normal cache miss. If the block isn't in the cache, it isn't consider a mis-prediction; it is considered a cache miss.

Calculate the "percentage of cache hits that are way mis-predicted" (the lower the better). Plot a graph of this mis-prediction rate (y-axis) versus way predictor size (x-axis).

How accurate is just a single-entry way predictor? Was it more or less accurate than you anticipated? Give two reasons that explains its accuracy.

How large must the predictor be for the number of mis-predictions + cache misses to be smaller than the miss rate of a direct-mapped cache with the same capacity and block size? Basically, we're asking you to determine the point at which the two-way set associative cache that uses a way predictor is almost certainly better than a direct-mapped cache.

At the predictor size calculated above, approximately what is the percentage overhead in number of bits that the way predictor adds to the overall cache? (You can ignore the cache's tags if you wish to simplify the calculation.)

How many entries are required to get a 1% mis-prediction rate (a 99% accuracy)? What is the overhead in terms of relative number bits for this way predictor?

Test Cases

We have also provided you with a sample trace output run on the first thousand memory references from the trace. The output files have the following format:

[Set 0: {Way 0:000000, C} {Way 1:000000, C} LRU: 0] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S miss clean
[Set 0: {Way 0:ffb788, D} {Way 1:000000, C} LRU: 1] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S  hit    --
[Set 0: {Way 0:ffb788, D} {Way 1:000000, D} LRU: 0] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S  hit    --

The contents of each set are specified, these are the contents of each way (way number: tag, state, where C=clean, D=dirty) and the LRU way for that set. Each line says what the contents of the cache were before a particular cache access and then lists the address being accessed, the access type (S for store, L for load), the outcome (hit or miss) and if the state of the LRU way being evicted (clean or dirty on a miss, none on a hit). Note: the address in the trace above is the "block address". That is, it is the address from the trace with the offset bits set to zero.

The two output traces are:

cache-128-assoc-1-block-32.trace.output: A direct-mapped cache with 32B blocks and a total cache size of 128 bytes.
cache-256-assoc-2-block-64.trace.output: A two-way set-associative cache 64B blocks and a total cache size of 256 bytes.

What to Turn In

Short answers. Turn in your answers to the above questions. I strongly urge you to type up your answers.
Graphs. Turn in printouts of the following graphs. One per sheet of paper (to make them big enough to read) and label axises and gives the lines descriptive names in a legend.
- Graph A [Questions #4 and #7]: Miss rate for various cache sizes. Two lines: direct-mapped and two-way set associative.
- Graph B [Question #5]: The impact of cache size on CPI. One line: the CPI of the processor over various cache sizes
- Graph C [Question #6]: Traffic vs cache size. Two lines: write-through and write-back caches.
- Graph D [Question #8]: Impact of cache block size. Two lines: one for 8KB and one for 32KB cache.
- Graph E [Question #9]: Impact of way prediction. One line: way prediction accuracy (percent correct)
Source Code. Print out your final source code for your cache simulator.

Hints & Comments

Runtime. Running through the trace should take just a few seconds in most cases. Even if you run through the entire trace to generate each data point, it shouldn't take more than a few minutes to generate all the data you need for this assignment. If your code is taking much longer, you're probably not doing something right (ask me or the TAs about it).
Computer resources. If you wish to write the code by ssh'ing into a Linux box, you can log into minus.seas.upenn.edu. These machines are multi-core 3Ghz Intel chips, so they are quite fast. However, if your jobs run for too long (say 10 or 20 minutes) the system will kill off the job (to make sure that users don't use these machines for very long running jobs). Of course, you're also welcome to write and run the code on your personal computers or the lab machines.
Code reuse. In the end, the total amount of code needed to perform these simulations isn't actually that much. Don't go crazy in making the code super modular, but conversely there is significant potential for code reuse that shouldn't be ignored.
Automation of data collection. Put a little bit of though into how you're going to collect the various data for each question. If you're running your program multiple times by hand and then copying and pasting the individual results into Excel, you're doing something wrong (or at least super inefficiently). Your code should be able to perform a range of simulations in an automated fashion (either by building it into the main program of using scripts to call it with different parameters). You program should be able to split out data files that can be imported into Excel (or whatever) to make graphs with minimal tedium.

Addendum

[April 10] - Clarified some wording on way predictors
[April 12] - Clarified that the address in the output traces are "block address". That is, they are the address from the trace with the offset bits set to zero.