Homework 2 - CIS501 Fall 2009

Instructor:Prof. Milo Martin
Due:Noon, Thursday, Oct 22nd (at the start of class)

Overview

In this assignment you will explore the effectiveness of caches and the impact of different cache configurations. We've used Pin, a binary instrumentation tool, to generate a trace of data accesses (all loads and stores) of a program. Your task will be use this representative trace to evaluate the impact of cache size, associativity, block size, etc.

To do this, you'll write a program that reads in the trace and simulates different cache configurations. You're welcome to write the program in whatever language you like; although we are going to ask for a printout of your source code, to end result of this assignment are the experimental results (and not the program source code).

Trace and Trace Format

The trace we're giving you is a trace of load and store operations from an execution of the program GCC (Gnu C Compiler) from the SPECint2000 benchmark suite. Instruction cache accesses are not recorded (we're focusing on the data cache in this assignment). Each line of the trace file has two fields. Below are the first few lines of a trace file:

bfede22c    L
bfede228    S

The first field is the memory address (in hex) being read or written. The second field is the character "L" or "S" to distinguish between load and store operations. You may assume that no memory operations cross cache blocks.

The trace is available as an 81MB gzip compressed file (memory-trace-gcc.trace.gz). If you wish to verify that the files downloaded correctly, the MD5 sum is 5036ef77c947267e5b1512080eeced68.

You can either (1) download the compressed file, uncompress it, and process the uncompressed file, or (2) process the compressed file directly. Under Linux this can be done by using zcat to pipe the uncompressed trace into a program reading from standard input. For example: zcat file.trace.gz | your-program. Another option is to use zlib (for C/C++) or java.util.zip (for Java) to open and process the compressed file directly.

To help you debug your cache simulation code, we've provided some annotated results for the first thousand or so memory references for various cache configurations (see below). Comparing your results to these annotated outputs should help ensure your code is working properly.

Question 1 - Direct-mapped Tag/Index/Offset Calculations

As talked about in class, a cache has three primary configuration parameters:

These three parameters the number of sets and ways in the cache. These parameters also specify how a memory reference address is split into its three components: block offset bits, index bits, and tag bits.

Assuming 32-bit address, calculate the following values for a direct-mapped 32KB cache with 64B blocks (that is: S=32KB, B=64B, and A=1):

  1. The number of bits in the block offset

    Answer

    6 bits

  2. The number of "sets" in the cache

    Answer

    512 sets

  3. The number of "ways" in the cache

    Answer

    Each set has 1 way. So total number of ways = 512* 1 = 512

  4. The number of index bits

    Answer

    9 bits

  5. The number of tag bits

    Answer

    17 bits

Question 2 - Set-Associative Tag/Index/Offset Calculations

For a two-way set associative version of the cache in question 1 (that is: S=32KB, B=64B, and A=2), calculate the following values:

  1. The number of bits in the block offset

    Answer

    6 bits

  2. The number of "sets" in the cache

    Answer

    256 sets

  3. The number of "ways" in the cache

    Answer

    Each set has 2 ways. So total number of ways = 2 * 256 = 512

  4. The number of index bits

    Answer

    8 bits

  5. The number of tag bits

    Answer

    18 bits

Question 3 - Generalizing the Calculations

Assuming 32-bit address, write formulas in terms of S, B, and A for calculating the following quantities. You may find using log(S), log(B), and log(A) in the formulas easier than S, B, and A directly. For this assignment, you may assume S, B, and A are all powers of 2. Write the formulas for:

  1. The number of bits in the block offset

    Answer

    log(B)

  2. The number of "sets" in the cache

    Answer

    S/(B * A)

  3. The number of "ways" in the cache

    Answer

    S/B

  4. The number of index bits

    Answer

    log(S/(B * A))

  5. The number of tag bits

    Answer

    32 - log(B) - log(S/(B * A))

Hint: you may find it helpful to first work out the numbers for a direct-mapped cache (that is, consider just S and B), and then extend it to set-associative caches.

Question 4 - Extracting Bits from Addresses

When writing your cache simulator, you'll find it useful to have function (or methods) to extract the tag bits and index bits from an address. Given the four integer variables: address (address of the access), num_tag_bits (number of tag bits), num_index_bits (number of index bits), "num_offset_bits (number of offset bits), use bit manipulation operations to write C/C++/Java code to:

  1. Calculate the tag of the address (an integer value):

    tag = <your code here>;
    

    Answer

    tag = address >> (num_offset_bits + num_index_bits);

  2. Calculate the index of the address (an integer value):

    index = <your code here>;
    

    Answer

    index = (address >> num_offset_bits) & ~(~0 << num_index_bits);

IMPORTANT: Do not convert the integer values to character strings (which is slow and unnecessarily verbose). Using integer bit manipulation instructions (shifts, bitwise and, or, not, etc.) can be used to to easily and efficiently accomplish such bit extraction operations. Hint: to generate an integer value with all bits set to one, you can use ~0 (bit-wise not of zero).

Question 5 - Miss Rate vs Cache Size

The first experiment is to gauge the impact of cache size on cache miss rate. Write a program to simulate a direct-mapped cache parametrized by S and B. You may actually find it more convenient to configure your cache in terms of log(S) and log(B) instead of S and B.

For this first direct-mapped cache simulations, the contents of the cache can be represented by a single array of values: the tag array. To avoid needing to track a separate valid bit, just initialize all the tag values to zero and assume that all blocks are valid and remain valid throughout the simulation. As we're not actually manipulating the actual data values (as a real processor would) but just calculating hit/miss for memory reference, there is no need to track the contents of the data array (in fact, the trace doesn't even have the information you would need to do this).

For now, handle loads and stores in the same fashion.

Use your cache simulator to produce cache miss rates for varying cache sizes. Generate the data for caches from 256 bytes (28) to 4MB (222). Configure the block size to 64 bytes. Generate a line plot of this data. On the y-axis, plot the "cache miss rate (percent of all memory references)". The smaller the miss rate, the better. On the x-axis, plot the log of the cache size (basically, the log of the cache sizes), in essence giving us a log scale on the x-axis.

  1. How large must the cache be for the miss rate to be less than 10%? How large to be less than 5%?

    Answer

    For miss rate less than 10% - 211

    For miss rate less than 5% - 214

  2. Today's processors generally have 32KB or 64KB first-level data caches. The best improvement you would expect in general from doubling the size of the cache would be a 2x reduction in misses (for example, reducing the miss rate from 20% to 10%). By what ratio does increasing the cache size from 32KB to 64KB reduce the miss rate? (2.0 would be halving the miss rate; 1.0 would be no change in miss rate; less than 1.0 would be an increase in misses).

    Answer

    2.38

Question 6 - Performance Impact of Cache Size

In that last question, you plotted the miss rate, but the miss rate is only indirectly related to performance. In this question, plot the CPI (lower is better) for a processor with varying cache size based on the following assumptions: a base CPI of 1, 33.33% of instructions are memory operations, and the cache miss latency is either 10, 25, or 100 cycles. The x-axis is cache size (same as the previous question). The y-axis is CPI. Plot three lines on a graph (once for each miss latency). Creating this graph shouldn't require additional simulations, but can instead be calculated from the data collected in the previous question.

  1. For each of the cache miss latency, how large must the cache be to achieve a CPI of 2? What does this say about the impact on cache miss latency?

    Answer

    10 cycle latency - it is less than 2 at 28

    25 cycle latency - 210

    100 cycle latency - 215

  2. In the previous question you calculated the ratio of miss rates for increasing the cache size from 32KB to 64KB. What is the speedup that results from increasing the cache size from 32KB to 64KB (this is basically a "speedup" calculation in which 1.0 is no speedup, less than 1.0 is a slowdown, and greater than 1.0 is a speedup). Calculate the speedup for each of the cache miss latencies.

    Answer

    10 cycle - 1.04

    25 cycle - 1.1

    100 cycle - 1.34

  3. How do these speedup numbers compare to the cache miss ratio you calculated in part b of the previous question?

    Answer

    Reduction in CPI is not as significant as the miss rate. This is because time spent missing in the cache is only a fraction of the total exeuction time.

Question 7 - Traffic of Write-Back vs Write-Through Caches

Modify your simulator to calculate the total number of bytes transfered. The first source of traffic is the data transfers to fill cache blocks. This quantity is easy to calculate: it is just the number of misses multiplied by the block size. The other source of traffic is write traffic (either writeback or writethrough). In this question, you'll calculate the traffic for both write-back and write-through caches.

In either case, use a write-allocate policy. As such, the number of misses between write-through and write-back is the same under our assumptions (no write buffers and such), only the traffic is different.

For a direct-mapped cache with 64-byte blocks, generate a graph plotting the same cache sizes as the previous question (on the x-axis) versus total data traffic per memory operation (total number of bytes transferred divide by the total number of memory references). Plot two lines: (1) the write-through cache (miss fill traffic + write-through traffic) and (2) the write-back cache (miss fill traffic + write-back of dirty blocks).

  1. At what cache size do the two write policies generate approximately the same amount of traffic?

    Answer

    212

  2. Why does the difference between the two schemes diverge at large cache sizes?

    Answer

    With the increase in the cache size, misses and thus evictions are less frequent. Because write-back caches generate traffic only on evictions, it generates less traffic. In contrast, the write-through traffic of stores is independent of the size of the cache.

  3. Why does the difference between the two schemes diverge at small cache sizes?

    Answer

    With smaller caches, miss and thus evictions are more frequent. Write-back caches evict entire cache blocks (64 bytes) in contrast to write-through which writes a word (4 bytes).

As the write-back cache uses less traffic for reasonable cache sizes, all subsequent experiments use write-back caches.

Question 8 - Set-Associative Caches

Modify your simulator to support two-way set associative caches. Each set in the cache will now need two "tag" entries (one for each way), two dirty bits (one for each way), and a "LRU" (least recently used) bit used for cache replacement. Whenever you make a memory reference to a block, update the LRU bit to point to the other entry in the way. When replacing a block, replace the LRU entry (and don't forget to update the LRU bit to point to the other entry). When performing a store, be sure to set the correct dirty bit.

Generate miss rate data for the same block size and caches sizes as in the previous question, but simulate two-way set associative caches. Plot this result with the original data collected for a direct-mapped cache (two lines on the graph: one for direct-mapped caches and one for the two-way set associative cache).

  1. How large must the set-associative cache be for the miss rate to be less than 10%? How large to be less than 5%?

    Answer

    Less than 10% - 211

    Less than 5% - 213

  2. How large must the direct-mapped cache be before it equals or exceeds the performance of the 16KB two-way set-associative cache (ignore any noise in which the direct-mapped cache might slightly outperform the set-associative cache for some configurations)?

    Answer

    32KB

  3. This graph shows that the performance gap between set-associative and direct-mapped caches narrows as the cache size grows very large. Explain what likely causes this narrowing.

    Answer

    As cache size increases, each set in the cache has fewer memory locations that map to it. As a result, the number of conflict misses decreases. In the limit, where cache size is sufficiently large, direct mapped cache, set associative cache and a fully associative would have the same performance.

As the set-associative cache performs better than the direct-mapped cache, all subsequent experiments use the two-way set associative cache.

Question 9 - Cache Block Size

Now let's explore the impact of different cache block sizes. Modify your simulator to support various block sizes. Use the simulator to calculate the miss rate for several different block sizes: 4B, 8B, 16B, 32B, 64B, 128B, 256B, 512B, and 1024B blocks. Do this for two cache sizes: 8KB and 32KB. Create two graphs from this data:

Answer the following questions:

  1. What is the block size with the lowest miss rate for the 8KB cache? And the 32KB cache?

    Answer

    8 KB - 256 byte block size

    32 KB - 256 byte block size

  2. What is the block size with the lowest traffic (bytes per memory reference) for the 8KB cache? And the 32KB cache?

    Answer

    8KB - 4 bytes

    32KB - 4 bytes

  3. What are the two (or three) sources of additional traffic as the block size grows? Explain why each component grows.

    Answer

    Several factors increase: the miss rate (more misses cause more traffic), the amount of data being brought into the cache (more traffic on each miss), and the amount of data being written for dirty blocks (more data is written back, even if not all of it has changed).

  4. Given that current processors typically use, say, 64B blocks, which metric (miss rate or traffic) are today's caches designed to minimize?

    Answer

    64B is a sweet spot with respect to miss-rate and traffic. Miss rate starts to saturate at 64B whereas the traffic starts to increase after 64B.

Question 10 - Way Prediction

One way to reduce the hit latency penalty of a set-associative cache is with way prediction. Way prediction allows the data array and tag array to be accessed in parallel. If the predicted way was correct (determined by a tag match), no penalty occurs. If the prediction was incorrect, additional cycles are needed to find the data.

A way predictor is a table that guesses which "way" a given address should access. For our two-way set-associative cache, this is just a single bit per entry (way "zero" or way "one"). The table uses n bits of the data block address to index into the table that contains 2n entries. The "offset" bits of the address are not used, so the n bits are the lower order bits from the combination of tag and index bits.

The way predictor is trained in two cases. First, whenever a block is inserted into the cache because of a miss, set the corresponding entry in the way predictor to record in which way of the cache the block was placed. Second, whenever a way mis-prediction occurs, update the way predictor to reflect the way of the block that caused the mis-prediction.

Simulate a way predictor for a two-way set-associative cache with 64B blocks and capacity of 32KB. Simulate way predictors sizes ranging from 1 entry to 216 entries. Record how many way mis-predictions occur by categorizing each access as either a normal hit, a way mis-predicted hit, or a normal cache miss. If the block isn't in the cache, it isn't consider a mis-prediction; it is considered a cache miss.

Calculate the "percentage of cache hits that are way mis-predicted" (the lower the better). Plot a graph of this mis-prediction rate (y-axis) versus way predictor size (x-axis).

  1. How accurate is just a single-entry way predictor? Was it more or less accurate than you anticipated? Give two reasons that explains its accuracy.

    Answer

    83% correct (17% wrong). Why? First, the accuracy would be 50% just by guessing randomly (even a stuck clock is right twice a day). Second, spatial locality means that the same block is likely accessed multiple times, increasing its accuracy.

  2. How large must the predictor be for the number of mis-predictions + cache misses to be smaller than the miss rate of a direct-mapped cache with the same capacity and block size? Basically, we're asking you to determine the point at which the two-way set associative cache that uses a way predictor is almost certainly better than a direct-mapped cache.

    Answer

    211

  3. At the predictor size calculated above, approximately what is the percentage overhead in number of bits that the way predictor adds to the overall cache? (You can ignore the cache's tags if you wish to simplify the calculation.)

    Answer

    Less than 1% (0.8%). Why? 211 is 2048 bits, which is 256 bytes. 256 bytes is less than 1% of the 32KB L1 cache.

  4. How many entries are required to get a 1% mis-prediction rate (a 99% accuracy)? What is the overhead in terms of relative number bits for this way predictor?

    Answer

    29 bits, with overhead of only 0.2%.

Test Cases

We have also provided you with a sample trace output run on the first thousand memory references from the trace. The output files have the following format:

[Set 0: {Way 0:000000, C} {Way 1:000000, C} LRU: 0] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S miss clean
[Set 0: {Way 0:ffb788, D} {Way 1:000000, C} LRU: 1] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S  hit    --
[Set 0: {Way 0:ffb788, D} {Way 1:000000, D} LRU: 0] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S  hit    --

The contents of each set are specified, these are the contents of each way (way number: tag, state, where C=clean, D=dirty) and the LRU way for that set. Each line says what the contents of the cache were before a particular cache access and then lists the address being accessed, the access type (S for store, L for load), the outcome (hit or miss) and if the state of the LRU way being evicted (clean or dirty on a miss, none on a hit). Note: the address in the trace above is the "block address". That is, it is the address from the trace with the offset bits set to zero.

The two output traces are:

What to Turn In

Hints & Comments


Addendum