CSE 371/372 Spring 2005
Homework 4

Due: Thursday, April 7 in class.

Floating Point

1. (5 points) a. Add the numbers (in decimal, not binary) 8.43*102 and 9.31*103 on a machine with 3 signficant digits and guard and round digits. b. Repeat the addition on a machine with 3 significant digits, but no guard and round bits.

2. (5 points) The MIPS instruction c.X.s (where X is one of eq, ne, lt, le, gt, ge) compares two single precision floating point registers and puts a 1 in the specified integer if the corresponding test holds. In a paragraph, explain the structure of a functional unit that can execute this instruction. How does the IEEE754 representation help to simplify this unit simple?

Memory Hierarchy

3. (5 points) What is the tag overhead of a 128KB, two-way set associative cache, with 64B blocks on a 32-bit machine?

4. (12 points) Consider a cache with a hit time of 1 cycle and a lower-level memory hierarchy with an average access time of 20 cycles. The cache miss rate is 5%. Assume an access stream composed of 80% loads and 20% stores and a processor that must wait for stores to fully complete.

a. What is the average access time of the cache if we assume that it is write-back/write-allocate, that 25% of all blocks are dirty and that there is no write buffer?

b. What is the average access time if we add a write buffer?

c. What is the average access time if we assume that the cache is write-thru/write-non-allocate with no write buffer?

d. What is the average access time if we add a write buffer?

5. (8 points) Consider a direct mapped 4B cache with 4 1B blocks and a machine with 4 bit addresses. Construct a repeating sequence of 4 accesses that would generate fewer misses if we were to double the block size. Construct a repeating sequence of 4 accesses that would generate more misses if we were to double the block size.

6. (10 points) Consider a cache with 64B lines, an 8B wide 200MHz bus, and 8B wide 10ns access time, 20ns cycle time DRAM chips. What would be the cache miss penalty (in ns) if you used a single DRAM chip? (Ignore the latency of the address bus) What is the smallest number of DRAM chips you could use that would minimize the cache miss penalty?

7. (20 points) Consider the following memory system parameters:

• 8KB pages
• 4B PTEs
• A program P that uses 8MB of code, 256MB of heap and 1MB of stack

a. What is the size of a single-level page table for program P?

b. Assuming that 2nd-level tables are page-sized, how much space would a two-level page table for program A occupy? Assume that partially used pages count as full pages.

c. What is the total tag storage for a virtually-indexed/physically-tagged cache with the same organization as in question 3? (Ignore the fact that a cache of this size combined with a page of this size couldn't support a TLB)

d. What is the total tag storage of a 64-entry 4-way set-associative TLB with 1-PTE blocks?

R372 Simulator Programming

For this assignment, you will again modify the R372 simulator. This time you will add a module that simulates caches. Make a copy of the directory ~amir/cse371/r372/src/ to a private directory.

This simulator has some new arguments. The optional argument [-i<nsets>:<nways>:<bsize>] simulates an instruction cache with <nsets> sets, an associativity of <nways> and a block size of <bsize> 16-bit words. The optional argument [-d<nsets>:<nways>:<bsize>] simulates a data cache with similar parameters. The optional argument [-tm<tmiss>] sets the simulated miss latency to <tmiss>.

The cache interface is defined in the file cache.h and consists of three functions. The file cache.c provides a partial implementation of this interface, which is your job to complete. cache_create creates a cache structure given the three parameters above. cache_lookup takes a cache and an address. It returns 1 if the access results in a miss and 0 if it results in a hit. It also updates cache contents to reflect the most recent access. cache_stats prints out some statistics. Your job is to modify cache.c to simulate a cache with an LRU replacement policy. Do not worry about distinguishing reads from writes.

HINT: Notice that the function cache_lookup does not contain a parameter for actually getting any data out of the cache. That's because you don't really need to model the data in the cache. There is a useful trick for simulating caches and it is to only simulate the tags, which are needed to compute hits and misses. Data resides in "main memory" only. The interface given assumes you are using this trick, which actually makes modeling a cache quite a bit simpler. Notice that the function cache_lookup does not contain a parameter for actually getting any data out of the cache. That's because you don't really need to model the data in the cache. There is a useful trick for simulating caches and it is to only simulate the tags, which are needed to compute hits and misses. Data resides in "main memory" only. The interface given assumes you are using this trick, which actually makes modeling a cache quite a bit simpler.

Turn in the cache.c file electronically. Also, use your implementation to answer the following questions using the btree benchmark in ~amir/cse371/r372/mbench/btree/. To run this program with this input, make a private copy of that directory and use the command <my_simulator> btree -1 [-other_arguments].

8. (15 points) a. How many misses does a direct-mapped, 32-word data cache with 2-word blocks generate? What is the new number of misses if you halve the number of sets? What is the new number of misses if you double them? b. What configuration of a cache with a total capacity of 32 words generates the fewest misses?

9. (20 points) Recall the 3C miss classification definition. As it turns out, you can measure individual kinds of misses for a given cache configuration. The way you do this is by measuring the misses of two other carefully-chosen configurations and subtracting. Design experiments to measure the compulsory, capacity, and conflict misses in a direct-mapped, 32-word data cache with 2-word blocks.