Due: Tuesday, Apr. 4
1. Floating Point (10 points)
Describe (and draw) a circuit that would convert a 32-bit signed integer to a single-precision IEEE floating point number. This circuit can be combinational or sequential.
2. Cache Access Times (10 points)
Assume an access stream composed of 75% loads and 25% stores, a data cache with a hit time of 2 cycles, and miss rate of 5% and a lower-level memory hierarchy with an average access time of 20 cycles.
a. What is the average cache access time in cycles if the cache is write-back/write-allocate with no write buffer and 40% of all blocks are dirty?
b. What is the average access time if we add a write buffer?
c. What is the average access time if we assume that the cache is write-thru/write-non-allocate with no write buffer?
d. What is the average access time if we add a write buffer?
3. Simulating Miss Rates (20 points)
For this assignment, you will again modify PennSim. This time you will add a module that simulates caches. Here is the new PennSim.jar file. This simulator has a new optional argument. The optional argument [-cache <nsets>:<nways>:<blocksize>:<prefetchdist>] simulates a unified cache (instructions and data in the same cache) with <nsets> sets, an associativity of <nways>, a block size of <blocksize> 16-bit words, and next block prefetching for the <prefetchdist> blocks ahead. Remember, P37X has a single 16-bit data type and memory is addressible at 16-bit word granularity. P37X addresses correspond to individual words, not bytes.
The cache interface is defined in the file Cache.java. This file also provides a partial implementation of this interface, which is your job to complete. Your job is to complete this implementation by writing the constructor and the functions access and prefetch. You may use any additional private fields and structures to do this. A few notes about the implementations of these functions. The most important function is the access function. It is called on every instruction fetch, load, store with the reference address and an argument that tells you which action is taking place. This function should track cache state based on these accesses, and should also increment the nAccesses and nMisses arrays appropriately. For example, nAccesses[CC_IFETCH] should be incremented on every instruction fetch and nMisses[CC_LOAD] should be incremented on every load miss. This function should implement LRU (not NMRU) replacement. You can do this by implementing each set as a list of blocks and moving blocks to the front of that list as they are accessed. Look at the Java List (and ArrayList or LinkedList) class to see how to do this.
On a miss, the access function should call the prefetch function <prefetchdist> times to prefetch the next <prefetchdist> blocks. The prefetch function is similar to the access function except that it does not update LRU (i.e., it prefetches into the least recently used block in each set and leaves that block at the end of the list) and it should not update statistics.
Since you are simulating only a single level of cache, you do not need to worry about write-thru, write-back, write-buffers, dirty-bits or any other such things. Also, notice that the functions access and prefetch do not contain parameters for actually getting any data into or out of the cache. That's because you don't really need to model the data in the cache. There is a useful trick for simulating caches and it is to only simulate the tags, which are needed to compute hits and misses. Actual data resides in "main memory" only. The interface given assumes you are using this trick, which actually makes modeling a cache quite a bit simpler.
Once you have modified the Java source file, you can compile it (javac -deprecation -classpath PennSim.jar Cache.java) and replace the Cache.class file in the jarchive with your own copy (jar uf PennSim.jar Cache.class). As part of this assignment, you are to turn in the Cache.java file electronically. Use the program pcalc (here are the pcalcos.asm and pcalc.asm and a pcalc.script) to compute the instruction and data miss counts and miss rates for a 2-way set-associative 64-word cache with 4-word blocks. Why may either instruction or data miss rate be higher than the other?
4. Optimal Block Size (10 points)
For a 64-word 2-way set-associative cache, what is the optimal block size for instructions (i.e., what block size minimizes instruction misses)? What is the optimal block size for data? If these two block sizes are different, why might this be the case?
5. 3C (10 points)
Recall the 3C miss classification definition. As it turns out, you can measure individual kinds of misses for a given cache configuration. The way you do this is by measuring the misses of two other carefully-chosen configurations and subtracting. Design experiments to measure the compulsory, capacity, and conflict misses for both instructions and data for the cache configuration from question 3. What are the compulsory, capacity, and conflict instruction and data miss counts (and miss rates)? What experiments did you use to calculate these?
6. Prefetching (10 points)
For the 2-way set-associative 64-word cache with 4-word blocks, what are the instruction and data miss counts and miss rates when you use "next block" and "next two blocks" prefetching? Is prefetching more effective for instructions or for data? Why? Is more prefetching always better? How many of each kind of miss (compulsory, capacity, or conflict) does "next block" prefetching eliminate?
7. Next Block Prefetching vs. Doubling Block Size (10 points)
Which is more effective: next block prefetching or doubling the block size? What is the difference between these two schemes which basically go after the same thing (spatial locality)? Can you construct an example of an access sequence that prefetching handles effectively, but doubling the block size does not? What about the other way around?
8. Memory System Design (10 points)
Consider the following memory system parameters:
a. Ignoring the latency of the address bus, what would be the cache miss penalty (in ns) if you used a single DRAM chip?
b. What is the smallest number of DRAM chips you could use that would minimize the cache miss penalty? What would this minimal penalty be?
9. Virtual Memory (10 points)
Consider the following memory system parameters:
a. What is the size of a single-level page table for program P?
b. Assuming that 2nd-level tables are page-sized, how much space would a two-level page table for program P occupy? Assume that partially used pages count as full pages.