Homework 3b - CIS501 Fall 2012

Instructor:	Prof. Milo Martin
Due:	Friday, October 19th at 6pm (If you use a "late period", you can turn it in up to Tuesday, 23rd at 6pm).
Instructions:	This is an individual work assignment. Sharing of answers or code is strictly prohibited. For the short answer questions, Show your work to document how you came up with your answers.
Note:	This is part of a two-part assignment (HW3a and HW3b) due on the same day. Create a separate PDF file for each part and turn them in electronically via Canvas.

Part B

Performance and CPI. Assume a typical program has the following instruction type breakdown:
- 30% loads
- 15% stores
- 50% adds
- 4% multiplies
- 1% divides
Assume the current-generation processor has the following instruction latencies:
- loads: 2 cycles
- stores: 6 cycles
- adds: 1 cycles
- multiplies: 14 cycles
- divides: 50 cycles
If for the next-generation design you could pick one type of instruction to make twice as fast (half the latency), which instruction type would you pick? Why?
Averaging. Assume that a program executes one branch every 8 instructions for the first 1M instructions and a branch every 16 instructions for the next 2M instructions. What is the average number instructions per branch for the entire program?
Diminishing Returns (Amdahl's Law). A program has 20% divide instructions. All non-divide instructions take one cycle. All divide instructions take 20 cycles.
1. What is the CPI of this program on this processor?
1. What percent of time is spent just doing divides?
1. What would the speedup be if we sped up divide by 2x?
1. What would the speedup be if we sped up divide by 5x?
1. What would the speedup be if we sped up divide by 20x?
1. What would the speedup be if divide instructions were infinitely fast (zero cycles)?
Performance and ISA. Chip A executes the ARM ISA and has a 2.5Ghz clock frequency. Chip B executes the x86 and has a 3Ghz clock frequency. Assume that on average, programs execute 2.5 times as many ARM instructions than x86 instructions.
1. For Program P1, Chip A has a CPI of 2 and Chip B has a CPI of 3. Which chip is faster for P1? What is the speedup for Program P1?
1. For Program P2, Chip A has a CPI of 1 and Chip B has a CPI of 2. Which chip is faster for P2? What is the speedup for Program P2?
1. Assuming that Programs P1 and P2 are equally important workloads for the target market of this chip, which chip is "faster"? Calculate the average speedup.
ISA Modification and Performance Metrics. You are the architect of a new line of processors, and you have the choice to add new instructions to the ISA if you wish. You consider adding a fused multiply/add instruction to the ISA (which is helpful in many signal processing and image processing applications). The advantage of this new instruction is that it can reduce the overall number of instructions in the program by converting some pairs of instructions into a single instruction. The disadvantage is that its implementation requires extending the clock period by 10% and increases the CPI by 20%.
1. Calculate under what conditions would adding this instruction lead to an overall performance increase.
2. What qualitative impact would this modification have on the overall MIPS (millions of instructions per second) of the processor?
3. What does this say about using MIPS as a performance metric?
Caching. Consider the following progression of on-chip caches for three generations of chips (loosely modeled after the Alpha 21064, 21164, and 21264).
1. Assume the first generation chip had a direct-mapped 8KB single-cycle first-level data cache. Assume T_hit for this cache is one cycle, T_miss is 100 cycles (to accesses off-chip memory), and the miss rate is 20%. What is the average memory access latency?
2. The second generation chip used the same direct-mapped 8KB single-cycle first-level data cache, but added a 96KB second-level cache on the chip. Assume the second-level cache has a 10-cycle hit latency. If an access misses in the second-level cache, it takes an additional 100 cycles to fetch the block from the off-chip memory. The second-level cache has a global miss rate of 4% of memory operations (which corresponds to a local miss rate of 20%). What is the average memory access latency?
3. The third generation chip replaced the two-levels of on-chip caches with a single-level of cache: a set-associative 64KB cache. Assume this cache has a 5% miss rate and the same 100 cycle miss latency as above. Under what conditions does this new cache configuration have an average memory latency lower than the second generation configuration?
Cache Miss Rates Consider two different cache configurations for an 8-bit processor. Both caches have two 16-byte blocks (for a total capacity of 32 bytes), but one is direct-mapped and the other is two-way set associative and uses the LRU replacement algorithm. All caches begin empty.
1. For the direct-mapped cache, how many bits are in the tag, index, and offset?
1. For the set-associative cache, how many bits are in the tag, index, and offset?
1. Give a short sequence of addresses (4 or fewer) in which the set-associative cache has a better hit rate than the direct-mapped cache. (Give the addresses as an 8-digit binary number). Give the miss rate for both the direct mapped cache and the set-associative cache.
1. Give a short sequence of addresses (4 or fewer) in which the set-associative cache has a worse hit rate than the direct-mapped cache. (Give the addresses as an 8-digit binary number). Give the miss rate for both the direct mapped cache and the set-associative cache.

Addendum

[Oct 16th] Fixed the last question to ask when set-associative caches are both better and worse (a cut-and-paste typo resulted in a repeated question).