Due: Thursday, Apr. 27 (by noon, in my mailbox or under my office door, no grace days!)
1. Disk Access Latency (15 points)
Consider a 1 GB 1 platter disk with 512 tracks and 512 sectors per track. The disk spins at 6,000 RPM. Ignore controller latency. Seeking to a random track takes 5ms, but seeking to the next track over takes 1ms.
a. How long would it take to read an 8 MB file if the file was made up of 4 KB pages that were arbitrarily distributed throughout the disk?
b. How long would it take to read the same 8 MB file if its pages were consecutively located on tracks, but if the tracks were randomly located on the platter?
c. How long would it take to read the file if the pages were consecutively located on tracks and the tracks were consecutively located on the disk?
d. Suppose that in addition to reducing seek time, you also wanted to reduce rotational delay. How would you arrange the file on the disk to do this?
2. Polling, Interrupts, and DMA (15 points)
A 1 GHz CPU is connected to a 1 Gb/s (Gigabit/second) NIC with an 8B interface. Network packets are 512 B in size. Packets arrive continuously but only 1 of every 16 packets is intended for this CPU (i.e., it matches the NIC's MAC address). On this system it takes 50 cycles for the CPU to check the status of the NIC (i.e., to poll or to service an interrupt) and 20 cycles to transfer 8B of data. There is no DMA.
a. What percentage of the time does the CPU spend transferring NIC data?
b. What percentage of the time would the CPU spend polling the NIC?
c. What percentage of the time would the CPU spend handling NIC interrupts?
Now assume there is a DMA controller with a 450 cycle initialization time, DMA can transfer entire packets at a time.
d. What percentage of the time would the CPU spend initializing the DMA and handling DMA interrupts?
3. P37X (0 points)
Remember the "worm" game from Homework 1. How many instructions does a game in which you don't touch the keyboard and the worm just crawls up to the top of the screen execute? Rewrite the worm code to use interrupts instead of polling. How many instructions does the same game take now?
4. ILP (20 points)
A common routine in linear algebra is SAXPY (Single-precision A X Plus Y), where A is a scalar and X and Y are vectors. SAXPY (and it's sister DAXPY) performs a linear combination of two vectors. Here is the source code:
for (i = 0; i < 100; i++)
Z[i] = A * X[i] + Y[i];
And here is the MIPS assembly code. Assume that initially $r1 is 0, $r2 is &X[0], $r3 is &Y[0], $r4 is &Z[0], and $f0 is A.
Loop:
l.s $f1, 0($r2)
mul.s $f2, $f1, $f0
l.s $f3, 0($r3)
add.s $f4, $f3, $f2
s.s $f4, 0($r4)
addi $r1, $r1, 1
addi $r2, $r2, 4
addi $r3, $r3, 4
addi $r4, $r4, 4
slti $r5, $r1, 100
bnez $r5, Loop
a. Assume a scalar 5-stage pipelined processor with full bypassing on which every instruction takes 1 cycle. Basically after every load there is one "delay slot". Schedule this code to minimize stalls on this pipeline. Ignoring the lateny to fill the pipeline, and assuming that all branches are correctly predicted how long would it take to execute this loop?
b. Now assume a 2-wide VLIW processor in which the first slot can execute any instruction but the second slot can only execute integer ALU instructions, i.e., not loads stores and branches. (By the way, except for the VLIW part, this is the way the original Penitum was). Schedule the code to execute on this processor. How long would the loop take to execute on a pipeline with similar latencies? Remember, in a VLIW processor instructions travel down the pipeline in lockstep, so if one stalls, effectively both do. Assume the VLIW has hardware stall logic so that it only needs nops if there is nothing to put in the second VLIW slot, it doesn't need nops to implement stalls.
c. Now assume a scalar 5-stage pipelined processor on which every integer instruction (including all loads and stores, FP loads and stores count as integer instructions) takes 1 cycle but every FP instruction takes 5 cycles. In other words, in addition to the load "delay slot", there are four "delay slots" after every FP operation (mul.s and add.s). Reschedule the code for this pipeline. How long would it take to execute the loop?
d. Finally, assume a 2-wide VLIW version of the processor with the same latencies as the one in part c. Schedule the code to execute on this processor. How long would the loop take to execute on a pipeline with similar latencies?
e. Given these four "experiments", what can you say about instruction level parallelism (ILP) in general and about VLIW in particular as a way to exploit ILP?
5. DLP (20 points)
a. Assume a MIPS processor that is augmented with a vector unit that has a natural vector length of 4. The vector unit has 8 vector registers, $v0-$v7, the VLR register $vlr , and the following vector instructions: l.s.v (load vector), s.s.v (store vector), mul.s.sv (multiply scalar by vector), and add.s.vv (add two vectors). Write vector-MIPS code to vectorize the SAXPY loop.
Assume each vector execution unit consists of just 1 scalar execution unit through which all vector elements have to be processed in a pipelined fashion. The scalar unit latency for the vector load/store units is 1 cycle (it can execute the first load or store in 1 cycle); the scalar unit latency for the vector FP units is 5 cycles (the first vector add or mul executes in 5 cycles). Assume vector instructions are not pipelined with and can't even be overlapped with one another but can be overlapped with scalar instructions. This means you cannot put any vector instruction into the delay slot of another vector instruction, but you can put a scalar instruction into this delay slot. What is the execution time of the loop?
b. Now assume that each vector execution unit consists of 2 scalar execution units such that each vector can be processed pairwise. Reschedule the code if needed. What is the new execution time?
c. Finally, assume that the vector operations are implemented using SSE2 extensions and that each vector operation can perform all 4 operations in parallel. Reschedule the code if needed. What is the new execution time?
6. Multithreading (15 points)
A database/web server is executing on a 1 GHz processor. The server runs each query as a separate thread. Queries are memory intensive and so each query has 20M cycles of active CPU time but also 30M cycles of memory (L2 miss) stall time. Because queries have a lot of stall time but also a high degree of parallelism, one way to improve query throughput is to use a multithreaded processor. While one thread is stalled waiting for memory, other threads can execute. However, there is a catch. Adding threads complicates some of the pipeline structures such that each additional thread adds 2M execution cycles to all threads. Adding threads also thrashes the L2 cache such that each additional thread adds 10M cycles worth of memory stall time to each active thread.
With support for only 1 thread, query latency is 50M cycles or 50ms, throughput is 20 queries/s, and processor utilization is 40% (the processor is active 40% of the time and idle while waiting for memory the other 60%).
a. What would query latency, query throughput, and processor utilization be if the processor supported 2 threads?
b. What would query latency, query throughput, and processor utilization be if the processor supported 4 threads?
c. What number of threads should the processor suport to maximize query throughput?
7. Cache Coherence (15 points)
In class, we discussed the simple MSI (modified/shared/invalid) coherence protocol. Real processors use more complex protocols that have additional states. One commonly used protocol is the MESI (modified/exclusive/shared/invalid) protocol which is sometimes also called the Illinois protocol. The meaning of a block in the E state is that the block is exclusively owned by the cache (i.e., no other caches have copies) but isn't actually dirty. The E state is kind of strange in that the processor transitions to it on a local Read from the Invalid state, in other words under the same conditions as it would transition to the S state. For now, let's assume that the processor decides which state (E or S) to transition to by flipping a coin (which is implemented in hardware).
a. Draw the protocol diagram for the MESI protocol.
b. What could the E state be used for? How could the processor decide (either deterministically or speculatively) to enter the E state as opposed to the S state?
To answer this question, it may help to think about this. In an MSI protocol, on a read miss (i.e., on a Read transition from the Invalid state), the processor actually has two options. It could transition to the S state. But it could also transition to the M state; this would perform poorly but would not be incorrect. The E state is something between the S state and the M state, which means it has some of the advantages and disadvantages of being in one and some of the advantages and disadvantages of being in the other. Think of what these advantages and disadvantages may be.
Another hint is to look at the protocol diagram from part a.