CSE 371/372 Spring 2004
Homework 5

Due: Tuesday, April 26 under my door by 5PM.

I/O

1. (6 points) Consider 2 different transaction processing I/O systems: A support 500 IOPS, B supports 250 IOPS. A single transaction requires 5 I/O operation, each of which requires 200,000 instructions. What is the maximum number of transactions per second (TPS) each system can sustain if coupled to a 75MIPS processor?

2. (9 points) Consider a 5GB disk with 10 platters, 128 tracks per platter, and 512 sectors per track. The disk spins at 12,000 RPM. Controller latency is 1ms and average seek time is 5ms. What is the peak data rate of the disk? What is the effective data rate if data is read in 8KB pages? What is the effective data rate if data is read in 64KB pages?

Parallelism

3. (15 points) A common routine in linear algebra is SAXPY (Single-precision A X Plus Y), where A is a scalar and X and Y are vectors. SAXPY (and it's sister DAXPY) performs a linear combination of two vectors. This is the key operation is both Gaussian Elimination and Simplex-style optimization. Here is the source code for SAXPY:

for (i = 0; i < 100; i++)
Z[i] = A * X[i] + Y[i];

And here is the MIPS assembly code, assume that initially \$r1 is 0, \$r2 is &X[0], \$r3 is &Y[0], \$r4 is &Z[0], and \$f0 is A:

Loop:
slti \$r5, \$r1, 100
beqz \$r5, LoopExit
l.s \$f1, 0(\$r2)
mul.s \$f2, \$f1, \$f0
l.s \$f3, 0(\$r3)
add.s \$f4, \$f3, \$f2
s.s \$f4, 0(\$r4)
addi \$r1, \$r1, 1
addi \$r2, \$r2, 4
addi \$r3, \$r3, 4
addi \$r4, \$r4, 4
j Loop
LoopExit:

a. Assume a scalar "one instruction at a time" MIPS processor on which every instruction takes 1 cycle, and all branches are correctly predicted. Excluding the latency to fill the pipeline, how long would the previous loop take to execute?

b. Now, assume a MIPS processor that is augmented with a vector unit that has a natural vector length of 16. The vector unit has 8 vector registers, \$v0-\$v7, the VLR register \$vlr , and the following vector instructions: l.s.v (load vector), s.s.v (store vector), mul.s.sv (multiply scalar by vector), and add.s.vv (add two vectors). Write vector-MIPS code to vectorize the SAXPY loop.

c. Assume each vector execution unit consists of just 1 scalar execution unit (through which all vector elements have to be processed serially) and that each vector operation therefore takes a number of cycles equal to \$vlr. What is the execution time of the loop?

d. Now assume that each vector execution unit consists of 2 scalar execution units such that each vector can be processed pairwise and each vector operation takes \$vlr/2. What is the new execution time?

4. (20 points) Suppose you wanted to execute the SAXPY loop from the previous problem on a 4-node message passing multiprocessor rather than a vector processor.

a. Write the C code for the parallel version of SAXPY, including initialization, communication, and collection code using the send and send.

b. How many floating-point multiplications and additions does each processor have to perform?

c. Assuming each single-precision FP number is 4B long, how many bytes of data does processor 0 have to send on initialization and receive on collection?

d. How many bytes does each processor have to send and receive during communication?

e. Assume that floating-point additions and multiplications take 10 cycles and that all operations take 0 cycles. If each message takes 20 cycles plus 1 cycle per 4B transferred, what is the speedup obtained by parallelization?

5. (15 points) Consider a simplified version of the code sequence to do a money transfer from account A to account B. Suppose this code executes on two processors in a shared-memory multiprocessor at once.

lw \$r3, transAmt
lw \$r1, acctA
sub \$r1,\$r3,\$r1
sw \$r1, acctA
lw \$r2, acctB
sw \$r2, acctB

a. Show a correct interleaving that would add 2*transAmt to acctB and subtract 2*transAmt from acctA.

b. Show a incorrect interleaving that would add 2*transAmt to acctB but subtract transAmt from acctA.

c. Incorrect interleavings can be avoided using synchronization. Add "minimal" synchronization to this code. "Minimal" synchronization is synchronization that guarantees correct execution but puts as little code into individual critical sections as possible.

R372 Programming

(25 points) The Pentium was one of the earlier "super-scalar" processors: it could execute two instructions per cycle. In this assignment, you will change the cache simulator to model a quasi-Pentium.

The Pentium has two pipelines. In any given cycle, it examines the next two instructions. If the first (i.e., "older") instruction is not stalled on any older in-flight instruction, it is sent down the first pipeline. Sending an instruction from the decode stage down the rest of the pipeline is called "issuing". Otherwise, both instructions are stalled, the second (i.e., "younger") instruction cannot pass the first.

If the first instruction is issued, the second instruction is examined. If it has no stalls either on any older instruction currently in any downstream pipeline stage or on the older instruction next to it, it issues to the second pipeline. Otherwise, it moves over to the first pipeline and is considered for issue again the following cycle, this time in a pairing with the instruction after it.

Start with the simulator in ~amir/cse371/r372/src/. This simulator is somewhat simplified in that it assumes full bypassing and doesn't model caches. The way to model the additional "superscalar" behavior is by controlling the increment of the clock counter (counters[1]). If an instruction can issue in parallel with the immediately older instruction, the counter is not incremented. You need to add logic to determine if this is the case.

The original simulator executes btree -1 (btree with first input) in 659 clock cycles and search in 463 clock cycles. How many clock cycles does it take your quasi-Pentium simulator to "execute" these programs? Submit your simulator.c file.