HW5 Solution Key 1. (6 points, 3 for each part) Consider 2 different transaction processing I/O systems: A support 500 IOPS, B supports 250 IOPS. A single transaction requires 5 I/O operation, each of which requires 200,000 instructions. What is the maximum number of transactions per second (TPS) each system can sustain if coupled to a 75MIPS processor? A single transaction requires 1 M instructions. A 75MIPS processor can therefore sustain 75 TPS. Now, what about the I/O systems? I/O system A supports 100 TPS, so here the processor is the weakest link and complete system A can sustain only 75 TPS. I/O system B supports 50 TPS, so here it is the bottleneck. Complete system B supports 50 TPS. 2. (9 points, 3 for each part) Consider a 5GB disk with 10 platters, 128 tracks per platter, and 512 sectors per track. The disk spins at 12,000 RPM. Controller latency is 1ms and average seek time is 5ms. What is the peak data rate of the disk? What is the effective data rate if data is read in 8KB pages? What is the effective data rate if data is read in 64KB pages? There is 512MB per platter and 4MB per track. The disk spins at 12,000 RPM which is 200 RPS. The peak data rate is 4MB * 200 RPS = 800MB/s. It takes the disk 5ms (1/200 RPS) to complete a single rotation, so the rotational latency is 2.5ms (5ms/2). Combined with seek time and controller delay this is 8.5ms. The transfer latency for 8KB pages is trivial compared to this number: 5ms * (8KB/4MB) ~ 0.01ms. Transfer time just doesn't matter in this equation. Since the total access time for the disk is 8.5ms per page, the disk can do 118 page transfers per second, or 940 KB/s. If we increase the page size to 64KB, the transfer latency would still be tiny and total access latency would still be dominated by a combination of controller, seek, and rotation time, so it would still take ~8.5ms per page, or 118 pages per second. However, since each page is bigger, the data rate is 7.5 MB/s. 3. (15 points, 2 for the first part, 5 for the code, and 4 for each of the last two calculations) a. The original loop executes 100 iterations of 12 insructions each and one iteration of 2 instructions. The total instruction count is 1202 and since each instruction takes a single cycle, that is also the total cycle count. If you ignored the last iteration and put down 1200, that's OK too. b. Here is the vector-MIPS code. addi $vlr, $r0, 4 addi $r1, $r1, 4 Loop: slti $r5, $r1, 100 bnez $r5, LoopExit l.s.v $v1, 0($r2) mul.s.sv $v2, $v1, $f0 l.s.v $v3, 0($r3) add.s.vv $v4, $v3, $v2 s.s.v $v4, 0($r4) addi $vlr, $r0, 16 addi $r1, $r1, 16 addi $r2, $r2, 64 addi $r3, $r3, 64 addi $r4, $r4, 64 j Loop LoopExit: How many instructions does this code execute? Well, 2 in the prologue, then the loop body which is 13 instructions long executes 7 times in its entirety, and then the first two instructions in the loop body execute once more. That's a total of 95 instructions. Now 30 of these instructions are full-length vector instructions that take 16 cycles to execute. An additional 5 instructions are partial length vector instructions that take 4 cycles each. The remaining 60 are single-cycle scalar instructions. So the new cycle count is: 60*1 + 30*16 + 5*4 = 60 + 560. c. Now assume that the vector operations each take $vlr/2 cycles rather than $vlr cycles. The new cycle count is: 60*1 + 30*8 + 5*2 = 310. 4. (20 points, 4 for each part) Suppose you wanted to execute the SAXPY loop from the previous problem on a 4-node message passing multiprocessor rather than a vector processor. a. Write the C code for the parallel version of SAXPY, including initialization, communication, and collection code using the send and send. /* Initialization */ if (my_id() == 0) { memcpy (my_X, &X[0], 25 /* number of elements */); memcpy (my_Y, &Y[0], 25); for (id = 1; id < 4; id++) { send (id, &X[id * 25], 25); sned (id, &Y[id * 25], 25); } } else { recv (0, &my_X, 25); recv (0, &my_Y, 25); } /* Body */ for (I = 0; I < 25; I++) my_Z[I] = A * my_X[I] + my_Y[I]; /* Collection */ if (my_id() == 0) { for (id = 1; id < 4; id++) recv (id, &Z[id * 25], 25); } else { send (0, my_Z, 25); } b. How many floating-point multiplications and additions does each processor have to perform? The full loop performs 200 FP operations. Divided over 4 processors, each processor performs 50. c. Assuming each single-precision FP number is 4B long, how many bytes of data does processor 0 have to send on initialization and receive on collection? During initialization, processor 0 sends two arrays of 25 elements to each of 3 processors. Each element is 4B, so the total is 2 * 25 * 3 * 4 = 600B. During collection, processor 0 receives one array of 25 elements from each of 3 processors, so the total is 25 * 3 * 4 = 300B. d. How many bytes does each processor have to send and receive during communication? There is no communication in SAXPY, so 0. e. Assume that floating-point additions and multiplications take 10 cycles and that all operations take 0 cycles. If each message takes 20 cycles plus 1 cycle per 4B transferred, what is the speedup obtained by parallelization? Let's start with the non-sequential version. It performs 200 FP operations at 10 cycles each, for a total of 2000 cycles. Now what about processor 0 of the parallel version. It performs 50 FP operations, for a total of 500 cycles of computation. During initialization, it sends 6 messages of 25 elements each, that's 6 * (20 + 25) = 270 cycles. There is no "communication" phase. During collection, it receives 3 messages of 25 elements each, that's 3 * (20 + 25) = 135. So total runtime for processor 0 is 500 + 270 + 135 = 905. Speedup is 2000 / 905 = 2.2. 5. (15 points) Consider a simplified version of the code sequence to do a money transfer from account A to account B. Suppose this code executes on two processors in a shared-memory multiprocessor at once. lw $r3, transAmt lw $r1, acctA sub $r1,$r3,$r1 sw $r1, acctA lw $r2, acctB add $r2,$r3,$r2 sw $r2, acctB a. Show a correct interleaving that would add 2*transAmt to acctB and subtract 2*transAmt from acctA. There are many answers here, but here is one in which there is an actual interleaving. Processor 0 Processor 1 lw $r3, transAmt lw $r3, transAmt lw $r1, acctA sub $r1,$r3,$r1 sw $r1, acctA lw $r1, acctA sub $r1,$r3,$r1 sw $r1, acctA lw $r2, acctB add $r2,$r3,$r2 sw $r2, acctB lw $r2, acctB add $r2,$r3,$r2 sw $r2, acctB b. Show a incorrect interleaving that would add 2*transAmt to acctB but subtract transAmt from acctA. Here is a similar interleaving, but which is wrong. Processor 0 Processor 1 lw $r3, transAmt lw $r3, transAmt lw $r1, acctA lw $r1, acctA sub $r1,$r3,$r1 sw $r1, acctA sub $r1,$r3,$r1 sw $r1, acctA lw $r2, acctB add $r2,$r3,$r2 sw $r2, acctB lw $r2, acctB add $r2,$r3,$r2 sw $r2, acctB c. Incorrect interleavings can be avoided using synchronization. Add "minimal" synchronization to this code. "Minimal" synchronization is synchronization that guarantees correct execution but puts as little code into individual critical sections as possible. The first interleaving actually shows the maximum level of interleaving allowed while still maintaining correct execution. To enforce this execution, we need the following synchronization. lw $r3, transAmt ACQUIRE(lockA) lw $r1, acctA sub $r1,$r3,$r1 sw $r1, acctA RELEASE(lockA) ACQUIRE(lockB) lw $r2, acctB add $r2,$r3,$r2 sw $r2, acctB RELEASE(lockB) Something like this ACQUIRE(lock) lw $r3, transAmt lw $r1, acctA sub $r1,$r3,$r1 sw $r1, acctA lw $r2, acctB add $r2,$r3,$r2 sw $r2, acctB RELEASE(lock) would be correct, but would not allow any interleaving at all. 6. (25 points, 10 points if you have a working simulator that actually reduces the cycle count when the -p is turned on, the remaining 15 points if you got the right cycle counts or within 5 cycles of them. Maybe there were some corner cases that you forgot, but that's it.) The answers I got for cycle counts here are 482 for btree and 351 for search.