CSE 371/372 Spring 2004
Homework 5

Due: Wednesday, April 21 in the folder outside Cheryl Hickey's office (Levine 502) by 5PM.

Memory Hierarchy with Virtual Memory

For questions 1-4, use the following memory system parameters:

1. (5 points) How big would a single-level page table for program A be on this machine?

2. (10 points) Assuming that second-level tables are page-sized, how big would a two-level page table for program A be on this machine? Assume that partially used pages count as full pages.

3. (5 points) What is the total tag storage for the cache?

4. (5 points) What is the total tag storage for the TLB?

I/O

5. (5 points) Consider 2 different I/O systems: A support 1000 IOPS, B supports 500 IOPS. A single transaction requires 5 I/O operation, each of which requires 100,000 instructions. What is the maximum number of transactions per second (TPS) each system can sustain if coupled to a 75MIPS processor?

6. (10 points) Consider 10GB disk with 10 platters, 128 tracks per platter, and 512 sectors per track. The disk spins at 12,000 RPM. Controller latency is 1ms and average seeks time is 5 ms. What is the peak data rate of the disk? What is the effective data rate if data is read in 16KB pages? What is the effective data rate if data is read in 64KB pages?

Parallelism

7. (15 points) A common routine in linear algebra is SAXPY (Single-precision A X Plus Y), where A is a scalar and X and Y are vectors. SAXPY (and it's sister DAXPY) performs a linear combination of two vectors. This is the key operation is both Gaussian Elimination and Simplex-style optimization. Here is the source code for SAXPY:

   for (i = 0; i < 100; i++)
      Z[i] = A * X[i] + Y[i];

And here is the MIPS assembly code, assume that initially $r1 is 0, $r2 is &X[0], $r3 is &Y[0], $r4 is &Z[0], and $f0 is A:

Loop:
   slti $r5, $r1, 100
   beqz $r5, LoopExit
   l.s $f1, 0($r2)
   mul.s $f2, $f1, $f0
   l.s $f3, 0($r3)
   add.s $f4, $f3, $f2
   s.s $f4, 0($r4)
   addi $r1, $r1, 1
   addi $r2, $r2, 4
   addi $r3, $r3, 4
   addi $r4, $r4, 4
   j Loop
LoopExit:

Assume a scalar "one instruction at a time" MIPS processor on which every instruction takes 1 cycle, and all branches are correctly predicted. Excluding the latency to fill the pipeline, how long would the previous loop take to execute? Now, assume a MIPS processor that is augmented with a vector unit that has a natural vector length of 32. The vector unit has 8 vector registers, $v0-$v7 and the VLR register $vlr register the following instructions: l.s.v (load vector), s.s.v (store vector), mul.s.sv (multiply scalar by vector), and add.s.vv (add two vectors). Write vector-MIPS code to vectorize the SAXPY loop.

Assume each vector execution unit consists of just 1 scalar execution unit (through which all vector elements have to be processed serially) and that each vector operation therefore takes a number of cycles equal to $vlr. What is the execution time of the loop?

Now assume that each vector execution unit consists of 2 scalar execution units such that each vector can be processed pairwise and each vector operation takes $vlr/2. What is the new execution time?

8. (10 points) P+H problem 9.2

9. (10 points) Consider the standard matrix multiply routine:

   for (i = 0; i < 64; i++)
      for (j = 0; j < 64; j++)
         for (k = 0; k < 64; k++)
            C[i][j] += A[i][k] * B[k][j];

Suppose you wanted to parallelize this benchmark on a 16-node message-passing multiprocessor. Suppose also that this parallelization meant that you had to divide each of the three arrays A, B, and C among the 16 processors. One way to do this is to treat the processor grid as a 4-by-4 square and map the matrices onto that square. If each processor computes the values in the portion of the C matrix assigned to it, then how many multiplication operations does each processor have to perform? How much A and B data does each processor have to send to each of its neighbor processors?

R372 Programming

(25 points) The Pentium was one of the earlier "super-scalar" processors: it could execute two instructions per cycle. In this assignment, you will change the cache simulator to model a quasi-Pentium.

The Pentium has two pipelines. In any given cycle, it examines the next two instructions. If the first (i.e., "older") instruction is not stalled on any older in-flight instruction, it is sent down the first pipeline. Sending an instruction from the decode stage down the rest of the pipeline is called "issuing". Otherwise, both instructions are stalled, the second (i.e., "younger") instruction cannot pass the first.

If the first instruction is issued, the second instruction is examined. If it has no stalls either on any older instruction currently in any downstream pipeline stage or on the older instruction next to it, it issues to the second pipeline. Otherwise, it moves over to the first pipeline and is considered for issue again the following cycle, this time in a pairing with the instruction after it.

Start with the simulator in ~amir/cse371/r372/src5/. This simulator is somewhat simplified in that it assumes full bypassing and doesn't model caches. The way to model the additional "superscalar" behavior is by controlling the increment of the clock counter (counters[1]). If an instruction can issue in parallel with the immediately older instruction, the counter is not incremented. You need to add logic to determine if this is the case.

The original simulator "executes" bubble -1 in 577 clock cycles and matrix -1 in 918 clock cycles. How many clock cycles does it take your quasi-Pentium simulator to "execute" these programs? Submit your simulator.c file to the TA.