Homework 1 - CIS501 Fall 2009

Instructor:	Prof. Milo Martin
Due:	Tuesday, Oct 6th (at the start of class)
Instructions:	Your solutions to this assignment must either be hand-written clearly and legibly or, preferably, typed. Show all your work to document how you came up with your answers.

[5 points] Cost and Yield I. Your company's current generation of chips has a die area of 100 mm² and sells for $25 per chip. It is manufactured using wafers of 300mm in diameter at a cost of $5000 per wafer. The manufacturing facility's average defect rate when producing these wafers is 1 defect per 500 mm² (0.002 defects per mm²). A mask set for this manufacturing process costs $2 million dollars. The "alpha" for this process technology, an estimate fo the fabrication complexity, is 2.0. For this question, assume wafer yield is 100% and ignore design cost, testing cost, and package cost. Hint: section 1.6 of your textbook contains equations you'll find useful to answer this question.
1. What is the yield for this die?
1. How many total die per wafer?
1. How many good die per wafer?
1. What is the unit cost of this chip?
1. How many chips must be sold to break even?
[8 points] Cost & Yield II. For the next chip generation, your company plans to offer both a dual-core and single-core chip. The new dual-core chip will sell for the price of the old single-core chip ($25); the new single-core chip will sell for half the price ($12.50). Based on the prices and demand, your company estimates it will sell the same number of dual-core and single-core chips (that is, each will account for half of sales).

These chips will be manufactured on the next-generation of process technology, doubling the transistor density (and halving the area of the single-core chip). Thus, the new chips are 100 mm² and 50 mm². The defect rate, mask cost, and wafer cost are unchanged from the previous question.

Let's consider two ways to manufacture these chips. The first option is to manufacture the two variants of the chip as a distinct design (each with its own mask set). The second option is to manufacture only dual-core chips (and thus require only a single mask set) and disable one of the cores to be sold as the single-core part.
1. What is the per-unit cost for the dual-core chip?
2. When manufactured as a distinct design, what is the per-unit cost for the single-core chip?
3. Create a graph of the "profit" on the y-axis versus number of chips sold on the x-axis. Plot one line for each of the two approaches described above. Note: the profit will be negative below the break-even point because of the fixed cost of mask sets. Also, recall that half the chips sold will be dual-core (the other half are single-core chips).
4. Based on the graph, what is the minimum number of chips sold to break even? Which of the two manufacturing techniques gives the lower break-even point?
5. Again based on the graph, for what range of total volume of chips should these two approaches be used? That is, for what volumes does each approach have the largest profit?

[10 points] Error Correction Codes. The lecture notes gave an example of a SECDED ECC code for protecting four bits (d₁, d₂, d₃, d₄) using four check bits (c₁, c₂, c₃, c₄). In this question, you'll mimic the behavior of a memory controller that reads the four data bits and four check bits given below. For each, you must determine if a detectable error has occurred or not. If such an error has, you must also give the corrected bits string or state the an uncorrectable (double-bit) error has occurred.

The code calculated when writing the bits:
- c₁ = d₁ ^ d₂ ^ d₄
- c₂ = d₁ ^ d₃ ^ d₄
- c₃ = d₂ ^ d₃ ^ d₄
- c₄ = c₁ ^ c₂ ^ d₁ ^ c₃ ^ d₂ ^ d₃ ^ d₄
When checking the bits, calculate c₁', c₂', c₃', and c₄':
- c₁' = c₁ ^ d₁ ^ d₂ ^ d₄
- c₂' = c₂ ^ d₁ ^ d₃ ^ d₄
- c₃' = c₃ ^ d₂ ^ d₃ ^ d₄
- c₄' = c₄ ^ c₁ ^ c₂ ^ d₁ ^ c₃ ^ d₂ ^ d₃ ^ d₄
Calculate the syndrome (S) by interpreting c₃' c₂' c₁' as a binary number. This gives us four cases:
- S is zero, c₄' is zero: no error
- S not zero, c₄' is not zero: single bit error
- S is zero, c₄' is not zero: single-bit error in c₄
- S not zero, c₄' is zero: two-bit error
In the case of a single-bit flip, the bit flipped is using S to index into the bits in the following order starting with one: c₁ c₂ d₁ c₃ d₂ d₃ d₄ c₄. For example, is S is 5, it was d₂ that flipped.

Note

Unlike in the lecture notes, the above formulas for the syndrome bits (c₃' c₂' c₁') already include the xor with the check bits read from the memory (so you don't need to xor it twice).

The bits patterns below are given in the form: d₁ d₂ d₃ d₄ c₁ c₂ c₃ c₄
1. 0001 0011
1. 1001 0011
1. 1001 1011
1. 1001 0010
1. 1101 1011

[12 points] Wire Delay. Wire delay is an increasing consideration when designing microprocessors. The delay introduced by a long wire to the first order is quadratic in its length, but adding inverters at regular intervals to act as repeaters can reduce the wire delay to be linear with length (distance). This question explores how and why this works.

Definitions:

L: length of the wire

R_t: Resistance of a transistor (channel) - 107 Ohm

C_t: Capacitance of the transistor (gate) - 143 femto Farad

R_w: Resistance per unit length of the wire - 0.16 Ohm per micrometre

C_w: Capacitance per unit length of the wire - 0.21 femto Farad per micrometre

L:	length of the wire
R_t:	Resistance of a transistor (channel) - 107 Ohm
C_t:	Capacitance of the transistor (gate) - 143 femto Farad
R_w:	Resistance per unit length of the wire - 0.16 Ohm per micrometre
C_w:	Capacitance per unit length of the wire - 0.21 femto Farad per micrometre

Consider two inverters connected by a wire. The delay of the circuit is proportional to its resistance (R) multiplied by its capacitance (C). If the wire is so short it can effectively be ignored, the delay is based only the resistance and capacitance of the transistors:

RC = R_t * C_t

Assuming the resistance and capacitance of the transistor is 107 Ohm and 143 femto Farad respectively, the delay is 15,301 femto-seconds which is 15.3 pico-seconds or 0.0153 nano-seconds.

However, when the wire is long enough, it begins to contribute to both capacitance and resistance of the circuit:

RC = ((L * R_w) + R_t) ((L * C_w) + C_t)

The resistance of the wire is 0.16 Ohm per micrometre and the capacitance of the wire is 0.21 femto Farad per micrometre.

Calculate the total switching delay of two inverters connected by a wire of length 6000 micrometers (6 mm or 0.6 cm).

If the wire is divided into two equal sized segments with an inverter between them, what is the total delay of the circuit? (Hint: calculate the delay of each segment and then add them together.)

Give a formula for calculating the delay of a wire of length L divided into N equal sized segments. You may keep the constants as symbols or replace them with the values specified above.

Plot a line graph of the total delay (y-axis) for 1 to 20 segments (x-axis).

Based on the above graph, (1) what is the number of segments that minimizes total delay and (2) what is the total delay with the optimal number of segments? Note: you can get this data directly from the graph.

Derive a formula to calculate the number of segments (N) that minimizes delay as a function of the length of the wire. As before, you may wish to multiply out the constants to reduce the number of symbols. Hint: The derivative of (1/N) with respect to N is (-1/N²).

Using the above formulas, derive a formula for the overall delay of a wire of length L that has repeaters placed optimally. This formula should be a function of L but not N (that is, the number of segments should not appear explicitly in the formula). As before, you may wish to multiply out the constants to reduce the number of symbols.

If the processor has a 1.84 Ghz clock frequency (0.544 nanosecond clock period, which is 544 picoseconds), assuming optimally inserted repeaters, approximately how far can a signal reach in a single cycle?

If the die are of the chip is 200 mm², approximately what percentage of the chip can be reached in a single cycle?

[8 points] Performance II. Computer A executes the MIPS ISA and has a 2.5Ghz clock frequency. Computer B executes the x86 and has a 3Ghz clock frequency. On average, programs execute 1.5 times as many MIPS instructions than x86 instructions.
1. For Program P1, Computer A has a CPI of 2 and Computer B has a CPI of 3. Which computer is faster for P1? What is the speedup?
1. For Program P2, Computer A has a CPI of 1 and Computer B has a CPI of 2. Which computer is faster for P2? What is the speedup?
[6 points] Performance I. Assume a typical program has the following instruction type breakdown:
- 30% loads
- 10% stores
- 50% adds
- 8% multiplies
- 2% divides
Assume the current-generation processor has the following instruction latencies:
- loads: 4 cycles
- stores: 4 cycles
- adds: 2 cycles
- multiplies: 16 cycles
- divides: 50 cycles
If for the next-generation design you could pick one type of instruction to make twice as fast (half the latency), which instruction type would you pick? Why?
[7 points] Amdahl's Law. A program has 10% divide instructions. All non-divide instructions take one cycle. All divide instructions take 50 cycles.
1. What is the CPI of this program on this processor?
1. What percent of time is spent just doing divides?
1. What would the speedup be if we sped up divide by 2x?
1. What would the speedup be if we sped up divide by 5x?
1. What would the speedup be if we sped up divide by 10x?
1. What would the speedup be if we sped up divide by 50x?
1. What would the speedup be if divide instructions were infinitely fast (zero cycles)?
[20 points] Effectiveness of Compiler Optimizations. The compiler's goal is to generate the fastest code for the given input program. The programmer specifies the "optimization" level the compiler performs. To ask the compiler to perform no compiler optimizations, the "-O0" flag can be used (that is "dash oh zero"). This is useful when debugging. It is also the fastest compilation and the least likely to encounter bugs in the compiler. The flag "-O3" (dash oh three) enables most optimizations. Finally, the flag "-funroll-loops" enables yet another compiler optimization.

This question asks you to explore the impact of compiler optimizations using the following simple vector-vector addition:
```
void vector_add(int* a, int* b, int* c)
{
  for (int i = 0; i < 10000; i++) {
    c[i] = a[i] + b[i];
  }
}
```
You can also get this code from the file code.c. Log into minus.seas.upenn.edu via SSH, and compile the code with three different levels of optimization:
```
gcc -O0 -m32 -Wall -std=c99 -masm=intel -fomit-frame-pointer -S code.c -o code-O0.s
gcc -O3 -m32 -Wall -std=c99 -masm=intel -fomit-frame-pointer -S code.c -o code-O3.s
gcc -funroll-loops -O3 -m32 -Wall -std=c99 -masm=intel -fomit-frame-pointer -S code.c -o code-unrolled.s
```
This will create three .s files in the directory (code-O0.s, code-O3.s, and code-unrolled.s). These files are the x86 assembly of this function. You can ignore everything except the code between the label vector_add and the return statement (ret).

Note

You don't have to understand much x86 code to answer the following questions, but you may find seeing actual x86 code interesting. For example, instead of using "load" and "store" instructions, in x86 these instructions are both "mov" (move) instructions. An example load instruction is mov %eax, DWORD PTR [%esp+12]. An example store instruction is mov DWORD PTR [%ecx], %eax. The "DWORD PTR" just indicates that a 32-bit value is being loaded or stored. FYI, an instruction like add %ecx, DWORD PTR [%esp+28] performs both a memory access and an add (an example of a CISC-y x86 instruction). The x86 ISA also has some more advanced memory addressing modes. For example, mov %eax, DWORD PTR [%ebx-4+%edx*4] is using the scaled addressing mode. One more note: x86 uses condition codes, so jump (branch) instructions don't have explicit register inputs. If you want to lookup an x86 instruction, you may find the Intel x86 reference manual useful: http://developer.intel.com/design/pentium/manuals/24319101.pdf

For each of these files, answer the following questions:
1. What is the static instruction count (the number of instructions in the source code listing)?
1. Approximately what is the dynamic instruction count of this function (to the nearest thousand instructions)?
1. Assuming a processor with a CPI of 1, how much "faster than" is the -O3 than -O0? How much is unrolled faster than -O3?
1. When comparing the last two optimized .s files, how does the number of static instructions relate to the dynamic instruction count? Give a brief explanation.
1. Using what you learned from this data, what is one advantage and one disadvantage of loop unrolling?
Next, run the three versions. To do this, link each source file with driver.c and run it. The driver calls the vector addition function 100,000 times and reports the runtime in seconds:
```
gcc -m32 code-O0.s driver.c -o code-O0
gcc -m32 code-O3.s driver.c -o code-O3
gcc -m32 code-unrolled.s driver.c -o code-unrolled

./code-O0
./code-O3
./code-unrolled
```
1. Give the runtime of each of the three versions. How much faster is -O3 that -O0? How much faster is unrolled over -O3? How does this compare to your estimate above based only on instruction counts?
1. The clock frequency of the processors in "minus.seas.upenn.edu" is 3.0 Ghz. Based on your estimate of the number of instructions in each call to vector_add and that driver.c calls vector_add 100000 times, calculate the CPI of the processor running each of these programs.
1. What do the above CPI numbers tell you about these compiler optimizations?

Addendum

None, yet.