Instructor: | Prof. Milo Martin |
---|---|
Due: | Thursday, Oct 9th (at the start of class) |
Instructions: | Your solutions to this assignment must either be hand-written clearly and legibly or, preferably, typed. Show all your work, documenting how you came up with your answers. |
[5 points] Cost and Yield I. Your company's current generation of chips has a die area of 100 mm2 and sells for $25 per chip. It is manufactured using wafers of 300mm in diameter at a cost of $5000 per wafer. The manufacturing facility's average defect rate when producing these wafers is 1 defect per 500 mm2 (0.002 defects per mm2). A mask set for this manufacturing process costs $2 million dollars. The "alpha" for this process technology, an estimate fo the fabrication complexity, is 2.0. For this question, assume wafer yield is 100% and ignore design cost, testing cost, and package cost. Hint: section 1.6 of your textbook contains equations you'll find useful to answer this question.
[8 points] Cost & Yield II. For the next chip generation, your company plans to offer both a dual-core and single-core chip. The new dual-core chip will sell for the price of the old single-core chip ($25); the new single-core chip will sell for half the price ($12.50). Based on the prices and demand, your company estimates it will sell the same number of dual-core and single-core chips (that is, each will account for half of sales).
These chips will be manufactured on the next-generation of process technology, doubling the transistor density (and halving the area of the single-core chip). Thus, the new chips are 100 mm2 and 50 mm2. The defect rate, mask cost, and wafer cost are unchanged from the previous question.
Let's consider two ways to manufacture these chips. The first option is to manufacture the two variants of the chip as a distinct design (each with its own mask set). The second option is to manufacture only dual-core chips (and thus require only a single mask set) and disable one of the cores to be sold as the single-core part.
What is the per-unit cost for the dual-core chip?
When manufactured as a distinct design, what is the per-unit cost for the single-core chip?
Create a graph of the "profit" on the y-axis versus number of chips sold on the x-axis. Plot one line for each of the two approaches described above. Note, the profit will be negative below the break-even point because of the fixed cost of mask sets. Also, recall that half the chips sold will be dual-core (the other half are single-core chips).
Based on the graph, what is the minimum number of chips sold to break even? Which of the two manufacturing techniques gives the lower break-even point?
Again based on the graph, for what range of total volume of chips should these two approaches be used? That is, for what volumes does each approach have the largest profit?
[10 points] Error Correction Codes. The lecture notes gave an example of a SECDED ECC code for protecting four bits (d1, d2, d3, d4) using four check bits (c1, c2, c3, c4). In this question, you'll mimic the behavior of a memory controller that reads the four data bits and four check bits given below. For each, you must determine if a detectable error has occurred or not. If such an error has, you must also give the corrected bits string or state the an uncorrectable (double-bit) error has occurred.
The code calculated when writing the bits:
When checking the bits, calculate c1', c2', c3', and c4':
Calculate the number S by interpreting c3' c2' c1' as a binary number. This gives us four cases:
In the case of a single-bit flip, the bit flipped is using S to index into the bits in the following order starting with one: c1 c2 d1 c3 d2 d3 d4 c4. For example, is S is 5, it was d2 that flipped.
Important
The lecture notes are wrong (or at best incomplete) as to how to interpret which bit has been flipped. The lectures notes said to use c1' c2' c3' to calculate S. You actually first have to reverse the bits. The examples in the slides are still correct, but only because the values of S in those examples are symmetric.
The bits patterns below are given in the form: d1 d2 d3 d4 c1 c2 c3 c4
[12 points] Wire Delay. Wire delay is an increasing consideration when designing microprocessors. The delay introduced by a long wire to the first order is quadratic in its length, but adding inverters at regular intervals to act as repeaters can reduce the wire delay to be linear with length (distance). This question explores how and why this works.
Definitions:
L: length of the wire
Rt: Resistance of a transistor (channel) - 107 Ohm
Ct: Capacitance of the transistor (gate) - 143 femto Farad
Rw: Resistance per unit length of the wire - 0.16 Ohm per micrometre
Cw: Capacitance per unit length of the wire - 0.21 femto Farad per micrometre
Consider two inverters connected by a wire. The delay of the circuit is proportional to its resistance (R) multiplied by its capacitance (C). If the wire is so short it can effectively be ignored, the delay is:
RC = Rt * Ct
Assuming the resistance and capacitance of the transistor is 107 Ohm and 143 femto Farad respectively, the delay is 15301 femto-seconds which is 15.3 pico-seconds or 0.0153 nano-seconds.
However, when the wire is long enough, it begins to contribute to both capacitance and resistance of the circuit:
RC = ((L * Rw) + Rt) ((L * Cw) + Ct)
The resistance of the wire is 0.16 Ohm per micrometre and the capacitance of the wire is 0.21 femto Farad per micrometre.
[8 points] Performance. Computer A executes the MIPS ISA and has a 2.5Ghz clock frequency. Computer B executes the x86 and has a 3Ghz clock frequency. On average, programs execute 1.5 times as many MIPS instructions than x86 instructions.
[7 points] Amdahl's Law. A program has 10% divide instructions. All non-divide instructions take one cycle. All divide instructions take 50 cycles.
[20 points] Effectiveness of Compiler Optimizations. The compiler's goal is to generate the fastest code for the given input program. The programmer specifies the "optimization" level the compiler performs. To ask the compiler to perform no compiler optimizations, the "-O0" flag can be used (that is "dash oh zero"). This is useful when debugging. It is also the fastest compilation and the least likely to encounter bugs in the compiler. The flag "-O3" (dash oh three) enables most optimizations. Finally, the flag "-funroll-loops" enables yet another compiler optimization.
This question asks you to explore the impact of compiler optimizations using the following simple vector-vector addition:
void vector_add(int* a, int* b, int* c) { for (int i = 0; i < 1000; i++) { c[i] = a[i] + b[i]; } }
You can also get this code from the file code.c. Log into halfdome.cis.upenn.edu via SSH, and compile the code with three different levels of optimization:
gcc -O0 -m32 -Wall -std=c99 -masm=intel -fomit-frame-pointer -S code.c -o code-O0.s gcc -O3 -m32 -Wall -std=c99 -masm=intel -fomit-frame-pointer -S code.c -o code-O3.s gcc -funroll-loops -O3 -m32 -Wall -std=c99 -masm=intel -fomit-frame-pointer -S code.c -o code-unrolled.s
This will create three .s files in the directory (code-O0.s, code-O3.s, and code-unrolled.s). These files are the x86 assembly of this function. You can ignore everything except the code between the label vector_add and the return statement (ret).
Note
You don't have to understand much x86 code to answer the following questions, but you may find seeing actual x86 code interesting. For example, instead of using "load" and "store" instructions, in x86 these instructions are both "mov" (move) instructions. An example load instruction is mov %eax, DWORD PTR [%esp+12]. An example store instruction is mov DWORD PTR [%ecx], %eax. The "DWORD PTR" just indicates that a 32-bit value is being loaded or stored. FYI, an instruction like add %ecx, DWORD PTR [%esp+28] performs both a memory access and an add (an example of a CISC-y x86 instruction). The x86 ISA also has some more advanced memory addressing modes. For example, mov %eax, DWORD PTR [%ebx-4+%edx*4] is using the scaled addressing mode. One more note: x86 uses condition codes, so jump (branch) instructions don't have explicit register inputs. If you want to lookup an x86 instruction, you may find the Intel x86 reference manual useful: http://developer.intel.com/design/pentium/manuals/24319101.pdf
For each of these files, answer the following questions:
Next, run the three versions. To do this, link each source file with driver.c and run it. The driver calls the vector addition function 100,000 times and reports the runtime in seconds:
gcc -m32 code-O0.s driver.c -o code-O0 gcc -m32 code-O3.s driver.c -o code-O3 gcc -m32 code-unrolled.s driver.c -o code-unrolled ./code-O0 ./code-O3 ./code-unrolled