Due: Tuesday, Jan. 31
1. Cost (4 points)
Your company invests $100M to design a new chip and plans to produce this chip in your existing fabrication plant. Your fabrication plant manufactures 500-chip wafers at a cost of $7500/wafer. Your die yield is 60%. Your packaging and testing cost is $25/die and your test yield is 50%. At a profit margin is 20% per chip sold, how many chips does your company have to sell before it starts turning a profit on this particular chip design?
2. Performance/Cost (4 points)
Computer A and B have the same ISA. Computer A can has a CPI of 1, a 500MHz clock and costs $200. Computer B has a CPI of 0.5 and a 1GHz clock and costs $1000.
a) What is the performance/cost (MIPS/$) of each computer?
b) Suppose you were running an internet search engine and wanted to be able to
handle 1000 queries per second where each query executed 1B (billion)
instructions. What is the minimum amount of money you could spend on a cluster
of machines that could handle this volume?
3. Performance (4 points)
Computer A executes the X86 ISA, has a CPI of 1, and a 3GHz clock. Computer B executes the MIPS ISA, has a CPI of 0.8, and a 4GHz clock. A given program compiled to the MIPS ISA will execute 1.5 times as many instructions as the same program compiled to X86. Which computer has the higher performance?
4. Performance: Hardware vs. Software (6 points)
Java has automatic "garbage collection" (garbage is a piece of dynamically allocated memory to which no referece exists) and Java programs spend 25% of their instructions collecting garbage. C++ has no garbage collection. On average, you run C++ programs half the time and Java programs the other half. Consider the following three processors. Processor A has no special support for garbage collection and a 1ns clock period. Processor B implements garbage collection in hardware and has a 1.2ns clock period. Processor C has ISA support for garbage collection that can cut the number of garbage collection instructions by 80% and has a 1.1ns clock period. All processors have the same CPI and the same base ISA (except for possible garbage collection extensions).
a) Which processor has the best performance for C++ programs?
b) Which processor has the best performance for Java programs?
c) Which processor has the best performance for your mixed Java/C++ workload?
5. Moore (6 points)
A program that executes for 100 cycles spends 20 of those cycles on multiplication, 50 on memory access, and 30 on other instructions. Engineer Matt6 comes up with a design that will speed up multiplication by a factor of 2 and will take 4 months to design and implement. Engineer Matt7 comes up with a design that will speed up memory operations by a factor of 2 and will take 8 months design and implement. You are engineering manager Matt8.
a) Would you incorporate Matt6's design into your proposal?
b) Would you incorporate Matt7's design into your proposal?
c) What would you do if Matt6 and Matt7 could implement their proposals independently?
6. Amdahl (6 points)
A program is composed of 75 integer operations and 25 FP operations. Each integer operation takes 1 cycle and each FP operation takes 10 cycles.
a) What would the speedup factor of the entire program be if you were to
"magically" speed up each FP operation by a factor of 2? 5? 10? 50?
What is the peak speedup factor you could achieve if you could only optimize
the FP operations? What is the "efficiency" of each of these speedups (the
"efficiency" is whole-program speedup divided by component speedup)
b) What FP speedup factor do you have to achieve before it starts making sense
optimizing the integer operations instead (given equivalent speed up factors)?
7. SPEC (4 points)
Go to www.spec.org.
a) Use their search feature to find which current CPU has the
best performance (speed, not throughput) on SPEC2000 integer and SPEC2000
floating-point (FP) programs. For each, write the manufacturer, the Processor
model, the clock speed, and the "result".
b) SPEC requires that you report speed for an entire benchmark suite as a single
number. As you will see, this number usually winds up being higher than 1000.
How does SPEC ask that you calculate this number?
8. C to MIPS (10 points)
for (i=0; i<100; i++)
c += a[i] * b[i];
a) Write the MIPS assembly for the C code above, use the direct translation of a
"for" loop. Assume that the addresses of arrays a and b are initially in registers $a0 and $a1, and
that the variable c is initially in register $t0.
b) How many instructions execute during the run of this program? How many loads?
How many branches and jumps? What is the execution time of this program if
integer-alu instructions take 1 cycle, loads take 2, and branches/jumps take 3?
c) Because this is an arithmetic for-loop and its bounds are constant (i.e., 0 and
99), the compiler can convert this for loop into a more efficient "do-while"
loop. Write the assembly code for this style loop.
d) How many instructions execute during the run of this program? How many
loads? How many branches and jumps? What is the execution time of this
program if integer-alu instructions take 1 cycle, loads take 2, and
branches/jumps take 3? What is the speedup over the "for" style
impelemtnation?
9. Storage Models (10 points)
d = a + b + c
c = a - d
b = d + a
Write the code above in each of four operand model architecture styles: accumulator, stack, load-store with 3 registers and load-store with 2 registers. For each architecture, compute the static code size and the data memory traffic assuming that opcodes and register specifiers are 4-bits, addresses and data values are 16-bits and all instructions have an integral number of bytes (i.e., a 1.5 byte instruction is really a 2 byte instruction). Which architecture is most efficient from a static code size standpoint? Which is more efficient from a data memory traffic standpoint? Is either of these results surprising? Why or why not?
10. Addressing modes (6 points)
Register-displacement addressing directly captures many reference idioms. A few of the more common ones are: local variables in a stack frame, fields in a structure or class, and global variables in a global data segment. But there are some idioms which it doesn't capture directly. That's what other addressing modes were invented for.
a) What is the scaled addressing mode used for? Recall, in scaled
addressing there are two registers and two immediates and the effective address
is regs[r1] + (regs[r2] * imm1) +
imm2. Write a C/Java statement that would use this addressing mode, and
then the corresponding pseudo-MIPS assembly code. Then write down actual MIPS
assembly that would synthesize scaled addressing from other instructions.
b) Suppose there was a "super-scaled" addressing mode which instead of two
registers and two immediates would use three registers and two immediates. In
super-scaled addressing, the effective address is regs[r1] + (regs[r2] * imm1) + (regs[r3] * imm2). What
could this mode be used for? Write a C/Java statement that would use this
addressing mode, and then the corresponding pseudo-MIPS assembly code. Then
write down actual MIPS assembly that would synthesize scaled addressing from
other instructions.
11. P37X Programming (40 points)
For this exercise, you will write a little game called "worm" using P37X assembly. Remember the classic game "snake" where you have to move a snake around the board, trying to eat all the apples you can while trying not to hit any of the walls or your own growing body? Worm is a simpler version of this game. There are no apples. And the worm has a fixed tail, so it doesn't actuall move, it just grows. The object is to keep the game going for as long as possible. The game logic is also very simple. The worm grows on its own, one pixel per frame, and the only control you have is the direction in which it grows. Pressing the 'A' key will reorient the worm 90 degrees to the left (i.e., counter-clockwise). Pressing the 'D' key will reorient the worm 90 degrees to the right (i.e., clockwise). There is no need for an explicit quit function. If you don't do anything, the worm will run into a wall and the game will end in about 2 seconds.
What do you need to write this game?
P37X
Well, the first thing you need is to learn P37X assembly. And
the learning curve should be gentle because after some thought (and after
trying to port the breakout operating system), I decided to keep P37X close to
LC3, to eliminate only the features that would be difficult to implement in
hardware, and to add a few things which might be useful. So indirect stores
and loads (STI/LDI) are out because
you will not enjoy implementing them in Verilog. But PC-relative stores and
loads (ST/LD) are back in, because
their implementation is really very similar to register-relative stores and
loads (STR/LDR) and because it turns
out that when implementing an operating systems, it's super-convenient to be
able to save all the registers without actually using a register to generate a
base address. The branching scheme has also been reverted to be more LC3-like,
with tests against zero inside the branch rather than separate
set-condition-and-branch instructions. Of course, the branches test an
explicit register rather than condition codes set by the previous register
writing instruction. Those and the fact that you can subtract, or, xor, and
shift are the big changes.
PennSim
The second thing you need is to learn about the P37X
simulation environment, PennSim. And here too the learning curve should be
gentle because PennSim is just
a port of LC3 simulation environment you used in CSE240, LC3Sim.
PennSim is actually much better
than a port. Through the magic of something Prof. Martin calls "functors", it
actually supports both LC3 and P37X! Of course, it doesn't support both at the
same time. You can choose which one it supports at startup with the arguments
-lc3 and -p37x.
A word of caution about LC3 support. PennSim supports LC3 binaries and symbol tables, but it won't assemble LC3 code as is. Specifically, it will not allow you to use the opcode BRnzp (because it is redundant with BR), it will force you to explicitly write ADDI, ANDI, and MULI when doing a register-immediate operation (rather than overloading ADD, AND, and MUL) and it will force you to use the directive .FILLL if you want to use a label as the constant (rather than overloading .FILL with both numeric and symbolic forms). Also, it doesn't currently support the .STRINGZ directive.
But aside from that, PennSim is exactly the same as LC3Sim. It has the same commands, e.g., reset, as, load, etc. and the work the same way. Actually, PennSim is slightly better than LC3Sim. With an extended symbol table format and a few interior tweaks to the Memory module, PennSim can tell instructions from data. That means it will not try to confuse you by disassembling data. It also means you can set "watchpoints" in addition to breakpoints, i.e., when you set a "breakpoint" on a piece of data PennSim will break when that data is written to rather than when that data is executed (which is never).
PennSim is available as a jar file, which is also found in :eniac:~amir/cse371/pennsim/PennSim.jar. You need Java version 1.5 to run this jar file (I've tried it with 1.4.2_09 and it doesn't work).
Worm skeleton code
The directory :eniac:~amir/cse371/p37x-code/worm/ contains the
skeleton codes for the worm game logic worm.asm and the OS trap routines it uses wormos.asm. Within these files you
will find descriptions of the interfaces and functionality of the routines you
need to write. For the OS, you need to write routines that resemble some of
the ones you wrote for CSE240 (GET_POINT, DRAW_POINT, DRAW_RECTANGLE, and GET_EVENT). For the game itself, you will need to write
the routines INIT_VIDEO_MEM,
GROW_WORM, and ROTATE_HEAD.
The whole thing should be about 120 to 180 lines of P37X assembly, depending on how you choose to do different things. That's not very long, "breakout" I believe was about four or five times longer (and the game logic was much more complex). This part of the assignment will be Due. Wed, Feb. 1 at midnight. Directions for turning in your code will follow shortly.