Homework 3 - CIS534 Spring 2010

Instructor: Prof. Milo Martin

Initial Data Due: Monday, February 22nd (emailed to me by 10am)

Final Data and Writeup Due: Friday, February 26th (under my door by 5pm).

Overview

Create parallel implementations of computations from Homework 2. The goal is minimize overall runtime. The first two parts of the assignment is relatively concrete. The remaining parts of more open-ended, for which you'll be asked to make your own design choices, perform your own analysis, and lucidly describe them.

I'll provide some code for the low-level primitives you'll need. We'll call it Penn Parallelism Primitives library.

Part 1

Write a parallel implementation of Computation 1a (the n² double-nested loop version) from Homework 2. To start with perform a static partitioning of work in which each core takes a equal share of particles.

Calculate the runtimes for 200,000 particles for 1, 2, 4, and 8 cores for the Core 2 machines. Throughout this assignment, when four cores or less, use the following qub options: -q core2-quad.q -pe threaded 4; for 8 cores, use -q core2-oct.q -pe threaded 8.
Create two graphs. First, plot the both runtimes (y-axis) vs cores (x-axis). Second, plot the speedup (y-axis) versus cores (x-axis). For speedup, higher is better. For the speedup graph, make sure that both axis are either linear or logarithmic (both don't mix them). Plot a "perfect scalability" line (n speedup on n cores), which should be a straight line if done right.

Note: The speedups throughout this assignment should be the speedup as compared to your fastest sequential version of the code, not the original non-optimized codes from the first assignment.
Plot the speedup versus number of particles (up to 200,000 or larger) for four threads. What is the smallest number of particles (approximately) in which your code can achieve at least a 2x on 4 cores. The point of this question is to give you some feel for what is the smallest amount of computation than be reasonably performed in parallel.

Part 2

For the same computation as part 1 above, use a dynamic partitioning approach. Use a shared counter to distribute the work among the threads in chunks of particles as defined by a grain size parameter.

Plot the speedup versus grain size (from 1 to number of particles) for four threads and 200,000 particles. As the interesting part of the graph is where grain sizes are relatively small and relatively large, you'll want to run more data points in those ranges.

What is the impact of too small a grain size? What is the impact of too large a grain size? A "reasonable" grain size is the smallest grain size that is within a few percent of "optimal" grain size. What is a reasonable grain size?
Plot speedup versus number of threads for 200,000 particles. Plot two lines: your dynamic one with a "reasonable" grain size along with the results from Part 1b. To ensure the baseline is independent of the grain size, be sure to normalize the runtimes to the static partitioned version running on a single core, and not just the dynamically parallel version running on a single core.

Part 3

There is no part 3. (I had original asked you to parallelize the optimized variant of computation 1 from Homework 2, but for sake of brevity, please skip right to Part 4.

Part 4

Parallelize computation 2 from Homework 2. Try a few implementations, measure and graph their speedups, and describe what worked well and/or what worked less well.

Note: you may wish to complete Part 5 first, and then you can treat Part 4 as a simpler 1-dimensional case of part 5's 2-dimensional computation. Or, you can use a different algorithm for this part to obtain the highest performance.

Part 5

Parallelize computation 3 from Homework 2. Try a few implementations, measure and graph their speedups, and describe what worked well and/or what worked less well.

Part 6

Re-run one of the previous parts on both the Core i7 and the Niagara T2 machine. Compare and contrast the machines (recall that our Core i7 machine is a single-chip, has more memory bandwidth, and had more on-chip cache). Explore any challenges you met with achieving scalability on 128 threads on the Niagara T2 system. These machines are both hardware multithreaded, so you wouldn't really expect linear speedup beyond the number of cores (but you will hopefully observe some speedup).

Note: When using the Niagara T2, use one data set size to compare the performance from one thread to eight threads. Use a larger size to compare eight threads thorough 128 threads. The larger size likely won't complete on a single core within the CPU time limits set on the SGE queue for the machine.

Things to Consider

Understanding parallel performance. When trying to understand the parallel performance of your code, consider all the issues we've talked about thus far (mind what you have learned; save you it can): synchronization overhead, load imbalance, data locality, true and false read/write sharing, cache capacity, memory bandwidth limitations, etc. Create hypotheses, devise critical experiments to isolate effects, etc.
Sorting. As many of your sequential optimized versions of these computations used sorting, you may end up wanting to do your sorting in parallel. There are at least as many parallel sorting methods as the overwhelming number of choices of sequential sorting methods. You can implement anything you want, but a parallel bucket or counting sort might be the easiest (see the entry at Wikipedia on bucket sort and counting sort).

Penn Parallelism Primitives

The Penn Parallelism Primitives is a collection of low-level shared-memory primitives for writing multicore software written specifically for this course. It includes functions for spawning threads, atomic operations, simple barriers, etc.

See: https://www.cis.upenn.edu/~cis534/ppp.html

All of the source code you'll need is available as PPPv1.tar.gz.

To compile with these primitives:

g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver.C compute.C

To cross-compile for SPARC:

~cis534/public/cross/bin/sparc-sun-solaris2.10-g++ -R /home1/c/cis534/public/cross/sparc-sun-solaris2.10/lib/sparcv9/ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver.C compute.C

Logistics

Computations: See Homework 2 for descriptions of the computations, data distribution, links to the sequential code, etc.
Eniac: Log into eniac.seas.upenn.edu to develop and test your code You should run your timing results using the dedicated machines (see https://www.cis.upenn.edu/~cis534/sge.html).
Individual work: Each student must submit their own writeup, but I encourage you to discuss this general problem with each other outside of class. Sharing of code is not allowed.

What to Turn In

For the initial due date:

By 10am on the day of class, send an email to cis534@cis.upenn.edu send your speedup on four and eight cores for compuation 1a with the static and dynamic partitioning. Just four numbers total (normalized speedup for four cores and eight cores for static and dynamic). This is basically Part 1 and Part 2.
Be ready to describe your approaches for the in-class discussion.

Print out the following to bring to class on the final due date:

Insightful writeup. There are few "right" answers for this assignment. And I don't actually know the "right" answers, anyway. The point of this assignment is the process of writing parallel code and showing whatever insight you gained from the process. I'm not looking for lots of prose, but lots of insight described tersely.

We'll discuss these insights in-class after the assignments are due.
Graphs. Include the graphs described above. Label the x-axis and y-axis. Give each graph a descriptive title.
Your code. Print out your final (fastest) versions of each of the computations.
Your faster results. Your results for the fastest results for Part two, part three, and part four on the eight-core Core 2 machine. Give the results for both raw runtime in seconds and normalized speedup.

Disclaimer

It is quite likely that the primitives library I give you will have bugs in it. It is a small amount of code, but tricky code and parallel code is difficult to test. I've done what I can to test it, but if something odd is happening it could be a bug in the code I wrote.

There probably isn't any "right" solution. If there is, I don't know what it is, as I haven't tried this out myself. It could be the obvious way is the fastest way of doing it; or it could be more subtle.

Addendum

[Feb 15] Removed Part 3
[Feb 15] Included information about the Penn Parallelism Primitives.
[Feb 22] Note: When using the Niagara T2, use one data set size to compare the performance from one thread to eight threads. You a larger size to compare eight threads thorough 128 threads. The larger size likely won't complete on a single core.
[Feb 23] Clarified that speedup should be calculated with respect to the best sequential baseline rather than some non-optimized baseline version.

Retrospective

After assigning and grading this assignment, a few things to consider correcting in future years:

For the first two parts, the work per iteration of the first part is both large (as it has a whole inner loop in it) and the work is reasonably balanced as it is an n² algorithm. Thus, the experiments of dynamic work partitioning and grain size aren't so exciting (a grain size of 1 is okay and there isn't much speedup from dynamic work partitioning).
The other parts might have been a bit too free form. It allowed the students to explore, but some of the students got lots in the number of options in making an efficient sequential and parallel versions.
On the final two-dimensional computation, it seems the radius doesn't scale large enough to really allow good parallel computation. Too few particles in each radius results in all the bookkeeping overhead dominating, which is harder to parallelize. Adjusting the way the driver scales the radius as the number of particles increases could fix this problem.