Homework 3 - CIS534 Spring 2010

Instructor: Prof. Milo Martin

Initial Data Due: Monday, February 22nd (emailed to me by 10am)

Final Data and Writeup Due: Friday, February 26th (under my door by 5pm).

Overview

Create parallel implementations of computations from Homework 2. The goal is minimize overall runtime. The first two parts of the assignment is relatively concrete. The remaining parts of more open-ended, for which you'll be asked to make your own design choices, perform your own analysis, and lucidly describe them.

I'll provide some code for the low-level primitives you'll need. We'll call it Penn Parallelism Primitives library.

Part 1

Write a parallel implementation of Computation 1a (the n2 double-nested loop version) from Homework 2. To start with perform a static partitioning of work in which each core takes a equal share of particles.

  1. Calculate the runtimes for 200,000 particles for 1, 2, 4, and 8 cores for the Core 2 machines. Throughout this assignment, when four cores or less, use the following qub options: -q core2-quad.q -pe threaded 4; for 8 cores, use -q core2-oct.q -pe threaded 8.

  2. Create two graphs. First, plot the both runtimes (y-axis) vs cores (x-axis). Second, plot the speedup (y-axis) versus cores (x-axis). For speedup, higher is better. For the speedup graph, make sure that both axis are either linear or logarithmic (both don't mix them). Plot a "perfect scalability" line (n speedup on n cores), which should be a straight line if done right.

    Note: The speedups throughout this assignment should be the speedup as compared to your fastest sequential version of the code, not the original non-optimized codes from the first assignment.

  3. Plot the speedup versus number of particles (up to 200,000 or larger) for four threads. What is the smallest number of particles (approximately) in which your code can achieve at least a 2x on 4 cores. The point of this question is to give you some feel for what is the smallest amount of computation than be reasonably performed in parallel.

Part 2

For the same computation as part 1 above, use a dynamic partitioning approach. Use a shared counter to distribute the work among the threads in chunks of particles as defined by a grain size parameter.

  1. Plot the speedup versus grain size (from 1 to number of particles) for four threads and 200,000 particles. As the interesting part of the graph is where grain sizes are relatively small and relatively large, you'll want to run more data points in those ranges.

    What is the impact of too small a grain size? What is the impact of too large a grain size? A "reasonable" grain size is the smallest grain size that is within a few percent of "optimal" grain size. What is a reasonable grain size?

  2. Plot speedup versus number of threads for 200,000 particles. Plot two lines: your dynamic one with a "reasonable" grain size along with the results from Part 1b. To ensure the baseline is independent of the grain size, be sure to normalize the runtimes to the static partitioned version running on a single core, and not just the dynamically parallel version running on a single core.

Part 3

Part 4

Part 5

Part 6

Things to Consider

Penn Parallelism Primitives

The Penn Parallelism Primitives is a collection of low-level shared-memory primitives for writing multicore software written specifically for this course. It includes functions for spawning threads, atomic operations, simple barriers, etc.

See: https://www.cis.upenn.edu/~cis534/ppp.html

All of the source code you'll need is available as PPPv1.tar.gz.

To compile with these primitives:

g++ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver.C compute.C

To cross-compile for SPARC:

~cis534/public/cross/bin/sparc-sun-solaris2.10-g++ -R /home1/c/cis534/public/cross/sparc-sun-solaris2.10/lib/sparcv9/ -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C driver.C compute.C

Logistics

What to Turn In

For the initial due date:

  1. By 10am on the day of class, send an email to cis534@cis.upenn.edu send your speedup on four and eight cores for compuation 1a with the static and dynamic partitioning. Just four numbers total (normalized speedup for four cores and eight cores for static and dynamic). This is basically Part 1 and Part 2.
  2. Be ready to describe your approaches for the in-class discussion.

Print out the following to bring to class on the final due date:

  1. Insightful writeup. There are few "right" answers for this assignment. And I don't actually know the "right" answers, anyway. The point of this assignment is the process of writing parallel code and showing whatever insight you gained from the process. I'm not looking for lots of prose, but lots of insight described tersely.

    We'll discuss these insights in-class after the assignments are due.

  2. Graphs. Include the graphs described above. Label the x-axis and y-axis. Give each graph a descriptive title.

  3. Your code. Print out your final (fastest) versions of each of the computations.

  4. Your faster results. Your results for the fastest results for Part two, part three, and part four on the eight-core Core 2 machine. Give the results for both raw runtime in seconds and normalized speedup.

Disclaimer

It is quite likely that the primitives library I give you will have bugs in it. It is a small amount of code, but tricky code and parallel code is difficult to test. I've done what I can to test it, but if something odd is happening it could be a bug in the code I wrote.

There probably isn't any "right" solution. If there is, I don't know what it is, as I haven't tried this out myself. It could be the obvious way is the fastest way of doing it; or it could be more subtle.

Addendum

Retrospective

After assigning and grading this assignment, a few things to consider correcting in future years: