Homework 5 - CIS534 Spring 2010

Instructor: Prof. Milo Martin

Final Data and Writeup Due: Monday, April 5th (at start of class).

Overview

In this assignment you will complete the implementation of a task scheduling extension to the Penn Parallelism Primitives library. The task extension has both a lower-level and a higher-level interface.

Low-Level Task API

The low-level API to the Penn Parallelism Primitives task scheduling allows a programmer to explicitly spawn/wait (fork/join) tasks.

Task. All tasks are decedents of the Task base class. All tasks have an execute() method that is overloaded by each sub-class of Task. The Task objects also record other information about task, such as which counter to atomically decrement after it has completed executing.

TaskGroup. To spawn/wait on tasks, the programmer creates a TaskGroup object. This object has a spawn(Task& t) method for invoking tasks and a wait() method. A spawn operation immediately enqueues the task in the task queue. The wait() method completes only once all tasks spawned as part of that task group have completed. While waiting, a thread will look for other work to perform.

Example. For example, the following code spawns two tasks and then waits for them to complete:

ppp::TaskGroup tg;
MyTask t1(a, b, c);
MyTask t2(d, e, f);
MyTask t3(g, h, i);
tg.spawn(t1);
tg.spawn(t2);
tg.spawn(t3);
tg.wait();

In the above example the tasks are on the stack, and must remain live until wait() completes. Alternatively, tasks may be heap allocated:

ppp::TaskGroup tg;
for (int i=0; i<partitions; i++) {
  tg.spawn(new MyTask(i, x, y);
}
tg.wait();

The above code allocates the tasks on the heap, and the task scheduler will free/delete them once they have completed executing. Generally speaking, it is better to use recursive fork/join parallelism than spawning lots of tasks in a loop. Like OpenMP (and unlike Cilk), each spawn() adds a task to the task queue.

Task Scheduling. The code I've provided uses a centralized stack of tasks shared by all worker threads. As described below, you'll extend this to into a work-stealing scheduler with per-thread task queues.

The High-Level Task API

Alternatively, the task system can be invoked by using the parallel_sort() and parallel_for() functions.

Parallel sort. The parallel_sort() function has the prototype:

template <typename T>
void parallel_sort(T* array, int64_t left, int64_t right, int64_t grainsize)

A call to parallel_sort will sort an array of any type, given its starting and ending index. It uses the built in (or programmer-specified) "less than" and "greater than" operators on type T. To give you a concrete example to work from, I've implemented parallel_sort() and included the code.

Parallel for. The parallel_for() function has the prototype:

template <typename T>
void parallel_for(int64_t start, int64_t end, T* functor, int64_t grainsize)

This applies a recursive divide-and-conquer parallelism from start to end of the specified functor. The functor is any object of a class with an calculate(int64_t start, int64_t end) method. The grainsize is optional.

Example. For example, consider the following loop:

void foo()
{
  float* arr = new float(size);
  ...
  for (int64_t i=0; i<size; i++) {
    arr[i] = arr[i] * 2.0;
  }
  ...
}

The code to make the above loop would be:

#include "parallel_for.h"
class MyFunc {
public:
  MyFunc(float* array, float scalar) {
    m_array = array;
    m_scalar = scalar;
  }
private:
  float* m_array;
  float m_scalar;
public:
  void calculate(int64_t start, int64_t end) {
    for (int64_t i=start; i<end; i++) {
      m_array[i] = m_array[i] * m_scalar;
    }
  }
};

void foo()
{
  float* arr = new float(size);
  ...
  MyFunc f(arr, 2.0);
  parallel_for(0, size, &f);
  ...
}

Although this looks nasty (or at least verbose), the lambda functions in C++0x will clean this up substantially. GCC doesn't support it yet (and thus we won't be using it in this assignment), but it might look something like this:

void foo()
{
  float* arr = new float(size);
  ...
  parallel_for(0, size, [=](int64_t start, int64_t end) { 
    for (int64_t i=start; i<end; i++) {
      arr[i] = arr[i] * 2.0;
    }
  });
  ...
}

Or, an even simpler construct could be supported:

void foo()
{
  float* arr = new float(size);
  ...
  parallel_for(0, size, [=](int64_t i) { 
    arr[i] = arr[i] * 2.0;
  });
  ...
}

Part 1: Characterize Sequential Overhead of parallel_sort

The first part of the assignment is to evaluate the recursive fork/join parallel quicksort (see parallel_sort.h). I've also provided code to invoke the parallel_sort() on some random data (see driver-sort.C). For all the test, we'll be sorting 40 million items (--particles 40000000 on the command line). Use at least five trials or so (--trials 5), as different random numbers will cause different pivot points and thus effect the load balance.

The parallel_sort() spawns recursive tasks until the a task is smaller than the specified "grainsize" (--grainsize n on the command line), which controls the granularity of the parallelism (for load balancing at the cost of more task scheduling overhead). If the work is smaller than the grainsize, the code calls an efficient serial sort routine.

For a single thread, explore the performance of the parallel sort code with varying grainsize (use the Core 2 machines in the cluster for the measurements). For a single thread, the larger the grainsize the better. In fact, a grainsize equal to the number of elements to sort will just call the standard highly-optimized sort routine. You'll want to try a range of values, starting with large grainsizes and reducing it until runtimes become too large. Plot the data with grainsize on the x-axis and the runtime on the y-axis. Normalize the runtime to the infinite grainsize.

What is the smallest grainsize that is still within 5% of the infinite grainsize?

Part 2: Characterize Parallelism in parallel_sort

Now perform the same analysis, but with four threads (on the four-thread Core2 machines). Plot the data on the same graph (normalizing the runtimes to the infinite grainsize single-thread runtime).

What is the maximum speedup on four cores? At approximately what grainsize does this occur?

Note: If you're observing random segmentation faults with large number of particles and small grainsizes with the default centralized task queue, you're running out of per-thread stack space. Just don't report data for such small grainsizes.

Part 3: Distributed Task Queue

The task scheduler (in TaskGroup.C) provided to you uses a single TaskQueue object shared by all worker threads.

Modify the code in TaskGroup.C to implement a task-stealing distributed work queue. The g_queues_ptr is an array of TaskQueue objects, even though only one of them is used in the code provided. The TaskQueue class (see TaskQueue.h) is a wrapper around a STL deque class and it adds locking on each operation (thus insuring proper synchronization). The TaskQueue objects already have a steal() method (which steals from the end of the queue opposite from where tasks are normally enqueued and dequeued. As TaskQueue methods already included locking, your code doesn't need any additional synchronization.

After make the modifications to TaskGroup.C (less than a dozen lines of code), re-perform the evaluation of parallel_sort() from Part 2. Add this data as a third line to the graph.

A decentralized work queue has two effects on runtime (1) reduces contention on the central queue and (2) impacts order in which the tasks execute. Is the distributed queue always faster? Why or why not. Comment on the likely relative impact of the two effects. (Basically, look at the data and some something insightful about it).

Part 4: Implement parallel_for

Using parallel_sort() as an example, create a recursive divide-and-conquer implementation of parallel_for discussed above. The file parallel_for.h contains some code to get you started. As with parallel_sort(), parallel_for() should recursively split the work until the work remaining is smaller than the specified grainsize.

The driver-compute.C code includes an optimized version of "computation two" from the prior homeworks. It first sorts the data and then uses binary search to identity the start and end of the range of particles to consider. This code already calls parallel_sort() and parallel_for() (see compute.h for the code to perform the core calculation). So, all you need to do is complete parallel_sort() in parallel_for.h.

Repeat the steps in Part 1 through Part 3 above, but instead characterizing the parallel_for() behavior in this case. Use just the compute_seconds runtime from the program's output (this is just the time to takes to the the parallel_for(). Use 10 million particles when performing the measurements (--particles 10000000). Create a second graph (three lines, as with the prior graph). Answer the following questions:

For a single thread, what is the smallest grainsize that is still within 5% of the infinite grainsize?
What is the maximum speedup on four cores? At approximately what grainsize does this occur?
A decentralized work queue has two effects on runtime (1) reduces contention on the central queue and (2) impacts order in which the tasks execute. Is the distributed queue always faster? Why or why not. Comment on the likely relative impact of the two effects. (Basically, look at the data and some something insightful about it).

Part 5: Implement parallel_reduce

Note: choose only one of Part 5 and Part 6 to complete

Implement parallel_reduce() analogously to the above parallel_sort() function. Use it to sum an array of numbers (rather than sort them). Use a large enough dataset and multiple trials so that it runs for at least a few seconds. Perform the same analysis as in Part 4 (and answer the same questions).

Part 6: Experiment on Other Machines

Note: choose only one of Part 5 and Part 6 to complete

For either parallel_for() or parallel_sort(), perform the same analysis as in Part 4 (and answer the same questions), but for the 8-core Core2 system, the Core i7, and the Niagara II system. You may need to adjust the problem size somewhat to make the runtimes reasonable for the Niagara II machine. Identify any differences in the data or anything else insightful or interesting about the data.

Logistics

PPP. All of the source code you'll need is available as PPPv3.tar.gz. See https://www.cis.upenn.edu/~cis534/ppp3.html for the most recent information on the Penn Parallelism Primitives.

Compiling. To compile the code:

g++ -ggdb3 -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C random.C TaskGroup.C driver-sort.C -o driver-sort
g++ -ggdb3 -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C random.C TaskGroup.C driver-compute.C -o driver-compute
g++ -ggdb3 -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C random.C TaskGroup.C driver-reduce.C -o driver-reduce

Debugging output. To help you understand your code execution, I've inserted some debugging printout statements in the code. They can be enabled by compiling with the -DPPP_DEBUG flag:
```
g++ -DPPP_DEBUG -ggdb3 -m64 ...
```

As before, to cross-compile for SPARC:

~cis534/public/cross/bin/sparc-sun-solaris2.10-g++ -R /home1/c/cis534/public/cross/sparc-sun-solaris2.10/lib/sparcv9/ -ggdb3 -m64 -std=c++0x -pthread -lrt -Wall -O3 ...

The various drivers takes the following command-line parameters:
- --threads: number of threads (defaults to 1)
- --particles: problem size
- --trials: number of times to repeat the computation
- --grainsize: sets the grainsize for the computation
For example:
```
./driver-sort --particles 40000000 --trials 5 --grainsize 1000 --threads 4
./driver-compute --particles 10000000 --trials 5 --grainsize 1000 --threads 4
./driver-reduce --particles 100000000 --trials 5 --grainsize 1000 --threads 4
```
Data collection: This assignment requires only a little code, but lots of runs. I suggest you consider writing shell or Python scripts that both (1) setup the jobs to submit to SGE and (2) also assist in analyzes the data files by making data files that is easy to import into Excel or that generates graphs directly using GNUPlot, jgraph, or other script-based graphing tool.
Eniac: Log into eniac.seas.upenn.edu to develop and test your code You should run your timing results using the dedicated machines (see https://www.cis.upenn.edu/~cis534/sge.html).
Individual work: Each student must submit their own writeup, but I encourage you to discuss this general problem with each other outside of class. Sharing of code is not allowed.

Hints and Tips

The centralized task queue (the default given to you) can lead to deep recursion depths. If your code is segmentation faulting on large inputs sizes for small grainsizes, it is likely running out of the per-thread stack space (it took me a long time to debug this problem when I first encountered it). You can eiter up the size of the stack (in ppp.C), use a larger grainsize, etc.

What to Turn In

Print out the following to bring to class on the final due date:

Insightful writeup. The goal of this assignment is to get you to think more deeply about the various lock primitives choices and implementations. The role of the writeup is to convince with that you learned something by saying insightful things in the writeup. I'm not looking for lots of prose, but lots of insight described tersely.
Graphs. Include the graphs described above (one for parts 1-3, another for part 4, and one or more additional graphs for either Part 4 or Part 5. Label the x-axis and y-axis on all graphs. Give each graph a descriptive title. Make sure to use different symbols or dot/dash patterns to clearly identify which data is which.
Your code. Print out your code, but just the modified files:
- TaskGroup.C (part 3)
- parallel_for.h (part 4)
- parallel_reduce.h (part 5)

Disclaimer

It is quite likely that the code I gave you will have bugs in it. It is a small amount of code, but tricky code and parallel code is difficult to test. I've done what I can to test it, but if something odd is happening it could be a bug in the code I wrote.

I haven't collected data for these configurations. Thus, I don't know if the data is actually going to be interesting or not. It could be that four cores isn't enough to show much of anything. In that case, I might end up asking you all to collect the data for the first parts on the few larger machines we have.

Addendum

[March 31] Updated the PPPv3 code to display the grainsize with the output

Retrospective

After assigning and grading this assignment, a few things to consider correcting in future years:

For the "reduce" operation, I had a version in mind that used neither locking nor per-thread accumulators. Of course, using a lock to protect a global counter for each coarse-grained task might also perform well, but it wasn't what I had in mind. The assignment should be clarified to give the students more guidance.
I didn't explicitly say that the work stealing should just be random. Instead, students used all sorts of round-robin or other stealing. At least one used "always steal from queue zero", which caused lots of problems (it just doesn't work so well).