I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models.
I'll be teaching CIS 501: Computer Architecture in Fall 2015.
We present Consequence, a deterministic multi-threading library. Consequence achieves deterministic execution via store buffering and strict ordering of synchronization operations. To ensure high performance under a wide variety of conditions, the ordering of synch operations is based on a deterministic clock, and store buffering is implemented using version-controlled memory.
Recent work on deterministic concurrency has proposed relaxing the consistency model beyond total store order (TSO). Through novel optimizations, Consequence achieves the same or better performance on the Phoenix, PARSEC and SPLASH-2 benchmark suites, while retaining TSO memory consistency. Across 19 benchmark programs, Consequence incurs a worst-case slowdown of 3.9× vs. pthreads, with 14 out of 19 programs at or below 2.5×. We believe this performance improvement takes parallel programming one step closer to "determinism by default".
Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.
Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.
Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.
Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only race detectors are precise but slow; proposals with hardware support offer higher performance but are imprecise. Both precision and performance are necessary to achieve the many advantages always-on race detection could provide.
To resolve this trade-off, we propose RADISH, a hybrid hardware-software race detector that is always-on and fully precise. In RADISH, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that RADISH is correct by proving equivalence to a conventional happens-before race detector.
Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. RADISH can furthermore leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core RADISH processor using PARSEC benchmarks shows runtime overheads from negligible to 2x. Furthermore, RADISH outperforms the leading software-only race detector by 2x-37x.