I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models. I recently finished my PhD at the University of Washington, working with Luis Ceze and Dan Grossman.
I am eagerly looking for new PhD students interested in the programmability and security of multicore architectures. If you are interested in these topics or other work going on in the Architecture & Compilers Group please do apply to our PhD program and drop me an email as well.
I'm co-organizing the Workshop on Determinism and Correctness in Parallel Programming (WoDet) 2014 workshop with Brandon Lucia from MSR, to be held in conjunction with ASPLOS 2014 in Salt Lake City. Please consider submitting work improving the correctness of parallel programs!
I'm on the PC of the Systems for Future Multicore Architectures (SFMA) 2014 workshop, held in conjunction with EuroSys 2014 in Amsterdam. Please consider submitting work at the interface of operating systems and multicores!
I'm teaching CIS 501, graduate computer architecture, in Fall 2013 (TR 4:30-6:00). More info at the course web page.
Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.
Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.
Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.
Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only race detectors are precise but slow; proposals with hardware support offer higher performance but are imprecise. Both precision and performance are necessary to achieve the many advantages always-on race detection could provide.
To resolve this trade-off, we propose RADISH, a hybrid hardware-software race detector that is always-on and fully precise. In RADISH, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that RADISH is correct by proving equivalence to a conventional happens-before race detector.
Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. RADISH can furthermore leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core RADISH processor using PARSEC benchmarks shows runtime overheads from negligible to 2x. Furthermore, RADISH outperforms the leading software-only race detector by 2x-37x.
Nondeterminism is a key contributor to the difficulty of parallel programming. Many research projects have shown how to provide deterministic parallelism, but with unfortunate trade-offs. Deterministic execution enforces determinism for arbitrary programs but with significant runtime cost, while deterministic languages enforce determinism statically (without runtime overhead) but only for fork-join programs expressible in their static type systems.
MELD unifies these approaches. We explain the requirements for soundly integrating a deterministic language into a deterministic execution system, and describe a simple qualifier-based type checker that ensures isolation for code written in a deterministic language. We also extend MELD to incorporate nondeterministic operations without compromising the determinism of the rest of the program. Our experiments with benchmarks from the SPLASH2 and PARSEC suites show that a small number of annotations can accelerate the performance of deterministic versions of these programs by 2-6x.
Providing deterministic execution significantly simplifies the debugging, testing, replication, and deployment of multithreaded programs. Recent work has developed deterministic multiprocessor architectures as well as compiler and runtime systems that enforce determinism in current hardware. Such work has incidentally imposed strong memory-ordering properties. Historically, memory ordering has been relaxed in favor of higher performance in shared memory multiprocessors and, interestingly, determinism exacerbates the cost of strong memory ordering. Consequently, we argue that relaxed memory ordering is vital to achieving faster deterministic execution.
This paper introduces RCDC, a deterministic multiprocessor architecture that takes advantage of relaxed memory orderings to provide high-performance deterministic execution with low hardware complexity. RCDC has two key innovations: a hybrid HW/SW approach to enforcing determinism; and a new deterministic execution strategy that leverages data-race-free-based memory models (e.g., the models for Java and C++) to improve performance and scalability without sacrificing determinism, even in the presence of races. In our hybrid HW/SW approach, the only hardware mechanisms required are software-controlled store buffering and support for precise instruction counting; we do not require speculation. A runtime system uses these mechanisms to enforce determinism for arbitrary programs.
We evaluate RCDC using PARSEC benchmarks and show that relaxing memory ordering leads to performance and scalability close to nondeterministic execution without requiring any form of speculation. We also compare our new execution strategy to one based on TSO (total-store-ordering) and show that some applications benefit significantly from the extra relaxation. We also evaluate a software-only implementation of our new deterministic execution strategy.
This paper takes a critical look at the benefits provided by state-of-the-art deterministic execution techniques. Specifically, we look at four applications of deterministic execution: debugging, fault-tolerant replication, testing, and security. For each application, we discuss what an ideal system would provide, and then look at how deterministic systems compare to the ideal. Further, we discuss alternative approaches, not involving determinism, and we judge whether or not these alternatives are more suitable. Along the way, we identify open questions and suggest future work.
Ultimately, we find that there are competitive alternatives to determinism for debugging and replicating multithreaded programs; that determinism has high, though unproven, potential to improve testing; and that determinism has distinct security benefits in eliminating some covert timing channels. Furthermore, determinism is a unified solution for all four applications: this confers a distinct advantage over point solutions that do not compose well with one another.