I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models.
I'm teaching CIS 601: GPGPU Programming Models in Spring 2016.
As ever more computation shifts onto multicore architectures, it is increasingly critical to find effective ways of dealing with multithreaded performance bugs like true and false sharing. Previous approaches to fixing false sharing in unmanaged languages have had to resort to highly-invasive runtime program modification. We observe that managed language runtimes, with garbage collection and JIT code compilation, present unique opportunities to repair such bugs directly, mirroring the techniques used in manual repairs.
We present Remix, a modified version of the Oracle HotSpot JVM which can detect cache contention bugs and repair false sharing at runtime. Remix’s detection mechanism leverages recent performance counter improvements on Intel platforms, which allow for precise, unobtrusive monitoring of cache contention at the hardware level. Remix can detect and repair known false sharing issues in the LMAX Disruptor high-performance inter-thread messaging library and the Spring Reactor event-processing framework, automatically providing 1.5-2x speedups over unoptimized code and matching the performance of hand-optimization. Remix also finds a new false sharing bug in SPECjvm2008, and uncovers a true sharing bug in the HotSpot JVM that, when fixed, improves the performance of three NAS Parallel Benchmarks by 7-25x. Remix incurs no statistically-significant performance overhead on other benchmarks that do not exhibit cache contention, making Remix practical for always-on use.
Data races are one of the main culprits behind the complexity of multithreaded programming. Existing data race detectors require large amounts of metadata for each program variable to perform their analyses. The SlimFast system exploits the insight that there is a large amount of redundancy in this metadata: many program variables often have identical metadata state. By sharing metadata across variables, a large reduction in space usage can be achieved compared to the state-of-the-art FastTrack algorithm. SlimFast uses immutable metadata to safely support metadata sharing across threads while also accelerating concurrency control. SlimFast’s lossless metadata compression achieves these benefits while preserving soundness and completeness.
Across a range of Java benchmarks from the Java Grande, DaCapo and NAS Parallel Benchmark suites, SlimFast is able to reduce memory consumption by 1.76x on average, and up to 4.47x for some benchmarks, compared to FastTrack. By improving cache locality and simplifying concurrency control, SlimFast also accelerates data race detection by 1.24x on average, and up to 8.33x for some benchmarks, compared to FastTrack.
Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program's actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment.
This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel's Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware.
We present Consequence, a deterministic multi-threading library. Consequence achieves deterministic execution via store buffering and strict ordering of synchronization operations. To ensure high performance under a wide variety of conditions, the ordering of synch operations is based on a deterministic clock, and store buffering is implemented using version-controlled memory.
Recent work on deterministic concurrency has proposed relaxing the consistency model beyond total store order (TSO). Through novel optimizations, Consequence achieves the same or better performance on the Phoenix, PARSEC and SPLASH-2 benchmark suites, while retaining TSO memory consistency. Across 19 benchmark programs, Consequence incurs a worst-case slowdown of 3.9× vs. pthreads, with 14 out of 19 programs at or below 2.5×. We believe this performance improvement takes parallel programming one step closer to "determinism by default".
Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.
Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.
Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.