my email address: my last name at cis dot upenn dot edu

(215) 746-4223

Levine Hall 572
3330 Walnut Street

Philadelphia, PA 19104-3409

Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.

Conference Papers

RPG²: Robust Profile-Guided Runtime Prefetch GenerationRPG²: Robust Profile-Guided Runtime Prefetch Generation
Yuxuan Zhang, Nathan Sobotka, Soyoon Park, Saba Jamilan, Tanvir Ahmed Khan, Baris Kasikci, Gilles Pokam, Heiner Litz and Joseph Devietti

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '24), May 2024

[abstract][paper]

Data cache prefetching is a well-established optimization to overcome the limits of the cache hierarchy and keep the processor pipeline fed with data. In principle, accurate, well-timed prefetches can sidestep the majority of cache misses and dramatically improve performance. In practice, however, it is challenging to identify which data to prefetch and when to do so. In particular, data can be easily requested too early, causing eviction of useful data from the cache, or requested too late, failing to avoid cache misses. Competition for limited off-chip memory bandwidth must also be balanced between prefetches and a program's regular "demand" accesses. Due to these challenges, prefetching can both help and hurt performance, and the outcome can depend on program structure, decisions about what to prefetch and when to do it, and, as we demonstrate in a series of experiments, program input, processor microarchitecture, and their interaction as well.
To try to meet these challenges, we have designed the RPG² system for online prefetch injection and tuning. RPG² is a pure-software system that operates on running C/C++ programs, profiling them, injecting prefetch instructions, and then tuning those prefetches to maximize performance. Across dozens of inputs, we find that RPG² can provide speedups of up to 2.15×, comparable to the best profile-guided prefetching compilers, but can also respond when prefetching ends up being harmful and roll back to the original code - something that static compilers cannot. RPG² improves prefetching robustness by preserving its performance benefits, while avoiding slowdowns.
OCOLOS: Online COde Layout OptimizationSOCOLOS: Online COde Layout OptimizationS
Yuxuan Zhang, Tanvir Ahmed Khan, Gilles Pokam, Baris Kasikci, Heiner Litz and Joseph Devietti

ACM IEEE International Symposium on Microarchitecture (MICRO '22), October 2022

[abstract][paper]

Selected for IEEE Micro Top Picks 2023 [article]

The processor front-end has become an increasingly important bottleneck in recent years due to growing application code footprints, particularly in data centers. First-level instruction caches and branch prediction engines have not been able to keep up with this code growth, leading to more front-end stalls and lower Instructions Per Cycle (IPC). Profile-guided optimizations performed by compilers represent a promising approach, as they rearrange code to maximize instruction cache locality and branch prediction efficiency along a relatively small number of hot code paths. However, these optimizations require continuous profiling and rebuilding of applications to ensure that the code layout matches the collected profiles. If an application’s code is frequently updated, it becomes challenging to map profiling data from a previous version onto the latest version, leading to ignored profiling data and missed optimization opportunities.
In this paper, we propose OCOLOS, the first online code layout optimization system for unmodified applications written in unmanaged languages. OCOLOS allows profile-guided optimization to be performed on a running process, instead of being performed offline and requiring the application to be re-launched. By running online, profile data is always relevant to the current execution and always maps perfectly to the running code. OCOLOS demonstrates how to achieve robust online code replacement in complex multithreaded applications like MySQL and MongoDB, without requiring any application changes. Our experiments show that OCOLOS can accelerate MySQL by up to 1.41x, the Verilator hardware simulator by up to 2.20x, and a build of the Clang compiler by up to 1.14x.
Twig: Profile-Guided BTB Prefetching for Data Center ApplicationsTwig: Profile-Guided BTB Prefetching for Data Center Applications
Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles Pokam, Heiner Litz and Baris Kasikci

ACM IEEE International Symposium on Microarchitecture (MICRO '21), October 2021

[abstract][paper]

Modern data center applications have deep software stacks, with instruction footprints that are orders of magnitude larger than typical instruction cache (I-cache) sizes. To efficiently prefetch instructions into the I-cache despite large application footprints, modern server-class processors implement a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP). In this work, we first characterize the limitations of a decoupled frontend processor with FDIP and find that FDIP suffers from significant Branch Target Buffer (BTB) misses. We also find that existing techniques (e.g., stream prefetchers and predecoders) are unable to mitigate these misses, as they rely on an incomplete understanding of a program’s branching behavior.
To address the shortcomings of existing BTB prefetching techniques, we propose Twig, a novel profile-guided BTB prefetching mechanism. Twig analyzes a production binary’s execution profile to identify critical BTB misses and inject BTB prefetch instructions into code. Additionally, Twig coalesces multiple non-contiguous BTB prefetches to improve the BTB’s locality. Twig exposes these techniques via new BTB prefetch instructions. Since Twig prefetches BTB entries without modifying the underlying BTB organization, it is easy to adopt in modern processors. We study Twig’s behavior across nine widely-used data center applications, and demonstrate that it achieves an average 20.86% (up to 145%) performance speedup over a baseline 8K-entry BTB, outperforming the state-of-the-art BTB prefetch mechanism by 19.82% (on average).
Ripple: Profile-Guided Instruction Cache Replacement for Data Center ApplicationsRipple: Profile-Guided Instruction Cache Replacement for Data Center Applications
Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz and Baris Kasikci

International Symposium on Computer Architecture (ISCA '21), June 2021

[abstract]

Modern data center applications exhibit deep software stacks yielding large instruction footprints that frequently lead to instruction cache misses degrading performance, cost-efficiency, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced evictions that are not handled by existing replacement policies. Alas, existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application’s complex program behavior.
To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program contexts that lead to I-cache misses and sparingly injects “cache line eviction” instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) due to a mean 19% (up to 28.6%) I-cache miss reduction.
I-SPY: Context-Driven Conditional Instruction Prefetching with CoalescingI-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing
Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz and Baris Kasikci

ACM IEEE International Symposium on Microarchitecture (MICRO '20), October 2020

[abstract][paper]

Modern data center applications have rapidly expanding instruction footprints that lead to frequent instruction cache misses, increasing cost and degrading data center performance and energy efficiency. Mitigating instruction cache misses is challenging since existing techniques (1) require significant hardware modifications, (2) expect impractical on-chip storage, or (3) prefetch instructions based on inaccurate understanding of program miss behavior. To overcome these limitations, we first investigate the challenges of effective instruction prefetching. We then use insights derived from our investigation to develop I-SPY, a novel profile-driven prefetching technique. I-SPY uses dynamic miss profiles to drive an offline analysis of I-cache miss behavior, which it uses to inform prefetching decisions. Two key techniques underlie I-SPY's design: (1) conditional prefetching, which only prefetches instructions if the program context is known to lead to misses, and (2) prefetch coalescing, which merges multiple prefetches of non-contiguous cache lines into a single prefetch instruction. I-SPY exposes these techniques via a family of light-weight hardware code prefetch instructions. We study I-SPY in the context of nine data center applications and show that it provides an average of 15.5% (up to 45.9%) speedup and 95.9% (and up to 98.4%) reduction in instruction cache misses, outperforming the state-of-the-art prefetching technique by 22.5%. We show that I-SPY achieves performance improvements that are on average 90.5% of the performance of an ideal cache with no misses.
Deterministic Atomic BufferingDeterministic Atomic Buffering
Yuan Hsi Chou, Christopher Ng, Shaylin Cattell, Jeremy Intan, Matthew Sinclair, Joseph Devietti, Timothy G. Rogers and Tor Aamodt

ACM IEEE International Symposium on Microarchitecture (MICRO '20), October 2020

[abstract][paper][full talk][lightning talk][simulator code]

Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4× better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations.
Reproducible ContainersReproducible Containers
Omar Navarro Leija, Kelly Shiptoski, Ryan Scott, Baojun Wang, Nicholas Renner, Ryan Newton and Joseph Devietti

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '20), March 2020

[abstract][paper]

In this paper, we describe the design and implementation of DetTrace, a reproducible container abstraction for Linux implemented in user space. All computation that occurs inside a DetTrace container is a pure function of the initial filesystem state of the container. Reproducible containers can be used for a variety of purposes, including replication for fault-tolerance, reproducible software builds and reproducible data analytics. We use DetTrace to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows. We show that, while software in each of these domains is initially irreproducible, DetTrace brings reproducibility without requiring any hardware, OS or application changes. DetTrace’s performance is dictated by the frequency of system calls: IO-intensive software builds have an average overhead of 3.49x, while a compute-bound bioinformatics workflow is under 2%.
Hurdle: Securing Jump Instructions Against Code Reuse AttacksHurdle: Securing Jump Instructions Against Code Reuse Attacks
Christian DeLozier, Kavya Lakshminarayanan, Gilles Pokam and Joseph Devietti

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '20), March 2020

[abstract][paper]

Code-reuse attacks represent the state-of-the-art in exploiting memory safety vulnerabilities. Control-flow integrity techniques offer a promising direction for preventing code-reuse attacks, but these attacks are resilient against imprecise and heuristic-based detection and prevention mechanisms.
In this work, we propose a new context-sensitive controlflow integrity system (HURDLE) that guarantees pairwise gadgets cannot be chained in a code-reuse attack. HURDLE improves upon prior techniques by using SMT constraint solving to ensure that indirect control flow transfers cannot be maliciously redirected to execute gadget chains. At the same time, HURDLE’s security policy is flexible enough that benign executions are only rarely mischaracterized as malicious. When such mischaracterizations occur, HURDLE can generalize its constraint solving to avoid these mischaracterizations at low marginal cost.
We propose architecture extensions for HURDLE which consist of an extended branch history register and new instructions. Thanks to its hardware support, HURDLE enforces a context-sensitive control-flow integrity policy with <1% average runtime overhead.
Lazy Determinism for Faster Deterministic MultithreadingLazy Determinism for Faster Deterministic Multithreading
Timothy Merrifield, Sepideh Roghanchi, Joseph Devietti and Jakob Eriksson

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '19), April 2019

[abstract][paper]
Block-Size Independence for GPU ProgramsBlock-Size Independence for GPU Programs
Rajeev Alur, Joseph Devietti and Nimit Singhania

Static Analysis Symposium (SAS '18), August 2018

[abstract][paper]

Radhia Cousot Young Researcher Best Paper Award

Optimizing GPU programs by tuning execution parameters is essential to realizing the full performance potential of GPU hardware. However, many of these optimizations do not ensure correctness and subtle errors can enter while optimizing a GPU program. Further, lack of formal models and the presence of non-trivial transformations prevent verification of optimizations.
In this work, we verify transformations involved in tuning the execution parameter, block-size. First, we present a formal programming and execution model for GPUs, and then formalize block-size independence of GPU programs, which ensures tuning block-size preserves program semantics. Next, we present an inter-procedural analysis to verify block-size independence for synchronization-free GPU programs. Finally, we evaluate the analysis on the Nvidia CUDA SDK samples, where 35 global kernels are verified to be block-size independent.
CURD: A Dynamic CUDA Race DetectorCURD: A Dynamic CUDA Race Detector
Yuanfeng Peng, Vinod Grover and Joseph Devietti

ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '18), June 2018

[abstract][paper]

As GPUs have become an integral part of nearly every processor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs such as data races to arise. These concurrency bugs can be extremely subtle and di cult to debug due to the massive numbers of threads running concurrently on a modern GPU. While some tools exist to detect data races in GPU programs, they are often prohibitively slow or focused only on a small class of data races in shared memory. Compared to prior work, our race detector, CURD, can detect data races precisely on both shared and global memory, selects an appropriate race detection algorithm based on the synchronization used in a program, and utilizes efficient compiler instrumentation to reduce performance overheads. Across 53 benchmarks, we find that using CURD incurs an average slowdown of just 2.88x over native execution. CURD is 2.1x faster than Nvidia’s CUDA-Racecheck race detector, despite detecting a much broader class of races. CURD finds 35 races across our benchmarks, including bugs in established benchmark suites and in sample programs from Nvidia.
SlimFast: Reducing Metadata Redundancy in Sound and Complete Dynamic Data Race DetectionSlimFast: Reducing Metadata Redundancy in Sound and Complete Dynamic Data Race Detection
Yuanfeng Peng, Christian DeLozier, Ariel Eizenberg, William Mansky and Joseph Devietti

IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018

[abstract][paper][code]

Data races are one of the main culprits behind the complexity of multithreaded programming. Existing data race detectors require large amounts of metadata for each program variable to perform their analyses. The SlimFast system exploits the insight that there is a large amount of redundancy in this metadata: many program variables often have identical metadata state. By sharing metadata across variables, a large reduction in space usage can be realized. SlimFast uses immutable metadata to safely support metadata sharing across threads while also accelerating concurrency control. SlimFast’s lossless metadata compression achieves these benefits while preserving soundness and completeness.
Across a range of benchmarks from Java Grande, DaCapo, NAS Parallel Benchmarks and Oracle’s BerkeleyDB Java Edition, SlimFast is able to reduce memory consumption by 1.83x on average, and up to 4.90x for some benchmarks, compared to the state-of-the-art FastTrack system. By improving cache locality and simplifying concurrency control, SlimFast also accelerates data race detection by 1.40x on average, and up to 8.8x for some benchmarks, compared to FastTrack.
SOFRITAS: Serializable Ordering-Free Regions for Increasing Thread Atomicity ScalablySOFRITAS: Serializable Ordering-Free Regions for Increasing Thread Atomicity Scalably
Christian DeLozier, Ariel Eizenberg, Brandon Lucia and Joseph Devietti

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '18), March 2018

[abstract][paper]

Correctly synchronizing multithreaded programs is challenging and errors can lead to program failures such as atomicity violations. Existing strong memory consistency models rule out some possible failures, but are limited by depending on programmer-defined locking code. We present the new Ordering-Free Region (OFR) serializability consistency model that ensures atomicity for OFRs, which are spans of dynamic instructions between consecutive ordering constructs (e.g., barriers), without breaking atomicity at lock operations. Our platform, Serializable Ordering-Free Regions for Increasing Thread Atomicity Scalably (SOFRITAS), ensures a C/C++ program’s execution is equivalent to a serialization of OFRs by default. We build two systems that realize the SOFRITAS idea: a concurrency bug finding tool for testing called SofriTest, and a production runtime system called SoPro.
SofriTest uses OFRs to find concurrency bugs, including a multi-critical-section atomicity violation in memcached that weaker consistency models will miss. If OFRs are too coarse-grained, SofriTest suggests refinement annotations automatically. Our software-only SoPro implementation has high performance, scales well with increased parallelism, and prevents failures despite bugs in locking code. SoPro has an average overhead of just 1.59x compared to pthreads, despite pthreads’ much weaker memory model.
Race Detection and Reachability in Nearly Series-Parallel DAGsRace Detection and Reachability in Nearly Series-Parallel DAGs
Jeremy Fineman, Kunal Agrawal, Joseph Devietti, I-Ting Angelina Lee, Robert Utterback and Changming Xu

ACM-SIAM Symposium on Discrete Algorithms (SODA '18), January 2018

[abstract][paper]

A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program, as different schedules of the program can lead to different results. Most prior work on detecting these races focuses on a subclass of programs with series-parallel or nested parallelism. This paper presents a race-detection algorithm for detecting races in a more general class of programs, namely programs that include arbitrary ordering constraints in additional to the series-parallel constructs. Our race-detection algorithm performs a serial execution of the program, augmented to detect races, in O(T₁ + k²) time, where T₁ is the sequential running time of the original program and k is the number of non series-parallel constraints. The main technical novelty is a new data structure for answering reachability queries in nearly series-parallel (SP) directed acyclic graphs (DAGs). Given as input a graph comprising an n-node series parallel graph and k additional non-SP edges, the total construction time of the data structure is O(n + k²), and each reachability query can be answered in O(1) time. The data structure is traversally incremental, meaning that it supports the insertion of nodes/edges, but only as they are discovered through a graph traversal.
Monadic composition for deterministic, parallel batch processingMonadic composition for deterministic, parallel batch processing
Ryan Scott, Omar Navarro Leija, Joseph Devietti and Ryan Newton

ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '17), October 2017

[abstract][paper]

Achieving determinism on real software systems remains difficult. Everyday code interacts with a wide array of nondeterministic sources, including those internal to the system (OS system calls, CPU instructions, other processes), external to the system (time, date, network communication), and arising from software abstractions (concurrency, threaded execution, data races). Nondeterminism complicates many tasks, from achieving reliable software builds across systems to creating reproducible scientific results. Existing approaches to determinism enforcement assume source code is available or require changing the operating system. Several approaches have high overhead as runtime monitoring causes performance to suffer.
In this work we present DetFlow, a framework for writing and deploying new and legacy software which guarantees determinism. DetFlow uses a novel approach combining static, language-level guarantees with a lightweight runtime enforcement system. Applications that leverage DetFlow must have an entrypoint that lives in the DetIO monad, a type which requires all operations —- including I/O -— be deterministic. Furthermore, DetFlow allows the execution of arbitrary code not written in this framework by executing it in a determinizing runtime. This allows for batch processing tasks to be composed of otherwise untrusted external tasks in a way that assures correctness. Combining support for deterministic parallelism, filesystem access, and logging, DetFlow is an ideal platform for writing scripted workflows that process large data sets simultaneously. We show several use cases of DetFlow by applying it to bioinformatics data pipelines and software build systems. Our evaluation shows we can determinize existing software with minimal modifications, while preserving performance and exploiting software parallelism. We show that DetFlow makes it easier to discover nondeterminism and data races sooner, as DetFlow forces programmers to get reproducibility and parallelism right from the onset.
TMI: Thread Memory Isolation for False Sharing RepairTMI: Thread Memory Isolation for False Sharing Repair
Christian DeLozier, Ariel Eizenberg, Shiliang Hu, Gilles Pokam and Joseph Devietti

ACM IEEE International Symposium on Microarchitecture (MICRO '17), October 2017

[abstract][paper]

Cache contention in the form of false sharing and true sharing arises when threads overshare cache lines at high frequency. Such oversharing can reduce or negate the performance benefits of parallel execution. Prior systems for detecting and repairing cache contention lack efficiency in detection or repair, contain subtle memory consistency flaws, or require invasive changes to the program environment.
In this paper, we introduce a new way to combat cache line oversharing via the Thread Memory Isolation (TMI) system. TMI operates completely in userspace, leveraging performance counters and the Linux ptrace mechanism to tread lightly on monitored applications, intervening only when necessary. TMI’s compatible-by-default design allows it to scale to real-world workloads, unlike previous proposals. TMI introduces a novel code-centric consistency model to handle cross-language memory consistency issues. TMI exploits the flexibility of code-centric consistency to efficiently repair false sharing while preserving strong consistency model semantics when necessary.
TMI has minimal impact on programs without oversharing, slowing their execution by just 2% on average. We also evaluate TMI on benchmarks with known false sharing, and manually inject a false sharing bug into the leveldb key-value store from Google. For these programs, TMI provides an average speedup of 5.2x and achieves 88% of the speedup possible with manual source code fixes.
PARSNIP: Performant Architecture for Race Safety with No Impact on PrecisionPARSNIP: Performant Architecture for Race Safety with No Impact on Precision
Yuanfeng Peng, Ben Wood and Joseph Devietti

ACM IEEE International Symposium on Microarchitecture (MICRO '17), October 2017

[abstract][paper]

Data race detection is a useful dynamic analysis for multithreaded programs that is a key building block in record-and-replay, enforcing strong consistency models, and detecting concurrency bugs. Existing software race detectors are precise but slow, and hardware support for precise data race detection relies on assumptions like type safety that many programs violate in practice.
We propose PARSNIP, a fully precise hardware-supported data race detector. PARSNIP exploits new insights into the redundancy of race detection metadata to reduce storage overheads. PARSNIP also adopts new race detection metadata encodings that accelerate the common case while preserving soundness and completeness. When bounded hardware resources are exhausted, PARSNIP falls back to a software race detector to preserve correctness. PARSNIP does not assume that target programs are type safe, and is thus suitable for race detection on arbitrary code.
Our evaluation of PARSNIP on several PARSEC benchmarks shows that it incurs performance overheads from negligible to 2.6x, with an average overhead of just 1.5x. Moreover, PARSNIP outperforms the state-of-the-art RADISH hardware race detector by 4.6x.
GPUDrano: Detecting Uncoalesced Accesses in GPU ProgramsGPUDrano: Detecting Uncoalesced Accesses in GPU Programs
Rajeev Alur, Joseph Devietti, Omar Navarro Leija and Nimit Singhania

International Conference on Computer-Aided Verification (CAV '17), July 2017

[abstract][paper]

Invited journal paper in Formal Methods in System Design [article]

Graphics Processing Units (GPUs) have become widespread and popular over the past decade. Fully utilizing the parallel compute and memory resources that GPUs present remains a significant challenge, however. In this paper, we describe GPUDrano: a scalable static analysis that detects uncoalesced global memory accesses in CUDA programs. Uncoalesced global memory accesses arise when a GPU program accesses DRAM in an ill-structured way, increasing latency and energy consumption. We formalize the GPUDrano static analysis and compare it empirically against a dynamic analysis to demonstrate that false positives are rare for most programs. We implement GPUDrano in LLVM and show that it can run on GPU programs of over a thousand lines of code. GPUDrano finds 133 of the 143 uncoalesced static memory accesses in the popular Rodinia GPU benchmark suite, demonstrating the precision of our implementation. Fixing these bugs leads to real performance improvements of up to 25%.
BARRACUDA: Binary-level Analysis of Runtime RAces in CUDA programsBARRACUDA: Binary-level Analysis of Runtime RAces in CUDA programs
Ariel Eizenberg, Yuanfeng Peng, Toma Pigli, William Mansky and Joseph Devietti

ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '17), June 2017

[abstract][paper]

GPU programming models enable and encourage massively parallel programming with over a million threads, requiring extreme parallelism to achieve good performance. Massive parallelism brings significant correctness challenges by increasing the possibility for bugs as the number of thread interleavings balloons. Conventional dynamic safety analyses struggle to run at this scale.
We present Barracuda, a data race detector for GPU programs written in Nvidia’s CUDA language. Barracuda handles a wider range of parallelism constructs than previous work, including branch operations, low-level atomics and memory fences, which allows Barracuda to detect new classes of races. Barracuda operates at the binary level for increased compatibility with existing code, leveraging a new binary instrumentation framework that is extensible to other dynamic analyses. Barracuda incorporates a number of novel optimizations that are crucial for scaling data race detection to over a million threads.
Remix: Online Detection and Repair of Cache Contention for the JVMRemix: Online Detection and Repair of Cache Contention for the JVM
Ariel Eizenberg, Shiliang Hu, Gilles Pokam and Joseph Devietti

ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16), June 2016

[abstract][paper][slides: pdf pptx]

As ever more computation shifts onto multicore architectures, it is increasingly critical to find effective ways of dealing with multithreaded performance bugs like true and false sharing. Previous approaches to fixing false sharing in unmanaged languages have had to resort to highly-invasive runtime program modification. We observe that managed language runtimes, with garbage collection and JIT code compilation, present unique opportunities to repair such bugs directly, mirroring the techniques used in manual repairs.
We present Remix, a modified version of the Oracle HotSpot JVM which can detect cache contention bugs and repair false sharing at runtime. Remix’s detection mechanism leverages recent performance counter improvements on Intel platforms, which allow for precise, unobtrusive monitoring of cache contention at the hardware level. Remix can detect and repair known false sharing issues in the LMAX Disruptor high-performance inter-thread messaging library and the Spring Reactor event-processing framework, automatically providing 1.5-2x speedups over unoptimized code and matching the performance of hand-optimization. Remix also finds a new false sharing bug in SPECjvm2008, and uncovers a true sharing bug in the HotSpot JVM that, when fixed, improves the performance of three NAS Parallel Benchmarks by 7-25x. Remix incurs no statistically-significant performance overhead on other benchmarks that do not exhibit cache contention, making Remix practical for always-on use.
LASER: Light, Accurate Sharing dEtection and RepairLASER: Light, Accurate Sharing dEtection and Repair
Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Gilles Pokam, Chris Newburn and Joseph Devietti

IEEE International Symposium on High Performance Computer Architecture (HPCA '16), March 2016

[abstract][paper][slides: pdf pptx][data]

Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program's actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment.
This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel's Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware.
Co-Design of Anytime Computation and Robust ControlCo-Design of Anytime Computation and Robust Control
Yash Pant, Kartik Mohta, Houssam Abbas, Truong X. Nghiem, Joseph Devietti and Rahul Mangharam

IEEE Real-Time Systems Symposium (RTSS '15), December 2015

[abstract][paper][appendix]

Control software of autonomous robots has stringent real-time requirements that must be met to achieve the control objectives. One source of variability in the performance of a control system is the execution time and accuracy of the state estimator that provides the controller with state information. This estimator is typically perception-based (e.g., Computer Vision-based) and is computationally expensive. When the computational resources of the hardware platform become overloaded, the estimation delay can compromise control performance and even stability. In this paper, we define a framework for co-designing anytime estimation and control algorithms, in a manner that accounts for implementation issues like delays and inaccuracies. We construct an anytime perception-based estimator from standard off-the-shelf Computer Vision algorithms, and show how to obtain a trade-off curve for its delay vs estimate error behavior. We use this anytime estimator in a controller that can use this trade- off curve at runtime to achieve its control objectives at a reduced energy cost. When the estimation delay is too large for correct operation, we provide an optimal manner in which the controller can use this curve to reduce estimation delay at the cost of higher inaccuracy, all the while guaranteeing basic objectives are met. We illustrate our approach on an autonomous hexrotor and demonstrate its advantage over a system that does not exploit co-design.
High-Performance Determinism with Total Store Order ConsistencyHigh-Performance Determinism with Total Store Order Consistency
Timothy Merrifield, Joseph Devietti and Jakob Eriksson

European Conference on Computer Systems (EuroSys '15), April 2015

[abstract][paper]

We present Consequence, a deterministic multi-threading library. Consequence achieves deterministic execution via store buffering and strict ordering of synchronization operations. To ensure high performance under a wide variety of conditions, the ordering of synch operations is based on a deterministic clock, and store buffering is implemented using version-controlled memory.
Recent work on deterministic concurrency has proposed relaxing the consistency model beyond total store order (TSO). Through novel optimizations, Consequence achieves the same or better performance on the Phoenix, PARSEC and SPLASH-2 benchmark suites, while retaining TSO memory consistency. Across 19 benchmark programs, Consequence incurs a worst-case slowdown of 3.9× vs. pthreads, with 14 out of 19 programs at or below 2.5×. We believe this performance improvement takes parallel programming one step closer to "determinism by default".
GPUDet: A Deterministic GPU ArchitectureGPUDet: A Deterministic GPU Architecture
Hadi Jooybar, Wilson W. L. Fung, Mike O'Connor, Joseph Devietti and Tor Aamodt

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '13), March 2013

[abstract][paper][code]

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.
Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.
Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.
RADISH: Always-On Sound and Complete Race Detection in Software and HardwareRADISH: Always-On Sound and Complete Race Detection in Software and Hardware
Joseph Devietti, Ben Wood, Karin Strauss, Luis Ceze, Dan Grossman and Shaz Qadeer

International Symposium on Computer Architecture (ISCA '12), June 2012

[abstract][paper][slides: key pdf][tech report with proofs][code][data]

Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only race detectors are precise but slow; proposals with hardware support offer higher performance but are imprecise. Both precision and performance are necessary to achieve the many advantages always-on race detection could provide.
To resolve this trade-off, we propose RADISH, a hybrid hardware-software race detector that is always-on and fully precise. In RADISH, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that RADISH is correct by proving equivalence to a conventional happens-before race detector.
Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. RADISH can furthermore leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core RADISH processor using PARSEC benchmarks shows runtime overheads from negligible to 2x. Furthermore, RADISH outperforms the leading software-only race detector by 2x-37x.
RCDC: A Relaxed-Consistency Deterministic ComputerRCDC: A Relaxed-Consistency Deterministic Computer
Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze and Dan Grossman

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '11), March 2011

[abstract][paper][slides: pdf pptx][simulator code][compiler code][code readme][data]

Providing deterministic execution significantly simplifies the debugging, testing, replication, and deployment of multithreaded programs. Recent work has developed deterministic multiprocessor architectures as well as compiler and runtime systems that enforce determinism in current hardware. Such work has incidentally imposed strong memory-ordering properties. Historically, memory ordering has been relaxed in favor of higher performance in shared memory multiprocessors and, interestingly, determinism exacerbates the cost of strong memory ordering. Consequently, we argue that relaxed memory ordering is vital to achieving faster deterministic execution.
This paper introduces RCDC, a deterministic multiprocessor architecture that takes advantage of relaxed memory orderings to provide high-performance deterministic execution with low hardware complexity. RCDC has two key innovations: a hybrid HW/SW approach to enforcing determinism; and a new deterministic execution strategy that leverages data-race-free-based memory models (e.g., the models for Java and C++) to improve performance and scalability without sacrificing determinism, even in the presence of races. In our hybrid HW/SW approach, the only hardware mechanisms required are software-controlled store buffering and support for precise instruction counting; we do not require speculation. A runtime system uses these mechanisms to enforce determinism for arbitrary programs.

We evaluate RCDC using PARSEC benchmarks and show that relaxing memory ordering leads to performance and scalability close to nondeterministic execution without requiring any form of speculation. We also compare our new execution strategy to one based on TSO (total-store-ordering) and show that some applications benefit significantly from the extra relaxation. We also evaluate a software-only implementation of our new deterministic execution strategy.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded ExecutionCoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution
Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze and Dan Grossman

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '10), March 2010

[abstract][paper][slides: pdf ppt][code]

The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically.
A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually.
Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.
DMP: Deterministic Shared Memory MultiprocessingDMP: Deterministic Shared Memory Multiprocessing
Joseph Devietti, Brandon Lucia, Luis Ceze and Mark Oskin

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '09), March 2009

[abstract][paper][slides: pdf ppt+flash][slides readme]

Selected for IEEE Micro Top Picks '09 [article]

Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.
Atom-Aid: Surviving and Detecting Atomicity ViolationsAtom-Aid: Surviving and Detecting Atomicity Violations
Brandon Lucia, Joseph Devietti, Karin Strauss and Luis Ceze

International Symposium on Computer Architecture (ISCA '08), June 2008

[abstract][paper][slides: ppt]

Selected for IEEE Micro Top Picks '08 [article]

Writing shared-memory parallel programs is error-prone. Among the concurrency errors that programmers often face are atomicity violations, which are especially challenging. They happen when programmers make incorrect assumptions about atomicity and fail to enclose memory accesses that should occur atomically inside the same critical section. If these accesses happen to be interleaved with conflicting accesses from different threads, the program might behave incorrectly.
Recent architectural proposals arbitrarily group consecutive dynamic memory operations into atomic blocks to enforce memory ordering at a coarse grain. This provides what we call implicit atomicity, as the atomic blocks are not derived from explicit program annotations. In this paper, we make the fundamental observation that implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. We then propose Atom-Aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves. Atom-Aid is also able to report where atomicity violations might exist in the code, providing resilience and debuggability. We evaluate Atom-Aid using buggy code from applications including Apache, MySQL, and XMMS, showing that Atom-Aid virtually eliminates the manifestation of atomicity violations.
HardBound: Architectural Support for Spatial Safety of the C Programming LanguageHardBound: Architectural Support for Spatial Safety of the C Programming Language
Joseph Devietti, Colin Blundell, Milo Martin and Steve Zdancewic

International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '08), March 2008

[abstract][paper][slides: pdf pptx][video]

The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. C's unchecked pointer arithmetic and array indexing allow simple programming mistakes to lead to erroneous executions, silent data corruption, and security vulnerabilities. Many prior proposals have tackled enforcing spatial safety in C programs by checking pointer and array accesses. However, existing software-only proposals have significant drawbacks that may prevent wide adoption, including: unacceptably high runtime overheads, lack of completeness, incompatible pointer representations, or need for non-trivial changes to existing C source code and compiler infrastructure.
Inspired by the promise of these software-only approaches, this paper proposes a hardware bounded pointer architectural primitive that supports cooperative hardware/software enforcement of spatial memory safety for C programs. This bounded pointer is a new hardware primitive datatype for pointers that leaves the standard C pointer representation intact, but augments it with bounds information maintained separately and invisibly by the hardware. The bounds are initialized by the software, and they are then propagated and enforced transparently by the hardware, which automatically checks a pointer's bounds before it is dereferenced. One mode of use requires instrumenting only malloc, which enables enforcement of per-allocation spatial safety for heap-allocated objects for existing binaries. When combined with simple intra-procedural compiler instrumentation, hardware bounded pointers enable a low-overhead approach for enforcing complete spatial memory safety in unmodified C programs.
Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional MemoryMaking the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory
Colin Blundell, Joseph Devietti, E Christopher Lewis and Milo Martin

International Symposium on Computer Architecture (ISCA '07), June 2007

[abstract][paper][slides: pdf ppt]

Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, allowing programmers to exploit more effectively the soon-to-be-ubiquitous multi-core designs. Several recent proposals have extended the original bounded transactional memory to unbounded transactional memory, a crucial step toward transactions becoming a generalpurpose primitive. Unfortunately, supporting the concurrent execution of an unbounded number of unbounded transactions is challenging, and as a result, many proposed implementations are complex.
This paper explores a different approach. First, we introduce the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible. Second, we propose ONETM to simplify the implementation of unbounded transactional memory by bounding the concurrency of transactions that overflow the cache. These mechanisms work synergistically to provide a simple and fast unbounded transactional memory system.
The permissions-only cache efficiently maintains the coherence permissions--but not data--for blocks read or written transactionally that have been evicted from the processor's caches. By holding coherence permissions for these blocks, the regular cache coherence protocol can be used to detect transactional conflicts using only a few bits of on-chip storage per overflowed cache block. ONETM allows only one overflowed transaction at a time, relying on the permissions-only cache to ensure that overflow is infrequent. We present two implementations. In ONETM-Serialized, an overflowed transaction simply stalls all other threads in the application. In ONETM-Concurrent, non-overflowed transactions and non-transactional code can execute concurrently with the overflowed transaction, providing more concurrency while retaining ONETM's core simplifying assumption.

Journal Papers

Online Code Layout Optimizations via OCOLOSOnline Code Layout Optimizations via OCOLOS
Yuxuan Zhang, Tanvir Ahmed Khan, Gilles Pokam, Baris Kasikci, Heiner Litz and Joseph Devietti

IEEE Micro, Vol. 43 No. 4, July 2023

[abstract][paper]
Static detection of uncoalesced accesses in GPU programsStatic detection of uncoalesced accesses in GPU programs
Rajeev Alur, Joseph Devietti, Omar Navarro Leija and Nimit Singhania

Formal Methods in System Design, March 2021

[abstract][paper]

GPU programming has become popular due to the high computational capabilities of GPUs. Obtaining significant performance gains with GPU is however challenging and the programmer needs to be aware of various subtleties of the GPU architecture. One such subtlety lies in accessing GPU memory, where certain access patterns can lead to poor performance. Such access patterns are referred to as uncoalesced global memory accesses. This work presents a light-weight compile-time static analysis to identify such accesses in GPU programs. The analysis relies on a novel abstraction which tracks the access pattern across multiple threads. The abstraction enables quick prediction while providing correctness guarantees. We have implemented the analysis in LLVM and compare it against a dynamic analysis implementation. The static analysis identifies 95 pre-existing uncoalesced accesses in Rodinia, a popular benchmark suite of GPU programs, and finishes within seconds for most programs, in comparison to the dynamic analysis which finds 69 accesses and takes orders of magnitude longer to finish.
Anytime Computation and Control for Autonomous SystemsAnytime Computation and Control for Autonomous Systems
Yash Pant, Houssam Abbas, Kartik Mohta, Rhudii Quaye, Truong X. Nghiem, Joseph Devietti and Rahul Mangharam

IEEE Transactions on Control Systems Technology, March 2021

[abstract][paper]

The correct and timely completion of the sensing and action loop is of utmost importance in safety critical autonomous systems. Crucial to the performance of this feedback control loop are the computation time and accuracy of the estimator which produces state estimates used by the controller. These state estimators often use computationally expensive perception algorithms like visual feature tracking. With on-board computers on autonomous robots being computationally limited, the computation time of such an estimation algorithm can at times be high enough to result in poor control performance. We develop a framework for codesign of anytime estimation and robust control algorithms, taking into account computation delays and estimation inaccuracies. This is achieved by constructing an anytime estimator from an off-the-shelf perception-based estimation algorithm and obtaining a trade-off curve for its computation time versus estimation error. This is used in the design of a robust predictive control algorithm that at run-time decides a contract, or operation mode, for the estimator in addition to controlling the dynamical system to meet its control objectives at a reduced computation energy cost. This codesign provides a mechanism through which the controller can use the tradeoff curve to reduce estimation delay at the cost of higher inaccuracy, while guaranteeing satisfaction of control objectives. Experiments on a hexrotor platform running a visual-based algorithm for state estimation show how our method results in up to a 10% improvement in control performance while simultaneously saving 5%-6% in computation energy as compared to a method that does not leverage the codesign.
Monadic composition for deterministic, parallel batch processingMonadic composition for deterministic, parallel batch processing
Ryan Scott, Omar Navarro Leija, Joseph Devietti and Ryan Newton

Proceedings of the ACM on Programming Languages, alternate version of the OOPSLA 2017 paper, October 2017

[abstract][paper]

Achieving determinism on real software systems remains difficult. Even a batch-processing job, whose task is to map input bits to output bits, risks nondeterminism from thread scheduling, system calls, CPU instructions, and leakage of environmental information such as date or CPU model. In this work, we present a system for achieving low-overhead deterministic execution of batch-processing programs that read and write the file system—turning them into pure functions on files.
We allow multi-process executions where a permissions system prevents races on the file system. Process separation enables different processes to enforce permissions and enforce determinism using distinct mechanisms. Our prototype, DetFlow, allows a statically-typed coordinator process to use shared-memory parallelism, as well as invoking process-trees of sandboxed legacy binaries. DetFlow currently implements the coordinator as a Haskell program with a restricted I/O type for its main function: a new monad we call DetIO. Legacy binaries launched by the coordinator run concurrently, but internally each process schedules threads sequentially, allowing dynamic determinism-enforcement with predictably low overhead.
We evaluate DetFlow by applying it to bioinformatics data pipelines and software build systems. DetFlow enables determinizing these data-processing workflows by porting a small amount of code to become a statically-typed coordinator. This hybrid approach of static and dynamic determinism enforcement permits freedom where possible but restrictions where necessary.
DMP: Deterministic Shared-Memory MultiprocessingDMP: Deterministic Shared-Memory Multiprocessing
Joseph Devietti, Brandon Lucia, Luis Ceze and Mark Oskin

IEEE Micro, Vol. 30 No. 1, January 2010

[abstract][paper][journal site]

Shared-memory multicore and multiprocessor systems are nondeterministic, which frustrates debugging and complicates testing of multithreaded code, impeding parallel programming's widespread adoption. The authors propose fully deterministic shared-memory multiprocessing that not only enhances debugging by offering repeatability by default, but also improves the quality of testing and the deployment of production code. They show that determinism can be provided with little performance cost on future hardware.
Atom-Aid: Detecting and Surviving Atomicity ViolationsAtom-Aid: Detecting and Surviving Atomicity Violations
Brandon Lucia, Joseph Devietti, Luis Ceze and Karin Strauss

IEEE Micro, Vol. 29 No. 1, January 2009

[abstract][paper][journal site]

Hardware can play a significant role in improving reliability of multithreaded software. Recent architectural proposals arbitrarily group consecutive dynamic memory operations into atomic blocks to enforce coarse-grained memory ordering, providing implicit atomicity. The authors of this article observe that implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. They propose Atom-Aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves.

Workshop Papers

Verifying Dynamic Race DetectionVerifying Dynamic Race Detection
William Mansky, Yuanfeng Peng, Steve Zdancewic and Joseph Devietti

Certified Programs and Proofs (CPP '17), co-located with POPL 2017, January 2017

[abstract][paper]
MAMA: Mostly Automatic Management of AtomicityMAMA: Mostly Automatic Management of Atomicity
Christian DeLozier, Joseph Devietti and Milo Martin

Workshop on Determinism and Correctness in Parallel Programming (WoDet '14), held in conjunction with ASPLOS '14, March 2014

[abstract][paper][slides: pdf ppt]

Correctly synchronizing the parallel execution of tasks remains one of the most difficult aspects of parallel programming. Without proper synchronization, many kinds of subtle concurrency errors can arise and cause a program to produce intermittently wrong results. The long-term goal of this work is to design a system that automatically synchronizes a set of programmer-specified, partially-independent parallel tasks. We present here our progress on the MAMA (Mostly Automatic Management of Atomicity) system, which can infer much but not all of this synchronization. MAMA provides a safety guarantee that a program either executes in a correctly atomic fashion or it deadlocks. Initial experiments indicate that MAMA can semi-automatically provide atomicity for a set of Java benchmarks while still allowing parallel execution.
The Case For Merging Execution- and Language-level Determinism with MELDThe Case For Merging Execution- and Language-level Determinism with MELD
Joseph Devietti, Dan Grossman and Luis Ceze

Workshop on Determinism and Correctness in Parallel Programming (WoDet '12), held in conjunction with ASPLOS '12, March 2012

[abstract][paper][slides: key pdf]

Nondeterminism is a key contributor to the difficulty of parallel programming. Many research projects have shown how to provide deterministic parallelism, but with unfortunate trade-offs. Deterministic execution enforces determinism for arbitrary programs but with significant runtime cost, while deterministic languages enforce determinism statically (without runtime overhead) but only for fork-join programs expressible in their static type systems.
MELD unifies these approaches. We explain the requirements for soundly integrating a deterministic language into a deterministic execution system, and describe a simple qualifier-based type checker that ensures isolation for code written in a deterministic language. We also extend MELD to incorporate nondeterministic operations without compromising the determinism of the rest of the program. Our experiments with benchmarks from the SPLASH2 and PARSEC suites show that a small number of annotations can accelerate the performance of deterministic versions of these programs by 2-6x.
The Deterministic Execution Hammer: How Well Does it Actually Pound Nails?The Deterministic Execution Hammer: How Well Does it Actually Pound Nails?
Tom Bergan, Joseph Devietti, Nicholas Hunt and Luis Ceze

Workshop on Determinism and Correctness in Parallel Programming (WoDet '11), held in conjunction with ASPLOS '11, March 2011

[abstract][paper]

This paper takes a critical look at the benefits provided by state-of-the-art deterministic execution techniques. Specifically, we look at four applications of deterministic execution: debugging, fault-tolerant replication, testing, and security. For each application, we discuss what an ideal system would provide, and then look at how deterministic systems compare to the ideal. Further, we discuss alternative approaches, not involving determinism, and we judge whether or not these alternatives are more suitable. Along the way, we identify open questions and suggest future work.
Ultimately, we find that there are competitive alternatives to determinism for debugging and replicating multithreaded programs; that determinism has high, though unproven, potential to improve testing; and that determinism has distinct security benefits in eliminating some covert timing channels. Furthermore, determinism is a unified solution for all four applications: this confers a distinct advantage over point solutions that do not compose well with one another.
Lock PredictionLock Prediction
Brandon Lucia, Joseph Devietti, Tom Bergan, Luis Ceze and Dan Grossman

USENIX Workshop on Hot Topics in Parallelism (HotPar '10), accepted for poster session, June 2010

[abstract][paper]

This paper makes the case for studying and exploiting the predictability of lock operations in multithreaded programs. We discuss several uses of lock prediction and identify the two key events we believe are useful to predict: (1) which thread will be the next one to acquire a given lock and (2) which lock a given thread will acquire next. We describe multiple prediction algorithms and show that while their accuracy varies widely, they are likely accurate enough for many uses.
The Case for System Support for Concurrency ExceptionsThe Case for System Support for Concurrency Exceptions
Luis Ceze, Joseph Devietti, Brandon Lucia and Shaz Qadeer

USENIX Workshop on Hot Topics in Parallelism (HotPar '09), March 2009

[abstract][paper]

In this position paper we argue that concurrency errors should be fail-stop. We want to put concurrency errors in the same category as division-by-zero, segmentation fault in unmanaged languages and cast exceptions in managed languages. This would make nondeterminism in multithreaded execution be much more manageable. Concurrency exceptions would improve the debugging process during development and make crashes due to concurrency errors that happen in the field be more descriptive. Our goal in this paper is to justify our position, propose a general approach to concurrency exceptions and discuss system requirements and implications. Specifically, we discuss the semantics of concurrency exceptions at the language level, their implications in the compiler and run-time systems, how they should be delivered and, finally, how they are enabled by efficient architecture support.
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic!Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic!
Joseph Devietti, Brandon Lucia, Luis Ceze and Mark Oskin

Workshop on Software and Hardware Challenges of Manycore Platforms (SHCMP '08), held in conjunction with ISCA '08, June 2008

[abstract][paper][slides: odp pdf ppt]

The root of all (or most) evil in programming shared memory multiprocessors is that execution is not deterministic. Bugs are hard to find and non-repeatable. This makes debugging a nightmare and gives little assurance in the testing - there is no way to know how the program will behave in a different environment. Testing and debugging is already difficult with a single thread on uniprocessors. Pervasive parallel programs and chip multiprocessors will make this difficulty worse, and more widespread throughout our industry.
In this paper we make the case for fully deterministic shared memory multiprocessing. The idea is to make memory interleaving fully deterministic, in contrast to past approaches of simply replaying an execution based on a memory interleaving log. This causes the execution of a parallel program to be only function of its inputs, making a parallel program effectively behave like a sequential program. We show that determinism can be provided with reasonable performance cost and we also discuss the benefits. Finally, we propose and evaluate a range of implementation approaches.

Posters

SlimFast: Reducing Metadata Redundancy in Sound & Complete Dynamic Data Race DetectionSlimFast: Reducing Metadata Redundancy in Sound & Complete Dynamic Data Race Detection
Yuanfeng Peng and Joseph Devietti

PLDI Student Research Competition (PLDI SRC '15), held in conjunction with PLDI '15, June 2015

[abstract][poster]

Dynamic race detectors that are sound and complete often incur too much memory overhead to be practical in many scenarios; to address this issue, we present SlimFast, a sound-and-complete race detector that significantly reduces the memory overhead, with comparable performance to a state-of-the-art race detector. According to experiments on DaCapo and Grande benchmarks, SlimFast brings up to 72% space savings and a 21% speedup over FastTrack.

Selected Talks

Feedback-Driven ProcessorsFeedback-Driven Processors
ASPLOS PC Workshop, University of Washington, 14 November 2019

[abstract]
Modern processor architectures are complex machines due to decades' worth of performance optimizations. This complexity can lead to subtle performance issues at the microarchitectural level that are very difficult to diagnose. Fortunately, modern processors also contain a rich set of performance monitoring facilities that enable introspection of many aspects of program execution. My research goal is to build automatic systems that can diagnose and repair performance issues in running programs, relieving programmers of the burden of performance tuning.
I'll discuss our work on a series of hardware-software systems for C/C++ and Java that use performance counters to diagnose false sharing bugs in multicore programs, and then perform automatic online repair of these bugs to improve performance through careful consideration of the memory consistency model. These repair techniques are able to achieve much of the benefit of hand-written expert fixes, but require no programmer intervention. In ongoing work we are applying the feedback-driven processor methodology to the problem of front-end stalls in the datacenter. I'll also discuss related efforts to improve performance and correctness via static and dynamic analysis of GPU programs and via deterministic execution systems.
Feedback-Driven ProcessorsFeedback-Driven Processors
UIUC, 28 October 2019

[abstract]
Modern processor architectures are complex machines due to decades' worth of performance optimizations. This complexity can lead to subtle performance issues at the microarchitectural level that are very difficult to diagnose. Fortunately, modern processors also contain a rich set of performance monitoring facilities that enable introspection of many aspects of program execution. My research goal is to build automatic systems that can diagnose and repair performance issues in running programs, relieving programmers of the burden of performance tuning.
I'll discuss our work on a series of hardware-software systems for C/C++ and Java that use performance counters to diagnose false sharing bugs in multicore programs, and then perform automatic online repair of these bugs to improve performance through careful consideration of the memory consistency model. These repair techniques are able to achieve much of the benefit of hand-written expert fixes, but require no programmer intervention. In ongoing work we are applying the feedback-driven processor methodology to the problem of front-end stalls in the datacenter. I'll also discuss related efforts to improve performance and correctness via static and dynamic analysis of GPU programs and via deterministic execution systems.
Feedback-Driven ProcessorsFeedback-Driven Processors
University of Michigan, 25 September 2019

[abstract][web]
Modern processor architectures are complex machines due to decades' worth of performance optimizations. This complexity can lead to subtle performance issues at the microarchitectural level that are very difficult to diagnose. Fortunately, modern processors also contain a rich set of performance monitoring facilities that enable introspection of many aspects of program execution. My research goal is to build automatic systems that can diagnose and repair performance issues in running programs, relieving programmers of the burden of performance tuning.
I'll discuss our work on a series of hardware-software systems for C/C++ and Java that use performance counters to diagnose false sharing bugs in multicore programs, and then perform automatic online repair of these bugs to improve performance through careful consideration of the memory consistency model. These repair techniques are able to achieve much of the benefit of hand-written expert fixes, but require no programmer intervention. In ongoing work we are applying the feedback-driven processor methodology to the problem of front-end stalls in the datacenter. I'll also discuss related efforts to improve performance and correctness via static and dynamic analysis of GPU programs and via deterministic execution systems.
Feedback-Driven ProcessorsFeedback-Driven Processors
MIT, 23 September 2019

[abstract]
Modern processor architectures are complex machines due to decades' worth of performance optimizations. This complexity can lead to subtle performance issues at the microarchitectural level that are very difficult to diagnose. Fortunately, modern processors also contain a rich set of performance monitoring facilities that enable introspection of many aspects of program execution. My research goal is to build automatic systems that can diagnose and repair performance issues in running programs, relieving programmers of the burden of performance tuning.
I'll discuss our work on a series of hardware-software systems for C/C++ and Java that use performance counters to diagnose false sharing bugs in multicore programs, and then perform automatic online repair of these bugs to improve performance through careful consideration of the memory consistency model. These repair techniques are able to achieve much of the benefit of hand-written expert fixes, but require no programmer intervention. In ongoing work we are applying the feedback-driven processor methodology to the problem of front-end stalls in the datacenter. I'll also discuss related efforts to improve performance and correctness via static and dynamic analysis of GPU programs and via deterministic execution systems.
Leveraging Intel Platforms for Automatic Cache Contention Detection and RepairLeveraging Intel Platforms for Automatic Cache Contention Detection and Repair
Intel Labs, 16 May 2018

[abstract]
Modern processor architectures contain a set of extensive performance optimizations, and provide good performance on a range of workloads without much programmer effort. However, the complexity of these architectures can also lead to subtle performance bugs at the microarchitectural level that are very difficult to diagnose.
I'll discuss our work on a series of runtime systems for C/C++ and Java that use performance counters to diagnose false sharing bugs in multicore chips, and then perform automatic online bug repair to improve performance. These repair techniques are able to achieve much of the benefit of a hand-written expert fix, but require no programmer intervention. I'll discuss our current work on performance bugs relating to front-end stalls, and discuss ways we might mitigate these stalls in an online fashion.
Automatically Finding & Fixing Cache Contention BugsAutomatically Finding & Fixing Cache Contention Bugs
Washington University in St. Louis, 18 November 2016

[abstract]
Multicore architectures continue to pervade every part of our computing infrastructure, from servers to phones and smart watches. While these parallel architectures bring established performance and energy efficiency gains compared to single-core designs, parallel code written for these architectures can suffer from subtle performance bugs that are difficult to understand and repair with current tools.
We'll discuss two systems that leverage hardware-software co-design to tackle cache contention bugs like false sharing, in the context of both unmanaged languages like C/C++ and managed languages like Java. These systems use hardware performance counters for efficient bug detection, and novel runtime systems to repair bugs online without any programmer intervention. Along the way, we'll discuss some subtle memory consistency model issues brought to light by these optimizations.
Automatically Finding & Fixing Cache Contention BugsAutomatically Finding & Fixing Cache Contention Bugs
Carnegie Mellon University, 20 September 2016

[abstract]
Multicore architectures continue to pervade every part of our computing infrastructure, from servers to phones and smart watches. While these parallel architectures bring established performance and energy efficiency gains compared to single-core designs, parallel code written for these architectures can suffer from subtle performance bugs that are difficult to understand and repair with current tools.
We'll discuss two systems that leverage hardware-software co-design to tackle cache contention bugs like false sharing, in the context of both unmanaged languages like C/C++ and managed languages like Java. These systems use hardware performance counters for efficient bug detection, and novel runtime systems to repair bugs online without any programmer intervention. Along the way, we'll discuss some subtle memory consistency model issues brought to light by these optimizations.
Towards Automatic Synchronization of Parallel ProgramsTowards Automatic Synchronization of Parallel Programs
Qualcomm Research, San Diego, 13 March 2015

[abstract]
Correctly synchronizing threads’ executions is one of the main difficulties of multithreaded programming. Incorrect synchronization causes many subtle concurrency errors such as data races and atomicity violations. To overcome the difficulty of synchronizing code, we propose Mostly Automatic Management of Atomicity (MAMA): a new programming and execution model, in which the programmer is relieved of specifying the program’s atomicity constraints and need only specify parallelism and ordering constraints. At runtime, MAMA conservatively over-approximates atomicity, leveraging the insight that a large atomic region provides the atomicity of a smaller atomic region it contains. MAMA’s conservatism is sometimes excessive, and can cause deadlocks. However, our experiments with PARSEC workloads and the memcached server show that deadlocks are rare and can be repaired through programmer annotations. Our system precisely guides the programmer to annotation sites and suggests over 80% of annotations automatically. With modest hardware support, MAMA incurs just 40% average slowdown and nearly identical performance scaling compared to native pthreads.
Low-overhead, Unobtrusive Cache Contention Detection and RepairLow-overhead, Unobtrusive Cache Contention Detection and Repair
Intel Labs, Santa Clara, 6 February 2015

[abstract]
Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Cache contention is often invisible at the source code level as it is a product of the opaque decisions of the memory allocator, and contention can appear or disappear across different hardware platforms.
We describe the Low-overhead, Effective, and Non-intrusive Sharing detection (LENS) system, which leverages new performance counter capabilities available on Intel’s Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a cache contention detector that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LENS imposes just 3.6% average runtime overhead on the Phoenix, Parsec and Splash2X benchmarks, and accurately detects instances of true and false sharing in these workloads. We discuss several potential extensions for LENS to automatically repair contention in both managed and unmanaged code.
Towards Automatic Synchronization of Parallel ProgramsTowards Automatic Synchronization of Parallel Programs
Intel Labs, Santa Clara, 7 February 2014

[abstract]
Correctly synchronizing the parallel execution of tasks remains one of the most difficult aspects of parallel programming. Without proper synchronization, many kinds of subtle concurrency errors can arise and cause a program to produce intermittently wrong results.
I'll describe our initial steps towards a system that can automatically synchronize a set of programmer-specified, partially-independent parallel threads. Our current prototype system can infer much but not all of this synchronization, while providing a safety guarantee that a program either executes in a correctly atomic fashion or it deadlocks. Initial experiments indicate that our prototype can semi-automatically provide atomicity for a set of Java benchmarks while still allowing these benchmarks to execute in parallel.
No Such Thing as Luck: Improving Parallel Programming with DeterminismNo Such Thing as Luck: Improving Parallel Programming with Determinism
Rutgers University, 3 December 2013
No Such Thing as Luck: Improving Parallel Programmability with DeterminismNo Such Thing as Luck: Improving Parallel Programmability with Determinism
Microsoft Research, Redmond, 23 April 2012

[abstract][video]
Nondeterminism is a key complication in programming multicore systems. Previous approaches to coping with it have focused on replay or required the adoption of new languages, but my research has shown how to provide deterministic execution for arbitrary parallel programs written in existing languages. My work has examined how to build deterministic platforms using novel hardware, compilers, runtimes, and type systems. I will also discuss the many uses of determinism, from debugging, testing, and replicating multithreaded programs to using determinism to accelerate dynamic safety and security checks.
No Such Thing as Luck: Improving Parallel Programmability with DeterminismNo Such Thing as Luck: Improving Parallel Programmability with Determinism
Penn State, Computer Science & Engineering, 18 April 2012

[abstract]
Nondeterminism is a key complication in programming multicore systems. It makes testing more difficult and less useful, since another run of the program can potentially introduce new behaviors. Nondeterminism also frustrates debugging efforts by making bugs hard to reproduce. Previous approaches to coping with nondeterminism in parallel programs have focused on recording an execution for subsequent replay, or required that programs be written in restrictive languages, but have not addressed the underlying nondeterminism of multicore systems in a direct way.
In this talk, I will show how to use novel hardware and software techniques to provide deterministic execution for arbitrary parallel programs written in today's languages. I've built a series of deterministic platforms, from new hardware architectures to compilers and language extensions, that show how the challenge of nondeterminism can be addressed across the computing stack. Hardware speculation, memory consistency relaxations, and hardware-software co-design all play key roles in improving the performance and simplicity of determinism. I'll also share my plans for future work, from leveraging determinism to accelerate safety and security checks, to new parallel computer architectures that enable a unified task+data parallelism abstraction.

Technical Reports

Code-Centric Communication Graphs for Shared-Memory Multithreaded ProgramsCode-Centric Communication Graphs for Shared-Memory Multithreaded Programs
Ben Wood, Joseph Devietti, Luis Ceze and Dan Grossman

Technical Report UW-CSE-09-05-02, May 2009

[abstract][paper]

With more and more complex pieces of software using explicit multithreading, tools to extract the structure of such software are increasingly important. We present a novel tool that builds graphs describing how threads in shared-memory parallel programs communicate. Compared to prior work, our communication graphs are code-centric: nodes represent units of code (e.g., functions) and edges represent inter-thread shared-memory communication via these units of code. Our approach is dynamic, using actual executions to build graphs, and exploits binary-code instrumentation to work for large, real-world applications. The graphs are useful for understanding software structure and computing program properties, such as the effect of nondeterministic thread-scheduling on the communication pattern.

Dissertation

Deterministic Execution for Arbitrary Multithreaded ProgramsDeterministic Execution for Arbitrary Multithreaded Programs
Joseph Devietti

PhD Dissertation, University of Washington, November 2012

[abstract][paper]

Nondeterminism is one of the main reasons that parallel programming is so difficult. Bugs can vanish when programs are rerun or run under a debugger, thwarting attempts at their removal. Stress-testing is a common practice to flush out rare defects though it consumes extensive processing power and offers no real guarantees of discovering bugs. Deployment can similarly expose new issues that are difficult to reproduce. Finally, nondeterminism frustrates replicating multithreaded programs for fault-tolerance or performance as the replicas can diverge silently. Determinism eliminates these problems, making debugging and replication possible and making testing more valuable and efficient.
Previous efforts at providing determinism required programs to be (re)written in restrictive languages. In contrast to these language-level determinism schemes, this dissertation shows how execution-level determinism can be provided for arbitrary parallel programs, even programs that contain concurrency errors. First, we employ a hardware-based approach to provide determinism for unmodified binaries running on a deterministic multiprocessor system. Second, we show that memory consistency relaxations both enable a pure software-based implementation of execution-level determinism for arbitrary programs and also admit a simpler deterministic multiprocessor design. Finally, we describe a hybrid mechanism that integrates execution-level and language-level determinism techniques to provide determinism for arbitrary programs with higher performance than an execution-level approach alone.