Joseph Devietti
my email address: my last name at cis dot upenn dot edu
(215) 746-4223
Levine Hall 572
3330 Walnut Street
Philadelphia, PA 19104-3409

I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models.


I am eagerly looking for new PhD students interested in improving the programmability and performance of multicore architectures. If you are interested in these topics please apply to our PhD program and drop me an email as well.

I'm on the PC of the 5th Workshop on Systems for Future Multicore Architectures (SFMA 2015), held in conjunction with EuroSys 2015 in Bordeaux. Please consider submitting work on the interaction of multicore architectures with operating systems, language runtimes and virtual machines. The submission deadline is 17 February 2015.


I'm teaching CIS 501: Computer Architecture in Fall 2015.


I'm lucky to be working with the following great students:

Recent Publications full list

Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.

  • SlimFast: Reducing Metadata Redundancy in Sound & Complete Dynamic Data Race DetectionSlimFast: Reducing Metadata Redundancy in Sound & Complete Dynamic Data Race Detection
    Yuanfeng Peng and Joseph Devietti
    PLDI Student Research Competition (PLDI SRC '15), held in conjunction with PLDI '15, June 2015
    Dynamic race detectors that are sound and complete often incur too much memory overhead to be practical in many scenarios; to address this issue, we present SlimFast, a sound-and-complete race detector that significantly reduces the memory overhead, with comparable performance to a state-of-the-art race detector. According to experiments on DaCapo and Grande benchmarks, SlimFast brings up to 72% space savings and a 21% speedup over FastTrack.
  • High-Performance Determinism with Total Store Order ConsistencyHigh-Performance Determinism with Total Store Order Consistency
    European Conference on Computer Systems (EuroSys '15), April 2015

    We present Consequence, a deterministic multi-threading library. Consequence achieves deterministic execution via store buffering and strict ordering of synchronization operations. To ensure high performance under a wide variety of conditions, the ordering of synch operations is based on a deterministic clock, and store buffering is implemented using version-controlled memory.

    Recent work on deterministic concurrency has proposed relaxing the consistency model beyond total store order (TSO). Through novel optimizations, Consequence achieves the same or better performance on the Phoenix, PARSEC and SPLASH-2 benchmark suites, while retaining TSO memory consistency. Across 19 benchmark programs, Consequence incurs a worst-case slowdown of 3.9× vs. pthreads, with 14 out of 19 programs at or below 2.5×. We believe this performance improvement takes parallel programming one step closer to "determinism by default".

  • MAMA: Mostly Automatic Management of AtomicityMAMA: Mostly Automatic Management of Atomicity
    Workshop on Determinism and Correctness in Parallel Programming (WoDet '14), held in conjunction with ASPLOS '14, March 2014
    [abstract][paper][slides: pdf ppt]
    Correctly synchronizing the parallel execution of tasks remains one of the most difficult aspects of parallel programming. Without proper synchronization, many kinds of subtle concurrency errors can arise and cause a program to produce intermittently wrong results. The long-term goal of this work is to design a system that automatically synchronizes a set of programmer-specified, partially-independent parallel tasks. We present here our progress on the MAMA (Mostly Automatic Management of Atomicity) system, which can infer much but not all of this synchronization. MAMA provides a safety guarantee that a program either executes in a correctly atomic fashion or it deadlocks. Initial experiments indicate that MAMA can semi-automatically provide atomicity for a set of Java benchmarks while still allowing parallel execution.
  • GPUDet: A Deterministic GPU ArchitectureGPUDet: A Deterministic GPU Architecture
    Hadi Jooybar, Wilson W. L. Fung, Mike O'Connor, Joseph Devietti and Tor Aamodt
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '13), March 2013

    Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs.

    Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations.

    Our simulation results indicate that GPUDet incurs only 2× slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.