Joseph Devietti
my email address: my last name at cis dot upenn dot edu
(215) 746-4223
Levine Hall 572
3330 Walnut Street
Philadelphia, PA 19104-3409

I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models. You can find my tenure materials here: CV Research Statement Teaching Statement Diversity Statement

I am looking for new PhD students interested in systems and computer architecture. If you are interested in these topics please apply to our PhD program and drop me an email as well.


I'm teaching CIS 501: Computer Architecture in Spring 2019.

This iteration of CIS 501 will have the content from CIS 371: Computer Organization & Design, notably the Verilog processor labs. The course content will be the same for graduate and undergraduate students. CIS 240 is the pre-requisite, just like for CIS 371.

We have chosen the 501 course number, however, to allow master's and PhD students to more easily take the class for credit. Undergraduates who need CIS 371 for their CSCI or CMPE degrees can fill out the CIS course substitution form to use this Spring 2019 offering of CIS 501 instead. A petition is not needed, nor will other offerings of CIS 501 be eligible for this substitution.


I'm lucky to be working with the following great students:

Former students

  • Yuanfeng Peng (PhD). First employment: Google
  • Nicholas Renner (Master's 2019, now a PhD student at NYU)
  • Nimit Singhania (PhD, co-advised with Rajeev Alur). First employment: Google
  • Christian DeLozier (PhD 2018). First employment: Assistant Professor at United States Naval Academy
  • Kavya Lakshminarayanan (Master's 2018)
  • Richard Zang (Master's 2018)
  • Sana Kamboj (Master's 2017)
  • Ariel Eizenberg (Master's 2016)
  • Brooke Fugate (Master's 2015, co-advised with André DeHon)
  • Liang Luo (Master's 2015, now a PhD student at the University of Washington)
  • Akshitha Sriraman (Master's 2015, now a PhD student at the University of Michigan)

Recent Publications full list

Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.

  • Lazy Determinism for Faster Deterministic MultithreadingLazy Determinism for Faster Deterministic Multithreading
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '19), April 2019
  • Block-Size Independence for GPU ProgramsBlock-Size Independence for GPU Programs
    Static Analysis Symposium (SAS '18), August 2018
    Radhia Cousot Young Researcher Best Paper Award

    Optimizing GPU programs by tuning execution parameters is essential to realizing the full performance potential of GPU hardware. However, many of these optimizations do not ensure correctness and subtle errors can enter while optimizing a GPU program. Further, lack of formal models and the presence of non-trivial transformations prevent verification of optimizations.

    In this work, we verify transformations involved in tuning the execution parameter, block-size. First, we present a formal programming and execution model for GPUs, and then formalize block-size independence of GPU programs, which ensures tuning block-size preserves program semantics. Next, we present an inter-procedural analysis to verify block-size independence for synchronization-free GPU programs. Finally, we evaluate the analysis on the Nvidia CUDA SDK samples, where 35 global kernels are verified to be block-size independent.

  • CURD: A Dynamic CUDA Race DetectorCURD: A Dynamic CUDA Race Detector
    ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '18), June 2018
    As GPUs have become an integral part of nearly every processor, GPU programming has become increasingly popular. GPU programming requires a combination of extreme levels of parallelism and low-level programming, making it easy for concurrency bugs such as data races to arise. These concurrency bugs can be extremely subtle and di cult to debug due to the massive numbers of threads running concurrently on a modern GPU. While some tools exist to detect data races in GPU programs, they are often prohibitively slow or focused only on a small class of data races in shared memory. Compared to prior work, our race detector, CURD, can detect data races precisely on both shared and global memory, selects an appropriate race detection algorithm based on the synchronization used in a program, and utilizes efficient compiler instrumentation to reduce performance overheads. Across 53 benchmarks, we find that using CURD incurs an average slowdown of just 2.88x over native execution. CURD is 2.1x faster than Nvidia’s CUDA-Racecheck race detector, despite detecting a much broader class of races. CURD finds 35 races across our benchmarks, including bugs in established benchmark suites and in sample programs from Nvidia.
  • SlimFast: Reducing Metadata Redundancy in Sound and Complete Dynamic Data Race DetectionSlimFast: Reducing Metadata Redundancy in Sound and Complete Dynamic Data Race Detection
    IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018

    Data races are one of the main culprits behind the complexity of multithreaded programming. Existing data race detectors require large amounts of metadata for each program variable to perform their analyses. The SlimFast system exploits the insight that there is a large amount of redundancy in this metadata: many program variables often have identical metadata state. By sharing metadata across variables, a large reduction in space usage can be realized. SlimFast uses immutable metadata to safely support metadata sharing across threads while also accelerating concurrency control. SlimFast’s lossless metadata compression achieves these benefits while preserving soundness and completeness.

    Across a range of benchmarks from Java Grande, DaCapo, NAS Parallel Benchmarks and Oracle’s BerkeleyDB Java Edition, SlimFast is able to reduce memory consumption by 1.83x on average, and up to 4.90x for some benchmarks, compared to the state-of-the-art FastTrack system. By improving cache locality and simplifying concurrency control, SlimFast also accelerates data race detection by 1.40x on average, and up to 8.8x for some benchmarks, compared to FastTrack.

  • SOFRITAS: Serializable Ordering-Free Regions for Increasing Thread Atomicity ScalablySOFRITAS: Serializable Ordering-Free Regions for Increasing Thread Atomicity Scalably
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '18), March 2018

    Correctly synchronizing multithreaded programs is challenging and errors can lead to program failures such as atomicity violations. Existing strong memory consistency models rule out some possible failures, but are limited by depending on programmer-defined locking code. We present the new Ordering-Free Region (OFR) serializability consistency model that ensures atomicity for OFRs, which are spans of dynamic instructions between consecutive ordering constructs (e.g., barriers), without breaking atomicity at lock operations. Our platform, Serializable Ordering-Free Regions for Increasing Thread Atomicity Scalably (SOFRITAS), ensures a C/C++ program’s execution is equivalent to a serialization of OFRs by default. We build two systems that realize the SOFRITAS idea: a concurrency bug finding tool for testing called SofriTest, and a production runtime system called SoPro.

    SofriTest uses OFRs to find concurrency bugs, including a multi-critical-section atomicity violation in memcached that weaker consistency models will miss. If OFRs are too coarse-grained, SofriTest suggests refinement annotations automatically. Our software-only SoPro implementation has high performance, scales well with increased parallelism, and prevents failures despite bugs in locking code. SoPro has an average overhead of just 1.59x compared to pthreads, despite pthreads’ much weaker memory model.

  • Race Detection and Reachability in Nearly Series-Parallel DAGsRace Detection and Reachability in Nearly Series-Parallel DAGs
    ACM-SIAM Symposium on Discrete Algorithms (SODA '18), January 2018

    A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program, as different schedules of the program can lead to different results. Most prior work on detecting these races focuses on a subclass of programs with series-parallel or nested parallelism. This paper presents a race-detection algorithm for detecting races in a more general class of programs, namely programs that include arbitrary ordering constraints in additional to the series-parallel constructs. Our race-detection algorithm performs a serial execution of the program, augmented to detect races, in O(T1 + k2) time, where T1 is the sequential running time of the original program and k is the number of non series-parallel constraints. The main technical novelty is a new data structure for answering reachability queries in nearly series-parallel (SP) directed acyclic graphs (DAGs). Given as input a graph comprising an n-node series parallel graph and k additional non-SP edges, the total construction time of the data structure is O(n + k2), and each reachability query can be answered in O(1) time. The data structure is traversally incremental, meaning that it supports the insertion of nodes/edges, but only as they are discovered through a graph traversal.

  • Monadic composition for deterministic, parallel batch processingMonadic composition for deterministic, parallel batch processing
    ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA '17), October 2017

    Achieving determinism on real software systems remains difficult. Everyday code interacts with a wide array of nondeterministic sources, including those internal to the system (OS system calls, CPU instructions, other processes), external to the system (time, date, network communication), and arising from software abstractions (concurrency, threaded execution, data races). Nondeterminism complicates many tasks, from achieving reliable software builds across systems to creating reproducible scientific results. Existing approaches to determinism enforcement assume source code is available or require changing the operating system. Several approaches have high overhead as runtime monitoring causes performance to suffer.

    In this work we present DetFlow, a framework for writing and deploying new and legacy software which guarantees determinism. DetFlow uses a novel approach combining static, language-level guarantees with a lightweight runtime enforcement system. Applications that leverage DetFlow must have an entrypoint that lives in the DetIO monad, a type which requires all operations —- including I/O -— be deterministic. Furthermore, DetFlow allows the execution of arbitrary code not written in this framework by executing it in a determinizing runtime. This allows for batch processing tasks to be composed of otherwise untrusted external tasks in a way that assures correctness. Combining support for deterministic parallelism, filesystem access, and logging, DetFlow is an ideal platform for writing scripted workflows that process large data sets simultaneously. We show several use cases of DetFlow by applying it to bioinformatics data pipelines and software build systems. Our evaluation shows we can determinize existing software with minimal modifications, while preserving performance and exploiting software parallelism. We show that DetFlow makes it easier to discover nondeterminism and data races sooner, as DetFlow forces programmers to get reproducibility and parallelism right from the onset.

  • TMI: Thread Memory Isolation for False Sharing RepairTMI: Thread Memory Isolation for False Sharing Repair
    ACM IEEE International Symposium on Microarchitecture (MICRO '17), October 2017

    Cache contention in the form of false sharing and true sharing arises when threads overshare cache lines at high frequency. Such oversharing can reduce or negate the performance benefits of parallel execution. Prior systems for detecting and repairing cache contention lack efficiency in detection or repair, contain subtle memory consistency flaws, or require invasive changes to the program environment.

    In this paper, we introduce a new way to combat cache line oversharing via the Thread Memory Isolation (TMI) system. TMI operates completely in userspace, leveraging performance counters and the Linux ptrace mechanism to tread lightly on monitored applications, intervening only when necessary. TMI’s compatible-by-default design allows it to scale to real-world workloads, unlike previous proposals. TMI introduces a novel code-centric consistency model to handle cross-language memory consistency issues. TMI exploits the flexibility of code-centric consistency to efficiently repair false sharing while preserving strong consistency model semantics when necessary.

    TMI has minimal impact on programs without oversharing, slowing their execution by just 2% on average. We also evaluate TMI on benchmarks with known false sharing, and manually inject a false sharing bug into the leveldb key-value store from Google. For these programs, TMI provides an average speedup of 5.2x and achieves 88% of the speedup possible with manual source code fixes.

  • PARSNIP: Performant Architecture for Race Safety with No Impact on PrecisionPARSNIP: Performant Architecture for Race Safety with No Impact on Precision
    ACM IEEE International Symposium on Microarchitecture (MICRO '17), October 2017

    Data race detection is a useful dynamic analysis for multithreaded programs that is a key building block in record-and-replay, enforcing strong consistency models, and detecting concurrency bugs. Existing software race detectors are precise but slow, and hardware support for precise data race detection relies on assumptions like type safety that many programs violate in practice.

    We propose PARSNIP, a fully precise hardware-supported data race detector. PARSNIP exploits new insights into the redundancy of race detection metadata to reduce storage overheads. PARSNIP also adopts new race detection metadata encodings that accelerate the common case while preserving soundness and completeness. When bounded hardware resources are exhausted, PARSNIP falls back to a software race detector to preserve correctness. PARSNIP does not assume that target programs are type safe, and is thus suitable for race detection on arbitrary code.

    Our evaluation of PARSNIP on several PARSEC benchmarks shows that it incurs performance overheads from negligible to 2.6x, with an average overhead of just 1.5x. Moreover, PARSNIP outperforms the state-of-the-art RADISH hardware race detector by 4.6x.

  • Monadic composition for deterministic, parallel batch processingMonadic composition for deterministic, parallel batch processing
    Proceedings of the ACM on Programming Languages, alternate version of the OOPSLA 2017 paper, October 2017

    Achieving determinism on real software systems remains difficult. Even a batch-processing job, whose task is to map input bits to output bits, risks nondeterminism from thread scheduling, system calls, CPU instructions, and leakage of environmental information such as date or CPU model. In this work, we present a system for achieving low-overhead deterministic execution of batch-processing programs that read and write the file system—turning them into pure functions on files.

    We allow multi-process executions where a permissions system prevents races on the file system. Process separation enables different processes to enforce permissions and enforce determinism using distinct mechanisms. Our prototype, DetFlow, allows a statically-typed coordinator process to use shared-memory parallelism, as well as invoking process-trees of sandboxed legacy binaries. DetFlow currently implements the coordinator as a Haskell program with a restricted I/O type for its main function: a new monad we call DetIO. Legacy binaries launched by the coordinator run concurrently, but internally each process schedules threads sequentially, allowing dynamic determinism-enforcement with predictably low overhead.

    We evaluate DetFlow by applying it to bioinformatics data pipelines and software build systems. DetFlow enables determinizing these data-processing workflows by porting a small amount of code to become a statically-typed coordinator. This hybrid approach of static and dynamic determinism enforcement permits freedom where possible but restrictions where necessary.