Joe Devietti
Associate Professor and Undergraduate Curriculum Chair, Computer & Information Science
Joseph Devietti
my email address: my last name at cis dot upenn dot edu
(215) 746-4223
Levine Hall 572
3330 Walnut Street
Philadelphia, PA 19104-3409

I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models.

I am looking for new PhD students interested in systems and computer architecture. If you are interested in these topics please apply to our PhD program and drop me an email as well.


I'm teaching a PhD seminar, CIS 700-002: Software & Hardware Support for Memory Safety, in Fall 2021. Learn more at the course website.


I'm lucky to be working with the following great students:

Former students

  • Gautam Mohan (Master's 2020. First employment: Amazon)
  • Yuanfeng Peng (PhD 2019). First employment: Google
  • Nicholas Renner (Master's 2019, now a PhD student at NYU)
  • Nimit Singhania (PhD 2018, co-advised with Rajeev Alur). First employment: Google
  • Christian DeLozier (PhD 2018). First employment: Assistant Professor at United States Naval Academy
  • Kavya Lakshminarayanan (Master's 2018) First employment: Microsoft
  • Richard Zang (Master's 2018) First employment: Microsoft
  • Sana Kamboj (Master's 2017) First employment: Qualcomm
  • Ariel Eizenberg (Master's 2016) First employment: Government of Israel
  • Brooke Fugate (Master's 2015, co-advised with André DeHon)
  • Liang Luo (Master's 2015, then a PhD student at the University of Washington)
  • Akshitha Sriraman (Master's 2015, then a PhD student at the University of Michigan)

Recent Publications full list

Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.

  • Ripple: Profile-Guided Instruction Cache Replacement for Data Center ApplicationsRipple: Profile-Guided Instruction Cache Replacement for Data Center Applications
    Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, and
    International Symposium on Computer Architecture (ISCA '21), June 2021

    Modern data center applications exhibit deep software stacks yielding large instruction footprints that frequently lead to instruction cache misses degrading performance, cost-efficiency, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced evictions that are not handled by existing replacement policies. Alas, existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application’s complex program behavior.

    To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program contexts that lead to I-cache misses and sparingly injects “cache line eviction” instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) due to a mean 19% (up to 28.6%) I-cache miss reduction.

  • Anytime Computation and Control for Autonomous SystemsAnytime Computation and Control for Autonomous Systems
    IEEE Transactions on Control Systems Technology, March 2021
    The correct and timely completion of the sensing and action loop is of utmost importance in safety critical autonomous systems. Crucial to the performance of this feedback control loop are the computation time and accuracy of the estimator which produces state estimates used by the controller. These state estimators often use computationally expensive perception algorithms like visual feature tracking. With on-board computers on autonomous robots being computationally limited, the computation time of such an estimation algorithm can at times be high enough to result in poor control performance. We develop a framework for codesign of anytime estimation and robust control algorithms, taking into account computation delays and estimation inaccuracies. This is achieved by constructing an anytime estimator from an off-the-shelf perception-based estimation algorithm and obtaining a trade-off curve for its computation time versus estimation error. This is used in the design of a robust predictive control algorithm that at run-time decides a contract, or operation mode, for the estimator in addition to controlling the dynamical system to meet its control objectives at a reduced computation energy cost. This codesign provides a mechanism through which the controller can use the tradeoff curve to reduce estimation delay at the cost of higher inaccuracy, while guaranteeing satisfaction of control objectives. Experiments on a hexrotor platform running a visual-based algorithm for state estimation show how our method results in up to a 10% improvement in control performance while simultaneously saving 5%-6% in computation energy as compared to a method that does not leverage the codesign.
  • I-SPY: Context-Driven Conditional Instruction Prefetching with CoalescingI-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing
    ACM IEEE International Symposium on Microarchitecture (MICRO '20), October 2020
    Modern data center applications have rapidly expanding instruction footprints that lead to frequent instruction cache misses, increasing cost and degrading data center performance and energy efficiency. Mitigating instruction cache misses is challenging since existing techniques (1) require significant hardware modifications, (2) expect impractical on-chip storage, or (3) prefetch instructions based on inaccurate understanding of program miss behavior. To overcome these limitations, we first investigate the challenges of effective instruction prefetching. We then use insights derived from our investigation to develop I-SPY, a novel profile-driven prefetching technique. I-SPY uses dynamic miss profiles to drive an offline analysis of I-cache miss behavior, which it uses to inform prefetching decisions. Two key techniques underlie I-SPY's design: (1) conditional prefetching, which only prefetches instructions if the program context is known to lead to misses, and (2) prefetch coalescing, which merges multiple prefetches of non-contiguous cache lines into a single prefetch instruction. I-SPY exposes these techniques via a family of light-weight hardware code prefetch instructions. We study I-SPY in the context of nine data center applications and show that it provides an average of 15.5% (up to 45.9%) speedup and 95.9% (and up to 98.4%) reduction in instruction cache misses, outperforming the state-of-the-art prefetching technique by 22.5%. We show that I-SPY achieves performance improvements that are on average 90.5% of the performance of an ideal cache with no misses.
  • Deterministic Atomic BufferingDeterministic Atomic Buffering
    ACM IEEE International Symposium on Microarchitecture (MICRO '20), October 2020
    Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4× better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations.
  • Reproducible ContainersReproducible Containers
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '20), March 2020
    In this paper, we describe the design and implementation of DetTrace, a reproducible container abstraction for Linux implemented in user space. All computation that occurs inside a DetTrace container is a pure function of the initial filesystem state of the container. Reproducible containers can be used for a variety of purposes, including replication for fault-tolerance, reproducible software builds and reproducible data analytics. We use DetTrace to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows. We show that, while software in each of these domains is initially irreproducible, DetTrace brings reproducibility without requiring any hardware, OS or application changes. DetTrace’s performance is dictated by the frequency of system calls: IO-intensive software builds have an average overhead of 3.49x, while a compute-bound bioinformatics workflow is under 2%.
  • Hurdle: Securing Jump Instructions Against Code Reuse AttacksHurdle: Securing Jump Instructions Against Code Reuse Attacks
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '20), March 2020

    Code-reuse attacks represent the state-of-the-art in exploiting memory safety vulnerabilities. Control-flow integrity techniques offer a promising direction for preventing code-reuse attacks, but these attacks are resilient against imprecise and heuristic-based detection and prevention mechanisms.

    In this work, we propose a new context-sensitive controlflow integrity system (HURDLE) that guarantees pairwise gadgets cannot be chained in a code-reuse attack. HURDLE improves upon prior techniques by using SMT constraint solving to ensure that indirect control flow transfers cannot be maliciously redirected to execute gadget chains. At the same time, HURDLE’s security policy is flexible enough that benign executions are only rarely mischaracterized as malicious. When such mischaracterizations occur, HURDLE can generalize its constraint solving to avoid these mischaracterizations at low marginal cost.

    We propose architecture extensions for HURDLE which consist of an extended branch history register and new instructions. Thanks to its hardware support, HURDLE enforces a context-sensitive control-flow integrity policy with <1% average runtime overhead.