Joseph Devietti
my email address: my last name at cis dot upenn dot edu
(215) 746-4223
Levine Hall 572
3330 Walnut Street
Philadelphia, PA 19104-3409

I work on making multiprocessors easier to program by leveraging changes in both computer architectures and parallel programming models.

I am looking for new PhD students interested in systems and computer architecture. If you are interested in these topics please apply to our PhD program and drop me an email as well.

Teaching

In Spring 2026 I'm teaching CIS 4710/5710: Computer Organization and Design.

Students

I'm lucky to be working with a great student:

Former students

  • Yuxuan Zhang (PhD)
  • Kelly Shiptoski (PhD 2023. First employment: Bolt Labs)
  • Omar Navarro Leija (PhD 2022. First employment: Bolt Labs)
  • Gautam Mohan (Master's 2020. First employment: Amazon)
  • Yuanfeng Peng (PhD 2019). First employment: Google
  • Nicholas Renner (Master's 2019, now a PhD student at NYU)
  • Nimit Singhania (PhD 2018, co-advised with Rajeev Alur). First employment: Google
  • Christian DeLozier (PhD 2018). First employment: Assistant Professor at United States Naval Academy
  • Kavya Lakshminarayanan (Master's 2018) First employment: Microsoft
  • Richard Zang (Master's 2018) First employment: Microsoft
  • Sana Kamboj (Master's 2017) First employment: Qualcomm
  • Ariel Eizenberg (Master's 2016) First employment: Government of Israel
  • Brooke Fugate (Master's 2015, co-advised with André DeHon)
  • Liang Luo (Master's 2015, then a PhD student at the University of Washington)
  • Akshitha Sriraman (Master's 2015, then a PhD student at the University of Michigan)

Recent Publications full list

Many of the paper links below use the ACM's Author-izer service, which tracks download statistics and provides a small kickback to various ACM Special Interest Groups for each download.

  • NanoTag: Systems Support for Efficient Byte-Granular Overflow Detection on ARM MTE NanoTag: Systems Support for Efficient Byte-Granular Overflow Detection on ARM MTE
    IEEE Symposium on Security and Privacy (to appear at Oakland '26), May 2026
  • Lobster: A GPU-Accelerated Framework for Neurosymbolic ProgrammingLobster: A GPU-Accelerated Framework for Neurosymbolic Programming
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '26), March 2026

    Neurosymbolic programs combine deep learning with symbolic reasoning to achieve better data efficiency, interpretability, and generalizability compared to standalone deep learning approaches. However, existing neurosymbolic learning frameworks implement an uneasy marriage between a highly scalable, GPU-accelerated neural component and a slower symbolic component that runs on CPUs.

    We propose Lobster, a unified framework for harnessing GPUs in an end-to-end manner for neurosymbolic learning. Lobster maps a general neurosymbolic language based on Datalog to the GPU programming paradigm. This mapping is implemented via compilation to a new intermediate language called APM. The extra abstraction provided by APM allows Lobster to be both flexible, supporting discrete, probabilistic, and differentiable modes of reasoning on GPU hardware with a library of provenance semirings, and performant, implementing new optimization passes.

    We demonstrate that Lobster programs can solve interesting problems spanning the domains of natural language processing, image processing, program reasoning, bioinformatics, and planning. On a suite of 9 applications, Lobster achieves an average speedup of 3.9x over Scallop, a state-of-the-art neurosymbolic framework, and enables scaling of neurosymbolic solutions to previously infeasible tasks.

  • RPG2: Robust Profile-Guided Runtime Prefetch GenerationRPG2: Robust Profile-Guided Runtime Prefetch Generation
    International Conference on Architectural Support for Programming Languages & Operating Systems (ASPLOS '24), May 2024

    Data cache prefetching is a well-established optimization to overcome the limits of the cache hierarchy and keep the processor pipeline fed with data. In principle, accurate, well-timed prefetches can sidestep the majority of cache misses and dramatically improve performance. In practice, however, it is challenging to identify which data to prefetch and when to do so. In particular, data can be easily requested too early, causing eviction of useful data from the cache, or requested too late, failing to avoid cache misses. Competition for limited off-chip memory bandwidth must also be balanced between prefetches and a program's regular "demand" accesses. Due to these challenges, prefetching can both help and hurt performance, and the outcome can depend on program structure, decisions about what to prefetch and when to do it, and, as we demonstrate in a series of experiments, program input, processor microarchitecture, and their interaction as well.

    To try to meet these challenges, we have designed the RPG2 system for online prefetch injection and tuning. RPG2 is a pure-software system that operates on running C/C++ programs, profiling them, injecting prefetch instructions, and then tuning those prefetches to maximize performance. Across dozens of inputs, we find that RPG2 can provide speedups of up to 2.15×, comparable to the best profile-guided prefetching compilers, but can also respond when prefetching ends up being harmful and roll back to the original code - something that static compilers cannot. RPG2 improves prefetching robustness by preserving its performance benefits, while avoiding slowdowns.

  • Online Code Layout Optimizations via OCOLOSOnline Code Layout Optimizations via OCOLOS
    IEEE Micro, Vol. 43 No. 4, July 2023
  • OCOLOS: Online COde Layout OptimizationSOCOLOS: Online COde Layout OptimizationS
    ACM IEEE International Symposium on Microarchitecture (MICRO '22), October 2022
    Selected for IEEE Micro Top Picks 2023 [article]

    The processor front-end has become an increasingly important bottleneck in recent years due to growing application code footprints, particularly in data centers. First-level instruction caches and branch prediction engines have not been able to keep up with this code growth, leading to more front-end stalls and lower Instructions Per Cycle (IPC). Profile-guided optimizations performed by compilers represent a promising approach, as they rearrange code to maximize instruction cache locality and branch prediction efficiency along a relatively small number of hot code paths. However, these optimizations require continuous profiling and rebuilding of applications to ensure that the code layout matches the collected profiles. If an application’s code is frequently updated, it becomes challenging to map profiling data from a previous version onto the latest version, leading to ignored profiling data and missed optimization opportunities.

    In this paper, we propose OCOLOS, the first online code layout optimization system for unmodified applications written in unmanaged languages. OCOLOS allows profile-guided optimization to be performed on a running process, instead of being performed offline and requiring the application to be re-launched. By running online, profile data is always relevant to the current execution and always maps perfectly to the running code. OCOLOS demonstrates how to achieve robust online code replacement in complex multithreaded applications like MySQL and MongoDB, without requiring any application changes. Our experiments show that OCOLOS can accelerate MySQL by up to 1.41x, the Verilator hardware simulator by up to 2.20x, and a build of the Clang compiler by up to 1.14x.