I work on making multiprocessors easier to program by
leveraging changes in both computer architectures and parallel programming
- Twig: Profile-Guided BTB Prefetching for Data Center ApplicationsTwig: Profile-Guided BTB Prefetching for Data Center Applications
ACM IEEE International Symposium on Microarchitecture (MICRO '21), October 2021
Modern data center applications have deep software stacks, with instruction footprints that are orders of magnitude larger than typical instruction cache (I-cache) sizes. To efficiently prefetch instructions into the I-cache despite large application footprints, modern server-class processors implement a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP). In this work, we first characterize the limitations of a decoupled frontend processor with FDIP and find that FDIP suffers from significant Branch Target Buffer (BTB) misses. We also find that existing techniques (e.g., stream prefetchers and predecoders) are unable to mitigate these misses, as they rely on an incomplete understanding of a program’s branching behavior.
To address the shortcomings of existing BTB prefetching techniques, we propose Twig, a novel profile-guided BTB prefetching mechanism. Twig analyzes a production binary’s execution profile to identify critical BTB misses and inject BTB prefetch instructions into code. Additionally, Twig coalesces multiple non-contiguous BTB prefetches to improve the BTB’s locality. Twig exposes these techniques via new BTB prefetch instructions. Since Twig prefetches BTB entries without modifying the underlying BTB organization, it is easy to adopt in modern processors. We study Twig’s behavior across nine widely-used data center applications, and demonstrate that it achieves an average 20.86% (up to 145%) performance speedup over a baseline 8K-entry BTB, outperforming the state-of-the-art BTB prefetch mechanism by 19.82% (on average).
- Ripple: Profile-Guided Instruction Cache Replacement for Data Center ApplicationsRipple: Profile-Guided Instruction Cache Replacement for Data Center Applications
International Symposium on Computer Architecture (ISCA '21), June 2021
Modern data center applications exhibit deep software stacks yielding large instruction footprints that frequently lead to instruction cache misses degrading performance, cost-efficiency, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced evictions that are not handled by existing replacement policies. Alas, existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application’s complex program behavior.
To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program contexts that lead to I-cache misses and sparingly injects “cache line eviction” instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) due to a mean 19% (up to 28.6%) I-cache miss reduction.
- Static detection of uncoalesced accesses in GPU programsStatic detection of uncoalesced accesses in GPU programs
Formal Methods in System Design, March 2021
GPU programming has become popular due to the high computational capabilities of GPUs. Obtaining significant performance gains with GPU is however challenging and the programmer needs to be aware of various subtleties of the GPU architecture. One such subtlety lies in accessing GPU memory, where certain access patterns can lead to poor performance. Such access patterns are referred to as uncoalesced global memory accesses. This work presents a light-weight compile-time static analysis to identify such accesses in GPU programs. The analysis relies on a novel abstraction which tracks the access pattern across multiple threads. The abstraction enables quick prediction while providing correctness guarantees. We have implemented the analysis in LLVM and compare it against a dynamic analysis implementation. The static analysis identifies 95 pre-existing uncoalesced accesses in Rodinia, a popular benchmark suite of GPU programs, and finishes within seconds for most programs, in comparison to the dynamic analysis which finds 69 accesses and takes orders of magnitude longer to finish.
- Anytime Computation and Control for Autonomous SystemsAnytime Computation and Control for Autonomous Systems
IEEE Transactions on Control Systems Technology, March 2021
The correct and timely completion of the sensing and action loop is of utmost importance in safety critical autonomous systems. Crucial to the performance of this feedback control loop are the computation time and accuracy of the estimator which produces state estimates used by the controller. These state estimators often use computationally expensive perception algorithms like visual feature tracking. With on-board computers on autonomous robots being computationally limited, the computation time of such an estimation algorithm can at times be high enough to result in poor control performance. We develop a framework for codesign of anytime estimation and robust control algorithms, taking into account computation delays and estimation inaccuracies. This is achieved by constructing an anytime estimator from an off-the-shelf perception-based estimation algorithm and obtaining a trade-off curve for its computation time versus estimation error. This is used in the design of a robust predictive control algorithm that at run-time decides a contract, or operation mode, for the estimator in addition to controlling the dynamical system to meet its control objectives at a reduced computation energy cost. This codesign provides a mechanism through which the controller can use the tradeoff curve to reduce estimation delay at the cost of higher inaccuracy, while guaranteeing satisfaction of control objectives. Experiments on a hexrotor platform running a visual-based algorithm for state estimation show how our method results in up to a 10% improvement in control performance while simultaneously saving 5%-6% in computation energy as compared to a method that does not leverage the codesign.
- I-SPY: Context-Driven Conditional Instruction Prefetching with CoalescingI-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing
ACM IEEE International Symposium on Microarchitecture (MICRO '20), October 2020
Modern data center applications have rapidly expanding instruction footprints that lead to frequent instruction cache misses, increasing cost and degrading data center performance and energy efficiency. Mitigating instruction cache misses is challenging since existing techniques (1) require significant hardware modifications, (2) expect impractical on-chip storage, or (3) prefetch instructions based on inaccurate understanding of program miss behavior. To overcome these limitations, we first investigate the challenges of effective instruction prefetching. We then use insights derived from our investigation to develop I-SPY, a novel profile-driven prefetching technique. I-SPY uses dynamic miss profiles to drive an offline analysis of I-cache miss behavior, which it uses to inform prefetching decisions. Two key techniques underlie I-SPY's design: (1) conditional prefetching, which only prefetches instructions if the program context is known to lead to misses, and (2) prefetch coalescing, which merges multiple prefetches of non-contiguous cache lines into a single prefetch instruction. I-SPY exposes these techniques via a family of light-weight hardware code prefetch instructions. We study I-SPY in the context of nine data center applications and show that it provides an average of 15.5% (up to 45.9%) speedup and 95.9% (and up to 98.4%) reduction in instruction cache misses, outperforming the state-of-the-art prefetching technique by 22.5%. We show that I-SPY achieves performance improvements that are on average 90.5% of the performance of an ideal cache with no misses.
- Deterministic Atomic BufferingDeterministic Atomic Buffering
ACM IEEE International Symposium on Microarchitecture (MICRO '20), October 2020
Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4× better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations.