Publications of E Christopher Lewis

Journal Publications

Colin Blundell, E Christopher Lewis, and Milo M. K. Martin. "Subleties of Transactional Memory Atomicity Semantics," IEEE Computer Architecture Letters, 5(2), July-December 2006.

Marc L. Corliss, E Christopher Lewis, and Amir Roth. "The Implementation and Evaluation of Dynamic Code Decompression using DISE," ACM Transactions on Embedded Computing Systems, 4(1):38-72, February 2005.

Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. "ZPL: A Machine Independent Programming Language for Parallel Computers," IEEE Transactions on Software Engineering, 26(3):197-211, March 2000.

Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. "The Case for High-Level Parallel Programming in ZPL," IEEE Computational Science and Engineering, 5(3):76-86, July-September 1998.

Refereed Conferences
Marc L. Corliss, E Christopher Lewis, and Amir Roth, "Low-Overhead Debugging via Flexible Dynamic Instrumentation with DISE," In the Proceedings of the Eleventh International Symposium on High-Performance Computer Architecture (HPCA-11), pages 303-314, February 2005.

Marc L. Corliss, E Christopher Lewis, and Amir Roth, "DISE: A Programmable Macro Engine for Customizing Applications," In the Proceedings of the Thirtieth International Symposium on Computer Architecture (ISCA-30), June 2003.

Marc L. Corliss, E Christopher Lewis, and Amir Roth, "A DISE Implementation of Dynamic Code Decompression," In the Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '03), June 2003.

Sung-Eun Choi and E Christopher Lewis, "A Study of Common Pitfalls in Simple Multi-Threaded Programs," In the Proceedings of the Thirty-first ACM SIGCSE Technical Symposium on Computer Science Education, March 2000.

Bradford L. Chamberlain, E Christopher Lewis, Calvin Lin, and Lawrence Snyder, "Regions: An Abstraction for Expressing Array Computation," In the Proceedings of the 1999 ACM SIGAPL/SIGPLAN International Conference on Array Programming Languages (APL '99), pages 41-49, August 1999.

Bradford L. Chamberlain, E Christopher Lewis, Lawrence Snyder, "Problem Space Promotion and Its Evaluation as a Technique for Efficient Parallel Computation," In the Proceedings of the 13th International Conference on Supercomputing, pages 311-318, June 1999.

E Christopher Lewis, Calvin Lin, and Lawrence Snyder, "The Implementation and Evaluation of Fusion and Contraction in Array Languages," In the Proceedings of the ACM SIGPLAN `98 Conference on Programming Language Design and Implementation (PLDI), pages 50-59, June 1998.

E Christopher Lewis, Calvin Lin, Lawrence Snyder, and George Turkiyyah, "A Portable Parallel N-Body Solver," In the Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, pages 331-336, February 1995.

Reviewed Workshop Publications
Matt R. Jacobs and E Christopher Lewis. "SMART C: A Semantic Macro Replacement Translator for C," In the Proceedings of the IEEE International Workshop on Source Code Analysis and Manipulation (SCAM '06), September 2006.

Marc L. Corliss, Vlad Petric, and E Chritopher Lewis. "Dynamic Translation as a Systems Service," In the Proceedings of the Workshop on the Interaction Between Operating Systems and Computer Architecture (WIOSCA) (held in conjunction with ISCA), June 2006.

Colin Blundell, E Christopher Lewis, and Milo M.K. Martin. "Deconstructing Transactional Semantics: The Subtleties of Atomicity," In the Proceedings of the Annual Workshop on Duplicating, Deconstructing, and Debunking (held in conjunction with ISCA), June 2005.

Marc L. Corliss, E Christopher Lewis, and Amir Roth. "Using DISE to Protect Return Addresses from Attack," In the Proceedings of the Workshop on Architectural Support for Security and Anti-Virus (held in conjunction with ASPLOS), October 2004.

E Christopher Lewis and Lawrence Snyder, "Pipelining Wavefront Computations: Experiences and Performance," In the Proceedings of the 5th IEEE International Workshop on High-Level Parallel Programming Models and Supportive Environments (held in conjunction with IPDPS), May 2000.

Bradford L. Chamberlain, E Christopher Lewis, and Lawrence Snyder, "Language Support for Pipelining Wavefront Computations," In the Proceedings of the Workshop on Languages and Compilers for Parallel Computing, August 1999.

Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby, "ZPL's WYSIWYG Performance Model," In the Proceedings of the IEEE Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 50-61, March 1998.

Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby, "Factor-Join: A Unique Approach to Compiling Array Languages for Parallel Machines," In the Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 481-500, August 1996.

Technical Reports and Theses
Colin Blundell, E Christopher Lewis, and Milo M. K. Martin. "Unrestricted Transactional Memory: Supporting I/O and System Calls within Transactions," Technical Report CIS-06-09, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, April 2006.

Marc L. Corliss and E Christopher Lewis. "A DISE Framework for Securing Software," Technical Report MS-CIS-05-13, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, April 2005.

Marc L. Corliss, E Christopher Lewis, and Amir Roth. "DISE: Dynamic Instruction Stream Editing," Technical Report MS-CIS-02-24, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, July 2002.

Ibrahim Hur, E Christopher Lewis and Calvin Lin, "Evaluation of Optimizing Multi-Dimensional Shift Communication via Piggybacking," Technical Report, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, Decemeber 1999.

Bradford L. Chamberlain, E Christopher Lewis, and Lawrence Snyder, "A Region-based Approach to Sparse Parallel Computation," Technical Report UW-CSE-98-11-01, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, November 1998.

E Christopher Lewis, "Support for Software Assisted Speculative Execution", Technical Report UW-CSE-98-09-05, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, September 1998.

Calvin Lin, Lawrence Snyder, Ruth Anderson, Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, and W. Derrick Weathersby, "ZPL vs. HPF: A Comparison of Performance and Programming Style," Technical Report UW-CSE-95-11-05, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, November 1995.

Theses
Marc L. Corliss, The Design, Implementation, and Evaluation of the Dynamic Instruction Stream Editor (DISE), PhD Thesis, University of Pennsylvania, 2006.

E Christopher Lewis, Achieving Robust Performance in Parallel Programming Languages, PhD Thesis, University of Washington, 2001.


Abstracts

Colin Blundell, E Christopher Lewis, and Milo M. K. Martin. "Subleties of Transactional Memory Atomicity Semantics," IEEE Computer Architecture Letters, 5(2), July-December 2006. (pdf)

Abstract: Transactional memory has great potential for simplifying multithreaded programming by allowing programmers to specify regions of the program that must appear to execute atomically. Transactional memory implementations then optimistically execute these transactions concurrently to obtain high performance. This work shows that the same atomic guarantees that give transactions their power also have unexpected and potentially serious negative effects on programs that were written assuming narrower scopes of atomicity. We make four contributions: (1) we show that a direct translation of lock-based critical sections into transactions can introduce deadlock into otherwise correct programs, (2) we introduce the terms strong atomicity and weak atomicity to describe the interaction of transactional and non-transactional code, (3) we show that code that is correct under weak atomicity can deadlock under strong atomicity, and (4) we demonstrate that sequentially composing transactional code can also introduce deadlocks. These observations invalidate the intuition that transactions are strictly safer than lock-based critical sections, that strong atomicity is strictly safer than weak atomicity, and that transactions are always composable.


Matt R. Jacobs and E Christopher Lewis. "SMART C: A Semantic Macro Replacement Translator for C," In the Proceedings of the IEEE International Workshop on Source Code Analysis and Manipulation (SCAM '06), September 2006. (pdf)

Abstract: Programmers often want to transform the source or binary representations of their programs (e.g., to optimize, add dynamic safety checks, or add profile gathering code). Unfortunately, existing approaches to program transformation are either disruptive to the source (hand transformation), difficult to implement (ad hoc analysis tools), or functionally limited (macros). We propose an extension to the C programming language called the Semantic Macro Replacement Translator (SMART C). SMART C allows for the specification of very general type-aware transformations of all operations, statements, and declarations of the C programming language without exposing the programmer to the complexities of the system's internal representations. We have implemented a prototype SMART C source-to-source translator and show its use in transforming programs for buffer overflow detection, format string vulnerability detection, and weighted call graph profiling. We show that SMART C achieves a pragmatic balance between generality and ease of use.


Marc L. Corliss, Vlad Petric, and E Chritopher Lewis. "Dynamic Translation as a Systems Service," In the Proceedings of the Workshop on the Interaction Between Operating Systems and Computer Architecture (WIOSCA) (held in conjunction with ISCA), June 2006. (pdf)

Abstract: Dynamic translation is a well-known and powerful technique for transforming programs as they run. Dynamic translators have many uses including profiling, security assurance, dynamic optimization, and bug patching. However, the utility of dynamic translation is severely limited by a lack of integration with the system in which it is used, instead, requiring individual users to initiate and program the translator. As a result, translation is not transparent, cannot be used to protect system-level security, and can only be programmed by a single party (the user).

In this work, we propose integrating dynamic translation with the operating system, providing dynamic translation as an operating system service (DTSS). With DTSS, the OS dynamically translates all applications according to translation rules provided by any entity (with sufficient access rights) in the system. In addition, this paper describes a number of challenges in implementing DTSS and presents solutions to address them. Performance overhead poses the greatest challenge, but code caching can dramatically reduce overhead. That multiple parties may transform a single program is also challenging. We describe how transformations may be composed and how they are isolated from one another. DTSS has great utility beyond traditional dynamic translation systems, but new implementation strategies are required.



Colin Blundell, E Christopher Lewis, and Milo M. K. Martin. "Unrestricted Transactional Memory: Supporting I/O and System Calls within Transactions," Technical Report CIS-06-09, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, April 2006. (pdf)

Abstract: Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, enabling programmers to exploit the soon-to-be-ubiquitous multi-core designs. Transactions are simply segments of code that are guaranteed to execute without interference from other concurrently-executing threads. The hardware executes transactions in parallel, ensuring non-interference via abort/rollback/restart when conflicts are detected. Transactions thus provide both a simple programming interface and a highly-concurrent implementation that serializes only on data conflicts. A progression of recent work has broadened the utility of transactional memory by lifting the bound on the size and duration of transactions, called unbounded transactions. Nevertheless, two key challenges remain: (i) I/O and system calls cannot appear in transactions and (ii) existing unbounded transactional memory proposals require complex implementations.

We describe a system for fully unrestricted transactions (i.e., they can contain I/O and system calls in addition to being unbounded in size and duration). We achieve this via two modes of transaction execution: restricted (which limits transaction size, duration, and content but is highly concurrent) and unrestricted (which is unbounded and can contain I/O and system calls but has limited concurrency because there can be only one unrestricted transaction executing at a time). Transactions transition to unrestricted mode only when necessary. We introduce unoptimized and optimized implementations in order to balance performance and design complexity.



Colin Blundell, E Christopher Lewis, and Milo M.K. Martin, "Deconstructing Transactional Semantics: The Subtleties of Atomicity," In the Proceedings of the Annual Workshop on Duplicating, Deconstructing, and Debunking (held in conjunction with ISCA), June 2005. (pdf)

Abstract: Researchers have recently proposed software and hardware support for transactions as a replacement for the traditional lock-based synchronization most common in multithreaded programs. Transactions allow the programmer to specify a region of the program that should appear to execute atomically, while the hardware and runtime system optimistically execute the transactions concurrently to obtain high performance. The transactional abstraction is thus a promising approach for creating both faster and simpler multithreaded programs.

Although transactions have great potential for simplifying multithreaded programming due to their strong atomicity guarantees, this work shows that these same guarantees can have unexpected and potentially serious negative effects on programs that were written assuming weaker synchronization primitives. We make three contributions: (1) we show that a naive translation (statically or dynamically) of lock-based critical sections into transactions can introduce deadlocks into otherwise correct programs, (2) we define an atomicity model for transactions, in which strong atomicity guarantees atomicity between transactions and non-transactional memory operations, while weak atomicity guarantees atomicity only among transactions, and (3) we show that the decision to enforce strong atomicity as opposed to enforcing weak atomicity can also introduce deadlocks. These results invalidate the intuitive idea that transactions are strictly safer than lock-based critical sections and strong atomicity is strictly safer than weak atomicity. We assert that the research community must confront these subtle issues of transactional semantics by exploring the design space and deciding upon the most appropriate semantics for future transactional systems.



Marc L. Corliss and E Christopher Lewis. "A DISE Framework for Securing Software," Technical Report MS-CIS-05-13, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, April 2005. (pdf not yet available)

Abstract: Software security vulnerabilities cost billions of dollars a year. Despite an acute awareness of the causes of these vulnerabilities (e.g., unchecked buffer accesses and improper use of format strings) and the availability of modern type-safe programming languages such as Java, practical concerns, legacy codes, and inertia have resulted in the continued use and deployment of vulnerable software. Nevertheless, vulnerable programs written in unsafe languages may still be protected by dynamically detecting when a vulnerability is exploited (i.e., when a system is attacked).

We propose and evaluate the use of the Dynamic Instruction Stream Editing (DISE) mechanism as an effective and practical framework for attack detection. DISE is a previously-proposed, programmable, hardware facility for dynamically customizing applications by transforming the instruction stream as it is decoded. We use DISE to detect three diverse and dangerous attacks: stack-resident return address corruption, memory-resident pointer corruption, and sensitive data theft or corruption. A DISE-based attack detector separates the application from the detection mechanism so the latter may be changed to fit the needs of the user or to combat new attacks. We find that a DISE-based attack detector is simple, effective, and efficient.



Marc L. Corliss, E Christopher Lewis, and Amir Roth, "Low-Overhead Debugging via Flexible Dynamic Instrumentation with DISE," In the Proceedings of the Eleventh International Symposium on High-Performance Computer Architecture (HPCA-11), pages 303-314, February 2005. (pdf)

Abstract: Breakpoints, watchpoints, and conditional variants of both are essential debugging primitives, but their natural implementations often degrade performance significantly. Slowdown arises because the debugger---the tool implementing the breakpoint/watchpoint interface---is implemented in a process separate from the debugged application. Since the debugger evaluates the watchpoint expressions and conditional predicates to determine whether to invoke the user, a debugging session typically requires many expensive application-debugger context switches, resulting in slowdowns of 40,000 times or more in current commercial and open-source debuggers!

In this paper, we present an effective and efficient implementation of (conditional) breakpoints and watchpoints by dynamically embedding debugger logic into the running application via DISE. DISE (dynamic instruction stream editing) is a previously-proposed, programmable hardware facility for dynamically customizing applications by transforming the fetched instruction stream as it is decoded. DISE embedding preserves the logical separation of application and debugger---instructions are added dynamically and transparently, existing application code and data are not statically modified---and has little startup cost. Cycle-level simulation on the SPEC 2000 integer benchmarks shows that the DISE approach eliminates all unnecessary context switching, typically limits debugging overhead to 25% or less for a wide range of watchpoints, and outperforms alternative implementations.



Marc L. Corliss, E Christopher Lewis, and Amir Roth. "The Implementation and Evaluation of Dynamic Code Decompression using DISE," ACM Transactions on Embedded Computing Systems, 4(1):38-72, February 2005. (pdf)

Abstract: Code compression coupled with dynamic decompression is an important technique for both embedded and general-purpose microprocessors. Post-fetch decompression, in which decompression is performed after the compressed instructions have been fetched, allows the instruction cache to store compressed code but requires a highly efficient decompression implementation. We propose implementing post-fetch decompression using a new hardware facility called dynamic instruction stream editing (DISE). DISE provides a programmable decodersimilar in structure to those in many IA-32 processorsthat is used to add functionality to an application by injecting custom code snippets into its fetched instruction stream. We present a DISE-based implementation of post-fetch decompression and show that it naturally supports customized program-specific decompression dictionaries, enables parameterized decompression allowing similar-but-not-identical instruction sequences to share dictionary entries, and uses no decompression-specific hardware. We present extensive experimental results showing the virtue of this approach and evaluating the factors that impact its efficacy. We also present implementation-neutral results that give insight into the characteristics of any post-fetch decompression technique. Our experiments not only demonstrate significant reduction in code size (up to 35%) but also significant improvements in performance (up to 20%) and energy (up to 10%).


Marc L. Corliss, E Christopher Lewis, and Amir Roth. "Using DISE to Protect Return Addresses from Attack," In the Proceedings of the Workshop on Architectural Support for Security and Anti-Virus (held in conjunction with ASPLOS), pages 65-72, October 2004. (pdf)
Abstract: Stack-smashing by buffer overflow is a common tactic used by viruses and worms to crash or hijack systems. Exploiting a bounds-unchecked copy into a stack buffer, an attacker can---by supplying a specially-crafted and unexpectedly long input---overwrite a stored return address and trigger the execution of code of her choosing. In this paper, we propose to protect code from this common form of attack using dynamic instruction stream editing (DISE), a previously proposed hardware mechanism that implements binary rewriting in a transparent, efficient, and convenient way by rewriting the dynamic instruction stream rather than the static executable. Simply, we define productions (rewriting rules) that instrument program calls and returns to maintain and verify a ``shadow'' stack of return addresses in a protected region of memory. When invalid return addresses are detected, the application is terminated.

The DISE implementation resembles previous software schemes like StackGuard and the Return Address Defender (RAD), but it can operate without source code and in dynamically-linked libraries and dynamically-generated code. It also has natural facilities for protecting the shadow stack, which provides little security if it itself is vulnerable. Finally, unlike software instrumentation, DISE checks---which are inserted by the processor at runtime---cannot be bypassed or subverted.



Marc L. Corliss, E Christopher Lewis, and Amir Roth, "DISE: A Programmable Macro Engine for Customizing Applications," In the Proceedings of the Thirtieth International Symposium on Computer Architecture (ISCA-30), June 2003.(pdf)
Dynamic Instruction Stream Editing (DISE) is a cooperative software---hardware scheme for efficiently adding customization functionality---e.g, safety/security checking, profiling, dynamic code decompression, and dynamic optimization---to an application. In DISE, application customization functions (ACFs) are formulated as rules for macro-expanding certain instructions into parameterized instruction sequences. The processor executes the rules on the fetched instructions, feeding the execution engine an instruction stream that contains ACF code. Dynamic instruction macro-expansion is widely used in many of today's processors to convert a complex ISA to an easier-to-execute, finer-grained internal form. DISE co-opts this technology and adds a programming interface to it.

DISE unifies the implementation of a large class of ACFs that would otherwise require either special-purpose hardware widgets or static binary rewriting. We show DISE implementations of two ACFs---memory fault isolation and dynamic code decompression---and their composition. Simulation shows that DISE ACFs have better performance than their software counterparts, and more flexibility (which sometimes translates into performance) than hardware implementations.



Marc L. Corliss, E Christopher Lewis, and Amir Roth, "A DISE Implementation of Dynamic Code Decompression," In the Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES '03), June 2003. (pdf)
Abstract: Code compression coupled with dynamic decompression is an important technique for both embedded and general-purpose microprocessors. Post-fetch decompression, in which decompression is performed after the compressed instructions have been fetched, allows the instruction cache to store compressed code but requires a highly efficient decompression implementation. We propose implementing post-fetch decompression using dynamic instruction stream editing (DISE), a programmable decoder---similar in structure to those in many IA32 processors---that is used to add functionality to an application by injecting custom code snippets into its fetched instruction stream. A DISE implementation of post-fetch decompression naturally supports customized program-specific decompression dictionaries, enables parameterized decompression allowing similar instruction sequences to share dictionary entries, and uses no decompression-specific hardware. Cycle-level simulation of DISE decompression shows that it can reduce static program size by 35% and execution time by 20%. Parameterized decompression, a feature unique to DISE, accounts for 20% of the code size reduction by making more effective use of the dictionary and allowing PC-relative branches to be included in compressed sequences. DISE-based compression can reduce total energy consumption by 10% and the energy-delay product by as much as 20%.


Marc L. Corliss, E Christopher Lewis, and Amir Roth. "DISE: Dynamic Instruction Stream Editing," Technical Report MS-CIS-02-24, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, July 2002. (pdf)

Abstract: Dynamic Instruction Stream Editing (DISE) is a cooperative software-hardware scheme for efficiently adding meta-featurese.g., safety/security checking, profiling, and dynamic code decompressionto an application. DISE implements meta-features by providing a programmable processor decoder that macro-expands certain instructions into parameterized instruction sequences. The technique is similar in spirit to programmable microcode and exploits the macro-expansion-style decoding technology already present in many of todays processors.

DISE is a single facility that unifies the implementation of a large class of meta-features previously requiring either special-purpose hardware widgets or software-only tools like binary rewriters. Meta-features implemented via DISE require no new hardware widgets and avoid many of the costs of binary rewriting, primarily static code bloat and the fixed overhead of the rewriting process itself. The dynamic character of DISE makes it appropriate for meta-features with fine grain intermittent usage patterns (e.g., sampled profiling) as well as applications that have no natural software implementations (e.g., dynamic code decompression).

In this paper, we introduce DISE and show how it can be used to formulate three meta-features: path profiling, memory fault isolation, and dynamic code decompression. We find DISE to be general, powerful, and efficient. We describe the requirements of a DISE implementation and show one particular design. We also discuss the implications of DISE on all facets of a system. Using cycle-level simulation we show that DISE-implemented meta-features have competitive raw performance and a profitable dynamic cost structure.



E Christopher Lewis, Achieving Robust Performance in Parallel Programming Languages, PhD Thesis, University of Washington, 2001. (postscript, pdf)
Abstract: Despite more than two decades of research effort, the question remains: how can we realize the potential of large-scale parallel machines? It can be done now, but only at great expense (i.e., development time and effort) and with limited portability, rendering the exploitation of parallelism impractical for most users. Advanced--ZPL (A--ZPL) is a parallel programming language intended to address this problem. It's design was guided by a predictive performance model that clearly defines the role of the programmer and the compiler, called the programmer-compiler separation. The former is responsible for abstract parallel and sequential algorithmic issues, while the latter manages the tractable elements of mapping abstract representations to a particular machine. This dissertation evaluates the design and implementation of A--ZPL in the light of this design criteria. Specifically, we examine two aspects of the language and the compiler implications: efficient loop generation and pipelining wavefront computations. We find the language is highly effective both relatively and absolutely as a direct consequence of considering the programmer-compiler separation.


Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. "ZPL: A Machine Independent Programming Language for Parallel Computers," IEEE Transactions on Software Engineering, 26(3):197-211, March 2000. (postscript, pdf)
Abstract: The goal of producing architecture-independent parallel programs is complicated by the competing need for high performance. The ZPL programming language achieves both goals by building upon an abstract parallel machine and by providing programming constructs that allow the programmer to "see" this underlying machine. This paper describes ZPL and provides a comprehensive evaluation of the language with respect to its goals of performance, portability, and programming convenience. In particular, we describe ZPL's machine-independent performance model, describe the programming benefits of ZPL's region-based constructs, summarize the compilation benefits of the language's high-level semantics, and summarize empirical evidence that ZPL has achieved both high performance and portability on diverse machines such as the IBM SP-2, Cray T3E, and SGI Power Challenge.


Ibrahim Hur, E Christopher Lewis, and Calvin Lin, "Evaluation of Optimizing Multi-Dimensional Shift Communication via Piggybacking," Technical Report, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, Decemeber 1999.
Abstract: This paper evaluates the benefits of the piggybacking communication optimization for multi-dimensional shifts. This optimization has appeared in at least one commercial HPF compiler but has never been extensively evaluated. This paper studies the effects of this optimization on different machines, and explores the effects of problem size and problem dimensionality on this optimization. We use microbenchmarks written in MPI to evaluate the optimization's effects on different parallel machines, and we use the LogGP model to understand the behavior. We find that the effects of the optimization are complex, but that the LogGP model was able to capture first order effects.


E Christopher Lewis and Lawrence Snyder, "Pipelining Wavefront Computations: Experiences and Performance," In the Proceedings of the 5th IEEE International Workshop on High-Level Parallel Programming Models and Supportive Environments (held in conjunction with IPDPS), May 2000. (postscript, pdf)
Abstract: Wavefront computations are common in scientific applications. Although it is well understood how wavefronts are pipelined for parallel execution, the question remains: How are they best presented to the compiler for the effective generation of pipelined code? We address this question through a quantitative and qualitative study of three approaches to expressing pipelining: programmer implemented via message passing, compiler discovered via automatic parallelization, and programmer defined via explicit parallel language features for pipelining. This work is the first assessment of the efficacy of these approaches in solving wavefront computations, and in the process, we reveal surprising characteristics of commercial compilers. We also demonstrate that a parallel language-level solution simplifies development and consistently performs well.


Sung-Eun Choi and E Christopher Lewis, "A Study of Common Pitfalls in Simple Multi-Threaded Programs," In the Proceedings of the Thirty-first ACM SIGCSE Technical Symposium on Computer Science Education, March 2000. (postscript, pdf)
Abstract: It is generally acknowledged that developing correct multi-threaded codes is difficult, because threads may interact with each other in unpredictable ways. The goal of this work is to discover common multi-threaded programming pitfalls, the knowledge of which will be useful in instructing new programmers and in developing tools to aid in multi-threaded programming. To this end, we study multi-threaded applications written by students from introductory operating systems courses. Although the applications are simple, careful inspection and the use of an automatic race detection tool reveal a surprising quantity and variety of synchronization errors. We describe and discuss these errors, evaluate the role of automated tools, and propose new tools for use in the instruction of multi-threaded programming.


Bradford L. Chamberlain, E Christopher Lewis, Calvin Lin, and Lawrence Snyder, "Regions: An Abstraction for Expressing Array Computation," In the Proceedings of the 1999 ACM SIGAPL/SIGPLAN International Conference on Array Programming Languages (APL '99), pages 41-49, August 1999. (postscript, pdf)
Abstract: Most array languages, such as Fortran 90, Matlab, and APL, provide support for referencing arrays by extending the traditional array subscripting construct found in scalar languages. We present an alternative approach that exploits the concept of regions -- a representation of index sets that can be named, manipulated with high-level operators, and syntactically separated from array references. This paper develops the concept of region-based programming and describes its benefits in the context of an idealized array language called RL. We show that regions simplify programming, reduce the likelihood of errors, and enable code reuse. Furthermore, we describe how regions accentuate the locality of array expressions and how this locality is important when targeting parallel computers. We also show how the concepts of region-based programming have been used in ZPL, a fully-implemented practical parallel programming language in use by scientists and engineers. In addition, we contrast region-based programming with the array reference constructs of other array languages.


Bradford L. Chamberlain, E Christopher Lewis, and Lawrence Snyder, "Language Support for Pipelining Wavefront Computations," In the Proceedings of the Workshop on Languages and Compilers for Parallel Computing, August 1999. (postscript, pdf)
Abstract: Wavefront computations, characterized by a data dependent flow of computation across a data space, are receiving increasing attention as an important class of parallel computations. Though sophisticated compiler optimizations can often produce efficient pipelined implementations from sequential representations of these computations, we argue that a language-based approach to representing wavefront computations is a more practical technique. A language-based approach is simple for the programmer yet unambiguously parallel. In this paper we introduce simple array language extensions that directly support wavefront computations. We describe their implementation, and we give performance data showing the importance of parallelizing these codes.


Bradford L. Chamberlain, E Christopher Lewis, Lawrence Snyder, "Problem Space Promotion and Its Evaluation as a Technique for Efficient Parallel Computation," In the Proceedings of the 13th International Conference on Supercomputing, pages 311-318, June 1999. (postscript, pdf)
Abstract: In this paper we describe a parallel programming paradigm called problem space promotion (PSP), a technique that increases parallelism by reducing communication and synchronization. We present four algorithms that exploit PSP and evaluate their communication characteristics relative non-PSP solutions. Our analysis is aided by the use of parallel algorithm notation that is concise, yet accurately reflects parallelism and communication costs. Our analysis illustrates circumstances under which the use of PSP is beneficial and detrimental to performance, and experiments on the Cray T3E attest to the validity of the analysis. We find that PSP can significantly improve the performance and scaling behavior of certain computations, even when compared to existing high quality parallel algorithms.


Bradford L. Chamberlain, E Christopher Lewis, and Lawrence Snyder, "A Region-based Approach to Sparse Parallel Computation," Technical Report UW-CSE-98-11-01, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, November 1998.
Abstract: This paper introduces and evaluates programming language abstractions for explicitly managing sparse computations at the source language level. This is accomplished by extending the array-language concept of regions---regular programmer-specified index sets used for specifying array computations. We introduce the notion of sparse regions which can represent an arbitrary set of indices. Sparse regions inherit the benefits of regular regions, including conciseness, direct encapsulation of parallelism, and support for language performance models that highlight parallel overheads. We show that region-based array languages can benefit from the use of sparse regions, both in terms of the semantic richness available to the programmer and the performance of the resulting program. We also demonstrate that regions result in efficient implementations as compared to array-based approaches, due to their role in amortizing sparse overheads and enabling optimizations.


E Christopher Lewis, "Support for Software Assisted Speculative Execution", Technical Report UW-CSE-98-09-05, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, September 1998. (postscript, pdf)
Abstract: Computer architects strive to improve machine performance by exploiting parallelism, but control flow and data dependences limit available parallelism. Speculative execution enhances parallelism by selectively ignoring the constraints of control flow and data dependences, thereby executing instructions before it it known whether they are needed or correct. Software assisted speculative execution is a form of this tack where the running program directs the hardware in what instructions should be speculatively executed and how. This report identifies and characterized the fundamental architectural, implementation, and compiler issues of software assisted speculative execution. These issues serve as the basis for describing, comparing and contrasting proposed architectures from the literature.


Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. "The Case for High-Level Parallel Programming in ZPL," IEEE Computational Science and Engineering, 5(3):76-86, July-September 1998. (postscript, pdf)
Abstract: Message-passing programs are efficient, but fall short on convenience and portability. ZPL is a high-level language that offers competitive performance and portability, as well as programming convenience lacking in low-level approaches.


E Christopher Lewis, Calvin Lin, and Lawrence Snyder, "The Implementation and Evaluation of Fusion and Contraction in Array Languages," In the Proceedings of the ACM SIGPLAN `98 Conference on Programming Language Design and Implementation (PLDI), pages 50-59, June 1998. (postscript, pdf)
Abstract: Array languages such as Fortran 90, HPF and ZPL have many benefits in simplifying array-based computations and expressing data parallelism. However, they can suffer large performance penalties because they introduce intermediate arrays---both at the source level and during the compilation process---which increase memory usage and pollute the cache. Most compilers address this problem by simply scalarizing the array language and relying on a scalar language compiler to perform loop fusion and array contraction. We instead show that there are advantages to performing a form of loop fusion and array contraction at the array level. This paper describes this approach and explains its advantages. Experimental results show that our scheme typically yields runtime improvements of greater than 20% and sometimes up to 400%. In addition, it yields superior memory use when compared against commercial compilers and exhibits comparable memory use when compared with scalar languages. We also explore the interaction between these transformations and communication optimizations.


Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby, "ZPL's WYSIWYG Performance Model," In the Proceedings of the IEEE Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 50-61, March 1998. (postscript, pdf)
Abstract: ZPL is a parallel array language designed for high performance scientific and engineering computations. Unlike other parallel languages, ZPL is founded on a machine model (the CTA) that accurately abstracts contemporary MIMD parallel computers. This makes it possible to correlate ZPL programs with machine behavior. As a result, programmers can reason about how code will perform on a typical parallel machine and thereby make informed decisions between alternative programming solutions. This paper describes ZPL's performance model and its syntactic cues for conveying operation cost. The what-you-see-is-what-you-get (WYSIWYG) nature of ZPL operations is demonstrated on the IBM SP-2, Intel Paragon, SGI Power Challenge, and Cray T3E. Additionally, the model is used to evaluate two algorithms for matrix multiplication. Experiments show that the performance model correctly predicts the faster solution on all four platforms for a range of problem sizes.


Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby, "Factor-Join: A Unique Approach to Compiling Array Languages for Parallel Machines," In the Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 481-500, August 1996. (postscript, pdf)
Abstract: This paper describes a new approach to compiling and optimizing array languages for parallel machines. This approach first decomposes array language operations into factors, where each factor corresponds to a different communication or computation structure. Optimizations are then achieved by combining, or joining, these factors. Because factors preserve high level information about array operations, the analysis necessary to perform these join operations is simpler than that required for scalar programs. In particular, we show how data parallel programs written in the ZPL programming language are compiled and optimized using the factor-join approach, and we show that a small number of factors are sufficient to represent ZPL programs.


Calvin Lin, Lawrence Snyder, Ruth Anderson, Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, and W. Derrick Weathersby, "ZPL vs. HPF: A Comparison of Performance and Programming Style," Technical Report UW-CSE-95-11-05, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, November 1995. (postscript, pdf)
Abstract: This paper compares two data parallel languages, ZPL and HPF, in terms of programming style and performance. The results show that for eight programs from a number of standard benchmark suites, ZPL generally outperforms HPF, and ZPL expresses problems at higher levels of abstraction, yielding programs that are shorter, less error prone and easier to maintain. ZPL's better performance comes from its clean expression of parallelism that allows for better compiler analysis.


E Christopher Lewis, Calvin Lin, Lawrence Snyder, and George Turkiyyah, "A Portable Parallel N-Body Solver," In the Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, pages 331-336, February 1995. (postscript, pdf)
Abstract: We present parallel solutions for direct and fast n-body solvers written in the ZPL language. We describe the direct solver, compare its performance against a sequential C program, and show performance results for two very different parallel machines: the KSR-2 and the Paragon. We also discuss the implementation of the fast solver in ZPL, including factors pertinent to data movement.