ON-Core

One of the big questions in computer architecture today is whether improvements in single-thread performance are still necessary and if so whether they are still possible. We, for ones, certainly believe that they are necessary. And necessity being the mother of invention (although Jared Diamond makes a clever case that it's actually the other way around) we also believe that they are also possible. Projects like Washington's WaveScalar further substantiate this belief.

The primary fundamental challenge to maintaining and improving single-thread performance is hiding memory latency and the conceptually simplest way to hiding memory latency is to build a large instruction window. However, building a large instruction window begets two secondary challenges: overcoming the physical complexity of building a large window and overcoming branch mispredictions which limit the utilization of a large window. Because of these challenges, people have proposed architectures which use hierarchy and/or distribution to "fake" large windows. Some hiearchical designs like Wisconsin's Multiscalar can actually use the hierarchy to ovecome branch mispredictions too. Researchers have also proposed architectures that use redundant execution to hide memory latency "beyond-the-window". Redundant execution techniques include a lightweight run-ahead execution mode or multi-threaded pre-execution which can actually help with branch mispredictions as well. Of course, WaveScalar uses dataflow mechanisms to establish large windows and overcome branch mispredictions.

In this project, we attempt the direct approach and try to actually build a large window. And we overcome the challenges associated with physical complexity and branch mispredictions not by stepping outside the superscalar paradigm, but by redesigning key algorithms and structures within the confines of the paradigm. This approach allows us to attack individual problem areas separately and means that our design can be adopted piecemeal or wholesale and that individual solutions can have impact regardless of when (and even whether) the large window design is ultimately adopted.

The first problem area we have attacked is the in-flight memory system. The relevant algorithms are memory-dependence violation detection and in-flight memory communication (i.e., store-load forwarding) and the relevant structures are the load and store queues. The problem with conventional designs is that they implement memory-dependence violation detection and store-load forwarding using age-ordered associative search of the load queue and store queue and associative search does not scale to large sizes or high bandwidths. We have a design for an in-flight memory system that does not use associative search whatsoever. For store-load forwarding, we use memory-dependence prediction to predict for each load the single most likely to forward in-flight store. Load execution does not associatively search the store queue but instead makes a single indexed read at the predicted location. As usual, forwarding takes place only if the load's and the predicted store's addresses match. Speculative indexed forwarding works because the combination of address-checking and the inherent stability and predictability of store-load communication means that store queue indices can be predicted with accuracies of 99.9%. For memory-dependence violation (and forwarding mis-prediction), we use filtered in-order load re-execution immediately prior to commit. In-order load re-execution as a mechanism for verifying load speculation with respect to older stores is not a new idea. Our contribution is adding the Store Vulnerability Window (SVW) mechanism, which can filter load re-execution rates by factors of 30-50. If only 2-3% of all loads have to re-execute, re-execution can share a data cache port with store commit with no performance penalty.

We also have ideas for redesigning register communication, branch mis-prediction recovery, register renaming, and fetch.

This work is supported by NSF CAREER award CCR-0238203 and NSF CPA grant CCF-0541292.

People

Publications

NoSQ: Store-Load Forwarding without a Store Queue. (pdf)
Tingting Sha, Milo M.K. Martin and Amir Roth.
39th International Symposium on Microarchitecture (MICRO-39), Dec. 9-13, 2006.

Store Vulnerability Window (SVW): A Filter and Potential Replacement for Load Re-Execution.(pdf)
Amir Roth.
Journal of Instruction Level Parallelism, Vol. 8, 2006.

Scalable Store-Load Forwarding via Store Queue Index Prediction. (pdf)
Tingting Sha, Milo M.K. Martin and Amir Roth.
In proc. of 38th International Symposium on Microarchitecture (MICRO-38), Nov. 14-16, 2005.

Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization. (pdf)
Amir Roth.
In proc. of ISCA-32, Jun. 6-8, 2005.

A High-Bandwidth Load/Store Unit for Single- and Multi-Threaded Processors. (pdf)
Amir Roth.
CIS Technical Report MS-CIS-04-09, Jun. 2004.