Moore's Law and architectural advances enable exponential improvements in computer performance. These advances, however, degrade two system usability issues. First, system availability decreases, because shrinking transistor sizes and increasing numbers of transistors lead to more hardware faults. Second, system designability suffers, since architects can design more complicated systems, making both design and verification more challenging. To regain these losses in availability and designability, which are even more pronounced in multiprocessor systems, requires architectural techniques for tolerating device and design faults.
I will present a system-wide hardware checkpoint/recovery scheme, called SafetyNet, that addresses both availability and designability. In the case of a device fault or a design fault, SafetyNet recovers the system to a consistent, pre-fault checkpoint state. SafetyNet uses logical time to coordinate checkpoints across the system efficiently and "logically atomic" coherence transactions to free checkpoints of transient coherence state. Runtime overhead is minimized by pipelining checkpoint validation with subsequent parallel execution.
I will illustrate SafetyNet avoiding system crashes due to both soft and hard faults. Using full-system simulation of a 16-way multiprocessor running commercial workloads, I will show that SafetyNet (a) adds statistically insignificant runtime overhead in the common case of fault-free execution, and (b) avoids a crash when tolerated faults occur.