Understanding Large System Failures - A Fault Injection Experiment

Ram Chillarege and Nicholas S. Bowen
IBM Thomas J. Watson Research Center

Published: IEEE International Symposium on Fault Tolerant Computing, 1989.

Abstract: --
This paper uses fault injection to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system and this paper:

  • Introduces the idea of failure acceleration to conduct such experiments.
  • Estimates total loss of the primary service to occur in only 16% of the faults.
  • Reveals errors termed potential hazards that do not affect short term availability but cause catastrophic failure following a change in operating state.
  • Identifies at-least 41% of errors as potential candidates for repair before total failure. Estimates are provided on the likely location of such errors in terms of storage area and code-data split.

These results enhance our understanding of large system failures and provide a foundation for design enhancements, and modeling of availability.

Failure Acceleration

An important attribute of a fault injection experiment is the ability to measure in the lab what could not be done in the field. This includes many aspects of the failure process including, when possible, cause and effect analysis. Interestingly,there seems to be a mechanism that allows for a number of measurement, if a few conditions are met, or if the system can be stressed towards meeting these conditions. These conditions result in accelerating the failure process. We first define it formally and then discuss its attributes.


Excerpt from the Section II. DESIGN OF THE EXPERIMENT

--------------

Definition - Failure Acceleration:
The failure process is said to be accelerated when, the fault model is not altered and:

  • 1. The fault latency is decreased.
  • 2. The error latency is decreased.
  • 3. The probability of a fault causing a failure is increased.

Definition - Maximum Failure Acceleration:
The failure process is said to be maximally accelerated when, the fault model is not altered and:

  • 1. The fault latency is zero.
  • 2. The error latency is a minimum.
  • 3. The probability of a fault causing a failure is maximized.

-----------------

Scanned copies of the Paper from IEEE Proceedings of the Symposium on Fault Tolerant Computing" FTCS 1989

FTCS 1989 Page 356
FTCS 1989 Page 357
FTCS 1989 Page 358
FTCS 1989 Page 359
FTCS 1989 Page 360
FTCS 1989 Page 361
FTCS 1989 Page 362
FTCS 1989 Page 363