Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems

Mark Sullivan and Ram Chillarege IBM Thomas J. Watson Research Center

Abstract: --

In recent years, software defects have become the dominant cause of customer outage, and improvements in software reliability and quality have not kept pace with those of hardware. Yet, software defects are not well enough understood to provide a clear methodology for avoiding or recovering from them. To gain the necessary insight, we study defects reported between 1986 and 1889 from a on a high-end operating system product. We compare a typical defect (regular) to one that corrupts a program’s memory (overlay) given that overlays are considered by field services to be particularly hard to find and fix.

This paper:

  • Shows that the impact of an overlay defect is, on average, much higher than that of a regular defect.
  • Defines error types to classify the programming mistakes that cause software to fail.
  • Defines error trigger to classify the events that cause latent errors in programs to surface. The error trigger distribution weights events and environments that are probably inadequately tested.
  • Shows that boundary conditions and allocation management are the major causes of overlay defects, not timing.
  • Shows that most overlays are small and not too far from their intended destination.

Further analysis are provided on defects in fixes to other defects, symptoms, and an assessment of their impact. These results provide a base line understanding useful to designers and developers. The data will also help develop realistic fault models for use in fault-injection experiments.

Published: The 21st IEEE International Symposium on Fault-tolerant Computing, 1991.
Full Paper

Summary

This paper uses five years of field data on software defects to develop a taxonomy of defects, providing insight into their behaviour and impact. The data comes from IBM’s field service database called RETAIN. This paper focuses on those software defects reported against a specific (unnamed) high-end operating system product.

 

This study is performed in the backdrop of a computer industry faced with tremendous challenges in software reliability, quality, and availability. Recent studies have demonstrated that while, in the past five years hardware reliability has made tremendous improvements, software has not. Unless software reliability improves, it will limit the total reliability and availability possible in computer systems.

Software errors found in the field are fundamentally different from classical hardware errors. Like hardware design errors, once fixed, they will not reappear. It is important to understand the type of software errors that remain undiscovered after system test and the conditions in a customer environment that allow them to surface. We have chosen to call these attributes, the error type and the trigger, respectively. This paper provides distributions of both the error type and the trigger, and provides customer impact information for each of these attributes. The paper focuses on a particular type of defect called the overlay by field service personnel -- errors which corrupt program memory. The overlay defect is contrasted to the typical defect, herein referred to as the regular defect.

The study finds:

  • (1) Overlay defects have, on the average, a much higher impact on the system than regular defects. This is measured by its probability of causing an IPL, its probability of achieving a severity 1 rating, and its probability of being flagged as ‘‘highly pervasive’’ across the customer base.
  • (2) Most overlay defects are due to boundary condition and allocation problems. Contarary to popular folklore, they are less likely to result from timing or synchronization problems.
  • (3) Most overlays are small (order of a few bytes) and occur near the address the software was supposed to write. Less than a fifth of the overlays cause wild stores in a process address space.
  • (4) Non-overlay defects are dominated by undefined state errors in which the module implementing a network or device protocol mistakes the current protocol state and goes into a wait or deadlock state.
  • (5) Untested boundary conditions in the software trigger a majority of failures. Recovery and timing-triggered failures have slightly higher impact than failures triggered by boundary conditions.
  • (6) Among errors in fixes to other errors the causes are related to mismatch in data types and interfaces.
  • (7) While overlay errors are more likely to cause addressing faults than non-overlay errors, most overlay errors do not cause the system to take an address fault. That suggests that the corrupted data can actually be used before the failure occurs, making error propagation more likely.

The above list summarizes some of the salient findings. It is also the intent of this paper to provide a more structured and systematic method to classify and understand software defects. This understanding is critical to designing appropriate techniques to shield against them -- either in development or operation. Furthermore, the paper provides a framework from which fault-models for fault-injection based evaluation can be developed.