Ram Chillarege, Wei-Lun Kao and Richard G. Condit IBM Thomas J. Watson Research Center, 1991
- uses defect data from an operating systems development project.
- finds that initialization defects were strongly related to the inflection noticed in the reliability growth.
- finds that the defect type distribution identified process problems that concurred with the developers' hind sight.
Proceedings 13th International Conference on Software Engineering, May 13-17, 1991, Austin Texas. Copyright IEEE
1. Introduction
An important question in the area of software engineering is whether there exist measurable cause-effect relationships between the kind of defects and the resultant development effort. If it were true that there existed quantifiable cause-effect relationships, then schemes could be devised to recognize them and take suitable corrective action on the process. It would also form a turning point in trying to make software development into a controllable process much like other manufacturing processes. Although this is a hard question to answer completely, one can formulate sub-problems that could eventually lead to an answer. This paper, investigates one such sub-problem and demonstrates the existence of such relationships. Reliability growth modeling has existed for several years. Growth modeling traditionally deals with prediction after making some fundamental assumptions on the error detection process. The reliability growth is, in a sense, an effect of the defects and the testing process. On the other hand, defect characterization has been used as a means to identify the nature of problems inherent in the process. In another sense, the defects are a cause at least with respect to measurements such as reliability growth. Within this framework a sub-problem that is pertinent to the goals of this paper is to investigate relationships between types of defects and their result on the process measured by, say, reliability growth. Now that we have defined the sub-problem of investigating relationships between defect types and the reliability growth, we can embark on looking at what data is available for such a study. As it happens, there is plenty of data but not quite in the form that would allows us to answer the question. Many labs do not have their defect data elaborately categorized by defect type. The information exists but in the form of plain English text that cannot be easily parsed for semantic content. Thus, the approach of separating the defects into sub-populations by defect type and then investigating their growth characteristics cannot be done. Thus, the analysis gets a little more complicated. Given the lack of defect-type information in the data a reverse approach has to be tried. Potentially, if it were possible to identify sub-populations that had different growth characteristics from the average, the sub-populations could be analyzed for possible differences in their composition of defects. This approach would work, if one could identify such sub-populations. It still does not solve the problem of not having the defects categorized by defect type. However, if the separation could be achieved, then it is possible to sample the defects and categorize them. There have been several studies on the topics of both growth modeling and defect characterization. However, this search for potential cause-effect is not so common. There have been some interesting studies on the relationship between distribution of errors and complexity . is an experimental investigation on the effect of Fortran and Ada on program reliability. is a study of software defects found in the field in operating systems code and their impact on availability. The data used in this paper comes from the development of a fairly large software product. We discuss the details of the data source in Section 2 as also its attributes. We were able to divide the data into sub-populations that demonstrated, in Section 3, how we were able to identify sub-populations that had varying degrees of reliability growth. Section 4 describes how defect type categorization is done and also shows some results that come directly out of the categorization. Section 5 brings together results from the previous two sections, relating defect types (cause) to the differences in the observed growth (effect). Section 6 summarizes the key results from sections 3, 4 and 5.
Defect data used for this paper comes from the development of operating systems code. There were very small portions of the code that needed to be in assembler, and for the most part the code was written in a high level language (e.g. C). The number of programmers and testers throughout the project ranged in the hundreds. The code included both new and changed code, given that there was an earlier base to work with. The defect data used for this paper included defects from functional test, system test, and early ship. By functional test we mean tests on specific modules and components that do not require the entire system to be up. System test required almost all the code to be ready. The defects found during these phases of testing are called Program Trouble Memorandums or PTM for short. The PTM database tracked defects from function test, system test and also for a short time after the product was released. Field defects, in IBM are commonly tracked in another data base called RETAIN. However, defects found during early ship are tracked by the developer as PTMs. This data, is thus what is tracked by the developer and includes some defects found after system test, during early ship. The PTM data is maintained in a database with a considerable amount of tracking information. Each defect entry contains information on: the date it was opened, it's current status, the date it was closed, the symptom of the error, a description, ownership, the person tracking it, etc. This is on on-line database and data entry is through panels with capability for searches. The specific fields in the defect entry contain tracking information which are also useful as search arguments on the database. Such fields help the generation of quantitative information such as reliability growth curves, distribution of severity, average time a defect is open, etc. However, the descriptive information recorded against a defect does not provide such conveniences. Yet, the descriptive information contains the meat, such as the problems experienced, tests used, and the fixes necessary etc. Thus, analysis that depends on the descriptive information in a defect is not easy to perform. Keyword searches are possible, however, they only help in a limited fashion. Such analysis invariably requires reading and comprehending a few paragraphs of descriptive text per defect. To aid the identification and tracking of defects, a defect entry is opened with a key for the symptom experienced by the failure. A symptom is what the tester notices as the effect of the defect on the system. The symptom may be the response to a test case, or a workload running on the system, or just commands issued by the tester. Symptoms are useful to testers who search the database for other defects with the same symptom to check whether the same problem was reported earlier within the same component. However, note that the symptom does not allude to any possible cause of the defect and only identifies its effects. A set of symptom descriptors such as: hang, data-lost, performance, loop, etc., are agreed upon to be used against defects. Not always is it possible to map a defect's symptoms into a very specific descriptor and therefore some descriptors have a more general scope, such as functional claim. Table 1 shows the set of symptom descriptors used and what they stand for. Since the data is keyed by symptoms, it provides a mechanism for us to partition the database. Another key used when the defect is opened is severity. This is a number between 1 and 4 and reflects a combination of the difficulty caused by the defects as also the urgency for a fix.
Symptom Description S1 Defect resulted in data being lost S2 Hang; machine does not respond S3 Incorrect data was shown on display S4 Incorrect message displayed S5 Incorrect data output S6 Problem during build; Library S7 Performance problem (noticeable) S8 Unpredictable results occur S9 Useability of machine affected S10 Problems uncovered during a build S11 Unverifiable claim (no test case) S12 Functional claim (needs test case) S13 Performance claim (needs test case) S14 Architectural claim S15 Core dump (memory fault, abort) Table 1. The 15 Symptom Descriptors.
This section addresses the question of whether the population of defects consists of sub-populations that exhibit different reliability growth characteristics. Given the nature of defect data there is no straightforward answer. We attempted a search for this answer, by trial and error, inspecting growth curves of various sub-populations identified by keys such as defect severity, symptom, component, etc. The qualitative inspection revealed that symptom held a promise. This prompted a quantitative identification of the sub-populations using estimated parameters from a reliability growth model. We first begin by describing the growth model used and then the clustering method used to identify the sub-populations of interest. We use the inflection S-shaped growth model for our analysis. The choice is motivated by the shape of the observed reliability growth curve. In this paper, the primary purpose of fitting a model is not prediction, but to provide a quantitative means to identify sub-populations. (The data used has been made available in the Appendix). The inflection S-shaped model is characterized by a growth mean value function h(t) of NHPP: h(t) = N<1-e sup(-phi t)> over <1 + psi e sup (- phi t)> where, phi is the failure detection rate (in the sense of the Jelinski-Moranda model ), psi is the inflection parameter, and h(t) is the number of failures detected up to time t. The inflection parameter is defined for a given r by, psi(r) = <1-r> over <r>, r gt 0, where r is the inflection rate. The S shape and the inflection are attributed to mutually dependent defects such that some faults are not detectable before others. Essentially, r indicates the ratio of the number of detectable faults to the total number of faults in the program. An r value of 1 reduces the model to an exponential curve, and implies that the defects are independent. A lower value of r causes the curve to be more inflected and implies, in this model, the defects are dependent. A detailed discussion on this model, its use, and algorithms for estimation of parameters can be found in . Figure 1 shows the observed and fitted reliability growth curve of the entire population of detected defects. The estimated parameters for N, phi and psi are shown against the figure. A vertical reference line is drawn in the figure at a specific instant, T, during the testing cycle. We call the percentage of total defects detected at time T, P sub T, that is about 65% in the figure. We use the parameter P sub T later in this section. The S-shaped model is a good fit for most of the testing phase. Towards the late stages of testing there is a pronounced increase in detection rate which this model is not intended to capture.
We separate defects by their failure symptom and generate both the observed and fitted reliability growth curve for each of the 15 symptoms. A qualitative inspection of the 15 growth curves, one for each symptom, (not shown) revealed considerable differences among some of them. This motivated the identification of symptom-groups such that each symptom-group has a different characteristic reliability growth from the other. Our interest is in quantifying the differences in the shape of the growth curve for the various symptoms. Therefore, we fit the inflection S-shaped growth curve using non-linear regression, specifically the Marquardt method. From the fit, we estimate parameters for N, phi and psi and their asymptotic 95% confidence intervals. Note that N corresponds to the estimated number of defects for that specific symptom. The parameters phi and psi characterize the growth curve and can be used to quantify the difference in growth experienced by the fifteen symptoms. We also calculate P sub T for each symptom, which measures the percentage of the defects detected until time T. Identifying the sub-populations of interest is best done by plotting the parameters phi, psi and P sub T in 3 dimensions. We use r instead of psi since r relates to the independence of defects. Figure 2. shows such a plot. The horizontal plane contains the parameters r, phi and the vertical contains P sub T. A lower value of r implies greater inflection and correspondingly more dependent defects. The parameter phi corresponds to the rate of detection. A lower phi corresponds to harder to detect errors and a higher phi, easier. Each arrow in the figure corresponds to defects with one of the 15 symptoms. The height of the arrow shows the value of its P sub T and correspondingly what fraction of the defects were detected at time T. The position on the r - phi plane indicates, in relative terms, whether the defects are dependent and how hard it is to detect them. Defects that have a low r and phi have relatively more dependent defects but are also easier to find. Defects with a high r and low phi have relatively more independent defects and are harder to find, etc. Thus, partitions of the r - phi plane correspond to defects that demonstrate different growth characteristics, leading to what we want to identify.
Note that the purpose is to identify sub-populations of defects that demonstrate different growth curves - and symptom is only a means to that end. We cluster symptoms in Figure 2. into symptom-groups such that each symptom group identifies a sub-population of defects that experiences similar reliability growth characteristics. The circles drawn around the arrows identify clusters of symptoms with phi, r and P sub T close to each other. Four such symptom-groups (groups) have been identified in the figure. The choice of the first three groups as clusters is more obvious than the fourth. Each of the groups has typically between 20-30% of the defects. We now divide the entire population of defects into the four sub-populations (symptom-groups) as identified by the clusters of symptoms in Figure 2. For each symptom-group we generate the observed and fitted reliability growth curve, shown in Figure 3. The fit with four sub-populations will be better, since it is equivalent to fitting data with more parameters. Therefore, we do not provide the sum of squares of error for the four sub-populations to compare with the overall, since, such a comparison is not meaningful. The parameters of the fitted model namely, phi, psi and N are extracted and shown for each symptom-group in Table 2. Qualitative differences in the symptom-groups are more apparent from their growth curves whereas the extent of the differences is better seen from their parameters. Notice that parameters phi and psi have a reasonable separation among the groups. The clustering approach is a reasonably good method to identify sub-populations, provided it is done with an understanding of the underlying data. We have deliberately avoided the use of complicated analysis since verifying normality assumptions is often impossible in such data. Indeed, we strive to keep the analysis simple without straying far from the physical interpretation of data. Group 1 contains defects with 6 symptoms that have a relatively low r and high phi value. This should correspond to defects that are dependent but easy to find. The observed and fitted growth curves in Figure 3a. show this: it has the most inflection; it has the highest P sub T value. Group 2 contains defects that have a slightly higher r value and lower phi. This Group still has the S shape, but is much less inflected than Group 1. The lower phi corresponds to defects that are harder to find, and is well reflected in the lower P sub T value. Group 3 is unusual in that it is just one symptom and has a relatively low phi and low r value. This corresponds to defects that are more independent and also harder to find (reflected with a P sub T value much lower than Group 2 which has almost similar r value). The growth curves of group 2 and group 3 have similar inflection except that a significant number of defects in Group 3 have late discovery. Group 4 is a collection of three symptom groups that do not have their r, phi values as close each other as the other groups do. This group is characterized by a low phi and high r, corresponding to relatively independent, and harder to detect defects. This sub-population is also the set of defects for which the S-shaped curve is not the best model. There is almost a multi-stage S-shape during the testing phase. We have put these defects in one group since they stand out from the other three groups. It is interesting that 30% of the defects, coming from 3 specific symptoms tend to be different from the rest of the population. In this section we used a clustering technique to identify sub-populations using the parameters from inflection S-shaped growth model. We identified four sub-populations using the symptoms and called them symptom groups. The symptom groups are different in terms of having relatively dependent or independent defects and demonstrating either a slow or quick growth. In the next sections we investigate the cause of the defects (types) and in the subsequent section relate the defect types to the symptom-groups identified in this section.
G PARM ESTIMATE STD 95% CONFIDENCE R ERROR INTERVAL P LOWER UPPER 1 N 148.534 0.9282 146.700 150.368 1 PHI 0.049 0.0008 0.047 0.050 1 PSI 3164.837 436.6523 2302.137 4027.537 2 N 257.067 1.09645 254.907 259.226 2 PHI 0.031 0.00033 0.030 0.031 2 PSI 219.710 11.15628 197.741 241.678 3 N 178.490 2.70529 173.147 183.832 3 PHI 0.026 0.00061 0.025 0.027 3 PSI 254.467 24.62409 205.837 303.098 4 N 233.039 12.91579 207.562 258.516 4 PHI 0.019 0.00123 0.017 0.022 4 PSI 71.088 11.06830 49.255 92.920 Table 2. Parameter Estimates
4. Analysis by Defect Type
This section looks at what caused the defect whereas, the earlier section primarily dealt with the effect of defects. Defects are classified into types that represent a class of causes. We first describe the method used to identify the types and then discuss the various defect types, their distribution and impact. Software defects are caused due to a variety of reasons and can require elaborate descriptions to communicate their nature. Some defects are simple to communicate such as an incorrect branch condition or printed with wrong format. However, not all defects elicit a simple explanation. Errors due to ambiguous specification could be attributed either to the specification or the code. In some instances the code may be correct, but performs so poorly that it cannot be shipped as is. This may not be a bug but the code is certainly defective and needs to be fixed. Classifying defects into types is thus not a straightforward process. Yet, defects may share common deficiencies in code, structure, performance, design, accuracy, etc.. Such deficiencies can possibly be recognized by the design and development process for improvements. We have tried to develop a defect type classification to capture descriptions that can provide insight into the defect as also the development process. The choice of defect types is important, possibly critical, to the formulation of cause and effect relationships. We wanted the defect types to provide insight to the developer on where problems exist and identify possible corrective actions. At the same time, the defect type should be simple and obvious to any programmer leaving little room for confusion during classification. Our first pass at the data resulted in lengthy descriptions of defect types. However, in a second pass they crystallized when we began defining defect types that were orthogonal and could be associated with specific programming steps and errors. Our resulting defect types form a five by two matrix. The five defect types chosen were function, checking, assignment, initialization, and documentation. Further, each defect could be due to either the omission of code or the presence of incorrect statements. Thus, for each defect type we determined whether it was missing or incorrect. This in our opinion conveys the essence of what we were after. It provides information on the probable cause as also identifies a phase in the process to associate the defect with. A fundamental problem in identifying defect types is the work involved to do so. Reports on defects are not classified by defect type. Therefore information on defect type has be learned by reading descriptions of defect, tracking their fixes and comprehending the semantics surrounding the defect. Descriptions of defects can vary from a short paragraph to a few pages of text. The text contains details on the failure and discussion on the probable causes along with a description of the fixes put in place. Given several hundred defects in the database, it is almost impossible to attempt classifying all the defects. Furthermore, depending on the degree of categorization, statistically, it is not necessary that all the defect reports be read. We adopted random sampling to pick representative samples of defects to categorize into defect types. This significantly reduced the amount of reading necessary to develop a defect type categorization. Random sampling also provided a means to assure ourselves that the samples are representative of the four symptom groups that were generated in the earlier section. This is important to relate a defect type distribution with the symptom groups. We chose a sample size of 30 to repeatedly sample the defect population. These samples (of size 30) were then checked to be representative of the symptom-groups distribution contained in the population. The sample size was indeed a good choice and we proceeded to use the samples. We initially chose 3 samples, totaling 90 defects, to read in detail and classify into defects types. Thus, if necessary we could pick more samples to read. It is an arduous task considering that not only does one have to read the description but also must get involved in the semantics of the defect. Before we discuss the details of the types, we dwell a little more on the issue of sampling. We used the defect types to generate a defect type distribution for each sample. This was done after each sample was categorized. The defect type distribution stabilized after 3 samples (90 defects), thus necessitating no further sampling. In each sample there were a few defect reports (4 to 5) which we were not able to classify. It is not that the defect is inadequately documented, but that we could not locate the right specification surrounding the defect in question. We chose to drop those specific defects from our sample, effectively reducing our sample size from 30 to say 26. An alternative to dropping the sample size would have been to randomly choose a replacement defect to make up the sample size. Given that our defect-type distributions were very stable, we chose to proceed with the slightly reduced sample size. We provided the aggregate defect type distribution in Figure 4.
Although the defect types are for the most part self-explanatory a little description is useful. The function type of defects can require either small or large changes to the code, however it is the only defect type requiring large changes. Examples of missing functions are: failed to log the record, and option in the service not supported. An incorrect function can include defects with unexpected results, such as loops. Checking defects can occur due to incorrect statements (such as conditions and boundary checks); for example, overlay in the last element of the array. A missing check is for example: fail to check return value and fail to check a set debug flag. Assignment and initialization have been separated intentionally, although they may seem similar. Assignment refers to instructions in the middle of the code needed for the correct functioning of logic. Thus, assignments are statements after initialization such as representing a change of state. Examples of assignment defects are: fail to reload registers (missing) and set wrong value (incorrect). Initialization, specifically deals with the constants or run-time setups expected by a program. Initialization defects include default values, system parameters, initial values of variables and special files, such as files containing device information. Examples of initialization defects are: missing device info (missing) and set wrong value (incorrect). Finally, we have documentation defects which includes incorrect description of the program and missing documentation for the specified operation. Note, from Figure 4., that function and checking consist of almost 60% of the defects. Function defects are linked to the specification and design. Among the function defects, the missing function dominates the defect type. Checking defects are also linked to specification, or design but at a different level. In operating systems code, the checks have mainly to do with interfaces, status conditions, states, locks, etc. The lack of such details can lead to code that does not check for certain conditions and yields erroneous conditions. It is obvious from the data that improvements in the area of specification and design would have made a significant difference on this project. Although this is true for any product, the distribution shows that in this project the defects are dominated by these types of problems. It is not a matter of common programming errors that dominate the errors in this development cycle, but the bigger problems such as those due to missing function. After this analysis we took the results back to the development team to share our views of where we believe the problems stemmed from. Our assertions were based purely on the data analysis and we did not have knowledge of the development history. It is interesting that the development team concurred with our views, since they knew the details in hind sight. We conjecture that this type of classification done in-process could provide considerable insight and warning that will allow for corrective action to be taken in-flight. In this section a random sample of defects was studied to classify defects into defect types. Five major defect types and their sub-types were used to generate a defect type distribution. It was found that function and checking defects contributed to more than half the defects. Furthermore, it was observed that improved specification could greatly impact the number of defects by potentially decreasing the number of function and checking defects.
5. Relating Defect types to Symptom Groups
In this section we bring together the results from the earlier two sections. Section 3 discussed partitioning defects into four sub-populations (called symptom-groups) that demonstrated varying degrees of reliability growth. Section 4 developed the defect type distribution. We now proceed to relate the defect types (cause) to the reliability growth (effect) experienced by the symptom groups. We relate defect types to the symptom groups by examining the defect type distribution for each of the four symptom groups. Thus we identify possible differences in the makeup of defect types leading to differences in the corresponding reliability growth curves. Figure 5. contains four bar charts, one for each of the symptom groups, showing the distribution of defect types. Each major defect type is represented by a bar which is sub-divided to show the proportions of its sub-types. Notice that defect type distribution for Symptom Groups 2, 3 and 4 are similar to each other and also the aggregate distribution in Figure 4. That is, these three symptom groups have almost the same relative proportions of the major defect types, i.e., function defects being the largest and documentation being the least. However, there are differences among them when considering the proportions of the individual defect sub-types. Symptom Group 1 is markedly different from the rest both in the major defect types and the sub-types. We now look at these differences in greater detail. Symptom group 1, is a major departure from the other three groups. Firstly, it contains defects from only 3 defect types. (Note, statistically this does not mean that documentation and checking defects are absent in this sub-population. They may exist, but are likely to be very small contributors). Secondly, the fraction of initialization defects is much higher than those in the other three sub-populations or the aggregate. Initialization defects constitute almost 35% whereas it is typically 10-15% in the others. A large fraction of the initialization defects (two thirds) is due to missing initialization. However, missing initialization exists in a nominal (as compared to the aggregate) fraction in Group 2 but does not exist in either Group 3 or Group 4. The other defect types of Group 1, namely function and assignment do not show similar strong differences. Function defects continue to be the largest; slightly higher than others, as expected. The assignment defects are about the same fraction as the aggregate. Recall that Group 1 demonstrated the most inflected growth curve. The inflection quantitatively corresponds to a low r value, and physically to defects that may have a lot of dependence between each other. To visualize dependence, imagine defects on a graph where the edges correspond to execution paths. Two defects along the same path could require the removal of the first to detect the other. Thus, the two defects along the same path could be dependent affecting the shape of the growth curve. However, if they were along disjoint paths they may be independently detected. Dependence can cause an initial delay in the detection process with the rate of detection rapidly increasing once a few defects are detected. Thus, it is conjectured that dependence causes an inflection in the growth curve. A more detailed discussion on the topic is contained in . From the defect type distribution we know that Symptom Group 1 defects have a relatively larger contribution of initialization defects. Given that initialization defects occur upstream in the execution path they are likely to be shared by more than one path, contributing to greater dependence between defects. Therefore, until the initialization defects are fixed others that share the same path are not likely to be detected. Thus indicating that initialization defects could contribute to a very inflected growth curve. However, note that Group 2, which also has a fairly inflected curve, does not have as large a number of initialization defects. This seeming inconsistency is useful to identify the probable cause and clarify the relationship. Recall that although Group 1 has a large fraction of initialization defects, the major difference is due to the missing initialization which accounts for almost two thirds of them. Missing initialization defects also exist in Group 2 but not in Groups 3 or 4. This points to a combination of missing initialization and initialization defects being strongly related to the inflected nature of the growth curve. Analysis of variance can be used to further access the relative contributions of defect types. However, more important is the physical reason behind this plausible relationship and its impact on the overall growth experienced. Primarily, the testing strategy can be modified to weed out missing initialization defects well before further path testing is attempted. Group 2, 3 and 4 do not display a distribution of defects types that immediately draws attention to the type of growth they experience. The only exception is that Group 4 has a much smaller than average size of initialization defects. However, since the Group 4 sub-population is not as well defined by the cluster of r and phi parameters, we are not inclined to analyze it further. It is interesting that we are able to establish one of the cause-effect relationship between a defect type (cause) and the inflection in the reliability growth curve (effect). The fact that there are plausible cause and effect relationships should be welcome. Establishing more such relationships will allow for the software development effort to be understood and controlled better. It will also pave the way for using more advanced techniques in software development, akin to other manufacturing processes. We are pursuing the use of defect type classification in more products to allow us the additional insight that it can provide.
This paper used defect data from a large operating system project to empirically understand the impact of defect types on the net reliability growth experienced. The analysis was complicated by the fact that defect-type information was not available in the original defect database. The goal of this study was to examine the existence of possible cause and effect relationships that allow greater understanding and control over the software development process. The study finds that:
- Sub-populations with very inflected growth curves are found to be strongly related to the number of initialization, especially missing initialization, defects. This is consistent with the proposed model, wherein inflection is claimed to be caused by dependence between defects sharing the same execution path; considering that initializations usually occur early in the code path. This observation is also useful to test case design.
- The defect types chosen provided an insight into the problems experienced in the development and their probable causes. In this product, missing function dominated the defect types. Although, in hind sight, it was known that the specification and design phases needed improvement, it is interesting that the defect type distribution specifically identified that. It is likely that this could provide early recognition of problems, yielding to in-process corrective action.
As with any empirical study, it should be noted that these findings are the result of analysis on one large software project and do not necessarily generalize. Generalization of the findings cannot be assumed unless many independent studies concur with these results. However, the method employed has been instructive and the results are useful to enhance the development process of the project studied.