Ram Chillarege
Chillarege Inc., 2004
Abstract -- Measurement is a vital aspect of the test process, and traditionally the weakest. This paper discusses how ODC trigger based measurements are fundamental to test, and contrasts it with classical notions of coverage. Triggers complement classical coverage and together can yield a holistic test measurement.
SPLiT – Workshop on Software Product Line Testing, Software Product Lines, Lecture Notes in Computer Science, 2004, Volume 3154/2004. Springer
1. Introduction
Software test productivity and its measurement remain blurred. While there are a number of measurements used in the practice, none are definitive or complete, and this remains an area of research. To list just a few of the popular measures used, we have: number of test cases run, percent regression, coverage, defects found, etc. Software reliability is a related measure, but occurs late in the development cycle and not as practical during a majority of the test cycle. While all these measures provide some measure of goodness, they all suffer from a common weakness, namely they are incidental to the fault detection process itself.
The classical measures of code coverage [1] stand apart from the rest in terms of measures, since it is well defined, precise and has a reasonable tool set [2]. This natural appeal has lead to a significant following, and there are measures of effectiveness defined and debated [3]. However, like many a software engineering metric, there is a dearth of industry data and empirical studies to help guide us. Needless to say, the use of classical coverage measures is stretched beyond its applicable limit.
To examine the software measurement issue in greater detail, we need to peel a few layers off the defect activation, detection, observation and testing mechanisms to gain clarity. It is only then that we can begin to roll up measurements that are closer to the action, and potentially a more meaningful set of measures at higher levels of abstraction.
2. Software Faults and Detection
We need to distinguish software faults from hardware faults so we do not confuse it with methods in hardware testing. To the first degree of approximation, the hardware faults and methods are about manufacturing faults and defects. Separately, there are design faults in hardware, that to a large extent are no different than in software. The one practical difference being specifications in the hardware world are significantly better than in the software industry. That said, we can make the first high level generalization that all software faults are human created, and in loose terms "design" defects. Once fixed, they ought not to break again. In passing, let us also note that the world of hardware has a much better defined fault model, whereas, the world of software has a much more amorphous definition of fault models.
Software defects are therefore latent by design. The testing process is often a defect removal process, with quality assurance on the side, via checking its functionality and dependability. Thus, testing begins with the design process, and continues past the release and delivery of a product. Most of the current practices account for at least as much expense on testing as in development; sometimes more.
Figure 1 illustrates the three well known states of a fault, error and failure - where the fault refers to the latent software "bug"; error a state that should not be, but is due to the fault; and failure, a manifestation of the fault that impacts the product ability to deliver service. To an end user the failure is the most likely state to be observable. Ordinarily, the fault is not observable (for if it were, it would be corrected assuming a non-malicious developer), and the error condition may or may not be observable, depending on the situation.
The software fault is thus naturally dormant until some mechanism activates it. That does not mean that the fault is identified yet - one can be aware of the existence of a fault, and not be able to observe it. This observability of the fault only happens when there is a diagnostic mechanism or process that indicates the fault.
The task in software testing is to exercise the system enough to detect faults and correct them. The variety of methods, starting form inspection, unit test, functional test, system test drive one or more of these transitions. To detect and fix a fault several sub-steps need to occur:
- The fault is activated
- The error or failure is made observable
- The fault is identified
- A viable fix developed
- The fix integrated and validated
Step 4 and 5 may seem obvious, but are real world issues. For instance, step 4 is not a given. There are situations where a fix is not possible and the fault has to be circumvented. This may be done by blocking the fault - error transition, or masking the error.
Difference among Active, Passive and Environmental Triggers
What is it that activates software faults? Triggers.
The concept of the triggers was developed by studying field failures [4] and then adopted as one of the dimensions in Orthogonal Defect Classification (ODC) [5]. While the original definitions of trigger did not partition them by the terms we use here, these provide a better explanation of the mechanism at work, and relationship to measurement.
There are three classes of triggers: active, passive, and environmental. Table 1 has a list of the triggers, and their definitions are available from the web[6]. An active trigger (example function test case) activates the fault into an error or failure thus facilitating its detection. On the other hand, a passive trigger does not have an activation mechanism such as a test case. A good example is the software inspection process. Human perception detects faults directly. It's called passive, since the code is not executed. Environmental triggers are those that are external to the code, but have an influence on it. Examples are configurations and stress.
Active and environmental triggers activates the fault into an error and possibly a failure. Minimally, an error state is generated, which may or may not be observable. In contrast, a passive trigger may never generate an error condition. For instance, in the case of software inspection, the fault is identified, and there may never be the error state known and consequently no failure either. Environmental triggers work on gross level controls and depend largely on the generation of a failure. In such instances the error state, while generated, may or may not be observable.
The different test methodologies use one or more triggers to do their work. This analysis brings to light the distinctions in the methods as viewed by their triggers. This aids us in gaining an appreciation for the intrinsic differences in the costs of different methods. For instance, it is quite well established that the inspection process is considered to be one of the least cost methods of fault removal, and systems test among the most expensive. Inspection relies on passive triggers which go after the fault directly. There is no generation of error or failure, and the costs associated with setting up their observability. Whereas, system test relies on generating the failure - which require multiple transitions to be activated and detected.
3. Trigger Coverage
Empirical data from ODC studies has taught us that different classes of defects that lend themselves to be activated by certain types of triggers[7]. At the same time, we also know that the different methods of testing accentuate different triggers. This knowledge brings us closer to our ability to measure the effectiveness of test based on the activation methods used. The Trigger coverage is quantified by the probability distribution of the triggers present in a test plan and also the defects identified. The measure of goodness is not the shape of the distribution alone, but its match to the incident distribution of defects and their propensity of activation to the given triggers.
How does this concept compare to our traditional notions of coverage?
Classical coverage is defined often on a structural entity. In the case of code - the structural aspects are well defined: code, blocks, branches, etc. In the case of a function, we can extend the notion of coverage on a functional description, or a structural representation such as a state transition diagram. Coverage is then defined on that structure - namely states, transitions, etc.
However, the definition of faults, errors and failures, which are key to the defect detection and correction mechanisms are not necessarily tied to structure. In fact, software faults often do not even lend themselves to identification structurally. For instance, a requirement that is implemented incorrectly may have the same structural description if it was right or wrong.
This has led to triggers becoming a viable mechanism to measure test effectiveness[8, 9]. In these studies, the trigger distribution during the different the test phase is examined for the coverage it provides in terms of the type of activation that is needed by that phase of testing. The studies have used the defects detected, indexed by the trigger that found them, to measure whether the current phase of testing was well suited for the yield that a particular trigger produces.
A next step in such analysis would be to combine the trigger distribution with a measure of classical coverage. This has the potential of blending the structural view with the perceptual view.
4. Summary:
This paper studies the fundamental mechanisms at play in the testing and defect detection mechanism. Triggers, the activation mechanism of software defects, are partitioned based on the states and transitions they facilitate. This division explains how the triggers work, and also what observability criteria are needed.
Developing measures based on trigger distributions gives us a complementary measure to the classical coverage. While the classical coverage measures are based on the structural elements of software, and not tightly tied to the efficiency of fault detection, triggers work in an opposite fashion. They are not tied to the structural elements of software, but directly to the error detection mechanism, or perceptive behavior.
Combining the classical structural coverage with the perceptive coverage from Triggers gives us a more holistic measure of testing.
References
- Beizer, B., Software Testing Techniques. Second ed. 1990: Van Nostrand Reinhold. 550.
- xSuds, The Telcordia Software Visualization and Analysis Toolsuite (xSuds), Telcordia Technologies.
- Ye, R. and Y.K. Malaiya, Relationship Between Test Effectiveness and Coverage. International Symposium on Software Reliability Engineering, 2002. 2(Fast Abstracts).
- Sullivan, M. and R. Chillarege. Software Defects and their Impact on System Availability - A study of Field Failures in Operating Systems. in Fault-Tolerant Computing Systems. 1991.
- Chillarege, R., et al., Orthogonal Defect Classification - A Concept for In-Process Measurements. IEEE Transactions on Software Engineering, 1992. 18(11).
- ODC, Web Resources. 2004, www.chillarege.com, www.research.ibm.com/softeng.
- Chillarege, R. and K. Bassin. Software Triggers and their Characteristics - A case study on field failures from an Operating Systems product. in Fifth IFIP Working Conference on Dependable Computing for Critical Applications. 1995: IFIP.
- Butcher, M., H. Munro, and T. Kratschmer, Improving software testing via ODC: Three case studies. IBM Systems Journal, 2002. 41(1): p. 31-44.
- Chillarege, R. and K.R. Prasad. Test and Development Process Retrospective - Case Study using ODC Triggers. in International Performance and Dependability Symposium, IPDS. 2002. Washington DC: IEEE Computer Society.