Enlightening and frightening information about code coverage measurement
Author: Frank Büchner, Hitex GmbH
Contribution – Embedded Software Engineering Congress 2015
The following discussion does not focus on defining code coverage measures. Rather, it aims to offer (perhaps) surprising insights, dispel (perhaps) naive views, highlight different possible interpretations, and avoid potential misunderstandings.
What code coverage measures are mentioned in standards?
|
IEC 61508 |
ISO 26262 |
DO-178 |
|
| Entry Point Coverage |
X |
|
|
| Statement Coverage |
X |
X |
X |
| Branch Coverage |
X |
X |
|
| Decision Coverage |
|
|
X |
| Modified Condition/Decision Coverage (MC/DC) |
X |
X |
X |
Table 1: Code coverage measures from standards
How are the dimensions defined and how do they relate to each other?
The dimensions can be sorted according to the test depth typically required to achieve 100% coverage for a given dimension. This correlates with the criticality in terms of safety, where the aforementioned standards require the measurement of these dimensions. This sorting is shown in the table above (Table 1), where Entry Point Coverage (top) is the dimension with the lowest test depth and MC/DC (bottom) is the dimension with the highest test depth. In other words, in safety-critical systems requiring a high level of risk reduction, MC/DC should be measured; in safety-critical systems requiring only a low level of risk reduction, measuring Entry Point Coverage or Statement Coverage is sufficient.
A coverage measure generally indicates (usually as a percentage) what portion of the respective measurement criterion was achieved/completed/executed by the software tests. The individual measures have the following meanings:
Entry Point Coverage:
The measurement criterion is the entry points into the software. In the C programming language, these are the functions. (A label in C can only be accessed from within the function containing the label. In our context, labels are therefore not entry points into the software.) If the software under test consists of 100 functions, and 75 of these functions have been called at least once in previous tests, the entry point coverage is 75%. Entry point coverage provides a very limited indication. This is because functions can vary greatly in size, and it doesn't matter what portion of the function was executed in a single call.
Statement Coverage:
The measurement criterion is the number of instructions in the software. Statement coverage therefore indicates the proportion of instructions executed by the tests relative to the total number of instructions. The usefulness of statement coverage becomes apparent when values < 100% detect instructions that were never executed during the tests. Depending on the programming language, it can be difficult to define precisely what constitutes an instruction.
Branch Coverage:
The measurement criterion is the number of branches in the software. Branch coverage therefore indicates the proportion of branches executed by the tests relative to the total number of branches. An if statement, for example, has two branches: the then branch and the else branch. Surprisingly for some, the else branch exists even if it wasn't explicitly programmed. (ISO 26262 feels compelled to point this out explicitly.) Branches also exist in other statements, of course; for example, the do-while loop has a "back" branch (which is never executed in while(0)). In switch statements, a branch leads to each case label, raising the question of whether this should also apply when several labels appear consecutively and thus collectively mark the beginning of a block of statements. (The author believes the answer is "yes" and is happy to provide further information.) Branch coverage obviously yields a better test result than statement coverage, because branch coverage, for example, uncovers unexecuted non-existent else branches, which statement coverage cannot.
Decision Coverage:
The measurement criterion here is the decisions made in the software. Many equate decision coverage with branch coverage. For them, for example, the if statement contains a decision, and the then branch of the if statement is executed if the decision is true, and the else part of the if statement is executed if the decision is false. Thus, they (incorrectly) conclude that 100% branch coverage also implies 100% decision coverage. However, if one examines the definition of decision coverage in [DO-178C] and related papers [CAST-10] more closely, it turns out that the structure of a decision is also considered. Decisions, therefore, consist of conditions linked by Boolean operators. Decision coverage is thus more accurately described as a condition coverage measure. (With regard to branch coverage, decisions are monolithic.) To achieve 100% decision coverage, all conditions in a decision contained in the program must have evaluated as true at least once and false at least once. Under these conditions, one can construct examples where 100% branch coverage is achieved, but not 100% decision coverage. This will not be discussed further here. (The author will gladly provide an example with explanations upon request.)
Modified Condition / Decision Coverage (MC/DC):
This is a measure from the class of condition coverage measures. A decision consists of conditions linked by logical operators. MC/DC tests whether these conditions contribute sufficiently to the overall decision. This is the case if, for a given condition, a pair of test cases/input combinations (of truth values for the conditions) exists for which three things are true: (1) The two input combinations must differ in their truth value for the condition in question. (2) The two input combinations must have the same truth value for all other conditions in this decision. (3) The two input combinations must lead to different truth values for the overall decision. If such a pair of input combinations has been found and tested for all conditions in the decision, MC/DC is achieved. If a decision has n conditions, the required pairs can be assembled using n+1 input combinations.
Does a name always mean the same coverage measure, or is there always only one name for a coverage measure?
Unfortunately not. Coverage measures are also abbreviated, for example, C0, C1, C2, and their meanings are not always consistent. For instance, in Boris Beizer's book "Software Testing Techniques" [BEIZER], C1 is used to refer to Statement Coverage, while in the book "Basiswissen Softwaretest" [SPILLNER], Statement Coverage is abbreviated as C0. In this latter book, C1 is then the abbreviation for Branch Coverage, which Beizer abbreviates as C2. But even when using "proper" names, there are surprises: For example, the term "Multiple Condition Coverage" is referred to as "branch condition combination testing" in the aforementioned book "Basiswissen Softwaretest," but as the literal translation "Multiple Condition Coverage" in the book "Software-Qualität" [LIGGESMEYER]. So, there's no need to give up immediately if you need to measure Multiple Condition Coverage but your available tool can only measure branch condition combination testing. It's always important to clarify the terminology with the other party.
At which process step should code coverage be measured and why?
If the goal is to achieve a metric of 100%, this may only be possible in unit or module tests, but not in integration or system tests. This is because unit tests allow you to test the test object with arbitrary test input data, which may not be possible in system tests.
In the image above (see image 1, PDFThe following illustrates a situation where 100% code coverage cannot be achieved in system testing. The function `f1()` calls the function `f2()` with a pointer as a parameter, but only after ensuring that this pointer is not a NULL pointer. However, the function `f2()` checks whether the passed pointer is a NULL pointer before using it and returns an error message if it is (defensive programming). Assuming that `f2()` is not called by other functions, it will not be possible in system testing to ensure that a NULL pointer is passed to `f2()` and thus the `then` branch of the `if` statement or the `return` statement is executed. In unit testing, any value can be passed to the test object, including a NULL pointer to `f2()`. This allows 100% coverage for `f2()` and also verifies the correctness of the return value.
What is not taken into account in code coverage measurement?
The function shown in the image above (see image 2, PDFThe function essentially performs calculations. Based on its name, one might assume that the function calculates the sine value of its input. Indeed, if you call this function with the input value x_deg (presumably a degree value) set to 0, the function returns 0. This would be the correct value. The code coverage measurement for this single test case (input 0, return value 0) yields 100% for all the measurements mentioned so far. So, are we finished testing? After all, we have a passed test case and 100% coverage! I hope the answer is no. It should be clear that several additional test cases are needed to strengthen our assumption that the function calculates the sine value of its input. This is because code coverage measurements do not consider calculations. As a consequence of the example above, one must recognize that achieving 100% coverage is not a reliable indicator of software quality.
What problem cannot be found through code coverage measurement?
Quite simply: Omissions in the code are not detected by coverage measurements. For example, if the test to see if the passed parameter is a NULL pointer was omitted from the function f2() in Figure 1 (see PDFIf a function is omitted, this cannot be detected by coverage measurement. This is also true for the sine() function shown in Figure 2 (see...). PDFAn important line of code is missing, which would increase the accuracy of the calculation. This omission is also not detected by the coverage measurement. This should be obvious, yet it is not always recognized.
Why shouldn't one derive the test cases for achieving 100% coverage from the code?
The main reason, as mentioned above, is that this approach prevents the detection of omissions in the code. Therefore, test cases should always be created based on the requirements. An additional reason is that this approach essentially assumes the code is correct. While it's possible to check the results of test cases derived from the code against the requirements (and thereby discover errors and perhaps even requirements for which no tests exist), in my opinion, it's still the wrong approach.
What calculation methods are available for MC/DC?
In the left part of image 3 (see PDFThe four test cases required to achieve 100% MC/DC for the decision (A && (B || C) are shown. [Due to the incomplete evaluation in C, despite the three conditions, only four test cases are possible, not eight. And because four test cases are necessary to achieve 100% MC/DC with three conditions, all four test cases must be executed.] Surprisingly, different MC/DC values are obtained as long as not all four test cases have been executed, depending on the calculation method used. One method (test case counting) starts with the number of necessary test cases, four in our example. If one of these four test cases (any one) has been executed, the result is 25% MC/DC. Two test cases result in 50%, three test cases in 75%, and four test cases finally in 100% MC/DC. The other method (test pair counting) starts with the number of necessary pairs. Starting with test cases, in our example there are three because there are three conditions. When a test case, for example, test case no. 1, has been executed, no pair has yet been executed, because a pair requires two test cases. Therefore, this method determines 0% MC/DC for test case no. 1 alone. If another test case, for example, test case no. 2, is then executed, the MC/DC value remains at 0%, because test case no. 1 and test case no. 2 both have the same overall result and therefore cannot produce a pair for either condition. With test case no. 3, two pairs are then generated (one for condition A and one for condition C), and thus 66% MC/DC is reached at once. These value curves are shown on the right side of Figure 3 (see PDF) shown. It is noteworthy that, for example, to fulfill the requirement 70% MC/DC, one method requires 3 test cases, while the other requires 4 test cases (see also Figure 4, PDF)
Is it permissible to add coverage values for code variants?
In the image above (image 5, see PDFThe left side of Figure 5 shows a function that exhibits different behavior depending on whether a preprocessor constant is defined and, if so, which one. The right side of Figure 5 (see Figure 5) PDFFive test cases for this function are given: In test case 1, no preprocessor constant is defined, the input is 3, so the `then` branch of the `if` statement is executed, and the return value (the output) is 3. In test case 4, the preprocessor constant `VARIANT_1` is defined, the input is 0, so the `else` branch is executed, and the output is 12 (=12/(0+1)). It can be seen that the `else` branch of the `if` statement is not executed when the preprocessor constant `VARIANT_2` is defined. Thus, the branch coverage for this variant is not 100%. Since the `else` branch is identical for all variants, one might assume that the `else` branch of the `if` statement has already been executed in the other variants. Therefore, the insufficient coverage in one variant could be "supplemented" to 100% by the coverage of the other variants. That this is not a good idea becomes clear when one runs through the still missing test case for variant 2: The input 3 with the defined preprocessor constant VARIANT_2 leads to division by zero. Therefore, the cumulative coverage of the three variants must not be 100% (see Figure 6)., PDF).
Does coverage measure the quality of the test cases?
Because coverage measurements do not adequately consider calculations, because they cannot detect omissions, and because even poor test cases often result in 100% coverage, coverage cannot be used to measure the quality of test cases. One method for assessing quality would be, for example, mutation testing (or error seeding, as this procedure is called in ISO 61508).
Bibliography and list of sources
[TESSY] https://www.hitex.de/tessy: More about the unit testing tool TESSY.
[BEIZER] Beizer, Boris: Software Testing Techniques, 2nd edition, New York, 1990.
[LIGGESMEYER] Liggesmeyer, Peter: Software Quality: Testing, Analyzing and Verifying Software. Heidelberg, Berlin, 2002. Spektrum Akademischer Verlag.
[SPILLNER] Spillner, A., Linz, T.: Basic Knowledge of Software Testing, Heidelberg, 2003. dpunkt-Verlag.
[61508] IEC 61508, Functional Safety of electrical/electronic/programmable electronic safety-related systems, 2010.
[26262] ISO 26262, Road Vehicles – Functional Safety, 2011.
[DO-178C] Software Considerations In Airborne Systems And Equipment Certification, RTCA, 2011.
[CAST-10] Certification Authorities Software Team, Position Paper 10, 2002
Testing, Quality & Debugging – Our Training & Coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topics of testing, quality & debugging.
Training & coaching on the other topics in our portfolio can be found here. here.
Testing, Quality & Debug – Expertise
Valuable expertise on the topics of testing, quality & debugging is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
