About the automatic generation of module test cases
Authors: Dr. Stephan Grünfelder, Bernhard Peischl
Contribution – Embedded Software Engineering Congress 2015
For several years now, tools for the automatic generation of unit test cases have been available. These tools generate unit tests quickly, without any prior knowledge of the correct functionality of the software under test. The "tests" are created solely based on the existing source code. This testing tool aims to achieve high structural test coverage. A completely different approach is the test-first strategy: Unit tests are created exclusively based on the (design) specification, even before the code exists. Structural test coverage is not crucial for this approach; high functional test coverage is the primary objective. This article explores the advantages and disadvantages of these two strategies and how to effectively combine these seemingly incompatible approaches.
Test-Driven Development (TDD), also known as the test-first strategy, is a methodology from the agile software development portfolio. It's not a testing method but a development methodology where the developer begins writing the unit test for their own code before starting to code. Step by step, in small iterations, the test is implemented or extended, followed by the code to be tested. Once the test no longer reports any errors, a new iteration begins.
This development method forces the developer of a software module to design the module in such a way that its interfaces are easily testable. Furthermore, it compels them to derive their test cases solely from the module's design (specification). They are not tempted to simply execute the code in unit tests, focusing only on the source code under test without thoroughly examining its purpose. On the contrary, the test definitively verifies the code's correct behavior, even though the code under test is not yet available.
TDD vs. traditional unit testing
Unlike traditional unit testing, which is testing after programming, TDD doesn't have a defined "test end criterion." Agile literature recommends stopping test case writing "when you run out of things to say." In classic unit testing, achieving a certain level of structural test coverage is usually a prerequisite for completion. Often, for embedded systems, achieving 100% Decision Coverage—that is, traversing all branches of the code under test—is required. However, as with TDD, the tester should test the unit against its intended purpose (its design specification). additionally This includes measuring test coverage [1]. However, if the source code already exists, there is undoubtedly a temptation to simply base the test cases on the source code and be content with merely executing the code instead of testing it (against the design specification). But for creating "test cases" that simply execute the code, there are now modern unit testing tools that can do this automatically in a few seconds: generating test cases based on the code.
Automatic generation of unit tests
A test typically tests the response of the device under test against an expected response for specific inputs. When discussing the automatic generation of test cases, it's important to note that for given inputs, the... expected The result of a function under test may not be known to a test tool. Therefore, a test tool that automatically generates test cases for a function only creates inputs for that function and can calculate the actual result for it, which it then displays as the expected result. suggest.
Many tools proceed as follows: they select test cases from a typically vast number of possibilities in such a way as to maximize structural test coverage. Many authors refer to this as search-based testing [7]. Academic tools, such as the Java tool EvoSuite, use genetic algorithms for this search [2]. A genetic algorithm attempts to imitate the processes of evolution, above all the idea of "survival of the fittest." In our case, "fit" means achieving good test coverage. An initial population of solution candidates (in our case, sets of test cases) is generated randomly or using heuristic methods. The fittest pairs of these candidates produce offspring by combining their genetic material (their test parameters). With a certain probability, crossing over or a mutation occurs. With each iteration of the search process, the fitness (test coverage) increases through natural selection until either one of the candidates in the population is a solution (i.e., the test set 100% achieves test coverage) or until another test termination criterion is met, such as exceeding a time limit.
Researching efficient search methods
Pure genetic search methods are not particularly intelligent in their searches. For example, it can take thousands of iterations before a string-compare function reports a match with a password. Therefore, these methods are sometimes replaced by... Dynamic Symbolic Execution replaced or supplemented. In this more intelligent approach, to put it simply, specific test runs log which parameters influence branching conditions and then selectively modify them for new test runs. The PEX tool uses Dynamic Symbolic Execution for testing .NET languages [3].
Academic publications suggest that search-based testing could become increasingly important. However, research teams rarely present tools for low-level programming languages. One exception is the tool AUSTIN (AUgmented Search-based TestING), see [link/reference]. https://code.google.com/p/austin-sbst. This tool also supports both search-based testing and dynamic symbolic execution and is a freely available unit testing tool for C programs, making it particularly interesting for embedded developers. AUSTIN has been used in the automotive sector [8] as well as for testing open-source programs [9]. A study at King's College London compared AUSTIN with other evolutionary methods for test case generation and concluded that AUSTIN is just as effective in terms of structural coverage but considerably more efficient [8].
Commercial tools
Unlike tools of academic origin, little is known about how commercial tools work. However, these tools are particularly interesting for embedded software testing because they generate tests for C/C++. The unit testing tool Cantata, for example, likely uses a type of... Backtracking Use this method. In backtracking, randomness plays a significantly smaller role. Similar to finding a way out of a maze, these algorithms constantly reassess the current situation and, if necessary, deliberately return to a previously visited location to search for a different solution. As a starting point for this search, Cantata likely simply chooses zero for the environment values and parameters of the functions under test, as suggested by the test cases generated by this tool in Listing 1.
#include
#include
static int* values;
static int size = 0;
static int max_size = 3;
void init_stack()
{
values = malloc(3 * sizeof(int));
}
static void resize()
{
values = realloc(values, max_size * 2);
max_size = max_size * 2;
}
void push(int x)
{
if (size >= max_size) resize(); /* full stack */
if (size < max_size)
{
values[size++] = x;
}
/* else branch is infeasible, thus 100% DC is infeasible */
}
int pop()
{
if (size > 0)
{
return values[size-];
}
else
{
printf("error");
return 0;
}
}
Listing 1: A stack implementation taken from [4] and translated from Java to C.
Listing 1 illustrates a particular challenge for a testing tool. There are essentially no test cases for this code that can achieve 100% branch coverage, because the function push() can never be the second else-branch. The EvoSuite tool attempts to achieve 100% decision coverage by repeatedly calling the provided functions and ultimately solves the problem by accepting that certain paths cannot be accessed after a certain time [4]. Cantata has a more "brutal" solution that still allows 100% decision coverage for Listing 1: the variables max_size and size will be called before the push() overwritten with zeros. With a project's default settings, the tool is also allowed to... static-to manipulate variables from the outside, and so 100% branching cover is for push() reachable. However, the path traversed can never be traversed in the production system because max_size can never be zero. Figure 1 (see PDF) shows how Cantata presents the 6 automatically generated test cases and Figure 2 (see PDF) shows how the user is informed about the achieved test coverage in a summary report.
Academic tools
Research teams are equipping some academic tools with product features that are (still) not found in commercial unit testing tools for embedded systems and may not even be really needed in the future: they test the code against universally valid requirements: no overflow of arithmetic operations, no null pointer dereferencing, no division by zero, array indices must be within the defined range of values.
The absence of one of the described errors cannot, of course, be proven in a test, but one can certainly try to provoke such an error. To make provoking an array index overflow more appealing to the search algorithm, the Evosuite tool transforms array accesses in the Java bytecode as follows:
void test(int x)
{
if (x < 0) throw new NegativeArraySizeException();
if (x >= foo.length) throw new ArrayIndexOutOfBoundsException();
foo[x] = 0;
}
Listing 2: Demo of the code transformation used by EvoSuite, see [3].
Here, unlike the previously discussed generated test cases, there is a high probability that test cases are actually generated that aim to verify compliance with a specification requirement: the software should not crash. However, in the embedded systems field, robust code is sometimes intentionally omitted for performance reasons, and overflows, for example, are not prohibited if the application's environment is indifferent to them or if they are extremely unlikely. This means that many, but not all, automatically generated tests of this kind are actually useful in practice.
Troubleshooting with and without tools, Part 1
A large-scale study with over 100 participants was conducted at the University of Sheffield to determine whether using EvoSuite's automatically generated unit tests resulted in more errors than manually implementing tests with JUnit [5]. The participants were mostly students. In the study, they were given 2 or 3 hours to complete a test task with and without the test generator, and shortly before receiving the task, they attended a JUnit refresher course. The study organizers endeavored to obtain the most objective results possible in this comparison.
- The comments in the source code were detailed enough to serve as a complete specification.
- The Java classes to be tested were neither trivial nor so complex that special expertise was necessary/helpful.
- The classes to be tested should be understandable without time-consuming study of other classes and had a mix of string processing and numerical tasks.
One result of the experiment was that the test coverage achieved was roughly the same with and without the tool – the participants had no way of measuring the achieved test coverage. This was quite encouraging for the scientists, as it showed that their test generator could compete with human counterparts. However, the second result was sobering for some: the error rate was also roughly the same with and without the test generator.
A fool with a tool is still a fool
The authors of the experiment correctly noted in their published results that the participants were unfamiliar with the code prior to the experiment. This is a situation that applies to very few projects, as the author of unit tests is almost always the author of the code. However, a few further points warrant further criticism:
- For an industrial project, not only the test depth/quality achieved with a tool is of interest, but also the costs incurred for that achieved quality. Therefore, a performance comparison of the two groups would likely be one of the most interesting figures for industrial applications.
- To do this correctly, one would need to compare test subjects who have months of experience using the respective technique. Such a comparison will probably never happen due to the effort involved.
- Many (comparatively inexperienced) test subjects spent (wasted) some of their time figuring out how the automatically generated tests performed on the test object instead of using the test data to assess whether the automatically generated test made sense or not, and using the time to manually create further test cases.
The final point of criticism offers a possible explanation for the identical error rate with and without a test generator: errors are found through well-thought-out and effective test cases. Relevant training, skill, and experience on the part of the tester contribute to this. The test tool cannot replace the tester's thinking and therefore will not increase the number of errors found. However, it can certainly improve test creation. accelerate, because it relieves the tester of comparatively tedious work and leaves more time to concentrate on the test strategy, as discussed in the following paragraphs of the article.
Troubleshooting with and without tools, Part 2
Even unit testing tools without automatic test generation accelerate test writing. For example, stubs (mockups, the bodies of functions called but not available by the code under test) are automatically generated, and test drivers are created by the tool based on tabular input. This acceleration compared to the test-first strategy is possible because the tool can analyze the interfaces of the code under test, thus relieving the user of typing work.
Test tools like Cantata, Tessy, VectorCast, and others can also log structural test coverage upon request. As mentioned above, while achieving the desired coverage of 100% is not a sufficient condition to end testing, failing to reach 100% almost always indicates that test cases are still missing. The test-first strategy omits this warning system.
And Cantata offers another form of support that is not possible with TDD: the testing tool allows the tester, to a certain extent, not only to determine that the code under test does what it is supposed to (by defining test cases), but also to determine when something happens that not What should happen: Cantata checks after each test case all Variables are checked for changes and can therefore, for example, detect when a stray pointer variable unintentionally overwrites other variables.
If these three advantages of using a modern unit testing tool compared to TDD also apply without What impact does automated test generation have, and if the University of Sheffield experiment showed that test generators don't find more errors than non-test generators, what remains of automated test generation? Test acceleration. The following paragraphs show how to apply the core ideas of TDD to testing with test generators, thereby further reducing test time and making unit tests more profitable.
Test FAST instead of Test FIRST
The article began by discussing the advantages of TDD. The previous section explained the disadvantages of TDD compared to using modern unit testing tools. Now we will show how to combine and utilize the advantages of both approaches while mitigating their respective disadvantages.
Undoubtedly, the advantages and core ideas of TDD—(1) the exclusive focus on the specification when defining test cases and (2) the design of units that are easily testable at their interfaces—can also be applied in traditional testing with tools. This simply requires a certain level of discipline.
Test generators shouldn't be overestimated, and you shouldn't let them mislead you. Tests generated by a generator can be completely nonsensical. Thoughtful test design is and remains the tester's responsibility. Generators can reduce tedious typing, but not the need for creative test design. Therefore, if you want to use automatic unit test generation effectively, you need to know how to use the tools. The following "TEST Fast" "Code of Conduct for Using Test Generators" can help you get the most out of a unit test project and save testing time compared to other approaches. Here's a little teaser: Automatically generating the tests for Listing 1 takes less than 2 seconds.
To implement Test FAST, it is recommended to observe the following rules:
- When designing new code, consider its testability.
- Let the testing tool generate test cases and evaluate their suitability without looking at the source code. The design specification is the sole reference. Delete all test cases that are irrelevant.
- Now add your own test cases as soon as possible after writing the code. Design these test cases solely based on the design specification; avoid looking at the source code.
- Use methodical black-box testing techniques, such as boundary value analysis, pairwise testing, decision table techniques, and so on, to derive powerful test cases.
- Measure test coverage only when you believe the unit under test is fully tested. If the tests unexpectedly do not yet reach 100% of coverage, first try to achieve higher coverage by analyzing the test cases before debugging the test and inspecting the code under test.
Many companies have tried to implement Test First to achieve quality improvements. Many abandoned it because programmers didn't find it particularly appealing to write (even trivial) test cases that they might later have to discard if they changed the architecture of a software unit during development. Test FAST has the potential to counteract this: programmers can first focus on implementing the product, then on implementing the concise test cases; the simple tests are generated automatically.
Summary and Outlook
This article demonstrated how unit test generators work and how to use them effectively. It highlighted the undeniable advantages of TDD in relation to tool usage and introduced the Test FAST methodology.
The article pointed out that while achieving 100% test coverage is usually a necessary condition for good tests, both for automated and manually created tests, it is certainly not sufficient. Listing 3 shows another example. The test cases achieve 100% branch coverage and yet still fail to detect a very serious error.
unsigned max(unsigned a, unsigned b, unsigned c)
{
unsigned max = 0;
if (a > c) max = a;
if (b > a) max = b;
if (c > b) max = c;
return max;
}
/*
Test 1: max(3,5,7) == 7
Test 2: max(5,7,3) == 7
Test 3: max(7,5,3) == 7
*/
Listing 3: An example of bad tests from [6]. A bug is not detected.
Most currently available commercial tools for automatically generating C/C++ unit tests do not offer test cases for Listing 3 that alert the user to the error, because they only maximize test coverage and do not yet offer intelligent test data selection. However, vendors are constantly working to make test generators more intelligent. By 2016, it is expected that many commercial generators will be intelligent enough to present the following test suggestions for Listing 3, thus enabling quick identification of the code error:
- max(INT_MAX, 0 ,0) == INT_MAX
- max(0, INT_MAX ,0) == INT_MAX
- max(0, 0, INT_MAX) == INT_MAX
- max(INT_MAX, INT_MAX , INT_MAX) == 0
Technical Afterword
A method that exclusively Based on the source code, but for which there is hardly any tool support, this is Baseline Testing – often illustrated using control flow graphs. The idea behind this method is to generate a set of independent paths through the control flow graph of the software under test. Put simply, a new test case is created from an existing one by changing only a single direction decision at a time, until this is no longer possible.
int x = 0;
int do_nothing(int a, int b)
{
if (a > 10) x += 47;
if (b > 10) x -= 47;
return x;
}
Listing 4: This code should never change x. But it does.
For example, using this method, the following three test cases could be derived from the code in Listing 4.
- a case where neither of the two conditions applies, e.g. foo(10,10);
- a case where only the first condition is true, e.g. foo(11,10);
- a case where only the second condition is true, e.g. foo(10,11).
To achieve 100% Decision Coverage, two test cases would be sufficient here. The two test cases foo(11,11)and foo(10,10) Both decisions go through every possible direction, but they don't detect that x is being changed improperly. Baseline testing would have revealed the error in the function. max() relatively easy to find, except the path that prevents all three if conditions from firing is made by the call max(0,0,0) Enter. This is the only parameter choice that won't detect the error during baseline testing. While baseline testing selects test cases more intelligently based on the source code, it's not more effective if the test data is poorly chosen.
literature
The publications by Gordon Fraser et al. can be downloaded free of charge from www.evosuite.org.
[1] Stephan Grünfelder, Software testing for embedded systems, Dpunkt-Verlag, Heidelberg 2013.
[2] Gordon Fraser, Andrea Arcuri, Phil McMinn: A Memetic Algorithm for Whole Test Suite Generation. Journal of Systems and Software, Volume 103, May 2015, pages 311–327.
[3] Gordon Fraser, Andrea Arcuri: 1600 Faults in 100 Projects: Automatically Finding Faults While Achieving High Coverage with EvoSuite. Empirical Software Engineering, June 2015, Volume 20, Issue 3, pp 611-639.
[4] Gordon Fraser, Andrea Arcuri: Evolutionary Generation of Whole Test Suites. Proc. 11th International Conference on Quality Software, 2011, pp. 31-40.
[5] G. Fraser, M. Staats, P. McMinn, A. Arcuri, F. Padberg: Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study. ACM Transactions on Software Engineering Methodology, vol. 24, no. 4, 2015.
[6] Andreas Spillner, „Agility and Systematic Testing“, lecture at the BCD Acceptance Café on April 16, 2013 in Vienna.
[7] Ali, S.; Briand, L.C.; Hemmati, H.; Panesar-Walawege, RK, "A Systematic Review of the Application and Empirical Investigation of Search-Based Test Case Generation," in Software Engineering, IEEE Transactions on , vol.36, no.6, pp.742-762, Nov.-Dec. 2010.
[8] Lakhotia, K.; Harman, M.; Gross, H., AUSTIN: A Tool for Search Based Software Testing for the C Language and Its Evaluation on Deployed Automotive Systems, in Search Based Software Engineering (SSBSE), 2010 Second International Symposium on , vol., no., pp.101-110, 7-9 Sept. 2010.
[9] K. Lakhotia, P. McMinn, and M. Harman, Automated Test Data Generation for Coverage: Haven't We Solved This Problem Yet? in 4th Testing Academia and Industry Conference – Practice and Research Techniques, 2009, pp. 95–104.
[10] P. McMinn, Search-based software test data generation: A survey, Software Testing, Verification and Reliability, vol. 14, no. 2, pp. 105–156, Jun. 2004.
Testing, Quality & Debugging – Our Training & Coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topics of testing, quality & debugging.
Training & coaching on the other topics in our portfolio can be found here. here.
Testing, Quality & Debug – Expertise
Valuable expertise on the topics of testing, quality & debugging is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
