Evaluation of software verification tools

What can they do and what are their capabilities?

Authors: Ralf Gerlich, Dr. Rainer Gerlich, BSSE System and Software Engineering (BSSE), Christian R. Prause, German Aerospace Center (DLR)

Contribution – Embedded Software Engineering Congress 2016

Software verification tools are designed to find errors in software. However, there is little information available about what these tools can actually do. Usually, only the manufacturer's description is available, which users can use as a (rough) guide. As a result, tools are often chosen based on ease of use or prevalence in the software community rather than their actual ability to find errors – which may differ from the manufacturer's specifications.

But without detailed information about a tool's capabilities, objectively speaking, it's really just a good feeling to have done more for quality assurance, at least a little more than before. But could even more be done? To answer this question, the Space Administration of the German Aerospace Center (DLR) commissioned a study of six tools applied to representative middleware (embedded software) from the space sector. This activity is currently being continued with additional tools and software. The results of the completed study are presented in this article.

overview

The six tools represent a cross-section of static and dynamic verification methods. A key selection criterion was that, apart from tool configuration, no manual intervention is required during the analysis, particularly with the dynamic method. The tools employ the following methods:

symbolic execution (2)
abstract interpretation (2)
automatic testing (1).

The test tool applies massive stimulation to every function of the middleware, observes and evaluates any anomalies such as exceptions or index out-of-range errors. Additionally, a compiler (gcc with –Wall under mingw) taken into account to get an indication of how far the capabilities of today's compilers are comparable to those of StaticAnalyzers.

Comparing the results from the different tools proved to be a major challenge and could not be automated as intended: in some cases, line numbers for matching were missing, and messages for the same error could be reported for different lines, even duplicates. The follow-up project will now attempt to increase the level of automation based on the previous experience.

Advice for tool selection and use

The following advice – based on common misconceptions – should help to avoid errors in the selection and use of verification tools:

The tools are no different.

Incorrect. The tools always implement a specific error detection method and apply a specific set of rules. Therefore, the results regarding detected errors can differ significantly.

One tool finds all errors.

Incorrect. A tool can only find the errors, or more precisely, the error types, that its underlying method and implementation support. The implementation can further restrict the method's scope.

Simply use the tool, that's all it takes.

Incorrect. A strategy should be developed for the application. Which errors can a tool detect, which should be detected, what is the acceptable residual risk, and which tools should be combined?

Such a strategy requires knowledge of the capabilities.

All tools aim to find all errors.

Incorrect. Tools can or must make compromises due to limited resources (memory, time, accuracy). Some minimize (waiting) time at the expense of accuracy and completeness.

This faster, more „superficial“ error detection can certainly be useful for eliminating many simple errors, reducing the runtimes of more accurate/complete tools, i.e., clearing away most of the rough garbage so that the finer garbage can then be better captured – i.e., a multi-stage optimization strategy.

The tools generate many inaccurate messages.

False. Current research shows that only a small percentage of reports are inaccurate, averaging approximately 25% across all tools.

The value that is usually communicated or assumed is a multiple of this observed value.

The tool is only used for acceptance testing at the end of the project.

Incorrect. Experience shows that this strategy leads to a high number of reports, a large proportion of which may be inaccurate – meaning "the error cannot occur" – or only applicable to a few marginal cases – meaning "it is unlikely that the error will occur."„

Every reported error can be fixed.

Incorrect. With a large number of reports, i.e., a mix of critical, less critical, and non-critical reports, critical reports can be overlooked. The goal should therefore be to reduce the number of inaccurate reports through preventative measures such as applying best practices and using tools early on.

Errors that are supported by the implementation of a tool will always be reported.

Incorrect. Certain errors can suppress messages for other errors. A complete overview is only obtained when no (critical) messages appear.

Inaccurate reports can be ignored.

Incorrect. Whether a message is accurate, i.e., indicates an error that needs to be corrected, can only be determined after (thorough) analysis. This requires (probably considerable) effort.

The aim should therefore be to avoid such reports through appropriate measures such as the application of "best practices" and the early use of tools, so that the causes of inaccurate reports can be identified as soon as possible.

Messages are always understandable and comprehensible.

Incorrect. The message is not always easy to understand. With some tools, you need to know the underlying method to understand the message and locate the source of the error.

The reports are unambiguous regarding the identification of an actual error.

Incorrect. The messages usually need to be interpreted. Whether the messages are accurate also depends on the tool's view of the software. Depending on whether context is taken into account—that is, limiting conditions, for example, in an integrated system—or not, the message can be a true or false positive (TP or FP). Some tools allow you to specify whether or not context is considered. If so, then the message is unambiguous with regard to the context setting.

Key figures

For the evaluation, 20 error types were identified, to which all messages from all tools could be mapped, and divided into three categories: critical, potentially critical, and non-critical. Sensitivity and precision were determined as key performance indicators (KPIs) across all error types, for the three categories, and for each error type individually. Both values range from 0 to 1. The closer to 1, the better.

Sensitivity characterizes the ability to detect errors and is the quotient of the number of errors found to the total number of errors. Since the number of errors in the tested software cannot, in principle, be determined, it was approximated by the union of all errors found with all tools, supplemented by further errors found manually during the analysis of the tool messages.

Precision is a measure of a tool's accuracy and is the quotient of the number of actual errors among all messages from the tool (TP / (TP+FP)).

Characterization of the tools

The methods used by the tools are briefly characterized in Table 1.

Tool	type	Analysis method
1	static	Symbolic execution, data flow analysis
2	static	Abstract Interpretation
3	dynamic	Auto-stimulation
4	static	Symbolic execution, data flow analysis
5	static	Data flow analysis
6	Compiler	Syntax, semantics

Table 1: Characterization of the tools

Results

The results presented here represent only a small selection of the total findings. They are based on specific evaluation criteria. Varying these criteria also changes the values for sensitivity and precision.

Sensitivity and precision are not correlated. Sensitivity values range from 0.04 to 0.7, while precision values range from 0.5 to 0.98. A tool with low sensitivity can have high precision.

Surprisingly, approximately two-thirds of the alerts are generated by only one of the tools, suggesting a significant increase in sensitivity when combining multiple tools. For example, the highest sensitivity value rises from 0.5 for a single tool to 0.8 when combining two tools.

Another result of the investigations and discussions with the tool manufacturers is shown in Figure 1 (PDFThe tools do not necessarily all have to aim to find as many errors as possible. The implementation can be based on a compromise: sensitivity versus execution time. It can be quite useful to use a tool with lower sensitivity to quickly find certain error types, in order to later perform more comprehensive and/or efficient analyses for more complex error types. One should simply be aware of the compromise that was made.

Summary

The results show that when planning software verification, knowledge of the characteristics of the tools to be selected is essential if the goal of optimal software quality is to be achieved. For further optimization and increased efficiency, the tools should be used early and continuously, and potentially in combination, throughout the coding phase. This allows the coding style to be tailored to the conditions of the subsequent verification. Many errors can be avoided in this way, along with the otherwise likely significant manual effort.

The compiler with only the -Wall option is not a serious competitor to static analyzers. Further investigation is planned to determine whether enabling optimization yields better results.

Further information on the evaluation and the results can be found in [1] and [2], or upon request from the authors.

thanksgiving

The authors thank the tool manufacturers and distributors for tools 4 and 5, who provided an evaluation license.

The content of this article is a result of the project „Evaluation of Software Verification Methods and Tools“ (ESVW) for the Space Administration of the German Aerospace Center (DLR) on behalf of the Federal Ministry for Economic Affairs and Energy (BMWi), grant number 50PS1505.

Bibliography and list of sources

[1] CRPrause, R.Gerlich, R.Gerlich, A.Fischer: „Characterizing Verification Tools Through Coding Error Candidates Reported in Space Flight Software“, Eurospace Symposium DASIA'15 ’Data Systems in Aerospace„, May 19 – 21, 2015, Barcelona, Spain

[2] R.Gerlich, R.Gerlich, A.Fischer, M.Pinto, CRPrause: „Early Results from Characterizing Verification Tools Through Coding Error Candidates Reported in Space Flight Software“, Eurospace Symposium DASIA'16 ’Data Systems in Aerospace„, 10 – 12 May, 2016, Tallinn, Estonia

Download the article as a PDF

Testing, Quality & Debugging – Our Training & Coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topics of testing, quality & debugging.

Training & coaching on the other topics in our portfolio can be found here. here.

Testing, Quality & Debug – Expertise

Valuable expertise on the topics of testing, quality & debugging is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← Use of debuggers in hardware-in-the-loop testing Software analytics in complex software products →