Select Page

Software analytics in complex software products

Usable knowledge from data about software and development

Author: Professor Dr. Rainer Koschke, University of Bremen

Contribution – Embedded Software Engineering Congress 2016

Summary

This article describes techniques, methods, and tools of software analytics. Software analytics is the science of acquiring and evaluating data about a software product or its development. The following questions are addressed: What usable data is generated during the development process? How can the relevant data for a given question be identified? How can the data be collected? Which data mining techniques can be used to analyze the data? What tools are available for data mining? How can data be effectively visualized? What are its opportunities and risks?

What is software analytics and why is it important?

Software project managers have to make countless far-reaching decisions. Is the system sufficiently tested? Has the introduction of a new method or tool proven effective? Mistakes in these decisions can be costly. Unfortunately, data is often lacking to make informed decisions. In such cases, managers are reliant on their own intuition or the opinions of others. However, the advice of others is only conditionally reliable, as it is often based on intuition itself. Human intuition—one's own or that of others—can be wrong. For example, common intuition suggests that large modules contain a disproportionately higher number of errors than small ones. Following this belief, one might test small modules less than large ones. In reality, however, empirical studies have shown that small modules contain a disproportionately higher number of errors.

For this reason, decisions should be based on empirical knowledge whenever possible. However, this requires appropriate data first. And then conclusions must be drawn from the data. Software Analytics This addresses both of these points. It is the science of acquiring and analyzing data about a software product and/or its development process. It is the essential tool of every empirical researcher in software engineering. However, it is by no means limited to research. Relevant questions also arise in practice, which may not aim for universal generalizability, but are of great importance for sustainable decision-making within a company.

What usable data is generated during the development process?

Development generates a wealth of data that can be used to provide well-founded answers to important questions in software development. This data is contained in the source code, version control systems, issue trackers, email conversations, chat logs, and other sources. These sources can be utilized. Some can be extracted automatically for this purpose. However, some data must also be gathered through surveys or observations, because much of the information resides in the minds of the developers or other people involved in the process.

How do you identify the relevant data for a given question?

The starting point of any investigation must be a clearly defined question. And before even attempting to answer it, the question itself should be examined critically. The test for this is: If I had an answer to this question, what could and would I do differently? If the question fails this test, it may be interesting, but it has no relevance whatsoever. The question to be answered, therefore, always starts with a concrete and relevant objective. For example, the objective could be to improve quality assurance. This then leads to several relevant questions, such as which quality assurance measures are actually applied, how and to what extent, what their effectiveness is, and how efficiently they are implemented.

A precise question is essential because only a precisely formulated question can be operationalized; that is, only then can a measurement method be specified that can answer the question. For example, if the question is about the scope of testing, we need a clear and calculable definition of "scope." The scope of testing can be calculated using various coverage metrics (such as instruction or branch coverage, and many others). Alternatively, it could be defined by the time testers spent on the test.

If the question is precisely operationalized, then a measurement procedure for answering it has essentially already been found. If the initial question regarding the scope of the test was operationalized as "degree of test coverage measured by instruction coverage," then instruction coverage is the measure to be collected.

The approach of deriving questions from concrete and relevant goals, which must be answered to achieve those goals, in order to ultimately arrive at measures that quantify the answers to the questions, is also known in scientific and practical literature as the so-called Goal-Question-Metric Method known.

Since the aim is to collect data as economically as possible, and not all data necessary to answer a question is actually available or of sufficient quality, this process may need to be iterated several times until all answerable questions and practical measurement methods have been found.

How can the data be collected?

Data about the source code itself, and to some extent about its underlying architecture, can be obtained from code analysis tools such as... Axivion Bauhaus Suite These tools can often collect measurements throughout the development history, not just for a single version. Such historical data is particularly valuable when trying to identify trends.

Most development tools that generate data offer interfaces and export options. If the data is in a fixed syntax, it can be easily processed further. However, content is often written in natural language. In these cases, methods from computational linguistics are sometimes necessary. Since these are generally simple texts and not literary works, there is usually a good chance of automatically analyzing them. Often, regular expressions are sufficient to search for specific keywords. Researchers are now also successfully applying more advanced text mining techniques. For example, automated sentiment analysis (also known as sentiment detection) is used to identify feelings expressed in texts, such as agreement or disagreement. Text mining is becoming increasingly powerful.

However, some data is not yet available electronically and must then be obtained in other ways, for example through interviews or questionnaires.

The data in electronic archives such as version control systems often only tells half the story. It can be erroneous or incomplete. Therefore, one should never blindly trust the recorded data. Instead, it should be checked for validity and discussed with the developers, whose knowledge often provides deeper insights and interpretations. For this reason, a purely quantitative analysis should always be accompanied by a qualitative analysis that seeks explanations for the recorded data and the (possibly only apparent) correlations found.

Which data mining techniques can be used to analyze the data?

Software analytics utilizes classical statistics as well as advanced data mining and visualization techniques to uncover patterns and relationships. Classical statistics offers various association coefficients, regression analyses, significance tests, and much more, which can be used to evaluate data. Various visualization techniques, such as box plots and distribution diagrams, are also employed. Popular advanced data mining techniques and machine learning include, for example, the discovery of association rules (in purchasing, these are known as rules of the form: "Those who bought product X also bought product Y") or the automatic learning of decision trees for data classification.

What tools are available for data mining?

For statistical analysis and data mining, there are both free and commercial tools available. SPSS is one example of a commercial tool. R is a very powerful free tool. R is, strictly speaking, a programming language for solving statistical problems. However, R is also a large ecosystem of packages provided by a large community of R developers. These include packages for classic statistical analyses and tests, as well as modern data mining and machine learning algorithms, and even powerful visualization tools. R is definitely worth a closer look. In my daily research work, I have so far found a suitable package in R for every statistical question I have encountered.

How can data be meaningfully visualized?

The statistical tools mentioned, such as SPSS or R, offer a wide range of visualizations that can be used to present complex data in a clear and understandable way. A further development of these visualizations is the approach of... Visual Analytics Visual analytics, while traditionally visualizations are used to display the results of an analysis, prepares the visualization of the raw data so that the human observer can recognize patterns themselves. In this way, inference is transferred from an algorithm to the human. This approach leverages the particularly well-developed ability of human vision for pattern recognition.

What are the opportunities and risks of (mis)interpretation?

Software analytics is not simply blindly trusting numbers. Numbers represent a highly condensed, sometimes complex, understanding of relationships. They must be interpreted. Therefore, purely quantitative analysis should always be complemented by qualitative analysis that delves into the underlying phenomena expressed in numbers. Furthermore, ethical and data protection considerations must always be taken into account when collecting and analyzing personal data.

Conclusion

Both the knowledge of the correct methodology and the empirical results obtained with software analytics methods have increased rapidly in recent years. Analysis tools for data collection and evaluation are freely available. The time is ripe for companies to also embrace software analytics. The goal is to provide empirically sound answers to relevant practical questions within the context of their own business. It is sufficient to assign one or two people within the company who only need to dedicate a portion of their working time to software analytics. Much of the data acquisition and evaluation process can be automated. A basic understanding of statistics and a willingness to learn data mining techniques are all that is required. Programming skills are a significant advantage, as the evaluation can and should be largely programmed, i.e., automated, using statistical software such as R. This not only avoids tedious manual work but also ensures the reproducibility and traceability of the results.

The software analytics experts can then use their methodological knowledge to determine which data needs to be collected and how it can be meaningfully analyzed for relevant questions. Other people within the company can use these provided analytical tools and prepare the results for decision-makers. In this way, decision-makers can make decisions based on empirically sound data.

Further links and literature

Rhttps://www.r-project.org/
SPSShttps://www-01.ibm.com/software/de/analytics/spss/
Axivion Bauhaus Suitehttps://www.axivion.com/de
Software AnalyticsPerspectives on Data Science for Software Engineering Edited by Tim Menzies, Laurie Williams, and Thomas Zimmermann. Published by Morgan Kaufmann, 2016.
Data MiningPractical Machine Learning Tools and Techniques By Ian H. Witten (Author), Eibe Frank (Contributor), Mark A. Hall (Contributor). Published by Morgan Kaufmann Series in Data Management Systems, 2011.

Download the article as a PDF


Software Engineering Management – our training courses & coaching sessions

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of Software Engineering Management / process, project and product management.

Training & coaching on the other topics in our portfolio can be found here. here.


Software Engineering Management – Expertise

Valuable expertise in software engineering management / process, project and product management is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

weissblau media