What will become of my code?

Finally, a sound way to evaluate software performance

Author: Daniel Penning, embeff GmbH

Contribution – Embedded Software Engineering Congress 2018

Software performance plays a crucial role in virtually every embedded project. Faster code leads to improved response times and higher system throughput. A specified task can thus potentially be accomplished with less power and a correspondingly smaller microcontroller. Energy consumption decreases, resulting in longer operating times or a smaller battery capacity, especially in battery-powered systems. These effects ultimately lead to more cost-effective hardware.

In the context of this positive chain of effects, it is surprising that many projects pay little attention to the relationship between individual software components and the resulting performance. Highly optimizing compilers and innovative processor instructions now offer enormous potential for writing high-performance code.

Nevertheless, many discussions in the embedded systems field are characterized by generalizations and prejudices. Figure 1 illustrates techniques that are often criticized in the name of performance.

Technology	Untapped positive effects
Consistent encapsulation into new types & modules (abstraction)	Reusability, maintainability
Use of external libraries	Fewer errors through proven implementations, reduced time-to-market, higher performance
Modern C++ language features	Reusability, error detection during implementation, higher performance

Figure 1: Techniques of modern software engineering

The fundamental rejection of these techniques prevents innovation in embedded software engineering, which is essential for the increasingly complex tasks.

Performance rating for embedded systems

Assessing the performance of embedded code proves difficult for several reasons:

There are very different target architectures with significantly different runtime behavior.
Profiling is often only possible with expensive tools and special hardware.
Setting up a profiling-capable environment can be complex.
The target hardware must have features for performance evaluation.

The following will demonstrate how a performance evaluation can be implemented using simple means.

Performance rating for ARM Cortex-M4

Microcontrollers of the ARM Cortex-M4 series are licensed by a wide variety of manufacturers and are versatile in their applications. The underlying armv7m architecture [1] will therefore serve as the starting point for this discussion.

These processors can optionally include a „Data Watchpoint and Trace Unit“ (DWT) [2] from the manufacturer. This is the case in the vast majority of models. The DWT supports reading performance registers.

CMSIS Register	Description
DWT_CYCCNT	Cycle Count Register
DWT_CPICNT	CPI Count Register
DWT_EXCCNT	Exception Overhead Count Register
DWT_SLEEPCNT	Sleep Count Register
DWT_LSUCNT	LSU Count Register
DWT_FOLDCNT	Folded instruction count register

Figure 2: ARM DWT Register

The DWT_CYCCNT register can be used to measure the performance of individual code segments. This register counts cycles with clock precision. It therefore represents the most accurate unit that can theoretically be measured on a processor. Due to the fixed clock frequency typical of embedded MCUs, the absolute time can be calculated from a number of cycles if needed.

In pseudocode, a measurement therefore looks like this:

preCycleCount = DWT->CYCCNT

CodeUnderTest( ); // Measure runtime

postCycleCount = DWT->CYCCNT

cyclesUsed = postCycleCount – preCycleCount

Figure 3: Pseudocode for runtime measurement

One could read the `cyclesUsed` variable in the debugger at runtime and achieve the desired result. However, there are two problems with this approach:

With optimization enabled, the compiler may reorder read accesses to the DWT_CYCCNT register.
At the assembly level, the load/store instructions from the DWT_CYCCNT register into an internal processor register distort the measurement results.

A better approach is therefore to use a special HALT instruction that halts the processor immediately before and after executing the measurement without side effects. In armv7m, this is achieved using the BKPT instruction. At this point, the DWT_CYCCNT register can be read via a debug probe using the SWD interface [3]. The pseudocode is thus reduced to:

BKPT //< Read external CYCCNT

CodeUnderTest( )

BKPT //< Read external CYCCNT

Figure 4: Improved pseudocode for runtime measurement

This variant allows cycle-accurate runtimes to be determined for any program segment. At a clock frequency of 100 MHz, the temporal resolution is, for example, a remarkable 10 ns.

Compiler optimizations and performance measurements

With optimization enabled, the compiler implements measures to improve code performance. One of the most effective techniques is replacing branching into short functions with the actual function content. This is called inlining. Furthermore, the compiler will attempt to calculate as many values as possible itself during compilation.

It can easily happen that the compiler completely optimizes out a function call that is supposed to be measured.

Generally foregoing optimization is not a solution, as these measures contribute significantly to overall performance. There are various ways to selectively disable such optimizations locally. The documentation of the Google Benchmark Library [4] provides interesting possibilities in this regard.

Example: FPU vs. Soft-FPU

A simple example will demonstrate how the approach presented above can be used to evaluate performance at a very granular level. This will involve measuring the runtime of a single function.

The function under test (FUT) simply multiplies an integer input value by the number pi with simple floating-point accuracy.

int fut(int input) {

return input * 3.14159265359f;

}

Figure 5: Function whose runtime is to be measured
The examples were run using the arm-none-eabi-gcc toolchain (version 7-2017-q4-major) with optimization (O2) enabled on an STM32F4.

The microcontroller used has a built-in FPU for floating-point numbers. If this is disabled via a compiler option, the multiplication must be simulated in software.

With FPU

Without FPU (Soft-FPU)

fut(int): # cycles

vmov s15, r0@int # 1

vldr.32 s14, .L3 # 2

vcvt.f32.s32 s15, s15 # 1

vmul.f32 s15, s15, s14 # 1

vcvt.s32.f32 s15, s15 # 1

vmov r0, s15@int # 1

bx lr # 2-4

.L3:

.word 1078530011

fut(int):

push {r3, lr}

bl __aeabi_i2f

ldr r1, .L4

bl __aeabi_fmul

bl __aeabi_f2iz

pop{r3, pc}

.L4:

.word1078530011

Figure 6: Assembly Listing for Function to be Measured

Figure 6 shows the assembly listing for both variants. It is already apparent that the soft FPU variant contains jumps to the FPU emulations. These functions are typically only delivered as compiled object code by the toolchain. Therefore, their implementation is unknown with proprietary compilers and can only be reverse-engineered from the assembly. In particular, it is unclear whether these functions have a constant runtime.

In contrast, the FPU variant requires no jumps – all operations can be directly handled by instructions. The required cycles per instruction were taken from the Reference Manual [5] and noted after the assembly. Only the branch instruction has non-deterministic cycles (2-4), as the number of cycles required for a pipeline refill varies depending on the alignment.

In fact, measurements of the runtimes show (Figure 7, p. PDFThe FPU variant has a constant runtime, while the emulated variant is variable. The 15 cycles result from the predicted 9-11 cycles plus another 2-4 cycles required for the jump to the function itself. The branch instruction thus requires 4 cycles in this case.

At critical points in the program, it is important to be aware of runtime-variable program components. In these cases, the longest possible path must be selected to determine the Worst-Case Execution Time (WCET). A single measurement here could easily have led to seemingly plausible but incorrect conclusions.

Summary

The presented methodology is suitable for subjecting functions and code fragments to precise performance measurements. The high temporal resolution allows for investigations in all areas of application, especially critical interrupt service routines and control loops.

Such a methodology provides the necessary basis for evaluating the software techniques listed in Table 1 on a case-by-case basis. Cycle-accurate results enable a well-founded assessment of the applicability of languages, libraries, and design features. If compromises become necessary, decisions can be made based on real-world data.

Note: The author operates a free web platform for convenient performance evaluation of small code fragments [6].

Sources

[1] ARMv7-M Reference Manual

[2] Data Watchpoint and Trace Unit

[3] ARM Serial Wire Debug

[4] Google Benchmark Library

[5] Cortex-M4 Reference Manual

[6] Online platform for MCU performance measurements

author

Daniel Penning studied electrical engineering in Karlsruhe and is the managing director of embeff GmbH. He has over 10 years of experience in various areas of software development. He now focuses exclusively on the specific requirements of embedded systems. The efficiency of products and development processes is of particular importance to him.

Contact: daniel.penning@embeff.com

Download the article as a PDF file

Real-time – MicroConsult Training & Coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of embedded and real-time software development.

Training & coaching on the other topics in our portfolio can be found here. here.

Real-time expertise

Valuable expertise in the field of embedded and real-time software development is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← The vector unit - your friend and helper Real-time capability of container solutions using Docker as an example →