Finally, a sound way to evaluate software performance
Author: Daniel Penning, embeff GmbH
Contribution – Embedded Software Engineering Congress 2018
Software performance plays a crucial role in virtually every embedded project. Faster code leads to improved response times and higher system throughput. A specified task can thus potentially be accomplished with less power and a correspondingly smaller microcontroller. Energy consumption decreases, resulting in longer operating times or a smaller battery capacity, especially in battery-powered systems. These effects ultimately lead to more cost-effective hardware.
In the context of this positive chain of effects, it is surprising that many projects pay little attention to the relationship between individual software components and the resulting performance. Highly optimizing compilers and innovative processor instructions now offer enormous potential for writing high-performance code.
Nevertheless, many discussions in the embedded systems field are characterized by generalizations and prejudices. Figure 1 illustrates techniques that are often criticized in the name of performance.
| Technology | Untapped positive effects |
| Consistent encapsulation into new types & modules (abstraction) | Reusability, maintainability |
| Use of external libraries | Fewer errors through proven implementations, reduced time-to-market, higher performance |
| Modern C++ language features | Reusability, error detection during implementation, higher performance |
Figure 1: Techniques of modern software engineering
The fundamental rejection of these techniques prevents innovation in embedded software engineering, which is essential for the increasingly complex tasks.
Performance rating for embedded systems
Assessing the performance of embedded code proves difficult for several reasons:
- There are very different target architectures with significantly different runtime behavior.
- Profiling is often only possible with expensive tools and special hardware.
- Setting up a profiling-capable environment can be complex.
- The target hardware must have features for performance evaluation.
The following will demonstrate how a performance evaluation can be implemented using simple means.
Performance rating for ARM Cortex-M4
Microcontrollers of the ARM Cortex-M4 series are licensed by a wide variety of manufacturers and are versatile in their applications. The underlying armv7m architecture [1] will therefore serve as the starting point for this discussion.
These processors can optionally include a „Data Watchpoint and Trace Unit“ (DWT) [2] from the manufacturer. This is the case in the vast majority of models. The DWT supports reading performance registers.
| CMSIS Register | Description |
| DWT_CYCCNT | Cycle Count Register |
| DWT_CPICNT | CPI Count Register |
| DWT_EXCCNT | Exception Overhead Count Register |
| DWT_SLEEPCNT | Sleep Count Register |
| DWT_LSUCNT | LSU Count Register |
| DWT_FOLDCNT | Folded instruction count register |
Figure 2: ARM DWT Register
The DWT_CYCCNT register can be used to measure the performance of individual code segments. This register counts cycles with clock precision. It therefore represents the most accurate unit that can theoretically be measured on a processor. Due to the fixed clock frequency typical of embedded MCUs, the absolute time can be calculated from a number of cycles if needed.
In pseudocode, a measurement therefore looks like this:
| preCycleCount = DWT->CYCCNT
CodeUnderTest( ); // Measure runtime postCycleCount = DWT->CYCCNT cyclesUsed = postCycleCount – preCycleCount |
Figure 3: Pseudocode for runtime measurement
One could read the `cyclesUsed` variable in the debugger at runtime and achieve the desired result. However, there are two problems with this approach:
- With optimization enabled, the compiler may reorder read accesses to the DWT_CYCCNT register.
- At the assembly level, the load/store instructions from the DWT_CYCCNT register into an internal processor register distort the measurement results.
A better approach is therefore to use a special HALT instruction that halts the processor immediately before and after executing the measurement without side effects. In armv7m, this is achieved using the BKPT instruction. At this point, the DWT_CYCCNT register can be read via a debug probe using the SWD interface [3]. The pseudocode is thus reduced to:
|
BKPT //< Read external CYCCNT CodeUnderTest( ) BKPT //< Read external CYCCNT |
Figure 4: Improved pseudocode for runtime measurement
This variant allows cycle-accurate runtimes to be determined for any program segment. At a clock frequency of 100 MHz, the temporal resolution is, for example, a remarkable 10 ns.
Compiler optimizations and performance measurements
With optimization enabled, the compiler implements measures to improve code performance. One of the most effective techniques is replacing branching into short functions with the actual function content. This is called inlining. Furthermore, the compiler will attempt to calculate as many values as possible itself during compilation.
It can easily happen that the compiler completely optimizes out a function call that is supposed to be measured.
Generally foregoing optimization is not a solution, as these measures contribute significantly to overall performance. There are various ways to selectively disable such optimizations locally. The documentation of the Google Benchmark Library [4] provides interesting possibilities in this regard.
Example: FPU vs. Soft-FPU
A simple example will demonstrate how the approach presented above can be used to evaluate performance at a very granular level. This will involve measuring the runtime of a single function.
The function under test (FUT) simply multiplies an integer input value by the number pi with simple floating-point accuracy.
| int fut(int input) {
return input * 3.14159265359f; } |
Figure 5: Function whose runtime is to be measured
The examples were run using the arm-none-eabi-gcc toolchain (version 7-2017-q4-major) with optimization (O2) enabled on an STM32F4.
The microcontroller used has a built-in FPU for floating-point numbers. If this is disabled via a compiler option, the multiplication must be simulated in software.
| With FPU | Without FPU (Soft-FPU) |
| fut(int): # cycles
vmov s15, r0@int # 1 vldr.32 s14, .L3 # 2 vcvt.f32.s32 s15, s15 # 1 vmul.f32 s15, s15, s14 # 1 vcvt.s32.f32 s15, s15 # 1 vmov r0, s15@int # 1 bx lr # 2-4 .L3: .word 1078530011
|
fut(int):
push {r3, lr} bl __aeabi_i2f ldr r1, .L4 bl __aeabi_fmul bl __aeabi_f2iz pop{r3, pc} .L4: .word1078530011
|
Figure 6: Assembly Listing for Function to be Measured
Figure 6 shows the assembly listing for both variants. It is already apparent that the soft FPU variant contains jumps to the FPU emulations. These functions are typically only delivered as compiled object code by the toolchain. Therefore, their implementation is unknown with proprietary compilers and can only be reverse-engineered from the assembly. In particular, it is unclear whether these functions have a constant runtime.
In contrast, the FPU variant requires no jumps – all operations can be directly handled by instructions. The required cycles per instruction were taken from the Reference Manual [5] and noted after the assembly. Only the branch instruction has non-deterministic cycles (2-4), as the number of cycles required for a pipeline refill varies depending on the alignment.
In fact, measurements of the runtimes show (Figure 7, p. PDFThe FPU variant has a constant runtime, while the emulated variant is variable. The 15 cycles result from the predicted 9-11 cycles plus another 2-4 cycles required for the jump to the function itself. The branch instruction thus requires 4 cycles in this case.
At critical points in the program, it is important to be aware of runtime-variable program components. In these cases, the longest possible path must be selected to determine the Worst-Case Execution Time (WCET). A single measurement here could easily have led to seemingly plausible but incorrect conclusions.
Summary
The presented methodology is suitable for subjecting functions and code fragments to precise performance measurements. The high temporal resolution allows for investigations in all areas of application, especially critical interrupt service routines and control loops.
Such a methodology provides the necessary basis for evaluating the software techniques listed in Table 1 on a case-by-case basis. Cycle-accurate results enable a well-founded assessment of the applicability of languages, libraries, and design features. If compromises become necessary, decisions can be made based on real-world data.
Note: The author operates a free web platform for convenient performance evaluation of small code fragments [6].
Sources
[2] Data Watchpoint and Trace Unit
[5] Cortex-M4 Reference Manual
[6] Online platform for MCU performance measurements
author
Daniel Penning studied electrical engineering in Karlsruhe and is the managing director of embeff GmbH. He has over 10 years of experience in various areas of software development. He now focuses exclusively on the specific requirements of embedded systems. The efficiency of products and development processes is of particular importance to him.
Contact: daniel.penning@embeff.com
Download the article as a PDF file
Real-time – MicroConsult Training & Coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of embedded and real-time software development.
Training & coaching on the other topics in our portfolio can be found here. here.
Real-time expertise
Valuable expertise in the field of embedded and real-time software development is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
