Old and new tricks for the smallest and fastest code
Author: André Schmitz, Green Hills Software
Contribution – Embedded Software Engineering Congress 2018
Compiler optimization is nothing new and has been around since compilers first existed. Nevertheless, new compiler versions are released every year, generating code that is, for example, 5% smaller or 10% faster. How is this possible? Are the existing compilers somehow bad, or are the compiler vendors using non-standard tricks? Compiler optimization is inherently complex, and its success depends heavily on the source code being optimized. This article examines fundamental concepts of optimizing compilers for C and C++, presents examples of both well-established and very recent optimizations, and questions the purpose of these optimizations. It also highlights pitfalls in attempting to optimize code manually.
Why optimization?
Especially in embedded software projects, the goal is to generate the fastest and smallest possible code. The faster the code executes, the faster it can react to events, improving response time and usability. The faster a task is completed, the sooner it can switch to power-saving mode, which saves energy and, in battery-powered devices, extends battery life. Smaller code allows more functionality to be packed into the same device memory.
Sometimes software developers try to write pre-optimized source code. This can be advantageous, but in some cases it can also degrade the program's quality because the compiler's optimizer can no longer optimize this manually optimized code as effectively, which can ultimately result in worse optimization (see below).
Compiler optimizations can generally be divided into two categories: target-dependent optimization, which works at the instruction level, and target-independent optimization, which optimizes at the source code level. Within this category, some optimizations improve both the size and speed of the program, making it both smaller and faster. However, there are also opposing measures, meaning some optimizations reduce code size at the expense of speed, or vice versa. Let's look at some examples below.
Examples of compiler optimizations
Typical and well-known source-level optimizations include loop unrolling and function inlining. Both aim to increase program speed at the expense of code size. Another common optimization is common subexpression elimination (CSE), where multiple redundant calculations occurring in every program path are replaced by a single calculation (see example in Figure 1)., PDF).
A newer algorithm, more aggressive than CSE, is Partial Redundancy Elimination (PRE). With PRE, it's sufficient for the computation to be redundant in only one path, and it also handles loops. CSE and PRE also help with computations involving many redundant memory operations. For example,
arr[i] + arr[i+3]
through
ptr = arr + i; ptr[0] + ptr[3];
They can be replaced, which saves one or more instructions. Other well-known optimizations include "Dead Code Elimination" (DCE), where the compiler attempts to delete code components that are irrelevant to the result of a function, or "Constant Propagation".
Busy Code Motion (BCM) reduces program size by having the compiler attempt to combine similar instructions from multiple blocks into a single block. (see Figure 2), PDFThis idea is further developed in linker optimization, which is described in the next chapter. Target-dependent instruction scheduling optimization utilizes knowledge about instruction scheduling in the CPU pipeline. While one instruction is executing in the pipeline, the execution of the next instruction is already beginning. And if the next instruction has to wait for the result of the previous one, pipeline stalls occur. The compiler, within its capabilities, sorts the instructions to minimize pipeline stalls.
Another area of optimization deals with vectorization, where a distinction is generally made between manual and automatic vectorization. This area has seen significant developments in recent years, not least because many modern CPUs now include SIMD (Single Instruction Multiple Data) instructions that enable vectorization, such as Power Architecture AltiVec, ARM NEON, and Intel SSE. Manual vectorization can lead to substantial performance improvements, but it requires the developer to decide what can be parallelized and how, as well as the necessary steps. They must learn and apply new data types and compiler intrinsics, and also consider alignment and aliasing.
Automatic vectorization has the same goal and, at first glance, promises to relieve the developer of some tasks, but it is usually not as easy to use as one might hope. Similar to automatic parallelization of programs in general, a clean alias analysis must be performed, and this is very difficult in the C and C++ languages. Therefore, even with automatic vectorization, the developer should clearly indicate to the compiler, using pragmas or other keywords, where it is safe to vectorize.
Other areas for optimization, particularly in recent times, include register allocation, or, in the case of C++, the removal of unused virtual functions or the handling of C++ exceptions. Furthermore, more and more specialized CPU instructions are being supported directly by the compiler or through intrinsic methods. Intermodular optimization has enormous potential, but requires the compiler to process each source file twice. The first time, the module's structure is recognized and recorded; the second time, each module is actually compiled and optimized using the structural information from all other modules.
In some projects, speed is paramount and code size is irrelevant; in others, code size is the most important factor, regardless of how slow the code becomes. Most of the time, however, a middle ground is desired, and optimization settings are chosen that represent a compromise between speed and size. This middle ground isn't necessarily the best approach, because the program often spends a significant amount of CPU time on very little code, and a large portion of the code is executed infrequently. So, how do you best optimize such a situation? The optimal solution is profiler-based optimization, where the actual optimization is parameterized at runtime by the results of the profiling.
Profiling identifies which parts of the program are called how often and how much time is spent in each. This allows you to identify areas where a lot of time is spent and where speed optimization is therefore highly beneficial, and areas where little processing time is spent and where it may be more sensible to optimize for size, as the potential trade-off regarding speed is negligible. Figure 3 (see below) illustrates this. PDFFor example, it can be seen that most CPU time is spent on just two functions (minSpanningTreePrims and minSpanningTreeKruskals). These two functions can therefore be optimized for extreme speed, even if this makes them significantly larger. All others could probably benefit from optimization for size. Functions that are called very frequently, even if they only take a short time, would likely be good candidates for inlining.
Ultimately, the developer can use the profiling results to identify the hotspots of their program and adjust the optimization options in the build environment accordingly.
Link optimization
Few developers know that the linker can also optimize. Why is this useful? The linker has to process all modules and, as the last step in the build process, sees which functions and data are actually used. Therefore, it can simply omit any unused functions and data during the linking process. Of course, some caution is advised when using function pointers defined at runtime, but you can provide the linker with specific instructions on what it absolutely must not remove. Furthermore, the linker can see throughout the entire program whether certain instruction sequences occur in different locations, which can then potentially be replaced by a subroutine. This approach can also be called "outlining" (the opposite of inlining).
Typical pitfalls
Of course, there are also things you should avoid to allow the compiler to optimize as effectively as possible. Some developers think, "I can do better than what the compiler wants," or simply want to implement certain aspects of a program efficiently in assembly language. Generally speaking, hand-written assembly code can be significantly better than the instructions a compiler generates from a high-level language like C or C++. However, things become critical when the developer uses inline assembly, i.e., inserts individual instructions manually within the C code. These inline assembly instructions automatically act as optimization barriers for the compiler. All the tricks mentioned above, such as CSE, BCM, DCE, etc., cannot be applied across the boundary of the inserted assembly instruction. Attempting manual optimization can therefore backfire spectacularly.
Further problems can arise from incorrect assumptions about how a compiler stores data in memory and how data access may be optimized. Compilers and linkers are allowed to rearrange variables in memory as they see fit, as long as they are not declared as elements of a structure or attributes of a class. The compiler can place local variables on the stack or in registers, and in the best-case scenario, accesses to variables could be optimized away entirely (see DCE above). If one wants to prevent the removal of accesses, the use of the "volatile" type qualifier is recommended, as this tells the compiler that the underlying variable can be modified even outside the control of the compiled code.
Summary
As you can see, there are many different optimizations, each with very different goals and operating on very different source code or instruction sequences. Therefore, depending on the underlying code and hardware architecture, very different results can be expected from these optimizations.
References
[1] https://compileroptimizations.com/index.html
author
Andre Schmitz received his diploma in physics from the University of Bonn in 1997. He then developed control and simulation software for autonomous robots at the Fraunhofer Institute for Geosciences (FhG). From 2000 to 2005, Mr. Schmitz developed embedded software for UMTS communication systems. Since 2005, Mr. Schmitz has been responsible for providing technical customer support and conducting training courses at Green Hills Software. He has also been a regular speaker at various industry conferences.
Implementation – our training & coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of implementation/embedded and real-time software development.
Training & coaching on the other topics in our portfolio can be found here. here.
Implementation – Expertise
Valuable expertise in the field of implementation/embedded and real-time software development is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
