More performance at no cost?
Author: Dr. Andreas Ehmanns, MBDA Deutschland GmbH
Contribution – Embedded Software Engineering Congress 2018
Vector units have been standard in common processor families for many years, yet they are often overlooked by software developers, even in the embedded sector. Especially with ARM processors, development in this area has progressed significantly in recent years, opening up new performance possibilities for (embedded) developers. Making use of these very powerful units and deciding when their use offers tangible advantages often seems much more difficult than it actually is.
When software developers hear the term "vector unit," they often think it has something to do with supercomputing and are unaware that almost every PC, most tablets and smartphones, and many embedded devices also contain SIMD units. SIMD (Single Instruction, Multiple Data) refers to the execution of a single instruction on multiple data elements. Different processor families use different names for the SIMD units used. Well-known examples of this group are:
- Intel x86: MMX, SSE/AVX
- AMD x86: 3DNow!
- Power family: Altivec
- ARM: NEON
history
Although the first versions of these units existed as early as the late 1990s, they were slow to gain software support. Since SIMD units are not used by "normally" written programs, they must be explicitly programmed by the software. The situation improved only with the introduction of "autovectorization," where the compiler attempts to recognize concurrently executable computational tasks and utilize the vector unit. However, even current compilers (as of autumn 2018) still do not achieve the performance possible with direct programming of the SIMD unit. The potential for performance gains through the SIMD unit was first recognized in the fields of graphics and multimedia, as these often involve performing identical operations on multiple data sets (e.g., pixels). This may also explain why some software developers still associate SIMD units solely with graphics and multimedia, even today – some 20 years later. Apart from special data types (such as "pixels" in Altivec), these units are not limited to a specific use case, but can be used for all kinds of calculations, provided the calculation task is parallelizable.
Even though the use of SIMD units is much more widespread today, some software developers still cling to the persistent rumor that programming these vector units is very complicated and requires being a hardware and/or assembly guru. Compilers, such as gcc, offer built-in commands, meaning that the use of instructions for programming the vector unit can be easily enabled via parameters, for example... -maltivec -mabi=altivec with PowerPC, or through -mavx2 For x86 AVX2. The resulting intrinsic functions are a type of C API and, in most cases, map directly to the corresponding assembly instructions. Extensive documentation on the individual functions, parameters, and variations can be found online.
Requirements
Before using the SIMD unit, the first question is whether the code to be optimized is parallelizable, or whether individual computational steps are interdependent. Classic examples of easily parallelizable computational tasks are loops, where the calculation of each iteration does not depend on the results of the previous one. Examples include vector addition, vector multiplication, matrix operations, and many more.
Furthermore, it must first be clarified whether and to what extent the hardware supports SIMD. Especially with ARM, it's quite possible that the SIMD unit is completely absent. Conversely, with x86, the sheer number of extensions that Intel has regularly introduced is vast and can be overwhelming at first glance. These range from MMX to SSE, SSE2, and so on, including AVX, AVX-256, AVX-512, and so forth.
Depending on the extension, certain commands may or may not be available. More detailed information can be found, for example, in the reference documents online.
Implementation
If the software task is parallelizable and hardware support is available, then nothing stands in the way of using the SIMD unit. Current compilers readily support the common SIMD units and their variants.
A fundamental aspect of programming SIMD units is the data format. Most current units have a width of 128 bits (exceptions: AVX-256: 256 bits, AVX-512: 512 bits), meaning they can hold, for example, four 32-bit floats or four 32-bit integers. Specific data formats exist for this, which means that data in a non-vector format must first be copied into these formats and, after processing, possibly converted back. Special functions exist to elegantly handle this task.
Example: A 4-fold vector for 32-bit float is called: float32x4_t
Caution: There are many examples online that use a union to access data from the vector unit and avoid the copying process. However, this requires that the original data in memory is aligned by 16 bytes, exactly the width of the vector unit. While this is usually the case, the compiler doesn't always guarantee it. If this isn't the case, there's no error message or exception; instead, the lower four bits of the address (in Altivec) are simply ignored, and the data used is corrupted. To be on the safe side, use the available conversion functions.
Once the data is in the correct format, programming can begin. It's advisable to study the reference manuals to see which functions are available. These include mathematical functions (essentially the basic arithmetic operations), as well as copy functions, bit manipulation functions, and several others. Anyone needing more than basic arithmetic (e.g., a sine or exponential function) will quickly reach the limits of the available options. However, there are many libraries online that provide precisely these missing functions. It's rare that a function needs to be implemented from scratch. In such cases, it's recommended to use the mathlib library and recreate the required function in the desired SIMD syntax.
If a function (or part of it) that was previously implemented using traditional methods has been converted to use the SIMD unit, the first step should be to verify that the function works correctly. It is recommended to generate reference input and output data and use this to check the correctness of the implementation. It is important to ensure that all branches in the program are executed to avoid overlooking errors in untested branches.
If the program functions correctly, the fruits of the labor can be reaped. It's often expected that the optimized part of the program will be four times faster when working with, for example, Int32, since the SIMD unit can perform four calculations simultaneously. However, this assumption doesn't take into account that the SIMD unit, depending on the architecture and design, has its own completely separate connection to the peripherals and memory. Speed increases of significantly more than a factor of four are not uncommon. However, the opposite can also occur. This is especially true when little computation is required, when large amounts of data need to be copied back and forth between SIMD and non-SIMD data types, or when the hardware (e.g., the memory connection) is the bottleneck.
It's generally impossible to make a reliable prediction of the performance gains to be expected when switching to the SIMD unit, as this depends on many factors. In principle, it's advisable to first analyze where the most time is consumed in the (non-SIMD) code (profiling). These code sections or functions are then recommended for rewriting. Only then, in combination with the compiler and the target hardware, can the developer obtain a reliable assessment of performance. Since performance is heavily dependent on the instructions used, it can only be applied to other parts of the code to a limited extent.
Despite the many "buts," the software developer is ultimately rewarded when they see the performance gains possible with SIMD units. Various algorithms have shown that, under optimal parallelization conditions, speed increases of a factor of 10-15 are achievable.
Autovectoring
Autovectorization offers a relatively simple way to at least partially utilize the performance of the SIMD unit. Most common compilers support this, and compilers have improved considerably in this area, especially in recent years. With autovectorization, the compiler automatically attempts to convert suitable code segments into SIMD format, thus utilizing the vector unit, without the programmer having to explicitly use corresponding intrinsics when creating their program. The amount of performance gain achievable depends on the program's structure, the computational operations used, and the specific compiler version. If a program already exists, it's definitely worth testing autovectorization. However, it's always advisable to verify that the program produces the correct results with this option enabled.
In gcc, autovectorization can be enabled with the switch. -ftree-vectorize Enable this option. gcc only enables this option automatically with -O3, however, there are also compilers and/or wrappers that activate this option with -O2.
To get an idea of what the compiler does during autovectoring, the following helps: -ftree-vectorizer-verbose=N Option.
If, however, autovectorization is not desired or should be explicitly deactivated for the purpose of performing comparative measurements, this can be done using the option -fno-tree-vectorize happen.
Even though compilers have improved significantly in recent years, direct programming of the SIMD unit generally yields considerably more performance gains than autovectoring. However, the exact amount of this improvement is difficult to estimate and can only be determined through actual measurements.
One disadvantage of SIMD units should not be overlooked. Because the intrinsics are specific to a processor family, code portability suffers. While processor manufacturers ensure that newer versions of a vector unit also support all older instructions, switching processor families necessitates porting the SIMD code. Generally, it is recommended to encapsulate the code segments that use SIMD instructions in a wrapper or similar construct, provided the application design allows for it.
Summary
If a performance increase in a program is desired or necessary, it is often worthwhile to utilize the hardware's vector unit (if available). Autovectorization by the compiler can already increase computing power in simpler cases, but direct programming of the SIMD unit usually offers a significantly greater gain for parallelizable code, which under optimal conditions can even amount to several factors. Especially in the embedded field, where computing power is often considerably more limited, this can extract significantly more performance from the available hardware.
Bibliography
[1] Altivec Technology Programming Interface Manual
author
Andreas Ehmanns has been working with embedded systems and the challenges of soft and hard real-time systems for more than 20 years. As early as the late 1990s, he began using Linux on both established and new systems with real-time requirements. He works as a 'Technical Consultant for Embedded Software Systems' and investigates, among other things, the suitability of various processor architectures for use in embedded systems.
Download the article as a PDF file
Real-time – MicroConsult Training & Coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of embedded and real-time software development.
Training & coaching on the other topics in our portfolio can be found here. here.
Real-time expertise
Valuable expertise in the field of embedded and real-time software development is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
