Software development for multicore systems

What's new, and where is the journey headed?

Author: André Schmitz, Green Hills Software

Contribution – Embedded Software Engineering Congress 2018

Many embedded systems already use multicore processors, and this proportion is steadily increasing. Developing and migrating software to these architectures is becoming increasingly easier, and highly mature technologies for developing software for multicore systems already exist. This article examines the technologies in the areas of code generation, real-time operating systems, and debugging that facilitate the development of multicore software. It also looks to the future and explores how these technologies will scale with the expected increase in the number of cores.

Why multicore?

There are various reasons for switching to multicore systems. The most common goal is to achieve higher performance with a multicore CPU—performance that cannot be achieved simply by increasing the clock speed of a single core. For example, compared to a high-clocked single-core system, a downclocked dual-core system consumes significantly less power while offering more processing power. Another reason for using multicore processors could be the desire to run multiple operating systems on a single CPU. On a dual-core processor, one core could run Linux and the other a real-time operating system (RTOS). These days, this is often referred to as virtualization; previously, it was also known as "Asymmetric Multicore Processing" (AMP) (see below).

There are also use cases where multicore is employed to achieve functional safety. A typical configuration is the lockstep principle, where two similar cores execute the same algorithms, and at the end, it is checked whether the result of both calculations is identical [1]. This increases hardware fault tolerance. Sometimes, attempts are made to run two software components on different cores, arguing that this allows them to run without interaction. This is, of course, only the case if these cores do not share any resources (caches, bus, etc.), which is not the case in most multicore hardware systems (see loosely vs. tightly coupled systems below).

Multicore hardware

Generally, a distinction can be made between heterogeneous systems with different cores and homogeneous systems with identical cores. Heterogeneous systems are primarily used for inherently asymmetric tasks, where, for example, a general-purpose processor (e.g., ARM) handles the user interface and system control, while a dedicated processing unit (e.g., DSP or NPU) ensures efficient and fast data processing and input/output. Access to certain peripherals is often restricted to specific cores. The advantage of heterogeneous systems, therefore, lies in the dedicated performance of the cores. Homogeneous systems, on the other hand, offer an elegant and energy-efficient solution for increasing a system's computing power.

When considering the connection of the cores to the memory, a distinction is made between tightly coupled systems, where all cores have access to the same memory, and loosely coupled systems, where each core has its own dedicated memory. In the latter, communication typically does not occur directly via shared memory, but rather through other channels. Furthermore, in tightly coupled systems, performance does not scale proportionally with the number of processors, as the bus usually becomes a bottleneck and cache coherence must be ensured.

This performance bottleneck can be improved using specialized hardware architectures. One possible architecture is the Non-Uniform Memory Architecture (NUMA), in which each core has local, fast memory whose access is not slowed down by bus arbitration. Access to the memory of the other cores is achieved via a high-speed connection, which, while cache-coherent, is significantly slower than access to local memory.

Programming of multicore systems

When developing software for multicore systems, it is essential to distribute the tasks across the cores effectively. In heterogeneous systems, this distribution is much more obvious and must be done manually by the developer. However, programming these systems is rather complex, as different tools (compilers and debuggers) are often required for the different cores, and interprocessor communication typically needs to be implemented in a hardware-specific manner. If operating systems are used, they are usually different on the participating cores, adding another layer of difficulty when it comes to interoperability.

Operating system solutions

In homogeneous multicore systems, there are usually two different approaches to distributing tasks across the cores (AMP and SMP), which are described below.

AMP

Looking at the technical literature, one finds quite different interpretations of what Asymmetric Multi-Processing (AMP) actually means. Some use the term AMP for heterogeneous hardware architectures in which not all cores have access to the same resources [2]. Others consider it a purely software-related consideration for homogeneous hardware architectures [3]. Here, I will only use AMP as a software aspect on otherwise homogeneous multicore hardware. Ultimately, however, in all forms of AMP, the developer decides on the partitioning and distribution of functions between the cores. Each core runs its own independent software, which can be implemented with or without an operating system. If operating systems are used, they do not necessarily have to be identical on all cores. The latter is often referred to as virtualization today, although virtualization can also be understood in a completely different way.

An AMP software approach can be helpful when guaranteeing computing time for specific tasks, although these guarantees are not truly assured due to potentially shared resources. Unfortunately, applications on AMP systems are usually less portable, and there is no way to achieve dynamic load balancing at runtime. Similar to heterogeneous systems, fault-tolerant message exchange between the cores is also required.

SMP

Symmetric multi-processing (SMP) is only possible on homogeneous, tightly coupled systems using an SMP-enabled operating system (OS). One instance of the operating system runs on all cores and handles the distribution of threads across them. This makes it possible, in principle, to run an existing multithreaded application that previously ran on a single core but was already implemented on an SMP-enabled OS on a multi-core system with a significant performance gain. The OS automatically distributes threads between the cores, and all resources and internal processes are transparently monitored by the OS [4]. Of course, there are some pitfalls to consider.

Another strength of SMP systems is resource sharing. This allows for a high degree of flexibility, but the true concurrency of threads can also lead to resource competition. In SMP systems, threads with different priorities can run truly concurrently. Therefore, if someone developing a single-core application previously assumed that a low-priority thread could only run if no higher-priority thread was running, they might encounter some surprises with an SMP system. It is advisable to use an SMP OS with Memory Management Unit (MMU) support to implement software granularity directly at the process level, rather than at the thread level. This avoids troublesome errors that often occur with multithreading. The MMU is then used to isolate the OS components and define clean interfaces between them.

To implement real-time behavior, a real-time operating system (RTOS) is required on a multicore system. Ideally, this RTOS should have low, deterministic interrupt latency and maintain the determinism of an application when migrating to a multicore system. In addition to determinism, the OS should also scale well when additional cores are added. These two properties usually depend on the type of kernel parallelization, i.e., what kind of "locks" are implemented to protect critical areas of the kernel. There are very scalable approaches, such as that of the Linux kernel, but these are comparatively undeterministic. The improved real-time capability and reliability of an RTOS, however, are sometimes achieved at the expense of scalability.

Scheduling and core affinity

Last but not least, as mentioned earlier, the memory bus becomes a bottleneck with an increasing number of cores. To at least effectively utilize the cores' caches, it makes sense to bind threads to specific cores (affinity). Here, threads can be statically bound to cores via operating system calls (user-defined affinity). While this reduces the flexibility of load balancing at runtime, it can, for example, help with interrupt handling by assigning the interrupt-awakened thread to the core that also handles the interrupt. In this case, performance is improved through more efficient cache utilization, and an inter-processor interrupt (IPI) can be avoided.

Furthermore, an SMP OS can attempt to automatically execute threads on the same core whenever possible ("natural affinity"). This improves cache utilization and thus performance while maintaining flexibility and priority guarantees at runtime. Ultimately, core affinity allows an SMP system to be partially configured in the direction of software AMP, which can then be referred to as Bound Multicore Processing (BMP).

Automatic parallelization

Furthermore, there are also approaches to the automated parallelization of code using libraries. Significant progress has been made in this area in recent years. While these libraries are generally easy to use and scale well, they all come with certain limitations. Some only function effectively on large loops (e.g., when using OpenMP) or require explicit design using C++ classes (parallelization objects) [5]. Of course, newer C++ versions (C++11 and later) also help to implement threading on SMP. There is an increasing number of tools designed to help fully automatically parallelize existing serial code. This seems to work quite well for languages like Fortran, because Fortran has stricter guarantees regarding memory overlap (aliasing) than languages like C or C++. The latter allow indirect addressing, recursion, and many other dynamic aspects that make it difficult for a parallelization tool to perform fully automatic parallelization. The existing solutions on the market usually require support from the developer or can only parallelize very specific aspects of a program, such as the aforementioned loops [6].

Storage models

The behavior of memory accesses on multicore systems also requires special consideration. Besides the obvious need to protect shared resources from concurrent access (thread safety), for example using atomic operations or mutexes, it is particularly important to consider how memory accesses from different cores relate to each other. The order in which one core writes to memory can be completely different from the order in which another core sees these accesses. This is not about the potential reordering of instructions by a compiler to optimize execution speed, but rather about optimizing the hardware's memory access. To avoid problems caused by this reordering, instruction set architectures offer special barrier instructions that guarantee the execution of different types of memory accesses before the barrier, before other types of memory accesses occur after the barrier. Newer C++ variants also provide support for this. Therefore, software developers should definitely be aware of the issues associated with memory models. [7]

Development tools

Another challenge when working with multicore systems is preventing software errors, finding unavoidable ones, and optimizing the code. In my view, finding software errors remains one of the biggest challenges in software development. A multicore system, due to its increased complexity, doesn't make things any easier. Therefore, a structured development process is crucial, helping to avoid errors from the outset or to find them as easily and early as possible. Even with singlecore systems, we are familiar with the problems of dealing with virtual concurrency (e.g., race conditions), and these don't become any simpler when dealing with true concurrency. It is therefore essential to have the right development tools that efficiently support software development for multicore systems.

For driver development, tools that enable synchronous runtime control (stop mode) of all system cores are ideal, allowing multiple cores to be started and stopped simultaneously. If an operating system is used, application debugging (run mode) is also useful, enabling the search for program errors at the thread level during runtime without halting the rest of the system. For a better understanding of the runtime behavior of concurrent processes, a tool for analyzing operating system events is essential for both AMP and SMP systems. If non-intrusive analysis of the program flow is required, offering a simple way to find hidden or difficult-to-reproduce software errors, tools for multicore trace acquisition and trace analysis are necessary [6]. Of course, it is optimal to find errors automatically. This can be achieved either through static source code analysis or through automatic runtime debugging during program execution.

outlook

The trend towards more multicore processors, especially in the embedded sector, will certainly continue, if only to satisfy the demand for performance with simultaneously low power consumption, and we need to consider how we will overcome the associated challenges. Regarding hardware, further developments will likely primarily proceed in three directions.

Functional safety through Lockstep
Application-specific heterogeneous architectures
Homogeneous high-end multicore SOCs

Regarding code generation, while certain areas can benefit from automatic code generation, software developers and system architects will still have plenty to do with multicore software architecture (concurrency, memory models, etc.). We need to ensure that operating systems scale well enough for multicore systems. We're on the right track with tooling, but when it comes to program tracing, a crucial aspect for analyzing complex systems, we have to reckon with ever-increasing amounts of data per second. Current hardware tracing solutions, even those using highly efficient serial protocols, will likely not be sufficient to map the entire program flow in enough detail in real time. Further work is needed here so that we can continue to understand precisely what our program is doing on which core at any given time.

References

[1] https://de.wikipedia.org/wiki/Lockstep_(Computertechnik)

[2] https://en.wikipedia.org/wiki/Asymmetric_multiprocessing

[3] https://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CFHBJBIE.html

[4] https://en.wikipedia.org/wiki/Symmetric_multiprocessing

[5] https://en.wikipedia.org/wiki/OpenMP

[6] https://en.wikipedia.org/wiki/Automatic_parallelization_tool

[7] https://en.wikipedia.org/wiki/Memory_ordering

author

André Schmitz received his diploma in physics from the University of Bonn in 1997. He then developed control and simulation software for autonomous robots at the Fraunhofer Institute for Geosciences (FhG). From 2000 to 2005, Mr. Schmitz developed embedded software for UMTS communication systems. Since 2005, Mr. Schmitz has been responsible for providing technical customer support and conducting training courses at Green Hills Software. He has also been a regular speaker at various industry conferences.

Download the article as a PDF

Multicore – our training & coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of multicore/microcontrollers.

Training & coaching on the other topics in our portfolio can be found here. here.

Multicore – Expertise

Valuable expertise on the topic of multicore/microcontrollers is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← Logical Execution Time in the Automotive Environment Intercore communication for multicore microcontrollers →