Parallel programming without spinlocks

How to avoid deadlocks

Authors: Jens Harnisch, Li Lin, Albrecht Mayer, Gerhard Wirrer, Infineon Technologies AG

Contribution – Embedded Software Engineering Congress 2017

To fully utilize the performance of modern multi-core processors, a certain degree of synchronization, for example through barriers, and resource protection, for example through spinlocks, are necessary depending on the application. This can lead to deadlocks, which are naturally undesirable, especially for safety-critical systems. An alternative programming pattern is lock-free algorithms. These must be specifically adapted for the data structures shared by multiple cores. An unbounded queue is presented, evaluated, and compared to other approaches.

Motivation: Spinlocks and their problems

Mutexes, semaphores, and events are provided by operating systems for the synchronization of parallel processes and the protection of shared resources, such as data structures.

The implementation uses atomic instructions of the respective computer architecture. A pseudo-notation for a spinlock is given below:

inline void get_spinlock(volatile unsigned long* spinlock_ptr)

{

while (swap(spinlock_ptr, BUSY) == BUSY)

;

}

The `swap` function, within which the atomic `swap` machine instruction is used, reads the current value from the address `spinlock_ptr` and writes the value `BUSY` to the address of `spinlock_ptr`. The execution of the machine instruction itself will proceed without blocking. However, if the spinlock becomes unavailable, the `while` loop will not exit, resulting in a deadlock. For example, the software might have crashed on another core, preventing the spinlock from being released. Spinlocks can also be involved in priority inversion. The complexity increases with multiple spinlocks in the design. Therefore, the first version of Linux for multi-core systems only used the single Big Kernel Lock. However, this partially reduces the performance of a multi-core system to that of a single core.

Block-free programming: principles, advantages and disadvantages

Lock-free programming has been a well-known alternative to working with mutexes and spinlocks for many years [1]. Deadlocks can no longer occur with this approach. Lock-free algorithms must be adapted to the data structure to be protected, for example, a queue. Often, this only involves a few lines of source code, but the design is complex. Furthermore, these algorithms potentially waste computing power due to speculative execution. At least this problem is mitigated when many cores are available. The principle of lock-free programming is usually as follows:

Creating a copy of the sensitive data structure
Execution of the operation on the copy of the data structure
Before writing back the result, check whether another process has modified the structure in the meantime. If so, the operation must be repeated on a new copy of the structure (back to step 1). If the data structure has not been modified in the meantime, the calculated data structure can be written back.

To check whether the data structure has been modified in the meantime, a pointer can be used, for example. If the pointer still points to the same address as when the copy was created, then apparently no other process has worked on the structure in the interim. However, the "ABA problem" is well known. In this case, the pointer points to address A when the copy is created, as well as when checking after the calculation is executed. This could be coincidental. In reality, the pointer pointed to address B in the meantime; another process has manipulated the structure and written its result back. If we ignore this problem, we overwrite the result of the other process on the sensitive data structure. To solve this problem, a counter can be introduced, for example, which is incremented each time the structure is written back.

Example: An unlimited queue

Unlike working with mutexes or spinlocks, which can be used quite generically, lock-free programming must be adapted to the specific data structure, such as linked lists, stacks, or queues. A lock-free unlimited queue, implemented in C, is presented below. The software was run on a 6-core AURIX 2nd Generation microcontroller.

The core element of the queue is a node, containing the current value and a reference to the next node:

struct Node

{

int value;

struct Node* next;

};

The queue design supports an unlimited number of elements. While an unlimited number of elements is unrealistic for an embedded system, supporting a variable number of elements, such as detected objects in an autonomous driving scene, is a realistic scenario. Inserting and deleting elements is explained below.

Inserting an element into the queue (enqueue)
Using the knot with the pointer at the end (see Figure 1, PDFThe last actual node is determined. In enqueue step 1, the current end of the queue (node B) is determined, and the pointer `next` of node B is set to the address of node A. Node A will become a new node in the queue. The queue is now internally consistent, so the pointer `next` of node `tail` can also be set to the address of node A.

Deleting an element from the queue (dequeue)
The current first node is determined using the Head node; initially, this is the sentinel node. The Sentinel node points to node D, the actual last node. In Dequeue step 1, the actual value of the node is read. In step 2, the next pointer of the Head node is set to node D.

Three of the steps must be executed using an atomic compare-and-swap instruction, labeled CAS in the diagram. The execution of the atomic instruction cannot be blocked. The above queue implementation can be compared to conventional queue implementations. These either use a single spinlock for the entire queue, regardless of whether an element is being deleted or added, or they use two spinlocks, one for inserting and one for deleting an element. However, all of these conventional implementations are potentially prone to blocking.

Analyzing and testing non-blocking algorithms is more demanding than analyzing blocking algorithms. Non-blocking algorithms explicitly support simultaneity and do not temporarily exclude it, as is the case with implementations using spinlocks. On the other hand, it is a largely unsynchronized programming pattern, not a synchronized one like Logical Execution Time. Programming with non-blocking algorithms ultimately resembles Transactional Memory, where transactions are executed speculatively but rolled back in case of conflicts. The complexity of non-blocking algorithms must also be considered during testing. The developed algorithm was not formally verified, but only analyzed. For this analysis, the availability of the non-intrusive trace system MCDS (Multi-Core Debug Solution) on the target platform, the AURIX 2nd Generation Microcontroller TC39x, was very helpful. This allowed for the precise tracking of various simultaneity scenarios, such as the sequence of accesses by different cores for correctness analysis, as well as the exact timing behavior for performance analysis. Both are made possible by special MCDS properties. These include the parallel and temporally aligned tracing at different observation points with very high temporal resolution.

Conclusion

Since lock-free programming inherently eliminates deadlocks, its use can be particularly advantageous in safety-critical software. Reference [1] documented how lock-free programming even achieved higher performance than spinlocks. On the other hand, lock-free programming must always be specifically tailored to the data structure being protected, and the implementations are more difficult to understand than those using a spinlock. Although lock-free programming has been known for many years [1], its adoption remains rather limited. The support for lock-free programming in the C++ STL through execution policies will certainly increase its prevalence. As C++ gains increasing acceptance in the embedded systems field, including limited use of the STL, lock-free programming will undoubtedly find its place there as well, even in safety-critical systems with hard real-time requirements. To verify a specific implementation, hardware-assisted tracing on the microcontroller is the method of choice.

Bibliography and list of sources

[1] Henry Massalin and Calton Pu. A Lock-Free Multiprocessor OS Kernel. Technical Report CUCS-005-91, Columbia University, 1991.

Download the article as a PDF

Multicore – our training & coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of multicore/microcontrollers.

Training & coaching on the other topics in our portfolio can be found here. here.

Multicore – Expertise

Valuable expertise on the topic of multicore/microcontrollers is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← PLAT4MC: Multicore Performance Optimization with Open Source OpenAMP - finally! →