Minimize the consequences through efficient storage management
Authors: Philipp Jungklaß, Ingenieurgesellschaft Auto und Verkehr GmbH,
Prof. Dr.-Ing. Mladen Berekovic, University of Lübeck, Institute for Technical Computer Science
Contribution – Embedded Software Engineering Congress 2018
The use of embedded multicore microcontrollers in modern control units with strict real-time requirements presents developers with ongoing challenges, as software separation is often not possible to the extent dictated by the number of processor cores. This necessitates data exchange between the processor cores. Currently, this intercore communication occurs via shared memory, which the processor cores access concurrently. These parallel accesses result in wait cycles that are difficult to calculate for a system with strict real-time requirements. Therefore, this article presents a priority-based method for intercore communication that minimizes wait cycles through effective utilization of the existing memory hierarchy. To demonstrate its functionality, the method is ported to two embedded multicore microcontrollers from the AURIX family and compared with the existing approach.
1. Introduction
Since the introduction of the first electronic control units (ECUs) with embedded multicore microcontrollers, the number of available processor cores has steadily increased. This development places particular demands on the ECU software, which must be distributed across the available processor cores. However, it is important to note that dependencies exist between the software components on the different cores, necessitating cross-core data exchange. Currently, data exchange between the processor cores occurs via a shared global memory, which the cores access concurrently. The resulting waiting times are difficult to predict and can compromise the system's real-time capability. Therefore, this article presents a method that minimizes the number of wait cycles by intelligently utilizing the existing memory hierarchy of the embedded multicore microcontroller, thereby increasing predictability [10][11].
This article is structured into six sections. After the introduction, the current state of the art is presented. The developed concept is then introduced and described in detail. Next, the experimental setup used is explained, followed by a presentation of the results obtained from the two test platforms. The article concludes with a discussion of the results achieved.
2. State of the art
Structure of embedded multicore microcontrollers
Fundamentally, the currently available multicore microcontrollers from various manufacturers have a similar architecture. Depending on the processor family and derivative, this includes a varying number of processor cores, which are connected to the system's global memory and to each other via a crossbar. The advantage of a crossbar lies in the realization of parallel connections between different participants. Additionally, each core offers local memory that can be accessed directly without wait cycles. Figure 1 (see...). PDF) shows the schematic structure [1][2][3][7].
Intercore communication
In modern electronic control units (ECUs) with embedded multicore microcontrollers, data exchange between the processor cores takes place via a global memory to which all communication participants have access. To avoid inconsistent data that can arise from concurrent access by multiple communication participants, three basic safeguarding mechanisms are used [5][6].
Mutex
A mutex is a method used for the mutual exclusion of different communication partners. It is important to note that only the partner who has reserved exclusive access to the shared resources can release them again [6].
Semaphore
A semaphore is also used to regulate exclusive access to shared resources. However, unlike a mutex, the reserving communication partner does not necessarily have to be the releasing partner [6].
Spinlock
The term spinlock describes a method that governs exclusive access to a shared resource. Unlike a mutex or a semaphore, however, the communication partners actively wait if the required resource is not currently exclusively available [6].
All three methods for controlling access to shared memory would benefit from accelerated intercore communication. This is because each exclusive access involves executing a critical section of the ECU's program code, which must not be interrupted. The longer such a section lasts, the more difficult it becomes to maintain hard real-time capability.
3. Concept
As described in Section 2, processor cores access their local memory significantly faster than global memory, which is connected via a crossbar. Therefore, the concept presented here implements intercore communication using local memory. It's important to note that accessing another core's local memory is considerably slower than accessing its own local memory, but still faster than accessing global memory. Unlike the previously used method, this concept takes into account the priority of processor cores within a control unit. Priority can be assigned based on the assigned task, such as the security level, or the workload.
To better illustrate the concept, the example in Figure 2 (see below) is shown. PDF) is used. This system has three processor cores, each with its own local memory, connected to the global memory and to each other via a crossbar. Each of the three cores has a separate priority. To illustrate intercore communication in the following examples, red is used for write access and green for read access. Yellow indicates an operation by a core on its local memory that is not part of intercore communication.
In the first scenario, which is shown in Figure 3 (see below),. PDFAs shown in the diagram, Core 0 provides information that Core 1 and Core 2 need. Since Core 0 has the lowest priority in this system, it writes the data to the local memory of Core 1 and Core 2. This allows them to access the data much faster when reading, thus avoiding wait cycles for the higher-priority cores.
The second use case is shown in Figure 4 (see below). PDFCore 1 provides data to the other two cores. Since Core 1 has the highest priority in the system shown here, it writes its values to its own local memory. While this configuration means Core 0 and Core 2 require more time for read access, it reduces the load on Core 1.
In the third scenario, which is shown in Figure 5 (see below),. PDFAs presented, data is calculated by Core 2, which Core 0 and Core 1 require for further processing. Because Core 2 has the middle priority in this system, the values for Core 1 are written directly to its memory, as Core 1 has a higher priority. Core 2 stores the values for Core 0 in its own local memory. While this distribution requires more time for Core 0 to access the data, it reduces the load on Core 1 by allowing faster access to its own memory.
As shown in the use cases illustrated in Figures 3 to 5, the processor cores use their local memory for both inter-core communication and for calculations performed exclusively on a single core. This results in concurrent access, leading to wait times that can impact real-time performance. For this reason, some processor manufacturers have begun integrating two local memories per core, allowing inter-core communication to take place via a separate memory. Figure 6 (see Figure 5) illustrates this. PDF) shows the schematic representation of the extended memory hierarchy [4].
In the following scenario, which is shown in Figure 7 (see below). PDFAs represented, Core 0 provides values required by Core 1 and Core 2. In this configuration, Core 0 also has the lowest priority and therefore writes its values to the memories of Core 1 and Core 2. The difference from the first use case is that the local memory X-0 is used exclusively by the processor cores, while memory X-1 is used for inter-core communication. This effectively prevents concurrent access during the processing of exclusive tasks.
Since Core 1 has the highest priority in this example, it writes its values for Core 0 and Core 2 to local memory 1-1. While this increases the read access time for Core 0 and Core 2, it reduces the write access time for Core 1. Due to the two available memory locations, a division of resources occurs for exclusive calculations and intercore communication to reduce concurrent accesses. See Figure 8 (see below). PDF).
In the next scenario from Figure 9 (see. PDFThe data is provided by Core 2. Since Core 2 has the middle priority in this system, it writes the values for Core 1 to its local memory due to its higher priority. Core 2 writes the values for Core 0 to its own local memory. Due to the extended memory hierarchy, memory can also be divided into exclusive and shared variables, thus reducing concurrent access caused by intercore communication.
This system can be extended by the DMA controller, which is integrated into most modern embedded multicore microcontrollers. This allows each processor core to use its own local memory, and the DMA controller handles the copying process to the local memory of another core. Figure 10 (see below). PDF) illustrates the separation, showing only the memory areas for intercore communication.
In the following application example, Core 1 provides data to Core 2. Core 1 writes the updated values to its local memory at maximum speed. Subsequently, the DMA controller is activated, which writes the new data to Core 2's local memory. By using the DMA controller, Core 2 can also access its own local memory at full speed. See Figure 11 (see below). PDF).
In this approach, embedded multicore microcontrollers with two local memories per processor core are also suitable, since otherwise, in this case too, there could be a concurrent access by the DMA controller and the respective processor core during an operation on the local memory.
4. Experimental setup
The concept presented in this article is tested using two evaluation boards from hitex. One uses the first-generation Infineon AURIX TC277, and the other is equipped with the second-generation AURIX TC397. A Lauterbach debugger is used for flashing, debugging, and reading the measured values. The exact names of all test tools can be found in Table 1.
| Microcontrollers | Evaluation board | Clock frequency |
| SAK-TC277TF-64F200S CA ES | TriBoard TC2X7 V1.0 | 200 MHz |
| SAK-TC397XE-256F300S AA EES | TriBoard TC3X7 TH V1.0 | 300 MHz |
| Designation | use | |
| Lauterbach Power Debug Interface / USB3; TRACE32 R2016 | Flash adapter, debugger | |
Table 1: Experimental tools
Table 2 below provides a detailed breakdown of the software used, including its version numbers and usage.
| Designation | use |
| TASKING VX toolset for TriCore v6r2 | Compiler, Linker |
| Infineon Software Framework | Development framework |
| Infineon Low Level Driver 1.0.0.12.0 | Microcontroller driver |
Table 2: Test software used
5. Results
To validate the concept described above, a 4KB array of 32-bit values is exchanged between the processor cores in each measurement. Timing is performed using the processor cores' internal performance counters, which precisely determine the required time and the executed instructions. Furthermore, to deliberately provoke concurrent accesses, the copy operations are triggered synchronously using a broadcast interrupt.
Infineon AURIX TC277
To verify the concept, the method was ported to a first-generation AURIX embedded multicore microcontroller. The TC277 derivative uses three processor cores, each of which provides local memory for particularly fast access. In addition, the TC277 integrates a global RAM memory that all cores can access at the same speed [13].
The first series of measurements investigates the extent to which memory usage affects copy time. For this purpose, Core 1 copies a 4KB array back and forth between the different memories of the TC277 in each measurement. To avoid concurrent access in this series, Core 0, Core 2, and the DMA controller are disabled.
| Source: | Goal: | Ticks: | Instructions: |
| Local storage 1 | Local storage 1 | 2076 | 3082 |
| Local storage 1 | Global Storage | 12281 | 3082 |
| Global Storage | Local storage 1 | 11287 | 3082 |
| Global Storage | Global Storage | 22529 | 3082 |
| Local storage 1 | Local storage 2 | 8197 | 3082 |
| Local storage 2 | Local storage 1 | 719 | 3082 |
| Local storage 2 | Local storage 2 | 14345 | 3082 |
Table 3: Memory performance during a copy operation, Core 1 active
Table 3 shows that Core 1's access to its own local memory is significantly faster than its access to global memory. Furthermore, the measurements show that accessing the local memory of another processor core is also significantly faster than performing an operation on global memory. The number of instructions required is independent of the source and destination of the copy operation. The duration depends on the memory itself and its connection to the processor core.
For the second series of measurements, concurrent access to the local memory of Core 1 is simulated. For this purpose, Core 0 and Core 2 copy a 4KB array from Core 1's local memory to their own. This configuration corresponds to the scenario shown in Figure 8.
| Source: | Goal: | Ticks: | Instructions: |
| Local storage 1 | Local storage 1 | 4121 | 3082 |
| Local storage 1 | Global Storage | 12283 | 3082 |
| Global Storage | Local storage 1 | 11288 | 3082 |
| Global Storage | Global Storage | 22530 | 3082 |
| Local storage 1 | Local storage 2 | 8199 | 3082 |
| Local storage 2 | Local storage 1 | 7197 | 3082 |
| Local storage 2 | Local storage 2 | 14353 | 3082 |
Table 4: Memory performance of Core 1 during concurrent access by Core 0 / 2
As can be seen in the measurement series in Table 4, a significant change in copy time compared to Table 3 is only observed in the first measurement from local memory 1 to local memory 1. The other measurements are not affected by the concurrent access. However, this clearly shows that intercore communication also affects calculations that are executed exclusively on Core 1. This should be taken into account when evaluating the real-time capability of a system.
In the third series of measurements, shown in Table 5 and corresponding to Figure 11, the DMA controller is used for inter-core communication. Here, too, access to global memory is slower, but the difference compared to local memory is significantly smaller when using the processor cores. The number of instructions results from Core 1 activating the DMA transfer.
| Source: | Goal: | Ticks: | Instructions: |
| Local storage 1 | Local storage 1 | 2971 | 17 |
| Local storage 1 | Global Storage | 3483 | 17 |
| Global Storage | Local storage 1 | 3483 | 17 |
| Global Storage | Global Storage | 3995 | 17 |
| Local storage 1 | Local storage 2 | 2971 | 17 |
| Local storage 2 | Local storage 1 | 2971 | 17 |
| Local storage 2 | Local storage 2 | 2971 | 17 |
Table 5: Memory performance DMA controller
Infineon AURIX TC397
With the second generation of the AURIX microcontroller family, Infineon significantly revised the number of processor cores and the memory hierarchy compared to the first generation. This includes doubling the maximum number of processor cores to six and providing each core with two local memory locations, thus minimizing concurrent access during inter-core communication. For the measurements in this article, the Infineon AURIX TC397 is used, which offers six cores, each with two local memory locations. In addition, Infineon increased the total number of global RAM locations to four. However, since these global memory locations play a subordinate role in the concept presented here, they serve only as a reference [12].
The measurement series in Table 6 shows that access to local memory 1-0 is identical to the first AURIX generation. Access to the second local memory 1-1 also behaves equivalently to the measurements for local memory 1-0. The only exception is the copy operation, where both the source and the destination are local memory 1-1. This memory appears to only have a connection to Core 1, which significantly slows down parallel accesses. Comparing Table 3 and Table 6, it can be seen that the copy operation on the AURIX 2G requires fewer ticks with respect to the global memory. This is due to Infineon's improved connection to the global memory. Therefore, in the second AURIX generation, the access speed of one core to the global memory corresponds to the operational speed of the access to the local memory of the other cores.
| Source: | Goal: | Ticks: | Instructions: |
| Local memory 1-0 | Local memory 1-0 | 2076 | 3082 |
| Local memory 1-0 | Local memory 1-1 | 2076 | 3082 |
| Local memory 1-1 | Local memory 1-0 | 2076 | 3082 |
| Local memory 1-1 | Local memory 1-1 | 4113 | 3082 |
| Local memory 1-0 | Global Storage | 2076 | 3082 |
| Global Storage | Local memory 1-0 | 9260 | 3082 |
| Global Storage | Global Storage | 9239 | 3082 |
| Local memory 1-0 | Local memory 2-0 | 2076 | 3082 |
| Local memory 2-0 | Local memory 1-0 | 9239 | 3082 |
| Local memory 2-0 | Local memory 2-0 | 9239 | 3082 |
Table 6: Memory performance Core 1
6 Discussion
Since the introduction of the first embedded multicore microcontrollers for control units with hard real-time requirements, performance improvements have primarily been achieved by increasing the number of available processor cores. Unfortunately, existing algorithms often cannot be separated in such a way that they can be executed completely independently on different processor cores. For this reason, intercore communication will become increasingly challenging in the coming years as the number of processor cores grows. Both the access speed to shared memory and the effects of concurrent accesses pose a problem.
The concept presented in this article for the effective use of the existing memory hierarchy represents a first step. By prioritizing processor cores, the wait cycles during intercore communication can be distributed in such a way that higher-priority cores are specifically relieved of the load. Furthermore, by distributing the data to be exchanged across different memory locations, the effects of concurrent accesses are significantly reduced.
Future development envisions integrating the concept into a framework that automatically assigns shared data to available memory. To this end, trace data from electronic control units (ECUs) will be analyzed, and the shared values and their associated communication participants will be automatically extracted. Furthermore, the framework will be provided with a general description of the embedded multicore microcontroller and its memory hierarchy. This information will then allow for the calculation of an optimal distribution. Another objective is the integration of the LET paradigm for the respective memory. This will enable better planning of the effects of concurrent accesses and minimize their impact on real-time capability, thereby reducing the worst-case execution time [8][9][14][15].
Bibliography and list of sources
[1] Abbi Ashok and Jens Harnisch. 2017. AURIX – Programming close to hardware for best performance. In Embedded Multi-Core Conference. Infineon Technologies AG, 81726 Munich, Germany.
[2] Infineon Technologies AG 2014. AURIX TC27x C-Step User's Manual V2.2. Infineon Technologies AG, 81726 Munich, Germany.
[3] Infineon Technologies AG 2014. AURIX TC29x B-Step User's Manual V1.3. Infineon Technologies AG, 81726 Munich, Germany.
[4] Infineon Technologies AG 2016. AURIX TC3xx Target Specification V2.0.1. Infineon Technologies AG, 81726 Munich, Germany.
[5] Thomas Barth and Peter Fromm. 2016. Warp 3 between all cores – Development of a fast and secure multicore RTE. In Proceedings of the Embedded Software Engineering Congress 2016.
[6] Günther Bengel, Christian Baun, Marcel Kunze, and Karl-Uwe Stucky. 2015. Master Course Parallel and Distributed Systems. Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-8348-2151-5
[7] Hartmut Ernst, Jochen Schmidt, and Gerd Beneken. 2016. Grundkurs Informatik. Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-14634-4
[8] Christoph M Kirsch and Ana Sokolova. 2012. The Logical Execution Time Paradigm. In Advances in Real-Time Systems. Springer, 103-120.
[9] Florian Kluge, Martin Schoeberl, and Theo Ungerer. 2016. Support for the Logical Execution Time Model on a Time-predictable Multicore Processor. SIGBED Rev. 13, 4 (Nov. 2016), 61-66. https://doi.org/10.1145/3015037.3015047
[10] Philipp Jungklasse and Mladen Berekovic. 2018. Effects of concurrent access to embedded multicore microcontrollers with hard real-time demands. 13th International Symposium on Industrial Embedded Systems (2018).
[11] Philipp Jungklasse and Mladen Berekovic. 2018. Performance-Oriented Memory Management for Embedded Multicore Microcontrollers. In 26th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing.
[12] Infineon Technologies AG 2015. TriCore TC1.6.2 Core Architecture. Infineon Technologies AG, 81726 Munich, Germany.
[13] Infineon Technologies AG 2012. TriCore TC1.6P & TC1.6E Core Architecture. Infineon Technologies AG, 81726 Munich, Germany.
[14] Florian Kluge, Mike Gerdes, and Theo Ungerer. 2014. An Operating System for Safety-Critical Applications on Manycore Processors. 2014 IEEE 17th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing (2014), 238–245.
[15] Gang Yao, Rodolfo Pellizzoni, Stanley Bak, Emiliano Betti, and Marco Caccamo. 2012. Memory-centric Scheduling for Multicore Hard Real-time Systems. Real-Time Syst. 48, 6 (Nov. 2012), 681–715. https://doi.org/10.1007/s11241-012-9158-9
author
Philipp Jungklass, M.Sc., studied computer science at Stralsund University of Applied Sciences and has worked for many years as a development engineer in the automotive sector. His career began with driver programming for communication systems in vehicles. He currently focuses on multicore microcontrollers in safety-critical applications and is responsible for training-related supervision. He also regularly presents his work at embedded systems symposia.
Multicore – our training & coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of multicore/microcontrollers.
Training & coaching on the other topics in our portfolio can be found here. here.
Multicore – Expertise
Valuable expertise on the topic of multicore/microcontrollers is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
