PLAT4MC: Multicore Performance Optimization with Open Source

Open Source Technologies (APP4MC) in productive usage

Author: Syed Aoun Raza, Robert Bosch GmbH

Beitrag - Embedded Software Engineering Kongress 2017

With the advent of multi-core ECUs in the automotive domain, the tooling environment to support multi-core software development has gained significance, especially tools which can provide an early indication about the architectural behavior before the existence of the code. Another important aspect in large scale automotive productive systems is the possibility of module and component-level multi-core software design and development and eventually their integration into the multi-core system. Domain specific multi-core development tool platforms which enable analysis (e.g., data-consistency) and optimizations (memory management, task-to-core mapping, timing simulations and distribution) are not easily available. There are several commercial solutions on the market to support multi-core software scenarios, however they cannot be applied with off the shelf configuration options. The reason behind this limitation is specific customer scenarios in the Bosch solution domain. Another significant hurdle is the existence of huge single-core code bases, which have been successfully certified and tested according to automotive standards. Though, in Bosch we require tailored solutions for our multi-core software systems and tools anyone can benefit from our AMALTHEA (https://www.eclipse.org/app4mc/) based multicore tooling strategy.

1. AMALTHEA Based Multi-Core Tooling

As mentioned previously, the initial motivation was to establish a common tooling platform for Bosch productive system development, where several existing and future development tools for multi-core analysis can be synergistically combined. We started with a vision and strategy that this tooling should cover Bosch specific solutions and support multi-core developers in all design steps, i.e. from different abstraction levels. After several discussions, we identified two categories in the problem and solution space:

Legacy system transition to multi-core
Green Field Approach (GFA) where project development starts from scratch

An important aspect was to reduce the cost of unsupportiveness/failure/error at early stages of design by benefitting from frontloading techniques.

Outcome of these requirements has yield PLAT4MC (see Figure 1, PDF), the Multicore Tooling Platform, with the AMALTHEA object model as its core component. There are several practical reasons behind the selection of AMALTHEA as object model. First, it is part of APP4MC, the open source eclipse project extensively used inside Bosch. Secondly, it also has the potential to become an industry wide standard for multi-core system modelling. Finally, it is a strong candidate for being the software sharing medium between Bosch and its customers. Any use case which can work with AMALTHEA has the potential to benefit from PLAT4MC tooling. The strategy and lesson learned during the develoment of the this tooling platform are quite generic and can be applied in any multicore system scenario.

2. Inside PLAT4MC

PLAT4MC provides different tools to obtain an AMALTHEA model from either already existing software specification (AUTOSAR/MSR) or C sources and Executable and Linking Format (ELF). Later, the obtained AMALTHEA model can be refined, enriched or adapted according to the optimization use case. Therefore, PLAT4MC additionally provides interfaces and tools such as data consistency checks, local core memory optimizations and the merger. For future use cases we have planned for AMALTHEA based visualization editors.

3. PLAT4MC Features

Currently, the platform offers the following capabilities (see Figure 2, PDF):

SCA2AMALTHEA: AMALTHEA model export from C source code analysis
ELF2AMALTHEA: AMALTHEA model export from ELF
AUTOSAR/MSR2AMALTHEA: AMALTHEA model export from MSR/AUTOSAR specification
AMALTHEA MERGER: Merge/Diff tooling for AMALTHEA models obtained from several sources
RB_LOCOMO: Local core memory optimizations
Data Consistency Checks (DCC): Static checks for the data access inconsistencies
Multi-Core Cookbook Checks (MCC): System specific checks for data access inconsistencies

4. Software Distribution and Load Balancing

The initial step toward a multi-core system design involves the software distribution according to the underlying hardware architecture. Afterwards, the system architect can simulate the system and achieve an ideal distribution based on the simulation results. Similarly, it is further possible to optimize the load on multiple cores i.e., achieve load-balancing.

As mentioned previously, during our requirement analysis meeting within Bosch we have found that there are two approaches which should be supported by PLAT4MC Multicore tooling. So, we provide several possibilities to generate AMALTHEA models to cover these use cases. Further, PLAT4MC tools aim to reduce the length of stages before a feedback is available (see Figure 3, PDF). Here, Performance Simulation tools like Timing Architects (TA) can help migrating projects from single-core to multi-core HW or from one multi-core device to another.

4.1 Frontloading with Performance Simulation

Performance simulation can provide an initial indication of the software behavior from the design level. Figure 4 (see PDF) shows a scenario how a system can be simulated in TA using an AMALTHEA data model. The AMALTHEA model components can be generated from static code analysis or from AUTOSAR/MSR specification. As depicted in Figure 4, different types of information are available after a simulation run to provide the initial impression of the system behavior even before the existence of a single line of code.

5. Bringing all together in Performance Optimization

After the distribution of SW processes to tasks and the assignment of tasks to cores has been completed, it is time to give the ECU a final performance boost by optimizing memory allocation of data and code.

For demonstration purpose we have chosen a sample engine ECU software project which runs on the Infineon AURIX 2G (2nd Generation) microcontroller. This µC provides two local RAM and one local flash memory for each of its 6 cores. Every core can access all the memories but due to a direct connection the access from a core to its local memories is much faster than to the non-local memories that can only be reached through the Shared Resource Interconnect Bus. The advantages of direct access are relevant for the read/write access to variables (RAM), the read access to constants (Flash) and for fetching code (Flash) for function execution.

5.1 Modelling Local and Global Bus Latencies

The µC memory structure can be provided in an AMALTHEA HW model that also contains the access latencies from a given core to any memory on the bus. The simple model of fixed latencies does not consider caches or instruction pipelines, but is sufficient for memory allocation optimization.

5.2 Optimization Goal: Core Load Improvement

Every core spends a significant amount of time for data and code access. The goal of RB_Locomo optimization is to allocate data and code to the available memories in a way that the overall performance of the µC is optimized. In other words, the cross core loads (CPU loads that stem from cross-communication of coreX to coreY-local memories) are reduced to a minimum.

This optimization does only affect the linking stage of the SW build, i.e. the code itself will not be changed by the allocation distribution. In order to determine the optimal distribution, RB_Locomo calculates for every core the CPU load that is consumed by core to memory communication. Since RB_Locomo runs on a built project the current location of every memory element and therefore also the bus latency from any core to this element is known. The information that is still missing is how often a specific core accesses a data or code element. This dynamic data is provided to RB_Locomo by means of a Call/Access Statistics table.

The Call/Access Statistics table contains the average number of accesses per second from every core to data and/or code elements. The PLAT4MC tool suite provides functionality to extract this data either from debugger traces taken on a running ECU in a specific operation point or from static code analysis (SCA) of C-source code or ELF file opcode.

With a given allocation distribution and the access statistics the CPU Load for memory communication can be calculated for every core. RB_Locomo optimization algorithm is minimizing the CPU loads over all cores. Furthermore, it is possible to parameterize the algorithm e.g., in order to prioritize optimization of one core over the others. The RB_Locomo optimization algorithm has been published as a patent (US2017090820 AA, DE102015218589 A1, …) in several countries.

The process flow of data/code allocation optimization with RB_Locomo is shown in Figure 5 (see PDF).

For this project the access statistic data has been derived from debugger traces taken at an engine speed of 4000 rpm. The call statistics data for functions was calculated by a static code analysis based on opcodes from the elf file. The SW contains ~50k variables and functions for which an allocation optimization has been performed.

5.3 Outcome

Figure 6 (see PDF) depicts the linker memory distribution for the built sample project before and after RB_Locomo optimization. The initial allocation is defined by default linker rules which first fill up the global memory and subsequently the core local memories. The (automatically generated) linker rules for the optimized allocation deploy data and code to all memories in order to achieve optimal µC performance.

The effective core load improvement has been measured in a HiL (Hardware in the Loop) LabCar environment before and after RB_Locomo optimization – see Figure 7 (PDF) for the measurement results.

The load improvement potential depends on the Software, the initial distribution of data and code and the operation point. For this project an average per core load improvement of up to 6% could be achieved depending on the operation point. The improvement increases with the absolute CPU load, i.e. for highly loaded cores a better absolute improvement can be expected. For the above project the relative average per core load improvement amounts to more than 20%.

6. Summary

In this paper we have introduced PLAT4MC and provided an end-to-end use case from BOSCH internal productive system to show its benefits. Further, we described how open sourced AMALTHEA model from APP4MC project can be used as an exchange format to fulfill multicore system related tooling requirements. We have also discussed in details how to obtain AMALTHEA model from different sources and finally, we have in details provided a deep dive into memory optimization use case with PLAT4MC tool chain.

Beitrag als PDF downloaden

Multicore - unsere Trainings & Coachings

Wollen Sie sich auf den aktuellen Stand der Technik bringen?

Dann informieren Sie sich hier zu Schulungen/ Seminaren/ Trainings/ Workshops und individuellen Coachings von MircoConsult zum Thema Multicore /Mikrocontroller.

Training & Coaching zu den weiteren Themen unseren Portfolios finden Sie hier.

Multicore - Fachwissen

Wertvolles Fachwissen zum Thema Multicore /Mikrocontroller steht hier für Sie zum kostenfreien Download bereit.

Zu den Fachinformationen

Fachwissen zu weiteren Themen unseren Portfolios finden Sie hier.

Experience Embedded

Fachwissen

ESE Fachwissen