Parallel design for real-time software

Applicability in the AUTOSAR environment

Author: Ralph Mader, Continental Automotive GmbH

Contribution – Embedded Software Engineering Congress 2017

Patterns for designing parallel-executable algorithms have not yet been systematically adopted in embedded real-time systems. Continental is currently investigating the applicability of patterns in AUTOSAR-based applications. This presentation will focus on patterns applicable in this field and describe the analysis methodology. Furthermore, the application of a pattern will be demonstrated using a concrete example from the engine control environment. Possible solutions for scheduling support, such as state transitions synchronized across core boundaries or the application of logical execution time, will be presented in this context.

1. Introducing the application domain

Embedded systems in the powertrain of a motor vehicle have been in use for more than 25 years. They serve to control combustion, gear shifting, and battery management, as well as increasingly to implement higher-level algorithms in domain control units. Multi-core computers have entered this field of application in the last five years or so. Microcontrollers with up to three nearly homogeneous cores and a clock frequency of up to 300 MHz are used. The software running on these systems has grown considerably over the decades; in an engine control unit, it comprises approximately 2 million lines of code and requires about 6 MB of program memory.

The software architecture used is based on the AUTOSAR 4.0 standard.[1] During the transition to multi-core computers, Continental essentially redesigned the application software to be "multicore ready," with the aim of distributing software across cores and continuing to ensure data consistency.[2] During distribution, care was taken to maintain functional chains and essentially exploit the inherent task parallelism. With approximately 60 tasks currently (see Figure 1) in an engine control system, the resulting speedup has been sufficient to meet the performance requirements of the current generation of ECUs.

(see Fig. 1, PDF)

2. Challenges of the application

A key challenge for the applications mentioned in Section 1 is real-time capability. These applications involve computational processes in repetitive tasks with a minimum repetition rate of 1 millisecond and minimum deadlines of 200 microseconds. Furthermore, there are repetitive tasks linked to the engine speed and valve clearance of internal combustion engines. The calculations performed within these tasks are a crucial factor for combustion quality and, consequently, exhaust gas composition. Due to increasing demands on exhaust gas purity and fuel consumption, these calculations are expected to require even more computing time in the future. Therefore, it is essential to design these algorithms in a way that allows for parallel execution to ensure continued real-time performance.

3. How can parallelism be found?

The question now arises as to which strategy is most effective for finding parallelism. When analyzing the code in these types of applications, one very often finds that there are longer computational chains in which clusters can be formed and distributed across cores. Furthermore, one can search the source code and, for example, perform analyses regarding loop parallelism. It is frequently observed that very short loops are used, usually only on the number of cylinders. However, in the given computer architectures, it is not very efficient to parallelize these loops across cores.

For this reason, a different method was adopted. In interviews, the functional developers, mostly mechanical engineers, were asked about the functional problem to be solved, and based on this problem definition, an attempt was made to completely redesign the software to enable parallel execution. Several design patterns were presented to the development teams to support the ideation process. The following chapters present some of the design patterns that proved to be feasible.

4. Design Patterns for Parallel Programming

4.1 Pipe and Filter Design Pattern

The "Pipe and Filter" design pattern helps identify parallelism at a structural level, even when a series of tasks are interdependent. With a forward data flow, and where task execution does not depend on the result of the preceding stage, different tasks can be grouped into stages, similar to the "pipeline" method in microprocessors (see Figure 2)., PDF).

The granularity of the stages must be balanced with the synchronization effort. Ideally, all stages should have approximately the same execution time. If this is not the case, the longest stage determines the timing of subsequent stages.

In a control application, such as an engine control unit, the data flow for the relevant data of a system function can be assumed to be forward-flowing. The various tasks can be categorized into stages. One scheme for this categorization is the phase model [4] used at Continental, which can be translated into this design pattern.

4.2 Divide and Conquer (Algorithm Strategy Pattern)

The "Divide and Conquer" pattern can be applied to many algorithms. A task is divided into subtasks. During the design phase, it is important to ensure that these subtasks can be solved independently. This allows the subtasks to be offloaded to different kernels (see Figure 2)., PDFFinally, the partial results must be merged, which entails a certain amount of synchronization effort. There are several use cases for this in control applications. It is advantageous to combine several independent algorithms of this type to minimize the synchronization effort.

4.3 Bulk Synchronous Parallel

The "Bulk Synchronous Parallel Pattern" allows tasks running on different cores to be merged at a barrier and thereby synchronized. At the beginning of a parallel phase, the data is made available to each task in a local copy from global memory. During execution, work is performed on this local copy. Communication during task execution is only permitted between runnables within that task.

Once all parallel tasks have finished, the results are made available again in global memory before the next phase starts.

This type of barrier can be used to implement both the "Divide and Conquer" and "Pipe and Filter" patterns. Successive phases can represent the different stages of the pipeline.

(see Fig. 3, PDF)

5. Implementation examples

5.1 State transition synchronized across cores

In vehicle control units, state changes occur very frequently. Some of these have an impact on the overall system, which means that in an application distributed across the cores of a controller, these state changes must be synchronized. To avoid impacting the real-time behavior of the cyclic tasks, an "event coordinator" terminates the cyclically running tasks after a system state change request is detected and delays activated tasks with less critical deadlines; this creates a free time slot in the overall system. The event coordinator then simultaneously fires a state change task on each core. Initialization functions can be executed very effectively in parallel within these tasks, as they are typically uncoupled or only weakly coupled. Once the state change tasks have finished on all cores, the event coordinator releases the execution of the cyclic tasks again.

This method significantly reduces the required computing time for such state transitions. Initializing an error memory with approximately 1000 error locations yields the most impressive results, as this is one of the few ways to benefit from loop optimization in a control application. The degree of parallelization is nearly 100%, which, according to Amdahl's Law, corresponds to a speedup equal to the number of cores.

(see Fig. 4, PDF)

5.2 Logical Execution Time (LET)

Describing and implementing the timing behavior of multi-core computer systems using the logical execution time model allows the implementation of both the "Pipe and Filter" and "Divide and Conquer" patterns described in Section 4.

By dividing a time period into segments with logical execution time, it is possible to represent computational chains that are processed sequentially but include parallelization possibilities in certain sections. For example, analyzing the sequence of a task, as shown in Figure 5, reveals this pattern. The sequential and parallel clusters thus identified can be processed within a logical execution timeframe as shown in Figure 6 [5]. Communication between sequentially executed LETs occurs without additional data consistency mechanisms, while communication between parallel tasks is protected by consistency buffers [6].

(see Figs. 5 and 6, PDF)

6. Summary and Outlook

As demonstrated, the identification of parallelism in control applications has been limited beyond the previously identified task parallelism. Nevertheless, some patterns can be identified and successfully applied. A key factor for success is considering parallelism during the function design phase. An architectural framework, such as that provided by the application of the logical execution time model, can be made available to developers in an abstracted form similar to the phase concept [4]. This would significantly simplify the applicability of this approach. Further support through design tools is desirable in this regard.

List of sources

[1] AUTOSAR Consortium, in https://www.autosar.org/index.php?p=3&up=2&uup=0, Rev 4.0.3.

[2] D. Claraz, F. Grimal, T. Ledier, R. Mader, and G. Wirrer, “Introducing multi-core at automotive engine systems,” in ERTS2, 2014.

[3] M. Negrean, R. Ernst, and S. Schliecker, “Mastering timing challenges for the design of multi-mode applications on multicore real-time embedded systems,” in 6th International Congress of Embedded Real-Time Software and Systems (ERTS), 2012.

[4] D. Claraz, S. Kuntz, U. Margull, M. Niemetz, and G. Wirrer, “Deterministic Execution Sequence in Component Based Multi-Contributor Powertrain Control Systems,” in ERTSS 2012, 2012.

[5] J. Hennig, H. von Hasseln, H. Mohammad, S. Resmerita, S. Lukesch, and A. Naderlinger. 2016. Towards parallelizing legacy embedded control software using the LET programming paradigm. In Proc. of WiP Papers of the 22nd IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS '16).

[6] Stefan Resmerita, Andreas Naderlinger, Stefan Lukesch, “Efficient Realization of Logical Execution Times in Legacy Embedded Software” MEMOCODE'17, Vienna, Austria, 2017

author

Ralph Mader studied Electrical Engineering at the University of Applied Sciences in Regensburg. He worked on the design and selection of microcontrollers for the Powertrain area and the efficient use of microcontroller resources. Since 2010, he is also leading development of the Multi Core software architecture for engine management systems. Mr. Mader represents Continental Automotive GmbH in several research projects in this area.

Download the article as a PDF

Automotive – our training & coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of automotive/embedded and real-time software development.

Training & coaching on the other topics in our portfolio can be found here. here.

Automotive – Expertise

Valuable expertise in automotive/embedded and real-time software development is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← Securing test systems Automatic verification of safety criteria for ASIL D basic software →