Optimization approaches at the RTOS/code level; single/multicore
Author: Peter Gliwa, GLIWA GmbH embedded systems
Contribution – Embedded Software Engineering Congress 2018
The resource "computing time" often becomes scarce during the development of many projects. The following section will highlight some practical approaches for optimizing ECU software with regard to runtime in such situations. Furthermore, measures will be discussed to prevent runtime problems early on in the design, configuration, and implementation phases.
Over the past 20 years, significant progress has been made regarding timing in the development of ECU software. While timing was previously only explicitly considered when it caused problems, it is now often addressed systematically and early in the development process. For example, the operating system configuration is no longer simply adopted from the previous project and tweaked here and there. Instead, thorough consideration is given to the timing requirements and how these can be met through appropriate operating system configuration, task distribution across the various cores of a multicore processor, and other measures.
Techniques such as scheduling simulation or scheduling analysis are being used more and more frequently. Nevertheless, virtually every project reaches a point where the behavior of the real Systems sometimes severely affected by expected The behavior deviates from the simulated or modeled timing. Only an analysis of the real system can help here, if necessary in the real environment, i.e., in the vehicle.
Runtime analysis: Which technology is suitable for which situation?
Before we turn to specific timing analysis techniques, it's important to clarify that timing analysis should always begin with two fundamental questions. First: What phase is the project in? Possible phases are "early" or "late," or "design," "implementation," or "verification," to use the terminology of the V-model. Second: At what level of abstraction should the timing analysis be conducted? Possible levels are the "code level," "scheduling level," or "network level." It can be observed that any examination of a specific timing aspect can be situated within a coordinate system with the axes "phase/project duration" and "level." Figure 1 (see below) illustrates this. PDF) illustrates this using V-models at the respective level.
The same levels can be found in Figure 2 (see below). PDF) below the granularity axis. Above this are the various timing analysis techniques; there is a certain correlation based on the horizontal position. "Static code analysis," for example, takes place at the code level and covers the range from "Opcode States" to "TASK/ISR.".
Figure 3 (see. PDF) supplements the overview with the aspects "type of timing analysis" and the (project) phase already mentioned.
A brief description of how the different analysis techniques work can be found on the timing poster [1], from which the previous illustrations are also taken.
The following list shows use cases for the various timing analysis techniques.
- Static code analysis
- Use case: Determination of the WCET (worst-case core execution time) independent of hardware availability and independent of test vectors
- Notes: The impact of interrupts on the cache and pipeline is ignored, as are various multicore effects such as access conflicts at the memory interface.
Indirect function calls and loop bounds may not be resolved and must be manually "annotated" (added by hand), which is error-prone.
- Code simulation
- Use case: Rough estimate of the CET (core execution time) for the given test scenario
- Note: It doesn't play a major role in timing analysis.
- Measure
- Use cases: Analysis of the real system (the software runs on the target hardware) for profiling, verification, or monitoring purposes.
- Notes: Timing profiling refers to the determination of timing parameters such as CPU utilization, CET (core execution time), RT (response time), etc. The results depend on the test vectors that the software processed during the measurement.
- Tracing
- Use cases: Analysis of the real system (the software runs on the target hardware) for the purpose of visualization, debugging, optimization, profiling or verification.
- Notes: Tracing refers to the recording of events for later analysis and visualization. With regard to timing, scheduling traces, which include events such as "activation," "start," "interruption," and "termination" of tasks, are particularly well-suited.
A distinction is made between hardware-based and software-based tracing. The former can function without any software modification, while the latter involves instrumenting the software, which allows the use of the same hardware as the original vehicle.
- Scheduling simulation
- Use case: Design and optimization of scheduling concepts and operating system configuration; analysis of typical scheduling behavior.
- Static scheduling analysis
- Use case: Design and optimization of scheduling concepts and operating system configuration; analysis of worst-case scheduling behavior, for example, determining the WCRT (worst-case response time)
Runtime optimization
It is impossible to create a complete list of all runtime optimization approaches. Therefore, the following will only address a few fundamental aspects and present some concrete measures as examples.
As a general rule, runtime optimization should always be done "top-down," i.e., from the upper to the lower levels. If you were to start optimizing the code directly, you wouldn't know whether the code in question would even be executed at a time when timing is critical.
It is better to first make a decision at the scheduling level. good understanding to bring about the current situation, to carry out optimizations at this level and then to address the remaining "hotspots" by means of code optimization.
The meaning of the good understanding This is unfortunately often underestimated. In many projects, I've observed integrators who, upon seeing the first downloaded trace, were somewhat dismayed to find that the system behaved completely differently than expected and previously simulated.
Runtime optimization at the RTOS (scheduling) level
The cardinal rule for runtime optimization at the scheduling level is unfortunately often ignored, even though it is incredibly simple: Keep it simple!
Specifically, this means, for example, that the operating system configuration should be kept as simple as possible. Ideally, BCC1 (Basic Conformance Class without multiple activations) should be used. Unfortunately, most AUTOSAR RTE generators suggest the use of ECC by creating a non-terminating ECC task and introducing a second level of scheduling within it: The RTE runnables are triggered by events in this ECC task. The resulting complexity is beyond the grasp of most project managers, and the analysis of timing problems becomes significantly more difficult.
Incidentally, using BCC significantly reduces the stack size.
Another measure to avoid typical real-time problems and further minimize runtime and stack requirements is the use of cooperative multitasking. "Cooperative" here means that task switching is only permitted at specific times—ideally when no runnable is currently running. The RTE or similar mechanisms then determine that no data copies are necessary to ensure data consistency. The result: savings in RAM and runtime. It's also clear that stack requirements are typically drastically reduced, as nesting runnables is no longer possible.
Figure 4 (see. PDFFigure 1 shows the trace of a very positive example of successful timing design: the active steering of the BMW X5 (e70). Depending on the state, the system is utilized up to over 93%, but never overloaded. Thanks to optimization measures, a less expensive, less powerful processor than in the previous generation could be used – while still offering additional functionality.
Runtime optimization at the code level
As an example of runtime optimization at the code level, the well-known function `memcpy` will be optimized here. `memcpy` copies a defined number of bytes from one memory location to another. A standard implementation is shown in Figure 5 (see below). PDF) evident.
When this function is used to copy one kibibyte (1024 bytes) on an Infineon AURIX TC275 with a 200 MHz clock frequency and using a TASKING compiler, the minimum CET (core execution time) for the default memory locations is 114.395 µs or 111.7ns per byte. This is the starting point for the optimizations that follow. The generated assembly code is shown in Figure 6 (see below). PDF) to see.
The first step will be to use different storage locations. Figure 7 (see below). PDFThe table shows an excerpt from the AURIX manual that specifies the access time to various memory locations in clock cycles. A crucial multicore aspect should be mentioned here: The figures given in the table apply only if no access conflict occurs at the memory interface, meaning that another core is not currently accessing the memory area with a higher priority. In the event of a conflict, the delay can be significantly greater.
However, even without conflicts, the difference between "fast" and "slow" memory is significant. The default memory locations in the initial state were: Cached Flash0 for the code, LMU RAM for the destination of the copy, and Cached Flash0 for the source.
If the target is now stored in Local DSPR0 instead of LMU RAM, the copy speed decreases to 100.6ns per byte.
In the next step, the compiler is instructed, via #pragma or the compiler option -t0, to generate the fastest possible code during compilation. The compiler now generates different assembly code, see Figure 8 (see below). PDF), using "post-increment" memory accesses, which are desirable in principle, and using special instructions from the AURIX instruction set, in this case the loop instruction.
The code generated in this way runs significantly faster and only requires... 59.6ns per byte.
Finally, the code is optimized manually. In most cases, this is possible at the C code level, requiring the continuous monitoring of a) the generated assembly code and b) the actual runtime. In code optimization, there's a strong temptation to rely on assumptions and guesswork. Even the very best experience surprises and have to admit that the expected runtime behavior of an optimization approach is far removed from the actual behavior, and only measurements on the real system can reveal this.
Before delving into the actual manual optimization, let's consider a few points about memcpy. According to its specification, it can copy data with a granularity of one byte. The AURIX, as a 32-bit processor, can handle four-byte words very efficiently. Analyzing the data objects of a typical automotive application reveals that the size of most data objects is an integer multiple of four, and they are also four-byte aligned – meaning they have a memory address that is also an integer multiple of four. The data objects shown in Figure 9 (see...). PDFThe implementation of the crucial parts of memcpy shown here utilizes these findings and checks whether the source and destination are aligned by four bytes and whether the number of bytes to be copied is also an integer multiple of four. If this is the case, four bytes are always copied in a single loop iteration.
This manual optimization, together with all the previously mentioned optimizations, results in a runtime requirement of 14.7ns per byte. At least that's a saving in runtime of around 87% compared to the original version.
Multicore-specific aspects
For multicore systems, it has proven effective to distribute functionality so that intensive calculations and the handling of numerous interrupts are distributed across different cores. As a result, the pipeline and cache are utilized more efficiently for complex calculations, thus increasing throughput.
„"Busy spinning," or waiting for a core to release a resource on another core, should be avoided conceptually wherever possible. In many cases, the use of spin locks can be completely avoided through appropriate software design. When porting single-core software to multi-core systems, blindly replacing `DisableInterrupts()`/`EnableInterrupts()` with `GetSpinlock()`/`ReleaseSpinlock()` is a bad idea. The best mechanism for protecting data is the one you don't need.
Summary, Outlook
Runtime optimization is complex and multifaceted. It can and should take place at different levels: at the scheduling level and at the code level – in that order, to prevent (code optimization) effort from being expended that does not improve the overall situation.
A prerequisite for targeted analysis and optimization is knowledge of the various analysis techniques. Only then can the right tool be used for the specific application.
The most important first step in transforming an existing application into efficient and secure software is understanding how the system actually behaves. Assumptions and guesswork are out of place here; insight into and analysis of the real system under real-world conditions (including in a vehicle) are crucial.
For several years now, so-called C-to-C compilers have been gaining attention, promising to translate given C code into parallelizable C code. I'm skeptical of this approach. My assumption (or should I say "hope"?) is that better utilization of multicore processors will be achieved by fundamentally different code generators. Future generators will allow the creation of either single-core (or single-threaded) code or highly parallelizable multicore code from a generic model ("generic" in the sense of not being tied to single-core or multicore).
Bibliography and list of sources
[1] Timing poster, Peter Gliwa, February 2013
author
Since its founding in 2003, Peter Gliwa has been the managing partner of GLIWA. For many years, he developed the T1 timing suite and today advises international clients on timing issues and topics related to real-time operating systems. Previously, he worked at ETAS, first as a developer and then as product manager for the ERCOS real-time operating system.EK. Between 2001 and 2006, he also worked as a lecturer in the field of "microcomputer technology". Peter Gliwa studied electrical engineering at the Stuttgart Cooperative State University.
Internet: https://gliwa.com
Email: peter.gliwa@gliwa.com
Download the article as a PDF file
Real-time – MicroConsult Training & Coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of embedded and real-time software development.
Training & coaching on the other topics in our portfolio can be found here. here.
Real-time expertise
Valuable expertise in the field of embedded and real-time software development is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
