Tools and resources for multicore debugging
Author: Jens Braunes, PLS Programmable Logic & Systems GmbH
Contribution – Embedded Software Engineering Congress 2015
Debugging deeply embedded multicore systems is no longer just about tracking down incorrect variable values. Deadlocks, resource conflicts, and timing issues are now commonplace. This presents a significant challenge for developers, one that can only be overcome with appropriate support from on-chip debugging functions working in close conjunction with powerful software tools. This article introduces such solutions and demonstrates their capabilities as well as their limitations.
Looking solely at the consumer market, multicore systems have been mainstream for a decade. However, for deeply embedded systems, such as those found primarily in industrial applications or automotive control systems, the technological shift has only occurred in recent years, and even then, rather hesitantly. One reason for this is certainly the high demands on security, reliability, and real-time performance, which take absolute priority in these areas. But the extensive portfolio of proven and well-tested software modules for single-core systems, the porting of which to multiple cores would have involved considerable effort, also undoubtedly slowed down a faster advance. And let's not forget:
Not all multicore is created equal.
The multicore approach, familiar from the Windows, Linux, and Android worlds, allows tasks and processes to be dynamically created and then executed on any available core depending on the workload. This is possible because the processors used have identical processing cores, meaning they are homogeneous multicore systems. Each core is capable and suitable for executing assigned tasks equally.
However, industrial and automotive applications typically require defined processing times for individual tasks and guaranteed response times. Therefore, heterogeneous multicore systems with several cores specifically tailored to a particular task are usually employed, as illustrated by the example of the AURIX™ microcontrollers from Infineon (see Figure 1)., PDFWhile all three main cores originate from the TriCore architecture family, only two of them are designated as so-called performance cores (P-cores) for standard computationally intensive tasks. The third central processing unit, designated as the economy core (E-core), primarily manages peripherals and handles general tasks requiring less processing power. For safety-critical tasks, one of the P-cores and the E-core are equipped with an additional lockstep core, which executes the same operations as the main core in the background. If a discrepancy occurs when comparing the results, the system no longer operates reliably and may need to be reset to a safe state.
Complex timing algorithms and the efficient, parallel processing of signals are supported by a fourth "core," the so-called Generic Timer Module (GTM). Completely different from the TriCore cores, the GTM is nevertheless programmable via its own instruction set. This allows tasks to be executed on it.
Therefore, when it comes to distributing application loads across the processing cores, one can hardly rely on the operating system. Instead, it must be clear during the software design phase which core is responsible for which tasks.
Deeply embedded debugging
Applications that perform individual tasks with high real-time requirements and are distributed across different processing cores often pose a significant challenge for debugging, testing, and system analysis. The typically substantial dependencies between tasks running on different cores naturally impact stop-go debugging. You can't simply stop one core while all the others continue running. Sometimes, the remaining cores and peripherals must also be stopped simultaneously to prevent the application from becoming completely unresponsive and entering an undefined state. However, truly simultaneous stopping is virtually impossible with heterogeneous cores that have different clock speeds and execution pipelines. Therefore, in practice, there will always be a certain time lag that developers have to accept. Sometimes, stopping an entire multicore system can even have disastrous consequences, for example, if other applications are running concurrently that should not or cannot be debugged at that time. The application scenarios mentioned above clearly demonstrate the importance of a flexible, synchronous run control for the multicore debug infrastructure.
A second important aspect is the analysis of runtime behavior without influencing it. This non-intrusive system observation plays a crucial role not only in real-time critical applications but also in profiling tasks and for observing communication between cores. Often, it is desirable to be able to read the current system state at a specific point in time from the target system using an externally connected debugger. However, halting the application could potentially alter the system behavior so fundamentally that it would no longer resemble the behavior without the debugger connected. Therefore, tracing is essential for efficient non-intrusive system observation.
On-Chip Debugger
But first, let's return to the topic of synchronous run control. This requires fast signal paths between the cores, which can only be achieved with debug hardware directly on the chip. Transmitting stop and go signals externally via the debug interface would take far too long at today's common high clock frequencies. The application would inevitably become unresponsive.
Now, every chip manufacturer offers its own on-chip debugging solution. Infineon, for example, calls it OCDS (On-Chip Debug System). A key component is a trigger switch that distributes individually configurable stop and suspend signals system-wide. This allows individual processing cores and peripheral units to be selectively stopped and restarted simultaneously without affecting the other functional groups. Additionally, individual trigger lines of the trigger switch can be routed externally via pins. This is an interesting option for connecting an oscilloscope, for example, or even triggering a break from outside the chip.
Besides Infineon's AURIX family, there are of course a number of other multicore microcontrollers that cover the industrial and automotive sectors, including, for example, the Freescale MPC57xx family or SoCs based on the Arm Cortex-R architecture. Let's first take a look at CoreSight™ [1], Arm's on-chip debug hardware.
A cross-trigger matrix (CTM) with connected cross-trigger interfaces (CTIs) is used to distribute the break and go signals between the cores. Channels in the CTM forward the signals in broadcast mode to the connected CTIs. These are directly connected to the cores and can be configured to either forward or block the run control signals between the core and the CTM. Due to the necessary handshake mechanisms between the components involved, signal delays of several clock cycles occur. The actual magnitude of these delays depends on the specific implementation and, of course, the clock speed of the individual components. However, the slight slippage of a few instructions, typically in the single digits, that occurs during synchronous stop cannot be completely avoided. Whether the respective Arm controller offers this hardware support is up to the chip manufacturer. They are free to decide whether to implement the necessary CoreSight components on the chip at all (see Figure 2)., PDF).
Synchronous run control is also supported at the hardware level by the Power Architecture-based controllers of the MPC57xx family. The unit responsible for this is called DCI (Debug and Calibration Interface). The advantage over the Arm solution: As with the trigger switch of the AURIX, the peripheral units are also connected to the DCI, which allows the entire system, and not just the cores, to be stopped.
Of course, developers would prefer not to have to deal with such differing perspectives in detail at all. Debuggers therefore hide the unnecessary details regarding the configuration of the synchronous run control behind an easy-to-use interface. An example of this is the Multi-Core Run-Control Manager in the Universal Debug Engine (UDE) from PLS (see Figure 3)., PDFThis allows cores to be grouped into run control groups, which can then be synchronously stopped at a breakpoint and subsequently restarted synchronously.
Trace with Trace
Especially when it comes to real-time applications, in addition to synchronous run control, there is another important prerequisite for accurate and reliable system analysis: on-chip tracing. This technology is, of course, also available for the multicore controllers mentioned above. Freescale, for example, uses Nexus [2] for its MPC57xx family, Infineon employs the Multi-Core Debug Solution (MCDS) [3], and Arm offers the well-known CoreSight. All of these controllers share the ability to record traces for multiple cores in parallel; however, with MCDS, this is limited to a maximum of two selectable cores. Timestamps allow for the temporal assignment of trace data to reconstruct the precise sequence of events. This makes it possible to detect deadlocks and race conditions, as well as communication bottlenecks.
A major challenge lies in transferring the recorded trace data to the debugger, which then performs further analysis. This data is either temporarily stored on the chip in a trace buffer and read via the debug interface, or transmitted over a broadband interface during recording. The former offers significantly higher bandwidth but very limited storage capacity. While the latter allows for theoretically unlimited observation time, overflows can occur more frequently if more trace data is generated than can be transmitted. In both cases, sophisticated filter and trigger mechanisms provide a solution by limiting the amount of trace data. Cross-triggering is also possible. This allows, for example, the trace for one core to be started when a condition for another core is triggered. This function is helpful, for instance, for debugging communication between cores. Cross-triggering is a standard feature in MCDS and CoreSight. However, in CoreSight, it competes with the synchronous run control, as both utilize the same hardware resources. Freescale, on the other hand, had to extend its Nexus implementation with a proprietary unit, the so-called Sequence Processing Unit (SPU), because cross-triggering is not provided for in the Nexus standard.
The debugger's capabilities are also essential for the targeted observation of system behavior using tracing, as well as the subsequent evaluation and analysis. For example, UDE provides a graphical tool for creating trace tasks, which allows even complex cross-triggers to be configured quite easily. This works for various on-chip trace systems without requiring the user to concern themselves with the technical details. Different views, such as those visualizing the parallel execution of code on multiple cores, facilitate trace analysis. If desired or needed, the debugger can also be used to obtain profiling information or determine code coverage. However, these two options are not specific to multicore systems, but rather are generally useful for system analysis and optimization.
Conclusion
Without the support of suitable on-chip hardware, multicore debugging for deeply embedded systems would be extremely difficult. Even a modern debugger inevitably reaches its limits when it comes to synchronously stopping and starting multiple cores. Truly synchronous run control is only possible with suitable on-chip debugging hardware featuring configurable cross-triggers. The same applies to comprehensive system analysis of multicore applications. Without on-chip tracing, the limits of what is feasible are quickly reached. While chip manufacturers don't adhere to a uniform standard for on-chip debugging, modern debuggers handle it quite well. And modern debuggers like the UDE allow developers to easily use the functions, so they rarely have to delve into chip-specific features.
Sources
[1] ARM Ltd: CoreSight Debug and Trace
[2] Nexus 5001 Forum: IEEE-ISTO 5001™-2012, The Nexus 5001™ Forum Standard for a Global Embedded Processor Debug Interface
[3] A. Mayer, H. Siebert, C. Lipsky: Multi-Core Debug Solution IP; Whitepaper, IPextreme, 2007
Multicore – our training & coaching
Do you want to bring yourself up to date with the latest technology?
Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of multicore/microcontrollers.
Training & coaching on the other topics in our portfolio can be found here. here.
Multicore – Expertise
Valuable expertise in modeling/embedded and real-time software development is available. here Available for you to download free of charge.
You can find expertise on other topics in our portfolio here. here.
