Cross-platform software for multicore & FPGAs

Portable MATLAB®, Simulink and Scilab software development

Authors: Timo Stripf, Michael Rückauer and Oliver Oey, emmtrix Technologies GmbH

Contribution – Embedded Software Engineering Congress 2016

The latest embedded systems are increasingly equipped with high-performance multicore processors and FPGA accelerators. This is necessary to meet the ever-increasing performance demands of applications while minimizing energy consumption. Currently, the programming of such heterogeneous multicore systems in companies is predominantly done manually. This is very complex, expensive, and time-consuming, posing a significant challenge for software developers. For this reason, many companies avoid using multicore processors altogether.

The embedded systems market is desperately seeking programming solutions for multicore systems. This presentation introduces tools for automated C code generation and parallelization using MATLAB®, Simulink, and Scilab for heterogeneous embedded multicore systems.

motivation

According to studies [1], programming embedded multi-core systems is 4.5 times more expensive and requires 3 times as many software engineers as programming single-core systems and takes 25% longer. The main problems are the increased effort required to distribute tasks across the individual cores, the additional testing effort, and problems caused by additional sources of error such as race conditions or deadlocks. The latter, in particular, are errors that are especially difficult to debug and are therefore feared by developers. With heterogeneous systems (compared to homogeneous systems), there is the additional effort of having to decide which part of the application should be executed on which processor type or accelerator to achieve an efficient result in terms of performance and energy consumption.

emmtrix Technologies GmbH has developed a tool that addresses the challenges of parallel software development for embedded heterogeneous multi-core processors. Through automatic parallelization and code generation, complex tasks are automated, and parallelization is significantly simplified within a graphical user interface. MATLAB® scripts, Simulink models, and the open-source alternative Scilab are supported as input. The programmer can focus on developing the algorithm, while the toolchain automatically handles adaptations to the target hardware. Finally, the solution can be executed and tested directly on the target hardware. This automation of parallelization enables rapid prototyping, which significantly reduces the effort required in the development cycle for evaluating a parallel solution.

Example application & test environment

As an example application, a manageable sub-problem from industrial image processing is used, implemented as a MATLAB® program. An image is read from a camera, edge detection is performed using a Sobel filter, and the result is displayed on a screen.

The target system is a ZedBoard [2] with a Zynq-7000 FPGA (XC7Z020), which combines a dual-core Cortex A9 processor with FPGA logic. The programmable FPGA logic serves both for image acquisition from the connected camera and for outputting the image to the monitor, as well as for hardware acceleration for parts of the example application. A Linux operating system runs on the system.

The following shows the basic structure of the MATLAB® application (see PDF):

The basic structure of the filter is that two 2D convolution filters (conv2) are first applied to detect horizontal and vertical edges. k and its transpose k' are used as the convolution matrix. Direction-independent edge information is obtained by combining the results of both filters. Edges are displayed as the output image as soon as a threshold is exceeded. In our example, this threshold is assumed to be 0.7.

Generation of sequential C codes

As a first step, the MATLAB® program is translated into sequential C source code for the target architecture. The emmtrix Code Generator is used for this purpose. It supports MATLAB® and Scilab scripts as well as Simulink models. The generated C code is already prepared for parallelization.

An important aspect here is the modeling of the interface code (see PDF).

The interface code receives input data and outputs the output data. This code varies depending on where and in which environment the code is executed. For example, to evaluate the program on a PC, the following can be used:. interface_indata access an image file within the MATLAB® application and interface_outdata The result is written to the hard drive. This is how, for example, the image in Figure 1 (see Fig. 1) was created., PDF). . generated. For later execution on the hardware, the interface code is then replaced by a control signal for the camera or screen.

Analysis of the program

After generating the sequential program, it is executed on the PC using test input data. Profiling is performed to determine the execution times for the specified target architecture. The result is displayed in emmtrix Parallel Studio [3] as a hierarchical program representation showing the execution times of the various program components. The horizontal axis represents time. At the lowest level is the main function, which represents the entire program and spans the entire duration. At the second level, the function calls for the interface functions are located at the beginning and end. The largest block in the middle is the call to the sobel function. Within this function, the two 2D folding filters (conv2_1 and conv2_2) and the for loop for combining the two folding results are clearly visible. Deeper levels are not shown in this example for clarity.

Parallelization of the program

First, the program is parallelized for the Zynq system's dual-core processor. The automatic scheduler has already found a sensible distribution. Alternatively, the user can always intervene in the process via the hierarchical view and, for example, specify the processor core for each program segment/block. This ensures the user always retains full control over the parallelization.

One result of parallelization is which parts of a program are executed together on one processor, or how deeply a program should be broken down. This is shown in the diagram above (see Fig. 2)., PDF) recognizable by the red line. The assignment to the respective processor core can be seen from the boxes in the bottom right.

These program components can also be found in the Scheduling view (see Fig. 3, PDF) again, which represents the execution of the parallel program. Here you can see that the two 2D convolution operations were distributed across two different processors, thereby speeding up the application by approximately 50%.

Rapid Prototyping

The emmtrix Parallel Studio allows the direct execution of the parallelized application on the target architecture. First, automatic code generation is performed based on the calculated schedule, creating parallel code. For the parallel code, various APIs such as MPI, processes, or pthreads can be selected. For the Zynq target architecture, pthreads (POSIX threads) are recommended because the architecture uses shared memory.

For rapid prototyping, the generated code is combined with a template for the target system. The template includes the initialization, main loop, camera and screen control, and hardware implementation of the interface functions.

The final C files are transferred to the target system and compiled directly on the local Linux system. This eliminates the need to set up a cross-compiler on the host system. After compilation, the application can be run directly on the target system, allowing for live testing. The entire process, from sequential code generation to application execution, is fully automated and completes the entire toolchain in a short amount of time.

Hardware profiling

In addition to functional testing of the application, rapid prototyping also involves evaluating its performance on the target architecture. Profiling instructions are integrated into the application code for this purpose. These instructions measure the schedule of the parallel application on the hardware, thus allowing a comparison with the previously calculated schedule.

The differences can be visualized in emmtrix Parallel Studio. See Figure 4 (see...). PDFThe difference between the calculated and measured schedule on the first processor core is shown. This results in a particular deviation for the interface code (BB3). These deviations can then be taken into account in a subsequent pass.

Support for heterogeneous systems

The presented workflow for parallelization can also be easily applied to FPGA accelerators. In our example, we now want to execute one of the two convolution filters on the FPGA (see Fig. 5)., PDFBy right-clicking on Block30 or the conv2_2 function, it can be manually assigned to the FPGA. In the background, the C subprogram is then synthesized as a hardware accelerator using Xilinx high-level synthesis [4]. After a synthesis of approximately one minute, initial performance information is available, which can then be used for scheduling.

The freed-up capacity on the second processor core can then be used, for example, by splitting the first convolution filter across two processor cores, thereby further improving the overall performance of the application.

Summary

This article introduced the emmtrix tools, which simplify the programming of embedded heterogeneous multicore systems by automating the otherwise time-consuming parallelization process. Evaluations have shown that this can reduce development effort by between 50-80% [5].

Using an example from industrial image processing, a MATLAB® application was optimized for a Zynq system. First, the MATLAB® code was translated into sequential C code, and the performance of the program components was displayed in a hierarchical view. This hierarchical view allows for program analysis and control of parallelization. For example, the user can specify on which processor core a particular application component should be executed. Additionally, individual program components can be offloaded to the FPGA. A scheduling view then displays the precise sequence of the parallelized application. Parallel C code is automatically generated using code generation. This code can be directly compiled for the target architecture and executed on it, thus enabling rapid prototyping.

Bibliography and list of sources

[1] „Next Generation Embedded Hardware Architectures: Driving Onset of Project Delays, Cost Overruns, and Software Development Challenges,“ VDC Research, September 2010.

[2] „ZedBoard™ – Xilinx Zynq®-7000 All Programmable SoC“ [Online)

[3] „emmtrix Parallel Studio“, emmtrix Technologies GmbH, 2016 [Online]

[4] „Vivado High-Level Synthesis“, Xilinx Inc. [Online]

[5] „ALMA Project, Test cases evaluation report“, 2015 [Online]

Download the article as a PDF

Multicore – our training & coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of multicore/microcontrollers.

Training & coaching on the other topics in our portfolio can be found here.

Multicore – Expertise

Valuable expertise on the topic of multicore/microcontrollers is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← Software Design Automation for efficient multicore utilization How models make testing more efficient →