Efficient embedded multicore programming

Automatic parallelization of Scilab/MATLAB applications

Authors: Oliver Oey, Timo Stripf, emmtrix Technologies

Contribution – Embedded Software Engineering Congress 2015

Due to ever-increasing performance demands, multi-core processors are being used in more and more areas instead of single-core processors. This shift has already taken place in the realm of desktop PCs and smartphones, but the transformation is still underway in the field of embedded systems. Parallel program execution can increase performance while simultaneously reducing power consumption. However, parallel programming has thus far been time-consuming and expensive, and requires specialized knowledge of the target systems. Within the ALMA-EU project, a consortium of research and industry partners has developed a toolchain that significantly simplifies parallel programming. Using automatic parallelization, sequential Scilab/MATLAB code is parallelized for embedded multi-core processors. This not only eliminates the need for time-consuming manual parallelization but also allows the code to be reused on different processors.

motivation

According to studies [1], programming embedded multi-core systems is 4.5 times more expensive, takes 25% longer, and requires 3 times as many software engineers as programming single-core systems. The goal of the EU project ALMA was to reduce these demands on the developer. To this end, development was divided into two areas: the pure programming of the algorithm and the adaptation to the given hardware. To reduce these demands on developers, the EU project ALMA investigated automatic parallelization and code generation from array-based programming languages. These include MATLAB and similar languages like Scilab, which are very close to the purely mathematical description and work with matrices as the standard data type. For this reason, they are also well-suited for users without extensive programming experience. The programmer can concentrate on developing the algorithm, while the adaptations to the target hardware are handled automatically by the toolchain.

ALMA – Project Overview

The ALMA toolchain, as presented in [2], is divided into several subcomponents, which are shown in Figure 1 (see PDFThe application files written in Scilab/MATLAB, along with an abstract architecture description, are used as input files. The Matrix FrontEnd first converts the input code into sequential C code, which serves as the basis for parallelization. The parallelization tools generate optimized code for the target architectures by parallelizing the code and adapting it to the specific characteristics of the target platforms, as defined in the architecture description. In further optimization steps, runtime information obtained with the multicore simulator can be used to improve hardware utilization.

Evaluation

Two target architectures and two test applications were used to evaluate the approach. The architectures were divided between the scientific processor Kahrisma [3] and the X2014 system from Recore Systems. Both platforms rely on distributed memory and multiple independently operating cores. The target applications came from Fraunhofer IOSB's image processing division, where object recognition was performed using the SIFT algorithm, and from Intracom Telecom's telecommunications division, where parts of the WiMAX standard are implemented in software.

ALMA Toolchain & Workflow

Scilab/MATLAB are scripting languages, meaning that instructions are interpreted and executed sequentially by a runtime environment. This allows properties such as the size or type of variables to be determined only at runtime. While implementing this behavior in C is possible, it is not practical due to performance and memory consumption. The first step in generating C code from the general MATLAB code is to create C code, which is well-suited for static analysis and subsequent parallelization. All dynamic decisions are resolved, allowing for a better analysis of the static program flow. In the case of variables, the entire program runtime is analyzed to determine the necessary data type to represent all numbers without restrictions. This data type is then used in the C code to reduce both dynamic decisions and memory consumption. Another advantage of MATLAB for parallelization is the absence of pointers. This allows for the unambiguous determination of data flow within a program, which is crucial for data transfer between different kernels.

The abstract hardware description relies on an architecture description language (ADL) developed within the project [4]. A key feature is its support for descriptions at different levels of abstraction. This allows hardware modules to be represented both in a purely functional way and in detail, including cycle-accurate instructions. In this way, both information necessary for hardware simulation and more abstract information required for parallelization decisions can be represented.

To achieve good performance in a parallelized application, parallelization must operate at two different levels: a fine and a coarse level. The fine level aims to optimize execution on a single core, while the coarse level optimizes simultaneous execution on multiple cores. This combination allows for the most efficient use of the available hardware.

Fine-grained parallelism extraction first analyzes the required data types to strike a balance between efficient hardware utilization and the required accuracy of the results. This allows the exploitation of the architecture's SIMD (single instruction, multiple data) units by, for example, performing four 8-bit additions concurrently instead of one 32-bit addition. As shown in [5], it is also determined whether numbers can be represented in a fixed-point representation. While this can improve execution performance, it impacts accuracy. Furthermore, loops are transformed to better adapt data access to the available caches.

Coarse-grained parallelization, as presented in [6], distributes the application across the individual cores of the target platform. The goal of this optimization is to reduce the execution time of the entire program by utilizing as many parallel computing units of the architecture as possible simultaneously. Parallelization relies on a hierarchical task representation of the program. Each control flow construct, such as a loop or condition, within the program flow adds a new level to the hierarchy. An example is shown in Figure 2 (see PDFThe representation allows parallelization to be performed both top-down and bottom-up. In this way, an optimal distribution of tasks across the individual cores can be determined for each level, and subsequently, the optimal overall process can be determined.

Parallel code generation [6] ultimately produces parallel C code that can be executed on the hardware or its associated simulators. To achieve this, code generation creates separate source code for each core of the hardware platform by resolving data dependencies between the individual processors via communication instructions. Communication synthesis focuses on minimizing waiting times and is optimized for, but not limited to, distributed memory systems. Both common communication models such as MPI (message passing interface) and platform-specific features can be used.

The generated code can be automatically instrumented to determine runtime information using a simulator. This information can be used to iteratively improve parallelization and thus performance. [7] To this end, the runtimes of the individual code segments and the data transfer are fed back into the coarse-grained parallelization, enabling a partitioning that is better adapted to the actual hardware.

Results

The toolchain is designed to fulfill two objectives: Firstly, it should enable the exploitation of the performance capabilities of multi-core systems through meaningful parallelization of the application. Secondly, it should increase productivity by reducing the development effort for parallel embedded systems.

In Figure 3 (see PDFFigure 4 illustrates the reduction in development effort determined in the project. A manual implementation of an existing algorithm in parallel C code was compared to the use of the parallelization tool developed in ALMA. In the selected examples, savings of 30 to 57 % in development time were achieved. This was demonstrated in Figure 4 (see...). PDF) shown, a speedup of 2.8 can be achieved when using four cores.

Summary and Outlook

The presented ALMA toolchain simplifies the programming of embedded multicore systems by automating the time-consuming parallelization process. Evaluation showed that development effort could be reduced by up to 57%, achieving a speedup of 2.8 when using four cores.

ALMA technology is made available to industry, further developed, and adapted to specific customer requirements by the EXIST-funded KIT spin-off "emmtrix Technologies" (www.emmtrix.com). The technology enables companies to more easily deploy multi-core processors in embedded systems. Development costs are reduced, and time-to-market is shortened. For developers, automation and the integration of specialized knowledge simplify the programming of embedded multi-core systems.

emmtrix relies on an industry-established workflow in which algorithms are developed in MATLAB and translated into C code for hardware execution. This is done automatically on single-core processors using a code generator. However, for multi-core processors, manual reimplementation is currently required. The emmtrix solution is fully integrated into this workflow (see Figure 5)., PDF) and allows the developer to efficiently implement the MATLAB code on embedded multicore systems.

Bibliography and list of sources

[1]	„Next Generation Embedded Hardware Architectures: Driving Onset of Project Delays, Costs Overruns, and Software Development Challenges,“ VDC Research, September 2010.
[2]	J. Becker, T. Stripf, O. Oey, M. Huebner, S. Derrien, D. Menard, O. Sentieys, G. Rauwerda, K. Sunesen, N. Kavvadias, K. Masselos, G. Goulas, P. Alefragis, N. Voros, D. Kritharidis, N. Mitas and D. Goehringer, „From Scilab to High Performance Embedded Multicore Systems: The ALMA Approach“, in Digital System Design (DSD), 2012 15th Euromicro Conference on, 2012.
[3]	R. Koenig, L. Bauer, T. Stripf, M. Shafique, W. Ahmed and J. a. HJ Becker, „KAHRISMA: A Novel Hypermorphic Reconfigurable-instruction-set Multi-grained-array Architecture,“ in Proceedings of the Conference on Design, Automation and Test in Europe, Dresden, Germany, 2010.
[4]	T. Bruckschloegl, O. Oey, M. Ruckauer, T. Stripf, and J. Becker, „A Hierarchical Architecture Description for Flexible Multicore System Simulation,“ in Parallel and Distributed Processing with Applications (ISPA), 2014 IEEE International Symposium on, Milan, Italy, 2014.
[5]	G. Deest, T. Yuki, O. Sentieys, and S. Derrien, „Toward scalable source level accuracy analysis for floating-point to fixed-point conversion,“ in Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD '14), Piscataway, NJ, USA, 2014.
[6]	G. Goulas, C. Valouxis, P. Alefragis, N. Voros, O. Oey, T. Stripf, T. Bruckschloegl, J. Becker, C. Gogos, A. El Moussawi, M. Naullet, and T. Yuki, „Coarse-Grain Optimization and Code Generation for Embedded Multicore Systems,“ in Digital System Design (DSD), 2013 Euromicro Conference on Digital System Design, 2013.
[7]	J. Becker, T. Bruckschloegl, O. Oey, T. Stripf, G. Goulas, N. Raptis, C. Valouxis, P. Alefragis, N. Voros, and C. Gogos, „Profile-Guided Compilation of Scilab Algorithms for Multiprocessor Systems,“ in Reconfigurable Computing: Architectures, Tools, and Applications, Springer International Publishing, 2014, pp. 330-336.

Download the article as a PDF

Multicore – our training & coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of multicore/microcontrollers.

Training & coaching on the other topics in our portfolio can be found here. here.

Multicore – Expertise

Valuable expertise in modeling/embedded and real-time software development is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← Automatic multi-core real-time validation Security on all cores →