Hardware and software aspects for optimizing embedded system designs

Increased system security through the selection of ECC-compatible processors and memory.

Author: Kei Thomsen, MicroSys Electronics GmbH

Contribution – Embedded Software Engineering Congress 2016

With highly integrated and powerful system-on-chip solutions, intelligence is increasingly migrating down to the sensor level of complex embedded applications. Achieving reliable design even in small system structures, coupled with high performance and low power consumption, remains a crucial competency in modern systems engineering.

This presentation/article compares ARM, PowerPC, and x86 platforms with the same clock speed in terms of overall performance and system safety. Using two C code examples that appear almost identical at first glance, it explains how well-designed programming can achieve performance increases of up to 30 times. The ever-shrinking chip structures highlight the importance of ECC (Error Correcting Code) memory and processor support. Furthermore, new insights into NAND flash memory are presented, including methods such as scrubbing.

Part 1: CPU platform comparison regarding system performance

C code example: "good" and "bad" programming in memory accesses
Results of the C example on ARM, PowerPC and x86 platforms with CPUs clocked at approximately the same speed.
Causes and differences, and what does that mean for overall performance?

Part 2: Storage Architectures and System Reliability

Background information on ever smaller chip structures in memory and associated sources of error
Increased reliability for CPU platforms and memory through ECC
NAND (SLC, MLC, TLC) flash memory and its protection against data loss

1. CPU platform comparison regarding system performance

There is ongoing debate about which processor is more performant and better for a given application. The debate usually centers on x86, PowerPC, and ARM in their various iterations. First of all: Processors with the same clock speed are roughly equally fast, give or take a few percent. Since most local data resides in the cache, differences are barely noticeable there. However, as soon as data is transferred to external RAM, bus width, memory type, and the cache-RAM connection become crucial. The following sections will show whether and where a difference exists.

C code example: "good" and "bad" programming in memory accesses

But first, let's look at the C code example used to perform the measurement. Why "good" and "bad" programming? Often, people simply program and accept the result as a given. With a little understanding of how the processor works with the cache and RAM, memory-intensive functions can run significantly faster.

As an example, two almost identical C functions are used here.

long array[1024][1024];
for (i = 0; i < 1024; i++)
for (j = 0; j < 1024; j++)
b += array[j][i];

C Source Code 1

long array[1024][1024];
for (i = 0; i < 1024; i++)
for (j = 0; j < 1024; j++)
b += array[i][j];

C Source Code 2

Results of the C example on ARM, PowerPC and x86 platforms with CPUs clocked at approximately the same speed.

100 * the C loops = 100 * 4MB read and sum

CPU	MHz	Vertical (left) msec	Vertical MB/sec	Horizontal. (right) msec	Horizontal. MB/sec	factor Vertical/Horizontal.
ARM Cortex-A9 NXP i.MX6	800	19782	20	1209	338	16
ARM Cortex-A9 Xilinx Zynq	666	20077	20	1399	292	14
PowerPC QorIQ P2020	1200	41415	9	1351	303	30
Vortex86-DX	800	15676	26	5246	78	3
Intel i7	2100	1480	270	376	1080	4

Table 1

Table 1 shows the results for several CPUs as examples. Other CPUs with different memory configurations will naturally produce different values! The acceleration factor (right column, Table 1) provides important insights. This varies between 3 and 30. Therefore, "good" programming can often be far more important than a faster CPU.

The cause of the differences and what this means for overall performance

The execution of C-Source 1 is shown in Table 2 (see PDF) is shown and illustrates the reading of data in vertical order. The execution of C-Source 2 is shown in Table 3 (see PDF) is displayed and shows the reading of data in horizontal order.

First, let's analyze Table 3. A 32-bit value is to be loaded from address $0000. This value is not in the L1/L2 cache, so a burst read of 32 or 64 bytes (depending on the cache line size) is performed to read the value from RAM. This takes significantly longer compared to reading from the cache. The CPU waits (stalls) until the value has arrived in the L1 cache and can be read. Subsequent accesses to $4, $8, $10, etc., then only come from the L1 cache and are therefore extremely fast.

In Table 2, however, $0 and then $1000 are read. Therefore, it must wait again until the RAM has been read because the next data is not yet in the cache. Typically, L1 caches are 32 KB in size and contain 1024 data lines of 32 bytes each (cache line size). In this example, exactly 1024 lines are read, thus filling the cache precisely. Therefore, the remaining data at / / can now be read from the cache on the next iteration.

Furthermore, the MMU (Memory Management Unit) plays a crucial role in system performance. Each MMU page is typically 4 KB in size (ARM can also use 64 KB). Therefore, one entry points to exactly one data row of 1024 x 32 bits. Since the address $0 is not found in the MMU table, an exception is thrown to reload the MMU entry for this address in software, which consequently happens with every access. Here, the PowerPC has a slight disadvantage, as it does not use L1/L2 tables like ARM and x86, but instead has 512 entries of 4 KB each. This results in an exception being generated for the row in every iteration, which explains the worse performance factor of 30 (Table 1) for the PowerPC. If the example is slightly modified (512 x 512 instead of 1024 x 1024), the PowerPC's performance factor drops to 8, while ARM still remains at 10-12. Therefore, the conclusion is: Design and optimize your benchmark to achieve your desired result!

Summary of memory access: The impact of "poor programming" on performance depends largely on the architecture. With "good programming," all processors are roughly the same speed (using the same memory types) and only minimally dependent on CPU speed (666 / 800 / 1200 MHz in the example). CPU speed only becomes a significant factor when computationally intensive and the majority of the data resides in the L1 cache. The Intel i7 processor stands out because its memory is connected via 128-bit dual channel, allowing it to fill a cache line in two cycles, a task that requires eight cycles for the other systems.

2. Storage architecture and system reliability

Background information on ever smaller chip structures in memory and associated sources of error

Modern semiconductor structures are becoming ever smaller while simultaneously enabling ever greater storage capacity. This applies to RAM (DDR3/4) as well as NAND flash memory. Let's first consider RAM with regard to system security.

The current chip structure of DDR4 RAM is 20nm and operates at a voltage of 1.2V to keep the charge being moved low (< 20,000 electrons). This is necessary to reduce power consumption and increase speed. However, this also creates the risk of soft errors caused by high-altitude and radioactive radiation, as well as high-energy radiation from electric motors and static fields. These effects are known as Single Event Upsets (SEUs). Often, the RAM timings are not precisely calculated and set. Since the timing is temperature-dependent, bit errors can occur at high or low temperatures. Without appropriate countermeasures, this can be quite dangerous for safe system operation! For safety-critical applications, the use of ECC (Error Correction Code) memory is a suitable measure.

Increased reliability for CPU platforms and memory through ECC

To detect and correct memory errors, hardware ECC is used in most cases. This involves extending the 64-bit RAM by 8 bits to 72 bits (32 bits + 7 bits). This allows for the direct correction of a 1-bit error and the detection of a 2-bit error, which of course requires support from the memory controller. However, by no means all processors today have such an ECC-capable memory controller on the chip (System On Chip – SoC). Almost all PowerPC processors have an ECC-capable controller. For ARM CPUs, this is the exception, and for x86 CPUs, it is usually only the larger server chipsets and rarely the embedded SoCs that are equipped with this function.

If the SoC supports ECC and the correct number of RAM modules are present (typically 5 or 9 on the board), then this feature should definitely be used. The immediate question is how much additional processing time this incurs. Based on our experience with PowerPC boards and ARM Xilinx ZYNQ systems, this amounts to less than 5% of memory performance loss. It's advisable for the operating system to support ECC interrupts to make errors visible. The memory controller can trigger an interrupt for a single-bit error that can be corrected with ECC. During interrupt handling, the memory cell is read and rewritten, since the read data is (still) correct. However, this isn't possible with a multi-bit error, and a specific decision must be made regarding how to handle it – safe state, error message, reboot, etc. This can be easily tested, as every ECC-capable controller also has the capability of error injection to simulate errors.

Why is ECC memory so important for system reliability? A single-bit error occurs in memory, for whatever reason. With ECC memory, a message is issued, and the corrected data is written back—all good! Without ECC memory, at best the data is only slightly incorrect; at worst, the error occurs in the operating system code, and a jump doesn't go forward by 100 bytes, but backward by 100. Cheers to that!

NAND (SLC, MLC, TLC) flash memory and its protection against data loss

Things get really scary, however, when it comes to NAND flash memory, which is found in all common memory cards such as USB sticks, CF cards, SD cards, microSD cards, and SSDs, if it's to be used in reliable embedded systems. Normally, one would assume: 1 memory cell = 1 bit. This is only true for SLC flash chips. SLC = Single Level Cells, MLC = Multi Level Cells, TLC = Three Level Cells.

MLC and TLC technologies are particularly problematic. With MLC (Black Magic), 2 bits are stored per cell: 0, 0.33, 0.66, and 1. With TLC (Alien Technology), it's even 3 bits. The cell is therefore loaded with 8 different states.

One can imagine the effort required within the chips to distinguish between these minute charge differences. To maintain data accuracy, the ECC is stored in the out-of-band (OOB) data, which is an additional data area alongside the actual data (16 bytes OOB / 512 bytes user data). With SLC, typically 3 bytes per 512 bytes are used for ECC; with MLC and TLC, almost the entire 16-byte area is used. Most NAND flash controllers on SoCs use these 3 bytes to generate a 1-bit error correction in hardware. If more ECC bits are needed, this must be handled in the driver software. Since the cells can only withstand a certain number of write/erase cycles (SLC ~100,000, MLC ~10,000, TLC ~3,000–5,000), the data must be evenly distributed across the NAND blocks using wear leveling. In SD, microSD, CF, USB flash drives, and SSDs, this task is performed by the built-in flash controller. For soldered NAND flash memory, a suitable file system such as UBIFS, JFFS2, or YAFFS2 manages this.

Several years ago, NAND flash memory reached storage densities per mm² where individual bits would flip on their own after a few months, even in sectors that are only read once or twice a day during boot. After persistent inquiries with the manufacturers, one eventually receives the information that this is within the expected range. And this is for use in industrial environments!

This isn't mentioned anywhere in datasheets. We noticed it because after 12 months, a NAND flash drive suddenly became unreadable by Linux. This was due to two bits flipping in a 512-byte sector, which couldn't be corrected by the 1-bit ECC. A solution is to regularly (every 1-2 months) read the entire NAND flash drive and check the error status of the NAND flash controller.

If the controller reports that it has corrected a bit, this sector can simply be overwritten and checked again. If this information cannot be obtained from the controller, then only regular scrubbing, in which the sectors are read and overwritten, will help. Ideally, external flash storage media should do this automatically. But how does, for example, an SD card that has been sitting in a cupboard as a backup for three years manage this?

A partner company discovered that, for example, newer SanDisk CF cards (>= 2GB) apparently no longer perform wear leveling across the entire area, but only across the area required for FAT32, as typically used in cameras, for speed reasons. Therefore, other file systems like EXT2/3/4, OS-9, QNX, etc., which are not located within the wear-leveling area, are detrimental and will cease to function after a few thousand accesses.

Our market research revealed only a few CF cards suitable for industrial use that utilize SLC technology and wear leveling across the entire range. The use of MLC or TLC technologies is therefore strongly discouraged for industrial applications. As for SD cards... no idea!

Now you're probably wondering how this relates to the increasingly widespread SSD hard drives?

Yes, SSDs are based on NAND flash memory. There are SLC, MLC, and TLC versions. The louder the advertising claims of "groundbreaking 3D NAND technology," the more cautious you should be, as this usually refers to MLC or TLC. They are certainly inexpensive, but what about their long-term reliability?

SSDs with SLC, on the other hand, require some searching and are no longer cheap, but are certainly suitable for industrial applications. During my research for this article, I contacted the E4You association (a network of more than 30 embedded solution providers) to ask if anyone had any particular experience with SSD technology. And the response came promptly: "We've been using SSDs in 10 computers for development for a year and a half, and within two weeks, three SSDs failed completely. They were unreadable. Luckily, we had backups for two of them that were no more than a week old."„

I would like to draw your attention to a test that attempted to cyclically write 1 petabyte of data to 250GB SSDs (1 petabyte = 1024 terabytes). In this test, 3 out of a total of 6 SSDs could only be written to just over 700 terabytes before failing completely. Detailed information about these tests can be found on the following website (see PDF).

3. Summary & Discussion

Due to increasingly smaller structures and the use of alien technologies, security measures by the SOC, the operating system, and the file systems are becoming ever more important. In industrial applications under extreme conditions, such as temperature and interference fields, always ECC memory should be used whenever possible, as otherwise the consequences can be disastrous. The minimal performance loss is more than compensated for by the massive improvement in stability and system reliability. Since ECC has long been standard in PowerPC processors, these processors remain the first choice for applications with special requirements, at least within our company.

Most ARM processors, on the other hand, are built for commercial use and rarely have an ECC-capable memory controller. For industrial applications, only NAND flash memory with SLC technology can truly be recommended. Anything else would be a false economy. As described at the beginning, there's no substitute for performance except more performance. But that doesn't come solely from the hardware; it also comes largely from well-written programs.

4. Sources

DDR RAM
Soft Error
ECC
Heavy Ion sensitivity of 16/32-Gbit NAND Flash and 4-Gbit DDR3 SDRAM, ESA February 3, 2012
SLC, MLC, TLC
CF card problem: ESD report
SSD Petabyte Club

Download the article as a PDF file

Real-time – MicroConsult Training & Coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of embedded and real-time software development.

Training & coaching on the other topics in our portfolio can be found here.

Real-time expertise

Valuable expertise in the field of embedded and real-time software development is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← Mixed-criticality systems through real-time capability classes What is your processor doing right now? →