Dynamic Memory Allocation: Justifiably Taboo?

Avoiding Risks Using New Memory Management Strategies

Author: Steven Graves, McObject LLC

Beitrag - Embedded Software Engineering Kongress 2015

Abstract

Developers of fault-tolerant embedded systems must identify and eliminate possible failure points. Dynamic memory allocation is one key concern. A sound approach contributes to predictable and robust systems, while inattention can lead to instability, slow and/or unpredictable performance, or failure. This paper argues that dynamic allocation is acceptable only in non-critical portions of fault-tolerant embedded systems, and then only when the technique’s risks can be successfully mitigated. Fault-tolerant systems should instead employ custom memory allocators that are more precisely suited to the application’s specific allocation patterns. Custom memory managers presented in the paper include block, stack, bitmap and thread-local allocators. The solutions presented retain the power and flexibility of dynamic memory management while mitigating common risks such as fragmentation and memory leaks, and improving efficiency and performance.

Introduction

Users have different quality expectations for embedded systems than for business and desktop applications. We wouldn’t tolerate our cell phone or set-top box failing regularly, or even once in a while. And some embedded systems, such as the avionics that enable a jet to fly safely, are expected to be even more dependable: they must be fail-safe. In short, the effect of errors, unpredictable behavior and degraded performance in embedded systems ranges from a bad user experience and eroded customer loyalty, to potential loss of life – none of which is acceptable to organizations releasing this technology out into the world.

So what makes embedded applications, ranging from consumer electronics to mission critical aerospace and industrial control, more dependable? Apart from reliable hardware platforms, it is the software components of those systems: the operating system, middleware and applications. In embedded environments, these components should be immune from crashes. And if an application does fail, the failure shouldn’t affect other applications or the operating system. In an embedded setting, this requirement is not always so easy to satisfy, because the applications are often implemented as separate tasks that run within the same address space. Finally, the operating system and application software should not introduce unpredictable latencies. That is, they should exhibit predictable performance.

Memory Management and Performance, Predictability and Resilience

Embedded systems’ reliability imperative requires a close examination of all aspects of software development, and how approaches to concepts such as scheduling, synchronization across multiple tasks and processes, locking strategies, deadline management, and memory management affect quality. For example, unpredictable latencies can be introduced by techniques such as message passing or garbage collection. Such approaches should be avoided or their unpredictable latencies mitigated.

This paper focuses on the fundamental programming concept of memory management. Safe, predictable and efficient program execution is often the result of sound memory management, while the wrong practices, or inattention to memory management, can result in slow and/or unpredictable performance (which may be a result of memory fragmentation) as well as instability or failure due to memory leaks.

The impact of badly written or improperly used memory management can vary in severity. Less severe might mean slow or unpredictable performance in specific circumstances, such as when an application is moved from a single-core to a multi-core environment. At the more severe end of the spectrum, a high level of memory fragmentation may increasingly degrade performance, resulting finally in application failure. Even more severe could be system-wide instability or failure due to memory leaks.

Software designers who develop safety-critical applications for airborne systems are well aware of the risks of unsound memory management. Industry norms dictate that safety-critical applications should avoid using techniques that could introduce instability. Those readers involved in creating airborne systems should be familiar with the DO-178B standard, used by the U.S. Federal Aviation Administration to certify avionics software. Among other strictures, the DO-178B document states:

"Software Design Standards should include…constraints on design, for example, exclusion of recursion, dynamic objects, data aliases, and compacted expressions."

[DO-178B, Software Considerations in Airborne Systems and Equipment Certification, RTCA, Inc. (jointly developed with European Organisation for Civil Aviation Equipment, EUROCAE)]

"Dynamic Objects" in this quote refers to objects created in the application through dynamic memory allocation, a technique in many programming languages, including C, in which system memory is allocated to processes on an as-needed basis at run-time.

An Overview of Dynamic Memory Management

Dynamic memory allocation was popularized with the C programming language and Unix systems in the late 60’s and early 70’s, and is present in virtually all modern operating systems and programming languages.

To implement dynamic memory allocation, the C runtime library provides malloc() and free() functions that allow applications to allocate and release memory as needed, during a program’s execution. Free memory is colloquially called the "heap". The function of memory allocation mechanisms used "under the covers" in dynamic allocation (we will refer to these mechanisms as "allocators" or "memory managers") is to organize the heap in some coherent way. They must satisfy requests for dynamic memory allocation by assigning a portion of the heap to a requesting task, and to return memory to the heap for subsequent use when the task relinquishes that memory.

An allocator keeps track of which parts of memory are in use and which are free. A design goal of any allocator is to minimize wasted memory space, balancing the amount of wasted space against the time and processing resources needed to recover it. A major objective of allocators is to limit the fragmentation that occurs when an application frees memory blocks in a random order.

In dynamic allocation, the heap is commonly managed with what is known as a "list allocator". It organizes the heap into a singly-linked chain of pointers, where each link in the chain points to a free block of memory. Each free block (in a 32-bit system) requires 8 bytes of overhead that we call "meta-data"; 4 bytes for the chain pointer, and 4 bytes to record the size of the free block (see image, PDF)

To satisfy a request for memory allocation, the allocator walks the chain of pointers until it finds a free block that is large enough to satisfy the allocation request. If it cannot find a block of memory large enough, it returns a NULL pointer. If it finds a block large enough, and if the block is exactly the right size, it unlinks the block from the chain and returns a pointer to it back to the application. Otherwise, the block is divided into one "chunk" that is the requested allocation size, and a remaining piece. The remainder is linked back into the chain and the pointer to the allocated memory is returned to the application. When memory is subsequently freed, it is linked back into the chain. If the newly freed block happens to be adjacent to an already free block, then the free blocks are joined together to form a single larger block, minimizing fragmentation (see below).

Risks of Dynamic Memory Management

De-fragmentation strategies used by the general purpose allocation mechanisms that underlie malloc() and free() can cause non-deterministic behavior and/or impose CPU overhead that is too high for embedded systems. Application developers tend to use dynamic memory allocation liberally, without thinking about such effects. A significant drawback is that standard allocators are not limited in their consumption of physical or virtual memory. Their de facto limit is the total amount of system memory – introducing a risk of running out of memory.

One of the greatest risks in dynamic memory allocation is failure to diligently relinquish (free) allocated memory when it is no longer needed. This results in memory leaks that, no matter how much system memory is available, will cause the system to become progressively slower and eventually stop (or crash) altogether. Dynamically allocated memory needs to be carefully released when it is no longer needed or when it is out of scope.

Memory fragmentation is the phenomenon that occurs when a task allocates and de-allocates memory in random order. For the sake of simplicity, assume that the "heap" is 100 byte and memory is allocated in the following order:

A.     10 bytes
B.     46
C.     13
D.     15

Leaving 16 bytes or our original 100 bytes free (see table, PDF).

Subsequently, the allocation for 13 bytes (C) is released (see table, PDF).

And an allocation request is made for 21 bytes. There are 29 bytes of memory free, in the aggregate, but the allocation request cannot be satisfied: due to fragmentation, there is no free hole of 21 contiguous bytes. This is the essence of fragmentation. The longer a system runs, the more fragmented the heap becomes due to the randomness of allocating and freeing memory.

List Allocators: A Closer Look

List allocators used by malloc() and free() are, by necessity, general purpose allocators that aren’t optimized for any particular memory allocation pattern. Compared to allocators that take specific allocation needs into account, these general purpose mechanisms are more likely to introduce unpredictability, high performance cost, and excessive resource (particularly memory) consumption. To understand, let’s examine commonly used list allocation algorithms: first-fit, next-fit, best-fit and quick-fit.

The first-fit algorithm always walks the chain of pointers from the beginning of the chain, and therefore attempts to allocate memory from the "front" of the heap first, leaving large free holes in the back. Therefore, this algorithm minimizes fragmentation, but at the expense of unpredictable performance: the longer the system runs, the longer the chain becomes, and the allocator typically walks a greater distance before finding a free hole large enough.
The next-fit algorithm attempts to smooth the performance at the expense of greater fragmentation. This algorithm will walk the chain from wherever it last left off, so it will allocate memory from more-or-less random locations in the heap.

The best-fit algorithm attempts to minimize fragmentation, but again at the risk of unpredictable, and potentially very bad, performance. This algorithm walks the chain until it finds a free hole that is exactly the right size, or is as close as possible to the allocation request. It could find the perfect free hole immediately, or it might have to walk the entire chain.

The best-fit algorithm maintains a separate list of the most commonly requested allocation sizes, as well as pointers to free holes in the heap that match those commonly requested sizes. It increases the overhead (i.e. there’s more meta-data to be managed), but it can provide better, and more consistent, performance for common allocation request sizes.

Pulling it together

In summary, general purpose allocators

lack limits
have unpredictable and potentially unacceptable performance
suffer from fragmentation
can impose excesive overhead

Avoiding general purpose allocators is a good strategy generally for embedded and real-time systems. The drawbacks of these allocators are not surprising. After all, they must, by definition, satisfy a variety of application scenarios and allocation patterns. It stands to reason that a tool designed to address every scenario will fail to fill some needs precisely. And as discussed above, embedded systems have higher standards than non-embedded applications in areas including performance, predictability and reliability. They need software tools and components that do their jobs precisely.

Mitigating the Risks with Custom Memory Managers

When a task makes an allocation request (calls malloc), control is passed to the C run-time library. Malloc is a software "black box". The embedded system developer cannot predict what will happen. Will the allocation request succeed or fail? How many clock cycles will transpire before control is returned to the calling function?

In contrast, the allocators that best serve embedded systems concentrate on just a few allocation patterns, and ONLY those patterns required by the system. Therefore, efficiency is gained by using allocators that are optimized for these patterns. To mitigate the fact that general purpose allocators have no limits (i.e. will allow an application to exhaust all available memory), embedded systems developers should put every task "in a box", essentially fencing off tasks from one another. In other words, force the task to operate within a predefined memory arena that’s large enough for the task to accomplish its purpose but small enough that when memory utilization approaches 100%, it still can’t adversely affect the operating system or other tasks.

To accomplish this, the developer should start with the assumption that responsibility for memory management must be taken away from the standard memory management routines (e.g. the C run-time library's malloc and free function) and assigned to the application.

Within every application, it is sometimes possible to separate non-critical application activities from critical ones. For non-critical tasks, the general purpose, standard dynamic allocator exposed by the C runtime can still be used; malloc and free can be introduced for these tasks, provided that memory leaks that might be introduced will not affect safety-critical parts of the system.

For the safety-critical tasks, the developer can replace the standard allocator with a custom allocator that sets aside a buffer for the exclusive use of that task, and satisfies all memory allocation requests out of that buffer. If the memory reserved for this buffer is exhausted, the task is informed. When the task receives such a notification, it must free up memory within the buffer, or find more memory.

The nature of the custom allocators that replace malloc and free will differ according to memory usage patterns that occur in the application, and will be optimized to deliver predictability, performance, and efficiency when addressing these patterns. The remainder of this paper presents four examples of custom allocator algorithms: the block allocator, stack allocator, bitmap allocator and thread-local allocator. Custom memory managers can use any one of these algorithms or a combination of different algorithms.

Block Allocator

Applications use block allocators when they need to allocate a large number of small objects of the same size. Typical examples of this pattern include small scalar values such as timestamp/value pairs that can represent sensor data; another allocation pattern that is well-served by the block allocator is abstract syntax tree nodes used by various parsers, and pages of an in-memory database. Block allocators handle such objects very efficiently with minimal overhead, and eliminate fragmentation by design.

The idea behind a block allocator is very simple. An allocator is given a quantity of memory, which we’ll call the "large block", divides it into equal-size pieces, and organizes them in a linked-list of free elements. To serve a request for memory, the allocator returns a pointer to one element and removes it from the list. When there are no more elements in the "free list", a new large block is selected from the memory pool using some other allocator (for example, a list allocator). The new large block again gets divided by the block allocator into equal-size elements. The elements are put into a new linked-list, and allocation requests are handled from this newly created list.

When an object is freed, it is placed back into its original "free list". Since all allocated objects in a given list are of the same size, there is no need for the block allocator to "remember" each element’s size, or to aggregate smaller chunks into larger ones. Therefore, there is less overhead and fewer CPU cycles used by avoiding the aggregation of adjacent free blocks. A block allocator eliminates fragmentation because blocks are all of equal size, therefore when freed they are not, and do not need to be, joined with adjacent free blocks.

There is also less overhead (this type of allocator is more efficient) because each block is a fixed, known, size so there is no need to carry additional meta-data (specifically, the size of the free block) in the linked list. This frees up 4 bytes of memory per block in a 32-bit system.

The most basic block allocators satisfy allocation requests only for objects that fit into their pre-determined element size, making such algorithms useful only when the allocation pattern is known in advance (for example, when the application always allocates 16-byte objects). In practice, many memory managers, including database memory managers, need to satisfy several allocation patterns. So, to take advantage of the simplicity of the block allocator algorithm while at the same time dealing with this variability, a block allocator can be combined with some other technique into a hybrid memory manager.

For example, a block allocator can maintain multiple lists of different–sized elements, choosing the list that is suited for a particular allocation request. Meanwhile, the blocks themselves, and objects that exceed the chunk size of any of the blocks, are allocated using another general purpose allocator (for example a page allocator, or a list allocator). In such an implementation, the number of allocations (or objects) processed by the block algorithm is typically many times higher than those made by the general purpose allocator, and this can greatly enhance performance.

Stack Allocator

Memory allocators would be simpler to design, and perform better, if they only had to allocate objects, and not free them. The stack allocator is designed to follow this simple strategy.

Stack-based allocators fit when memory requirements can be divided into a first phase in which objects are allocated, and a second phase in which they are de-allocated. In order to use the stack allocator, the number of allocations and the total amount of required memory should be (A) known in advance, and (B) limited.

Stack-based allocators use a concept similar to the application’s call stack. When a function is called, memory space for its arguments and local variables is pushed onto the stack (by advancing the stack pointer). When the function returns, the arguments and local variables are no longer needed and so the stack pointer is rewound to its starting point before the function call (i.e. the stack is "popped"). With the stack-based allocator, for each allocation a pointer is advanced. When the memory is no longer required, the pointer is simply rewound to the starting point.

One useful example of the stack allocator model is XML parsers, in which short-lived objects can be released all at once. Another example is SQL statement execution, in which all memory is released upon the statement’s commit or rollback. The process of evaluating and executing a SQL statement involves prodigious amounts of memory allocation: The parser needs to build a parse tree of tokens; the optimizer is invoked next, which involves more allocations to hold possible execution plan steps (the optimizer’s job is to assemble a sequence of steps that is less costly than any other sequence of steps); finally the execution plan must be carried out, which can involve more allocations to hold the result set. In all, executing a SQL statement can involve thousands of individual allocations. When the application program releases the SQL statement handle, all of that memory can be released. This fits the two-phase pattern in which objects are first allocated, and then de-allocated.

An important by-product of the stack approach is improved safety: because memory is allocated and de-allocated in two phases (not in random order), the application doesn't have to track individual allocations and it is impossible to accidentally introduce a memory leak through improper de-allocation.

Another benefit is much improved performance. During allocation, there is no chain of pointers to walk. And de-allocation performance benefits even more: though there might have been thousands of allocations during the first phase, de-allocating the memory does not require calling the equivalent of free thousands of times. Rather, the stack pointer is rewound in one move.

The stack allocator imposes virtually no overhead. There is a single stack pointer (4 bytes in a 32-bit system) in place of a chain of pointers in a list allocator or block allocator.

A final benefit is improved resiliency. Whereas list and blick allocators intersperse chains of pointers in the heap, the stack allocator has just the stack pointer. So the chance of overwriting that memory and compromising the integrity of the allocator's meta-data is greatly reduced.

Bitmap Allocator

The bitmap allocation algorithm is a little more complex. It acts on the memory pool by allocating objects in a pre-defined unit called the allocation quantum. The size of the quantum can be a word or a double word. A bitmap is a vector of bit-flags, with each bit corresponding to one quantum of memory. A bit value of 0 indicates a free quantum, while 1 is an allocated quantum.

The bitmap allocator is similar to the block allocator in that it always allocates memory in a pre-specified size, but the bitmap allocator preserves the flexibility of a list allocator’s ability to satisfy random allocation request sizes by finding adjacent free bits and allocating them together. Advantages of bitmap allocators include the following:

They can allocate groups of objects sequentially, thus maintaining locality of reference. This means objects allocated sequentially are placed close to one another in storage, on the same physical "page".
Fragmentation is reduced. When objects are allocated by a quantum of comparable size, small unused holes in the memory pool are avoided.
A performance advantage of the algorithm lies in the fact that the bitmaps themselves are not interleaved with main storage, which improves the locality of searching
Keeping the bitmap separate from the heap also improves safety, in that memory overwrites (e.g. allocating 20 bytes and writing 24 bytes to the pointer that was returned) will not compromise the integrity of the allocator’s meta-data.

Bitmap allocators are commonly used in garbage collectors, database systems and file system disk block managers, where enforcing locality of reference and reducing fragmentation are imperative.

Thread-local Alligator

In today’s world of multi-core processors and multi-threaded applications, developers need to constantly think about how to harness the power of multiple CPU cores. Increasing application performance depends on the proper use of multiple application threads, which in turn hinges on the right approach to memory management.

In particular, multi-threaded applications running on multi-core systems can slow down markedly when performing many memory allocations. Often an application will run fine with a single CPU, but placing it on a system with two or more processors or processor cores yields a slowdown in performance, not the expected doubling of performance. This performance impact is very easy to miss at the application’s design stage as it is hidden deep inside the C runtime library.

Why does the standard list allocator perform so poorly in a multi-core environment? It stems from the fact that this allocator’s chain of pointers is a shared resource that must be protected. Chaos would ensue if a thread in the middle of breaking the chain to insert a new link, was interrupted by a context switch, and another thread tried to walk the (now broken) chain. So, the chain is protected by a mutex to prevent concurrent access and preserve the allocator’s consistency (see PDF for image).

Locking a mutex presents minor overhead when there are no, or few, conflicts. However, as the number of malloc and free calls increases, contention for this shared resource increases, creating a lock conflict. To resolve the conflict, the operating system imposes a context switch, suspending the thread that attempted to access the allocator and inserting it into the kernel’s waiting queue. When the allocator is released, the thread is allowed to run and access the allocator. Even if each thread accesses only objects that it created, and so otherwise requires no synchronization, there is still only one memory allocator; the allocator doesn’t "know" that no synchronization is required, and acts to protect its meta-data, which results in a lot of conflicts between threads. As a result, the same application would perform better on a single CPU because the CPU can be kept busy with other tasks (it does not try to schedule tasks that are in the waiting queue). Conversely, in a multi-core setting, all but one core can be idled, with respect to dynamic memory management, because of the serialization of access to the heap (see PDF for image).

The ideal solution, from a performance standpoint, would be for each thread to allocate objects on the stack rather than in dynamic memory (in other words, with local variables declared in the function body). However, this simplified approach is rarely viable: thread stack size is limited and therefore allocating large objects or a large number of smaller objects on the stack is impossible. A more practical approach is to provide a separate memory allocator for each thread, so that each allocator manages memory independently of the others. This approach is called a thread-local allocator. The thread-local allocator is a custom allocator that avoids creating locking conflicts when objects are allocated and released within a single task. When allocating in one task and de-allocating in another, a lock is, of course, required. However, this allocator takes measures to minimize those locking conflicts.

The implementation of the thread-local memory manager is based on two concepts: the block allocator algorithm that we have already discussed, and the concept of thread-local storage, or TLS. Thread-local storage provides a means to map global memory to a local thread’s memory. In other words, data in a global variable is usually located at the same memory location when it is referred to by threads from the same process. Sometimes, it is advantageous to have different threads that refer to the same global variable while referring to different memory locations. Thread-local storage accomplishes this. Likewise, the thread-local allocator maps portions of the global heap to individual threads. Again, the thread-local memory manager is based on the block allocator algorithm discussed earlier. The allocator creates and maintains a number of chains of same-size small blocks that are made out of large pages. To allocate memory, the allocator simply unlinks a block from the appropriate chain and returns the pointer to the block to the application. When and if a new large page is necessary, the allocator can use a general-purpose memory manager (standard malloc) to allocate the page.

As long as all objects are allocated and de-allocated locally (by the same thread), this algorithm does not require any synchronization at all because each thread has its own allocator (and therefore doesn't need to synchronization with any other thread when allocating/de-allocating its own memory). What happens when objects are not local? The memory manager maintains a Pending free Requests List (PRL) for each thread. When an object allocated in one thread is being de-allocated by another thread, the de-allocating thread simply links the object into its PRL list. Of course, PRL access is protected by a mutex. Each thread periodically de-allocates objects in its PRL at once. When does this occur? It could be based on a timer, or when a certain number of requests are pending, or when a certain amount of memory has accumulated in the PRL, or according to any other application-specific criteria.

It’s important to note that regardless of the criteria, the number of synchronization requests is reduced significantly using this approach. First, objects are often freed by the same thread that allocated them. Second, even when the object is de-allocated by a different thread, it does not interfere with all other threads, but only with those that need to use the same PRL. For example, assume you have eight threads, and you know based on your application’s logic flow that memory allocated by thread #1 will only ever be de-allocated by threads #4 or #7. Therefore, locking the PRL in any one of those threads will only interfere with the other two threads, not with all seven of the other threads, as would be the case with the default allocator. In this way, locking conflicts are reduced even when allocation/de-allocation is not "local".

Following is a diagram of the internal structures of thread local allocators (see PDF).

The thread-local allocator can create an arbitrary number of block lists, of varying sizes. The diagram shows just one possible example (see PDF).

To applications, the allocator exports three functions with syntax similar to the standard C runtime allocation API: thread_malloc(), thread_realloc() and thread_free(). For applications written in C++, the memory manager’s interface also includes a simple way to redefine the default new and delete operators.

We developed two tests to examine the impact of the thread-local memory manager. The first test compares performance of the thread-local allocator and the standard C runtime allocator when the allocation pattern is thread-local: all de-allocations are performed by the same thread as the original allocations. This is a “best case” scenario.

The second test compares performance when objects are allocated by a "producer" thread and freed by a "consumer" thread. This is a "worst case" scenario (see PDF).

We ran these tests on a Sunfire x4450 system with four 6-core Xeon processors and 24GB of memory. The first test performed 10 million allocation/free pairs in each of 24 threads (for a total of 240 million allocation/free pairs). Because all the allocations were so-called local and required no synchronization, all 24 cores were utilized with the thread-local allocator. Standard malloc, of course, could only utilize a single core, and this accounts for the dramatic performance difference.

The second test also performed 10 million allocation/free pairs but with only two threads. In this case, performance improved, but only by about 20%, due to three factors: (1) Allocation doesn’t require any synchronization, (2) there was some benefit from the reduced synchronization on the PRL (but minimal, because there were only two threads), and (3) the block allocator that the thread-local allocator uses is simply a superior allocator compared to standard malloc.

The results show that significant performance improvements are obtained by replacing the standard allocation mechanism with a thread-local allocator, especially as the number of cores increases. This is a classic case of "your mileage may wary" (YMMV): The benefit that any application will experience will be a function of (1) the number of cores and (2) the ratio of local to global allocations and (3) the logic flow that determines the number of synchronizations on the PRLs.

Conclusion

Approaches to memory management significantly affect embedded code safety, performance and predictability, as well as prospects for DO-178B airborne software certification. This is due to the fact that dynamic memory allocation is risky. It can and should be eliminated from safety-critical processes. General purpose C language memory allocators are not optimized for embedded systems. When possible they should be replaced with custom allocators that deliver safety, predictability and reduced overhead when allocating memory. A number of algorithms can be considered, including bitmaps allocators, block allocators and stack-based allocators. Finally, the performance of multi-threaded applications on multi-core systems can be improved with a custom thread-local allocator.

Beitrag als PDF-Datei herunterladen

Echtzeit - MicroConsult Trainings & Coachings

Wollen Sie sich auf den aktuellen Stand der Technik bringen?

Dann informieren Sie sich hier zu Schulungen/ Seminaren/ Trainings/ Workshops und individuellen Coachings von MircoConsult zum Thema Embedded- und Echtzeit-Softwareentwicklung.

Training & Coaching zu den weiteren Themen unseren Portfolios finden Sie hier.

Echtzeit - Fachwissen

Wertvolles Fachwissen zum Thema Embedded- und Echtzeit-Softwareentwicklung steht hier für Sie zum kostenfreien Download bereit.

Zu den Fachinformationen

Fachwissen zu weiteren Themen unseren Portfolios finden Sie hier.

Experience Embedded

Fachwissen

ESE Fachwissen