Fixed-point arithmetic – easier than you think

Use in custom algorithms and libraries

Author: Ferdinand Englberger, University of the German Federal Armed Forces Munich

Contribution – Embedded Software Engineering Congress 2018

Although more and more microcontrollers have floating-point arithmetic units, fixed-point arithmetic is still used in many libraries, for example, for digital signal processing, neural networks, or control algorithms. The processors' instruction sets have extensions for handling fixed-point numbers and, with SIMD instructions, enable their efficient processing. This course explains the fundamentals of fixed-point arithmetic, providing the necessary understanding to effectively use this arithmetic in custom algorithms and library functions.

The direct support of floating-point arithmetic by processors makes it easy for developers to implement algorithms. For efficiency reasons, single-precision floating-point numbers are used. However, it is easily overlooked that using this number type can introduce errors that lead to the loss of the desired design goal. Figure 1 (see below) illustrates this. PDFThis is illustrated by a simple example. Adding small numerical values followed by a large value produces the desired result. However, the reverse process disregards the small values in the result. Implementing FIR filters without considering this will result in the loss of linear-phase behavior.

Fixed-point arithmetic aims to perform calculations with minimal overhead. This is achieved by utilizing the multiplier of the processor or an FPGA. Divisions are avoided due to their longer computation times. Figure 2 (see...). PDFFigure 1 shows a simple example of a fixed-point calculation. For result1 (integer arithmetic), a calculation using multiplication and division is used, while for result2 (fixed-point arithmetic), the division is replaced by a shift operation, which is integrated into modern processors as part of a memory instruction. Fixed-point arithmetic is thus an optimized calculation method in which divisions are only performed with values that can be expressed as powers of two.

Representation of numerical values

Figure 3 (see. PDF) shows the structure of fixed-point numbers. With x will be in Qx The position of the decimal point is specified. The decimal places are shown as... fractional bits This is called. If numerical values with a range greater than one are needed, additionally... integer bits required. Since every numerical value x To ensure the most accurate representation possible, as few as possible are used. integer bits n_i used to determine a numerical value.

The required number of integer bits n_i in a data word x with n Bits are calculated by taking the base-2 logarithm of the absolute value and rounding up to the nearest whole bit. (see...). PDF)

The number of decimal places P a QPThe resulting number is: (see. PDF)

The equivalent integer value x_i The value of a fixed-point number is obtained by determining the numerical value x by solving the number 2^–P The number is divided and rounded. (see below). PDF)

The following simple rules must be observed when performing calculations:

When adding, the decimal points must be aligned.
In multiplications, the new position of the decimal point is determined by adding the decimal point positions of the operands, e.g., Q15×Q15 = Q30. The word width of the result is the sum of the positions of the two operands.
Further additions are performed using the word width of the multiplication result.
The word width of the final result is reduced to the desired width. This often involves the use of saturation arithmetic and rounding.

As shown in Figure 4 (see below). PDFAs shown, the result variable contains a range with the desired result and the desired accuracy. Additionally, fractional bits There are features available that help improve the result during additions. The guard area is used to prevent erroneous results when the result range is insufficient during additions.

The problem of saturation arithmetic and the necessity of the guard region are shown in Figure 5 (see Figure 5). PDFThis illustrates the point. Algorithms are often designed to expect a specific range of values for the result. This applies, for example, to digital filters. Despite this design, intermediate results can exceed this range. However, errors due to exceeding the range should be avoided, and the final result should be as accurate as possible. In the example shown, adding the two operands leads to a change in the sign bit in the result range. Without the guard area, this would be a serious error. Performing saturation at the end of the calculation also doesn't yield a correct result, but the error is much smaller than without saturation. In this example, one guard bit was required for adding two operands. With a higher number of additions, the number of guard bits must be adjusted accordingly to reliably detect when the result range is exceeded.

calculations

Figure 6 (see. PDFFigure 1 shows the calculation of a multiply-accumulate (MAC) operation with 16-bit numerical values. On a 32-bit processor, the result can be stored in a register. Both operands have the number format Q15. After the multiplication, the resulting number format is Q30. This provides a one-bit guard area. To use the desired target number format Q15, the result must be shifted 15 places to the right and then subjected to a saturation operation. Since only one guard bit is available, this number representation can lead to undetected range overruns if more than two additions are performed.

If a higher precision than 16 bits is required for an operand, a 64-bit result (two 32-bit registers) is necessary. For modern 32-bit processors, an assembly instruction is available for this purpose. Figure 7 (see...). PDFFigure 1 shows the arithmetic operation Q15×Q31. It is assumed that only 16 bits are needed as the result word width. This leaves 17 guard bits available, and up to 2¹⁷ Additions are performed under the protection of the guard. However, this solution has the significant disadvantage that 64-bit additions require two machine instructions and that shift operations must be performed across two registers. This leads to increased computational overhead.

To avoid the disadvantage of storing the result in two different processor registers, the numerical format of the sample operand can be adjusted. Figure 8 (see below) illustrates this. PDFThe input value was set to Q31. In a 32-bit × 32-bit multiplication, the 16-bit result is then stored in only one register. Due to the high number of additional fractional bits Outside the result area, the use of the second register can be omitted. However, in the representation shown in Figure 8, only one guard bit is available. If more guard bits are needed, this can be adjusted by placing them in the sample variable. In Figure 9 (see Figure 8), the guard bits can be adjusted by adjusting their position. PDFThe position was changed so that three guard bits are now available.

FPGA and ASIC

The previous description highlighted the functionality of the Guard area. The additional fractional bits were not considered in detail. For a single-processor solution, this approach is sufficient, as it processes data that is many times its word width. However, with an FPGA or ASIC, one would try to allocate only the necessary resources. If additional fractional bits Since they cannot influence the result, they can be omitted. The reasoning is similar to that used when determining the guard area. For two additions, one additional bit must be considered; for four additions, two bits; and for twoⁿ Additions n Bit. Figure 10 (see below). PDFThis example demonstrates how to proceed with an FPGA. The example involves performing seven MAC operations with 10-bit values. A solution should be found that achieves an exact and error-free result with minimal resource consumption. Seven additions require three guard bits to prevent number overflow and three additional fractional bits, to obtain the most accurate results possible. Multiplying two 10-bit values yields a 20-bit result with one guard bit and 9 additional bits. fractional bits. For further processing, two guard bits must be added, and six can be used. fractional bits These steps can be omitted. The summation can then be performed using the newly generated 16-bit variable. The calculation is completed with a saturation operation.

Summary

Fixed-point arithmetic is an efficient and resource-saving method for implementing algorithms. It allows for the effective use of special processor capabilities such as SIMD.

literature

[1] ARM Ltd, CMSIS – Cortex Microcontroller Software Interface Standard, https://www.arm.com/products/processors/cortex-m/cortex-microcontroller-software-interface-standard.php

author

Prof. Dr.-Ing. Ferdinand Englberger is Professor of Embedded Systems and Digital Signal Processing at the Bundeswehr University Munich in the Faculty of Electrical Engineering and Computer Science. His teaching and research areas include embedded systems, robotics, systems on a chip, and digital signal processing.

Download the article as a PDF

Implementation – our training & coaching

Do you want to bring yourself up to date with the latest technology?

Then find out more here MircoConsult offers training courses/seminars/workshops and individual coaching on the topic of implementation/embedded and real-time software development.

Training & coaching on the other topics in our portfolio can be found here. here.

Implementation – Expertise

Valuable expertise in the field of implementation/embedded and real-time software development is available. here Available for you to download free of charge.

To the specialist information

You can find expertise on other topics in our portfolio here. here.

MicroConsult Newsletter

With the MicroConsult newsletter, you'll stay on the pulse of the embedded world. Look forward to proven practical knowledge, real professional tips, and current events – directly from our experts for your project success.

Subscribe now!

Published by

weissblau media

← Embedded Software Engineering Expertise: ESE 2018 (Part 1) Modern compiler optimizations →