1.. _blog-04-fixed-point: 2 3======================================================================== 4Pigweed Blog #4: Fixed-point arithmetic as a replacement for soft floats 5======================================================================== 6*Published on 2024-09-03 by Leonard Chan* 7 8Fixed-point arithmetic is an alternative to floating-point arithmetic for 9representing fractional data values (0.5, -1.125, 10.75, etc.). Fixed-point 10types can be represented as :ref:`scaled integers <blog-04-fixed-point-intro>`. 11The advantage here is many 12arithmetic operations (+, -, \*, /, etc.) can be implemented as normal integral 13instructions. This can be useful for embedded systems or other applications 14where floating-point operations can be too expensive or not available. 15Clang has support for fixed-point types according to `ISO TR 18037 <https://standards.iso.org/ittf/PubliclyAvailableStandards/c051126_ISO_IEC_TR_18037_2008.zip>`_. 16I work on a Google project that uses Pigweed. 17Recently, we replaced usage of floats in an embedded project running 18on Cortex M0+. This MCU doesn't provide any hardware support for floating-point 19operations, so floating-point operations are implemented in software. We’ve 20observed a **~2x speed improvement** in 21classification algorithms and a **small decrease in binary size** using fixed-point 22**without sacrificing correctness**. 23 24All Clang users can benefit from this by adding ``-ffixed-point`` as a 25compilation flag. All of the types and semantics described 26in ISO 18037 are supported. 27 28Problems with (soft) floats 29=========================== 30One of the ways to represent a decimal number is to use a significand, base, 31and exponent. This is how floating-point numbers according to IEEE 754 are 32represented. The base is two, but the significand and exponent are adjustable. 33 34When doing arithmetic with floats, many operations are needed to align the 35exponents of the operands and normalize the result. Most modern computers have 36Floating-Point Units (`FPUs <https://en.wikipedia.org/wiki/Floating-point_unit>`_) that do this arithmetic very quickly, but not all 37platforms (especially embedded ones) have this luxury. The alternative is to 38emulate floating-point arithmetic in software which can be costly. 39 40.. _blog-04-fixed-point-intro: 41 42---------------------------------------------- 43Intro to fixed-point types and scaled integers 44---------------------------------------------- 45`ISO TR 18037 <https://standards.iso.org/ittf/PubliclyAvailableStandards/c051126_ISO_IEC_TR_18037_2008.zip>`_ 46describes 12 fixed-point types with varying scales and ranges: 47 48* ``unsigned short _Fract`` 49* ``unsigned _Fract`` 50* ``unsigned long _Fract`` 51* ``signed short _Fract`` 52* ``signed _Fract`` 53* ``signed long _Fract`` 54* ``unsigned short _Accum`` 55* ``unsigned _Accum`` 56* ``unsigned long _Accum`` 57* ``signed short _Accum`` 58* ``signed _Accum`` 59* ``signed long _Accum`` 60 61The scale represents the number of fractional bits in the type. Under the hood, 62Clang implements each of these types as integers. To get the decimal value of 63the type, we just divide the integer by the scale. For example, on x86_64 64Linux, an ``unsigned _Accum`` type is represented as an unsigned 32-bit integer 65with a scale of 16, so 16 bits are used to represent the fractional part and 66the remaining 16 are used to represent the integral part. An ``unsigned _Accum`` 67type with a value of ``16.625`` would be represented as ``1089536`` because 68``1089536 / 2**16 = 16.625``. Additionally, an ``unsigned _Accum`` type has a range 69of ``[0, 65535.99998474121]``, where the max value is represented as ``2^32-1``. 70The smallest possible value we can represent with a fixed-point type is 71``1/2^scale``. 72 73The width and scale of each type are platform-dependent. In Clang, different 74targets are free to provide whatever widths 75and scales are best-suited to their needs and capabilities. Likewise, 76LLVM backends are free to lower the different fixed-point intrinsics to native 77fixed-point operations. The default configuration for all targets is the 78“typical desktop processor” implementation described in A.3 of ISO TR 18037. 79 80Each of these types has a saturating equivalent also (denoted by the ``_Sat`` 81keyword). This means operations on a saturating type that would normally cause 82overflow instead clamps to the maximal or minimal value for that type. 83 84Fixed-point to the rescue 85========================= 86One of the main advantages of fixed-point types is their operations can be done 87with normal integer operations. Addition and subtraction between the same fixed-point 88types normally require just a regular integral ``add`` or ``sub`` instruction. 89Multiplication and division can normally be done with a regular ``mul`` and ``div`` 90instruction (plus an extra shift to account for the scale). Platforms that 91don’t support hard floats normally need to make a call to a library function 92that implements the floating-point equivalents of these which can be large or 93CPU-intensive. 94 95Fixed-point types however are limited in their range and precision which are 96dependent on the number of integral and fractional bits. A ``signed _Accum`` 97(which uses 15 fractional bits, 16 integral bits, and 1 sign bit on x86_64 98Linux) will have a range of ``[-65536, 65535.99996948242]`` and a precision of 99``3.0517578125e-05``. Floats can effectively range from ``(-∞, +∞)`` and have 100precision as small as :math:`10^{-38}`. Maintaining correctness for your program 101largely depends on how much range and precision you intend your fractional 102numbers to have. For addressing ranges, there are sometimes a couple of clever 103math tricks you can do to get large values to fit within these ranges. Some of 104these methods will be discussed later. 105 106.. list-table:: Fixed-Point vs Floating-Point 107 :widths: 10 45 45 108 :header-rows: 1 109 :stub-columns: 1 110 111 * - 112 - Fixed-Point 113 - Floating-Point 114 * - Range 115 - Limited by integral and fractional bits 116 - +/-∞ 117 * - Precision 118 - Limited by the fractional bits (:math:`1/2^{scale}`) 119 - :math:`10^{-38}` 120 * - Binary Operations 121 - | Typically fast 122 | Usually just involves normal integral binary ops with some shifts 123 - | *Hard floats* - Typically fast 124 | *Soft floats* - Can be very slow/complex; depends on the implementation 125 * - Correctness 126 - | Always within 1 ULP of the true result 127 | Will always produce the same result for a given op 128 - | Always within 0.5 ULP of true result 129 | May not always produce the same result for a given op due to rounding errors 130 131--------------------- 132Running on Cortex-M0+ 133--------------------- 134The Arm Cortex-M0+ processor is a common choice for constrained embedded 135applications because it is very energy-efficient. Because it doesn’t have an 136FPU, many floating-point operations targeting this CPU depend on a helper 137`library <https://github.com/ARM-software/abi-aa/blob/7c2fbbd6d32c906c709804f873b67308d1ab46e2/rtabi32/rtabi32.rst#the-standard-compiler-helper-function-library>`_ 138to define these as runtime functions. For our application, the code 139running on this processor runs different classification algorithms. One such 140classifier utilizes `softmax regression <http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/>`_. 141The algorithm is roughly: 142 143.. code-block:: c++ 144 145 // This is roughly a trimmed down version of the softmax regression 146 // classifier. 147 size_t Classify(std::span<const float> sample) const { 148 // 1. Compute activations for each category. 149 // All activations are initially zero. 150 std::array<float, NumCategories> activations{}; 151 for (size_t i = 0; i < NumCategories; i++) { 152 for (size_t j = 0; j < sample.size(); j++) { 153 activations[i] += coefficients_[i][j] * sample[j]; 154 } 155 activations[i] += biases_[i]; 156 } 157 float max_activation = *std::max_element(activations.begin(), 158 activations.end()); 159 160 // 2. Get e raised to each of these activations. 161 std::array<T, NumCategories> exp_activations; 162 for (size_t i = 0; i < NumCategories; i++) { 163 exp_activations[i] = expf(activations[i]); 164 } 165 float sum_exp_activations = std::accumulate(exp_activations.begin(), 166 exp_activations.end(), 0); 167 168 // 3. Compute each likelihood which us exp(x[i]) / sum(x). 169 float max_likelihood = std::numeric_limits<float>::min(); 170 size_t prediction; 171 for (size_t i = 0; i < NumCategories; i++) { 172 // 0 <= result.likelihoods[i] < 1 173 result.likelihoods[i] = exp_activations[i] / sum_exp_activations; 174 if (result.likelihoods[i] > max_likelihood) { 175 max_likelihood = result.likelihoods[i]; 176 prediction = i; // The highest likelihood is the prediction. 177 } 178 } 179 180 return prediction; // Return the index of the highest likelihood. 181 } 182 183Many of these operations involve normal floating-point comparison, addition, 184multiplication, and division. Each of these expands to a call to some 185``__aeabi_*`` function. This particular function is on a hot path that 186activates, meaning soft float operations take up much computation and power. 187For the normal binary operations, fixed-point types might be a good fit because 188it’s very likely each of the binary operations will result in a value that 189can fit into one of the fixed-point types. (Each of 190the sample values is in the range ``[-256, +256]`` and there are at most 10 191categories. Likewise, each of the ``coefficients_`` and ``biases_`` are small values 192ranging from roughly ``(-3, +3)``.) 193 194If we wanted to completely replace floats for this function with fixed-point 195types, we’d run into two issues: 196 197#. There doesn’t exist an ``expf`` function that operates on fixed-point types. 198#. ``expf(x)`` can grow very large for even small for small values of x 199 (``expf(23)`` would exceed the largest value possible for an 200 ``unsigned long _Accum``). 201 202Fortunately, we can refactor the code as needed and we can make changes to 203llvm-libc, the libc implementation this device uses. 204 205llvm-libc and some math 206======================= 207The embedded device's codebase is dependent on a handful of functions that 208take floats: ``expf``, ``sqrtf``, and ``roundf``. 209``printf`` is also used for printing floats, but support for the fixed-point format 210specifiers is needed. The project already uses llvm-libc, so we can implement 211versions of these functions that operate on fixed-point types. 212The llvm-libc team at Google has been very responsive 213and eager to implement these functions for us! Michael Jones (who implemented 214much of printf in llvm-libc) provided `printf support for each of the fixed 215point types <https://github.com/llvm/llvm-project/pull/82707>`_. Tue Ly 216provided implementations for `abs <https://github.com/llvm/llvm-project/pull/81823>`_, 217`roundf <https://github.com/llvm/llvm-project/pull/81994>`_, 218`sqrtf <https://github.com/llvm/llvm-project/pull/83042>`_, 219`integral sqrt with a fixed-point output <http://github.com/llvm/llvm-project/issues/83924>`_, 220and `expf <https://github.com/llvm/llvm-project/pull/84391>`_ for different 221fixed-point types. Now we have the necessary math functions which accept 222fixed-point types. 223 224With implementations provided, the next big step is avoiding overflow and 225getting results to fit within the new types. Let’s look at one case involving 226``expf`` and one involving ``sqrtf``. (Tue Ly also provided these solutions.) 227 228The softmax algorithm above easily causes overflow with: 229 230.. code-block:: 231 232 exp_activations[i] = expf(activations[i]); 233 234But the larger goal is to get the likelihood that a sample matches each 235category. This is a value from [0, 1]. 236 237.. math:: 238 239 likelihood(i) = \frac{e^{activations[i]}}{\sum_{k=0}^{NumCategories}{e^{activation[k]}}} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} 240 241This can however be normalized by the max activation: 242 243.. math:: 244 245 likelihood(i) = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} * \frac{max(e^{activation[...]})}{max(e^{activation[...]})} = \frac{e^{activation[i]-max(activation[...])]}}{sum(e^{activation[...]-max(activation[...])}))} 246 247This makes the exponent for each component at most zero and the result of 248``exp(x)`` at most 1 which can definitely fit in the fixed-point types. 249Likewise, the sum of each of these is at most ``NumCategories`` (which happens 250to be 10 for us). For the above code, the only necessary change required is: 251 252.. code-block:: 253 254 exp_activations[i] = expf(activations[i] - max_activation); 255 256And lucky for us, the precision for a ``signed _Accum`` type is enough to get 257us the same results. One interesting thing is this change alone also improved 258the performance when using floats by 11%. The theory is the ``expf`` implementation 259has a quick switch to see if the exponents need to be adjusted for floats, and 260skips that scaling part when unnecessary. 261 262The code involving ``sqrtf`` can be adjusted similarly: 263 264.. code-block:: c++ 265 266 void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) { 267 // Compute center of touch. 268 int32_t sum_x_w = 0, sum_y_w = 0, sum_w = 0; 269 // touch.num_position_estimates is at most 100 270 for (size_t i = 0; i < touch.num_position_estimates; i++) { 271 sum_x_w += touch.position_estimates[i].position.x * 255; 272 sum_y_w += touch.position_estimates[i].position.y * 255; 273 sum_w += 255; 274 } 275 276 // Cast is safe, since average can't exceed any of the individual values. 277 touch.center = Point{ 278 .x = static_cast<int16_t>(sum_x_w / sum_w), 279 .y = static_cast<int16_t>(sum_y_w / sum_w), 280 }; 281 282 // Compute radial deviation from center. 283 float sum_d_squared_w = 0.0f; 284 for (size_t i = 0; i < touch.num_position_estimates; i++) { 285 int32_t dx = touch.position_estimates[i].position.x - touch.center.x; 286 int32_t dy = touch.position_estimates[i].position.y - touch.center.y; 287 sum_d_squared_w += static_cast<float>(dx * dx + dy * dy) * 255; 288 } 289 290 // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w)) 291 touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)] 292 = sqrtf(sum_d_squared_w / sum_w); 293 } 294 295We know beforehand the maximal values of ``dx`` and ``dy`` are +/-750, so 296``sum_d_squared_w`` will require as much as 35 integral bits to represent the 297final sum. ``sum_w`` also needs 11 bits, so ``sqrtf`` would need to accept a value 298that can be held in ~24 bits. This can definitely fit into a 299``signed long _Accum`` which has 32 integral bits, but ideally we’d like to 300use a ``_Sat signed _Accum`` for consistency. Similar to before, we can 301normalize the value by 255. That is, we can avoid multiplying by 255 when 302calculating ``sum_x_w``, ``sum_y_w``, and ``sum_w`` since this will result in 303the exact same ``touch.center`` results. Likewise, if we avoid the ``* 255`` when 304calculating ``sum_d_squared_w``, the final ``sqrtf`` result will be unchanged. 305This makes the code: 306 307.. code-block:: c++ 308 309 void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) { 310 // Compute center of touch. 311 int32_t sum_x_w = 0, sum_y_w = 0, sum_w = touch.num_position_estimates; 312 // touch.num_position_estimates is at most 100 313 for (size_t i = 0; i < touch.num_position_estimates; i++) { 314 sum_x_w += touch.position_estimates[i].position.x; 315 sum_y_w += touch.position_estimates[i].position.y; 316 } 317 318 // Cast is safe, since average can't exceed any of the individual values. 319 touch.center = Point{ 320 .x = static_cast<int16_t>(sum_x_w / sum_w), 321 .y = static_cast<int16_t>(sum_y_w / sum_w), 322 }; 323 324 // Compute radial deviation from center. 325 float sum_d_squared_w = 0.0f; 326 for (size_t i = 0; i < touch.num_position_estimates; i++) { 327 int32_t dx = touch.position_estimates[i].position.x - touch.center.x; 328 int32_t dy = touch.position_estimates[i].position.y - touch.center.y; 329 sum_d_squared_w += static_cast<float>(dx * dx + dy * dy); 330 } 331 332 // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w)) 333 touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)] 334 = sqrtf(sum_d_squared_w / sum_w); 335 } 336 337This allows ``sum_d_squared_w`` to fit within 31 integral bits. We can go 338further and realize ``sum_d_squared_w`` will always have an integral value and 339replace its type with a ``uint32_t``. This will mean any fractional part from 340``sum_d_squared_w / sum_w`` will be discarded, but for our use case, this is 341acceptable since the integral part of ``sum_d_squared_w / sum_w`` is so large 342that the fractional part is negligible. ``touch.features[i]`` also isn’t 343propagated significantly so we don’t need to worry about propagation of 344error. With this, we can replace the ``sqrtf`` with one that accepts an integral 345type and returns a fixed-point type: 346 347.. code-block:: c++ 348 349 touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)] 350 = isqrt(sum_d_squared_w / sum_w); 351 352------- 353Results 354------- 355 356Classification runtime performance 357================================== 358Before any of this effort, a single run of classifying sensor input took on 359average **856.8 us**. 360 361Currently, with all floats substituted with fixed-point types, a single run is 362**412.845673 us** which is a **~2.1x speed improvement**. 363 364It’s worth noting that the optimizations we described above also improved 365performance when using floats, making a single run using floats 366**771.475464 us** which translates to **~1.9x speed improvement**. 367 368Size 369==== 370We’ve observed **~1.25 KiB** saved (out of a **~177 KiB** binary) when 371switching to fixed-point. 372 373.. code-block:: 374 375 $ bloaty app.elf.fixed -- app.elf.float 376 FILE SIZE VM SIZE 377 -------------- -------------- 378 +0.5% +3.06Ki [ = ] 0 .debug_line 379 +0.1% +1.89Ki [ = ] 0 .debug_str 380 +0.2% +496 [ = ] 0 .debug_abbrev 381 +0.6% +272 +0.6% +272 .bss 382 -0.3% -16 -0.3% -16 .data 383 -0.3% -232 [ = ] 0 .debug_frame 384 -1.3% -712 [ = ] 0 .debug_ranges 385 -0.3% -1.11Ki [ = ] 0 .debug_loc 386 -1.2% -1.50Ki -1.2% -1.50Ki .text 387 -1.6% -1.61Ki [ = ] 0 .symtab 388 -3.2% -3.06Ki [ = ] 0 .pw_tokenizer.entries 389 -1.2% -4.00Ki [ = ] 0 .strtab 390 -0.4% -19.6Ki [ = ] 0 .debug_info 391 -0.3% -26.1Ki -0.7% -1.25Ki TOTAL 392 393A large amount of this can be attributed to not needing many of the 394``__aeabi_*`` float functions provided by libc. All instances of floats were 395removed so they don’t get linked into the final binary. Instead, we use the 396fixed-point equivalents provided by llvm-libc. Much of these fixed-point 397functions are also smaller and make fewer calls to other functions. For 398example, the ``sqrtf`` we used makes calls to ``__ieee754_sqrtf``, 399``__aeabi_fcmpun``, ``__aeabi_fcmplt``, ``__errno``, and ``__aeabi_fdiv`` 400whereas ``isqrtf_fast`` only makes calls to ``__clzsi2``, ``__aeabi_lmul``, and 401``__aeabi_uidiv``. It’s worth noting that individual fixed-point binary 402operations are slightly larger since they involve a bunch of shifts or 403saturation checks. Floats tend to make a call to the same library function, but 404fixed-point types emit instructions at every callsite. 405 406.. code-block:: 407 408 float_add: 409 push {r7, lr} 410 add r7, sp, #0 411 bl __aeabi_fadd 412 pop {r7, pc} 413 414 fixed_add: 415 adds r0, r0, r1 416 bvc .LBB1_2 417 asrs r1, r0, #31 418 movs r0, #1 419 lsls r0, r0, #31 420 eors r0, r1 421 .LBB1_2: 422 bx lr 423 424Although the callsite is shorter, it’s worth noting that the soft float 425functions can’t be inlined because they tend to be very large. The 426``__aeabi_fadd`` in particular is **444 bytes large**. 427 428Correctness 429=========== 430We observe the exact same results for our classification algorithms when 431using fixed-point types. We have **over 15000 classification tests** with none 432of them producing differing results. This effectively means we get all the 433mentioned benefits without any correctness cost. 434 435----------------- 436Using fixed-point 437----------------- 438Fixed-point arithmetic can be used out-of-the-box with Clang by adding 439``-ffixed-point`` on the command line. All of the types and semantics described 440in ISO 18037 are supported. ``llvm-libc`` is the only ``libc`` that supports 441fixed-point printing and math functions at the moment. Not all libc functions 442described in ISO 18037 are supported at the moment, but they will be added 443eventually! Pigweed users can use these fixed-point functions by using 444the ``//pw_libc:stdfix`` target. 445 446----------- 447Future work 448----------- 449* An option we could take to reduce size (at the cost of even more performance) 450 is to potentially teach LLVM to outline fixed-point additions. This could result 451 in something similar to the float libcalls. 452* See if we can instead use a regular ``signed _Accum`` in the codebase instead 453 of a ``_Sat signed _Accum`` to save some instructions (or even see if we can 454 use a smaller type). 455* Investigate performance compared to hard floats. If we’re able to afford 456 using an unsaturated type, then many native floating-point ops could be 457 replaced with normal integral ops. 458* Full libc support for the fixed-point C functions. 459