xref: /aosp_15_r20/external/pigweed/docs/blog/04-fixed-point.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1.. _blog-04-fixed-point:
2
3========================================================================
4Pigweed Blog #4: Fixed-point arithmetic as a replacement for soft floats
5========================================================================
6*Published on 2024-09-03 by Leonard Chan*
7
8Fixed-point arithmetic is an alternative to floating-point arithmetic for
9representing fractional data values (0.5, -1.125, 10.75, etc.). Fixed-point
10types can be represented as :ref:`scaled integers <blog-04-fixed-point-intro>`.
11The advantage here is many
12arithmetic operations (+, -, \*, /, etc.) can be implemented as normal integral
13instructions. This can be useful for embedded systems or other applications
14where floating-point operations can be too expensive or not available.
15Clang has support for fixed-point types according to `ISO TR 18037 <https://standards.iso.org/ittf/PubliclyAvailableStandards/c051126_ISO_IEC_TR_18037_2008.zip>`_.
16I work on a Google project that uses Pigweed.
17Recently, we replaced usage of floats in an embedded project running
18on Cortex M0+. This MCU doesn't provide any hardware support for floating-point
19operations, so floating-point operations are implemented in software. We’ve
20observed a **~2x speed improvement** in
21classification algorithms and a **small decrease in binary size** using fixed-point
22**without sacrificing correctness**.
23
24All Clang users can benefit from this by adding ``-ffixed-point`` as a
25compilation flag. All of the types and semantics described
26in ISO 18037 are supported.
27
28Problems with (soft) floats
29===========================
30One of the ways to represent a decimal number is to use a significand, base,
31and exponent. This is how floating-point numbers according to IEEE 754 are
32represented. The base is two, but the significand and exponent are adjustable.
33
34When doing arithmetic with floats, many operations are needed to align the
35exponents of the operands and normalize the result. Most modern computers have
36Floating-Point Units (`FPUs <https://en.wikipedia.org/wiki/Floating-point_unit>`_) that do this arithmetic very quickly, but not all
37platforms (especially embedded ones) have this luxury. The alternative is to
38emulate floating-point arithmetic in software which can be costly.
39
40.. _blog-04-fixed-point-intro:
41
42----------------------------------------------
43Intro to fixed-point types and scaled integers
44----------------------------------------------
45`ISO TR 18037 <https://standards.iso.org/ittf/PubliclyAvailableStandards/c051126_ISO_IEC_TR_18037_2008.zip>`_
46describes 12 fixed-point types with varying scales and ranges:
47
48* ``unsigned short _Fract``
49* ``unsigned _Fract``
50* ``unsigned long _Fract``
51* ``signed short _Fract``
52* ``signed _Fract``
53* ``signed long _Fract``
54* ``unsigned short _Accum``
55* ``unsigned _Accum``
56* ``unsigned long _Accum``
57* ``signed short _Accum``
58* ``signed _Accum``
59* ``signed long _Accum``
60
61The scale represents the number of fractional bits in the type. Under the hood,
62Clang implements each of these types as integers. To get the decimal value of
63the type, we just divide the integer by the scale. For example, on x86_64
64Linux, an ``unsigned _Accum`` type is represented as an unsigned 32-bit integer
65with a scale of 16, so 16 bits are used to represent the fractional part and
66the remaining 16 are used to represent the integral part. An ``unsigned _Accum``
67type with a value of ``16.625`` would be represented as ``1089536`` because
68``1089536 / 2**16 = 16.625``. Additionally, an ``unsigned _Accum`` type has a range
69of ``[0, 65535.99998474121]``, where the max value is represented as ``2^32-1``.
70The smallest possible value we can represent with a fixed-point type is
71``1/2^scale``.
72
73The width and scale of each type are platform-dependent. In Clang, different
74targets are free to provide whatever widths
75and scales are best-suited to their needs and capabilities. Likewise,
76LLVM backends are free to lower the different fixed-point intrinsics to native
77fixed-point operations. The default configuration for all targets is the
78“typical desktop processor” implementation described in A.3 of ISO TR 18037.
79
80Each of these types has a saturating equivalent also (denoted by the ``_Sat``
81keyword). This means operations on a saturating type that would normally cause
82overflow instead clamps to the maximal or minimal value for that type.
83
84Fixed-point to the rescue
85=========================
86One of the main advantages of fixed-point types is their operations can be done
87with normal integer operations. Addition and subtraction between the same fixed-point
88types normally require just a regular integral ``add`` or ``sub`` instruction.
89Multiplication and division can normally be done with a regular ``mul`` and ``div``
90instruction (plus an extra shift to account for the scale). Platforms that
91don’t support hard floats normally need to make a call to a library function
92that implements the floating-point equivalents of these which can be large or
93CPU-intensive.
94
95Fixed-point types however are limited in their range and precision which are
96dependent on the number of integral and fractional bits. A ``signed _Accum``
97(which uses 15 fractional bits, 16 integral bits, and 1 sign bit on x86_64
98Linux) will have a range of ``[-65536, 65535.99996948242]`` and a precision of
99``3.0517578125e-05``. Floats can effectively range from ``(-∞, +∞)`` and have
100precision as small as :math:`10^{-38}`. Maintaining correctness for your program
101largely depends on how much range and precision you intend your fractional
102numbers to have. For addressing ranges, there are sometimes a couple of clever
103math tricks you can do to get large values to fit within these ranges. Some of
104these methods will be discussed later.
105
106.. list-table:: Fixed-Point vs Floating-Point
107   :widths: 10 45 45
108   :header-rows: 1
109   :stub-columns: 1
110
111   * -
112     - Fixed-Point
113     - Floating-Point
114   * - Range
115     - Limited by integral and fractional bits
116     - +/-∞
117   * - Precision
118     - Limited by the fractional bits (:math:`1/2^{scale}`)
119     - :math:`10^{-38}`
120   * - Binary Operations
121     - | Typically fast
122       | Usually just involves normal integral binary ops with some shifts
123     - | *Hard floats* - Typically fast
124       | *Soft floats* - Can be very slow/complex; depends on the implementation
125   * - Correctness
126     - | Always within 1 ULP of the true result
127       | Will always produce the same result for a given op
128     - | Always within 0.5 ULP of true result
129       | May not always produce the same result for a given op due to rounding errors
130
131---------------------
132Running on Cortex-M0+
133---------------------
134The Arm Cortex-M0+ processor is a common choice for constrained embedded
135applications because it is very energy-efficient. Because it doesn’t have an
136FPU, many floating-point operations targeting this CPU depend on a helper
137`library <https://github.com/ARM-software/abi-aa/blob/7c2fbbd6d32c906c709804f873b67308d1ab46e2/rtabi32/rtabi32.rst#the-standard-compiler-helper-function-library>`_
138to define these as runtime functions. For our application, the code
139running on this processor runs different classification algorithms. One such
140classifier utilizes `softmax regression <http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/>`_.
141The algorithm is roughly:
142
143.. code-block:: c++
144
145   // This is roughly a trimmed down version of the softmax regression
146   // classifier.
147   size_t Classify(std::span<const float> sample) const {
148       // 1. Compute activations for each category.
149       //    All activations are initially zero.
150       std::array<float, NumCategories> activations{};
151       for (size_t i = 0; i < NumCategories; i++) {
152           for (size_t j = 0; j < sample.size(); j++) {
153               activations[i] += coefficients_[i][j] * sample[j];
154           }
155           activations[i] += biases_[i];
156       }
157       float max_activation = *std::max_element(activations.begin(),
158                                                activations.end());
159
160       // 2. Get e raised to each of these activations.
161       std::array<T, NumCategories> exp_activations;
162       for (size_t i = 0; i < NumCategories; i++) {
163           exp_activations[i] = expf(activations[i]);
164       }
165       float sum_exp_activations = std::accumulate(exp_activations.begin(),
166                                                   exp_activations.end(), 0);
167
168       // 3. Compute each likelihood which us exp(x[i]) / sum(x).
169       float max_likelihood = std::numeric_limits<float>::min();
170       size_t prediction;
171       for (size_t i = 0; i < NumCategories; i++) {
172           // 0 <= result.likelihoods[i] < 1
173           result.likelihoods[i] = exp_activations[i] / sum_exp_activations;
174           if (result.likelihoods[i] > max_likelihood) {
175               max_likelihood = result.likelihoods[i];
176               prediction = i;  // The highest likelihood is the prediction.
177           }
178       }
179
180       return prediction;  // Return the index of the highest likelihood.
181   }
182
183Many of these operations involve normal floating-point comparison, addition,
184multiplication, and division. Each of these expands to a call to some
185``__aeabi_*`` function. This particular function is on a hot path that
186activates, meaning soft float operations take up much computation and power.
187For the normal binary operations, fixed-point types might be a good fit because
188it’s very likely each of the binary operations will result in a value that
189can fit into one of the fixed-point types. (Each of
190the sample values is in the range ``[-256, +256]`` and there are at most 10
191categories. Likewise, each of the ``coefficients_`` and ``biases_`` are small values
192ranging from roughly ``(-3, +3)``.)
193
194If we wanted to completely replace floats for this function with fixed-point
195types, we’d run into two issues:
196
197#. There doesn’t exist an ``expf`` function that operates on fixed-point types.
198#. ``expf(x)`` can grow very large for even small for small values of x
199   (``expf(23)`` would exceed the largest value possible for an
200   ``unsigned long _Accum``).
201
202Fortunately, we can refactor the code as needed and we can make changes to
203llvm-libc, the libc implementation this device uses.
204
205llvm-libc and some math
206=======================
207The embedded device's codebase is dependent on a handful of functions that
208take floats: ``expf``, ``sqrtf``, and ``roundf``.
209``printf`` is also used for printing floats, but support for the fixed-point format
210specifiers is needed. The project already uses llvm-libc, so we can implement
211versions of these functions that operate on fixed-point types.
212The llvm-libc team at Google has been very responsive
213and eager to implement these functions for us! Michael Jones (who implemented
214much of printf in llvm-libc) provided `printf support for each of the fixed
215point types <https://github.com/llvm/llvm-project/pull/82707>`_. Tue Ly
216provided implementations for `abs <https://github.com/llvm/llvm-project/pull/81823>`_,
217`roundf <https://github.com/llvm/llvm-project/pull/81994>`_,
218`sqrtf <https://github.com/llvm/llvm-project/pull/83042>`_,
219`integral sqrt with a fixed-point output <http://github.com/llvm/llvm-project/issues/83924>`_,
220and `expf <https://github.com/llvm/llvm-project/pull/84391>`_ for different
221fixed-point types. Now we have the necessary math functions which accept
222fixed-point types.
223
224With implementations provided, the next big step is avoiding overflow and
225getting results to fit within the new types. Let’s look at one case involving
226``expf`` and one involving ``sqrtf``. (Tue Ly also provided these solutions.)
227
228The softmax algorithm above easily causes overflow with:
229
230.. code-block::
231
232   exp_activations[i] = expf(activations[i]);
233
234But the larger goal is to get the likelihood that a sample matches each
235category. This is a value from [0, 1].
236
237.. math::
238
239   likelihood(i) = \frac{e^{activations[i]}}{\sum_{k=0}^{NumCategories}{e^{activation[k]}}} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))}
240
241This can however be normalized by the max activation:
242
243.. math::
244
245   likelihood(i) = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} * \frac{max(e^{activation[...]})}{max(e^{activation[...]})} = \frac{e^{activation[i]-max(activation[...])]}}{sum(e^{activation[...]-max(activation[...])}))}
246
247This makes the exponent for each component at most zero and the result of
248``exp(x)`` at most 1 which can definitely fit in the fixed-point types.
249Likewise, the sum of each of these is at most ``NumCategories`` (which happens
250to be 10 for us). For the above code, the only necessary change required is:
251
252.. code-block::
253
254   exp_activations[i] = expf(activations[i] - max_activation);
255
256And lucky for us, the precision for a ``signed _Accum`` type is enough to get
257us the same results. One interesting thing is this change alone also improved
258the performance when using floats by 11%. The theory is the ``expf`` implementation
259has a quick switch to see if the exponents need to be adjusted for floats, and
260skips that scaling part when unnecessary.
261
262The code involving ``sqrtf`` can be adjusted similarly:
263
264.. code-block:: c++
265
266   void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) {
267       // Compute center of touch.
268       int32_t sum_x_w = 0, sum_y_w = 0, sum_w = 0;
269       // touch.num_position_estimates is at most 100
270       for (size_t i = 0; i < touch.num_position_estimates; i++) {
271           sum_x_w += touch.position_estimates[i].position.x * 255;
272           sum_y_w += touch.position_estimates[i].position.y * 255;
273           sum_w += 255;
274       }
275
276       // Cast is safe, since average can't exceed any of the individual values.
277       touch.center = Point{
278           .x = static_cast<int16_t>(sum_x_w / sum_w),
279           .y = static_cast<int16_t>(sum_y_w / sum_w),
280       };
281
282       // Compute radial deviation from center.
283       float sum_d_squared_w = 0.0f;
284       for (size_t i = 0; i < touch.num_position_estimates; i++) {
285           int32_t dx = touch.position_estimates[i].position.x - touch.center.x;
286           int32_t dy = touch.position_estimates[i].position.y - touch.center.y;
287           sum_d_squared_w += static_cast<float>(dx * dx + dy * dy) * 255;
288       }
289
290       // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w))
291       touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)]
292           = sqrtf(sum_d_squared_w / sum_w);
293   }
294
295We know beforehand the maximal values of ``dx`` and ``dy`` are +/-750, so
296``sum_d_squared_w`` will require as much as 35 integral bits to represent the
297final sum. ``sum_w`` also needs 11 bits, so ``sqrtf`` would need to accept a value
298that can be held in ~24 bits. This can definitely fit into a
299``signed long _Accum`` which has 32 integral bits, but ideally we’d like to
300use a ``_Sat signed _Accum`` for consistency. Similar to before, we can
301normalize the value by 255. That is, we can avoid multiplying by 255 when
302calculating ``sum_x_w``, ``sum_y_w``, and ``sum_w`` since this will result in
303the exact same ``touch.center`` results. Likewise, if we avoid the ``* 255`` when
304calculating ``sum_d_squared_w``, the final ``sqrtf`` result will be unchanged.
305This makes the code:
306
307.. code-block:: c++
308
309   void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) {
310     // Compute center of touch.
311     int32_t sum_x_w = 0, sum_y_w = 0, sum_w = touch.num_position_estimates;
312     // touch.num_position_estimates is at most 100
313     for (size_t i = 0; i < touch.num_position_estimates; i++) {
314         sum_x_w += touch.position_estimates[i].position.x;
315         sum_y_w += touch.position_estimates[i].position.y;
316     }
317
318     // Cast is safe, since average can't exceed any of the individual values.
319     touch.center = Point{
320         .x = static_cast<int16_t>(sum_x_w / sum_w),
321         .y = static_cast<int16_t>(sum_y_w / sum_w),
322     };
323
324     // Compute radial deviation from center.
325     float sum_d_squared_w = 0.0f;
326     for (size_t i = 0; i < touch.num_position_estimates; i++) {
327         int32_t dx = touch.position_estimates[i].position.x - touch.center.x;
328         int32_t dy = touch.position_estimates[i].position.y - touch.center.y;
329         sum_d_squared_w += static_cast<float>(dx * dx + dy * dy);
330     }
331
332     // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w))
333     touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)]
334         = sqrtf(sum_d_squared_w / sum_w);
335   }
336
337This allows ``sum_d_squared_w`` to fit within 31 integral bits. We can go
338further and realize ``sum_d_squared_w`` will always have an integral value and
339replace its type with a ``uint32_t``. This will mean any fractional part from
340``sum_d_squared_w / sum_w`` will be discarded, but for our use case, this is
341acceptable since the integral part of ``sum_d_squared_w / sum_w`` is so large
342that the fractional part is negligible. ``touch.features[i]`` also isn’t
343propagated significantly so we don’t need to worry about propagation of
344error. With this, we can replace the ``sqrtf`` with one that accepts an integral
345type and returns a fixed-point type:
346
347.. code-block:: c++
348
349   touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)]
350       = isqrt(sum_d_squared_w / sum_w);
351
352-------
353Results
354-------
355
356Classification runtime performance
357==================================
358Before any of this effort, a single run of classifying sensor input took on
359average **856.8 us**.
360
361Currently, with all floats substituted with fixed-point types, a single run is
362**412.845673 us** which is a **~2.1x speed improvement**.
363
364It’s worth noting that the optimizations we described above also improved
365performance when using floats, making a single run using floats
366**771.475464 us** which translates to **~1.9x speed improvement**.
367
368Size
369====
370We’ve observed **~1.25 KiB** saved (out of a **~177 KiB** binary) when
371switching to fixed-point.
372
373.. code-block::
374
375   $ bloaty app.elf.fixed -- app.elf.float
376       FILE SIZE        VM SIZE
377    --------------  --------------
378     +0.5% +3.06Ki  [ = ]       0    .debug_line
379     +0.1% +1.89Ki  [ = ]       0    .debug_str
380     +0.2%    +496  [ = ]       0    .debug_abbrev
381     +0.6%    +272  +0.6%    +272    .bss
382     -0.3%     -16  -0.3%     -16    .data
383     -0.3%    -232  [ = ]       0    .debug_frame
384     -1.3%    -712  [ = ]       0    .debug_ranges
385     -0.3% -1.11Ki  [ = ]       0    .debug_loc
386     -1.2% -1.50Ki  -1.2% -1.50Ki    .text
387     -1.6% -1.61Ki  [ = ]       0    .symtab
388     -3.2% -3.06Ki  [ = ]       0    .pw_tokenizer.entries
389     -1.2% -4.00Ki  [ = ]       0    .strtab
390     -0.4% -19.6Ki  [ = ]       0    .debug_info
391     -0.3% -26.1Ki  -0.7% -1.25Ki    TOTAL
392
393A large amount of this can be attributed to not needing many of the
394``__aeabi_*`` float functions provided by libc. All instances of floats were
395removed so they don’t get linked into the final binary. Instead, we use the
396fixed-point equivalents provided by llvm-libc. Much of these fixed-point
397functions are also smaller and make fewer calls to other functions. For
398example, the ``sqrtf`` we used makes calls to ``__ieee754_sqrtf``,
399``__aeabi_fcmpun``, ``__aeabi_fcmplt``, ``__errno``, and ``__aeabi_fdiv``
400whereas ``isqrtf_fast`` only makes calls to ``__clzsi2``, ``__aeabi_lmul``, and
401``__aeabi_uidiv``. It’s worth noting that individual fixed-point binary
402operations are slightly larger since they involve a bunch of shifts or
403saturation checks. Floats tend to make a call to the same library function, but
404fixed-point types emit instructions at every callsite.
405
406.. code-block::
407
408   float_add:
409     push    {r7, lr}
410     add     r7, sp, #0
411     bl      __aeabi_fadd
412     pop     {r7, pc}
413
414   fixed_add:
415     adds    r0, r0, r1
416     bvc     .LBB1_2
417     asrs    r1, r0, #31
418     movs    r0, #1
419     lsls    r0, r0, #31
420     eors    r0, r1
421   .LBB1_2:
422     bx      lr
423
424Although the callsite is shorter, it’s worth noting that the soft float
425functions can’t be inlined because they tend to be very large. The
426``__aeabi_fadd`` in particular is **444 bytes large**.
427
428Correctness
429===========
430We observe the exact same results for our classification algorithms when
431using fixed-point types. We have **over 15000 classification tests** with none
432of them producing differing results. This effectively means we get all the
433mentioned benefits without any correctness cost.
434
435-----------------
436Using fixed-point
437-----------------
438Fixed-point arithmetic can be used out-of-the-box with Clang by adding
439``-ffixed-point`` on the command line. All of the types and semantics described
440in ISO 18037 are supported. ``llvm-libc`` is the only ``libc`` that supports
441fixed-point printing and math functions at the moment. Not all libc functions
442described in ISO 18037 are supported at the moment, but they will be added
443eventually! Pigweed users can use these fixed-point functions by using
444the ``//pw_libc:stdfix`` target.
445
446-----------
447Future work
448-----------
449* An option we could take to reduce size (at the cost of even more performance)
450  is to potentially teach LLVM to outline fixed-point additions. This could result
451  in something similar to the float libcalls.
452* See if we can instead use a regular ``signed _Accum`` in the codebase instead
453  of a ``_Sat signed _Accum`` to save some instructions (or even see if we can
454  use a smaller type).
455* Investigate performance compared to hard floats. If we’re able to afford
456  using an unsaturated type, then many native floating-point ops could be
457  replaced with normal integral ops.
458* Full libc support for the fixed-point C functions.
459