xref: /aosp_15_r20/external/mesa3d/docs/drivers/panfrost/instancing.rst (revision 6104692788411f58d303aa86923a9ff6ecaded22)
1Instancing
2==========
3
4The attribute descriptor lets the attribute unit compute the address of an
5attribute given the vertex and instance ID. Unfortunately, the way this works is
6rather complicated when instancing is enabled.
7
8To explain this, first we need to explain how compute and vertex threads are
9dispatched.  When a quad is dispatched, it receives a single, linear index.
10However, we need to translate that index into a (vertex id, instance id) pair.
11One option would be to do:
12
13.. math::
14   \text{vertex id} = \text{linear id} \% \text{num vertices}
15
16   \text{instance id} = \text{linear id} / \text{num vertices}
17
18but this involves a costly division and modulus by an arbitrary number.
19Instead, we could pad ``num_vertices``. We dispatch
20:math:`\text{padded_num_vertices} \cdot \text{num_instances}` threads instead
21of :math:`\text{num_vertices} \cdot \text{num_instances}`, which results
22in some "extra" threads with :math:`\text{vertex_id} \geq \text{num_vertices}`,
23which we have to discard.  The more we pad ``num_vertices``, the more "wasted"
24threads we dispatch, but the division is potentially easier.
25
26One straightforward choice is to pad ``num_vertices`` to the next power
27of two, which means that the division and modulus are just simple bit shifts
28and masking. But the actual algorithm is a bit more complicated. The thread
29dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
30to dividing by a power of two. As a result, ``padded_num_vertices`` can
31be 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
32since we need less padding.
33
34``padded_num_vertices`` is picked by the hardware. The driver just specifies
35the actual number of vertices. Note that ``padded_num_vertices`` is a multiple
36of four (presumably because threads are dispatched in groups of 4). Also,
37``padded_num_vertices`` is always at least one more than ``num_vertices``,
38which seems like a quirk of the hardware. For larger ``num_vertices``, the
39hardware uses the following algorithm: using the binary representation of
40``num_vertices``, we look at the most significant set bit as well as the
41following 3 bits. Let n be the number of bits after those 4 bits. Then we
42set ``padded_num_vertices`` according to the following table:
43
44==========  =======================
45high bits   ``padded_num_vertices``
46==========  =======================
471000		   :math:`9 \cdot 2^n`
481001		   :math:`5 \cdot 2^{n+1}`
49101x		   :math:`3 \cdot 2^{n+2}`
50110x		   :math:`7 \cdot 2^{n+1}`
51111x		   :math:`2^{n+4}`
52==========  =======================
53
54For example, if :math:`\text{num_vertices} = 70` is passed to
55:c:func:`glDraw()`, its binary representation is 1000110, so :math:`n = 3`
56and the high bits are 1000, and therefore
57:math:`\text{padded_num_vertices} = 9 \cdot 2^3 = 72`.
58
59The attribute unit works in terms of the original ``linear_id``. if
60:math:`\text{num_instances} = 1`, then they are the same, and everything
61is simple. However, with instancing things get more complicated. There are
62four possible modes, two of them we can group together:
63
641. Use the ``linear_id`` directly. Only used when there is no instancing.
65
662. Use the ``linear_id`` modulo a constant. This is used for per-vertex
67attributes with instancing enabled by making the constant equal
68``padded_num_vertices``. Because the modulus is always ``padded_num_vertices``,
69this mode only supports a modulus that is a power of 2 times 1, 3, 5, 7,
70or 9. The shift field specifies the power of two, while the ``extra_flags``
71field specifies the odd number. If :math:`\text{shift} = n` and
72:math:`\text{extra_flags} = m`, then the modulus is
73:math:`(2m + 1) \cdot 2^n`. As an example, if
74:math:`\text{num_vertices} = 70`, then as computed above,
75:math:`\text{padded_num_vertices} = 9 \cdot 2^3`, so we should set
76:math:`\text{extra_flags} = 4` and :math:`\text{shift} = 3`. Note that we
77must exactly follow the hardware algorithm used to get ``padded_num_vertices``
78in order to correctly implement per-vertex attributes.
79
803. Divide the ``linear_id`` by a constant. In order to correctly implement
81instance divisors, we have to divide ``linear_id`` by ``padded_num_vertices``
82times to user-specified divisor. So first we compute ``padded_num_vertices``,
83again following the exact same algorithm that the hardware uses, then multiply
84it by the GL-level divisor to get the hardware-level divisor. This case is
85further divided into two more cases. If the hardware-level divisor is a
86power of two, then we just need to shift. The shift amount is specified by
87the shift field, so that the hardware-level divisor is just
88:math:`2^\text{shift}`.
89
90If it isn't a power of two, then we have to divide by an arbitrary integer.
91For that, we use the well-known technique of multiplying by an approximation
92of the inverse. The driver must compute the magic multiplier and shift
93amount, and then the hardware does the multiplication and shift. The
94hardware and driver also use the "round-down" optimization as described in
95https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
96The hardware further assumes the multiplier is between :math:`2^{31}` and
97:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
98to 0 by the driver -- presumably this simplifies the hardware multiplier a
99little. The hardware first multiplies ``linear_id`` by the multiplier and
100takes the high 32 bits, then applies the round-down correction if
101:math:`\text{extra_flags} = 1`, then finally shifts right by the shift field.
102
103There are some differences between ridiculousfish's algorithm and the Mali
104hardware algorithm, which means that the reference code from ridiculousfish
105doesn't always produce the right constants. Mali does not use the pre-shift
106optimization, since that would make a hardware implementation slower (it
107would have to always do the pre-shift, multiply, and post-shift operations).
108It also forces the multiplier to be at least :math:`2^{31}`, which means
109that the exponent is entirely fixed, so there is no trial-and-error.
110Altogether, given the divisor d, the algorithm the driver must follow is:
111
1121. Set :math:`\text{shift} = \lfloor \log_2(d) \rfloor`.
1132. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
1143. If :math:`e \leq 2^{shift}`, then we need to use the round-down algorithm.
115   Set :math:`\text{magic_divisor} = m - 1` and :math:`\text{extra_flags} = 1`.
1164. Otherwise, set :math:`\text{magic_divisor} = m` and
117   :math:`\text{extra_flags} = 0`.
118