1Instancing 2========== 3 4The attribute descriptor lets the attribute unit compute the address of an 5attribute given the vertex and instance ID. Unfortunately, the way this works is 6rather complicated when instancing is enabled. 7 8To explain this, first we need to explain how compute and vertex threads are 9dispatched. When a quad is dispatched, it receives a single, linear index. 10However, we need to translate that index into a (vertex id, instance id) pair. 11One option would be to do: 12 13.. math:: 14 \text{vertex id} = \text{linear id} \% \text{num vertices} 15 16 \text{instance id} = \text{linear id} / \text{num vertices} 17 18but this involves a costly division and modulus by an arbitrary number. 19Instead, we could pad ``num_vertices``. We dispatch 20:math:`\text{padded_num_vertices} \cdot \text{num_instances}` threads instead 21of :math:`\text{num_vertices} \cdot \text{num_instances}`, which results 22in some "extra" threads with :math:`\text{vertex_id} \geq \text{num_vertices}`, 23which we have to discard. The more we pad ``num_vertices``, the more "wasted" 24threads we dispatch, but the division is potentially easier. 25 26One straightforward choice is to pad ``num_vertices`` to the next power 27of two, which means that the division and modulus are just simple bit shifts 28and masking. But the actual algorithm is a bit more complicated. The thread 29dispatcher has special support for dividing by 3, 5, 7, and 9, in addition 30to dividing by a power of two. As a result, ``padded_num_vertices`` can 31be 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, 32since we need less padding. 33 34``padded_num_vertices`` is picked by the hardware. The driver just specifies 35the actual number of vertices. Note that ``padded_num_vertices`` is a multiple 36of four (presumably because threads are dispatched in groups of 4). Also, 37``padded_num_vertices`` is always at least one more than ``num_vertices``, 38which seems like a quirk of the hardware. For larger ``num_vertices``, the 39hardware uses the following algorithm: using the binary representation of 40``num_vertices``, we look at the most significant set bit as well as the 41following 3 bits. Let n be the number of bits after those 4 bits. Then we 42set ``padded_num_vertices`` according to the following table: 43 44========== ======================= 45high bits ``padded_num_vertices`` 46========== ======================= 471000 :math:`9 \cdot 2^n` 481001 :math:`5 \cdot 2^{n+1}` 49101x :math:`3 \cdot 2^{n+2}` 50110x :math:`7 \cdot 2^{n+1}` 51111x :math:`2^{n+4}` 52========== ======================= 53 54For example, if :math:`\text{num_vertices} = 70` is passed to 55:c:func:`glDraw()`, its binary representation is 1000110, so :math:`n = 3` 56and the high bits are 1000, and therefore 57:math:`\text{padded_num_vertices} = 9 \cdot 2^3 = 72`. 58 59The attribute unit works in terms of the original ``linear_id``. if 60:math:`\text{num_instances} = 1`, then they are the same, and everything 61is simple. However, with instancing things get more complicated. There are 62four possible modes, two of them we can group together: 63 641. Use the ``linear_id`` directly. Only used when there is no instancing. 65 662. Use the ``linear_id`` modulo a constant. This is used for per-vertex 67attributes with instancing enabled by making the constant equal 68``padded_num_vertices``. Because the modulus is always ``padded_num_vertices``, 69this mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, 70or 9. The shift field specifies the power of two, while the ``extra_flags`` 71field specifies the odd number. If :math:`\text{shift} = n` and 72:math:`\text{extra_flags} = m`, then the modulus is 73:math:`(2m + 1) \cdot 2^n`. As an example, if 74:math:`\text{num_vertices} = 70`, then as computed above, 75:math:`\text{padded_num_vertices} = 9 \cdot 2^3`, so we should set 76:math:`\text{extra_flags} = 4` and :math:`\text{shift} = 3`. Note that we 77must exactly follow the hardware algorithm used to get ``padded_num_vertices`` 78in order to correctly implement per-vertex attributes. 79 803. Divide the ``linear_id`` by a constant. In order to correctly implement 81instance divisors, we have to divide ``linear_id`` by ``padded_num_vertices`` 82times to user-specified divisor. So first we compute ``padded_num_vertices``, 83again following the exact same algorithm that the hardware uses, then multiply 84it by the GL-level divisor to get the hardware-level divisor. This case is 85further divided into two more cases. If the hardware-level divisor is a 86power of two, then we just need to shift. The shift amount is specified by 87the shift field, so that the hardware-level divisor is just 88:math:`2^\text{shift}`. 89 90If it isn't a power of two, then we have to divide by an arbitrary integer. 91For that, we use the well-known technique of multiplying by an approximation 92of the inverse. The driver must compute the magic multiplier and shift 93amount, and then the hardware does the multiplication and shift. The 94hardware and driver also use the "round-down" optimization as described in 95https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. 96The hardware further assumes the multiplier is between :math:`2^{31}` and 97:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set 98to 0 by the driver -- presumably this simplifies the hardware multiplier a 99little. The hardware first multiplies ``linear_id`` by the multiplier and 100takes the high 32 bits, then applies the round-down correction if 101:math:`\text{extra_flags} = 1`, then finally shifts right by the shift field. 102 103There are some differences between ridiculousfish's algorithm and the Mali 104hardware algorithm, which means that the reference code from ridiculousfish 105doesn't always produce the right constants. Mali does not use the pre-shift 106optimization, since that would make a hardware implementation slower (it 107would have to always do the pre-shift, multiply, and post-shift operations). 108It also forces the multiplier to be at least :math:`2^{31}`, which means 109that the exponent is entirely fixed, so there is no trial-and-error. 110Altogether, given the divisor d, the algorithm the driver must follow is: 111 1121. Set :math:`\text{shift} = \lfloor \log_2(d) \rfloor`. 1132. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`. 1143. If :math:`e \leq 2^{shift}`, then we need to use the round-down algorithm. 115 Set :math:`\text{magic_divisor} = m - 1` and :math:`\text{extra_flags} = 1`. 1164. Otherwise, set :math:`\text{magic_divisor} = m` and 117 :math:`\text{extra_flags} = 0`. 118