xref: /aosp_15_r20/external/mesa3d/docs/drivers/amd/hw/pops.rst (revision 6104692788411f58d303aa86923a9ff6ecaded22)
1*61046927SAndroid Build Coastguard WorkerPrimitive Ordered Pixel Shading
2*61046927SAndroid Build Coastguard Worker===============================
3*61046927SAndroid Build Coastguard Worker
4*61046927SAndroid Build Coastguard WorkerPrimitive Ordered Pixel Shading (POPS) is the feature available starting from
5*61046927SAndroid Build Coastguard WorkerGFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering
6*61046927SAndroid Build Coastguard Workerfunctionality.
7*61046927SAndroid Build Coastguard Worker
8*61046927SAndroid Build Coastguard WorkerIt allows a part of a fragment shader — an ordered section (or a critical
9*61046927SAndroid Build Coastguard Workersection) — to be executed sequentially in rasterization order for different
10*61046927SAndroid Build Coastguard Workerinvocations covering the same pixel position.
11*61046927SAndroid Build Coastguard Worker
12*61046927SAndroid Build Coastguard WorkerThis article describes how POPS is set up in shader code and the registers. The
13*61046927SAndroid Build Coastguard Workerinformation here is currently provided for architecture generations up to GFX11.
14*61046927SAndroid Build Coastguard Worker
15*61046927SAndroid Build Coastguard WorkerNote that the information in this article is **not official** and may contain
16*61046927SAndroid Build Coastguard Workerinaccuracies, as well as incomplete or incorrect assumptions. It is based on the
17*61046927SAndroid Build Coastguard Workershader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage
18*61046927SAndroid Build Coastguard Workerin Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references,
19*61046927SAndroid Build Coastguard Workerand experimentation with the hardware.
20*61046927SAndroid Build Coastguard Worker
21*61046927SAndroid Build Coastguard WorkerShader code
22*61046927SAndroid Build Coastguard Worker-----------
23*61046927SAndroid Build Coastguard Worker
24*61046927SAndroid Build Coastguard WorkerWith POPS, a wave can dynamically execute up to one ordered section. It is fine
25*61046927SAndroid Build Coastguard Workerfor a wave not to enter an ordered section at all if it doesn't need ordering on
26*61046927SAndroid Build Coastguard Workerits execution path, however.
27*61046927SAndroid Build Coastguard Worker
28*61046927SAndroid Build Coastguard WorkerThe setup of the ordered section consists of three parts:
29*61046927SAndroid Build Coastguard Worker
30*61046927SAndroid Build Coastguard Worker1. Entering the ordered section in the current wave — awaiting the completion of
31*61046927SAndroid Build Coastguard Worker   ordered sections in overlapped waves.
32*61046927SAndroid Build Coastguard Worker2. Resolving overlap within the current wave — intrawave collisions (optional
33*61046927SAndroid Build Coastguard Worker   and GFX9–10.3 only).
34*61046927SAndroid Build Coastguard Worker3. Exiting the ordered section — resuming overlapping waves trying to enter
35*61046927SAndroid Build Coastguard Worker   their ordered sections.
36*61046927SAndroid Build Coastguard Worker
37*61046927SAndroid Build Coastguard WorkerGFX9–10.3: Entering the ordered section in the wave
38*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
39*61046927SAndroid Build Coastguard Worker
40*61046927SAndroid Build Coastguard WorkerAwaiting the completion of ordered sections in overlapped waves is performed by
41*61046927SAndroid Build Coastguard Workersetting the POPS packer hardware register, and then polling the volatile
42*61046927SAndroid Build Coastguard Worker``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest
43*61046927SAndroid Build Coastguard Workeroverlapped wave ID for the current wave.
44*61046927SAndroid Build Coastguard Worker
45*61046927SAndroid Build Coastguard WorkerThe information needed for the wave to perform the waiting is provided to it via
46*61046927SAndroid Build Coastguard Workerthe SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the
47*61046927SAndroid Build Coastguard Worker``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that
48*61046927SAndroid Build Coastguard Workerthe POPS arguments specifically need to be enabled not only in ``RSRC`` unlike
49*61046927SAndroid Build Coastguard Workervarious other arguments, but in ``PA_SC_SHADER_CONTROL`` as well).
50*61046927SAndroid Build Coastguard Worker
51*61046927SAndroid Build Coastguard WorkerThe collision wave ID argument contains the following unsigned values:
52*61046927SAndroid Build Coastguard Worker
53*61046927SAndroid Build Coastguard Worker* [31]: Whether overlap has occurred.
54*61046927SAndroid Build Coastguard Worker* [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated
55*61046927SAndroid Build Coastguard Worker  with.
56*61046927SAndroid Build Coastguard Worker* [25:16]: Newest overlapped wave ID.
57*61046927SAndroid Build Coastguard Worker* [9:0]: Current wave ID.
58*61046927SAndroid Build Coastguard Worker
59*61046927SAndroid Build Coastguard WorkerThe 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of
60*61046927SAndroid Build Coastguard Workerthe fields, possibly from an early development iteration, but the meanings of
61*61046927SAndroid Build Coastguard Workerthem are accurate there.
62*61046927SAndroid Build Coastguard Worker
63*61046927SAndroid Build Coastguard WorkerThe wait must not be performed if the "did overlap" bit 31 is set to 0,
64*61046927SAndroid Build Coastguard Workerotherwise it will result in a hang. Also, the bit being set to 0 indicates that
65*61046927SAndroid Build Coastguard Workerthere are *both* no wave overlap *and no intrawave collisions* for the current
66*61046927SAndroid Build Coastguard Workerwave — so if the bit is 0, it's safe for the wave to skip all of the POPS logic
67*61046927SAndroid Build Coastguard Workercompletely and execute the contents of the ordered section simply as usual with
68*61046927SAndroid Build Coastguard Workerunordered access as a potential additional optimization. The packer hardware
69*61046927SAndroid Build Coastguard Workerregister, however, may be set even without overlap safely — it's the wait loop
70*61046927SAndroid Build Coastguard Workeritself that must not be executed if it was reported that there was no overlap.
71*61046927SAndroid Build Coastguard Worker
72*61046927SAndroid Build Coastguard WorkerThe packer ID needs to be passed to the packer hardware register using
73*61046927SAndroid Build Coastguard Worker``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer.
74*61046927SAndroid Build Coastguard Worker
75*61046927SAndroid Build Coastguard WorkerOn GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer
76*61046927SAndroid Build Coastguard Workerthe wave is associated with:
77*61046927SAndroid Build Coastguard Worker
78*61046927SAndroid Build Coastguard Worker* [25]: The wave is associated with packer 1.
79*61046927SAndroid Build Coastguard Worker* [24]: The wave is associated with packer 0.
80*61046927SAndroid Build Coastguard Worker
81*61046927SAndroid Build Coastguard WorkerInitially, both of these bits are set 0, meaning that POPS is disabled for the
82*61046927SAndroid Build Coastguard Workerwave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if
83*61046927SAndroid Build Coastguard Workerthe packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID
84*61046927SAndroid Build Coastguard Workeris 1.
85*61046927SAndroid Build Coastguard Worker
86*61046927SAndroid Build Coastguard WorkerStarting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead,
87*61046927SAndroid Build Coastguard Workercontaining the following fields:
88*61046927SAndroid Build Coastguard Worker
89*61046927SAndroid Build Coastguard Worker* [2:1]: Packer ID.
90*61046927SAndroid Build Coastguard Worker* [0]: POPS enabled for the wave.
91*61046927SAndroid Build Coastguard Worker
92*61046927SAndroid Build Coastguard WorkerInitially, POPS is disabled for a wave. To start entering the ordered section,
93*61046927SAndroid Build Coastguard Workerbits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs
94*61046927SAndroid Build Coastguard Workerto be set to 1.
95*61046927SAndroid Build Coastguard Worker
96*61046927SAndroid Build Coastguard WorkerThe wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are
97*61046927SAndroid Build Coastguard Worker10-bit values wrapping around on overflow — consecutive waves are numbered 1022,
98*61046927SAndroid Build Coastguard Worker1023, 0, 1… This wraparound needs to be taken into account when comparing the
99*61046927SAndroid Build Coastguard Workerexiting wave ID and the newest overlapped wave ID.
100*61046927SAndroid Build Coastguard Worker
101*61046927SAndroid Build Coastguard WorkerSpecifically, until the current wave exits the ordered section, its ID can't be
102*61046927SAndroid Build Coastguard Workersmaller than the newest overlapped wave ID or the exiting wave ID. So
103*61046927SAndroid Build Coastguard Worker``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to
104*61046927SAndroid Build Coastguard Workermonotonically increasing unsigned values. In this case, the largest value,
105*61046927SAndroid Build Coastguard Worker0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current
106*61046927SAndroid Build Coastguard Workerwave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from
107*61046927SAndroid Build Coastguard Workerbefore the last wraparound will be near 0 increasing away from it. Subtracting
108*61046927SAndroid Build Coastguard Worker``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``.
109*61046927SAndroid Build Coastguard Worker
110*61046927SAndroid Build Coastguard WorkerGFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit
111*61046927SAndroid Build Coastguard Workernewest overlapped wave ID is greater than the 10-bit current wave ID (meaning
112*61046927SAndroid Build Coastguard Workerthat it's behind the last wraparound point), 1 needs to be added to the newest
113*61046927SAndroid Build Coastguard Workeroverlapped wave ID before using it in the comparison. This was corrected in
114*61046927SAndroid Build Coastguard WorkerGFX10.
115*61046927SAndroid Build Coastguard Worker
116*61046927SAndroid Build Coastguard WorkerThe exiting wave ID (not to be confused with "exited" — the exiting wave ID is
117*61046927SAndroid Build Coastguard Workerthe wave that will exit the ordered section next) is queried via the
118*61046927SAndroid Build Coastguard Worker``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be
119*61046927SAndroid Build Coastguard Workerone of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave
120*61046927SAndroid Build Coastguard WorkerID to monotonically increasing one.
121*61046927SAndroid Build Coastguard Worker
122*61046927SAndroid Build Coastguard WorkerIt's a volatile operand, and it needs to be read in a loop until its value
123*61046927SAndroid Build Coastguard Workerbecomes greater than the newest overlapped wave ID (after remapping both to
124*61046927SAndroid Build Coastguard Workermonotonic). However, if it's too early for the current wave to enter the ordered
125*61046927SAndroid Build Coastguard Workersection, it needs to yield execution to other waves that may potentially be
126*61046927SAndroid Build Coastguard Workeroverlapped — via ``s_sleep``. GFX9 requires a finite amount of delay to be
127*61046927SAndroid Build Coastguard Workerspecified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up
128*61046927SAndroid Build Coastguard Workerthe waiting waves, so the maximum delay of 0xFFFF can be used.
129*61046927SAndroid Build Coastguard Worker
130*61046927SAndroid Build Coastguard WorkerIn pseudocode, the entering logic would look like this::
131*61046927SAndroid Build Coastguard Worker
132*61046927SAndroid Build Coastguard Worker   bool did_overlap = collision_wave_id[31];
133*61046927SAndroid Build Coastguard Worker   if (did_overlap) {
134*61046927SAndroid Build Coastguard Worker      if (gfx_level >= GFX10) {
135*61046927SAndroid Build Coastguard Worker         uint packer_id = collision_wave_id[29:28];
136*61046927SAndroid Build Coastguard Worker         s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1));
137*61046927SAndroid Build Coastguard Worker      } else {
138*61046927SAndroid Build Coastguard Worker         uint packer_id = collision_wave_id[28];
139*61046927SAndroid Build Coastguard Worker         s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01);
140*61046927SAndroid Build Coastguard Worker      }
141*61046927SAndroid Build Coastguard Worker
142*61046927SAndroid Build Coastguard Worker      uint current_10bit_wave_id = collision_wave_id[9:0];
143*61046927SAndroid Build Coastguard Worker      // Or -(current_10bit_wave_id + 1).
144*61046927SAndroid Build Coastguard Worker      uint wave_id_remap_offset = ~current_10bit_wave_id;
145*61046927SAndroid Build Coastguard Worker
146*61046927SAndroid Build Coastguard Worker      uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16];
147*61046927SAndroid Build Coastguard Worker      if (gfx_level < GFX10 &&
148*61046927SAndroid Build Coastguard Worker          newest_overlapped_10bit_wave_id > current_10bit_wave_id) {
149*61046927SAndroid Build Coastguard Worker         ++newest_overlapped_10bit_wave_id;
150*61046927SAndroid Build Coastguard Worker      }
151*61046927SAndroid Build Coastguard Worker      uint newest_overlapped_wave_id =
152*61046927SAndroid Build Coastguard Worker         newest_overlapped_10bit_wave_id + wave_id_remap_offset;
153*61046927SAndroid Build Coastguard Worker
154*61046927SAndroid Build Coastguard Worker      while (!(src_pops_exiting_wave_id + wave_id_remap_offset >
155*61046927SAndroid Build Coastguard Worker               newest_overlapped_wave_id)) {
156*61046927SAndroid Build Coastguard Worker         s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
157*61046927SAndroid Build Coastguard Worker      }
158*61046927SAndroid Build Coastguard Worker   }
159*61046927SAndroid Build Coastguard Worker
160*61046927SAndroid Build Coastguard WorkerThe SPIR-V fragment shader interlock specification requires an invocation — an
161*61046927SAndroid Build Coastguard Workerindividual invocation, not the whole subgroup — to execute
162*61046927SAndroid Build Coastguard Worker``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple
163*61046927SAndroid Build Coastguard Workerbegin instructions, or even multiple begin/end pairs, under divergent
164*61046927SAndroid Build Coastguard Workerconditions, a wave may end up waiting for the overlapped waves multiple times.
165*61046927SAndroid Build Coastguard WorkerThankfully, it's safe to set the POPS packer hardware register to the same
166*61046927SAndroid Build Coastguard Workervalue, or to run the wait loop, multiple times during the wave's execution, as
167*61046927SAndroid Build Coastguard Workerlong as the ordered section isn't exited in between by the wave.
168*61046927SAndroid Build Coastguard Worker
169*61046927SAndroid Build Coastguard WorkerGFX11: Entering the ordered section in the wave
170*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
171*61046927SAndroid Build Coastguard Worker
172*61046927SAndroid Build Coastguard WorkerInstead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave
173*61046927SAndroid Build Coastguard Workerstatus flag to report that the wave may enter the ordered section. It's awaited
174*61046927SAndroid Build Coastguard Workerby the ``s_wait_event`` instruction, with the bit 0 ("don't wait for
175*61046927SAndroid Build Coastguard Worker``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD
176*61046927SAndroid Build Coastguard Workerpasses 0 as the whole immediate operand.
177*61046927SAndroid Build Coastguard Worker
178*61046927SAndroid Build Coastguard WorkerThe "export ready" wait can be done multiple times safely.
179*61046927SAndroid Build Coastguard Worker
180*61046927SAndroid Build Coastguard WorkerGFX9–10.3: Resolving intrawave collisions
181*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
182*61046927SAndroid Build Coastguard Worker
183*61046927SAndroid Build Coastguard WorkerOn GFX9–10.3, it's possible for overlapping fragment shader invocations to be
184*61046927SAndroid Build Coastguard Workerplaced not only in different waves, but also in the same wave, with the shader
185*61046927SAndroid Build Coastguard Workercode making sure that the ordered section is executed for overlapping
186*61046927SAndroid Build Coastguard Workerinvocations in order.
187*61046927SAndroid Build Coastguard Worker
188*61046927SAndroid Build Coastguard WorkerThis functionality is optional — it can be activated by enabling loading of the
189*61046927SAndroid Build Coastguard Worker``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and
190*61046927SAndroid Build Coastguard Worker``PA_SC_SHADER_CONTROL``.
191*61046927SAndroid Build Coastguard Worker
192*61046927SAndroid Build Coastguard WorkerThe lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION``
193*61046927SAndroid Build Coastguard Workercontain the mask of whether each quad in the wave starts a new layer of
194*61046927SAndroid Build Coastguard Workeroverlapping invocations, and thus the ordered section code for them needs to be
195*61046927SAndroid Build Coastguard Workerexecuted after running it for all lanes with indices preceding that quad index
196*61046927SAndroid Build Coastguard Workermultiplied by 4. The rest of the bits in the argument need to be ignored — AMD
197*61046927SAndroid Build Coastguard Workerexplicitly masks them out in shader code (although this is not necessary if the
198*61046927SAndroid Build Coastguard Workershader uses "find first 1" to obtain the start of the next set of overlapping
199*61046927SAndroid Build Coastguard Workerquads or expands this quad mask into a lane mask).
200*61046927SAndroid Build Coastguard Worker
201*61046927SAndroid Build Coastguard WorkerFor example, if the intrawave collision mask is 0b0000001110000100, or
202*61046927SAndroid Build Coastguard Worker``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section
203*61046927SAndroid Build Coastguard Workerneeds to be executed first only for quads 1:0 (lanes 7:0), then only for quads
204*61046927SAndroid Build Coastguard Worker6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32),
205*61046927SAndroid Build Coastguard Workerand then for the remaining quads 15:9 (lanes 63:36).
206*61046927SAndroid Build Coastguard Worker
207*61046927SAndroid Build Coastguard WorkerThis effectively causes the ordered section to be executed as smaller
208*61046927SAndroid Build Coastguard Worker"sub-subgroups" within the original subgroup.
209*61046927SAndroid Build Coastguard Worker
210*61046927SAndroid Build Coastguard WorkerHowever, this is not always compatible with the execution model of SPIR-V or
211*61046927SAndroid Build Coastguard WorkerGLSL fragment shaders, so enabling intrawave collisions and wrapping a part of
212*61046927SAndroid Build Coastguard Workerthe shader in a loop may be unsafe in some cases. One particular example is when
213*61046927SAndroid Build Coastguard Workerthe shader uses subgroup operations influenced by lanes outside the current
214*61046927SAndroid Build Coastguard Workerquad. In this case, the code outside and inside the ordered section may be
215*61046927SAndroid Build Coastguard Workerexecuted with different sets of active invocations, affecting the results of
216*61046927SAndroid Build Coastguard Workersubgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not
217*61046927SAndroid Build Coastguard Workersupposed to modify the set of active invocations in any way. So the intrawave
218*61046927SAndroid Build Coastguard Workercollision loop may break the results of subgroup operations in unpredictable
219*61046927SAndroid Build Coastguard Workerways, even outside the driver's compiler infrastructure. Even if the driver
220*61046927SAndroid Build Coastguard Workersplits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the
221*61046927SAndroid Build Coastguard Workerlane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application
222*61046927SAndroid Build Coastguard Workerand the compilers that created the source shader are still not aware of that
223*61046927SAndroid Build Coastguard Workerhappening — the input SPIR-V or GLSL shader might have already gone through
224*61046927SAndroid Build Coastguard Workervarious optimizations, such as common subexpression elimination which might
225*61046927SAndroid Build Coastguard Workerhave considered a subgroup operation before ``OpBeginInvocationInterlockEXT``
226*61046927SAndroid Build Coastguard Workerand one after it equivalent.
227*61046927SAndroid Build Coastguard Worker
228*61046927SAndroid Build Coastguard WorkerThe idea behind reporting intrawave collisions to shaders is to reduce the
229*61046927SAndroid Build Coastguard Workerimpact on the parallelism of the part of the shader that doesn't depend on the
230*61046927SAndroid Build Coastguard Workerordering, to avoid wasting lanes in the wave and to allow the code outside the
231*61046927SAndroid Build Coastguard Workerordered section in different invocations to run in parallel lanes as usual. This
232*61046927SAndroid Build Coastguard Workermay be especially helpful if the ordered section is small compared to the rest
233*61046927SAndroid Build Coastguard Workerof the shader — for instance, a custom blending equation in the end of the usual
234*61046927SAndroid Build Coastguard Workerfragment shader for a surface in the world.
235*61046927SAndroid Build Coastguard Worker
236*61046927SAndroid Build Coastguard WorkerHowever, whether handling intrawave collisions is preferred is not a question
237*61046927SAndroid Build Coastguard Workerwith one universal answer. Intrawave collisions are pretty uncommon without
238*61046927SAndroid Build Coastguard Workermultisampling, or when using sample interlock with multisampling, although
239*61046927SAndroid Build Coastguard Workerthey're highly frequent with pixel interlock with multisampling, when adjacent
240*61046927SAndroid Build Coastguard Workerprimitives cover the same pixels along the shared edge (though that's an
241*61046927SAndroid Build Coastguard Workerextremely expensive situation in general). But resolving intrawave collisions
242*61046927SAndroid Build Coastguard Workeradds some overhead costs to the shader. If intrawave overlap is unlikely to
243*61046927SAndroid Build Coastguard Workerhappen often, or even more importantly, if the majority of the shader is inside
244*61046927SAndroid Build Coastguard Workerthe ordered section, handling it in the shader may cause more harm than good.
245*61046927SAndroid Build Coastguard Worker
246*61046927SAndroid Build Coastguard WorkerGFX11 removes this concept entirely, instead overlapping invocations are always
247*61046927SAndroid Build Coastguard Workerplaced in different waves.
248*61046927SAndroid Build Coastguard Worker
249*61046927SAndroid Build Coastguard WorkerGFX9–10.3: Exiting the ordered section in the wave
250*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
251*61046927SAndroid Build Coastguard Worker
252*61046927SAndroid Build Coastguard WorkerTo exit the ordered section and let overlapping waves resume execution and enter
253*61046927SAndroid Build Coastguard Workertheir ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message
254*61046927SAndroid Build Coastguard Worker(7) using ``s_sendmsg``.
255*61046927SAndroid Build Coastguard Worker
256*61046927SAndroid Build Coastguard WorkerIf the wave has enabled POPS by setting the packer hardware register, it *must
257*61046927SAndroid Build Coastguard Workernot* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the
258*61046927SAndroid Build Coastguard Workermessage must be sent on all execution paths after the packer register setup.
259*61046927SAndroid Build Coastguard WorkerHowever, if the wave exits before having configured the packer register, sending
260*61046927SAndroid Build Coastguard Workerthe message is not required, though it's still fine to send it regardless of
261*61046927SAndroid Build Coastguard Workerthat.
262*61046927SAndroid Build Coastguard Worker
263*61046927SAndroid Build Coastguard WorkerNote that if the shader has multiple ``OpEndInvocationInterlockEXT``
264*61046927SAndroid Build Coastguard Workerinstructions executed in the same wave (depending on a divergent condition, for
265*61046927SAndroid Build Coastguard Workerexample), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave
266*61046927SAndroid Build Coastguard Workeronly once, and especially not before any awaiting of overlapped waves.
267*61046927SAndroid Build Coastguard Worker
268*61046927SAndroid Build Coastguard WorkerBefore the message is sent, all counters for memory accesses that need to be
269*61046927SAndroid Build Coastguard Workerprimitive-ordered, both writes and (in case something after the ordered section
270*61046927SAndroid Build Coastguard Workerdepends on the per-pixel data, for instance, the tail blending fallback in
271*61046927SAndroid Build Coastguard Workerorder-independent transparency) reads, must be awaited. Those may include
272*61046927SAndroid Build Coastguard Worker``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered
273*61046927SAndroid Build Coastguard Workermemory accesses will be done through VMEM with divergent addresses, not SMEM, as
274*61046927SAndroid Build Coastguard Workerthere's no synchronization between fragments at different pixel coordinates, but
275*61046927SAndroid Build Coastguard Workerit's still technically possible for a shader, even though pointless and
276*61046927SAndroid Build Coastguard Workernonoptimal, to explicitly perform them in a waterfall loop, for instance, and
277*61046927SAndroid Build Coastguard Workerthat must work correctly too). Without that, a race condition will occur when
278*61046927SAndroid Build Coastguard Workerthe newly resumed waves start accessing the memory locations to which there
279*61046927SAndroid Build Coastguard Workerstill are outstanding accesses in the current wave.
280*61046927SAndroid Build Coastguard Worker
281*61046927SAndroid Build Coastguard WorkerAnother option for exiting is the ``s_endpgm_ordered_ps_done`` instruction,
282*61046927SAndroid Build Coastguard Workerwhich combines waiting for all the counters, sending the ``ORDERED_PS_DONE``
283*61046927SAndroid Build Coastguard Workermessage, and ending the program. Generally, however, it's desirable to resume
284*61046927SAndroid Build Coastguard Workeroverlapping waves as early as possible, including before the export, as it may
285*61046927SAndroid Build Coastguard Workerstall the wave for some time too.
286*61046927SAndroid Build Coastguard Worker
287*61046927SAndroid Build Coastguard WorkerGFX11: Exiting the ordered section in the wave
288*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
289*61046927SAndroid Build Coastguard Worker
290*61046927SAndroid Build Coastguard WorkerThe overlapping waves are resumed when the wave performs the last export (with
291*61046927SAndroid Build Coastguard Workerthe ``done`` flag).
292*61046927SAndroid Build Coastguard Worker
293*61046927SAndroid Build Coastguard WorkerThe same requirements for awaiting the memory access counters as on GFX9–10.3
294*61046927SAndroid Build Coastguard Workerstill apply.
295*61046927SAndroid Build Coastguard Worker
296*61046927SAndroid Build Coastguard WorkerMemory access requirements
297*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^
298*61046927SAndroid Build Coastguard Worker
299*61046927SAndroid Build Coastguard WorkerThe compiler needs to ensure that entering the ordered section implements
300*61046927SAndroid Build Coastguard Workeracquire semantics, and exiting it implements release semantics, in the fragment
301*61046927SAndroid Build Coastguard Workerinterlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage
302*61046927SAndroid Build Coastguard Workerclasses.
303*61046927SAndroid Build Coastguard Worker
304*61046927SAndroid Build Coastguard WorkerA fragment interlock memory scope instance includes overlapping fragment shader
305*61046927SAndroid Build Coastguard Workerinvocations executed by commands inside a single subpass. It may be considered a
306*61046927SAndroid Build Coastguard Workersubset of a queue family memory scope instance from the perspective of memory
307*61046927SAndroid Build Coastguard Workerbarriers.
308*61046927SAndroid Build Coastguard Worker
309*61046927SAndroid Build Coastguard WorkerFragment shader interlock doesn't perform implicit memory availability or
310*61046927SAndroid Build Coastguard Workervisibility operations. Shaders must do them by themselves for accesses requiring
311*61046927SAndroid Build Coastguard Workerprimitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL
312*61046927SAndroid Build Coastguard Workeror ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope
313*61046927SAndroid Build Coastguard Workerin SPIR-V.
314*61046927SAndroid Build Coastguard Worker
315*61046927SAndroid Build Coastguard WorkerOn AMD hardware, this means that the accessed memory locations must be made
316*61046927SAndroid Build Coastguard Workeravailable or visible between waves that may be executed on any compute unit — so
317*61046927SAndroid Build Coastguard Workeraccesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag
318*61046927SAndroid Build Coastguard Workerand L1$ via DLC.
319*61046927SAndroid Build Coastguard Worker
320*61046927SAndroid Build Coastguard WorkerHowever, it should be noted that memory accesses in the ordered section may be
321*61046927SAndroid Build Coastguard Workerexpected by the application to be done in primitive order even if they don't
322*61046927SAndroid Build Coastguard Workerhave the GLC and DLC flags. Coherent access not only bypasses, but also
323*61046927SAndroid Build Coastguard Workerinvalidates the lower-level caches for the accessed memory locations. Thus,
324*61046927SAndroid Build Coastguard Workerconsidering that normally per-pixel data is accessed exclusively by the
325*61046927SAndroid Build Coastguard Workerinvocation executing the ordered section, it's not necessary to make all reads
326*61046927SAndroid Build Coastguard Workeror writes in the ordered section for one memory location to be GLC/DLC — just
327*61046927SAndroid Build Coastguard Workerthe first read and the last write: it doesn't matter if per-pixel data is cached
328*61046927SAndroid Build Coastguard Workerin L0/L1 in the middle of a dependency chain in the ordered section, as long as
329*61046927SAndroid Build Coastguard Workerit's invalidated in them in the beginning and flushed to L2 in the end.
330*61046927SAndroid Build Coastguard WorkerTherefore, optimizations in the compiler must not simply assume that only
331*61046927SAndroid Build Coastguard Workercoherent accesses need primitive ordering — and moreover, the compiler must also
332*61046927SAndroid Build Coastguard Workertake into account that the same data may be accessed through different bindings.
333*61046927SAndroid Build Coastguard Worker
334*61046927SAndroid Build Coastguard WorkerExport requirements
335*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^
336*61046927SAndroid Build Coastguard Worker
337*61046927SAndroid Build Coastguard WorkerWith POPS, on all hardware generations, the shader must have at least one
338*61046927SAndroid Build Coastguard Workerexport, though it can be a null or an ``off, off, off, off`` one.
339*61046927SAndroid Build Coastguard Worker
340*61046927SAndroid Build Coastguard WorkerAlso, even if the shader doesn't need to export any real data, the export
341*61046927SAndroid Build Coastguard Workerskipping that was added in GFX10 must not be used, and some space must be
342*61046927SAndroid Build Coastguard Workerallocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for
343*61046927SAndroid Build Coastguard Workersome color output to ``SPI_SHADER_32_R``.
344*61046927SAndroid Build Coastguard Worker
345*61046927SAndroid Build Coastguard WorkerWithout this, the shader will be executed without the needed synchronization on
346*61046927SAndroid Build Coastguard WorkerGFX10, and will hang on GFX11.
347*61046927SAndroid Build Coastguard Worker
348*61046927SAndroid Build Coastguard WorkerDrawing context setup
349*61046927SAndroid Build Coastguard Worker---------------------
350*61046927SAndroid Build Coastguard Worker
351*61046927SAndroid Build Coastguard WorkerConfiguring POPS
352*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^
353*61046927SAndroid Build Coastguard Worker
354*61046927SAndroid Build Coastguard WorkerMost of the configuration is performed via the ``DB_SHADER_CONTROL`` register.
355*61046927SAndroid Build Coastguard Worker
356*61046927SAndroid Build Coastguard WorkerTo enable POPS for the draw,
357*61046927SAndroid Build Coastguard Worker``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1.
358*61046927SAndroid Build Coastguard Worker
359*61046927SAndroid Build Coastguard WorkerOn GFX9–10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which
360*61046927SAndroid Build Coastguard Workerfragment shader invocations are considered overlapping:
361*61046927SAndroid Build Coastguard Worker
362*61046927SAndroid Build Coastguard Worker* For pixel interlock, it must be set to 0 (1 sample).
363*61046927SAndroid Build Coastguard Worker* If sample interlock is sufficient (only synchronizing between invocations that
364*61046927SAndroid Build Coastguard Worker  have any common sample mask bits), it may be set to
365*61046927SAndroid Build Coastguard Worker  ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` — the number of sample coverage mask
366*61046927SAndroid Build Coastguard Worker  bits passed to the shader which is expected to use the sample mask to
367*61046927SAndroid Build Coastguard Worker  determine whether it's allowed to access the data for each of the samples. As
368*61046927SAndroid Build Coastguard Worker  of April 2023, PAL for some reason doesn't use non-1x
369*61046927SAndroid Build Coastguard Worker  ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer
370*61046927SAndroid Build Coastguard Worker  Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading
371*61046927SAndroid Build Coastguard Worker  (those APIs tie the interlock granularity to the shading frequency — Vulkan
372*61046927SAndroid Build Coastguard Worker  and OpenGL fragment shader interlock, however, allows specifying the interlock
373*61046927SAndroid Build Coastguard Worker  granularity independently of it, making it possible both to ask for finer
374*61046927SAndroid Build Coastguard Worker  synchronization guarantees and to require stronger ones than Direct3D ROVs can
375*61046927SAndroid Build Coastguard Worker  provide). However, with MSAA, on AMD hardware, pixel interlock generally
376*61046927SAndroid Build Coastguard Worker  performs *massively*, sometimes prohibitively, slower than sample interlock,
377*61046927SAndroid Build Coastguard Worker  because it causes fragment shader invocations along the common edge of
378*61046927SAndroid Build Coastguard Worker  adjacent primitives to be ordered as they cover the same pixels (even though
379*61046927SAndroid Build Coastguard Worker  they don't cover any common samples). So it's highly desirable for the driver
380*61046927SAndroid Build Coastguard Worker  to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES``
381*61046927SAndroid Build Coastguard Worker  accordingly, if the shader declares that it's enough for it via the execution
382*61046927SAndroid Build Coastguard Worker  mode.
383*61046927SAndroid Build Coastguard Worker
384*61046927SAndroid Build Coastguard WorkerOn GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is
385*61046927SAndroid Build Coastguard Workerused in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier
386*61046927SAndroid Build Coastguard Workerarchitecture generations (and has a different bit offset in the register), and
387*61046927SAndroid Build Coastguard Worker``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11
388*61046927SAndroid Build Coastguard Workerblending performance workaround overriding the intrinsic rate must not be
389*61046927SAndroid Build Coastguard Workerapplied if POPS is used in the draw — the intrinsic rate override must be used
390*61046927SAndroid Build Coastguard Workersolely to control the interlock granularity in this case.
391*61046927SAndroid Build Coastguard Worker
392*61046927SAndroid Build Coastguard WorkerNo explicit flushes/synchronization are needed when changing the pipeline state
393*61046927SAndroid Build Coastguard Workervariables that may be involved in POPS, such as the rasterization sample count.
394*61046927SAndroid Build Coastguard WorkerPOPS automatically keeps synchronizing invocations even between draws with
395*61046927SAndroid Build Coastguard Workerdifferent sample counts (invocations with common coverage mask bits are
396*61046927SAndroid Build Coastguard Workerconsidered overlapping by the hardware, regardless of what those samples
397*61046927SAndroid Build Coastguard Workeractually are — only the indices are important).
398*61046927SAndroid Build Coastguard Worker
399*61046927SAndroid Build Coastguard WorkerAlso, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage
400*61046927SAndroid Build Coastguard Workersample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES``
401*61046927SAndroid Build Coastguard Workereven if there's no depth/stencil target.
402*61046927SAndroid Build Coastguard Worker
403*61046927SAndroid Build Coastguard WorkerHardware bug workarounds
404*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^
405*61046927SAndroid Build Coastguard Worker
406*61046927SAndroid Build Coastguard WorkerEarly revisions of GFX9 — ``CHIP_VEGA10`` and ``CHIP_RAVEN`` — contain a
407*61046927SAndroid Build Coastguard Workerhardware bug that may result in a hang, and need a workaround to be enabled.
408*61046927SAndroid Build Coastguard WorkerSpecifically, if POPS is used with 8 or more rasterization samples, or with 8 or
409*61046927SAndroid Build Coastguard Workermore depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP``
410*61046927SAndroid Build Coastguard Workermust be set to 1 for draws that satisfy this condition. In PAL, this is the
411*61046927SAndroid Build Coastguard Worker``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance
412*61046927SAndroid Build Coastguard Workerin those cases, increasing the frame time by around 1.5 to 2 times in
413*61046927SAndroid Build Coastguard Worker`nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
414*61046927SAndroid Build Coastguard Workeron the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is
415*61046927SAndroid Build Coastguard Workermandatory to ensure stability.
416*61046927SAndroid Build Coastguard Worker
417*61046927SAndroid Build Coastguard WorkerAlso, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required
418*61046927SAndroid Build Coastguard Workeron chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if
419*61046927SAndroid Build Coastguard Workerit's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``,
420*61046927SAndroid Build Coastguard Worker``CHIP_NAVI14``), and the draw uses POPS,
421*61046927SAndroid Build Coastguard Worker``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to
422*61046927SAndroid Build Coastguard Worker``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL).
423*61046927SAndroid Build Coastguard Worker
424*61046927SAndroid Build Coastguard WorkerOut-of-order rasterization interaction
425*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
426*61046927SAndroid Build Coastguard Worker
427*61046927SAndroid Build Coastguard WorkerThis is a largely unresearched topic currently. However, considering that POPS
428*61046927SAndroid Build Coastguard Workeris primarily the functionality of the Depth Block, similarity to the behavior of
429*61046927SAndroid Build Coastguard Workerout-of-order rasterization in depth/stencil testing may possibly be expected.
430*61046927SAndroid Build Coastguard Worker
431*61046927SAndroid Build Coastguard WorkerIf the shader specifies an ordered interlock execution mode, out-of-order
432*61046927SAndroid Build Coastguard Workerrasterization likely must not be enabled implicitly.
433*61046927SAndroid Build Coastguard Worker
434*61046927SAndroid Build Coastguard WorkerAs of April 2023, PAL doesn't have any rules specifically for POPS in the logic
435*61046927SAndroid Build Coastguard Workerdetermining whether out-of-order rasterization can be enabled automatically.
436*61046927SAndroid Build Coastguard WorkerSome of the POPS usage cases may possibly be covered by the rule that always
437*61046927SAndroid Build Coastguard Workerdisables out-of-order rasterization if the shader writes to Unordered Access
438*61046927SAndroid Build Coastguard WorkerViews (storage resources), though fragment shader interlock can be used for
439*61046927SAndroid Build Coastguard Workerread-only purposes too (for ordering between draws that only read per-pixel data
440*61046927SAndroid Build Coastguard Workerand draws that may write it), so that may be an oversight.
441*61046927SAndroid Build Coastguard Worker
442*61046927SAndroid Build Coastguard WorkerExplicitly enabled relaxed rasterization order modifies the concept of
443*61046927SAndroid Build Coastguard Workerrasterization order itself in Vulkan, so from the point of view of the
444*61046927SAndroid Build Coastguard Workerspecification of fragment shader interlock, relaxed rasterization order should
445*61046927SAndroid Build Coastguard Workerstill be applicable regardless of whether the shader requests ordered interlock.
446*61046927SAndroid Build Coastguard WorkerPAL also doesn't make any POPS-specific exceptions here as of April 2023.
447*61046927SAndroid Build Coastguard Worker
448*61046927SAndroid Build Coastguard WorkerVariable-rate shading interaction
449*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
450*61046927SAndroid Build Coastguard Worker
451*61046927SAndroid Build Coastguard WorkerOn GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces
452*61046927SAndroid Build Coastguard Workerthe shading rate to be 1x1, thus the
453*61046927SAndroid Build Coastguard Worker``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must
454*61046927SAndroid Build Coastguard Workerbe false.
455*61046927SAndroid Build Coastguard Worker
456*61046927SAndroid Build Coastguard WorkerOn GFX11, by default, POPS itself can work with non-1x1 shading rates, and the
457*61046927SAndroid Build Coastguard Worker``fragmentShadingRateWithFragmentShaderInterlock`` property must be true.
458*61046927SAndroid Build Coastguard WorkerHowever, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set,
459*61046927SAndroid Build Coastguard Workerenabling POPS will force 1x1 shading rate.
460*61046927SAndroid Build Coastguard Worker
461*61046927SAndroid Build Coastguard WorkerThe widest interlock granularity available on GFX11 — with the lowest possible
462*61046927SAndroid Build Coastguard WorkerDepth Block intrinsic rate, 1x — is per-fine-pixel, however. There's no
463*61046927SAndroid Build Coastguard Workersynchronization between coarse fragment shader invocations if they don't cover
464*61046927SAndroid Build Coastguard Workercommon fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device
465*61046927SAndroid Build Coastguard Workerfeature is not available.
466*61046927SAndroid Build Coastguard Worker
467*61046927SAndroid Build Coastguard WorkerAdditional configuration
468*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^
469*61046927SAndroid Build Coastguard Worker
470*61046927SAndroid Build Coastguard WorkerThese are some largely unresearched options found in the register declarations.
471*61046927SAndroid Build Coastguard WorkerPAL doesn't use them, so it's unknown if they make any significant difference.
472*61046927SAndroid Build Coastguard WorkerNo effect was found in `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
473*61046927SAndroid Build Coastguard Workerduring testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_NAVI31``.
474*61046927SAndroid Build Coastguard Worker
475*61046927SAndroid Build Coastguard Worker* ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9–10.3.
476*61046927SAndroid Build Coastguard Worker* ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+.
477