1*61046927SAndroid Build Coastguard WorkerPrimitive Ordered Pixel Shading 2*61046927SAndroid Build Coastguard Worker=============================== 3*61046927SAndroid Build Coastguard Worker 4*61046927SAndroid Build Coastguard WorkerPrimitive Ordered Pixel Shading (POPS) is the feature available starting from 5*61046927SAndroid Build Coastguard WorkerGFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering 6*61046927SAndroid Build Coastguard Workerfunctionality. 7*61046927SAndroid Build Coastguard Worker 8*61046927SAndroid Build Coastguard WorkerIt allows a part of a fragment shader — an ordered section (or a critical 9*61046927SAndroid Build Coastguard Workersection) — to be executed sequentially in rasterization order for different 10*61046927SAndroid Build Coastguard Workerinvocations covering the same pixel position. 11*61046927SAndroid Build Coastguard Worker 12*61046927SAndroid Build Coastguard WorkerThis article describes how POPS is set up in shader code and the registers. The 13*61046927SAndroid Build Coastguard Workerinformation here is currently provided for architecture generations up to GFX11. 14*61046927SAndroid Build Coastguard Worker 15*61046927SAndroid Build Coastguard WorkerNote that the information in this article is **not official** and may contain 16*61046927SAndroid Build Coastguard Workerinaccuracies, as well as incomplete or incorrect assumptions. It is based on the 17*61046927SAndroid Build Coastguard Workershader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage 18*61046927SAndroid Build Coastguard Workerin Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references, 19*61046927SAndroid Build Coastguard Workerand experimentation with the hardware. 20*61046927SAndroid Build Coastguard Worker 21*61046927SAndroid Build Coastguard WorkerShader code 22*61046927SAndroid Build Coastguard Worker----------- 23*61046927SAndroid Build Coastguard Worker 24*61046927SAndroid Build Coastguard WorkerWith POPS, a wave can dynamically execute up to one ordered section. It is fine 25*61046927SAndroid Build Coastguard Workerfor a wave not to enter an ordered section at all if it doesn't need ordering on 26*61046927SAndroid Build Coastguard Workerits execution path, however. 27*61046927SAndroid Build Coastguard Worker 28*61046927SAndroid Build Coastguard WorkerThe setup of the ordered section consists of three parts: 29*61046927SAndroid Build Coastguard Worker 30*61046927SAndroid Build Coastguard Worker1. Entering the ordered section in the current wave — awaiting the completion of 31*61046927SAndroid Build Coastguard Worker ordered sections in overlapped waves. 32*61046927SAndroid Build Coastguard Worker2. Resolving overlap within the current wave — intrawave collisions (optional 33*61046927SAndroid Build Coastguard Worker and GFX9–10.3 only). 34*61046927SAndroid Build Coastguard Worker3. Exiting the ordered section — resuming overlapping waves trying to enter 35*61046927SAndroid Build Coastguard Worker their ordered sections. 36*61046927SAndroid Build Coastguard Worker 37*61046927SAndroid Build Coastguard WorkerGFX9–10.3: Entering the ordered section in the wave 38*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 39*61046927SAndroid Build Coastguard Worker 40*61046927SAndroid Build Coastguard WorkerAwaiting the completion of ordered sections in overlapped waves is performed by 41*61046927SAndroid Build Coastguard Workersetting the POPS packer hardware register, and then polling the volatile 42*61046927SAndroid Build Coastguard Worker``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest 43*61046927SAndroid Build Coastguard Workeroverlapped wave ID for the current wave. 44*61046927SAndroid Build Coastguard Worker 45*61046927SAndroid Build Coastguard WorkerThe information needed for the wave to perform the waiting is provided to it via 46*61046927SAndroid Build Coastguard Workerthe SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the 47*61046927SAndroid Build Coastguard Worker``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that 48*61046927SAndroid Build Coastguard Workerthe POPS arguments specifically need to be enabled not only in ``RSRC`` unlike 49*61046927SAndroid Build Coastguard Workervarious other arguments, but in ``PA_SC_SHADER_CONTROL`` as well). 50*61046927SAndroid Build Coastguard Worker 51*61046927SAndroid Build Coastguard WorkerThe collision wave ID argument contains the following unsigned values: 52*61046927SAndroid Build Coastguard Worker 53*61046927SAndroid Build Coastguard Worker* [31]: Whether overlap has occurred. 54*61046927SAndroid Build Coastguard Worker* [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated 55*61046927SAndroid Build Coastguard Worker with. 56*61046927SAndroid Build Coastguard Worker* [25:16]: Newest overlapped wave ID. 57*61046927SAndroid Build Coastguard Worker* [9:0]: Current wave ID. 58*61046927SAndroid Build Coastguard Worker 59*61046927SAndroid Build Coastguard WorkerThe 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of 60*61046927SAndroid Build Coastguard Workerthe fields, possibly from an early development iteration, but the meanings of 61*61046927SAndroid Build Coastguard Workerthem are accurate there. 62*61046927SAndroid Build Coastguard Worker 63*61046927SAndroid Build Coastguard WorkerThe wait must not be performed if the "did overlap" bit 31 is set to 0, 64*61046927SAndroid Build Coastguard Workerotherwise it will result in a hang. Also, the bit being set to 0 indicates that 65*61046927SAndroid Build Coastguard Workerthere are *both* no wave overlap *and no intrawave collisions* for the current 66*61046927SAndroid Build Coastguard Workerwave — so if the bit is 0, it's safe for the wave to skip all of the POPS logic 67*61046927SAndroid Build Coastguard Workercompletely and execute the contents of the ordered section simply as usual with 68*61046927SAndroid Build Coastguard Workerunordered access as a potential additional optimization. The packer hardware 69*61046927SAndroid Build Coastguard Workerregister, however, may be set even without overlap safely — it's the wait loop 70*61046927SAndroid Build Coastguard Workeritself that must not be executed if it was reported that there was no overlap. 71*61046927SAndroid Build Coastguard Worker 72*61046927SAndroid Build Coastguard WorkerThe packer ID needs to be passed to the packer hardware register using 73*61046927SAndroid Build Coastguard Worker``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer. 74*61046927SAndroid Build Coastguard Worker 75*61046927SAndroid Build Coastguard WorkerOn GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer 76*61046927SAndroid Build Coastguard Workerthe wave is associated with: 77*61046927SAndroid Build Coastguard Worker 78*61046927SAndroid Build Coastguard Worker* [25]: The wave is associated with packer 1. 79*61046927SAndroid Build Coastguard Worker* [24]: The wave is associated with packer 0. 80*61046927SAndroid Build Coastguard Worker 81*61046927SAndroid Build Coastguard WorkerInitially, both of these bits are set 0, meaning that POPS is disabled for the 82*61046927SAndroid Build Coastguard Workerwave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if 83*61046927SAndroid Build Coastguard Workerthe packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID 84*61046927SAndroid Build Coastguard Workeris 1. 85*61046927SAndroid Build Coastguard Worker 86*61046927SAndroid Build Coastguard WorkerStarting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead, 87*61046927SAndroid Build Coastguard Workercontaining the following fields: 88*61046927SAndroid Build Coastguard Worker 89*61046927SAndroid Build Coastguard Worker* [2:1]: Packer ID. 90*61046927SAndroid Build Coastguard Worker* [0]: POPS enabled for the wave. 91*61046927SAndroid Build Coastguard Worker 92*61046927SAndroid Build Coastguard WorkerInitially, POPS is disabled for a wave. To start entering the ordered section, 93*61046927SAndroid Build Coastguard Workerbits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs 94*61046927SAndroid Build Coastguard Workerto be set to 1. 95*61046927SAndroid Build Coastguard Worker 96*61046927SAndroid Build Coastguard WorkerThe wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are 97*61046927SAndroid Build Coastguard Worker10-bit values wrapping around on overflow — consecutive waves are numbered 1022, 98*61046927SAndroid Build Coastguard Worker1023, 0, 1… This wraparound needs to be taken into account when comparing the 99*61046927SAndroid Build Coastguard Workerexiting wave ID and the newest overlapped wave ID. 100*61046927SAndroid Build Coastguard Worker 101*61046927SAndroid Build Coastguard WorkerSpecifically, until the current wave exits the ordered section, its ID can't be 102*61046927SAndroid Build Coastguard Workersmaller than the newest overlapped wave ID or the exiting wave ID. So 103*61046927SAndroid Build Coastguard Worker``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to 104*61046927SAndroid Build Coastguard Workermonotonically increasing unsigned values. In this case, the largest value, 105*61046927SAndroid Build Coastguard Worker0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current 106*61046927SAndroid Build Coastguard Workerwave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from 107*61046927SAndroid Build Coastguard Workerbefore the last wraparound will be near 0 increasing away from it. Subtracting 108*61046927SAndroid Build Coastguard Worker``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``. 109*61046927SAndroid Build Coastguard Worker 110*61046927SAndroid Build Coastguard WorkerGFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit 111*61046927SAndroid Build Coastguard Workernewest overlapped wave ID is greater than the 10-bit current wave ID (meaning 112*61046927SAndroid Build Coastguard Workerthat it's behind the last wraparound point), 1 needs to be added to the newest 113*61046927SAndroid Build Coastguard Workeroverlapped wave ID before using it in the comparison. This was corrected in 114*61046927SAndroid Build Coastguard WorkerGFX10. 115*61046927SAndroid Build Coastguard Worker 116*61046927SAndroid Build Coastguard WorkerThe exiting wave ID (not to be confused with "exited" — the exiting wave ID is 117*61046927SAndroid Build Coastguard Workerthe wave that will exit the ordered section next) is queried via the 118*61046927SAndroid Build Coastguard Worker``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be 119*61046927SAndroid Build Coastguard Workerone of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave 120*61046927SAndroid Build Coastguard WorkerID to monotonically increasing one. 121*61046927SAndroid Build Coastguard Worker 122*61046927SAndroid Build Coastguard WorkerIt's a volatile operand, and it needs to be read in a loop until its value 123*61046927SAndroid Build Coastguard Workerbecomes greater than the newest overlapped wave ID (after remapping both to 124*61046927SAndroid Build Coastguard Workermonotonic). However, if it's too early for the current wave to enter the ordered 125*61046927SAndroid Build Coastguard Workersection, it needs to yield execution to other waves that may potentially be 126*61046927SAndroid Build Coastguard Workeroverlapped — via ``s_sleep``. GFX9 requires a finite amount of delay to be 127*61046927SAndroid Build Coastguard Workerspecified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up 128*61046927SAndroid Build Coastguard Workerthe waiting waves, so the maximum delay of 0xFFFF can be used. 129*61046927SAndroid Build Coastguard Worker 130*61046927SAndroid Build Coastguard WorkerIn pseudocode, the entering logic would look like this:: 131*61046927SAndroid Build Coastguard Worker 132*61046927SAndroid Build Coastguard Worker bool did_overlap = collision_wave_id[31]; 133*61046927SAndroid Build Coastguard Worker if (did_overlap) { 134*61046927SAndroid Build Coastguard Worker if (gfx_level >= GFX10) { 135*61046927SAndroid Build Coastguard Worker uint packer_id = collision_wave_id[29:28]; 136*61046927SAndroid Build Coastguard Worker s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1)); 137*61046927SAndroid Build Coastguard Worker } else { 138*61046927SAndroid Build Coastguard Worker uint packer_id = collision_wave_id[28]; 139*61046927SAndroid Build Coastguard Worker s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01); 140*61046927SAndroid Build Coastguard Worker } 141*61046927SAndroid Build Coastguard Worker 142*61046927SAndroid Build Coastguard Worker uint current_10bit_wave_id = collision_wave_id[9:0]; 143*61046927SAndroid Build Coastguard Worker // Or -(current_10bit_wave_id + 1). 144*61046927SAndroid Build Coastguard Worker uint wave_id_remap_offset = ~current_10bit_wave_id; 145*61046927SAndroid Build Coastguard Worker 146*61046927SAndroid Build Coastguard Worker uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16]; 147*61046927SAndroid Build Coastguard Worker if (gfx_level < GFX10 && 148*61046927SAndroid Build Coastguard Worker newest_overlapped_10bit_wave_id > current_10bit_wave_id) { 149*61046927SAndroid Build Coastguard Worker ++newest_overlapped_10bit_wave_id; 150*61046927SAndroid Build Coastguard Worker } 151*61046927SAndroid Build Coastguard Worker uint newest_overlapped_wave_id = 152*61046927SAndroid Build Coastguard Worker newest_overlapped_10bit_wave_id + wave_id_remap_offset; 153*61046927SAndroid Build Coastguard Worker 154*61046927SAndroid Build Coastguard Worker while (!(src_pops_exiting_wave_id + wave_id_remap_offset > 155*61046927SAndroid Build Coastguard Worker newest_overlapped_wave_id)) { 156*61046927SAndroid Build Coastguard Worker s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3); 157*61046927SAndroid Build Coastguard Worker } 158*61046927SAndroid Build Coastguard Worker } 159*61046927SAndroid Build Coastguard Worker 160*61046927SAndroid Build Coastguard WorkerThe SPIR-V fragment shader interlock specification requires an invocation — an 161*61046927SAndroid Build Coastguard Workerindividual invocation, not the whole subgroup — to execute 162*61046927SAndroid Build Coastguard Worker``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple 163*61046927SAndroid Build Coastguard Workerbegin instructions, or even multiple begin/end pairs, under divergent 164*61046927SAndroid Build Coastguard Workerconditions, a wave may end up waiting for the overlapped waves multiple times. 165*61046927SAndroid Build Coastguard WorkerThankfully, it's safe to set the POPS packer hardware register to the same 166*61046927SAndroid Build Coastguard Workervalue, or to run the wait loop, multiple times during the wave's execution, as 167*61046927SAndroid Build Coastguard Workerlong as the ordered section isn't exited in between by the wave. 168*61046927SAndroid Build Coastguard Worker 169*61046927SAndroid Build Coastguard WorkerGFX11: Entering the ordered section in the wave 170*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 171*61046927SAndroid Build Coastguard Worker 172*61046927SAndroid Build Coastguard WorkerInstead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave 173*61046927SAndroid Build Coastguard Workerstatus flag to report that the wave may enter the ordered section. It's awaited 174*61046927SAndroid Build Coastguard Workerby the ``s_wait_event`` instruction, with the bit 0 ("don't wait for 175*61046927SAndroid Build Coastguard Worker``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD 176*61046927SAndroid Build Coastguard Workerpasses 0 as the whole immediate operand. 177*61046927SAndroid Build Coastguard Worker 178*61046927SAndroid Build Coastguard WorkerThe "export ready" wait can be done multiple times safely. 179*61046927SAndroid Build Coastguard Worker 180*61046927SAndroid Build Coastguard WorkerGFX9–10.3: Resolving intrawave collisions 181*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 182*61046927SAndroid Build Coastguard Worker 183*61046927SAndroid Build Coastguard WorkerOn GFX9–10.3, it's possible for overlapping fragment shader invocations to be 184*61046927SAndroid Build Coastguard Workerplaced not only in different waves, but also in the same wave, with the shader 185*61046927SAndroid Build Coastguard Workercode making sure that the ordered section is executed for overlapping 186*61046927SAndroid Build Coastguard Workerinvocations in order. 187*61046927SAndroid Build Coastguard Worker 188*61046927SAndroid Build Coastguard WorkerThis functionality is optional — it can be activated by enabling loading of the 189*61046927SAndroid Build Coastguard Worker``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and 190*61046927SAndroid Build Coastguard Worker``PA_SC_SHADER_CONTROL``. 191*61046927SAndroid Build Coastguard Worker 192*61046927SAndroid Build Coastguard WorkerThe lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION`` 193*61046927SAndroid Build Coastguard Workercontain the mask of whether each quad in the wave starts a new layer of 194*61046927SAndroid Build Coastguard Workeroverlapping invocations, and thus the ordered section code for them needs to be 195*61046927SAndroid Build Coastguard Workerexecuted after running it for all lanes with indices preceding that quad index 196*61046927SAndroid Build Coastguard Workermultiplied by 4. The rest of the bits in the argument need to be ignored — AMD 197*61046927SAndroid Build Coastguard Workerexplicitly masks them out in shader code (although this is not necessary if the 198*61046927SAndroid Build Coastguard Workershader uses "find first 1" to obtain the start of the next set of overlapping 199*61046927SAndroid Build Coastguard Workerquads or expands this quad mask into a lane mask). 200*61046927SAndroid Build Coastguard Worker 201*61046927SAndroid Build Coastguard WorkerFor example, if the intrawave collision mask is 0b0000001110000100, or 202*61046927SAndroid Build Coastguard Worker``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section 203*61046927SAndroid Build Coastguard Workerneeds to be executed first only for quads 1:0 (lanes 7:0), then only for quads 204*61046927SAndroid Build Coastguard Worker6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32), 205*61046927SAndroid Build Coastguard Workerand then for the remaining quads 15:9 (lanes 63:36). 206*61046927SAndroid Build Coastguard Worker 207*61046927SAndroid Build Coastguard WorkerThis effectively causes the ordered section to be executed as smaller 208*61046927SAndroid Build Coastguard Worker"sub-subgroups" within the original subgroup. 209*61046927SAndroid Build Coastguard Worker 210*61046927SAndroid Build Coastguard WorkerHowever, this is not always compatible with the execution model of SPIR-V or 211*61046927SAndroid Build Coastguard WorkerGLSL fragment shaders, so enabling intrawave collisions and wrapping a part of 212*61046927SAndroid Build Coastguard Workerthe shader in a loop may be unsafe in some cases. One particular example is when 213*61046927SAndroid Build Coastguard Workerthe shader uses subgroup operations influenced by lanes outside the current 214*61046927SAndroid Build Coastguard Workerquad. In this case, the code outside and inside the ordered section may be 215*61046927SAndroid Build Coastguard Workerexecuted with different sets of active invocations, affecting the results of 216*61046927SAndroid Build Coastguard Workersubgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not 217*61046927SAndroid Build Coastguard Workersupposed to modify the set of active invocations in any way. So the intrawave 218*61046927SAndroid Build Coastguard Workercollision loop may break the results of subgroup operations in unpredictable 219*61046927SAndroid Build Coastguard Workerways, even outside the driver's compiler infrastructure. Even if the driver 220*61046927SAndroid Build Coastguard Workersplits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the 221*61046927SAndroid Build Coastguard Workerlane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application 222*61046927SAndroid Build Coastguard Workerand the compilers that created the source shader are still not aware of that 223*61046927SAndroid Build Coastguard Workerhappening — the input SPIR-V or GLSL shader might have already gone through 224*61046927SAndroid Build Coastguard Workervarious optimizations, such as common subexpression elimination which might 225*61046927SAndroid Build Coastguard Workerhave considered a subgroup operation before ``OpBeginInvocationInterlockEXT`` 226*61046927SAndroid Build Coastguard Workerand one after it equivalent. 227*61046927SAndroid Build Coastguard Worker 228*61046927SAndroid Build Coastguard WorkerThe idea behind reporting intrawave collisions to shaders is to reduce the 229*61046927SAndroid Build Coastguard Workerimpact on the parallelism of the part of the shader that doesn't depend on the 230*61046927SAndroid Build Coastguard Workerordering, to avoid wasting lanes in the wave and to allow the code outside the 231*61046927SAndroid Build Coastguard Workerordered section in different invocations to run in parallel lanes as usual. This 232*61046927SAndroid Build Coastguard Workermay be especially helpful if the ordered section is small compared to the rest 233*61046927SAndroid Build Coastguard Workerof the shader — for instance, a custom blending equation in the end of the usual 234*61046927SAndroid Build Coastguard Workerfragment shader for a surface in the world. 235*61046927SAndroid Build Coastguard Worker 236*61046927SAndroid Build Coastguard WorkerHowever, whether handling intrawave collisions is preferred is not a question 237*61046927SAndroid Build Coastguard Workerwith one universal answer. Intrawave collisions are pretty uncommon without 238*61046927SAndroid Build Coastguard Workermultisampling, or when using sample interlock with multisampling, although 239*61046927SAndroid Build Coastguard Workerthey're highly frequent with pixel interlock with multisampling, when adjacent 240*61046927SAndroid Build Coastguard Workerprimitives cover the same pixels along the shared edge (though that's an 241*61046927SAndroid Build Coastguard Workerextremely expensive situation in general). But resolving intrawave collisions 242*61046927SAndroid Build Coastguard Workeradds some overhead costs to the shader. If intrawave overlap is unlikely to 243*61046927SAndroid Build Coastguard Workerhappen often, or even more importantly, if the majority of the shader is inside 244*61046927SAndroid Build Coastguard Workerthe ordered section, handling it in the shader may cause more harm than good. 245*61046927SAndroid Build Coastguard Worker 246*61046927SAndroid Build Coastguard WorkerGFX11 removes this concept entirely, instead overlapping invocations are always 247*61046927SAndroid Build Coastguard Workerplaced in different waves. 248*61046927SAndroid Build Coastguard Worker 249*61046927SAndroid Build Coastguard WorkerGFX9–10.3: Exiting the ordered section in the wave 250*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 251*61046927SAndroid Build Coastguard Worker 252*61046927SAndroid Build Coastguard WorkerTo exit the ordered section and let overlapping waves resume execution and enter 253*61046927SAndroid Build Coastguard Workertheir ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message 254*61046927SAndroid Build Coastguard Worker(7) using ``s_sendmsg``. 255*61046927SAndroid Build Coastguard Worker 256*61046927SAndroid Build Coastguard WorkerIf the wave has enabled POPS by setting the packer hardware register, it *must 257*61046927SAndroid Build Coastguard Workernot* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the 258*61046927SAndroid Build Coastguard Workermessage must be sent on all execution paths after the packer register setup. 259*61046927SAndroid Build Coastguard WorkerHowever, if the wave exits before having configured the packer register, sending 260*61046927SAndroid Build Coastguard Workerthe message is not required, though it's still fine to send it regardless of 261*61046927SAndroid Build Coastguard Workerthat. 262*61046927SAndroid Build Coastguard Worker 263*61046927SAndroid Build Coastguard WorkerNote that if the shader has multiple ``OpEndInvocationInterlockEXT`` 264*61046927SAndroid Build Coastguard Workerinstructions executed in the same wave (depending on a divergent condition, for 265*61046927SAndroid Build Coastguard Workerexample), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave 266*61046927SAndroid Build Coastguard Workeronly once, and especially not before any awaiting of overlapped waves. 267*61046927SAndroid Build Coastguard Worker 268*61046927SAndroid Build Coastguard WorkerBefore the message is sent, all counters for memory accesses that need to be 269*61046927SAndroid Build Coastguard Workerprimitive-ordered, both writes and (in case something after the ordered section 270*61046927SAndroid Build Coastguard Workerdepends on the per-pixel data, for instance, the tail blending fallback in 271*61046927SAndroid Build Coastguard Workerorder-independent transparency) reads, must be awaited. Those may include 272*61046927SAndroid Build Coastguard Worker``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered 273*61046927SAndroid Build Coastguard Workermemory accesses will be done through VMEM with divergent addresses, not SMEM, as 274*61046927SAndroid Build Coastguard Workerthere's no synchronization between fragments at different pixel coordinates, but 275*61046927SAndroid Build Coastguard Workerit's still technically possible for a shader, even though pointless and 276*61046927SAndroid Build Coastguard Workernonoptimal, to explicitly perform them in a waterfall loop, for instance, and 277*61046927SAndroid Build Coastguard Workerthat must work correctly too). Without that, a race condition will occur when 278*61046927SAndroid Build Coastguard Workerthe newly resumed waves start accessing the memory locations to which there 279*61046927SAndroid Build Coastguard Workerstill are outstanding accesses in the current wave. 280*61046927SAndroid Build Coastguard Worker 281*61046927SAndroid Build Coastguard WorkerAnother option for exiting is the ``s_endpgm_ordered_ps_done`` instruction, 282*61046927SAndroid Build Coastguard Workerwhich combines waiting for all the counters, sending the ``ORDERED_PS_DONE`` 283*61046927SAndroid Build Coastguard Workermessage, and ending the program. Generally, however, it's desirable to resume 284*61046927SAndroid Build Coastguard Workeroverlapping waves as early as possible, including before the export, as it may 285*61046927SAndroid Build Coastguard Workerstall the wave for some time too. 286*61046927SAndroid Build Coastguard Worker 287*61046927SAndroid Build Coastguard WorkerGFX11: Exiting the ordered section in the wave 288*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 289*61046927SAndroid Build Coastguard Worker 290*61046927SAndroid Build Coastguard WorkerThe overlapping waves are resumed when the wave performs the last export (with 291*61046927SAndroid Build Coastguard Workerthe ``done`` flag). 292*61046927SAndroid Build Coastguard Worker 293*61046927SAndroid Build Coastguard WorkerThe same requirements for awaiting the memory access counters as on GFX9–10.3 294*61046927SAndroid Build Coastguard Workerstill apply. 295*61046927SAndroid Build Coastguard Worker 296*61046927SAndroid Build Coastguard WorkerMemory access requirements 297*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^ 298*61046927SAndroid Build Coastguard Worker 299*61046927SAndroid Build Coastguard WorkerThe compiler needs to ensure that entering the ordered section implements 300*61046927SAndroid Build Coastguard Workeracquire semantics, and exiting it implements release semantics, in the fragment 301*61046927SAndroid Build Coastguard Workerinterlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage 302*61046927SAndroid Build Coastguard Workerclasses. 303*61046927SAndroid Build Coastguard Worker 304*61046927SAndroid Build Coastguard WorkerA fragment interlock memory scope instance includes overlapping fragment shader 305*61046927SAndroid Build Coastguard Workerinvocations executed by commands inside a single subpass. It may be considered a 306*61046927SAndroid Build Coastguard Workersubset of a queue family memory scope instance from the perspective of memory 307*61046927SAndroid Build Coastguard Workerbarriers. 308*61046927SAndroid Build Coastguard Worker 309*61046927SAndroid Build Coastguard WorkerFragment shader interlock doesn't perform implicit memory availability or 310*61046927SAndroid Build Coastguard Workervisibility operations. Shaders must do them by themselves for accesses requiring 311*61046927SAndroid Build Coastguard Workerprimitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL 312*61046927SAndroid Build Coastguard Workeror ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope 313*61046927SAndroid Build Coastguard Workerin SPIR-V. 314*61046927SAndroid Build Coastguard Worker 315*61046927SAndroid Build Coastguard WorkerOn AMD hardware, this means that the accessed memory locations must be made 316*61046927SAndroid Build Coastguard Workeravailable or visible between waves that may be executed on any compute unit — so 317*61046927SAndroid Build Coastguard Workeraccesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag 318*61046927SAndroid Build Coastguard Workerand L1$ via DLC. 319*61046927SAndroid Build Coastguard Worker 320*61046927SAndroid Build Coastguard WorkerHowever, it should be noted that memory accesses in the ordered section may be 321*61046927SAndroid Build Coastguard Workerexpected by the application to be done in primitive order even if they don't 322*61046927SAndroid Build Coastguard Workerhave the GLC and DLC flags. Coherent access not only bypasses, but also 323*61046927SAndroid Build Coastguard Workerinvalidates the lower-level caches for the accessed memory locations. Thus, 324*61046927SAndroid Build Coastguard Workerconsidering that normally per-pixel data is accessed exclusively by the 325*61046927SAndroid Build Coastguard Workerinvocation executing the ordered section, it's not necessary to make all reads 326*61046927SAndroid Build Coastguard Workeror writes in the ordered section for one memory location to be GLC/DLC — just 327*61046927SAndroid Build Coastguard Workerthe first read and the last write: it doesn't matter if per-pixel data is cached 328*61046927SAndroid Build Coastguard Workerin L0/L1 in the middle of a dependency chain in the ordered section, as long as 329*61046927SAndroid Build Coastguard Workerit's invalidated in them in the beginning and flushed to L2 in the end. 330*61046927SAndroid Build Coastguard WorkerTherefore, optimizations in the compiler must not simply assume that only 331*61046927SAndroid Build Coastguard Workercoherent accesses need primitive ordering — and moreover, the compiler must also 332*61046927SAndroid Build Coastguard Workertake into account that the same data may be accessed through different bindings. 333*61046927SAndroid Build Coastguard Worker 334*61046927SAndroid Build Coastguard WorkerExport requirements 335*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^ 336*61046927SAndroid Build Coastguard Worker 337*61046927SAndroid Build Coastguard WorkerWith POPS, on all hardware generations, the shader must have at least one 338*61046927SAndroid Build Coastguard Workerexport, though it can be a null or an ``off, off, off, off`` one. 339*61046927SAndroid Build Coastguard Worker 340*61046927SAndroid Build Coastguard WorkerAlso, even if the shader doesn't need to export any real data, the export 341*61046927SAndroid Build Coastguard Workerskipping that was added in GFX10 must not be used, and some space must be 342*61046927SAndroid Build Coastguard Workerallocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for 343*61046927SAndroid Build Coastguard Workersome color output to ``SPI_SHADER_32_R``. 344*61046927SAndroid Build Coastguard Worker 345*61046927SAndroid Build Coastguard WorkerWithout this, the shader will be executed without the needed synchronization on 346*61046927SAndroid Build Coastguard WorkerGFX10, and will hang on GFX11. 347*61046927SAndroid Build Coastguard Worker 348*61046927SAndroid Build Coastguard WorkerDrawing context setup 349*61046927SAndroid Build Coastguard Worker--------------------- 350*61046927SAndroid Build Coastguard Worker 351*61046927SAndroid Build Coastguard WorkerConfiguring POPS 352*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^ 353*61046927SAndroid Build Coastguard Worker 354*61046927SAndroid Build Coastguard WorkerMost of the configuration is performed via the ``DB_SHADER_CONTROL`` register. 355*61046927SAndroid Build Coastguard Worker 356*61046927SAndroid Build Coastguard WorkerTo enable POPS for the draw, 357*61046927SAndroid Build Coastguard Worker``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1. 358*61046927SAndroid Build Coastguard Worker 359*61046927SAndroid Build Coastguard WorkerOn GFX9–10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which 360*61046927SAndroid Build Coastguard Workerfragment shader invocations are considered overlapping: 361*61046927SAndroid Build Coastguard Worker 362*61046927SAndroid Build Coastguard Worker* For pixel interlock, it must be set to 0 (1 sample). 363*61046927SAndroid Build Coastguard Worker* If sample interlock is sufficient (only synchronizing between invocations that 364*61046927SAndroid Build Coastguard Worker have any common sample mask bits), it may be set to 365*61046927SAndroid Build Coastguard Worker ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` — the number of sample coverage mask 366*61046927SAndroid Build Coastguard Worker bits passed to the shader which is expected to use the sample mask to 367*61046927SAndroid Build Coastguard Worker determine whether it's allowed to access the data for each of the samples. As 368*61046927SAndroid Build Coastguard Worker of April 2023, PAL for some reason doesn't use non-1x 369*61046927SAndroid Build Coastguard Worker ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer 370*61046927SAndroid Build Coastguard Worker Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading 371*61046927SAndroid Build Coastguard Worker (those APIs tie the interlock granularity to the shading frequency — Vulkan 372*61046927SAndroid Build Coastguard Worker and OpenGL fragment shader interlock, however, allows specifying the interlock 373*61046927SAndroid Build Coastguard Worker granularity independently of it, making it possible both to ask for finer 374*61046927SAndroid Build Coastguard Worker synchronization guarantees and to require stronger ones than Direct3D ROVs can 375*61046927SAndroid Build Coastguard Worker provide). However, with MSAA, on AMD hardware, pixel interlock generally 376*61046927SAndroid Build Coastguard Worker performs *massively*, sometimes prohibitively, slower than sample interlock, 377*61046927SAndroid Build Coastguard Worker because it causes fragment shader invocations along the common edge of 378*61046927SAndroid Build Coastguard Worker adjacent primitives to be ordered as they cover the same pixels (even though 379*61046927SAndroid Build Coastguard Worker they don't cover any common samples). So it's highly desirable for the driver 380*61046927SAndroid Build Coastguard Worker to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES`` 381*61046927SAndroid Build Coastguard Worker accordingly, if the shader declares that it's enough for it via the execution 382*61046927SAndroid Build Coastguard Worker mode. 383*61046927SAndroid Build Coastguard Worker 384*61046927SAndroid Build Coastguard WorkerOn GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is 385*61046927SAndroid Build Coastguard Workerused in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier 386*61046927SAndroid Build Coastguard Workerarchitecture generations (and has a different bit offset in the register), and 387*61046927SAndroid Build Coastguard Worker``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11 388*61046927SAndroid Build Coastguard Workerblending performance workaround overriding the intrinsic rate must not be 389*61046927SAndroid Build Coastguard Workerapplied if POPS is used in the draw — the intrinsic rate override must be used 390*61046927SAndroid Build Coastguard Workersolely to control the interlock granularity in this case. 391*61046927SAndroid Build Coastguard Worker 392*61046927SAndroid Build Coastguard WorkerNo explicit flushes/synchronization are needed when changing the pipeline state 393*61046927SAndroid Build Coastguard Workervariables that may be involved in POPS, such as the rasterization sample count. 394*61046927SAndroid Build Coastguard WorkerPOPS automatically keeps synchronizing invocations even between draws with 395*61046927SAndroid Build Coastguard Workerdifferent sample counts (invocations with common coverage mask bits are 396*61046927SAndroid Build Coastguard Workerconsidered overlapping by the hardware, regardless of what those samples 397*61046927SAndroid Build Coastguard Workeractually are — only the indices are important). 398*61046927SAndroid Build Coastguard Worker 399*61046927SAndroid Build Coastguard WorkerAlso, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage 400*61046927SAndroid Build Coastguard Workersample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` 401*61046927SAndroid Build Coastguard Workereven if there's no depth/stencil target. 402*61046927SAndroid Build Coastguard Worker 403*61046927SAndroid Build Coastguard WorkerHardware bug workarounds 404*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^ 405*61046927SAndroid Build Coastguard Worker 406*61046927SAndroid Build Coastguard WorkerEarly revisions of GFX9 — ``CHIP_VEGA10`` and ``CHIP_RAVEN`` — contain a 407*61046927SAndroid Build Coastguard Workerhardware bug that may result in a hang, and need a workaround to be enabled. 408*61046927SAndroid Build Coastguard WorkerSpecifically, if POPS is used with 8 or more rasterization samples, or with 8 or 409*61046927SAndroid Build Coastguard Workermore depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` 410*61046927SAndroid Build Coastguard Workermust be set to 1 for draws that satisfy this condition. In PAL, this is the 411*61046927SAndroid Build Coastguard Worker``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance 412*61046927SAndroid Build Coastguard Workerin those cases, increasing the frame time by around 1.5 to 2 times in 413*61046927SAndroid Build Coastguard Worker`nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_ 414*61046927SAndroid Build Coastguard Workeron the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is 415*61046927SAndroid Build Coastguard Workermandatory to ensure stability. 416*61046927SAndroid Build Coastguard Worker 417*61046927SAndroid Build Coastguard WorkerAlso, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required 418*61046927SAndroid Build Coastguard Workeron chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if 419*61046927SAndroid Build Coastguard Workerit's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``, 420*61046927SAndroid Build Coastguard Worker``CHIP_NAVI14``), and the draw uses POPS, 421*61046927SAndroid Build Coastguard Worker``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to 422*61046927SAndroid Build Coastguard Worker``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL). 423*61046927SAndroid Build Coastguard Worker 424*61046927SAndroid Build Coastguard WorkerOut-of-order rasterization interaction 425*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 426*61046927SAndroid Build Coastguard Worker 427*61046927SAndroid Build Coastguard WorkerThis is a largely unresearched topic currently. However, considering that POPS 428*61046927SAndroid Build Coastguard Workeris primarily the functionality of the Depth Block, similarity to the behavior of 429*61046927SAndroid Build Coastguard Workerout-of-order rasterization in depth/stencil testing may possibly be expected. 430*61046927SAndroid Build Coastguard Worker 431*61046927SAndroid Build Coastguard WorkerIf the shader specifies an ordered interlock execution mode, out-of-order 432*61046927SAndroid Build Coastguard Workerrasterization likely must not be enabled implicitly. 433*61046927SAndroid Build Coastguard Worker 434*61046927SAndroid Build Coastguard WorkerAs of April 2023, PAL doesn't have any rules specifically for POPS in the logic 435*61046927SAndroid Build Coastguard Workerdetermining whether out-of-order rasterization can be enabled automatically. 436*61046927SAndroid Build Coastguard WorkerSome of the POPS usage cases may possibly be covered by the rule that always 437*61046927SAndroid Build Coastguard Workerdisables out-of-order rasterization if the shader writes to Unordered Access 438*61046927SAndroid Build Coastguard WorkerViews (storage resources), though fragment shader interlock can be used for 439*61046927SAndroid Build Coastguard Workerread-only purposes too (for ordering between draws that only read per-pixel data 440*61046927SAndroid Build Coastguard Workerand draws that may write it), so that may be an oversight. 441*61046927SAndroid Build Coastguard Worker 442*61046927SAndroid Build Coastguard WorkerExplicitly enabled relaxed rasterization order modifies the concept of 443*61046927SAndroid Build Coastguard Workerrasterization order itself in Vulkan, so from the point of view of the 444*61046927SAndroid Build Coastguard Workerspecification of fragment shader interlock, relaxed rasterization order should 445*61046927SAndroid Build Coastguard Workerstill be applicable regardless of whether the shader requests ordered interlock. 446*61046927SAndroid Build Coastguard WorkerPAL also doesn't make any POPS-specific exceptions here as of April 2023. 447*61046927SAndroid Build Coastguard Worker 448*61046927SAndroid Build Coastguard WorkerVariable-rate shading interaction 449*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 450*61046927SAndroid Build Coastguard Worker 451*61046927SAndroid Build Coastguard WorkerOn GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces 452*61046927SAndroid Build Coastguard Workerthe shading rate to be 1x1, thus the 453*61046927SAndroid Build Coastguard Worker``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must 454*61046927SAndroid Build Coastguard Workerbe false. 455*61046927SAndroid Build Coastguard Worker 456*61046927SAndroid Build Coastguard WorkerOn GFX11, by default, POPS itself can work with non-1x1 shading rates, and the 457*61046927SAndroid Build Coastguard Worker``fragmentShadingRateWithFragmentShaderInterlock`` property must be true. 458*61046927SAndroid Build Coastguard WorkerHowever, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set, 459*61046927SAndroid Build Coastguard Workerenabling POPS will force 1x1 shading rate. 460*61046927SAndroid Build Coastguard Worker 461*61046927SAndroid Build Coastguard WorkerThe widest interlock granularity available on GFX11 — with the lowest possible 462*61046927SAndroid Build Coastguard WorkerDepth Block intrinsic rate, 1x — is per-fine-pixel, however. There's no 463*61046927SAndroid Build Coastguard Workersynchronization between coarse fragment shader invocations if they don't cover 464*61046927SAndroid Build Coastguard Workercommon fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device 465*61046927SAndroid Build Coastguard Workerfeature is not available. 466*61046927SAndroid Build Coastguard Worker 467*61046927SAndroid Build Coastguard WorkerAdditional configuration 468*61046927SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^ 469*61046927SAndroid Build Coastguard Worker 470*61046927SAndroid Build Coastguard WorkerThese are some largely unresearched options found in the register declarations. 471*61046927SAndroid Build Coastguard WorkerPAL doesn't use them, so it's unknown if they make any significant difference. 472*61046927SAndroid Build Coastguard WorkerNo effect was found in `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_ 473*61046927SAndroid Build Coastguard Workerduring testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_NAVI31``. 474*61046927SAndroid Build Coastguard Worker 475*61046927SAndroid Build Coastguard Worker* ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9–10.3. 476*61046927SAndroid Build Coastguard Worker* ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+. 477