1Primitive Ordered Pixel Shading 2=============================== 3 4Primitive Ordered Pixel Shading (POPS) is the feature available starting from 5GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering 6functionality. 7 8It allows a part of a fragment shader — an ordered section (or a critical 9section) — to be executed sequentially in rasterization order for different 10invocations covering the same pixel position. 11 12This article describes how POPS is set up in shader code and the registers. The 13information here is currently provided for architecture generations up to GFX11. 14 15Note that the information in this article is **not official** and may contain 16inaccuracies, as well as incomplete or incorrect assumptions. It is based on the 17shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage 18in Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references, 19and experimentation with the hardware. 20 21Shader code 22----------- 23 24With POPS, a wave can dynamically execute up to one ordered section. It is fine 25for a wave not to enter an ordered section at all if it doesn't need ordering on 26its execution path, however. 27 28The setup of the ordered section consists of three parts: 29 301. Entering the ordered section in the current wave — awaiting the completion of 31 ordered sections in overlapped waves. 322. Resolving overlap within the current wave — intrawave collisions (optional 33 and GFX9–10.3 only). 343. Exiting the ordered section — resuming overlapping waves trying to enter 35 their ordered sections. 36 37GFX9–10.3: Entering the ordered section in the wave 38^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 39 40Awaiting the completion of ordered sections in overlapped waves is performed by 41setting the POPS packer hardware register, and then polling the volatile 42``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest 43overlapped wave ID for the current wave. 44 45The information needed for the wave to perform the waiting is provided to it via 46the SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the 47``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that 48the POPS arguments specifically need to be enabled not only in ``RSRC`` unlike 49various other arguments, but in ``PA_SC_SHADER_CONTROL`` as well). 50 51The collision wave ID argument contains the following unsigned values: 52 53* [31]: Whether overlap has occurred. 54* [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated 55 with. 56* [25:16]: Newest overlapped wave ID. 57* [9:0]: Current wave ID. 58 59The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of 60the fields, possibly from an early development iteration, but the meanings of 61them are accurate there. 62 63The wait must not be performed if the "did overlap" bit 31 is set to 0, 64otherwise it will result in a hang. Also, the bit being set to 0 indicates that 65there are *both* no wave overlap *and no intrawave collisions* for the current 66wave — so if the bit is 0, it's safe for the wave to skip all of the POPS logic 67completely and execute the contents of the ordered section simply as usual with 68unordered access as a potential additional optimization. The packer hardware 69register, however, may be set even without overlap safely — it's the wait loop 70itself that must not be executed if it was reported that there was no overlap. 71 72The packer ID needs to be passed to the packer hardware register using 73``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer. 74 75On GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer 76the wave is associated with: 77 78* [25]: The wave is associated with packer 1. 79* [24]: The wave is associated with packer 0. 80 81Initially, both of these bits are set 0, meaning that POPS is disabled for the 82wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if 83the packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID 84is 1. 85 86Starting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead, 87containing the following fields: 88 89* [2:1]: Packer ID. 90* [0]: POPS enabled for the wave. 91 92Initially, POPS is disabled for a wave. To start entering the ordered section, 93bits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs 94to be set to 1. 95 96The wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are 9710-bit values wrapping around on overflow — consecutive waves are numbered 1022, 981023, 0, 1… This wraparound needs to be taken into account when comparing the 99exiting wave ID and the newest overlapped wave ID. 100 101Specifically, until the current wave exits the ordered section, its ID can't be 102smaller than the newest overlapped wave ID or the exiting wave ID. So 103``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to 104monotonically increasing unsigned values. In this case, the largest value, 1050xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current 106wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from 107before the last wraparound will be near 0 increasing away from it. Subtracting 108``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``. 109 110GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit 111newest overlapped wave ID is greater than the 10-bit current wave ID (meaning 112that it's behind the last wraparound point), 1 needs to be added to the newest 113overlapped wave ID before using it in the comparison. This was corrected in 114GFX10. 115 116The exiting wave ID (not to be confused with "exited" — the exiting wave ID is 117the wave that will exit the ordered section next) is queried via the 118``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be 119one of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave 120ID to monotonically increasing one. 121 122It's a volatile operand, and it needs to be read in a loop until its value 123becomes greater than the newest overlapped wave ID (after remapping both to 124monotonic). However, if it's too early for the current wave to enter the ordered 125section, it needs to yield execution to other waves that may potentially be 126overlapped — via ``s_sleep``. GFX9 requires a finite amount of delay to be 127specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up 128the waiting waves, so the maximum delay of 0xFFFF can be used. 129 130In pseudocode, the entering logic would look like this:: 131 132 bool did_overlap = collision_wave_id[31]; 133 if (did_overlap) { 134 if (gfx_level >= GFX10) { 135 uint packer_id = collision_wave_id[29:28]; 136 s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1)); 137 } else { 138 uint packer_id = collision_wave_id[28]; 139 s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01); 140 } 141 142 uint current_10bit_wave_id = collision_wave_id[9:0]; 143 // Or -(current_10bit_wave_id + 1). 144 uint wave_id_remap_offset = ~current_10bit_wave_id; 145 146 uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16]; 147 if (gfx_level < GFX10 && 148 newest_overlapped_10bit_wave_id > current_10bit_wave_id) { 149 ++newest_overlapped_10bit_wave_id; 150 } 151 uint newest_overlapped_wave_id = 152 newest_overlapped_10bit_wave_id + wave_id_remap_offset; 153 154 while (!(src_pops_exiting_wave_id + wave_id_remap_offset > 155 newest_overlapped_wave_id)) { 156 s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3); 157 } 158 } 159 160The SPIR-V fragment shader interlock specification requires an invocation — an 161individual invocation, not the whole subgroup — to execute 162``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple 163begin instructions, or even multiple begin/end pairs, under divergent 164conditions, a wave may end up waiting for the overlapped waves multiple times. 165Thankfully, it's safe to set the POPS packer hardware register to the same 166value, or to run the wait loop, multiple times during the wave's execution, as 167long as the ordered section isn't exited in between by the wave. 168 169GFX11: Entering the ordered section in the wave 170^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 171 172Instead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave 173status flag to report that the wave may enter the ordered section. It's awaited 174by the ``s_wait_event`` instruction, with the bit 0 ("don't wait for 175``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD 176passes 0 as the whole immediate operand. 177 178The "export ready" wait can be done multiple times safely. 179 180GFX9–10.3: Resolving intrawave collisions 181^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 182 183On GFX9–10.3, it's possible for overlapping fragment shader invocations to be 184placed not only in different waves, but also in the same wave, with the shader 185code making sure that the ordered section is executed for overlapping 186invocations in order. 187 188This functionality is optional — it can be activated by enabling loading of the 189``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and 190``PA_SC_SHADER_CONTROL``. 191 192The lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION`` 193contain the mask of whether each quad in the wave starts a new layer of 194overlapping invocations, and thus the ordered section code for them needs to be 195executed after running it for all lanes with indices preceding that quad index 196multiplied by 4. The rest of the bits in the argument need to be ignored — AMD 197explicitly masks them out in shader code (although this is not necessary if the 198shader uses "find first 1" to obtain the start of the next set of overlapping 199quads or expands this quad mask into a lane mask). 200 201For example, if the intrawave collision mask is 0b0000001110000100, or 202``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section 203needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads 2046:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32), 205and then for the remaining quads 15:9 (lanes 63:36). 206 207This effectively causes the ordered section to be executed as smaller 208"sub-subgroups" within the original subgroup. 209 210However, this is not always compatible with the execution model of SPIR-V or 211GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of 212the shader in a loop may be unsafe in some cases. One particular example is when 213the shader uses subgroup operations influenced by lanes outside the current 214quad. In this case, the code outside and inside the ordered section may be 215executed with different sets of active invocations, affecting the results of 216subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not 217supposed to modify the set of active invocations in any way. So the intrawave 218collision loop may break the results of subgroup operations in unpredictable 219ways, even outside the driver's compiler infrastructure. Even if the driver 220splits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the 221lane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application 222and the compilers that created the source shader are still not aware of that 223happening — the input SPIR-V or GLSL shader might have already gone through 224various optimizations, such as common subexpression elimination which might 225have considered a subgroup operation before ``OpBeginInvocationInterlockEXT`` 226and one after it equivalent. 227 228The idea behind reporting intrawave collisions to shaders is to reduce the 229impact on the parallelism of the part of the shader that doesn't depend on the 230ordering, to avoid wasting lanes in the wave and to allow the code outside the 231ordered section in different invocations to run in parallel lanes as usual. This 232may be especially helpful if the ordered section is small compared to the rest 233of the shader — for instance, a custom blending equation in the end of the usual 234fragment shader for a surface in the world. 235 236However, whether handling intrawave collisions is preferred is not a question 237with one universal answer. Intrawave collisions are pretty uncommon without 238multisampling, or when using sample interlock with multisampling, although 239they're highly frequent with pixel interlock with multisampling, when adjacent 240primitives cover the same pixels along the shared edge (though that's an 241extremely expensive situation in general). But resolving intrawave collisions 242adds some overhead costs to the shader. If intrawave overlap is unlikely to 243happen often, or even more importantly, if the majority of the shader is inside 244the ordered section, handling it in the shader may cause more harm than good. 245 246GFX11 removes this concept entirely, instead overlapping invocations are always 247placed in different waves. 248 249GFX9–10.3: Exiting the ordered section in the wave 250^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 251 252To exit the ordered section and let overlapping waves resume execution and enter 253their ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message 254(7) using ``s_sendmsg``. 255 256If the wave has enabled POPS by setting the packer hardware register, it *must 257not* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the 258message must be sent on all execution paths after the packer register setup. 259However, if the wave exits before having configured the packer register, sending 260the message is not required, though it's still fine to send it regardless of 261that. 262 263Note that if the shader has multiple ``OpEndInvocationInterlockEXT`` 264instructions executed in the same wave (depending on a divergent condition, for 265example), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave 266only once, and especially not before any awaiting of overlapped waves. 267 268Before the message is sent, all counters for memory accesses that need to be 269primitive-ordered, both writes and (in case something after the ordered section 270depends on the per-pixel data, for instance, the tail blending fallback in 271order-independent transparency) reads, must be awaited. Those may include 272``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered 273memory accesses will be done through VMEM with divergent addresses, not SMEM, as 274there's no synchronization between fragments at different pixel coordinates, but 275it's still technically possible for a shader, even though pointless and 276nonoptimal, to explicitly perform them in a waterfall loop, for instance, and 277that must work correctly too). Without that, a race condition will occur when 278the newly resumed waves start accessing the memory locations to which there 279still are outstanding accesses in the current wave. 280 281Another option for exiting is the ``s_endpgm_ordered_ps_done`` instruction, 282which combines waiting for all the counters, sending the ``ORDERED_PS_DONE`` 283message, and ending the program. Generally, however, it's desirable to resume 284overlapping waves as early as possible, including before the export, as it may 285stall the wave for some time too. 286 287GFX11: Exiting the ordered section in the wave 288^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 289 290The overlapping waves are resumed when the wave performs the last export (with 291the ``done`` flag). 292 293The same requirements for awaiting the memory access counters as on GFX9–10.3 294still apply. 295 296Memory access requirements 297^^^^^^^^^^^^^^^^^^^^^^^^^^ 298 299The compiler needs to ensure that entering the ordered section implements 300acquire semantics, and exiting it implements release semantics, in the fragment 301interlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage 302classes. 303 304A fragment interlock memory scope instance includes overlapping fragment shader 305invocations executed by commands inside a single subpass. It may be considered a 306subset of a queue family memory scope instance from the perspective of memory 307barriers. 308 309Fragment shader interlock doesn't perform implicit memory availability or 310visibility operations. Shaders must do them by themselves for accesses requiring 311primitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL 312or ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope 313in SPIR-V. 314 315On AMD hardware, this means that the accessed memory locations must be made 316available or visible between waves that may be executed on any compute unit — so 317accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag 318and L1$ via DLC. 319 320However, it should be noted that memory accesses in the ordered section may be 321expected by the application to be done in primitive order even if they don't 322have the GLC and DLC flags. Coherent access not only bypasses, but also 323invalidates the lower-level caches for the accessed memory locations. Thus, 324considering that normally per-pixel data is accessed exclusively by the 325invocation executing the ordered section, it's not necessary to make all reads 326or writes in the ordered section for one memory location to be GLC/DLC — just 327the first read and the last write: it doesn't matter if per-pixel data is cached 328in L0/L1 in the middle of a dependency chain in the ordered section, as long as 329it's invalidated in them in the beginning and flushed to L2 in the end. 330Therefore, optimizations in the compiler must not simply assume that only 331coherent accesses need primitive ordering — and moreover, the compiler must also 332take into account that the same data may be accessed through different bindings. 333 334Export requirements 335^^^^^^^^^^^^^^^^^^^ 336 337With POPS, on all hardware generations, the shader must have at least one 338export, though it can be a null or an ``off, off, off, off`` one. 339 340Also, even if the shader doesn't need to export any real data, the export 341skipping that was added in GFX10 must not be used, and some space must be 342allocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for 343some color output to ``SPI_SHADER_32_R``. 344 345Without this, the shader will be executed without the needed synchronization on 346GFX10, and will hang on GFX11. 347 348Drawing context setup 349--------------------- 350 351Configuring POPS 352^^^^^^^^^^^^^^^^ 353 354Most of the configuration is performed via the ``DB_SHADER_CONTROL`` register. 355 356To enable POPS for the draw, 357``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1. 358 359On GFX9–10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which 360fragment shader invocations are considered overlapping: 361 362* For pixel interlock, it must be set to 0 (1 sample). 363* If sample interlock is sufficient (only synchronizing between invocations that 364 have any common sample mask bits), it may be set to 365 ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` — the number of sample coverage mask 366 bits passed to the shader which is expected to use the sample mask to 367 determine whether it's allowed to access the data for each of the samples. As 368 of April 2023, PAL for some reason doesn't use non-1x 369 ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer 370 Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading 371 (those APIs tie the interlock granularity to the shading frequency — Vulkan 372 and OpenGL fragment shader interlock, however, allows specifying the interlock 373 granularity independently of it, making it possible both to ask for finer 374 synchronization guarantees and to require stronger ones than Direct3D ROVs can 375 provide). However, with MSAA, on AMD hardware, pixel interlock generally 376 performs *massively*, sometimes prohibitively, slower than sample interlock, 377 because it causes fragment shader invocations along the common edge of 378 adjacent primitives to be ordered as they cover the same pixels (even though 379 they don't cover any common samples). So it's highly desirable for the driver 380 to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES`` 381 accordingly, if the shader declares that it's enough for it via the execution 382 mode. 383 384On GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is 385used in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier 386architecture generations (and has a different bit offset in the register), and 387``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11 388blending performance workaround overriding the intrinsic rate must not be 389applied if POPS is used in the draw — the intrinsic rate override must be used 390solely to control the interlock granularity in this case. 391 392No explicit flushes/synchronization are needed when changing the pipeline state 393variables that may be involved in POPS, such as the rasterization sample count. 394POPS automatically keeps synchronizing invocations even between draws with 395different sample counts (invocations with common coverage mask bits are 396considered overlapping by the hardware, regardless of what those samples 397actually are — only the indices are important). 398 399Also, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage 400sample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` 401even if there's no depth/stencil target. 402 403Hardware bug workarounds 404^^^^^^^^^^^^^^^^^^^^^^^^ 405 406Early revisions of GFX9 — ``CHIP_VEGA10`` and ``CHIP_RAVEN`` — contain a 407hardware bug that may result in a hang, and need a workaround to be enabled. 408Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or 409more depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` 410must be set to 1 for draws that satisfy this condition. In PAL, this is the 411``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance 412in those cases, increasing the frame time by around 1.5 to 2 times in 413`nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_ 414on the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is 415mandatory to ensure stability. 416 417Also, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required 418on chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if 419it's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``, 420``CHIP_NAVI14``), and the draw uses POPS, 421``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to 422``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL). 423 424Out-of-order rasterization interaction 425^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 426 427This is a largely unresearched topic currently. However, considering that POPS 428is primarily the functionality of the Depth Block, similarity to the behavior of 429out-of-order rasterization in depth/stencil testing may possibly be expected. 430 431If the shader specifies an ordered interlock execution mode, out-of-order 432rasterization likely must not be enabled implicitly. 433 434As of April 2023, PAL doesn't have any rules specifically for POPS in the logic 435determining whether out-of-order rasterization can be enabled automatically. 436Some of the POPS usage cases may possibly be covered by the rule that always 437disables out-of-order rasterization if the shader writes to Unordered Access 438Views (storage resources), though fragment shader interlock can be used for 439read-only purposes too (for ordering between draws that only read per-pixel data 440and draws that may write it), so that may be an oversight. 441 442Explicitly enabled relaxed rasterization order modifies the concept of 443rasterization order itself in Vulkan, so from the point of view of the 444specification of fragment shader interlock, relaxed rasterization order should 445still be applicable regardless of whether the shader requests ordered interlock. 446PAL also doesn't make any POPS-specific exceptions here as of April 2023. 447 448Variable-rate shading interaction 449^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 450 451On GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces 452the shading rate to be 1x1, thus the 453``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must 454be false. 455 456On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the 457``fragmentShadingRateWithFragmentShaderInterlock`` property must be true. 458However, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set, 459enabling POPS will force 1x1 shading rate. 460 461The widest interlock granularity available on GFX11 — with the lowest possible 462Depth Block intrinsic rate, 1x — is per-fine-pixel, however. There's no 463synchronization between coarse fragment shader invocations if they don't cover 464common fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device 465feature is not available. 466 467Additional configuration 468^^^^^^^^^^^^^^^^^^^^^^^^ 469 470These are some largely unresearched options found in the register declarations. 471PAL doesn't use them, so it's unknown if they make any significant difference. 472No effect was found in `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_ 473during testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_NAVI31``. 474 475* ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9–10.3. 476* ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+. 477