xref: /aosp_15_r20/external/mesa3d/docs/drivers/amd/hw/pops.rst (revision 6104692788411f58d303aa86923a9ff6ecaded22)
1Primitive Ordered Pixel Shading
2===============================
3
4Primitive Ordered Pixel Shading (POPS) is the feature available starting from
5GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering
6functionality.
7
8It allows a part of a fragment shader — an ordered section (or a critical
9section) — to be executed sequentially in rasterization order for different
10invocations covering the same pixel position.
11
12This article describes how POPS is set up in shader code and the registers. The
13information here is currently provided for architecture generations up to GFX11.
14
15Note that the information in this article is **not official** and may contain
16inaccuracies, as well as incomplete or incorrect assumptions. It is based on the
17shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage
18in Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references,
19and experimentation with the hardware.
20
21Shader code
22-----------
23
24With POPS, a wave can dynamically execute up to one ordered section. It is fine
25for a wave not to enter an ordered section at all if it doesn't need ordering on
26its execution path, however.
27
28The setup of the ordered section consists of three parts:
29
301. Entering the ordered section in the current wave — awaiting the completion of
31   ordered sections in overlapped waves.
322. Resolving overlap within the current wave — intrawave collisions (optional
33   and GFX9–10.3 only).
343. Exiting the ordered section — resuming overlapping waves trying to enter
35   their ordered sections.
36
37GFX9–10.3: Entering the ordered section in the wave
38^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
39
40Awaiting the completion of ordered sections in overlapped waves is performed by
41setting the POPS packer hardware register, and then polling the volatile
42``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest
43overlapped wave ID for the current wave.
44
45The information needed for the wave to perform the waiting is provided to it via
46the SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the
47``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that
48the POPS arguments specifically need to be enabled not only in ``RSRC`` unlike
49various other arguments, but in ``PA_SC_SHADER_CONTROL`` as well).
50
51The collision wave ID argument contains the following unsigned values:
52
53* [31]: Whether overlap has occurred.
54* [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated
55  with.
56* [25:16]: Newest overlapped wave ID.
57* [9:0]: Current wave ID.
58
59The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of
60the fields, possibly from an early development iteration, but the meanings of
61them are accurate there.
62
63The wait must not be performed if the "did overlap" bit 31 is set to 0,
64otherwise it will result in a hang. Also, the bit being set to 0 indicates that
65there are *both* no wave overlap *and no intrawave collisions* for the current
66wave — so if the bit is 0, it's safe for the wave to skip all of the POPS logic
67completely and execute the contents of the ordered section simply as usual with
68unordered access as a potential additional optimization. The packer hardware
69register, however, may be set even without overlap safely — it's the wait loop
70itself that must not be executed if it was reported that there was no overlap.
71
72The packer ID needs to be passed to the packer hardware register using
73``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer.
74
75On GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer
76the wave is associated with:
77
78* [25]: The wave is associated with packer 1.
79* [24]: The wave is associated with packer 0.
80
81Initially, both of these bits are set 0, meaning that POPS is disabled for the
82wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if
83the packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID
84is 1.
85
86Starting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead,
87containing the following fields:
88
89* [2:1]: Packer ID.
90* [0]: POPS enabled for the wave.
91
92Initially, POPS is disabled for a wave. To start entering the ordered section,
93bits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs
94to be set to 1.
95
96The wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are
9710-bit values wrapping around on overflow — consecutive waves are numbered 1022,
981023, 0, 1… This wraparound needs to be taken into account when comparing the
99exiting wave ID and the newest overlapped wave ID.
100
101Specifically, until the current wave exits the ordered section, its ID can't be
102smaller than the newest overlapped wave ID or the exiting wave ID. So
103``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to
104monotonically increasing unsigned values. In this case, the largest value,
1050xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current
106wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from
107before the last wraparound will be near 0 increasing away from it. Subtracting
108``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``.
109
110GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit
111newest overlapped wave ID is greater than the 10-bit current wave ID (meaning
112that it's behind the last wraparound point), 1 needs to be added to the newest
113overlapped wave ID before using it in the comparison. This was corrected in
114GFX10.
115
116The exiting wave ID (not to be confused with "exited" — the exiting wave ID is
117the wave that will exit the ordered section next) is queried via the
118``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be
119one of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave
120ID to monotonically increasing one.
121
122It's a volatile operand, and it needs to be read in a loop until its value
123becomes greater than the newest overlapped wave ID (after remapping both to
124monotonic). However, if it's too early for the current wave to enter the ordered
125section, it needs to yield execution to other waves that may potentially be
126overlapped — via ``s_sleep``. GFX9 requires a finite amount of delay to be
127specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up
128the waiting waves, so the maximum delay of 0xFFFF can be used.
129
130In pseudocode, the entering logic would look like this::
131
132   bool did_overlap = collision_wave_id[31];
133   if (did_overlap) {
134      if (gfx_level >= GFX10) {
135         uint packer_id = collision_wave_id[29:28];
136         s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1));
137      } else {
138         uint packer_id = collision_wave_id[28];
139         s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01);
140      }
141
142      uint current_10bit_wave_id = collision_wave_id[9:0];
143      // Or -(current_10bit_wave_id + 1).
144      uint wave_id_remap_offset = ~current_10bit_wave_id;
145
146      uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16];
147      if (gfx_level < GFX10 &&
148          newest_overlapped_10bit_wave_id > current_10bit_wave_id) {
149         ++newest_overlapped_10bit_wave_id;
150      }
151      uint newest_overlapped_wave_id =
152         newest_overlapped_10bit_wave_id + wave_id_remap_offset;
153
154      while (!(src_pops_exiting_wave_id + wave_id_remap_offset >
155               newest_overlapped_wave_id)) {
156         s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
157      }
158   }
159
160The SPIR-V fragment shader interlock specification requires an invocation — an
161individual invocation, not the whole subgroup — to execute
162``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple
163begin instructions, or even multiple begin/end pairs, under divergent
164conditions, a wave may end up waiting for the overlapped waves multiple times.
165Thankfully, it's safe to set the POPS packer hardware register to the same
166value, or to run the wait loop, multiple times during the wave's execution, as
167long as the ordered section isn't exited in between by the wave.
168
169GFX11: Entering the ordered section in the wave
170^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
171
172Instead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave
173status flag to report that the wave may enter the ordered section. It's awaited
174by the ``s_wait_event`` instruction, with the bit 0 ("don't wait for
175``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD
176passes 0 as the whole immediate operand.
177
178The "export ready" wait can be done multiple times safely.
179
180GFX9–10.3: Resolving intrawave collisions
181^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
182
183On GFX9–10.3, it's possible for overlapping fragment shader invocations to be
184placed not only in different waves, but also in the same wave, with the shader
185code making sure that the ordered section is executed for overlapping
186invocations in order.
187
188This functionality is optional — it can be activated by enabling loading of the
189``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and
190``PA_SC_SHADER_CONTROL``.
191
192The lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION``
193contain the mask of whether each quad in the wave starts a new layer of
194overlapping invocations, and thus the ordered section code for them needs to be
195executed after running it for all lanes with indices preceding that quad index
196multiplied by 4. The rest of the bits in the argument need to be ignored — AMD
197explicitly masks them out in shader code (although this is not necessary if the
198shader uses "find first 1" to obtain the start of the next set of overlapping
199quads or expands this quad mask into a lane mask).
200
201For example, if the intrawave collision mask is 0b0000001110000100, or
202``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section
203needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads
2046:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32),
205and then for the remaining quads 15:9 (lanes 63:36).
206
207This effectively causes the ordered section to be executed as smaller
208"sub-subgroups" within the original subgroup.
209
210However, this is not always compatible with the execution model of SPIR-V or
211GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of
212the shader in a loop may be unsafe in some cases. One particular example is when
213the shader uses subgroup operations influenced by lanes outside the current
214quad. In this case, the code outside and inside the ordered section may be
215executed with different sets of active invocations, affecting the results of
216subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not
217supposed to modify the set of active invocations in any way. So the intrawave
218collision loop may break the results of subgroup operations in unpredictable
219ways, even outside the driver's compiler infrastructure. Even if the driver
220splits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the
221lane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application
222and the compilers that created the source shader are still not aware of that
223happening — the input SPIR-V or GLSL shader might have already gone through
224various optimizations, such as common subexpression elimination which might
225have considered a subgroup operation before ``OpBeginInvocationInterlockEXT``
226and one after it equivalent.
227
228The idea behind reporting intrawave collisions to shaders is to reduce the
229impact on the parallelism of the part of the shader that doesn't depend on the
230ordering, to avoid wasting lanes in the wave and to allow the code outside the
231ordered section in different invocations to run in parallel lanes as usual. This
232may be especially helpful if the ordered section is small compared to the rest
233of the shader — for instance, a custom blending equation in the end of the usual
234fragment shader for a surface in the world.
235
236However, whether handling intrawave collisions is preferred is not a question
237with one universal answer. Intrawave collisions are pretty uncommon without
238multisampling, or when using sample interlock with multisampling, although
239they're highly frequent with pixel interlock with multisampling, when adjacent
240primitives cover the same pixels along the shared edge (though that's an
241extremely expensive situation in general). But resolving intrawave collisions
242adds some overhead costs to the shader. If intrawave overlap is unlikely to
243happen often, or even more importantly, if the majority of the shader is inside
244the ordered section, handling it in the shader may cause more harm than good.
245
246GFX11 removes this concept entirely, instead overlapping invocations are always
247placed in different waves.
248
249GFX9–10.3: Exiting the ordered section in the wave
250^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
251
252To exit the ordered section and let overlapping waves resume execution and enter
253their ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message
254(7) using ``s_sendmsg``.
255
256If the wave has enabled POPS by setting the packer hardware register, it *must
257not* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the
258message must be sent on all execution paths after the packer register setup.
259However, if the wave exits before having configured the packer register, sending
260the message is not required, though it's still fine to send it regardless of
261that.
262
263Note that if the shader has multiple ``OpEndInvocationInterlockEXT``
264instructions executed in the same wave (depending on a divergent condition, for
265example), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave
266only once, and especially not before any awaiting of overlapped waves.
267
268Before the message is sent, all counters for memory accesses that need to be
269primitive-ordered, both writes and (in case something after the ordered section
270depends on the per-pixel data, for instance, the tail blending fallback in
271order-independent transparency) reads, must be awaited. Those may include
272``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered
273memory accesses will be done through VMEM with divergent addresses, not SMEM, as
274there's no synchronization between fragments at different pixel coordinates, but
275it's still technically possible for a shader, even though pointless and
276nonoptimal, to explicitly perform them in a waterfall loop, for instance, and
277that must work correctly too). Without that, a race condition will occur when
278the newly resumed waves start accessing the memory locations to which there
279still are outstanding accesses in the current wave.
280
281Another option for exiting is the ``s_endpgm_ordered_ps_done`` instruction,
282which combines waiting for all the counters, sending the ``ORDERED_PS_DONE``
283message, and ending the program. Generally, however, it's desirable to resume
284overlapping waves as early as possible, including before the export, as it may
285stall the wave for some time too.
286
287GFX11: Exiting the ordered section in the wave
288^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
289
290The overlapping waves are resumed when the wave performs the last export (with
291the ``done`` flag).
292
293The same requirements for awaiting the memory access counters as on GFX9–10.3
294still apply.
295
296Memory access requirements
297^^^^^^^^^^^^^^^^^^^^^^^^^^
298
299The compiler needs to ensure that entering the ordered section implements
300acquire semantics, and exiting it implements release semantics, in the fragment
301interlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage
302classes.
303
304A fragment interlock memory scope instance includes overlapping fragment shader
305invocations executed by commands inside a single subpass. It may be considered a
306subset of a queue family memory scope instance from the perspective of memory
307barriers.
308
309Fragment shader interlock doesn't perform implicit memory availability or
310visibility operations. Shaders must do them by themselves for accesses requiring
311primitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL
312or ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope
313in SPIR-V.
314
315On AMD hardware, this means that the accessed memory locations must be made
316available or visible between waves that may be executed on any compute unit — so
317accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag
318and L1$ via DLC.
319
320However, it should be noted that memory accesses in the ordered section may be
321expected by the application to be done in primitive order even if they don't
322have the GLC and DLC flags. Coherent access not only bypasses, but also
323invalidates the lower-level caches for the accessed memory locations. Thus,
324considering that normally per-pixel data is accessed exclusively by the
325invocation executing the ordered section, it's not necessary to make all reads
326or writes in the ordered section for one memory location to be GLC/DLC — just
327the first read and the last write: it doesn't matter if per-pixel data is cached
328in L0/L1 in the middle of a dependency chain in the ordered section, as long as
329it's invalidated in them in the beginning and flushed to L2 in the end.
330Therefore, optimizations in the compiler must not simply assume that only
331coherent accesses need primitive ordering — and moreover, the compiler must also
332take into account that the same data may be accessed through different bindings.
333
334Export requirements
335^^^^^^^^^^^^^^^^^^^
336
337With POPS, on all hardware generations, the shader must have at least one
338export, though it can be a null or an ``off, off, off, off`` one.
339
340Also, even if the shader doesn't need to export any real data, the export
341skipping that was added in GFX10 must not be used, and some space must be
342allocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for
343some color output to ``SPI_SHADER_32_R``.
344
345Without this, the shader will be executed without the needed synchronization on
346GFX10, and will hang on GFX11.
347
348Drawing context setup
349---------------------
350
351Configuring POPS
352^^^^^^^^^^^^^^^^
353
354Most of the configuration is performed via the ``DB_SHADER_CONTROL`` register.
355
356To enable POPS for the draw,
357``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1.
358
359On GFX9–10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which
360fragment shader invocations are considered overlapping:
361
362* For pixel interlock, it must be set to 0 (1 sample).
363* If sample interlock is sufficient (only synchronizing between invocations that
364  have any common sample mask bits), it may be set to
365  ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` — the number of sample coverage mask
366  bits passed to the shader which is expected to use the sample mask to
367  determine whether it's allowed to access the data for each of the samples. As
368  of April 2023, PAL for some reason doesn't use non-1x
369  ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer
370  Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading
371  (those APIs tie the interlock granularity to the shading frequency — Vulkan
372  and OpenGL fragment shader interlock, however, allows specifying the interlock
373  granularity independently of it, making it possible both to ask for finer
374  synchronization guarantees and to require stronger ones than Direct3D ROVs can
375  provide). However, with MSAA, on AMD hardware, pixel interlock generally
376  performs *massively*, sometimes prohibitively, slower than sample interlock,
377  because it causes fragment shader invocations along the common edge of
378  adjacent primitives to be ordered as they cover the same pixels (even though
379  they don't cover any common samples). So it's highly desirable for the driver
380  to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES``
381  accordingly, if the shader declares that it's enough for it via the execution
382  mode.
383
384On GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is
385used in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier
386architecture generations (and has a different bit offset in the register), and
387``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11
388blending performance workaround overriding the intrinsic rate must not be
389applied if POPS is used in the draw — the intrinsic rate override must be used
390solely to control the interlock granularity in this case.
391
392No explicit flushes/synchronization are needed when changing the pipeline state
393variables that may be involved in POPS, such as the rasterization sample count.
394POPS automatically keeps synchronizing invocations even between draws with
395different sample counts (invocations with common coverage mask bits are
396considered overlapping by the hardware, regardless of what those samples
397actually are — only the indices are important).
398
399Also, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage
400sample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES``
401even if there's no depth/stencil target.
402
403Hardware bug workarounds
404^^^^^^^^^^^^^^^^^^^^^^^^
405
406Early revisions of GFX9 — ``CHIP_VEGA10`` and ``CHIP_RAVEN`` — contain a
407hardware bug that may result in a hang, and need a workaround to be enabled.
408Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or
409more depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP``
410must be set to 1 for draws that satisfy this condition. In PAL, this is the
411``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance
412in those cases, increasing the frame time by around 1.5 to 2 times in
413`nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
414on the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is
415mandatory to ensure stability.
416
417Also, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required
418on chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if
419it's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``,
420``CHIP_NAVI14``), and the draw uses POPS,
421``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to
422``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL).
423
424Out-of-order rasterization interaction
425^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
426
427This is a largely unresearched topic currently. However, considering that POPS
428is primarily the functionality of the Depth Block, similarity to the behavior of
429out-of-order rasterization in depth/stencil testing may possibly be expected.
430
431If the shader specifies an ordered interlock execution mode, out-of-order
432rasterization likely must not be enabled implicitly.
433
434As of April 2023, PAL doesn't have any rules specifically for POPS in the logic
435determining whether out-of-order rasterization can be enabled automatically.
436Some of the POPS usage cases may possibly be covered by the rule that always
437disables out-of-order rasterization if the shader writes to Unordered Access
438Views (storage resources), though fragment shader interlock can be used for
439read-only purposes too (for ordering between draws that only read per-pixel data
440and draws that may write it), so that may be an oversight.
441
442Explicitly enabled relaxed rasterization order modifies the concept of
443rasterization order itself in Vulkan, so from the point of view of the
444specification of fragment shader interlock, relaxed rasterization order should
445still be applicable regardless of whether the shader requests ordered interlock.
446PAL also doesn't make any POPS-specific exceptions here as of April 2023.
447
448Variable-rate shading interaction
449^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
450
451On GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces
452the shading rate to be 1x1, thus the
453``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must
454be false.
455
456On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the
457``fragmentShadingRateWithFragmentShaderInterlock`` property must be true.
458However, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set,
459enabling POPS will force 1x1 shading rate.
460
461The widest interlock granularity available on GFX11 — with the lowest possible
462Depth Block intrinsic rate, 1x — is per-fine-pixel, however. There's no
463synchronization between coarse fragment shader invocations if they don't cover
464common fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device
465feature is not available.
466
467Additional configuration
468^^^^^^^^^^^^^^^^^^^^^^^^
469
470These are some largely unresearched options found in the register declarations.
471PAL doesn't use them, so it's unknown if they make any significant difference.
472No effect was found in `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
473during testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_NAVI31``.
474
475* ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9–10.3.
476* ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+.
477