1Asahi 2===== 3 4The Asahi driver aims to provide an OpenGL implementation for the Apple M1. 5 6Wrap (macOS only) 7----------------- 8 9Mesa includes a library that wraps the key IOKit entrypoints used in the macOS 10UABI for AGX. The wrapped routines print information about the kernel calls made 11and dump work submitted to the GPU using agxdecode. This facilitates 12reverse-engineering the hardware, as glue to get at the "interesting" GPU 13memory. 14 15The library is only built if ``-Dtools=asahi`` is passed. It builds a single 16``wrap.dylib`` file, which should be inserted into a process with the 17``DYLD_INSERT_LIBRARIES`` environment variable. 18 19For example, to trace an app ``./app``, run: 20 21 DYLD_INSERT_LIBRARIES=~/mesa/build/src/asahi/lib/libwrap.dylib ./app 22 23Hardware varyings 24----------------- 25 26At an API level, vertex shader outputs need to be interpolated to become 27fragment shader inputs. This process is logically pipelined in AGX, with a value 28traveling from a vertex shader to remapping hardware to coefficient register 29setup to the fragment shader to the iterator hardware. Each stage is described 30below. 31 32Vertex shader 33````````````` 34 35A vertex shader (running on the :term:`Unified Shader Cores`) outputs varyings with the 36``st_var`` instruction. ``st_var`` takes a *vertex output index* and a 32-bit 37value. The maximum number of *vertex outputs* is specified as the "output count" 38of the shader in the "Bind Vertex Pipeline" packet. The value may be interpreted 39consist of a single 32-bit value or an aligned 16-bit register pair, depending 40on whether interpolation should happen at 32-bit or 16-bit. Vertex outputs are 41indexed starting from 0, with the *vertex position* always coming first, the 4232-bit user varyings coming next with perspective, flat, and linear interpolated 43varyings grouped in that order, then 16-bit user varyings with the same groupings, 44and finally *point size* and *clip distances* at the end if present. Note that 45*clip distances* are not accessible from the fragment shader; if the fragment 46shader needs to read the interpolated clip distance, the vertex shader must 47*also* write the clip distance values to a user varying for the fragment shader 48to interpolate. Also note there is no clip plane enable mask anywhere; that must 49lowered for APIs that require this (OpenGL but not Vulkan). 50 51.. list-table:: Ordering of vertex outputs with all outputs used 52 :widths: 25 75 53 :header-rows: 1 54 55 * - Size (words) 56 - Value 57 * - 4 58 - Vertex position 59 * - 1 60 - 32-bit smooth varying 0 61 * - 62 - ... 63 * - 1 64 - 32-bit smooth varying m 65 * - 1 66 - 32-bit flat varying 0 67 * - 68 - ... 69 * - 1 70 - 32-bit flat varying n 71 * - 1 72 - 32-bit linear varying 0 73 * - 74 - ... 75 * - 1 76 - 32-bit linear varying o 77 * - 1 78 - Packed pair of 16-bit smooth varyings 0 79 * - 80 - ... 81 * - 1 82 - Packed pair of 16-bit smooth varyings p 83 * - 1 84 - Packed pair of 16-bit flat varyings 0 85 * - 86 - ... 87 * - 1 88 - Packed pair of 16-bit flat varyings q 89 * - 1 90 - Packed pair of 16-bit linear varyings 0 91 * - 92 - ... 93 * - 1 94 - Packed pair of 16-bit linear varyings r 95 * - 1 96 - Point size 97 * - 1 98 - Clip distance for plane 0 99 * - 100 - ... 101 * - 1 102 - Clip distance for plane 15 103 104Remapping 105````````` 106 107Vertex outputs are remapped to varying slots to be interpolated. 108The output of remapping consists of the following items: the *W* fragment 109coordinate, the *Z* fragment coordinate, user varyings in the vertex 110output order. *Z* may be omitted, but *W* may not be. This remapping is 111configured by the "Output select" word. 112 113.. list-table:: Ordering of remapped slots 114 :widths: 25 75 115 :header-rows: 1 116 117 * - Index 118 - Value 119 * - 0 120 - Fragment coord W 121 * - 1 122 - Fragment coord Z 123 * - 2 124 - 32-bit varying 0 125 * - 126 - ... 127 * - 2 + m 128 - 32-bit varying m 129 * - 2 + m + 1 130 - Packed pair of 16-bit varyings 0 131 * - 132 - ... 133 * - 2 + m + n + 1 134 - Packed pair of 16-bit varyings n 135 136Coefficient registers 137````````````````````` 138 139The fragment shader does not see the physical slots. 140Instead, it references varyings through *coefficient registers*. A coefficient 141register is a register allocated constant for all fragment shader invocations in 142a given polygon. Physically, it contains the values output by the vertex shader 143for each vertex of the polygon. Coefficient registers are preloaded with values 144from varying slots. This preloading appears to occur in fixed function hardware, 145a simplification from PowerVR which requires a specialized program for the 146programmable data sequencer to do the preload. 147 148The "Bind fragment pipeline" packet points to coefficient register bindings, 149preceded by a header. The header contains the number of 32-bit varying slots. As 150the *W* slot is always present, this field is always nonzero. Slots whose index 151is below this count are treated as 32-bit. The remaining slots are treated as 15216-bits. 153 154The header also contains the total number of coefficient registers bound. 155 156Each binding that follows maps a (vector of) varying slots to a (consecutive) 157coefficient registers. Some details about the varying (perspective 158interpolation, flat shading, point sprites) are configured here. 159 160Coefficient registers may be ordered the same as the internal varying slots. 161However, this may be inconvenient for some APIs that require a separable shader 162model. For these APIs, the flexibility to mix-and-match slots and coefficient 163registers allows mixing shaders without shader variants. In that case, the 164bindings should be generated outside of the compiler. For simple APIs where the 165bindings are fixed and known at compile-time, the bindings could be generated 166within the compiler. 167 168Fragment shader 169``````````````` 170 171In the fragment shader, coefficient registers, identified by the prefix ``cf`` 172followed by a decimal index, act as opaque handles to varyings. For flat 173shading, coefficient registers may be loaded into general registers with the 174``ldcf`` instruction. For smooth shading, the coefficient register corresponding 175to the desired varying is passed as an argument to the "iterate" instruction 176``iter`` in order to "iterate" (interpolate) a varying. As perspective correct 177interpolation also requires the W component of the fragment coordinate, the 178coefficient register for W is passed as a second argument. As an example, if 179there's a single varying to interpolate, an instruction like ``iter r0, cf1, cf0`` 180is used. 181 182Iterator 183```````` 184 185To actually interpolate varyings, AGX provides fixed-function iteration hardware 186to multiply the specified coefficient registers with the required barycentrics, 187producing an interpolated value, hence the name "coefficient register". This 188operation is purely mathematical and does not require any memory access, as 189the required coefficients are preloaded before the shader begins execution. 190That means the iterate instruction executes in constant time, does not signal 191a data fence, and does not require the shader to wait on a data fence before 192using the value. 193 194Image layouts 195------------- 196 197AGX supports several image layouts, described here. To work with image layouts 198in the drivers, use the ail library, located in ``src/asahi/layout``. 199 200The simplest layout is **strided linear**. Pixels are stored in raster-order in 201memory with a software-controlled stride. Strided linear images are useful for 202working with modifier-unaware window systems, however performance will suffer. 203Strided linear images have numerous limitations: 204 205- Strides must be a multiple of 16 bytes. 206- Strides must be nonzero. For 1D images where the stride is logically 207 irrelevant, ail will internally select the minimal stride. 208- Only 1D, 2D, and 2D Array images may be linear. In particular, no 3D or cubemaps. 209- 2D images must not be mipmapped. 210- Block-compressed formats and multisampled images are unsupported. Elements of 211 a strided linear image are simply pixels. 212 213With these limitations, addressing into a strided linear image is as simple as 214 215.. math:: 216 217 \text{address} = (y \cdot \text{stride}) + (x \cdot \text{bytes per pixel}) 218 219In practice, this suffices for window system integration and little else. 220 221The most common uncompressed layout is **twiddled**. The image is divided into 222power-of-two sized tiles. The tiles themselves are stored in raster-order. 223Within each tile, elements (pixels/blocks) are stored in Morton (Z) order. 224 225The tile size used depends on both the image size and the block size of the 226image format. For large images, :math:`n \times n` or :math:`2n \times n` tiles 227are used (:math:`n` power-of-two). :math:`n` is such that each page contains 228exactly one tile. Only power-of-two block sizes are supported in hardware, 229ensuring such a tile size always exists. The hardware uses 16 KiB pages, so tile 230sizes are as follows: 231 232.. list-table:: Tile sizes for large images 233 :widths: 50 50 234 :header-rows: 1 235 236 * - Bytes per block 237 - Tile size 238 * - 1 239 - 128 x 128 240 * - 2 241 - 128 x 64 242 * - 4 243 - 64 x 64 244 * - 8 245 - 64 x 32 246 * - 16 247 - 32 x 32 248 249The dimensions of large images are rounded up to be multiples of the tile size. 250In addition, non-power-of-two large images have extra padding tiles when 251mipmapping is used, see below. 252 253That rounding would waste a great deal of memory for small images. If 254an image is smaller than this tile size, a smaller tile size is used to reduce 255the memory footprint. For small images, the tile size is :math:`m \times m` 256where 257 258.. math:: 259 260 m = 2^{\lceil \log_2( \min \{ \text{width}, \text{ height} \}) \rceil} 261 262In other words, small images use the smallest square power-of-two tile such that 263the image's minor axis fits in one tile. 264 265For mipmapped images, tile sizes are determined independently for each level. 266Typically, the first levels of an image are "large" and the remaining levels are 267"small". This scheme reduces the memory footprint of mipmapping, compared to a 268fixed tile size for the whole image. Each mip level are padded to fill at least 269one cache line (128 bytes), ensure no cache line contains multiple mip levels. 270 271There is a wrinkle: the dimensions of large mip levels in tiles are determined 272by the dimensions of level 0. For power-of-two images, the two calculations are 273equivalent. However, they differ subtly for non-power-of-two images. To 274determine the number of tiles to allocate for level :math:`l`, the number of 275tiles for level 0 should be right-shifted by :math:`2l`. That appears to divide 276by :math:`2^l` in both width and height, matching the definition of mipmapping, 277however it rounds down incorrectly. To compensate, the level contains one extra 278row, column, or both (with the corner) as required if any of the first :math:`l` 279levels were rounded down. This hurt the memory footprint. However, it means 280non-power-of-two integer multiplication is only required for level 0. 281Calculating the sizes for subsequent levels requires only addition and bitwise 282math. That simplifies the hardware (but complicates software). 283 284A 2D image consists of a full miptree (constructed as above) rounded up to the 285page size (16 KiB). 286 2873D images consist simply of an array of 2D layers (constructed as above). That 288means cube maps, 2D arrays, cube map arrays, and 3D images all use the same 289layout. The only difference is the number of layers. Notably, 3D images (like 290``GL_TEXTURE_3D``) reserve space even for mip levels that do not exist 291logically. These extra levels pad out layers of 3D images to the size of the 292first layer, simplifying layout calculations for both software and hardware. 293Although the padding is logically unnecessary, it wastes little space compared 294to the sizes of large mipmapped 3D textures. 295 296drm-shim (Linux only) 297--------------------- 298 299Mesa includes a library that mocks out the DRM UABI used by the Asahi driver 300stack, allowing the Mesa driver to run on non-M1 Linux hardware. This can be 301useful for exercising the compiler. To build, use options: 302 303:: 304 305 -Dgallium-drivers=asahi -Dtools=drm-shim 306 307Then run an OpenGL workload with environment variable: 308 309.. code-block:: sh 310 311 LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so 312 313For example to compile a shader with shaderdb and print some statistics along 314with the IR: 315 316.. code-block:: sh 317 318 ~/shader-db$ AGX_MESA_DEBUG=shaders,shaderdb ASAHI_MESA_DEBUG=precompile LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so ./run shaders/glmark/1-12.shader_test 319 320The drm-shim implementation for Asahi is located in ``src/asahi/drm-shim``. The 321drm-shim implementation there should be updated as new UABI is added. 322 323Hardware glossary 324----------------- 325 326AGX is a tiled renderer descended from the PowerVR architecture. Some hardware 327concepts used in PowerVR GPUs appear in AGX. 328 329.. glossary:: :sorted: 330 331 VDM 332 Vertex Data Master 333 Dispatches vertex shaders. 334 335 PDM 336 Pixel Data Master 337 Dispatches pixel shaders. 338 339 CDM 340 Compute Data Master 341 Dispatches compute kernels. 342 343 USC 344 Unified Shader Cores 345 A unified shader core is a small CPU that runs shader code. The core is 346 unified because a single ISA is used for vertex, pixel and compute 347 shaders. This differs from older GPUs where the vertex, fragment and 348 compute have separate ISAs for shader stages. 349 350 PPP 351 Primitive Processing Pipeline 352 The Primitive Processing Pipeline is a hardware unit that does primitive 353 assembly. The PPP is between the :term:`VDM` and :term:`ISP`. 354 355 ISP 356 Image Synthesis Processor 357 The Image Synthesis Processor is responsible for the rasterization stage 358 of the rendering pipeline. 359 360 PBE 361 Pixel BackEnd 362 Hardware unit which writes to color attachments and images. Also the 363 name for a descriptor passed to :term:`PBE` instructions. 364 365 UVS 366 Unified Vertex Store 367 Hardware unit which buffers the outputs of the vertex shader (varyings). 368