xref: /aosp_15_r20/external/gemmlowp/doc/kernel.md (revision 5f39d1b313f0528e11bae88b3029b54b9e1033e7)
1*5f39d1b3SJooyung Han# Kernels in gemmlowp
2*5f39d1b3SJooyung Han
3*5f39d1b3SJooyung Han## Kernels provide an inner-loop implementation, and a format
4*5f39d1b3SJooyung Han
5*5f39d1b3SJooyung HanHere we assume familiarity with the concepts of kernels and of packing as
6*5f39d1b3SJooyung Hanexplained in [design.md](design.md).
7*5f39d1b3SJooyung Han
8*5f39d1b3SJooyung Hangemmlowp is designed to be easily extensible to different architectures and
9*5f39d1b3SJooyung Hanother low-level details, while achieving high performance. Thus a line had to be
10*5f39d1b3SJooyung Handrawn between the generic GEMM code and the specific parts that need to be
11*5f39d1b3SJooyung Hanmanually designed for each architecture, etc. The design choice made in gemmlowp
12*5f39d1b3SJooyung Hanis to have easily swappable GEMM kernels.
13*5f39d1b3SJooyung Han
14*5f39d1b3SJooyung HanIn itself, a GEMM kernel is just an implementation of the inner-most loop in a
15*5f39d1b3SJooyung HanGEMM (That inner-most loop has to be over the 'depth' dimension so as to be able
16*5f39d1b3SJooyung Hanto accumulate into a small enough number of accumulators to fit in registers).
17*5f39d1b3SJooyung Han
18*5f39d1b3SJooyung HanThus, by itself, a GEMM kernel should be just a function computing a block of
19*5f39d1b3SJooyung HanGEMM.
20*5f39d1b3SJooyung Han
21*5f39d1b3SJooyung HanHowever, GEMM kernels may need to differ not just in how they implement this
22*5f39d1b3SJooyung Hancomputation, but also in the format of data that they operate on. Indeed, in
23*5f39d1b3SJooyung Hanorder to maximize the ratio of arithmetic instructions to memory access
24*5f39d1b3SJooyung Haninstructions, GEMM kernels want to handle blocks as wide as possible given the
25*5f39d1b3SJooyung Hannumber of registers of the CPU architecture.
26*5f39d1b3SJooyung Han
27*5f39d1b3SJooyung HanThus, in order to allow efficient specialization to diverse architectures,
28*5f39d1b3SJooyung Hangemmlowp allows each GEMM kernel to dictate the format of data that it expects,
29*5f39d1b3SJooyung Hanin addition to providing its inner-loop implementation.
30*5f39d1b3SJooyung Han
31*5f39d1b3SJooyung HanThe former is given by a 'Format' typedef, and the latter by a 'Run' method.
32*5f39d1b3SJooyung Han
33*5f39d1b3SJooyung HanA good example is to look at internal/kernel_neon.h, and specifically at the
34*5f39d1b3SJooyung HanNEONKernel12x4Depth2 kernel, which specifies its format as
35*5f39d1b3SJooyung Han
36*5f39d1b3SJooyung Han```
37*5f39d1b3SJooyung Han  typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>,
38*5f39d1b3SJooyung Han                       KernelSideFormat<CellFormat<4, 2>, 1> > Format;
39*5f39d1b3SJooyung Han```
40*5f39d1b3SJooyung Han
41*5f39d1b3SJooyung HanThe meaning of these terms is explained in the lengthy comment at the top of
42*5f39d1b3SJooyung Haninternal/kernel.h. Here, they mean that this kernel handles at each iteration
43*5f39d1b3SJooyung Han(along the depth dimension):
44*5f39d1b3SJooyung Han
45*5f39d1b3SJooyung Han- 3 'cells' of size 4x2 each of the lhs, so a total lhs block of size 12x2
46*5f39d1b3SJooyung Han
47*5f39d1b3SJooyung Han- 1 'cell' of size 2x4 of the rhs.
48*5f39d1b3SJooyung Han
49*5f39d1b3SJooyung HanIn other words, this kernel handles 12 rows of the lhs and 4 columns of the
50*5f39d1b3SJooyung Hanrhs, and handles two levels of depth at once. The 'cells' and `CellFormat`
51*5f39d1b3SJooyung Handetail the layout of these 12x2 and 2x4 blocks.
52*5f39d1b3SJooyung Han
53*5f39d1b3SJooyung HanThis kernel then loads these 12x2 and 2x4 blocks and computes the corresponding
54*5f39d1b3SJooyung Han12x4 GEMM; for ease of reference let us paste the critical comment and code
55*5f39d1b3SJooyung Hanhere:
56*5f39d1b3SJooyung Han
57*5f39d1b3SJooyung Han```
58*5f39d1b3SJooyung Han"loop_NEONKernel12x4Depth2_%=:\n"
59*5f39d1b3SJooyung Han
60*5f39d1b3SJooyung Han// Overview of register layout:
61*5f39d1b3SJooyung Han//
62*5f39d1b3SJooyung Han// A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0).
63*5f39d1b3SJooyung Han// A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7
64*5f39d1b3SJooyung Han// (q1--q3).
65*5f39d1b3SJooyung Han// A 12x4 block of accumulators is stored in 32bit in q4--q15.
66*5f39d1b3SJooyung Han//
67*5f39d1b3SJooyung Han//                   +-----+-----+-----+-----+
68*5f39d1b3SJooyung Han//                   |d0[0]|d0[1]|d0[2]|d0[3]|
69*5f39d1b3SJooyung Han//              Rhs  +-----+-----+-----+-----+
70*5f39d1b3SJooyung Han//                   |d1[0]|d1[1]|d1[2]|d1[3]|
71*5f39d1b3SJooyung Han//                   +-----+-----+-----+-----+
72*5f39d1b3SJooyung Han//
73*5f39d1b3SJooyung Han//                   |     |     |     |     |
74*5f39d1b3SJooyung Han//
75*5f39d1b3SJooyung Han//    Lhs            |     |     |     |     |
76*5f39d1b3SJooyung Han//
77*5f39d1b3SJooyung Han//  +--+--+ - - - -  +-----+-----+-----+-----+
78*5f39d1b3SJooyung Han//  |d2|d3|          | q4  | q5  | q6  | q7  |
79*5f39d1b3SJooyung Han//  |d2|d3|          | q4  | q5  | q6  | q7  |
80*5f39d1b3SJooyung Han//  |d2|d3|          | q4  | q5  | q6  | q7  |
81*5f39d1b3SJooyung Han//  |d2|d3|          | q4  | q5  | q6  | q7  |
82*5f39d1b3SJooyung Han//  +--+--+ - - - -  +-----+-----+-----+-----+
83*5f39d1b3SJooyung Han//  |d4|d5|          | q8  | q9  | q10 | q11 |
84*5f39d1b3SJooyung Han//  |d4|d5|          | q8  | q9  | q10 | q11 |
85*5f39d1b3SJooyung Han//  |d4|d5|          | q8  | q9  | q10 | q11 |
86*5f39d1b3SJooyung Han//  |d4|d5|          | q8  | q9  | q10 | q11 |
87*5f39d1b3SJooyung Han//  +--+--+ - - - -  +-----+-----+-----+-----+
88*5f39d1b3SJooyung Han//  |d6|d7|          | q12 | q13 | q14 | q15 |
89*5f39d1b3SJooyung Han//  |d6|d7|          | q12 | q13 | q14 | q15 |
90*5f39d1b3SJooyung Han//  |d6|d7|          | q12 | q13 | q14 | q15 |
91*5f39d1b3SJooyung Han//  |d6|d7|          | q12 | q13 | q14 | q15 |
92*5f39d1b3SJooyung Han//  +--+--+ - - - -  +-----+-----+-----+-----+
93*5f39d1b3SJooyung Han//
94*5f39d1b3SJooyung Han//                            Accumulator
95*5f39d1b3SJooyung Han
96*5f39d1b3SJooyung Han// Load 1 Rhs cell of size 2x4
97*5f39d1b3SJooyung Han"vld1.8 {d0}, [%[rhs_ptr]:64]!\n"
98*5f39d1b3SJooyung Han
99*5f39d1b3SJooyung Han// Load 3 Lhs cells of size 4x2 each
100*5f39d1b3SJooyung Han"vld1.8 {d2}, [%[lhs_ptr]:64]!\n"
101*5f39d1b3SJooyung Han"vld1.8 {d4}, [%[lhs_ptr]:64]!\n"
102*5f39d1b3SJooyung Han"vld1.8 {d6}, [%[lhs_ptr]:64]!\n"
103*5f39d1b3SJooyung Han
104*5f39d1b3SJooyung Han// Expand Lhs/Rhs cells to 16 bit.
105*5f39d1b3SJooyung Han"vmovl.u8 q0, d0\n"
106*5f39d1b3SJooyung Han"vmovl.u8 q1, d2\n"
107*5f39d1b3SJooyung Han"vmovl.u8 q2, d4\n"
108*5f39d1b3SJooyung Han"vmovl.u8 q3, d6\n"
109*5f39d1b3SJooyung Han
110*5f39d1b3SJooyung Han// Multiply-accumulate, level of depth 0
111*5f39d1b3SJooyung Han"vmlal.u16 q4, d2, d0[0]\n"
112*5f39d1b3SJooyung Han"vmlal.u16 q5, d2, d0[1]\n"
113*5f39d1b3SJooyung Han"vmlal.u16 q6, d2, d0[2]\n"
114*5f39d1b3SJooyung Han"vmlal.u16 q7, d2, d0[3]\n"
115*5f39d1b3SJooyung Han"vmlal.u16 q8, d4, d0[0]\n"
116*5f39d1b3SJooyung Han"vmlal.u16 q9, d4, d0[1]\n"
117*5f39d1b3SJooyung Han"vmlal.u16 q10, d4, d0[2]\n"
118*5f39d1b3SJooyung Han"vmlal.u16 q11, d4, d0[3]\n"
119*5f39d1b3SJooyung Han"vmlal.u16 q12, d6, d0[0]\n"
120*5f39d1b3SJooyung Han"vmlal.u16 q13, d6, d0[1]\n"
121*5f39d1b3SJooyung Han"vmlal.u16 q14, d6, d0[2]\n"
122*5f39d1b3SJooyung Han"vmlal.u16 q15, d6, d0[3]\n"
123*5f39d1b3SJooyung Han
124*5f39d1b3SJooyung Han// Multiply-accumulate, level of depth 1
125*5f39d1b3SJooyung Han"vmlal.u16 q4, d3, d1[0]\n"
126*5f39d1b3SJooyung Han"vmlal.u16 q5, d3, d1[1]\n"
127*5f39d1b3SJooyung Han"vmlal.u16 q6, d3, d1[2]\n"
128*5f39d1b3SJooyung Han"vmlal.u16 q7, d3, d1[3]\n"
129*5f39d1b3SJooyung Han"vmlal.u16 q8, d5, d1[0]\n"
130*5f39d1b3SJooyung Han"vmlal.u16 q9, d5, d1[1]\n"
131*5f39d1b3SJooyung Han"vmlal.u16 q10, d5, d1[2]\n"
132*5f39d1b3SJooyung Han"vmlal.u16 q11, d5, d1[3]\n"
133*5f39d1b3SJooyung Han"vmlal.u16 q12, d7, d1[0]\n"
134*5f39d1b3SJooyung Han"vmlal.u16 q13, d7, d1[1]\n"
135*5f39d1b3SJooyung Han"vmlal.u16 q14, d7, d1[2]\n"
136*5f39d1b3SJooyung Han"vmlal.u16 q15, d7, d1[3]\n"
137*5f39d1b3SJooyung Han
138*5f39d1b3SJooyung Han// Loop. Decrement loop index (depth) by 2, since we just handled 2
139*5f39d1b3SJooyung Han// levels of depth (Kernel::kDepth=2).
140*5f39d1b3SJooyung Han"subs %[run_depth], #2\n"
141*5f39d1b3SJooyung Han"bne loop_NEONKernel12x4Depth2_%=\n"
142*5f39d1b3SJooyung Han```
143*5f39d1b3SJooyung Han
144*5f39d1b3SJooyung Han## Packing code adapts to the format chosen by the kernel
145*5f39d1b3SJooyung Han
146*5f39d1b3SJooyung HanAs explained in [design.md](design.md), gemmlowp starts by packing blocks of the
147*5f39d1b3SJooyung Hanlhs and rhs matrices for optimally efficient traversal by the kernel. This
148*5f39d1b3SJooyung Handepends on fine details of the kernel format, in ways that can only be
149*5f39d1b3SJooyung Hanefficiently handled by knowing these kernel format details at compile-time.
150*5f39d1b3SJooyung Han
151*5f39d1b3SJooyung HanThis is the reason why all the code in [internal/pack.h](../internal/pack.h) is
152*5f39d1b3SJooyung Hantemplated in the corresponding kernel format.
153*5f39d1b3SJooyung Han
154*5f39d1b3SJooyung HanThe code in internal/pack.h isn't tightly optimized by itself, but it is
155*5f39d1b3SJooyung Hanstructured in such a way that the critical code is in a template,
156*5f39d1b3SJooyung Han`PackingRegisterBlock`, that can easily be specialized to override the slow
157*5f39d1b3SJooyung Hangeneric code with fast specific packing code for specific formats, on specific
158*5f39d1b3SJooyung Hanplatforms.
159*5f39d1b3SJooyung Han
160*5f39d1b3SJooyung HanSee [internal/pack_neon.h](../internal/pack_neon.h) which provides NEON
161*5f39d1b3SJooyung Hanspecializations of the packing code for the particular kernel formats that are
162*5f39d1b3SJooyung Hanused by the NEON kernels in [internal/kernel_neon.h](../internal/kernel_neon.h).
163*5f39d1b3SJooyung Han
164*5f39d1b3SJooyung Han## Wrapping up: how to optimize gemmlowp for a CPU architecture
165*5f39d1b3SJooyung Han
166*5f39d1b3SJooyung HanIn conclusion, the key feature of gemmlowp when it comes to efficiently
167*5f39d1b3SJooyung Hansupporting a specific CPU architecture, is that it allows to freely replace the
168*5f39d1b3SJooyung Haninner loop of the GEMM by providing one's own GEMM kernel, which is also free to
169*5f39d1b3SJooyung Handictate its required data layout; each data layout then also needs optimized
170*5f39d1b3SJooyung Hanpacking code. The steps are thus:
171*5f39d1b3SJooyung Han
172*5f39d1b3SJooyung Han1.  Freely design a GEMM kernel with a freely chosen data layout.
173*5f39d1b3SJooyung Han2.  Implement the GEMM kernel, similar to
174*5f39d1b3SJooyung Han    [internal/kernel_neon.h](../internal/kernel_neon.h).
175*5f39d1b3SJooyung Han3.  Implement the optimized packing code, similar to
176*5f39d1b3SJooyung Han    [internal/pack_neon.h](../internal/pack_neon.h).
177