1*5f39d1b3SJooyung Han# Kernels in gemmlowp 2*5f39d1b3SJooyung Han 3*5f39d1b3SJooyung Han## Kernels provide an inner-loop implementation, and a format 4*5f39d1b3SJooyung Han 5*5f39d1b3SJooyung HanHere we assume familiarity with the concepts of kernels and of packing as 6*5f39d1b3SJooyung Hanexplained in [design.md](design.md). 7*5f39d1b3SJooyung Han 8*5f39d1b3SJooyung Hangemmlowp is designed to be easily extensible to different architectures and 9*5f39d1b3SJooyung Hanother low-level details, while achieving high performance. Thus a line had to be 10*5f39d1b3SJooyung Handrawn between the generic GEMM code and the specific parts that need to be 11*5f39d1b3SJooyung Hanmanually designed for each architecture, etc. The design choice made in gemmlowp 12*5f39d1b3SJooyung Hanis to have easily swappable GEMM kernels. 13*5f39d1b3SJooyung Han 14*5f39d1b3SJooyung HanIn itself, a GEMM kernel is just an implementation of the inner-most loop in a 15*5f39d1b3SJooyung HanGEMM (That inner-most loop has to be over the 'depth' dimension so as to be able 16*5f39d1b3SJooyung Hanto accumulate into a small enough number of accumulators to fit in registers). 17*5f39d1b3SJooyung Han 18*5f39d1b3SJooyung HanThus, by itself, a GEMM kernel should be just a function computing a block of 19*5f39d1b3SJooyung HanGEMM. 20*5f39d1b3SJooyung Han 21*5f39d1b3SJooyung HanHowever, GEMM kernels may need to differ not just in how they implement this 22*5f39d1b3SJooyung Hancomputation, but also in the format of data that they operate on. Indeed, in 23*5f39d1b3SJooyung Hanorder to maximize the ratio of arithmetic instructions to memory access 24*5f39d1b3SJooyung Haninstructions, GEMM kernels want to handle blocks as wide as possible given the 25*5f39d1b3SJooyung Hannumber of registers of the CPU architecture. 26*5f39d1b3SJooyung Han 27*5f39d1b3SJooyung HanThus, in order to allow efficient specialization to diverse architectures, 28*5f39d1b3SJooyung Hangemmlowp allows each GEMM kernel to dictate the format of data that it expects, 29*5f39d1b3SJooyung Hanin addition to providing its inner-loop implementation. 30*5f39d1b3SJooyung Han 31*5f39d1b3SJooyung HanThe former is given by a 'Format' typedef, and the latter by a 'Run' method. 32*5f39d1b3SJooyung Han 33*5f39d1b3SJooyung HanA good example is to look at internal/kernel_neon.h, and specifically at the 34*5f39d1b3SJooyung HanNEONKernel12x4Depth2 kernel, which specifies its format as 35*5f39d1b3SJooyung Han 36*5f39d1b3SJooyung Han``` 37*5f39d1b3SJooyung Han typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>, 38*5f39d1b3SJooyung Han KernelSideFormat<CellFormat<4, 2>, 1> > Format; 39*5f39d1b3SJooyung Han``` 40*5f39d1b3SJooyung Han 41*5f39d1b3SJooyung HanThe meaning of these terms is explained in the lengthy comment at the top of 42*5f39d1b3SJooyung Haninternal/kernel.h. Here, they mean that this kernel handles at each iteration 43*5f39d1b3SJooyung Han(along the depth dimension): 44*5f39d1b3SJooyung Han 45*5f39d1b3SJooyung Han- 3 'cells' of size 4x2 each of the lhs, so a total lhs block of size 12x2 46*5f39d1b3SJooyung Han 47*5f39d1b3SJooyung Han- 1 'cell' of size 2x4 of the rhs. 48*5f39d1b3SJooyung Han 49*5f39d1b3SJooyung HanIn other words, this kernel handles 12 rows of the lhs and 4 columns of the 50*5f39d1b3SJooyung Hanrhs, and handles two levels of depth at once. The 'cells' and `CellFormat` 51*5f39d1b3SJooyung Handetail the layout of these 12x2 and 2x4 blocks. 52*5f39d1b3SJooyung Han 53*5f39d1b3SJooyung HanThis kernel then loads these 12x2 and 2x4 blocks and computes the corresponding 54*5f39d1b3SJooyung Han12x4 GEMM; for ease of reference let us paste the critical comment and code 55*5f39d1b3SJooyung Hanhere: 56*5f39d1b3SJooyung Han 57*5f39d1b3SJooyung Han``` 58*5f39d1b3SJooyung Han"loop_NEONKernel12x4Depth2_%=:\n" 59*5f39d1b3SJooyung Han 60*5f39d1b3SJooyung Han// Overview of register layout: 61*5f39d1b3SJooyung Han// 62*5f39d1b3SJooyung Han// A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0). 63*5f39d1b3SJooyung Han// A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7 64*5f39d1b3SJooyung Han// (q1--q3). 65*5f39d1b3SJooyung Han// A 12x4 block of accumulators is stored in 32bit in q4--q15. 66*5f39d1b3SJooyung Han// 67*5f39d1b3SJooyung Han// +-----+-----+-----+-----+ 68*5f39d1b3SJooyung Han// |d0[0]|d0[1]|d0[2]|d0[3]| 69*5f39d1b3SJooyung Han// Rhs +-----+-----+-----+-----+ 70*5f39d1b3SJooyung Han// |d1[0]|d1[1]|d1[2]|d1[3]| 71*5f39d1b3SJooyung Han// +-----+-----+-----+-----+ 72*5f39d1b3SJooyung Han// 73*5f39d1b3SJooyung Han// | | | | | 74*5f39d1b3SJooyung Han// 75*5f39d1b3SJooyung Han// Lhs | | | | | 76*5f39d1b3SJooyung Han// 77*5f39d1b3SJooyung Han// +--+--+ - - - - +-----+-----+-----+-----+ 78*5f39d1b3SJooyung Han// |d2|d3| | q4 | q5 | q6 | q7 | 79*5f39d1b3SJooyung Han// |d2|d3| | q4 | q5 | q6 | q7 | 80*5f39d1b3SJooyung Han// |d2|d3| | q4 | q5 | q6 | q7 | 81*5f39d1b3SJooyung Han// |d2|d3| | q4 | q5 | q6 | q7 | 82*5f39d1b3SJooyung Han// +--+--+ - - - - +-----+-----+-----+-----+ 83*5f39d1b3SJooyung Han// |d4|d5| | q8 | q9 | q10 | q11 | 84*5f39d1b3SJooyung Han// |d4|d5| | q8 | q9 | q10 | q11 | 85*5f39d1b3SJooyung Han// |d4|d5| | q8 | q9 | q10 | q11 | 86*5f39d1b3SJooyung Han// |d4|d5| | q8 | q9 | q10 | q11 | 87*5f39d1b3SJooyung Han// +--+--+ - - - - +-----+-----+-----+-----+ 88*5f39d1b3SJooyung Han// |d6|d7| | q12 | q13 | q14 | q15 | 89*5f39d1b3SJooyung Han// |d6|d7| | q12 | q13 | q14 | q15 | 90*5f39d1b3SJooyung Han// |d6|d7| | q12 | q13 | q14 | q15 | 91*5f39d1b3SJooyung Han// |d6|d7| | q12 | q13 | q14 | q15 | 92*5f39d1b3SJooyung Han// +--+--+ - - - - +-----+-----+-----+-----+ 93*5f39d1b3SJooyung Han// 94*5f39d1b3SJooyung Han// Accumulator 95*5f39d1b3SJooyung Han 96*5f39d1b3SJooyung Han// Load 1 Rhs cell of size 2x4 97*5f39d1b3SJooyung Han"vld1.8 {d0}, [%[rhs_ptr]:64]!\n" 98*5f39d1b3SJooyung Han 99*5f39d1b3SJooyung Han// Load 3 Lhs cells of size 4x2 each 100*5f39d1b3SJooyung Han"vld1.8 {d2}, [%[lhs_ptr]:64]!\n" 101*5f39d1b3SJooyung Han"vld1.8 {d4}, [%[lhs_ptr]:64]!\n" 102*5f39d1b3SJooyung Han"vld1.8 {d6}, [%[lhs_ptr]:64]!\n" 103*5f39d1b3SJooyung Han 104*5f39d1b3SJooyung Han// Expand Lhs/Rhs cells to 16 bit. 105*5f39d1b3SJooyung Han"vmovl.u8 q0, d0\n" 106*5f39d1b3SJooyung Han"vmovl.u8 q1, d2\n" 107*5f39d1b3SJooyung Han"vmovl.u8 q2, d4\n" 108*5f39d1b3SJooyung Han"vmovl.u8 q3, d6\n" 109*5f39d1b3SJooyung Han 110*5f39d1b3SJooyung Han// Multiply-accumulate, level of depth 0 111*5f39d1b3SJooyung Han"vmlal.u16 q4, d2, d0[0]\n" 112*5f39d1b3SJooyung Han"vmlal.u16 q5, d2, d0[1]\n" 113*5f39d1b3SJooyung Han"vmlal.u16 q6, d2, d0[2]\n" 114*5f39d1b3SJooyung Han"vmlal.u16 q7, d2, d0[3]\n" 115*5f39d1b3SJooyung Han"vmlal.u16 q8, d4, d0[0]\n" 116*5f39d1b3SJooyung Han"vmlal.u16 q9, d4, d0[1]\n" 117*5f39d1b3SJooyung Han"vmlal.u16 q10, d4, d0[2]\n" 118*5f39d1b3SJooyung Han"vmlal.u16 q11, d4, d0[3]\n" 119*5f39d1b3SJooyung Han"vmlal.u16 q12, d6, d0[0]\n" 120*5f39d1b3SJooyung Han"vmlal.u16 q13, d6, d0[1]\n" 121*5f39d1b3SJooyung Han"vmlal.u16 q14, d6, d0[2]\n" 122*5f39d1b3SJooyung Han"vmlal.u16 q15, d6, d0[3]\n" 123*5f39d1b3SJooyung Han 124*5f39d1b3SJooyung Han// Multiply-accumulate, level of depth 1 125*5f39d1b3SJooyung Han"vmlal.u16 q4, d3, d1[0]\n" 126*5f39d1b3SJooyung Han"vmlal.u16 q5, d3, d1[1]\n" 127*5f39d1b3SJooyung Han"vmlal.u16 q6, d3, d1[2]\n" 128*5f39d1b3SJooyung Han"vmlal.u16 q7, d3, d1[3]\n" 129*5f39d1b3SJooyung Han"vmlal.u16 q8, d5, d1[0]\n" 130*5f39d1b3SJooyung Han"vmlal.u16 q9, d5, d1[1]\n" 131*5f39d1b3SJooyung Han"vmlal.u16 q10, d5, d1[2]\n" 132*5f39d1b3SJooyung Han"vmlal.u16 q11, d5, d1[3]\n" 133*5f39d1b3SJooyung Han"vmlal.u16 q12, d7, d1[0]\n" 134*5f39d1b3SJooyung Han"vmlal.u16 q13, d7, d1[1]\n" 135*5f39d1b3SJooyung Han"vmlal.u16 q14, d7, d1[2]\n" 136*5f39d1b3SJooyung Han"vmlal.u16 q15, d7, d1[3]\n" 137*5f39d1b3SJooyung Han 138*5f39d1b3SJooyung Han// Loop. Decrement loop index (depth) by 2, since we just handled 2 139*5f39d1b3SJooyung Han// levels of depth (Kernel::kDepth=2). 140*5f39d1b3SJooyung Han"subs %[run_depth], #2\n" 141*5f39d1b3SJooyung Han"bne loop_NEONKernel12x4Depth2_%=\n" 142*5f39d1b3SJooyung Han``` 143*5f39d1b3SJooyung Han 144*5f39d1b3SJooyung Han## Packing code adapts to the format chosen by the kernel 145*5f39d1b3SJooyung Han 146*5f39d1b3SJooyung HanAs explained in [design.md](design.md), gemmlowp starts by packing blocks of the 147*5f39d1b3SJooyung Hanlhs and rhs matrices for optimally efficient traversal by the kernel. This 148*5f39d1b3SJooyung Handepends on fine details of the kernel format, in ways that can only be 149*5f39d1b3SJooyung Hanefficiently handled by knowing these kernel format details at compile-time. 150*5f39d1b3SJooyung Han 151*5f39d1b3SJooyung HanThis is the reason why all the code in [internal/pack.h](../internal/pack.h) is 152*5f39d1b3SJooyung Hantemplated in the corresponding kernel format. 153*5f39d1b3SJooyung Han 154*5f39d1b3SJooyung HanThe code in internal/pack.h isn't tightly optimized by itself, but it is 155*5f39d1b3SJooyung Hanstructured in such a way that the critical code is in a template, 156*5f39d1b3SJooyung Han`PackingRegisterBlock`, that can easily be specialized to override the slow 157*5f39d1b3SJooyung Hangeneric code with fast specific packing code for specific formats, on specific 158*5f39d1b3SJooyung Hanplatforms. 159*5f39d1b3SJooyung Han 160*5f39d1b3SJooyung HanSee [internal/pack_neon.h](../internal/pack_neon.h) which provides NEON 161*5f39d1b3SJooyung Hanspecializations of the packing code for the particular kernel formats that are 162*5f39d1b3SJooyung Hanused by the NEON kernels in [internal/kernel_neon.h](../internal/kernel_neon.h). 163*5f39d1b3SJooyung Han 164*5f39d1b3SJooyung Han## Wrapping up: how to optimize gemmlowp for a CPU architecture 165*5f39d1b3SJooyung Han 166*5f39d1b3SJooyung HanIn conclusion, the key feature of gemmlowp when it comes to efficiently 167*5f39d1b3SJooyung Hansupporting a specific CPU architecture, is that it allows to freely replace the 168*5f39d1b3SJooyung Haninner loop of the GEMM by providing one's own GEMM kernel, which is also free to 169*5f39d1b3SJooyung Handictate its required data layout; each data layout then also needs optimized 170*5f39d1b3SJooyung Hanpacking code. The steps are thus: 171*5f39d1b3SJooyung Han 172*5f39d1b3SJooyung Han1. Freely design a GEMM kernel with a freely chosen data layout. 173*5f39d1b3SJooyung Han2. Implement the GEMM kernel, similar to 174*5f39d1b3SJooyung Han [internal/kernel_neon.h](../internal/kernel_neon.h). 175*5f39d1b3SJooyung Han3. Implement the optimized packing code, similar to 176*5f39d1b3SJooyung Han [internal/pack_neon.h](../internal/pack_neon.h). 177