xref: /aosp_15_r20/external/llvm/docs/Vectorizers.rst (revision 9880d6810fe72a1726cb53787c6711e909410d58)
1*9880d681SAndroid Build Coastguard Worker==========================
2*9880d681SAndroid Build Coastguard WorkerAuto-Vectorization in LLVM
3*9880d681SAndroid Build Coastguard Worker==========================
4*9880d681SAndroid Build Coastguard Worker
5*9880d681SAndroid Build Coastguard Worker.. contents::
6*9880d681SAndroid Build Coastguard Worker   :local:
7*9880d681SAndroid Build Coastguard Worker
8*9880d681SAndroid Build Coastguard WorkerLLVM has two vectorizers: The :ref:`Loop Vectorizer <loop-vectorizer>`,
9*9880d681SAndroid Build Coastguard Workerwhich operates on Loops, and the :ref:`SLP Vectorizer
10*9880d681SAndroid Build Coastguard Worker<slp-vectorizer>`. These vectorizers
11*9880d681SAndroid Build Coastguard Workerfocus on different optimization opportunities and use different techniques.
12*9880d681SAndroid Build Coastguard WorkerThe SLP vectorizer merges multiple scalars that are found in the code into
13*9880d681SAndroid Build Coastguard Workervectors while the Loop Vectorizer widens instructions in loops
14*9880d681SAndroid Build Coastguard Workerto operate on multiple consecutive iterations.
15*9880d681SAndroid Build Coastguard Worker
16*9880d681SAndroid Build Coastguard WorkerBoth the Loop Vectorizer and the SLP Vectorizer are enabled by default.
17*9880d681SAndroid Build Coastguard Worker
18*9880d681SAndroid Build Coastguard Worker.. _loop-vectorizer:
19*9880d681SAndroid Build Coastguard Worker
20*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer
21*9880d681SAndroid Build Coastguard Worker===================
22*9880d681SAndroid Build Coastguard Worker
23*9880d681SAndroid Build Coastguard WorkerUsage
24*9880d681SAndroid Build Coastguard Worker-----
25*9880d681SAndroid Build Coastguard Worker
26*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer is enabled by default, but it can be disabled
27*9880d681SAndroid Build Coastguard Workerthrough clang using the command line flag:
28*9880d681SAndroid Build Coastguard Worker
29*9880d681SAndroid Build Coastguard Worker.. code-block:: console
30*9880d681SAndroid Build Coastguard Worker
31*9880d681SAndroid Build Coastguard Worker   $ clang ... -fno-vectorize  file.c
32*9880d681SAndroid Build Coastguard Worker
33*9880d681SAndroid Build Coastguard WorkerCommand line flags
34*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^
35*9880d681SAndroid Build Coastguard Worker
36*9880d681SAndroid Build Coastguard WorkerThe loop vectorizer uses a cost model to decide on the optimal vectorization factor
37*9880d681SAndroid Build Coastguard Workerand unroll factor. However, users of the vectorizer can force the vectorizer to use
38*9880d681SAndroid Build Coastguard Workerspecific values. Both 'clang' and 'opt' support the flags below.
39*9880d681SAndroid Build Coastguard Worker
40*9880d681SAndroid Build Coastguard WorkerUsers can control the vectorization SIMD width using the command line flag "-force-vector-width".
41*9880d681SAndroid Build Coastguard Worker
42*9880d681SAndroid Build Coastguard Worker.. code-block:: console
43*9880d681SAndroid Build Coastguard Worker
44*9880d681SAndroid Build Coastguard Worker  $ clang  -mllvm -force-vector-width=8 ...
45*9880d681SAndroid Build Coastguard Worker  $ opt -loop-vectorize -force-vector-width=8 ...
46*9880d681SAndroid Build Coastguard Worker
47*9880d681SAndroid Build Coastguard WorkerUsers can control the unroll factor using the command line flag "-force-vector-unroll"
48*9880d681SAndroid Build Coastguard Worker
49*9880d681SAndroid Build Coastguard Worker.. code-block:: console
50*9880d681SAndroid Build Coastguard Worker
51*9880d681SAndroid Build Coastguard Worker  $ clang  -mllvm -force-vector-unroll=2 ...
52*9880d681SAndroid Build Coastguard Worker  $ opt -loop-vectorize -force-vector-unroll=2 ...
53*9880d681SAndroid Build Coastguard Worker
54*9880d681SAndroid Build Coastguard WorkerPragma loop hint directives
55*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^
56*9880d681SAndroid Build Coastguard Worker
57*9880d681SAndroid Build Coastguard WorkerThe ``#pragma clang loop`` directive allows loop vectorization hints to be
58*9880d681SAndroid Build Coastguard Workerspecified for the subsequent for, while, do-while, or c++11 range-based for
59*9880d681SAndroid Build Coastguard Workerloop. The directive allows vectorization and interleaving to be enabled or
60*9880d681SAndroid Build Coastguard Workerdisabled. Vector width as well as interleave count can also be manually
61*9880d681SAndroid Build Coastguard Workerspecified. The following example explicitly enables vectorization and
62*9880d681SAndroid Build Coastguard Workerinterleaving:
63*9880d681SAndroid Build Coastguard Worker
64*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
65*9880d681SAndroid Build Coastguard Worker
66*9880d681SAndroid Build Coastguard Worker  #pragma clang loop vectorize(enable) interleave(enable)
67*9880d681SAndroid Build Coastguard Worker  while(...) {
68*9880d681SAndroid Build Coastguard Worker    ...
69*9880d681SAndroid Build Coastguard Worker  }
70*9880d681SAndroid Build Coastguard Worker
71*9880d681SAndroid Build Coastguard WorkerThe following example implicitly enables vectorization and interleaving by
72*9880d681SAndroid Build Coastguard Workerspecifying a vector width and interleaving count:
73*9880d681SAndroid Build Coastguard Worker
74*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
75*9880d681SAndroid Build Coastguard Worker
76*9880d681SAndroid Build Coastguard Worker  #pragma clang loop vectorize_width(2) interleave_count(2)
77*9880d681SAndroid Build Coastguard Worker  for(...) {
78*9880d681SAndroid Build Coastguard Worker    ...
79*9880d681SAndroid Build Coastguard Worker  }
80*9880d681SAndroid Build Coastguard Worker
81*9880d681SAndroid Build Coastguard WorkerSee the Clang
82*9880d681SAndroid Build Coastguard Worker`language extensions
83*9880d681SAndroid Build Coastguard Worker<http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations>`_
84*9880d681SAndroid Build Coastguard Workerfor details.
85*9880d681SAndroid Build Coastguard Worker
86*9880d681SAndroid Build Coastguard WorkerDiagnostics
87*9880d681SAndroid Build Coastguard Worker-----------
88*9880d681SAndroid Build Coastguard Worker
89*9880d681SAndroid Build Coastguard WorkerMany loops cannot be vectorized including loops with complicated control flow,
90*9880d681SAndroid Build Coastguard Workerunvectorizable types, and unvectorizable calls. The loop vectorizer generates
91*9880d681SAndroid Build Coastguard Workeroptimization remarks which can be queried using command line options to identify
92*9880d681SAndroid Build Coastguard Workerand diagnose loops that are skipped by the loop-vectorizer.
93*9880d681SAndroid Build Coastguard Worker
94*9880d681SAndroid Build Coastguard WorkerOptimization remarks are enabled using:
95*9880d681SAndroid Build Coastguard Worker
96*9880d681SAndroid Build Coastguard Worker``-Rpass=loop-vectorize`` identifies loops that were successfully vectorized.
97*9880d681SAndroid Build Coastguard Worker
98*9880d681SAndroid Build Coastguard Worker``-Rpass-missed=loop-vectorize`` identifies loops that failed vectorization and
99*9880d681SAndroid Build Coastguard Workerindicates if vectorization was specified.
100*9880d681SAndroid Build Coastguard Worker
101*9880d681SAndroid Build Coastguard Worker``-Rpass-analysis=loop-vectorize`` identifies the statements that caused
102*9880d681SAndroid Build Coastguard Workervectorization to fail.
103*9880d681SAndroid Build Coastguard Worker
104*9880d681SAndroid Build Coastguard WorkerConsider the following loop:
105*9880d681SAndroid Build Coastguard Worker
106*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
107*9880d681SAndroid Build Coastguard Worker
108*9880d681SAndroid Build Coastguard Worker  #pragma clang loop vectorize(enable)
109*9880d681SAndroid Build Coastguard Worker  for (int i = 0; i < Length; i++) {
110*9880d681SAndroid Build Coastguard Worker    switch(A[i]) {
111*9880d681SAndroid Build Coastguard Worker    case 0: A[i] = i*2; break;
112*9880d681SAndroid Build Coastguard Worker    case 1: A[i] = i;   break;
113*9880d681SAndroid Build Coastguard Worker    default: A[i] = 0;
114*9880d681SAndroid Build Coastguard Worker    }
115*9880d681SAndroid Build Coastguard Worker  }
116*9880d681SAndroid Build Coastguard Worker
117*9880d681SAndroid Build Coastguard WorkerThe command line ``-Rpass-missed=loop-vectorized`` prints the remark:
118*9880d681SAndroid Build Coastguard Worker
119*9880d681SAndroid Build Coastguard Worker.. code-block:: console
120*9880d681SAndroid Build Coastguard Worker
121*9880d681SAndroid Build Coastguard Worker  no_switch.cpp:4:5: remark: loop not vectorized: vectorization is explicitly enabled [-Rpass-missed=loop-vectorize]
122*9880d681SAndroid Build Coastguard Worker
123*9880d681SAndroid Build Coastguard WorkerAnd the command line ``-Rpass-analysis=loop-vectorize`` indicates that the
124*9880d681SAndroid Build Coastguard Workerswitch statement cannot be vectorized.
125*9880d681SAndroid Build Coastguard Worker
126*9880d681SAndroid Build Coastguard Worker.. code-block:: console
127*9880d681SAndroid Build Coastguard Worker
128*9880d681SAndroid Build Coastguard Worker  no_switch.cpp:4:5: remark: loop not vectorized: loop contains a switch statement [-Rpass-analysis=loop-vectorize]
129*9880d681SAndroid Build Coastguard Worker    switch(A[i]) {
130*9880d681SAndroid Build Coastguard Worker    ^
131*9880d681SAndroid Build Coastguard Worker
132*9880d681SAndroid Build Coastguard WorkerTo ensure line and column numbers are produced include the command line options
133*9880d681SAndroid Build Coastguard Worker``-gline-tables-only`` and ``-gcolumn-info``. See the Clang `user manual
134*9880d681SAndroid Build Coastguard Worker<http://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_
135*9880d681SAndroid Build Coastguard Workerfor details
136*9880d681SAndroid Build Coastguard Worker
137*9880d681SAndroid Build Coastguard WorkerFeatures
138*9880d681SAndroid Build Coastguard Worker--------
139*9880d681SAndroid Build Coastguard Worker
140*9880d681SAndroid Build Coastguard WorkerThe LLVM Loop Vectorizer has a number of features that allow it to vectorize
141*9880d681SAndroid Build Coastguard Workercomplex loops.
142*9880d681SAndroid Build Coastguard Worker
143*9880d681SAndroid Build Coastguard WorkerLoops with unknown trip count
144*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
145*9880d681SAndroid Build Coastguard Worker
146*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer supports loops with an unknown trip count.
147*9880d681SAndroid Build Coastguard WorkerIn the loop below, the iteration ``start`` and ``finish`` points are unknown,
148*9880d681SAndroid Build Coastguard Workerand the Loop Vectorizer has a mechanism to vectorize loops that do not start
149*9880d681SAndroid Build Coastguard Workerat zero. In this example, 'n' may not be a multiple of the vector width, and
150*9880d681SAndroid Build Coastguard Workerthe vectorizer has to execute the last few iterations as scalar code. Keeping
151*9880d681SAndroid Build Coastguard Workera scalar copy of the loop increases the code size.
152*9880d681SAndroid Build Coastguard Worker
153*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
154*9880d681SAndroid Build Coastguard Worker
155*9880d681SAndroid Build Coastguard Worker  void bar(float *A, float* B, float K, int start, int end) {
156*9880d681SAndroid Build Coastguard Worker    for (int i = start; i < end; ++i)
157*9880d681SAndroid Build Coastguard Worker      A[i] *= B[i] + K;
158*9880d681SAndroid Build Coastguard Worker  }
159*9880d681SAndroid Build Coastguard Worker
160*9880d681SAndroid Build Coastguard WorkerRuntime Checks of Pointers
161*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^
162*9880d681SAndroid Build Coastguard Worker
163*9880d681SAndroid Build Coastguard WorkerIn the example below, if the pointers A and B point to consecutive addresses,
164*9880d681SAndroid Build Coastguard Workerthen it is illegal to vectorize the code because some elements of A will be
165*9880d681SAndroid Build Coastguard Workerwritten before they are read from array B.
166*9880d681SAndroid Build Coastguard Worker
167*9880d681SAndroid Build Coastguard WorkerSome programmers use the 'restrict' keyword to notify the compiler that the
168*9880d681SAndroid Build Coastguard Workerpointers are disjointed, but in our example, the Loop Vectorizer has no way of
169*9880d681SAndroid Build Coastguard Workerknowing that the pointers A and B are unique. The Loop Vectorizer handles this
170*9880d681SAndroid Build Coastguard Workerloop by placing code that checks, at runtime, if the arrays A and B point to
171*9880d681SAndroid Build Coastguard Workerdisjointed memory locations. If arrays A and B overlap, then the scalar version
172*9880d681SAndroid Build Coastguard Workerof the loop is executed.
173*9880d681SAndroid Build Coastguard Worker
174*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
175*9880d681SAndroid Build Coastguard Worker
176*9880d681SAndroid Build Coastguard Worker  void bar(float *A, float* B, float K, int n) {
177*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < n; ++i)
178*9880d681SAndroid Build Coastguard Worker      A[i] *= B[i] + K;
179*9880d681SAndroid Build Coastguard Worker  }
180*9880d681SAndroid Build Coastguard Worker
181*9880d681SAndroid Build Coastguard Worker
182*9880d681SAndroid Build Coastguard WorkerReductions
183*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^
184*9880d681SAndroid Build Coastguard Worker
185*9880d681SAndroid Build Coastguard WorkerIn this example the ``sum`` variable is used by consecutive iterations of
186*9880d681SAndroid Build Coastguard Workerthe loop. Normally, this would prevent vectorization, but the vectorizer can
187*9880d681SAndroid Build Coastguard Workerdetect that 'sum' is a reduction variable. The variable 'sum' becomes a vector
188*9880d681SAndroid Build Coastguard Workerof integers, and at the end of the loop the elements of the array are added
189*9880d681SAndroid Build Coastguard Workertogether to create the correct result. We support a number of different
190*9880d681SAndroid Build Coastguard Workerreduction operations, such as addition, multiplication, XOR, AND and OR.
191*9880d681SAndroid Build Coastguard Worker
192*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
193*9880d681SAndroid Build Coastguard Worker
194*9880d681SAndroid Build Coastguard Worker  int foo(int *A, int *B, int n) {
195*9880d681SAndroid Build Coastguard Worker    unsigned sum = 0;
196*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < n; ++i)
197*9880d681SAndroid Build Coastguard Worker      sum += A[i] + 5;
198*9880d681SAndroid Build Coastguard Worker    return sum;
199*9880d681SAndroid Build Coastguard Worker  }
200*9880d681SAndroid Build Coastguard Worker
201*9880d681SAndroid Build Coastguard WorkerWe support floating point reduction operations when `-ffast-math` is used.
202*9880d681SAndroid Build Coastguard Worker
203*9880d681SAndroid Build Coastguard WorkerInductions
204*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^
205*9880d681SAndroid Build Coastguard Worker
206*9880d681SAndroid Build Coastguard WorkerIn this example the value of the induction variable ``i`` is saved into an
207*9880d681SAndroid Build Coastguard Workerarray. The Loop Vectorizer knows to vectorize induction variables.
208*9880d681SAndroid Build Coastguard Worker
209*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
210*9880d681SAndroid Build Coastguard Worker
211*9880d681SAndroid Build Coastguard Worker  void bar(float *A, float* B, float K, int n) {
212*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < n; ++i)
213*9880d681SAndroid Build Coastguard Worker      A[i] = i;
214*9880d681SAndroid Build Coastguard Worker  }
215*9880d681SAndroid Build Coastguard Worker
216*9880d681SAndroid Build Coastguard WorkerIf Conversion
217*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^
218*9880d681SAndroid Build Coastguard Worker
219*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer is able to "flatten" the IF statement in the code and
220*9880d681SAndroid Build Coastguard Workergenerate a single stream of instructions. The Loop Vectorizer supports any
221*9880d681SAndroid Build Coastguard Workercontrol flow in the innermost loop. The innermost loop may contain complex
222*9880d681SAndroid Build Coastguard Workernesting of IFs, ELSEs and even GOTOs.
223*9880d681SAndroid Build Coastguard Worker
224*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
225*9880d681SAndroid Build Coastguard Worker
226*9880d681SAndroid Build Coastguard Worker  int foo(int *A, int *B, int n) {
227*9880d681SAndroid Build Coastguard Worker    unsigned sum = 0;
228*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < n; ++i)
229*9880d681SAndroid Build Coastguard Worker      if (A[i] > B[i])
230*9880d681SAndroid Build Coastguard Worker        sum += A[i] + 5;
231*9880d681SAndroid Build Coastguard Worker    return sum;
232*9880d681SAndroid Build Coastguard Worker  }
233*9880d681SAndroid Build Coastguard Worker
234*9880d681SAndroid Build Coastguard WorkerPointer Induction Variables
235*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^
236*9880d681SAndroid Build Coastguard Worker
237*9880d681SAndroid Build Coastguard WorkerThis example uses the "accumulate" function of the standard c++ library. This
238*9880d681SAndroid Build Coastguard Workerloop uses C++ iterators, which are pointers, and not integer indices.
239*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer detects pointer induction variables and can vectorize
240*9880d681SAndroid Build Coastguard Workerthis loop. This feature is important because many C++ programs use iterators.
241*9880d681SAndroid Build Coastguard Worker
242*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
243*9880d681SAndroid Build Coastguard Worker
244*9880d681SAndroid Build Coastguard Worker  int baz(int *A, int n) {
245*9880d681SAndroid Build Coastguard Worker    return std::accumulate(A, A + n, 0);
246*9880d681SAndroid Build Coastguard Worker  }
247*9880d681SAndroid Build Coastguard Worker
248*9880d681SAndroid Build Coastguard WorkerReverse Iterators
249*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^
250*9880d681SAndroid Build Coastguard Worker
251*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer can vectorize loops that count backwards.
252*9880d681SAndroid Build Coastguard Worker
253*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
254*9880d681SAndroid Build Coastguard Worker
255*9880d681SAndroid Build Coastguard Worker  int foo(int *A, int *B, int n) {
256*9880d681SAndroid Build Coastguard Worker    for (int i = n; i > 0; --i)
257*9880d681SAndroid Build Coastguard Worker      A[i] +=1;
258*9880d681SAndroid Build Coastguard Worker  }
259*9880d681SAndroid Build Coastguard Worker
260*9880d681SAndroid Build Coastguard WorkerScatter / Gather
261*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^
262*9880d681SAndroid Build Coastguard Worker
263*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer can vectorize code that becomes a sequence of scalar instructions
264*9880d681SAndroid Build Coastguard Workerthat scatter/gathers memory.
265*9880d681SAndroid Build Coastguard Worker
266*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
267*9880d681SAndroid Build Coastguard Worker
268*9880d681SAndroid Build Coastguard Worker  int foo(int * A, int * B, int n) {
269*9880d681SAndroid Build Coastguard Worker    for (intptr_t i = 0; i < n; ++i)
270*9880d681SAndroid Build Coastguard Worker        A[i] += B[i * 4];
271*9880d681SAndroid Build Coastguard Worker  }
272*9880d681SAndroid Build Coastguard Worker
273*9880d681SAndroid Build Coastguard WorkerIn many situations the cost model will inform LLVM that this is not beneficial
274*9880d681SAndroid Build Coastguard Workerand LLVM will only vectorize such code if forced with "-mllvm -force-vector-width=#".
275*9880d681SAndroid Build Coastguard Worker
276*9880d681SAndroid Build Coastguard WorkerVectorization of Mixed Types
277*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^
278*9880d681SAndroid Build Coastguard Worker
279*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer can vectorize programs with mixed types. The Vectorizer
280*9880d681SAndroid Build Coastguard Workercost model can estimate the cost of the type conversion and decide if
281*9880d681SAndroid Build Coastguard Workervectorization is profitable.
282*9880d681SAndroid Build Coastguard Worker
283*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
284*9880d681SAndroid Build Coastguard Worker
285*9880d681SAndroid Build Coastguard Worker  int foo(int *A, char *B, int n, int k) {
286*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < n; ++i)
287*9880d681SAndroid Build Coastguard Worker      A[i] += 4 * B[i];
288*9880d681SAndroid Build Coastguard Worker  }
289*9880d681SAndroid Build Coastguard Worker
290*9880d681SAndroid Build Coastguard WorkerGlobal Structures Alias Analysis
291*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
292*9880d681SAndroid Build Coastguard Worker
293*9880d681SAndroid Build Coastguard WorkerAccess to global structures can also be vectorized, with alias analysis being
294*9880d681SAndroid Build Coastguard Workerused to make sure accesses don't alias. Run-time checks can also be added on
295*9880d681SAndroid Build Coastguard Workerpointer access to structure members.
296*9880d681SAndroid Build Coastguard Worker
297*9880d681SAndroid Build Coastguard WorkerMany variations are supported, but some that rely on undefined behaviour being
298*9880d681SAndroid Build Coastguard Workerignored (as other compilers do) are still being left un-vectorized.
299*9880d681SAndroid Build Coastguard Worker
300*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
301*9880d681SAndroid Build Coastguard Worker
302*9880d681SAndroid Build Coastguard Worker  struct { int A[100], K, B[100]; } Foo;
303*9880d681SAndroid Build Coastguard Worker
304*9880d681SAndroid Build Coastguard Worker  int foo() {
305*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < 100; ++i)
306*9880d681SAndroid Build Coastguard Worker      Foo.A[i] = Foo.B[i] + 100;
307*9880d681SAndroid Build Coastguard Worker  }
308*9880d681SAndroid Build Coastguard Worker
309*9880d681SAndroid Build Coastguard WorkerVectorization of function calls
310*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
311*9880d681SAndroid Build Coastguard Worker
312*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorize can vectorize intrinsic math functions.
313*9880d681SAndroid Build Coastguard WorkerSee the table below for a list of these functions.
314*9880d681SAndroid Build Coastguard Worker
315*9880d681SAndroid Build Coastguard Worker+-----+-----+---------+
316*9880d681SAndroid Build Coastguard Worker| pow | exp |  exp2   |
317*9880d681SAndroid Build Coastguard Worker+-----+-----+---------+
318*9880d681SAndroid Build Coastguard Worker| sin | cos |  sqrt   |
319*9880d681SAndroid Build Coastguard Worker+-----+-----+---------+
320*9880d681SAndroid Build Coastguard Worker| log |log2 |  log10  |
321*9880d681SAndroid Build Coastguard Worker+-----+-----+---------+
322*9880d681SAndroid Build Coastguard Worker|fabs |floor|  ceil   |
323*9880d681SAndroid Build Coastguard Worker+-----+-----+---------+
324*9880d681SAndroid Build Coastguard Worker|fma  |trunc|nearbyint|
325*9880d681SAndroid Build Coastguard Worker+-----+-----+---------+
326*9880d681SAndroid Build Coastguard Worker|     |     | fmuladd |
327*9880d681SAndroid Build Coastguard Worker+-----+-----+---------+
328*9880d681SAndroid Build Coastguard Worker
329*9880d681SAndroid Build Coastguard WorkerThe loop vectorizer knows about special instructions on the target and will
330*9880d681SAndroid Build Coastguard Workervectorize a loop containing a function call that maps to the instructions. For
331*9880d681SAndroid Build Coastguard Workerexample, the loop below will be vectorized on Intel x86 if the SSE4.1 roundps
332*9880d681SAndroid Build Coastguard Workerinstruction is available.
333*9880d681SAndroid Build Coastguard Worker
334*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
335*9880d681SAndroid Build Coastguard Worker
336*9880d681SAndroid Build Coastguard Worker  void foo(float *f) {
337*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i != 1024; ++i)
338*9880d681SAndroid Build Coastguard Worker      f[i] = floorf(f[i]);
339*9880d681SAndroid Build Coastguard Worker  }
340*9880d681SAndroid Build Coastguard Worker
341*9880d681SAndroid Build Coastguard WorkerPartial unrolling during vectorization
342*9880d681SAndroid Build Coastguard Worker^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
343*9880d681SAndroid Build Coastguard Worker
344*9880d681SAndroid Build Coastguard WorkerModern processors feature multiple execution units, and only programs that contain a
345*9880d681SAndroid Build Coastguard Workerhigh degree of parallelism can fully utilize the entire width of the machine.
346*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer increases the instruction level parallelism (ILP) by
347*9880d681SAndroid Build Coastguard Workerperforming partial-unrolling of loops.
348*9880d681SAndroid Build Coastguard Worker
349*9880d681SAndroid Build Coastguard WorkerIn the example below the entire array is accumulated into the variable 'sum'.
350*9880d681SAndroid Build Coastguard WorkerThis is inefficient because only a single execution port can be used by the processor.
351*9880d681SAndroid Build Coastguard WorkerBy unrolling the code the Loop Vectorizer allows two or more execution ports
352*9880d681SAndroid Build Coastguard Workerto be used simultaneously.
353*9880d681SAndroid Build Coastguard Worker
354*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
355*9880d681SAndroid Build Coastguard Worker
356*9880d681SAndroid Build Coastguard Worker  int foo(int *A, int *B, int n) {
357*9880d681SAndroid Build Coastguard Worker    unsigned sum = 0;
358*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < n; ++i)
359*9880d681SAndroid Build Coastguard Worker        sum += A[i];
360*9880d681SAndroid Build Coastguard Worker    return sum;
361*9880d681SAndroid Build Coastguard Worker  }
362*9880d681SAndroid Build Coastguard Worker
363*9880d681SAndroid Build Coastguard WorkerThe Loop Vectorizer uses a cost model to decide when it is profitable to unroll loops.
364*9880d681SAndroid Build Coastguard WorkerThe decision to unroll the loop depends on the register pressure and the generated code size.
365*9880d681SAndroid Build Coastguard Worker
366*9880d681SAndroid Build Coastguard WorkerPerformance
367*9880d681SAndroid Build Coastguard Worker-----------
368*9880d681SAndroid Build Coastguard Worker
369*9880d681SAndroid Build Coastguard WorkerThis section shows the execution time of Clang on a simple benchmark:
370*9880d681SAndroid Build Coastguard Worker`gcc-loops <http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/UnitTests/Vectorizer/>`_.
371*9880d681SAndroid Build Coastguard WorkerThis benchmarks is a collection of loops from the GCC autovectorization
372*9880d681SAndroid Build Coastguard Worker`page <http://gcc.gnu.org/projects/tree-ssa/vectorization.html>`_ by Dorit Nuzman.
373*9880d681SAndroid Build Coastguard Worker
374*9880d681SAndroid Build Coastguard WorkerThe chart below compares GCC-4.7, ICC-13, and Clang-SVN with and without loop vectorization at -O3, tuned for "corei7-avx", running on a Sandybridge iMac.
375*9880d681SAndroid Build Coastguard WorkerThe Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels.
376*9880d681SAndroid Build Coastguard Worker
377*9880d681SAndroid Build Coastguard Worker.. image:: gcc-loops.png
378*9880d681SAndroid Build Coastguard Worker
379*9880d681SAndroid Build Coastguard WorkerAnd Linpack-pc with the same configuration. Result is Mflops, higher is better.
380*9880d681SAndroid Build Coastguard Worker
381*9880d681SAndroid Build Coastguard Worker.. image:: linpack-pc.png
382*9880d681SAndroid Build Coastguard Worker
383*9880d681SAndroid Build Coastguard Worker.. _slp-vectorizer:
384*9880d681SAndroid Build Coastguard Worker
385*9880d681SAndroid Build Coastguard WorkerThe SLP Vectorizer
386*9880d681SAndroid Build Coastguard Worker==================
387*9880d681SAndroid Build Coastguard Worker
388*9880d681SAndroid Build Coastguard WorkerDetails
389*9880d681SAndroid Build Coastguard Worker-------
390*9880d681SAndroid Build Coastguard Worker
391*9880d681SAndroid Build Coastguard WorkerThe goal of SLP vectorization (a.k.a. superword-level parallelism) is
392*9880d681SAndroid Build Coastguard Workerto combine similar independent instructions
393*9880d681SAndroid Build Coastguard Workerinto vector instructions. Memory accesses, arithmetic operations, comparison
394*9880d681SAndroid Build Coastguard Workeroperations, PHI-nodes, can all be vectorized using this technique.
395*9880d681SAndroid Build Coastguard Worker
396*9880d681SAndroid Build Coastguard WorkerFor example, the following function performs very similar operations on its
397*9880d681SAndroid Build Coastguard Workerinputs (a1, b1) and (a2, b2). The basic-block vectorizer may combine these
398*9880d681SAndroid Build Coastguard Workerinto vector operations.
399*9880d681SAndroid Build Coastguard Worker
400*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
401*9880d681SAndroid Build Coastguard Worker
402*9880d681SAndroid Build Coastguard Worker  void foo(int a1, int a2, int b1, int b2, int *A) {
403*9880d681SAndroid Build Coastguard Worker    A[0] = a1*(a1 + b1)/b1 + 50*b1/a1;
404*9880d681SAndroid Build Coastguard Worker    A[1] = a2*(a2 + b2)/b2 + 50*b2/a2;
405*9880d681SAndroid Build Coastguard Worker  }
406*9880d681SAndroid Build Coastguard Worker
407*9880d681SAndroid Build Coastguard WorkerThe SLP-vectorizer processes the code bottom-up, across basic blocks, in search of scalars to combine.
408*9880d681SAndroid Build Coastguard Worker
409*9880d681SAndroid Build Coastguard WorkerUsage
410*9880d681SAndroid Build Coastguard Worker------
411*9880d681SAndroid Build Coastguard Worker
412*9880d681SAndroid Build Coastguard WorkerThe SLP Vectorizer is enabled by default, but it can be disabled
413*9880d681SAndroid Build Coastguard Workerthrough clang using the command line flag:
414*9880d681SAndroid Build Coastguard Worker
415*9880d681SAndroid Build Coastguard Worker.. code-block:: console
416*9880d681SAndroid Build Coastguard Worker
417*9880d681SAndroid Build Coastguard Worker   $ clang -fno-slp-vectorize file.c
418*9880d681SAndroid Build Coastguard Worker
419*9880d681SAndroid Build Coastguard WorkerLLVM has a second basic block vectorization phase
420*9880d681SAndroid Build Coastguard Workerwhich is more compile-time intensive (The BB vectorizer). This optimization
421*9880d681SAndroid Build Coastguard Workercan be enabled through clang using the command line flag:
422*9880d681SAndroid Build Coastguard Worker
423*9880d681SAndroid Build Coastguard Worker.. code-block:: console
424*9880d681SAndroid Build Coastguard Worker
425*9880d681SAndroid Build Coastguard Worker   $ clang -fslp-vectorize-aggressive file.c
426*9880d681SAndroid Build Coastguard Worker
427