xref: /aosp_15_r20/external/llvm/docs/CompileCudaWithLLVM.rst (revision 9880d6810fe72a1726cb53787c6711e909410d58)
1*9880d681SAndroid Build Coastguard Worker===================================
2*9880d681SAndroid Build Coastguard WorkerCompiling CUDA C/C++ with LLVM
3*9880d681SAndroid Build Coastguard Worker===================================
4*9880d681SAndroid Build Coastguard Worker
5*9880d681SAndroid Build Coastguard Worker.. contents::
6*9880d681SAndroid Build Coastguard Worker   :local:
7*9880d681SAndroid Build Coastguard Worker
8*9880d681SAndroid Build Coastguard WorkerIntroduction
9*9880d681SAndroid Build Coastguard Worker============
10*9880d681SAndroid Build Coastguard Worker
11*9880d681SAndroid Build Coastguard WorkerThis document contains the user guides and the internals of compiling CUDA
12*9880d681SAndroid Build Coastguard WorkerC/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
13*9880d681SAndroid Build Coastguard Workerand developers who want to improve LLVM for GPUs. This document assumes a basic
14*9880d681SAndroid Build Coastguard Workerfamiliarity with CUDA. Information about CUDA programming can be found in the
15*9880d681SAndroid Build Coastguard Worker`CUDA programming guide
16*9880d681SAndroid Build Coastguard Worker<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
17*9880d681SAndroid Build Coastguard Worker
18*9880d681SAndroid Build Coastguard WorkerHow to Build LLVM with CUDA Support
19*9880d681SAndroid Build Coastguard Worker===================================
20*9880d681SAndroid Build Coastguard Worker
21*9880d681SAndroid Build Coastguard WorkerCUDA support is still in development and works the best in the trunk version
22*9880d681SAndroid Build Coastguard Workerof LLVM. Below is a quick summary of downloading and building the trunk
23*9880d681SAndroid Build Coastguard Workerversion. Consult the `Getting Started
24*9880d681SAndroid Build Coastguard Worker<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting
25*9880d681SAndroid Build Coastguard Workerup LLVM.
26*9880d681SAndroid Build Coastguard Worker
27*9880d681SAndroid Build Coastguard Worker#. Checkout LLVM
28*9880d681SAndroid Build Coastguard Worker
29*9880d681SAndroid Build Coastguard Worker   .. code-block:: console
30*9880d681SAndroid Build Coastguard Worker
31*9880d681SAndroid Build Coastguard Worker     $ cd where-you-want-llvm-to-live
32*9880d681SAndroid Build Coastguard Worker     $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
33*9880d681SAndroid Build Coastguard Worker
34*9880d681SAndroid Build Coastguard Worker#. Checkout Clang
35*9880d681SAndroid Build Coastguard Worker
36*9880d681SAndroid Build Coastguard Worker   .. code-block:: console
37*9880d681SAndroid Build Coastguard Worker
38*9880d681SAndroid Build Coastguard Worker     $ cd where-you-want-llvm-to-live
39*9880d681SAndroid Build Coastguard Worker     $ cd llvm/tools
40*9880d681SAndroid Build Coastguard Worker     $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
41*9880d681SAndroid Build Coastguard Worker
42*9880d681SAndroid Build Coastguard Worker#. Configure and build LLVM and Clang
43*9880d681SAndroid Build Coastguard Worker
44*9880d681SAndroid Build Coastguard Worker   .. code-block:: console
45*9880d681SAndroid Build Coastguard Worker
46*9880d681SAndroid Build Coastguard Worker     $ cd where-you-want-llvm-to-live
47*9880d681SAndroid Build Coastguard Worker     $ mkdir build
48*9880d681SAndroid Build Coastguard Worker     $ cd build
49*9880d681SAndroid Build Coastguard Worker     $ cmake [options] ..
50*9880d681SAndroid Build Coastguard Worker     $ make
51*9880d681SAndroid Build Coastguard Worker
52*9880d681SAndroid Build Coastguard WorkerHow to Compile CUDA C/C++ with LLVM
53*9880d681SAndroid Build Coastguard Worker===================================
54*9880d681SAndroid Build Coastguard Worker
55*9880d681SAndroid Build Coastguard WorkerWe assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
56*9880d681SAndroid Build Coastguard WorkerCUDA installation guide
57*9880d681SAndroid Build Coastguard Worker<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
58*9880d681SAndroid Build Coastguard Workeryou have not.
59*9880d681SAndroid Build Coastguard Worker
60*9880d681SAndroid Build Coastguard WorkerSuppose you want to compile and run the following CUDA program (``axpy.cu``)
61*9880d681SAndroid Build Coastguard Workerwhich multiplies a ``float`` array by a ``float`` scalar (AXPY).
62*9880d681SAndroid Build Coastguard Worker
63*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
64*9880d681SAndroid Build Coastguard Worker
65*9880d681SAndroid Build Coastguard Worker  #include <iostream>
66*9880d681SAndroid Build Coastguard Worker
67*9880d681SAndroid Build Coastguard Worker  __global__ void axpy(float a, float* x, float* y) {
68*9880d681SAndroid Build Coastguard Worker    y[threadIdx.x] = a * x[threadIdx.x];
69*9880d681SAndroid Build Coastguard Worker  }
70*9880d681SAndroid Build Coastguard Worker
71*9880d681SAndroid Build Coastguard Worker  int main(int argc, char* argv[]) {
72*9880d681SAndroid Build Coastguard Worker    const int kDataLen = 4;
73*9880d681SAndroid Build Coastguard Worker
74*9880d681SAndroid Build Coastguard Worker    float a = 2.0f;
75*9880d681SAndroid Build Coastguard Worker    float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
76*9880d681SAndroid Build Coastguard Worker    float host_y[kDataLen];
77*9880d681SAndroid Build Coastguard Worker
78*9880d681SAndroid Build Coastguard Worker    // Copy input data to device.
79*9880d681SAndroid Build Coastguard Worker    float* device_x;
80*9880d681SAndroid Build Coastguard Worker    float* device_y;
81*9880d681SAndroid Build Coastguard Worker    cudaMalloc(&device_x, kDataLen * sizeof(float));
82*9880d681SAndroid Build Coastguard Worker    cudaMalloc(&device_y, kDataLen * sizeof(float));
83*9880d681SAndroid Build Coastguard Worker    cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
84*9880d681SAndroid Build Coastguard Worker               cudaMemcpyHostToDevice);
85*9880d681SAndroid Build Coastguard Worker
86*9880d681SAndroid Build Coastguard Worker    // Launch the kernel.
87*9880d681SAndroid Build Coastguard Worker    axpy<<<1, kDataLen>>>(a, device_x, device_y);
88*9880d681SAndroid Build Coastguard Worker
89*9880d681SAndroid Build Coastguard Worker    // Copy output data to host.
90*9880d681SAndroid Build Coastguard Worker    cudaDeviceSynchronize();
91*9880d681SAndroid Build Coastguard Worker    cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
92*9880d681SAndroid Build Coastguard Worker               cudaMemcpyDeviceToHost);
93*9880d681SAndroid Build Coastguard Worker
94*9880d681SAndroid Build Coastguard Worker    // Print the results.
95*9880d681SAndroid Build Coastguard Worker    for (int i = 0; i < kDataLen; ++i) {
96*9880d681SAndroid Build Coastguard Worker      std::cout << "y[" << i << "] = " << host_y[i] << "\n";
97*9880d681SAndroid Build Coastguard Worker    }
98*9880d681SAndroid Build Coastguard Worker
99*9880d681SAndroid Build Coastguard Worker    cudaDeviceReset();
100*9880d681SAndroid Build Coastguard Worker    return 0;
101*9880d681SAndroid Build Coastguard Worker  }
102*9880d681SAndroid Build Coastguard Worker
103*9880d681SAndroid Build Coastguard WorkerThe command line for compilation is similar to what you would use for C++.
104*9880d681SAndroid Build Coastguard Worker
105*9880d681SAndroid Build Coastguard Worker.. code-block:: console
106*9880d681SAndroid Build Coastguard Worker
107*9880d681SAndroid Build Coastguard Worker  $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch>  \
108*9880d681SAndroid Build Coastguard Worker      -L<CUDA install path>/<lib64 or lib>              \
109*9880d681SAndroid Build Coastguard Worker      -lcudart_static -ldl -lrt -pthread
110*9880d681SAndroid Build Coastguard Worker  $ ./axpy
111*9880d681SAndroid Build Coastguard Worker  y[0] = 2
112*9880d681SAndroid Build Coastguard Worker  y[1] = 4
113*9880d681SAndroid Build Coastguard Worker  y[2] = 6
114*9880d681SAndroid Build Coastguard Worker  y[3] = 8
115*9880d681SAndroid Build Coastguard Worker
116*9880d681SAndroid Build Coastguard Worker``<CUDA install path>`` is the root directory where you installed CUDA SDK,
117*9880d681SAndroid Build Coastguard Workertypically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of
118*9880d681SAndroid Build Coastguard Workeryour GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want
119*9880d681SAndroid Build Coastguard Workerto run your program on a GPU with compute capability of 3.5, you should specify
120*9880d681SAndroid Build Coastguard Worker``--cuda-gpu-arch=sm_35``.
121*9880d681SAndroid Build Coastguard Worker
122*9880d681SAndroid Build Coastguard WorkerDetecting clang vs NVCC
123*9880d681SAndroid Build Coastguard Worker=======================
124*9880d681SAndroid Build Coastguard Worker
125*9880d681SAndroid Build Coastguard WorkerAlthough clang's CUDA implementation is largely compatible with NVCC's, you may
126*9880d681SAndroid Build Coastguard Workerstill want to detect when you're compiling CUDA code specifically with clang.
127*9880d681SAndroid Build Coastguard Worker
128*9880d681SAndroid Build Coastguard WorkerThis is tricky, because NVCC may invoke clang as part of its own compilation
129*9880d681SAndroid Build Coastguard Workerprocess!  For example, NVCC uses the host compiler's preprocessor when
130*9880d681SAndroid Build Coastguard Workercompiling for device code, and that host compiler may in fact be clang.
131*9880d681SAndroid Build Coastguard Worker
132*9880d681SAndroid Build Coastguard WorkerWhen clang is actually compiling CUDA code -- rather than being used as a
133*9880d681SAndroid Build Coastguard Workersubtool of NVCC's -- it defines the ``__CUDA__`` macro.  ``__CUDA_ARCH__`` is
134*9880d681SAndroid Build Coastguard Workerdefined only in device mode (but will be defined if NVCC is using clang as a
135*9880d681SAndroid Build Coastguard Workerpreprocessor).  So you can use the following incantations to detect clang CUDA
136*9880d681SAndroid Build Coastguard Workercompilation, in host and device modes:
137*9880d681SAndroid Build Coastguard Worker
138*9880d681SAndroid Build Coastguard Worker.. code-block:: c++
139*9880d681SAndroid Build Coastguard Worker
140*9880d681SAndroid Build Coastguard Worker  #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
141*9880d681SAndroid Build Coastguard Worker    // clang compiling CUDA code, host mode.
142*9880d681SAndroid Build Coastguard Worker  #endif
143*9880d681SAndroid Build Coastguard Worker
144*9880d681SAndroid Build Coastguard Worker  #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
145*9880d681SAndroid Build Coastguard Worker    // clang compiling CUDA code, device mode.
146*9880d681SAndroid Build Coastguard Worker  #endif
147*9880d681SAndroid Build Coastguard Worker
148*9880d681SAndroid Build Coastguard WorkerBoth clang and nvcc define ``__CUDACC__`` during CUDA compilation.  You can
149*9880d681SAndroid Build Coastguard Workerdetect NVCC specifically by looking for ``__NVCC__``.
150*9880d681SAndroid Build Coastguard Worker
151*9880d681SAndroid Build Coastguard WorkerFlags that control numerical code
152*9880d681SAndroid Build Coastguard Worker=================================
153*9880d681SAndroid Build Coastguard Worker
154*9880d681SAndroid Build Coastguard WorkerIf you're using GPUs, you probably care about making numerical code run fast.
155*9880d681SAndroid Build Coastguard WorkerGPU hardware allows for more control over numerical operations than most CPUs,
156*9880d681SAndroid Build Coastguard Workerbut this results in more compiler options for you to juggle.
157*9880d681SAndroid Build Coastguard Worker
158*9880d681SAndroid Build Coastguard WorkerFlags you may wish to tweak include:
159*9880d681SAndroid Build Coastguard Worker
160*9880d681SAndroid Build Coastguard Worker* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
161*9880d681SAndroid Build Coastguard Worker  compiling CUDA) Controls whether the compiler emits fused multiply-add
162*9880d681SAndroid Build Coastguard Worker  operations.
163*9880d681SAndroid Build Coastguard Worker
164*9880d681SAndroid Build Coastguard Worker  * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
165*9880d681SAndroid Build Coastguard Worker    and add instructions.
166*9880d681SAndroid Build Coastguard Worker  * ``on``: fuse multiplies and adds within a single statement, but never
167*9880d681SAndroid Build Coastguard Worker    across statements (C11 semantics).  Prevent ptxas from fusing other
168*9880d681SAndroid Build Coastguard Worker    multiplies and adds.
169*9880d681SAndroid Build Coastguard Worker  * ``fast``: fuse multiplies and adds wherever profitable, even across
170*9880d681SAndroid Build Coastguard Worker    statements.  Doesn't prevent ptxas from fusing additional multiplies and
171*9880d681SAndroid Build Coastguard Worker    adds.
172*9880d681SAndroid Build Coastguard Worker
173*9880d681SAndroid Build Coastguard Worker  Fused multiply-add instructions can be much faster than the unfused
174*9880d681SAndroid Build Coastguard Worker  equivalents, but because the intermediate result in an fma is not rounded,
175*9880d681SAndroid Build Coastguard Worker  this flag can affect numerical code.
176*9880d681SAndroid Build Coastguard Worker
177*9880d681SAndroid Build Coastguard Worker* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
178*9880d681SAndroid Build Coastguard Worker  floating point operations may flush `denormal
179*9880d681SAndroid Build Coastguard Worker  <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
180*9880d681SAndroid Build Coastguard Worker  Operations on denormal numbers are often much slower than the same operations
181*9880d681SAndroid Build Coastguard Worker  on normal numbers.
182*9880d681SAndroid Build Coastguard Worker
183*9880d681SAndroid Build Coastguard Worker* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
184*9880d681SAndroid Build Coastguard Worker  compiler may emit calls to faster, approximate versions of transcendental
185*9880d681SAndroid Build Coastguard Worker  functions, instead of using the slower, fully IEEE-compliant versions.  For
186*9880d681SAndroid Build Coastguard Worker  example, this flag allows clang to emit the ptx ``sin.approx.f32``
187*9880d681SAndroid Build Coastguard Worker  instruction.
188*9880d681SAndroid Build Coastguard Worker
189*9880d681SAndroid Build Coastguard Worker  This is implied by ``-ffast-math``.
190*9880d681SAndroid Build Coastguard Worker
191*9880d681SAndroid Build Coastguard WorkerOptimizations
192*9880d681SAndroid Build Coastguard Worker=============
193*9880d681SAndroid Build Coastguard Worker
194*9880d681SAndroid Build Coastguard WorkerCPU and GPU have different design philosophies and architectures. For example, a
195*9880d681SAndroid Build Coastguard Workertypical CPU has branch prediction, out-of-order execution, and is superscalar,
196*9880d681SAndroid Build Coastguard Workerwhereas a typical GPU has none of these. Due to such differences, an
197*9880d681SAndroid Build Coastguard Workeroptimization pipeline well-tuned for CPUs may be not suitable for GPUs.
198*9880d681SAndroid Build Coastguard Worker
199*9880d681SAndroid Build Coastguard WorkerLLVM performs several general and CUDA-specific optimizations for GPUs. The
200*9880d681SAndroid Build Coastguard Workerlist below shows some of the more important optimizations for GPUs. Most of
201*9880d681SAndroid Build Coastguard Workerthem have been upstreamed to ``lib/Transforms/Scalar`` and
202*9880d681SAndroid Build Coastguard Worker``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
203*9880d681SAndroid Build Coastguard Workercustomizable target-independent optimization pipeline.
204*9880d681SAndroid Build Coastguard Worker
205*9880d681SAndroid Build Coastguard Worker* **Straight-line scalar optimizations**. These optimizations reduce redundancy
206*9880d681SAndroid Build Coastguard Worker  in straight-line code. Details can be found in the `design document for
207*9880d681SAndroid Build Coastguard Worker  straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
208*9880d681SAndroid Build Coastguard Worker
209*9880d681SAndroid Build Coastguard Worker* **Inferring memory spaces**. `This optimization
210*9880d681SAndroid Build Coastguard Worker  <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_
211*9880d681SAndroid Build Coastguard Worker  infers the memory space of an address so that the backend can emit faster
212*9880d681SAndroid Build Coastguard Worker  special loads and stores from it.
213*9880d681SAndroid Build Coastguard Worker
214*9880d681SAndroid Build Coastguard Worker* **Aggressive loop unrooling and function inlining**. Loop unrolling and
215*9880d681SAndroid Build Coastguard Worker  function inlining need to be more aggressive for GPUs than for CPUs because
216*9880d681SAndroid Build Coastguard Worker  control flow transfer in GPU is more expensive. They also promote other
217*9880d681SAndroid Build Coastguard Worker  optimizations such as constant propagation and SROA which sometimes speed up
218*9880d681SAndroid Build Coastguard Worker  code by over 10x. An empirical inline threshold for GPUs is 1100. This
219*9880d681SAndroid Build Coastguard Worker  configuration has yet to be upstreamed with a target-specific optimization
220*9880d681SAndroid Build Coastguard Worker  pipeline. LLVM also provides `loop unrolling pragmas
221*9880d681SAndroid Build Coastguard Worker  <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
222*9880d681SAndroid Build Coastguard Worker  and ``__attribute__((always_inline))`` for programmers to force unrolling and
223*9880d681SAndroid Build Coastguard Worker  inling.
224*9880d681SAndroid Build Coastguard Worker
225*9880d681SAndroid Build Coastguard Worker* **Aggressive speculative execution**. `This transformation
226*9880d681SAndroid Build Coastguard Worker  <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
227*9880d681SAndroid Build Coastguard Worker  mainly for promoting straight-line scalar optimizations which are most
228*9880d681SAndroid Build Coastguard Worker  effective on code along dominator paths.
229*9880d681SAndroid Build Coastguard Worker
230*9880d681SAndroid Build Coastguard Worker* **Memory-space alias analysis**. `This alias analysis
231*9880d681SAndroid Build Coastguard Worker  <http://reviews.llvm.org/D12414>`_ infers that two pointers in different
232*9880d681SAndroid Build Coastguard Worker  special memory spaces do not alias. It has yet to be integrated to the new
233*9880d681SAndroid Build Coastguard Worker  alias analysis infrastructure; the new infrastructure does not run
234*9880d681SAndroid Build Coastguard Worker  target-specific alias analysis.
235*9880d681SAndroid Build Coastguard Worker
236*9880d681SAndroid Build Coastguard Worker* **Bypassing 64-bit divides**. `An existing optimization
237*9880d681SAndroid Build Coastguard Worker  <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
238*9880d681SAndroid Build Coastguard Worker  enabled in the NVPTX backend. 64-bit integer divides are much slower than
239*9880d681SAndroid Build Coastguard Worker  32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
240*9880d681SAndroid Build Coastguard Worker  divides in our benchmarks have a divisor and dividend which fit in 32-bits at
241*9880d681SAndroid Build Coastguard Worker  runtime. This optimization provides a fast path for this common case.
242*9880d681SAndroid Build Coastguard Worker
243*9880d681SAndroid Build Coastguard WorkerPublication
244*9880d681SAndroid Build Coastguard Worker===========
245*9880d681SAndroid Build Coastguard Worker
246*9880d681SAndroid Build Coastguard Worker| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
247*9880d681SAndroid Build Coastguard Worker| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
248*9880d681SAndroid Build Coastguard Worker| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
249*9880d681SAndroid Build Coastguard Worker| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
250*9880d681SAndroid Build Coastguard Worker
251*9880d681SAndroid Build Coastguard WorkerTutorial
252*9880d681SAndroid Build Coastguard Worker========
253*9880d681SAndroid Build Coastguard Worker
254*9880d681SAndroid Build Coastguard Worker`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
255*9880d681SAndroid Build Coastguard Worker
256*9880d681SAndroid Build Coastguard WorkerObtaining Help
257*9880d681SAndroid Build Coastguard Worker==============
258*9880d681SAndroid Build Coastguard Worker
259*9880d681SAndroid Build Coastguard WorkerTo obtain help on LLVM in general and its CUDA support, see `the LLVM
260*9880d681SAndroid Build Coastguard Workercommunity <http://llvm.org/docs/#mailing-lists>`_.
261