1*9880d681SAndroid Build Coastguard Worker=================================== 2*9880d681SAndroid Build Coastguard WorkerCompiling CUDA C/C++ with LLVM 3*9880d681SAndroid Build Coastguard Worker=================================== 4*9880d681SAndroid Build Coastguard Worker 5*9880d681SAndroid Build Coastguard Worker.. contents:: 6*9880d681SAndroid Build Coastguard Worker :local: 7*9880d681SAndroid Build Coastguard Worker 8*9880d681SAndroid Build Coastguard WorkerIntroduction 9*9880d681SAndroid Build Coastguard Worker============ 10*9880d681SAndroid Build Coastguard Worker 11*9880d681SAndroid Build Coastguard WorkerThis document contains the user guides and the internals of compiling CUDA 12*9880d681SAndroid Build Coastguard WorkerC/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM 13*9880d681SAndroid Build Coastguard Workerand developers who want to improve LLVM for GPUs. This document assumes a basic 14*9880d681SAndroid Build Coastguard Workerfamiliarity with CUDA. Information about CUDA programming can be found in the 15*9880d681SAndroid Build Coastguard Worker`CUDA programming guide 16*9880d681SAndroid Build Coastguard Worker<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. 17*9880d681SAndroid Build Coastguard Worker 18*9880d681SAndroid Build Coastguard WorkerHow to Build LLVM with CUDA Support 19*9880d681SAndroid Build Coastguard Worker=================================== 20*9880d681SAndroid Build Coastguard Worker 21*9880d681SAndroid Build Coastguard WorkerCUDA support is still in development and works the best in the trunk version 22*9880d681SAndroid Build Coastguard Workerof LLVM. Below is a quick summary of downloading and building the trunk 23*9880d681SAndroid Build Coastguard Workerversion. Consult the `Getting Started 24*9880d681SAndroid Build Coastguard Worker<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting 25*9880d681SAndroid Build Coastguard Workerup LLVM. 26*9880d681SAndroid Build Coastguard Worker 27*9880d681SAndroid Build Coastguard Worker#. Checkout LLVM 28*9880d681SAndroid Build Coastguard Worker 29*9880d681SAndroid Build Coastguard Worker .. code-block:: console 30*9880d681SAndroid Build Coastguard Worker 31*9880d681SAndroid Build Coastguard Worker $ cd where-you-want-llvm-to-live 32*9880d681SAndroid Build Coastguard Worker $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm 33*9880d681SAndroid Build Coastguard Worker 34*9880d681SAndroid Build Coastguard Worker#. Checkout Clang 35*9880d681SAndroid Build Coastguard Worker 36*9880d681SAndroid Build Coastguard Worker .. code-block:: console 37*9880d681SAndroid Build Coastguard Worker 38*9880d681SAndroid Build Coastguard Worker $ cd where-you-want-llvm-to-live 39*9880d681SAndroid Build Coastguard Worker $ cd llvm/tools 40*9880d681SAndroid Build Coastguard Worker $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang 41*9880d681SAndroid Build Coastguard Worker 42*9880d681SAndroid Build Coastguard Worker#. Configure and build LLVM and Clang 43*9880d681SAndroid Build Coastguard Worker 44*9880d681SAndroid Build Coastguard Worker .. code-block:: console 45*9880d681SAndroid Build Coastguard Worker 46*9880d681SAndroid Build Coastguard Worker $ cd where-you-want-llvm-to-live 47*9880d681SAndroid Build Coastguard Worker $ mkdir build 48*9880d681SAndroid Build Coastguard Worker $ cd build 49*9880d681SAndroid Build Coastguard Worker $ cmake [options] .. 50*9880d681SAndroid Build Coastguard Worker $ make 51*9880d681SAndroid Build Coastguard Worker 52*9880d681SAndroid Build Coastguard WorkerHow to Compile CUDA C/C++ with LLVM 53*9880d681SAndroid Build Coastguard Worker=================================== 54*9880d681SAndroid Build Coastguard Worker 55*9880d681SAndroid Build Coastguard WorkerWe assume you have installed the CUDA driver and runtime. Consult the `NVIDIA 56*9880d681SAndroid Build Coastguard WorkerCUDA installation guide 57*9880d681SAndroid Build Coastguard Worker<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if 58*9880d681SAndroid Build Coastguard Workeryou have not. 59*9880d681SAndroid Build Coastguard Worker 60*9880d681SAndroid Build Coastguard WorkerSuppose you want to compile and run the following CUDA program (``axpy.cu``) 61*9880d681SAndroid Build Coastguard Workerwhich multiplies a ``float`` array by a ``float`` scalar (AXPY). 62*9880d681SAndroid Build Coastguard Worker 63*9880d681SAndroid Build Coastguard Worker.. code-block:: c++ 64*9880d681SAndroid Build Coastguard Worker 65*9880d681SAndroid Build Coastguard Worker #include <iostream> 66*9880d681SAndroid Build Coastguard Worker 67*9880d681SAndroid Build Coastguard Worker __global__ void axpy(float a, float* x, float* y) { 68*9880d681SAndroid Build Coastguard Worker y[threadIdx.x] = a * x[threadIdx.x]; 69*9880d681SAndroid Build Coastguard Worker } 70*9880d681SAndroid Build Coastguard Worker 71*9880d681SAndroid Build Coastguard Worker int main(int argc, char* argv[]) { 72*9880d681SAndroid Build Coastguard Worker const int kDataLen = 4; 73*9880d681SAndroid Build Coastguard Worker 74*9880d681SAndroid Build Coastguard Worker float a = 2.0f; 75*9880d681SAndroid Build Coastguard Worker float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; 76*9880d681SAndroid Build Coastguard Worker float host_y[kDataLen]; 77*9880d681SAndroid Build Coastguard Worker 78*9880d681SAndroid Build Coastguard Worker // Copy input data to device. 79*9880d681SAndroid Build Coastguard Worker float* device_x; 80*9880d681SAndroid Build Coastguard Worker float* device_y; 81*9880d681SAndroid Build Coastguard Worker cudaMalloc(&device_x, kDataLen * sizeof(float)); 82*9880d681SAndroid Build Coastguard Worker cudaMalloc(&device_y, kDataLen * sizeof(float)); 83*9880d681SAndroid Build Coastguard Worker cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), 84*9880d681SAndroid Build Coastguard Worker cudaMemcpyHostToDevice); 85*9880d681SAndroid Build Coastguard Worker 86*9880d681SAndroid Build Coastguard Worker // Launch the kernel. 87*9880d681SAndroid Build Coastguard Worker axpy<<<1, kDataLen>>>(a, device_x, device_y); 88*9880d681SAndroid Build Coastguard Worker 89*9880d681SAndroid Build Coastguard Worker // Copy output data to host. 90*9880d681SAndroid Build Coastguard Worker cudaDeviceSynchronize(); 91*9880d681SAndroid Build Coastguard Worker cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), 92*9880d681SAndroid Build Coastguard Worker cudaMemcpyDeviceToHost); 93*9880d681SAndroid Build Coastguard Worker 94*9880d681SAndroid Build Coastguard Worker // Print the results. 95*9880d681SAndroid Build Coastguard Worker for (int i = 0; i < kDataLen; ++i) { 96*9880d681SAndroid Build Coastguard Worker std::cout << "y[" << i << "] = " << host_y[i] << "\n"; 97*9880d681SAndroid Build Coastguard Worker } 98*9880d681SAndroid Build Coastguard Worker 99*9880d681SAndroid Build Coastguard Worker cudaDeviceReset(); 100*9880d681SAndroid Build Coastguard Worker return 0; 101*9880d681SAndroid Build Coastguard Worker } 102*9880d681SAndroid Build Coastguard Worker 103*9880d681SAndroid Build Coastguard WorkerThe command line for compilation is similar to what you would use for C++. 104*9880d681SAndroid Build Coastguard Worker 105*9880d681SAndroid Build Coastguard Worker.. code-block:: console 106*9880d681SAndroid Build Coastguard Worker 107*9880d681SAndroid Build Coastguard Worker $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ 108*9880d681SAndroid Build Coastguard Worker -L<CUDA install path>/<lib64 or lib> \ 109*9880d681SAndroid Build Coastguard Worker -lcudart_static -ldl -lrt -pthread 110*9880d681SAndroid Build Coastguard Worker $ ./axpy 111*9880d681SAndroid Build Coastguard Worker y[0] = 2 112*9880d681SAndroid Build Coastguard Worker y[1] = 4 113*9880d681SAndroid Build Coastguard Worker y[2] = 6 114*9880d681SAndroid Build Coastguard Worker y[3] = 8 115*9880d681SAndroid Build Coastguard Worker 116*9880d681SAndroid Build Coastguard Worker``<CUDA install path>`` is the root directory where you installed CUDA SDK, 117*9880d681SAndroid Build Coastguard Workertypically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of 118*9880d681SAndroid Build Coastguard Workeryour GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want 119*9880d681SAndroid Build Coastguard Workerto run your program on a GPU with compute capability of 3.5, you should specify 120*9880d681SAndroid Build Coastguard Worker``--cuda-gpu-arch=sm_35``. 121*9880d681SAndroid Build Coastguard Worker 122*9880d681SAndroid Build Coastguard WorkerDetecting clang vs NVCC 123*9880d681SAndroid Build Coastguard Worker======================= 124*9880d681SAndroid Build Coastguard Worker 125*9880d681SAndroid Build Coastguard WorkerAlthough clang's CUDA implementation is largely compatible with NVCC's, you may 126*9880d681SAndroid Build Coastguard Workerstill want to detect when you're compiling CUDA code specifically with clang. 127*9880d681SAndroid Build Coastguard Worker 128*9880d681SAndroid Build Coastguard WorkerThis is tricky, because NVCC may invoke clang as part of its own compilation 129*9880d681SAndroid Build Coastguard Workerprocess! For example, NVCC uses the host compiler's preprocessor when 130*9880d681SAndroid Build Coastguard Workercompiling for device code, and that host compiler may in fact be clang. 131*9880d681SAndroid Build Coastguard Worker 132*9880d681SAndroid Build Coastguard WorkerWhen clang is actually compiling CUDA code -- rather than being used as a 133*9880d681SAndroid Build Coastguard Workersubtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is 134*9880d681SAndroid Build Coastguard Workerdefined only in device mode (but will be defined if NVCC is using clang as a 135*9880d681SAndroid Build Coastguard Workerpreprocessor). So you can use the following incantations to detect clang CUDA 136*9880d681SAndroid Build Coastguard Workercompilation, in host and device modes: 137*9880d681SAndroid Build Coastguard Worker 138*9880d681SAndroid Build Coastguard Worker.. code-block:: c++ 139*9880d681SAndroid Build Coastguard Worker 140*9880d681SAndroid Build Coastguard Worker #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) 141*9880d681SAndroid Build Coastguard Worker // clang compiling CUDA code, host mode. 142*9880d681SAndroid Build Coastguard Worker #endif 143*9880d681SAndroid Build Coastguard Worker 144*9880d681SAndroid Build Coastguard Worker #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) 145*9880d681SAndroid Build Coastguard Worker // clang compiling CUDA code, device mode. 146*9880d681SAndroid Build Coastguard Worker #endif 147*9880d681SAndroid Build Coastguard Worker 148*9880d681SAndroid Build Coastguard WorkerBoth clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can 149*9880d681SAndroid Build Coastguard Workerdetect NVCC specifically by looking for ``__NVCC__``. 150*9880d681SAndroid Build Coastguard Worker 151*9880d681SAndroid Build Coastguard WorkerFlags that control numerical code 152*9880d681SAndroid Build Coastguard Worker================================= 153*9880d681SAndroid Build Coastguard Worker 154*9880d681SAndroid Build Coastguard WorkerIf you're using GPUs, you probably care about making numerical code run fast. 155*9880d681SAndroid Build Coastguard WorkerGPU hardware allows for more control over numerical operations than most CPUs, 156*9880d681SAndroid Build Coastguard Workerbut this results in more compiler options for you to juggle. 157*9880d681SAndroid Build Coastguard Worker 158*9880d681SAndroid Build Coastguard WorkerFlags you may wish to tweak include: 159*9880d681SAndroid Build Coastguard Worker 160*9880d681SAndroid Build Coastguard Worker* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when 161*9880d681SAndroid Build Coastguard Worker compiling CUDA) Controls whether the compiler emits fused multiply-add 162*9880d681SAndroid Build Coastguard Worker operations. 163*9880d681SAndroid Build Coastguard Worker 164*9880d681SAndroid Build Coastguard Worker * ``off``: never emit fma operations, and prevent ptxas from fusing multiply 165*9880d681SAndroid Build Coastguard Worker and add instructions. 166*9880d681SAndroid Build Coastguard Worker * ``on``: fuse multiplies and adds within a single statement, but never 167*9880d681SAndroid Build Coastguard Worker across statements (C11 semantics). Prevent ptxas from fusing other 168*9880d681SAndroid Build Coastguard Worker multiplies and adds. 169*9880d681SAndroid Build Coastguard Worker * ``fast``: fuse multiplies and adds wherever profitable, even across 170*9880d681SAndroid Build Coastguard Worker statements. Doesn't prevent ptxas from fusing additional multiplies and 171*9880d681SAndroid Build Coastguard Worker adds. 172*9880d681SAndroid Build Coastguard Worker 173*9880d681SAndroid Build Coastguard Worker Fused multiply-add instructions can be much faster than the unfused 174*9880d681SAndroid Build Coastguard Worker equivalents, but because the intermediate result in an fma is not rounded, 175*9880d681SAndroid Build Coastguard Worker this flag can affect numerical code. 176*9880d681SAndroid Build Coastguard Worker 177*9880d681SAndroid Build Coastguard Worker* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, 178*9880d681SAndroid Build Coastguard Worker floating point operations may flush `denormal 179*9880d681SAndroid Build Coastguard Worker <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. 180*9880d681SAndroid Build Coastguard Worker Operations on denormal numbers are often much slower than the same operations 181*9880d681SAndroid Build Coastguard Worker on normal numbers. 182*9880d681SAndroid Build Coastguard Worker 183*9880d681SAndroid Build Coastguard Worker* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the 184*9880d681SAndroid Build Coastguard Worker compiler may emit calls to faster, approximate versions of transcendental 185*9880d681SAndroid Build Coastguard Worker functions, instead of using the slower, fully IEEE-compliant versions. For 186*9880d681SAndroid Build Coastguard Worker example, this flag allows clang to emit the ptx ``sin.approx.f32`` 187*9880d681SAndroid Build Coastguard Worker instruction. 188*9880d681SAndroid Build Coastguard Worker 189*9880d681SAndroid Build Coastguard Worker This is implied by ``-ffast-math``. 190*9880d681SAndroid Build Coastguard Worker 191*9880d681SAndroid Build Coastguard WorkerOptimizations 192*9880d681SAndroid Build Coastguard Worker============= 193*9880d681SAndroid Build Coastguard Worker 194*9880d681SAndroid Build Coastguard WorkerCPU and GPU have different design philosophies and architectures. For example, a 195*9880d681SAndroid Build Coastguard Workertypical CPU has branch prediction, out-of-order execution, and is superscalar, 196*9880d681SAndroid Build Coastguard Workerwhereas a typical GPU has none of these. Due to such differences, an 197*9880d681SAndroid Build Coastguard Workeroptimization pipeline well-tuned for CPUs may be not suitable for GPUs. 198*9880d681SAndroid Build Coastguard Worker 199*9880d681SAndroid Build Coastguard WorkerLLVM performs several general and CUDA-specific optimizations for GPUs. The 200*9880d681SAndroid Build Coastguard Workerlist below shows some of the more important optimizations for GPUs. Most of 201*9880d681SAndroid Build Coastguard Workerthem have been upstreamed to ``lib/Transforms/Scalar`` and 202*9880d681SAndroid Build Coastguard Worker``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a 203*9880d681SAndroid Build Coastguard Workercustomizable target-independent optimization pipeline. 204*9880d681SAndroid Build Coastguard Worker 205*9880d681SAndroid Build Coastguard Worker* **Straight-line scalar optimizations**. These optimizations reduce redundancy 206*9880d681SAndroid Build Coastguard Worker in straight-line code. Details can be found in the `design document for 207*9880d681SAndroid Build Coastguard Worker straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. 208*9880d681SAndroid Build Coastguard Worker 209*9880d681SAndroid Build Coastguard Worker* **Inferring memory spaces**. `This optimization 210*9880d681SAndroid Build Coastguard Worker <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_ 211*9880d681SAndroid Build Coastguard Worker infers the memory space of an address so that the backend can emit faster 212*9880d681SAndroid Build Coastguard Worker special loads and stores from it. 213*9880d681SAndroid Build Coastguard Worker 214*9880d681SAndroid Build Coastguard Worker* **Aggressive loop unrooling and function inlining**. Loop unrolling and 215*9880d681SAndroid Build Coastguard Worker function inlining need to be more aggressive for GPUs than for CPUs because 216*9880d681SAndroid Build Coastguard Worker control flow transfer in GPU is more expensive. They also promote other 217*9880d681SAndroid Build Coastguard Worker optimizations such as constant propagation and SROA which sometimes speed up 218*9880d681SAndroid Build Coastguard Worker code by over 10x. An empirical inline threshold for GPUs is 1100. This 219*9880d681SAndroid Build Coastguard Worker configuration has yet to be upstreamed with a target-specific optimization 220*9880d681SAndroid Build Coastguard Worker pipeline. LLVM also provides `loop unrolling pragmas 221*9880d681SAndroid Build Coastguard Worker <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ 222*9880d681SAndroid Build Coastguard Worker and ``__attribute__((always_inline))`` for programmers to force unrolling and 223*9880d681SAndroid Build Coastguard Worker inling. 224*9880d681SAndroid Build Coastguard Worker 225*9880d681SAndroid Build Coastguard Worker* **Aggressive speculative execution**. `This transformation 226*9880d681SAndroid Build Coastguard Worker <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is 227*9880d681SAndroid Build Coastguard Worker mainly for promoting straight-line scalar optimizations which are most 228*9880d681SAndroid Build Coastguard Worker effective on code along dominator paths. 229*9880d681SAndroid Build Coastguard Worker 230*9880d681SAndroid Build Coastguard Worker* **Memory-space alias analysis**. `This alias analysis 231*9880d681SAndroid Build Coastguard Worker <http://reviews.llvm.org/D12414>`_ infers that two pointers in different 232*9880d681SAndroid Build Coastguard Worker special memory spaces do not alias. It has yet to be integrated to the new 233*9880d681SAndroid Build Coastguard Worker alias analysis infrastructure; the new infrastructure does not run 234*9880d681SAndroid Build Coastguard Worker target-specific alias analysis. 235*9880d681SAndroid Build Coastguard Worker 236*9880d681SAndroid Build Coastguard Worker* **Bypassing 64-bit divides**. `An existing optimization 237*9880d681SAndroid Build Coastguard Worker <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ 238*9880d681SAndroid Build Coastguard Worker enabled in the NVPTX backend. 64-bit integer divides are much slower than 239*9880d681SAndroid Build Coastguard Worker 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit 240*9880d681SAndroid Build Coastguard Worker divides in our benchmarks have a divisor and dividend which fit in 32-bits at 241*9880d681SAndroid Build Coastguard Worker runtime. This optimization provides a fast path for this common case. 242*9880d681SAndroid Build Coastguard Worker 243*9880d681SAndroid Build Coastguard WorkerPublication 244*9880d681SAndroid Build Coastguard Worker=========== 245*9880d681SAndroid Build Coastguard Worker 246*9880d681SAndroid Build Coastguard Worker| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ 247*9880d681SAndroid Build Coastguard Worker| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt 248*9880d681SAndroid Build Coastguard Worker| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* 249*9880d681SAndroid Build Coastguard Worker| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_ 250*9880d681SAndroid Build Coastguard Worker 251*9880d681SAndroid Build Coastguard WorkerTutorial 252*9880d681SAndroid Build Coastguard Worker======== 253*9880d681SAndroid Build Coastguard Worker 254*9880d681SAndroid Build Coastguard Worker`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_ 255*9880d681SAndroid Build Coastguard Worker 256*9880d681SAndroid Build Coastguard WorkerObtaining Help 257*9880d681SAndroid Build Coastguard Worker============== 258*9880d681SAndroid Build Coastguard Worker 259*9880d681SAndroid Build Coastguard WorkerTo obtain help on LLVM in general and its CUDA support, see `the LLVM 260*9880d681SAndroid Build Coastguard Workercommunity <http://llvm.org/docs/#mailing-lists>`_. 261