Cheatsheet ========== .. raw:: pdf Spacer 0,1 .. only:: html Download pdf version :download:`here <../../cheatsheet/cheatsheet.pdf>` General ------- - Getting alpaka: https://github.com/alpaka-group/alpaka - Issue tracker, questions, support: https://github.com/alpaka-group/alpaka/issues - All alpaka names are in namespace alpaka and header file `alpaka/alpaka.hpp` - This document assumes .. code-block:: c++ #include using namespace alpaka; Accelerator, Platform and Device -------------------------------- Define in-kernel thread indexing type .. code-block:: c++ using Dim = DimInt; using Idx = IntegerType; Define accelerator type (CUDA, OpenMP,etc.) .. code-block:: c++ using Acc = AcceleratorType; AcceleratorType: .. code-block:: c++ AccGpuCudaRt, AccGpuHipRt, AccCpuSycl, AccFpgaSyclIntel, AccGpuSyclIntel, AccCpuOmp2Blocks, AccCpuOmp2Threads, AccCpuTbbBlocks, AccCpuThreads, AccCpuSerial Create platform and select a device by index .. code-block:: c++ auto const platform = Platform{}; auto const device = getDevByIdx(platform, index); Queue and Events ---------------- Create a queue for a device .. code-block:: c++ using Queue = Queue; auto queue = Queue{device}; Property: .. code-block:: c++ Blocking, NonBlocking Put a task for execution .. code-block:: c++ enqueue(queue, task); Wait for all operations in the queue .. code-block:: c++ wait(queue); Create an event .. code-block:: c++ Event event{device}; Put an event to the queue .. code-block:: c++ enqueue(queue, event); Check if the event is completed .. code-block:: c++ isComplete(event); Wait for the event (and all operations put to the same queue before it) .. code-block:: c++ wait(event); Memory ------ Memory allocation and transfers are symmetric for host and devices, both done via alpaka API Create a CPU device for memory allocation on the host side .. code-block:: c++ auto const platformHost = PlatformCpu{}; auto const devHost = getDevByIdx(platformHost, 0); Allocate a buffer in host memory .. code-block:: c++ // Use alpaka vector as a static array for the extents Vec, Idx> extent = value; Vec, Idx> extent = {valueY, valueX}; // Allocate memory for the alpaka buffer, which is a dynamic array using BufHost = Buf; BufHost bufHost = allocBuf(devHost, extent); Create a view to host memory represented by a pointer .. code-block:: c++ // Create an alpaka vector which is a static array Vec extent = size; DataType* ptr = ...; auto hostView = createView(devHost, ptr, extent); Create a view to host std::vector .. code-block:: c++ auto vec = std::vector(42u); auto hostView = createView(devHost, vec); Create a view to host std::array .. code-block:: c++ std::array array = {42u, 23}; auto hostView = createView(devHost, array); Get a raw pointer to a buffer or view initialization, etc. .. code-block:: c++ DataType* raw = view::getPtrNative(hostBufOrView); Get the pitches of a buffer or view .. code-block:: c++ // memory in bytes to the next element in the buffer/view along the pitch dimension auto pitchBufOrViewAcc = getPitchesInBytes(accBufOrView) Get a mdspan to a buffer or view initialization, etc. .. code-block:: c++ auto bufOrViewMdSpan = experimental::getMdSpan(bufOrViewAcc) auto value = bufOrViewMdSpan(y,x); // access 2D mdspan bufOrViewMdSpan(y,x) = value; // assign item to 2D mdspan Allocate a buffer in device memory .. code-block:: c++ auto bufDevice = allocBuf(device, extent); Enqueue a memory copy from host to device .. code-block:: c++ // arguments can be also View instances instead of Buf memcpy(queue, bufDevice, bufHost, extent); Enqueue a memory copy from device to host .. code-block:: c++ memcpy(queue, bufHost, bufDevice, extent); .. raw:: pdf PageBreak Kernel Execution ---------------- Prepare Kernel Bundle .. code-block:: c++ HeatEquationKernel heatEqKernel; Automatically select a valid kernel launch configuration .. code-block:: c++ Vec const globalThreadExtent = vectorValue; Vec const elementsPerThread = vectorValue; KernelCfg const kernelCfg = { globalThreadExtent, elementsPerThread, false, GridBlockExtentSubDivRestrictions::Unrestricted}; auto autoWorkDiv = getValidWorkDiv( kernelCfg, device, kernel, kernelParams...); Manually set a kernel launch configuration .. code-block:: c++ Vec const blocksPerGrid = vectorValue; Vec const threadsPerBlock = vectorValue; Vec const elementsPerThread = vectorValue; using WorkDiv = WorkDivMembers; auto manualWorkDiv = WorkDiv{blocksPerGrid, threadsPerBlock, elementsPerThread}; Instantiate a kernel (does not launch it yet) .. code-block:: c++ Kernel kernel{argumentsForConstructor}; acc parameter of the kernel is provided automatically, does not need to be specified here Get information about the kernel from the device (size, maxThreadsPerBlock, sharedMemSize, registers, etc.) .. code-block:: c++ auto kernelFunctionAttributes = getFunctionAttributes(devAcc, kernel, parameters...); Put the kernel for execution .. code-block:: c++ exec(queue, workDiv, kernel, parameters...); Kernel Implementation --------------------- Define a kernel as a C++ functor .. code-block:: c++ struct Kernel { template ALPAKA_FN_ACC void operator()(Acc const & acc, parameters) const { ... } }; ``ALPAKA_FN_ACC`` is required for kernels and functions called inside, ``acc`` is mandatory first parameter, its type is the template parameter Access multi-dimensional indices and extents of blocks, threads, and elements .. code-block:: c++ auto idx = getIdx(acc); auto extent = getWorkDiv(acc); // Origin: Grid, Block, Thread // Unit: Blocks, Threads, Elems Access components of and destructure multi-dimensional indices and extents .. code-block:: c++ auto idxX = idx[0]; auto [z, y, x] = extent3D; Linearize multi-dimensional vectors .. code-block:: c++ auto linearIdx = mapIdx<1u>(idxND, extentND); More generally, index multi-dimensional vectors with a different dimensionality .. code-block:: c++ auto idxND = mapIdx(idxMD, extentMD); .. raw:: pdf Spacer 0,8 Allocate static shared memory variable .. code-block:: c++ Type& var = declareSharedVar(acc); // scalar auto& arr = declareSharedVar(acc); // array Get dynamic shared memory pool, requires the kernel to specialize .. code-block:: c++ trait::BlockSharedMemDynSizeBytes Type * dynamicSharedMemoryPool = getDynSharedMem(acc); Synchronize threads of the same block .. code-block:: c++ syncBlockThreads(acc); Atomic operations .. code-block:: c++ auto result = atomicOp(acc, arguments); // Operation: AtomicAdd, AtomicSub, AtomicMin, AtomicMax, AtomicExch, // AtomicInc, AtomicDec, AtomicAnd, AtomicOr, AtomicXor, AtomicCas // Also dedicated functions available, e.g.: auto old = atomicAdd(acc, ptr, 1); Memory fences on block-, grid- or device level (guarantees LoadLoad and StoreStore ordering) .. code-block:: c++ mem_fence(acc, memory_scope::Block{}); mem_fence(acc, memory_scope::Grid{}); mem_fence(acc, memory_scope::Device{}); Warp-level operations .. code-block:: c++ uint64_t result = warp::ballot(acc, idx == 1 || idx == 4); assert( result == (1<<1) + (1<<4) ); int32_t valFromSrcLane = warp::shfl(val, srcLane); Math functions take acc as additional first argument .. code-block:: c++ math::sin(acc, argument); Similar for other math functions. Generate random numbers .. code-block:: c++ auto distribution = rand::distribution::createNormalReal(acc); auto generator = rand::engine::createDefault(acc, seed, subsequence); auto number = distribution(generator);