Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from January 28 to February 10 2022.

We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.

Industry News and Community Events

LLVM and Clang


  • Jay Foad noticed a NewGVN regression. The missed optimization may hurt GPU performance most as vector loads are common and load latency very high. There are no replies at the time of writing.


  • AMGPU’s enable-flat-scratch was made a subtarget feature instead of a command line argument. The goal was to reduce the amount of global state. D119425
  • CUDA support for the --offload parameter to override the default device target was added. This is meant for SPIR-V targets. D117137
  • HIP gained support for code object v5. D118949




  • The AMDGPU hostcall module flag was replaced with a function attribute. D119216
  • Added device-side async copy operations to the GPU dialect. D119191

OpenMP (Target Offloading)



  • Removed the hard limit of number of teams (65536) for Nvidia GPUs. D119313
  • Fixed a performance regression in BabelStream. D119187
  • Fixed shadow map traversals creating an infinite loop in TestSNAP. D119471
  • Fixed a bug where an AMDGPU barrier was not actually aligned but treated as such. ede248e
  • The new OpenMP offloading driver landed and is now in the test suite. D116541, D118637.
  • Completely removed the old device runtime. The -fopenmp-target-new-runtime option is now deprecated. D118934
  • Added a new flag -fembed-offload-binary to support bundling offloading files with the host binary (as an ELF section). D116542
  • -Xopenmp-target= was extended to accept shortened triples. D118495

External Compilers


oneAPI DPC++

CUDA/HIP support

  • Added HIP AMD support for ilogb, log2, and remainder. DPCPP#5272
  • Added generic address space implementation for frexp, modf, and sincos built-ins on the HIP backend. DPCPP#5377
  • Added CUDA-specific values for memory advice query. DPCPP#5090
  • Made PTXAS the optimization level default to -O3 to fix issues caused by optimization levels inconsistencies between PTXAS and clang compilers. DPCPP#5188
  • Added bitwise reductions for CUDA. DPCPP#5416
  • Added the -fsycl-fp32-prec-sqrt flag to enable high precision sqrt implementation. DPCPP#5309

SYCL 2020 support

  • Old style sycl::atomics are deprecated in the SYCL 2020 mode. DPCPP#5440

Non-standard extensions

Explicit SIMD
  • Fixed the bug of comparison between two bf16’s wi_element. DPCPP#5493
  • Added support for bf16’s wi_element. DPCPP#5397
  • Added support for the [[intel::loop_count()]] attribute. DPCPP#5520
  • Refactored the [[intel::max_work_group_size()]] attribute implementation. DPCPP#5392
std::complex support
  • Added complex support to group algorithms. DPCPP#5394
Device global
Optional device features
  • Allowed the [[sycl::device_has]] attribute on a SYCL kernel. DPCPP#5503