Issue #29

Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from February 11 to February 24 2022.

We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.

Industry News and Community Events

We now have a ‘gpu’ tag on Discourse, use it when creating GPU-related topics.
Future LLVM GPU Working Group meetings will be announced in the ‘Community’ discourse category and tagged as ‘gpu’. The third meeting is tentatively scheduled for March 18.

LLVM and Clang

Discussions

Alessandro Fanfarillo asked for feedback on adding a new Clang/LLVM flag to print AMDGPU register usage of HIP kernels. One possibility is to use -Rpass-analysis= to activate this functionality. Alessandro shared a draft implementation, D95063.
Badhi asked about missing CUDA support for variadic functions in clangd/Clang. Artem Belevich explained that early versions of PTX did not support variadic functions on GPU.

Commits

SILoadStoreOptimizer was extended to handle global loads and stores. D120279, D120346
AMDGPU now promotes recursive loads from kernel arguments to constant address space. D119886
Exposed more NVPTX builtins/intrinsics. D117887, D118977

MLIR

Discussions

Commits

OpenMP (Target Offloading)

Discussions

LLVM GPU Working Group Meeting on February 18: initial work done towards OpenMP interoperability with other languages (e.g., CUDA, HIP) by porting them to use the new driver. An initial proof-of-concept is in development at D120273.

Commits

Added the -fopenmp-offload-mandatory to remove host-fallback code. D120353
Added the -fopenmp-assume-no-thread-state to statically remove thread-states in the device runtime. D120106
Support for offloading to the CPU added to the new driver. The new driver now passes 100% of the test suite. D119613
The new offloading driver now passes the AMDGPU math libraries to the linker wrapper. D119841
Initial support for the atomic compare compare-and-swap operation. D118632
Added HIP support for linking archive files of bundled bitcode with -fgpu-rdc. D120070
Introduced -fgpu-default-stream={legacy|per-thread} option to support per-thread default stream for the HIP runtime. D120298

External Compilers

LLPC

The SPIR-V DontInline function control mask is no longer translated to the LLVM noinline attribute. This is to prepare for the changes to the LLVM noinline callsite attribute handling (D119553). LLPC#1700
Pipeline options hashing was refactored for correctness and testability. LLPC#1677
More work towards enabling the New Pass Manager by default. LLPC#1682

oneAPI DPC++

CUDA/HIP support

Optimized async_work_group_copy for sm_80+ NVIDIA GPUs. DPCPP#5611
Set minimum CUDA version for Turing devices (sm_75). DPCPP#5642

SYCL 2020 support

Added SYCL 2020 support for the group class. DPCPP#5447
Aligned sycl::make_event<backend::opencl> with rev 4 of the SYCL 2020 specification. DPCPP#5498

Non-standard extensions

Explicit SIMD

Added support for the dpas API DPCPP#5637 and named barrier APIs DPCPP#5583.
Enabled op(vector, scalar) variant of binary math functions. DPCPP#5651

FPGA

Silenced unknown attribute warnings on host compilation. DPCPP#5619

Misc

Added an experimental Windows build in the GitHub Actions nightly (DPCPP#5560) and post commit (DPCPP#5639) workflows.
Turned on nightly testing with the new PM enabled by default. DPCPP#5340
Level Zero backend improvements:
- Honor property::queue::enable_profiling. DPCPP#5543
- Fixed use of copy-engines in L0 interop queue. DPCPP#5579
- Fixed timestamp calculation (in ns). DPCPP#5555
- Tuned USM pooling parameters. DPCPP#5457
Optimized online compilation for sub-devices. DPCPP#5240
Finished documentation layout restructuring. DPCPP#5605, DPCPP#5607, DPCPP#5578, DPCPP#5556
Improved get_kernel_bundle performance. DPCPP#5496
Added a warning for converting ‘C’ input to ‘C++’ in the SYCL mode. DPCPP#5598
Fixed buffer creation from an rvalue iterator. DPCPP#5609
Silenced “unknown attribute” warnings for device_indirectly_callable. DPCPP#5591
Fixed sync of host task vs kernel for in-order queue. DPCPP#5551