Issue #29
Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from February 11 to February 24 2022.
We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.
Industry News and Community Events
- We now have a ‘gpu’ tag on Discourse, use it when creating GPU-related topics.
- Future LLVM GPU Working Group meetings will be announced in the ‘Community’ discourse category and tagged as ‘gpu’. The third meeting is tentatively scheduled for March 18.
LLVM and Clang
Discussions
- Alessandro Fanfarillo asked for feedback on adding a new Clang/LLVM flag to print AMDGPU register usage of HIP kernels. One possibility is to use
-Rpass-analysis=
to activate this functionality. Alessandro shared a draft implementation, D95063. - Badhi asked about missing CUDA support for variadic functions in clangd/Clang. Artem Belevich explained that early versions of PTX did not support variadic functions on GPU.
Commits
SILoadStoreOptimizer
was extended to handle global loads and stores. D120279, D120346- AMDGPU now promotes recursive loads from kernel arguments to constant address space. D119886
- Exposed more NVPTX builtins/intrinsics. D117887, D118977
MLIR
Discussions
Commits
OpenMP (Target Offloading)
Discussions
- LLVM GPU Working Group Meeting on February 18: initial work done towards OpenMP interoperability with other languages (e.g., CUDA, HIP) by porting them to use the new driver. An initial proof-of-concept is in development at D120273.
Commits
- Added the
-fopenmp-offload-mandatory
to remove host-fallback code. D120353 - Added the
-fopenmp-assume-no-thread-state
to statically remove thread-states in the device runtime. D120106 - Support for offloading to the CPU added to the new driver. The new driver now passes 100% of the test suite. D119613
- The new offloading driver now passes the AMDGPU math libraries to the linker wrapper. D119841
- Initial support for the
atomic compare
compare-and-swap operation. D118632 - Added HIP support for linking archive files of bundled bitcode with
-fgpu-rdc
. D120070 - Introduced
-fgpu-default-stream={legacy|per-thread}
option to support per-thread default stream for the HIP runtime. D120298
External Compilers
LLPC
- The SPIR-V
DontInline
function control mask is no longer translated to the LLVMnoinline
attribute. This is to prepare for the changes to the LLVMnoinline
callsite attribute handling (D119553). LLPC#1700 - Pipeline options hashing was refactored for correctness and testability. LLPC#1677
- More work towards enabling the New Pass Manager by default. LLPC#1682
oneAPI DPC++
CUDA/HIP support
- Optimized
async_work_group_copy
for sm_80+ NVIDIA GPUs. DPCPP#5611 - Set minimum CUDA version for Turing devices (sm_75). DPCPP#5642
SYCL 2020 support
- Added SYCL 2020 support for the group class. DPCPP#5447
- Aligned
sycl::make_event<backend::opencl>
with rev 4 of the SYCL 2020 specification. DPCPP#5498
Non-standard extensions
Explicit SIMD
- Added support for the dpas API DPCPP#5637 and named barrier APIs DPCPP#5583.
- Enabled
op(vector, scalar)
variant of binary math functions. DPCPP#5651
FPGA
- Silenced unknown attribute warnings on host compilation. DPCPP#5619
Misc
- Added an experimental Windows build in the GitHub Actions nightly (DPCPP#5560) and post commit (DPCPP#5639) workflows.
- Turned on nightly testing with the new PM enabled by default. DPCPP#5340
- Level Zero backend improvements:
- Honor
property::queue::enable_profiling
. DPCPP#5543 - Fixed use of copy-engines in L0 interop queue. DPCPP#5579
- Fixed timestamp calculation (in ns). DPCPP#5555
- Tuned USM pooling parameters. DPCPP#5457
- Honor
- Optimized online compilation for sub-devices. DPCPP#5240
- Finished documentation layout restructuring. DPCPP#5605, DPCPP#5607, DPCPP#5578, DPCPP#5556
- Improved
get_kernel_bundle
performance. DPCPP#5496 - Added a warning for converting ‘C’ input to ‘C++’ in the SYCL mode. DPCPP#5598
- Fixed buffer creation from an rvalue iterator. DPCPP#5609
- Silenced “unknown attribute” warnings for
device_indirectly_callable
. DPCPP#5591 - Fixed sync of host task vs kernel for in-order queue. DPCPP#5551