Issue #16
Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from July 2 to July 22 2021.
We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.
LLVM and Clang
Discussions
- Frank Winter asked about adding support for the NVSHMEM API for GPU clusters to the NVPTX backend. There are no replies as of writing.
Commits
- Backends can now split register allocation into multiple
RegAlloc
runs. This is implemented by giving eachRegAlloc
pass a callback which decides if a register class should be handled or not. AMDGPU uses this to allocate SGPRs and VGPRs separately. Interestingly, this had been in review since December 2018. D55301 - OpenCL gained support for new feature macros:
- The NVPTX matrix operation intrinsics were extended with the
.and.popc
variant of theb1
MMA instruction. D105384 - Waterfall loops are now marked as
SI_WATERFALL_LOOP
to aid AMDGPU register allocation optimizations. D105467, D105192 - AMDGPU frontends can now generate export-free fragment shaders. D105683
- Some AMDGPU instructions are now marked as rematerializable:
- New
llvm-mca
CustomBehaviour
implementation for AMDGPU to handle thes_waitcnt
instructions landed but was later reverted.
MLIR
Discussions
- chudur-budur asked about making linalg.matmul to GPU runnable code that can be executed with the CUDA runtime. There are no replies as of writing.
Commits
OpenMP (Target Offloading)
Discussions
Commits
- The first GPU runtime specific call folding has been pushed D105787, though there are issues still to be resolved right now. Folding of more runtime calls is already prepared (e.g., D106154, D106033).
- OpenMP-Opt will now emit more concise remarks with ID tags D105939. The OpenMP documentation does describe these remarks in detail, with examples and mitigation strategies where applicable D106018.
- SPMDization detection was extended to include
AAHeapToStack
/AAHeapToShared
analysis results D105634. This optimization, first introduced in D102307, transforms Generic-mode OpenMP kernels that contain separate worker / main threads into an SPMD kernel where all threads are worker threads and active throughout the entire region. This is closer to how CUDA operates and is generally faster when we can perform this optimization. The new patch allows it to handle globalization. We’re still missing a third optimization which guards operations meant to be done by a single thread with a barrier before there is full support. Even though full support is still in progress, we have already seen good improvements from this transformation.
External Compilers
LLPC
Mesa
- David Airlie is working on bringing back the Mesa OpenCL 3.0 support and ran into issues with the SPIR target always enabling all CL optional extensions. Anastasia Stulova noted that SPIR was originally added as a device agnostic-target. However, for Mesa/clover, it is desirable to provide Clang a concrete device and the list of supported extensions.