Issue #16

Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from July 2 to July 22 2021.

We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.

LLVM and Clang

Discussions

Frank Winter asked about adding support for the NVSHMEM API for GPU clusters to the NVPTX backend. There are no replies as of writing.

Commits

Backends can now split register allocation into multiple RegAlloc runs. This is implemented by giving each RegAlloc pass a callback which decides if a register class should be handled or not. AMDGPU uses this to allocate SGPRs and VGPRs separately. Interestingly, this had been in review since December 2018. D55301
OpenCL gained support for new feature macros:
- __opencl_c_generic_address_space D103401
- __opencl_c_read_write_images D104915
- __opencl_c_program_scope_global_variables D103191
The NVPTX matrix operation intrinsics were extended with the .and.popc variant of the b1 MMA instruction. D105384
Waterfall loops are now marked as SI_WATERFALL_LOOP to aid AMDGPU register allocation optimizations. D105467, D105192
AMDGPU frontends can now generate export-free fragment shaders. D105683
Some AMDGPU instructions are now marked as rematerializable:
- SOP D105670
- VOP1 D105742
- VOP2 D106023
- VOP3 D106110
New llvm-mca CustomBehaviour implementation for AMDGPU to handle the s_waitcnt instructions landed but was later reverted.

MLIR

Discussions

chudur-budur asked about making linalg.matmul to GPU runnable code that can be executed with the CUDA runtime. There are no replies as of writing.

Commits

OpenMP (Target Offloading)

Discussions

Commits

The first GPU runtime specific call folding has been pushed D105787, though there are issues still to be resolved right now. Folding of more runtime calls is already prepared (e.g., D106154, D106033).
OpenMP-Opt will now emit more concise remarks with ID tags D105939. The OpenMP documentation does describe these remarks in detail, with examples and mitigation strategies where applicable D106018.
SPMDization detection was extended to include AAHeapToStack/AAHeapToShared analysis results D105634. This optimization, first introduced in D102307, transforms Generic-mode OpenMP kernels that contain separate worker / main threads into an SPMD kernel where all threads are worker threads and active throughout the entire region. This is closer to how CUDA operates and is generally faster when we can perform this optimization. The new patch allows it to handle globalization. We’re still missing a third optimization which guards operations meant to be done by a single thread with a barrier before there is full support. Even though full support is still in progress, we have already seen good improvements from this transformation.

External Compilers

LLPC

Mesa

David Airlie is working on bringing back the Mesa OpenCL 3.0 support and ran into issues with the SPIR target always enabling all CL optional extensions. Anastasia Stulova noted that SPIR was originally added as a device agnostic-target. However, for Mesa/clover, it is desirable to provide Clang a concrete device and the list of supported extensions.