Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from July 2 to July 22 2021.

We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.

LLVM and Clang

Discussions

  • Frank Winter asked about adding support for the NVSHMEM API for GPU clusters to the NVPTX backend. There are no replies as of writing.

Commits

  • Backends can now split register allocation into multiple RegAlloc runs. This is implemented by giving each RegAlloc pass a callback which decides if a register class should be handled or not. AMDGPU uses this to allocate SGPRs and VGPRs separately. Interestingly, this had been in review since December 2018. D55301
  • OpenCL gained support for new feature macros:
    • __opencl_c_generic_address_space D103401
    • __opencl_c_read_write_images D104915
    • __opencl_c_program_scope_global_variables D103191
  • The NVPTX matrix operation intrinsics were extended with the .and.popc variant of the b1 MMA instruction. D105384
  • Waterfall loops are now marked as SI_WATERFALL_LOOP to aid AMDGPU register allocation optimizations. D105467, D105192
  • AMDGPU frontends can now generate export-free fragment shaders. D105683
  • Some AMDGPU instructions are now marked as rematerializable:
  • New llvm-mca CustomBehaviour implementation for AMDGPU to handle the s_waitcnt instructions landed but was later reverted.

MLIR

Discussions

Commits

OpenMP (Target Offloading)

Discussions

Commits

  • The first GPU runtime specific call folding has been pushed D105787, though there are issues still to be resolved right now. Folding of more runtime calls is already prepared (e.g., D106154, D106033).
  • OpenMP-Opt will now emit more concise remarks with ID tags D105939. The OpenMP documentation does describe these remarks in detail, with examples and mitigation strategies where applicable D106018.
  • SPMDization detection was extended to include AAHeapToStack/AAHeapToShared analysis results D105634. This optimization, first introduced in D102307, transforms Generic-mode OpenMP kernels that contain separate worker / main threads into an SPMD kernel where all threads are worker threads and active throughout the entire region. This is closer to how CUDA operates and is generally faster when we can perform this optimization. The new patch allows it to handle globalization. We’re still missing a third optimization which guards operations meant to be done by a single thread with a barrier before there is full support. Even though full support is still in progress, we have already seen good improvements from this transformation.

External Compilers

LLPC

Mesa