Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from December 3 to December 16 2021.

We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.

Industry News and Conferences

LLVM and Clang

Discussions

  • Intel, ARM, and Khronos proposed to start the integration of SPIR-V backend to LLVM. The code has been refactored so it does not require extra changes from GlobalISel or other LLVM target-independent infrastructure. With some SPIR-V functionality missing, the new code is proposed to land as an experimental backend. There are no replies to the proposal at the time of writing.

Commits

  • The HIP toolchain is being refactored with SPIR-V generation in mind. The old toolchain was renamed to HIPAMDToolChain and a new one HIPSPVToolChain was added. The new toolchain uses the SPIRV-LLVM-Translator as the SPIR-V LLVM backend is not ready yet. Next, the in-review patches will allow emitting SPIR-V binary using llc. D110549, D110618, D110622, D110685
  • Documentation was added for the DWARF location descriptions, as a part of the ‘Dwarf Extensions For Heterogeneous Debugging’ used by AMDGPU. D115587

MLIR

Discussions

  • Thomas Raoux is working on adding async copy from global to shared memory to the GPU dialect: ‘async copy is a new feature in Nvidia Ampere GPUs, but other GPUs are likely to support similar feature in the future. It works like a DMA operation that copies data from global to shared memory without going through register and without blocking the execution unit. The big advantage is that it saves register usage and it allows aggressive software pipelining to hide latency.’ For more details, see the blog post. Thomas is seeking feedback on how async copy is modeled before sending out patches for review.

Commits

OpenMP (Target Offloading)

Discussions

Commits

  • The new device runtime is now enabled by default for AMDGPU offloading. D115157

External Compilers

LLPC

  • The standalone compiler tool amdllpc can now compile multiple shader stages together without having to provide a .pipe file. LLPC#1591

oneAPI DPC++

CUDA/HIP support

  • Added HIP backend support to filter selector extension.
  • Fixed infinite loop when parallel_for range exceeds INT_MAX.
  • Fixed platform query for USM allocation information.
  • Improved performance on CUDA backed by reducing runtime overhead on event handling and using more efficient implementation for barrier and other synchronization built-in functions.

SYCL 2020 support

  • Switched blocking USM free for OpenCL GPU devices.
  • Fixed support for classes implicitly converted from items in parallel_for.

Non-standard extensions

  • Added support for HW threads per EU device query.
  • Added implementation of discard_events extension enabling optimization of queue operations.
  • Added experimental FPGA latency control feature.
  • ESIMD: Add support for an arbitrary number of elements to simd::copy_from/to.
  • Fixed bfloat16 construction from integer.

Misc

  • Improved runtime library performance by reducing overhead of XPTI instrumentation.
  • Made initialization of host task thread pool thread-safe.
  • Multiple GitHub Actions continuous system improvements:
  • libclc testing added to regular validation.
  • Build of HIP and CUDA back-ends enabled in nightly Docker containers.
  • Improved runtime error diagnostics by handling ZE_RESULT_ERROR_INVALID_ARGUMENT and out-of-memory Level Zero error codes.