Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from March 19 to April 1 2021.
We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.
Industry News and Conference Talks
LLVM and Clang
- Discussion on the ‘Abstracting over SSA form IRs to implement generic analyses’ RFC has seen some new activity. Sameer Sahasrabuddhe shared their perspective and identified that the main issue is that LLVM IR/MIR basic blocks do not explicitly track their successors and predecessors. Nicolai Hähnle clarified what the most important decisions are to move the proposal forward. In addition, Nicolai noted that changing the in-memory representation of basic blocks to contain predecessor and successor vectors would allow terminator instruction to refer to those, and potentially result in reduced memory usage.
- AMDGPU PAL usage documentation was updated.
- (In-review) AMDGPU Machine IR optimization to remove unnecessary cache invalidations.
- Conversion to NNVM/ROCL now uses a data layout entry to specify the bitwidth for index type.
OpenMP (Target Offloading)
- Nader Al Awar asked about using the
-fembed-bitcodeClang option with OpenMP target offload for CUDA. There are no replies as of writing.
- Asynchronous offloading bugs were discovered and are being discussed on the mailing list and the bugtracker.
- The device runtime for LLVM 12 shows performance regressions,  and , that will be addressed in the 12.1 release.
- A rewrite of the device runtime is being tested right now. The first results look promising with regards to performance and memory usage.
- Issues with Clang’s device code generation were detected: , , and will be resolved soon.
- OpenMP declare mapper will now pass variable names to the runtime for better feedback.
- Asynchronous errors reported by the device runtime will be less confusing.
- Failed offloading will not cause an assertion error anymore.
- Optimization for variable globalization on the device is already available while we prepare to switch to the new system.
- Dave Airlie reported that they found lavapipe, the Mesa’s CPU-based Vulkan implementation, to be faster than SwiftShader, a CPU-based Vulkan implementation from Google. This is based on a set of randomly picked Vulkan samples from Sascha Willems.