Issue #28

Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from January 28 to February 10 2022.

We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.

Industry News and Community Events

The second LLVM GPU Working Group Meeting is scheduled for the coming Friday, February 18 at 11am ET / 4pm UTC. You can find the details in the agenda document.
The GPU tag was added to the LLVM discourse. All LLVM GPU News issues will use it. If you want to increase the discoverability of your GPU-related discussions, remember to add it when you create a new Topic. See the full list of new tags.
Jason Ekstrand blogged on why the Mesa graphics driver stack prefers a custom intermediate representation, NIR, over LLVM.
Lei Zhang started a blog post series on compilers and IRs. The first post touches on IR design and discusses LLVM IR, SPIR-V, and MLIR.
AMD ROCm v5.0 has been released.

LLVM and Clang

Discussions

Jay Foad noticed a NewGVN regression. The missed optimization may hurt GPU performance most as vector loads are common and load latency very high. There are no replies at the time of writing.

Commits

AMGPU’s enable-flat-scratch was made a subtarget feature instead of a command line argument. The goal was to reduce the amount of global state. D119425
CUDA support for the --offload parameter to override the default device target was added. This is meant for SPIR-V targets. D117137
HIP gained support for code object v5. D118949

MLIR

Discussions

Krzysztof Drewniak proposed ‘enabling ROCm conversions in pre-merge builds’ to prevent future breakages in AMDGPU-related MLIR passes. Mehdi Amini suggested for AMD to set up a buildbot with an AMD GPU, similar to the existing mlir-nvidia builder. Such machines can be fully isolated.

Commits

The AMDGPU hostcall module flag was replaced with a function attribute. D119216
Added device-side async copy operations to the GPU dialect. D119191

OpenMP (Target Offloading)

Discussions

Commits

Removed the hard limit of number of teams (65536) for Nvidia GPUs. D119313
Fixed a performance regression in BabelStream. D119187
Fixed shadow map traversals creating an infinite loop in TestSNAP. D119471
Fixed a bug where an AMDGPU barrier was not actually aligned but treated as such. ede248e
The new OpenMP offloading driver landed and is now in the test suite. D116541, D118637.
Completely removed the old device runtime. The -fopenmp-target-new-runtime option is now deprecated. D118934
Added a new flag -fembed-offload-binary to support bundling offloading files with the host binary (as an ELF section). D116542
-Xopenmp-target= was extended to accept shortened triples. D118495

External Compilers

LLPC

oneAPI DPC++

CUDA/HIP support

Added HIP AMD support for ilogb, log2, and remainder. DPCPP#5272
Added generic address space implementation for frexp, modf, and sincos built-ins on the HIP backend. DPCPP#5377
Added CUDA-specific values for memory advice query. DPCPP#5090
Made PTXAS the optimization level default to -O3 to fix issues caused by optimization levels inconsistencies between PTXAS and clang compilers. DPCPP#5188
Added bitwise reductions for CUDA. DPCPP#5416
Added the -fsycl-fp32-prec-sqrt flag to enable high precision sqrt implementation. DPCPP#5309

SYCL 2020 support

Old style sycl::atomics are deprecated in the SYCL 2020 mode. DPCPP#5440

Non-standard extensions

Explicit SIMD

Updated README.md (DPCPP#5515) and Doxygen documentation (DPCPP#5531, DPCPP#5485, DPCPP#5472, and DPCPP#5443).
Fixed merge (DPCPP#5490) and public simd and simd_view APIs (DPCPP#5465).
Removed deprecated APIs. (DPCPP#5402, DPCPP#5426)
esimd::saturate: updated saturation usage interface and cleanup APIs. DPCPP#5507, DPCPP#5441
Optimized out loops in popular simd constructors. DPCPP#5425

Matrix

Fixed the bug of comparison between two bf16’s wi_element. DPCPP#5493
Added support for bf16’s wi_element. DPCPP#5397

FPGA

Added support for the [[intel::loop_count()]] attribute. DPCPP#5520
Refactored the [[intel::max_work_group_size()]] attribute implementation. DPCPP#5392

`std::complex` support

Added complex support to group algorithms. DPCPP#5394

Device global

Updated device global design document. DPCPP#5525, DPCPP#5432
Started implementation with initial device_global registration. DPCPP#5499

Optional device features

Allowed the [[sycl::device_has]] attribute on a SYCL kernel. DPCPP#5503

Misc

Made documentation layout restructuring inside the sycl/doc directory. DPCPP#5437, DPCPP#5439, DPCPP#5381, DPCPP#5436, DPCPP#5421, DPCPP#5405, DPCPP#5463, DPCPP#5453, DPCPP#5470, DPCPP#5452, DPCPP#5456, DPCPP#5458, DPCPP#5459, DPCPP#5454, DPCPP#5451
Added new proposals for sub_group::load/store extension (DPCPP#5523) and new free function queries (DPCPP#5106).
Added code instrumentation for the tools supporting XPTI (i.e., profilers, dynamic analyzers, etc.). DPCPP#5391, DPCPP#5259, DPCPP#4998, DPCPP#5335, DPCPP#5533
Improved Level Zero backed support:
- Emit more informative errors at command execution failure. DPCPP#5524
- Emit program build logs for warning levels >= 2. DPCPP#5319
- Avoid resetting the same command list multiple times. DPCPP#5529
- Fix memory leak in USM prefetch (DPCPP#5461) and a possible null pointer dereference (DPCPP#5521).
- Refine support for query of USM capabilities. DPCPP#5442
- Relax the mutex lock duration during queue finish operation. DPCPP#5406
Added compiler diagnostics for explicit cast from default address space to a named one. DPCPP#5500
Turned off parts of the SimplifyCFG optimization pass breaking the “convergent” functions semantic in the SYCL mode. DPCPP#5283
Initial support for allowing fat static -lname processing. DPCPP#5413
Fallback implementation of the assert functions has been disabled by default. DPCPP#4694
Allowed calls through constant expr function pointers. DPCPP#5390

Industry News and Community Events

LLVM and Clang

Discussions

Commits

MLIR

Discussions

Commits

OpenMP (Target Offloading)

Discussions

Commits

External Compilers

LLPC

oneAPI DPC++

CUDA/HIP support

SYCL 2020 support

Non-standard extensions

Explicit SIMD

Matrix

FPGA

std::complex support

Device global

Optional device features

Misc

`std::complex` support