Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from January 28 to February 10 2022.
We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.
Industry News and Community Events
- The second LLVM GPU Working Group Meeting is scheduled for the coming Friday, February 18 at 11am ET / 4pm UTC. You can find the details in the agenda document.
- The GPU tag was added to the LLVM discourse. All LLVM GPU News issues will use it. If you want to increase the discoverability of your GPU-related discussions, remember to add it when you create a new Topic. See the full list of new tags.
- Jason Ekstrand blogged on why the Mesa graphics driver stack prefers a custom intermediate representation, NIR, over LLVM.
- Lei Zhang started a blog post series on compilers and IRs. The first post touches on IR design and discusses LLVM IR, SPIR-V, and MLIR.
- AMD ROCm v5.0 has been released.
LLVM and Clang
- Jay Foad noticed a NewGVN regression. The missed optimization may hurt GPU performance most as vector loads are common and load latency very high. There are no replies at the time of writing.
enable-flat-scratchwas made a subtarget feature instead of a command line argument. The goal was to reduce the amount of global state. D119425
- CUDA support for the
--offloadparameter to override the default device target was added. This is meant for SPIR-V targets. D117137
- HIP gained support for code object v5. D118949
- Krzysztof Drewniak proposed ‘enabling ROCm conversions in pre-merge builds’ to prevent future breakages in AMDGPU-related MLIR passes. Mehdi Amini
suggested for AMD to set up a buildbot with an AMD GPU, similar to the existing
mlir-nvidiabuilder. Such machines can be fully isolated.
- The AMDGPU
hostcallmodule flag was replaced with a function attribute. D119216
- Added device-side async copy operations to the GPU dialect. D119191
OpenMP (Target Offloading)
- Removed the hard limit of number of teams (65536) for Nvidia GPUs. D119313
- Fixed a performance regression in
- Fixed shadow map traversals creating an infinite loop in
- Fixed a bug where an AMDGPU barrier was not actually aligned but treated as such. ede248e
- The new OpenMP offloading driver landed and is now in the test suite. D116541, D118637.
- Completely removed the old device runtime. The
-fopenmp-target-new-runtimeoption is now deprecated. D118934
- Added a new flag
-fembed-offload-binaryto support bundling offloading files with the host binary (as an ELF section). D116542
-Xopenmp-target=was extended to accept shortened triples. D118495
- Added HIP AMD support for
- Added generic address space implementation for
sincosbuilt-ins on the HIP backend. DPCPP#5377
- Added CUDA-specific values for memory advice query. DPCPP#5090
- Made PTXAS the optimization level default to
-O3to fix issues caused by optimization levels inconsistencies between PTXAS and clang compilers. DPCPP#5188
- Added bitwise reductions for CUDA. DPCPP#5416
- Added the
-fsycl-fp32-prec-sqrtflag to enable high precision
SYCL 2020 support
- Old style
sycl::atomicsare deprecated in the SYCL 2020 mode. DPCPP#5440
README.md(DPCPP#5515) and Doxygen documentation (DPCPP#5531, DPCPP#5485, DPCPP#5472, and DPCPP#5443).
- Fixed merge (DPCPP#5490) and public
- Removed deprecated APIs. (DPCPP#5402, DPCPP#5426)
esimd::saturate: updated saturation usage interface and cleanup APIs. DPCPP#5507, DPCPP#5441
- Optimized out loops in popular simd constructors. DPCPP#5425
- Fixed the bug of comparison between two
- Added support for
- Added support for the
- Refactored the
[[intel::max_work_group_size()]]attribute implementation. DPCPP#5392
- Added complex support to group algorithms. DPCPP#5394
- Updated device global design document. DPCPP#5525, DPCPP#5432
- Started implementation with initial
Optional device features
- Allowed the
[[sycl::device_has]]attribute on a SYCL kernel. DPCPP#5503
- Made documentation layout restructuring inside the
sycl/docdirectory. DPCPP#5437, DPCPP#5439, DPCPP#5381, DPCPP#5436, DPCPP#5421, DPCPP#5405, DPCPP#5463, DPCPP#5453, DPCPP#5470, DPCPP#5452, DPCPP#5456, DPCPP#5458, DPCPP#5459, DPCPP#5454, DPCPP#5451
- Added new proposals for
storeextension (DPCPP#5523) and new free function queries (DPCPP#5106).
- Added code instrumentation for the tools supporting XPTI (i.e., profilers, dynamic analyzers, etc.). DPCPP#5391, DPCPP#5259, DPCPP#4998, DPCPP#5335, DPCPP#5533
- Improved Level Zero backed support:
- Emit more informative errors at command execution failure. DPCPP#5524
- Emit program build logs for warning
levels >= 2. DPCPP#5319
- Avoid resetting the same command list multiple times. DPCPP#5529
- Fix memory leak in USM prefetch (DPCPP#5461) and a possible null pointer dereference (DPCPP#5521).
- Refine support for query of USM capabilities. DPCPP#5442
- Relax the mutex lock duration during queue finish operation. DPCPP#5406
- Added compiler diagnostics for explicit cast from default address space to a named one. DPCPP#5500
- Turned off parts of the
SimplifyCFGoptimization pass breaking the “convergent” functions semantic in the SYCL mode. DPCPP#5283
- Initial support for allowing fat static
- Fallback implementation of the
assertfunctions has been disabled by default. DPCPP#4694
- Allowed calls through constant expr function pointers. DPCPP#5390