Issue #28
Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things under the LLVM umbrella. This issue covers the period from January 28 to February 10 2022.
We welcome your feedback and suggestions. Let us know if we missed anything interesting, or want us to bring attention to your (sub)project, revisions under review, or proposals. Please see the bottom of the page for details on how to submit suggestions and contribute.
Industry News and Community Events
- The second LLVM GPU Working Group Meeting is scheduled for the coming Friday, February 18 at 11am ET / 4pm UTC. You can find the details in the agenda document.
- The GPU tag was added to the LLVM discourse. All LLVM GPU News issues will use it. If you want to increase the discoverability of your GPU-related discussions, remember to add it when you create a new Topic. See the full list of new tags.
- Jason Ekstrand blogged on why the Mesa graphics driver stack prefers a custom intermediate representation, NIR, over LLVM.
- Lei Zhang started a blog post series on compilers and IRs. The first post touches on IR design and discusses LLVM IR, SPIR-V, and MLIR.
- AMD ROCm v5.0 has been released.
LLVM and Clang
Discussions
- Jay Foad noticed a NewGVN regression. The missed optimization may hurt GPU performance most as vector loads are common and load latency very high. There are no replies at the time of writing.
Commits
- AMGPU’s
enable-flat-scratch
was made a subtarget feature instead of a command line argument. The goal was to reduce the amount of global state. D119425 - CUDA support for the
--offload
parameter to override the default device target was added. This is meant for SPIR-V targets. D117137 - HIP gained support for code object v5. D118949
MLIR
Discussions
- Krzysztof Drewniak proposed ‘enabling ROCm conversions in pre-merge builds’ to prevent future breakages in AMDGPU-related MLIR passes. Mehdi Amini
suggested for AMD to set up a buildbot with an AMD GPU, similar to the existing
mlir-nvidia
builder. Such machines can be fully isolated.
Commits
- The AMDGPU
hostcall
module flag was replaced with a function attribute. D119216 - Added device-side async copy operations to the GPU dialect. D119191
OpenMP (Target Offloading)
Discussions
Commits
- Removed the hard limit of number of teams (65536) for Nvidia GPUs. D119313
- Fixed a performance regression in
BabelStream
. D119187 - Fixed shadow map traversals creating an infinite loop in
TestSNAP
. D119471 - Fixed a bug where an AMDGPU barrier was not actually aligned but treated as such. ede248e
- The new OpenMP offloading driver landed and is now in the test suite. D116541, D118637.
- Completely removed the old device runtime. The
-fopenmp-target-new-runtime
option is now deprecated. D118934 - Added a new flag
-fembed-offload-binary
to support bundling offloading files with the host binary (as an ELF section). D116542 -Xopenmp-target=
was extended to accept shortened triples. D118495
External Compilers
LLPC
oneAPI DPC++
CUDA/HIP support
- Added HIP AMD support for
ilogb
,log2
, andremainder
. DPCPP#5272 - Added generic address space implementation for
frexp
,modf
, andsincos
built-ins on the HIP backend. DPCPP#5377 - Added CUDA-specific values for memory advice query. DPCPP#5090
- Made PTXAS the optimization level default to
-O3
to fix issues caused by optimization levels inconsistencies between PTXAS and clang compilers. DPCPP#5188 - Added bitwise reductions for CUDA. DPCPP#5416
- Added the
-fsycl-fp32-prec-sqrt
flag to enable high precisionsqrt
implementation. DPCPP#5309
SYCL 2020 support
- Old style
sycl::atomics
are deprecated in the SYCL 2020 mode. DPCPP#5440
Non-standard extensions
Explicit SIMD
- Updated
README.md
(DPCPP#5515) and Doxygen documentation (DPCPP#5531, DPCPP#5485, DPCPP#5472, and DPCPP#5443). - Fixed merge (DPCPP#5490) and public
simd
andsimd_view
APIs (DPCPP#5465). - Removed deprecated APIs. (DPCPP#5402, DPCPP#5426)
esimd::saturate
: updated saturation usage interface and cleanup APIs. DPCPP#5507, DPCPP#5441- Optimized out loops in popular simd constructors. DPCPP#5425
Matrix
- Fixed the bug of comparison between two
bf16
’swi_element
. DPCPP#5493 - Added support for
bf16
’swi_element
. DPCPP#5397
FPGA
- Added support for the
[[intel::loop_count()]]
attribute. DPCPP#5520 - Refactored the
[[intel::max_work_group_size()]]
attribute implementation. DPCPP#5392
std::complex
support
- Added complex support to group algorithms. DPCPP#5394
Device global
- Updated device global design document. DPCPP#5525, DPCPP#5432
- Started implementation with initial
device_global
registration. DPCPP#5499
Optional device features
- Allowed the
[[sycl::device_has]]
attribute on a SYCL kernel. DPCPP#5503
Misc
- Made documentation layout restructuring inside the
sycl/doc
directory. DPCPP#5437, DPCPP#5439, DPCPP#5381, DPCPP#5436, DPCPP#5421, DPCPP#5405, DPCPP#5463, DPCPP#5453, DPCPP#5470, DPCPP#5452, DPCPP#5456, DPCPP#5458, DPCPP#5459, DPCPP#5454, DPCPP#5451 - Added new proposals for
sub_group::load
/store
extension (DPCPP#5523) and new free function queries (DPCPP#5106). - Added code instrumentation for the tools supporting XPTI (i.e., profilers, dynamic analyzers, etc.). DPCPP#5391, DPCPP#5259, DPCPP#4998, DPCPP#5335, DPCPP#5533
- Improved Level Zero backed support:
- Emit more informative errors at command execution failure. DPCPP#5524
- Emit program build logs for warning
levels >= 2
. DPCPP#5319 - Avoid resetting the same command list multiple times. DPCPP#5529
- Fix memory leak in USM prefetch (DPCPP#5461) and a possible null pointer dereference (DPCPP#5521).
- Refine support for query of USM capabilities. DPCPP#5442
- Relax the mutex lock duration during queue finish operation. DPCPP#5406
- Added compiler diagnostics for explicit cast from default address space to a named one. DPCPP#5500
- Turned off parts of the
SimplifyCFG
optimization pass breaking the “convergent” functions semantic in the SYCL mode. DPCPP#5283 - Initial support for allowing fat static
-lname
processing. DPCPP#5413 - Fallback implementation of the
assert
functions has been disabled by default. DPCPP#4694 - Allowed calls through constant expr function pointers. DPCPP#5390