aboutsummaryrefslogtreecommitdiff
path: root/clang/test/AST/ast-print-openacc-wait-construct.cpp
diff options
context:
space:
mode:
authorKrzysztof Drewniak <Krzysztof.Drewniak@amd.com>2025-07-24 10:26:03 -0700
committerGitHub <noreply@github.com>2025-07-24 12:26:03 -0500
commita4dd51d72f18df5ebc447e3c9070bc392fddb9b5 (patch)
treeaa43000afff2bf62be3c9b7ca1135a5a587f84be /clang/test/AST/ast-print-openacc-wait-construct.cpp
parent33c94450f02ca9c7fea1366b14186dcf1a1b8cd7 (diff)
downloadllvm-a4dd51d72f18df5ebc447e3c9070bc392fddb9b5.zip
llvm-a4dd51d72f18df5ebc447e3c9070bc392fddb9b5.tar.gz
llvm-a4dd51d72f18df5ebc447e3c9070bc392fddb9b5.tar.bz2
[mlir][ArithToAMDGPU] Use native packing support (#150342)
The current arith-to-amdgpu patterns for scaling_extf and scaling_truncf don't take full advantage of the native packing ability of the intrinsics being targetted. Scaling extension takes the location of the two elements to be extended as a constant argument (byte for fp4, half for fp8), and scaling truncation takes a 32-bit input register and a byte or half to write the truncated values to. Not using these features would cause excess unneeded register pressure. This PR resolves the inefficiency. It also adds a test for the expected usecase of extending or truncateting a block of 32 values to/from fp4 with a uniform scale to ensure that this usage has a minimal amount of vector shuffling.
Diffstat (limited to 'clang/test/AST/ast-print-openacc-wait-construct.cpp')
0 files changed, 0 insertions, 0 deletions