aboutsummaryrefslogtreecommitdiff
path: root/llvm/lib/CodeGen/MachineOperand.cpp
diff options
context:
space:
mode:
authorFabian Ritter <fabian.ritter@amd.com>2024-10-28 09:04:19 +0100
committerGitHub <noreply@github.com>2024-10-28 09:04:19 +0100
commita4fd3dba6e285734bc635b0651a30dfeffedeada (patch)
tree8534b669c66318b8e867065b74f1a0182201d54a /llvm/lib/CodeGen/MachineOperand.cpp
parent35f6cc6af09f48f9038fce632815a2ad6ffe8689 (diff)
downloadllvm-a4fd3dba6e285734bc635b0651a30dfeffedeada.zip
llvm-a4fd3dba6e285734bc635b0651a30dfeffedeada.tar.gz
llvm-a4fd3dba6e285734bc635b0651a30dfeffedeada.tar.bz2
[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332)
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in LowerMemIntrinsics.cpp, the loop consists of a single load/store pair per iteration. We can improve performance in some cases by emitting multiple load/store pairs per iteration. This patch achieves that by increasing the width of the loop lowering type in the GCN target and letting legalization split the resulting too-wide access pairs into multiple legal access pairs. This change only affects lowered memcpys and memmoves with large (>= 1024 bytes) constant lengths. Smaller constant lengths are handled by ISel directly; non-constant lengths would be slowed down by this change if the dynamic length was smaller or slightly larger than what an unrolled iteration copies. The chosen default unroll factor is the result of microbenchmarks on gfx1030. This change leads to speedups of 15-38% for global memory and 1.9-5.8x for scratch in these microbenchmarks. Part of SWDEV-455845.
Diffstat (limited to 'llvm/lib/CodeGen/MachineOperand.cpp')
0 files changed, 0 insertions, 0 deletions