aboutsummaryrefslogtreecommitdiff
path: root/llvm/lib/Object/ELFObjectFile.cpp
diff options
context:
space:
mode:
authorChengjun <chengjunp@Nvidia.com>2025-09-19 15:37:41 -0700
committerGitHub <noreply@github.com>2025-09-19 15:37:41 -0700
commit4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76 (patch)
tree3d5a0f6e501e5906c8eba067ce5bfd9121f13435 /llvm/lib/Object/ELFObjectFile.cpp
parentb4a17b13b7327e583fb16384004155508f31a09d (diff)
downloadllvm-4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76.zip
llvm-4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76.tar.gz
llvm-4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76.tar.bz2
[SROA] Use tree-structure merge to remove alloca (#152793)
This patch introduces a new optimization in SROA that handles the pattern where multiple non-overlapping vector `store`s completely fill an `alloca`. The current approach to handle this pattern introduces many `.vecexpand` and `.vecblend` instructions, which can dramatically slow down compilation when dealing with large `alloca`s built from many small vector `store`s. For example, consider an `alloca` of type `<128 x float>` filled by 64 `store`s of `<2 x float>` each. The current implementation requires: - 64 `shufflevector`s( `.vecexpand`) - 64 `select`s ( `.vecblend` ) - All operations use masks of size 128 - These operations form a long dependency chain This kind of IR is both difficult to optimize and slow to compile, particularly impacting the `InstCombine` pass. This patch introduces a tree-structured merge approach that significantly reduces the number of operations and improves compilation performance. Key features: - Detects when vector `store`s completely fill an `alloca` without gaps - Ensures no loads occur in the middle of the store sequence - Uses a tree-based approach with `shufflevector`s to merge stored values - Reduces the number of intermediate operations compared to linear merging - Eliminates the long dependency chains that hurt optimization Example transformation: ``` // Before: (stores do not have to be in order) %alloca = alloca <8 x float> store <2 x float> %val0, ptr %alloca ; offset 0-1 store <2 x float> %val2, ptr %alloca+16 ; offset 4-5 store <2 x float> %val1, ptr %alloca+8 ; offset 2-3 store <2 x float> %val3, ptr %alloca+24 ; offset 6-7 %result = load <8 x float>, ptr %alloca // After (tree-structured merge): %shuffle0 = shufflevector %val0, %val1, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %shuffle1 = shufflevector %val2, %val3, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %result = shufflevector %shuffle0, %shuffle1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> ``` Benefits: - Logarithmic depth (O(log n)) instead of linear dependency chains - Fewer total operations for large vectors - Better optimization opportunities for subsequent passes - Significant compilation time improvements for large vector patterns For some large cases, the compile time can be reduced from about 60s to less than 3s. --------- Co-authored-by: chengjunp <chengjunp@nividia.com>
Diffstat (limited to 'llvm/lib/Object/ELFObjectFile.cpp')
0 files changed, 0 insertions, 0 deletions