diff options
author | Chengjun <chengjunp@Nvidia.com> | 2025-09-19 15:37:41 -0700 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-09-19 15:37:41 -0700 |
commit | 4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76 (patch) | |
tree | 3d5a0f6e501e5906c8eba067ce5bfd9121f13435 /llvm/lib/Object/ELFObjectFile.cpp | |
parent | b4a17b13b7327e583fb16384004155508f31a09d (diff) | |
download | llvm-4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76.zip llvm-4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76.tar.gz llvm-4bc9d29fbaa7f0d0a3b522e1e085e228d5d40d76.tar.bz2 |
[SROA] Use tree-structure merge to remove alloca (#152793)
This patch introduces a new optimization in SROA that handles the
pattern where multiple non-overlapping vector `store`s completely fill
an `alloca`.
The current approach to handle this pattern introduces many `.vecexpand`
and `.vecblend` instructions, which can dramatically slow down
compilation when dealing with large `alloca`s built from many small
vector `store`s. For example, consider an `alloca` of type `<128 x
float>` filled by 64 `store`s of `<2 x float>` each. The current
implementation requires:
- 64 `shufflevector`s( `.vecexpand`)
- 64 `select`s ( `.vecblend` )
- All operations use masks of size 128
- These operations form a long dependency chain
This kind of IR is both difficult to optimize and slow to compile,
particularly impacting the `InstCombine` pass.
This patch introduces a tree-structured merge approach that
significantly reduces the number of operations and improves compilation
performance.
Key features:
- Detects when vector `store`s completely fill an `alloca` without gaps
- Ensures no loads occur in the middle of the store sequence
- Uses a tree-based approach with `shufflevector`s to merge stored
values
- Reduces the number of intermediate operations compared to linear
merging
- Eliminates the long dependency chains that hurt optimization
Example transformation:
```
// Before: (stores do not have to be in order)
%alloca = alloca <8 x float>
store <2 x float> %val0, ptr %alloca ; offset 0-1
store <2 x float> %val2, ptr %alloca+16 ; offset 4-5
store <2 x float> %val1, ptr %alloca+8 ; offset 2-3
store <2 x float> %val3, ptr %alloca+24 ; offset 6-7
%result = load <8 x float>, ptr %alloca
// After (tree-structured merge):
%shuffle0 = shufflevector %val0, %val1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%shuffle1 = shufflevector %val2, %val3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%result = shufflevector %shuffle0, %shuffle1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
```
Benefits:
- Logarithmic depth (O(log n)) instead of linear dependency chains
- Fewer total operations for large vectors
- Better optimization opportunities for subsequent passes
- Significant compilation time improvements for large vector patterns
For some large cases, the compile time can be reduced from about 60s to
less than 3s.
---------
Co-authored-by: chengjunp <chengjunp@nividia.com>
Diffstat (limited to 'llvm/lib/Object/ELFObjectFile.cpp')
0 files changed, 0 insertions, 0 deletions