diff options
author | Joseph Huber <35342157+jhuber6@users.noreply.github.com> | 2023-10-27 14:55:37 -0500 |
---|---|---|
committer | GitHub <noreply@github.com> | 2023-10-27 14:55:37 -0500 |
commit | 8e447a123b0006b4c14fcb61049559039b3da32f (patch) | |
tree | d1dded85da1afa2ff47dbcaa8e619478c39497ac /llvm/lib/CodeGen/MachineBasicBlock.cpp | |
parent | 741579930e9b49cee9d859a7468f5e9d0c6f2006 (diff) | |
download | llvm-8e447a123b0006b4c14fcb61049559039b3da32f.zip llvm-8e447a123b0006b4c14fcb61049559039b3da32f.tar.gz llvm-8e447a123b0006b4c14fcb61049559039b3da32f.tar.bz2 |
[libc] Optimize the RPC memory copy for the AMDGPU target (#70467)
Summary:
We previously made the change to make the GPU target use builtin
implementations of memory copy functions. However, this had the negative
effect of massively increasing register usages when using the printing
interface. For example, a `printf` call went from using 25 VGPRs to 54
simply because of using the builtin. However, we probably want to still
export the builitin, but for the RPC interface we heavily prefer small
resource usage over the performance gains of fully unrolling this loop.
For NVPTX however, the builtin implementation causes the resource usage
to go down (36 registers total for a regular `fputs` call) so we will
maintain that implementation.
I think specializing this is the right call as we will always prefer the
implementation with the smallest resource footprint for this interface,
as performance is already going to be heavily bottlenecked by the use of
fine-grained memory.
Diffstat (limited to 'llvm/lib/CodeGen/MachineBasicBlock.cpp')
0 files changed, 0 insertions, 0 deletions