riscv-gnu-toolchain/llvm.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Durgadoss R <durgadossr@nvidia.com>	2024-11-07 15:21:53 +0530
committer	GitHub <noreply@github.com>	2024-11-07 15:21:53 +0530
commit	1b01064faad2cd93c516341cfaf047b7a0f8da42 (patch)
tree	e84a1dd29dec16a07b63bf6aca3376eb717bb0de /llvm/lib/Target/ARM/Disassembler/ARMDisassembler.cpp
parent	2d7f34f2a5df9396a33a0ea044cfe3ddf33e1f5c (diff)
download	llvm-1b01064faad2cd93c516341cfaf047b7a0f8da42.zip llvm-1b01064faad2cd93c516341cfaf047b7a0f8da42.tar.gz llvm-1b01064faad2cd93c516341cfaf047b7a0f8da42.tar.bz2

[NVPTX] Add TMA bulk tensor copy intrinsics (#96083)

This patch adds NVVM intrinsics and NVPTX codegen for: * cp.async.bulk.tensor.S2G.1D -> 5D variants, supporting both Tile and Im2Col modes. These intrinsics optionally support cache_hints as indicated by the boolean flag argument. * cp.async.bulk.tensor.G2S.1D -> 5D variants, with support for both Tile and Im2Col modes. The Im2Col variants have an extra set of offsets as parameters. These intrinsics optionally support multicast and cache_hints, as indicated by the boolean arguments at the end of the intrinsics. * The backend looks through these flag arguments and lowers to the appropriate PTX instruction. * Lit tests are added for all combinations of these intrinsics in cp-async-bulk-tensor-g2s/s2g.ll. * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst file. * PTX Spec reference: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-tensor Signed-off-by: Durgadoss R <durgadossr@nvidia.com>

Diffstat (limited to 'llvm/lib/Target/ARM/Disassembler/ARMDisassembler.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: