aboutsummaryrefslogtreecommitdiff
path: root/llvm/lib/Target/NVPTX
AgeCommit message (Collapse)AuthorFilesLines
2024-03-29Reland "[NVPTX] Use .common linkage for common globals" (#86824)Alex MacLean1-7/+9
Switch from `.weak` to `.common` linkage for common global variables where possible. The `.common` linkage is described in [PTX ISA 11.6.4. Linking Directives: .common] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common) > Declares identifier to be globally visible but “common”. > >Common symbols are similar to globally visible symbols. However multiple object files may declare the same common symbol and they may have different types and sizes and references to a symbol get resolved against a common symbol with the largest size. > >Only one object file can initialize a common symbol and that must have the largest size among all other definitions of that common symbol from different object files. > >.common linking directive can be used only on variables with .global storage. It cannot be used on function symbols or on symbols with opaque type. I've updated the logic and tests to only use `.common` for PTX 5.0 or greater and verified that the new tests now pass with `ptxas`.
2024-03-26[NFC][NVPTX] remove truncating c-style cast (#85889)Alex MacLean1-1/+1
While a stack size large enough to cause this truncation to be a problem would certainly cause other issues and not produce a valid program anyway, this cast is triggering our Coverity static analysis. Removing it seems cleaner.
2024-03-23[NewPM][NVPTX] Add NVPTXPassRegistry.def NFCI (#86246)paperchalice2-39/+42
Prepare for dag-isel migration.
2024-03-19[NVPTX][DebugInfo] avoid emitting extra .loc directives (#84584)Alex MacLean1-2/+5
This change removes an extra, unneeded debug directive emitted in the PTX at the beginning on non-empty functions: ```nvptx .visible .func (.param .b32 func_retval0) foo( .param .b32 foo_param_0, .param .b32 foo_param_1 ) { .reg .b32 %r<4>; .loc 1 26 0 <---- unneeded (removed by the PR) $L__func_begin0: .loc 1 26 0 ```
2024-03-19[NVPTX] Use PTX prmt for llvm.bswap (#85545)Alex MacLean2-3/+19
2024-03-19[NFC][RemoveDIs] Use iterators for insertion at various call-sites (#84736)Jeremy Morse3-9/+9
These are the last remaining "trivial" changes to passes that use Instruction pointers for insertion. All of this should be NFC, it's just changing the spelling of how we identify a position. In one or two locations, I'm also switching uses of getNextNode etc to using std::next with iterators. This too should be NFC. --------- Merged by: Stephen Tozer <stephen.tozer@sony.com>
2024-03-15Revert "[NVPTX] Use .common linkage for common globals (#84416)"Sterling Augustine1-9/+7
This reverts commit 8f0012d3dc2ae6d40e9f812cae111ca7a6eb2a2d. The common-linkage.ll test fails with ptxas enabled.
2024-03-15[NVPTX] support dynamic allocas with PTX alloca instruction (#84585)Alex MacLean4-14/+63
Add support for dynamically sized alloca instructions with the PTX alloca instruction introduced in PTX 7.3 ([9.7.15.3. Stack Manipulation Instructions: alloca] (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#stack-manipulation-instructions-alloca))
2024-03-15Reland "[NVPTX] Add support for atomic add for f16 type" (#85197)Adrian Kuegel2-1/+19
atom.add.noftz.f16 is supported since SM 7.0
2024-03-14[NVPTX] Use .common linkage for common globals (#84416)Alex MacLean1-7/+9
Switch from `.weak` to `.common` linkage for common global variables where possible. The `.common` linkage is described in [PTX ISA 11.6.4. Linking Directives: .common](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common) > Declares identifier to be globally visible but “common”. > >Common symbols are similar to globally visible symbols. However multiple object files may declare the same common symbol and they may have different types and sizes and references to a symbol get resolved against a common symbol with the largest size. > >Only one object file can initialize a common symbol and that must have the largest size among all other definitions of that common symbol from different object files. > >.common linking directive can be used only on variables with .global storage. It cannot be used on function symbols or on symbols with opaque type.
2024-03-12Revert "[NVPTX] Add support for atomic add for f16 type" (#84918)Danial Klimkin2-18/+0
Reverts llvm/llvm-project#84295 due to breakages.
2024-03-12[NVPTX] Add support for atomic add for f16 type (#84295)Adrian Kuegel2-0/+18
atom.add.noftz.f16 is supported since SM 7.0
2024-03-05[NVPTX] Remove sub.s16x2 instructionBenjamin Kramer2-2/+0
According to the PTX ISA this doesn't exist (and ptxas rejects it) See https://github.com/pytorch/pytorch/issues/118589
2024-02-25[CodeGen] Port AtomicExpand to new Pass Manager (#71220)Rishabh Bali1-1/+1
Port the `atomicexpand` pass to the new Pass Manager. Fixes #64559
2024-02-22[NVPTX] fixup support for unaligned parameters and returns (#82562)Alex MacLean3-20/+271
Add support for unaligned parameters and return values. These must be loaded and stored one byte at a time and then bit manipulation is used to assemble the correct final result.
2024-02-21[NVPTX] Correctly guard int -> bf16 on PTX version and SM versionDavid Majnemer2-10/+12
2024-02-21[NVPTX] Simplify handling of ISD::BF16_TO_FPDavid Majnemer1-4/+1
We only use it to get from BF16 to F32. After that point, we insert an FP_EXTEND to get the rest of the way.
2024-02-21[NVPTX] Set ISD::FP_{EXTEND,ROUND} to Custom for more typesDavid Majnemer1-2/+4
Sometimes those nodes are queried with the non-bf16. We need to request to SDAG that we want to handle the non-bf16 side so that the handler can detect if bf16 is being used on either side.
2024-02-21Correctly round FP -> BF16 when SDAG expands such nodes (#82399)David Majnemer3-5/+87
We did something pretty naive: - round FP64 -> BF16 by first rounding to FP32 - skip FP32 -> BF16 rounding entirely - taking the top 16 bits of a FP32 which will turn some NaNs into infinities Let's do this in a more principled way by rounding types with more precision than FP32 to FP32 using round-inexact-to-odd which will negate double rounding issues.
2024-02-13[LLVM] Add `__builtin_readsteadycounter` intrinsic and builtin for realtime ↵Joseph Huber3-1/+3
clocks (#81331) Summary: This patch adds a new intrinsic and builtin function mirroring the existing `__builtin_readcyclecounter`. The difference is that this implementation targets a separate counter that some targets have which returns a fixed frequency clock that can be used to determine elapsed time, this is different compared to the cycle counter which often has variable frequency. This patch only adds support for the NVPTX and AMDGPU targets. This is done as a new and separate builtin rather than an argument to `readcyclecounter` to avoid needing to change existing code and to make the separation more explicit.
2024-02-12[NVPTX] Fix the error in a pattern match in v4i8 comparisons. (#81308)Artem Belevich1-23/+69
The replacement should've had BFE() as the arguments for the comparison, not the source register. While at that, tighten the patterns a bit, and expand them to cover variants with immediate arguments. Also change the default lowering of bfe() to use unsigned variant, so the value of the upper bits is predictable.
2024-02-12[NVPTX] Implement `__builtin_readcyclecounter` on NVPTX (#81344)Joseph Huber2-0/+6
Summary: This patch simply states that `__builtin_readcyclecounter` is legal on NVPTX and makes it return the value from the `clock64` sreg. The timer intrinsics are marked as having side effects, which is desireable for timing primitives and required to pattern match the instrinic DAG.
2024-02-12Fix use after free error in NVVMReflect (#81471)Petr1-5/+13
I have a Triton kernel, which triggered a heap-use-after-free error in LLVM. The problem was that the same instruction may be added to the `ToSimplify` array multiple times. If this duplicate instruction is trivially dead, it gets deleted on the first pass. Then, on the second pass, the freed instruction is passed. To fix this, I'm adding the instructions to the `ToRemove` array and filter it out for duplicates to avoid possible double frees.
2024-02-12Do not use PerformEXTRACTCombine for v8i8 types (#81242)Adrian Kuegel1-3/+4
Same as with v4i8 types, we should not be using PerformEXTRACTCombine for v8i8 types.
2024-02-09[NVVMReflect] Improve folding inside of the NVVMReflect pass (#81253)Joseph Huber1-53/+17
Summary: The previous patch did very simple folding that only worked for driectly used branches. This patch improves this by traversing the use-def chain to sipmlify every constant subexpression until it reaches a terminator we can delete. The support should work for all expected cases now.
2024-02-08[NVVMReflect][Reland] Force dead branch elimination in NVVMReflect (#81189)Joseph Huber1-0/+65
Summary: The `__nvvm_reflect` function is used to guard invalid code that varies between architectures. One problem with this feature is that if it is used without optimizations, it will leave invalid code in the module that will then make it to the backend. The `__nvvm_reflect` pass is already mandatory, so it should do some trivial branch removal to ensure that constants are handled correctly. This dead branch elimination only works in the trivial case of a compare on a branch and does not touch any conditionals that were not realted to the `__nvvm_reflect` call in order to preserve `O0` semantics as much as possible. This should allow the following to work on NVPTX targets ```c int foo() { if (__nvvm_reflect("__CUDA_ARCH") >= 700) asm("valid;\n"); } ``` Relanding after fixing a bug.
2024-02-08Revert "[NVVMReflect] Force dead branch elimination in NVVMReflect (#81189)"Joseph Huber1-62/+0
This reverts commit 9211e67da36782db44a46ccb9ac06734ccf2570f. Summary: This seemed to crash one one of the CUDA math tests. Revert until it can be fixed.
2024-02-08[NVVMReflect] Force dead branch elimination in NVVMReflect (#81189)Joseph Huber1-0/+62
Summary: The `__nvvm_reflect` function is used to guard invalid code that varies between architectures. One problem with this feature is that if it is used without optimizations, it will leave invalid code in the module that will then make it to the backend. The `__nvvm_reflect` pass is already mandatory, so it should do some trivial branch removal to ensure that constants are handled correctly. This dead branch elimination only works in the trivial case of a compare on a branch and does not touch any conditionals that were not realted to the `__nvvm_reflect` call in order to preserve `O0` semantics as much as possible. This should allow the following to work on NVPTX targets ```c int foo() { if (__nvvm_reflect("__CUDA_ARCH") >= 700) asm("valid;\n"); } ```
2024-02-08[NVPTX] Add support for calling aliases (#81170)Alex MacLean2-30/+29
The current implementation of aliases tries to remove all the aliases in the module to prevent the generic version of `AsmPrinter` from emitting them incorrectly. Unfortunately, if the aliases are used this will fail. Instead let's override the function to print aliases directly. In addition, the declarations of the alias functions must occur before the uses. To fix this we emit alias declarations as part of `emitDeclarations` and only emit the `.alias` directives at the end (where we can assume the aliasee has also already been declared).
2024-02-08[NVPTX][NFC] cleanup dead vars, use MAKE_CASE (#81161)Alex MacLean2-621/+428
Cleanup some dead variables. In addition, switch to a `MAKE_CASE` macro, similar to other targets, to reduce boilerplate.
2024-02-01[TTI] Use Register in isLoadFromStackSlot and isStoreToStackSlot [nfc] (#80339)Philip Reames1-2/+2
2024-01-31[NVPTX] improve Boolean ISel (#80166)Alex MacLean1-0/+10
Add TableGen patterns to convert more instructions to boolean expressions: - **mul -> and/or**: i1 multiply instructions currently cannot be selected causing the compiler to crash. See https://github.com/llvm/llvm-project/issues/57404 - **select -> and/or**: Converting selects to and/or can enable more optimizations. `InstCombine` cannot do this as aggressively due to poison semantics.
2024-01-29Revert "Disable incorrect peephole optimizations" (#79916)Justin Fargnoli1-0/+13
This reverts commit ff77058141e8026357ca514ad0d45c6c50921290.
2024-01-29Disable incorrect peephole optimizationsJustin Fargnoli1-13/+0
2024-01-29[NVPTX] Add builtin support for 'globaltimer' (#79765)Joseph Huber1-0/+2
Summary: This patch adds support for `globaltimer` to match `clock` and `clock64`. See the PTX ISA reference for details. This patch does not implement the `hi` or `lo` variants for brevity as they can be obtained from this with the cost of an additional register. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-globaltimer-globaltimer-lo-globaltimer-hi
2024-01-29[NVPTX] Add builtin for 'exit' handling (#79777)Joseph Huber1-0/+3
Summary: The PTX ISA has always supported the 'exit' instruction to terminate individual threads. This patch adds a builtin to handle it. See the PTX documentation for further details. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#control-flow-instructions-exit
2024-01-29[NVPTX] Add builtin support for 'nanosleep' PTX instrunction (#79888)Joseph Huber1-0/+6
Summary: This patch adds a builtin for the `nanosleep` PTX function. It takes either an immediate or a register and sleeps for [0, 2t] nanoseconds given t. More information at the documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-nanosleep
2024-01-29[NVPTX] Add 'activemask' builtin and intrinsic support (#79768)Joseph Huber2-1/+7
Summary: This patch adds support for getting the 'activemask' instruction's value without needing to use inline assembly. See the relevant PTX reference for details. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-activemask
2024-01-26[NVPTX] improve identifier renaming for PTX (#79459)Alex MacLean1-3/+6
Update `NVPTXAssignValidGlobalNames` to convert all characters which are illegal in PTX identifiers to `_$_`. ([PTX ISA: 4.4 Identifiers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers)).
2024-01-25[llvm] Move CodeGenTypes library to its own directory (#79444)Nico Weber2-2/+2
Finally addresses https://reviews.llvm.org/D148769#4311232 :) No behavior change.
2024-01-24[NVPTX] use incomplete aggregate initializers (#79062)Alex MacLean1-2/+13
The PTX ISA specifies that initializers may be incomplete ([5.4.4. Initializers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#initializers)) > As in C, array initializers may be incomplete, i.e., the number of initializer elements may be less than the extent of the corresponding array dimension, with remaining array locations initialized to the default value for the specified array type. Emitting initializers in this form is preferable because it reduces the size of the PTX, in some cases significantly, and can improve compile time of ptxas as a result.
2024-01-19[LLVM][NVPTX] Add cp.async.bulk.commit/wait intrinsics (#78698)Durgadoss R1-0/+16
This patch adds NVVM intrinsics and NVPTX codegen for the bulk variants of the async-copy commit/wait instructions. lit tests are added to verify the generated PTX. PTX Doc link: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-commit-group Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-01-18[NVPTX][NFC] Remove unused parameter of getArgumentAlignment (#78604)Alex MacLean2-7/+6
2024-01-17[NVPTX] extend type support for nvvm.{min,max,mulhi,sad} (#78385)Alex MacLean1-1/+12
Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for `nvvm.{min,max,mulhi,sad}` - `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select` instructions but it is still nice to support the 16 bit variants just in case any generators of IR are still trying to use these intrinsics. - `nvvm.sad` added both the 16 and 64 bit variants, also marked this instruction as speculateble. These directly correspond to the PTX `sad.{u16,s16,u64,s64}` instructions. - `nvvm.mulhi` added the 16 bit variants. These directly correspond to the PTX `mul.hi.{s,u}16` instructions.
2024-01-17[NVPTX] Add tex.grad.cube{array} intrinsics (#77693)Alex MacLean5-0/+129
Extend IR support for PTX `tex` instruction described in [PTX ISA. 9.7.9.3. Texture Instructions: tex](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-instructions-tex). Add support for unified-move versions of `tex.grad.cube{array}` variants added in PTX ISA 4.3.
2024-01-16[NVPTX] Fix generating permute bytes from register pair when the initial ↵mmoadeli1-2/+4
values are undefined (#74437) When generating the permute bytes for the prmt instruction, the existence of an undefined initial value initialises the int32 that holds the mask with all 1's (0xFFFFFFFF). That initialization subsequently leads to complications during the subsequent OR operation, leading to inaccuracies in populating mask values for the following bytes. Consequently, the final value persists as a constant -1, irrespective of the actual mask values that succeed the initial set value.
2024-01-14[Target] Use getConstantOperandVal (NFC)Kazu Hirata1-3/+1
2024-01-13[LLVM][NVPTX]: Add aligned versions of cluster barriers (#77940)Durgadoss R1-0/+10
2024-01-10[Target] Use getConstantOperandAPInt (NFC)Kazu Hirata1-3/+1
2024-01-09[LLVM][NVPTX]: Add intrinsic for setmaxnreg (#77289)Durgadoss R2-0/+16
This patch adds an intrinsic for setmaxnreg PTX instruction. * PTX Doc link for this instruction: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg * The i32 argument, an immediate value, specifies the actual absolute register count for the instruction. * The `setmaxnreg` instruction is available in SM90a. So, this patch adds 'hasSM90a' predicate to use in the NVPTX backend. * lit tests are added to verify the lowering of the intrinsic. * Verifier logic (and tests) are added to test the register count range and divisibility-by-8 requirements. Signed-off-by: Durgadoss R <durgadossr@nvidia.com>