Age | Commit message (Collapse) | Author | Files | Lines |
|
Switch from `.weak` to `.common` linkage for common global variables
where possible. The `.common` linkage is described in
[PTX ISA 11.6.4. Linking Directives: .common]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common)
> Declares identifier to be globally visible but “common”.
>
>Common symbols are similar to globally visible symbols. However
multiple object files may declare the same common symbol and they may
have different types and sizes and references to a symbol get resolved
against a common symbol with the largest size.
>
>Only one object file can initialize a common symbol and that must have
the largest size among all other definitions of that common symbol from
different object files.
>
>.common linking directive can be used only on variables with .global
storage. It cannot be used on function symbols or on symbols with opaque
type.
I've updated the logic and tests to only use `.common` for PTX 5.0 or
greater and verified that the new tests now pass with `ptxas`.
|
|
While a stack size large enough to cause this truncation to be a problem
would certainly cause other issues and not produce a valid program
anyway, this cast is triggering our Coverity static analysis. Removing
it seems cleaner.
|
|
Prepare for dag-isel migration.
|
|
This change removes an extra, unneeded debug directive emitted in the
PTX at the beginning on non-empty functions:
```nvptx
.visible .func (.param .b32 func_retval0) foo(
.param .b32 foo_param_0,
.param .b32 foo_param_1
)
{
.reg .b32 %r<4>;
.loc 1 26 0 <---- unneeded (removed by the PR)
$L__func_begin0:
.loc 1 26 0
```
|
|
|
|
These are the last remaining "trivial" changes to passes that use
Instruction pointers for insertion. All of this should be NFC, it's just
changing the spelling of how we identify a position.
In one or two locations, I'm also switching uses of getNextNode etc to
using std::next with iterators. This too should be NFC.
---------
Merged by: Stephen Tozer <stephen.tozer@sony.com>
|
|
This reverts commit 8f0012d3dc2ae6d40e9f812cae111ca7a6eb2a2d.
The common-linkage.ll test fails with ptxas enabled.
|
|
Add support for dynamically sized alloca instructions with the PTX
alloca instruction introduced in PTX 7.3
([9.7.15.3. Stack Manipulation Instructions: alloca]
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#stack-manipulation-instructions-alloca))
|
|
atom.add.noftz.f16 is supported since SM 7.0
|
|
Switch from `.weak` to `.common` linkage for common global variables
where possible. The `.common` linkage is described in [PTX ISA 11.6.4.
Linking Directives:
.common](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#linking-directives-common)
> Declares identifier to be globally visible but “common”.
>
>Common symbols are similar to globally visible symbols. However
multiple object files may declare the same common symbol and they may
have different types and sizes and references to a symbol get resolved
against a common symbol with the largest size.
>
>Only one object file can initialize a common symbol and that must have
the largest size among all other definitions of that common symbol from
different object files.
>
>.common linking directive can be used only on variables with .global
storage. It cannot be used on function symbols or on symbols with opaque
type.
|
|
Reverts llvm/llvm-project#84295 due to breakages.
|
|
atom.add.noftz.f16 is supported since SM 7.0
|
|
According to the PTX ISA this doesn't exist (and ptxas rejects it)
See https://github.com/pytorch/pytorch/issues/118589
|
|
Port the `atomicexpand` pass to the new Pass Manager.
Fixes #64559
|
|
Add support for unaligned parameters and return values. These must be
loaded and stored one byte at a time and then bit manipulation is used
to assemble the correct final result.
|
|
|
|
We only use it to get from BF16 to F32. After that point, we insert
an FP_EXTEND to get the rest of the way.
|
|
Sometimes those nodes are queried with the non-bf16. We need to request
to SDAG that we want to handle the non-bf16 side so that the handler can
detect if bf16 is being used on either side.
|
|
We did something pretty naive:
- round FP64 -> BF16 by first rounding to FP32
- skip FP32 -> BF16 rounding entirely
- taking the top 16 bits of a FP32 which will turn some NaNs into
infinities
Let's do this in a more principled way by rounding types with more
precision than FP32 to FP32 using round-inexact-to-odd which will negate
double rounding issues.
|
|
clocks (#81331)
Summary:
This patch adds a new intrinsic and builtin function mirroring the
existing `__builtin_readcyclecounter`. The difference is that this
implementation targets a separate counter that some targets have which
returns a fixed frequency clock that can be used to determine elapsed
time, this is different compared to the cycle counter which often has
variable frequency.
This patch only adds support for the NVPTX and AMDGPU targets.
This is done as a new and separate builtin rather than an argument to
`readcyclecounter` to avoid needing to change existing code and to make
the separation more explicit.
|
|
The replacement should've had BFE() as the arguments for the comparison,
not the source register.
While at that, tighten the patterns a bit, and expand them to cover
variants with immediate arguments. Also change the default lowering of
bfe() to use unsigned variant, so the value of the upper bits is
predictable.
|
|
Summary:
This patch simply states that `__builtin_readcyclecounter` is legal on
NVPTX and makes it return the value from the `clock64` sreg. The timer
intrinsics are marked as having side effects, which is desireable for
timing primitives and required to pattern match the instrinic DAG.
|
|
I have a Triton kernel, which triggered a heap-use-after-free error in
LLVM.
The problem was that the same instruction may be added to the
`ToSimplify` array multiple times. If this duplicate instruction is
trivially dead, it gets deleted on the first pass. Then, on the second
pass, the freed instruction is passed.
To fix this, I'm adding the instructions to the `ToRemove` array and
filter it out for duplicates to avoid possible double frees.
|
|
Same as with v4i8 types, we should not be using PerformEXTRACTCombine
for v8i8 types.
|
|
Summary:
The previous patch did very simple folding that only worked for driectly
used branches. This patch improves this by traversing the use-def chain
to sipmlify every constant subexpression until it reaches a terminator
we can delete. The support should work for all expected cases now.
|
|
Summary:
The `__nvvm_reflect` function is used to guard invalid code that varies
between architectures. One problem with this feature is that if it is
used without optimizations, it will leave invalid code in the module
that will then make it to the backend. The `__nvvm_reflect` pass is
already mandatory, so it should do some trivial branch removal to ensure
that constants are handled correctly. This dead branch elimination only
works in the trivial case of a compare on a branch and does not touch
any conditionals that were not realted to the `__nvvm_reflect` call in
order to preserve `O0` semantics as much as possible. This should allow
the following to work on NVPTX targets
```c
int foo() {
if (__nvvm_reflect("__CUDA_ARCH") >= 700)
asm("valid;\n");
}
```
Relanding after fixing a bug.
|
|
This reverts commit 9211e67da36782db44a46ccb9ac06734ccf2570f.
Summary:
This seemed to crash one one of the CUDA math tests. Revert until it can
be fixed.
|
|
Summary:
The `__nvvm_reflect` function is used to guard invalid code that varies
between architectures. One problem with this feature is that if it is
used without optimizations, it will leave invalid code in the module
that will then make it to the backend. The `__nvvm_reflect` pass is
already mandatory, so it should do some trivial branch removal to ensure
that constants are handled correctly. This dead branch elimination only
works in the trivial case of a compare on a branch and does not touch
any conditionals that were not realted to the `__nvvm_reflect` call in
order to preserve `O0` semantics as much as possible. This should allow
the following to work on NVPTX targets
```c
int foo() {
if (__nvvm_reflect("__CUDA_ARCH") >= 700)
asm("valid;\n");
}
```
|
|
The current implementation of aliases tries to remove all the aliases in
the module to prevent the generic version of `AsmPrinter` from emitting
them incorrectly. Unfortunately, if the aliases are used this will fail.
Instead let's override the function to print aliases directly.
In addition, the declarations of the alias functions must occur before
the uses. To fix this we emit alias declarations as part of
`emitDeclarations` and only emit the `.alias` directives at the end
(where we can assume the aliasee has also already been declared).
|
|
Cleanup some dead variables. In addition, switch to a `MAKE_CASE` macro,
similar to other targets, to reduce boilerplate.
|
|
|
|
Add TableGen patterns to convert more instructions to boolean
expressions:
- **mul -> and/or**: i1 multiply instructions currently cannot be
selected causing the compiler to crash. See
https://github.com/llvm/llvm-project/issues/57404
- **select -> and/or**: Converting selects to and/or can enable more
optimizations. `InstCombine` cannot do this as aggressively due to
poison semantics.
|
|
This reverts commit ff77058141e8026357ca514ad0d45c6c50921290.
|
|
|
|
Summary:
This patch adds support for `globaltimer` to match `clock` and
`clock64`. See the PTX ISA reference for details. This patch does not
implement the `hi` or `lo` variants for brevity as they can be obtained
from this with the cost of an additional register.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-globaltimer-globaltimer-lo-globaltimer-hi
|
|
Summary:
The PTX ISA has always supported the 'exit' instruction to terminate
individual threads. This patch adds a builtin to handle it. See the PTX
documentation for further details.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#control-flow-instructions-exit
|
|
Summary:
This patch adds a builtin for the `nanosleep` PTX function. It takes
either an immediate or a register and sleeps for [0, 2t] nanoseconds
given t. More information at the documentation:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-nanosleep
|
|
Summary:
This patch adds support for getting the 'activemask' instruction's value
without needing to use inline assembly. See the relevant PTX reference
for details.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-activemask
|
|
Update `NVPTXAssignValidGlobalNames` to convert all characters which are
illegal in PTX identifiers to `_$_`. ([PTX ISA: 4.4
Identifiers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers)).
|
|
Finally addresses https://reviews.llvm.org/D148769#4311232 :)
No behavior change.
|
|
The PTX ISA specifies that initializers may be incomplete ([5.4.4.
Initializers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#initializers))
> As in C, array initializers may be incomplete, i.e., the number of
initializer elements may be less than the extent of the corresponding
array dimension, with remaining array locations initialized to the
default value for the specified array type.
Emitting initializers in this form is preferable because it reduces the
size of the PTX, in some cases significantly, and can improve compile
time of ptxas as a result.
|
|
This patch adds NVVM intrinsics and NVPTX codegen for the bulk variants
of the async-copy commit/wait instructions.
lit tests are added to verify the generated PTX.
PTX Doc link:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-commit-group
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
|
|
|
|
Ensure intrinsics and auto-upgrades support i16, i32, and i64 for for
`nvvm.{min,max,mulhi,sad}`
- `nvvm.min` and `nvvm.max`: These are auto-upgraded to `select`
instructions but it is still nice to support the 16 bit variants just in
case any generators of IR are still trying to use these intrinsics.
- `nvvm.sad` added both the 16 and 64 bit variants, also marked this
instruction as speculateble. These directly correspond to the PTX
`sad.{u16,s16,u64,s64}` instructions.
- `nvvm.mulhi` added the 16 bit variants. These directly correspond to
the PTX `mul.hi.{s,u}16` instructions.
|
|
Extend IR support for PTX `tex` instruction described in [PTX ISA.
9.7.9.3. Texture Instructions:
tex](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-instructions-tex).
Add support for unified-move versions of `tex.grad.cube{array}` variants
added in PTX ISA 4.3.
|
|
values are undefined (#74437)
When generating the permute bytes for the prmt instruction, the
existence of an undefined initial value initialises the int32 that holds
the mask with all 1's (0xFFFFFFFF). That initialization subsequently
leads to complications during the subsequent OR operation, leading to
inaccuracies in populating mask values for the following bytes.
Consequently, the final value persists as a constant -1, irrespective of
the actual mask values that succeed the initial set value.
|
|
|
|
|
|
|
|
This patch adds an intrinsic for setmaxnreg PTX instruction.
* PTX Doc link for this instruction:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg
* The i32 argument, an immediate value, specifies the actual
absolute register count for the instruction.
* The `setmaxnreg` instruction is available in SM90a.
So, this patch adds 'hasSM90a' predicate to use in
the NVPTX backend.
* lit tests are added to verify the lowering of the intrinsic.
* Verifier logic (and tests) are added to test the register
count range and divisibility-by-8 requirements.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
|