Age | Commit message (Collapse) | Author | Files | Lines |
|
Reverts llvm/llvm-project#148758
[As
requested.](https://github.com/llvm/llvm-project/pull/148758#pullrequestreview-3076627201)
|
|
This patch implements the `llvm.loop.estimated_trip_count` metadata
discussed in [[RFC] Fix Loop Transformations to Preserve Block
Frequencies](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785).
As [suggested in the RFC
comments](https://discourse.llvm.org/t/rfc-fix-loop-transformations-to-preserve-block-frequencies/85785/4),
it adds the new metadata to all loops at the time of profile ingestion
and estimates each trip count from the loop's `branch_weights` metadata.
As [suggested in the PR #128785
review](https://github.com/llvm/llvm-project/pull/128785#discussion_r2151091036),
it does so via a new `PGOEstimateTripCountsPass` pass, which creates the
new metadata for each loop but omits the value if it cannot estimate a
trip count due to the loop's form.
An important observation not previously discussed is that
`PGOEstimateTripCountsPass` *often* cannot estimate a loop's trip count,
but later passes can sometimes transform the loop in a way that makes it
possible. Currently, such passes do not necessarily update the metadata,
but eventually that should be fixed. Until then, if the new metadata has
no value, `llvm::getLoopEstimatedTripCount` disregards it and tries
again to estimate the trip count from the loop's current
`branch_weights` metadata.
|
|
As mentioned in https://github.com/llvm/llvm-project/pull/145071,
LoopInterchange should be part of the optimization pipeline rather than
the simplification pipeline. This patch moves LoopInterchange into the
optimization pipeline.
More contexts:
- By default, LoopInterchange attempts to improve data locality,
however, it also takes increasing vectorization opportunities into
account. Given that, it is reasonable to run it as close to
vectorization as possible.
- I looked into previous changes related to the placement of
LoopInterchange, but couldn’t find any strong motivation suggesting that
it benefits other simplifications.
- As far as I tried some tests (including llvm-test-suite), removing
LoopInterchange from the simplification pipeline does not affect other
simplifications. Therefore, there doesn't seem to be much value in
keeping it there.
- The new position reduces compile-time for ThinLTO, probably because it
only runs once per function in post-link optimization, rather than both
in pre-link and post-link optimization.
I haven't encountered any cases where the positional difference affects
optimization results, so please feel free to revert if you run into any issues.
|
|
The entry count of a function needs to be updated after a callsite is elided by TRE: before elision, the entry count accounted for the recursive call at that callsite. After TRE, we need to remove that callsite's contribution.
This patch enables this for instrumented profiling cases because, there, we know the function entry count captured entries before TRE. We cannot currently address this for sample-based (because we don't know whether this function was TRE-ed in the binary that donated samples)
|
|
This is a fairly exotic pass, I don't think it makes a lot of sense to
run it at O1, esp. as vectorization wouldn't run at O1 anyway.
|
|
This pass figures out whether inlining has exposed a constant address to
a lowered type test, and remove the test if so and the address is known
to pass the test. Unfortunately this pass ends up needing to reverse
engineer what LowerTypeTests did; this is currently inherent to the design
of ThinLTO importing where LowerTypeTests needs to run at the start.
Reviewers: teresajohnson
Reviewed By: teresajohnson
Pull Request: https://github.com/llvm/llvm-project/pull/141327
|
|
Most of the recent development on the MemProfiler has been on the Use part. The instrumentation has been quite stable for a while. As the complexity of the use grows (with undrifting, diagnostics etc) I figured it would be good to separate these two implementations.
|
|
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
|
|
(#131005)
When we enable EVL-based loop vectorization w/ predicated tail-folding,
each vectorized loop has effectively two induction variables: one
calculates the step using (VF x vscale) and the other one increases the
IV by values returned from experiment.get.vector.length. The former,
also known as canonical IV, is more favorable for analyses as it's
"countable" in the sense of SCEV; the latter (EVL-based IV), however, is
more favorable to codegen, at least for those that support scalable
vectors like AArch64 SVE and RISC-V.
The idea is that we use canonical IV all the way until the end of all
vectorizers, where we replace it with EVL-based IV using EVLIVSimplify
introduced here. Such that we can have the best from both worlds.
This Pass is enabled by default in RISC-V. However, since we haven't
really vectorize loops with predicate tail-folding by default, this Pass
is no-op at this moment.
|
|
In this change, NVPTX AA is moved before Basic AA to potentially improve
compile time. Additionally, it introduces a flag in the
`ExternalAAWrapper` that allows other backends to run their
target-specific AA passes before Basic AA, if desired.
The change works for both New Pass Manager and Legacy Pass Manager.
Original implementation by Princeton Ferro <pferro@nvidia.com>
|
|
If a hot callsite function is not inlined in the 1st build, inlining the
hot callsite in pre-link stage of SPGO 2nd build may lead to Function
Sample not found in profile file in link stage. It will miss some
profile info.
ThinLTO has already considered and dealed with it by setting
HotCallSiteThreshold to 0 to stop the inline. This patch just adds the
same processing for FullLTO.
|
|
Flatten the profile pre-thinlink so that ThinLTO has something to work with for the parts of the binary that aren't covered by contextual profiles. Post-thinlink, the flattener is re-run and will actually change profile info, but just for the modules containing contextual trees ("specialized modules"). For the rest, the flattener just yanks out the instrumentation.
|
|
When coroutines are used w/ both -ffat-lto-objects and -flto=thin,
the coroutine passes are not added to the optimization pipelines.
Ensure they are added before ModuleOptimization to generate a
working ELF object.
Fixes #134409.
|
|
|
|
Reverts llvm/llvm-project#128654
Breaks FatLTO
https://github.com/llvm/llvm-project/pull/128654#issuecomment-2700053700
|
|
specified (#128940)
This patch makes it easy to enable ctxprof instrumentation for targets where the build has a bunch of defaults for instrumented PGO that we want to inherit for ctxprof.
This is switching experimental defaults: we'll eventually enable ctxprof instrumentation through `PGOOpt` but that type is currently quite entangled and, for the time being, no point adding to that.
|
|
Contextual profiling identifies functions by GUID. Functions that may get overridden by the linker with a prevailing copy may have, during instrumentation, different variants in different modules. If these variants get inlined before linking (here I assume thinlto), they will identify themselves to the ctxprof runtime as their GUID, leading to issues - they may have different counter counts, for instance.
If we block their inlining in the pre-thinlink compilation, only the prevailing copy will survive post-thinlink and the confusion is avoided.
The change introduces a small pass just for this purpose, which marks any symbols that could be affected by the above as `noinline` (even if they were `alwaysinline`). We already carried out some inlining (via the preinliner), before instrumenting, so technically the `alwaysinline` directives were honored.
We could later (different patch) choose to mark them back to their original attribute (none or `alwaysinline`) post-thinlink, if we want to - but experimentally that doesn't really change much of the performance of the instrumented binary.
|
|
```
if (!isLTOPostLink(Phase))
CoroPM.addPass(CoroEarlyPass());
if (!isLTOPreLink(Phase))
// Other Coro passes
```
Followup to #126168.
|
|
|
|
When using FatLTO, it is common to want to enable certain types of whole
program optimizations (WPD) or security transforms (CFI), so that they
can be made available when performing LTO. However, these transforms
should not be used when compiling the non-LTO object code. Since the
frontend must emit different IR, we cannot simply clone the module and
optimize the LTO section and non-LTO section differently to work around
this. Instead, we need to remove any problematic instruction sequences.
This patch adds a new pass whose responsibility is to clean up the IR
in the FatLTO pipeline after creating the bitcode section, which is
after running the pre-link pipeline but before running module
optimization. This allows us to safely drop any conflicting instructions
or IR constructs that are inappropriate for non-LTO compilation.
|
|
ThinLTO delays handling of coroutines to ThinLTO backend.
However it's usually possible to use ThinLTO prelink objects for FullLTO.
In this case we have left-over coroutines which crash in codegen.
Issue #104525.
|
|
Helper for #126168.
`Phase` will be used in followup patches.
|
|
This introduces options `-floop-interchange` and `-fno-loop-interchange`
to enable/disable the loop-interchange pass. This is part of the work
that tries to get that pass enabled by default (#124911), where it was
remarked that a user facing option to control this would be convenient
to have. The option name is the same as GCC's.
|
|
Summary:
This code is intended to block transformations if the call isn't
present, however the way it's coded it silently lets it pass if the
definition doesn't exist at all. This previously was always valid since
we included the runtime as one giant blob so everything was always
there, but now that we want to move towards separate ones, it's not
quite correct.
|
|
Added an extension point after vectorizer passes in the PassBuilder.
Additionally, added extension points before and after vectorizer passes
in `buildLTODefaultPipeline`. Credit goes to @mshockwave for guiding me
through my first LLVM contribution (and my first open source
contribution in general!) :)
- Implemented `registerVectorizerEndEPCallback`
- Implemented `invokeVectorizerEndEPCallbacks`
- Added `VectorizerEndEPCallbacks` SmallVector
- Added a command line option `passes-ep-vectorizer-end` to
`NewPMDriver.cpp`
- `buildModuleOptimizationPipeline` now calls
`invokeVectorizerEndEPCallbacks`
- `buildO0DefaultPipeline` now calls `invokeVectorizerEndEPCallbacks`
- `buildLTODefaultPipeline` now calls BOTH
`invokeVectorizerStartEPCallbacks` and `invokeVectorizerEndEPCallbacks`
- Added LIT tests to `new-pm-defaults.ll`, `new-pm-lto-defaults.ll`,
`new-pm-O0-ep-callbacks.ll`, and `pass-pipeline-parsing.ll`
- Renamed `CHECK-EP-Peephole` to `CHECK-EP-PEEPHOLE` in
`new-pm-lto-defaults.ll` for consistency.
This code is intended for developers that wish to implement and run
custom passes after the vectorizer passes in the PassBuilder pipeline.
For example, in #91796, a pass was created that changed the induction
variables of vectorized code. This is right after the vectorization
passes.
|
|
[PassBuilder] Add RelLookupTableConverterPass to LTO
This patch adds RelLookupTableConverterPass into the LTO
post-link optimization pass pipeline. This optimization
converts lookup tables to relative lookup tables to make
them PIC-friendly, which is already included in the non-LTO
pass pipeline. This patch adds this optimization to the
post-link optimization pipeline to discover more
opportunities in the LTO context.
|
|
|
|
Generalize ShouldRunExtraVectorPasses to ShouldRunExtraPasses, to allow
re-use for other transformations.
PR: https://github.com/llvm/llvm-project/pull/118323
|
|
This implements a global function merging pass. Unlike traditional
function merging passes that use IR comparators, this pass employs a
structurally stable hash to identify similar functions while ignoring
certain constant operands. These ignored constants are tracked and
encoded into a stable function summary. When merging, instead of
explicitly folding similar functions and their call sites, we form a
merging instance by supplying different parameters via thunks. The
actual size reduction occurs when identically created merging instances
are folded by the linker.
Currently, this pass is wired to a pre-codegen pass, enabled by the
`-enable-global-merge-func` flag.
In a local merging mode, the analysis and merging steps occur
sequentially within a module:
- `analyze`: Collects stable function hashes and tracks locations of
ignored constant operands.
- `finalize`: Identifies merge candidates with matching hashes and
computes the set of parameters that point to different constants.
- `merge`: Uses the stable function map to optimistically create a
merged function.
We can enable a global merging mode similar to the global function
outliner
(https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753/),
which will perform the above steps separately.
- `-codegen-data-generate`: During the first round of code generation,
we analyze local merging instances and publish their summaries.
- Offline using `llvm-cgdata` or at link-time, we can finalize all these
merging summaries that are combined to determine parameters.
- `-codegen-data-use`: During the second round of code generation, we
optimistically create merging instances within each module, and finally,
the linker folds identically created merging instances.
Depends on #112664
This is a patch for
https://discourse.llvm.org/t/rfc-global-function-merging/82608.
|
|
Add support for enabling sample loader pass in O0 mode(under
`-fsample-profile-use`). This can help verify PGO raw profile count
quality or provide a more accurate performance proxy(predictor), as O0
mode has minimal or no compiler optimizations that might otherwise
impact profile count accuracy.
- Explicitly disable the sample loader inlining to ensure it only emits
sampling annotation.
- Use flattened profile for O0 mode.
- Add the pass after `AddDiscriminatorsPass` pass to work with
`-fdebug-info-for-profiling`.
|
|
(#114004)
|
|
Run ArgumentPromotion before IPSCCP in the LTO pipeline, to expose more
constants to be propagated. We also run PostOrderFunctionAttrs to
improve the information available to ArgumentPromotion's alias analysis,
and SROA to clean up allocas.
Relands #111163.
|
|
|
|
(#114547)
The early simplication pipeline is used in non-LTO and (Thin/Full)LTO
pre-link
stage. There are some passes that we want them in non-LTO mode, but not
at LTO
pre-link stage. The control is missing currently. This PR adds the
support. To
demonstrate the use, we only enable the internalization pass in non-LTO
mode for
AMDGPU because having it run in pre-link stage causes some issues.
|
|
(#114564)
This will allow more fine-grained control in the future.
|
|
In https://github.com/llvm/llvm-project/pull/109837, it sets a global
variable(`PGOInstrumentColdFunctionOnly`) in PassBuilderPipelines.cpp
which introduced a data race detected by TSan. To fix this, I decouple
the flag setting, the flags are now set
separately(`instrument-cold-function-only-path` is required to be used
with `--pgo-instrument-cold-function-only`).
|
|
(#112788)
We want to support CFI instrumentation for the bitcode section, without
miscompiling the object code portion of a FatLTO object. We can reuse
the existing mechanisms in the LowerTypeTestsPass to do that, by just
adding the pass to the FatLTO pipeline after the EmbedBitcodePass with
the correct options set.
Fixes #112053
|
|
This reverts commit e517cfc531886bf6ed64b4e7109bb3141ac7f430.
|
|
Currently, the `DropTypeTests` parameter only fully works with phi nodes
and llvm.assume instructions. However, we'd like CFI to work in
conjunction with FatLTO, in so far as the bitcode section should be able
to contain the CFI instrumentation, while any incompatible bits are
dropped when compiling the object code.
To do that, we need to drop the llvm.type.test instructions everywhere,
and not just their uses in phi nodes. This patch updates the
LowerTypeTest pass so that uses are removed, and replaced with `true` in
all cases, and not just in phi nodes.
Addressing this will allow us to fix #112053 by modifying the FatLTO
pipeline.
Reviewers: pcc, nikic
Reviewed By: pcc
Pull Request: https://github.com/llvm/llvm-project/pull/112787
|
|
This patch adds support for cold function coverage instrumentation based
on sampling PGO counts. The major motivation is to detect dead functions
for the services that are optimized with sampling PGO. If a function is
covered by sampling profile count (e.g., those with an entry count > 0),
we choose to skip instrumenting those functions, which significantly
reduces the instrumentation overhead.
More details about the implementation and flags:
- Added a flag `--pgo-instrument-cold-function-only` in
`PGOInstrumentation.cpp` as the main switch to control skipping the
instrumentation.
- Built the extra instrumentation passes(a bundle of passes in
`addPGOInstrPasses`) under sampling PGO pipeline. This is controlled by
`--instrument-cold-function-only-path` flag.
- Added a driver flag `-fprofile-generate-cold-function-coverage`:
- 1) Config the flags in one place, i,e. adding
`--instrument-cold-function-only-path=<...>` and
`--pgo-function-entry-coverage`. Note that the instrumentation file path
is passed through `--instrument-sample-cold-function-path`, because we
cannot use the `PGOOptions.ProfileFile` as it's already used by
`-fprofile-sample-use=<...>`.
- 2) makes linker to link `compiler_rt.profile` lib(see
[ToolChain.cpp#L1125-L1131](https://github.com/llvm/llvm-project/blob/main/clang/lib/Driver/ToolChain.cpp#L1125-L1131)
).
- Added a flag(`--pgo-cold-instrument-entry-threshold`) to config entry
count to determine cold function.
Overall, the full command is like:
```
clang++ -O2 -fprofile-generate-cold-function-coverage=<...> -fprofile-sample-use=<...> code.cc -o code
```
|
|
This patch enables support for cloning in indirect callsites.
This is done by synthesizing callsite records for each virtual call
target from the profile metadata. In the thin link all the synthesized
records for a particular indirect callsite initially share the same
context node, but support is added to partition the callsites and
outgoing edges based on the callee function, creating a separate node
for each target.
In the LTO backend, when cloning is needed we first perform indirect
call promotion, then change the target of the new direct call to the
desired clone.
Note this is ThinLTO-specific, since for regular LTO indirect call
promotion should have already occurred.
|
|
This flag has been on for a while without any complaints.
|
|
Reverts llvm/llvm-project#111163, as this was merged prematurely.
|
|
Run ArgumentPromotion before IPSCCP in the LTO pipeline, to expose more
constants to be propagated. We also run PostOrderFunctionAttrs to
improve the information available to ArgumentPromotion's alias analysis,
and SROA to clean up allocas.
|
|
`PGOOpt` could have a value if, for instance, debug info for profiling
is requested. Relaxing the requirement, for now, following that
eventually we would factor `PGOOpt` to better capture the supported
interplay between the various profiling options.
|
|
After landing https://github.com/llvm/llvm-project/pull/99285 we found
that the call graph update was causing the following crash when
expensive checks are turned on
```
llvm-project/llvm/lib/Analysis/CGSCCPassManager.cpp:982: LazyCallGraph::SCC &updateCGAndAnalysisManagerForPass(LazyCallGraph &, LazyCallGraph::SCC &, LazyCallGraph::Node &, CGSCCAnalysisManager &, CGSCCUpdateResult &, FunctionAnalysisManager &, bool): Assertion `(RC == &TargetRC || RC->isAncestorOf(Targe
tRC)) && "New call edge is not trivial!"' failed.
```
I have to admit I believe that the call graph update process I did for
that patch could be wrong.
After reading the code in `CGSCCToFunctionPassAdaptor`, I am convinced
that `CoroAnnotationElidePass` can be a FunctionPass and rely on the
adaptor to update the call graph for us, so long as we properly
invalidate the caller's analyses.
After this patch,
`llvm/test/Transforms/Coroutines/coro-transform-must-elide.ll` no longer
fails under expensive checks.
|
|
This patch enables experimenting with the contextual profile. ICP is currently disabled in this case - will reenable it subsequently. Also subsequently the inline cost model / decision making would be updated to be context-aware. Right now, this just achieves "complete use" of the profile, in that it's ingested, maintained, and sunk to a flat profile when not needed anymore.
Issue [#89287](https://github.com/llvm/llvm-project/issues/89287)
|
|
coroutines to the `noalloc` variant (#99285)
This patch is episode three of the middle end implementation for the
coroutine HALO improvement project published on discourse:
https://discourse.llvm.org/t/language-extension-for-better-more-deterministic-halo-for-c-coroutines/80044
After we attribute the calls to some coroutines as "coro_elide_safe" in
the C++ FE and creating a `noalloc` ramp function, we use a new middle
end pass to move the call to coroutines to the noalloc variant.
This pass should be run after CoroSplit. For each node we process in
CoroSplit, we look for its callers and replace the attributed ones in
presplit coroutines to the noalloc one. The transformed `noalloc` ramp
function will also require a frame pointer to a block of memory it can
use as an activation frame. We allocate this on the caller's frame with
an alloca.
Please note that we cannot safely transform such attributed calls in
post-split coroutines due to memory lifetime reasons. The CoroSplit pass
is responsible for creating the coroutine frame spills for all the
allocas in the coroutine. Therefore it will be unsafe to create new
allocas like this one in post-split coroutines. This happens relatively
rarely because CGSCC performs the passes on the callees before the
caller. However, if multiple coroutines coexist in one SCC, this
situation does happen (and prevents us from having potentially unbound
frame size due to recursion.)
You can find episode 1: Clang FE of this patch series at
https://github.com/llvm/llvm-project/pull/99282
Episode 2: CoroSplit at https://github.com/llvm/llvm-project/pull/99283
|
|
synthetic count passes. (#107471)
The primary motivation is to remove `EntryCount` from `FunctionSummary`.
This frees 8 bytes out of `sizeof(FunctionSummary)` (136 bytes as of
https://github.com/llvm/llvm-project/commit/64498c54831bed9cf069e0923b9b73678c6451d8).
While I'm at it, this PR clean up {SummaryBasedOptimizations,
SyntheticCountsPropagation} since they were not used and there are no
plans to further invest on them.
With this patch, bitcode writer writes a placeholder 0 at the byte
offset of `EntryCount` and bitcode reader can parse the function entry
count at the correct byte offset. Added a TODO to stop writing
`EntryCount` and bump bitcode version
|
|
Pass to flatten and lower the contextual profile to profile (i.e. `MD_prof`) metadata. This is expected to be used after all IPO transformations have happened.
Prior to lowering, the instrumentation is maintained during IPO and the contextual profile is kept in sync (see PRs #105469, #106154). Flattening (#104539) sums up all the counters belonging to all a function's context nodes.
We first propagate counter values (from the flattened profile) using the same propagation algorithm as `PGOUseFunc::populateCounters`, then map the edge values to `branch_weights`. Functions. in the module that don't have an entry in the flattened profile are deemed cold, and any `MD_prof` metadata they may have is reset. The profile summary is also reset at this point.
Issue [#89287](https://github.com/llvm/llvm-project/issues/89287)
|