Age | Commit message (Collapse) | Author | Files | Lines |
|
Previously some checkers attached explicitly created program point tags
to some of the exploded graph nodes that they created. In most of the
checkers this ad-hoc tagging only affected the debug dump of the
exploded graph (and they weren't too relevant for debugging) so this
commit removes them.
There were two checkers where the tagging _did_ have a functional role:
- In `RetainCountChecker` the presence of tags were checked by
`RefCountReportVisitor`.
- In `DynamicTypePropagation` the checker sometimes wanted to create two
identical nodes and had to apply an explicit tag on the second one to
avoid "caching out".
In these two situations I preserved the tags but switched to using
`SimpleProgramPointTag` instead of `CheckerProgramPointTag` because
`CheckerProgramPointTag` didn't provide enough benefits to justify its
existence.
Note that this commit depends on the earlier commit "[analyzer] Fix
tagging of PostAllocatorCall" ec96c0c072ef3f78813c378949c00e1c07aa44e5
and would introduce crashes when cherry-picked onto a branch that
doesn't contain that commit.
For more details about the background see the discourse thread
https://discourse.llvm.org/t/role-of-programpointtag-in-the-static-analyzer/
As a tangentially related changes, this commit also adds some comments
to document the surprising behavior of `CheckerContext::addTransition`
and an assertion in the constructor of `PathSensitiveBugReport` to get a
more readable crash dump in the case when the report is constructed with
`nullptr` as the `ErrorNode`. (This can happen due to "caching out".)
|
|
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.
|
|
|
|
Previously `optin.taint.GenericTaint` misinterpreted the parameter
indices and produced false positives in situations when a [format
attribute](https://clang.llvm.org/docs/AttributeReference.html#format)
is applied on a non-static method. This commit fixes this bug
|
|
checkers (#98157)
Taint propagation is a a generic modeling feature of the Clang Static
Analyzer which many other checkers depend on. Therefore
GenericTaintChecker is split into a TaintPropagation modeling checker
and a GenericTaint reporting checker.
|
|
This commit refactors GenericTaintChecker and performs various
improvements in the list of taint propagation functions:
1. The matching mode (usually `CDM::CLibrary` or
`CDM::CLibraryMaybeHardened`) was specified to avoid matching e.g. C++
methods or functions from a user-defined namespace that happen to share
the name of a well-known library function.
2. With these matching modes, a `CallDescription` can automatically
match builtin variants of the functions, so entries that explicitly
specified a builtin function were removed. This eliminated
inconsistencies where the "normal" and the builtin variant of the same
function was handled differently (e.g. `__builtin_strlcat` was covered,
while plain `strlcat` wasn't; while `__builtin_memcpy` and `memcpy` were
both on the list with different propagation rules).
3. The modeling of the functions `strlcat` and `strncat` was updated to
propagate taint from the first argument (index 0), because a tainted
string should remain tainted even if we append something else to it.
Note that this was already applied to `strcat` and `wcsncat` by commit
6ceb1c0ef9f544be0eed65e46cc7d99941a001bf.
4. Some functions were updated to propagate taint from a size/length
argument to the result: e.g. `memcmp(p, q, get_tainted_int())` will now
return a tainted value (because the attacker can manipulate it). This
principle was already present in some propagation rules (e.g.
`__builtin_memcpy` was handled this way), and even after this commit
there are still some functions where it isn't applied. (I only aimed for
consistency within the same function family.)
5. Functions that have hardened `__FOO_chk()` variants are matched in
`CDM:CLibraryMaybeHardened` to ensure consistent handling of the
"normal" and the hardened variant. I added special handling for the
hardened variants of "sprintf" and "snprintf" because there the extra
parameters are inserted into the middle of the parameter list.
6. Modeling of `sscanf_s` was added, to complete the group of `fscanf`,
`fscanf_s` and `sscanf`.
7. The `Source()` specifications for `gets`, `gets_s` and `wgetch` were
ill-formed: they were specifying variadic arguments starting at argument
index `ReturnValueIndex`. (That is, in addition to the return value they
were propagating taint to all arguments.)
8. Functions that were related to each other were grouped together. (I
know that this makes the diff harder to read, but I felt that the full
list is unreadable without some reorganization.)
9. I spotted and removed some redundant curly braces. Perhaps would be
good to switch to a cleaner layout with less nested braces...
10. I updated some obsolete comments and added two TODOs for issues that
should be fixed in followup commits.
|
|
I'm planning to remove StringRef::equals in favor of
StringRef::operator==.
- StringRef::operator==/!= outnumber StringRef::equals by a factor of
24 under clang/ in terms of their usage.
- The elimination of StringRef::equals brings StringRef closer to
std::string_view, which has operator== but not equals.
- S == "foo" is more readable than S.equals("foo"), especially for
!Long.Expression.equals("str") vs Long.Expression != "str".
|
|
checker (#68607)
Before this commit the the checker alpha.security.taint.TaintPropagation always reported warnings when the size argument of a memcpy-like or malloc-like function was tainted. However, this produced false positive reports in situations where the size was tainted, but correctly performed bound checks guaranteed the safety of the call.
This commit removes the rough "always warn if the size argument is tainted" heuristic; but it would be good to add a more refined "warns if the size argument is tainted and can be too large" heuristic in follow-up commits. That logic would belong to CStringChecker and MallocChecker, because those are the checkers responsible for the more detailed modeling of memcpy-like and malloc-like functions. To mark this plan, TODO comments are added in those two checkers.
There were several test cases that used these sinks to test generic properties of taint tracking; those were adapted to use different logic.
As a minor unrelated change, this commit ensures that strcat (and its wide variant, wcsncat) propagates taint from the first argument to the first argument, i.e. a tainted string remains tainted if we concatenate it with another string. This change was required because the adapted variant of multipleTaintedArgs is relying on strncat to compose a value that combines taint from two different sources.
|
|
In builds that use source hardening (-D_FORTIFY_SOURCE), many standard
functions are implemented as macros that expand to calls of hardened
functions that take one additional argument compared to the "usual"
variant and perform additional input validation. For example, a `memcpy`
call may expand to `__memcpy_chk()` or `__builtin___memcpy_chk()`.
Before this commit, `CallDescription`s created with the matching mode
`CDM::CLibrary` automatically matched these hardened variants (in a
addition to the "usual" function) with a fairly lenient heuristic.
Unfortunately this heuristic meant that the `CLibrary` matching mode was
only usable by checkers that were prepared to handle matches with an
unusual number of arguments.
This commit limits the recognition of the hardened functions to a
separate matching mode `CDM::CLibraryMaybeHardened` and applies this
mode for functions that have hardened variants and were previously
recognized with `CDM::CLibrary`.
This way checkers that are prepared to handle the hardened variants will
be able to detect them easily; while other checkers can simply use
`CDM::CLibrary` for matching C library functions (and they won't
encounter surprising argument counts).
The initial motivation for refactoring this area was that previously
`CDM::CLibrary` accepted calls with more arguments/parameters than the
expected number, so I wasn't able to use it for `malloc` without
accidentally matching calls to the 3-argument BSD kernel malloc.
After this commit this "may have more args/params" logic will only
activate when we're actually matching a hardened variant function (in
`CDM::CLibraryMaybeHardened` mode). The recognition of "sprintf()" and
"snprintf()" in CStringChecker was refactored, because previously it was
abusing the behavior that extra arguments are accepted even if the
matched function is not a hardened variant.
This commit also fixes the oversight that the old code would've
recognized e.g. `__wmemcpy_chk` as a hardened variant of `memcpy`.
After this commit I'm planning to create several follow-up commits that
ensure that checkers looking for C library functions use `CDM::CLibrary`
as a "sane default" matching mode.
This commit is not truly NFC (it eliminates some buggy corner cases),
but it does not intentionally modify the behavior of CSA on real-world
non-crazy code.
As a minor unrelated change I'm eliminating the argument/variable
"IsBuiltin" from the evalSprintf function family in CStringChecker,
because it was completely unused.
---------
Co-authored-by: Balazs Benics <benicsbalazs@gmail.com>
|
|
The class `CallDescription` is used to define patterns that are used for
matching `CallEvent`s. For example, a
`CallDescription{{"std", "find_if"}, 3}`
matches a call to `std::find_if` with 3 arguments.
However, these patterns are somewhat fuzzy, so this pattern could also
match something like `std::__1::find_if` (with an additional namespace
layer), or, unfortunately, a `CallDescription` for the well-known
function `free()` can match a C++ method named `free()`:
https://github.com/llvm/llvm-project/issues/81597
To prevent this kind of ambiguity this commit introduces the enum
`CallDescription::Mode` which can limit the pattern matching to
non-method function calls (or method calls etc.). After this NFC change,
one or more follow-up commits will apply the right pattern matching
modes in the ~30 checkers that use `CallDescription`s.
Note that `CallDescription` previously had a `Flags` field which had
only two supported values:
- `CDF_None` was the default "match anything" mode,
- `CDF_MaybeBuiltin` was a "match only C library functions and accept
some inexact matches" mode.
This commit preserves `CDF_MaybeBuiltin` under the more descriptive
name `CallDescription::Mode::CLibrary` (or `CDM::CLibrary`).
Instead of this "Flags" model I'm switching to a plain enumeration
becasue I don't think that there is a natural usecase to combine the
different matching modes. (Except for the default "match anything" mode,
which is currently kept for compatibility, but will be phased out in the
follow-up commits.)
|
|
Variadic arguments were not considered as taint sink arguments. I also
decided to extend the list of exec-like functions.
(Juliet CWE78_OS_Command_Injection__char_connect_socket_execl)
|
|
strlen(..) call should not propagate taintedness,
because it brings in many false positive findings. It is a common
pattern to copy user provided input to another buffer. In these cases we
always
get warnings about tainted data used as the malloc parameter:
buf = malloc(strlen(tainted_txt) + 1); // false warning
This pattern can lead to a denial of service attack only, when the
attacker can directly specify the size of the allocated area as an
arbitrary large number (e.g. the value is converted from a user provided
string).
Later, we could reintroduce strlen() as a taint propagating function
with the consideration not to emit warnings when the tainted value
cannot be "arbitrarily large" (such as the size of an already allocated
memory block).
The change has been evaluated on the following open source projects:
- memcached: [1 lost false
positive](https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=memcached_1.6.8_ednikru_taint_nostrlen_baseline&newcheck=memcached_1.6.8_ednikru_taint_nostrlen_new&is-unique=on&diff-type=Resolved)
- tmux: 0 lost reports
- twin [3 lost false
positives](https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=twin_v0.8.1_ednikru_taint_nostrlen_baseline&newcheck=twin_v0.8.1_ednikru_taint_nostrlen_new&is-unique=on&diff-type=Resolved)
- vim [1 lost false
positive](https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=vim_v8.2.1920_ednikru_taint_nostrlen_baseline&newcheck=vim_v8.2.1920_ednikru_taint_nostrlen_new&is-unique=on&diff-type=Resolved)
- openssl 0 lost reports
- sqliste [2 lost false
positives](https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=sqlite_version-3.33.0_ednikru_taint_nostrlen_baseline&newcheck=sqlite_version-3.33.0_ednikru_taint_nostrlen_new&is-unique=on&diff-type=Resolved)
- ffmpeg 0 lost repots
- postgresql [3 lost false
positives](https://codechecker-demo.eastus.cloudapp.azure.com/Default/reports?run=postgres_REL_13_0_ednikru_taint_nostrlen_baseline&newcheck=postgres_REL_13_0_ednikru_taint_nostrlen_new&is-unique=on&diff-type=Resolved)
- tinyxml 0 lost reports
- libwebm 0 lost reports
- xerces 0 lost reports
In all cases the lost reports are originating from copying untrusted
environment variables into another buffer.
There are 2 types of lost false positive reports:
1) [Where the warning is emitted at the malloc call by the
TaintPropagation Checker
](https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=memcached_1.6.8_ednikru_taint_nostrlen_baseline&newcheck=memcached_1.6.8_ednikru_taint_nostrlen_new&is-unique=on&diff-type=Resolved&report-id=2648506&report-hash=2079221954026f17e1ecb614f5f054db&report-filepath=%2amemcached.c)
`
len = strlen(portnumber_filename)+4+1;
temp_portnumber_filename = malloc(len);
`
2) When pointers are set based on the length of the tainted string by
the ArrayOutofBoundsv2 checker.
For example [this
](https://codechecker-demo.eastus.cloudapp.azure.com/Default/report-detail?run=vim_v8.2.1920_ednikru_taint_nostrlen_baseline&newcheck=vim_v8.2.1920_ednikru_taint_nostrlen_new&is-unique=on&diff-type=Resolved&report-id=2649310&report-hash=79dc8522d2cd34ca8e1b2dc2db64b2df&report-filepath=%2aos_unix.c)case.
|
|
Functions like `fgets`, `strlen`, `strcat` propagate taint.
However, their `wchar_t` variants don't. This patch fixes that.
Notice, that there could be many more APIs missing.
This patch intends to fix those that so far surfaced,
instead of exhaustively fixing this issue.
https://github.com/llvm/llvm-project/pull/66074
|
|
This allows to track taint on real code from `socket()`
to reading into a buffer using `recv()`.
https://github.com/llvm/llvm-project/pull/66074
|
|
The `stdin` declaration should be within `extern "C" {...}`, in C++
mode. In addition, it should be also marked `extern` in both C and
C++ modes.
I tightened the check to ensure we only accept `stdin` if both of these
match. However, from the Juliet test suite's perspective, this commit
should not matter.
https://github.com/llvm/llvm-project/pull/66074
|
|
There was a bug in alpha.security.taint.TaintPropagation checker
in Clang Static Analyzer.
Taint filtering could only sanitize const arguments.
After this patch, taint filtering is effective also
on non-const parameters.
Differential Revision: https://reviews.llvm.org/D155848
|
|
|
|
This patch improves the diagnostics of the alpha.security.taint.TaintPropagation
checker and taint related checkers by showing the "Taint originated here" note
at the correct place, where the attacker may inject it. This greatly improves
the understandability of the taint reports.
In the baseline the taint source was pointing to an invalid location, typically
somewhere between the real taint source and sink.
After the fix, the "Taint originated here" tag is correctly shown at the taint
source. This is the function call where the attacker can inject a malicious data
(e.g. reading from environment variable, reading from file, reading from
standard input etc.).
This patch removes the BugVisitor from the implementation and replaces it with 2
new NoteTags. One, in the taintOriginTrackerTag() prints the "taint originated
here" Note and the other in taintPropagationExplainerTag() explaining how the
taintedness is propagating from argument to argument or to the return value
("Taint propagated to the Xth argument"). This implementation uses the
interestingess BugReport utility to track back the tainted symbols through
propagating function calls to the point where the taintedness was introduced by
a source function call.
The checker which wishes to emit a Taint related diagnostic must use the
categories::TaintedData BugType category and must mark the tainted symbols as
interesting. Then the TaintPropagationChecker will automatically generate the
"Taint originated here" and the "Taint propagated to..." diagnostic notes.
|
|
This patch replaces (llvm::|)Optional< with std::optional<. I'll post
a separate patch to remove #include "llvm/ADT/Optional.h".
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
|
|
call information
This avoids recomputing string length that is already known at compile time.
It has a slight impact on preprocessing / compile time, see
https://llvm-compile-time-tracker.com/compare.php?from=3f36d2d579d8b0e8824d9dd99bfa79f456858f88&to=e49640c507ddc6615b5e503144301c8e41f8f434&stat=instructions:u
This a recommit of e953ae5bbc313fd0cc980ce021d487e5b5199ea4 and the subsequent fixes caa713559bd38f337d7d35de35686775e8fb5175 and 06b90e2e9c991e211fecc97948e533320a825470.
The above patchset caused some version of GCC to take eons to compile clang/lib/Basic/Targets/AArch64.cpp, as spotted in aa171833ab0017d9732e82b8682c9848ab25ff9e.
The fix is to make BuiltinInfo tables a compilation unit static variable, instead of a private static variable.
Differential Revision: https://reviews.llvm.org/D139881
|
|
builtin and call information"
Revert "Fix lldb option handling since e953ae5bbc313fd0cc980ce021d487e5b5199ea4 (part 2)"
Revert "Fix lldb option handling since e953ae5bbc313fd0cc980ce021d487e5b5199ea4"
GCC build hangs on this bot https://lab.llvm.org/buildbot/#/builders/37/builds/19104
compiling CMakeFiles/obj.clangBasic.dir/Targets/AArch64.cpp.d
The bot uses GNU 11.3.0, but I can reproduce locally with gcc (Debian 12.2.0-3) 12.2.0.
This reverts commit caa713559bd38f337d7d35de35686775e8fb5175.
This reverts commit 06b90e2e9c991e211fecc97948e533320a825470.
This reverts commit e953ae5bbc313fd0cc980ce021d487e5b5199ea4.
|
|
call information
This avoids recomputing string length that is already known at compile
time.
It has a slight impact on preprocessing / compile time, see
https://llvm-compile-time-tracker.com/compare.php?from=3f36d2d579d8b0e8824d9dd99bfa79f456858f88&to=e49640c507ddc6615b5e503144301c8e41f8f434&stat=instructions:u
This is a recommit of 719d98dfa841c522d8d452f0685e503538415a53 that into
account a GGC issue (probably
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92181) when dealing with
intiailizer_list and constant expressions.
Workaround this by avoiding initializer list, at the expense of a
temporary plain old array.
Differential Revision: https://reviews.llvm.org/D139881
|
|
builtin and call information"
There are still remaining issues with GCC 12, see for instance
https://lab.llvm.org/buildbot/#/builders/93/builds/12669
This reverts commit 5ce4e92264102de21760c94db9166afe8f71fcf6.
|
|
call information
This avoids recomputing string length that is already known at compile
time.
It has a slight impact on preprocessing / compile time, see
https://llvm-compile-time-tracker.com/compare.php?from=3f36d2d579d8b0e8824d9dd99bfa79f456858f88&to=e49640c507ddc6615b5e503144301c8e41f8f434&stat=instructions:u
This is a recommit of 719d98dfa841c522d8d452f0685e503538415a53 with a
change to llvm/utils/TableGen/OptParserEmitter.cpp to cope with GCC bug
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108158
Differential Revision: https://reviews.llvm.org/D139881
|
|
builtin and call information"
Failing builds: https://lab.llvm.org/buildbot#builders/9/builds/19030
This is GCC specific and has been reported upstream: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108158
This reverts commit 719d98dfa841c522d8d452f0685e503538415a53.
|
|
call information
This avoids recomputing string length that is already known at compile
time.
It has a slight impact on preprocessing / compile time, see
https://llvm-compile-time-tracker.com/compare.php?from=3f36d2d579d8b0e8824d9dd99bfa79f456858f88&to=e49640c507ddc6615b5e503144301c8e41f8f434&stat=instructions:u
Differential Revision: https://reviews.llvm.org/D139881
|
|
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
|
|
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
|
|
|
|
|
|
|
|
Previously, system globals were treated as immutable regions, unless it
was the `errno` which is known to be frequently modified.
D124244 wants to add a check for stores to immutable regions.
It would basically turn all stores to system globals into an error even
though we have no reason to believe that those mutable sys globals
should be treated as if they were immutable. And this leads to
false-positives if we apply D124244.
In this patch, I'm proposing to treat mutable sys globals actually
mutable, hence allocate them into the `GlobalSystemSpaceRegion`, UNLESS
they were declared as `const` (and a primitive arithmetic type), in
which case, we should use `GlobalImmutableSpaceRegion`.
In any other cases, I'm using the `GlobalInternalSpaceRegion`, which is
no different than the previous behavior.
---
In the tests I added, only the last `expected-warning` was different, compared to the baseline.
Which is this:
```lang=C++
void test_my_mutable_system_global_constraint() {
assert(my_mutable_system_global > 2);
clang_analyzer_eval(my_mutable_system_global > 2); // expected-warning {{TRUE}}
invalidate_globals();
clang_analyzer_eval(my_mutable_system_global > 2); // expected-warning {{UNKNOWN}} It was previously TRUE.
}
void test_my_mutable_system_global_assign(int x) {
my_mutable_system_global = x;
clang_analyzer_eval(my_mutable_system_global == x); // expected-warning {{TRUE}}
invalidate_globals();
clang_analyzer_eval(my_mutable_system_global == x); // expected-warning {{UNKNOWN}} It was previously TRUE.
}
```
---
Unfortunately, the taint checker will be also affected.
The `stdin` global variable is a pointer, which is assumed to be a taint
source, and the rest of the taint propagation rules will propagate from
it.
However, since mutable variables are no longer treated immutable, they
also get invalidated, when an opaque function call happens, such as the
first `scanf(stdin, ...)`. This would effectively remove taint from the
pointer, consequently disable all the rest of the taint propagations
down the line from the `stdin` variable.
All that said, I decided to look through `DerivedSymbol`s as well, to
acquire the memregion in that case as well. This should preserve the
previously existing taint reports.
Reviewed By: martong
Differential Revision: https://reviews.llvm.org/D127306
|
|
Depends on D125709
Reviewed By: martong
Differential Revision: https://reviews.llvm.org/D127742
|
|
Thanks @kazu for helping me clean these parts in D127799.
I'm leaving the dump methods, along with the unused visitor handlers and
the forwarding methods.
The dead parts actually helped to uncover two bugs, to which I'm going
to post separate patches.
Reviewed By: martong
Differential Revision: https://reviews.llvm.org/D127836
|
|
Reviewed By: NoQ, xazax.hun, steakhal
Differential Revision: https://reviews.llvm.org/D123155
|
|
Add more functions as taint propators to GenericTaintChecker.
Reviewed By: steakhal
Differential Revision: https://reviews.llvm.org/D120369
|
|
Add more functions as taint sources to GenericTaintChecker.
Reviewed By: steakhal
Differential Revision: https://reviews.llvm.org/D120236
|
|
7fd60ee6e0a87957a718297a4a42d9881fc561e3
|
|
|
|
This reverts commit 2acead35c1289d2b3593a992b0639ca6427e481f.
Let's try `REQUIRES: asserts`.
|
|
location context""
This reverts commit d16c5f4192c30d53468a472c6820163a81192825.
Let's try `REQUIRES: asserts`.
|
|
propagation""
This reverts commit b8ae323cca61dc1edcd36e9ae18c7e4c3d76d52e.
Let's try `REQUIRES: asserts`.
|
|
This reverts commit 744745ae195f0997e5bfd5aa2de47b9ea156b6a6.
I'm reverting this since this patch caused a build breakage.
https://lab.llvm.org/buildbot/#/builders/91/builds/3818
|
|
This reverts commit b099e1e562555fbc67e2e0abbc15074e14a85ff1.
I'm reverting this since the head of the patch stack caused a build
breakage.
https://lab.llvm.org/buildbot/#/builders/91/builds/3818
|
|
This reverts commit bf5963bf19670ea58facdf57492e147c13bb650f.
I'm reverting this since the head of the patch stack caused a build
breakage.
https://lab.llvm.org/buildbot/#/builders/91/builds/3818
|
|
There was a typo in the rule.
`{{0}, ReturnValueIndex}` meant that the discrete index is `0` and the
variadic index is `-1`.
What we wanted instead is that both `0` and `-1` are in the discrete index
list.
Instead of this, we wanted to express that both `0` and the
`ReturnValueIndex` is in the discrete arg list.
The manual inspection revealed that `setproctitle_init` also suffered a
probably incomplete propagation rule.
Reviewed By: Szelethus, gamesh411
Differential Revision: https://reviews.llvm.org/D119129
|
|
Fixes the issue D118987 by mapping the propagation to the callsite's
LocationContext.
This way we can keep track of the in-flight propagations.
Note that empty propagation sets won't be inserted.
Reviewed By: NoQ, Szelethus
Differential Revision: https://reviews.llvm.org/D119128
|
|
Recently we uncovered a serious bug in the `GenericTaintChecker`.
It was already flawed before D116025, but that was the patch that turned
this silent bug into a crash.
It happens if the `GenericTaintChecker` has a rule for a function, which
also has a definition.
char *fgets(char *s, int n, FILE *fp) {
nested_call(); // no parameters!
return (char *)0;
}
// Within some function:
fgets(..., tainted_fd);
When the engine inlines the definition and finds a function call within
that, the `PostCall` event for the call will get triggered sooner than the
`PostCall` for the original function.
This mismatch violates the assumption of the `GenericTaintChecker` which
wants to propagate taint information from the `PreCall` event to the
`PostCall` event, where it can actually bind taint to the return value
**of the same call**.
Let's get back to the example and go through step-by-step.
The `GenericTaintChecker` will see the `PreCall<fgets(..., tainted_fd)>`
event, so it would 'remember' that it needs to taint the return value
and the buffer, from the `PostCall` handler, where it has access to the
return value symbol.
However, the engine will inline fgets and the `nested_call()` gets
evaluated subsequently, which produces an unimportant
`PreCall<nested_call()>`, then a `PostCall<nested_call()>` event, which is
observed by the `GenericTaintChecker`, which will unconditionally mark
tainted the 'remembered' arg indexes, trying to access a non-existing
argument, resulting in a crash.
If it doesn't crash, it will behave completely unintuitively, by marking
completely unrelated memory regions tainted, which is even worse.
The resulting assertion is something like this:
Expr.h: const Expr *CallExpr::getArg(unsigned int) const: Assertion
`Arg < getNumArgs() && "Arg access out of range!"' failed.
The gist of the backtrace:
CallExpr::getArg(unsigned int) const
SimpleFunctionCall::getArgExpr(unsigned int)
CallEvent::getArgSVal(unsigned int) const
GenericTaintChecker::checkPostCall(const CallEvent &, CheckerContext&) const
Prior to D116025, there was a check for the argument count before it
applied taint, however, it still suffered from the same underlying
issue/bug regarding propagation.
This path does not intend to fix the bug, rather start a discussion on
how to fix this.
---
Let me elaborate on how I see this problem.
This pre-call, post-call juggling is just a workaround.
The engine should by itself propagate taint where necessary right where
it invalidates regions.
For the tracked values, which potentially escape, we need to erase the
information we know about them; and this is exactly what is done by
invalidation.
However, in the case of taint, we basically want to approximate from the
opposite side of the spectrum.
We want to preserve taint in most cases, rather than cleansing them.
Now, we basically sanitize all escaping tainted regions implicitly,
since invalidation binds a fresh conjured symbol for the given region,
and that has not been associated with taint.
IMO this is a bad default behavior, we should be more aggressive about
preserving taint if not further spreading taint to the reachable
regions.
We have a couple of options for dealing with it (let's call it //tainting
policy//):
1) Taint only the parameters which were tainted prior to the call.
2) Taint the return value of the call, since it likely depends on the
tainted input - if any arguments were tainted.
3) Taint all escaped regions - (maybe transitively using the cluster
algorithm) - if any arguments were tainted.
4) Not taint anything - this is what we do right now :D
The `ExprEngine` should not deal with taint on its own. It should be done
by a checker, such as the `GenericTaintChecker`.
However, the `Pre`-`PostCall` checker callbacks are not designed for this.
`RegionChanges` would be a much better fit for modeling taint propagation.
What we would need in the `RegionChanges` callback is the `State` prior
invalidation, the `State` after the invalidation, and a `CheckerContext` in
which the checker can create transitions, where it would place `NoteTags`
for the modeled taint propagations and report errors if a taint sink
rule gets violated.
In this callback, we could query from the prior State, if the given
value was tainted; then act and taint if necessary according to the
checker's tainting policy.
By using RegionChanges for this, we would 'fix' the mentioned
propagation bug 'by-design'.
Reviewed By: Szelethus
Differential Revision: https://reviews.llvm.org/D118987
|
|
|
|
GenericTaintChecker now uses CallDescriptionMap to describe the possible
operation in code which trigger the introduction (sources), the removal
(filters), the passing along (propagations) and detection (sinks) of
tainted values.
Reviewed By: steakhal, NoQ
Differential Revision: https://reviews.llvm.org/D116025
|