-
Notifications
You must be signed in to change notification settings - Fork 1.9k
C++: Initial IR-based dataflow implementation #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++: Initial IR-based dataflow implementation #597
Conversation
The AST dataflow library essentially ignores conversions, which is probably the right behavior. Converting an `int` to a `long` preserves the value, even if the bit pattern might be different. It's arguable whether narrowing conversions should be treated as dataflow, but we'll do so for now. We can revisit that if we see it cause problems.
This fixes a subtle bug in the construction of aliased SSA. `getResultMemoryAccess` was failing to return a `MemoryAccess` for a store to a variable whose address escaped. This is because no `VirtualIRVariable` was being created for such variables. The code was assuming that any access to such a variable would be via `UnknownMemoryAccess`. The result is that accesses to such variables were not being modeled in SSA at all. Instead, the way to handle this is to have a `VariableMemoryAccess` even when the variable being accessed has escaped, and to have `VariableMemoryAccess::getVirtualVariable()` return the `UnknownVirtualVariable` for escaped variables. In the future, this will also let us be less conservative about inserting `Chi` nodes, because we'll be able to determine that there's an exact overlap between two accesses to the same escaped variable in some cases.
This commit adds a new model interface that describes the known side effects (or lack thereof) of a library function. Does it read memory, does it write memory, and do any of its parameters escape? Initially, we have models for just two Standard Library functions: `std::move` and `std::forward`, which neither read nor write memory, and do not escape their parameter. IR construction has been updated to insert the correct side effect instruction (or no side effect instruction) based on the model.
rdmarsh2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits in the side-effect-free modeling library
| abstract predicate writesMemory(); | ||
|
|
||
| /** | ||
| * Holds if any address passed to the parameter at the specified index is retained after the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a caveat about storing vs returning addresses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new API and comments clear this up.
| * This memory could be from global variables, or from other memory that was reachable from a | ||
| * pointer that was passed into the function. | ||
| */ | ||
| predicate functionReadsMemory(Function f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename these to indicate that they're only based on the model, not based on some deeper analysis, given that there's likely to be a library for that in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new API for this doesn't have the non-member predicates.
Made `Node::getType()`, `Node::asParameter()`, and `Node::asUninitialized()` operate directly on the IR. This actually fixed several diffs compared to the AST dataflow, because `getType()` wasn't holding for nodes that weren't `Exprs`. Made `Uninitialized` a `VariableInstruction`. This makes it consistent with `InitializeParameter`.
|
Very exciting! Do I understand it correctly that this initial PR does not use the recently added Chi instructions? And that it can't use those until they are more precise? It would be great to have tests that (1) show the difference between the current library and this new one, and (2) demonstrate known false negatives or false positives. This doesn't have to block this PR. |
|
@jbj I don't expect most analyses, including this one, to actually do anything meaningful with Chi instructions directly. A Chi instruction more or less just says "something may or may not have changed this memory location", without any precise knowledge about what that potential change was. A Chi instruction is just what we insert when we know that something may have changed, but can't be more specific. As we make Chi insertion more precise, we'll insert fewer Chi instructions, and instead have more operands connected to an exact definition. That's how future improvements to Chi instructions will help this analysis (and all other SSA-based analyses). |
|
@jbj To respond to your testing suggestion, the |
|
Yes, my testing suggestion was to make a |
|
I've hand-edited the output slightly and made a diff of how the current test output compares to the new test output. For each difference, I've hand-annotated it with the consequence and likely cause. --- ast.expected 2018-12-03 14:05:42.000000000 +0100
+++ ir.expected 2018-12-03 14:14:21.000000000 +0100
@@ -2,19 +2,18 @@
| test.cpp:9:8:9:9 | t1 | test.cpp:6:12:6:17 | call to source |
| test.cpp:10:8:10:9 | t2 | test.cpp:6:12:6:17 | call to source |
| test.cpp:15:8:15:9 | t2 | test.cpp:6:12:6:17 | call to source |
+| test.cpp:21:8:21:9 | t1 | test.cpp:6:12:6:17 | call to source | REGRESSION (lack of CFG pruning)
| test.cpp:26:8:26:9 | t1 | test.cpp:6:12:6:17 | call to source |
| test.cpp:30:8:30:8 | t | test.cpp:35:10:35:15 | call to source |
| test.cpp:31:8:31:8 | c | test.cpp:36:13:36:18 | call to source |
| test.cpp:58:10:58:10 | t | test.cpp:50:14:50:19 | call to source |
-| test.cpp:71:8:71:9 | x4 | test.cpp:66:30:66:36 | source1 | REGRESSION (`std::move` not special-cased)
| test.cpp:76:8:76:9 | u1 | test.cpp:75:7:75:8 | u1 |
| test.cpp:84:8:84:18 | ... ? ... : ... | test.cpp:83:7:83:8 | u2 |
| test.cpp:86:8:86:9 | i1 | test.cpp:83:7:83:8 | u2 |
| test.cpp:90:8:90:14 | source1 | test.cpp:89:28:89:34 | source1 |
-| test.cpp:103:10:103:12 | ref | test.cpp:100:13:100:18 | call to source | IMPROVEMENT
-| test.cpp:126:8:126:19 | sourceArray1 | test.cpp:120:9:120:20 | sourceArray1 | REGRESSION (why?)
-| test.cpp:137:27:137:28 | m1 | test.cpp:136:27:136:32 | call to source | REGRESSION (lack of flow through fields)
-| test.cpp:140:22:140:23 | m1 | test.cpp:136:27:136:32 | call to source | IMPROVEMENT (lack of flow through fields)
+| test.cpp:92:8:92:14 | source1 | test.cpp:89:28:89:34 | source1 | REGRESSION (lack of heap modelling)
+| test.cpp:132:22:132:23 | m1 | test.cpp:122:18:122:30 | sourceStruct1 | REGRESSION (flow from struct to field)
+| test.cpp:140:22:140:23 | m1 | test.cpp:122:18:122:30 | sourceStruct1 | REGRESSION (flow from struct to field)
| test.cpp:188:8:188:8 | y | test.cpp:186:27:186:32 | call to source |
| test.cpp:192:8:192:8 | s | test.cpp:199:33:199:38 | call to source |
| test.cpp:200:8:200:8 | y | test.cpp:199:33:199:38 | call to source |
@@ -25,12 +24,13 @@
| test.cpp:314:12:314:12 | x | test.cpp:313:22:313:27 | call to source |
| test.cpp:337:14:337:14 | x | test.cpp:353:17:353:22 | call to source |
| test.cpp:366:7:366:7 | x | test.cpp:362:4:362:9 | call to source |
-| test.cpp:397:10:397:18 | globalVar | test.cpp:395:17:395:22 | call to source | REGRESSION (lack of flow through globals)
-| test.cpp:423:10:423:14 | field | test.cpp:421:13:421:18 | call to source | REGRESSION (lack of flow through fields)
+| true_upon_entry.cpp:13:8:13:8 | x | true_upon_entry.cpp:9:11:9:16 | call to source | REGRESSION (lack of true-upon-entry support)
| true_upon_entry.cpp:21:8:21:8 | x | true_upon_entry.cpp:17:11:17:16 | call to source |
| true_upon_entry.cpp:29:8:29:8 | x | true_upon_entry.cpp:27:9:27:14 | call to source |
| true_upon_entry.cpp:39:8:39:8 | x | true_upon_entry.cpp:33:11:33:16 | call to source |
| true_upon_entry.cpp:49:8:49:8 | x | true_upon_entry.cpp:43:11:43:16 | call to source |
| true_upon_entry.cpp:57:8:57:8 | x | true_upon_entry.cpp:54:11:54:16 | call to source |
+| true_upon_entry.cpp:66:8:66:8 | x | true_upon_entry.cpp:62:11:62:16 | call to source | REGRESSION (lack of true-upon-entry support)
| true_upon_entry.cpp:78:8:78:8 | x | true_upon_entry.cpp:70:11:70:16 | call to source |
| true_upon_entry.cpp:86:8:86:8 | x | true_upon_entry.cpp:83:11:83:16 | call to source |
+| true_upon_entry.cpp:105:8:105:8 | x | true_upon_entry.cpp:98:11:98:16 | call to source | IMPROVEMENT (lack of (imperfect) true-upon-entry support)Let's discuss in the next stream meeting which regressions are most important to fix for 1.19. |
|
I hope the lack of flow through fields can be fixed by using the support for flow through fields in the shared Java/C++ data-flow library. I know you also want to add field flow to the IR itself, but that won't give us interprocedural field flow. I don't know whether we have time for any of these options for 1.19. |
cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
|
Do you feel we're ready to add imports in |
|
@jbj The intent is for interprocedural field flow to be based on the Java implementation, but not in 1.19. I expect it will take a bit of work to adapt it to the more complicated aliasing of C++. |
|
I'm chasing up on 1.19 PRs. Is this ready to go in soon? Do you still plan to make other PRs based on it for 1.19, as mentioned in the PR description? |
I've separated the model interface for memory side effects from the model for escaped addresses. It will be fairly common for a given model to extend both interfaces, but they are used for two different purposes. I've also put each model interface and the non-member predicates that query it into a named module, which seemed cleaner than having predicates named `functionModelReadsMemory()` and `getFunctionModelParameterAliasBehavior()`.
|
@adityasharad This PR is almost ready to go in, but I do expect another couple related PRs over the next week or so. Ping me on Slack if you want more details. |
This sort of fixes one FP and causes a new FN, but for the wrong reasons. The IR dataflow is tracking the reference itself, rather than the referred-to object. Once we can better model indirections, we can make this work correctly. This change is still the right thing to do, because it ensures that the dataflow is looking at actual expression being computed by the instruction.
cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
|
|
||
| /** Gets the expression corresponding to this node, if any. */ | ||
| Expr asExpr() { result = this.getUnconvertedResultExpression() } | ||
| Expr asExpr() { result = this.getConvertedResultExpression() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a predicate with this body is technically the most precise, but I think it's for advanced users only. It could be named asConvertedExpr, perhaps.
I don't think we should generally expect users of asExpr to call getFullyConverted. It's too easy to forget, and it's different from the current data flow library (and all other libraries except range analysis). Can we make it so e = i.asExpr() holds if i is the getConvertedResultExpression of e.getFullyConverted()? I think what I'm proposing is to write
Expr asExpr() {
result.getFullyConverted() = this.getConvertedResultExpression() and
not result instanceof Conversion
}
Expr asConvertedExpr() { result = this.getConvertedResultExpression() }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we had:
short func(int& r) {
return (short)r;
}
There are three Exprs of interest: the VariableAccess of r, the ReferenceDeref, and the Cast to short. Would there be three ExprNodes, with the asExpr() for all three returning the same VariableAccess? That seems confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I'm proposing. I think in your example it's what most users will want, keeping in mind that the data flow library is often the first thing novice users reach for, long before learning about conversions. Users that come from our other languages or from the current data flow library will expect it to "just work" when they mark the VariableAccess to r as a source.
If we turn the argument around, is there then a situation where the user would get unexpected false positives due to the behaviour I'm proposing?
I agree that there should be a predicate available that models conversions accurately, but that would be for advanced users only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case that causes the most trouble is references. In the example above, the ExprNodes for both the VariableAccess (the reference) and the result of dereferencing the reference would have the same asExpr() (the VariableAccess). In any scenario I can think of, one or the other of these is incorrect, and would lead to false positives. This is what caused the FP that I fixed with commit 2b80aee.
With the current behavior in this PR, in the absence of references, user code can just look at the ExprNodes for non-conversion expressions, and the fact that IR-based dataflow treats all Convert instructions as flow should take care of the conversion expressions. If the user code needs to match expressions that involve references, then it still has to distinguish between the reference and the referred-to value, but I don't think that your proposal makes that any easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so concerned about false positives in this case. First, I don't think it'll actually happen in practice. If a sink has type int, then a value of type int& should not be able to reach it and vice versa. If the sink itself is a lvalue-to-reference conversion, then it might happen that both the value and the reference can flow there, but what's the harm in that? Second, false negatives are much worse than false positives in security queries. Queries in our standard suite should not use the asExpr that I'm proposing but should use asConvertedExpr.
With the behaviour of asExpr you propose, I don't know how to write the QLDoc to recommend to users whether they should call getFullyConverted() or not. They should certainly do it in your return (short)r; example. In the case of array-to-pointer conversions they probably shouldn't if the tainted data is in the array itself. To fully understand whether to call getFullyConverted, they'd need to understand our Conversion system, and I wouldn't expect even 10% of non-Semmle QL users to understand that.
Thinking about it more, I can see it's too extreme to let asExpr include any conversion. It's really only conversions between values and pointers/references that we want. That's array-to-pointer, value-to-reference, and reference-to-value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should match the existing behaviour. As I remember it, it’s completely unaware of conversions. That means there is a Node for each Conversion, but such nodes have no flow from them or to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say "match the existing behaviour" I really mean allow the same kind of caller to work the same way.
If it seems too confusing to have multiple instructions map to the same Expr, we can also prescribe one of those instructions to be the right choice. Unless you can think of reasonable use cases for tracking the reference, rather than its contents, we can make it the default choice to track contents. I think it's easier to state what I mean in terms of DataFlow::exprNode, which would be a total and injective function from non-conversions to Node.
- If a non-conversion
ehas aReferenceDereferenceExpr cdirectly on it, thenexprNode(e)is the instruction ofc. As far as I can tell,ReferenceDereferenceExpralways comes first in the chain of conversions. - If a non-conversion
ehas anArrayToPointerConversion cin its conversion chain, thenexprNode(e)is the instruction ofc.getExpr(). - If a non-conversion
ehas aReferenceToExpr cin its conversion chain, thenexprNode(e)is the instruction ofc.getExpr().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with your suggestion. However, as I started to implement the predicate to select the "appropriate" Expr, I hit enough corner cases that I'm not comfortable merging that change for 1.19. For example:
int a[10];
int* p = (int (&)[10])(float (&)[10])a;
I believe that expression has both an ArrayToPointerConversion and a ReferenceDereferenceExpr, with the ReferenceDereferenceExpr not being the innermost Conversion.
For now, I've simply stopped creating an ExprNode for a Conversion at all, which should be close to the AST-based behavior. I've opened CPP-318 to do the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a terrifying edge case you have there.
I don't see how it helps to stop creating ExprNode for Conversions. Callers that use this library like the current data flow library, never calling getFullyConverted, will get the same result as before. Callers that know about conversions will get worse results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've reverted that commit.
cpp/ql/src/semmle/code/cpp/models/implementations/IdentityFunction.qll
Outdated
Show resolved
Hide resolved
|
In addition to our long thread about
|
|
I've added this library to ImportAdditionalLibraries.ql. |
rdmarsh2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all feedback has been addressed at this point.
jbj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this merged for 1.19.0, and then we can address CPP-318 after that.
Update dbscheme to add table for variadic signature types
I suggest reviewing this PR commit-by-commit.
This PR adds a new dataflow library that operates on the IR rather than ASTs. The first commit was mostly @rdmarsh2's work, and I just brought it up-to-date with the latest IR API. The remaining commits are fixes that improve the dataflow results.
The new library shares much of the code with the existing C++ AST dataflow and Java dataflow libraries via the identical-files.json mechanism. The main difference is that in the IR-based dataflow library,
DataFlow::NodeextendsInstruction, since all three of the node types from the AST version (Expr, Parameter, and Uninitialized) are all just different instructions in the IR.The API that's exposed to consumers of the dataflow library is pretty much source-compatible with the AST dataflow library. Consumers do not have to use the IR directly to write a dataflow query; they can still define sources and sinks in terms of
Expr,Parameter, etc.All local flow is done just by following SSA def-use links, so there's no need for all the LocalFlow stuff.
There are still several diffs in the results compared to the AST version. Significant known weaknesses include:
There will be more PRs on this for 1.19, but this gets us something that is basically working.