C++: Initial IR-based dataflow implementation #597

dave-bartolomeo · 2018-11-30T22:20:22Z

I suggest reviewing this PR commit-by-commit.

This PR adds a new dataflow library that operates on the IR rather than ASTs. The first commit was mostly @rdmarsh2's work, and I just brought it up-to-date with the latest IR API. The remaining commits are fixes that improve the dataflow results.

The new library shares much of the code with the existing C++ AST dataflow and Java dataflow libraries via the identical-files.json mechanism. The main difference is that in the IR-based dataflow library, DataFlow::Node extends Instruction, since all three of the node types from the AST version (Expr, Parameter, and Uninitialized) are all just different instructions in the IR.

The API that's exposed to consumers of the dataflow library is pretty much source-compatible with the AST dataflow library. Consumers do not have to use the IR directly to write a dataflow query; they can still define sources and sinks in terms of Expr, Parameter, etc.

All local flow is done just by following SSA def-use links, so there's no need for all the LocalFlow stuff.

There are still several diffs in the results compared to the AST version. Significant known weaknesses include:

The IR version does not ignore unreachable code.
There's still work to do to handle flow through library functions that lack definitions.
Handling of C++ reference types is different from the AST version. The AST version is inconsistent here, but sometimes it get things right by accident. The IR version needs more work in the Aliased SSA component to handle these indirection via references better.

There will be more PRs on this for 1.19, but this gets us something that is basically working.

The AST dataflow library essentially ignores conversions, which is probably the right behavior. Converting an `int` to a `long` preserves the value, even if the bit pattern might be different. It's arguable whether narrowing conversions should be treated as dataflow, but we'll do so for now. We can revisit that if we see it cause problems.

This fixes a subtle bug in the construction of aliased SSA. `getResultMemoryAccess` was failing to return a `MemoryAccess` for a store to a variable whose address escaped. This is because no `VirtualIRVariable` was being created for such variables. The code was assuming that any access to such a variable would be via `UnknownMemoryAccess`. The result is that accesses to such variables were not being modeled in SSA at all. Instead, the way to handle this is to have a `VariableMemoryAccess` even when the variable being accessed has escaped, and to have `VariableMemoryAccess::getVirtualVariable()` return the `UnknownVirtualVariable` for escaped variables. In the future, this will also let us be less conservative about inserting `Chi` nodes, because we'll be able to determine that there's an exact overlap between two accesses to the same escaped variable in some cases.

This commit adds a new model interface that describes the known side effects (or lack thereof) of a library function. Does it read memory, does it write memory, and do any of its parameters escape? Initially, we have models for just two Standard Library functions: `std::move` and `std::forward`, which neither read nor write memory, and do not escape their parameter. IR construction has been updated to insert the correct side effect instruction (or no side effect instruction) based on the model.

rdmarsh2

Some minor nits in the side-effect-free modeling library

rdmarsh2 · 2018-12-01T00:29:07Z

cpp/ql/src/semmle/code/cpp/models/interfaces/SideEffectFunction.qll

+  abstract predicate writesMemory();
+
+  /**
+   * Holds if any address passed to the parameter at the specified index is retained after the


Add a caveat about storing vs returning addresses?

I think the new API and comments clear this up.

rdmarsh2 · 2018-12-01T00:52:10Z

cpp/ql/src/semmle/code/cpp/models/interfaces/SideEffectFunction.qll

+  * This memory could be from global variables, or from other memory that was reachable from a
+  * pointer that was passed into the function.
+  */
+predicate functionReadsMemory(Function f) {


I would rename these to indicate that they're only based on the model, not based on some deeper analysis, given that there's likely to be a library for that in the future.

The new API for this doesn't have the non-member predicates.

Made `Node::getType()`, `Node::asParameter()`, and `Node::asUninitialized()` operate directly on the IR. This actually fixed several diffs compared to the AST dataflow, because `getType()` wasn't holding for nodes that weren't `Exprs`. Made `Uninitialized` a `VariableInstruction`. This makes it consistent with `InitializeParameter`.

jbj · 2018-12-02T12:44:18Z

Very exciting!

Do I understand it correctly that this initial PR does not use the recently added Chi instructions? And that it can't use those until they are more precise?

It would be great to have tests that (1) show the difference between the current library and this new one, and (2) demonstrate known false negatives or false positives. This doesn't have to block this PR.

dave-bartolomeo · 2018-12-03T06:18:16Z

@jbj I don't expect most analyses, including this one, to actually do anything meaningful with Chi instructions directly. A Chi instruction more or less just says "something may or may not have changed this memory location", without any precise knowledge about what that potential change was. A Chi instruction is just what we insert when we know that something may have changed, but can't be more specific. As we make Chi insertion more precise, we'll insert fewer Chi instructions, and instead have more operands connected to an exact definition. That's how future improvements to Chi instructions will help this analysis (and all other SSA-based analyses).

dave-bartolomeo · 2018-12-03T06:20:47Z

@jbj To respond to your testing suggestion, the test_ir.ql file I added for this is in the same directory as the AST-based dataflow tests, using the same test sources. The difference between test.expected and test_ir.expected is the difference in results between the two libraries. Were you suggesting something like a test_diff.ql that runs both AST- and IR-based dataflow, and prints out only those results that differ between the two?

jbj · 2018-12-03T13:02:44Z

Yes, my testing suggestion was to make a test_diff.ql like you describe. But perhaps it's not a big deal -- for the initial PR I can make a diff manually, and for subsequent PRs we'll see both expected files change, and it should be quite clear to see the difference between how they change.

jbj · 2018-12-03T13:40:01Z

I've hand-edited the output slightly and made a diff of how the current test output compares to the new test output. For each difference, I've hand-annotated it with the consequence and likely cause.

--- ast.expected	2018-12-03 14:05:42.000000000 +0100
+++ ir.expected	2018-12-03 14:14:21.000000000 +0100
@@ -2,19 +2,18 @@
 | test.cpp:9:8:9:9 | t1 | test.cpp:6:12:6:17 | call to source |
 | test.cpp:10:8:10:9 | t2 | test.cpp:6:12:6:17 | call to source |
 | test.cpp:15:8:15:9 | t2 | test.cpp:6:12:6:17 | call to source |
+| test.cpp:21:8:21:9 | t1 | test.cpp:6:12:6:17 | call to source | REGRESSION (lack of CFG pruning)
 | test.cpp:26:8:26:9 | t1 | test.cpp:6:12:6:17 | call to source |
 | test.cpp:30:8:30:8 | t | test.cpp:35:10:35:15 | call to source |
 | test.cpp:31:8:31:8 | c | test.cpp:36:13:36:18 | call to source |
 | test.cpp:58:10:58:10 | t | test.cpp:50:14:50:19 | call to source |
-| test.cpp:71:8:71:9 | x4 | test.cpp:66:30:66:36 | source1 | REGRESSION (`std::move` not special-cased)
 | test.cpp:76:8:76:9 | u1 | test.cpp:75:7:75:8 | u1 |
 | test.cpp:84:8:84:18 | ... ? ... : ... | test.cpp:83:7:83:8 | u2 |
 | test.cpp:86:8:86:9 | i1 | test.cpp:83:7:83:8 | u2 |
 | test.cpp:90:8:90:14 | source1 | test.cpp:89:28:89:34 | source1 |
-| test.cpp:103:10:103:12 | ref | test.cpp:100:13:100:18 | call to source | IMPROVEMENT
-| test.cpp:126:8:126:19 | sourceArray1 | test.cpp:120:9:120:20 | sourceArray1 | REGRESSION (why?)
-| test.cpp:137:27:137:28 | m1 | test.cpp:136:27:136:32 | call to source | REGRESSION (lack of flow through fields)
-| test.cpp:140:22:140:23 | m1 | test.cpp:136:27:136:32 | call to source | IMPROVEMENT (lack of flow through fields)
+| test.cpp:92:8:92:14 | source1 | test.cpp:89:28:89:34 | source1 | REGRESSION (lack of heap modelling)
+| test.cpp:132:22:132:23 | m1 | test.cpp:122:18:122:30 | sourceStruct1 | REGRESSION (flow from struct to field)
+| test.cpp:140:22:140:23 | m1 | test.cpp:122:18:122:30 | sourceStruct1 | REGRESSION (flow from struct to field)
 | test.cpp:188:8:188:8 | y | test.cpp:186:27:186:32 | call to source |
 | test.cpp:192:8:192:8 | s | test.cpp:199:33:199:38 | call to source |
 | test.cpp:200:8:200:8 | y | test.cpp:199:33:199:38 | call to source |
@@ -25,12 +24,13 @@
 | test.cpp:314:12:314:12 | x | test.cpp:313:22:313:27 | call to source |
 | test.cpp:337:14:337:14 | x | test.cpp:353:17:353:22 | call to source |
 | test.cpp:366:7:366:7 | x | test.cpp:362:4:362:9 | call to source |
-| test.cpp:397:10:397:18 | globalVar | test.cpp:395:17:395:22 | call to source | REGRESSION (lack of flow through globals)
-| test.cpp:423:10:423:14 | field | test.cpp:421:13:421:18 | call to source | REGRESSION (lack of flow through fields)
+| true_upon_entry.cpp:13:8:13:8 | x | true_upon_entry.cpp:9:11:9:16 | call to source | REGRESSION (lack of true-upon-entry support)
 | true_upon_entry.cpp:21:8:21:8 | x | true_upon_entry.cpp:17:11:17:16 | call to source |
 | true_upon_entry.cpp:29:8:29:8 | x | true_upon_entry.cpp:27:9:27:14 | call to source |
 | true_upon_entry.cpp:39:8:39:8 | x | true_upon_entry.cpp:33:11:33:16 | call to source |
 | true_upon_entry.cpp:49:8:49:8 | x | true_upon_entry.cpp:43:11:43:16 | call to source |
 | true_upon_entry.cpp:57:8:57:8 | x | true_upon_entry.cpp:54:11:54:16 | call to source |
+| true_upon_entry.cpp:66:8:66:8 | x | true_upon_entry.cpp:62:11:62:16 | call to source | REGRESSION (lack of true-upon-entry support)
 | true_upon_entry.cpp:78:8:78:8 | x | true_upon_entry.cpp:70:11:70:16 | call to source |
 | true_upon_entry.cpp:86:8:86:8 | x | true_upon_entry.cpp:83:11:83:16 | call to source |
+| true_upon_entry.cpp:105:8:105:8 | x | true_upon_entry.cpp:98:11:98:16 | call to source | IMPROVEMENT (lack of (imperfect) true-upon-entry support)

Let's discuss in the next stream meeting which regressions are most important to fix for 1.19.

jbj · 2018-12-03T13:47:54Z

I hope the lack of flow through fields can be fixed by using the support for flow through fields in the shared Java/C++ data-flow library. I know you also want to add field flow to the IR itself, but that won't give us interprocedural field flow. I don't know whether we have time for any of these options for 1.19.

cpp/ql/src/semmle/code/cpp/ir/dataflow/DataFlow.qll

cpp/ql/src/semmle/code/cpp/ir/dataflow/DataFlow2.qll

cpp/ql/src/semmle/code/cpp/ir/dataflow/TaintTracking.qll

cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll

cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

jbj · 2018-12-03T21:05:24Z

Do you feel we're ready to add imports in ImportAdditionalLibraries.ql so we can use this new library on LGTM? It will only be accessible those who know about it ;).

dave-bartolomeo · 2018-12-04T07:50:35Z

@jbj The intent is for interprocedural field flow to be based on the Java implementation, but not in 1.19. I expect it will take a bit of work to adapt it to the more complicated aliasing of C++.

adityasharad · 2018-12-05T11:10:24Z

I'm chasing up on 1.19 PRs. Is this ready to go in soon? Do you still plan to make other PRs based on it for 1.19, as mentioned in the PR description?

I've separated the model interface for memory side effects from the model for escaped addresses. It will be fairly common for a given model to extend both interfaces, but they are used for two different purposes. I've also put each model interface and the non-member predicates that query it into a named module, which seemed cleaner than having predicates named `functionModelReadsMemory()` and `getFunctionModelParameterAliasBehavior()`.

dave-bartolomeo · 2018-12-05T19:14:50Z

@adityasharad This PR is almost ready to go in, but I do expect another couple related PRs over the next week or so. Ping me on Slack if you want more details.

dave-bartolomeo · 2018-12-05T19:36:04Z

@jbj @rdmarsh2 I think I've fixed all of the pending feedback.

This sort of fixes one FP and causes a new FN, but for the wrong reasons. The IR dataflow is tracking the reference itself, rather than the referred-to object. Once we can better model indirections, we can make this work correctly. This change is still the right thing to do, because it ensures that the dataflow is looking at actual expression being computed by the instruction.

cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

jbj · 2018-12-06T08:40:14Z

cpp/ql/src/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll


  /** Gets the expression corresponding to this node, if any. */
-  Expr asExpr() { result = this.getUnconvertedResultExpression() }
+  Expr asExpr() { result = this.getConvertedResultExpression() }


I agree a predicate with this body is technically the most precise, but I think it's for advanced users only. It could be named asConvertedExpr, perhaps.

I don't think we should generally expect users of asExpr to call getFullyConverted. It's too easy to forget, and it's different from the current data flow library (and all other libraries except range analysis). Can we make it so e = i.asExpr() holds if i is the getConvertedResultExpression of e.getFullyConverted()? I think what I'm proposing is to write

Expr asExpr() { result.getFullyConverted() = this.getConvertedResultExpression() and not result instanceof Conversion } Expr asConvertedExpr() { result = this.getConvertedResultExpression() }

So if we had:

short func(int& r) { return (short)r; }

There are three Exprs of interest: the VariableAccess of r, the ReferenceDeref, and the Cast to short. Would there be three ExprNodes, with the asExpr() for all three returning the same VariableAccess? That seems confusing.

That's exactly what I'm proposing. I think in your example it's what most users will want, keeping in mind that the data flow library is often the first thing novice users reach for, long before learning about conversions. Users that come from our other languages or from the current data flow library will expect it to "just work" when they mark the VariableAccess to r as a source.

If we turn the argument around, is there then a situation where the user would get unexpected false positives due to the behaviour I'm proposing?

I agree that there should be a predicate available that models conversions accurately, but that would be for advanced users only.

The case that causes the most trouble is references. In the example above, the ExprNodes for both the VariableAccess (the reference) and the result of dereferencing the reference would have the same asExpr() (the VariableAccess). In any scenario I can think of, one or the other of these is incorrect, and would lead to false positives. This is what caused the FP that I fixed with commit 2b80aee.

With the current behavior in this PR, in the absence of references, user code can just look at the ExprNodes for non-conversion expressions, and the fact that IR-based dataflow treats all Convert instructions as flow should take care of the conversion expressions. If the user code needs to match expressions that involve references, then it still has to distinguish between the reference and the referred-to value, but I don't think that your proposal makes that any easier.

I'm not so concerned about false positives in this case. First, I don't think it'll actually happen in practice. If a sink has type int, then a value of type int& should not be able to reach it and vice versa. If the sink itself is a lvalue-to-reference conversion, then it might happen that both the value and the reference can flow there, but what's the harm in that? Second, false negatives are much worse than false positives in security queries. Queries in our standard suite should not use the asExpr that I'm proposing but should use asConvertedExpr.

With the behaviour of asExpr you propose, I don't know how to write the QLDoc to recommend to users whether they should call getFullyConverted() or not. They should certainly do it in your return (short)r; example. In the case of array-to-pointer conversions they probably shouldn't if the tainted data is in the array itself. To fully understand whether to call getFullyConverted, they'd need to understand our Conversion system, and I wouldn't expect even 10% of non-Semmle QL users to understand that.

Thinking about it more, I can see it's too extreme to let asExpr include any conversion. It's really only conversions between values and pointers/references that we want. That's array-to-pointer, value-to-reference, and reference-to-value.

I agree we should match the existing behaviour. As I remember it, it’s completely unaware of conversions. That means there is a Node for each Conversion, but such nodes have no flow from them or to them.

When I say "match the existing behaviour" I really mean allow the same kind of caller to work the same way.

If it seems too confusing to have multiple instructions map to the same Expr, we can also prescribe one of those instructions to be the right choice. Unless you can think of reasonable use cases for tracking the reference, rather than its contents, we can make it the default choice to track contents. I think it's easier to state what I mean in terms of DataFlow::exprNode, which would be a total and injective function from non-conversions to Node.

If a non-conversion e has a ReferenceDereferenceExpr c directly on it, then exprNode(e) is the instruction of c. As far as I can tell, ReferenceDereferenceExpr always comes first in the chain of conversions.

If a non-conversion e has an ArrayToPointerConversion c in its conversion chain, then exprNode(e) is the instruction of c.getExpr().

If a non-conversion e has a ReferenceToExpr c in its conversion chain, then exprNode(e) is the instruction of c.getExpr().

I think I agree with your suggestion. However, as I started to implement the predicate to select the "appropriate" Expr, I hit enough corner cases that I'm not comfortable merging that change for 1.19. For example:

int a[10]; int* p = (int (&)[10])(float (&)[10])a;

I believe that expression has both an ArrayToPointerConversion and a ReferenceDereferenceExpr, with the ReferenceDereferenceExpr not being the innermost Conversion.

For now, I've simply stopped creating an ExprNode for a Conversion at all, which should be close to the AST-based behavior. I've opened CPP-318 to do the right thing.

What a terrifying edge case you have there.

I don't see how it helps to stop creating ExprNode for Conversions. Callers that use this library like the current data flow library, never calling getFullyConverted, will get the same result as before. Callers that know about conversions will get worse results.

OK, I've reverted that commit.

cpp/ql/src/semmle/code/cpp/models/implementations/IdentityFunction.qll

cpp/ql/src/semmle/code/cpp/models/interfaces/Alias.qll

cpp/ql/src/semmle/code/cpp/models/interfaces/SideEffect.qll

jbj · 2018-12-10T21:42:35Z

In addition to our long thread about asExpr, I also had this question.

Do you feel we're ready to add imports in ImportAdditionalLibraries.ql so we can use this new library on LGTM? It will only be accessible those who know about it ;).

This reverts commit df882a9.

dave-bartolomeo · 2018-12-10T23:10:46Z

I've added this library to ImportAdditionalLibraries.ql.

rdmarsh2

I believe all feedback has been addressed at this point.

jbj

Let's get this merged for 1.19.0, and then we can address CPP-318 after that.

Update dbscheme to add table for variadic signature types

dave-bartolomeo added 4 commits November 30, 2018 12:15

C++: IR-based dataflow

58f7596

dave-bartolomeo requested a review from a team as a code owner November 30, 2018 22:20

dave-bartolomeo added the C++ label Nov 30, 2018

dave-bartolomeo added this to the 1.19 milestone Nov 30, 2018

dave-bartolomeo requested review from jbj and rdmarsh2 November 30, 2018 22:22

rdmarsh2 reviewed Dec 1, 2018

View reviewed changes

C++: Add missing changes to test_ir.expected

2822d14

jbj reviewed Dec 3, 2018

View reviewed changes

C++: Fix IR Dataflow PR feedback

e11b4b6

C++: Remove StoreDestinationAsPostUpdateNode

e8efb32

jbj reviewed Dec 6, 2018

View reviewed changes

dave-bartolomeo added 2 commits December 6, 2018 12:35

C++: Simplify models for side effects and alias info.

84b39bf

C++: Fix PR feedback

ebbd701

dave-bartolomeo mentioned this pull request Dec 10, 2018

C++: Remove unreachable IR #648

Merged

C++: Avoid creating ExprNodes for Conversions

df882a9

dave-bartolomeo added 2 commits December 10, 2018 15:06

Revert "C++: Avoid creating ExprNodes for Conversions"

2399371

This reverts commit df882a9.

C++: Add IR dataflow to ImportAdditionalQueries.ql

78e5b3a

rdmarsh2 approved these changes Dec 10, 2018

View reviewed changes

jbj approved these changes Dec 11, 2018

View reviewed changes

adityasharad merged commit ce905e7 into github:rc/1.19 Dec 11, 2018

jbj mentioned this pull request Jan 16, 2019

C++: Split DataFlow::Node.asExpr into two predicates #774

Merged

cklin pushed a commit that referenced this pull request May 23, 2022

Merge pull request #597 from owen-mc/var-args

6d90b81

Update dbscheme to add table for variadic signature types

C++: Initial IR-based dataflow implementation #597

C++: Initial IR-based dataflow implementation #597

Uh oh!

Conversation

dave-bartolomeo commented Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdmarsh2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbj commented Dec 2, 2018

Uh oh!

dave-bartolomeo commented Dec 3, 2018

Uh oh!

dave-bartolomeo commented Dec 3, 2018

Uh oh!

jbj commented Dec 3, 2018

Uh oh!

jbj commented Dec 3, 2018

Uh oh!

jbj commented Dec 3, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbj commented Dec 3, 2018

Uh oh!

dave-bartolomeo commented Dec 4, 2018

Uh oh!

adityasharad commented Dec 5, 2018

Uh oh!

dave-bartolomeo commented Dec 5, 2018

Uh oh!

dave-bartolomeo commented Dec 5, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dave-bartolomeo Dec 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbj commented Dec 10, 2018

Uh oh!

dave-bartolomeo commented Dec 10, 2018

Uh oh!

rdmarsh2 left a comment

dave-bartolomeo commented Nov 30, 2018 •

edited

Loading

dave-bartolomeo Dec 9, 2018 •

edited

Loading