Skip to content

Conversation

@dave-bartolomeo
Copy link
Contributor

@dave-bartolomeo dave-bartolomeo commented Nov 30, 2018

I suggest reviewing this PR commit-by-commit.

This PR adds a new dataflow library that operates on the IR rather than ASTs. The first commit was mostly @rdmarsh2's work, and I just brought it up-to-date with the latest IR API. The remaining commits are fixes that improve the dataflow results.

The new library shares much of the code with the existing C++ AST dataflow and Java dataflow libraries via the identical-files.json mechanism. The main difference is that in the IR-based dataflow library, DataFlow::Node extends Instruction, since all three of the node types from the AST version (Expr, Parameter, and Uninitialized) are all just different instructions in the IR.

The API that's exposed to consumers of the dataflow library is pretty much source-compatible with the AST dataflow library. Consumers do not have to use the IR directly to write a dataflow query; they can still define sources and sinks in terms of Expr, Parameter, etc.

All local flow is done just by following SSA def-use links, so there's no need for all the LocalFlow stuff.

There are still several diffs in the results compared to the AST version. Significant known weaknesses include:

  • The IR version does not ignore unreachable code.
  • There's still work to do to handle flow through library functions that lack definitions.
  • Handling of C++ reference types is different from the AST version. The AST version is inconsistent here, but sometimes it get things right by accident. The IR version needs more work in the Aliased SSA component to handle these indirection via references better.

There will be more PRs on this for 1.19, but this gets us something that is basically working.

The AST dataflow library essentially ignores conversions, which is probably the right behavior. Converting an `int` to a `long` preserves the value, even if the bit pattern might be different. It's arguable whether narrowing conversions should be treated as dataflow, but we'll do so for now. We can revisit that if we see it cause problems.
This fixes a subtle bug in the construction of aliased SSA. `getResultMemoryAccess` was failing to return a `MemoryAccess` for a store to a variable whose address escaped. This is because no `VirtualIRVariable` was being created for such variables. The code was assuming that any access to such a variable would be via `UnknownMemoryAccess`. The result is that accesses to such variables were not being modeled in SSA at all.

Instead, the way to handle this is to have a `VariableMemoryAccess` even when the variable being accessed has escaped, and to have `VariableMemoryAccess::getVirtualVariable()` return the `UnknownVirtualVariable` for escaped variables. In the future, this will also let us be less conservative about inserting `Chi` nodes, because we'll be able to determine that there's an exact overlap between two accesses to the same escaped variable in some cases.
This commit adds a new model interface that describes the known side effects (or lack thereof) of a library function. Does it read memory, does it write memory, and do any of its parameters escape? Initially, we have models for just two Standard Library functions: `std::move` and `std::forward`, which neither read nor write memory, and do not escape their parameter.

IR construction has been updated to insert the correct side effect instruction (or no side effect instruction) based on the model.
@dave-bartolomeo dave-bartolomeo requested a review from a team as a code owner November 30, 2018 22:20
@dave-bartolomeo dave-bartolomeo added this to the 1.19 milestone Nov 30, 2018
Copy link
Contributor

@rdmarsh2 rdmarsh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits in the side-effect-free modeling library

abstract predicate writesMemory();

/**
* Holds if any address passed to the parameter at the specified index is retained after the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a caveat about storing vs returning addresses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new API and comments clear this up.

* This memory could be from global variables, or from other memory that was reachable from a
* pointer that was passed into the function.
*/
predicate functionReadsMemory(Function f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename these to indicate that they're only based on the model, not based on some deeper analysis, given that there's likely to be a library for that in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API for this doesn't have the non-member predicates.

Made `Node::getType()`, `Node::asParameter()`, and `Node::asUninitialized()` operate directly on the IR. This actually fixed several diffs compared to the AST dataflow, because `getType()` wasn't holding for nodes that weren't `Exprs`.

Made `Uninitialized` a `VariableInstruction`. This makes it consistent with `InitializeParameter`.
@jbj
Copy link
Contributor

jbj commented Dec 2, 2018

Very exciting!

Do I understand it correctly that this initial PR does not use the recently added Chi instructions? And that it can't use those until they are more precise?

It would be great to have tests that (1) show the difference between the current library and this new one, and (2) demonstrate known false negatives or false positives. This doesn't have to block this PR.

@dave-bartolomeo
Copy link
Contributor Author

@jbj I don't expect most analyses, including this one, to actually do anything meaningful with Chi instructions directly. A Chi instruction more or less just says "something may or may not have changed this memory location", without any precise knowledge about what that potential change was. A Chi instruction is just what we insert when we know that something may have changed, but can't be more specific. As we make Chi insertion more precise, we'll insert fewer Chi instructions, and instead have more operands connected to an exact definition. That's how future improvements to Chi instructions will help this analysis (and all other SSA-based analyses).

@dave-bartolomeo
Copy link
Contributor Author

@jbj To respond to your testing suggestion, the test_ir.ql file I added for this is in the same directory as the AST-based dataflow tests, using the same test sources. The difference between test.expected and test_ir.expected is the difference in results between the two libraries. Were you suggesting something like a test_diff.ql that runs both AST- and IR-based dataflow, and prints out only those results that differ between the two?

@jbj
Copy link
Contributor

jbj commented Dec 3, 2018

Yes, my testing suggestion was to make a test_diff.ql like you describe. But perhaps it's not a big deal -- for the initial PR I can make a diff manually, and for subsequent PRs we'll see both expected files change, and it should be quite clear to see the difference between how they change.

@jbj
Copy link
Contributor

jbj commented Dec 3, 2018

I've hand-edited the output slightly and made a diff of how the current test output compares to the new test output. For each difference, I've hand-annotated it with the consequence and likely cause.

--- ast.expected	2018-12-03 14:05:42.000000000 +0100
+++ ir.expected	2018-12-03 14:14:21.000000000 +0100
@@ -2,19 +2,18 @@
 | test.cpp:9:8:9:9 | t1 | test.cpp:6:12:6:17 | call to source |
 | test.cpp:10:8:10:9 | t2 | test.cpp:6:12:6:17 | call to source |
 | test.cpp:15:8:15:9 | t2 | test.cpp:6:12:6:17 | call to source |
+| test.cpp:21:8:21:9 | t1 | test.cpp:6:12:6:17 | call to source | REGRESSION (lack of CFG pruning)
 | test.cpp:26:8:26:9 | t1 | test.cpp:6:12:6:17 | call to source |
 | test.cpp:30:8:30:8 | t | test.cpp:35:10:35:15 | call to source |
 | test.cpp:31:8:31:8 | c | test.cpp:36:13:36:18 | call to source |
 | test.cpp:58:10:58:10 | t | test.cpp:50:14:50:19 | call to source |
-| test.cpp:71:8:71:9 | x4 | test.cpp:66:30:66:36 | source1 | REGRESSION (`std::move` not special-cased)
 | test.cpp:76:8:76:9 | u1 | test.cpp:75:7:75:8 | u1 |
 | test.cpp:84:8:84:18 | ... ? ... : ... | test.cpp:83:7:83:8 | u2 |
 | test.cpp:86:8:86:9 | i1 | test.cpp:83:7:83:8 | u2 |
 | test.cpp:90:8:90:14 | source1 | test.cpp:89:28:89:34 | source1 |
-| test.cpp:103:10:103:12 | ref | test.cpp:100:13:100:18 | call to source | IMPROVEMENT
-| test.cpp:126:8:126:19 | sourceArray1 | test.cpp:120:9:120:20 | sourceArray1 | REGRESSION (why?)
-| test.cpp:137:27:137:28 | m1 | test.cpp:136:27:136:32 | call to source | REGRESSION (lack of flow through fields)
-| test.cpp:140:22:140:23 | m1 | test.cpp:136:27:136:32 | call to source | IMPROVEMENT (lack of flow through fields)
+| test.cpp:92:8:92:14 | source1 | test.cpp:89:28:89:34 | source1 | REGRESSION (lack of heap modelling)
+| test.cpp:132:22:132:23 | m1 | test.cpp:122:18:122:30 | sourceStruct1 | REGRESSION (flow from struct to field)
+| test.cpp:140:22:140:23 | m1 | test.cpp:122:18:122:30 | sourceStruct1 | REGRESSION (flow from struct to field)
 | test.cpp:188:8:188:8 | y | test.cpp:186:27:186:32 | call to source |
 | test.cpp:192:8:192:8 | s | test.cpp:199:33:199:38 | call to source |
 | test.cpp:200:8:200:8 | y | test.cpp:199:33:199:38 | call to source |
@@ -25,12 +24,13 @@
 | test.cpp:314:12:314:12 | x | test.cpp:313:22:313:27 | call to source |
 | test.cpp:337:14:337:14 | x | test.cpp:353:17:353:22 | call to source |
 | test.cpp:366:7:366:7 | x | test.cpp:362:4:362:9 | call to source |
-| test.cpp:397:10:397:18 | globalVar | test.cpp:395:17:395:22 | call to source | REGRESSION (lack of flow through globals)
-| test.cpp:423:10:423:14 | field | test.cpp:421:13:421:18 | call to source | REGRESSION (lack of flow through fields)
+| true_upon_entry.cpp:13:8:13:8 | x | true_upon_entry.cpp:9:11:9:16 | call to source | REGRESSION (lack of true-upon-entry support)
 | true_upon_entry.cpp:21:8:21:8 | x | true_upon_entry.cpp:17:11:17:16 | call to source |
 | true_upon_entry.cpp:29:8:29:8 | x | true_upon_entry.cpp:27:9:27:14 | call to source |
 | true_upon_entry.cpp:39:8:39:8 | x | true_upon_entry.cpp:33:11:33:16 | call to source |
 | true_upon_entry.cpp:49:8:49:8 | x | true_upon_entry.cpp:43:11:43:16 | call to source |
 | true_upon_entry.cpp:57:8:57:8 | x | true_upon_entry.cpp:54:11:54:16 | call to source |
+| true_upon_entry.cpp:66:8:66:8 | x | true_upon_entry.cpp:62:11:62:16 | call to source | REGRESSION (lack of true-upon-entry support)
 | true_upon_entry.cpp:78:8:78:8 | x | true_upon_entry.cpp:70:11:70:16 | call to source |
 | true_upon_entry.cpp:86:8:86:8 | x | true_upon_entry.cpp:83:11:83:16 | call to source |
+| true_upon_entry.cpp:105:8:105:8 | x | true_upon_entry.cpp:98:11:98:16 | call to source | IMPROVEMENT (lack of (imperfect) true-upon-entry support)

Let's discuss in the next stream meeting which regressions are most important to fix for 1.19.

@jbj
Copy link
Contributor

jbj commented Dec 3, 2018

I hope the lack of flow through fields can be fixed by using the support for flow through fields in the shared Java/C++ data-flow library. I know you also want to add field flow to the IR itself, but that won't give us interprocedural field flow. I don't know whether we have time for any of these options for 1.19.

@jbj
Copy link
Contributor

jbj commented Dec 3, 2018

Do you feel we're ready to add imports in ImportAdditionalLibraries.ql so we can use this new library on LGTM? It will only be accessible those who know about it ;).

@dave-bartolomeo
Copy link
Contributor Author

@jbj The intent is for interprocedural field flow to be based on the Java implementation, but not in 1.19. I expect it will take a bit of work to adapt it to the more complicated aliasing of C++.

@adityasharad
Copy link
Collaborator

I'm chasing up on 1.19 PRs. Is this ready to go in soon? Do you still plan to make other PRs based on it for 1.19, as mentioned in the PR description?

I've separated the model interface for memory side effects from the model for escaped addresses. It will be fairly common for a given model to extend both interfaces, but they are used for two different purposes.

I've also put each model interface and the non-member predicates that query it into a named module, which seemed cleaner than having predicates named `functionModelReadsMemory()` and `getFunctionModelParameterAliasBehavior()`.
@dave-bartolomeo
Copy link
Contributor Author

@adityasharad This PR is almost ready to go in, but I do expect another couple related PRs over the next week or so. Ping me on Slack if you want more details.

@dave-bartolomeo
Copy link
Contributor Author

@jbj @rdmarsh2 I think I've fixed all of the pending feedback.

This sort of fixes one FP and causes a new FN, but for the wrong reasons. The IR dataflow is tracking the reference itself, rather than the referred-to object. Once we can better model indirections, we can make this work correctly.

This change is still the right thing to do, because it ensures that the dataflow is looking at actual expression being computed by the instruction.

/** Gets the expression corresponding to this node, if any. */
Expr asExpr() { result = this.getUnconvertedResultExpression() }
Expr asExpr() { result = this.getConvertedResultExpression() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a predicate with this body is technically the most precise, but I think it's for advanced users only. It could be named asConvertedExpr, perhaps.

I don't think we should generally expect users of asExpr to call getFullyConverted. It's too easy to forget, and it's different from the current data flow library (and all other libraries except range analysis). Can we make it so e = i.asExpr() holds if i is the getConvertedResultExpression of e.getFullyConverted()? I think what I'm proposing is to write

Expr asExpr() {
  result.getFullyConverted() = this.getConvertedResultExpression() and
  not result instanceof Conversion
}

Expr asConvertedExpr() { result = this.getConvertedResultExpression() }

Copy link
Contributor Author

@dave-bartolomeo dave-bartolomeo Dec 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we had:

short func(int& r) {
  return (short)r;
}

There are three Exprs of interest: the VariableAccess of r, the ReferenceDeref, and the Cast to short. Would there be three ExprNodes, with the asExpr() for all three returning the same VariableAccess? That seems confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I'm proposing. I think in your example it's what most users will want, keeping in mind that the data flow library is often the first thing novice users reach for, long before learning about conversions. Users that come from our other languages or from the current data flow library will expect it to "just work" when they mark the VariableAccess to r as a source.

If we turn the argument around, is there then a situation where the user would get unexpected false positives due to the behaviour I'm proposing?

I agree that there should be a predicate available that models conversions accurately, but that would be for advanced users only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case that causes the most trouble is references. In the example above, the ExprNodes for both the VariableAccess (the reference) and the result of dereferencing the reference would have the same asExpr() (the VariableAccess). In any scenario I can think of, one or the other of these is incorrect, and would lead to false positives. This is what caused the FP that I fixed with commit 2b80aee.

With the current behavior in this PR, in the absence of references, user code can just look at the ExprNodes for non-conversion expressions, and the fact that IR-based dataflow treats all Convert instructions as flow should take care of the conversion expressions. If the user code needs to match expressions that involve references, then it still has to distinguish between the reference and the referred-to value, but I don't think that your proposal makes that any easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so concerned about false positives in this case. First, I don't think it'll actually happen in practice. If a sink has type int, then a value of type int& should not be able to reach it and vice versa. If the sink itself is a lvalue-to-reference conversion, then it might happen that both the value and the reference can flow there, but what's the harm in that? Second, false negatives are much worse than false positives in security queries. Queries in our standard suite should not use the asExpr that I'm proposing but should use asConvertedExpr.

With the behaviour of asExpr you propose, I don't know how to write the QLDoc to recommend to users whether they should call getFullyConverted() or not. They should certainly do it in your return (short)r; example. In the case of array-to-pointer conversions they probably shouldn't if the tainted data is in the array itself. To fully understand whether to call getFullyConverted, they'd need to understand our Conversion system, and I wouldn't expect even 10% of non-Semmle QL users to understand that.

Thinking about it more, I can see it's too extreme to let asExpr include any conversion. It's really only conversions between values and pointers/references that we want. That's array-to-pointer, value-to-reference, and reference-to-value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should match the existing behaviour. As I remember it, it’s completely unaware of conversions. That means there is a Node for each Conversion, but such nodes have no flow from them or to them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say "match the existing behaviour" I really mean allow the same kind of caller to work the same way.

If it seems too confusing to have multiple instructions map to the same Expr, we can also prescribe one of those instructions to be the right choice. Unless you can think of reasonable use cases for tracking the reference, rather than its contents, we can make it the default choice to track contents. I think it's easier to state what I mean in terms of DataFlow::exprNode, which would be a total and injective function from non-conversions to Node.

  1. If a non-conversion e has a ReferenceDereferenceExpr c directly on it, then exprNode(e) is the instruction of c. As far as I can tell, ReferenceDereferenceExpr always comes first in the chain of conversions.
  2. If a non-conversion e has an ArrayToPointerConversion c in its conversion chain, then exprNode(e) is the instruction of c.getExpr().
  3. If a non-conversion e has a ReferenceToExpr c in its conversion chain, then exprNode(e) is the instruction of c.getExpr().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with your suggestion. However, as I started to implement the predicate to select the "appropriate" Expr, I hit enough corner cases that I'm not comfortable merging that change for 1.19. For example:

int a[10];
int* p = (int (&)[10])(float (&)[10])a;

I believe that expression has both an ArrayToPointerConversion and a ReferenceDereferenceExpr, with the ReferenceDereferenceExpr not being the innermost Conversion.

For now, I've simply stopped creating an ExprNode for a Conversion at all, which should be close to the AST-based behavior. I've opened CPP-318 to do the right thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a terrifying edge case you have there.

I don't see how it helps to stop creating ExprNode for Conversions. Callers that use this library like the current data flow library, never calling getFullyConverted, will get the same result as before. Callers that know about conversions will get worse results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've reverted that commit.

@jbj
Copy link
Contributor

jbj commented Dec 10, 2018

In addition to our long thread about asExpr, I also had this question.

Do you feel we're ready to add imports in ImportAdditionalLibraries.ql so we can use this new library on LGTM? It will only be accessible those who know about it ;).

@dave-bartolomeo
Copy link
Contributor Author

I've added this library to ImportAdditionalLibraries.ql.

Copy link
Contributor

@rdmarsh2 rdmarsh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all feedback has been addressed at this point.

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this merged for 1.19.0, and then we can address CPP-318 after that.

@adityasharad adityasharad merged commit ce905e7 into github:rc/1.19 Dec 11, 2018
cklin pushed a commit that referenced this pull request May 23, 2022
Update dbscheme to add table for variadic signature types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants