C++: New range analysis #633

rdmarsh2 · 2018-12-06T23:52:35Z

This PR adds an IR-based port of the Java range analysis library, which has more precision than the existing SimpleRangeAnalysis and supports bounds relative to other IR instructions and operands. The library will need more correctness tests as well as some performance testing before it's ready to merge.

The library currently only reports bounds relative to Operands and Instructions. A mapping layer to support the AST classes may be added in a followup PR.

A few features of the Java library are currently missing from the C++ reimplementation. The Java library supports range analysis across a small set of math functions. Those may be addressed as part of this PR or as a follow up PR. There is also a modulus analysis library used to improve precision in the Java range analysis, which will be ported in a separate PR.

kevinbackhouse · 2018-12-07T09:04:41Z

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeUtils.qll

+predicate backEdge(PhiInstruction phi, PhiOperand op) {
+  phi.getAnOperand() = op and
+  (
+    phi.getBlock().dominates(op.getPredecessorBlock())


I don't think this is correct if the control flow graph isn't reducible. It works for Java, because Java doesn't have arbitrary gotos, so the CFG is always reducible in Java.

cpp/ql/test/library-tests/rangeanalysis/rangeanalysis/RangeAnalysis.expected

jbj

Thanks for posting this PR early. This is only half of a review as I didn't get as far as reading the code in RangeAnalysis.qll.

jbj · 2018-12-07T15:54:09Z

cpp/ql/src/semmle/code/cpp/controlflow/IRGuards.qll

+     * return x;
+     * ```
+     */
+    cached predicate controlsEdge(ConditionalBranchInstruction branch, IRBlock succ, boolean testIsTrue) {


Does this need to be cached?

The "controls" part appears to be missing from the body of this predicate - as it's written it's only going to include inferences where the guard-successor leads directly to the phi-node. Consider e.g. if the example above gets augmented with an unrelated block between the x >= y edge and the phi-node:

if (x < y) { x = y; } else if (foo) { return 0; } return x;

Found the missing part, as per comment below. In that case, this predicate just has a misleading name.

renamed to controlsEdgeDirectly

jbj · 2018-12-07T17:33:22Z

cpp/ql/src/semmle/code/cpp/ir/implementation/aliased_ssa/Instruction.qll

    result = getAnOperand().(RightOperand).getDefinitionInstruction()
  }
+
+  final predicate hasOperands(Operand op1, Operand op2) {


QLDoc, please.

jbj · 2018-12-07T18:56:14Z

cpp/ql/src/semmle/code/cpp/rangeanalysis/Bound.qll

+  TBoundOperand(Operand o) {
+    o.getDefinitionInstruction().getResultType() instanceof IntegralType or
+    o.getDefinitionInstruction().getResultType() instanceof PointerType
+  }


If TBoundOperand should correspond to Java's TBoundExpr, then this includes too much. But at the time of writing this I haven't yet seen how you deal with the SsaReadPosition difference, so maybe that will explain it.

I had thought that bounding relative to Operand might give a tighter bound than Instruction in some cases involving guard conditions; I'm no longer convinced that that's true.

It also looks like TBoundInstruction includes too much if it's trying to correspond to TBoundSsa - it should likely exclude temporary variables that are introduced by the IR construction. Note, that the overall algorithm works by constructing a graph (that's all the *step* predicates), which should be fast as its size is likely linear in the size of the program, and the recursive calculation in the graph essentially calculates the transitive closure in the graph restricted to certain starting points, and the Bound class is exactly those starting points. So Bound should be as small as possible to maintain good performance, while including all the interesting nodes in the graph that one might want as a bound for something else, and all phi nodes as they are needed as bounds for themselves to get any sort of relevant bounds in loops.

Right. I think the set of interesting bounds in C++ is going to be bigger than in Java, but I haven't narrowed it down fully. Also, @dave-bartolomeo suggested that TBoundInstruction might be better as TBoundValueNumber. We'd probably need to go through ValueNumber anyway to match up cases where the allocation is done in the same function. I'm not sure how well it will work for cases where the bound is passed into the function as a field of an inparam.

I absolutely don't think value numbers in general should be part of Bound - that's going to be too big without any benefits. E.g. bounding anything by the value number of x+1 is never going to be useful, as it's always better to bound directly by x.

Right, sorry, just now read the commit with the code change above to use value numbers. When restricted to atomic things, then it looks like it might be fine. I'm slightly concerned by the i.getAUse() instanceof ArgumentOperand disjunct - doesn't that open the door for valuenumbers of things like x+1?

Yes; Consider the following code:

int *foo(int x) { int *arr = new int[x*2]; for(int i = 0; i < x*2; i++) { arr[i] = i; // we need to be able to relate i to the size of arr, which is x*2 } return arr; }

I think we will need to be able to have bounds relative to arbitrary ValueNumbers that are used as arguments to allocating functions, and given that C++ allows custom allocators, I don't think we'll be able to do that without allowing any ValueNumber used as an argument to be a bound.

Ok, I can see why it can be useful. Then let me just say that the size of the Bound type and whether a smaller set of valuenumbers can be used here is something to potentially revisit if performance becomes a problem.

I'm starting on performance testing; it looks like the main stage of range analysis is about 50% slower with the i.getAUse() instanceof ArgumentOperand disjunct than without.

cpp/ql/src/semmle/code/cpp/rangeanalysis/Bound.qll

cpp/ql/test/library-tests/rangeanalysis/rangeanalysis/RangeAnalysis.ql

jbj · 2018-12-07T19:03:45Z

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

+ *
+ * An inferred bound can either be a specific integer, the abstract value of an
+ * SSA variable, or the abstract value of an interesting expression. The latter
+ * category includes array lengths that are not SSA variables.


This paragraph needs updating.

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

aschackmull · 2018-12-10T14:42:13Z

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

+ * - `isEq = true`  : `i == bound + delta`
+ * - `isEq = false` : `i != bound + delta`
+ */
+private IRGuardCondition eqFlowCond(Operand op, Operand bound, int delta,


Hmm, this predicate doesn't seem to be referenced from anywhere useful? Also, I don't think I understand what this predicate does - at the very least it doesn't seem to match the corresponding java predicate in any way.

it looks like I didn't finish implementing unequalOperand, which would use it indirectly.

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

aschackmull · 2018-12-10T16:18:53Z

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

+    (if r1 instanceof NoReason then reason = r2 else reason = r1)
+  )
+  or
+  boundedInstruction(op.getDefinitionInstruction(), b, delta, upper, fromBackEdge, origdelta, reason)


This case here combined with the op2.getDefinitionInstruction().getAnOperand().(CopySourceOperand) = op1 case in boundFlowStepSsa appears to duplicate some work.

Fixed by removing the case in boundFlowStepSsa

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

aschackmull · 2018-12-10T17:07:37Z

So, I've commented along the way as I read the code. I think the code could be streamlined a bit and made easier to understand, since I felt like some of the predicates got slightly confusing in the translation from the java implementation.
The general algorithm works by propagating bound-facts along the edges of a graph, and the predicates are roughly divided into two groups - those that define the edges of the graph and those that propagate and update bound-facts (the latter group is easy to identify as they all have a paramenter of type Bound). I think the former group of predicates (those defining the graph edges) can be made clearer.
To get a better grasp of this, we should consider exactly which nodes and which edges we need in the graph - and in particular avoid accidentally including edges that are unnecessary due to being the concatenation of two other edges.
In the java implementation there are 3 kinds of nodes: expressions, ssa variables, and ssa variable-ssa read position pairs. A read of an ssa variable (at some position) yield edges from ssa variables and ssa variable-read position pairs to expressions, most expressions yield expr-to-expr edges, ssa variable definitions yield edges from their defining expr to the variable, and guards comparing an ssa variable to some expr yield edges from that expr to all ssa variable-read position pairs that are corresponding ssa reads guarded by the condition.
In the C implementation there are as far as I've been able to gather just two kinds of nodes, Instruction and Operand. Edges go from operands to instructions, say in the case of an addition, or from an instruction to an operand (if the operand is defined by the instruction). Then there's guards, and I'm not sure how to factor those edges in in the best way, but it should likely result in an operand-to-operand edge.

I hope this makes sense. :-)

geoffw0 · 2018-12-12T12:26:27Z

cpp/ql/test/library-tests/rangeanalysis/rangeanalysis/RangeAnalysis.expected

+| test.cpp:37:6:37:10 | Load: begin | test.cpp:35:28:35:30 | InitializeParameter: end | 0 | true | CompareLT: ... < ... |
+| test.cpp:37:16:37:20 | Load: begin | test.cpp:35:28:35:30 | InitializeParameter: end | 0 | true | CompareLT: ... < ... |
+| test.cpp:38:5:38:11 | Load: ... ++ | test.cpp:35:28:35:30 | InitializeParameter: end | 0 | true | CompareLT: ... < ... |
+| test.cpp:44:13:44:13 | Load: y | test.cpp:42:29:42:29 | InitializeParameter: z | 0 | true | CompareLT: ... < ... |


Are your upper bounds non-inclusive? (otherwise the upper bound for y here appears to be z - 1)

~~Oh actually both bounds appear to be non-inclusive. That's consistent. No problem.~~

I think the rule is lower bound <= value < upper bound, can you confirm that?

The library ought to compute bounds that are inclusive. That goes for both upper and lower bounds.

Then the upper bound for the read of y on line 44 should be z - 1, right?

Yes, I believe so. There's a strengthening that's missing in boundFlowCond (I've also put a comment about it above in that predicate).

geoffw0 · 2018-12-12T12:35:10Z

cpp/ql/test/library-tests/rangeanalysis/rangeanalysis/RangeAnalysis.expected

+| test.cpp:16:9:16:9 | Load: y | test.cpp:14:15:14:15 | InitializeParameter: x | 1 | false | CompareLT: ... < ... |
+| test.cpp:18:9:18:9 | Load: x | test.cpp:14:22:14:22 | InitializeParameter: y | 0 | false | CompareLT: ... < ... |
+| test.cpp:20:10:20:10 | Load: x | test.cpp:14:15:14:15 | InitializeParameter: x | -2 | false | NoReason |
+| test.cpp:20:10:20:10 | Load: x | test.cpp:14:22:14:22 | InitializeParameter: y | -2 | false | CompareLT: ... < ... |


Impressive results here, took me a while to figure out why they're correct. 👍

cpp/ql/src/semmle/code/cpp/controlflow/IRGuards.qll

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

aschackmull · 2018-12-13T10:50:27Z

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

 * - `isEq = false` : `i != bound + delta`
 */
-private IRGuardCondition eqFlowCond(Operand op, Operand bound, int delta,
+private IRGuardCondition eqFlowCond(ValueNumber vn, Operand bound, int delta,


Much better!

aschackmull · 2018-12-13T11:01:30Z

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

+  exists(int d |
+    result.comparesLt(vn.getAUse(), bound, d, upper, testIsTrue) and
+    // strengthen from x < y to x <= y-1
+    if upper = true


This strengthening appears wrong. There are at least four cases to be handled with each of the true-false combinations of upper and testIsTrue. I'd think that in two of those cases we'd have delta = d, one case with delta = d-1, and one case with delta = d+1.

The C++ guards library's comparelt provides the same semantics as boundFlowCond but with a < rather than a <= when upper = true. I've added a test to demonstrate.

Maybe the comment ought to be expanded a bit here. I think I've convinced myself that the code is right, but it would probably help other readers if the transformations implicit in comparesLt were elaborated a bit here to better connect the dots.

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll

rdmarsh2 · 2018-12-15T03:11:46Z

I've retargeted to next for the moment because I need #648 for termination. Initial performance testing looks acceptable but not great; 4 minutes of evaluation after building IR on ChakraCore. Half of that time is before the main stage of the range analysis, which probably means there's performance improvements to be made in the sign analysis library. That time was with a 1.19 snapshot, my changes on top of next, and a 1.18 evaluator, so the performance is likely better with a 1.19 evaluator.

I've done a bit of glancing at stats on ChakraCore; about half of instructions have a bound, about a tenth of those have a non-self bound. There's 370 instructions after pruning by #648 that have an impossible pair of bounds (upper and lower bounds relative to the same value number, with the lower bound's delta > the upper bound's delta). I suspect that most of those are actually unreachable, but I haven't done a manual review yet.

aschackmull · 2019-01-07T10:29:26Z

We can't do anything useful with negation for relative bounds, but I could add something to handle it for constant bounds.

As you say, for constant bounds this is a fairly easy extension, which I've also considered adding on the java side (but haven't gotten around to actually implementing). We could also do something for relative bounds, but that requires extending the Bound class to include negated ssa variables.

However, some other differences are regressions where a type-derived bound is relevant after widening; line 371 in test.c is an example of this

Lower bounds of 0 for unsigned types might be relevant to add.

the algorithm we're using doesn't combine constant bounds from both operands of an arithmetic operation, so we won't get a constant upper bound on x + y in the following case

I originally excluded this in the java implementation, but it should be possible to add if it's relevant. Note that this needs some care to get proper performance as it involves joining two recursive calls, so a naive implementation will likely perform badly due to lots of #prev scanning. To get proper performance the technique that's used for conditional expressions should be used.

rdmarsh2 · 2019-01-07T18:33:58Z

@jbj and I discussed the possible constant bounds improvements on a call this morning; we would like to do those in the future and deprecate SimpleRangeAnalysis.qll, but we decided not to do them as part of this PR. I think I've addressed all feedback, and I'm removing the WIP label; Jonas and Schack should feel free to merge at this point.

jbj

LGTM, but some files are now in conflict. This might be because the master branch now reflects the 1.19 release, not the 1.18 release as it did yesterday. You should be able to sort it out with qltest --accept-output.

This reduces the number of bounds computed, and will simplify use of the library. The resulting locations in the tests may be slightly strange, because the example `Instruction` for a `ValueNumber` is the first appearing in the IR, regardless of source order, and may not be the most closely related `Instruction` to the bounded value. I think that's worth doing for the performance and usability benefits.

jbj · 2019-01-09T09:34:39Z

The tests failed. I'm guessing the failures come from the switch from 1.18 to 1.19, so please check them carefully as they may be symptoms of problems with 1.19.

rdmarsh2 · 2019-01-09T18:19:10Z

Sign analysis and range analysis changes look like improvements - I think they're related to better CFG pruning after the rebase. GVN changes also look like improvements, but I'm not sure what the change was. Alias analysis improvements, maybe?

rdmarsh2 · 2019-01-10T22:57:54Z

@jbj the failing test is qlcfg, but I didn't expect anything I've touched to affect it... Can you take a look?

jbj · 2019-01-11T07:31:38Z

That's a semantic merge conflict that's crept in because the submodule pointer is too old to include this test. I've opened #747 to fix it.

jbj

The test failure is unrelated to this PR, so I'll merge now.

Fix incorrect type name in database/sql model

rdmarsh2 requested a review from a team as a code owner December 6, 2018 23:52

rdmarsh2 requested review from aschackmull, dave-bartolomeo and jbj December 6, 2018 23:52

rdmarsh2 added the WIP This is a work-in-progress, do not merge yet! label Dec 6, 2018

kevinbackhouse reviewed Dec 7, 2018

View reviewed changes

kevinbackhouse mentioned this pull request Dec 7, 2018

C++: identify back-edges in the control flow graph. #639

Closed

geoffw0 reviewed Dec 7, 2018

View reviewed changes

cpp/ql/test/library-tests/rangeanalysis/rangeanalysis/RangeAnalysis.expected Outdated Show resolved Hide resolved

jbj reviewed Dec 7, 2018

View reviewed changes

aschackmull reviewed Dec 10, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 10, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 10, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 10, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 10, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Show resolved Hide resolved

aschackmull reviewed Dec 10, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

geoffw0 reviewed Dec 12, 2018

View reviewed changes

aschackmull reviewed Dec 12, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/controlflow/IRGuards.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 12, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 13, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 13, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

aschackmull reviewed Dec 13, 2018

View reviewed changes

cpp/ql/src/semmle/code/cpp/rangeanalysis/RangeAnalysis.qll Outdated Show resolved Hide resolved

rdmarsh2 changed the base branch from master to next December 14, 2018 22:58

rdmarsh2 force-pushed the rdmarsh/cpp/range-analysis branch from 91d41ca to 5c779c5 Compare December 14, 2018 23:01

rdmarsh2 added C++ and removed WIP This is a work-in-progress, do not merge yet! labels Jan 7, 2019

jbj previously approved these changes Jan 8, 2019

View reviewed changes

Robert Marsh added 12 commits January 8, 2019 09:34

C++: move SimpleRangeAnalysis tests

a06a20d

C++: Initial implementation of new range analysis

ed68f91

C++: fix/add comments

fe32aea

C++: respond to PR comments, add some TODOs

ae4ffd9

C++: respond to further PR comments

89148a9

C++: add test for false comparisons

b2cd9a2

C++: allow phi nodes to self-bound

567eee1

C++: performance and termination fixes

8c9c316

C++: respond to further PR comments

0040a2d

C++: expand a comment

af8a3f2

C++: change in-predicate comment formatting

c39de75

rdmarsh2 dismissed jbj’s stale review via c39de75 January 8, 2019 17:34

rdmarsh2 force-pushed the rdmarsh/cpp/range-analysis branch from 8f47d70 to c39de75 Compare January 8, 2019 17:34

C++: update test expectations

c455db9

jbj approved these changes Jan 11, 2019

View reviewed changes

jbj merged commit ef331ee into master Jan 11, 2019

jbj mentioned this pull request Jan 15, 2019

C++: Sync files that should be identical #771

Merged

jbj mentioned this pull request Jan 23, 2019

C++: IR back-edge detection based on TranslatedStmt #812

Merged

rdmarsh2 deleted the rdmarsh/cpp/range-analysis branch February 25, 2019 19:31

cklin pushed a commit that referenced this pull request May 23, 2022

Merge pull request #633 from owen-mc/database-sql-model-incorrect

89b2a2f

Fix incorrect type name in database/sql model

C++: New range analysis #633

C++: New range analysis #633

Uh oh!

Conversation

rdmarsh2 commented Dec 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdmarsh2 Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdmarsh2 Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aschackmull commented Dec 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoffw0 Dec 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rdmarsh2 Dec 12, 2018 •

edited

Loading

rdmarsh2 Dec 13, 2018 •

edited

Loading

geoffw0 Dec 12, 2018 •

edited

Loading

aschackmull Dec 13, 2018 •

edited

Loading