Skip to content

Conversation

@rdmarsh2
Copy link
Contributor

@rdmarsh2 rdmarsh2 commented Dec 6, 2018

This PR adds an IR-based port of the Java range analysis library, which has more precision than the existing SimpleRangeAnalysis and supports bounds relative to other IR instructions and operands. The library will need more correctness tests as well as some performance testing before it's ready to merge.

The library currently only reports bounds relative to Operands and Instructions. A mapping layer to support the AST classes may be added in a followup PR.

A few features of the Java library are currently missing from the C++ reimplementation. The Java library supports range analysis across a small set of math functions. Those may be addressed as part of this PR or as a follow up PR. There is also a modulus analysis library used to improve precision in the Java range analysis, which will be ported in a separate PR.

@rdmarsh2 rdmarsh2 requested a review from a team as a code owner December 6, 2018 23:52
@rdmarsh2 rdmarsh2 added the WIP This is a work-in-progress, do not merge yet! label Dec 6, 2018
predicate backEdge(PhiInstruction phi, PhiOperand op) {
phi.getAnOperand() = op and
(
phi.getBlock().dominates(op.getPredecessorBlock())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct if the control flow graph isn't reducible. It works for Java, because Java doesn't have arbitrary gotos, so the CFG is always reducible in Java.

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for posting this PR early. This is only half of a review as I didn't get as far as reading the code in RangeAnalysis.qll.

* return x;
* ```
*/
cached predicate controlsEdge(ConditionalBranchInstruction branch, IRBlock succ, boolean testIsTrue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be cached?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "controls" part appears to be missing from the body of this predicate - as it's written it's only going to include inferences where the guard-successor leads directly to the phi-node. Consider e.g. if the example above gets augmented with an unrelated block between the x >= y edge and the phi-node:

if (x < y) {
  x = y;
} else if (foo) {
  return 0;
}
return x;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the missing part, as per comment below. In that case, this predicate just has a misleading name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to controlsEdgeDirectly

result = getAnOperand().(RightOperand).getDefinitionInstruction()
}

final predicate hasOperands(Operand op1, Operand op2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QLDoc, please.

TBoundOperand(Operand o) {
o.getDefinitionInstruction().getResultType() instanceof IntegralType or
o.getDefinitionInstruction().getResultType() instanceof PointerType
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TBoundOperand should correspond to Java's TBoundExpr, then this includes too much. But at the time of writing this I haven't yet seen how you deal with the SsaReadPosition difference, so maybe that will explain it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought that bounding relative to Operand might give a tighter bound than Instruction in some cases involving guard conditions; I'm no longer convinced that that's true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also looks like TBoundInstruction includes too much if it's trying to correspond to TBoundSsa - it should likely exclude temporary variables that are introduced by the IR construction. Note, that the overall algorithm works by constructing a graph (that's all the *step* predicates), which should be fast as its size is likely linear in the size of the program, and the recursive calculation in the graph essentially calculates the transitive closure in the graph restricted to certain starting points, and the Bound class is exactly those starting points. So Bound should be as small as possible to maintain good performance, while including all the interesting nodes in the graph that one might want as a bound for something else, and all phi nodes as they are needed as bounds for themselves to get any sort of relevant bounds in loops.

Copy link
Contributor Author

@rdmarsh2 rdmarsh2 Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think the set of interesting bounds in C++ is going to be bigger than in Java, but I haven't narrowed it down fully. Also, @dave-bartolomeo suggested that TBoundInstruction might be better as TBoundValueNumber. We'd probably need to go through ValueNumber anyway to match up cases where the allocation is done in the same function. I'm not sure how well it will work for cases where the bound is passed into the function as a field of an inparam.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely don't think value numbers in general should be part of Bound - that's going to be too big without any benefits. E.g. bounding anything by the value number of x+1 is never going to be useful, as it's always better to bound directly by x.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry, just now read the commit with the code change above to use value numbers. When restricted to atomic things, then it looks like it might be fine. I'm slightly concerned by the i.getAUse() instanceof ArgumentOperand disjunct - doesn't that open the door for valuenumbers of things like x+1?

Copy link
Contributor Author

@rdmarsh2 rdmarsh2 Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; Consider the following code:

int *foo(int x) {
  int *arr = new int[x*2];
  for(int i = 0; i < x*2; i++) {
    arr[i] = i; // we need to be able to relate i to the size of arr, which is x*2
  }
  return arr;
}

I think we will need to be able to have bounds relative to arbitrary ValueNumbers that are used as arguments to allocating functions, and given that C++ allows custom allocators, I don't think we'll be able to do that without allowing any ValueNumber used as an argument to be a bound.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can see why it can be useful. Then let me just say that the size of the Bound type and whether a smaller set of valuenumbers can be used here is something to potentially revisit if performance becomes a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting on performance testing; it looks like the main stage of range analysis is about 50% slower with the i.getAUse() instanceof ArgumentOperand disjunct than without.

*
* An inferred bound can either be a specific integer, the abstract value of an
* SSA variable, or the abstract value of an interesting expression. The latter
* category includes array lengths that are not SSA variables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph needs updating.

* - `isEq = true` : `i == bound + delta`
* - `isEq = false` : `i != bound + delta`
*/
private IRGuardCondition eqFlowCond(Operand op, Operand bound, int delta,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this predicate doesn't seem to be referenced from anywhere useful? Also, I don't think I understand what this predicate does - at the very least it doesn't seem to match the corresponding java predicate in any way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like I didn't finish implementing unequalOperand, which would use it indirectly.

(if r1 instanceof NoReason then reason = r2 else reason = r1)
)
or
boundedInstruction(op.getDefinitionInstruction(), b, delta, upper, fromBackEdge, origdelta, reason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case here combined with the op2.getDefinitionInstruction().getAnOperand().(CopySourceOperand) = op1 case in boundFlowStepSsa appears to duplicate some work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by removing the case in boundFlowStepSsa

@aschackmull
Copy link
Contributor

So, I've commented along the way as I read the code. I think the code could be streamlined a bit and made easier to understand, since I felt like some of the predicates got slightly confusing in the translation from the java implementation.
The general algorithm works by propagating bound-facts along the edges of a graph, and the predicates are roughly divided into two groups - those that define the edges of the graph and those that propagate and update bound-facts (the latter group is easy to identify as they all have a paramenter of type Bound). I think the former group of predicates (those defining the graph edges) can be made clearer.
To get a better grasp of this, we should consider exactly which nodes and which edges we need in the graph - and in particular avoid accidentally including edges that are unnecessary due to being the concatenation of two other edges.
In the java implementation there are 3 kinds of nodes: expressions, ssa variables, and ssa variable-ssa read position pairs. A read of an ssa variable (at some position) yield edges from ssa variables and ssa variable-read position pairs to expressions, most expressions yield expr-to-expr edges, ssa variable definitions yield edges from their defining expr to the variable, and guards comparing an ssa variable to some expr yield edges from that expr to all ssa variable-read position pairs that are corresponding ssa reads guarded by the condition.
In the C implementation there are as far as I've been able to gather just two kinds of nodes, Instruction and Operand. Edges go from operands to instructions, say in the case of an addition, or from an instruction to an operand (if the operand is defined by the instruction). Then there's guards, and I'm not sure how to factor those edges in in the best way, but it should likely result in an operand-to-operand edge.

I hope this makes sense. :-)

| test.cpp:37:6:37:10 | Load: begin | test.cpp:35:28:35:30 | InitializeParameter: end | 0 | true | CompareLT: ... < ... |
| test.cpp:37:16:37:20 | Load: begin | test.cpp:35:28:35:30 | InitializeParameter: end | 0 | true | CompareLT: ... < ... |
| test.cpp:38:5:38:11 | Load: ... ++ | test.cpp:35:28:35:30 | InitializeParameter: end | 0 | true | CompareLT: ... < ... |
| test.cpp:44:13:44:13 | Load: y | test.cpp:42:29:42:29 | InitializeParameter: z | 0 | true | CompareLT: ... < ... |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are your upper bounds non-inclusive? (otherwise the upper bound for y here appears to be z - 1)

Copy link
Contributor

@geoffw0 geoffw0 Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually both bounds appear to be non-inclusive. That's consistent. No problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rule is lower bound <= value < upper bound, can you confirm that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library ought to compute bounds that are inclusive. That goes for both upper and lower bounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the upper bound for the read of y on line 44 should be z - 1, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. There's a strengthening that's missing in boundFlowCond (I've also put a comment about it above in that predicate).

| test.cpp:16:9:16:9 | Load: y | test.cpp:14:15:14:15 | InitializeParameter: x | 1 | false | CompareLT: ... < ... |
| test.cpp:18:9:18:9 | Load: x | test.cpp:14:22:14:22 | InitializeParameter: y | 0 | false | CompareLT: ... < ... |
| test.cpp:20:10:20:10 | Load: x | test.cpp:14:15:14:15 | InitializeParameter: x | -2 | false | NoReason |
| test.cpp:20:10:20:10 | Load: x | test.cpp:14:22:14:22 | InitializeParameter: y | -2 | false | CompareLT: ... < ... |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive results here, took me a while to figure out why they're correct. 👍

* - `isEq = false` : `i != bound + delta`
*/
private IRGuardCondition eqFlowCond(Operand op, Operand bound, int delta,
private IRGuardCondition eqFlowCond(ValueNumber vn, Operand bound, int delta,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better!

exists(int d |
result.comparesLt(vn.getAUse(), bound, d, upper, testIsTrue) and
// strengthen from x < y to x <= y-1
if upper = true
Copy link
Contributor

@aschackmull aschackmull Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strengthening appears wrong. There are at least four cases to be handled with each of the true-false combinations of upper and testIsTrue. I'd think that in two of those cases we'd have delta = d, one case with delta = d-1, and one case with delta = d+1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ guards library's comparelt provides the same semantics as boundFlowCond but with a < rather than a <= when upper = true. I've added a test to demonstrate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the comment ought to be expanded a bit here. I think I've convinced myself that the code is right, but it would probably help other readers if the transformations implicit in comparesLt were elaborated a bit here to better connect the dots.

@rdmarsh2 rdmarsh2 changed the base branch from master to next December 14, 2018 22:58
@rdmarsh2 rdmarsh2 force-pushed the rdmarsh/cpp/range-analysis branch from 91d41ca to 5c779c5 Compare December 14, 2018 23:01
@rdmarsh2
Copy link
Contributor Author

I've retargeted to next for the moment because I need #648 for termination. Initial performance testing looks acceptable but not great; 4 minutes of evaluation after building IR on ChakraCore. Half of that time is before the main stage of the range analysis, which probably means there's performance improvements to be made in the sign analysis library. That time was with a 1.19 snapshot, my changes on top of next, and a 1.18 evaluator, so the performance is likely better with a 1.19 evaluator.

I've done a bit of glancing at stats on ChakraCore; about half of instructions have a bound, about a tenth of those have a non-self bound. There's 370 instructions after pruning by #648 that have an impossible pair of bounds (upper and lower bounds relative to the same value number, with the lower bound's delta > the upper bound's delta). I suspect that most of those are actually unreachable, but I haven't done a manual review yet.

@aschackmull
Copy link
Contributor

We can't do anything useful with negation for relative bounds, but I could add something to handle it for constant bounds.

As you say, for constant bounds this is a fairly easy extension, which I've also considered adding on the java side (but haven't gotten around to actually implementing). We could also do something for relative bounds, but that requires extending the Bound class to include negated ssa variables.

However, some other differences are regressions where a type-derived bound is relevant after widening; line 371 in test.c is an example of this

Lower bounds of 0 for unsigned types might be relevant to add.

the algorithm we're using doesn't combine constant bounds from both operands of an arithmetic operation, so we won't get a constant upper bound on x + y in the following case

I originally excluded this in the java implementation, but it should be possible to add if it's relevant. Note that this needs some care to get proper performance as it involves joining two recursive calls, so a naive implementation will likely perform badly due to lots of #prev scanning. To get proper performance the technique that's used for conditional expressions should be used.

@rdmarsh2
Copy link
Contributor Author

rdmarsh2 commented Jan 7, 2019

@jbj and I discussed the possible constant bounds improvements on a call this morning; we would like to do those in the future and deprecate SimpleRangeAnalysis.qll, but we decided not to do them as part of this PR. I think I've addressed all feedback, and I'm removing the WIP label; Jonas and Schack should feel free to merge at this point.

@rdmarsh2 rdmarsh2 added C++ and removed WIP This is a work-in-progress, do not merge yet! labels Jan 7, 2019
jbj
jbj previously approved these changes Jan 8, 2019
Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but some files are now in conflict. This might be because the master branch now reflects the 1.19 release, not the 1.18 release as it did yesterday. You should be able to sort it out with qltest --accept-output.

Robert Marsh added 12 commits January 8, 2019 09:34
This reduces the number of bounds computed, and will simplify use of the
library. The resulting locations in the tests may be slightly strange,
because the example `Instruction` for a `ValueNumber` is the first
appearing in the IR, regardless of source order, and may not be the most
closely related `Instruction` to the bounded value. I think that's worth
doing for the performance and usability benefits.
@jbj
Copy link
Contributor

jbj commented Jan 9, 2019

The tests failed. I'm guessing the failures come from the switch from 1.18 to 1.19, so please check them carefully as they may be symptoms of problems with 1.19.

@rdmarsh2
Copy link
Contributor Author

rdmarsh2 commented Jan 9, 2019

Sign analysis and range analysis changes look like improvements - I think they're related to better CFG pruning after the rebase. GVN changes also look like improvements, but I'm not sure what the change was. Alias analysis improvements, maybe?

@rdmarsh2
Copy link
Contributor Author

@jbj the failing test is qlcfg, but I didn't expect anything I've touched to affect it... Can you take a look?

@jbj
Copy link
Contributor

jbj commented Jan 11, 2019

That's a semantic merge conflict that's crept in because the submodule pointer is too old to include this test. I've opened #747 to fix it.

Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure is unrelated to this PR, so I'll merge now.

@jbj jbj merged commit ef331ee into master Jan 11, 2019
@rdmarsh2 rdmarsh2 deleted the rdmarsh/cpp/range-analysis branch February 25, 2019 19:31
cklin pushed a commit that referenced this pull request May 23, 2022
Fix incorrect type name in database/sql model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants