fix nested column range index range computation#13297
fix nested column range index range computation#13297clintropolis merged 2 commits intoapache:masterfrom
Conversation
| localStartIndex = -(localFound + 1); | ||
| // if the computed local start index violates the global range, shift up by 1 | ||
| int actualGlobalStartIndex = localDictionary.get(localStartIndex); | ||
| if (actualGlobalStartIndex < globalStartIndex) { |
There was a problem hiding this comment.
Should this be done in a loop or can we be sure that the global index at the next localStartIndex would be valid?
There was a problem hiding this comment.
ok so i looked a lot closer in debugger and working things out by hand, and this code on the start index is pointless and only can be hit in cases where the range will effectively be empty anyway so can be dropped. it also uncovered a missing bounds check on FixedIndexed get method, which I have fixed and added tests for 😅
I've re-arranged some things so the logic is simplified, it was really only the end index that needed adjustment when it was missing from the local dictionary, and that was because I was doing the shifting to account for the end index being exclusive in a funny manner. I think its clearer now. Thanks for asking this question!
There was a problem hiding this comment.
Thanks! Simpler to follow now.
I also wonder if the last value returned from FixedIndex.indexOf() should always be -(minIndex + 1).
My concern is that if we haven't found the value until now, then the insertion point could either be before or after the currIndex that we are evaluating. I think we should incorporate that fact into the returned value.
Say at the last step, we found that our value is bigger than the currValue, so minIndex becomes currIndex + 1, then we break out of the loop and then we return -(minIndex + 1) = -(currIndex + 2).
Doesn't this mean that we have actually skipped the insertion point? Or maybe I am missing something.
(Although, I guess this should be okay because the Indexed interface doesn't make any promises about the returned negative value.)
There was a problem hiding this comment.
FixedIndexed.indexOf is the same-ish implementation as GenericIndexed.indexOf, both of which are basically the same as Arrays.binarySearch, so I think it is ok. Btw, range finder method is used for both FixedIndexed and GenericIndexed global dictionaries depending on if it is making LexicographicalRangeIndex for nested strings or NumericRangeIndex for nested numbers, though the local dictionary is always a FixedIndexed.
Or did you mean something about how the range calculation is using indexOf to compute the ranges?
btw, I did recently strengthen the contract of indexOf to be (-(insertion point) - 1) for missing values if isSorted() returns true, since these index things basically require the (-(insertion point) - 1) behavior to work correctly.
There was a problem hiding this comment.
Makes sense, thanks for the clarification!
kfaraz
left a comment
There was a problem hiding this comment.
Minor query, otherwise looks good.
* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed
* fix nested column range index range computation * simplify, add missing bounds check for FixedIndexed
Description
Fixes a bug with nested column range index value range computation that can lead to incorrect values matching filters when the range is not present in the nested fields local dictionary.
This PR fixes the issue by checking that the computed local start and end index values global id mapping does not violate the computed original global id range whenever the actual global ids are not present in the local dictionary. That's a lot of words thats probably hard to follow, but its basically an off by 1 error when mapping the global range to the local range in certain cases.
This wrong behavior was accidentally encoded in a test I wrote which would have caught the issue if the test had been expecting the correct answer, so I went through and manually verified that the tests are actually expecting the correct thing this time, as well as tried to get 100% coverage on everything which matters in practice, which is now pretty high

The remaining uncovered lines are largely things like safety checks that are not hit in regular operation such as calling next on an iterator before calling hasNext, etc, which I cannot hit during normal usage of the index supplier.
Release note
Fixes a bug with nested columns when using filters that use range indexes such as greater/less than or like filters.
This PR has: