Conversation
| } else { | ||
| final int bucket = adjustedIndex >> div; | ||
| if (readCachedBucket == bucket) { | ||
| readBufferCache.position(0); |
Check warning
Code scanning / CodeQL
Dereferenced variable may be null
| { | ||
| offset.set(rowNum); | ||
| return valueSelector.isNull(); | ||
| return valueSelector.getObject() == null; |
There was a problem hiding this comment.
does this have a different contract than BaseNullableColumnValueSelector? Its isNull method means that getLong/getFloat/getDouble will return a null value, not necessarily that getObject is null. E.g. sometimes non-numeric selectors will implement these primitive getters, but not necessarily be null rows.
Looking at some of the things that implement it, it doesn't appear to be different, e.g. ObjectColumnAccessorBase has an implementation of getLong that checks if the instance is a String, and if so, tries to parse it. This means that getObject will not return null, but getLong isn't valid, so there is no way to determine that the 0 it returns is actually a null because we couldn't parse the string 'hello' into a long.
There was a problem hiding this comment.
The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"
There was a problem hiding this comment.
how do you know if it is safe to call getLong if that is the contract?
There was a problem hiding this comment.
Or maybe we should eliminate implicit casting in the numeric getter methods to make it so that they only work if the column type really is a number?
There was a problem hiding this comment.
From the javadoc for ColumnAccessor (Get whether the value of a row is null), the updated behaviour would be correct.
There was a problem hiding this comment.
i think we need to revisit this contract, since it doesn't really feel compatible to coexist with implicit casting on the getLong/getFloat etc methods. If you have a string column but want to read numbers from it but not all of them can be parsed into numbers, you can't reliably use any of the primitive getters, because isNull would say false, but getLong would not be parseable into a number and return 0. See ObjectColumnAccessorBase.
This means you can't actually use these methods if you want to try to get numbers from an object column, so they should probably instead throw exceptions instead of silently giving you zeros that should have been handled as nulls.
We can fix this by just forbidding this sort of implicit casting, I'm not sure it does us any favors in terms of making code clear to follow, while having explicit casting wrapper gizmos would make it obvious when something is turning a string into a number.
There was a problem hiding this comment.
not that it has to be done now, but it does seem problematic and is why ColumnValueSelector.isNull has the strange contract it does, because you kind of need it if you allow implicit casting.
There was a problem hiding this comment.
The semantics as defined here is that a non-parseable String, when read from getLong is 0 instead of null. If something wants that to be null, it needs to do its own parsing/handling. It is entirely possible to, if you have a String object, use the getObject() form and handle it yourself.
We could make this logic be generically accessible by having a LongAccessor interface and ask the column to become one of those when it wants to force-cast to Long. What I'm trying to say is that I understand the semantics you are looking for and I think that it's totally valid, I don't think it has to be delivered from this set of getters which just exist to provide primitive values in the simplest possible way. That said, perhaps if we had the type-specific accessors we would end up not needing these generic get methods and then life would just be better. Either way, it's something that can be adjusted in another PR.
| return FindResult.found(startIndex, end); | ||
| } | ||
|
|
||
| int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
I know this isn't really new since the others do this too, but is there any guarantee that vals is sorted?
There was a problem hiding this comment.
These methods are part of the BinarySearchableAccessor, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column. That's my bad really.
There was a problem hiding this comment.
Fwiw, the whole BinarySearchableAccessor is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.
There was a problem hiding this comment.
The worst-case time complexity of findLong is O(n) if all the values from startIndex to endIndex are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.
There was a problem hiding this comment.
No, none of the "find" stuff is actually being used for anything meaningful. Ignore it.
There was a problem hiding this comment.
I'm not sure if it's more likely that we have distinct values here, but currently, this is in line with the other type columns. If we make a call on using multiple searches instead, that should be consistent with the other columns here.
There was a problem hiding this comment.
@adarshsanjeev Lets java doc the oddity for now and move forward? We can take up this work as part of a follow up refactor ?
There was a problem hiding this comment.
I have added this to the javadoc in BinarySearchableAccessor
| @Override | ||
| public FindResult findComplex(int startIndex, int endIndex, Object val) | ||
| { | ||
| return findLong(startIndex, endIndex, Numbers.tryParseLong(val, 0)); |
There was a problem hiding this comment.
i know this isn't really new, but why doesn't this need to distinguish between the case where val is parseable into a number and not?
There was a problem hiding this comment.
The meta answer to your question is because it's not an interface that's used for anything actually meaningful yet.
The actual answer to your question is that the one place that is trying to use this is first checking for null and then calling findNull It then does a switch on the searcher's type and calls the relevant method for that. The type of the LongArrayColumn is LONG so it won't actually call findComplex() and will call findLong() instead. This is fundamentally putting the responsibility on the call site to "do the right thing". Is that the correct behavior or not? I dunno, but it's an experimental interface that you would have to be incredibly intrepid to actually hit, so it's a detail that perhaps doesn't matter.
If your suggestion is to switch this to a NotYetImplemented exception, I'd probably agree with that suggestion.
There was a problem hiding this comment.
Changed the numberColumns to throw an exception instead of using a 0 if the string is unparsable for findComplex and findString
imply-cheddar
left a comment
There was a problem hiding this comment.
Generally LGTM. Make sure you have green checks and that clint is satisfied before merge.
| return FindResult.found(startIndex, end); | ||
| } | ||
|
|
||
| int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
These methods are part of the BinarySearchableAccessor, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column. That's my bad really.
| return FindResult.found(startIndex, end); | ||
| } | ||
|
|
||
| int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
Fwiw, the whole BinarySearchableAccessor is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.
| { | ||
| offset.set(rowNum); | ||
| return valueSelector.isNull(); | ||
| return valueSelector.getObject() == null; |
There was a problem hiding this comment.
The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"
| return FindResult.found(startIndex, end); | ||
| } | ||
|
|
||
| int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
The worst-case time complexity of findLong is O(n) if all the values from startIndex to endIndex are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.
| */ | ||
| public class NotYetImplemented extends DruidException.Failure | ||
| { | ||
| public static DruidException ex(String msg, Object... args) |
There was a problem hiding this comment.
I think it would be better to not have factory methods which omit the cause exception
I've seen places where there was a catch block - but the newly created exception did not had the cause filled in most likely by mistake - but as a result the stacktrace was clipped.
I think forcing the api user to write out null for cause makes this mistake less probable to do the same
| ++firstNonDruidExceptionIndex; | ||
| } | ||
| if (firstNonDruidExceptionIndex < stackTrace.length) { | ||
| retVal.setStackTrace(Arrays.copyOfRange(stackTrace, firstNonDruidExceptionIndex, stackTrace.length)); |
There was a problem hiding this comment.
this conditional looks odd... its not wrong; but it will only skip the set if it went over the full stacktrace; but when its 0 it still does it...
I guess the intention here is to clip the exception construction process?
could you add a test for this?
with a case which checks what happens if a DruidException is set as a cause for another DruidException
There was a problem hiding this comment.
Changed this to if (firstNonDruidExceptionIndex > 0)
If firstNonDruidExceptionIndex == stackTrace.length, we should copy an empty array, which is what would happen, and if the the exception is 0, we should not need a copy.
|
Just looking at the comments, it looks like things have generally been addressed. Clint has an outstanding comment about the semantics of the ColumnAccessor interface, which are valid but also not the central theme of this PR. As such, I don't see a reason to delay the merge of these changes, so going to push the merge button. |
Some general refactors across Druid.
write.This PR has: