-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5198: [JAVA] add hasNull flag to Vectors #4199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yuruiz
commented
Apr 25, 2019
- Add hasNull flag to BaseFixedWidthVector, BaseVariableWidthVector, FixedSizeListVector and ListVector to indicate if current vector has null or not.
- Skip null checking if hasNull flag is false
- When vector transfer occurs, transfer hasNull flag as well
- Move setNull method from the subclass of BaseFixedWidthVector to BaseFixedWidthVector
|
If vectors are immutable, why don't you memoize the null count like in the C++ side? That'll make the getNullCount faster while providing the same functionality. |
|
Can you please start on this list to discussion changes on the hot code path before you start working on them? Thanks |
@fsaintjacques For what I understand, in C++ the Arrow Arrays represent immutable data and do not provide update accessors. However, in Java the Arrow ValueVector is not the case, it indeed provides the update accessor that allow external consumer to update values. So it makes no sense to memorize the null count since the value can be invalid after user updates. |
Hi @jacques-n , not sure I understand you question here. The purpose of this PR is to remove unnecessary null check when there is no null in current vector. In vectorized data processing, accessing data via get accessor is a high frequency operation and we want to make sure to remove all the redundant operation from the code path. Is that answer your question? |
Codecov Report
@@ Coverage Diff @@
## master #4199 +/- ##
==========================================
+ Coverage 87.78% 89.18% +1.4%
==========================================
Files 758 617 -141
Lines 92513 82203 -10310
Branches 1251 0 -1251
==========================================
- Hits 81210 73315 -7895
+ Misses 11186 8888 -2298
+ Partials 117 0 -117Continue to review full report at Codecov.
|
|
Then if vectors are not immutable, your implementation of hasNull memoization is also prone to invalidation, i.e. hasNull has to be recomputed every time a value is set (since it can flip the last null to non-null)? |
You're assuming a certain usage of the API and changing the hot path of the code. As @fsaintjacques points out, given the the vectors are mutable, the behavior of this patch can also lead to returning wrong result. Earlier versions of the vector classes had specialized classes specifically for this purpose however we decided that the extra complexity was not something that was worth maintaining as specific memory algorithms that can take advantage of this will likely interact with memory directly as opposed always through the vector interface, such as the [pivot algorithm we implemented|https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java]. So my comment was really about the idea that for fundamental hot-path changes like this, you should discuss the requirements you're trying to resolve and your idea around possible changes on the mailing list before coding something up. It will likely save you some time. -1 on this in general |