Skip to content

Conversation

@yuruiz
Copy link

@yuruiz yuruiz commented Apr 25, 2019

  • Add hasNull flag to BaseFixedWidthVector, BaseVariableWidthVector, FixedSizeListVector and ListVector to indicate if current vector has null or not.
  • Skip null checking if hasNull flag is false
  • When vector transfer occurs, transfer hasNull flag as well
  • Move setNull method from the subclass of BaseFixedWidthVector to BaseFixedWidthVector

@yuruiz yuruiz closed this Apr 25, 2019
@yuruiz yuruiz reopened this Apr 25, 2019
@yuruiz yuruiz changed the title ARROW-5197: [JAVA] add hasNull flag to Vectors ARROW-5198: [JAVA] add hasNull flag to Vectors Apr 25, 2019
@yuruiz yuruiz closed this Apr 25, 2019
@yuruiz yuruiz reopened this Apr 25, 2019
@fsaintjacques
Copy link
Contributor

If vectors are immutable, why don't you memoize the null count like in the C++ side? That'll make the getNullCount faster while providing the same functionality.

@jacques-n
Copy link
Contributor

Can you please start on this list to discussion changes on the hot code path before you start working on them? Thanks

@yuruiz
Copy link
Author

yuruiz commented Apr 26, 2019

If vectors are immutable, why don't you memoize the null count like in the C++ side? That'll make the getNullCount faster while providing the same functionality.

@fsaintjacques For what I understand, in C++ the Arrow Arrays represent immutable data and do not provide update accessors. However, in Java the Arrow ValueVector is not the case, it indeed provides the update accessor that allow external consumer to update values. So it makes no sense to memorize the null count since the value can be invalid after user updates.

@yuruiz
Copy link
Author

yuruiz commented Apr 26, 2019

Can you please start on this list to discussion changes on the hot code path before you start working on them? Thanks

Hi @jacques-n , not sure I understand you question here. The purpose of this PR is to remove unnecessary null check when there is no null in current vector. In vectorized data processing, accessing data via get accessor is a high frequency operation and we want to make sure to remove all the redundant operation from the code path. Is that answer your question?

@codecov-io
Copy link

Codecov Report

Merging #4199 into master will increase coverage by 1.4%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4199      +/-   ##
==========================================
+ Coverage   87.78%   89.18%    +1.4%     
==========================================
  Files         758      617     -141     
  Lines       92513    82203   -10310     
  Branches     1251        0    -1251     
==========================================
- Hits        81210    73315    -7895     
+ Misses      11186     8888    -2298     
+ Partials      117        0     -117
Impacted Files Coverage Δ
cpp/src/arrow/util/thread-pool-test.cc 97.66% <0%> (-0.94%) ⬇️
go/arrow/math/uint64_amd64.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/ipc/file_reader.go
js/src/enum.ts
go/arrow/array/builder.go
js/src/Arrow.node.ts
js/src/schema.ts
go/arrow/type_traits_boolean.go
js/src/ipc/node/writer.ts
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67efb73...bacf9a7. Read the comment docs.

@fsaintjacques
Copy link
Contributor

Then if vectors are not immutable, your implementation of hasNull memoization is also prone to invalidation, i.e. hasNull has to be recomputed every time a value is set (since it can flip the last null to non-null)?

@jacques-n
Copy link
Contributor

Can you please start on this list to discussion changes on the hot code path before you start working on them? Thanks

Hi @jacques-n , not sure I understand you question here. The purpose of this PR is to remove unnecessary null check when there is no null in current vector. In vectorized data processing, accessing data via get accessor is a high frequency operation and we want to make sure to remove all the redundant operation from the code path. Is that answer your question?

You're assuming a certain usage of the API and changing the hot path of the code. As @fsaintjacques points out, given the the vectors are mutable, the behavior of this patch can also lead to returning wrong result.

Earlier versions of the vector classes had specialized classes specifically for this purpose however we decided that the extra complexity was not something that was worth maintaining as specific memory algorithms that can take advantage of this will likely interact with memory directly as opposed always through the vector interface, such as the [pivot algorithm we implemented|https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java].

So my comment was really about the idea that for fundamental hot-path changes like this, you should discuss the requirements you're trying to resolve and your idea around possible changes on the mailing list before coding something up. It will likely save you some time.

-1 on this in general

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants