-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-549: [C++] Add arrow::Concatenate function to combine multiple arrays into a single Array #3746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be on the right track. The test coverage is a bit thin, though. It would be better to return NotImplemented in paths that are untested if tests don't get written for them
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just writing:
CheckConcatenate(type, arrays_as_json, expected_json)
The ConcatenateParam seems unnecessarily elaborate to me.
It doesn't appear that the non-zero offset cases are being tested right now, so in those cases will have to construct and slice the test arrays manually
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation needs a more appropriate home than array.cc. We could create a subdirectory of miscellaneous array algorithms. I don't have a strong opinion, @xhochy @pitrou @fsaintjacques @emkornfield thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
under util/ might make sense for now. We can see how many random algorithms get put here vs just used in kernels (my second choice would be under compute I suppose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algorithms could be grouped under src/arrow/array/algorithm_<category>.{h,cc} with a single convenience header/source src/arrow/algorithm.{h,cc} similar to builders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is all private, we should avoid convenience headers that balloon compile times.
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment what this is?
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should return Invalid if any type is unequal
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the null bitmap is getting concatenated twice here
|
I think you need to define 2^31 - 1 overflow behavior for offsets (lists and binary) |
573be68 to
4ef76e7
Compare
|
@fsaintjacques re property testing |
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this at the top-level namespace, maybe static Status Buffer::Concatenate(...) or BuffersConcatenate()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the renaming seems unnecessary- the overload will not conflict with concatenate(arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not concerned about conflict. I'm concerned about readability. buffer::Concatenate or Buffer::Concatenate speaks a lot more than just Concatenate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it's more readable to avoid manual name mangling; the argument types are part of what users will read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to introduce an alias like:
using BufferVector = std::vector<std::shared_ptr<Buffer>>
I think this would help make the signature easier to read.
cpp/src/arrow/util/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, there's no need for this function to depends on internal state. In fact, I'd want to see this function explicitly unit tested.
|
@wesm @fsaintjacques how's this look? |
cpp/src/arrow/util/concatenate.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use the ArrayVector alias in the first input argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, that will neaten this up quite a bit. I see there's also BufferVector which should be useful too
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Mersenne-Twister PRNG engine has been identified as a source of test slowness. See discussion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll switch to std::default_random_engine
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another place where the ArrayVector alias could be used.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. Here are some comments below. In particular, I would welcome a bit more clarity in the implementation.
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? What if I have sliced a StringArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the spec: the first offset in an offsets buffer is always 0.
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
cpp/src/arrow/array-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have this in a separate test file. array-test.cc is already too large, and too slow to compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll move it to arrow-concatenate-test.
cpp/src/arrow/buffer.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call this ConcatenateBuffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I rename Concatenate(arrays) to ConcatenateArrays as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can, though disambiguating requires only one rename ;-)
cpp/src/arrow/util/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining what this does?
cpp/src/arrow/util/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining what this does?
- test concatenation of randomly generated array slices - ensure some of those array slices will have offset != 0 - test all primitive types, StringType, ListType, StructType, and DictionaryType - add disabled test for UnionType
vagrind was complaining about uninitialized values due to the way CopyBitmap restores trailing bits; extended CopyBitmap to (optionally) elide that step
- renaming - use memcpy not std::copy in ConcatenateBuffers - concatenating zero buffers does not terminate - refactor for clarity - add comments
14a78f7 to
c1600bc
Compare
|
@pitrou does this address your concerns? |
|
Thanks @bkietz. Can you open a JIRA for concatenation of union arrays? |
Concatenate arrays into a single array