-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5336: [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries #8984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/arrow/array/array_dict.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DictionaryUnifier is typed to the value type of the dictionary but not the index type so I needed to create this utility for accessing the max possible value for a given index type. I wasn't really sure where to put this. There could also be a general purpose visitor for finding the min/max of all numeric types. Also, do we have this capability anywhere else in the code base I could just leverage? This is needed by DictionaryUnifierImpl::GetResultWithIndexType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which numeric type would you return for the min/max of all numeric types? Let's just stick with integers :-)
As for where to put it, util/int_util.h sounds like a decent place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better, it turns out util/int_util.h already had what I needed. I switched to using internal::IntegersCanFit.
cpp/src/arrow/array/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the concatenation functions simply memcpy'd the buffers. However, the dictionary concatenation needs to map buffers to potentially new index values. As a result, this function needs to know the type of the buffer for the reinterpret case on line 190. Also, the fact that memo table indices are 32 bit and dictionary indices could be 64 bit is a potential problem but one that already existed and it seems unlikely that a dictionary array would be used when there are 4B unique values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are at it, can all non-public functions/classes in this module be put in the anonymous namespace? This reduces the number of exported symbols and can also open more optimization opportunities for the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be an overload but the behavior was different enough I felt it warranted its own name. GetResult is not actually used in the code base anywhere but DictionaryUnifier is an exported type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the docstring a bit off? It doesn't seem a DictionaryType is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, fixed.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Here a bunch of comments.
cpp/src/arrow/array/array_dict.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which numeric type would you return for the min/max of all numeric types? Let's just stick with integers :-)
As for where to put it, util/int_util.h sounds like a decent place.
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the docstring a bit off? It doesn't seem a DictionaryType is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean i += 2? Or simply UnsafeAppend(i * 2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think i*2 would work. It's building up two arrays, the first has values [0,size) and the second has values [size,2*size). I changed it a little so it is one loop if that is clearer.
cpp/src/arrow/array/array_dict.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning int64_t is not right if the index type is uint64_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. N/A anymore since using int_util.
cpp/src/arrow/array/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mutable_data? A const pointer should be sufficient here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cpp/src/arrow/array/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are at it, can all non-public functions/classes in this module be put in the anonymous namespace? This reduces the number of exported symbols and can also open more optimization opportunities for the compiler.
cpp/src/arrow/array/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems common to both if branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cpp/src/arrow/array/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If out_data a CType*? If so, spell it out explicitly for clarity. (is a reinterpret_cast missing too?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the missing reinterpret_cast was a bug (as I think you noticed) for index types with more than one byte. Fixed.
cpp/src/arrow/array/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: const auto&? Though copying the shared_ptr is probably not a bottleneck here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed anyways for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... can you test with index types larger than 1 byte? There may be an issue with them in the current impl (not sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test and there was an issue. Both the missing reinterpret cast and the way I was computing size (# bytes vs. # elements)
b3fc15d to
7075b20
Compare
…aled issue with the index mapping. Reusing IntUtil functions for transpose and identifying the max index value. Cleaned up some comments and styling
|
Thanks for the insight @pitrou . Your guess was right, there was a bug with multi-byte index types. I believe I have addressed your concerns. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @westonpace !
|
A bit late that I think of it, but the arrow->pandas code already has some dictionary concatenation code ( arrow/cpp/src/arrow/python/arrow_to_pandas.cc Lines 1641 to 1676 in 97f8160
|
|
@jorisvandenbossche It's pretty close but there are a few differences.
I'll defer to @pitrou if we want to combine them but it seems simpler to just leave them separate for now. |
|
Ah, yes, I forgot about the null handling needed for pandas. That's certainly something that shouldn't be added to the general arrow functionality, but can be left as specific handling the arrow_to_pandas code. |
|
The differences look a bit annoying to reconcile (especially the different null representation). Not sure it's worth doing anything. |
The dictionaries still need to have the same index & value types. It is possible that concatenating two dictionaries still fails because the resulting dictionary has more values than its index type can represent.
The unification will still fail if nulls are present in either dictionary. The canonical approach seems to be representing nulls in the indices array with a validity bitmap. The existing unifier had this constraint in place. My guess is that this was to avoid making the memo table null-aware. It could be handled without modification to the memo table by using a -1 index and so I could easily add this if desired. I wasn't sure if support for this non-typical case justified the complexity.