-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-554: [C++] Add functions to unify dictionary types and arrays #3165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
28e55ce to
cf6f8e2
Compare
f961288 to
a72a759
Compare
|
This PR is ready for review now. |
a72a759 to
7d2579b
Compare
Codecov Report
@@ Coverage Diff @@
## master #3165 +/- ##
==========================================
+ Coverage 86.44% 87.63% +1.19%
==========================================
Files 508 446 -62
Lines 70077 66052 -4025
==========================================
- Hits 60576 57885 -2691
+ Misses 9398 8167 -1231
+ Partials 103 0 -103
Continue to review full report at Codecov.
|
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } | ||
|
|
||
| TEST(TestDictionary, TransposeBasic) { | ||
| std::shared_ptr<Array> arr, out, expected; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you type them DictionaryArray at the declaration to avoid all casts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, other APIs take Array pointers, not DictionaryArray.
| std::vector<int32_t> transpose_map = {1111, 2222, 3333, 4444, 5555, 6666, 7777}; | ||
| std::vector<int64_t> dest(src.size()); | ||
|
|
||
| TransposeInts(src.data(), dest.data(), 6, transpose_map.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src.size() instead of 6, just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may emit a cast warning, though :-/
| template <typename InputInt, typename OutputInt> | ||
| void TransposeInts(const InputInt* src, OutputInt* dest, int64_t length, | ||
| const int32_t* transpose_map) { | ||
| while (length >= 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang does this unrolling naturally, gcc does not :( https://godbolt.org/z/ZHobe1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither does MSVC. The compiler generally doesn't have enough context to know whether unrolling is a good idea (I'm surprised than clang does it at -O2, btw). I'd rather unroll explicitly where we have that context.
| while (length >= 4) { | ||
| dest[0] = static_cast<OutputInt>(transpose_map[src[0]]); | ||
| dest[1] = static_cast<OutputInt>(transpose_map[src[1]]); | ||
| dest[2] = static_cast<OutputInt>(transpose_map[src[2]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't you concerned by truncation due to incompatible size, maybe a static_assert somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we static_assert? Indeed it's up to the caller to ensure the input is ok (it is by construction in case of dictionary unification).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizeof(in_type) <= sizeof(out_type)
is_signed(in_type) == is_signed(out_type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but the converse is supported as well.
| /// one per input type. Each integer vector represents the transposition | ||
| /// of input type indices into unified type indices. | ||
| // XXX Should we return something special (an empty transpose map?) when | ||
| // the transposition is the identity function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an instance where the first transposition isn't the identity function? If not then maybe we could just return types.size()-1 transpositions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this implementation, but the XXX is about the more general case (all transpositions could actually be the identity function).
|
Giving this a look before merging, bear with me |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for kicking off this work. This can be a nuanced topic -- there's a number of follow up items that I've noted in comments. I'm going to go ahead and merge this patch and then we can create follow up JIRAs for discussion and work. A number of these items of follow up work will be forced by the development of other things, so no specific urgency
| in_data.null_count); | ||
| internal::TransposeInts(in_data.GetValues<in_c_type>(1), | ||
| out_data->GetMutableValues<out_c_type>(1), in_data.length, | ||
| transpose_map.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is a specific case of the "take" function, a la ndarray.take or "from" verb in J. Here we have
out = in TAKE transpose_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though this one has transparent up/downcasting as well (which we probably don't want in the general case of a "take" function).
| // implement dictionary-to-dictionary casting. | ||
| auto in_type_id = dict_type_->index_type()->id(); | ||
| auto out_type_id = out_dict_type.index_type()->id(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth the (minor) expense of checking for no-op case where transpose_map is range(dictionary_size)? This covers the case where the unified dictionary type is an extended version of the current type, e.g. as the result of a delta dictionary
Perhaps let's create a follow up JIRA so as to not hold up this patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already an XXX for that in type.h. I agree it could be beneficial in many cases (such as when you are simply augmenting an existing dictionary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Status Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type, | ||
| const std::vector<int32_t>& transpose_map, | ||
| std::shared_ptr<Array>* out) const; | ||
| // XXX Do we also want an unsafe in-place Transpose? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want that but how we handle it may require some care. This also came up in the context of Arrow Flight where the dictionaries are unknown at the start of an RPC session
https://issues.apache.org/jira/browse/ARROW-3144
I would suggest that we address both cases through a single design, i.e. a MutableDictionaryType subtype or something similar.
One complexity that this introduces (which we need to be prepared for anyway) is the some code will need to check whether type->dictionary() is null or not (if it were null, it would mean that some data has been used inappropriately before the dictionary has been resolved), but I think that's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment in ARROW-3144 directing to this discussion
| dict_types.push_back(checked_cast<const DictionaryType*>(type)); | ||
| } | ||
|
|
||
| // XXX Should we check the ordered flag? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky. We should consider it in a follow up JIRA.
You could conceivably have a sequence of dictionaries like
a < b < c
a < b < c < d
a < b < c < d < e < f
This could result from a sequence of ordered delta dictionaries. You could argue in this case that the output type should also be ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO the "ordered" flag doesn't seem to make sense. A dictionary is simply a linear lookup table, so it's always ordered in a sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered is part of the protocol https://github.com/apache/arrow/blob/master/format/Schema.fbs#L236
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor optimization: if dictionaries have an ordering, that could be used to merge them without building a memo table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz I'm not sure I understand your suggestion. "ordered" here does not mean "sorted", just that the dictionary's order is semantically meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 0 1 2 3 4 5 6
B 7 8 1 3 4 10 11 6
U 0 7 8 1 2 3 4 5 10 11 6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but how does the algorithm look like and why would it more efficient than a series of hash lookups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right; talking about optimization is not justified when there is not a version which preserves order using a hash table. I'll write an example of each and benchmark them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it's not worth spending time right now on handling multiple "ordered" dictionaries (where the only case where "ordered" would be preserved is when all the dictionaries are prefixes of a common dictionary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created https://issues.apache.org/jira/browse/ARROW-4096. This is low priority
|
|
||
| Status DictionaryType::Unify(MemoryPool* pool, const std::vector<const DataType*>& types, | ||
| std::shared_ptr<DataType>* out_type, | ||
| std::vector<std::vector<int32_t>>* out_transpose_maps) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also consider the case where we want to conform a dictionary to an other, without unification. For example:
[a, b, c, a, b, c], dictionary [a, b, c]
conform to dictionary [a, c, d]
yields
[a, null, c, a, null, c]
This isn't unification so much as conformance. This was what was meant by my comment "Conform the dictionary indices to reference a common dictionary" in ARROW-554. The use case would be implementing casts between dictionary types -- this would include a safe and unsafe mode (safe would error if a value is made null)
Since this use case is not addressed in this patch (unless I missed it), let us create a follow up JIRA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. I have misinterpreted the original JIRA then. In any case, I was thinking of the CSV use case, where we'll be unifying dictionaries at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto t2 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 5, 6]")); | ||
| auto t3 = dictionary(int16(), ArrayFromJSON(int32(), "[3, 4, 5, 6]")); | ||
| auto t4 = dictionary(int8(), ArrayFromJSON(int16(), "[3, 4, 5, 6]")); | ||
| auto t5 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 7, 6]")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite glad we have this function now =) We might consider adding even more testing sugar so that we can write jsarray::int64("[3, 4, 5, 6]") or some such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"[3, 1, 4, 2]"_array(int64())
| ty1 = pa.dictionary(pa.float32(), pa.array([1.0, 2.0]), ordered=True) | ||
| assert ty1.index_type == pa.float32() | ||
| ty1 = pa.dictionary(pa.int8(), pa.array([1.0, 2.0]), ordered=True) | ||
| assert ty1.index_type == pa.int8() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
OK I think I have all the follow up items accounted for in JIRA |
No description provided.