ARROW-554: [C++] Add functions to unify dictionary types and arrays #3165

pitrou · 2018-12-12T18:16:16Z

No description provided.

cpp/src/arrow/builder-dict.cc

pitrou · 2018-12-18T20:10:43Z

This PR is ready for review now.

codecov-io · 2018-12-19T12:04:15Z

Codecov Report

Merging #3165 into master will increase coverage by 1.19%.
The diff coverage is 89.71%.

@@            Coverage Diff             @@
##           master    #3165      +/-   ##
==========================================
+ Coverage   86.44%   87.63%   +1.19%     
==========================================
  Files         508      446      -62     
  Lines       70077    66052    -4025     
==========================================
- Hits        60576    57885    -2691     
+ Misses       9398     8167    -1231     
+ Partials      103        0     -103

Impacted Files	Coverage Δ
cpp/src/arrow/visitor_inline.h	`96.89% <ø> (ø)`	⬆️
cpp/src/arrow/util/hashing.h	`99.25% <ø> (ø)`	⬆️
cpp/src/arrow/type.h	`87.01% <ø> (ø)`	⬆️
python/pyarrow/tests/test_types.py	`96.89% <100%> (ø)`	⬆️
cpp/src/arrow/type-test.cc	`100% <100%> (ø)`	⬆️
cpp/src/arrow/util/int-util-test.cc	`100% <100%> (ø)`	⬆️
cpp/src/arrow/type.cc	`92.5% <100%> (+0.08%)`	⬆️
cpp/src/arrow/util/int-util.cc	`99.6% <100%> (+0.02%)`	⬆️
cpp/src/arrow/array.h	`98.44% <50%> (-0.77%)`	⬇️
cpp/src/arrow/array/builder_dict.cc	`65.15% <74.62%> (+11.3%)`	⬆️
... and 87 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 944b9e3...7d2579b. Read the comment docs.

fsaintjacques

LGTM

fsaintjacques · 2018-12-19T13:15:33Z

cpp/src/arrow/array-dict-test.cc

 }

+TEST(TestDictionary, TransposeBasic) {
+  std::shared_ptr<Array> arr, out, expected;


Can you type them DictionaryArray at the declaration to avoid all casts?

Well, other APIs take Array pointers, not DictionaryArray.

fsaintjacques · 2018-12-19T13:33:05Z

cpp/src/arrow/util/int-util-test.cc

+  std::vector<int32_t> transpose_map = {1111, 2222, 3333, 4444, 5555, 6666, 7777};
+  std::vector<int64_t> dest(src.size());
+
+  TransposeInts(src.data(), dest.data(), 6, transpose_map.data());


src.size() instead of 6, just in case.

It may emit a cast warning, though :-/

fsaintjacques · 2018-12-19T13:33:44Z

cpp/src/arrow/util/int-util.cc

+template <typename InputInt, typename OutputInt>
+void TransposeInts(const InputInt* src, OutputInt* dest, int64_t length,
+                   const int32_t* transpose_map) {
+  while (length >= 4) {


Clang does this unrolling naturally, gcc does not :( https://godbolt.org/z/ZHobe1

Neither does MSVC. The compiler generally doesn't have enough context to know whether unrolling is a good idea (I'm surprised than clang does it at -O2, btw). I'd rather unroll explicitly where we have that context.

fsaintjacques · 2018-12-19T13:34:15Z

cpp/src/arrow/util/int-util.cc

+  while (length >= 4) {
+    dest[0] = static_cast<OutputInt>(transpose_map[src[0]]);
+    dest[1] = static_cast<OutputInt>(transpose_map[src[1]]);
+    dest[2] = static_cast<OutputInt>(transpose_map[src[2]]);


Aren't you concerned by truncation due to incompatible size, maybe a static_assert somewhere?

What would we static_assert? Indeed it's up to the caller to ensure the input is ok (it is by construction in case of dictionary unification).

sizeof(in_type) <= sizeof(out_type)
is_signed(in_type) == is_signed(out_type)

Ah, but the converse is supported as well.

bkietz · 2018-12-19T12:00:55Z

cpp/src/arrow/type.h

+  ///     one per input type.  Each integer vector represents the transposition
+  ///     of input type indices into unified type indices.
+  // XXX Should we return something special (an empty transpose map?) when
+  // the transposition is the identity function?


Is there an instance where the first transposition isn't the identity function? If not then maybe we could just return types.size()-1 transpositions

Not in this implementation, but the XXX is about the more general case (all transpositions could actually be the identity function).

wesm · 2018-12-19T17:50:48Z

Giving this a look before merging, bear with me

wesm

Thank you for kicking off this work. This can be a nuanced topic -- there's a number of follow up items that I've noted in comments. I'm going to go ahead and merge this patch and then we can create follow up JIRAs for discussion and work. A number of these items of follow up work will be forced by the development of other things, so no specific urgency

wesm · 2018-12-19T18:09:55Z

cpp/src/arrow/array.cc

+                                  in_data.null_count);
+  internal::TransposeInts(in_data.GetValues<in_c_type>(1),
+                          out_data->GetMutableValues<out_c_type>(1), in_data.length,
+                          transpose_map.data());


Note: this is a specific case of the "take" function, a la ndarray.take or "from" verb in J. Here we have

out = in TAKE transpose_map

https://issues.apache.org/jira/browse/ARROW-772

Yes, though this one has transparent up/downcasting as well (which we probably don't want in the general case of a "take" function).

wesm · 2018-12-19T18:11:57Z

cpp/src/arrow/array.cc

+  // implement dictionary-to-dictionary casting.
+  auto in_type_id = dict_type_->index_type()->id();
+  auto out_type_id = out_dict_type.index_type()->id();
+


Do you think it's worth the (minor) expense of checking for no-op case where transpose_map is range(dictionary_size)? This covers the case where the unified dictionary type is an extended version of the current type, e.g. as the result of a delta dictionary

Perhaps let's create a follow up JIRA so as to not hold up this patch

There is already an XXX for that in type.h. I agree it could be beneficial in many cases (such as when you are simply augmenting an existing dictionary).

https://issues.apache.org/jira/browse/ARROW-4095

wesm · 2018-12-19T18:14:05Z

cpp/src/arrow/array.h

+  Status Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type,
+                   const std::vector<int32_t>& transpose_map,
+                   std::shared_ptr<Array>* out) const;
+  // XXX Do we also want an unsafe in-place Transpose?


I think we do want that but how we handle it may require some care. This also came up in the context of Arrow Flight where the dictionaries are unknown at the start of an RPC session

https://issues.apache.org/jira/browse/ARROW-3144

I would suggest that we address both cases through a single design, i.e. a MutableDictionaryType subtype or something similar.

One complexity that this introduces (which we need to be prepared for anyway) is the some code will need to check whether type->dictionary() is null or not (if it were null, it would mean that some data has been used inappropriately before the dictionary has been resolved), but I think that's OK.

I left a comment in ARROW-3144 directing to this discussion

wesm · 2018-12-19T18:23:29Z

cpp/src/arrow/array/builder_dict.cc

+    dict_types.push_back(checked_cast<const DictionaryType*>(type));
+  }
+
+  // XXX Should we check the ordered flag?


This is a bit tricky. We should consider it in a follow up JIRA.

You could conceivably have a sequence of dictionaries like

a < b < c a < b < c < d a < b < c < d < e < f

This could result from a sequence of ordered delta dictionaries. You could argue in this case that the output type should also be ordered.

IMHO the "ordered" flag doesn't seem to make sense. A dictionary is simply a linear lookup table, so it's always ordered in a sense.

ordered is part of the protocol https://github.com/apache/arrow/blob/master/format/Schema.fbs#L236

Minor optimization: if dictionaries have an ordering, that could be used to merge them without building a memo table.

@bkietz I'm not sure I understand your suggestion. "ordered" here does not mean "sorted", just that the dictionary's order is semantically meaningful.

A 0 1 2 3 4 5 6 B 7 8 1 3 4 10 11 6 U 0 7 8 1 2 3 4 5 10 11 6

Right, but how does the algorithm look like and why would it more efficient than a series of hash lookups?

Sorry, you're right; talking about optimization is not justified when there is not a version which preserves order using a hash table. I'll write an example of each and benchmark them

I would say it's not worth spending time right now on handling multiple "ordered" dictionaries (where the only case where "ordered" would be preserved is when all the dictionaries are prefixes of a common dictionary)

I created https://issues.apache.org/jira/browse/ARROW-4096. This is low priority

wesm · 2018-12-19T18:26:41Z

cpp/src/arrow/array/builder_dict.cc

+
+Status DictionaryType::Unify(MemoryPool* pool, const std::vector<const DataType*>& types,
+                             std::shared_ptr<DataType>* out_type,
+                             std::vector<std::vector<int32_t>>* out_transpose_maps) {


We should also consider the case where we want to conform a dictionary to an other, without unification. For example:

[a, b, c, a, b, c], dictionary [a, b, c] conform to dictionary [a, c, d] yields [a, null, c, a, null, c]

This isn't unification so much as conformance. This was what was meant by my comment "Conform the dictionary indices to reference a common dictionary" in ARROW-554. The use case would be implementing casts between dictionary types -- this would include a safe and unsafe mode (safe would error if a value is made null)

Since this use case is not addressed in this patch (unless I missed it), let us create a follow up JIRA

Hmm, I see. I have misinterpreted the original JIRA then. In any case, I was thinking of the CSV use case, where we'll be unifying dictionaries at the end.

I opened https://issues.apache.org/jira/browse/ARROW-4097

wesm · 2018-12-19T18:28:56Z

cpp/src/arrow/type-test.cc

+  auto t2 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 5, 6]"));
+  auto t3 = dictionary(int16(), ArrayFromJSON(int32(), "[3, 4, 5, 6]"));
+  auto t4 = dictionary(int8(), ArrayFromJSON(int16(), "[3, 4, 5, 6]"));
+  auto t5 = dictionary(int8(), ArrayFromJSON(int32(), "[3, 4, 7, 6]"));


I'm quite glad we have this function now =) We might consider adding even more testing sugar so that we can write jsarray::int64("[3, 4, 5, 6]") or some such

"[3, 1, 4, 2]"_array(int64())

wesm · 2018-12-19T18:31:57Z

python/pyarrow/tests/test_types.py

-    ty1 = pa.dictionary(pa.float32(), pa.array([1.0, 2.0]), ordered=True)
-    assert ty1.index_type == pa.float32()
+    ty1 = pa.dictionary(pa.int8(), pa.array([1.0, 2.0]), ordered=True)
+    assert ty1.index_type == pa.int8()


wesm · 2018-12-21T00:42:22Z

OK I think I have all the follow up items accounted for in JIRA

pitrou added the WIP PR is work in progress label Dec 12, 2018

pitrou force-pushed the ARROW-554-conform-dicts branch 2 times, most recently from 28e55ce to cf6f8e2 Compare December 12, 2018 19:35

bkietz reviewed Dec 13, 2018

View reviewed changes

cpp/src/arrow/builder-dict.cc Outdated Show resolved Hide resolved

fsaintjacques reviewed Dec 16, 2018

View reviewed changes

cpp/src/arrow/builder-dict.cc Outdated Show resolved Hide resolved

pitrou force-pushed the ARROW-554-conform-dicts branch 3 times, most recently from f961288 to a72a759 Compare December 18, 2018 20:05

pitrou changed the title ~~[WIP] ARROW-554: [C++] Add functions to conform dictionaries~~ ARROW-554: [C++] Add functions to conform dictionaries Dec 18, 2018

pitrou removed the WIP PR is work in progress label Dec 18, 2018

ARROW-554: [C++] Add functions to conform dictionaries

7d2579b

pitrou force-pushed the ARROW-554-conform-dicts branch from a72a759 to 7d2579b Compare December 19, 2018 11:15

fsaintjacques approved these changes Dec 19, 2018

View reviewed changes

bkietz approved these changes Dec 19, 2018

View reviewed changes

wesm reviewed Dec 19, 2018

View reviewed changes

wesm changed the title ~~ARROW-554: [C++] Add functions to conform dictionaries~~ ARROW-554: [C++] Add functions to unify dictionary types and arrays Dec 19, 2018

wesm closed this in f66fa80 Dec 19, 2018

pitrou deleted the ARROW-554-conform-dicts branch December 19, 2018 22:11

valkum mentioned this pull request Oct 15, 2025

DictionaryKeyOverflowError on DataFrame.write_parquet apache/datafusion#17445

Open

ARROW-554: [C++] Add functions to unify dictionary types and arrays #3165

ARROW-554: [C++] Add functions to unify dictionary types and arrays #3165

Conversation

pitrou commented Dec 12, 2018

Uh oh!

Uh oh!

Uh oh!

pitrou commented Dec 18, 2018

Uh oh!

codecov-io commented Dec 19, 2018

Codecov Report

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Dec 19, 2018

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz Dec 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz Dec 20, 2018 •

edited

Loading