ARROW-5336: [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries #8984

westonpace · 2020-12-21T21:17:49Z

The dictionaries still need to have the same index & value types. It is possible that concatenating two dictionaries still fails because the resulting dictionary has more values than its index type can represent.

The unification will still fail if nulls are present in either dictionary. The canonical approach seems to be representing nulls in the indices array with a validity bitmap. The existing unifier had this constraint in place. My guess is that this was to avoid making the memo table null-aware. It could be handled without modification to the memo table by using a -1 index and so I could easily add this if desired. I wasn't sure if support for this non-typical case justified the complexity.

westonpace · 2020-12-21T21:20:21Z

cpp/src/arrow/array/array_dict.cc

The DictionaryUnifier is typed to the value type of the dictionary but not the index type so I needed to create this utility for accessing the max possible value for a given index type. I wasn't really sure where to put this. There could also be a general purpose visitor for finding the min/max of all numeric types. Also, do we have this capability anywhere else in the code base I could just leverage? This is needed by DictionaryUnifierImpl::GetResultWithIndexType

Which numeric type would you return for the min/max of all numeric types? Let's just stick with integers :-)
As for where to put it, util/int_util.h sounds like a decent place.

Even better, it turns out util/int_util.h already had what I needed. I switched to using internal::IntegersCanFit.

westonpace · 2020-12-21T21:25:46Z

cpp/src/arrow/array/concatenate.cc

The rest of the concatenation functions simply memcpy'd the buffers. However, the dictionary concatenation needs to map buffers to potentially new index values. As a result, this function needs to know the type of the buffer for the reinterpret case on line 190. Also, the fact that memo table indices are 32 bit and dictionary indices could be 64 bit is a potential problem but one that already existed and it seems unlikely that a dictionary array would be used when there are 4B unique values.

While we are at it, can all non-public functions/classes in this module be put in the anonymous namespace? This reduces the number of exported symbols and can also open more optimization opportunities for the compiler.

cpp/src/arrow/array/concatenate.cc

cpp/src/arrow/array/concatenate_test.cc

westonpace · 2020-12-21T21:31:25Z

cpp/src/arrow/type.h

This could be an overload but the behavior was different enough I felt it warranted its own name. GetResult is not actually used in the code base anywhere but DictionaryUnifier is an exported type.

Isn't the docstring a bit off? It doesn't seem a DictionaryType is returned.

Correct, fixed.

github-actions · 2020-12-21T21:55:24Z

https://issues.apache.org/jira/browse/ARROW-5336

pitrou

Thank you! Here a bunch of comments.

pitrou · 2021-01-04T15:40:24Z

cpp/src/arrow/array/array_dict.cc

Which numeric type would you return for the min/max of all numeric types? Let's just stick with integers :-)
As for where to put it, util/int_util.h sounds like a decent place.

pitrou · 2021-01-04T15:44:12Z

cpp/src/arrow/type.h

Isn't the docstring a bit off? It doesn't seem a DictionaryType is returned.

pitrou · 2021-01-04T15:46:51Z

cpp/src/arrow/array/concatenate_test.cc

Did you mean i += 2? Or simply UnsafeAppend(i * 2)?

I don't think i*2 would work. It's building up two arrays, the first has values [0,size) and the second has values [size,2*size). I changed it a little so it is one loop if that is clearer.

pitrou · 2021-01-04T15:49:02Z

cpp/src/arrow/array/array_dict.cc

Returning int64_t is not right if the index type is uint64_t.

Good catch. N/A anymore since using int_util.

pitrou · 2021-01-04T15:52:38Z

cpp/src/arrow/array/concatenate.cc

Why mutable_data? A const pointer should be sufficient here.

pitrou · 2021-01-04T15:58:24Z

cpp/src/arrow/array/concatenate.cc

While we are at it, can all non-public functions/classes in this module be put in the anonymous namespace? This reduces the number of exported symbols and can also open more optimization opportunities for the compiler.

pitrou · 2021-01-04T15:59:25Z

cpp/src/arrow/array/concatenate.cc

This seems common to both if branches.

pitrou · 2021-01-04T16:02:59Z

cpp/src/arrow/array/concatenate.cc

If out_data a CType*? If so, spell it out explicitly for clarity. (is a reinterpret_cast missing too?)

Yes, the missing reinterpret_cast was a bug (as I think you noticed) for index types with more than one byte. Fixed.

pitrou · 2021-01-04T16:04:40Z

cpp/src/arrow/array/concatenate.cc

Nit: const auto&? Though copying the shared_ptr is probably not a bottleneck here...

Fixed anyways for consistency.

pitrou · 2021-01-04T16:05:40Z

cpp/src/arrow/array/concatenate_test.cc

Hmm... can you test with index types larger than 1 byte? There may be an issue with them in the current impl (not sure).

I added a test and there was an issue. Both the missing reinterpret cast and the way I was computing size (# bytes vs. # elements)

…aled issue with the index mapping. Reusing IntUtil functions for transpose and identifying the max index value. Cleaned up some comments and styling

westonpace · 2021-01-08T22:24:09Z

Thanks for the insight @pitrou . Your guess was right, there was a bug with multi-byte index types. I believe I have addressed your concerns.

pitrou

+1, thank you @westonpace !

jorisvandenbossche · 2021-01-11T09:26:48Z

A bit late that I think of it, but the arrow->pandas code already has some dictionary concatenation code (

arrow/cpp/src/arrow/python/arrow_to_pandas.cc

Lines 1641 to 1676 in 97f8160

    
           Status WriteIndicesVarying(const ChunkedArray& data, std::shared_ptr<Array>* out_dict) { 
        
             // Yield int32 indices to allow for dictionary outgrowing the current index 
        
             // type 
        
             RETURN_NOT_OK(this->AllocateNDArray(NPY_INT32, 1)); 
        
             auto out_values = reinterpret_cast<int32_t*>(this->block_data_); 
        
             const auto& dict_type = checked_cast<const DictionaryType&>(*data.type()); 
        
             ARROW_ASSIGN_OR_RAISE(auto unifier, DictionaryUnifier::Make(dict_type.value_type(), 
        
                                                                         this->options_.pool)); 
        
             for (int c = 0; c < data.num_chunks(); c++) { 
        
               const auto& arr = checked_cast<const DictionaryArray&>(*data.chunk(c)); 
        
               const auto& indices = checked_cast<const ArrayType&>(*arr.indices()); 
        
               auto values = reinterpret_cast<const T*>(indices.raw_values()); 
        
               std::shared_ptr<Buffer> transpose_buffer; 
        
               RETURN_NOT_OK(unifier->Unify(*arr.dictionary(), &transpose_buffer)); 
        
               auto transpose = reinterpret_cast<const int32_t*>(transpose_buffer->data()); 
        
               int64_t dict_length = arr.dictionary()->length(); 
        
               RETURN_NOT_OK(CheckIndexBounds(*indices.data(), dict_length)); 
        
               // Null is -1 in CategoricalBlock 
        
               for (int i = 0; i < arr.length(); ++i) { 
        
                 if (indices.IsValid(i)) { 
        
                   *out_values++ = transpose[values[i]]; 
        
                 } else { 
        
                   *out_values++ = -1; 
        
                 } 
        
               } 
        
             } 
        
             std::shared_ptr<DataType> unused_type; 
        
             return unifier->GetResult(&unused_type, out_dict); 
        
           }

). I am not that familiar with the code, but there might be room for simplification there with the functionality added in this PR.

westonpace · 2021-01-12T02:11:57Z

@jorisvandenbossche It's pretty close but there are a few differences.

The pandas code allows the index type to expand (e.g. from uint8_t to uint16_t). In fact, it looks like it always sets it to int32_t. Also, arrow doesn't allow dictionary indices to be negative.
The pandas code puts -1 in the map for a null value. Arrow uses null in the validity bitmap for the indices array and/or null as an item in the dictionary itself with a valid index (both arrow approaches are legal but the pandas approach is neither of those)

I'll defer to @pitrou if we want to combine them but it seems simpler to just leave them separate for now.

jorisvandenbossche · 2021-01-12T08:46:22Z

Ah, yes, I forgot about the null handling needed for pandas. That's certainly something that shouldn't be added to the general arrow functionality, but can be left as specific handling the arrow_to_pandas code.

pitrou · 2021-01-12T10:52:49Z

The differences look a bit annoying to reconcile (especially the different null representation). Not sure it's worth doing anything.

github-actions bot added the Component: C++ label Dec 21, 2020

westonpace commented Dec 21, 2020

View reviewed changes

cpp/src/arrow/array/concatenate.cc Outdated Show resolved Hide resolved

westonpace commented Dec 21, 2020

View reviewed changes

cpp/src/arrow/array/concatenate_test.cc Outdated Show resolved Hide resolved

westonpace commented Dec 21, 2020

View reviewed changes

pitrou reviewed Jan 4, 2021

View reviewed changes

westonpace added 2 commits January 8, 2021 10:07

Added concatenate implementation for dictionaries

83bd54e

Fixed a test line I had accidentally moved

7075b20

westonpace force-pushed the feature/arrow-5336 branch from b3fc15d to 7075b20 Compare January 8, 2021 20:09

westonpace added 2 commits January 8, 2021 11:56

Addressing PR comments. Added test case for 2 byte indices which reve…

9fd5dbd

…aled issue with the index mapping. Reusing IntUtil functions for transpose and identifying the max index value. Cleaned up some comments and styling

Reworked the double loops a bit to make it clearer what is being created

a3a478f

pitrou approved these changes Jan 11, 2021

View reviewed changes

pitrou closed this in 97f8160 Jan 11, 2021

westonpace deleted the feature/arrow-5336 branch March 3, 2021 17:35

asfimport mentioned this pull request Jan 12, 2021

[C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries #21796

Closed

ARROW-5336: [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries #8984

ARROW-5336: [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries #8984

Uh oh!

Conversation

westonpace commented Dec 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 21, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Jan 8, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 11, 2021

Uh oh!

westonpace commented Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Jan 12, 2021

Uh oh!

pitrou commented Jan 12, 2021

Uh oh!

Reviewers

westonpace commented Jan 12, 2021 •

edited

Loading