Skip to content

Conversation

@westonpace
Copy link
Member

The dictionaries still need to have the same index & value types. It is possible that concatenating two dictionaries still fails because the resulting dictionary has more values than its index type can represent.

The unification will still fail if nulls are present in either dictionary. The canonical approach seems to be representing nulls in the indices array with a validity bitmap. The existing unifier had this constraint in place. My guess is that this was to avoid making the memo table null-aware. It could be handled without modification to the memo table by using a -1 index and so I could easily add this if desired. I wasn't sure if support for this non-typical case justified the complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DictionaryUnifier is typed to the value type of the dictionary but not the index type so I needed to create this utility for accessing the max possible value for a given index type. I wasn't really sure where to put this. There could also be a general purpose visitor for finding the min/max of all numeric types. Also, do we have this capability anywhere else in the code base I could just leverage? This is needed by DictionaryUnifierImpl::GetResultWithIndexType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which numeric type would you return for the min/max of all numeric types? Let's just stick with integers :-)
As for where to put it, util/int_util.h sounds like a decent place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better, it turns out util/int_util.h already had what I needed. I switched to using internal::IntegersCanFit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the concatenation functions simply memcpy'd the buffers. However, the dictionary concatenation needs to map buffers to potentially new index values. As a result, this function needs to know the type of the buffer for the reinterpret case on line 190. Also, the fact that memo table indices are 32 bit and dictionary indices could be 64 bit is a potential problem but one that already existed and it seems unlikely that a dictionary array would be used when there are 4B unique values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at it, can all non-public functions/classes in this module be put in the anonymous namespace? This reduces the number of exported symbols and can also open more optimization opportunities for the compiler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an overload but the behavior was different enough I felt it warranted its own name. GetResult is not actually used in the code base anywhere but DictionaryUnifier is an exported type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the docstring a bit off? It doesn't seem a DictionaryType is returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, fixed.

@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Here a bunch of comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which numeric type would you return for the min/max of all numeric types? Let's just stick with integers :-)
As for where to put it, util/int_util.h sounds like a decent place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the docstring a bit off? It doesn't seem a DictionaryType is returned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean i += 2? Or simply UnsafeAppend(i * 2)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think i*2 would work. It's building up two arrays, the first has values [0,size) and the second has values [size,2*size). I changed it a little so it is one loop if that is clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning int64_t is not right if the index type is uint64_t.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. N/A anymore since using int_util.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why mutable_data? A const pointer should be sufficient here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at it, can all non-public functions/classes in this module be put in the anonymous namespace? This reduces the number of exported symbols and can also open more optimization opportunities for the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems common to both if branches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If out_data a CType*? If so, spell it out explicitly for clarity. (is a reinterpret_cast missing too?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the missing reinterpret_cast was a bug (as I think you noticed) for index types with more than one byte. Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: const auto&? Though copying the shared_ptr is probably not a bottleneck here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed anyways for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... can you test with index types larger than 1 byte? There may be an issue with them in the current impl (not sure).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test and there was an issue. Both the missing reinterpret cast and the way I was computing size (# bytes vs. # elements)

…aled issue with the index mapping. Reusing IntUtil functions for transpose and identifying the max index value. Cleaned up some comments and styling
@westonpace
Copy link
Member Author

Thanks for the insight @pitrou . Your guess was right, there was a bug with multi-byte index types. I believe I have addressed your concerns.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @westonpace !

@pitrou pitrou closed this in 97f8160 Jan 11, 2021
@jorisvandenbossche
Copy link
Member

A bit late that I think of it, but the arrow->pandas code already has some dictionary concatenation code (

Status WriteIndicesVarying(const ChunkedArray& data, std::shared_ptr<Array>* out_dict) {
// Yield int32 indices to allow for dictionary outgrowing the current index
// type
RETURN_NOT_OK(this->AllocateNDArray(NPY_INT32, 1));
auto out_values = reinterpret_cast<int32_t*>(this->block_data_);
const auto& dict_type = checked_cast<const DictionaryType&>(*data.type());
ARROW_ASSIGN_OR_RAISE(auto unifier, DictionaryUnifier::Make(dict_type.value_type(),
this->options_.pool));
for (int c = 0; c < data.num_chunks(); c++) {
const auto& arr = checked_cast<const DictionaryArray&>(*data.chunk(c));
const auto& indices = checked_cast<const ArrayType&>(*arr.indices());
auto values = reinterpret_cast<const T*>(indices.raw_values());
std::shared_ptr<Buffer> transpose_buffer;
RETURN_NOT_OK(unifier->Unify(*arr.dictionary(), &transpose_buffer));
auto transpose = reinterpret_cast<const int32_t*>(transpose_buffer->data());
int64_t dict_length = arr.dictionary()->length();
RETURN_NOT_OK(CheckIndexBounds(*indices.data(), dict_length));
// Null is -1 in CategoricalBlock
for (int i = 0; i < arr.length(); ++i) {
if (indices.IsValid(i)) {
*out_values++ = transpose[values[i]];
} else {
*out_values++ = -1;
}
}
}
std::shared_ptr<DataType> unused_type;
return unifier->GetResult(&unused_type, out_dict);
}
). I am not that familiar with the code, but there might be room for simplification there with the functionality added in this PR.

@westonpace
Copy link
Member Author

westonpace commented Jan 12, 2021

@jorisvandenbossche It's pretty close but there are a few differences.

  • The pandas code allows the index type to expand (e.g. from uint8_t to uint16_t). In fact, it looks like it always sets it to int32_t. Also, arrow doesn't allow dictionary indices to be negative.
  • The pandas code puts -1 in the map for a null value. Arrow uses null in the validity bitmap for the indices array and/or null as an item in the dictionary itself with a valid index (both arrow approaches are legal but the pandas approach is neither of those)

I'll defer to @pitrou if we want to combine them but it seems simpler to just leave them separate for now.

@jorisvandenbossche
Copy link
Member

Ah, yes, I forgot about the null handling needed for pandas. That's certainly something that shouldn't be added to the general arrow functionality, but can be left as specific handling the arrow_to_pandas code.

@pitrou
Copy link
Member

pitrou commented Jan 12, 2021

The differences look a bit annoying to reconcile (especially the different null representation). Not sure it's worth doing anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants