Skip to content

[C++][Compute] Support Recordbatch sorting for dictionary type #38386

@llama90

Description

@llama90

Describe the enhancement requested

Hello. While implementing join operation support for the Dictionary type, I encountered the following message.

I am attempting to support the Dictionary type through the following steps:

  1. Random generation of Dictionary data
  2. Supporting Dictionary type for non-key columns
  3. Supporting Dictionary type for key columns

I discovered the following error while taking step 2.

'_error_or_value55.status()' failed with Type error: Unsupported type for RecordBatch sorting: dictionary<values=string, indices=int32, ordered=0>

The detailed content is as follows.

...

Test 2: LEFT_SEMI EQ_EQ parallel = true bloom_filter = true
	 left schema: large_string / string / dictionary<values=string, indices=int32, ordered=0> / uint64 / null
	 right schema: large_string / string / fixed_size_binary[17] / null
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/test_util_internal.cc:476: Failure
Failed
'_error_or_value55.status()' failed with Type error: Unsupported type for RecordBatch sorting: dictionary<values=string, indices=int32, ordered=0>
/Users/lama/workspace/arrow-2/cpp/src/arrow/compute/kernels/vector_sort.cc:410  VisitTypeInline(*physical_type, this)
/Users/lama/workspace/arrow-2/cpp/src/arrow/compute/kernels/vector_sort.cc:391  factory.MakeColumnSort()
/Users/lama/workspace/arrow-2/cpp/src/arrow/compute/kernels/vector_sort.cc:665  sorter.Sort(begin_offset)
/Users/lama/workspace/arrow-2/cpp/src/arrow/compute/kernels/vector_sort.cc:1038  sorter.Sort()
/Users/lama/workspace/arrow-2/cpp/src/arrow/compute/api_vector.cc:316  CallFunction("sort_indices", {datum}, &options, ctx)
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/test_util_internal.cc:465  SortIndices(tab, SortOptions(sort_keys))
Google Test trace:
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/hash_join_node_test.cc:1168: LEFT_SEMI EQ_EQ parallel = true bloom_filter = true

...

It appears that this error occurs due to the absence of sorting operation implementation for the Dictionary type, which is observed in the process of verifying the result values after performing the join operation.

Additionally, I attempted to support key column operations for the Null type, but encountered a similar type of error in this case as well.

'_error_or_value45.status()' failed with NotImplemented: Function 'equal' has no kernel matching input types (null, null)

Following these two error messages led me to the files below:

  • cpp/src/arrow/compute/kernels/codegen_internal.cc
  • cpp/src/arrow/compute/kernels/codegen_internal.h

Should I reference the logic in these files to implement the following functionalities, and then proceed with the join operation?

  • Implementing sorting for dictionary type.
  • Implementing equal operations for null, null types

The files mentioned above may not fundamentally be the problem, but it seems that type-specific operations are primarily needed to perform the join operation.

I am aiming to support sorting for the Dictionary Type to address the feature that triggers the error.

It would be great for some advice if I am misunderstanding the problem, or if anyone is well-informed about this part..

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions