Skip to content

[C++] Display the name of the problematic field when returning status "Data type ... is not supported in join non-key field" for HashJoin #36187

@rowillia

Description

@rowillia

Describe the enhancement requested

Joining two tables where 1 has any column of type list (even if it's not the join column) results in an exception. For example:

import pyarrow as pa
import random
NUM_ITEMS = 30
t1 = pa.Table.from_pydict({
    'id': [x.to_bytes(4, 'big') for x in range (NUM_ITEMS)],
    'array_column': [[z for z in range(3)] for x in range(NUM_ITEMS)],
})
t2 = pa.Table.from_pydict({
    'id': [x.to_bytes(4, 'big') for x in range (NUM_ITEMS)],
    'value': [x for x in range(NUM_ITEMS)]
})
t1.join(t2, 'id', join_type='inner')

Results in the following exception:
ArrowInvalid: Data type list<item: int64> is not supported in join non-key field

This exception is fairly unintuitive (I spent a few hours today trying to understand what was causing this exception) and could be made a lot clearer by providing the field name if it's available (I'm new to Arrow but I believe the name should be available?

arrow/cpp/src/arrow/type.h

Lines 1829 to 1831 in f959a2e

const std::string* name() const {
return IsName() ? &std::get<std::string>(impl_) : NULLPTR;
}
)

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions