Skip to content

CometReader.loadVector should not overwrite dictionary ids #475

@viirya

Description

@viirya

Describe the bug

Java Arrow provides Data.importVector API which is used to import Arrow array/schema through C Data interface. The caller needs to provide dictionary provider to the API and Data.importVector fills dictionary value vectors into the provider with dictionary ids are keys.

The dictionary ids are keys used to look up correct dictionary values for dictionary-encoded arrays. So it is obvious that dictionary id should be unique. Otherwise, dictionary arrays will use incorrect dictionary values and cause issues.

In Java Arrow API. one class SchemaImport is used to maintain current dictionary id during importing an array. But one design drawback of this Data.importVector API is, this SchemaImport is initiated internally in Data.importVector. So the uniqueness is only guaranteed for the array. For example, if the array is nested type, all dictionary-encoded arrays in the array will have unique dictionary ids.

But once you import another array by calling Data.importVector again, the dictionary id is reset. It cannot provide unique dictionary ids for all arrays you are importing.

Due to the above issue, CometReader.loadVector which calls Data.importVector to import arrays from native code, will overwrite dictionary ids of other arrays.

This is the cause of test failures of CometTPCDSQuerySuite in #437.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions