Skip to content

[Python, Java] UnionArray round trip not working #17700

@asfimport

Description

@asfimport

I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: #1216

The data is generated with the following script:

import pyarrow as pa

binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
int64 = pa.array([1, 2, 3], type='int64')
types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)

batch = pa.RecordBatch.from_arrays([result], ["test"])

sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)

writer.write_batch(batch)

sink.close()

b = sink.get_result()

with open("union_array.arrow", "wb") as f:
    f.write(b)

# Sanity check: Read the batch in again

with open("union_array.arrow", "rb") as f:
    b = f.read()
    reader = pa.RecordBatchStreamReader(pa.BufferReader(b))

batch = reader.read_next_batch()

print("union array is", batch.column(0))

I attached the file generated by that script. Then when I run the following code in Java:

RootAllocator allocator = new RootAllocator(1000000000);

ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));

ArrowStreamReader reader = new ArrowStreamReader(in, allocator);

reader.loadNextBatch()

I get the following error:

|  java.lang.IllegalArgumentException thrown: Could not load buffers for field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can not truncate buffer to a larger size 7: 0
|        at VectorLoader.loadBuffers (VectorLoader.java:83)
|        at VectorLoader.load (VectorLoader.java:62)
|        at ArrowReader$1.visit (ArrowReader.java:125)
|        at ArrowReader$1.visit (ArrowReader.java:111)
|        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|        at ArrowReader.loadNextBatch (ArrowReader.java:137)
|        at (#7:1)

It seems like Java is not picking up that the UnionArray is Dense instead of Sparse. After changing the default in java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I get this:

jshell> reader.getVectorSchemaRoot().getSchema()
$9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense, [0])<: Int(64, true)>>>>>

but then reading doesn't work:

jshell> reader.loadNextBatch()
|  java.lang.IllegalArgumentException thrown: Could not load buffers for field list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<: Int(64, true)>>>>. error message: can not truncate buffer to a larger size 1: 0
|        at VectorLoader.loadBuffers (VectorLoader.java:83)
|        at VectorLoader.load (VectorLoader.java:62)
|        at ArrowReader$1.visit (ArrowReader.java:125)
|        at ArrowReader$1.visit (ArrowReader.java:111)
|        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|        at ArrowReader.loadNextBatch (ArrowReader.java:137)
|        at (#8:1)

Any help with this is appreciated!

Reporter: Philipp Moritz / @pcmoritz
Assignee: Ryan Murray / @rymurr

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-1692. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions