-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
When ArrowStreamWriter is writing a RecordBatch with nulls in it, it is mixing up the column's NullCount.
You can see here:
arrow/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs
Lines 195 to 200 in 90affbd
| for (var i = 0; i < fieldCount; i++) | |
| { | |
| var fieldArray = recordBatch.Column(i); | |
| fieldNodeOffsets[i] = | |
| Flatbuf.FieldNode.CreateFieldNode(Builder, fieldArray.Length, fieldArray.NullCount); | |
| } |
It is writing the fields from 0 ~~> fieldCount order. But then lower, it is writing the fields from fieldCount ~~> 0.
Looking at the Java implementation it says
// struct vectors have to be created in reverse order
A simple test of roundtripping the following RecordBatch shows the issue:
var result = new RecordBatch(
new Schema.Builder()
.Field(f => f.Name("age").DataType(Int32Type.Default))
.Field(f => f.Name("CharCount").DataType(Int32Type.Default))
.Build(),
new IArrowArray[]
{
new Int32Array(
new ArrowBuffer.Builder<int>().Append(0).Build(),
new ArrowBuffer.Builder<byte>().Append(0).Build(),
length: 1,
nullCount: 1,
offset: 0),
new Int32Array(
new ArrowBuffer.Builder<int>().Append(7).Build(),
ArrowBuffer.Empty,
length: 1,
nullCount: 0,
offset: 0)
},
length: 1);Here, the "age" column should have a null in it. However, when you write and read this RecordBatch back, you see that the "CharCount" column has NullCount == 1 and "age" column has NullCount == 0.
Reporter: Eric Erhardt / @eerhardt
Assignee: Eric Erhardt / @eerhardt
PRs and other links:
Note: This issue was originally created as ARROW-5887. Please see the migration documentation for further details.