-
Notifications
You must be signed in to change notification settings - Fork 70
Description
This is an odd one and likely to be a PICNIC...
Problem: Missigness in a string column is lost after saving/loading arrow file
When it happens: When a column in my dataset has type Union{Missing,String}, I partition it, and the missing item appears only in the later partitions. It's easily reproducible (see below).
Debugging:
- It happens only to DataFrames (not Tables.rowtable when created from a namedtuple)
- Only when partitioned as
Iterators.partition(Tables.rows(df), 2). If partitioned asIterators.partition(df,2)available from version >1.5.0, it is fine - If missing type appears in the first partition, it's fine
- Validity bitmap is written correctly
- But field is marked as not-nullable (!)
┌ Debug: building field: name = x1, nullable = false, T = String, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486
--- in correct cases, this appears
┌ Debug: building field: name = x1, nullable = true, T = Union{Missing, String}, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486
MWE
using Arrow, Tables, Random, DataFramesMeta
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)
# Create dataset
fn = "test_types.arrow"
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
# Works okay
Arrow.write(fn, df; compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}
# Works okay
Arrow.write(fn, Iterators.partition(df,2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:
# broken -- missingness is lost
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{String, Arrow.List{String, Int32, Vector{UInt8}}}
# Works okay with Tables
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
Arrow.write(fn, Iterators.partition(Tables.rows(t), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}
Versioninfo:
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores
Arrow: 2.4.3 on main branch