Skip to content

Missing type gets lost when writing partitions of DataFrame #403

@svilupp

Description

@svilupp

This is an odd one and likely to be a PICNIC...

Problem: Missigness in a string column is lost after saving/loading arrow file

When it happens: When a column in my dataset has type Union{Missing,String}, I partition it, and the missing item appears only in the later partitions. It's easily reproducible (see below).

Debugging:

  • It happens only to DataFrames (not Tables.rowtable when created from a namedtuple)
  • Only when partitioned as Iterators.partition(Tables.rows(df), 2). If partitioned as Iterators.partition(df,2) available from version >1.5.0, it is fine
  • If missing type appears in the first partition, it's fine
  • Validity bitmap is written correctly
  • But field is marked as not-nullable (!)

┌ Debug: building field: name = x1, nullable = false, T = String, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486
--- in correct cases, this appears
┌ Debug: building field: name = x1, nullable = true, T = Union{Missing, String}, type = Arrow.Flatbuf.Utf8
└ @ Arrow ~/Documents/GitHub/arrow-julia/src/write.jl:486

MWE

using Arrow, Tables, Random, DataFramesMeta
using Logging
debuglogger = ConsoleLogger(stderr, Logging.Debug)

# Create dataset
fn = "test_types.arrow"
df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame

# Works okay
Arrow.write(fn, df; compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}

# Works okay
Arrow.write(fn, Iterators.partition(df,2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:

# broken -- missingness is lost
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{String, Arrow.List{String, Int32, Vector{UInt8}}}

# Works okay with Tables
t = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4))
Arrow.write(fn, Iterators.partition(Tables.rows(t), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]
# SentinelArrays.ChainedVector{Union{Missing, String}, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}

Versioninfo:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores

Arrow: 2.4.3 on main branch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions