refactor Arrow.write to support incremental writes #277

baumgold · 2022-01-12T22:22:04Z

An alternate solution to #105 than #160 by refactoring the write function. This is an improvement over #160 because it supports incremental writes to both the Arrow IPC stream format (currently supported via append) and the Arrow file format (currently unsupported).

ericphanson · 2022-01-15T03:21:38Z

FYI we unfortunately don’t have CI until #273 is resolved or worked-around, so that will likely slow things down

baumgold · 2022-02-13T02:38:57Z

FYI we unfortunately don’t have CI until #273 is resolved or worked-around, so that will likely slow things down

@ericphanson - now that #273 appears to be resolved, could you (or someone) please approve the workflow so CI can be run on this PR? Thanks!

kou · 2022-02-13T20:43:17Z

Approved.

baumgold · 2022-02-14T23:59:24Z

Could someone please approve running CI again?

As a side-note, it appears that the main branch currently produces failing tests with Julia v1.3 and v1.4 (note that all tests pass in v1.5, v1.6, and v1.7). The relevant stacktrace is below. Does anyone have any idea when this regression was introduced?

misc: Error During Test at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:98
  Got exception outside of a @test
  TaskFailedException:
  MethodError: no method matching resize!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Int64)
  Closest candidates are:
    resize!(!Matched::Array{T,1} where T, ::Integer) at array.jl:1017
    resize!(!Matched::BitArray{1}, ::Integer) at bitarray.jl:773
    resize!(!Matched::PooledArray{T,R,1,RA} where RA, ::Integer) where {T, R} at ~/.julia/packages/PooledArrays/DuIZ1/src/PooledArrays.jl:284
    ...
  Stacktrace:
   [1] _append!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Base.HasLength, ::Tuple{Int64}) at ./array.jl:921
   [2] append!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Tuple{Int64}) at ./array.jl:915
   [3] push!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Int64) at ./array.jl:916
   [4] push!(::SentinelArrays.ChainedVector{Int64,Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}}, ::Int64) at ~/.julia/packages/SentinelArrays/pYV2X/src/chainedvector.jl:458
   [5] append!(::SentinelArrays.ChainedVector{Int64,Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}}, ::Arrow.DictEncoded{Int64,Int8,SentinelArrays.ChainedVector{Int64,Arrow.Primitive{Int64,Array{Int64,1}}}}) at ~/.julia/packages/SentinelArrays/pYV2X/src/chainedvector.jl:615
   [6] (::Arrow.var"#87#92"{Array{Any,1},Arrow.Table})(::Int64) at ~/.julia/packages/Arrow/rVa71/src/table.jl:288
   [7] foreach(::Arrow.var"#87#92"{Array{Any,1},Arrow.Table}, ::UnitRange{Int64}) at ./abstractarray.jl:1920
   [8] macro expansion at ~/.julia/packages/Arrow/rVa71/src/table.jl:287 [inlined]
   [9] (::Arrow.var"#84#89"{Arrow.Table,Channel{Task}})() at ./threadingconstructs.jl:113
  Stacktrace:
   [1] wait at ./task.jl:251 [inlined]
   [2] #Table#83(::Bool, ::Type{Arrow.Table}, ::Array{Arrow.ArrowBlob,1}) at ~/.julia/packages/Arrow/rVa71/src/table.jl:358
   [3] Table at ~/.julia/packages/Arrow/rVa71/src/table.jl:271 [inlined]
   [4] #Table#78(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Type{Arrow.Table}, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int64, ::Nothing) at ~/.julia/packages/Arrow/rVa71/src/table.jl:265
   [5] Table at ~/.julia/packages/Arrow/rVa71/src/table.jl:265 [inlined] (repeats 2 times)
   [6] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:294
   [7] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1107
   [8] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:101
   [9] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1107
   [10] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:39
   [11] include at ./boot.jl:328 [inlined]
   [12] include_relative(::Module, ::String) at ./loading.jl:1105
   [13] include(::Module, ::String) at ./Base.jl:31
   [14] include(::String) at ./client.jl:424
   [15] top-level scope at none:6
   [16] eval(::Module, ::Any) at ./boot.jl:330
   [17] exec_options(::Base.JLOptions) at ./client.jl:263
   [18] _start() at ./client.jl:460

kou · 2022-02-15T02:00:14Z

Approved.

codecov-commenter · 2022-02-15T02:20:06Z

Codecov Report

Merging #277 (fca21c1) into main (3afb6b2) will increase coverage by 0.13%.
The diff coverage is 93.63%.

❗ Current head fca21c1 differs from pull request most recent head e09a78b. Consider uploading reports for the commit e09a78b to get more accurate results

@@            Coverage Diff             @@
##             main     #277      +/-   ##
==========================================
+ Coverage   87.00%   87.13%   +0.13%     
==========================================
  Files          26       26              
  Lines        3263     3297      +34     
==========================================
+ Hits         2839     2873      +34     
  Misses        424      424

Impacted Files	Coverage Δ
src/write.jl	`96.63% <93.51%> (+0.36%)`	⬆️
src/arraytypes/arraytypes.jl	`89.52% <100.00%> (+0.20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3afb6b2...e09a78b. Read the comment docs.

complyue · 2022-03-01T10:34:11Z

Bump!?

I've been going over pyarrow for append-ability upon arrow file format files, but seemingly it's not there.

This is crucial to accumulate large (out-of-core) data files with day-to-day batches, without generating too many individual files overtime.

Or Arrow devs really think you should have a writer process running forever without crashing? Maybe they'd rather you accumulate datasets with separate files per each day or so, then (despite wasted space by each file's schema payload) it's nightmare for even modern OSes to mmap the full dataset after some time, if tens of thousands of files are produced everyday.

This PR seems a perfect cure! Hope it get merge sooner.

src/write.jl

quinnj

This functionality looks great! Thanks @baumgold!

I left a few comments on tweaks to make. It looks like the 2 main things needed here are some Arrow.Writer-specific tests (I'd recommend adding a new @testset in the runtests.jl file, or adding a new file of tests all together), and adding some documentation (You can add "in code" docs attached to the Writer struct directly which will get included in the auto-docs generation, and it would also be helpful to add a section to the manual, somewhere around here).

Let me know if you're unfamiliar with writing tests or docs and I'm happy to give more direction/help. Thanks again for the contribution!

baumgold · 2022-03-06T21:24:52Z

I left a few comments on tweaks to make. It looks like the 2 main things needed here are some Arrow.Writer-specific tests (I'd recommend adding a new @testset in the runtests.jl file, or adding a new file of tests all together), and adding some documentation (You can add "in code" docs attached to the Writer struct directly which will get included in the auto-docs generation, and it would also be helpful to add a section to the manual, somewhere around here).

Thanks for your review. I've addressed your comments in my most recent commit. I'll add some tests and documentation shortly.

…Spawn

…Spawn

baumgold · 2022-03-24T21:14:35Z

@quinnj - Based on your suggestions, I've now added Arrow.Writer-specific tests and documentation both in-code and in the manual. Can you please review again? Thanks.

baumgold · 2022-04-04T21:10:37Z

Any chance this PR can be merged and released sooner than later? Thanks.

quinnj

This looks good to me; thanks for all the work here @baumgold and for persisting with us even though the review has been slow.

baumgold mentioned this pull request Jan 12, 2022

Allow for file appends #105

Closed

baumgold force-pushed the incremental_write branch 3 times, most recently from 3c78925 to d982ea1 Compare January 14, 2022 18:23

baumgold force-pushed the incremental_write branch from d982ea1 to c08afcf Compare January 16, 2022 03:27

baumgold force-pushed the incremental_write branch 2 times, most recently from 61d30cc to 5346f8e Compare February 13, 2022 02:19

baumgold force-pushed the incremental_write branch 2 times, most recently from 92370d1 to 9fe4510 Compare February 14, 2022 22:18