Skip to content

Conversation

@baumgold
Copy link
Member

An alternate solution to #105 than #160 by refactoring the write function. This is an improvement over #160 because it supports incremental writes to both the Arrow IPC stream format (currently supported via append) and the Arrow file format (currently unsupported).

@baumgold baumgold mentioned this pull request Jan 12, 2022
@baumgold baumgold force-pushed the incremental_write branch 3 times, most recently from 3c78925 to d982ea1 Compare January 14, 2022 18:23
@ericphanson
Copy link
Member

FYI we unfortunately don’t have CI until #273 is resolved or worked-around, so that will likely slow things down

@baumgold baumgold force-pushed the incremental_write branch 2 times, most recently from 61d30cc to 5346f8e Compare February 13, 2022 02:19
@baumgold
Copy link
Member Author

FYI we unfortunately don’t have CI until #273 is resolved or worked-around, so that will likely slow things down

@ericphanson - now that #273 appears to be resolved, could you (or someone) please approve the workflow so CI can be run on this PR? Thanks!

@kou
Copy link
Member

kou commented Feb 13, 2022

Approved.

@baumgold baumgold force-pushed the incremental_write branch 2 times, most recently from 92370d1 to 9fe4510 Compare February 14, 2022 22:18
@baumgold
Copy link
Member Author

Could someone please approve running CI again?

As a side-note, it appears that the main branch currently produces failing tests with Julia v1.3 and v1.4 (note that all tests pass in v1.5, v1.6, and v1.7). The relevant stacktrace is below. Does anyone have any idea when this regression was introduced?

misc: Error During Test at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:98
  Got exception outside of a @test
  TaskFailedException:
  MethodError: no method matching resize!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Int64)
  Closest candidates are:
    resize!(!Matched::Array{T,1} where T, ::Integer) at array.jl:1017
    resize!(!Matched::BitArray{1}, ::Integer) at bitarray.jl:773
    resize!(!Matched::PooledArray{T,R,1,RA} where RA, ::Integer) where {T, R} at ~/.julia/packages/PooledArrays/DuIZ1/src/PooledArrays.jl:284
    ...
  Stacktrace:
   [1] _append!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Base.HasLength, ::Tuple{Int64}) at ./array.jl:921
   [2] append!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Tuple{Int64}) at ./array.jl:915
   [3] push!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Int64) at ./array.jl:916
   [4] push!(::SentinelArrays.ChainedVector{Int64,Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}}, ::Int64) at ~/.julia/packages/SentinelArrays/pYV2X/src/chainedvector.jl:458
   [5] append!(::SentinelArrays.ChainedVector{Int64,Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}}, ::Arrow.DictEncoded{Int64,Int8,SentinelArrays.ChainedVector{Int64,Arrow.Primitive{Int64,Array{Int64,1}}}}) at ~/.julia/packages/SentinelArrays/pYV2X/src/chainedvector.jl:615
   [6] (::Arrow.var"#87#92"{Array{Any,1},Arrow.Table})(::Int64) at ~/.julia/packages/Arrow/rVa71/src/table.jl:288
   [7] foreach(::Arrow.var"#87#92"{Array{Any,1},Arrow.Table}, ::UnitRange{Int64}) at ./abstractarray.jl:1920
   [8] macro expansion at ~/.julia/packages/Arrow/rVa71/src/table.jl:287 [inlined]
   [9] (::Arrow.var"#84#89"{Arrow.Table,Channel{Task}})() at ./threadingconstructs.jl:113
  Stacktrace:
   [1] wait at ./task.jl:251 [inlined]
   [2] #Table#83(::Bool, ::Type{Arrow.Table}, ::Array{Arrow.ArrowBlob,1}) at ~/.julia/packages/Arrow/rVa71/src/table.jl:358
   [3] Table at ~/.julia/packages/Arrow/rVa71/src/table.jl:271 [inlined]
   [4] #Table#78(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Type{Arrow.Table}, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int64, ::Nothing) at ~/.julia/packages/Arrow/rVa71/src/table.jl:265
   [5] Table at ~/.julia/packages/Arrow/rVa71/src/table.jl:265 [inlined] (repeats 2 times)
   [6] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:294
   [7] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1107
   [8] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:101
   [9] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1107
   [10] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:39
   [11] include at ./boot.jl:328 [inlined]
   [12] include_relative(::Module, ::String) at ./loading.jl:1105
   [13] include(::Module, ::String) at ./Base.jl:31
   [14] include(::String) at ./client.jl:424
   [15] top-level scope at none:6
   [16] eval(::Module, ::Any) at ./boot.jl:330
   [17] exec_options(::Base.JLOptions) at ./client.jl:263
   [18] _start() at ./client.jl:460

@kou
Copy link
Member

kou commented Feb 15, 2022

Approved.

@codecov-commenter
Copy link

codecov-commenter commented Feb 15, 2022

Codecov Report

Merging #277 (fca21c1) into main (3afb6b2) will increase coverage by 0.13%.
The diff coverage is 93.63%.

❗ Current head fca21c1 differs from pull request most recent head e09a78b. Consider uploading reports for the commit e09a78b to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #277      +/-   ##
==========================================
+ Coverage   87.00%   87.13%   +0.13%     
==========================================
  Files          26       26              
  Lines        3263     3297      +34     
==========================================
+ Hits         2839     2873      +34     
  Misses        424      424              
Impacted Files Coverage Δ
src/write.jl 96.63% <93.51%> (+0.36%) ⬆️
src/arraytypes/arraytypes.jl 89.52% <100.00%> (+0.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3afb6b2...e09a78b. Read the comment docs.

@complyue
Copy link

complyue commented Mar 1, 2022

Bump!?

I've been going over pyarrow for append-ability upon arrow file format files, but seemingly it's not there.

This is crucial to accumulate large (out-of-core) data files with day-to-day batches, without generating too many individual files overtime.

Or Arrow devs really think you should have a writer process running forever without crashing? Maybe they'd rather you accumulate datasets with separate files per each day or so, then (despite wasted space by each file's schema payload) it's nightmare for even modern OSes to mmap the full dataset after some time, if tens of thousands of files are produced everyday.

This PR seems a perfect cure! Hope it get merge sooner.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality looks great! Thanks @baumgold!

I left a few comments on tweaks to make. It looks like the 2 main things needed here are some Arrow.Writer-specific tests (I'd recommend adding a new @testset in the runtests.jl file, or adding a new file of tests all together), and adding some documentation (You can add "in code" docs attached to the Writer struct directly which will get included in the auto-docs generation, and it would also be helpful to add a section to the manual, somewhere around here).

Let me know if you're unfamiliar with writing tests or docs and I'm happy to give more direction/help. Thanks again for the contribution!

@baumgold baumgold force-pushed the incremental_write branch from 9fe4510 to fca21c1 Compare March 6, 2022 21:10
@baumgold
Copy link
Member Author

baumgold commented Mar 6, 2022

I left a few comments on tweaks to make. It looks like the 2 main things needed here are some Arrow.Writer-specific tests (I'd recommend adding a new @testset in the runtests.jl file, or adding a new file of tests all together), and adding some documentation (You can add "in code" docs attached to the Writer struct directly which will get included in the auto-docs generation, and it would also be helpful to add a section to the manual, somewhere around here).

Thanks for your review. I've addressed your comments in my most recent commit. I'll add some tests and documentation shortly.

@baumgold baumgold force-pushed the incremental_write branch from e09a78b to e076afb Compare March 24, 2022 19:31
@baumgold
Copy link
Member Author

baumgold commented Mar 24, 2022

@quinnj - Based on your suggestions, I've now added Arrow.Writer-specific tests and documentation both in-code and in the manual. Can you please review again? Thanks.

@baumgold baumgold requested a review from quinnj April 1, 2022 23:01
@baumgold
Copy link
Member Author

baumgold commented Apr 4, 2022

Any chance this PR can be merged and released sooner than later? Thanks.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me; thanks for all the work here @baumgold and for persisting with us even though the review has been slow.

@quinnj quinnj merged commit a3f6da7 into apache:main Apr 9, 2022
@baumgold baumgold deleted the incremental_write branch April 9, 2022 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants