-
Notifications
You must be signed in to change notification settings - Fork 70
refactor Arrow.write to support incremental writes #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3c78925 to
d982ea1
Compare
|
FYI we unfortunately don’t have CI until #273 is resolved or worked-around, so that will likely slow things down |
d982ea1 to
c08afcf
Compare
61d30cc to
5346f8e
Compare
@ericphanson - now that #273 appears to be resolved, could you (or someone) please approve the workflow so CI can be run on this PR? Thanks! |
|
Approved. |
92370d1 to
9fe4510
Compare
|
Could someone please approve running CI again? As a side-note, it appears that the main branch currently produces failing tests with Julia v1.3 and v1.4 (note that all tests pass in v1.5, v1.6, and v1.7). The relevant stacktrace is below. Does anyone have any idea when this regression was introduced? misc: Error During Test at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:98
Got exception outside of a @test
TaskFailedException:
MethodError: no method matching resize!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Int64)
Closest candidates are:
resize!(!Matched::Array{T,1} where T, ::Integer) at array.jl:1017
resize!(!Matched::BitArray{1}, ::Integer) at bitarray.jl:773
resize!(!Matched::PooledArray{T,R,1,RA} where RA, ::Integer) where {T, R} at ~/.julia/packages/PooledArrays/DuIZ1/src/PooledArrays.jl:284
...
Stacktrace:
[1] _append!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Base.HasLength, ::Tuple{Int64}) at ./array.jl:921
[2] append!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Tuple{Int64}) at ./array.jl:915
[3] push!(::Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}, ::Int64) at ./array.jl:916
[4] push!(::SentinelArrays.ChainedVector{Int64,Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}}, ::Int64) at ~/.julia/packages/SentinelArrays/pYV2X/src/chainedvector.jl:458
[5] append!(::SentinelArrays.ChainedVector{Int64,Arrow.DictEncoded{Int64,Int8,Arrow.Primitive{Int64,Array{Int64,1}}}}, ::Arrow.DictEncoded{Int64,Int8,SentinelArrays.ChainedVector{Int64,Arrow.Primitive{Int64,Array{Int64,1}}}}) at ~/.julia/packages/SentinelArrays/pYV2X/src/chainedvector.jl:615
[6] (::Arrow.var"#87#92"{Array{Any,1},Arrow.Table})(::Int64) at ~/.julia/packages/Arrow/rVa71/src/table.jl:288
[7] foreach(::Arrow.var"#87#92"{Array{Any,1},Arrow.Table}, ::UnitRange{Int64}) at ./abstractarray.jl:1920
[8] macro expansion at ~/.julia/packages/Arrow/rVa71/src/table.jl:287 [inlined]
[9] (::Arrow.var"#84#89"{Arrow.Table,Channel{Task}})() at ./threadingconstructs.jl:113
Stacktrace:
[1] wait at ./task.jl:251 [inlined]
[2] #Table#83(::Bool, ::Type{Arrow.Table}, ::Array{Arrow.ArrowBlob,1}) at ~/.julia/packages/Arrow/rVa71/src/table.jl:358
[3] Table at ~/.julia/packages/Arrow/rVa71/src/table.jl:271 [inlined]
[4] #Table#78(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Type{Arrow.Table}, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int64, ::Nothing) at ~/.julia/packages/Arrow/rVa71/src/table.jl:265
[5] Table at ~/.julia/packages/Arrow/rVa71/src/table.jl:265 [inlined] (repeats 2 times)
[6] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:294
[7] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1107
[8] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:101
[9] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Test/src/Test.jl:1107
[10] top-level scope at ~/.julia/packages/Arrow/rVa71/test/runtests.jl:39
[11] include at ./boot.jl:328 [inlined]
[12] include_relative(::Module, ::String) at ./loading.jl:1105
[13] include(::Module, ::String) at ./Base.jl:31
[14] include(::String) at ./client.jl:424
[15] top-level scope at none:6
[16] eval(::Module, ::Any) at ./boot.jl:330
[17] exec_options(::Base.JLOptions) at ./client.jl:263
[18] _start() at ./client.jl:460 |
|
Approved. |
Codecov Report
@@ Coverage Diff @@
## main #277 +/- ##
==========================================
+ Coverage 87.00% 87.13% +0.13%
==========================================
Files 26 26
Lines 3263 3297 +34
==========================================
+ Hits 2839 2873 +34
Misses 424 424
Continue to review full report at Codecov.
|
|
Bump!? I've been going over This is crucial to accumulate large (out-of-core) data files with day-to-day batches, without generating too many individual files overtime. Or Arrow devs really think you should have a writer process running forever without crashing? Maybe they'd rather you accumulate datasets with separate files per each day or so, then (despite wasted space by each file's schema payload) it's nightmare for even modern OSes to This PR seems a perfect cure! Hope it get merge sooner. |
quinnj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality looks great! Thanks @baumgold!
I left a few comments on tweaks to make. It looks like the 2 main things needed here are some Arrow.Writer-specific tests (I'd recommend adding a new @testset in the runtests.jl file, or adding a new file of tests all together), and adding some documentation (You can add "in code" docs attached to the Writer struct directly which will get included in the auto-docs generation, and it would also be helpful to add a section to the manual, somewhere around here).
Let me know if you're unfamiliar with writing tests or docs and I'm happy to give more direction/help. Thanks again for the contribution!
9fe4510 to
fca21c1
Compare
Thanks for your review. I've addressed your comments in my most recent commit. I'll add some tests and documentation shortly. |
e09a78b to
e076afb
Compare
|
@quinnj - Based on your suggestions, I've now added |
|
Any chance this PR can be merged and released sooner than later? Thanks. |
quinnj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me; thanks for all the work here @baumgold and for persisting with us even though the review has been slow.
An alternate solution to #105 than #160 by refactoring the write function. This is an improvement over #160 because it supports incremental writes to both the Arrow IPC stream format (currently supported via append) and the Arrow file format (currently unsupported).