-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9782: [C++][Dataset] More configurable Dataset writing #8305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
754e559
ARROW-9782: [C++][Dataset] More configurable Dataset writing
bkietz e2c1199
Minimal hacking to get the R tests passing
nealrichardson 9eea1bd
fix Scanner splitting
bkietz c263f6c
remove debug print()s
bkietz e70dd9d
don't double unlock std::mutex
bkietz fa97c52
extract a helper for single-lookup insertion into maps
bkietz 0815d6e
add a python binding for custom parquet write properties
bkietz 2830bd0
remove unused schema parameter
bkietz 9b8b8fc
repair parquet write options in R
bkietz bf7d392
add a test for writing with a selection
bkietz 733af53
lint
bkietz 0567537
make doc
bkietz f0da04a
document basename_template parameter
bkietz 18ea1c3
enable on-write filtering of written datasets
bkietz bca8764
add LockFreeStack
bkietz 1c6db50
extract and unit test string interpolation
bkietz 6eb546e
refactor ::Write() to use explicit WriteQueues
bkietz ed5ec52
cache queue mapping local to each thread
bkietz 1a541d6
lint, simplify WriteQueue storage, try workaround for atomic::atomic()
bkietz 95e548e
comparator must be const
bkietz 7fd7185
simplify thread local caching
bkietz dde5eed
simplify: revert local queue lookup caching
bkietz dfc2291
revert lock_free
bkietz a3454d9
more exact typing in GetOrInsertGenerated
bkietz 448e04e
move lazy initialization locking into Flush()
bkietz 87d863a
fix comment
bkietz 7db8bf3
address review comments
bkietz 33257a6
add default basename_template for python
bkietz d46c1af
R code/doc polishing
nealrichardson 7f1255e
Update vignette now that you can filter when writing
nealrichardson 16d9d53
lint fix
bkietz 086f59d
writing without partitioning will create a single file
bkietz 998d760
address review comments
bkietz 5602aa8
correct R doc after dat_{i} -> part-{i}
bkietz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know all fragments (and their expressions) already, can we avoid all the locking multi-threading in WriterSet (IIRC, you need them to create the writer once)? That would heavily simplify all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context fragments are the object of writing rather than the target (so for example one might represent an in-memory table which is being copied to disk). Writers are not known ahead of time since they depend on the partitioning which depends on the set of unique values in a given column, which we discover only after running GroupBy on an input batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do two scans of the input data:
This doesn't seem worthwhile to me; scanning the input is potentially expensive so we should avoid doing it twice. Furthermore we'll still need to coordinate between threads since two input batches might still contain rows bound for a single output writer.