-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support All Statistics and Enable Backpressure in Parallel Parquet Writer #7632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support All Statistics and Enable Backpressure in Parallel Parquet Writer #7632
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool @devinjdangelo -- thank you for pushing it forward. I think it is a very nice improvement and while there might be additional improvements possible this is nice step in the right direction
| let arc_props = Arc::new(parquet_props.clone()); | ||
| let arc_props_clone = arc_props.clone(); | ||
| let schema_clone = output_schema.clone(); | ||
| let launch_serialization_task: JoinHandle<Result<(), DataFusionError>> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter but if something goes wrong and output_single_parquet_file_parallelized returns an error, this code may well still launch tasks and try to buffer / serialize the streams.
I think this could be avoided if we put all the handles into a JoinSet so when they were dropped all the tasks would be canceled: https://docs.rs/tokio/1.32.0/tokio/task/struct.JoinSet.html
| Ok((writer.close()?, inner_row_count)) | ||
| })) | ||
| .await | ||
| .map_err(|_| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only way the send will fail is if the receiver was dropped -- either due to early plan cancel or some other error
Thus it might make sense to ignore the error and return (with a comment about rationale) rather than returning an error
| while let Some(batch) = data_stream.next().await.transpose()? { | ||
| inner_row_count += batch.num_rows(); | ||
| writer.write(&batch)?; | ||
| let (serialize_tx, mut serialize_rx) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my reading of this suggests it allows up to 100 row groups to be created in parallel, which likely results in more buffering than necessary.
Rather than formulating this as a mspc::channel it would be really neat to see it formulated as a Stream<(ArrowColumnChunk, ColumnCloseResult)>.
then, in combination with StreamExt::buffered() we could control the parallelism at the row group level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using mpsc::channel even more extensively in the new implementation (#7655). I have experimented with StreamExt::buffered() in the past, but it did not seem to leverage multiple CPU cores, whereas spawning tokio tasks communicating across a channel did.
I can revisit this though as it could simplify the code, and I may have just messed something up last time I tried it 🤔 .
|
@alamb Thank you for the review! I have significantly reworked this PR in a new PR #7655, relying primarily on column wise parallelization rather than row group wise. The new approach is much more complex in the code, but the performance advantage is huge. 20% faster and 90% lower memory overhead vs. this PR. |
This PR is Draft since it cannot compile without arrow-rs changes: apache/arrow-rs#4850
Which issue does this PR close?
Closes #7591
Closes #7589
Related to apache/arrow-rs#1718
Rationale for this change
Parallel parquet serialization process implemented in #7562 did not support all parquet metadata (indexes/bloom filters) and had no backpressure on serialization tasks. This PR aims to address these two deficiencies.
What changes are included in this PR?
ArrowRowGroupWriterdirectly rather thanArrowWriterarrow-rschanges filed to makeArrowRowGroupWriterpublic andSendso it can be used across an.awaitBenchmarking Results
The results show the parallel parquet process in this PR is ~10% faster than the previous in addition to supporting statistics/bloom filters. Channel_limit=N means that the maximum number of parallel parquet serialization tasks running at one time is set to N. Surprisingly setting this number low can actually increase peak memory usage, which is a surprising result. Perhaps RecordBatches are accumulating in memory as some streams are being consumed but not others, more than offsetting the benefit of limiting the amount of serialized parquet data in memory.
See #7562 for benchmarking script used.
Test 1, All 16 Columns, ~3.6GB Parquet File (release build)
Execution Time (s)
Peak Memory Usage (MB)
Test 2, Subset of 3 Columns, ~895MB Parquet File (release build)
Execution Time (s)
Peak Memory Consumption (MB)
Are these changes tested?
Yes, by existing tests and adhoc benchmarking
Are there any user-facing changes?
No
Next Steps
So far, parquet serialization is only being parallelized in terms of RowGroups. This means we are limited in terms of parallelization based on the number of RowGroups we want in our file, which can be as low as 1 in general. Parquet files generally have a large number of columns and we could parallelize at the column level in addition to speed up more.
We could also break free of the Parallelism=RowGroupNumber limit if it were possible to concatenate (
ArrowColumnChunk,ColumnCloseResult) tuples together before writing them into a RowGroup. This might not be possible efficiently, sinceArrowColumnChunk's are already compressed. Aggregating column min/max statistics would be trivial, but distinct values + bloom filters would not be trivial.