Skip to content

Conversation

@kazuyukitanimura
Copy link
Contributor

Which issue does this PR close?

Closes #6871

Rationale for this change

We would like to use streaming_merge on its own.

What changes are included in this PR?

Changing physical_plan::merge::streaming_merge to public

Are these changes tested?

Yes, existing tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Jul 7, 2023
@alamb
Copy link
Contributor

alamb commented Jul 7, 2023

Build failure is due to #6875

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kazuyukitanimura -- this is fine with me.

Without some care, I worry that this change will be lost in some future refactoring (e.g. if the code is moved to another module, for example, that is not pub).

To avoid this happening, I recommend you write a test, perhaps as an example of how to use this API, in: https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples/examples

You might be able to adapt some of the existing code in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/fuzz_cases/merge_fuzz.rs or https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/benches/sort.rs

@kazuyukitanimura
Copy link
Contributor Author

Thank you @alamb for the test examples. I will definitely write more tests on this.
It would be great if you could merge this first. I plan to send follow up PRs separately.

@alamb
Copy link
Contributor

alamb commented Jul 7, 2023

It would be great if you could merge this first. I plan to send follow up PRs separately.

Will do

@alamb
Copy link
Contributor

alamb commented Jul 7, 2023

(BTW this PR needs a merge up from main to get the CI to pass -- I can do it later today if no one else gets around to it first)

@viirya
Copy link
Member

viirya commented Jul 7, 2023

@kazuyukitanimura Can you sync up with latest main branch?

Comment on lines 54 to 55
/// Perform a streaming merge of [`SendableRecordBatchStream`]
pub(crate) fn streaming_merge(
pub fn streaming_merge(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a public API, I suggest that we can provide more clear doc for it. E.g., this will sort stream based on provided expressions etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

/// Perform a streaming merge of [`SendableRecordBatchStream`]
/// Perform a streaming merge of [`SendableRecordBatchStream`] based on provided sort expressions
/// while preserving order. This is a convenience wrapper for [`SortPreservingMergeStream`] and
/// chooses a right cursor for the expressions and the data type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the cursor. You mean RowCursorStream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip cursor in the doc and only mention this merges SendableRecordBatchStreams by sorting and preserving the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function switches between RowCursorStream and FieldCursorStream depending on the data type.
SortPreservingMergeStream expects a cursor and that's what this wrapper provides as convenience

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the cursor explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is too much detailed to the implementation. No strong option here. It's fine.

@alamb alamb merged commit 85be509 into apache:main Jul 8, 2023
@kazuyukitanimura
Copy link
Contributor Author

Thank you all!

@alamb
Copy link
Contributor

alamb commented Jul 9, 2023

Thank you all!

Thanks again for the contribution @kazuyukitanimura

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make streaming_merge public

3 participants