Spark 4.0: Implement SupportsReportOrdering DSv2 API #14948
+2,532
−80
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: This PR builds on top of #14683 which is still in review. Reviewers can look at the changes in this commit cc08ff2
This PR implements the Spark DSv2 SupportsReportOrdering API to enable Spark's sort elimination optimization for partitioned tables when reading sorted Iceberg tables that have a defined sort order and files are written respecting that order.
Implementation summary:
Ordering Validation:
SparkPartitioningAwareScan.outputOrdering()validates all files have the current table's sort order ID before reporting ordering to Spark. If validation fails, no ordering is reported.Merging Sorted Files: Since sorted files within a partition may have overlapping ranges, this PR introduces MergingSortedRowDataReader that merges rows from multiple sorted files using k-way merge with a min-heap.
Row Comparison:
InternalRowComparatorcompares Spark InternalRows based on Iceberg sort order.Constraints
Sort elimination examples
Without reporting sort order
With sort order reporting:
Without reporting sort order
With sort order reporting: