fix: distributed RangePartitioning bounds calculation with native shuffle by mbutrovich · Pull Request #2258 · apache/datafusion-comet

mbutrovich · 2025-08-28T16:12:06Z

Which issue does this PR close?

Closes #1906.

Rationale for this change

#1862 tried to implement RangePartitioning with native shuffle. The implementation for bounds calculation didn't work in a distributed setting because executors calculated their own partition boundaries.

What changes are included in this PR?

This modifies the flow for the driver to calculate the boundaries (like Spark). At a high level:

Hoist code from Spark's ShuffleExchangeExec for using Spark's RangePartitioner to calculate boundary rows.
Serialize boundary rows to native side.
Deserialize boundary rows and pass as part of the partitioning scheme. Each executor should have the boundary values now.
Remove range_partitioner.rs which performed reservoir sampling and bounds calculations in native code.

How are these changes tested?

New test implementing RangePartitioning does not yield correct results with native shuffle #1906.
New tests adding random and duplicate values for RangePartitioning.
New benchmark for range partitioning shuffles.

…for native shuffle to consume. Added new test to represent apache#1906.

codecov-commenter · 2025-08-28T16:34:20Z

Codecov Report

❌ Patch coverage is 87.35632% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.46%. Comparing base (f09f8af) to head (c87aba7).
⚠️ Report is 539 commits behind head on main.

Files with missing lines	Patch %	Lines
...t/execution/shuffle/CometNativeShuffleWriter.scala	80.95%	3 Missing and 5 partials ⚠️
...t/execution/shuffle/CometShuffleExchangeExec.scala	90.00%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2258      +/-   ##
============================================
+ Coverage     56.12%   58.46%   +2.34%     
- Complexity      976     1440     +464     
============================================
  Files           119      146      +27     
  Lines         11743    13520    +1777     
  Branches       2251     2351     +100     
============================================
+ Hits           6591     7905    +1314     
- Misses         4012     4381     +369     
- Partials       1140     1234      +94

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…sult in 1 partition.

…ow to handle dictionary encoding.

…orderByOrdinal"?

…sampling).

# Conflicts: # spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala

…erate the partitioning scheme. This solves the issue where the input schema says it contains dictionaries that were later going to be unpacked by CopyExec. Will open an issue to understand why we even wrap the child in CopyExec in the first place.

# Conflicts: # native/core/src/execution/planner.rs

mbutrovich · 2025-09-22T13:48:25Z

Depends on #2434

Updated this branch after merging and looks clean still.

mbutrovich · 2025-09-22T14:35:46Z

For Spark it can be something like

That's effectively what the test "fix: range partitioning #1906" does, but I can abstract out the partitions bounds checking and add some more interesting data sets.

… random data.

…_range_partitioning

comphead · 2025-09-22T16:36:48Z

      MapStatus.apply(SparkEnv.get.blockManager.shuffleServerId, partitionLengths, mapId)
  }

+  private def isSinglePartitioning(p: Partitioning): Boolean = p match {


the entire method might be simplified just
p.numPartitions <= 1

?

The number of partition bounds can be less than the number of target partitions based on value cardinality, so we still need to check rangePartitionBounds.

parthchandra · 2025-09-22T22:54:10Z

    val numParts = rdd.getNumPartitions
+
+    // The code block below is mostly brought over from
+    // ShuffleExchangeExec::prepareShuffleDependency


It might be non trivial to do so but we could think about this being a plan we could execute on the native side. Essentially, your original range partitioner but distributed.

parthchandra · 2025-09-22T22:56:15Z

      serializer: Serializer,
      metrics: Map[String, SQLMetric]): ShuffleDependency[Int, ColumnarBatch, ColumnarBatch] = {
    val numParts = rdd.getNumPartitions
+


Can we report the time spent in this? It might be useful to decide if this is worthy of optimization.

I'll open a followup issue.

comphead · 2025-09-23T19:57:17Z

@@ -26,15 +27,15 @@ pub enum CometPartitioning {
    Hash(Vec<Arc<dyn PhysicalExpr>>, usize),
    /// Allocate rows based on the lexical order of one of more expressions and the specified number of


I'm thinking would be that intuitive for the user to have

Arc<RowConverter>, Vec<OwnedRow>

here? 🤔

Could you expand on this please? I'm not sure I understand the requested change.

Sorry for misleading comment.
I was thinking if to compare with others variants like Hash

Hash(Vec<Arc<dyn PhysicalExpr>>, usize),

it is quite intuitive that hash depends on numPartitions and expression that supposed to be hashed.

for Range it is

RangePartitioning(LexOrdering, usize, Arc<RowConverter>, Vec<OwnedRow>),

which looks no so intuitive IMO, because cannot say when reading what is the meaning of last 2 params.
Anyway, this design question can be addressed in follow up if needed

comphead · 2025-09-23T20:30:08Z

+                // Create a RowConverter and use to create OwnedRows from the Arrays
+                let converter = RowConverter::new(sort_fields)?;
+                let rows = converter.convert_columns(&arrays)?;
+                let owned_rows: Vec<OwnedRow> = rows.iter().map(|row| row.owned()).collect();


maybe we comment here what is owned_rows here?
Is it actual rows before shuffle?

For simplicity attaching a diagram with RR flow

[ Before Shuffle ] Executor 1: (5,A) (15,B) (30,C) Executor 2: (40,D) (70,E) (90,F) | | v v [ Shuffle Write: Buckets on Disk ] Executor 1: [P0:(5,15,30)] [P1: ] [P2: ] Executor 2: [P0: ] [P1:(40)] [P2:(70,90)] | | v v [ Shuffle Read: Reducers Fetch Buckets ] Reducer P0 <---- E1.Bucket0 + E2.Bucket0 ----> (5,15,30) Reducer P1 <---- E1.Bucket1 + E2.Bucket1 ----> (40) Reducer P2 <---- E1.Bucket2 + E2.Bucket2 ----> (70,90) [ After Shuffle = Range Partitions ] P0: (5,15,30) P1: (40) P2: (70,90)

owned_rows are the boundary values. I can make it more explicit.

comphead · 2025-09-24T15:11:15Z

+      .doc("Whether to enable range partitioning for Comet native shuffle.")
      .booleanConf
-      .createWithDefault(false)
+      .createWithDefault(true)


should we keep it as false

then run some benches and real tests with this param true and later enable it by default?

I discussed with @andygrove and we were comfortable merging with true back in June. I think if you're opting into native shuffle we should try to accelerate all partitioning schemes, and if we discover issues it can be toggled off.

Enabling it by default now gives us more opportunities to find bugs over the next few weeks before we release 0.11.0 and we can always disable if we find issues in that time.

comphead

Thanks @mbutrovich I left some minors, but overall I think the PR would be good to go

…fle/CometNativeShuffleWriter.scala Co-authored-by: Oleks V <comphead@users.noreply.github.com>

andygrove

This is a significant improvement! Thanks @mbutrovich

…ffle (apache#2258)

mbutrovich added 6 commits August 28, 2025 10:35

Use Spark's RangePartitioning to compute boundary rows and serialize …

709d6e9

…for native shuffle to consume. Added new test to represent apache#1906.

Fix warnings and benchmark compilation.

4ee3d8e

Fix benchmark bug.

6f34e35

Minor refactor.

332e76a

Cleanup to make it more clear what code came from Spark.

0eb1134

Fix errant comment.

bb67f73

mbutrovich self-assigned this Aug 28, 2025

mbutrovich changed the title ~~fix: RangePartitioning boundaries with native shuffle~~ fix: RangePartitioning with native shuffle Aug 28, 2025

mbutrovich added 8 commits August 28, 2025 13:50

Override partitioning scheme at serialization when num_partitions is 1.

abd8958

Override partitioning scheme at serialization when computed bounds re…

967d1a1

…sult in 1 partition.

Merge branch 'main' into fix_range_partitioning

7af9474

Remove string and binary range partitioning types until we sort out h…

522ef80

…ow to handle dictionary encoding.

Merge branch 'main' into fix_range_partitioning

1a956a5

Fix circular dependency in execution.

58e35b0

Update plans.

08a4b51

fix Spark SQL test "change SQLConf should not change view behavior - …

2f4280f

…orderByOrdinal"?

mbutrovich added this to the 0.11.0 milestone Sep 11, 2025

mbutrovich added 7 commits September 11, 2025 21:38

Fix bug with indexing into boundary rows.

58f2eda

Merge branch 'main' into fix_range_partitioning

2803842

Merge branch 'main' into fix_range_partitioning

5bceb41

Remove range_partitioner.rs (native bounds calculation and reservoir …

044e098

…sampling).

remove errant collection.JavaConverters

21c4665

Merge branch 'main' into fix_range_partitioning

4e86961

# Conflicts: # spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala

Remove redundant config (setting to default).

c9acdfc

mbutrovich marked this pull request as ready for review September 16, 2025 21:25

mbutrovich added 4 commits September 17, 2025 07:30

Merge branch 'main' into fix_range_partitioning

386aa6c

Update test after last commit.

4077f7d

Merge branch 'main' into fix_range_partitioning

dd0939f

# Conflicts: # native/core/src/execution/planner.rs

mbutrovich changed the title ~~fix: RangePartitioning with native shuffle~~ fix: distributed RangePartitioning bounds calculation with native shuffle Sep 22, 2025

mbutrovich added 2 commits September 22, 2025 11:04

Remove development test, add more tests for duplicates in columns and…

b5b286b

… random data.

Merge remote-tracking branch 'origin/fix_range_partitioning' into fix…

d12c855

…_range_partitioning

comphead reviewed Sep 22, 2025

View reviewed changes

andygrove reviewed Sep 22, 2025

View reviewed changes

Comment thread common/src/main/scala/org/apache/comet/CometConf.scala Outdated

andygrove reviewed Sep 22, 2025

View reviewed changes

Comment thread native/core/src/execution/planner.rs Outdated

andygrove reviewed Sep 22, 2025

View reviewed changes

Comment thread native/core/src/execution/planner.rs Outdated

parthchandra reviewed Sep 22, 2025

View reviewed changes

comphead mentioned this pull request Sep 23, 2025

feat: do not fallback to Spark for COUNT(distinct) #2429

Merged

mbutrovich added 2 commits September 23, 2025 15:43

Update docs.

fa2c20b

PR feedback.

8781264

comphead reviewed Sep 23, 2025

View reviewed changes

mbutrovich added 2 commits September 23, 2025 18:20

Fix clippy.

318adbd

Update comments based on PR feedback.

2aa3f0f

mbutrovich requested review from andygrove and comphead September 24, 2025 13:19

comphead reviewed Sep 24, 2025

View reviewed changes

Comment thread ...k/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometNativeShuffleWriter.scala Outdated

comphead reviewed Sep 24, 2025

View reviewed changes

comphead approved these changes Sep 24, 2025

View reviewed changes

mbutrovich and others added 2 commits September 24, 2025 12:51

Update spark/src/main/scala/org/apache/spark/sql/comet/execution/shuf…

3d359ba

…fle/CometNativeShuffleWriter.scala Co-authored-by: Oleks V <comphead@users.noreply.github.com>

Fix formatting after accepting change on GitHub.

c87aba7

andygrove approved these changes Sep 24, 2025

View reviewed changes

mbutrovich merged commit 25d5924 into apache:main Sep 24, 2025
102 checks passed

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

fix: distributed RangePartitioning bounds calculation with native shu…

1a5af8d

…ffle (apache#2258)

andygrove mentioned this pull request Jan 30, 2026

Support RangePartitioning with native shuffle #458

Closed

mbutrovich deleted the fix_range_partitioning branch March 13, 2026 18:58

		@@ -26,15 +27,15 @@ pub enum CometPartitioning {
		Hash(Vec<Arc<dyn PhysicalExpr>>, usize),
		/// Allocate rows based on the lexical order of one of more expressions and the specified number of

Conversation

mbutrovich commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mbutrovich commented Sep 22, 2025

Uh oh!

mbutrovich commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mbutrovich commented Aug 28, 2025 •

edited

Loading

codecov-commenter commented Aug 28, 2025 •

edited

Loading

mbutrovich Sep 22, 2025 •

edited

Loading

comphead Sep 24, 2025 •

edited

Loading