Spark 3.3: Adding Rebalance operator solving for small files problem #8042

namrathamyske · 2023-07-11T19:00:23Z

For spark 3.4, #7637, #7520 have been added for enabling AQE to solve for small files problem and taking care of skews.

For spark 3.4, RequiresDistributionAndOrdering has distributionStrictlyRequired() which has been to false by iceberg https://github.com/apache/iceberg/blame/37f53518a09803e4ef6b4669f58fbcc960ea5994/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L139.

To solve same problem in spark 3.3 without presence of distributionStrictlyRequired() in RequiresDistributionAndOrdering, we can add a table level property which can be checked which determining Repartition vs Rebalance operator. Idea taken from https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala#L66.

Extending this PR on #7932. @bowenliang123

@RussellSpitzer @rdblue @aokolnychyi @amogh-jahagirdar @jackye1995 @singhpk234 could you take a look at this PR?
cc: @SreeramGarlapati

…l files problem

RussellSpitzer · 2023-07-11T19:04:57Z

@huaxingao is probably the right person to review this

…l files problem

RussellSpitzer

Probably need some tests to prove the property is taking effect and that we get different plans.

bowenliang123 · 2023-07-12T00:30:36Z

Thanks for extending my PR on #7932, by introducing strict table distribution in table property.
I think it's also worth backporting to both Spark 3.3 and 3.2 for rebalancing support.
And again as in your comment #7932 (comment), rebalancing partitions does not have range partitioner support.

…l files problem

…l files problem Revert "Spark 3.3: Adding Rebalance operator for handling skew - solving small files problem" This reverts commit 9c82c35 Revert "Spark 3.3: Adding Rebalance operator for handling skew - solving small files problem" This reverts commit 5f0094d. Revert "Spark 3.3: Adding Rebalance operator for handling skew - solving small files problem" This reverts commit 0612543.

…l files problem

namrathamyske · 2023-07-13T06:11:34Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+  // Controls whether the set distribution mode has to be followed or not.
+  public static final String STRICT_TABLE_DISTRIBUTION_AND_ORDERING =
+      "strict-table-distribution-and-ordering";
+  public static final String STRICT_TABLE_DISTRIBUTION_AND_ORDERING_DEFAULT = "true";


New table level property introduced to decide whether we want strict distribution mode specified in properties. As Rebalance supports only hash partitioning in spark 3.3

minor suggestion here, since this is spark specific. maybe call this "write.spark.strict-table-distribution-ordering" Comment also should reflect that this is a spark property

I believe this is only going to be for Spark 3.3 as well right? Should probably document that since we won't be doing a patch in 3.4 ?

The doc which is currently in the planning section should probably be here instead

// if strict distribution mode is not enabled, then we fallback to spark AQE // to determine the number of partitions by colaesceing and un-skewing partitions // Also to note, Rebalance is only supported for hash distribution mode till spark 3.3 // By default the strictDistributionMode is set to true, to not disrupt regular // plan of RepartitionByExpression ```

I don't think it is a good idea to add a new public-facing table property that is already applicable only to older versions of Spark. Can we add a SQL property instead in our 3.3 module? Also, I would be okay skipping it and just assuming the distribution has to be strict. Then Spark will coalesce small files but won't split large ones. If we want feature parity with 3.4, let's do a SQL property.

Also, the ordering is always required. It is just the distribution that can be strict or not. It should be reflected in the name of the property.

namrathamyske · 2023-07-13T06:12:44Z

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala

-        RepartitionByExpression(ArraySeq.unsafeWrapArray(distribution), query, finalNumPartitions)
+
+        val tableProperties = if(table.isInstanceOf[RowLevelOperationTable])  {
+          table.asInstanceOf[RowLevelOperationTable].table.properties()


In case of RowLevelOperationTable, had to get the nested table to get its properties.

namrathamyske · 2023-07-13T06:13:35Z

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala

+          // Also to note, Rebalance is only supported for hash distribution mode till spark 3.3
+          // By default the strictDistributionMode is set to true, to not disrupt regular
+          // plan of RepartitionByExpression
+          RebalancePartitions(ArraySeq.unsafeWrapArray(distribution), query)


New Operator which which helps in reducing number of files by AQE

namrathamyske · 2023-07-13T06:14:01Z

spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

  }

+  @Test
+  public void testCoalesceDelete() throws Exception {


Added tests from #7532

…l files problem

namrathamyske · 2023-07-13T15:51:08Z

@RussellSpitzer Added tests from spark 3.4

core/src/main/java/org/apache/iceberg/TableProperties.java

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala

…l files problem

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala

huaxingao · 2023-07-17T23:29:57Z

LGTM. cc @RussellSpitzer @aokolnychyi

…l files problem

namrathamyske · 2023-07-27T20:24:45Z

@RussellSpitzer Addressed all review comments! Please take another look.

aokolnychyi · 2023-07-27T21:47:27Z

I will have some time today/tomorrow to take a look as well.

aokolnychyi · 2023-07-28T15:58:57Z

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala

+        if (strictDistributionMode.equals("false") && isHashDistributionMode) {
+          // if strict distribution mode is not enabled, then we fallback to spark AQE
+          // to determine the number of partitions by colaesceing and un-skewing partitions
+          // Also to note, Rebalance is only supported for hash distribution mode till spark 3.3


Till Spark 3.3? Or 3.4? I am confused cause this change is for 3.3.

@aokolnychyi in spark 3.4 both hash, range is supported for rebalance operator. But for spark 3.3 only hash is supported.I will change the statement to in spark 3.3. Do we we want to fall back to rebalance even for none distribution mode (aka round robin partitioning)?

That's OK. Let's just fix the comment then cause it states that Spark 3.3 is supposed to work.

aokolnychyi · 2023-07-28T16:01:11Z

The change seems correct to me but I would not add a table property given that it is needed only for older versions of a particular engine. I'd add a SQL property in 3.3 instead.

Thanks, @namrathamyske!

…l files problem

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala

…l files problem

namrathamyske · 2023-08-02T02:14:01Z

@aokolnychyi thanks for reviewing! addressed your comments

namrathamyske · 2023-08-07T22:39:43Z

@RussellSpitzer @aokolnychyi Pls give this PR another look

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala

RussellSpitzer · 2023-08-08T16:05:12Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

  // When set, new snapshots will be committed to this branch.
  public static final String WAP_BRANCH = "spark.wap.branch";
+
+  // This property doesn't need to be transferred to Spark 3.4 because we have already set


This description seems to just describe the implementation, what we would need here is just what an end user would need to know. What does this property do for the user and why should they change it?

I would probably change the whole name to something like, "ENABLE_MERGE_AQE" or something like that.
"spark.merge-aqe.enabled" defaulting to false.

@aokolnychyi what do you think?

I believe this would apply to all writes, not only row-level operations, if extensions are enabled?
What about spark.sql.iceberg.write-aqe.enabled with false by default? I am not sure about write-aqe part of it but I think it has be generic and start with spark.sql.iceberg prefix.

Also, the comment is probably too specific, I agree with @RussellSpitzer it should be for the user.

I agree that this applies to all writes, not just row level operations. This property changes the final distribution, and used in ExtendedDistributionAndOrderingUtils, i would prefer spark.sql.iceberg.write-distribution.aqe.enabled. Else having just spark.sql.iceberg.write-aqe.enabled is confused for enabling or disabling AQE for the whole spark job

…l files problem

namrathamyske · 2023-08-21T20:11:56Z

@RussellSpitzer @aokolnychyi addressed the comments. Pls give this another look

ssandona · 2024-06-07T09:13:20Z

Hi all, any progress on this?

github-actions · 2024-09-06T00:14:44Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-09-19T00:14:34Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

5958b08

…l files problem

github-actions bot added the spark label Jul 11, 2023

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

0cecb56

…l files problem

RussellSpitzer reviewed Jul 11, 2023

View reviewed changes

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

926ded0

…l files problem

github-actions bot added the core label Jul 12, 2023

namrathamyske added 10 commits July 12, 2023 09:39

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

9c82c35

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

5f0094d

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

0612543

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

d4433fd

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

65cdb6f

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

028d08b

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

c015382

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

1c980d6

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

03bc744

…l files problem

github-actions bot added the build label Jul 13, 2023

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

8a69ce9

…l files problem

namrathamyske commented Jul 13, 2023

View reviewed changes

namrathamyske added 2 commits July 12, 2023 23:17

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

b6d4c4b

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

ce59566

…l files problem

huaxingao reviewed Jul 14, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

huaxingao reviewed Jul 14, 2023

View reviewed changes

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala Outdated Show resolved Hide resolved

huaxingao reviewed Jul 14, 2023

View reviewed changes

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala Outdated Show resolved Hide resolved

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

1a3d30b

…l files problem

huaxingao reviewed Jul 17, 2023

View reviewed changes

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala Outdated Show resolved Hide resolved

namrathamyske added 2 commits July 27, 2023 13:21

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

38f2ad0

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

0423762

…l files problem

aokolnychyi reviewed Jul 28, 2023

View reviewed changes

namrathamyske added 4 commits July 28, 2023 16:59

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

d2b5e21

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

1d676a9

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

38feb3b

…l files problem

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

7643be4

…l files problem

ConeyLiu reviewed Aug 1, 2023

View reviewed changes

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala Outdated Show resolved Hide resolved

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

40bbf1f

…l files problem

RussellSpitzer reviewed Aug 8, 2023

View reviewed changes

...ala/org/apache/spark/sql/execution/datasources/v2/ExtendedDistributionAndOrderingUtils.scala Outdated Show resolved Hide resolved

RussellSpitzer reviewed Aug 8, 2023

View reviewed changes

aokolnychyi mentioned this pull request Aug 9, 2023

Spark: support use-table-distribution-and-ordering in session conf #8164

Closed

namrathamyske force-pushed the Rebalance-Prepartition branch from 443256d to 7d82055 Compare August 9, 2023 16:21

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

bf7e866

…l files problem

namrathamyske force-pushed the Rebalance-Prepartition branch from 7d82055 to bf7e866 Compare August 9, 2023 16:27

Spark 3.3: Adding Rebalance operator for handling skew - solving smal…

ad711c0

…l files problem

github-actions bot added the stale label Sep 6, 2024

github-actions bot closed this Sep 19, 2024

Spark 3.3: Adding Rebalance operator solving for small files problem #8042

Spark 3.3: Adding Rebalance operator solving for small files problem #8042

Uh oh!

Conversation

namrathamyske commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Jul 11, 2023

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

bowenliang123 commented Jul 12, 2023

Uh oh!

namrathamyske Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namrathamyske Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namrathamyske commented Jul 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huaxingao commented Jul 17, 2023

Uh oh!

namrathamyske commented Jul 27, 2023

Uh oh!

aokolnychyi commented Jul 27, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namrathamyske Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

namrathamyske commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namrathamyske commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namrathamyske Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namrathamyske commented Aug 21, 2023

Uh oh!

ssandona commented Jun 7, 2024

namrathamyske commented Jul 11, 2023 •

edited

Loading

namrathamyske Jul 13, 2023 •

edited

Loading

aokolnychyi Jul 28, 2023 •

edited

Loading

namrathamyske Jul 13, 2023 •

edited

Loading

namrathamyske Jul 28, 2023 •

edited

Loading

aokolnychyi commented Jul 28, 2023 •

edited

Loading

namrathamyske commented Aug 2, 2023 •

edited

Loading

namrathamyske commented Aug 7, 2023 •

edited

Loading

namrathamyske Aug 9, 2023 •

edited

Loading