Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR enables AQE to handle skew in writes.

@github-actions github-actions bot added the spark label May 4, 2023
@aokolnychyi
Copy link
Contributor Author

}

@Test
public void testSkewDelete() throws Exception {
Copy link
Contributor Author

@aokolnychyi aokolnychyi May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for CoW row-level operations already cover SparkWrite, which is used in normal writes. There is not much logic on Iceberg side, the rest is covered by Spark tests.

@RussellSpitzer
Copy link
Member

This is just for 3.4 because of the new rebalance code for writes right?

@aokolnychyi
Copy link
Contributor Author

@RussellSpitzer, correct. This API does not exist in 3.3.


@Override
public boolean distributionStrictlyRequired() {
return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may actually need to move it to SparkWriteBuilder as SparkWrite is used for compaction. We explicitly disable table distribution/ordering and AQE in shuffling rewriters but not in bin-pack when the output spec mismatches.

Thoughts, @RussellSpitzer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say I don't really mind our Compaction solution atm. I think disabling AQE is our best bet there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick to that then, I agree.

// that means there are 4 shuffle blocks, all assigned to the same reducer
// AQE detects that all 4 shuffle blocks are big and processes them in 4 separate tasks
// otherwise, there would be 1 task processing 4 shuffle blocks
int addedFiles = Integer.parseInt(summary.get(SnapshotSummary.ADDED_DELETE_FILES_PROP));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] can use PropertyUtil here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not use PropertyUtil as it requires a default value and won't fit on one line. I can switch, though.

Comment on lines 159 to 160
int addedFiles = Integer.parseInt(summary.get(SnapshotSummary.ADDED_FILES_PROP));
Assert.assertEquals("Must produce 4 files", 4, addedFiles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] can this be moved to a private func for ex: assertAddedFiles to use in both MOR / COW ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was existing validateProperty, which I forgot about. I switched to that.


@Override
public boolean distributionStrictlyRequired() {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also check ADAPTIVE_OPTIMIZE_SKEWS_IN_REBALANCE_PARTITIONS_ENABLED is true as well before disabling this requirement ? otherwise it will be a no-op for OptimizeSkewInRebalancePartitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a good reason to return true from this method? We don't require distributions to be strict and it is up to Spark to either handle the skew or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I was mostly comming from, the point that we are overriding this and setting it to false, in a hope that spark will optimize the skew whereas if the above conf is disabled spark will never do the same. I am fine with keeping it as it is.

Copy link
Contributor Author

@aokolnychyi aokolnychyi May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is better to always return false and leave it up to Spark. It seems the safest way as Spark may add new configs or logic on when to do that in the future.

Comment on lines +157 to +158
// AQE detects that all shuffle blocks are big and processes them in 4 independent tasks
// otherwise, there would be 2 tasks processing 2 shuffle blocks each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] should we also add a UT where coalese is happening ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to do so in a separate PR. This change focuses on skew.

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, Thanks @aokolnychyi !

@aokolnychyi aokolnychyi merged commit 7fa5fca into apache:master May 4, 2023
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @singhpk234 @RussellSpitzer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants