Skip to content

Spark 3.4: Fix write and SQL options to override delete file compression config#8438

Merged
szehon-ho merged 22 commits intoapache:masterfrom
roryqi:delete_fix
Sep 7, 2023
Merged

Spark 3.4: Fix write and SQL options to override delete file compression config#8438
szehon-ho merged 22 commits intoapache:masterfrom
roryqi:delete_fix

Conversation

@roryqi
Copy link
Copy Markdown
Contributor

@roryqi roryqi commented Aug 31, 2023

No description provided.

@roryqi roryqi marked this pull request as draft August 31, 2023 03:05
@github-actions github-actions Bot added the spark label Aug 31, 2023
@roryqi roryqi changed the title Delete fix Spark 3.4: Fix write and SQL options to override delete file compression config Aug 31, 2023
Comment thread spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java Outdated
@roryqi roryqi marked this pull request as ready for review August 31, 2023 07:37
});
}

private Object[][] getOverridePropertiesSuites() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 1: Spark properties override the write properties.

}

private Object[][] getNonOverridePropertiesSuite() {
return new Object[][] {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 2: We reuse the data file format properties because of the absence of delete file format properties.

},
{
ImmutableMap.of(),
ImmutableMap.of(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 3: The properties of delete file format exists.

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments

break;

case ORC:
String orcCodec = deleteOrcCompressionCodec();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would a helper method be better to re-use?

setWithFallback(writeProperties, DELETE_ORC_COMPRESSION, deleteOrcCompressionCodec(), orcCompressionCodec());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point.

writeProperties.put(
DELETE_ORC_COMPRESSION, orcCodec != null ? orcCodec : orcCompressionCodec());
String strategy = deleteOrcCompressionStrategy();
if (strategy != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simlarly, maybe setWithFallback(writeProperties, DELETE_ORC_COMPRESSION_STRATEGY, deleteOrcCompressionStrategy(), orcCompressionStrategy()) (I feel null handling is similar in the two cases?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks good to me

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get this in, I left some comments on the test as it was a bit harder to read for me.

}
}

private void testWritePropertiesBySuite(Object[] propertiesSuite) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any point to make it Object[], and not just Map<String, String>[]?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, not sure 'bySuite' adds any value here, maybe just testWriteProperties

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any point to make it Object[]

Fixed.

Copy link
Copy Markdown
Contributor Author

@roryqi roryqi Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any point to make it Object[], and not just Map<String, String>[]?

We can't create a generic type array or cast the arry. So I use Object[] here. This seems confusing. So I changed the type from the Array to the List.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, not sure 'bySuite' adds any value here, maybe just testWriteProperties

Fixed.

(Map<String, String>) propertiesSuite[0],
() -> {
Table table = validationCatalog.loadTable(tableIdent);
Map<String, String> writeOptions = ImmutableMap.of();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition seems quite far from where its used. How about just inline it below?

      SparkWriteConf writeConf = new SparkWriteConf(spark, table, ImmutableMap.of());

}

@Test
public void testSparkWriteConfWriteProperties() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is good for code re-use and test coverage , its hard to read. Iceberg test typically split to be as fine grained as possible.

I see you have at least three cases, can we make separate test for the three cases? We should also inline the properties in the test itself to make it easier to read.

@Test
testSparkConfOverride() {
   Map<String, String>[] properties = // inline properties here
   testWriteProperties(properties)...
}

testDataPropsDefaultsAsDeleteProps() {
  ...
}

testDeleteFileWriteConf() {
  ...
}

...

feel free to clarify the titles.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

updateProperties.commit();

Map<String, String> writeOptions = ImmutableMap.of();
SparkWriteConf writeConf = new SparkWriteConf(spark, table, writeOptions);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still slightly prefer to inline writeOptions here, but not a blocker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will raise a follow-up pr to fix this.

@szehon-ho szehon-ho merged commit 3ed03c6 into apache:master Sep 7, 2023
@szehon-ho
Copy link
Copy Markdown
Member

Merged, thanks @jerqi !

@roryqi
Copy link
Copy Markdown
Contributor Author

roryqi commented Sep 7, 2023

Merged, thanks @jerqi !

Thanks for your careful review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants