Core: use ZSTD compression parquet by default for new tables#8299
Core: use ZSTD compression parquet by default for new tables#8299szehon-ho wants to merge 4 commits intoapache:masterfrom
Conversation
|
Will need to fix tests, but the idea is here, cc @dbtsai , @RussellSpitzer |
| this.identifier = identifier; | ||
| this.schema = schema; | ||
| this.tableProperties.putAll(tableDefaultProperties()); | ||
| // Explicitly set ZSTD for new tables |
There was a problem hiding this comment.
Maybe // Explicitly set Parquet compression codec for new tables
| // Explicitly set ZSTD for new tables | ||
| this.tableProperties.put( | ||
| TableProperties.PARQUET_COMPRESSION, | ||
| TableProperties.PARQUET_COMPRESSION_NEW_TABLE_DEFAULT); |
There was a problem hiding this comment.
this.tableProperties.put(
TableProperties.PARQUET_COMPRESSION,
TableProperties.PARQUET_COMPRESSION_DEFAULT);so we don't need to introduce a new conf, PARQUET_COMPRESSION_NEW_TABLE_DEFAULT
| Map<String, String> newProperties = Maps.newHashMap(base.properties()); | ||
| newProperties.put( | ||
| TableProperties.PARQUET_COMPRESSION, | ||
| TableProperties.PARQUET_COMPRESSION_DEFAULT); |
There was a problem hiding this comment.
if (base.properties().get(TableProperties.PARQUET_COMPRESSION) == null) {
Map<String, String> newProperties = Maps.newHashMap(base.properties());
newProperties.put(TableProperties.PARQUET_COMPRESSION, "gzip");| public static final String PARQUET_COMPRESSION = "write.parquet.compression-codec"; | ||
| public static final String DELETE_PARQUET_COMPRESSION = "write.delete.parquet.compression-codec"; | ||
| public static final String PARQUET_COMPRESSION_DEFAULT = "gzip"; | ||
| public static final String PARQUET_COMPRESSION_NEW_TABLE_DEFAULT = "zstd"; |
There was a problem hiding this comment.
Then, we don't need PARQUET_COMPRESSION_NEW_TABLE_DEFAULT
|
Minor comments above. Thank you for working on this followup. @szehon-ho |
a71a5a4 to
4d0e4d7
Compare
|
Thanks for looking. Updated and fixed a round of tests. |
| this.schema = schema; | ||
| this.tableProperties.putAll(tableDefaultProperties()); | ||
|
|
||
| // Explicitly set default Parquet compression codecs for new tables |
There was a problem hiding this comment.
This seems to only cover catalogs that extend this class.
Is it something we can address in TableMetadata to avoid changes in multiple places?
There was a problem hiding this comment.
Theroetically we could but it sounds strange to me, making the TableMetadata object construct by default with a few properties set , as its a public class?
There was a problem hiding this comment.
Well, the problem here is that we update multiple places with the same logic and it only covers some implementations. I am not sure how I feel about changing TableMetadata either. Let me think.
There was a problem hiding this comment.
Took a look, REST catalog table builder doesnt seem to build TableMetadata themselves (gets from the response object). So this wont help that case. I tried to organize the code a little bit into TableMetadata class in any case
| old: "method org.apache.iceberg.view.ViewBuilder org.apache.iceberg.view.ViewBuilder::withQueryColumnNames(java.util.List<java.lang.String>)" | ||
| justification: "Acceptable break due to updating View APIs and the View Spec" | ||
| org.apache.iceberg:iceberg-core: | ||
| - code: "java.field.constantValueChanged" |
There was a problem hiding this comment.
Another idea could be to add a new constant for new tables and persist the codec only during table creation, while still relying on the old PARQUET_COMPRESSION_DEFAULT for existing tables.
That way we won't have to change SnapshotProducer and encode old values.
There was a problem hiding this comment.
That being said, I am not against the current implementation.
There was a problem hiding this comment.
I had that earlier (not change the constant), I think @dbtsai wanted to do it the other way.
f33366f to
ac6e6fc
Compare
|
Looks like there's some open question. First, I assume we want to do this, to reduce the impact for user (keep gzip for old table and use zstd for new tables). If that's the case, the questions:
|
|
@rdblue @nastra @jackye1995 @stevenzwu @danielcweeks , could you guys take a look? Seems like it can go either way here. |
|
I kind of like option 2) where we would essentially have |
d09c720 to
6ec04c5
Compare
|
Thanks, makes sense to me, added new constant PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0 |
| public static final String PARQUET_COMPRESSION = "write.parquet.compression-codec"; | ||
| public static final String DELETE_PARQUET_COMPRESSION = "write.delete.parquet.compression-codec"; | ||
| public static final String PARQUET_COMPRESSION_DEFAULT = "gzip"; | ||
| public static final String PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0 = "zstd"; |
There was a problem hiding this comment.
Let me think. I do like the name but I also wonder if it sends a message the new default applies to all existing tables that did not provide a default value. It is probably not what happens as we only use this value in new tables.
There was a problem hiding this comment.
There was a problem hiding this comment.
I can go either way here, up to you @szehon-ho.
| schema, spec, sortOrder, location, unreservedProperties(properties), formatVersion); | ||
| } | ||
|
|
||
| public static TableMetadata newTableMetadataWithDefaultProperties( |
There was a problem hiding this comment.
The name kind of indicates that all defaults are persisted. Are there any better alternatives?
There was a problem hiding this comment.
After thinking about this a bit more, I'd probably just update the existing method given that we modify all places we know and that's the behavior we want to achieve.
There was a problem hiding this comment.
Alright then , just change the existing method, hope there's no breakages
| TableProperties.DELETE_PARQUET_COMPRESSION, | ||
| TableProperties.PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0); | ||
| defaults.putAll(unreservedProperties(properties)); | ||
| int formatVersion = |
There was a problem hiding this comment.
I wonder whether we should add a method for fetching the format version, it is used in 3 places now.
private static int formatVersion(Map<String, String> properties) {
return PropertyUtil.propertyAsInt(
properties, TableProperties.FORMAT_VERSION, DEFAULT_TABLE_FORMAT_VERSION);
}
There was a problem hiding this comment.
I think its a good idea, I no longer make a new method though now so it may be unrelated as a change.
c08e78b to
43e3cab
Compare
|
@jerqi FYI i commented out the failing test for now, I think as @aokolnychyi wanted to get this one for 1.4, and we can also look at the fixes in the meantime. |
OK, I got it. I will fix the issue as soon as I can. |
47b9ce5 to
3e8b227
Compare
|
Rebased on #8438 , TestCompression tests now pass, fyi @aokolnychyi when you are back |
|
Szehon is on parental leave. I took his work and added Spark 3.5 changes in #8593. Closing this. |
This is another attempt at #8158, based on #8158 (comment).