Skip to content

Core: use ZSTD compression parquet by default for new tables#8299

Closed
szehon-ho wants to merge 4 commits intoapache:masterfrom
szehon-ho:new_zstd_default
Closed

Core: use ZSTD compression parquet by default for new tables#8299
szehon-ho wants to merge 4 commits intoapache:masterfrom
szehon-ho:new_zstd_default

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

This is another attempt at #8158, based on #8158 (comment).

@szehon-ho
Copy link
Copy Markdown
Member Author

Will need to fix tests, but the idea is here, cc @dbtsai , @RussellSpitzer

this.identifier = identifier;
this.schema = schema;
this.tableProperties.putAll(tableDefaultProperties());
// Explicitly set ZSTD for new tables
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe // Explicitly set Parquet compression codec for new tables

// Explicitly set ZSTD for new tables
this.tableProperties.put(
TableProperties.PARQUET_COMPRESSION,
TableProperties.PARQUET_COMPRESSION_NEW_TABLE_DEFAULT);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.tableProperties.put(
    TableProperties.PARQUET_COMPRESSION,
    TableProperties.PARQUET_COMPRESSION_DEFAULT);

so we don't need to introduce a new conf, PARQUET_COMPRESSION_NEW_TABLE_DEFAULT

Map<String, String> newProperties = Maps.newHashMap(base.properties());
newProperties.put(
TableProperties.PARQUET_COMPRESSION,
TableProperties.PARQUET_COMPRESSION_DEFAULT);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (base.properties().get(TableProperties.PARQUET_COMPRESSION) == null) {
  Map<String, String> newProperties = Maps.newHashMap(base.properties());
  newProperties.put(TableProperties.PARQUET_COMPRESSION, "gzip");

public static final String PARQUET_COMPRESSION = "write.parquet.compression-codec";
public static final String DELETE_PARQUET_COMPRESSION = "write.delete.parquet.compression-codec";
public static final String PARQUET_COMPRESSION_DEFAULT = "gzip";
public static final String PARQUET_COMPRESSION_NEW_TABLE_DEFAULT = "zstd";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we don't need PARQUET_COMPRESSION_NEW_TABLE_DEFAULT

@dbtsai
Copy link
Copy Markdown
Member

dbtsai commented Aug 12, 2023

Minor comments above. Thank you for working on this followup. @szehon-ho

@szehon-ho
Copy link
Copy Markdown
Member Author

Thanks for looking. Updated and fixed a round of tests.

this.schema = schema;
this.tableProperties.putAll(tableDefaultProperties());

// Explicitly set default Parquet compression codecs for new tables
Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to only cover catalogs that extend this class.
Is it something we can address in TableMetadata to avoid changes in multiple places?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theroetically we could but it sounds strange to me, making the TableMetadata object construct by default with a few properties set , as its a public class?

Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the problem here is that we update multiple places with the same logic and it only covers some implementations. I am not sure how I feel about changing TableMetadata either. Let me think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look, REST catalog table builder doesnt seem to build TableMetadata themselves (gets from the response object). So this wont help that case. I tried to organize the code a little bit into TableMetadata class in any case

Comment thread .palantir/revapi.yml Outdated
old: "method org.apache.iceberg.view.ViewBuilder org.apache.iceberg.view.ViewBuilder::withQueryColumnNames(java.util.List<java.lang.String>)"
justification: "Acceptable break due to updating View APIs and the View Spec"
org.apache.iceberg:iceberg-core:
- code: "java.field.constantValueChanged"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea could be to add a new constant for new tables and persist the codec only during table creation, while still relying on the old PARQUET_COMPRESSION_DEFAULT for existing tables.

That way we won't have to change SnapshotProducer and encode old values.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I am not against the current implementation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that earlier (not change the constant), I think @dbtsai wanted to do it the other way.

@szehon-ho
Copy link
Copy Markdown
Member Author

szehon-ho commented Aug 15, 2023

Looks like there's some open question. First, I assume we want to do this, to reduce the impact for user (keep gzip for old table and use zstd for new tables). If that's the case, the questions:

  1. Should we change TableMetadata constructor to set these two compression table properties for new TableMetadata? Core: use ZSTD compression parquet by default for new tables #8299 (comment)
  • Currently changing each catalog risks missing this change in custom catalog.
  • But TableMetadata is public API and it may be strange to have new instances default with 2 properties. But then again, maybe this is a good direction and we should start returning with other default table property as well (which is a way for us to start persisting the defaults).
  1. Should we just have two compression constants (DEFAULT and NEW_TABLE_DEFAULT) and not have to persist GZIP explicitly in SnapshotProducer? Then, the only change would be making createTable persist ZSTD default.
    Core: use ZSTD compression parquet by default for new tables #8299 (comment)
  • Its less intrusive code in SnapshotProducer, and doesnt change a public constant. But it just looks a bit weird to have two defaults.

@szehon-ho
Copy link
Copy Markdown
Member Author

szehon-ho commented Aug 15, 2023

@rdblue @nastra @jackye1995 @stevenzwu @danielcweeks , could you guys take a look? Seems like it can go either way here.

@nastra
Copy link
Copy Markdown
Contributor

nastra commented Aug 16, 2023

I kind of like option 2) where we would essentially have PARQUET_COMPRESSION_DEFAULT and PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0, indicating that the newer default should be used on new tables since Iceberg 1.4.0 (or any other name that basically indicates that this is the new default for new tables).

@szehon-ho szehon-ho force-pushed the new_zstd_default branch 3 times, most recently from d09c720 to 6ec04c5 Compare August 16, 2023 23:03
Comment thread hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java Outdated
@szehon-ho
Copy link
Copy Markdown
Member Author

Thanks, makes sense to me, added new constant PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0

public static final String PARQUET_COMPRESSION = "write.parquet.compression-codec";
public static final String DELETE_PARQUET_COMPRESSION = "write.delete.parquet.compression-codec";
public static final String PARQUET_COMPRESSION_DEFAULT = "gzip";
public static final String PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0 = "zstd";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think. I do like the name but I also wonder if it sends a message the new default applies to all existing tables that did not provide a default value. It is probably not what happens as we only use this value in new tables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go either way here, up to you @szehon-ho.

Comment thread core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java Outdated
schema, spec, sortOrder, location, unreservedProperties(properties), formatVersion);
}

public static TableMetadata newTableMetadataWithDefaultProperties(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name kind of indicates that all defaults are persisted. Are there any better alternatives?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this a bit more, I'd probably just update the existing method given that we modify all places we know and that's the behavior we want to achieve.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then , just change the existing method, hope there's no breakages

@aokolnychyi aokolnychyi added this to the Iceberg 1.4.0 milestone Aug 24, 2023
TableProperties.DELETE_PARQUET_COMPRESSION,
TableProperties.PARQUET_COMPRESSION_DEFAULT_SINCE_1_4_0);
defaults.putAll(unreservedProperties(properties));
int formatVersion =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should add a method for fetching the format version, it is used in 3 places now.

private static int formatVersion(Map<String, String> properties) {
  return PropertyUtil.propertyAsInt(
      properties, TableProperties.FORMAT_VERSION, DEFAULT_TABLE_FORMAT_VERSION);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a good idea, I no longer make a new method though now so it may be unrelated as a change.

Comment thread core/src/main/java/org/apache/iceberg/TableMetadata.java Outdated
@szehon-ho
Copy link
Copy Markdown
Member Author

@jerqi FYI i commented out the failing test for now, I think as @aokolnychyi wanted to get this one for 1.4, and we can also look at the fixes in the meantime.

@roryqi
Copy link
Copy Markdown
Contributor

roryqi commented Aug 31, 2023

@jerqi FYI i commented out the failing test for now, I think as @aokolnychyi wanted to get this one for 1.4, and we can also look at the fixes in the meantime.

OK, I got it. I will fix the issue as soon as I can.

@szehon-ho
Copy link
Copy Markdown
Member Author

szehon-ho commented Sep 7, 2023

Rebased on #8438 , TestCompression tests now pass, fyi @aokolnychyi when you are back

@aokolnychyi
Copy link
Copy Markdown
Contributor

Szehon is on parental leave. I took his work and added Spark 3.5 changes in #8593. Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants