Skip to content

Conversation

@hililiwei
Copy link
Contributor

@hililiwei hililiwei commented Mar 8, 2022

Based on #3810.
Refer:#3810 (comment)

For Parquet the Hadoop configuration is used to pass options into the data file. It is not used as the source of table configuration. Table configuration properties should never come from the Hadoop Configuration.

The steps should be:

  1. Get the Hadoop configuration, if it is present in the InputFile or OutputFile. If not, default it
  2. Get config from the builder and the table properties
  3. Set configuration from step 2 on the Hadoop config
  4. Create the reader or writer with the Hadoop config

The only configuration coming from the Hadoop Configuration itself is whatever was in the environment.

One possibly confusing thing is that we also set config values directly in the Hadoop configuration. That handles cases where the user wants to pass config properties that are not standardized in Iceberg. So you could use set("parquet.bloom.filter.enabled#id", "true") for example. Standardized table settings should override these.

@github-actions github-actions bot added the ORC label Mar 8, 2022
@hililiwei hililiwei marked this pull request as draft March 8, 2022 16:58
@hililiwei hililiwei force-pushed the optimize-orc-config branch from 52c8f00 to a0b8b5f Compare March 10, 2022 02:11
@hililiwei hililiwei marked this pull request as ready for review March 10, 2022 02:18
@hililiwei hililiwei changed the title ORC:Optimize the use of table properties for ORC ORC:Optimize table properties usage for ORC Mar 10, 2022
@hililiwei
Copy link
Contributor Author

cc @rdblue @openinx @zhongyujiang

@hililiwei hililiwei force-pushed the optimize-orc-config branch from a0b8b5f to fc1c118 Compare March 10, 2022 02:25
@github-actions github-actions bot added the spark label Mar 10, 2022
// write in such a way that the file contains 10 stripes each with 100 rows
.set("iceberg.orc.vectorbatch.size", "100")
.set(OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(), "100")
.set(OrcConf.STRIPE_SIZE.getAttribute(), "1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looks like the default value of table property write.orc.stripe-size-bytes will override the set value from original ORC config key (orc.stripe.size ) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using the original ORC config key directly here is invalid and will be overwritten by the table property or by the table property default value.

Copy link
Contributor Author

@hililiwei hililiwei Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the standard iceberg table properties is not found, do we feedback to try the origin ORC key?
From another perspective, this is not a hadoop configuration, but a subjective intention of the developer. It's just not used in a proper way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the table property should override the Hadoop configuration property. We should probably also support the old iceberg.orc.vectorbatch.size property passed in like this. I think the table property should override it.

@hililiwei hililiwei force-pushed the optimize-orc-config branch 2 times, most recently from 056673f to 64645ed Compare March 10, 2022 03:41
@github-actions github-actions bot added the core label Mar 10, 2022
@hililiwei hililiwei requested a review from openinx March 10, 2022 14:17
@hililiwei hililiwei force-pushed the optimize-orc-config branch from ee1b8f6 to 9edd728 Compare March 11, 2022 06:01
Remove the extra `org.apache.iceberg.` prefix.
Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now. Thanks @hililiwei for the contribution !

@zhongyujiang
Copy link
Contributor

Sorry for the delay.

About backwards compatibility, there is another issue I worried about. Users used to be able to set ORC props in Iceberg table properties like "orc.stripe.size" = '1', which could be passed directly to ORC to control stripe size, now we introduced standardized Iceberg ORC_STRIPE_SIZE_BYTES, but I think still we should not ignore those ORC properties which already had been set in table properties by users. I think we can use priorities like #3810, just pull the props from config instead of conf. What do you think?

The rest looks good to me.

@openinx
Copy link
Member

openinx commented Mar 14, 2022

@zhongyujiang As we will still copy the configure key&values to the hadoop configuration here https://github.com/apache/iceberg/pull/4291/files#diff-34d7fce4c1d9417fa9342247cf5ace0636cd86591efbed0555ae80197f79303dR172-R174 . So I think the current patch could still meet your requirement about Users used to be able to set ORC props in Iceberg table properties like "orc.stripe.size" = '1', which could be passed directly to ORC to control stripe size, right ?

@hililiwei
Copy link
Contributor Author

hililiwei commented Mar 14, 2022

@openinx One thing to note. Although the k-v of config is copied to hadoop conf, if a standardized iceberg key, such as "orc.stripe.size", is set in conf using only the origin key, instead of the standardized key "write.orc.stripe-size-bytes", it will eventually be overwritten by iceberg's default value.
@zhongyujiang seems to be talking about this. I raised a similar question above, which you can refer to:#4291 (comment)

That is, for the config, if the key has been standardized, it is invalid to use the origin of the orc.

@openinx
Copy link
Member

openinx commented Mar 15, 2022

@hililiwei Would you mind to fix those travis CI issues ?

@hililiwei hililiwei force-pushed the optimize-orc-config branch from e331ff8 to 5c451a0 Compare March 15, 2022 09:16
@hililiwei
Copy link
Contributor Author

@hililiwei Would you mind to fix those travis CI issues ?

Done.

@openinx openinx merged commit b4d6489 into apache:master Mar 21, 2022
@openinx
Copy link
Member

openinx commented Mar 21, 2022

Got this merged, thanks @hililiwei for contribution, and thanks all for reviewing !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants