-
Notifications
You must be signed in to change notification settings - Fork 3k
ORC:Optimize table properties usage for ORC #4291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
52c8f00 to
a0b8b5f
Compare
a0b8b5f to
fc1c118
Compare
| // write in such a way that the file contains 10 stripes each with 100 rows | ||
| .set("iceberg.orc.vectorbatch.size", "100") | ||
| .set(OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(), "100") | ||
| .set(OrcConf.STRIPE_SIZE.getAttribute(), "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looks like the default value of table property write.orc.stripe-size-bytes will override the set value from original ORC config key (orc.stripe.size ) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using the original ORC config key directly here is invalid and will be overwritten by the table property or by the table property default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the standard iceberg table properties is not found, do we feedback to try the origin ORC key?
From another perspective, this is not a hadoop configuration, but a subjective intention of the developer. It's just not used in a proper way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the table property should override the Hadoop configuration property. We should probably also support the old iceberg.orc.vectorbatch.size property passed in like this. I think the table property should override it.
056673f to
64645ed
Compare
ee1b8f6 to
9edd728
Compare
Remove the extra `org.apache.iceberg.` prefix.
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now. Thanks @hililiwei for the contribution !
|
Sorry for the delay. About backwards compatibility, there is another issue I worried about. Users used to be able to set ORC props in Iceberg table properties like "orc.stripe.size" = '1', which could be passed directly to ORC to control stripe size, now we introduced standardized Iceberg ORC_STRIPE_SIZE_BYTES, but I think still we should not ignore those ORC properties which already had been set in table properties by users. I think we can use priorities like #3810, just pull the props from The rest looks good to me. |
|
@zhongyujiang As we will still copy the configure key&values to the hadoop configuration here https://github.com/apache/iceberg/pull/4291/files#diff-34d7fce4c1d9417fa9342247cf5ace0636cd86591efbed0555ae80197f79303dR172-R174 . So I think the current patch could still meet your requirement about |
|
@openinx One thing to note. Although the k-v of config is copied to hadoop conf, if a standardized iceberg key, such as "orc.stripe.size", is set in conf using only the origin key, instead of the standardized key "write.orc.stripe-size-bytes", it will eventually be overwritten by iceberg's default value. That is, for the config, if the key has been standardized, it is invalid to use the origin of the orc. |
|
@hililiwei Would you mind to fix those travis CI issues ? |
e331ff8 to
5c451a0
Compare
Done. |
|
Got this merged, thanks @hililiwei for contribution, and thanks all for reviewing ! |
Based on #3810.
Refer:#3810 (comment)