-
Notifications
You must be signed in to change notification settings - Fork 3k
ORC: Add configurable write properties #3810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
reopen rerun CI |
90a5d04 to
f45f0bb
Compare
| public static final String ORC_STRIPE_SIZE_BYTES = "write.orc.stripe-size-bytes"; | ||
| public static final long ORC_STRIPE_SIZE_BYTES_DEFAULT = 64L * 1024 * 1024; // 64 MB | ||
|
|
||
| public static final String ORC_BLOCK_SIZE_BYTES = "write.orc.block-size-bytes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omalley or @shardulm94, is this a config that we should expose as an Iceberg table setting? Is it still needed for object stores?
|
I think if we decide that we need the default configuration values set for Iceberg reads, then we should do this the same way as it has been done with the other file formats:
Basically we should create a Context object for the ORC as well. |
added to the latest commiter. PTAL, thx. |
|
@hililiwei: Do we have any tests for this in Parquet/Avro? |
@pvary Checked, and seemed no relevant tests. I added one unit test for ORC this time. How about let me open another PR for Parquet\AVRO? |
|
@szlta: Would you mind taking a look at the tests? Thanks, |
flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java
Outdated
Show resolved
Hide resolved
flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/TestOrcTableProperties.java
Outdated
Show resolved
Hide resolved
orc/src/test/java/org/apache/iceberg/orc/TestTableProperties.java
Outdated
Show resolved
Hide resolved
orc/src/test/java/org/apache/iceberg/orc/TestTableProperties.java
Outdated
Show resolved
Hide resolved
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost looks good to me overall, just left several comments.
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now, thanks @hililiwei for the contribution, and thanks @rdblue and @jackye1995 for the reviewing !
|
Thank you all for reviewing. |
|
@openinx, @hililiwei, I think we need to revert this. It looks like the Iceberg doesn't use Hadoop Configuration. We can either directly fix this to work like the Parquet contexts with another PR, or I think we should revert and open a replacement PR. What do you think is the right way forward? |
I understand what you mean roughly. Let me raise a separate PR, it may goes beyond the this PR content. |
|
@rdblue The [1]. https://github.com/apache/iceberg/blob/master/orc/src/main/java/org/apache/iceberg/orc/ORC.java#L122 |
|
iceberg/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Lines 153 to 156 in c6710cd
iceberg/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Lines 158 to 161 in c6710cd
iceberg/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java Lines 261 to 270 in c6710cd
I mainly refer to Parquet's code above. Do you mean to use this way as well? @openinx |
|
Let's make this more clear: In the ORC class, we just copied all key values from property maps to hadoop configuration. And all the following configure keys will be parsed from hadoop configuration instance. In the current Parquet class, we will just keep all the property maps into in-memory hash map, and then the In my mind, I just don't see the difference between the two approach. Both look good to me. |
|
@openinx, I think what @rdblue means is that in current logic Iceberg configuration might be overridden by Hadoop configuration , for example: So we need a props map to receive table props passed in, then set corresponding ORC props in Hadoop configuration after parsing, just like Parquet. |
|
For Parquet the Hadoop configuration is used to pass options into the data file. It is not used as the source of table configuration. Table configuration properties should never come from the Hadoop Configuration. The steps should be:
The only configuration coming from the Hadoop Configuration itself is whatever was in the environment. One possibly confusing thing is that we also set config values directly in the Hadoop configuration. That handles cases where the user wants to pass config properties that are not standardized in Iceberg. So you could use |
|
Okay, I think I get the drawback about the current ORC builder ( which copied the table properties map into hadoop conf and just use the hadoop conf as the source of truth):
I agree to make another PR to change the ORC configs parser as the @rdblue suggested. Thanks. |
closes #3272