ORC: Add configurable write properties #3810

hililiwei · 2021-12-27T11:22:47Z

To be consistent with Parquet, it would be good to introduce write table properties for ORC too.
Also make sure that we have a configuration available for ORC separately.
Discussed here: #2935 (comment)

hililiwei · 2021-12-27T13:06:13Z

reopen rerun CI

core/src/main/java/org/apache/iceberg/TableProperties.java

site/docs/configuration.md

orc/src/main/java/org/apache/iceberg/orc/ORC.java

rdblue · 2021-12-29T20:52:13Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+  public static final String ORC_STRIPE_SIZE_BYTES = "write.orc.stripe-size-bytes";
+  public static final long ORC_STRIPE_SIZE_BYTES_DEFAULT = 64L * 1024 * 1024; // 64 MB
+
+  public static final String ORC_BLOCK_SIZE_BYTES = "write.orc.block-size-bytes";


@omalley or @shardulm94, is this a config that we should expose as an Iceberg table setting? Is it still needed for object stores?

pvary · 2022-01-03T11:19:22Z

I think if we decide that we need the default configuration values set for Iceberg reads, then we should do this the same way as it has been done with the other file formats:

Parquet:

iceberg/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

Lines 304 to 320 in 129cdfa

    
           static Context dataContext(Map<String, String> config) { 
        
             int rowGroupSize = Integer.parseInt(config.getOrDefault( 
        
                 PARQUET_ROW_GROUP_SIZE_BYTES, PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT)); 
        
             int pageSize = Integer.parseInt(config.getOrDefault( 
        
                 PARQUET_PAGE_SIZE_BYTES, PARQUET_PAGE_SIZE_BYTES_DEFAULT)); 
        
             int dictionaryPageSize = Integer.parseInt(config.getOrDefault( 
        
                 PARQUET_DICT_SIZE_BYTES, PARQUET_DICT_SIZE_BYTES_DEFAULT)); 
        
             String codecAsString = config.getOrDefault(PARQUET_COMPRESSION, PARQUET_COMPRESSION_DEFAULT); 
        
             CompressionCodecName codec = toCodec(codecAsString); 
        
             String compressionLevel = config.getOrDefault(PARQUET_COMPRESSION_LEVEL, PARQUET_COMPRESSION_LEVEL_DEFAULT); 
        
             return new Context(rowGroupSize, pageSize, dictionaryPageSize, codec, compressionLevel); 
        
           }

Avro:

iceberg/core/src/main/java/org/apache/iceberg/avro/Avro.java

Lines 208 to 213 in 303f925

    
           static Context dataContext(Map<String, String> config) { 
        
             String codecAsString = config.getOrDefault(AVRO_COMPRESSION, AVRO_COMPRESSION_DEFAULT); 
        
             CodecFactory codec = toCodec(codecAsString); 
        
             return new Context(codec); 
        
           }

Basically we should create a Context object for the ORC as well.

hililiwei · 2022-01-04T07:17:02Z

I think if we decide that we need the default configuration values set for Iceberg reads, then we should do this the same way as it has been done with the other file formats:
…………
Basically we should create a Context object for the ORC as well.

added to the latest commiter. PTAL, thx.

pvary · 2022-01-04T12:59:46Z

@hililiwei: Do we have any tests for this in Parquet/Avro?
I think it would b good to add some tests, so we can be sure that these configurations are working as expected.

hililiwei · 2022-01-05T08:21:38Z

@hililiwei: Do we have any tests for this in Parquet/Avro? I think it would b good to add some tests, so we can be sure that these configurations are working as expected.

@pvary Checked, and seemed no relevant tests. I added one unit test for ORC this time. How about let me open another PR for Parquet\AVRO?

pvary · 2022-01-05T12:48:29Z

@szlta: Would you mind taking a look at the tests?

Thanks,
Peter

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/TestOrcTableProperties.java

orc/src/main/java/org/apache/iceberg/orc/ORC.java

orc/src/test/java/org/apache/iceberg/orc/TestTableProperties.java

openinx

Almost looks good to me overall, just left several comments.

openinx

Looks good to me now, thanks @hililiwei for the contribution, and thanks @rdblue and @jackye1995 for the reviewing !

hililiwei · 2022-03-03T06:45:14Z

Thank you all for reviewing.

rdblue · 2022-03-07T22:20:18Z

@openinx, @hililiwei, I think we need to revert this. It looks like the dataContext and deleteContext methods accept Hadoop Configuration objects rather than property maps.

Iceberg doesn't use Hadoop Configuration. We can either directly fix this to work like the Parquet contexts with another PR, or I think we should revert and open a replacement PR. What do you think is the right way forward?

hililiwei · 2022-03-08T01:51:49Z

@openinx, @hililiwei, I think we need to revert this. It looks like the dataContext and deleteContext methods accept Hadoop Configuration objects rather than property maps.

Iceberg doesn't use Hadoop Configuration. We can either directly fix this to work like the Parquet contexts with another PR, or I think we should revert and open a replacement PR. What do you think is the right way forward?

I understand what you mean roughly. Let me raise a separate PR, it may goes beyond the this PR content.

openinx · 2022-03-08T02:03:06Z

@rdblue The ORC#WriterBuilder will copy all the key values from property map to the Hadoop configuration, please see [1] and [2]. So in my thought, we can just use the unified hadoop configuration to get all of the configure keys. Will this answer your concern ?

[1]. https://github.com/apache/iceberg/blob/master/orc/src/main/java/org/apache/iceberg/orc/ORC.java#L122
[2]. https://github.com/apache/iceberg/blob/master/orc/src/main/java/org/apache/iceberg/orc/ORC.java#L132

hililiwei · 2022-03-08T02:13:03Z

iceberg/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

Lines 153 to 156 in c6710cd

    
           public WriteBuilder set(String property, String value) { 
        
             config.put(property, value); 
        
             return this; 
        
           }

iceberg/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

Lines 158 to 161 in c6710cd

    
           public WriteBuilder setAll(Map<String, String> properties) { 
        
             config.putAll(properties); 
        
             return this; 
        
           }

iceberg/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

Lines 261 to 270 in c6710cd

    
           Configuration conf; 
        
           if (file instanceof HadoopOutputFile) { 
        
             conf = ((HadoopOutputFile) file).getConf(); 
        
           } else { 
        
             conf = new Configuration(); 
        
           } 
        
           for (Map.Entry<String, String> entry : config.entrySet()) { 
        
             conf.set(entry.getKey(), entry.getValue()); 
        
           }

I mainly refer to Parquet's code above. Do you mean to use this way as well? @openinx

openinx · 2022-03-08T02:29:28Z

Let's make this more clear:

In the ORC class, we just copied all key values from property maps to hadoop configuration. And all the following configure keys will be parsed from hadoop configuration instance.

In the current Parquet class, we will just keep all the property maps into in-memory hash map, and then the dataContext & deleteContext will parse the hash map directly. Finally, we will copy the key values from HashMap to hadoop configuration.

In my mind, I just don't see the difference between the two approach. Both look good to me.

zhongyujiang · 2022-03-08T13:23:58Z

@openinx, I think what @rdblue means is that in current logic Iceberg configuration might be overridden by Hadoop configuration , for example:
when Hadoop configuraion is configed a property
"orc.compress" = "snappy"
and no corresponding property has been configed in Iceberg table props, In the present context ORC will use snappy for compress as configed in Hadoop. But the correct procedure is use default codec configed in TableProperties.

So we need a props map to receive table props passed in, then set corresponding ORC props in Hadoop configuration after parsing, just like Parquet.

rdblue · 2022-03-08T16:54:08Z

For Parquet the Hadoop configuration is used to pass options into the data file. It is not used as the source of table configuration. Table configuration properties should never come from the Hadoop Configuration.

The steps should be:

Get the Hadoop configuration, if it is present in the InputFile or OutputFile. If not, default it
Get config from the builder and the table properties
Set configuration from step 2 on the Hadoop config
Create the reader or writer with the Hadoop config

The only configuration coming from the Hadoop Configuration itself is whatever was in the environment.

One possibly confusing thing is that we also set config values directly in the Hadoop configuration. That handles cases where the user wants to pass config properties that are not standardized in Iceberg. So you could use set("parquet.bloom.filter.enabled#id", "true") for example. Standardized table settings should override these.

openinx · 2022-03-09T02:16:40Z

Okay, I think I get the drawback about the current ORC builder ( which copied the table properties map into hadoop conf and just use the hadoop conf as the source of truth):

As @zhongyujiang said, the default value of iceberg standardized settings will be replaced by the hadoop config values.
Non-standardized iceberg settings which are added in the iceberg table properties will be affected to the underlying ORC readers & writers.
The standardized iceberg settings will be added to hadoop configuration, which may pollute the hadoop configuration.

I agree to make another PR to change the ORC configs parser as the @rdblue suggested.

Thanks.

hililiwei force-pushed the #3272 branch from 5fa020d to 3f94d42 Compare December 27, 2021 11:30

github-actions bot added core docs ORC labels Dec 27, 2021

hililiwei closed this Dec 27, 2021

hililiwei reopened this Dec 27, 2021

jackye1995 reviewed Dec 27, 2021

View reviewed changes

hililiwei force-pushed the #3272 branch 2 times, most recently from 90a5d04 to f45f0bb Compare December 28, 2021 01:42

hililiwei requested a review from jackye1995 December 28, 2021 03:43

rdblue reviewed Dec 29, 2021

View reviewed changes

hililiwei requested a review from rdblue January 4, 2022 08:35

hililiwei marked this pull request as draft January 4, 2022 17:04

hililiwei force-pushed the #3272 branch from c280d8c to a764ec2 Compare January 5, 2022 08:15

hililiwei marked this pull request as ready for review January 5, 2022 08:15

github-actions bot added the flink label Jan 5, 2022

hililiwei mentioned this pull request Jan 14, 2022

Test: Add unit tests to validate forTable calls setAll with table properties #3902

Closed

rdblue reviewed Jan 18, 2022

View reviewed changes

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 18, 2022

View reviewed changes

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/TestOrcTableProperties.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 18, 2022

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/ORC.java Outdated Show resolved Hide resolved

hililiwei force-pushed the #3272 branch from a764ec2 to f166dab Compare January 19, 2022 08:42

hililiwei requested a review from rdblue January 19, 2022 08:44

hililiwei marked this pull request as draft January 19, 2022 08:55

hililiwei force-pushed the #3272 branch from f166dab to 67fca02 Compare January 19, 2022 09:01

openinx reviewed Mar 2, 2022

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/ORC.java Outdated Show resolved Hide resolved

openinx reviewed Mar 2, 2022

View reviewed changes

orc/src/test/java/org/apache/iceberg/orc/TestTableProperties.java Show resolved Hide resolved

openinx reviewed Mar 2, 2022

View reviewed changes

orc/src/test/java/org/apache/iceberg/orc/TestTableProperties.java Outdated Show resolved Hide resolved

hililiwei force-pushed the #3272 branch from d7c6a38 to ab1f8f7 Compare March 2, 2022 09:53

github-actions bot added the build label Mar 2, 2022

hililiwei force-pushed the #3272 branch from 927da9d to ab1f8f7 Compare March 2, 2022 11:16

hililiwei requested a review from openinx March 3, 2022 02:06

openinx reviewed Mar 3, 2022

View reviewed changes

orc/src/test/java/org/apache/iceberg/orc/TestTableProperties.java Show resolved Hide resolved

openinx reviewed Mar 3, 2022

View reviewed changes

orc/src/test/java/org/apache/iceberg/orc/TestTableProperties.java Outdated Show resolved Hide resolved

openinx reviewed Mar 3, 2022

View reviewed changes

ORC: Add ORC support for write properties

3b6c458

hililiwei force-pushed the #3272 branch from ab1f8f7 to 3b6c458 Compare March 3, 2022 03:30

hililiwei requested a review from openinx March 3, 2022 03:33

openinx approved these changes Mar 3, 2022

View reviewed changes

openinx changed the title ~~ORC: Add ORC support for write properties~~ ORC: Add configurable write properties Mar 3, 2022

openinx added this to the Iceberg 0.14.0 Release milestone Mar 3, 2022

openinx merged commit 79c8978 into apache:master Mar 3, 2022

hililiwei deleted the #3272 branch March 3, 2022 06:42

rdblue mentioned this pull request Mar 7, 2022

ORC: Add compression properties #4273

Merged

hililiwei mentioned this pull request Mar 8, 2022

ORC:Optimize table properties usage for ORC #4291

Merged

singhpk234 mentioned this pull request Mar 27, 2022

Parquet: Avoid modifying existing conf of HadoopOutputFile rather create new one #4411

Merged

ORC: Add configurable write properties #3810

ORC: Add configurable write properties #3810

Uh oh!

Conversation

hililiwei commented Dec 27, 2021

Uh oh!

hililiwei commented Dec 27, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue Dec 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented Jan 3, 2022

Uh oh!

hililiwei commented Jan 4, 2022

Uh oh!

pvary commented Jan 4, 2022

Uh oh!

hililiwei commented Jan 5, 2022

Uh oh!

pvary commented Jan 5, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openinx left a comment

Choose a reason for hiding this comment

Uh oh!

openinx left a comment

Choose a reason for hiding this comment

Uh oh!

hililiwei commented Mar 3, 2022

Uh oh!

rdblue commented Mar 7, 2022

Uh oh!

hililiwei commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openinx commented Mar 8, 2022

Uh oh!

hililiwei commented Mar 8, 2022

Uh oh!

openinx commented Mar 8, 2022

Uh oh!

zhongyujiang commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Mar 8, 2022

Uh oh!

openinx commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rdblue Dec 29, 2021 •

edited

Loading

hililiwei commented Mar 8, 2022 •

edited

Loading

zhongyujiang commented Mar 8, 2022 •

edited

Loading