Skip to content

Conversation

@hililiwei
Copy link
Contributor

closes #3272

To be consistent with Parquet, it would be good to introduce write table properties for ORC too.
Also make sure that we have a configuration available for ORC separately.
Discussed here: #2935 (comment)

@hililiwei
Copy link
Contributor Author

reopen rerun CI

@hililiwei hililiwei closed this Dec 27, 2021
@hililiwei hililiwei reopened this Dec 27, 2021
@hililiwei hililiwei force-pushed the #3272 branch 2 times, most recently from 90a5d04 to f45f0bb Compare December 28, 2021 01:42
@hililiwei hililiwei requested a review from jackye1995 December 28, 2021 03:43
public static final String ORC_STRIPE_SIZE_BYTES = "write.orc.stripe-size-bytes";
public static final long ORC_STRIPE_SIZE_BYTES_DEFAULT = 64L * 1024 * 1024; // 64 MB

public static final String ORC_BLOCK_SIZE_BYTES = "write.orc.block-size-bytes";
Copy link
Contributor

@rdblue rdblue Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omalley or @shardulm94, is this a config that we should expose as an Iceberg table setting? Is it still needed for object stores?

@pvary
Copy link
Contributor

pvary commented Jan 3, 2022

I think if we decide that we need the default configuration values set for Iceberg reads, then we should do this the same way as it has been done with the other file formats:

  • Parquet:
    static Context dataContext(Map<String, String> config) {
    int rowGroupSize = Integer.parseInt(config.getOrDefault(
    PARQUET_ROW_GROUP_SIZE_BYTES, PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT));
    int pageSize = Integer.parseInt(config.getOrDefault(
    PARQUET_PAGE_SIZE_BYTES, PARQUET_PAGE_SIZE_BYTES_DEFAULT));
    int dictionaryPageSize = Integer.parseInt(config.getOrDefault(
    PARQUET_DICT_SIZE_BYTES, PARQUET_DICT_SIZE_BYTES_DEFAULT));
    String codecAsString = config.getOrDefault(PARQUET_COMPRESSION, PARQUET_COMPRESSION_DEFAULT);
    CompressionCodecName codec = toCodec(codecAsString);
    String compressionLevel = config.getOrDefault(PARQUET_COMPRESSION_LEVEL, PARQUET_COMPRESSION_LEVEL_DEFAULT);
    return new Context(rowGroupSize, pageSize, dictionaryPageSize, codec, compressionLevel);
    }
  • Avro:
    static Context dataContext(Map<String, String> config) {
    String codecAsString = config.getOrDefault(AVRO_COMPRESSION, AVRO_COMPRESSION_DEFAULT);
    CodecFactory codec = toCodec(codecAsString);
    return new Context(codec);
    }

Basically we should create a Context object for the ORC as well.

@hililiwei
Copy link
Contributor Author

I think if we decide that we need the default configuration values set for Iceberg reads, then we should do this the same way as it has been done with the other file formats:
…………
Basically we should create a Context object for the ORC as well.

added to the latest commiter. PTAL, thx.

@hililiwei hililiwei requested a review from rdblue January 4, 2022 08:35
@pvary
Copy link
Contributor

pvary commented Jan 4, 2022

@hililiwei: Do we have any tests for this in Parquet/Avro?
I think it would b good to add some tests, so we can be sure that these configurations are working as expected.

@hililiwei hililiwei marked this pull request as draft January 4, 2022 17:04
@hililiwei hililiwei marked this pull request as ready for review January 5, 2022 08:15
@github-actions github-actions bot added the flink label Jan 5, 2022
@hililiwei
Copy link
Contributor Author

@hililiwei: Do we have any tests for this in Parquet/Avro? I think it would b good to add some tests, so we can be sure that these configurations are working as expected.

@pvary Checked, and seemed no relevant tests. I added one unit test for ORC this time. How about let me open another PR for Parquet\AVRO?

@pvary
Copy link
Contributor

pvary commented Jan 5, 2022

@szlta: Would you mind taking a look at the tests?

Thanks,
Peter

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost looks good to me overall, just left several comments.

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now, thanks @hililiwei for the contribution, and thanks @rdblue and @jackye1995 for the reviewing !

@openinx openinx changed the title ORC: Add ORC support for write properties ORC: Add configurable write properties Mar 3, 2022
@openinx openinx added this to the Iceberg 0.14.0 Release milestone Mar 3, 2022
@openinx openinx merged commit 79c8978 into apache:master Mar 3, 2022
@hililiwei hililiwei deleted the #3272 branch March 3, 2022 06:42
@hililiwei
Copy link
Contributor Author

Thank you all for reviewing.

@rdblue
Copy link
Contributor

rdblue commented Mar 7, 2022

@openinx, @hililiwei, I think we need to revert this. It looks like the dataContext and deleteContext methods accept Hadoop Configuration objects rather than property maps.

Iceberg doesn't use Hadoop Configuration. We can either directly fix this to work like the Parquet contexts with another PR, or I think we should revert and open a replacement PR. What do you think is the right way forward?

@hililiwei
Copy link
Contributor Author

hililiwei commented Mar 8, 2022

@openinx, @hililiwei, I think we need to revert this. It looks like the dataContext and deleteContext methods accept Hadoop Configuration objects rather than property maps.

Iceberg doesn't use Hadoop Configuration. We can either directly fix this to work like the Parquet contexts with another PR, or I think we should revert and open a replacement PR. What do you think is the right way forward?

I understand what you mean roughly. Let me raise a separate PR, it may goes beyond the this PR content.

@openinx
Copy link
Member

openinx commented Mar 8, 2022

@rdblue The ORC#WriterBuilder will copy all the key values from property map to the Hadoop configuration, please see [1] and [2]. So in my thought, we can just use the unified hadoop configuration to get all of the configure keys. Will this answer your concern ?

[1]. https://github.com/apache/iceberg/blob/master/orc/src/main/java/org/apache/iceberg/orc/ORC.java#L122
[2]. https://github.com/apache/iceberg/blob/master/orc/src/main/java/org/apache/iceberg/orc/ORC.java#L132

@hililiwei
Copy link
Contributor Author

public WriteBuilder set(String property, String value) {
config.put(property, value);
return this;
}

public WriteBuilder setAll(Map<String, String> properties) {
config.putAll(properties);
return this;
}

Configuration conf;
if (file instanceof HadoopOutputFile) {
conf = ((HadoopOutputFile) file).getConf();
} else {
conf = new Configuration();
}
for (Map.Entry<String, String> entry : config.entrySet()) {
conf.set(entry.getKey(), entry.getValue());
}

I mainly refer to Parquet's code above. Do you mean to use this way as well? @openinx

@openinx
Copy link
Member

openinx commented Mar 8, 2022

Let's make this more clear:

In the ORC class, we just copied all key values from property maps to hadoop configuration. And all the following configure keys will be parsed from hadoop configuration instance.

In the current Parquet class, we will just keep all the property maps into in-memory hash map, and then the dataContext & deleteContext will parse the hash map directly. Finally, we will copy the key values from HashMap to hadoop configuration.

In my mind, I just don't see the difference between the two approach. Both look good to me.

@zhongyujiang
Copy link
Contributor

zhongyujiang commented Mar 8, 2022

@openinx, I think what @rdblue means is that in current logic Iceberg configuration might be overridden by Hadoop configuration , for example:
when Hadoop configuraion is configed a property
"orc.compress" = "snappy"
and no corresponding property has been configed in Iceberg table props, In the present context ORC will use snappy for compress as configed in Hadoop. But the correct procedure is use default codec configed in TableProperties.

So we need a props map to receive table props passed in, then set corresponding ORC props in Hadoop configuration after parsing, just like Parquet.

@rdblue
Copy link
Contributor

rdblue commented Mar 8, 2022

For Parquet the Hadoop configuration is used to pass options into the data file. It is not used as the source of table configuration. Table configuration properties should never come from the Hadoop Configuration.

The steps should be:

  1. Get the Hadoop configuration, if it is present in the InputFile or OutputFile. If not, default it
  2. Get config from the builder and the table properties
  3. Set configuration from step 2 on the Hadoop config
  4. Create the reader or writer with the Hadoop config

The only configuration coming from the Hadoop Configuration itself is whatever was in the environment.

One possibly confusing thing is that we also set config values directly in the Hadoop configuration. That handles cases where the user wants to pass config properties that are not standardized in Iceberg. So you could use set("parquet.bloom.filter.enabled#id", "true") for example. Standardized table settings should override these.

@openinx
Copy link
Member

openinx commented Mar 9, 2022

Okay, I think I get the drawback about the current ORC builder ( which copied the table properties map into hadoop conf and just use the hadoop conf as the source of truth):

  • As @zhongyujiang said, the default value of iceberg standardized settings will be replaced by the hadoop config values.
  • Non-standardized iceberg settings which are added in the iceberg table properties will be affected to the underlying ORC readers & writers.
  • The standardized iceberg settings will be added to hadoop configuration, which may pollute the hadoop configuration.

I agree to make another PR to change the ORC configs parser as the @rdblue suggested.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ORC: Add ORC support for write properties

6 participants