Core: Implement FanoutPositionDeleteWriter #3377

marton-bod · 2021-10-26T12:16:20Z

Implement new FanoutPositionDeleteWriter, similar to the already-existing FanoutDataWriter class.
This new DeleteWriter implementation should provide an alternative to ClusteredPositionDeleteWriter for those users who are not inserting their data clustered by partition spec/partition values.

@aokolnychyi @openinx @pvary - can you please take a look? Thanks!

data/src/test/java/org/apache/iceberg/io/TestPartitioningWriters.java

jackye1995

overall looks good to me, just some nit in testing.

marton-bod · 2021-10-27T07:53:46Z

Thanks for the review @jackye1995 !

aokolnychyi · 2021-10-27T20:57:57Z

@marton-bod, what is the use case for this type of writers? The spec requires the position delete records to be sorted by file_path and pos. If we are sorting the records anyway, why not include the spec_id and partition metadata columns as well? Is it because we don't support these metadata columns in some query engines?

aokolnychyi · 2021-10-27T21:01:00Z

core/src/main/java/org/apache/iceberg/io/FanoutPositionDeleteWriter.java

+
+  @Override
+  protected FileWriter<PositionDelete<T>, DeleteWriteResult> newWriter(PartitionSpec spec, StructLike partition) {
+    return new RollingPositionDeleteWriter<>(writerFactory, fileFactory, io, targetFileSizeInBytes, spec, partition);


Do we support rolling ORC writers? Other writers such as ClusteredPositionDeleteWriter have a check here.

I think it does not fail right now as we disable ORC tests in TestPartitioningWriters. We can enable them now.

Yes it seems that some of these tests can be re-enabled for ORC now, whereas previously they has an initial Assume to not run against ORC.

As an example:

iceberg/data/src/test/java/org/apache/iceberg/io/TestPartitioningWriters.java

Line 161 in 31efe35

Assume.assumeFalse("ORC delete files are not supported", fileFormat == FileFormat.ORC);

I think since ORC: Add DeleteWriteBuilder for format v2 (#3250) went in, we don't need any special treatment for ORC format in the DeleteWriters. That means we could remove the Assume.assumeFalse() checks as well.

What about enabling them in a separate PR?

aokolnychyi · 2021-10-27T21:06:02Z

This looks mostly good to me. We need to fix ORC writers and enable appropriate tests. I'd like to also know a little bit more about the use case.

marton-bod · 2021-10-29T12:09:40Z

Thanks for reviewing @aokolnychyi! You're right about the spec mandating that delete entries must be sorted by file_path and file_pos. That's what we are doing on the Hive side as well, but came across the problem that since data files can be added via the API with any arbitrary name, an alphabetical sort of the file_paths could still lead to out of order partition values.

As for the spec_id and partition columns, to be honest I kinda missed that they have been added to the MetadataColumns :) I haven't tried them out yet, but I'm assuming you could include them too in the sort (e.g. sort by spec_id, partition, file_path, file_pos) and have your data perfectly clustered with their help.

I think there might still be some utility in keeping this writer implementation as well for the problems similar to the one described above, but I'll leave it up to you. What do you think?

rdblue · 2021-10-31T17:45:43Z

@marton-bod, I would rather not include this if we don't know that it is definitely needed. Otherwise it would be easy to start using it even though there are better ways to prepare the data so that it isn't needed. Does that make sense? If there's a way around adding a class, then we should avoid adding it.

aokolnychyi · 2021-11-02T19:05:30Z

Fanout writers may be inefficient so I'd rather use _spec_id and _partition metadata columns if possible. I'd be happy to review PRs to make those columns available in query engines other than Spark.

That being said, we should probably still enable ORC tests in TestPartitioningWriters.

marton-bod · 2021-11-04T08:21:57Z

@rdblue @aokolnychyi Makes sense! Thanks for reviewing, I'll close this PR.
I'll open a new PR for enabling ORC tests in TestPartitioningWriters.

marton-bod added 3 commits October 26, 2021 13:40

Core: Implement FanoutDeleteWriter

1c7e250

Add out of order spec test

f3e08a1

Rename class

339821f

github-actions bot added core data labels Oct 26, 2021

jackye1995 reviewed Oct 26, 2021

View reviewed changes

data/src/test/java/org/apache/iceberg/io/TestPartitioningWriters.java Show resolved Hide resolved

jackye1995 approved these changes Oct 26, 2021

View reviewed changes

Test out of order partition too

39e1079

aokolnychyi reviewed Oct 27, 2021

View reviewed changes

marton-bod closed this Nov 4, 2021

marton-bod mentioned this pull request Nov 8, 2021

Data: run tests for ORC too in TestPartitioningWriters deleteWriter tests #3497

Merged

Core: Implement FanoutPositionDeleteWriter #3377

Core: Implement FanoutPositionDeleteWriter #3377

Uh oh!

Conversation

marton-bod commented Oct 26, 2021

Uh oh!

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

marton-bod commented Oct 27, 2021

Uh oh!

aokolnychyi commented Oct 27, 2021

Uh oh!

aokolnychyi Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 27, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 28, 2021

Choose a reason for hiding this comment

Uh oh!

marton-bod Oct 29, 2021

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Nov 2, 2021

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Oct 27, 2021

Uh oh!

marton-bod commented Oct 29, 2021

Uh oh!

rdblue commented Oct 31, 2021

Uh oh!

aokolnychyi commented Nov 2, 2021

Uh oh!

marton-bod commented Nov 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aokolnychyi Oct 27, 2021 •

edited

Loading