Add sort order to writer classes #2214

yyanyy · 2021-02-04T23:42:53Z

Followup of #1975

jackye1995 · 2021-02-18T06:56:03Z

The code change looks good to me, but I am a bit confused about the scope of this PR (I only read the Spark path, but the same applies to Flink). In SparkAppenderFactory, only the constructor and newEqDeleteWriter are updated, but newDataWriter is not updated. The callers of the methods up to SparkWrite.WriterFactory are not updated to include sort order, so the methods you updated are basically not used. Do you plan to add all those things in another PR? I feel we can just add them here, based on what your title suggests.

yyanyy · 2021-02-18T22:03:10Z

Thanks Jack for reviewing the PR!

but newDataWriter is not updated.

I think newDataWriter does have it in the last argument of constructor DataWriter? Or did I misunderstand your question? If you were actually referring to newAppender, since the creation of ContentFile that stores sort order information is outside the responsibility of FileAppender, the caller of it that creates data/delete file is responsible for passing the sort order around. I think it's the same reason as why we created DataWriter.

the methods you updated are basically not used

Yes, in this PR I was trying to add capability of passing sortOrder info to writers within Iceberg library. I did think about expanding the support to SparkWrite but eventually felt a bit hesitant to do so, since at the time when I worked on this PR, in Spark community the sort order design was in an early phase, and I wasn't sure if my assumption on how sort order information comes from in SparkWrite is accurate, that I'd rather let people who are the expert in individual engines to do this integration with library writers since they have to also integrate with engine anyway. I think the row level delete support is also in this style - SparkAppenderFactory constructor could take in row level delete related arguments, but SparkWrite and other spark code are not using such constructor yet.

jackye1995 · 2021-02-19T18:28:21Z

If you were actually referring to newAppender, since the creation of ContentFile that stores sort order information is outside the responsibility of FileAppender, the caller of it that creates data/delete file is responsible for passing the sort order around. I think it's the same reason as why we created DataWriter.

Yes I was referring to the appender. Thanks for the clarification.

I think the row level delete support is also in this style - SparkAppenderFactory constructor could take in row level delete related arguments, but SparkWrite and other spark code are not using such constructor yet.

Okay I see, in that case this should be fine.

aokolnychyi · 2021-03-11T01:02:01Z

spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java

              .withPartition(partition)
              .equalityFieldIds(equalityFieldIds)
              .withKeyMetadata(file.keyMetadata())
+              .withSortOrder(sortOrder)


@yyanyy, I think we cannot mark files with the current sort order id by default as there is no guarantee the incoming records are sorted. If we had an API to request a specific distribution and ordering in a query engine (will be possible in Spark 3.2, for example), only then we could annotate the files.

Thank you for the review! Yes I wasn't sure when sortOrder will be available from engine, and they are currently all null for now. I assumed that this information could be available when engine constructs the appender factory like SparkAppenderFactory since it seems that the factory will be created per task level in Spark, so that sortOrder can be assigned based on individual task. But since I don't know the details about each engine I'll revert the changes to each appender factory for now.

aokolnychyi · 2021-03-11T01:06:59Z

cc @szehon-ho @RussellSpitzer @rdblue who participated in #2240.

This PR has been open for a while. I think it has a great basis for propagating the sort order to writers but I am afraid we cannot mark written files with the current sort order unless we sure those files are sorted. That will require a query engine API to request a distribution and ordering. For Spark, it will be possible only in 3.2. What about the other query engine integrations, @openinx @pvary?

If my understanding is correct, what about limiting the scope of this PR to just propagate the sort order to writers?

…ctors

aokolnychyi · 2021-03-22T17:45:31Z

Thanks for updating the PR, @yyanyy! Let me take a look now.

aokolnychyi

The change looks correct to me. I had just a couple of questions.

aokolnychyi · 2021-03-22T18:13:18Z

core/src/main/java/org/apache/iceberg/avro/Avro.java

      return this;
    }

+    public DeleteWriteBuilder withSortOrder(SortOrder newSortOrder) {


We add an API for setting the sort order for equality deletes. Shall we also add the ability to set an order for regular data files? We won't immediately use it but it will make the code more consistent.

If we decide to do that, we should probably cover ORC.

Thank you for the review! I think setting an order for regular data files is covered by changes in DataWriter file, as in Avro/Parquet/ORC files what WriteBuilder builds is actually an FileAppender that doesn't track file properties, and it's the DataWriter which wraps around this file appender that actually populates these information into a DataFile when it closes. I do think this seems a bit confusing and we may want to refactor these classes a bit to make delete and data writers to be constructed in a more similar way.

Regarding ORC - the reason for me to not change ORC is that ORC seems to not support delete writer. I wasn't sure if it was intentional, or just because no one had time to do it. If the latter is the case, I'm happy to create a separate PR to add it.

Oh, yeah, you are right. I think we should refactor this to make it clear (independently of this PR).

There are a few points to discuss.

The first one is FileAppenderFactory. It was originally created to produce FileAppender instances only, which made sense. Now, it also creates data and delete writers too, which makes less sense to me. I'd consider deprecating methods for creating writers in FileAppenderFactory and would create WriterFactory instead.

WriterFactory { newDataWriter(...) newEqualityDeleteWriter(...) newPositionDeleteWriter(...) }

Second, I'd consider adding a method called buildWriter to our Parquet, Avro, ORC classes that would create and initialize DataWriter, making it consistent with our delete writers. Otherwise, it looks a bit weird.

Thoughts, @yyanyy @rdblue @openinx @jackye1995?

W.r.t ORC, I think we just did not have time to implement it. I'd be happy to review if there is a PR.

I think the first sounds good to me! I think both DataWriter and *DeleteWriter are wrapper around appender, but today to construct the latter we have to specify everything in appender factories for each file format, and Parquet Avro actually directly pass them over to the underlying appender. I wonder if we can have separate newAppender and newDeleteAppender in FileAppenderFactory (or even generalize into one) so that we don't do this delegation per file format, and move the three methods you mentioned to WriterFactory to separate the two writer factories. In this case WriterFactory may always rely on a FileAppenderFactory in constructor, which might still be fine.

For adding buildWriter to Parquet, Avro, ORC classes, I'm not entirely sure though, since DataWriter does not differ much among three file formats, so adding them to each file format may actually increase duplicated code. I haven't looked into the code deeply but I wonder if we can abstract the creation of appender out from delete builders (similar to what mentioned above), so that the new WriterFactory can create delete writers as simple as create data writers today, so that we will just have this logic in each WriterFactory.

I'd need to take a closer look too. However, I do like the idea of abstracting appenders from writers. Let's discuss this in a separate issue, @yyanyy.

aokolnychyi · 2021-03-22T19:03:44Z

core/src/main/java/org/apache/iceberg/io/DataWriter.java

+    this(appender, format, location, spec, partition, keyMetadata, null);
+  }
+
+  public DataWriter(FileAppender<T> appender, FileFormat format, String location,


We instantiate DataWriter in FileAppenderFactory implementations. It is absolutely correct that we don't propagate the sort order id now. However, should we consider adding a comment that explains why? Like we cannot guarantee the incoming records are sorted as needed?

Sorry to make sure I understand the suggestion correctly, do you mean to add a comment here to discourage people from using this new constructor for now?

I was referring to the place where, for example, SparkAppenderFactory instantiates DataWriter:

@Override public DataWriter<InternalRow> newDataWriter(EncryptedOutputFile file, FileFormat format, StructLike partition) { return new DataWriter<>(newAppender(file.encryptingOutputFile(), format), format, file.encryptingOutputFile().location(), spec, partition, file.keyMetadata()); }

Here we are using the constructor without sort order. I thought about a comment explaining why but I think we should do that later. We will eventually add sort order to classes like SparkAppenderFactory and then can add the comment.

To sum up, let's ignore my original comment for now.

aokolnychyi · 2021-03-22T19:48:40Z

core/src/main/java/org/apache/iceberg/deletes/EqualityDeleteWriter.java

+    this(appender, format, location, spec, partition, keyMetadata, null, equalityFieldIds);
+  }
+
+  public EqualityDeleteWriter(FileAppender<T> appender, FileFormat format, String location,


Do we want to deprecate the old constructor? Not a big deal but I am bit worried about maintaining a new constructor each time we modify this class. At the same time, converting this into a builder is probably an overkill. This is a pretty internal API too so I'd be fine with breaking it.

Thoughts, @openinx @rdblue @yyanyy?

Good point, I think we can remove the old constructor in this case, as I think this class is for supporting V2 table and thus shouldn't have a lot of dependencies outside of the library packages for production usage, so we should probably be able to do it now. I will update the PR to update the existing constructor unless people have further comments on this thread.

I'd be fine keeping if there is a valid use case but I'd hope folks use a higher-level API for writing deletes.

aokolnychyi

We may avoid keeping the old constructors but this change looks good to me.

aokolnychyi · 2021-03-23T16:53:07Z

Looks like some CI job hanged. Triggering another testing round.

aokolnychyi · 2021-03-23T20:42:05Z

Thanks, @yyanyy! This looked good to me. I'll submit a follow-up issue for the refactoring we talked here.

rdblue · 2021-03-23T22:53:47Z

Thanks for working on this, @yyanyy! And thanks for the review, @aokolnychyi. I agree that we can probably improve the appender factory or replace it with something for all writers. It's a little messy there so I'm glad you're thinking about how to clean it up. Thanks!

add sort order to writer classes

fdb3721

github-actions bot added core data flink parquet spark labels Feb 4, 2021

aokolnychyi mentioned this pull request Mar 11, 2021

Core: Auto promote sorted column metrics to truncate(16) #2240

Merged

aokolnychyi reviewed Mar 11, 2021

View reviewed changes

revert changes to appender factories for adding sort order to constru…

9c79be4

…ctors

aokolnychyi reviewed Mar 22, 2021

View reviewed changes

aokolnychyi approved these changes Mar 23, 2021

View reviewed changes

remove old constructor in EqualityDeleteWriter

d194ba0

aokolnychyi merged commit 7cafcdb into apache:master Mar 23, 2021

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Core: Add a way to set sort order in writer classes (apache#2214)

cbbe945

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Core: Add a way to set sort order in writer classes (apache#2214)

cb2ec36

Add sort order to writer classes #2214

Add sort order to writer classes #2214

Uh oh!

Conversation

yyanyy commented Feb 4, 2021

Uh oh!

jackye1995 commented Feb 18, 2021

Uh oh!

yyanyy commented Feb 18, 2021

Uh oh!

jackye1995 commented Feb 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yyanyy Mar 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aokolnychyi commented Mar 22, 2021

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yyanyy Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yyanyy Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yyanyy Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Mar 23, 2021

Uh oh!

aokolnychyi commented Mar 23, 2021

Uh oh!

rdblue commented Mar 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yyanyy Mar 13, 2021 •

edited

Loading

aokolnychyi commented Mar 11, 2021 •

edited

Loading

yyanyy Mar 22, 2021 •

edited

Loading

yyanyy Mar 23, 2021 •

edited

Loading

yyanyy Mar 22, 2021 •

edited

Loading

aokolnychyi Mar 23, 2021 •

edited

Loading