Core: Add data and delete writers in FileAppenderFactory. #1836

openinx · 2020-11-26T15:16:36Z

This is an PR which is divided from this : #1818.

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkAppenderFactory.java

data/src/main/java/org/apache/iceberg/data/GenericAppenderFactory.java

rdblue · 2020-11-30T22:00:17Z

data/src/main/java/org/apache/iceberg/data/GenericAppenderFactory.java

+              .buildPositionWriter();
+
+        default:
+          throw new UnsupportedOperationException("Cannot write unknown file format: " + format);


I think it should be "unsupported" rather than "unknown" because ORC is known, but not supported.

rdblue · 2020-11-30T22:07:44Z

data/src/main/java/org/apache/iceberg/data/GenericAppenderFactory.java


-  public GenericAppenderFactory(Schema schema) {
+  public GenericAppenderFactory(Schema schema, PartitionSpec spec) {
+    this(schema, spec, null, schema, null);


Why not set eqDeleteRowSchema to null since equalityFieldIds is null?

OK, it's more reasonable to set the eqDeleteRowSchema to be null when equalityFieldIds is null.

rdblue · 2020-11-30T22:08:50Z

data/src/test/java/org/apache/iceberg/data/GenericAppenderHelper.java

-    FileAppender<Record> appender = new GenericAppenderFactory(table.schema()).newAppender(
+  private static DataFile appendToLocalFile(Table table, File file, FileFormat format, StructLike partition,
+                                            List<Record> records, PartitionSpec spec) throws IOException {
+    FileAppender<Record> appender = new GenericAppenderFactory(table.schema(), spec).newAppender(


Could we keep the constructor this was using before so we don't need to change any tests that only use newAppender? There are 4 files just here that don't appear like they need to change just to add the spec that won't be used.

I think I've counted at least 10 files that would not need to change if we kept the original constructor.

rdblue · 2020-11-30T22:34:11Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java


-
-    public <T> PositionDeleteWriter<T> buildPositionWriter() throws IOException {
+    public <T> PositionDeleteWriter<T> buildPositionWriter(ParquetValueWriters.PathPosAccessor<?, ?> accessor)


I'd rather not add this to the build method. Nothing distinguishes it from other options, so I think we should add transformations as configuration methods, like we do for equalityFieldIds.

I'm also thinking that it would be good to have a more light-weight way to add these transforms. Rather than an additional accessor that has two abstract methods, why not just register functions? It could look like this:

Avro.writeDeletes(outFile) ... .transformPaths(StringData::fromString) .buildPositionWriter();

That way it's easier to use a method reference rather than creating a class. And nothing actually needs to transform pos yet, so we can just leave that out.

I like the idea about registering a light-weight function to convert the CharSequence to StringData.

rdblue · 2020-11-30T22:42:15Z

core/src/test/java/org/apache/iceberg/io/TestAppenderFactory.java

+      return null;
+    }
+
+    Record record = GenericRecord.create(table.spec().schema()).copy(ImmutableMap.of("data", "aaa"));


I think this could just be table.schema(). No need to go through the spec to get the table schema.

rdblue · 2020-11-30T22:44:06Z

core/src/test/java/org/apache/iceberg/io/TestAppenderFactory.java

+
+    List<T> deletes = Lists.newArrayList(
+        createRow(1, "aaa"),
+        createRow(3, "bbb"),


If the data here is bbb instead of ccc on purpose, then could you add a comment that this is testing that just id is used for comparison?

rdblue · 2020-11-30T22:46:44Z

core/src/test/java/org/apache/iceberg/io/TestAppenderFactory.java

+        createRow(2, "bbb"),
+        createRow(4, "ddd")
+    );
+    Assert.assertEquals("Should have the expected records", expectedRowSet(expected), actualRowSet("*"));


Instead of reading from the table, I would rather see a test that the equality delete file contains the expected row data. In this case, it should not contain the data column. I would like to see that checked. And it would be good to add a case where the whole original row is written to the file.

rdblue · 2020-11-30T22:47:36Z

core/src/test/java/org/apache/iceberg/io/TestAppenderFactory.java

+        createRow(2, "bbb"),
+        createRow(4, "ddd")
+    );
+    Assert.assertEquals("Should have the expected records", expectedRowSet(expected), actualRowSet("*"));


Similar to above, I think this should check that only the path and position columns are written to the file and that they are the expected values. The test below should check that the row column is present and set correctly for each row.

openinx · 2020-12-01T08:06:40Z

spark/src/main/java/org/apache/iceberg/spark/source/RowDataRewriter.java


    StructType structType = SparkSchemaUtil.convert(schema);
-    SparkAppenderFactory appenderFactory = new SparkAppenderFactory(properties, schema, structType);
+    SparkAppenderFactory appenderFactory = new SparkAppenderFactory(properties, schema, structType, spec);


Here I use the constructor that has spec argument because I believe we will use the DataWriter to append records once we switch to the RollingFileWriter. https://github.com/apache/iceberg/pull/1818/files#diff-fc9a9fd84d24c607fd85e053b08a559f56dd2dd2a46f1341c528e7a0269f873cR263.

rdblue · 2020-12-01T21:36:59Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

    private StructLike partition = null;
    private EncryptionKeyMetadata keyMetadata = null;
    private int[] equalityFieldIds = null;
+    private Function<CharSequence, ?> pathTransformFunc = t -> t;


Nit: I think it would be better to use Function.identity().

rdblue · 2020-12-01T21:37:56Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

-
-    public <T> PositionDeleteWriter<T> buildPositionWriter() throws IOException {
+    public <T> PositionDeleteWriter<T> buildPositionWriter()
+        throws IOException {


Does this line need to change? I'm fine removing the empty line, but I think throws can still fit on the previous line.

I can revert it.

rdblue · 2020-12-01T21:39:16Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

        appenderBuilder.createWriterFunc(parquetSchema ->
-            new PositionDeleteStructWriter<T>((StructWriter<?>) GenericParquetWriter.buildWriter(parquetSchema)));
+            new PositionDeleteStructWriter<T>((StructWriter<?>) GenericParquetWriter.buildWriter(parquetSchema),
+                t -> t));


Shouldn't this pass pathTransformFunc as well?

We shouldn't pass pathTransformFunc here, because in this path we will use GenericParquetWriter (Rather than FlinkParquetWriter or SparkParquetWriter) to write the PositionDelete, if convert the path CharSequence to StringData, the GenericParquetWriter could not find the correct writer to write values.

It's good to use Function.identity() here.

rdblue

+1 overall. There is only one issue, which is the case where the new transform func isn't passed in Parquet. I'm not sure whether that is a bug, so please have a look.

Once that's done, please merge!

openinx · 2020-12-02T04:11:30Z

All check passed, let me merge this PR, thanks @rdblue for reviewing.

Core: Add data and delete writers in FileAppenderFactory.

609049e

openinx requested a review from rdblue November 26, 2020 15:16

github-actions bot added build core data flink parquet spark labels Nov 26, 2020

openinx added 2 commits November 27, 2020 13:34

Minor fixes

c1a1247

Fix broken unit tests.

d6cebc6

openinx commented Nov 27, 2020

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkAppenderFactory.java Show resolved Hide resolved

rdblue reviewed Nov 30, 2020

View reviewed changes

data/src/main/java/org/apache/iceberg/data/GenericAppenderFactory.java Show resolved Hide resolved

rdblue reviewed Nov 30, 2020

View reviewed changes

openinx added 3 commits December 1, 2020 14:57

Address comment

8006b63

Minor changes

b26898d

Align the exception messages

8e423fb

openinx commented Dec 1, 2020

View reviewed changes

Make the pathPosSchema to be private

4b2b04b

rdblue reviewed Dec 1, 2020

View reviewed changes

rdblue approved these changes Dec 1, 2020

View reviewed changes

Address nit issues.

e6a8169

openinx merged commit 4383ad4 into apache:master Dec 2, 2020

openinx deleted the appenderfactory branch December 2, 2020 04:11

ismailsimsek pushed a commit to ismailsimsek/iceberg that referenced this pull request Dec 7, 2020

Core: Add data and delete writers in FileAppenderFactory. (apache#1836)

b6e9361

openinx mentioned this pull request Sep 24, 2021

ORC: Implement buildEqualityWriter() and buildPositionWriter() #2935

Closed



		public <T> PositionDeleteWriter<T> buildPositionWriter() throws IOException {
		public <T> PositionDeleteWriter<T> buildPositionWriter(ParquetValueWriters.PathPosAccessor<?, ?> accessor)

Core: Add data and delete writers in FileAppenderFactory. #1836

Core: Add data and delete writers in FileAppenderFactory. #1836

Uh oh!

Conversation

openinx commented Nov 26, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

openinx commented Dec 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdblue Nov 30, 2020 •

edited

Loading