Flink: Support table sink. #1348

openinx · 2020-08-17T12:58:06Z

This patch will wrap the flink's DataStream as a StreamTable, which could allow user to use SQL to insert records to iceberg table, it will try to provide the similar experience with spark sql. Currently, this patch is depending on ~~#1185~~.

rdblue · 2020-08-29T23:25:23Z

@openinx, can you rebase this? I think it is next to review, right?

kbendick · 2020-08-30T23:54:03Z

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java

+
+      return returnStream.addSink(new DiscardingSink())
+          .name(String.format("IcebergSink %s", table.toString()))
+          .setParallelism(1);


Since we are already returning the DataStream, would it make sense to avoid the discarding sink and possibly let people stream the iceberg commit files instead? Like what if I wanted to also feed them into kafka?

You mean you want to feed the committed data files to kafka ? Is that meaningful for users ? It will be better to understand if we have such user cases I guess.

Some context

in the first sink version, I made the IcebergFilesCommitter implemented the SinkFucntion, then we could chain the function by addSink directly, while we found that it did not work for bounded stream because there was no interface/method to indicate that this stream is a bounded one, then we have no way to commit those data files into iceberg table when the stream has reached its end. So we have to turn to AbstractStreamOperator and implemented a BoundedOneInput interface. Finally, int this version, we will transform the data stream twice (the first one: rowdata -> dataFiles, the second one: datafiles -> void), and finally add a discarding sink.

openinx · 2020-08-31T01:49:48Z

@rdblue Yeah, Thanks for the merging, let me rebase this.

openinx · 2020-09-01T12:25:32Z

Ping @rdblue @JingsongLi for reviewing, Thanks.

JingsongLi · 2020-09-02T03:47:30Z

flink/src/main/java/org/apache/iceberg/flink/FlinkTableFactory.java

+    TableIdentifier icebergIdentifier = catalog.toIdentifier(objectPath);
+    try {
+      Table table = catalog.loadIcebergTable(objectPath);
+      return new IcebergTableSink(icebergIdentifier, table,


I think it is better to just pass a table loader to sink, source and sink can reuse this loader creation function, just like in:
https://github.com/apache/iceberg/pull/1293/files#diff-0ad7dfff9cfa32fbb760796d976fd650R61
What do you think?

Make sense to me, we also don't need to pass the icebergIdentifier to IcebergTableSink, that makes code more simplier.

JingsongLi · 2020-09-02T03:49:10Z

flink/src/main/java/org/apache/iceberg/flink/IcebergTableSink.java

+
+  @Override
+  public TableSink<RowData> configure(String[] fieldNames, TypeInformation<?>[] fieldTypes) {
+    if (!Arrays.equals(tableSchema.getFieldNames(), fieldNames)) {


This is a deprecated method, no one will call it, you can just return this.

OK, I see. will do .

JingsongLi · 2020-09-02T03:51:29Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+    EnvironmentSettings settings = EnvironmentSettings
+        .newInstance()
+        .useBlinkPlanner()
+        .inStreamingMode()


Can we use Parameterized for batch too?

That's a great idea, we could reuse almost all of the codes then.

JingsongLi · 2020-09-02T03:52:11Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+    tEnv.executeSql(String.format("create catalog iceberg_catalog with (" +
+        "'type'='iceberg', 'catalog-type'='hadoop', 'warehouse'='%s')", warehouse));
+    tEnv.executeSql("use catalog iceberg_catalog");
+    tEnv.getConfig().getConfiguration().set(TableConfigOptions.TABLE_DYNAMIC_TABLE_OPTIONS_ENABLED, true);


Looks like there is no dynamic table options. (Table hints)

OK, it could be removed now.

JingsongLi · 2020-09-02T03:53:07Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+    DataStream<RowData> stream = generateInputStream(rows);
+
+    // Register the rows into a temporary table named 'sourceTable'.
+    tEnv.createTemporaryView("sourceTable", tEnv.fromDataStream(stream, $("id"), $("data")));


Can we use TableEnvironment.fromValues?

JingsongLi · 2020-09-02T03:56:40Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+    SimpleDataUtil.assertTableRecords(warehouse.concat("/default/sourceTable"), expected);
+  }
+
+  private static void waitComplete(TableResult result) {


You can just add a method like:

def execInsertSqlAndWaitResult(tEnv: TableEnvironment, insert: String): JobExecutionResult = { tEnv.executeSql(insert).getJobClient.get .getJobExecutionResult(Thread.currentThread.getContextClassLoader) .get }

JingsongLi · 2020-09-02T03:59:53Z

flink/src/main/java/org/apache/iceberg/flink/IcebergTableSink.java

+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.flink.sink.FlinkSink;
+
+public class IcebergTableSink implements AppendStreamTableSink<RowData> {


We can add TODO for these interfaces:
Implement OverwritableTableSink, so in the Flink SQL, user can write these SQLs:
INSERT OVERWRITE t ...
Implement PartitionableTableSink, user can write:
INSERT OVERWRITE/INTO t PARTITION(...)

Thanks for the remainding, before we add the TODO comment, I will try to implement those two interfaces in the next path.

openinx · 2020-09-02T11:32:15Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestFlinkTableSink extends AbstractTestBase {


Think about this unit test again, we'd better to extend the FlinkCatalogTestBase so that we could cover both hive and hadoop catalog cases.

openinx · 2020-09-02T13:48:19Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+    sql("USE %s", DATABASE);
+
+    Map<String, String> properties = ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, format.name());
+    this.icebergTable = validationCatalog


We could use flink DDL to create table here if #1393 get merged.

It was merged!

rdblue · 2020-09-02T23:15:35Z

flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java

+        try (TableLoader loader = tableLoader) {
+          this.table = loader.loadTable();
+        } catch (IOException e) {
+          throw new UncheckedIOException("Failed to load iceberg table.", e);


Minor: it would be nice to have more context here. Maybe the table loader should define a toString that could be used in the error message here.

Defining the toString sounds good to me.

rdblue · 2020-09-02T23:18:48Z

flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

  }

+  private void replacePartitions(List<DataFile> dataFiles, long checkpointId) {
+    ReplacePartitions dynamicOverwrite = table.newReplacePartitions();


I just want to note that we don't encourage the use of ReplacePartitions because the data it deletes is implicit. It is better to specify what data should be overwritten, like in the new API for Spark:

df.writeTo("iceberg.db.table").overwrite($"date" === "2020-09-01")

If Flink's semantics are to replace partitions for overwrite, then it should be okay. But I highly recommend being more explicit about data replacement.

Yes, Flink's semantics are to replace partitions for overwrite, here is the 1, 2

rdblue · 2020-09-02T23:20:09Z

flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java

-    Table newTable = new HadoopTables().load(tablePath);
-    try (CloseableIterable<Record> iterable = IcebergGenerics.read(newTable).build()) {
+  public static void assertTableRecords(Table table, List<Record> expected) throws IOException {
+    table.refresh();


Can this be done automatically when a write completes, or is this a completely separate copy of the table?

Since we don't support to scan table by flink sql, so we have to read records from iceberg table by iceberg Java API in unit tests. In this test, we get the icebergTable instance firstly, then the following test methods will commit the iceberg table by flink sql, the icebergTable need a fresh to catch the latest changes.

rdblue · 2020-09-02T23:21:51Z

flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

+  }
+
+  @Test
+  public void testOverwriteTable() throws Exception {


It would be good to also have a partitioned test.

rdblue · 2020-09-02T23:25:40Z

+1 overall. I'd merge this now even with a couple of minor comments but it appears that merging #1393 caused conflicts.

…s deprecated.

openinx · 2020-09-03T06:22:11Z

The broken unit test for jdk11 is ( jdk8 works fine):

org.apache.iceberg.spark.sql.TestCreateTableAsSelect > testDataFrameV2Replace[1] FAILED
    java.lang.AssertionError: Should have rows matching the source table: number of results should match expected:<3> but was:<6>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:645)
        at org.apache.iceberg.spark.SparkTestBase.assertEquals(SparkTestBase.java:100)
        at org.apache.iceberg.spark.sql.TestCreateTableAsSelect.testDataFrameV2Replace(TestCreateTableAsSelect.java:206)

rdblue · 2020-09-03T15:15:37Z

Merged! Thanks for getting this done, @openinx! It's great to see Flink SQL writes working.

openinx marked this pull request as draft August 17, 2020 12:58

probot-autolabeler bot added core flink labels Aug 20, 2020

openinx mentioned this pull request Aug 21, 2020

Flink: Add the iceberg files committer to collect data files and commit to iceberg table. #1185

Merged

openinx marked this pull request as ready for review August 26, 2020 09:28

kbendick reviewed Aug 30, 2020

View reviewed changes

openinx mentioned this pull request Aug 31, 2020

Flink: Support creating table and altering table in Flink SQL #1393

Merged

JingsongLi reviewed Sep 2, 2020

View reviewed changes

openinx commented Sep 2, 2020

View reviewed changes

rdblue reviewed Sep 2, 2020

View reviewed changes

openinx added 11 commits September 3, 2020 09:48

Flink: Support table sink.

7a328cf

Minior fixes

23ddefb

Remove the public modifier

f733b3e

Pass the table loader rather than CatalogLoader to IcebergTabelSink

17af5a1

Remove the implementation for IcebergTableSink#configure because it i…

aa33266

…s deprecated.

Refactor the unit tests.

f9760c3

Create an executeSQLAndWaitResult to execute sql

5393428

Implement the OverwritableTableSink

9df390b

Add TODO to implement PartitionedTableSink

9215ced

Make the case extend the FlinkCatalogTestBase

53a16d9

Fix the broken unit tests.

01bc54d

openinx added 3 commits September 3, 2020 09:51

Minor changes

afb94a8

Addressing comments from Ryan

7587cb1

Minior changes.

57ec91c

openinx closed this Sep 3, 2020

openinx reopened this Sep 3, 2020

rdblue merged commit f153349 into apache:master Sep 3, 2020

rdblue added this to the Java 0.10.0 Release milestone Nov 16, 2020

Flink: Support table sink. #1348

Flink: Support table sink. #1348

Uh oh!

Conversation

openinx commented Aug 17, 2020 • edited by rdblue Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Aug 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Some context

Uh oh!

openinx commented Aug 31, 2020

Uh oh!

openinx commented Sep 1, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx Sep 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 2, 2020

Uh oh!

openinx commented Sep 3, 2020

Uh oh!

rdblue commented Sep 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openinx commented Aug 17, 2020 •

edited by rdblue

Loading

openinx Sep 3, 2020 •

edited

Loading