[BEAM-10160] Change API to enable to append to Iceberg #13585

Fokko · 2020-12-20T13:18:03Z

Exploring how to write Iceberg tables using Beam.

By removing the PDone, and emit the files that are written, we can add these to the Table Format, such as Iceberg, Hudi or Delta.

Exploring the possible approaches, but I think this is a very fruitful direction since we can fully re-use the existing Avro/Parquet/.. writers, and emit the WriteFilesResult to the next operator, which will append the new files to the Iceberg log.

My suggestion would be to change the API in Apache Beam so we can use the WriteFilesResult. Add the Iceberg extension in the Iceberg repository itself. This will be a PTransform<WriteFilesResult<?>, PDone>. Maybe I'll change the PDone to a Table (Iceberg Table) as well. So you can signal downstream consumers. I'll open this PR somewhere next week, but would like to know your idea's as well! :)

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

Exploring how to write Iceberg tables using Beam. By removing the PDone, and emit the files that are written, we can add these to the Table Format, such as Iceberg. Exploring the possible approaches, but I think this is a very fruitful direction since we can fully re-use the existing Avro/Parquet/.. writers, and emit the WriteFilesResult to the next operator, which will append the new files to the Iceberg log.

Fokko · 2020-12-21T09:23:11Z

R: @iemejia 👍

ghost · 2020-12-21T13:40:58Z

Run Spark StructuredStreaming ValidatesRunner

iemejia · 2020-12-21T13:50:54Z

@tszerszen Spark Structured Streaming Validates Runner tests are failing so I think we can ignore those for the moment.

iemejia

HelIo Fokko!
Great to see you contributing to Beam and super excited to have Iceberg support soon. Don't hesitate to share more details (or early WIP PR).

I let one question, do you think you can get ahead without changing AvroIO.Write if so just revert that part of the changes and I will merge.

iemejia · 2020-12-21T14:23:00Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java

-    public PDone expand(PCollection<T> input) {
-      input.apply(inner);
-      return PDone.in(input.getPipeline());
+    public WriteFilesResult<?> expand(PCollection<T> input) {


This solution looks ok, but I am wary of the consequences of changing the return type. It looks like we introduced TypedWrite on Beam for the achieve the same goal without breaking backwards compatibility.

The 'modern' preferred way to write files on Beam is via FileIO.write() which already returns WriteFilesResult.
Can you check if you we can achieve the intended results by relying on FileIO.write() + AvroIO.sink() ? or is there anything missing?

CC: @jkff in case you have some extra detail to add.

Agreed, this should use FileIO.write + AvroIO.sink - the current change is incompatible and will break anybody's transforms of the form:

PDone expand(...) { ... return AvroIO.write()...; // If return type changes to WFR, this stops compiling }

iemejia · 2020-12-21T14:31:24Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSchemaIOProvider.java


    @Override
-    public PTransform<PCollection<Row>, POutput> buildWriter() {
+    public PTransform<PCollection<Row>, WriteFilesResult<?>> buildWriter() {


This looks good, here the change has less issues since this class is @Internal
CC @TheNeuralBit for awarenes.

Ack, thank you

iemejia · 2020-12-21T14:32:01Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java

 * A {@link PTransform} that writes to a {@link FileBasedSink}. A write begins with a sequential
 * global initialization of a sink, followed by a parallel write, and ends with a sequential
- * finalization of the write. The output of a write is {@link PDone}.
+ * finalization of the write. The output of a write is {@link WriteFilesResult} with the files


Fokko · 2020-12-21T23:59:09Z

Thanks for the pointers. Turns out that this wasn't needed. I'm able to wire everything together using the FileIO API. First attempt is in apache/iceberg#1972

iemejia · 2020-12-22T09:08:25Z

Thanks Fokko I will follow track on the Iceberg side, do not hesitate to ping me if anything needed!

probot-autolabeler bot added the java label Dec 20, 2020

Fokko force-pushed the fd-write-iceberg branch from f1b0597 to f1a7d26 Compare December 20, 2020 15:20

iemejia reviewed Dec 21, 2020

View reviewed changes

Fokko closed this Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-10160] Change API to enable to append to Iceberg #13585

[BEAM-10160] Change API to enable to append to Iceberg #13585

Uh oh!

Fokko commented Dec 20, 2020 •

edited

Loading

Uh oh!

Fokko commented Dec 21, 2020

Uh oh!

ghost commented Dec 21, 2020

Uh oh!

iemejia commented Dec 21, 2020

Uh oh!

iemejia left a comment

Uh oh!

iemejia Dec 21, 2020

Uh oh!

jkff Dec 21, 2020

Uh oh!

iemejia Dec 21, 2020

Uh oh!

TheNeuralBit Dec 21, 2020

Uh oh!

iemejia Dec 21, 2020

Uh oh!

Fokko commented Dec 21, 2020

Uh oh!

iemejia commented Dec 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[BEAM-10160] Change API to enable to append to Iceberg #13585

[BEAM-10160] Change API to enable to append to Iceberg #13585

Uh oh!

Conversation

Fokko commented Dec 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

Uh oh!

Fokko commented Dec 21, 2020

Uh oh!

ghost commented Dec 21, 2020

Uh oh!

iemejia commented Dec 21, 2020

Uh oh!

iemejia left a comment

Choose a reason for hiding this comment

Uh oh!

iemejia Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

jkff Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

iemejia Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

TheNeuralBit Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

iemejia Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

Fokko commented Dec 21, 2020

Uh oh!

iemejia commented Dec 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fokko commented Dec 20, 2020 •

edited

Loading