-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-10160] Change API to enable to append to Iceberg #13585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Exploring how to write Iceberg tables using Beam. By removing the PDone, and emit the files that are written, we can add these to the Table Format, such as Iceberg. Exploring the possible approaches, but I think this is a very fruitful direction since we can fully re-use the existing Avro/Parquet/.. writers, and emit the WriteFilesResult to the next operator, which will append the new files to the Iceberg log.
f1b0597 to
f1a7d26
Compare
|
R: @iemejia 👍 |
|
Run Spark StructuredStreaming ValidatesRunner |
|
@tszerszen Spark Structured Streaming Validates Runner tests are failing so I think we can ignore those for the moment. |
iemejia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HelIo Fokko!
Great to see you contributing to Beam and super excited to have Iceberg support soon. Don't hesitate to share more details (or early WIP PR).
I let one question, do you think you can get ahead without changing AvroIO.Write if so just revert that part of the changes and I will merge.
| public PDone expand(PCollection<T> input) { | ||
| input.apply(inner); | ||
| return PDone.in(input.getPipeline()); | ||
| public WriteFilesResult<?> expand(PCollection<T> input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution looks ok, but I am wary of the consequences of changing the return type. It looks like we introduced TypedWrite on Beam for the achieve the same goal without breaking backwards compatibility.
The 'modern' preferred way to write files on Beam is via FileIO.write() which already returns WriteFilesResult.
Can you check if you we can achieve the intended results by relying on FileIO.write() + AvroIO.sink() ? or is there anything missing?
CC: @jkff in case you have some extra detail to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this should use FileIO.write + AvroIO.sink - the current change is incompatible and will break anybody's transforms of the form:
PDone expand(...) {
...
return AvroIO.write()...; // If return type changes to WFR, this stops compiling
}
|
|
||
| @Override | ||
| public PTransform<PCollection<Row>, POutput> buildWriter() { | ||
| public PTransform<PCollection<Row>, WriteFilesResult<?>> buildWriter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, here the change has less issues since this class is @Internal
CC @TheNeuralBit for awarenes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, thank you
| * A {@link PTransform} that writes to a {@link FileBasedSink}. A write begins with a sequential | ||
| * global initialization of a sink, followed by a parallel write, and ends with a sequential | ||
| * finalization of the write. The output of a write is {@link PDone}. | ||
| * finalization of the write. The output of a write is {@link WriteFilesResult} with the files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Thanks for the pointers. Turns out that this wasn't needed. I'm able to wire everything together using the |
|
Thanks Fokko I will follow track on the Iceberg side, do not hesitate to ping me if anything needed! |
Exploring how to write Iceberg tables using Beam.
By removing the PDone, and emit the files that are written, we can add these to the Table Format, such as Iceberg, Hudi or Delta.
Exploring the possible approaches, but I think this is a very fruitful direction since we can fully re-use the existing
Avro/Parquet/..writers, and emit the WriteFilesResult to the next operator, which will append the new files to the Iceberg log.My suggestion would be to change the API in Apache Beam so we can use the
WriteFilesResult. Add the Iceberg extension in the Iceberg repository itself. This will be aPTransform<WriteFilesResult<?>, PDone>. Maybe I'll change thePDoneto aTable(Iceberg Table) as well. So you can signal downstream consumers. I'll open this PR somewhere next week, but would like to know your idea's as well! :)Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.