Skip to content

Flink: Dynamic Iceberg Sink: Add dynamic writer and committer#13080

Merged
pvary merged 23 commits intoapache:mainfrom
mxm:dynamic-sink-contrib-breakdown3
Jun 4, 2025
Merged

Flink: Dynamic Iceberg Sink: Add dynamic writer and committer#13080
pvary merged 23 commits intoapache:mainfrom
mxm:dynamic-sink-contrib-breakdown3

Conversation

@mxm
Copy link
Copy Markdown
Contributor

@mxm mxm commented May 16, 2025

This adds the dynamic version of the writer and committer for the Flink Dynamic Iceberg Sink. Conceptually, they work similar to the IcebergSink, but they support writing to multiple tables. Write results from each table are aggregated from the DynamicWriter in the DynamicWriteResultAggregator, from where they are sent to the DynamicCommitter.

Broken out of #12424.

@github-actions github-actions Bot added the flink label May 16, 2025
@mxm mxm changed the title Flink: Dynamic Sink: Add dynamic writer and committer Flink: Dynamic Iceberg Sink: Add dynamic writer and committer May 16, 2025
@pvary
Copy link
Copy Markdown
Contributor

pvary commented May 19, 2025

Could you please provide tests?
Also double check, that every class we made public is marked with @Internal.
Thanks,
Peter

/**
* The aggregated results of a single checkpoint which should be committed. Containing the
* serialized {@link org.apache.iceberg.flink.sink.DeltaManifests} file - which contains the commit
* data, and the jobId, operatorId, checkpointId triplet which help identifying the specific commit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: help -> helps
Add a . at the end of the sentence

@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented May 23, 2025

Thanks for the review @pvary. I'll some more tests for the writer / aggregator / committer. There is more integration / e2e testing coming with the subsequent PRs, but we should also test the individual components.

@mxm mxm force-pushed the dynamic-sink-contrib-breakdown3 branch from 073623b to 000e7df Compare June 4, 2025 09:10
Copy link
Copy Markdown
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all tests to package private, and remove @Internal from non-public classes.
Otherwise +1 LGTM.

@pvary pvary merged commit b8cc8eb into apache:main Jun 4, 2025
18 checks passed
@mxm mxm deleted the dynamic-sink-contrib-breakdown3 branch June 4, 2025 12:52
@pvary
Copy link
Copy Markdown
Contributor

pvary commented Jun 4, 2025

Merged to main.
Thanks for the PR @mxm!

Could you please create the backport PRs for this and the previous commits too?

@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Jun 5, 2025

Thanks for merging! I'll create the backport PRs.

mxm added a commit to mxm/iceberg that referenced this pull request Jun 5, 2025
mxm added a commit to mxm/iceberg that referenced this pull request Jun 5, 2025
mxm added a commit to mxm/iceberg that referenced this pull request Jun 5, 2025
mxm added a commit to mxm/iceberg that referenced this pull request Jun 5, 2025
pvary pushed a commit that referenced this pull request Jun 5, 2025
cogwirrel pushed a commit to cogwirrel/iceberg that referenced this pull request Aug 10, 2025
private static final Duration CACHE_EXPIRATION_DURATION = Duration.ofMinutes(1);

private final CatalogLoader catalogLoader;
private transient Map<WriteTarget, Collection<DynamicWriteResult>> results;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxm @pvary @rodmeneses should Map<WriteTarget, Collection<DynamicWriteResult>> results; be serialised with the class for recovery in Flink? Or does Flink guarantee to reissue all write results after recovery if the pre-commit topology fails? This is serialisable in the original IcebergSink:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not serialize the WriteResults in the non-dynamic sink. We flush them on checkpoint via the prepareSnapshotPreBarrier method.

@Override
public void prepareSnapshotPreBarrier(long checkpointId) throws IOException {
Collection<CommittableWithLineage<DynamicCommittable>> committables =
Sets.newHashSetWithExpectedSize(results.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxm @pvary Why do we need to use a Set here? Do we rely on it to deduplicate CommittableWithLineage below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set is used here because there isn't a particular order to the the committables.

DynamicWriteResult result =
((CommittableWithLineage<DynamicWriteResult>) element.getValue()).getCommittable();
WriteTarget key = result.key();
results.computeIfAbsent(key, unused -> Sets.newHashSet()).add(result);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxm @pvary Same as the above, can we use a regular List here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a Set here because there is no enforced order. I'm curious, why do you want to use List?

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025
devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants