feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner #36001

liferoad · 2025-08-28T18:03:26Z

Implement WriteToPubSub batch mode support by using DirectRunner implementation Add PTransform override for batch mode and integration tests

Fixes #35990

Internal bug: b/441584693

Will update CHANGES.md when PR is good to go.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Implement WriteToPubSub batch mode support by using DirectRunner implementation Add PTransform override for batch mode and integration tests

liferoad · 2025-08-29T14:36:40Z

sdks/python/apache_beam/runners/direct/direct_runner.py

+    # use BundleBasedDirectRunner
+    # since Prism does not support transform overrides
+    transform_overrides = _get_transform_overrides(options)
+    if transform_overrides:


@shunping @damccorm fall back to DirectRunner for now when detecting transform overrides.

Tracked under #36011

Why do we need to apply these overrides in Prism?

If we're applying an override, it is usually because some feature is not included which is needed for the transform - it seems like it would be better to just exclude prism when the feature itself is needed vs assuming all transforms with overrides don't work

I can change this just for pub/sub if other transform overrides are not needed for Prism if we can confirm that. I do this simply since transform overrides are not called in Prism now.

Why do we need the pub/sub transform override? I assume there is a missing feature here?

without this PR, the batch mode does not work since WriteToPubSub does not do any write. DirectRunner uses the overrides to make it work. I just use the same idea for Dataflow Runner.

Please check the internal bug b/441584693

For Java, instead of using overrides to implement Dataflow batch, direct runner, etc we only override for Dataflow streaming (which specializes with internally publishing). I think this is a better approach because then the basic pubsub write transform works as is and overriding is just for specialization.

See
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L1624
and the override (which also has an experiment to disable):
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L659

Can we do that for Python too by:

implementing PubsubSink in pubsub.py with something like the existing Direct runner implementation. We may need to add some pushback if the publisher client itself doesn't do it as otherwise we may pull in all the messages into memory and have them pileup in publishing and oom.

removing all the direct runner stuff

add transform override for dataflow streaming and remove it for dataflow batch

My PR makes sure only minimal changes are introduced without breaking any potential update compatibility. This is why I overrides the batch implementation instead of touching any streaming part.

beam/sdks/python/apache_beam/io/gcp/pubsub.py

Line 438 in 17d5039

return pcoll | Write(self._sink)

: are you sure we can easily change this based on the Java implementation? or is it worth matching the Java one given this issue has been existing for a while and my current PR (not perfect) solves this for now?

I think this is a better approach because then the basic pubsub write transform works as is and overriding is just for specialization.

I think this is the key thing - ideally you would not need to implement an override for this to work on arbitrary runners; overrides should be the exceptional case

github-actions · 2025-08-29T23:06:33Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

scwhittle · 2025-09-01T08:59:56Z

sdks/python/apache_beam/runners/direct/direct_runner.py

+    # use BundleBasedDirectRunner
+    # since Prism does not support transform overrides
+    transform_overrides = _get_transform_overrides(options)
+    if transform_overrides:


For Java, instead of using overrides to implement Dataflow batch, direct runner, etc we only override for Dataflow streaming (which specializes with internally publishing). I think this is a better approach because then the basic pubsub write transform works as is and overriding is just for specialization.

See
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L1624
and the override (which also has an experiment to disable):
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L659

Can we do that for Python too by:

implementing PubsubSink in pubsub.py with something like the existing Direct runner implementation. We may need to add some pushback if the publisher client itself doesn't do it as otherwise we may pull in all the messages into memory and have them pileup in publishing and oom.

removing all the direct runner stuff

add transform override for dataflow streaming and remove it for dataflow batch

liferoad · 2025-09-03T13:58:28Z

close this. #36027 is the recommended way.

feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner

b888021

Implement WriteToPubSub batch mode support by using DirectRunner implementation Add PTransform override for batch mode and integration tests

github-actions bot added python io runners gcp labels Aug 28, 2025

liferoad added 4 commits August 28, 2025 17:42

fixed lints

3159dd5

postcommit

779a21f

Merge branch 'master' into pubsub-batch

66d7669

fixed tests

3a2b769

liferoad changed the title ~~[TEST-ONLY] feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner~~ feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner Aug 29, 2025

liferoad commented Aug 29, 2025

View reviewed changes

liferoad marked this pull request as ready for review August 29, 2025 14:36

liferoad requested a review from scwhittle August 29, 2025 14:37

scwhittle requested changes Sep 1, 2025

View reviewed changes

liferoad closed this Sep 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner #36001

feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner #36001

Uh oh!

liferoad commented Aug 28, 2025 •

edited

Loading

Uh oh!

liferoad Aug 29, 2025

Uh oh!

liferoad Aug 29, 2025

Uh oh!

damccorm Aug 29, 2025

Uh oh!

liferoad Aug 29, 2025

Uh oh!

damccorm Aug 29, 2025

Uh oh!

liferoad Aug 30, 2025

Uh oh!

scwhittle Sep 1, 2025

Uh oh!

liferoad Sep 1, 2025

Uh oh!

damccorm Sep 2, 2025

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

scwhittle Sep 1, 2025

Uh oh!

liferoad commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner #36001

feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner #36001

Uh oh!

Conversation

liferoad commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liferoad commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liferoad commented Aug 28, 2025 •

edited

Loading