-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(pubsub): add batch mode support for WriteToPubSub in DataflowRunner #36001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement WriteToPubSub batch mode support by using DirectRunner implementation Add PTransform override for batch mode and integration tests
| # use BundleBasedDirectRunner | ||
| # since Prism does not support transform overrides | ||
| transform_overrides = _get_transform_overrides(options) | ||
| if transform_overrides: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked under #36011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to apply these overrides in Prism?
If we're applying an override, it is usually because some feature is not included which is needed for the transform - it seems like it would be better to just exclude prism when the feature itself is needed vs assuming all transforms with overrides don't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this just for pub/sub if other transform overrides are not needed for Prism if we can confirm that. I do this simply since transform overrides are not called in Prism now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the pub/sub transform override? I assume there is a missing feature here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this PR, the batch mode does not work since WriteToPubSub does not do any write. DirectRunner uses the overrides to make it work. I just use the same idea for Dataflow Runner.
Please check the internal bug b/441584693
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Java, instead of using overrides to implement Dataflow batch, direct runner, etc we only override for Dataflow streaming (which specializes with internally publishing). I think this is a better approach because then the basic pubsub write transform works as is and overriding is just for specialization.
See
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L1624
and the override (which also has an experiment to disable):
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L659
Can we do that for Python too by:
- implementing PubsubSink in pubsub.py with something like the existing Direct runner implementation. We may need to add some pushback if the publisher client itself doesn't do it as otherwise we may pull in all the messages into memory and have them pileup in publishing and oom.
- removing all the direct runner stuff
- add transform override for dataflow streaming and remove it for dataflow batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR makes sure only minimal changes are introduced without breaking any potential update compatibility. This is why I overrides the batch implementation instead of touching any streaming part.
beam/sdks/python/apache_beam/io/gcp/pubsub.py
Line 438 in 17d5039
| return pcoll | Write(self._sink) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a better approach because then the basic pubsub write transform works as is and overriding is just for specialization.
I think this is the key thing - ideally you would not need to implement an override for this to work on arbitrary runners; overrides should be the exceptional case
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
| # use BundleBasedDirectRunner | ||
| # since Prism does not support transform overrides | ||
| transform_overrides = _get_transform_overrides(options) | ||
| if transform_overrides: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Java, instead of using overrides to implement Dataflow batch, direct runner, etc we only override for Dataflow streaming (which specializes with internally publishing). I think this is a better approach because then the basic pubsub write transform works as is and overriding is just for specialization.
See
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L1624
and the override (which also has an experiment to disable):
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L659
Can we do that for Python too by:
- implementing PubsubSink in pubsub.py with something like the existing Direct runner implementation. We may need to add some pushback if the publisher client itself doesn't do it as otherwise we may pull in all the messages into memory and have them pileup in publishing and oom.
- removing all the direct runner stuff
- add transform override for dataflow streaming and remove it for dataflow batch
|
close this. #36027 is the recommended way. |
Implement WriteToPubSub batch mode support by using DirectRunner implementation Add PTransform override for batch mode and integration tests
Fixes #35990
Internal bug: b/441584693
Will update CHANGES.md when PR is good to go.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.