-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Cleanly separate v1 worker and v2 sdk harness container image handling in DataflowRunner #36138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kennknowles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request refactors the DataflowRunner to explicitly differentiate between container images used for V1 workers and V2 SDK harnesses. The change improves the clarity and correctness of how container images are managed within Dataflow, ensuring that the appropriate image is used for each pipeline version and addressing a previous misinterpretation of related configuration options.
Highlights
- Container Image Handling Separation: The pull request cleanly separates the handling of container images for Dataflow V1 workers and V2 SDK harnesses. This clarifies the distinct roles of these container types, addressing previous ambiguities where the
--workerHarnessContainerImageUrland--sdkHarnessContainerImageUrloptions were incorrectly treated as synonymous. - Introduction of Dedicated Methods: New, explicit methods have been introduced in
DataflowRunner.javato retrieve the appropriate container image URLs:getV1WorkerContainerImageForJobfor V1 workers andgetV2SdkHarnessContainerImageForJobfor V2 SDK harnesses. This replaces a single, ambiguousgetContainerImageForJobmethod. - Correction of Deprecation: The
@Deprecatedannotations and associated comments forgetWorkerHarnessContainerImageandsetWorkerHarnessContainerImageinDataflowPipelineWorkerPoolOptions.javahave been removed. This acknowledges that the V1 worker container image option is still relevant and distinct from the V2 SDK harness image option. - No Breaking Changes for V1: The changes do not introduce breaking changes for V1, as V1 does not support custom container usage, and there is no supported usage of the V2 parameter for V1.
- Improved Code Clarity and Maintainability: By making the container image handling explicit and separate, the codebase becomes easier to understand and maintain, reducing potential for misconfiguration and paving the way for future service-side adjustments.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
R: @Abacn |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
...oogle-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
Outdated
Show resolved
Hide resolved
52b658d to
0b5d216
Compare
Abacn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just had one comments
| // it is a v1 or v2 job submission. | ||
| RunnerApi.Environment defaultEnvironmentForDataflow = | ||
| Environments.createDockerEnvironment(workerHarnessContainerImageURL); | ||
| Environments.createDockerEnvironment(v2SdkHarnessContainerImageURL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the comments of this line (L1284-1290) still relevant (or need update)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Removed the comment and named variable to make it obvious.
0b5d216 to
615fcc5
Compare
…s container image url parameters
…g in DataflowRunner
615fcc5 to
5e25608
Compare
|
The integration tests are now green. I haven't changed anything except the comment I think, so I'll merge. |
|
Internal tests didn't catch anything but I do think this has caused the service to not use the specified image. This is actually as expected, but we need to roll back and add tests. https://console.cloud.google.com/dataflow/jobs/us-central1/2025-09-18_12_07_20-4196960529059100076?project=apache-beam-testing is a job from #34902 which clearly has |
|
what does it mean in terms of "this is acually expected"? From this example job looks like Yeah we should probably revert it at the moment. |
|
I don't remember what I meant when I typed "this is expected" 🤷 But yes this is a bug in the service side. Or potentially if the portable pipeline environment did not change the container then it could be an SDK bug. I actually didn't fully decode the pipeline proto to see what was in the ParDoPayload. |
This is a simple first step to fix #30634
The V1 worker container is wholly independent of the V2 SDK harness container. This change makes them very obviously separate codepaths.
The prior changes to try to deprecate
--workerHarnessContainerImageUrloption were made in error. These two options are analogous but not synonymous.There are probably service-side changes necessary to adjust for this, since we probably are incorrectly passing a V1 worker container image URL in the
--sdkHarnessContainerImageUrloption.Since V1 does not support custom container usage, this is not a breaking change. There is no supported usage of the V2 parameter for V1.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.