Skip to content

Conversation

@damccorm
Copy link
Contributor

Add the ability for a python pipeline to specify a --gbek option which will cause all data passing through a GBK to get encrypted. This doesn't handle x-lang GBKs yet. Next steps include supporting this in Java, making sure we respect x-lang pipeline options, and making sure that x-lang GBKs in languages other than Python/Java are validated away.

Part of #36214


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@damccorm
Copy link
Contributor Author

Failing checks are mostly due to java license issues or unrelated flaky tests, none seem to be real issues

@damccorm damccorm marked this pull request as ready for review September 30, 2025 15:08
@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@damccorm
Copy link
Contributor Author

assign set of reviewers

@github-actions
Copy link
Contributor

Assigning reviewers:

R: @liferoad for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@damccorm
Copy link
Contributor Author

damccorm commented Oct 3, 2025

ML tests are red because of #36377

from apache_beam.transforms.util import GroupByEncryptedKey
from apache_beam.transforms.util import Secret

secret = Secret.parse_secret_option(replace_with_gbek_secret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we somehow retry get_secret_bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets called during setup, so it should get automatic retries from the runner -

self._hmac_key = self.hmac_key_secret.get_secret_bytes()

pcoll
| beam.ParDo(_EncryptMessage(self._hmac_key, key_coder, value_coder))
| beam.GroupByKey()
| gbk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this breaks the update compatibility given the line number changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true - that's why I think we should do #36251 (comment) to permanently avoid this problem.

I can follow up with a PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#36381 will fix this broadly.

Copy link
Contributor

@liferoad liferoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the failed workflows.

@damccorm
Copy link
Contributor Author

damccorm commented Oct 3, 2025

Please fix the failed workflows.

They're red from #36377 - I'll merge in master once that's in to confirm everything is green

@liferoad
Copy link
Contributor

liferoad commented Oct 3, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --gbek pipeline option to enforce encryption for GroupByKey operations, which is a great step towards enhancing data security at rest. The implementation correctly replaces GroupByKey with GroupByEncryptedKey when the option is present and prevents recursive replacement.

My review includes a couple of suggestions. First, a minor correction to the help text of the new --gbek option to avoid confusion. More importantly, I've proposed a refactoring of the parse_secret_option method in util.py to improve its robustness against malformed inputs and to make it more extensible for future secret types. The current parsing logic has a bug when parameter values contain colons.

The tests are comprehensive and cover the new functionality well.

@damccorm damccorm merged commit 659cc4d into master Oct 3, 2025
118 of 140 checks passed
@damccorm damccorm deleted the users/damccorm/enforce_gbek branch October 3, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants