-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-55062][Protobuf] Support proto2 extensions in protobuf functions #53828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
JIRA Issue Information=== Improvement SPARK-55062 === This comment was automatically generated by GitHub Actions |
| checkAnswer(fromProtoDf, expectedDf) | ||
| } | ||
|
|
||
| test("SPARK-55062: roundtrip - proto2 extension basic types") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding test cases for these edge cases?
- extension field name collision with regular fields
- schema evolution: read old data without extensions using new schema with extensions.
- map extensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, the original unit tests were in hindsight pretty lacking. I've added the above plus a few more. For #3, however, I believe the Protobuf grammar doesn't allow map fields in extensions (protoc will error), so there shouldn't be a need for a test
| val binary = input.asInstanceOf[Array[Byte]] | ||
| try { | ||
| result = DynamicMessage.parseFrom(messageDescriptor, binary) | ||
| result = DynamicMessage.parseFrom(messageDescriptor, binary, extensionRegistry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pretty flagrant error--I must have dropped this when reapplying changes from my other repo. I also realized as a result that the original tests were not explicitly checking for the added fields (round trip was working because the fields were being dropped), so I've added in those assertions.
EDIT: Also retested the operators manually in Spark Shell as a sanity check, and everything now looks good
What changes were proposed in this pull request?
This PR adds support for proto2 extensions to
from_protobufandto_protobuf(when file descriptor set is provided, as Java classes do not contain enough information to support extensions).This is done by building an ExtensionRegistry and a map from descriptor name to its extensions. The registry is used during construction of the DynamicMessage to provide the Protobuf library with visibility of the extensions. The index is plumbed through the various helper classes for use in schema conversion and serde.
Why are the changes needed?
Proto2 extensions are a valid, if somewhat uncommon, feature of Protobuf, and it therefore makes sense to incorporate them into the schema when provided so as to not confuse the user.
Does this PR introduce any user-facing change?
Yes. Previously, extension fields would be dropped by both
from_protobufandto_protobuf. Now, they are retained. This can be demonstrated with the minimal example below. See the unit tests for more examples.How was this patch tested?
Unit tests were added for the new behavior, including basic behavior, extending nested messages, and extensions defined in separate files.
Was this patch authored or co-authored using generative AI tooling?
Initial draft authored with Claude Code.
Generated-by: claude-4.5-opus