Skip to content

Conversation

@dichlorodiphen
Copy link

What changes were proposed in this pull request?

This PR adds support for proto2 extensions to from_protobuf and to_protobuf (when file descriptor set is provided, as Java classes do not contain enough information to support extensions).

This is done by building an ExtensionRegistry and a map from descriptor name to its extensions. The registry is used during construction of the DynamicMessage to provide the Protobuf library with visibility of the extensions. The index is plumbed through the various helper classes for use in schema conversion and serde.

Why are the changes needed?

Proto2 extensions are a valid, if somewhat uncommon, feature of Protobuf, and it therefore makes sense to incorporate them into the schema when provided so as to not confuse the user.

Does this PR introduce any user-facing change?

Yes. Previously, extension fields would be dropped by both from_protobuf and to_protobuf. Now, they are retained. This can be demonstrated with the minimal example below. See the unit tests for more examples.

message Person {
    int32 id = 1;
    extensions 100 to 200;
}
extend Person {
    int32 age = 100;
}

How was this patch tested?

Unit tests were added for the new behavior, including basic behavior, extending nested messages, and extensions defined in separate files.

Was this patch authored or co-authored using generative AI tooling?

Initial draft authored with Claude Code.

Generated-by: claude-4.5-opus

@github-actions
Copy link

JIRA Issue Information

=== Improvement SPARK-55062 ===
Summary: from_protobuf and to_protobuf do not support proto2 extensions
Assignee: None
Status: Open
Affected: ["4.1.1"]


This comment was automatically generated by GitHub Actions

checkAnswer(fromProtoDf, expectedDf)
}

test("SPARK-55062: roundtrip - proto2 extension basic types") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding test cases for these edge cases?

  1. extension field name collision with regular fields
  2. schema evolution: read old data without extensions using new schema with extensions.
  3. map extensions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, the original unit tests were in hindsight pretty lacking. I've added the above plus a few more. For #3, however, I believe the Protobuf grammar doesn't allow map fields in extensions (protoc will error), so there shouldn't be a need for a test

val binary = input.asInstanceOf[Array[Byte]]
try {
result = DynamicMessage.parseFrom(messageDescriptor, binary)
result = DynamicMessage.parseFrom(messageDescriptor, binary, extensionRegistry)
Copy link
Author

@dichlorodiphen dichlorodiphen Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pretty flagrant error--I must have dropped this when reapplying changes from my other repo. I also realized as a result that the original tests were not explicitly checking for the added fields (round trip was working because the fields were being dropped), so I've added in those assertions.

EDIT: Also retested the operators manually in Spark Shell as a sanity check, and everything now looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants