fix: regression of dict_id in physical plan proto#20063
Conversation
af1a108 to
e0a0061
Compare
brancz
left a comment
There was a problem hiding this comment.
lgtm, thank you for fixing this!
Thanks for the review! |
alamb
left a comment
There was a problem hiding this comment.
Thank you @kumarUjjawal
FYI @dispanser
|
thanks @kumarUjjawal , appreciated! |
| root_as_message(encoded_schema.ipc_message.as_slice()).map_err( | ||
| |e| { | ||
| Error::General(format!( | ||
| "Error IPC schema message while deserializing ScalarValue::List: {e}" |
There was a problem hiding this comment.
Why the error messages say ScalarValue::List ? (List)
Isn't this used for any nested type ? List, Map, Struct, ...
There was a problem hiding this comment.
Good catch! I will update.
…f-45 Undoing [apache#14227](apache#14227) which introduces a bug in protobuf deserialization by passing a wrong schema to record batch construction of the IPC message containing the dictionary. The associated test and a fix will probably be merged into datafusion as apache#20063 but the test fails in main for completely different reason so the patch cannot be backported and doesn't make sense in the context of df46.
alamb
left a comment
There was a problem hiding this comment.
Thank you @kumarUjjawal and @brancz
I feel intuitively something could be made much simpler here, but given this fixes the bug and i don't have any specific ideas of how to improve the PR, let's go with this one
| )); | ||
| }; | ||
|
|
||
| // IPC dictionary batch IDs are assigned when encoding the schema, but our protobuf |
There was a problem hiding this comment.
I feel like somehow I missing something key -- this seems like a pretty massive overhead to create / decode a single value here.
That being said, it seems like the existing code is also doing the overhead, so maybe it is fine for now 🤔
I wonder if we could pull this logic for creating the schema into its own function to try and reduce the size of the overall method / make it easier to understand
| Ok(()) | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
With the code change reverted, this test fails like
---- cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict stdout ----
Error: Plan("General error: Error encoding ScalarValue::List as IPC: Ipc error: no dict id for field item")
|
Thank you @alamb. I will look into your suggestions. |
Thanks -- let's do it as a follow on PR |
…f-45 --- [Cherry-pick summary: v46→v47] Source commit: 1481576 (Forward-porting protobuf decode logic for scalar nested values from df-45) Strategy: cherry-picked, minor adaptions (1 conflict in roundtrip_physical_plan.rs) Upstream PR: fork-only (fix for apache#14227 bug; upstream fix tracked as apache#20063 but not in v47) Conflict resolved: HEAD had two new tests (roundtrip_empty_projection, roundtrip_physical_plan_node); cherry-pick added roundtrip_call_null_scalar_struct_dict; kept all three tests Test coverage: adequate (adds roundtrip test for null scalar struct with dict column) Tests: cargo nextest run -p datafusion-proto passed
…f-45 --- [Cherry-pick summary: v46→v47] Source commit: 1481576 (Forward-porting protobuf decode logic for scalar nested values from df-45) Strategy: cherry-picked, minor adaptions (1 conflict in roundtrip_physical_plan.rs) Upstream PR: fork-only (fix for apache#14227 bug; upstream fix tracked as apache#20063 but not in v47) Conflict resolved: HEAD had two new tests (roundtrip_empty_projection, roundtrip_physical_plan_node); cherry-pick added roundtrip_call_null_scalar_struct_dict; kept all three tests Test coverage: adequate (adds roundtrip test for null scalar struct with dict column) Tests: cargo nextest run -p datafusion-proto passed
…f-45 --- [Cherry-pick summary: v46→v47] Source commit: 1481576 (Forward-porting protobuf decode logic for scalar nested values from df-45) Strategy: cherry-picked, minor adaptions (1 conflict in roundtrip_physical_plan.rs) Upstream PR: fork-only (fix for apache#14227 bug; upstream fix tracked as apache#20063 but not in v47) Conflict resolved: HEAD had two new tests (roundtrip_empty_projection, roundtrip_physical_plan_node); cherry-pick added roundtrip_call_null_scalar_struct_dict; kept all three tests Test coverage: adequate (adds roundtrip test for null scalar struct with dict column) Tests: cargo nextest run -p datafusion-proto passed
…f-45 --- [Cherry-pick summary: v46→v47] Source commit: 1481576 (Forward-porting protobuf decode logic for scalar nested values from df-45) Strategy: cherry-picked, minor adaptions (1 conflict in roundtrip_physical_plan.rs) Upstream PR: fork-only (fix for apache#14227 bug; upstream fix tracked as apache#20063 but not in v47) Conflict resolved: HEAD had two new tests (roundtrip_empty_projection, roundtrip_physical_plan_node); cherry-pick added roundtrip_call_null_scalar_struct_dict; kept all three tests Test coverage: adequate (adds roundtrip test for null scalar struct with dict column) Tests: cargo nextest run -p datafusion-proto passed
…f-45 --- [Cherry-pick summary: v46→v47] Source commit: 1481576 (Forward-porting protobuf decode logic for scalar nested values from df-45) Strategy: cherry-picked, minor adaptions (1 conflict in roundtrip_physical_plan.rs) Upstream PR: fork-only (fix for apache#14227 bug; upstream fix tracked as apache#20063 but not in v47) Conflict resolved: HEAD had two new tests (roundtrip_empty_projection, roundtrip_physical_plan_node); cherry-pick added roundtrip_call_null_scalar_struct_dict; kept all three tests Test coverage: adequate (adds roundtrip test for null scalar struct with dict column) Tests: cargo nextest run -p datafusion-proto passed
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#20011. ## Rationale for this change - `dict_id` is intentionally not preserved protobuf (it’s deprecated in Arrow schema metadata), but Arrow IPC still requires dict IDs for dictionary encoding/decoding. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Fix protobuf serde for nested ScalarValue (list/struct/map) containing dictionary arrays by using Arrow IPC’s dictionary handling correctly. - Seed DictionaryTracker by encoding the schema before encoding the nested scalar batch. - On decode, reconstruct an IPC schema from the protobuf schema and use arrow_ipc::reader::read_dictionary to build dict_by_id before reading the record batch. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes added a test for this <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Which issue does this PR close?
Struct(Dict)data type #20011.Rationale for this change
dict_idis intentionally not preserved protobuf (it’s deprecated in Arrow schema metadata), but Arrow IPC still requires dict IDs for dictionary encoding/decoding.What changes are included in this PR?
Are these changes tested?
Yes added a test for this
Are there any user-facing changes?
No