feat(utilities): add option to make all schema columns nullable for backwards compatibility #17777
+28
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe the issue this Pull Request addresses
This PR adds an option to ensure all columns in the schema are nullable when using HoodieStreamer with row-based sources like
SQLSourceorSQLFileBasedSource.When new columns are added via SQL queries, the schema must be backwards compatible. New columns added to a table must be nullable because existing records don't have values for them. This change provides a configuration option to automatically make all columns nullable, ensuring smooth schema evolution.
Summary and Changelog
What users gain: Users can now set
hoodie.deltastreamer.transformed.row.nullable=trueto automatically make all columns in the incoming schema nullable, preventing schema compatibility issues during schema evolution.Changes:
HoodieStreamer.java:ENSURE_ALL_COLUMNS_NULLABLE_KEY = "hoodie.deltastreamer.transformed.row.nullable"ENSURE_ALL_COLUMNS_NULLABLE_DEFAULT = falseextractSchemaFromDataset()method inUtilHelpers.javathat optionally converts schema to nullable using Spark'sStructType.asNullable()RowSource.javato use the new schema extraction methodStreamSync.javato use the new schema extraction method for transformed datasetsImpact
hoodie.deltastreamer.transformed.row.nullable(default:false)Risk Level
low - The feature is disabled by default and only affects schema handling when explicitly enabled. The implementation uses Spark's built-in
asNullable()method which is well-tested.Documentation Update
The new config
hoodie.deltastreamer.transformed.row.nullableshould be documented:hoodie.deltastreamer.transformed.row.nullablefalsetrue, all columns in the incoming dataset schema are made nullable. This is useful for maintaining backwards compatibility when new columns are added via SQL queries.Contributor's checklist