Skip to content

Conversation

@prashantwason
Copy link
Member

@prashantwason prashantwason commented Jan 4, 2026

Describe the issue this Pull Request addresses

This PR adds an option to ensure all columns in the schema are nullable when using HoodieStreamer with row-based sources like SQLSource or SQLFileBasedSource.

When new columns are added via SQL queries, the schema must be backwards compatible. New columns added to a table must be nullable because existing records don't have values for them. This change provides a configuration option to automatically make all columns nullable, ensuring smooth schema evolution.

Summary and Changelog

What users gain: Users can now set hoodie.deltastreamer.transformed.row.nullable=true to automatically make all columns in the incoming schema nullable, preventing schema compatibility issues during schema evolution.

Changes:

  • Added new configuration constants in HoodieStreamer.java:
    • ENSURE_ALL_COLUMNS_NULLABLE_KEY = "hoodie.deltastreamer.transformed.row.nullable"
    • ENSURE_ALL_COLUMNS_NULLABLE_DEFAULT = false
  • Added extractSchemaFromDataset() method in UtilHelpers.java that optionally converts schema to nullable using Spark's StructType.asNullable()
  • Updated RowSource.java to use the new schema extraction method
  • Updated StreamSync.java to use the new schema extraction method for transformed datasets

Impact

  • New configuration option hoodie.deltastreamer.transformed.row.nullable (default: false)
  • No breaking changes - existing behavior is preserved when config is not set
  • When enabled, all columns in the dataset schema are converted to nullable before being used

Risk Level

low - The feature is disabled by default and only affects schema handling when explicitly enabled. The implementation uses Spark's built-in asNullable() method which is well-tested.

Documentation Update

The new config hoodie.deltastreamer.transformed.row.nullable should be documented:

  • Key: hoodie.deltastreamer.transformed.row.nullable
  • Default: false
  • Description: When set to true, all columns in the incoming dataset schema are made nullable. This is useful for maintaining backwards compatibility when new columns are added via SQL queries.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

This is required to keep the schema backwards compatible when new columns are added via SQL queries (e.g. When using SQLSource or SQLFileBasedSource as source of records written into a table).
@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Jan 4, 2026
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 4, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants