Skip to content

Conversation

@KiteSoar
Copy link
Contributor

@KiteSoar KiteSoar commented Jan 3, 2026

Describe the issue this Pull Request addresses

This PR addresses the migration of HoodieRecord API from using org.apache.avro.Schema to org.apache.hudi.common.schema.HoodieSchema for all record-related methods. This change reduces coupling with Avro-specific implementations.
Closes #17689

Summary and Changelog

Users can now use HoodieSchema consistently across all HoodieRecord methods. This provides a unified schema abstraction layer.

Impact

All HoodieRecord methods that previously accepted org.apache.avro.Schema now accept org.apache.hudi.common.schema.HoodieSchema

Risk Level

Low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@KiteSoar KiteSoar force-pushed the hoodieRecord-migration branch 2 times, most recently from 5c8cf22 to 691a8ae Compare January 3, 2026 03:34
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jan 3, 2026
@KiteSoar KiteSoar force-pushed the hoodieRecord-migration branch from 691a8ae to 8cc8fa2 Compare January 3, 2026 03:40
@KiteSoar KiteSoar changed the title refactor: Migrate HoodieRecord methods to use HoodieSchema instead of Avro Schema. feat(scheme): Migrate HoodieRecord methods to use HoodieSchema instead of Avro Schema. Jan 3, 2026
@KiteSoar KiteSoar changed the title feat(scheme): Migrate HoodieRecord methods to use HoodieSchema instead of Avro Schema. feat(schema): Migrate HoodieRecord methods to use HoodieSchema instead of Avro Schema. Jan 3, 2026
@voonhous voonhous changed the title feat(schema): Migrate HoodieRecord methods to use HoodieSchema instead of Avro Schema. feat(schema): Migrate HoodieRecord methods to use HoodieSchema instead of Avro.Schema Jan 3, 2026
@KiteSoar KiteSoar force-pushed the hoodieRecord-migration branch from 88e0a16 to b892713 Compare January 3, 2026 08:39
@voonhous
Copy link
Member

voonhous commented Jan 3, 2026

Sorry, went through the changes again and found that there are more things that can be improved. I believe this should be all.

Let's wait for @the-other-tim-brown's review too after you've made the changes i suggested.

Thank you so much for the contribution, we really appreciate it!

@KiteSoar KiteSoar force-pushed the hoodieRecord-migration branch from 57bec67 to d40ef89 Compare January 4, 2026 15:36
Copy link
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 5, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

return OrderingValues.getDefault();
} else {
return OrderingValues.create(orderingFields, field -> {
if (recordSchema.getField(field) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieSchema#getField never returns null. It returns Option.empty() if the field is not found. So all invocations of HoodieSchema#getField should be revisited and revised so that they have parity as before (i.e., use recordSchema.getField(field).isEmpty() or recordSchema.getField(field).isPresent() instead of null check).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Four places to check in HoodieFlinkRecord

doGetOrderingValue(HoodieSchema, Properties, String[])
        if (recordSchema.getField(field) == null) {
getOrderingValueAsJava(HoodieSchema, Properties, String[])
        if (recordSchema.getField(field) == null) {
getRecordKey(HoodieSchema, Option<BaseKeyGenerator>)
      ValidationUtils.checkArgument(recordSchema.getField(RECORD_KEY_METADATA_FIELD) != null,
updateMetaField(HoodieSchema, int, String)
    boolean withOperation = recordSchema.getField(OPERATION_METADATA_FIELD) != null;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you add unit tests to cover this method?

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience on the review

return OrderingValues.getDefault();
} else {
return OrderingValues.create(orderingFields, field -> {
if (recordSchema.getField(field) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Four places to check in HoodieFlinkRecord

doGetOrderingValue(HoodieSchema, Properties, String[])
        if (recordSchema.getField(field) == null) {
getOrderingValueAsJava(HoodieSchema, Properties, String[])
        if (recordSchema.getField(field) == null) {
getRecordKey(HoodieSchema, Option<BaseKeyGenerator>)
      ValidationUtils.checkArgument(recordSchema.getField(RECORD_KEY_METADATA_FIELD) != null,
updateMetaField(HoodieSchema, int, String)
    boolean withOperation = recordSchema.getField(OPERATION_METADATA_FIELD) != null;


@Override
public Comparable<?> getOrderingValueAsJava(Schema recordSchema, Properties props, String[] orderingFields) {
public Comparable<?> getOrderingValueAsJava(HoodieSchema recordSchema, Properties props, String[] orderingFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same on if (recordSchema.getField(field) == null) { to avoid null check

public String getRecordKey(HoodieSchema recordSchema, String keyFieldName) {
if (key == null) {
String recordKey = Objects.toString(RowDataAvroQueryContexts.fromAvroSchema(recordSchema).getFieldQueryContext(keyFieldName).getFieldGetter().getFieldOrNull(data));
String recordKey = Objects.toString(RowDataAvroQueryContexts.fromAvroSchema(recordSchema.toAvroSchema()).getFieldQueryContext(keyFieldName).getFieldGetter().getFieldOrNull(data));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RowDataAvroQueryContexts going to be migrated away from Avro schema separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is that there's still back and forth conversion to Avro Schema for RowDataAvroQueryContexts to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll take this on in #17739

return OrderingValues.getDefault();
} else {
return OrderingValues.create(orderingFields, field -> {
if (recordSchema.getField(field) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you add unit tests to cover this method?

int seqId = 1;
for (HoodieRecord record : records) {
GenericRecord avroRecord = (GenericRecord) record.rewriteRecordWithNewSchema(schema, CollectionUtils.emptyProps(), schema).getData();
GenericRecord avroRecord = (GenericRecord) record.rewriteRecordWithNewSchema(HoodieSchema.fromAvroSchema(schema),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The Avro schema can be converted beforehand once, instead of conversion per record.

int seqId = 1;
for (HoodieRecord record : records) {
GenericRecord avroRecord = (GenericRecord) record.toIndexedRecord(schema, CollectionUtils.emptyProps()).get().getData();
GenericRecord avroRecord = (GenericRecord) record.toIndexedRecord(HoodieSchema.fromAvroSchema(schema), CollectionUtils.emptyProps()).get().getData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similar

public String getRecordKey(HoodieSchema recordSchema, String keyFieldName) {
if (key == null) {
String recordKey = Objects.toString(RowDataAvroQueryContexts.fromAvroSchema(recordSchema).getFieldQueryContext(keyFieldName).getFieldGetter().getFieldOrNull(data));
String recordKey = Objects.toString(RowDataAvroQueryContexts.fromAvroSchema(recordSchema.toAvroSchema()).getFieldQueryContext(keyFieldName).getFieldGetter().getFieldOrNull(data));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is that there's still back and forth conversion to Avro Schema for RowDataAvroQueryContexts to use.

while (recordItr.hasNext()) {
HoodieRecord record = recordItr.next();
String recordKey = record.getRecordKey(readerSchema, keyFieldName);
String recordKey = record.getRecordKey(HoodieSchema.fromAvroSchema(readerSchema), keyFieldName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieSchema should be generated outside the while loop

void testConvertColumnValueForLogicalTypeWithNullValue() {
Schema dateSchema = Schema.create(Schema.Type.INT);
LogicalTypes.date().addToSchema(dateSchema);
HoodieSchema dateSchema = HoodieSchema.create(HoodieSchemaType.INT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be HoodieSchema.createDate();?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 26: HoodieRecord migration

6 participants