Flink: Add DynamicRecord / DynamicRecordInternal / DynamicRecordInternalSerializer by mxm · Pull Request #12996 · apache/iceberg

mxm · 2025-05-07T13:51:54Z

This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer.

Broken out of #12424.

The original PR is based on Flink 1.20. This version is based on Flink 2.0.

gyfora · 2025-05-07T14:58:18Z

+  private PartitionSpec spec;
+  private int writerKey;
+  private RowData rowData;
+  private boolean upsertMode;


Should we rename this to isUpsert or if it denotes an actual mode use an enum instead?

Can do but it's consistent with the coding style. We often omit these verbs from the getters in Iceberg.

in that case upsert or useUpsertMode would probably a better name

gyfora · 2025-05-07T14:59:25Z

+  private String tableName;
+  private String branch;
+  private Schema schema;
+  private PartitionSpec spec;


Should we rename this to partitionSpec in case some other kind of spec appears in the future?

I was also leaning towards this name in the beginning, but it's Iceberg convention to use this name across the code base. We can rename though if this is a concern.

gyfora · 2025-05-07T15:01:21Z

+      // Check that the schema id can be resolved. Not strictly necessary for serialization.
+      Tuple3<RowDataSerializer, Schema, PartitionSpec> serializer =
+          serializerCache.serializerWithSchemaAndSpec(
+              toSerialize.tableName(),
+              toSerialize.schema().schemaId(),
+              toSerialize.spec().specId());


if not strictly necessary why do we do it? What happens if this fails / why would it fail?

This is basically a sanity-test, to test that looking up the serializer by id on the remote side will work. The remote side won't have the schema available, because it is not written in this branch. If there are any issues, we will know about them on the sender side, as opposed on the receiving side.

I've added a JavaDoc which should clarify things.

gyfora · 2025-05-07T15:02:17Z

+  private String branch;
+  private Schema schema;
+  private RowData rowData;
+  private PartitionSpec spec;


should this be called partitionSpec in case other specs are added in the future?

Same as #12996 (comment)

gyfora · 2025-05-07T15:02:55Z

+  private PartitionSpec spec;
+  private DistributionMode mode;
+  private int writeParallelism;
+  private boolean upsertMode;


a boolean doesn't really describe a mode , should this be an enum or isUpsert maybe?

I think it does. If enabled, upsert mode will be used.

See also #12996 (comment)

gyfora · 2025-05-07T15:03:25Z

+  private Schema schema;
+  private RowData rowData;
+  private PartitionSpec spec;
+  private DistributionMode mode;


should this be distribution or distributionMode? (it is already clashing with upsertMode a little)

Yes, it makes sense to rename to distributionMode.

mxm · 2025-05-08T06:15:03Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

gyfora · 2025-05-08T06:26:12Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

pvary · 2025-05-08T10:07:28Z

+  private DistributionMode mode;
+  private int writeParallelism;
+  private boolean upsertMode;
+  @Nullable private List<String> equalityFields;


Only this field is nullable?
Shall we use the annotation consistently?

Correct, only this field is currently nullable / optional. We could add some defaults. I was thinking to add a builder, what do you think?

A builder makes sense to me, as we have many parameters

pvary · 2025-05-09T14:46:25Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

It might be strange for new developers, but we always omit get, set, is from the method names.
Here is the guide: https://iceberg.apache.org/contribute/#iceberg-code-contribution-guidelines
The only exceptions are the overrides for external APIs

gyfora · 2025-05-09T14:49:02Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

It might be strange for new developers, but we always omit get, set, is from the method names. Here is the guide: https://iceberg.apache.org/contribute/#iceberg-code-contribution-guidelines The only exceptions are the overrides for external APIs

In general I get the idea, but my particular concern was related to upgradeMode the convention clearly doesn't work well with a name like this as it's immediately confusing when you have other xxMode fields that are enums etc.

pvary · 2025-05-09T15:16:19Z

Thanks for the review @gyfora! I think it makes sense to rename the API-facing fields / getters / setters to avoid confusion for users.

I am not yet aware of all the conventions here, @pvary maybe you could chime in related to the naming and then I will learn once and for all :D

It might be strange for new developers, but we always omit get, set, is from the method names. Here is the guide: https://iceberg.apache.org/contribute/#iceberg-code-contribution-guidelines The only exceptions are the overrides for external APIs

In general I get the idea, but my particular concern was related to upgradeMode the convention clearly doesn't work well with a name like this as it's immediately confusing when you have other xxMode fields that are enums etc.

I assume this is upsertMode?
While I understand your concern, the IcebergSink contains upsertMode, and this convention is used throughout the Flink code, so I would stick to it.

mxm · 2025-05-09T15:17:18Z

I've pushed an update to address the comments.

On the name discussion: I think this is all just convention. Every community has its own styles. I don't think either way makes more sense. upsertMode makes perfect sense to me, isUpsert not so much because not every record produces an upsert, but isUpsertMode makes just as much sense, even though the boolean type is next to the name.

The most important reason is consistency. All existing Flink Iceberg sinks use that name. I don't see a strong case to deviate from it.

I did rename mode to distributionMode and spec to partitionSpec.

…nalSerializer This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer. Broken out of github.com/apache/pull/12424.

mxm · 2025-05-12T11:11:49Z

(rebased and squashed commits)

pvary · 2025-05-12T12:16:48Z

+    return tableIdentifier;
+  }
+
+  public void setTableIdentifier(TableIdentifier tableIdentifier) {


Do we need these setters, if we have a builder?

We wouldn't. I'm not sure though we should remove these methods, as they allow DynamicRecord to be reused. If we add the builder, that won't be possible anymore.

pvary · 2025-05-12T12:17:15Z

+    return tableName;
+  }
+
+  public void setTableName(String tableName) {


Do we need this setters?

We currently use these setters here to allow for Flink's object reuse mode:

iceberg/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternalSerializer.java

Line 204 in bf02ad6

reuse.setTableName(from.tableName());

This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

…ternal / DynamicRecordInternalSerializer This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer. Broken out of github.com/apache/pull/12424.

This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

pvary · 2025-05-14T20:09:29Z

Merged to main.
Thanks @mxm for the PR and @gyfora for the review!

mxm · 2025-05-15T08:46:08Z

Thanks @pvary @gyfora for reviewing! Thanks @pvary for the merge!

This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

…mparison and evolution This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

This adds the classes around schema / spec comparison and evolution. A breakdown of the classes follows: # CompareSchemasVisitor Compares the user-provided schema against the current table schema. # EvolveSchemaVisitor Computes the changes required to the table schema to be compatible with the user-provided schema. # ParititonSpecEvolution Code for checking compatibility with the user-provided PartitionSpec and computing a set of changes to rewrite the PartitionSpec. # TableDataCache Cache which holds all relevant metadata of a table like its name, branch, schema, partition spec. Also holds a cache of past comparison results for a given table's schema and the user-provided input schema. # Table Updater Core logic to compare and create/update a table given a user-provided input schema. Broken out of apache#12424, depends on apache#12996.

This backports apache#12996 to Flink versions 1.19. Patch applied cleanly, apart from re-adding a 2.0 removed interface method in DynamicRecordInternalSerializer.

This backports apache#12996 to Flink versions 1.20. Patch applied cleanly, apart from re-adding a 2.0 removed interface method in DynamicRecordInternalSerializer.

…cordInternalSerializer to Flink 1.19 / 1.20 (#13246) backports #12996

…cordInternalSerializer to Flink 1.19 / 1.20 (apache#13246) backports apache#12996

…nalSerializer (apache#12996)

…cordInternalSerializer to Flink 1.19 / 1.20 (apache#13246) backports apache#12996

github-actions Bot added flink build labels May 7, 2025

mxm force-pushed the dynamic-sink-contrib-breakdown branch from 9081c13 to 0e50889 Compare May 7, 2025 14:32

gyfora reviewed May 7, 2025

View reviewed changes

pvary reviewed May 8, 2025

View reviewed changes

Comment thread flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternal.java

pvary reviewed May 8, 2025

View reviewed changes

Comment thread ...ink/src/main/java/org/apache/iceberg/flink/sink/dynamic/DynamicRecordInternalSerializer.java

pvary reviewed May 8, 2025

View reviewed changes

Comment thread ...k/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/RowDataSerializerCache.java Outdated

This comment was marked as resolved.

Sign in to view

Flink: Add DynamicRecord / DynamicRecordInternal / DynamicRecordInter…

ec7d036

…nalSerializer This adds the user-facing type DynamicRecord, alongside with its internal representation DynamicRecordInternal and its type information and serializer. Broken out of github.com/apache/pull/12424.

mxm force-pushed the dynamic-sink-contrib-breakdown branch from 665aa07 to ec7d036 Compare May 12, 2025 11:11

pvary reviewed May 12, 2025

View reviewed changes

mxm mentioned this pull request May 12, 2025

Flink: Dynamic Iceberg Sink: Add table update code for schema comparison and evolution #13032

Merged

pvary approved these changes May 14, 2025

View reviewed changes

pvary mentioned this pull request May 14, 2025

Flink Dynamic Sink #11536

Closed

6 tasks

gyfora approved these changes May 14, 2025

View reviewed changes

pvary merged commit 268661a into apache:main May 14, 2025
20 checks passed

mxm deleted the dynamic-sink-contrib-breakdown branch May 15, 2025 08:45

mxm mentioned this pull request May 16, 2025

Flink: Dynamic Iceberg Sink Contribution #12424

Closed

mxm added a commit to mxm/iceberg that referenced this pull request Jun 5, 2025

Flink: Backport apache#12996 to Flink 1.19

b804332

This backports apache#12996 to Flink versions 1.19. Patch applied cleanly, apart from re-adding a 2.0 removed interface method in DynamicRecordInternalSerializer.

mxm added a commit to mxm/iceberg that referenced this pull request Jun 5, 2025

Flink: Backport apache#12996 to Flink 1.20

e7f51fa

This backports apache#12996 to Flink versions 1.20. Patch applied cleanly, apart from re-adding a 2.0 removed interface method in DynamicRecordInternalSerializer.

mxm mentioned this pull request Jun 5, 2025

Flink: Backport #12996 to Flink 1.19 / 1.20 #13246

Merged

pvary pushed a commit that referenced this pull request Jun 5, 2025

Flink: Backport add DynamicRecord / DynamicRecordInternal / DynamicRe…

1996ff9

…cordInternalSerializer to Flink 1.19 / 1.20 (#13246) backports #12996

mxm mentioned this pull request Jun 12, 2025

Flink: Dynamic Iceberg Sink: Add sink / core processing logic / benchmarking #13304

Merged

cogwirrel pushed a commit to cogwirrel/iceberg that referenced this pull request Aug 10, 2025

Flink: Backport add DynamicRecord / DynamicRecordInternal / DynamicRe…

bfaa45e

…cordInternalSerializer to Flink 1.19 / 1.20 (apache#13246) backports apache#12996

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Flink: Add DynamicRecord / DynamicRecordInternal / DynamicRecordInter…

9f9e933

…nalSerializer (apache#12996)

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Flink: Backport add DynamicRecord / DynamicRecordInternal / DynamicRe…

e21bc96

…cordInternalSerializer to Flink 1.19 / 1.20 (apache#13246) backports apache#12996

Conversation

mxm commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyfora May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxm commented May 8, 2025

Uh oh!

gyfora commented May 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented May 9, 2025

Uh oh!

gyfora commented May 9, 2025

Uh oh!

pvary commented May 9, 2025

Uh oh!

mxm commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

mxm commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented May 14, 2025

Uh oh!

mxm commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

mxm commented May 7, 2025 •

edited

Loading

gyfora May 8, 2025 •

edited

Loading

mxm May 8, 2025 •

edited

Loading

mxm commented May 9, 2025 •

edited

Loading

mxm commented May 12, 2025 •

edited

Loading