Spec: Clarify multi-arg transform behavior for different versions by szehon-ho · Pull Request #9661 · apache/iceberg

szehon-ho · 2024-02-05T23:49:38Z

This pr clarifies multi-arg transform behavior in relation to different Iceberg versions. It proposes to make the behavior default in V3 but enabled in V1/V2 with a new table config. It also cleans up some of the affected tables and notes.

This is a follow up on: #8579 based on the email thread discussion : https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0.

szehon-ho · 2024-02-05T23:50:52Z

-
-1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
-2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
+1. In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec.


Points 1 and 2 here are existing ones, just organized in number points

szehon-ho · 2024-02-05T23:53:04Z

@rdblue @advancedxy @aokolnychyi @emkornfield , wanted to get the conversation started, on this proposal to clarify for V1-V3 behaviors for multi-arg transforms as discussed, let me know your thoughts.

manuzhang · 2024-02-06T01:53:58Z

Do we also need to update https://github.com/apache/iceberg/blob/main/site/docs/spec.md?

szehon-ho · 2024-02-06T02:25:59Z

@manuzhang i believe #8579 is not published yet, hence wanted to get this change in before the 1.5 release, if we want to add the clarification. Let me know if I am mistaken though about the new docs site flow.

advancedxy · 2024-02-06T02:46:42Z

Thanks for taking this over @szehon-ho, I will review it in today or tomorrow.

hence wanted to get this change in before the 1.5 release

I do agree that we should get the clarification in before the 1.5 release.

manuzhang · 2024-02-06T03:26:12Z

-Each partition field in the fields list is stored as an object. See the table for more detail:
+Each partition field in the `fields` is stored as a JSON object with the following properties.
+
+| V1       | V2       | V3       | Field            | JSON representation | Example      |


There's no mention of V3 format before this. Readers don't know its existence.

It's mentioned in appendix E already.

That's why I said "before". Meanwhile, multi-arg transform is not mention in appendix E.

OK, it seems the problem existed before then (that V3 is mentioned without a proper introduction), maybe we can tackle it in another PR if we go ahead with this one?

it seems the problem existed before then (that V3 is mentioned without a proper introduction)

Maybe v3 format is not completed and adopted by the community.

How about we introduce multi-arg transform in the ### Partitioning and ### Sorting section and point it to the details in the appendix E. In the appendix, we can write detailed documentation about which compatibility flag to use and how partition field and sort field are json serialized?

Something like this:

### Partitioning ... omitted ... Tables are configured with a **partition spec** that defines how to produce a tuple of partition values from a record. A partition spec has a list of fields that consist of: * A **source column id** or a list of **source column ids** from the table’s schema * A **partition field id** that is used to identify a partition field and is unique within a partition spec. In v2 table metadata, it is unique across all partition specs. * A **transform** that is applied to the source column(s)[1] to produce a partition value * A **partition name** ... omitted ... Partition field IDs must be reused if an existing partition spec contains an equivalent field. Note: 1. multi-arg transform is added in format Version 3. For details on how multi-arg transform is serialized in JSON, see appendix E

OK, I made an attempt at this.

There is a section about V1 and V2 versions at the beginning. What if we extend it and say that the V3 spec hasn't been adopted yet and under active development?

Versions 1 and 2 of the Iceberg spec are complete and adopted by the community. Version 3 is under active development and has not been formally adopted.

aokolnychyi · 2024-02-09T21:10:18Z

-Each partition field in the fields list is stored as an object. See the table for more detail:
+Each partition field in the `fields` is stored as a JSON object with the following properties.
+
+| V1       | V2       | V3       | Field            | JSON representation | Example      |


There is a section about V1 and V2 versions at the beginning. What if we extend it and say that the V3 spec hasn't been adopted yet and under active development?

Versions 1 and 2 of the Iceberg spec are complete and adopted by the community. Version 3 is under active development and has not been formally adopted.

rdblue · 2024-02-12T23:53:17Z

-2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
+1. In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec.
+2. The `field-id` property was added for each partition field in v2. In v1, the reference implementation assigned field ids sequentially in each spec starting at 1,000. See Partition Evolution for more details.
+3. For tables of version < V3, the ID of the source field of each partition field is set in `source-id`. For tables of version >= V3, the ID(s) of the source field(s) is set on `source-ids`, and `source-id` is omitted. See Appendix E for more details.


Rather than moving the first two clarifications into notes, I'd probably remove notes and just add a paragraph on source-id vs source-ids.

For the paragraph on source IDs, I think it should be something like this:

Transforms that accept multiple arguments specify source field IDs using source-ids instead of source-id.

Writers producing v1 and v2 metadata should continue to produce the source-id field for older readers that require it by setting it to the first ID from the source-ids list. Older versions of the reference implementation can read tables with unknown transforms and will ignore multi-arg transforms, but other implementations may break if they encounter unknown transform names.

Writers producing v3 metadata should omit the source-id field because v3 readers are required to support multi-arg transforms and accept the source-ids field.

@rdblue I added these paragraphs.

I added some minor clarification to parts that made me have to read twice. Clarified 'writers' to 'writers producing these transforms.' and used 'additionally' in the V1/V2 case to be more clear it is populated in addition to 'source-ids'. Let me know if that sounds ok.

Transforms that accept multiple arguments specify source field IDs using source-ids instead of source-id. Writers producing these transforms in v1 and v2 metadata should additionally produce the source-id field by setting it to the first ID from the source-ids list. Writers producing these transforms in v3 metadata should populate only the source-ids field because v3 readers will fully-support multi-arg transforms by reading this field.

This sentence actually made me think a bit:

Older versions of the reference implementation can read tables with unknown transforms and will ignore multi-arg transforms, but other implementations may break if they encounter unknown transform names.

I was thinking to pull out the sentence to a next paragraph, as it seems its a more general statement and the flow of the paragraph is better without it, let me know what you think. I was also thinking it may make sense to just say this for all unknown transforms, without having to mention multi-arg in particular, something like:

Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms.

What do you think?

advancedxy

Some minor comments, otherwise LGTM.

advancedxy · 2024-02-18T06:40:07Z

+
+Writing older version metadata:
+
+* For single-arg transforms, partition field and sort order field `source-id` should be written; `source-ids` must be omitted


source-ids must be omitted

source-ids is optional for V1 and V2 metadata, therefore this sentence could be removed? it's up to the implementations to decide whether to emit source-ids for V1/V2 metadata?

To me , it is optional for multi-arg transforms, I dont see much point to allow implementations to write it for single-arg transform (although fair point that it should not hurt).

I think the question is how to handle a case where a field has both source-id and source-ids. Here, I would simply state that it must be consistent:

For a single-arg transform, partition field and sort order field source-id must be written; if source-ids is also written it must be a list of one ID that matches the source-id field."

We can also state above under "Reading v1 or v2 metadata" that for a single-arg transform, source-id takes precedence over source-ids although we may not want to specify this either.

Yea you guys are right, it is slightly clearer to write 'source-ids' as a single element list for single-arg transform on v1/v2 tables, changed.

rdblue · 2024-02-25T23:29:37Z

 *   A **partition name**

-The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.
+The source columns, selected by ids, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.


I was going to suggest adding a note here that addresses compatibility, rather than only noting it in Appendix C. The problem is that it doesn't really fit here. I think a good solution is to note compatibility with any multi-arg transforms that are defined in the next section.

Since we don't have any multi-arg transforms right now, I think we can skip it for now, but we should definitely call out the compatibility of transforms that may not be supported in v1 and v2.

Makes sense. iiuc, implementations would optionally support them in v1/v2 based on a flag, and required to support them in v3.

rdblue · 2024-02-25T23:36:12Z


-1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
-2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
+Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms.


Should we state that in v3, readers are required to ignore unknown transforms and are not allowed to write to tables that use unknown transforms in the target partition spec?

Done, added here and also in Appendix E

rdblue

I suggested some clarifications, but I think these changes look great overall. Thanks for pushing this through, @szehon-ho and @advancedxy!

rdblue · 2024-02-26T23:18:36Z

+In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms. In v1 and v2 metadata, writers must always write `source-id`; for multi-arg transforms, writers must produce `source-ids` and set `source-id` to the first ID from the field ID list.

-Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms.
+Older versions of the reference implementation can read tables with transforms unknown to it, ignoring them. But other implementations may break if they encounter unknown transforms. All v3 readers are required to read tables with unknown transforms, ignoring them. Writers should not write to tables with unknown transforms.


This is okay for now, but the constraint for writers with an unknown transform is a bit more relaxed. Sort orders are best effort... so technically it's up to the writer. Similarly, the table's partition spec is the default spec because there may be more than one spec that is valid in a table. Neither of these cases is necessarily blocking so "should" is a strong word to use. I'd remove that language here for the sort order, and update the partition spec language to "Writers should not write using partition specs that use unknown transforms".

I went ahead and made this slight edit to unblock this.

Ah you are right, thanks for the change.

rdblue · 2024-02-26T23:21:40Z

Merging this. Thanks, @szehon-ho!

…ache#9661)

I was looking at adding support for `source-ids` in PyIceberg, but noticed that it was also still lacking for Java. I've noticed that `source-ids` are also backported to V1 and V2 tables, which suprised me, since this might break existing V2 implementations that are not aware of the `source-ids`. See apache#9661 And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0 I think it would be good to only allow multi-arg transforms from V3 onwards.

I was looking at adding support for `source-ids` in PyIceberg, but noticed that it was also still lacking for Java. I've noticed that `source-ids` are also backported to V1 and V2 tables, which suprised me, since this might break existing V2 implementations that are not aware of the `source-ids`. See #9661 And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0 I think it would be good to only allow multi-arg transforms from V3 onwards.

github-actions Bot added the Specification Issues that may introduce spec changes. label Feb 5, 2024

szehon-ho changed the title ~~Spec: Clarify iceberg-spec.md~~ Spec: Clarify multi-arg transform behavior for different versions Feb 5, 2024

szehon-ho commented Feb 5, 2024

View reviewed changes

szehon-ho force-pushed the multi_arg_spec_clarify branch from 4fa6ac0 to 0e80614 Compare February 5, 2024 23:54

Spec: Clarify multi-arg transform behavior for different versions

81b5dc7

szehon-ho force-pushed the multi_arg_spec_clarify branch from 0e80614 to 81b5dc7 Compare February 5, 2024 23:56

advancedxy mentioned this pull request Feb 6, 2024

API, Core: add multi-arg transform and add zOrder as the first one #9662

Closed

manuzhang reviewed Feb 6, 2024

View reviewed changes

advancedxy reviewed Feb 7, 2024

View reviewed changes

Comment thread format/spec.md Outdated

szehon-ho force-pushed the multi_arg_spec_clarify branch from 73a8971 to 508350d Compare February 8, 2024 21:49

Review comments

58026cc

szehon-ho force-pushed the multi_arg_spec_clarify branch from 508350d to 58026cc Compare February 8, 2024 21:58

aokolnychyi reviewed Feb 9, 2024

View reviewed changes

More review comments

70d92af

szehon-ho force-pushed the multi_arg_spec_clarify branch from d3e8749 to 70d92af Compare February 10, 2024 00:16

szehon-ho added 2 commits February 12, 2024 11:02

Add missing statement

e8c962c

Minor change

fcf2a63

szehon-ho force-pushed the multi_arg_spec_clarify branch from f44a76f to 02658ce Compare February 12, 2024 19:28

Fill relevant section in Appendix E

6c9c43b

szehon-ho force-pushed the multi_arg_spec_clarify branch from 02658ce to 6c9c43b Compare February 12, 2024 19:31

rdblue reviewed Feb 12, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Feb 12, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Feb 12, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Feb 12, 2024

View reviewed changes

Comment thread format/spec.md Outdated

Address comments

740d887

szehon-ho added 2 commits February 13, 2024 14:56

Adapting statement to match pr comment for review

f2ca9ac

Replicate fix of type to sort order

bf86446

szehon-ho force-pushed the multi_arg_spec_clarify branch from 00a5416 to bf86446 Compare February 14, 2024 00:41

advancedxy reviewed Feb 18, 2024

View reviewed changes

Review comments

4c3f294

advancedxy approved these changes Feb 19, 2024

View reviewed changes

rdblue reviewed Feb 25, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue approved these changes Feb 25, 2024

View reviewed changes

szehon-ho force-pushed the multi_arg_spec_clarify branch 2 times, most recently from aa77268 to acc6169 Compare February 26, 2024 22:34

Review comments

7534ce1

szehon-ho force-pushed the multi_arg_spec_clarify branch from acc6169 to 7534ce1 Compare February 26, 2024 22:36

rdblue reviewed Feb 26, 2024

View reviewed changes

Clarify wording about unknown transforms.

3c5cd1d

rdblue approved these changes Feb 26, 2024

View reviewed changes

rdblue merged commit c18cb96 into apache:main Feb 26, 2024

bitsondatadev pushed a commit to bitsondatadev/iceberg that referenced this pull request Mar 3, 2024

Spec: Clarify multi-arg transform behavior for different versions (ap…

685a9e5

…ache#9661)

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Spec: Clarify multi-arg transform behavior for different versions (ap…

6908420

…ache#9661)

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spec: Clarify multi-arg transform behavior for different versions (ap…

0b9a480

…ache#9661)

Fokko mentioned this pull request Feb 3, 2025

spec: Remove source-ids for V{1,2} tables #12161

Merged


		Writing older version metadata:

		* For single-arg transforms, partition field and sort order field `source-id` should be written; `source-ids` must be omitted

Conversation

szehon-ho commented Feb 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Feb 5, 2024

Uh oh!

manuzhang commented Feb 6, 2024

Uh oh!

szehon-ho commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented Feb 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

szehon-ho commented Feb 6, 2024 •

edited

Loading

szehon-ho Feb 13, 2024 •

edited

Loading