Skip to content

Spec: Clarify multi-arg transform behavior for different versions#9661

Merged
rdblue merged 12 commits intoapache:mainfrom
szehon-ho:multi_arg_spec_clarify
Feb 26, 2024
Merged

Spec: Clarify multi-arg transform behavior for different versions#9661
rdblue merged 12 commits intoapache:mainfrom
szehon-ho:multi_arg_spec_clarify

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

This pr clarifies multi-arg transform behavior in relation to different Iceberg versions. It proposes to make the behavior default in V3 but enabled in V1/V2 with a new table config. It also cleans up some of the affected tables and notes.

This is a follow up on: #8579 based on the email thread discussion : https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0.

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Feb 5, 2024
@szehon-ho szehon-ho changed the title Spec: Clarify iceberg-spec.md Spec: Clarify multi-arg transform behavior for different versions Feb 5, 2024
Comment thread format/spec.md Outdated

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
1. In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Points 1 and 2 here are existing ones, just organized in number points

@szehon-ho
Copy link
Copy Markdown
Member Author

@rdblue @advancedxy @aokolnychyi @emkornfield , wanted to get the conversation started, on this proposal to clarify for V1-V3 behaviors for multi-arg transforms as discussed, let me know your thoughts.

@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from 4fa6ac0 to 0e80614 Compare February 5, 2024 23:54
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from 0e80614 to 81b5dc7 Compare February 5, 2024 23:56
@manuzhang
Copy link
Copy Markdown
Member

@szehon-ho
Copy link
Copy Markdown
Member Author

szehon-ho commented Feb 6, 2024

@manuzhang i believe #8579 is not published yet, hence wanted to get this change in before the 1.5 release, if we want to add the clarification. Let me know if I am mistaken though about the new docs site flow.

@advancedxy
Copy link
Copy Markdown
Contributor

Thanks for taking this over @szehon-ho, I will review it in today or tomorrow.

hence wanted to get this change in before the 1.5 release

I do agree that we should get the clarification in before the 1.5 release.

Comment thread format/spec.md
Each partition field in the fields list is stored as an object. See the table for more detail:
Each partition field in the `fields` is stored as a JSON object with the following properties.

| V1 | V2 | V3 | Field | JSON representation | Example |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no mention of V3 format before this. Readers don't know its existence.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mentioned in appendix E already.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I said "before". Meanwhile, multi-arg transform is not mention in appendix E.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it seems the problem existed before then (that V3 is mentioned without a proper introduction), maybe we can tackle it in another PR if we go ahead with this one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the problem existed before then (that V3 is mentioned without a proper introduction)

Maybe v3 format is not completed and adopted by the community.

How about we introduce multi-arg transform in the ### Partitioning and ### Sorting section and point it to the details in the appendix E. In the appendix, we can write detailed documentation about which compatibility flag to use and how partition field and sort field are json serialized?

Something like this:

### Partitioning
... omitted ...
Tables are configured with a **partition spec** that defines how to produce a tuple of partition values from a record. A partition spec has a list of fields that consist of:

*   A **source column id** or a list of **source column ids** from the table’s schema
*   A **partition field id** that is used to identify a partition field and is unique within a partition spec. In v2 table metadata, it is unique across all partition specs.
*   A **transform** that is applied to the source column(s)[1] to produce a partition value
*   A **partition name**

... omitted ...

Partition field IDs must be reused if an existing partition spec contains an equivalent field.

Note:
1. multi-arg transform is added in format Version 3. For details on how multi-arg transform is serialized in JSON, see appendix E

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I made an attempt at this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a section about V1 and V2 versions at the beginning. What if we extend it and say that the V3 spec hasn't been adopted yet and under active development?

Versions 1 and 2 of the Iceberg spec are complete and adopted by the community. Version 3 is under active development and has not been formally adopted.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread format/spec.md Outdated
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from 73a8971 to 508350d Compare February 8, 2024 21:49
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from 508350d to 58026cc Compare February 8, 2024 21:58
Comment thread format/spec.md
Each partition field in the fields list is stored as an object. See the table for more detail:
Each partition field in the `fields` is stored as a JSON object with the following properties.

| V1 | V2 | V3 | Field | JSON representation | Example |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a section about V1 and V2 versions at the beginning. What if we extend it and say that the V3 spec hasn't been adopted yet and under active development?

Versions 1 and 2 of the Iceberg spec are complete and adopted by the community. Version 3 is under active development and has not been formally adopted.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from d3e8749 to 70d92af Compare February 10, 2024 00:16
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from f44a76f to 02658ce Compare February 12, 2024 19:28
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from 02658ce to 6c9c43b Compare February 12, 2024 19:31
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
1. In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec.
2. The `field-id` property was added for each partition field in v2. In v1, the reference implementation assigned field ids sequentially in each spec starting at 1,000. See Partition Evolution for more details.
3. For tables of version < V3, the ID of the source field of each partition field is set in `source-id`. For tables of version >= V3, the ID(s) of the source field(s) is set on `source-ids`, and `source-id` is omitted. See Appendix E for more details.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than moving the first two clarifications into notes, I'd probably remove notes and just add a paragraph on source-id vs source-ids.

For the paragraph on source IDs, I think it should be something like this:

Transforms that accept multiple arguments specify source field IDs using source-ids instead of source-id.

Writers producing v1 and v2 metadata should continue to produce the source-id field for older readers that require it by setting it to the first ID from the source-ids list. Older versions of the reference implementation can read tables with unknown transforms and will ignore multi-arg transforms, but other implementations may break if they encounter unknown transform names.

Writers producing v3 metadata should omit the source-id field because v3 readers are required to support multi-arg transforms and accept the source-ids field.

Copy link
Copy Markdown
Member Author

@szehon-ho szehon-ho Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I added these paragraphs.

I added some minor clarification to parts that made me have to read twice. Clarified 'writers' to 'writers producing these transforms.' and used 'additionally' in the V1/V2 case to be more clear it is populated in addition to 'source-ids'. Let me know if that sounds ok.

Transforms that accept multiple arguments specify source field IDs using source-ids instead of source-id. Writers producing these transforms in v1 and v2 metadata should additionally produce the source-id field by setting it to the first ID from the source-ids list. Writers producing these transforms in v3 metadata should populate only the source-ids field because v3 readers will fully-support multi-arg transforms by reading this field.

This sentence actually made me think a bit:

Older versions of the reference implementation can read tables with unknown transforms and will ignore multi-arg transforms, but other implementations may break if they encounter unknown transform names.

I was thinking to pull out the sentence to a next paragraph, as it seems its a more general statement and the flow of the paragraph is better without it, let me know what you think. I was also thinking it may make sense to just say this for all unknown transforms, without having to mention multi-arg in particular, something like:

Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms.

What do you think?

Comment thread format/spec.md Outdated
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from 00a5416 to bf86446 Compare February 14, 2024 00:41
Copy link
Copy Markdown
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, otherwise LGTM.

Comment thread format/spec.md
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated

Writing older version metadata:

* For single-arg transforms, partition field and sort order field `source-id` should be written; `source-ids` must be omitted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source-ids must be omitted

source-ids is optional for V1 and V2 metadata, therefore this sentence could be removed? it's up to the implementations to decide whether to emit source-ids for V1/V2 metadata?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me , it is optional for multi-arg transforms, I dont see much point to allow implementations to write it for single-arg transform (although fair point that it should not hurt).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question is how to handle a case where a field has both source-id and source-ids. Here, I would simply state that it must be consistent:

For a single-arg transform, partition field and sort order field source-id must be written; if source-ids is also written it must be a list of one ID that matches the source-id field."

We can also state above under "Reading v1 or v2 metadata" that for a single-arg transform, source-id takes precedence over source-ids although we may not want to specify this either.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea you guys are right, it is slightly clearer to write 'source-ids' as a single element list for single-arg transform on v1/v2 tables, changed.

Comment thread format/spec.md
* A **partition name**

The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.
The source columns, selected by ids, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest adding a note here that addresses compatibility, rather than only noting it in Appendix C. The problem is that it doesn't really fit here. I think a good solution is to note compatibility with any multi-arg transforms that are defined in the next section.

Since we don't have any multi-arg transforms right now, I think we can skip it for now, but we should definitely call out the compatibility of transforms that may not be supported in v1 and v2.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. iiuc, implementations would optionally support them in v1/v2 based on a flag, and required to support them in v3.

Comment thread format/spec.md Outdated

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.
2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1.
Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we state that in v3, readers are required to ignore unknown transforms and are not allowed to write to tables that use unknown transforms in the target partition spec?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added here and also in Appendix E

Comment thread format/spec.md Outdated
Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested some clarifications, but I think these changes look great overall. Thanks for pushing this through, @szehon-ho and @advancedxy!

@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch 2 times, most recently from aa77268 to acc6169 Compare February 26, 2024 22:34
@szehon-ho szehon-ho force-pushed the multi_arg_spec_clarify branch from acc6169 to 7534ce1 Compare February 26, 2024 22:36
Comment thread format/spec.md Outdated
In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms. In v1 and v2 metadata, writers must always write `source-id`; for multi-arg transforms, writers must produce `source-ids` and set `source-id` to the first ID from the field ID list.

Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms.
Older versions of the reference implementation can read tables with transforms unknown to it, ignoring them. But other implementations may break if they encounter unknown transforms. All v3 readers are required to read tables with unknown transforms, ignoring them. Writers should not write to tables with unknown transforms.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay for now, but the constraint for writers with an unknown transform is a bit more relaxed. Sort orders are best effort... so technically it's up to the writer. Similarly, the table's partition spec is the default spec because there may be more than one spec that is valid in a table. Neither of these cases is necessarily blocking so "should" is a strong word to use. I'd remove that language here for the sort order, and update the partition spec language to "Writers should not write using partition specs that use unknown transforms".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and made this slight edit to unblock this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are right, thanks for the change.

@rdblue rdblue merged commit c18cb96 into apache:main Feb 26, 2024
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Feb 26, 2024

Merging this. Thanks, @szehon-ho!

bitsondatadev pushed a commit to bitsondatadev/iceberg that referenced this pull request Mar 3, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Fokko added a commit to Fokko/iceberg that referenced this pull request Feb 3, 2025
I was looking at adding support for `source-ids` in PyIceberg,
but noticed that it was also still lacking for Java.

I've noticed that `source-ids` are also backported to V1 and V2
tables, which suprised me, since this might break existing V2
implementations that are not aware of the `source-ids`.

See apache#9661
And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0

I think it would be good to only allow multi-arg transforms from V3 onwards.
Fokko added a commit to Fokko/iceberg that referenced this pull request Feb 3, 2025
I was looking at adding support for `source-ids` in PyIceberg,
but noticed that it was also still lacking for Java.

I've noticed that `source-ids` are also backported to V1 and V2
tables, which suprised me, since this might break existing V2
implementations that are not aware of the `source-ids`.

See apache#9661
And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0

I think it would be good to only allow multi-arg transforms from V3 onwards.
Fokko added a commit that referenced this pull request Feb 13, 2025
I was looking at adding support for `source-ids` in PyIceberg,
but noticed that it was also still lacking for Java.

I've noticed that `source-ids` are also backported to V1 and V2
tables, which suprised me, since this might break existing V2
implementations that are not aware of the `source-ids`.

See #9661
And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0

I think it would be good to only allow multi-arg transforms from V3 onwards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants