Spec: Clarify multi-arg transform behavior for different versions#9661
Spec: Clarify multi-arg transform behavior for different versions#9661rdblue merged 12 commits intoapache:mainfrom
Conversation
|
|
||
| 1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. | ||
| 2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1. | ||
| 1. In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec. |
There was a problem hiding this comment.
Points 1 and 2 here are existing ones, just organized in number points
|
@rdblue @advancedxy @aokolnychyi @emkornfield , wanted to get the conversation started, on this proposal to clarify for V1-V3 behaviors for multi-arg transforms as discussed, let me know your thoughts. |
4fa6ac0 to
0e80614
Compare
0e80614 to
81b5dc7
Compare
|
Do we also need to update https://github.com/apache/iceberg/blob/main/site/docs/spec.md? |
|
@manuzhang i believe #8579 is not published yet, hence wanted to get this change in before the 1.5 release, if we want to add the clarification. Let me know if I am mistaken though about the new docs site flow. |
|
Thanks for taking this over @szehon-ho, I will review it in today or tomorrow.
I do agree that we should get the clarification in before the 1.5 release. |
| Each partition field in the fields list is stored as an object. See the table for more detail: | ||
| Each partition field in the `fields` is stored as a JSON object with the following properties. | ||
|
|
||
| | V1 | V2 | V3 | Field | JSON representation | Example | |
There was a problem hiding this comment.
There's no mention of V3 format before this. Readers don't know its existence.
There was a problem hiding this comment.
It's mentioned in appendix E already.
There was a problem hiding this comment.
That's why I said "before". Meanwhile, multi-arg transform is not mention in appendix E.
There was a problem hiding this comment.
OK, it seems the problem existed before then (that V3 is mentioned without a proper introduction), maybe we can tackle it in another PR if we go ahead with this one?
There was a problem hiding this comment.
it seems the problem existed before then (that V3 is mentioned without a proper introduction)
Maybe v3 format is not completed and adopted by the community.
How about we introduce multi-arg transform in the ### Partitioning and ### Sorting section and point it to the details in the appendix E. In the appendix, we can write detailed documentation about which compatibility flag to use and how partition field and sort field are json serialized?
Something like this:
### Partitioning
... omitted ...
Tables are configured with a **partition spec** that defines how to produce a tuple of partition values from a record. A partition spec has a list of fields that consist of:
* A **source column id** or a list of **source column ids** from the table’s schema
* A **partition field id** that is used to identify a partition field and is unique within a partition spec. In v2 table metadata, it is unique across all partition specs.
* A **transform** that is applied to the source column(s)[1] to produce a partition value
* A **partition name**
... omitted ...
Partition field IDs must be reused if an existing partition spec contains an equivalent field.
Note:
1. multi-arg transform is added in format Version 3. For details on how multi-arg transform is serialized in JSON, see appendix EThere was a problem hiding this comment.
OK, I made an attempt at this.
There was a problem hiding this comment.
There is a section about V1 and V2 versions at the beginning. What if we extend it and say that the V3 spec hasn't been adopted yet and under active development?
Versions 1 and 2 of the Iceberg spec are complete and adopted by the community. Version 3 is under active development and has not been formally adopted.
73a8971 to
508350d
Compare
508350d to
58026cc
Compare
| Each partition field in the fields list is stored as an object. See the table for more detail: | ||
| Each partition field in the `fields` is stored as a JSON object with the following properties. | ||
|
|
||
| | V1 | V2 | V3 | Field | JSON representation | Example | |
There was a problem hiding this comment.
There is a section about V1 and V2 versions at the beginning. What if we extend it and say that the V3 spec hasn't been adopted yet and under active development?
Versions 1 and 2 of the Iceberg spec are complete and adopted by the community. Version 3 is under active development and has not been formally adopted.
d3e8749 to
70d92af
Compare
f44a76f to
02658ce
Compare
02658ce to
6c9c43b
Compare
| 2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1. | ||
| 1. In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated `partition-spec` field in table metadata. The object format should be used unless otherwise noted in this spec. | ||
| 2. The `field-id` property was added for each partition field in v2. In v1, the reference implementation assigned field ids sequentially in each spec starting at 1,000. See Partition Evolution for more details. | ||
| 3. For tables of version < V3, the ID of the source field of each partition field is set in `source-id`. For tables of version >= V3, the ID(s) of the source field(s) is set on `source-ids`, and `source-id` is omitted. See Appendix E for more details. |
There was a problem hiding this comment.
Rather than moving the first two clarifications into notes, I'd probably remove notes and just add a paragraph on source-id vs source-ids.
For the paragraph on source IDs, I think it should be something like this:
Transforms that accept multiple arguments specify source field IDs using
source-idsinstead ofsource-id.Writers producing v1 and v2 metadata should continue to produce the
source-idfield for older readers that require it by setting it to the first ID from thesource-idslist. Older versions of the reference implementation can read tables with unknown transforms and will ignore multi-arg transforms, but other implementations may break if they encounter unknown transform names.Writers producing v3 metadata should omit the
source-idfield because v3 readers are required to support multi-arg transforms and accept thesource-idsfield.
There was a problem hiding this comment.
@rdblue I added these paragraphs.
I added some minor clarification to parts that made me have to read twice. Clarified 'writers' to 'writers producing these transforms.' and used 'additionally' in the V1/V2 case to be more clear it is populated in addition to 'source-ids'. Let me know if that sounds ok.
Transforms that accept multiple arguments specify source field IDs using
source-idsinstead ofsource-id. Writers producing these transforms in v1 and v2 metadata should additionally produce thesource-idfield by setting it to the first ID from thesource-idslist. Writers producing these transforms in v3 metadata should populate only thesource-idsfield because v3 readers will fully-support multi-arg transforms by reading this field.
This sentence actually made me think a bit:
Older versions of the reference implementation can read tables with unknown transforms and will ignore multi-arg transforms, but other implementations may break if they encounter unknown transform names.
I was thinking to pull out the sentence to a next paragraph, as it seems its a more general statement and the flow of the paragraph is better without it, let me know what you think. I was also thinking it may make sense to just say this for all unknown transforms, without having to mention multi-arg in particular, something like:
Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms.
What do you think?
00a5416 to
bf86446
Compare
advancedxy
left a comment
There was a problem hiding this comment.
Some minor comments, otherwise LGTM.
|
|
||
| Writing older version metadata: | ||
|
|
||
| * For single-arg transforms, partition field and sort order field `source-id` should be written; `source-ids` must be omitted |
There was a problem hiding this comment.
source-idsmust be omitted
source-ids is optional for V1 and V2 metadata, therefore this sentence could be removed? it's up to the implementations to decide whether to emit source-ids for V1/V2 metadata?
There was a problem hiding this comment.
To me , it is optional for multi-arg transforms, I dont see much point to allow implementations to write it for single-arg transform (although fair point that it should not hurt).
There was a problem hiding this comment.
I think the question is how to handle a case where a field has both source-id and source-ids. Here, I would simply state that it must be consistent:
For a single-arg transform, partition field and sort order field
source-idmust be written; ifsource-idsis also written it must be a list of one ID that matches thesource-idfield."
We can also state above under "Reading v1 or v2 metadata" that for a single-arg transform, source-id takes precedence over source-ids although we may not want to specify this either.
There was a problem hiding this comment.
Yea you guys are right, it is slightly clearer to write 'source-ids' as a single element list for single-arg transform on v1/v2 tables, changed.
| * A **partition name** | ||
|
|
||
| The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C. | ||
| The source columns, selected by ids, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct. For details on how to serialize a partition spec to JSON, see Appendix C. |
There was a problem hiding this comment.
I was going to suggest adding a note here that addresses compatibility, rather than only noting it in Appendix C. The problem is that it doesn't really fit here. I think a good solution is to note compatibility with any multi-arg transforms that are defined in the next section.
Since we don't have any multi-arg transforms right now, I think we can skip it for now, but we should definitely call out the compatibility of transforms that may not be supported in v1 and v2.
There was a problem hiding this comment.
Makes sense. iiuc, implementations would optionally support them in v1/v2 based on a flag, and required to support them in v3.
|
|
||
| 1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. | ||
| 2. For partition fields with a transform of multiple arguments, the IDs of the source fields are set on `source-ids`. To preserve backward compatibility, `source-id` is set to -1. | ||
| Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms. |
There was a problem hiding this comment.
Should we state that in v3, readers are required to ignore unknown transforms and are not allowed to write to tables that use unknown transforms in the target partition spec?
There was a problem hiding this comment.
Done, added here and also in Appendix E
rdblue
left a comment
There was a problem hiding this comment.
I suggested some clarifications, but I think these changes look great overall. Thanks for pushing this through, @szehon-ho and @advancedxy!
aa77268 to
acc6169
Compare
acc6169 to
7534ce1
Compare
| In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms. In v1 and v2 metadata, writers must always write `source-id`; for multi-arg transforms, writers must produce `source-ids` and set `source-id` to the first ID from the field ID list. | ||
|
|
||
| Older versions of the reference implementation can read tables with transforms unknown to it, without the ability to push down filters or write. But other implementations may break if they encounter unknown transforms. | ||
| Older versions of the reference implementation can read tables with transforms unknown to it, ignoring them. But other implementations may break if they encounter unknown transforms. All v3 readers are required to read tables with unknown transforms, ignoring them. Writers should not write to tables with unknown transforms. |
There was a problem hiding this comment.
This is okay for now, but the constraint for writers with an unknown transform is a bit more relaxed. Sort orders are best effort... so technically it's up to the writer. Similarly, the table's partition spec is the default spec because there may be more than one spec that is valid in a table. Neither of these cases is necessarily blocking so "should" is a strong word to use. I'd remove that language here for the sort order, and update the partition spec language to "Writers should not write using partition specs that use unknown transforms".
There was a problem hiding this comment.
I went ahead and made this slight edit to unblock this.
There was a problem hiding this comment.
Ah you are right, thanks for the change.
|
Merging this. Thanks, @szehon-ho! |
I was looking at adding support for `source-ids` in PyIceberg, but noticed that it was also still lacking for Java. I've noticed that `source-ids` are also backported to V1 and V2 tables, which suprised me, since this might break existing V2 implementations that are not aware of the `source-ids`. See apache#9661 And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0 I think it would be good to only allow multi-arg transforms from V3 onwards.
I was looking at adding support for `source-ids` in PyIceberg, but noticed that it was also still lacking for Java. I've noticed that `source-ids` are also backported to V1 and V2 tables, which suprised me, since this might break existing V2 implementations that are not aware of the `source-ids`. See apache#9661 And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0 I think it would be good to only allow multi-arg transforms from V3 onwards.
I was looking at adding support for `source-ids` in PyIceberg, but noticed that it was also still lacking for Java. I've noticed that `source-ids` are also backported to V1 and V2 tables, which suprised me, since this might break existing V2 implementations that are not aware of the `source-ids`. See #9661 And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0 I think it would be good to only allow multi-arg transforms from V3 onwards.
This pr clarifies multi-arg transform behavior in relation to different Iceberg versions. It proposes to make the behavior default in V3 but enabled in V1/V2 with a new table config. It also cleans up some of the affected tables and notes.
This is a follow up on: #8579 based on the email thread discussion : https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0.