spec: Remove source-ids for V{1,2} tables#12161
Conversation
I was looking at adding support for `source-ids` in PyIceberg, but noticed that it was also still lacking for Java. I've noticed that `source-ids` are also backported to V1 and V2 tables, which suprised me, since this might break existing V2 implementations that are not aware of the `source-ids`. See apache#9661 And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0 I think it would be good to only allow multi-arg transforms from V3 onwards.
|
@Fokko Thanks ! As discussed together, I was about to propose this, so totally agree ! |
|
By the way I gonna start impl side of the thing :) |
SGTM, if you do the Java side, I'm happy to take care of PyIceberg 👍 |
| |----------|----------|----------|------------------|---------------------|--------------| | ||
| | required | required | omitted | **`source-id`** | `JSON int` | 1 | | ||
| | optional | optional | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | | ||
| | | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` | |
|
@Fokko Thanks for bring this up. I'm supporting this to only allow multi-arg transforms for V3 onwards. When multi-arg transform was first added to the spec, V3 was far from completion and we want to make it possible for V2 to support that. I think it's good timing to reconsider that as the multi-arg transform is still lacking in the java side.
To clarify, this might not happen. All the existing transform should be the singe-arg transform. For multi-arg transforms, it shall be added to the spec first and other implementation should be aware of the new transforms if it's spec compliant. |
|
@advancedxy Thanks for jumping in here. Yes, I hope to complete the V3 spec soon. By simplifying the impl as suggested in this spec, it will also be easier to implement in PyIceberg. Feel free to chime in on the dev list as well: https://lists.apache.org/thread/4njo7dmxmc50pmj9fw972bvo91r0bl8s This way it is also part of the official docs. |
|
Thanks @Fokko initially was hoping that we could support it for v1/v2 tables but understand the issues you are pointing out. I guess we can patiently wait v3 tables then! |
I was looking at adding support for
source-idsin PyIceberg, but noticed that it was also still lacking for Java.I've noticed that
source-idsare also backported to V1 and V2 tables, which surprised me since this might break existing V2 implementations that are unaware of thesource-ids.This PR reconsiders #9661
And more specifically: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0
It would be good to only allow multi-arg transforms from V3 onwards, and avoid having some implementations support this by setting a flag. Other implementations might not be aware of this implementation and drop the 2nd argument onward:
{ "source-id": 19, "source-ids": [19, 25], "field-id": 1000, "name": "ts_bucket", "transform": "bucket" }The V2 implementation that is unaware of the
source-ids(PyIceberg, Iceberg-Rust and others), would produce:{ "source-id": 19, "field-id": 1000, "name": "ts_bucket", "transform": "bucket" }Breaking the partitioning silently 😱
cc @rdblue @szehon-ho @advancedxy @jbonofre