Skip to content

spec: Remove source-ids for V{1,2} tables#12161

Merged
Fokko merged 1 commit intoapache:mainfrom
Fokko:fd-spec-revert-v2-multi-arg-transforms
Feb 13, 2025
Merged

spec: Remove source-ids for V{1,2} tables#12161
Fokko merged 1 commit intoapache:mainfrom
Fokko:fd-spec-revert-v2-multi-arg-transforms

Conversation

@Fokko
Copy link
Copy Markdown
Contributor

@Fokko Fokko commented Feb 3, 2025

I was looking at adding support for source-ids in PyIceberg, but noticed that it was also still lacking for Java.

I've noticed that source-ids are also backported to V1 and V2 tables, which surprised me since this might break existing V2 implementations that are unaware of the source-ids.

This PR reconsiders #9661
And more specifically: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0

It would be good to only allow multi-arg transforms from V3 onwards, and avoid having some implementations support this by setting a flag. Other implementations might not be aware of this implementation and drop the 2nd argument onward:

{
  "source-id": 19,
  "source-ids": [19, 25],
  "field-id": 1000,
  "name": "ts_bucket",
  "transform": "bucket"
}

The V2 implementation that is unaware of the source-ids (PyIceberg, Iceberg-Rust and others), would produce:

{
  "source-id": 19,
  "field-id": 1000,
  "name": "ts_bucket",
  "transform": "bucket"
}

Breaking the partitioning silently 😱

cc @rdblue @szehon-ho @advancedxy @jbonofre

I was looking at adding support for `source-ids` in PyIceberg,
but noticed that it was also still lacking for Java.

I've noticed that `source-ids` are also backported to V1 and V2
tables, which suprised me, since this might break existing V2
implementations that are not aware of the `source-ids`.

See apache#9661
And more specific: https://lists.apache.org/thread/9opgkrpqhzp3nl8hdohgnk1m1zxnxmq0

I think it would be good to only allow multi-arg transforms from V3 onwards.
@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Feb 3, 2025
@jbonofre
Copy link
Copy Markdown
Member

jbonofre commented Feb 3, 2025

@Fokko Thanks ! As discussed together, I was about to propose this, so totally agree !

Comment thread format/spec.md
@jbonofre
Copy link
Copy Markdown
Member

jbonofre commented Feb 3, 2025

By the way I gonna start impl side of the thing :)

@Fokko
Copy link
Copy Markdown
Contributor Author

Fokko commented Feb 3, 2025

By the way I gonna start impl side of the thing :)

SGTM, if you do the Java side, I'm happy to take care of PyIceberg 👍

Comment thread format/spec.md
|----------|----------|----------|------------------|---------------------|--------------|
| required | required | omitted | **`source-id`** | `JSON int` | 1 |
| optional | optional | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@advancedxy
Copy link
Copy Markdown
Contributor

@Fokko Thanks for bring this up. I'm supporting this to only allow multi-arg transforms for V3 onwards. When multi-arg transform was first added to the spec, V3 was far from completion and we want to make it possible for V2 to support that. I think it's good timing to reconsider that as the multi-arg transform is still lacking in the java side.

Other implementations might not be aware of this implementation and drop the 2nd argument onward:

To clarify, this might not happen. All the existing transform should be the singe-arg transform. For multi-arg transforms, it shall be added to the spec first and other implementation should be aware of the new transforms if it's spec compliant.

@Fokko
Copy link
Copy Markdown
Contributor Author

Fokko commented Feb 5, 2025

@advancedxy Thanks for jumping in here. Yes, I hope to complete the V3 spec soon. By simplifying the impl as suggested in this spec, it will also be easier to implement in PyIceberg. Feel free to chime in on the dev list as well: https://lists.apache.org/thread/4njo7dmxmc50pmj9fw972bvo91r0bl8s This way it is also part of the official docs.

@szehon-ho
Copy link
Copy Markdown
Member

Thanks @Fokko initially was hoping that we could support it for v1/v2 tables but understand the issues you are pointing out. I guess we can patiently wait v3 tables then!

@Fokko Fokko merged commit 1153c30 into apache:main Feb 13, 2025
@Fokko Fokko deleted the fd-spec-revert-v2-multi-arg-transforms branch February 13, 2025 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.