-
Notifications
You must be signed in to change notification settings - Fork 3k
Spec: Clarify spec_id field in Data File #8730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| | | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | | ||
| | _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | | ||
|
|
||
| | _optional_ | _optional_ | **`141 spec_id`** | `int` | ID representing partition spec for this file [4]. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if we inherit from the manifest I think we may not want to add this here since it may confuse folks since this spec_id doesn't matter even if it's written. Maybe this change should just be limited towards clarifying why 141 is reserved. Thoughts @szehon-ho @Fokko @rdblue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely confusing. I think should not reserve an ID for it if we don't write it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't optional. The field ID is reserved, but it should never be written to a metadata file.
We need to reserve an ID for it because it is present in metadata tables. We don't want to have conflicting IDs when reading these files. I think all we need to communicate is that no one should write any field with ID 141, which is why we noted it was reserved here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on not adding it here but mentioning this ID is reserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about keeping it blank? This means that the field should not be written.
| | _optional_ | _optional_ | **`141 spec_id`** | `int` | ID representing partition spec for this file [4]. | | |
| | | | **`141 spec_id`** | `int` | ID representing partition spec for this file [4]. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit confusing to have a type that isn't included in v1 or v2. I'd remove it from here and maybe clarify what it is in the footnote.
| 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds. | ||
| 3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. | ||
| 4. The following field ids are reserved on `data_file`: 141. | ||
| 4. Field ID 141 is reserved in `data_file` for `spec_id`` representing the partition spec. Note that in practice spec_id is not written in the data file and is inherited from the manifest file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a typo. spec_id has an extra backtick after it.
The spec_id isn't just not written in practice. It can be passed in a data file's in-memory representation using field ID 141, but that is not a requirement and it should never be written into a manifest.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Fixes #8712 . This clarifies the
spec_idpartition spec field in DataFile. Note that in practice this field looks to not be written and is simply inherited from the manifest file. We already had a note indicating that the field 141 is reserved, but it wasn't clear what it was reserved for so this change adds those details to the spec.Some related context PRs:
1.) #1317
2.) https://github.com/apache/iceberg/pull/3015/files
3.) https://github.com/apache/iceberg/pull/4750/files