Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Oct 6, 2023

Fixes #8712 . This clarifies the spec_id partition spec field in DataFile. Note that in practice this field looks to not be written and is simply inherited from the manifest file. We already had a note indicating that the field 141 is reserved, but it wasn't clear what it was reserved for so this change adds those details to the spec.

Some related context PRs:

1.) #1317
2.) https://github.com/apache/iceberg/pull/3015/files
3.) https://github.com/apache/iceberg/pull/4750/files

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Oct 6, 2023
| | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file |
| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. |

| _optional_ | _optional_ | **`141 spec_id`** | `int` | ID representing partition spec for this file [4]. |
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we inherit from the manifest I think we may not want to add this here since it may confuse folks since this spec_id doesn't matter even if it's written. Maybe this change should just be limited towards clarifying why 141 is reserved. Thoughts @szehon-ho @Fokko @rdblue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely confusing. I think should not reserve an ID for it if we don't write it.

Copy link
Contributor

@rdblue rdblue Oct 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't optional. The field ID is reserved, but it should never be written to a metadata file.

We need to reserve an ID for it because it is present in metadata tables. We don't want to have conflicting IDs when reading these files. I think all we need to communicate is that no one should write any field with ID 141, which is why we noted it was reserved here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on not adding it here but mentioning this ID is reserved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about keeping it blank? This means that the field should not be written.

Suggested change
| _optional_ | _optional_ | **`141 spec_id`** | `int` | ID representing partition spec for this file [4]. |
| | | **`141 spec_id`** | `int` | ID representing partition spec for this file [4]. |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit confusing to have a type that isn't included in v1 or v2. I'd remove it from here and maybe clarify what it is in the footnote.

2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds.
3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.
4. The following field ids are reserved on `data_file`: 141.
4. Field ID 141 is reserved in `data_file` for `spec_id`` representing the partition spec. Note that in practice spec_id is not written in the data file and is inherited from the manifest file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a typo. spec_id has an extra backtick after it.

The spec_id isn't just not written in practice. It can be passed in a data file's in-memory representation using field ID 141, but that is not a requirement and it should never be written into a manifest.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 22, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec: Add 141: spec_id and 142: schema_id to the spec

4 participants