-
Notifications
You must be signed in to change notification settings - Fork 3k
Increase Partition Start Id to 10000 #6369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d4bb25d to
c8f3e10
Compare
|
This seems a reasonable change for me. Just a question for my better understanding: The tables that we have already written will still have their partition field IDs from 1000, right? So in case we have some tables that have more than 1000 cols and written prior to this change will still have the collision with the partition field IDs and will only be fixed if they are, or at lest their metadata is rewritten, right? |
yep |
|
Thanks for the answer, @ayushtkn! Would it make sense? I can create a separate issue for this for further discussion and if it does make sense I could give the implementation a try. @RussellSpitzer ? |
|
@gaborkaszab I would probably just recommended dropping and recreating the table (via metadata) or having a separate utility for modifying existing tables. I really don't think many folks have 1000 columns since we would have seen this before so I don't think the upgrade procedure really needs to exist inside the Iceberg repo. I think this change is pretty small and harmless though. |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but since this is a pretty core change I'd like at least one other committer to sign off. @szehon-ho or @aokolnychyi ?
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
@szehon-ho https://www.youtube.com/watch?v=rR4n-0KYeKQ about the LGTM :) Just for fun. |
|
Double checking the relevant part of the spec and we never actually demand that partition id's start at 1000. So I think we are in the clear hear from a backwards compatibility standpoint as well> |
|
I'd be really careful with this change. Even though the spec may not mention it directly, that was always our assumption. I will need to take a closer look in a bit. |
|
The area I'm worried about now, is iceberg/api/src/main/java/org/apache/iceberg/PartitionSpec.java Lines 602 to 609 in 2918735
Which is a check for V1 Tables which is only used iceberg/core/src/main/java/org/apache/iceberg/TableMetadata.java Lines 617 to 626 in 6697129
and iceberg/core/src/main/java/org/apache/iceberg/TableMetadata.java Lines 1409 to 1414 in 6697129
So I may have to think about this more |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was feeling nervous so now I think we should hold on this, I think we need a test that starts a table at 1000, then with the new start ID set to 10000 continues to work. Specifically I think this is an issue for V1 tables I think V2 tables are probably fine with the id's being non-sequential.
I think with the current code a v1 table created unpartitioned or partitioned, would become un-modifiable due to it's spec no longer looking sequential ... I think
|
I also checked if PartitionSpec.hasSequentialIds() could cause any issues with existing tables. The first use that you linked seems to be the case when we re-write the table metadata and it checks the new partition field ID. This should be fine in my opinion. iceberg/core/src/main/java/org/apache/iceberg/TableMetadata.java Lines 1409 to 1414 in 6697129
I can take another look tomorrow (it's kind of late now to think :) ) |
|
Thanx folks, Just thinking about the sequentialId check, why it needs to rely on the start id, Does changing that check like that help |
|
Yeah I think the hasSequential code needs to be modified. Otherwise I can have a table whose first spec id is 1000 before this patch, then after this patch I try to add another field. The hasSequential would see 1000 != 10000 and would throw an error. I'm not sure if having, 1000 and then 10000 would be ok in a v1 partition spec, so we need to check that. |
|
@TuroczyX lol i did see that before :) yep, this is my careless kitchen review of the day Nice catch, didnt realize it would throw an exception if its not sequential. Hm Im not 100% sure why we need to throw an exception in this case ,versus start id assignment from last assigned id, looks like it came from comment: #845 (comment) |
|
And some in depth discussion : #280 |
|
Went through the discussion, One good thing is it has lastPartitionId, and it is used for next allocations, so that should prevent any old table breaking due to this change. I couldn't figure out from where 1000 came in, it looks like The reason for sequential isn't mentioned over there, I need to explore a bit more around that area, |
|
Looks like @RussellSpitzer, @szehon-ho, and @aokolnychyi are looking at this and have noted the issues with v1 tables. I think that this is risky because not all v1 readers will use partition field IDs, but we do write them into partition specs now. Currently, we are careful that those IDs are always the same, but this change would cause them to differ. It may be safe, but I'd test very thoroughly and possibly put this behind a flag. I'd also like to understand why this is needed. Partition field IDs are stored in manifest files, not data files. Partition field IDs should generally not mix with data field IDs from the Iceberg schema. The only case I can think of right now is projecting the |
|
Any update on this? Should we keep it open or are we pursuing other solutions? |
|
any update on this? |
|
I think you created : #7221 |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Increase the partition field start to 10K to avoid collisions with columns.
#6368