core: initial support of multi-arg bucket#8259
core: initial support of multi-arg bucket#8259advancedxy wants to merge 2 commits intoapache:mainfrom
Conversation
This commit makes one simple E2E of multi-arg bucket transform works on spark 3.3 and spark 3.4
6873be7 to
90eda6d
Compare
|
This looks interesting, looks like we are discussing first in: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit |
Yeah, let's discuss this in the google doc first. I probably would raise a spec PR in this or next week. Once we have discussed this in the next community sync, I will refine and divide the PR into multiple parts. This PR is for reference only to get what the whole picture looks like for Spark(at least) engine: like roughly how many lines of code, which classes would be modified, how the interface would be changed/replaced etc. |
| break; | ||
| } | ||
| } | ||
| Preconditions.checkArgument(!isNull, "All fields are null"); |
There was a problem hiding this comment.
just check here for now, if all fields are null, this bucket function should produce null instead.
|
@advancedxy, thanks for your patience on this. I'm going to be looking at it more once the 1.4 release is out. I have a couple of things that I'm thinking through on this:
|
That sounds reasonable and great to me.
If we are going to add
Currently, I don't think we can promote int to str due to the truncate/bucket(original) transform since we cannot produce same partition ing value. Unless we add more checks and restrictions about type promotion. I will also put more thinkings in this, and let's discuss after 1.4 is released . |
|
Good point about the truncate transform not matching. We'll have to check when promoting types that there aren't incompatible transforms either way so we shouldn't over-complicate the bucket function to accomplish this. |
|
@advancedxy do you mind taking the generic changes here (not including bucketv2) and splitting into another pr? (Just like we did for the doc changes). I feel we can parallelize more this way. Thanks for the great work, by the way. |
Of course, I will split the API/Core parts into another prs, it may consist of multiple PRs to be reviewed. As for |
|
@advancedxy feel free to work on bucketv2, based on the ideas of @rdblue and @aokolnychyi . Yes for sure, it will be great to get bucketv2 in. We were interested to work on geo partitioning multi-arg partition functions, like z ordering. So would love to get the base code in, to work in parallel. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
@advancedxy Hi, thanks for the create work. Seems the pr has been closed by github actions. Is it possible to push this pr forward? We do need it while integrating Fluss with Iceberg. |
This commit makes one simple E2E of multi-arg bucket transform works on spark 3.3 and spark 3.4
this should resolves #8258