core: initial support of multi-arg bucket by advancedxy · Pull Request #8259 · apache/iceberg

advancedxy · 2023-08-08T15:01:15Z

This commit makes one simple E2E of multi-arg bucket transform works on spark 3.3 and spark 3.4

this should resolves #8258

This commit makes one simple E2E of multi-arg bucket transform works on spark 3.3 and spark 3.4

szehon-ho · 2023-08-15T22:02:43Z

This looks interesting, looks like we are discussing first in: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit

advancedxy · 2023-08-16T06:38:47Z

This looks interesting, looks like we are discussing first in: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit

Yeah, let's discuss this in the google doc first. I probably would raise a spec PR in this or next week. Once we have discussed this in the next community sync, I will refine and divide the PR into multiple parts. This PR is for reference only to get what the whole picture looks like for Spark(at least) engine: like roughly how many lines of code, which classes would be modified, how the interface would be changed/replaced etc.

advancedxy · 2023-08-17T06:25:22Z

+            break;
+        }
+      }
+      Preconditions.checkArgument(!isNull, "All fields are null");


just check here for now, if all fields are null, this bucket function should produce null instead.

rdblue · 2023-09-28T23:00:15Z

@advancedxy, thanks for your patience on this. I'm going to be looking at it more once the 1.4 release is out.

I have a couple of things that I'm thinking through on this:

If we're updating the function, should we use a faster algorithm like XXH?
Should we use a different approach to get to a bucket number rather than modulus?
Should we make hash(str) and hash(int) the same to enable type promotion?

advancedxy · 2023-09-30T04:34:10Z

@advancedxy, thanks for your patience on this. I'm going to be looking at it more once the 1.4 release is out.

That sounds reasonable and great to me.

I have a couple of things that I'm thinking through on this:

If we're updating the function, should we use a faster algorithm like XXH?

If we are going to add bucketV2 as the new bucket function for both single-arg and multi-arg parameters. We can certainly use a faster/new hash algorithm.

Should we make hash(str) and hash(int) the same to enable type promotion?

Currently, I don't think we can promote int to str due to the truncate/bucket(original) transform since we cannot produce same partition ing value. Unless we add more checks and restrictions about type promotion.

I will also put more thinkings in this, and let's discuss after 1.4 is released .

rdblue · 2023-10-01T19:02:32Z

Good point about the truncate transform not matching. We'll have to check when promoting types that there aren't incompatible transforms either way so we shouldn't over-complicate the bucket function to accomplish this.

szehon-ho · 2024-01-24T18:46:00Z

@advancedxy do you mind taking the generic changes here (not including bucketv2) and splitting into another pr? (Just like we did for the doc changes). I feel we can parallelize more this way. Thanks for the great work, by the way.

advancedxy · 2024-01-25T02:28:07Z

@advancedxy do you mind taking the generic changes here (not including bucketv2) and splitting into another pr? (Just like we did for the doc changes). I feel we can parallelize more this way. Thanks for the great work, by the way.

Of course, I will split the API/Core parts into another prs, it may consist of multiple PRs to be reviewed. As for bucketV2, I do think it's important and the motivation of multi-arg transform is to support multi-arg bucket initially. Do you or @aokolnychyi have interest to work on bucketV2?

szehon-ho · 2024-01-25T18:35:05Z

@advancedxy feel free to work on bucketv2, based on the ideas of @rdblue and @aokolnychyi . Yes for sure, it will be great to get bucketv2 in.

We were interested to work on geo partitioning multi-arg partition functions, like z ordering. So would love to get the base code in, to work in parallel.

github-actions · 2024-09-13T00:14:38Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-09-22T00:16:00Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

luoyuxia · 2025-07-16T08:01:22Z

@advancedxy Hi, thanks for the create work. Seems the pr has been closed by github actions. Is it possible to push this pr forward? We do need it while integrating Fluss with Iceberg.

github-actions Bot added API spark core labels Aug 8, 2023

advancedxy mentioned this pull request Aug 8, 2023

multi-arg transform support #8258

Closed

core: initial support of multi-arg bucket

90eda6d

This commit makes one simple E2E of multi-arg bucket transform works on spark 3.3 and spark 3.4

advancedxy force-pushed the multi-arg-transform branch from 6873be7 to 90eda6d Compare August 8, 2023 15:05

chore: fix ci failure

595aa93

Fokko mentioned this pull request Aug 9, 2023

feat: Add transform apache/iceberg-rust#26

Merged

advancedxy commented Aug 17, 2023

View reviewed changes

advancedxy mentioned this pull request Aug 23, 2023

Flink: Custom partitioner for bucket partitions #7161

Merged

advancedxy mentioned this pull request Sep 18, 2023

Spec: add multi-arg transform support #8579

Merged

github-actions Bot added the stale label Sep 13, 2024

github-actions Bot closed this Sep 22, 2024

szehon-ho mentioned this pull request Mar 27, 2025

Spec: Allow the use of source-id in V3 #12644

Merged

luoyuxia mentioned this pull request Jan 26, 2026

Fast fail when creating table of iceberg with multi key fields apache/fluss#2449

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: initial support of multi-arg bucket#8259

core: initial support of multi-arg bucket#8259
advancedxy wants to merge 2 commits intoapache:mainfrom
advancedxy:multi-arg-transform

advancedxy commented Aug 8, 2023 •

edited

Loading

Uh oh!

szehon-ho commented Aug 15, 2023

Uh oh!

advancedxy commented Aug 16, 2023

Uh oh!

advancedxy Aug 17, 2023

Uh oh!

rdblue commented Sep 28, 2023

Uh oh!

advancedxy commented Sep 30, 2023

Uh oh!

rdblue commented Oct 1, 2023

Uh oh!

szehon-ho commented Jan 24, 2024 •

edited

Loading

Uh oh!

advancedxy commented Jan 25, 2024

Uh oh!

szehon-ho commented Jan 25, 2024 •

edited

Loading

Uh oh!

github-actions Bot commented Sep 13, 2024

Uh oh!

github-actions Bot commented Sep 22, 2024

Uh oh!

luoyuxia commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

advancedxy commented Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Aug 15, 2023

Uh oh!

advancedxy commented Aug 16, 2023

Uh oh!

advancedxy Aug 17, 2023

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 28, 2023

Uh oh!

advancedxy commented Sep 30, 2023

Uh oh!

rdblue commented Oct 1, 2023

Uh oh!

szehon-ho commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

advancedxy commented Jan 25, 2024

Uh oh!

szehon-ho commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Sep 13, 2024

Uh oh!

github-actions Bot commented Sep 22, 2024

Uh oh!

luoyuxia commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

advancedxy commented Aug 8, 2023 •

edited

Loading

szehon-ho commented Jan 24, 2024 •

edited

Loading

szehon-ho commented Jan 25, 2024 •

edited

Loading