Skip to content

core: initial support of multi-arg bucket#8259

Closed
advancedxy wants to merge 2 commits intoapache:mainfrom
advancedxy:multi-arg-transform
Closed

core: initial support of multi-arg bucket#8259
advancedxy wants to merge 2 commits intoapache:mainfrom
advancedxy:multi-arg-transform

Conversation

@advancedxy
Copy link
Copy Markdown
Contributor

@advancedxy advancedxy commented Aug 8, 2023

This commit makes one simple E2E of multi-arg bucket transform works on spark 3.3 and spark 3.4

this should resolves #8258

This commit makes one simple E2E of multi-arg bucket transform works on
spark 3.3 and spark 3.4
@advancedxy advancedxy force-pushed the multi-arg-transform branch from 6873be7 to 90eda6d Compare August 8, 2023 15:05
@szehon-ho
Copy link
Copy Markdown
Member

This looks interesting, looks like we are discussing first in: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit

@advancedxy
Copy link
Copy Markdown
Contributor Author

This looks interesting, looks like we are discussing first in: https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit

Yeah, let's discuss this in the google doc first. I probably would raise a spec PR in this or next week. Once we have discussed this in the next community sync, I will refine and divide the PR into multiple parts. This PR is for reference only to get what the whole picture looks like for Spark(at least) engine: like roughly how many lines of code, which classes would be modified, how the interface would be changed/replaced etc.

break;
}
}
Preconditions.checkArgument(!isNull, "All fields are null");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just check here for now, if all fields are null, this bucket function should produce null instead.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Sep 28, 2023

@advancedxy, thanks for your patience on this. I'm going to be looking at it more once the 1.4 release is out.

I have a couple of things that I'm thinking through on this:

  • If we're updating the function, should we use a faster algorithm like XXH?
  • Should we use a different approach to get to a bucket number rather than modulus?
  • Should we make hash(str) and hash(int) the same to enable type promotion?

@advancedxy
Copy link
Copy Markdown
Contributor Author

@advancedxy, thanks for your patience on this. I'm going to be looking at it more once the 1.4 release is out.

That sounds reasonable and great to me.

I have a couple of things that I'm thinking through on this:

  • If we're updating the function, should we use a faster algorithm like XXH?

If we are going to add bucketV2 as the new bucket function for both single-arg and multi-arg parameters. We can certainly use a faster/new hash algorithm.

  • Should we make hash(str) and hash(int) the same to enable type promotion?

Currently, I don't think we can promote int to str due to the truncate/bucket(original) transform since we cannot produce same partition ing value. Unless we add more checks and restrictions about type promotion.

I will also put more thinkings in this, and let's discuss after 1.4 is released .

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Oct 1, 2023

Good point about the truncate transform not matching. We'll have to check when promoting types that there aren't incompatible transforms either way so we shouldn't over-complicate the bucket function to accomplish this.

@szehon-ho
Copy link
Copy Markdown
Member

szehon-ho commented Jan 24, 2024

@advancedxy do you mind taking the generic changes here (not including bucketv2) and splitting into another pr? (Just like we did for the doc changes). I feel we can parallelize more this way. Thanks for the great work, by the way.

@advancedxy
Copy link
Copy Markdown
Contributor Author

@advancedxy do you mind taking the generic changes here (not including bucketv2) and splitting into another pr? (Just like we did for the doc changes). I feel we can parallelize more this way. Thanks for the great work, by the way.

Of course, I will split the API/Core parts into another prs, it may consist of multiple PRs to be reviewed. As for bucketV2, I do think it's important and the motivation of multi-arg transform is to support multi-arg bucket initially. Do you or @aokolnychyi have interest to work on bucketV2?

@szehon-ho
Copy link
Copy Markdown
Member

szehon-ho commented Jan 25, 2024

@advancedxy feel free to work on bucketv2, based on the ideas of @rdblue and @aokolnychyi . Yes for sure, it will be great to get bucketv2 in.

We were interested to work on geo partitioning multi-arg partition functions, like z ordering. So would love to get the base code in, to work in parallel.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Sep 13, 2024
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@luoyuxia
Copy link
Copy Markdown

@advancedxy Hi, thanks for the create work. Seems the pr has been closed by github actions. Is it possible to push this pr forward? We do need it while integrating Fluss with Iceberg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multi-arg transform support

4 participants