-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23928][SQL][WIP] Add shuffle collection function. #21386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
For the tests, I was trying to do this:
This is my 1st time contributing to spark and codeGen. I hope committers and contributors could help with tests, and also the shuffle function code. Thanks a lot! |
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generates / Returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't ExpectsInputTypes sufficient in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Input is an Array. No string for input. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line wrapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I must hit the enter button by mistake...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see strong similarities with reverse function. Would it be possible to separate common code into a new trait or class and subsequently reference it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking about it. But I did not want to change the code in reverse function without reviewer's comments. Will do it.
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will trigger a python test. Won't it fail if it's random?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. My bad. Not familiar with this. Thought they were just doc like comments... Will fix it.
|
@mn-mikke Can you give some comments on the tests code? I believe they need to be improved, using functions like Multiset? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: revert unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child.nullable?
We don't need to override this in that case because it's the same in UnaryExpression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should extend Stateful trait and implement the related methods. Uuid and its test MiscExpressionsSuite should be the reference of the implementation to handle randomness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, Takuya @ueshin ! This is a good tip! Working on it and will push a new commit.
|
Thanks for your contribution! |
|
@pkuwm Btw, if you want to implement some algorithm like "Fisher-Yates algorithm" by hand, please add a comment near the code, and a link you referred hopefully. |
|
Can one of the admins verify this patch? |
|
I learned more of the code and am polishing my second commit. I had something else to do and also attended the spark summit this week. Sorry for being late. Will submit a new commit over the weekend. |
|
@pkuwm Hi, any updates on this? If you have any questions, please let us know. Thanks! |
a8a36ef to
70cd78c
Compare
|
@ueshin Really sorry for the delay. I was handling some personal stuff recently and was not able to modify this patch as I am not really familiar with this part. About the similarities of Reverse and Shuffle, I was trying to implement a trait, but did not have a good idea because the code is not the same. Can you help? If you have better idea, maybe you can continue completing this implementation. Thanks a lot! |
|
Okay, I'll take this over, and ping you when I submit a PR to ask a review. Thanks! |
|
Thanks for reminding, @dongjoon-hyun |
What changes were proposed in this pull request?
This PR adds a new collection function: shuffle. It generates a random permutation of the given array. This implementation uses the modern version of Fisher-Yates algorithm.
How was this patch tested?
New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala.