[SPARK-23928][SQL][WIP] Add shuffle collection function. #21386

huizhilu · 2018-05-21T21:28:53Z

What changes were proposed in this pull request?

This PR adds a new collection function: shuffle. It generates a random permutation of the given array. This implementation uses the modern version of Fisher-Yates algorithm.

How was this patch tested?

New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala.

huizhilu · 2018-05-21T21:36:05Z

For the tests, I was trying to do this:

assertEqualsInorgeOrder(shuffle(originSeq), originSeq)
But spark does not have assertEqualsInorgeOrder implemented. I was thinking to check Multiset of shuffle(originSeq) and originSeq. But had trouble using Multiset for expression and seq.
About the randomness, I was thinking to generate a Seq range(1, 501) and shuffle it 30 times. And it should produce at least 80% distinct permutations. Say using HashSet.add(shuffledResult). But I don't know how to implement this idea in scala and codeGen for expressions.

This is my 1st time contributing to spark and codeGen. I hope committers and contributors could help with tests, and also the shuffle function code. Thanks a lot!

mn-mikke · 2018-05-21T21:34:40Z

python/pyspark/sql/functions.py

Generates / Returns

mn-mikke · 2018-05-21T21:42:20Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Isn't ExpectsInputTypes sufficient in this case?

Correct. Input is an Array. No string for input. Fixed.

mn-mikke · 2018-05-21T21:44:29Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

line wrapping

Good catch. I must hit the enter button by mistake...

mn-mikke · 2018-05-21T21:56:44Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

I see strong similarities with reverse function. Would it be possible to separate common code into a new trait or class and subsequently reference it?

Yes, I was thinking about it. But I did not want to change the code in reverse function without reviewer's comments. Will do it.

mn-mikke · 2018-05-21T21:59:35Z

python/pyspark/sql/functions.py

This will trigger a python test. Won't it fail if it's random?

Cool. My bad. Not familiar with this. Thought they were just doc like comments... Will fix it.

huizhilu · 2018-05-21T22:21:46Z

@mn-mikke Can you give some comments on the tests code? I believe they need to be improved, using functions like Multiset? Thanks.

ueshin · 2018-05-31T19:39:21Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

nit: revert unrelated change.

ueshin · 2018-05-31T19:40:18Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

child.nullable?
We don't need to override this in that case because it's the same in UnaryExpression.

ueshin · 2018-05-31T20:18:51Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

I guess this should extend Stateful trait and implement the related methods. Uuid and its test MiscExpressionsSuite should be the reference of the implementation to handle randomness.

Thanks a lot, Takuya @ueshin ! This is a good tip! Working on it and will push a new commit.

ueshin · 2018-05-31T20:53:31Z

Thanks for your contribution!
You can refer Uuid and its test as I mentioned in a comment. Could you update this?
I'll revisit when you push new commits.

ueshin · 2018-05-31T22:35:32Z

@pkuwm Btw, if you want to implement some algorithm like "Fisher-Yates algorithm" by hand, please add a comment near the code, and a link you referred hopefully.

AmplabJenkins · 2018-06-09T00:03:52Z

Can one of the admins verify this patch?

huizhilu · 2018-06-09T03:21:15Z

I learned more of the code and am polishing my second commit. I had something else to do and also attended the spark summit this week. Sorry for being late. Will submit a new commit over the weekend.

ueshin · 2018-07-13T08:37:48Z

@pkuwm Hi, any updates on this? If you have any questions, please let us know. Thanks!

huizhilu · 2018-07-17T23:41:59Z

@ueshin Really sorry for the delay. I was handling some personal stuff recently and was not able to modify this patch as I am not really familiar with this part.
I updated the commit and fixed line wraps, typos, replacing the shuffle tests in functions.py with a note.

About the similarities of Reverse and Shuffle, I was trying to implement a trait, but did not have a good idea because the code is not the same.
And not sure if Stateful would be a good fit for this function.

Can you help? If you have better idea, maybe you can continue completing this implementation. Thanks a lot!

ueshin · 2018-07-18T05:35:18Z

Okay, I'll take this over, and ping you when I submit a PR to ask a review. Thanks!

ueshin · 2018-07-18T12:52:28Z

@pkuwm I submitted a PR #21802 based on this. Could you take a look if you have time? Thanks!

dongjoon-hyun · 2018-09-13T17:08:52Z

@pkuwm . Could you close this PR since #21802 is merged?

huizhilu · 2018-09-13T21:37:45Z

Thanks for reminding, @dongjoon-hyun

mn-mikke reviewed May 21, 2018

View reviewed changes

ueshin reviewed May 31, 2018

View reviewed changes

huizhilu force-pushed the SPARK-23928 branch 3 times, most recently from a8a36ef to 70cd78c Compare July 17, 2018 23:34

Add shuffle collection function.

7f395c3

huizhilu force-pushed the SPARK-23928 branch from 70cd78c to 7f395c3 Compare July 17, 2018 23:50

huizhilu closed this Sep 13, 2018

[SPARK-23928][SQL][WIP] Add shuffle collection function. #21386

[SPARK-23928][SQL][WIP] Add shuffle collection function. #21386

Uh oh!

Conversation

huizhilu commented May 21, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

huizhilu commented May 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huizhilu commented May 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented May 31, 2018

Uh oh!

ueshin commented May 31, 2018

Uh oh!

AmplabJenkins commented Jun 9, 2018

Uh oh!

huizhilu commented Jun 9, 2018

Uh oh!

ueshin commented Jul 13, 2018

Uh oh!

huizhilu commented Jul 17, 2018

Uh oh!

ueshin commented Jul 18, 2018

Uh oh!

ueshin commented Jul 18, 2018

Uh oh!

dongjoon-hyun commented Sep 13, 2018

Uh oh!

huizhilu commented Sep 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants