Skip to content

Conversation

@huizhilu
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds a new collection function: shuffle. It generates a random permutation of the given array. This implementation uses the modern version of Fisher-Yates algorithm.

How was this patch tested?

New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala.

@huizhilu
Copy link
Contributor Author

For the tests, I was trying to do this:

  1. assertEqualsInorgeOrder(shuffle(originSeq), originSeq)
    But spark does not have assertEqualsInorgeOrder implemented. I was thinking to check Multiset of shuffle(originSeq) and originSeq. But had trouble using Multiset for expression and seq.
  2. About the randomness, I was thinking to generate a Seq range(1, 501) and shuffle it 30 times. And it should produce at least 80% distinct permutations. Say using HashSet.add(shuffledResult). But I don't know how to implement this idea in scala and codeGen for expressions.

This is my 1st time contributing to spark and codeGen. I hope committers and contributors could help with tests, and also the shuffle function code. Thanks a lot!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generates / Returns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't ExpectsInputTypes sufficient in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Input is an Array. No string for input. Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line wrapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I must hit the enter button by mistake...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see strong similarities with reverse function. Would it be possible to separate common code into a new trait or class and subsequently reference it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about it. But I did not want to change the code in reverse function without reviewer's comments. Will do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will trigger a python test. Won't it fail if it's random?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. My bad. Not familiar with this. Thought they were just doc like comments... Will fix it.

@huizhilu
Copy link
Contributor Author

@mn-mikke Can you give some comments on the tests code? I believe they need to be improved, using functions like Multiset? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert unrelated change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child.nullable?
We don't need to override this in that case because it's the same in UnaryExpression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should extend Stateful trait and implement the related methods. Uuid and its test MiscExpressionsSuite should be the reference of the implementation to handle randomness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, Takuya @ueshin ! This is a good tip! Working on it and will push a new commit.

@ueshin
Copy link
Member

ueshin commented May 31, 2018

Thanks for your contribution!
You can refer Uuid and its test as I mentioned in a comment. Could you update this?
I'll revisit when you push new commits.

@ueshin
Copy link
Member

ueshin commented May 31, 2018

@pkuwm Btw, if you want to implement some algorithm like "Fisher-Yates algorithm" by hand, please add a comment near the code, and a link you referred hopefully.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@huizhilu
Copy link
Contributor Author

huizhilu commented Jun 9, 2018

I learned more of the code and am polishing my second commit. I had something else to do and also attended the spark summit this week. Sorry for being late. Will submit a new commit over the weekend.

@ueshin
Copy link
Member

ueshin commented Jul 13, 2018

@pkuwm Hi, any updates on this? If you have any questions, please let us know. Thanks!

@huizhilu huizhilu force-pushed the SPARK-23928 branch 3 times, most recently from a8a36ef to 70cd78c Compare July 17, 2018 23:34
@huizhilu
Copy link
Contributor Author

@ueshin Really sorry for the delay. I was handling some personal stuff recently and was not able to modify this patch as I am not really familiar with this part.
I updated the commit and fixed line wraps, typos, replacing the shuffle tests in functions.py with a note.

About the similarities of Reverse and Shuffle, I was trying to implement a trait, but did not have a good idea because the code is not the same.
And not sure if Stateful would be a good fit for this function.

Can you help? If you have better idea, maybe you can continue completing this implementation. Thanks a lot!

@ueshin
Copy link
Member

ueshin commented Jul 18, 2018

Okay, I'll take this over, and ping you when I submit a PR to ask a review. Thanks!

@ueshin
Copy link
Member

ueshin commented Jul 18, 2018

@pkuwm I submitted a PR #21802 based on this. Could you take a look if you have time? Thanks!

@dongjoon-hyun
Copy link
Member

@pkuwm . Could you close this PR since #21802 is merged?

@huizhilu
Copy link
Contributor Author

Thanks for reminding, @dongjoon-hyun

@huizhilu huizhilu closed this Sep 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants