-
-
Notifications
You must be signed in to change notification settings - Fork 748
Handle null partitions in P2P shuffling #7922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Columns with sparse data can have partitions with empty columns. This creates a conflict of data types between the partitions and an error during concatenation. autosquash! Promote null types during concat_tables (dask#7837).
|
Can one of the admins verify this patch? Admins can comment |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @detroyejr!
@hendrikmakait @phofl do either of you have bandwidth to review?
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you add a small test for this? Something similar to the snippet in the original issue should be sufficient
phofl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself looks good!
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 20 files ±0 20 suites ±0 12h 51m 14s ⏱️ - 24m 42s For more details on these failures, see this check. Results for commit f8a46bc. ± Comparison against base commit cf97a7c. ♻️ This comment has been updated with latest results. |
|
Test added. Seems to do the trick, but let me know if you have any feedback. |
phofl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test generally looks good to me and should cover the bug.
Needs another pair of eyes from someone who is more familiar with distributed tests, @hendrikmakait maybe?
hendrikmakait
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @detroyejr, for fixing this! The code looks good to me. I would like some improvements for the test case before approving this.
Move to shuffle/tests, implement some cleanup based on test_concurrent, and assert that the final data returned by dask DataFrame matches the original pandas DataFrame.
|
Thanks for the quick feedback. Hopefully that looks a little better. |
hendrikmakait
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@detroyejr: Overall, the test looks great now, I only have one minor suggestion for improvement.
Co-authored-by: Hendrik Makait <hendrik@makait.com>
hendrikmakait
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @detroyejr, great work! I've merged the latest main to avoid CI failures caused by upstream changes and will merge once CI is done and green-ish.
|
Awesome, thanks for all your help! |
Ran into this issue after upgrading to 2023.6. Luckily, it's an easy fix.
Columns with sparse data can have partitions with empty/null columns. This creates a conflict of data types between the partitions and an error during concatenation.
Closes #7837
pre-commit run --all-files