Skip to content

Conversation

@detroyejr
Copy link
Contributor

Ran into this issue after upgrading to 2023.6. Luckily, it's an easy fix.

Columns with sparse data can have partitions with empty/null columns. This creates a conflict of data types between the partitions and an error during concatenation.

Closes #7837

  • [ x] Tests added / passed
  • [ x] Passes pre-commit run --all-files

Columns with sparse data can have partitions with empty columns. This
creates a conflict of data types between the partitions and an error
during concatenation.

autosquash! Promote null types during concat_tables (dask#7837).
@detroyejr detroyejr requested a review from fjetter as a code owner June 15, 2023 16:30
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @detroyejr!

@hendrikmakait @phofl do either of you have bandwidth to review?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you add a small test for this? Something similar to the snippet in the original issue should be sufficient

Copy link
Collaborator

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks good!

@jrbourbeau jrbourbeau changed the title Promote null types during concat_tables (#7837). Handle null partitions in P2P shuffling Jun 15, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 15, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±0         20 suites  ±0   12h 51m 14s ⏱️ - 24m 42s
  3 687 tests +1    3 576 ✔️ +  2     106 💤 ±0    5  - 1 
35 658 runs  +8  33 893 ✔️ +11  1 748 💤  - 2  17  - 1 

For more details on these failures, see this check.

Results for commit f8a46bc. ± Comparison against base commit cf97a7c.

♻️ This comment has been updated with latest results.

@detroyejr
Copy link
Contributor Author

Test added. Seems to do the trick, but let me know if you have any feedback.

Copy link
Collaborator

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test generally looks good to me and should cover the bug.

Needs another pair of eyes from someone who is more familiar with distributed tests, @hendrikmakait maybe?

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @detroyejr, for fixing this! The code looks good to me. I would like some improvements for the test case before approving this.

Move to shuffle/tests, implement some cleanup based on test_concurrent,
and assert that the final data returned by dask DataFrame matches the
original pandas DataFrame.
@detroyejr
Copy link
Contributor Author

Thanks for the quick feedback. Hopefully that looks a little better.

@detroyejr detroyejr requested a review from hendrikmakait June 16, 2023 13:44
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detroyejr: Overall, the test looks great now, I only have one minor suggestion for improvement.

Co-authored-by: Hendrik Makait <hendrik@makait.com>
@detroyejr detroyejr requested a review from hendrikmakait June 20, 2023 14:56
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @detroyejr, great work! I've merged the latest main to avoid CI failures caused by upstream changes and will merge once CI is done and green-ish.

@detroyejr
Copy link
Contributor Author

Awesome, thanks for all your help!

@hendrikmakait hendrikmakait merged commit c98d4ce into dask:main Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

p2p shuffling failed if Arrow infers different dtypes for partitions

5 participants