Skip to content

Conversation

@hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Dec 20, 2022

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait marked this pull request as ready for review December 20, 2022 17:13
@hendrikmakait hendrikmakait self-assigned this Dec 20, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Dec 20, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       24 files  ±  0         24 suites  ±0   10h 33m 13s ⏱️ - 25m 23s
  3 341 tests +  5    3 237 ✔️ +  8     102 💤  - 1  2  - 2 
39 386 runs  +55  37 532 ✔️ +59  1 852 💤  - 2  2  - 2 

For more details on these failures, see this check.

Results for commit 50238de. ± Comparison against base commit fef78d4.

♻️ This comment has been updated with latest results.

Comment on lines +16 to +27
if pd.api.types.is_complex_dtype(column):
raise TypeError(
f"p2p does not support data of type '{column.dtype}' found in column '{name}'."
)
# FIXME: Serializing custom objects to PyArrow is not supported in P2P shuffling
if pd.api.types.is_object_dtype(column):
raise TypeError(
f"p2p does not support custom objects found in column '{name}'."
)
# FIXME: PyArrow does not support sparse data: https://issues.apache.org/jira/browse/ARROW-8679
if pd.api.types.is_sparse(column):
raise TypeError("p2p does not support sparse data found in column '{name}'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me likes the explicitness of these checks, but I wonder if they've got full coverage. For example, what happens if someone if using their own custom extension array dtype? My guess is this check wouldn't raise an error and shuffling would fail (though I've not confirmed this).

Maybe a there's a simple try/except we can do here instead to check that shuffling doesn't work? Would not being able to roundtrip through the shuffle-specific serialization be a sufficient check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You bring up a good point. Let me check what I need to do to trigger reliably trigger an error that we could utilize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if it's not straightforward to come up with such a check, we can totally include what you have here and follow-up as needed. The current set of changes are already an improvement over the current main branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just try to roundtrip meta? That would not be such a nice message but it would fail early

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just try to roundtrip meta? That would not be such a nice message but it would fail early

I've tried this before going out on holidays and it detected some issues but not all. IIRC, the main problem was object requiring actual data. We could combine explicit checks with the roundtrip though. That would give us nice error messages for some stuff and a fallback for anything we haven't explicitly checked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also df._meta_nonempty for cases like this. Might not catch super funky stuff but that's ok from my POV. After all, the shuffle will fail either way it's just about when it fails, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at df._meta_nonempty. It's both about failing early and providing useful feedback to the user.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ping again once we should merge. I would like to see a roundtrip test but I'm also fine with the current state

@fjetter fjetter merged commit 16a6462 into dask:main Feb 22, 2023
@fjetter fjetter mentioned this pull request Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants