-
-
Notifications
You must be signed in to change notification settings - Fork 748
Check for dtype support in p2p #7425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for dtype support in p2p #7425
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 24 files ± 0 24 suites ±0 10h 33m 13s ⏱️ - 25m 23s For more details on these failures, see this check. Results for commit 50238de. ± Comparison against base commit fef78d4. ♻️ This comment has been updated with latest results. |
| if pd.api.types.is_complex_dtype(column): | ||
| raise TypeError( | ||
| f"p2p does not support data of type '{column.dtype}' found in column '{name}'." | ||
| ) | ||
| # FIXME: Serializing custom objects to PyArrow is not supported in P2P shuffling | ||
| if pd.api.types.is_object_dtype(column): | ||
| raise TypeError( | ||
| f"p2p does not support custom objects found in column '{name}'." | ||
| ) | ||
| # FIXME: PyArrow does not support sparse data: https://issues.apache.org/jira/browse/ARROW-8679 | ||
| if pd.api.types.is_sparse(column): | ||
| raise TypeError("p2p does not support sparse data found in column '{name}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of me likes the explicitness of these checks, but I wonder if they've got full coverage. For example, what happens if someone if using their own custom extension array dtype? My guess is this check wouldn't raise an error and shuffling would fail (though I've not confirmed this).
Maybe a there's a simple try/except we can do here instead to check that shuffling doesn't work? Would not being able to roundtrip through the shuffle-specific serialization be a sufficient check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You bring up a good point. Let me check what I need to do to trigger reliably trigger an error that we could utilize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if it's not straightforward to come up with such a check, we can totally include what you have here and follow-up as needed. The current set of changes are already an improvement over the current main branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just try to roundtrip meta? That would not be such a nice message but it would fail early
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just try to roundtrip
meta? That would not be such a nice message but it would fail early
I've tried this before going out on holidays and it detected some issues but not all. IIRC, the main problem was object requiring actual data. We could combine explicit checks with the roundtrip though. That would give us nice error messages for some stuff and a fallback for anything we haven't explicitly checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also df._meta_nonempty for cases like this. Might not catch super funky stuff but that's ok from my POV. After all, the shuffle will fail either way it's just about when it fails, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at df._meta_nonempty. It's both about failing early and providing useful feedback to the user.
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please ping again once we should merge. I would like to see a roundtrip test but I'm also fine with the current state
P2PShuffleserialization for categorical data #7410pre-commit run --all-files