-
-
Notifications
You must be signed in to change notification settings - Fork 748
Fix P2PShuffle serialization for categorical data
#7410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 22 files + 10 22 suites +10 9h 23m 22s ⏱️ + 5h 3m 45s For more details on these failures, see this check. Results for commit 5fb858d. ± Comparison against base commit 3ac8631. ♻️ This comment has been updated with latest results. |
|
TODO: I still need to improve testing, e.g. by extending |
| # f"col{next(counter)}": pd.array( | ||
| # [np.nan, np.nan, 1.0, np.nan, np.nan] * 20, | ||
| # dtype="Sparse[float64]", | ||
| # ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raises TypeError: Sparse pandas data (column col27) not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| f"col{next(counter)}": pd.array(range(100), dtype="float16"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float32"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like arrow doesn't support complex dtypes https://issues.apache.org/jira/browse/ARROW-638
| f"col{next(counter)}": pd.array(range(100), dtype="float32"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="cdouble"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 15
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="cdouble"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="clongdouble"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 16
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hendrikmakait -- I agree that complex, sparse, and object dtypes aren't supported by pyarrow, so raising an informative error message in that case makes sense.
It's be good to also test that pyarrow-backed dtypes also work (e.g. int64[pyarrow], string[pyarrow], etc.)
| f"col{next(counter)}": pd.array(range(100), dtype="float16"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float32"), | ||
| f"col{next(counter)}": pd.array(range(100), dtype="float64"), | ||
| # f"col{next(counter)}": pd.array(range(100), dtype="csingle"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like arrow doesn't support complex dtypes https://issues.apache.org/jira/browse/ARROW-638
| # f"col{next(counter)}": pd.array( | ||
| # [np.nan, np.nan, 1.0, np.nan, np.nan] * 20, | ||
| # dtype="Sparse[float64]", | ||
| # ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # FIXME: distributed#7420 | ||
| # f"col{next(counter)}": pd.array( | ||
| # ["lorem ipsum"] * 100, | ||
| # dtype="string[pyarrow]", | ||
| # ), | ||
| # f"col{next(counter)}": pd.array( | ||
| # ["lorem ipsum"] * 100, | ||
| # dtype=pd.StringDtype("pyarrow"), | ||
| # ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is we're running into pandas-dev/pandas#50074 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have access to, or could easily keep track of, the original input dtypes at the point when we convert the pa.Table to a pd.DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left for follow-up work.
I leave this to a follow-up PR since we currently cast all objects to |
I added a bunch of arrow-based types, hopefully didn't miss any. |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hendrikmakait
distributed/shuffle/_arrow.py
Outdated
| if str(e) == "Tried reading schema message, was null or length 0": | ||
| return pa.concat_tables(shards) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing this error message seems brittle. Is there some other check we can do on file to make sure we're read it all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that we can seek the end of the open file object, I'm using that instead now.
| assert sum(map(len, out.values())) == len(df) | ||
| assert all(v.to_pandas().dtypes.equals(df.dtypes) for v in out.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not meant as a blocking comment) We're checking lengths and dtypes here. Maybe we could just use pd.testing.assert_frame_equal instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
|
I have incorporated all feedback and CI looks good, this is ready for another review. |
P2PShuffle serialization for categorical data
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hendrikmakait!
Closes #7400
pre-commit run --all-files