Fix `P2PShuffle` serialization for categorical data #7410

hendrikmakait · 2022-12-15T13:14:04Z

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-12-15T14:13:51Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      22 files +      10       22 suites +10 9h 23m 22s ⏱️ + 5h 3m 45s
  3 276 tests +        5   3 188 ✔️ +      13     86 💤 -     9 2 ❌ +1
35 967 runs +15 687 34 447 ✔️ +14 937 1 516 💤 +747 4 ❌ +3

For more details on these failures, see this check.

Results for commit 5fb858d. ± Comparison against base commit 3ac8631.

♻️ This comment has been updated with latest results.

hendrikmakait · 2022-12-15T17:07:36Z

TODO: I still need to improve testing, e.g. by extending test_processing_chain, to ensure that we caught all issues with pandas dtypes.

hendrikmakait · 2022-12-16T14:29:19Z

distributed/shuffle/tests/test_shuffle.py

+            # f"col{next(counter)}": pd.array(
+            # [np.nan, np.nan, 1.0, np.nan, np.nan] * 20,
+            # dtype="Sparse[float64]",
+            # ),


raises TypeError: Sparse pandas data (column col27) not supported.

https://issues.apache.org/jira/browse/ARROW-8679

hendrikmakait · 2022-12-16T14:29:50Z

distributed/shuffle/tests/test_shuffle.py

+            f"col{next(counter)}": pd.array(range(100), dtype="float16"),
+            f"col{next(counter)}": pd.array(range(100), dtype="float32"),
+            f"col{next(counter)}": pd.array(range(100), dtype="float64"),
+            # f"col{next(counter)}": pd.array(range(100), dtype="csingle"),


raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 14

Looks like arrow doesn't support complex dtypes https://issues.apache.org/jira/browse/ARROW-638

hendrikmakait · 2022-12-16T14:30:18Z

distributed/shuffle/tests/test_shuffle.py

+            f"col{next(counter)}": pd.array(range(100), dtype="float32"),
+            f"col{next(counter)}": pd.array(range(100), dtype="float64"),
+            # f"col{next(counter)}": pd.array(range(100), dtype="csingle"),
+            # f"col{next(counter)}": pd.array(range(100), dtype="cdouble"),


raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 15

hendrikmakait · 2022-12-16T14:30:45Z

distributed/shuffle/tests/test_shuffle.py

+            f"col{next(counter)}": pd.array(range(100), dtype="float64"),
+            # f"col{next(counter)}": pd.array(range(100), dtype="csingle"),
+            # f"col{next(counter)}": pd.array(range(100), dtype="cdouble"),
+            # f"col{next(counter)}": pd.array(range(100), dtype="clongdouble"),


raises pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 16

distributed/shuffle/tests/test_shuffle.py

jrbourbeau

Thanks @hendrikmakait -- I agree that complex, sparse, and object dtypes aren't supported by pyarrow, so raising an informative error message in that case makes sense.

It's be good to also test that pyarrow-backed dtypes also work (e.g. int64[pyarrow], string[pyarrow], etc.)

jrbourbeau · 2022-12-16T20:02:20Z

distributed/shuffle/tests/test_shuffle.py

+            f"col{next(counter)}": pd.array(range(100), dtype="float16"),
+            f"col{next(counter)}": pd.array(range(100), dtype="float32"),
+            f"col{next(counter)}": pd.array(range(100), dtype="float64"),
+            # f"col{next(counter)}": pd.array(range(100), dtype="csingle"),


Looks like arrow doesn't support complex dtypes https://issues.apache.org/jira/browse/ARROW-638

jrbourbeau · 2022-12-16T22:24:54Z

distributed/shuffle/tests/test_shuffle.py

+            # f"col{next(counter)}": pd.array(
+            # [np.nan, np.nan, 1.0, np.nan, np.nan] * 20,
+            # dtype="Sparse[float64]",
+            # ),


https://issues.apache.org/jira/browse/ARROW-8679

jrbourbeau · 2022-12-19T18:12:46Z

distributed/shuffle/tests/test_shuffle.py

+            # FIXME: distributed#7420
+            # f"col{next(counter)}": pd.array(
+            #     ["lorem ipsum"] * 100,
+            #     dtype="string[pyarrow]",
+            # ),
+            # f"col{next(counter)}": pd.array(
+            #     ["lorem ipsum"] * 100,
+            #     dtype=pd.StringDtype("pyarrow"),
+            # ),


My guess is we're running into pandas-dev/pandas#50074 here

Do we have access to, or could easily keep track of, the original input dtypes at the point when we convert the pa.Table to a pd.DataFrame?

Left for follow-up work.

hendrikmakait · 2022-12-20T10:36:59Z

complex, sparse, and object dtypes aren't supported by pyarrow, so raising an informative error message in that case makes sense.

I leave this to a follow-up PR since we currently cast all objects to string[python]. I would like to isolate changes to that logic.

hendrikmakait · 2022-12-20T10:41:28Z

It's be good to also test that pyarrow-backed dtypes also work (e.g. int64[pyarrow], string[pyarrow], etc.)

I added a bunch of arrow-based types, hopefully didn't miss any.

jrbourbeau

Thanks @hendrikmakait

jrbourbeau · 2022-12-20T21:01:12Z

distributed/shuffle/_arrow.py

+        if str(e) == "Tried reading schema message, was null or length 0":
+            return pa.concat_tables(shards)


Parsing this error message seems brittle. Is there some other check we can do on file to make sure we're read it all?

I forgot that we can seek the end of the open file object, I'm using that instead now.

distributed/shuffle/_arrow.py

distributed/shuffle/_worker_extension.py

jrbourbeau · 2022-12-20T21:24:54Z

distributed/shuffle/tests/test_shuffle.py

    assert sum(map(len, out.values())) == len(df)
+    assert all(v.to_pandas().dtypes.equals(df.dtypes) for v in out.values())


(Not meant as a blocking comment) We're checking lengths and dtypes here. Maybe we could just use pd.testing.assert_frame_equal instead?

Good point!

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

hendrikmakait · 2022-12-21T11:08:49Z

I have incorporated all feedback and CI looks good, this is ready for another review.

jrbourbeau

Thanks @hendrikmakait!

hendrikmakait added 5 commits December 15, 2022 11:05

Extend test case

a762bec

Improve typing

1064864

Fix network sending

4c23e3a

Fix disk serialization

ab7a1f9

Fix

3d6cbe8

hendrikmakait added 2 commits December 15, 2022 16:33

Fix changes to buffer interfaces

5dff0d7

Merge branch 'main' into p2p-serialization

a4c0beb

hendrikmakait requested a review from fjetter December 15, 2022 17:06

hendrikmakait added 2 commits December 16, 2022 15:22

Add all dtypes

d4d2e4c

Stub class

41ec668

hendrikmakait commented Dec 16, 2022

View reviewed changes

hendrikmakait marked this pull request as ready for review December 16, 2022 14:34

hendrikmakait added 5 commits December 16, 2022 16:13

Renaming

c10de32

improve check

fe1fcbd

Naming

d734ec3

improve test

a224f3b

Minor

841144d

hendrikmakait marked this pull request as draft December 16, 2022 18:36

hendrikmakait added 2 commits December 16, 2022 20:10

Fix receiving

4ddfbc6

Simplify

5a21c51

jrbourbeau reviewed Dec 16, 2022

View reviewed changes

hendrikmakait added 2 commits December 19, 2022 18:20

Merge branch 'main' into p2p-serialization

1eb70bc

Add pyarrow dtypes

b5c8e14

hendrikmakait mentioned this pull request Dec 19, 2022

string[pyarrow] dtype does not roundtrip in P2P shuffling #7420

Closed

hendrikmakait added 2 commits December 19, 2022 18:47

string[pyarrow]

9101b97

Add commented-out blocks for unsupported dtypes

60257de

jrbourbeau reviewed Dec 19, 2022

View reviewed changes

hendrikmakait marked this pull request as ready for review December 20, 2022 10:37

hendrikmakait requested a review from crusaderky December 20, 2022 10:38

Remove unnecessary param

c5d996d

hendrikmakait mentioned this pull request Dec 20, 2022

Check for dtype support in p2p #7425

Merged

2 tasks

hendrikmakait added 2 commits December 20, 2022 16:56

Merge branch 'main' into p2p-serialization

d1cf43e

Deserialization method

7a777f4

jrbourbeau reviewed Dec 20, 2022

View reviewed changes

hendrikmakait and others added 3 commits December 21, 2022 09:25

Apply suggestions from code review

9a2576f

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

Improve loading

b825d41

Improve test

5fb858d

jrbourbeau changed the title ~~Fix P2PShuffle serialization for categorical data~~ Fix P2PShuffle serialization for categorical data Dec 21, 2022

jrbourbeau approved these changes Dec 21, 2022

View reviewed changes

jrbourbeau merged commit 83bb080 into dask:main Dec 21, 2022

fjetter mentioned this pull request Jul 12, 2023

P2P DataFrame performance - Disk overhead #7990

Open

		if str(e) == "Tried reading schema message, was null or length 0":
		return pa.concat_tables(shards)

		assert sum(map(len, out.values())) == len(df)
		assert all(v.to_pandas().dtypes.equals(df.dtypes) for v in out.values())

Uh oh!

Fix P2PShuffle serialization for categorical data #7410

Fix P2PShuffle serialization for categorical data #7410

Uh oh!

Conversation

hendrikmakait commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

hendrikmakait commented Dec 15, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmakait commented Dec 20, 2022

Uh oh!

hendrikmakait commented Dec 20, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikmakait commented Dec 21, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix `P2PShuffle` serialization for categorical data #7410

Fix `P2PShuffle` serialization for categorical data #7410

hendrikmakait commented Dec 15, 2022 •

edited

Loading

github-actions bot commented Dec 15, 2022 •

edited

Loading