Dask-serialize dicts longer than five elements #3689

rjzamora · 2020-04-09T15:24:11Z

Some dask.dataframe/dask_cudf shuffling algorithms utilize tasks that produce a dictionary of pandas/cudf DataFrame objects. Since these dictionaries are typically larger than five elements, they are usually pickled when the output needs to be transferred between workers. For the cudf-backed shuffle algorithm, this can seriously degrade performance (changing the dominant communication mechanism from IPC to TPC on systems with NVLink support).

This PR increases the minimum size of a list/set/dict for which each element will be separately serialized (from 5 to 64). If the length is >5, the iterative serialization is only performed if the "first" element is dask-serializable. The check clearly doesnt assert that all items in the list/set/dict are dask-serializable, but always capture the case when all elements are.

rjzamora · 2020-04-09T15:26:52Z

@pentschev - Any thoughts on using logic like this to avoid pickling of the group-shuffle distionary output?

cc @beckernick @VibhuJawa

jakirkham · 2020-04-09T17:03:56Z

cc @mrocklin

mrocklin · 2020-04-09T21:02:31Z

                 dask_serialize.dispatch(type(next(iter(x))))

This is a fun approach. Really, what we want to say is "if this dictionary is a bunch of big-complex things that Dask knows how to handle, then please serailize each of them independently. If it's just a big json-like blob, then please don't bother". The previous check of if len(d) > 5 was a very poor approximation of this check. The check here of "does dask_serialize grok these objects in a special way seems a lot nicer to me. If we can trust this approach it would be great to remove the len(d) < 64 part of the check as well.

How well does the dask_serialize approach work? If I had a dict like {"x": 1} does this reliably not dive through the dict? It might be worth pulling this functionality into a separate function, testing it with a bunch of cases that we know about

{"x": 1} -> False
{"x": np.ones(5)} -> True
{"a": 1, "x": np.ones(5)} -> True (maybe xfails today?)
{"x": [np.ones(5)} -> True
{("x", i): np.ones(5) for i in range(100)} -> True
...

And then make sure that it runs in constant time relative to the size of the dict.

In short, I like this approach if it works. I'd love to see us drop the 64 length limit if we can. I think that being able to do that would be easier if we had a function that we could all agree did the right thing all (most?) of the time, which would be easy to show through a nice test like this.

rjzamora · 2020-04-09T21:16:46Z

In short, I like this approach if it works. I'd love to see us drop the 64 length limit if we can. I think that being able to do that would be easier if we had a function that we could all agree did the right thing all (most?) of the time, which would be easy to show through a nice test like this.

Thanks for the feedback @mrocklin! The approach will indeed not dive through the dict for the case you mentioned. I will work on the test you suggested and confirm that the check runs in constant time.

mrocklin · 2020-04-09T22:20:56Z

I think that you've done a constant time check by only checking the first one. This means that some of the test cases will fail (like the one with an integer and a numpy array in the dict), but makes things fast.

…

On Thu, Apr 9, 2020 at 2:17 PM Richard (Rick) Zamora < ***@***.***> wrote: In short, I like this approach if it works. I'd love to see us drop the 64 length limit if we can. I think that being able to do that would be easier if we had a function that we could all agree did the right thing all (most?) of the time, which would be easy to show through a nice test like this. Thanks for the feedback @mrocklin <https://github.com/mrocklin>! The approach will indeed not dive through the dict for the case you mentioned. I will work on the test you suggested and confirm that the check runs in constant time. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3689 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGLJ7HWMAMNJKHPQDLRLY3MZANCNFSM4ME2CCYQ> .

rjzamora · 2020-04-10T02:40:43Z

Thanks again for your advice here @mrocklin - I moved the check into a standalone function (check_dask_serializable_collection), and added a test for the behavior you mentioned. One difference from your example, however, is that I could not use a numpy array as a "dask-serializable" object. As far as I can tell, there are no custom serialization mechanisms registered for numpy types. [EDIT: Probably my mistake here - will follow up soon]

rjzamora · 2020-04-10T04:09:57Z

Okay - It seems that the removal of the len <= 5 criteria is leading to two new test failures...

test_nested_types is serializing [[[x]]], and expecting to recursively iterate through the list (but no longer will).
test_pickle_safe is serializing [1, 2, 3], but only offering the "msgpack" serializer. Since we are no longer iterating through the list, and recording the "collection" datatype, we seem to be erroneously converting from list to tuple during the dumps-loads round-trip.

rjzamora · 2020-04-10T04:22:35Z

Note: I am adding back the len <= 5 criteria for now (to allow tests to pass) - Ill need to follow up later

mrocklin · 2020-04-10T15:29:09Z

For nested types it seems like the check_dask_serializable function might have to recurse if the subtype is list/dict/...

rjzamora · 2020-04-10T15:32:22Z

For nested types it seems like the check_dask_serializable function might have to recurse if the subtype is list/dict/...

Right - Not sure how else a "dask-serializable" object can be detected within a nested collection... Having a bit of trouble finding a balance here (between functionality and simplicity)

mrocklin · 2020-04-10T15:34:45Z

Maybe something like this? (I don't know really, I haven't thought too much about this

def is_serializable(x):
    if isinstance(x, (tuple, list)) and x:
        return is_serializable(x[0])
    if isinstance(x, dict) and x:
        return is_serializable(toolz.first(x.values()))
    return dask_serialize.....(x)

rjzamora · 2020-04-10T15:40:26Z

Sorry - I'm not seeing how that is very different from check_dask_serializable. It will still need to recurse, right?

mrocklin · 2020-04-10T15:44:17Z

Oh I see, yes. I apologize I hadn't looked at the recent code (pushing quickly through issues this morning)

test_nested_types is serializing [[[x]]], and expecting to recursively iterate through the list (but no longer will).

O guess I'm surprised by this then

rjzamora · 2020-04-10T15:48:29Z

O guess I'm surprised by this then

Sorry - I was using github/CI a bit too much for debugging :)

That test failure is no longer a problem (the recursion fixes it). The only remaining issue is that I still need the len <= 5 criteria to get all tests to pass. (There also seems to be a recent CI problem unrelated to these changes)

mrocklin · 2020-04-10T16:01:18Z

What tests are failing due to the length check? Is it possible to fix whatever is causing those tests to fail?

rjzamora · 2020-04-10T16:15:55Z

What tests are failing due to the length check? Is it possible to fix whatever is causing those tests to fail?

On my local system, it is only distributed/tests/test_publish.py::test_pickle_safe - The problem seems to be that amsgpack round-trip converts a list to a tuple. I removed the len <= 5 and added a special_case for when the "pickle" serializer is unavailable -- I will try to find time to look into the "root"/extent of the msgpack problem.

pentschev · 2020-04-10T22:24:42Z

Sorry for the very late reply here @rjzamora . I think your approach is great, nice that you managed to improve my very naive implementation, thanks for doing that! Also it seems that it passed now! :)

rjzamora · 2020-04-13T18:27:07Z

@mrocklin - Do you think the current state of this PR is reasonable? (thanks again for your attention on this)

jakirkham

Looks good. Thanks Rick! 😄

Made a small observation below. Don't think anything needs to be done for it though. Just something for us to keep in mind 😉

jakirkham · 2020-04-15T20:42:32Z

distributed/protocol/serialize.py

+        return check_dask_serializable(next(iter(x.items()))[1])
+    else:
+        try:
+            dask_serialize.dispatch(type(x))


Just a note, this requires that objects be registered with dask_serialize (as we don't check cuda_serialize for example). However we do register all CUDA objects with dask_serialize now and use this in other contexts (like Dask-CUDA's spilling). So this seems like a good enough check.

Thanks for the note @jakirkham - The fact that you have cleaned up serialization and registered everything as dask_serialize definitely made my life easier here - Thanks! :)

jakirkham · 2020-04-15T20:43:46Z

Planning on merging tomorrow if no comments.

distributed/protocol/tests/test_serialize.py

distributed/protocol/serialize.py

mrocklin · 2020-04-16T14:47:31Z

distributed/protocol/serialize.py

+    # Check for "dask"-serializable data in dict/list/set
+    supported = (
+        isinstance(x, list) and "pickle" not in serializers
+    ) or check_dask_serializable(x)


What's going on here? I know that you added this because a particular test was failing, but what was the underlying cause?

Sorry - Didn't have time to look into this carefully yet. A list is being converted to a tuple during a round trip. I originally thought it was a msgpack limitation (since the failing test only provides the "msgpack" serializer). However, the problem may be something else in client.publish_dataset or get_dataset (a simple msgpack dumps-loads test works fine).

The problem seems to be that the use_list=False argument is passed to msgpack.loads in msgpack_loads - Not sure if there is a reason for this?

Looks like that might be intentional. Appears to have been added in PR ( #2000 ). @mrocklin, how would you like us to proceed here?

Agreed @jakirkham - Removing that option allows tests to pass here, but may have uninteded consequences.

But doesn't [an_unsupported_object] pass this check if the msgpack serializer is included?

Yes - You are right. supported just means “we should iterate through the collection”

@mrocklin - I just pushed some changes to (hopefully) make the logic a bit more clear. For example, I am using the term iterate_collection instead of supported. I am also now considering the order of the serializers list, because "msgpack" is only a problem for lists if it comes before "pickle" (I added test_serialize_lists to show/check this).

Ah, I see what's going on now. OK, thanks @rjzamora !

With PR ( #4575 ), I think we are able to workaround this MsgPack oddity. So dropped this code there.

distributed/protocol/serialize.py

jakirkham · 2020-04-16T23:34:48Z

Outside of that small comment, I think this is about as good for now. Hopefully issue ( #3716 ) will garner some attention and we can determine how to improve things more in the future.

Co-Authored-By: jakirkham <jakirkham@gmail.com>

…of serializers

rjzamora added 2 commits April 9, 2020 08:03

check for dask-serializable data in lists and dicts >5 elements

6321bea

fix set behavior

6734fce

create new function and add test coverage

366e4e0

rjzamora marked this pull request as ready for review April 10, 2020 02:36

rjzamora changed the title ~~[WIP] Dask-serialize dicts longer than five elements~~ Dask-serialize dicts longer than five elements Apr 10, 2020

rjzamora added 2 commits April 9, 2020 21:12

rolling back len(x) <= 5 removal (possibly temporary)

c0c5320

allow tests to pass (for now)

82a116d

rjzamora added 5 commits April 9, 2020 21:25

minor cleanup

5698a9e

lighten test for now

5ca40b4

fixing another typo

a28e5e4

change check_dask_serializable_collection to check_dask_serializable

6168483

add 5-element switch back to check tests

c0810ea

use special_case for msgpack problem for now

427672b

simplification - remove special_case variable

c3e2283

rjzamora added 2 commits April 10, 2020 09:32

another simplification

df058b9

len-0 bugfix

92aca1b

jakirkham approved these changes Apr 15, 2020

View reviewed changes

mrocklin reviewed Apr 16, 2020

View reviewed changes

rjzamora added 4 commits April 16, 2020 08:34

address code review - clean up pytest syntax

b16c786

revert to using tuples in msgpack

b46e958

check for msgpack rather than check for pickle

53da77b

fix typo

42beff1

jakirkham reviewed Apr 16, 2020

View reviewed changes

distributed/protocol/serialize.py Outdated Show resolved Hide resolved

Update distributed/protocol/serialize.py

9ab0ae5

Co-Authored-By: jakirkham <jakirkham@gmail.com>

jakirkham approved these changes Apr 16, 2020

View reviewed changes

rjzamora added 2 commits April 16, 2020 19:50

use more-descriptive iterate_collection variable, and consider order …

dd13703

…of serializers

add test for new logic

33ef057

mrocklin merged commit d5cb312 into dask:master Apr 17, 2020

pentschev mentioned this pull request May 1, 2020

Remove deprecated DGX class rapidsai/dask-cuda#286

Merged

jakirkham mentioned this pull request Jun 4, 2020

Require Distributed 2.15.0+ rapidsai/dask-cuda#306

Merged

This was referenced Jul 1, 2020

Rely on Dask's ability to serialize collections rapidsai/dask-cuda#307

Merged

Evaluate further serialization performance improvements rapidsai/dask-cuda#106

Closed

jakirkham mentioned this pull request Mar 30, 2021

[WIP] Distinguish tuples & lists in MsgPack serialization #4575

Draft

3 tasks

Uh oh!

Dask-serialize dicts longer than five elements #3689

Dask-serialize dicts longer than five elements #3689

Uh oh!

Conversation

rjzamora commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Apr 9, 2020

Uh oh!

jakirkham commented Apr 9, 2020

Uh oh!

mrocklin commented Apr 9, 2020

Uh oh!

rjzamora commented Apr 9, 2020

Uh oh!

mrocklin commented Apr 9, 2020 via email

Uh oh!

rjzamora commented Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Apr 10, 2020

Uh oh!

rjzamora commented Apr 10, 2020

Uh oh!

mrocklin commented Apr 10, 2020

Uh oh!

rjzamora commented Apr 10, 2020

Uh oh!

mrocklin commented Apr 10, 2020

Uh oh!

rjzamora commented Apr 10, 2020

Uh oh!

mrocklin commented Apr 10, 2020

Uh oh!

rjzamora commented Apr 10, 2020

Uh oh!

mrocklin commented Apr 10, 2020

Uh oh!

rjzamora commented Apr 10, 2020

Uh oh!

pentschev commented Apr 10, 2020

Uh oh!

rjzamora commented Apr 13, 2020

Uh oh!

jakirkham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Apr 15, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jakirkham commented Apr 16, 2020

Uh oh!

rjzamora commented Apr 9, 2020 •

edited

Loading

rjzamora commented Apr 10, 2020 •

edited

Loading