-
-
Notifications
You must be signed in to change notification settings - Fork 748
Dask-serialize dicts longer than five elements #3689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pentschev - Any thoughts on using logic like this to avoid pickling of the |
|
cc @mrocklin |
dask_serialize.dispatch(type(next(iter(x))))This is a fun approach. Really, what we want to say is "if this dictionary is a bunch of big-complex things that Dask knows how to handle, then please serailize each of them independently. If it's just a big json-like blob, then please don't bother". The previous check of How well does the {"x": 1} -> False
{"x": np.ones(5)} -> True
{"a": 1, "x": np.ones(5)} -> True (maybe xfails today?)
{"x": [np.ones(5)} -> True
{("x", i): np.ones(5) for i in range(100)} -> True
...And then make sure that it runs in constant time relative to the size of the dict. In short, I like this approach if it works. I'd love to see us drop the 64 length limit if we can. I think that being able to do that would be easier if we had a function that we could all agree did the right thing all (most?) of the time, which would be easy to show through a nice test like this. |
Thanks for the feedback @mrocklin! The approach will indeed not dive through the dict for the case you mentioned. I will work on the test you suggested and confirm that the check runs in constant time. |
|
I think that you've done a constant time check by only checking the first
one. This means that some of the test cases will fail (like the one with an
integer and a numpy array in the dict), but makes things fast.
…On Thu, Apr 9, 2020 at 2:17 PM Richard (Rick) Zamora < ***@***.***> wrote:
In short, I like this approach if it works. I'd love to see us drop the 64
length limit if we can. I think that being able to do that would be easier
if we had a function that we could all agree did the right thing all
(most?) of the time, which would be easy to show through a nice test like
this.
Thanks for the feedback @mrocklin <https://github.com/mrocklin>! The
approach will indeed not dive through the dict for the case you mentioned.
I will work on the test you suggested and confirm that the check runs in
constant time.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3689 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGLJ7HWMAMNJKHPQDLRLY3MZANCNFSM4ME2CCYQ>
.
|
|
Thanks again for your advice here @mrocklin - I moved the check into a standalone function ( |
|
Okay - It seems that the removal of the
|
|
Note: I am adding back the |
|
For nested types it seems like the |
Right - Not sure how else a "dask-serializable" object can be detected within a nested collection... Having a bit of trouble finding a balance here (between functionality and simplicity) |
|
Maybe something like this? (I don't know really, I haven't thought too much about this def is_serializable(x):
if isinstance(x, (tuple, list)) and x:
return is_serializable(x[0])
if isinstance(x, dict) and x:
return is_serializable(toolz.first(x.values()))
return dask_serialize.....(x) |
|
Sorry - I'm not seeing how that is very different from |
|
Oh I see, yes. I apologize I hadn't looked at the recent code (pushing quickly through issues this morning)
O guess I'm surprised by this then |
Sorry - I was using github/CI a bit too much for debugging :) That test failure is no longer a problem (the recursion fixes it). The only remaining issue is that I still need the |
|
What tests are failing due to the length check? Is it possible to fix whatever is causing those tests to fail? |
On my local system, it is only |
|
Sorry for the very late reply here @rjzamora . I think your approach is great, nice that you managed to improve my very naive implementation, thanks for doing that! Also it seems that it passed now! :) |
|
@mrocklin - Do you think the current state of this PR is reasonable? (thanks again for your attention on this) |
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks Rick! 😄
Made a small observation below. Don't think anything needs to be done for it though. Just something for us to keep in mind 😉
| return check_dask_serializable(next(iter(x.items()))[1]) | ||
| else: | ||
| try: | ||
| dask_serialize.dispatch(type(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note, this requires that objects be registered with dask_serialize (as we don't check cuda_serialize for example). However we do register all CUDA objects with dask_serialize now and use this in other contexts (like Dask-CUDA's spilling). So this seems like a good enough check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note @jakirkham - The fact that you have cleaned up serialization and registered everything as dask_serialize definitely made my life easier here - Thanks! :)
|
Planning on merging tomorrow if no comments. |
distributed/protocol/serialize.py
Outdated
| # Check for "dask"-serializable data in dict/list/set | ||
| supported = ( | ||
| isinstance(x, list) and "pickle" not in serializers | ||
| ) or check_dask_serializable(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here? I know that you added this because a particular test was failing, but what was the underlying cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - Didn't have time to look into this carefully yet. A list is being converted to a tuple during a round trip. I originally thought it was a msgpack limitation (since the failing test only provides the "msgpack" serializer). However, the problem may be something else in client.publish_dataset or get_dataset (a simple msgpack dumps-loads test works fine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem seems to be that the use_list=False argument is passed to msgpack.loads in msgpack_loads - Not sure if there is a reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed @jakirkham - Removing that option allows tests to pass here, but may have uninteded consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't [an_unsupported_object] pass this check if the msgpack serializer is included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - You are right. supported just means “we should iterate through the collection”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrocklin - I just pushed some changes to (hopefully) make the logic a bit more clear. For example, I am using the term iterate_collection instead of supported. I am also now considering the order of the serializers list, because "msgpack" is only a problem for lists if it comes before "pickle" (I added test_serialize_lists to show/check this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what's going on now. OK, thanks @rjzamora !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With PR ( #4575 ), I think we are able to workaround this MsgPack oddity. So dropped this code there.
|
Outside of that small comment, I think this is about as good for now. Hopefully issue ( #3716 ) will garner some attention and we can determine how to improve things more in the future. |
Co-Authored-By: jakirkham <jakirkham@gmail.com>
Some dask.dataframe/dask_cudf shuffling algorithms utilize tasks that produce a dictionary of pandas/cudf DataFrame objects. Since these dictionaries are typically larger than five elements, they are usually pickled when the output needs to be transferred between workers. For the cudf-backed shuffle algorithm, this can seriously degrade performance (changing the dominant communication mechanism from IPC to TPC on systems with NVLink support).
This PR increases the minimum size of a list/set/dict for which each element will be separately serialized (from 5 to 64). If the length is >5, the iterative serialization is only performed if the "first" element is dask-serializable. The check clearly doesnt assert that all items in the list/set/dict are dask-serializable, but always capture the case when all elements are.