-
-
Notifications
You must be signed in to change notification settings - Fork 748
Add iterate_collection argument to serialize #4641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @jrbourbeau @ian-r-rose - This seems to resolve most of the serialization problems we discussed today. If you use |
|
|
||
| def serialize(x, serializers=None, on_error="message", context=None): | ||
| def serialize( | ||
| x, serializers=None, on_error="message", context=None, iterate_collection=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, if iterate_collection=None isn't supposed to have a distinct meaning from iterate_collection=False (which seems to be the case), the default value could be set to False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - I guess the current logic will use False as the default anyway, so we might as well be explicit. Thanks, good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you! Basically, my thinking has been that, this way the intended value space would be documented more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking back at the code a bit, I think the iterate_collection=None default actually makes sense. The problem is that we do not need the iterate_collection = iterate_collection or False line below.
The behavior we want:
iterate_collection=True: Serialization always dumps the elements of the collection separatelyiterate_collection=False: Serialization always dumps the entire collection togetheriterate_collection=None(default): The ideal approach is inferred by the existence/location of "pickle" among the list of available serializers, and the types of data detected in the collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that iterate_collection = iterate_collection or False was redundant (apologies I couldn't be of more help by actually spotting it). Thanks for updating the docstring!
One potential style point about in (list, set, tuple, dict) (further down). Not sure if there are any firm guidelines about this (distributed-wise or Dask organization-wide), but perhaps that could be considered as well (i.e., in <tuple> vs in <set>).
|
cc @madsbk (in case you have thoughts here 🙂) |
distributed/worker.py
Outdated
| _dumps = warn_serialize if _serialized_task_arg(task[2:]) else warn_dumps | ||
| d = {"function": dumps_function(task[1]), "args": _dumps(task[2])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gjoseph92 @ian-r-rose @jrbourbeau - The changes in this PR seem to address the requirements in #4673 (in fact, nested tasks also work), but I am not happy with the changes in dumps_task.
Note that _serialized_task_arg is checking the type of each (non-function) element in task, and using warn_serialize rather than warn_dumps if there are any Serialized objects. This extra pass over the data partially defeats the purpose of the check (we might as well always use warn_serialize if we are going to take the time to iterate over elements anyway). Perhaps we should introduce a standalone serialize_task function that will do the same thing as dumps_task, but will always call warn_serialize on the args and kwargs? In that case, the decision whether to use dumps_task vs serialize_task would need to be made upstream, but the logic in dumps_task could remain the same.
I guess we could also cache the _serialized_task_arg result based on the function, but that seems like it could be fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oaky - I think I'd like to leave dumps_task alone and introduce the new logic for serializing a task on the scheduler in a new serialize_task function (I introduced this in 093d240)
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing on this @rjzamora! I'm still grokking all the changes here
Reading through the discussion over in #4673, I was wondering what your thoughts were on @gjoseph92's comments, specifically:
Register a dummy serialization family for "task", which caches, then just calls back into deserialize. By marking things as "task" when serializing, we can reduce cache misses. Kinda hackey, could work?
Alternatively, the quickest solution would be to leave everything as it is, and have worker._deserialize just call deserialize on the args/kwargs, to unpack any Serialize/Serialized objects they may contain. But I imagine that is undesirable, especially when we already have this nice machinery for handling nested serialization with to_serialize and Serialized.
Yeah - I would consider this to be the main alternative to using something like
This could work for the |
gjoseph92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would still need the
iterate_collectionchanges forSerializedobjects to be captured correctly.
Agreed: no matter what we do for task serialization, we'll need core serialize to be better at descending through collections.
Since we're less sure about the approach for task serialization though, could it make sense to focus this PR on just iterate_collection, and move the serialize_task changes into a different one?
My mistake for not (re-)marking this as a draft PR. I am fairly certain that we will need to expose an Confession: I actually thought I was pushing to a separate branch when I first started the serialize_task experiment, but just carried on after I already made a mess. Some Notes: In another branch, I am also experimenting with a |
|
Okay good(ish) news:
|
| if args: | ||
| if args and isinstance(args, bytes): | ||
| args = pickle.loads(args) | ||
| if kwargs: | ||
| if kwargs and isinstance(kwargs, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is the only change that wasn't in this PR originally. When constructing a task on the scheduler, we will need to have the option of using to_serialize on args/kwargs (so that we can avoid pickling Serialized/Serialize objects while still enabling function caching). Therefore, these dict values may not be pickled bytes when we get to _deserialize.
|
Just a note: I have also experimented with a |
|
Any other reason not to always iterate collections beside performance? |
Only for performance reasons (as far as I can tell).
I think the hypothetical case we are guarding against is a many-element collection (much larger than a task-based tuple) that doesn't contain any special elements. My intuition tells me that we don't want to iterate through these elements in python and generate a separate header/frames that will then need to be iterated though again on the worker. At the same time, it would also be nice to avoid the iterate_collection inference step (which, as you point out, also has real overhead). I certainly agree that this is probably an appropriate time to question the iterate_collection approach altogether. When a graph is generated on the client, we always use |
|
We could also do something in the middle and set |
Thanks @jrbourbeau - I like the idea of avoiding the need to string |
Hrm that's unfortunate. I wonder if we can avoid the need for double Specifically I'm wondering if we can avoid the need for double diff --git a/distributed/protocol/core.py b/distributed/protocol/core.py
index fb85d32f..e471ca8b 100644
--- a/distributed/protocol/core.py
+++ b/distributed/protocol/core.py
@@ -48,8 +48,6 @@ def dumps(msg, serializers=None, on_error="message", context=None) -> list:
def _encode_default(obj):
typ = type(obj)
if typ is Serialize or typ is Serialized:
- if typ is Serialize:
- obj = obj.data
offset = len(frames)
if typ is Serialized:
sub_header, sub_frames = obj.header, obj.frames |
|
Ah - Good suggestion @jrbourbeau. That works. Sorry I missed it! |
|
I assume the |
|
Okay - CI finally passing here :) Since we probably don't want to rush #4699, and this PR no longer breaks the public API, I think there is still good motivation to get this in. |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice green check marks @rjzamora : )
@jakirkham if you get a moment, your feedback on this PR would be valuable. As Rick pointed out in his original description
One critical issue is that serialize is not guarenteed to iterate through the elements of a task (and generate seperate header/frames for each element). This is a problem when one or more of those task elements are a Serialized object - We don't what these objects to be grouped together with the other elements in a single blob of bytes, otherwise the object will remain Serialized when it is passed to the actual function on the worker.
This PR updates serialize to always iterate through collections when it encounters a Serialize object.
In the future we will probably address task serialization with a more wide-spread refactor (e.g. something like @madsbk's work over in #4699) but the updates here serve as a smaller set of changes we could make today to unblock moving more operations to HighLevelGraph representations over in dask (e.g. dask/dask#7455, dask/dask#7415).
|
Thanks for the summary @jrbourbeau - I couldn't have put it better myself :) |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora! Planning to merge later today if there are no further comments
gjoseph92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM as well
This PR is motivated by serialization needs for the ongoing Blockwise-IO work in dask/dask (e.g dask#7417, dask#7415). One critical issue is that
serializeis not guarenteed to iterate through the elements of a task (and generate seperate header/frames for each element). This is a problem when one or more of those task elements are aSerializedobject - We don't what these objects to be grouped together with the other elements in a single blob of bytes, otherwise the object will remainSerializedwhen it is passed to the actual function on the worker.Note that I also think that we should always use
serialize(..., iterate_collection=True)when the object is a task. However, I'll rather leave that work for a seperate PR.