-
-
Notifications
You must be signed in to change notification settings - Fork 748
[WIP] Distinguish tuples & lists in MsgPack serialization
#4575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6272ed6 to
7f90128
Compare
As `tuple`s receive special handling that `list`s don't, decode to `list`s instead of to `tuple`s as any `list` that should go into a `tuple` will be converted to that.
This ensures that `tuple`s is handled by the `default` encoder.
0e6eb6c to
b39c832
Compare
b39c832 to
41025d8
Compare
tuples & lists in MsgPack serializationtuples & lists in MsgPack serialization
|
Thanks @jakirkham for working on this but I am a little concerned about the performance implications. AFAIK, we have been using |
|
|
True, but
The reason why msgpack defaults to But anyways, if the performance impact is reasonable, I think it is a great change :) |
|
If you know better ways to approach this with MsgPack, am open to suggestions 🙂 |
|
AFAICT the choice to use |
|
cc @jrbourbeau (for awareness) |
|
Yep also mentioned here ( #3689 (comment) ) The reason that code was there IIUC was just to workaround the fact that MsgPack can't distinguish between a |
I didn't look at this PR yet, but |
|
Yeah that makes sense. Given that this may already be somewhat unnecessary given the recent surgery that Mads did on serialization ( #4531 ). In particular that performs |
|
Quick comment, historically the solution to this was for the application
code to never care about the distinction between tuples and lists.
If there were a performance impact then I would suggest that we just change
the application code that is conflicting here.
…On Tue, Mar 30, 2021 at 1:36 PM jakirkham ***@***.***> wrote:
Yeah that makes sense. Given that this may already be somewhat unnecessary
given the recent surgery that Mads did on serialization ( #4531
<#4531> ). In particular that
performs "dask" or "cuda" serialization as needed with objects it sees
while doing MsgPack serialization. Also it relaxes the requirement that
everything in the message be serialized the same way. So there's less of a
need to look ahead at objections in a collection and make a decision since
that decision will be made per object seen in the collection while
serializing them
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4575 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFR5K7XB3FDLB26MMTTGIK3JANCNFSM4Y4WL5UA>
.
|
|
Unfortunately it's not that simple as these occur all over the place for reasons that are unclear (at least not to me). So have focused instead on just preserving that information somehow. Agree it would be nice to make things consistent, but actually that may be harder lift |
| if ( | ||
| type(x) in (list, set, tuple) | ||
| and iterate_collection | ||
| or type(x) is dict | ||
| and iterate_collection | ||
| and dict_safe | ||
| ): | ||
| if type(x) in (list, set, tuple) or type(x) is dict and dict_safe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this change makes #4641 unnecessary, because this PR is effectively setting iterate_collection=True in all cases. This is great for HLG serialization on the scheduler, because we need the serialization code path to detect existing Serialized objects within the elements of a serialized task.
With that said, I do want to make sure that we want this to be the default behavior for all objects. That is, do we want to be diving in to every element when the tuple length is 1000+?
What are these places? Looking at issues I'm seeing #4625 . Anything else? It seems like if we try not to mutate things then we should be ok here. That seems like good practice anyway in the common case. I wouldn't mind avoiding fancy msgpack things and instead focusing on not mutating. That seems like it might be a long-term simpler solution. |
|
Here is a possible solution to the issue mentioned above: #4653 |
"msgpack"doesn't preservelists #3716 and Fixes Test failures with explicit comms rapidsai/dask-cuda#549black distributed/flake8 distributedMsgPack does not itself have a way to distinguish between a
listor atupleand will convert them all into the same thing. However in Dask and 3rd party libraries we often rely on one or the other being used. So there is value in tracking whether alistortuplewas present and unpack it into the expected type.Here we solve this issue by enabling
use_strictto forcetupleto not be handled as alistand instead throughdefaultobject encoding. Withindefaultencoding, we pack thistupleinto adictnoting that it is atupleand coercing the value into alist. Then inobject_hookdecoding we convert the containedlistback into atuple. This way we can ensure we get atuplewhen expected.lists simply go through regular MsgPack handling.cc @madsbk @rjzamora @pentschev