Skip to content

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Mar 9, 2021

MsgPack does not itself have a way to distinguish between a list or a tuple and will convert them all into the same thing. However in Dask and 3rd party libraries we often rely on one or the other being used. So there is value in tracking whether a list or tuple was present and unpack it into the expected type.

Here we solve this issue by enabling use_strict to force tuple to not be handled as a list and instead through default object encoding. Within default encoding, we pack this tuple into a dict noting that it is a tuple and coercing the value into a list. Then in object_hook decoding we convert the contained list back into a tuple. This way we can ensure we get a tuple when expected. lists simply go through regular MsgPack handling.

cc @madsbk @rjzamora @pentschev

As `tuple`s receive special handling that `list`s don't, decode to
`list`s instead of to `tuple`s as any `list` that should go into a
`tuple` will be converted to that.
This ensures that `tuple`s is handled by the `default` encoder.
@jakirkham jakirkham force-pushed the ser_list_msgpack branch 2 times, most recently from 0e6eb6c to b39c832 Compare March 10, 2021 00:00
@jakirkham jakirkham marked this pull request as draft March 10, 2021 00:54
@jakirkham jakirkham changed the title Distinguish tuples & lists in MsgPack serialization [WIP] Distinguish tuples & lists in MsgPack serialization Mar 10, 2021
@madsbk
Copy link
Contributor

madsbk commented Mar 10, 2021

Thanks @jakirkham for working on this but I am a little concerned about the performance implications. AFAIK, we have been using strict_types=False and use_list=False in order to minimize overhead. It might not have a significant impact but we should do some benchmarking before merge.

@jakirkham
Copy link
Member Author

strict_types just uses more fine-tuned checks like this. For this reason would expect strict_types=False to be slower than strict_types=True as the former checks for more things than the latter

use_list just gets used in a bunch of ternary expressions like this. So would expect the performance is similar in either case

@madsbk
Copy link
Contributor

madsbk commented Mar 10, 2021

strict_types just uses more fine-tuned checks like this. For this reason would expect strict_types=False to be slower than strict_types=True as the former checks for more things than the latter

True, but msgpack_encode_default() will be called for each tuple.

use_list just gets used in a bunch of ternary expressions like this. So would expect the performance is similar in either case

The reason why msgpack defaults to use_list=False is because a tuple is a bit faster than a list to create.

But anyways, if the performance impact is reasonable, I think it is a great change :)

@jakirkham
Copy link
Member Author

If you know better ways to approach this with MsgPack, am open to suggestions 🙂

@jakirkham
Copy link
Member Author

AFAICT the choice to use tuples has more to do with where they can be used ( #2000 ). When it comes to MsgPack defaults, it appears using a list is the default

@jakirkham
Copy link
Member Author

cc @jrbourbeau (for awareness)

@jrbourbeau
Copy link
Member

cc @rjzamora for visibility as it looks like this PR is removing iterate_collection which will impact #4641

@jakirkham
Copy link
Member Author

Yep also mentioned here ( #3689 (comment) )

The reason that code was there IIUC was just to workaround the fact that MsgPack can't distinguish between a list & a tuple, which this would fix making that code obsolete

@rjzamora
Copy link
Member

rjzamora commented Mar 30, 2021

The reason that code was there IIUC was just to workaround the fact that MsgPack can't distinguish between a list & a tuple, which this would fix making that code obsolete

I didn't look at this PR yet, but iterate_collection is just a flag to determine if the serialization should dive into collection elements individually, or to serialize the entire collection at once (with something like pickle). The msgpack list/tuple concern was just a "special" case where we wanted to set iterate_collection=True, but we also want to iterate when the elements contain special objects (like GPU-serializable types), or Serialized elements.

@jakirkham
Copy link
Member Author

Yeah that makes sense. Given that this may already be somewhat unnecessary given the recent surgery that Mads did on serialization ( #4531 ). In particular that performs "dask" or "cuda" serialization as needed with objects it sees while doing MsgPack serialization. Also it relaxes the requirement that everything in the message be serialized the same way. So there's less of a need to look ahead at objections in a collection and make a decision since that decision will be made per object seen in the collection while serializing them

@mrocklin
Copy link
Member

mrocklin commented Mar 30, 2021 via email

@jakirkham
Copy link
Member Author

Unfortunately it's not that simple as these occur all over the place for reasons that are unclear (at least not to me). So have focused instead on just preserving that information somehow. Agree it would be nice to make things consistent, but actually that may be harder lift

Comment on lines -259 to +251
if (
type(x) in (list, set, tuple)
and iterate_collection
or type(x) is dict
and iterate_collection
and dict_safe
):
if type(x) in (list, set, tuple) or type(x) is dict and dict_safe:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this change makes #4641 unnecessary, because this PR is effectively setting iterate_collection=True in all cases. This is great for HLG serialization on the scheduler, because we need the serialization code path to detect existing Serialized objects within the elements of a serialized task.

With that said, I do want to make sure that we want this to be the default behavior for all objects. That is, do we want to be diving in to every element when the tuple length is 1000+?

@mrocklin
Copy link
Member

Unfortunately it's not that simple as these occur all over the place for reasons that are unclear (at least not to me). So have focused instead on just preserving that information somehow. Agree it would be nice to make things consistent, but actually that may be harder lift

What are these places? Looking at issues I'm seeing #4625 . Anything else?

It seems like if we try not to mutate things then we should be ok here. That seems like good practice anyway in the common case. I wouldn't mind avoiding fancy msgpack things and instead focusing on not mutating. That seems like it might be a long-term simpler solution.

@mrocklin
Copy link
Member

Here is a possible solution to the issue mentioned above: #4653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test failures with explicit comms

5 participants