[WIP] Distinguish `tuple`s & `list`s in MsgPack serialization #4575

jakirkham · 2021-03-09T23:27:13Z

Fixes Serialization with "msgpack" doesn't preserve lists #3716 and Fixes Test failures with explicit comms rapidsai/dask-cuda#549
Tests added / passed
Passes black distributed / flake8 distributed

MsgPack does not itself have a way to distinguish between a list or a tuple and will convert them all into the same thing. However in Dask and 3rd party libraries we often rely on one or the other being used. So there is value in tracking whether a list or tuple was present and unpack it into the expected type.

Here we solve this issue by enabling use_strict to force tuple to not be handled as a list and instead through default object encoding. Within default encoding, we pack this tuple into a dict noting that it is a tuple and coercing the value into a list. Then in object_hook decoding we convert the contained list back into a tuple. This way we can ensure we get a tuple when expected. lists simply go through regular MsgPack handling.

cc @madsbk @rjzamora @pentschev

As `tuple`s receive special handling that `list`s don't, decode to `list`s instead of to `tuple`s as any `list` that should go into a `tuple` will be converted to that.

This ensures that `tuple`s is handled by the `default` encoder.

madsbk · 2021-03-10T08:06:13Z

Thanks @jakirkham for working on this but I am a little concerned about the performance implications. AFAIK, we have been using strict_types=False and use_list=False in order to minimize overhead. It might not have a significant impact but we should do some benchmarking before merge.

jakirkham · 2021-03-10T18:05:02Z

strict_types just uses more fine-tuned checks like this. For this reason would expect strict_types=False to be slower than strict_types=True as the former checks for more things than the latter

use_list just gets used in a bunch of ternary expressions like this. So would expect the performance is similar in either case

madsbk · 2021-03-10T18:55:59Z

strict_types just uses more fine-tuned checks like this. For this reason would expect strict_types=False to be slower than strict_types=True as the former checks for more things than the latter

True, but msgpack_encode_default() will be called for each tuple.

use_list just gets used in a bunch of ternary expressions like this. So would expect the performance is similar in either case

The reason why msgpack defaults to use_list=False is because a tuple is a bit faster than a list to create.

But anyways, if the performance impact is reasonable, I think it is a great change :)

jakirkham · 2021-03-10T19:16:57Z

If you know better ways to approach this with MsgPack, am open to suggestions 🙂

jakirkham · 2021-03-11T00:54:38Z

AFAICT the choice to use tuples has more to do with where they can be used ( #2000 ). When it comes to MsgPack defaults, it appears using a list is the default

jakirkham · 2021-03-23T21:11:56Z

cc @jrbourbeau (for awareness)

jrbourbeau · 2021-03-30T17:17:02Z

cc @rjzamora for visibility as it looks like this PR is removing iterate_collection which will impact #4641

jakirkham · 2021-03-30T18:07:22Z

Yep also mentioned here ( #3689 (comment) )

The reason that code was there IIUC was just to workaround the fact that MsgPack can't distinguish between a list & a tuple, which this would fix making that code obsolete

rjzamora · 2021-03-30T18:16:56Z

The reason that code was there IIUC was just to workaround the fact that MsgPack can't distinguish between a list & a tuple, which this would fix making that code obsolete

I didn't look at this PR yet, but iterate_collection is just a flag to determine if the serialization should dive into collection elements individually, or to serialize the entire collection at once (with something like pickle). The msgpack list/tuple concern was just a "special" case where we wanted to set iterate_collection=True, but we also want to iterate when the elements contain special objects (like GPU-serializable types), or Serialized elements.

jakirkham · 2021-03-30T18:36:19Z

Yeah that makes sense. Given that this may already be somewhat unnecessary given the recent surgery that Mads did on serialization ( #4531 ). In particular that performs "dask" or "cuda" serialization as needed with objects it sees while doing MsgPack serialization. Also it relaxes the requirement that everything in the message be serialized the same way. So there's less of a need to look ahead at objections in a collection and make a decision since that decision will be made per object seen in the collection while serializing them

mrocklin · 2021-03-30T19:58:47Z

Quick comment, historically the solution to this was for the application code to never care about the distinction between tuples and lists. If there were a performance impact then I would suggest that we just change the application code that is conflicting here.

…

On Tue, Mar 30, 2021 at 1:36 PM jakirkham ***@***.***> wrote: Yeah that makes sense. Given that this may already be somewhat unnecessary given the recent surgery that Mads did on serialization ( #4531 <#4531> ). In particular that performs "dask" or "cuda" serialization as needed with objects it sees while doing MsgPack serialization. Also it relaxes the requirement that everything in the message be serialized the same way. So there's less of a need to look ahead at objections in a collection and make a decision since that decision will be made per object seen in the collection while serializing them — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4575 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFR5K7XB3FDLB26MMTTGIK3JANCNFSM4Y4WL5UA> .

jakirkham · 2021-03-30T20:16:18Z

Unfortunately it's not that simple as these occur all over the place for reasons that are unclear (at least not to me). So have focused instead on just preserving that information somehow. Agree it would be nice to make things consistent, but actually that may be harder lift

rjzamora · 2021-03-30T20:25:07Z

distributed/protocol/serialize.py

-    if (
-        type(x) in (list, set, tuple)
-        and iterate_collection
-        or type(x) is dict
-        and iterate_collection
-        and dict_safe
-    ):
+    if type(x) in (list, set, tuple) or type(x) is dict and dict_safe:


Note that this change makes #4641 unnecessary, because this PR is effectively setting iterate_collection=True in all cases. This is great for HLG serialization on the scheduler, because we need the serialization code path to detect existing Serialized objects within the elements of a serialized task.

With that said, I do want to make sure that we want this to be the default behavior for all objects. That is, do we want to be diving in to every element when the tuple length is 1000+?

mrocklin · 2021-03-30T20:57:20Z

Unfortunately it's not that simple as these occur all over the place for reasons that are unclear (at least not to me). So have focused instead on just preserving that information somehow. Agree it would be nice to make things consistent, but actually that may be harder lift

What are these places? Looking at issues I'm seeing #4625 . Anything else?

It seems like if we try not to mutate things then we should be ok here. That seems like good practice anyway in the common case. I wouldn't mind avoiding fancy msgpack things and instead focusing on not mutating. That seems like it might be a long-term simpler solution.

mrocklin · 2021-03-30T21:22:43Z

Here is a possible solution to the issue mentioned above: #4653

This was referenced Mar 9, 2021

Serialization with "msgpack" doesn't preserve lists #3716

Open

Test failures with explicit comms rapidsai/dask-cuda#549

Closed

jakirkham force-pushed the ser_list_msgpack branch from 6272ed6 to 7f90128 Compare March 9, 2021 23:43

jakirkham added 6 commits March 9, 2021 15:52

Rename as-list to values

e78b4e3

Use lists when decoding with MsgPack

728ac75

As `tuple`s receive special handling that `list`s don't, decode to `list`s instead of to `tuple`s as any `list` that should go into a `tuple` will be converted to that.

Use strict type encoding with MsgPack

50f3da4

This ensures that `tuple`s is handled by the `default` encoder.

Serialize tuples specially

0626182

Test serializing tuple & list

9cc4d79

Drop old list MsgPack workaround

3050ab8

jakirkham force-pushed the ser_list_msgpack branch 2 times, most recently from 0e6eb6c to b39c832 Compare March 10, 2021 00:00

jakirkham mentioned this pull request Mar 10, 2021

Dask-serialize dicts longer than five elements #3689

Merged

Fix-up tests

41025d8

jakirkham force-pushed the ser_list_msgpack branch from b39c832 to 41025d8 Compare March 10, 2021 00:04

jakirkham marked this pull request as draft March 10, 2021 00:54

jakirkham changed the title ~~Distinguish tuples & lists in MsgPack serialization~~ [WIP] Distinguish tuples & lists in MsgPack serialization Mar 10, 2021

rjzamora reviewed Mar 30, 2021

View reviewed changes

jrbourbeau mentioned this pull request Mar 30, 2021

Handle Blockwise HLG pack/unpack for concatenate=True dask/dask#7455

Merged

3 tasks

rjzamora mentioned this pull request Apr 5, 2021

[Discussion] Serialize objects within tasks #4673

Open

madsbk mentioned this pull request Jun 9, 2021

[WIP] Fine grained serialization #4897

Closed

7 tasks

madsbk mentioned this pull request Jun 17, 2021

[REVIEW] Formalization of Computation #4923

Closed

9 tasks

Uh oh!

[WIP] Distinguish tuples & lists in MsgPack serialization #4575

Are you sure you want to change the base?

[WIP] Distinguish tuples & lists in MsgPack serialization #4575

Uh oh!

Conversation

jakirkham commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk commented Mar 10, 2021

Uh oh!

jakirkham commented Mar 10, 2021

Uh oh!

madsbk commented Mar 10, 2021

Uh oh!

jakirkham commented Mar 10, 2021

Uh oh!

jakirkham commented Mar 11, 2021

Uh oh!

jakirkham commented Mar 23, 2021

Uh oh!

jrbourbeau commented Mar 30, 2021

Uh oh!

jakirkham commented Mar 30, 2021

Uh oh!

rjzamora commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Mar 30, 2021

Uh oh!

mrocklin commented Mar 30, 2021 via email

Uh oh!

jakirkham commented Mar 30, 2021

Uh oh!

rjzamora Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Mar 30, 2021

Uh oh!

mrocklin commented Mar 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[WIP] Distinguish `tuple`s & `list`s in MsgPack serialization #4575

[WIP] Distinguish `tuple`s & `list`s in MsgPack serialization #4575

jakirkham commented Mar 9, 2021 •

edited

Loading

rjzamora commented Mar 30, 2021 •

edited

Loading