-
-
Notifications
You must be signed in to change notification settings - Fork 748
Serialize some lists using Dispatch #7029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ontains a uniform, hashable datatypes. Otherwise defaults to recursively serializing with pickle
|
Can one of the admins verify this patch? Admins can comment |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 20m 2s ⏱️ + 4m 47s For more details on these failures, see this check. Results for commit 886108d. ± Comparison against base commit 1fd07f0. ♻️ This comment has been updated with latest results. |
|
cc: @madsbk -- Wondering if you would mind taking a look at this PR. Also interested in your thoughts on dropping the requirement to use pickle with protocol=4. xref : rapidsai/dask-cuda#746 |
|
cc: @ian-r-rose |
madsbk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hayesgb, overall I think this is a good idea but we have to make sure that we are not serializing anything need by the scheduler such as task graph keys.
distributed/protocol/serialize.py
Outdated
| if is_dask_collection(first_val) or typename(type(first_val)) not in [ | ||
| "str", | ||
| "int", | ||
| "float", | ||
| ]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if is_dask_collection(first_val) or typename(type(first_val)) not in [ | |
| "str", | |
| "int", | |
| "float", | |
| ]: | |
| if is_dask_collection(first_val) or isinstance(first_val, (str, int, float)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to use isinstance(first_val, (str, int, float)) returns test failures in shuffle.
Additionally, this seems more consistent with the convention of using type(x), which is the preferred approach in serialize.py. I could see replacing with type(x) in [int, float] if that makes more sense...
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a bit nervous.
AFAIK, the use of type(x) is only here to avoid sub-classes of list, set, tuple, dict to be converted to their base-class by msgpack. By disabling iterate_collection and use Dispatch on the while collection, we make sure to preserve the sub-classes.
Do you have a reproducer for the shuffle failing?
PS: I think this is another good reason why we need to re-design and simplify the serialization in Dask.
| and iterate_collection is True | ||
| or type(x) is dict | ||
| and iterate_collection | ||
| and iterate_collection is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this semantically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing other than readability.
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
…lect fact that type(x.data) is being checked, not type(x)
…ributed into dispatch_some_lists
|
@madsbk -- Can we drop the requirement to use |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rjzamora @jakirkham if either of you get a chance to look at this. FYI for others, NVIDIA folks are off Sep 15-16 for a company-wide holiday, so it may be until next week until pings are seen
Yes |
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
Closes Future submit()/result() takes very long time to complete with 8MB Python object #6368
Related to: Don't force recursing into python collections when serializing objects. #6940
Tests added / passed
[X ] Passes
pre-commit run --all-filesUsing the reproducer from @adbreind in #6368 as a starting point for:
On current main, I get:
Run time (in seconds) for 5 runs is: [13.804176807403564, 13.784174680709839, 13.835507869720459, 13.706598997116089, 13.749552011489868], and mean runtime: 13.776002073287964 secondsWhile on this branch, I get:
Run time (in seconds) for 5 runs is: [0.15462398529052734, 0.1298370361328125, 0.1314990520477295, 0.13030385971069336, 0.12935090065002441], and mean runtime: 0.13512296676635743 secondsWhat is happening?
When serializing collections, we prefer to use pickle and recurse into the collections, serializing each object in the collection separately. This decision was motivated by Blockwise-IO work as described by @rjzamora. While it makes sense, it also has the unfortunate consequence of making it expensive to serialize collections in general.
Here we create a
Dispatch()method for lists that converts a list to a numpy array, which can then be serialized. We addinfer_if_recurse_to_serialize_list. Now that lists can be serialized recursively usingpickle, or withdask_serialize, we offload the decision about whether toiterate_collectiontoinfer_if_recurse_to_serialize_list.We also need to handle the case where a
Serializeobject must itself be serialized. To handle this, we add aniterate_collectionattribute.