Serialize some lists using Dispatch #7029

hayesgb · 2022-09-10T00:22:04Z

Closes Future submit()/result() takes very long time to complete with 8MB Python object #6368
Related to: Don't force recursing into python collections when serializing objects. #6940
Tests added / passed
[X ] Passes pre-commit run --all-files

Using the reproducer from @adbreind in #6368 as a starting point for:

from distributed import Client, LocalCluster
from time import time
import numpy as np


def run_test():
    runtime = []
    def foo():
        return [1.5] * 1_000_000

    # with LocalCluster(n_workers=2, threads_per_worker=1, memory_limit='8GiB') as cluster:
    for i in range(5):
        with Client() as client:
            s = time()
            res = client.submit(foo).result()
            runtime.append(time() - s)
    print(f"Run time (in seconds) for 5 runs is: {runtime}, and mean runtime:  {np.mean(runtime)} seconds")

if __name__ == "__main__":
    run_test()

On current main, I get:
Run time (in seconds) for 5 runs is: [13.804176807403564, 13.784174680709839, 13.835507869720459, 13.706598997116089, 13.749552011489868], and mean runtime: 13.776002073287964 seconds

While on this branch, I get:
Run time (in seconds) for 5 runs is: [0.15462398529052734, 0.1298370361328125, 0.1314990520477295, 0.13030385971069336, 0.12935090065002441], and mean runtime: 0.13512296676635743 seconds

What is happening?
When serializing collections, we prefer to use pickle and recurse into the collections, serializing each object in the collection separately. This decision was motivated by Blockwise-IO work as described by @rjzamora. While it makes sense, it also has the unfortunate consequence of making it expensive to serialize collections in general.

Here we create a Dispatch() method for lists that converts a list to a numpy array, which can then be serialized. We add infer_if_recurse_to_serialize_list. Now that lists can be serialized recursively using pickle, or with dask_serialize, we offload the decision about whether to iterate_collection toinfer_if_recurse_to_serialize_list.

We also need to handle the case where a Serialize object must itself be serialized. To handle this, we add an iterate_collection attribute.

…ontains a uniform, hashable datatypes. Otherwise defaults to recursively serializing with pickle

GPUtester · 2022-09-10T00:22:08Z

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

github-actions · 2022-09-10T01:08:45Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 20m 2s ⏱️ + 4m 47s
  3 117 tests +  16   3 031 ✔️ +  17   84 💤 - 1 2 ❌ ±0
23 066 runs +112 22 163 ✔️ +115 899 💤 - 5 4 ❌ +2

For more details on these failures, see this check.

Results for commit 886108d. ± Comparison against base commit 1fd07f0.

♻️ This comment has been updated with latest results.

hayesgb · 2022-09-12T01:15:01Z

cc: @madsbk -- Wondering if you would mind taking a look at this PR. Also interested in your thoughts on dropping the requirement to use pickle with protocol=4. xref : rapidsai/dask-cuda#746

hayesgb · 2022-09-12T01:54:48Z

cc: @ian-r-rose

distributed/protocol/serialize.py

madsbk

Thanks @hayesgb, overall I think this is a good idea but we have to make sure that we are not serializing anything need by the scheduler such as task graph keys.

distributed/protocol/serialize.py

madsbk · 2022-09-14T07:15:42Z

distributed/protocol/serialize.py

+        if is_dask_collection(first_val) or typename(type(first_val)) not in [
+            "str",
+            "int",
+            "float",
+        ]:


Suggested change

if is_dask_collection(first_val) or typename(type(first_val)) not in [

"str",

"int",

"float",

]:

if is_dask_collection(first_val) or isinstance(first_val, (str, int, float)):

Trying to use isinstance(first_val, (str, int, float)) returns test failures in shuffle.

Additionally, this seems more consistent with the convention of using type(x), which is the preferred approach in serialize.py. I could see replacing with type(x) in [int, float] if that makes more sense...

Thoughts?

This makes me a bit nervous.
AFAIK, the use of type(x) is only here to avoid sub-classes of list, set, tuple, dict to be converted to their base-class by msgpack. By disabling iterate_collection and use Dispatch on the while collection, we make sure to preserve the sub-classes.
Do you have a reproducer for the shuffle failing?

PS: I think this is another good reason why we need to re-design and simplify the serialization in Dask.

distributed/protocol/serialize.py

madsbk · 2022-09-14T07:48:00Z

distributed/protocol/serialize.py

+        and iterate_collection is True
        or type(x) is dict
-        and iterate_collection
+        and iterate_collection is True


Any reason for this semantically?

Nothing other than readability.

distributed/protocol/serialize.py

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

…lect fact that type(x.data) is being checked, not type(x)

…ributed into dispatch_some_lists

hayesgb · 2022-09-14T18:46:15Z

@madsbk -- Can we drop the requirement to use protocol=4 now that distributed requires python>=3.8?

jrbourbeau

cc @rjzamora @jakirkham if either of you get a chance to look at this. FYI for others, NVIDIA folks are off Sep 15-16 for a company-wide holiday, so it may be until next week until pings are seen

madsbk · 2022-09-15T06:50:59Z

@madsbk -- Can we drop the requirement to use protocol=4 now that distributed requires python>=3.8?

Yes

distributed/protocol/serialize.py

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

hayesgb added 6 commits September 9, 2022 11:40

Adds a dispatch method for lists, that only gets used when the list c…

882723e

…ontains a uniform, hashable datatypes. Otherwise defaults to recursively serializing with pickle

Clean up

7de79bd

Added work on shuffle serialization

7b683c8

Handles strings contain tuples in lists

212fe08

Handle serialized strings from shuffling

8cdf818

Cleaned up function name

71b4335

hayesgb marked this pull request as draft September 10, 2022 00:24

hayesgb changed the title ~~[WIP] Dispatch some lists~~ Serialize some lists using Dispatch Sep 10, 2022

hayesgb added 4 commits September 9, 2022 21:13

Added tests for inferring list status

70800d9

linting

dce4a56

Linting

87934fc

Patch test_pickle_safe

2f357fb

hayesgb added 2 commits September 11, 2022 20:47

Update iterate_collection attribute for Serialize object

fe867f5

Linting

2dc93a7

Remove test files

948550f

fjetter reviewed Sep 12, 2022

View reviewed changes

distributed/protocol/serialize.py Outdated Show resolved Hide resolved

hayesgb added 6 commits September 12, 2022 12:44

Removed unneeded set call

040fcbc

Linting

eeda4ac

Fix docs. Clean and add additional test for infer_if_serialize_list

44afedd

Linting

e263079

Revert to protocol=4 for dask-cuda

8c937e6

Linting

aec1b8d

madsbk suggested changes Sep 14, 2022

View reviewed changes

hayesgb added 3 commits September 14, 2022 09:08

Merge branch 'main' into dispatch_some_lists

e870fc2

Addresses @madsbk suggestion to remove try:...except: block

0c56f15

Linting and drop strings from dtypes that can be serialized using array

4a07152

hayesgb and others added 4 commits September 14, 2022 13:19

Update distributed/protocol/serialize.py

9885c6c

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

Updating with suggestions from madsk

b64428b

Accepted suggestion to handle iterate_collection, but adjusted to ref…

91d46a2

…lect fact that type(x.data) is being checked, not type(x)

Merge branch 'dispatch_some_lists' of https://github.com/hayesgb/dist…

a31647d

…ributed into dispatch_some_lists

hayesgb marked this pull request as ready for review September 14, 2022 18:46

jrbourbeau reviewed Sep 14, 2022

View reviewed changes

madsbk reviewed Sep 15, 2022

View reviewed changes

distributed/protocol/serialize.py Outdated Show resolved Hide resolved

hayesgb and others added 2 commits September 15, 2022 05:54

Update distributed/protocol/serialize.py

eb8ad89

Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>

Drop protocol 4 requirement

886108d

Uh oh!

Serialize some lists using Dispatch #7029

Are you sure you want to change the base?

Serialize some lists using Dispatch #7029

Uh oh!

Conversation

hayesgb commented Sep 10, 2022 • edited by fjetter Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GPUtester commented Sep 10, 2022

Uh oh!

github-actions bot commented Sep 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

hayesgb commented Sep 12, 2022

Uh oh!

hayesgb commented Sep 12, 2022

Uh oh!

Uh oh!

madsbk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

madsbk Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

hayesgb Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

madsbk Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

madsbk Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

hayesgb Sep 14, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hayesgb commented Sep 14, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

madsbk commented Sep 15, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hayesgb commented Sep 10, 2022 •

edited by fjetter

Loading

github-actions bot commented Sep 10, 2022 •

edited

Loading

madsbk Sep 15, 2022 •

edited

Loading