Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Feb 21, 2023

This is a requirement for HLG serialization via pickle, see #7564

To actually allow for graphs to be shipped via pickle we actually need to implement a custom Pickler class since otherwise user arguments to graphs may end up being not serializable. One example are H5Py objects which intentionally disallow pickling. All of these cases are already handled by our dask_(de)serialize dispatch so the new pickler simply uses the dask serializer if one is available and falls back to ordinary pickle instead.

Comment on lines +26 to +30
try:
serialize = dask_serialize.dispatch(type(obj))
deserialize = dask_deserialize.dispatch(type(obj))
return deserialize, serialize(obj)
except TypeError:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we should be dealing with other families, e.g. cuda as well here, don't we? any suggestions on how to do this in an elegant way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how widespread the use of families truly is. Open to suggestions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible way of dealing with this would be a custom dispatcher that controls this specific behavior and isn't entangled with the other serialization logic. Thoughts?

@fjetter
Copy link
Member Author

fjetter commented Feb 21, 2023

in the existing test suite there are only few types that are actually not pickle-able. For instance, h5py forbids pickle by default. most other objects are pickle-able. This raises the question if this Pickler object shouldn't attempt to ordinarily pickle the data first before falling back to the dask_serializer

@github-actions
Copy link
Contributor

github-actions bot commented Feb 21, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       24 files  ±  0         24 suites  ±0   10h 19m 47s ⏱️ + 3m 43s
  3 340 tests +  2    3 239 ✔️ +  3     100 💤 ±0  1  - 1 
39 379 runs  +24  37 553 ✔️ +26  1 825 💤  - 1  1  - 1 

For more details on these failures, see this check.

Results for commit 354308b. ± Comparison against base commit 9e8876d.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member

In general I think that this is fine. Pickle should be fine in almost all cases and the serialization families idea was premature (pretty much everything is in the dask family). Honestly, the only real cases I can think of where this is necessary are the cases that you mention, like h5py and netcdf files (although I expect that netcdf will be exclusive to Xarray users, and Xarray has its own solution to this).

I think that we can be pretty lax here.

@fjetter
Copy link
Member Author

fjetter commented Feb 22, 2023

I chose to go for the "try standard pickle first" approach such that we only use the dispatching if pickle doesn't work, i.e. in most cases we just use plain pickle.

I'll move forward with this once CI is green-ish. If there are some issues with this we can follow up later, I think

@fjetter fjetter merged commit 41fdb91 into dask:main Feb 23, 2023
@fjetter fjetter deleted the allow_pickle_fallback_to_dask_serialize branch February 23, 2023 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants