-
-
Notifications
You must be signed in to change notification settings - Fork 748
Description
Question: would the Dask/Distributed community be interested in an improved memory spilling model that fixes the shortcomings of the current one but make use of proxy object wrappers?
In Dask-CUDA we have introduced a new approach to memory spilling that handles object aliasing and JIT memory un-spilling: rapidsai/dask-cuda#451
The result is memory spilling that:
- Avoids double counting: Double Counting and Issues w/Spilling #4186
- Avoids spilling of the same object multiple times.
- Avoids memory spikes because of incorrect memory tally.
- Implement just-in-time un-spilling: Delay deserialization of Data in workers until actual usage. #3998
- Support communication of spilled data so that GPU data don’t have to be un-spilled just to be spilled again as part of communication: [FEA] Allow communicating spilled data rapidsai/dask-cuda#342
- First step to enable partial spilled objects such as the spilling of individual columns in a data frame: Ability to have output dask_cudf.DataFrame not necessarily all in memory BlazingDB/blazingsql#1128
The current implement in Dask-CUDA handles CUDA device objects but it is possible to generalize to also handle spilling to disk.
The disadvantage of this approach is the use of proxy objects that get exposed to the users. The inputs to a tasks might be wrapped in a proxy object, which doesn't mimic the proxied object perfectly. E.g.:
# Type checking using instance() works as expected but direct type checking doesn't:
>>> import numpy as np
>>> from dask_cuda.proxy_object import asproxy
>>> x = np.arange(3)
>>> isinstance(asproxy(x), type(x))
True
>>> type(asproxy(x)) is type(x)
FalseBecause of this, the approach shouldn't be enabled by default but do you think that the Dask community would be interested in a generalization of this approach? Or is the proxy object hurdle too much of an issue?
cc. @mrocklin, @jrbourbeau, @quasiben