When dask uses partd for eg shuffle operations, the dataframes always come back as a pandas.DataFrame, even if a subclass was stored (xref geopandas/dask-geopandas#59 (comment)).
For example:
import geopandas
gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
import partd
# dask.dataframe shuffle operations use PandasBlocks
p = partd.PandasBlocks(partd.Dict())
p.append({"gdf": gdf})
res = p.get("gdf")
>>> type(gdf)
pandas.core.frame.DataFrame
>>> type(res)
pandas.core.frame.DataFrame
To be able to use dask's shuffle operations with dask_geopandas, which uses a pandas subclass as the partition type, the subclass should be preserved in the partd roundtrip (or are there other ways that you can override / dispatch this operation in dask?).
I was wondering how other dask.dataframe subclasses handle this, but eg dask_cudf doesn't seem to support "disk"-based shuffling.
When dask uses partd for eg shuffle operations, the dataframes always come back as a
pandas.DataFrame, even if a subclass was stored (xref geopandas/dask-geopandas#59 (comment)).For example:
To be able to use dask's shuffle operations with
dask_geopandas, which uses a pandas subclass as the partition type, the subclass should be preserved in the partd roundtrip (or are there other ways that you can override / dispatch this operation in dask?).I was wondering how other dask.dataframe subclasses handle this, but eg
dask_cudfdoesn't seem to support "disk"-based shuffling.