-
-
Notifications
You must be signed in to change notification settings - Fork 748
Closed
Labels
Description
Describe the issue:
An outer join with an int64 column, will keep the column as int64 even though it may have introduced NaNs.
Minimal Complete Verifiable Example:
from dask.distributed import Client
from dask.distributed import LocalCluster
import dask.dataframe as dd
cluster = LocalCluster()
client = Client(cluster)
client.cluster.scale(1)
dask_df = dd.from_dict(
{
"a": [1, 2],
},
npartitions=1,
)
dask_df2 = dd.from_dict(
{
"b": [1],
},
npartitions=1,
)
print(dask_df2.b.dtype) # int64
df = dask_df.join(dask_df2, how="left")
print(df.b.dtype) # int64, this would previously be a float
df.shuffle(on="a").compute() # causes IntCastingNaNErrorAnything else we need to know?:
In a previous version (dask 2022.5.2 don't know which version the change first occurred) this same code would output a float column and therefore not throw an error.
Is this expected behaviour?
The problem may of course be prevented by first casting b to "Int64" (pandas nullable) or "float" manually.
Environment:
- Dask version:
- dask 2023.9.1
- dask-bigquery 2023.5.1
- dask-glm 0.2.0
- dask-kubernetes 2023.9.0
- dask_labextension 7.0.0
- dask-ml 2023.3.24
- Python version: Python 3.10.11
- Operating System: Linux
- Install method: conda