Skip to content

IntCastingNaNError after outer join with int64 column #8183

@NickSchouten

Description

@NickSchouten

Describe the issue:
An outer join with an int64 column, will keep the column as int64 even though it may have introduced NaNs.

Minimal Complete Verifiable Example:

from dask.distributed import Client
from dask.distributed import LocalCluster
import dask.dataframe as dd

cluster = LocalCluster()
client = Client(cluster)
client.cluster.scale(1)


dask_df = dd.from_dict(
    {
        "a": [1, 2],
    },
    npartitions=1,
)
dask_df2 = dd.from_dict(
    {
        "b": [1],
    },
    npartitions=1,
)
print(dask_df2.b.dtype) # int64
df = dask_df.join(dask_df2, how="left")
print(df.b.dtype) # int64, this would previously be a float
df.shuffle(on="a").compute() # causes IntCastingNaNError

Anything else we need to know?:

In a previous version (dask 2022.5.2 don't know which version the change first occurred) this same code would output a float column and therefore not throw an error.

Is this expected behaviour?

The problem may of course be prevented by first casting b to "Int64" (pandas nullable) or "float" manually.

Environment:

  • Dask version:
    • dask 2023.9.1
    • dask-bigquery 2023.5.1
    • dask-glm 0.2.0
    • dask-kubernetes 2023.9.0
    • dask_labextension 7.0.0
    • dask-ml 2023.3.24
  • Python version: Python 3.10.11
  • Operating System: Linux
  • Install method: conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions