Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Feb 22, 2024

Closes #8519

The KeyError was used for control flow but I believe one should use dedicated exception types for this use case and keep the std exceptions clear for error reporting. This logic swallowed a very obvious error.

I'll see if I can reproduce #8519 even with artificial input somehow. Even if I don't have a reproducer I believe this change is sensible.

Comment on lines +2761 to +2771
ddf = dd.from_delayed(
[data_gen()] * 2, meta=[("a", int), ("b", int)], verify_meta=False
)

with raises_with_cause(
RuntimeError,
r"shuffling \w* failed",
ValueError,
"meta",
):
await c.gather(c.compute(ddf.shuffle(on="a")))
Copy link
Member Author

@fjetter fjetter Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main this indeed causes data loss and is returning an empty frame. While the from_delayed path above might be slightly artificial I don't think it's that difficult to produce a slight meta drift in general.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_csv is an example where this could happen in real world use cases

@fjetter fjetter changed the title Use dedicated exception class if data is not in buffer Avoid data loss of meta provided to P2P shuffle is wrong Feb 22, 2024
@fjetter fjetter changed the title Avoid data loss of meta provided to P2P shuffle is wrong Raise and avoid data loss of meta provided to P2P shuffle is wrong Feb 22, 2024
@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ± 0      27 suites  ±0   9h 54m 19s ⏱️ - 3m 46s
 3 995 tests + 1   3 883 ✅ + 3    110 💤 ±0  2 ❌  - 2 
50 241 runs  +29  47 946 ✅ +37  2 293 💤  - 6  2 ❌  - 2 

For more details on these failures, see this check.

Results for commit 67eb545. ± Comparison against base commit a5a6e99.

@fjetter fjetter merged commit 1211e79 into dask:main Feb 22, 2024
@fjetter fjetter deleted the shuffle_keyerror_nodata branch February 22, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P2P shuffle silently dropping data if provided meta includes columns that do not exist in data

2 participants