Add chaos test for dataset shuffle#25161
Conversation
|
Manually ran them and they passed most of time. |
mwtian
left a comment
There was a problem hiding this comment.
Currently I'm observing Ray becoming stuck in dataset_shuffle_push_based_random_shuffle_1tb after 10~20m during map phase, when many workers got OOM killed on a single node. It will be nice to reproduce that. Maybe we need to kill workers as well as Raylets.
| type: sdk_command | ||
| file_manager: sdk | ||
|
|
||
| - name: chaos_dataset_shuffle_sort_1tb |
There was a problem hiding this comment.
Is there plan to add a random shuffle based chaos test? Random shuffle tests seem to create higher stress on Ray than the sort nightly tests.
There was a problem hiding this comment.
Yea, currently I choose sort because random shuffle is not stable yet. I'd like the non-chaos random shuffle to pass consistently before adding chaos ones.
Yea, I can improve the chaos test framework so it can kill workers as well. |
Why are these changes needed?
Add chaos tests for dataset shuffle: both push-based and non-push-based.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.