Skip to content

Add chaos test for dataset shuffle#25161

Merged
jjyao merged 4 commits intoray-project:masterfrom
jjyao:jjyao/chaos
May 24, 2022
Merged

Add chaos test for dataset shuffle#25161
jjyao merged 4 commits intoray-project:masterfrom
jjyao:jjyao/chaos

Conversation

@jjyao
Copy link
Contributor

@jjyao jjyao commented May 24, 2022

Why are these changes needed?

Add chaos tests for dataset shuffle: both push-based and non-push-based.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jjyao
Copy link
Contributor Author

jjyao commented May 24, 2022

Manually ran them and they passed most of time.

Copy link
Member

@mwtian mwtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I'm observing Ray becoming stuck in dataset_shuffle_push_based_random_shuffle_1tb after 10~20m during map phase, when many workers got OOM killed on a single node. It will be nice to reproduce that. Maybe we need to kill workers as well as Raylets.

type: sdk_command
file_manager: sdk

- name: chaos_dataset_shuffle_sort_1tb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there plan to add a random shuffle based chaos test? Random shuffle tests seem to create higher stress on Ray than the sort nightly tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, currently I choose sort because random shuffle is not stable yet. I'd like the non-chaos random shuffle to pass consistently before adding chaos ones.

@jjyao
Copy link
Contributor Author

jjyao commented May 24, 2022

Currently I'm observing Ray becoming stuck in dataset_shuffle_push_based_random_shuffle_1tb after 10~20m during map phase, when many workers got OOM killed on a single node. It will be nice to reproduce that. Maybe we need to kill workers as well as Raylets.

Yea, I can improve the chaos test framework so it can kill workers as well.

@jjyao jjyao merged commit 00cdd8d into ray-project:master May 24, 2022
@jjyao jjyao deleted the jjyao/chaos branch May 24, 2022 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants