[TEST] Spliting the python tests to test_reduction_single#20
[TEST] Spliting the python tests to test_reduction_single#20lohiaj merged 1 commit intoamd-integrationfrom
Conversation
…hen other workers are using the gpu
| # The test_reduction_single* tests put the GPU under a lot of stress and can run | ||
| # out of time during the test run with lots of GPU workers. We should run them separately. | ||
| python tests/run_tests.py -v -r 3 -a amdgpu -t 16 -k "not test_reduction_single" | ||
| python tests/run_tests.py -v -r 3 -a amdgpu -t 16 -k "test_reduction_single" |
There was a problem hiding this comment.
I am a little confused on how this helps. The test_reduction_single are still in contention with the other tests since worksteal initially distributes at the test item level. Do we just get lucky that splitting the tests pulls the test_reduction_single tests back from a timeout cliff?
There was a problem hiding this comment.
As of now, 2 tests keep failing. Pre-submit pipelines showed that this PR fixed this issue.
There was a problem hiding this comment.
There are only 10 reduction tests. We were running the tests with 16 threads so even if the all 10 test reduction tests were running, there were still an additional 6 tests which were hammering the gpu. I didn't dig in to find out exactly which tests didn't play nice with the test_reduction_single, but my guess is that it was more than a couple tests.
|
hey @jamesETsmith, was the 50 to 8 min purely from removing the two timeouts * -r 3 retries? If so, that's totally fine, but it's worth saying so explicitly in the PR description so future readers (or someone tempted to revert this) understand the savings come from "avoid timeouts" rather than from "more parallelism." |
lohiaj
left a comment
There was a problem hiding this comment.
Reviewed offline with @jamesETsmith. Diff is +4/-1 in 4_test.sh: same suite, just split into two pytest invocations via -k so the heavy test_reduction_single_* tests don't contend with the other 15 GPU workers. Coverage preserved (union of -k 'not X' and -k 'X' = full suite), and run_tests.py already treats pytest exit code 5 as success so an empty -k match is safe. Trusting Yao's pre-submit validation. Merging.
Summary
This PR splits the python tests into two calls. All the same tests are still run. We now run the
test_reduction_single*tests separately bc they timeout when other workers are using the gpu too. Splitting the tests up reducing the test time from 50 to 8 min.the savings come from "avoid timeouts" rather than from "more parallelism."