Skip to content

[TEST] Spliting the python tests to test_reduction_single#20

Merged
lohiaj merged 1 commit intoamd-integrationfrom
test/jets/split_reduction_tests
Apr 28, 2026
Merged

[TEST] Spliting the python tests to test_reduction_single#20
lohiaj merged 1 commit intoamd-integrationfrom
test/jets/split_reduction_tests

Conversation

@jamesETsmith
Copy link
Copy Markdown
Collaborator

@jamesETsmith jamesETsmith commented Apr 28, 2026

Summary

This PR splits the python tests into two calls. All the same tests are still run. We now run the test_reduction_single* tests separately bc they timeout when other workers are using the gpu too. Splitting the tests up reducing the test time from 50 to 8 min.

the savings come from "avoid timeouts" rather than from "more parallelism."

@jamesETsmith jamesETsmith self-assigned this Apr 28, 2026
@jamesETsmith jamesETsmith added the enhancement New feature or request label Apr 28, 2026
@jamesETsmith jamesETsmith requested a review from yaoliu13 April 28, 2026 00:18
Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# The test_reduction_single* tests put the GPU under a lot of stress and can run
# out of time during the test run with lots of GPU workers. We should run them separately.
python tests/run_tests.py -v -r 3 -a amdgpu -t 16 -k "not test_reduction_single"
python tests/run_tests.py -v -r 3 -a amdgpu -t 16 -k "test_reduction_single"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused on how this helps. The test_reduction_single are still in contention with the other tests since worksteal initially distributes at the test item level. Do we just get lucky that splitting the tests pulls the test_reduction_single tests back from a timeout cliff?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, 2 tests keep failing. Pre-submit pipelines showed that this PR fixed this issue.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only 10 reduction tests. We were running the tests with 16 threads so even if the all 10 test reduction tests were running, there were still an additional 6 tests which were hammering the gpu. I didn't dig in to find out exactly which tests didn't play nice with the test_reduction_single, but my guess is that it was more than a couple tests.

@lohiaj
Copy link
Copy Markdown

lohiaj commented Apr 28, 2026

hey @jamesETsmith, was the 50 to 8 min purely from removing the two timeouts * -r 3 retries? If so, that's totally fine, but it's worth saying so explicitly in the PR description so future readers (or someone tempted to revert this) understand the savings come from "avoid timeouts" rather than from "more parallelism."

Copy link
Copy Markdown

@lohiaj lohiaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed offline with @jamesETsmith. Diff is +4/-1 in 4_test.sh: same suite, just split into two pytest invocations via -k so the heavy test_reduction_single_* tests don't contend with the other 15 GPU workers. Coverage preserved (union of -k 'not X' and -k 'X' = full suite), and run_tests.py already treats pytest exit code 5 as success so an empty -k match is safe. Trusting Yao's pre-submit validation. Merging.

@lohiaj lohiaj merged commit 2a990fd into amd-integration Apr 28, 2026
33 of 42 checks passed
@jamesETsmith jamesETsmith deleted the test/jets/split_reduction_tests branch April 28, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants