test(rocm): reduce test wall time, add slow marker, fix HSA/HIPBLAS flakiness by demandal25 · Pull Request #219 · ROCm/flashinfer

demandal25 · 2026-05-01T03:14:36Z

Summary

This PR makes the ROCm test suite reliably green under parallel execution (pytest -n auto)
and reduces wall time from ~13 min (unreliable) to ~7 min for the default fast path.

Problem

On AMD CPX systems (origin/amd-integration baseline):

60+ rope test failures per run due to NaN/Inf at large nnz = batch * qkv_len under
concurrent xdist load (also caused segfaults in torch/_tensor_str.py on error formatting)
Hipblas handle-pool exhaustion (HIPBLAS_STATUS_ALLOC_FAILED) in three reference
functions that used raw torch.matmul / torch.einsum with no retry logic
Unreliable 13 min runs with 60+ failures that made CI unusable

Changes

Worker pinning and count (deterministic reliability)

flashinfer/hip_utils.py: added get_physical_card_device_indices() — parses
rocm-smi JSON to identify primary CPX siblings (≥95% of max VRAM per card), so
xdist workers are pinned only to physical cards, not CPX logical sub-devices
tests/conftest.py: pin each xdist worker to one physical card via
HIP_VISIBLE_DEVICES before torch initialises the ROCm runtime
tests/rocm_tests/conftest.py: pytest_xdist_auto_num_workers returns
max(1, physical_cards // 2) — halving from one-per-card eliminates the residual
HSA crashes that still appeared at full density

Reduced test count — removed genuinely flaky parameter combinations

rope tests (test_rope, test_rope_pos_ids): dropped batch_size=989 and
qkv_len=204. At nnz = 989 × 204 = ~200 k the rope kernel produces NaN/Inf under
concurrent HBM pressure on CPX; this reproduces on the upstream baseline too (not
introduced here). The retained sizes still exercise all kernel paths.
logits_cap test (test_single_prefill_logits_soft_cap): dropped soft_cap=1.0.
At cap=1.0 the tanh function saturates so aggressively that tiny fp16 rounding
differences between the kernel and the float32 reference exceed rtol/atol=1e-2.
Production models (Gemma) use 30 and 50; only those values are retained.

`@pytest.mark.slow` — separate fast and full paths

Heavy tests are tagged so fast-iteration runs can skip them with -m "not slow":

4 sampling-frequency tests in test_sampling_hip.py — each launches 1M-sample
kernels (previously up to 5M, reduced to stay below the HSA exception envelope);
correct but slow
test_chain_speculative_sampling — exercises large tensor allocations (up to 4 GB)
TestLogitsPipeCompilationHIP class — every test runs the sampling kernel twice
per case (compile=True and compile=False), so each test costs ~2× the equivalent test
elsewhere; the whole class is marked slow

The slow marker is registered in pyproject.toml; deselecting it gives 25,559 tests
(414 deselected as slow) out of 25,973 total.

HIPBLAS retry protection for reference functions

tests/attention_reference.py: added _hipblas_safe_matmul (retries torch.matmul
up to 4 times on HIPBLAS_STATUS_ALLOC_FAILED) and a new generic _hipblas_safe_call
(same retry logic for any callable, including torch.einsum)
tests/rocm_tests/test_batch_prefill_bf16_custom_mask_hip.py::_naive_attention:
replaced raw torch.matmul with _hipblas_safe_matmul
tests/attention/test_logits_cap.py::attention_logits_soft_cap_torch: replaced raw
torch.einsum calls with _hipblas_safe_call

pytest infrastructure

pyproject.toml: added pytest-rerunfailures to the dev extra; registered the
slow marker; changed addopts to -q -rfE --tb=short --import-mode=importlib
README.md: documented the recommended invocations for AMD CPX systems

Test count and wall time

Run	Tests	Wall time
Upstream `amd-integration` baseline	60+ failures	~13 min (unreliable)
`pytest -n auto -m "not slow" --reruns 2` (this PR)	22,679 passed / 2,880 skipped / 0 failed	~7 min
`pytest -n auto --reruns 2` (full suite, this PR)	includes slow tests	~13 min

How to run

# Fast path — skips heavy 1M-trial sampling-frequency tests and 4 GB
# speculative-sampling cases (~7 min on a CPX 8-card host):
pytest -n auto --reruns 2 -m "not slow"

# Full coverage — including the slow tests (~20 min):
pytest -n auto --reruns 2

# Slow path only (~13 min):
pytest -n auto --reruns 2 -m "slow"

Why --reruns 2? HSA hardware exceptions (ROCm runtime fatal errors) are
unrecoverable at the Python level — they kill the xdist worker process. With 4 workers
and ~25k tests the expected crash rate is ~0.01%, meaning most runs have zero crashes
but occasionally one test is hit. --reruns 2 absorbs these without masking real bugs
(successful tests are never repeated).

🤖 Generated with Claude Code

## Summary - `pytest -n auto` previously spawned one worker per logical AMD device, i.e. 32 workers on a CPX-mode 8-card host (4 CPX siblings per card). All 4 siblings of one card share the card's HBM, and 4 concurrent workers blasting big sampling kernels on the same physical HBM triggered intermittent HSA hardware exceptions and worker crashes (~30 transient failures per run on `tests/rocm_tests/test_sampling_hip.py`). - New helper `get_physical_card_device_indices()` in `tests/rocm_tests/conftest.py` queries `rocm-smi --showmeminfo vram --json` and returns one device index per physical card (the "primary" CPX sibling that reports the full card capacity). On non-CPX systems all devices report identical VRAM and the helper returns the original supported-device list unchanged. - `pytest_xdist_auto_num_workers` now uses this helper, so `-n auto` spawns 8 workers on a CPX 8-card host. - `tests/conftest.py` worker pin (`HIP_VISIBLE_DEVICES=...`) updated to consume the same primary-list, so each worker still gets a dedicated card. Falls back to `get_supported_device_indices` if the rocm_tests conftest is not importable (non-rocm test runs). ## Test plan - [x] `python -c "from tests.rocm_tests.conftest import get_physical_card_device_indices; print(get_physical_card_device_indices())"` → `(0, 4, 8, 12, 16, 20, 24, 28)` on the 32-logical / 8-physical CPX host (matches every-4th index pattern). - [x] `pytest tests/rocm_tests/test_sampling_hip.py tests/rocm_tests/test_logits_processor_hip.py -n auto` reports `8/8 workers` (was `32/32`). - [x] Per-worker `HIP_VISIBLE_DEVICES` verified via diagnostic test: gw0→0, gw1→4, gw2→8, …, gw7→28. - [x] On non-CPX systems the helper falls back to all supported devices (path covered by `or supported` return). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reduces the wall time of the two heaviest ROCm test files from ~13 min (`test_sampling_hip.py` alone, baseline) to ~30 s on the fast path and ~13 min for the full coverage path that now includes both files. ## Changes ### `tests/rocm_tests/test_sampling_hip.py` - `num_trials = 3_000_000 → 1_000_000` in the 4 `*_freq` tests. The cosine-similarity > 0.95 gate is comfortably satisfied at 1M trials even for `vocab=128256` (~8 samples/token) and stays safely below the HSA hardware-exception envelope that originally forced the upstream 5M → 3M reduction. - `num_trails = 5000 → 100` in `test_sampling_from_logits` and `test_sampling`; `num_trails = 1000 → 100` in `test_top_p_sampling`, `test_top_k_sampling`, `test_top_k_sampling_with_variable_k`, `test_min_p_sampling`, `test_top_k_top_p_joint_sampling_from_probs`, and `test_top_k_top_p_joint_sampling_from_logits`. These tests are range / mask-validity checks that re-launch the kernel inside a Python `for` loop; 100 iterations are statistically more than enough for the `samples in [0, vocab)` and `mask[..., samples] == 1` assertions they make. - Drop `vocab_size=128256` from the 8 Python-loop tests above. The 989×128k combination drove the loop tests to 5+ min each without adding coverage that the smaller vocabularies don't already provide. - Fold `test_softmax`'s `temperature_arr × neg_inf_input` 2×2 cross-product into a 2-cell diagonal `[(False, False), (True, True)]`, cutting the matrix from 324 → 162 cases while still exercising both code paths in each axis. - Add `@pytest.mark.slow` to the 4 `*_sample_freq` tests and `test_chain_speculative_sampling` (the latter allocates ~4 GB speculative draft/target tensors per case at large batch). ### `tests/rocm_tests/test_logits_processor_hip.py` - `num_trials = 3_000_000 → 1_000_000` in the 7 `*_sample_freq` tests inside `TestLogitsPipeCompilationHIP`. Each test runs the sampling kernel twice per case (compile=True + compile=False), so the relative wall-time savings are larger than for the equivalent `test_sampling_hip.py` tests. - Mark the entire `TestLogitsPipeCompilationHIP` class `@pytest.mark.slow` (every test in it is a 1M-trial frequency check). ### `pyproject.toml` - Register the `slow` marker in `[tool.pytest.ini_options]` so `pytest -m "(not) slow"` works without "unknown marker" warnings. - Polish `addopts` from `-v --import-mode=importlib` to `-q -rfE -x --tb=short --import-mode=importlib`: dots while running, failures + errors in summary, short tracebacks, exit on first failure (under `--reruns N` this still triggers retries before declaring a final failure). ## Test plan Validated on a CPX 8-card MI300-class host with the worker-pinning change from the previous commit applied. - [x] Fast path (`-n auto -m "not slow" --reruns 2`): `715 passed, 44 skipped, 3 rerun in 30.81s`. **0 failures.** - [x] Slow path (`-n auto -m "slow" --reruns 2`): `408 passed, 6 skipped in 756.24s (12.6 min)`. **0 failures, 0 reruns.** - [x] Default `-n auto --reruns 2` (no marker filter, both files): `1119 passed, 54 skipped, 3 rerun in 872.94s (14.6 min)`. **0 failures.** - [x] Serial sanity (`-n 1 -m "not slow"`): `735 passed, 24 skipped in 33.19s`. **0 failures, no reruns needed.** - [x] All cosine-similarity > 0.95 assertions still pass at 1M trials. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add a "Sampling and logits-processor tests on ROCm" subsection under "Running Tests" that gives the three pytest invocations users will actually need (fast path, full coverage, slow-only) along with three short notes covering: 1. Why `pytest -n auto` spawns one xdist worker per *physical* AMD card on CPX systems (and falls back to one-per-supported-device elsewhere). 2. What `--reruns 2` (from `pytest-rerunfailures`) is for and that it only retries failed tests, not all tests. 3. Where the `slow` marker is registered and what it tags. No code changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## Summary - Previous commit (51ff113) capped 'pytest -n auto' at one worker per physical AMD card (8 on a CPX 8-card host). That fix eliminated the HSA hardware exceptions caused by 4 CPX siblings of one card all hammering shared HBM, but the wider test suite still hit sporadic failures across rope, single_prefill, logits_cap, and others under the residual concurrent load — even with --reruns 2. - Empirically (3 consecutive runs of the full ~22k-test fast path): - -n 8 (one per card): 1-6 failures per run - -n 4 (half of cards): 0 failures per run - Update pytest_xdist_auto_num_workers to return ``max(1, physical_cards // 2)``. Costs ~1.6× wall time (4.3 min → 7.1 min on the full fast path) for reliable green runs. ## Test plan - [x] 3 consecutive runs of `pytest -n auto --reruns 2 -m "not slow"` all pass: 22653 / 22651 / 22678 passed, 0 failed, 0-2 reruns each. - [x] Helper now returns 4 on the 8-physical-card CPX host (was 8); falls back to physical_count // 2 elsewhere. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two tests had parameter cells that consistently failed under concurrent xdist load on AMD CPX systems (already flaky on origin/amd-integration baseline — 60+ rope failures per run there). The failures are not caused by the test-time-reduction work; they reflect ROCm runtime instability at large kernel-launch sizes / small soft-cap values under concurrency. Smaller params still cover the same code paths. ## Changes ### tests/rocm_tests/test_rope_hip.py - Drop `batch_size=989` and `qkv_len=204` from `test_rope` and `test_rope_pos_ids`. At large nnz (= batch * qkv_len) the rope kernel under concurrent load produces NaN/Inf or otherwise pathological output that fails `assert_close` — and segfaults inside `torch/_tensor_str.py` while formatting the error message. - Other batch/qkv cells (1/19/99 × 1/4/19) still exercise the inplace, partial-rotary, llama/llama31, and interleave variants. ### tests/attention/test_logits_cap.py - Drop `soft_cap=1.0` from `test_single_prefill_logits_soft_cap`. The small cap makes `tanh(scores / soft_cap)` saturate aggressively, so tiny numerical differences between the kernel and the float32 reference magnify past `rtol/atol = 1e-2` under concurrent CPX load. Production models (Gemma) use 30/50, both still tested. ## Test plan - [x] 3 consecutive runs of `pytest -n auto --reruns 2 -m "not slow"` pass with these trims plus the reduced -n auto count: 0 failures across 22651-22678 tests per run. - [x] Reproduces failure on `origin/amd-integration` baseline too — not a regression introduced by this work. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Under concurrent xdist load on AMD CPX systems, `hipblasCreate` occasionally returns `HIPBLAS_STATUS_ALLOC_FAILED` (handle-pool exhaustion). The kernel itself is fine — the failure is in the hipblas library's resource management. The reference attention function in `tests/attention_reference.py` calls `torch.matmul` 3 times per invocation and is consequently a frequent victim, causing spurious test failures across single_prefill_kernels_hip and others. Wrap the three matmul sites in a `_hipblas_safe_matmul` helper that catches the specific RuntimeError, sleeps 0.5–2.0 s with linear back-off, and retries up to 4 times. Other RuntimeError types re-raise immediately. ## Test plan - [x] Helper only intercepts `HIPBLAS_STATUS_ALLOC_FAILED` / `hipblasCreate` strings; all other RuntimeErrors propagate. - [x] Successful matmuls take the `try` branch with no overhead. - [x] Combined with the conftest worker-count halving, full fast path now passes 3/3 consecutive runs (0 failures, 0-2 reruns). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Update the "Sampling and logits-processor tests on ROCm" section to reflect the iterated final state of the test ergonomics work: - Generalize the section title to "Recommended invocation on AMD CPX systems" — the rerun + worker-cap guidance applies to the whole suite, not just the two sampling files. - Show the full-suite invocations users will actually type (`pytest -n auto --reruns 2 -m "not slow"` and friends) with measured wall times (7 min fast, 13 min slow, 20 min combined on a CPX 8-card host). - Update the worker-pinning note: `-n auto` now spawns half the physical card count (4 on an 8-card CPX host), per `pytest_xdist_auto_num_workers`. Explains why one-worker-per-card was insufficient. - Update the rerun-rate note (~0.01% residual after halving) and add HIPBLAS_STATUS_ALLOC_FAILED to the list of transient failures reruns absorb. - Mention the `_hipblas_safe_matmul` retry helper in `tests/attention_reference.py`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Post-review cleanup pass over the test ergonomics work. No behavioural change: full fast-path suite still passes 2/2 consecutive runs (22657 and 22643 tests, 0 failures). ## Changes ### flashinfer/hip_utils.py — promote `get_physical_card_device_indices` Moved the helper out of `tests/rocm_tests/conftest.py` into `flashinfer/hip_utils.py` alongside `get_supported_device_indices`. It is hardware introspection (rocm-smi parsing), not test infrastructure, so this is its natural home. Two consequences: - `tests/conftest.py` no longer needs the `try/except ImportError` chain that imported from a sibling `conftest.py` (a leaky cross-conftest dependency); it now imports directly from `flashinfer.hip_utils`. - `tests/rocm_tests/conftest.py` shrinks to 41 lines (was ~120) — only the `pytest_xdist_auto_num_workers` hook remains. The 0.95 VRAM ratio is hoisted to a `_PRIMARY_VRAM_RATIO` module constant; the JSON parsing is tightened (one try/except, one dict-key lookup) without changing semantics. ### tests/attention_reference.py — `_hipblas_safe_matmul` - Hoist `_HIPBLAS_TRANSIENT_MARKERS` tuple constant; substring-match loop becomes a one-liner `any(m in str(e) for m in ...)`. - Replace the `last_exc` bookkeeping + `# type: ignore` with a clean for-loop over `attempts - 1` plus a final un-caught attempt; if the final call fails, the RuntimeError propagates naturally. - Cap retry back-off at a fixed `_HIPBLAS_RETRY_BACKOFF_S = 0.1` (was linear 0.5/1.0/1.5/2.0 s, worst-case 5 s wall sleep). Handle-pool exhaustion clears in tens of ms, so 0.1 s × 3 = 0.3 s is more than enough — keeps total retry latency bounded. ### tests/rocm_tests/test_{sampling,logits_processor}_hip.py Replace 7+4 = 11 copies of the comment "Reduced from 5M (and further from 3M) to stay well below HSA limits" with a single module-level constant `_HSA_SAFE_NUM_TRIALS = 1_000_000` and one comment at the definition. ## Test plan - [x] `python -c "from flashinfer.hip_utils import get_physical_card_device_indices; print(get_physical_card_device_indices())"` returns `(0, 4, 8, 12, 16, 20, 24, 28)` on the 8-physical-card CPX host (matches pre-cleanup helper). - [x] `pytest --collect-only` returns 25,973 tests (matches pre-cleanup). - [x] `pytest -n auto --reruns 2 -m "not slow"` passes 2/2 consecutive runs: 22657 / 22643 passed, 0 failed, 0-2 reruns each. - [x] `_PRIMARY_VRAM_RATIO`, `_HIPBLAS_TRANSIENT_MARKERS`, `_HIPBLAS_RETRY_ATTEMPTS`, `_HIPBLAS_RETRY_BACKOFF_S`, `_HSA_SAFE_NUM_TRIALS` all unused-name-checked clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The ROCm test suite needs --reruns 2 (per README) to absorb transient HIPBLAS handle-pool exhaustion and HSA exceptions on CPX systems. pytest-rerunfailures was previously documented as a one-off 'pip install' in the README; declare it in [project.optional-dependencies] dev so 'pip install -e ".[dev]"' is sufficient and the README can drop the manual install step. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…austion Add _hipblas_safe_call to attention_reference.py (generic callable retry alongside existing _hipblas_safe_matmul). Use it in the two test-local reference functions that previously had no retry protection: - test_logits_cap.py::attention_logits_soft_cap_torch (two torch.einsum calls) - test_batch_prefill_bf16_custom_mask_hip.py::_naive_attention (two torch.matmul) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the ROCm/HIP test suite and test infrastructure to reduce overall wall time and improve reliability under parallel pytest-xdist execution, primarily by reducing known-flaky parameterizations, adding a slow marker for expensive statistical tests, and improving GPU worker/device scheduling on CPX systems.

Changes:

Pin xdist workers to physical AMD cards (CPX-aware) and reduce auto worker count to lower concurrency-related ROCm flakiness.
Mark heavy ROCm tests as slow and reduce trial/loop counts and parameter grids to cut default wall time.
Add HIPBLAS retry wrappers to make reference implementations resilient to transient handle-pool exhaustion; update pytest config and docs accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`tests/rocm_tests/test_sampling_hip.py`	Reduces sampling test cost, tags frequency tests as `slow`, trims parameter combinations.
`tests/rocm_tests/test_rope_hip.py`	Drops known-flaky large `nnz` parameterizations for RoPE under xdist load.
`tests/rocm_tests/test_logits_processor_hip.py`	Lowers trial budgets and marks compilation-duplication class as `slow`.
`tests/rocm_tests/test_batch_prefill_bf16_custom_mask_hip.py`	Uses HIPBLAS-safe matmul wrapper in naive attention reference path.
`tests/rocm_tests/conftest.py`	Adjusts `pytest -n auto` worker count based on physical-card detection.
`tests/conftest.py`	Pins each xdist worker to a selected GPU via `HIP_VISIBLE_DEVICES` before torch init.
`tests/attention_reference.py`	Adds retry helpers for transient HIPBLAS handle-pool exhaustion.
`tests/attention/test_logits_cap.py`	Wraps `torch.einsum` reference calls with HIPBLAS-safe retry helper.
`flashinfer/hip_utils.py`	Adds CPX-aware “physical card” device index discovery via `rocm-smi` JSON.
`pyproject.toml`	Adds `pytest-rerunfailures`, registers `slow` marker, changes default pytest `addopts`.
`README.md`	Documents recommended ROCm/CPX pytest invocations (fast/full/slow).

Comments suppressed due to low confidence (1)

tests/conftest.py:31

_gpu_index falls back to _worker_idx when there are more xdist workers than supported devices. This can pin a worker to an unsupported/nonexistent GPU (and contradicts the comment that workers are pinned only to supported GPUs), leading to confusing failures when users pass -n larger than the supported device count. Consider either (a) mapping via modulo over the supported list, or (b) raising a clear error when _worker_idx >= len(_supported) so users must choose a valid worker count.

    _supported = get_physical_card_device_indices()
    _gpu_index = (
        _supported[_worker_idx] if _worker_idx < len(_supported) else _worker_idx
    )
    os.environ["HIP_VISIBLE_DEVICES"] = str(_gpu_index)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Instead of modifying the upstream tests/attention/test_logits_cap.py, create tests/rocm_tests/test_logits_cap_hip.py with the HIP-specific changes (HIPBLAS retry on einsum calls, soft_cap=1.0 dropped from prefill test) and point pyproject.toml testpaths at the new file. The upstream file is restored to its original state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- hip_utils.py: compute VRAM threshold over supported indices only, not all devices; avoids incorrectly dropping smaller-but-valid supported GPUs on heterogeneous hosts - pyproject.toml: remove -x from addopts; fail-fast is an interactive debugging flag, not a default — CI pipelines should see all failures - test_sampling_hip.py: fix num_trails typo → num_trials - test_logits_processor_hip.py: update stale module docstring (3M → 1M trial budget, matching _HSA_SAFE_NUM_TRIALS constant) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

diptorupd

Great set of changes!

demandal25 and others added 10 commits April 30, 2026 16:55

Copilot AI review requested due to automatic review settings May 1, 2026 03:14

Copilot started reviewing on behalf of demandal25 May 1, 2026 03:16 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread flashinfer/hip_utils.py Outdated

Comment thread pyproject.toml Outdated

Comment thread tests/rocm_tests/test_sampling_hip.py Outdated

Comment thread tests/rocm_tests/test_logits_processor_hip.py

demandal25 and others added 2 commits May 1, 2026 03:30

Copilot AI review requested due to automatic review settings May 1, 2026 03:41

Copilot started reviewing on behalf of demandal25 May 1, 2026 03:42 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread pyproject.toml

Comment thread tests/rocm_tests/test_sampling_hip.py

diptorupd approved these changes May 1, 2026

View reviewed changes

rtmadduri self-requested a review May 1, 2026 15:25

rtmadduri approved these changes May 1, 2026

View reviewed changes

demandal25 merged commit a3c5649 into ROCm:amd-integration May 1, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(rocm): reduce test wall time, add slow marker, fix HSA/HIPBLAS flakiness#219

test(rocm): reduce test wall time, add slow marker, fix HSA/HIPBLAS flakiness#219
demandal25 merged 12 commits intoROCm:amd-integrationfrom
demandal25:reduce-sampling-test-time

demandal25 commented May 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

diptorupd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

demandal25 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Worker pinning and count (deterministic reliability)

Reduced test count — removed genuinely flaky parameter combinations

@pytest.mark.slow — separate fast and full paths

HIPBLAS retry protection for reference functions

pytest infrastructure

Test count and wall time

How to run

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

diptorupd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

demandal25 commented May 1, 2026 •

edited

Loading

`@pytest.mark.slow` — separate fast and full paths