test(router): skip test_sglang_indexers_sync under DYN-2784#8468
test(router): skip test_sglang_indexers_sync under DYN-2784#8468dmitry-tokarev-nv wants to merge 2 commits into
Conversation
After #8441 skipped test_router_decisions_sglang_multiple_workers to unblock the sglang nightly, the very next test with num_workers=2 — test_sglang_indexers_sync[nats_core] — started hanging with the same KvRouter / min_initial_workers=2 startup bug described in DYN-2784. Example stuck CI job (PR #8441's own CI after its skip landed): https://github.com/ai-dynamo/dynamo/actions/runs/24739524988/job/72377886646 — last log line is the multiple_workers skip at [ 96%], then >80 min of no output in test_sglang_indexers_sync. All three num_workers=2 tests in this file share the same startup path; DYN-2784 already tracks the root cause. Mark this one with the same skip pending the real fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
WalkthroughAdded a Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/router/test_router_e2e_with_sglang.py (1)
263-269:⚠️ Potential issue | 🟡 MinorAdd a timeout to the sibling DYN-2784 test before removing the skips.
test_router_decisions_sglang_multiple_workersis also documented as a knownmin_initial_workers=2hang path, but unliketest_sglang_indexers_sync, it has no@pytest.mark.timeout(...). Adding the same guard now would prevent reintroducing an indefinite CI hang when both DYN-2784 skips are removed together. As per coding guidelines, “If the test (or any nearby test) can hang, it must be guarded with@pytest.mark.timeout(...).”Proposed guard
`@pytest.mark.pre_merge` `@pytest.mark.gpu_1` `@pytest.mark.parametrize`("request_plane", ["tcp"], indirect=True) +@pytest.mark.timeout(150) def test_router_decisions_sglang_multiple_workers(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/router/test_router_e2e_with_sglang.py` around lines 263 - 269, The test test_router_decisions_sglang_multiple_workers can hang due to the min_initial_workers=2 startup issue; add a pytest timeout decorator to it (e.g., `@pytest.mark.timeout`(<same_value_used_in_test_sglang_indexers_sync>)) so it is guarded the same way as test_sglang_indexers_sync and prevents indefinite CI hangs; place the decorator above the test_router_decisions_sglang_multiple_workers declaration alongside the existing marks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/router/test_router_e2e_with_sglang.py`:
- Around line 263-269: The test test_router_decisions_sglang_multiple_workers
can hang due to the min_initial_workers=2 startup issue; add a pytest timeout
decorator to it (e.g.,
`@pytest.mark.timeout`(<same_value_used_in_test_sglang_indexers_sync>)) so it is
guarded the same way as test_sglang_indexers_sync and prevents indefinite CI
hangs; place the decorator above the
test_router_decisions_sglang_multiple_workers declaration alongside the existing
marks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e212c551-5b26-4548-82e1-f90023db73e9
📒 Files selected for processing (1)
tests/router/test_router_e2e_with_sglang.py
…rvation Temporarily disables every @pytest.mark.skip in tests/router/test_router_e2e_with_sglang.py (including the one this PR was originally adding for test_sglang_indexers_sync under DYN-2784) so PR and post-merge CI runs surface whether the same hangs that break the nightly also happen here. Tests affected: - test_router_decisions_sglang_multiple_workers (DYN-2784) - test_router_decisions_sglang_dp (DYN-2265) - test_router_decisions_sglang_disagg (DYN-2603) - test_sglang_indexers_sync (DYN-2784) Do NOT merge this debug commit — revert before landing the skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
With PR #8441 (skip test_router_decisions_sglang_multiple_workers) merged, the 2026-04-22 nightly shows the hang has shifted to test_sglang_indexers_sync — the next unskipped test in the file. pytest never emits a header for it, so the hang is in fixture setup (worker #2 dies in SGLangProcess launch, KvRouter then blocks forever on min_initial_workers=2; pytest.mark.timeout is swallowed at a C-level syscall). Jobs run 5h06m and hit GitHub's 6h runner cap. The test passes reliably in pre_merge/post_merge — it only wedges in nightly, where the broader test suite accumulates state that makes worker #2 death more likely. Scope the skip accordingly: * Introduce a `skip_in_nightly` pytest marker (registered in pyproject.toml) for tests that should be excluded from nightly only. * Mark test_sglang_indexers_sync with it in place of the previous unconditional skip. * Extend sglang's single_gpu_test_markers in nightly-ci.yml with `and not skip_in_nightly` so the marker is honored. Scoped to sglang single-GPU since that's where the hang lives; vllm/trtllm filters are left untouched and other matrices can opt in when needed. PR #8468 is open with the same skip but is currently failing its sglang Multi-GPU checks on cuda12.9/13.0. Landing this unblocks the nightly without waiting for that PR. See DYN-2784 for the full root-cause chain and reproduction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
Summary
Skips
tests/router/test_router_e2e_with_sglang.py::test_sglang_indexers_syncunder DYN-2784, reusing the same ticket PR #8441 opened fortest_router_decisions_sglang_multiple_workers— both are instances of the same root cause.Why
PR #8441 skipped
test_router_decisions_sglang_multiple_workersunder DYN-2784 with the note:That file has three tests spinning up
num_workers=2:test_sglang_kv_router_basictest_router_decisions_sglang_multiple_workerstest_sglang_indexers_syncAfter #8441 merged, PR #8441's own CI run (24739524988 / job 72377886646) got past the
multiple_workersskip at[ 96%]and then sat for 80+ minutes in the next test (test_sglang_indexers_sync[nats_core]) with no further log output.Local repro on gpu host — root cause diagnosed
Reproduced the hang locally on a single-GPU dev box (RTX 6000 Ada, 48 GiB) using
nvcr.io/nvstaging/ai-dynamo/sglang-runtime:1.1.0rc0-cuda13+container/deps/requirements.test.txt.Findings
Exact failure signature matches DYN-2784.
py-spy dumpon the stuck pytest shows MainThread stuck inasyncio.run_until_complete(test_sync())attests/router/common.py:1234, insidewait_for_workers_ready(expected_num_workers=2). Only 1 of the 2 expected sglang workers ever registers.The missing worker is OOM'd, not crashed.
nvidia-smiduring the hang shows GPU at 44.4 / 48.2 GiB used (92%) with only 1 livedynamo.sglangprocess. The other ~22 GiB is held by kernel-level CUDA contexts that outlive their (dead) owner processes.Source of the leak:
@pytest.mark.nightly-only fault-tolerance tests (test_request_migration_sglang_aggregated[...worker_failure-...],test_request_cancellation_sglang_aggregated, etc.) useSIGKILLon their sglang workers as part of the test design.SIGKILLskips CUDA context teardown, so GPU memory is pinned at the driver level with no live owner. Zombie/proc/<pid>entries remain (12+ observed in one run) whose executables are gone.Why it's nightly-only, not pre-merge / post-merge
-mfilter"pre_merge and sglang and gpu_1""post_merge and sglang and gpu_1""sglang and gpu_1"(no gate)worker_failure/ SIGKILL tests collected?@pytest.mark.nightlyonlySo the same router tests pass cleanly in PR/post-merge and hang in nightly for the same reason: the nightly set includes SIGKILL-happy fault-tolerance tests that leak GPU memory before the router tests get their turn.
Suggested fixes (prioritised)
Source-side (ideal)
dynamo.sglangworker that callstorch.cuda.empty_cache()+ explicit CUDA context destroy before exit. Then the fault-tolerance tests SIGTERM first (graceful grace period), SIGKILL only if the worker ignores SIGTERM. No kernel-level leak.--forkedso the pytest subprocess holding the CUDA context also dies, releasing the driver's reference.Test-infra patches (cheaper / tactical)
nvidia-smi --gpu-reset -i 0after the fault-tolerance batch (requires root/MIG permissions in the CI runner; may not be allowed).fault_tolerance/**first in its own session,router/**after in a fresh container. Separate CUDA contexts, no cross-contamination.--dist=loadscopebecause loadscope only groups same-module tests; cross-module order isn't guaranteed.Today's minimum (this PR)
Keep the narrow nightly skip on
test_sglang_indexers_syncunder DYN-2784. Pairs with #8441's existing skip onmultiple_workers. Zero risk to PR/post-merge runs since they already weren't running these reliably anyway.Attached debug commit (will be reverted before merge)
a5abf34ca1comments out all four@pytest.mark.skipdecorators in this file (including the one this PR adds). That was used to let PR CI observe whether the hang reproduces outside nightly. Result: PR pre-merge passed all unskipped router tests (3/3 on single-GPU cuda12.9 + cuda13.0), multi-GPU failed ontest_router_decisions_sglang_dpwith the unrelated DYN-2265 error. Confirms the router tests themselves are not broken — only the nightly environment contaminates them.Once data is collected, revert
a5abf34ca1to land onlycb9fd10799(the actual skip).Test plan
a5abf34ca1before merge, leaving just thetest_sglang_indexers_syncskipdynamo.sglang— the real fix for DYN-2784🤖 Generated with Claude Code