Skip to content

test(router): skip test_sglang_indexers_sync under DYN-2784#8468

Open
dmitry-tokarev-nv wants to merge 2 commits into
mainfrom
dtokarev/skip-sglang-indexers-sync-dyn-2784
Open

test(router): skip test_sglang_indexers_sync under DYN-2784#8468
dmitry-tokarev-nv wants to merge 2 commits into
mainfrom
dtokarev/skip-sglang-indexers-sync-dyn-2784

Conversation

@dmitry-tokarev-nv
Copy link
Copy Markdown
Contributor

@dmitry-tokarev-nv dmitry-tokarev-nv commented Apr 21, 2026

Summary

Skips tests/router/test_router_e2e_with_sglang.py::test_sglang_indexers_sync under DYN-2784, reusing the same ticket PR #8441 opened for test_router_decisions_sglang_multiple_workers — both are instances of the same root cause.

Why

PR #8441 skipped test_router_decisions_sglang_multiple_workers under DYN-2784 with the note:

worker #2 dies during startup, then KvRouter(endpoint, ...) blocks forever waiting for min_initial_workers=2 instances that will never register.

That file has three tests spinning up num_workers=2:

Line Test num_workers State
254 test_sglang_kv_router_basic 2 passes
285 test_router_decisions_sglang_multiple_workers 2 SKIPPED by #8441 (DYN-2784)
395 test_sglang_indexers_sync 2 now hanging after #8441 landed

After #8441 merged, PR #8441's own CI run (24739524988 / job 72377886646) got past the multiple_workers skip at [ 96%] and then sat for 80+ minutes in the next test (test_sglang_indexers_sync[nats_core]) with no further log output.

Local repro on gpu host — root cause diagnosed

Reproduced the hang locally on a single-GPU dev box (RTX 6000 Ada, 48 GiB) using nvcr.io/nvstaging/ai-dynamo/sglang-runtime:1.1.0rc0-cuda13 + container/deps/requirements.test.txt.

Findings

  1. Exact failure signature matches DYN-2784. py-spy dump on the stuck pytest shows MainThread stuck in asyncio.run_until_complete(test_sync()) at tests/router/common.py:1234, inside wait_for_workers_ready(expected_num_workers=2). Only 1 of the 2 expected sglang workers ever registers.

  2. The missing worker is OOM'd, not crashed. nvidia-smi during the hang shows GPU at 44.4 / 48.2 GiB used (92%) with only 1 live dynamo.sglang process. The other ~22 GiB is held by kernel-level CUDA contexts that outlive their (dead) owner processes.

  3. Source of the leak: @pytest.mark.nightly-only fault-tolerance tests (test_request_migration_sglang_aggregated[...worker_failure-...], test_request_cancellation_sglang_aggregated, etc.) use SIGKILL on their sglang workers as part of the test design. SIGKILL skips CUDA context teardown, so GPU memory is pinned at the driver level with no live owner. Zombie /proc/<pid> entries remain (12+ observed in one run) whose executables are gone.

Why it's nightly-only, not pre-merge / post-merge

PR pre-merge Post-merge Nightly
pytest -m filter "pre_merge and sglang and gpu_1" "post_merge and sglang and gpu_1" "sglang and gpu_1" (no gate)
Tests collected (sample) 33 similar 425
worker_failure / SIGKILL tests collected? ❌ all marked @pytest.mark.nightly only ❌ same
GPU state reaching router tests clean clean saturated with leaked CUDA contexts

So the same router tests pass cleanly in PR/post-merge and hang in nightly for the same reason: the nightly set includes SIGKILL-happy fault-tolerance tests that leak GPU memory before the router tests get their turn.

Suggested fixes (prioritised)

Source-side (ideal)

  • Install SIGTERM/SIGINT handler in dynamo.sglang worker that calls torch.cuda.empty_cache() + explicit CUDA context destroy before exit. Then the fault-tolerance tests SIGTERM first (graceful grace period), SIGKILL only if the worker ignores SIGTERM. No kernel-level leak.
  • For tests that must SIGKILL to reproduce a true crash: run those under --forked so the pytest subprocess holding the CUDA context also dies, releasing the driver's reference.

Test-infra patches (cheaper / tactical)

  • Add a session-scoped teardown that runs nvidia-smi --gpu-reset -i 0 after the fault-tolerance batch (requires root/MIG permissions in the CI runner; may not be allowed).
  • Split the nightly sglang job into two pytest invocations: fault_tolerance/** first in its own session, router/** after in a fresh container. Separate CUDA contexts, no cross-contamination.
  • Reorder test collection so router tests run before fault-tolerance tests. Unreliable under --dist=loadscope because loadscope only groups same-module tests; cross-module order isn't guaranteed.

Today's minimum (this PR)

Keep the narrow nightly skip on test_sglang_indexers_sync under DYN-2784. Pairs with #8441's existing skip on multiple_workers. Zero risk to PR/post-merge runs since they already weren't running these reliably anyway.

Attached debug commit (will be reverted before merge)

a5abf34ca1 comments out all four @pytest.mark.skip decorators in this file (including the one this PR adds). That was used to let PR CI observe whether the hang reproduces outside nightly. Result: PR pre-merge passed all unskipped router tests (3/3 on single-GPU cuda12.9 + cuda13.0), multi-GPU failed on test_router_decisions_sglang_dp with the unrelated DYN-2265 error. Confirms the router tests themselves are not broken — only the nightly environment contaminates them.

Once data is collected, revert a5abf34ca1 to land only cb9fd10799 (the actual skip).

Test plan

  • Keep the debug commit while we gather any further data from PR/post-merge CI
  • Revert a5abf34ca1 before merge, leaving just the test_sglang_indexers_sync skip
  • Follow up (separate PR/ticket) with source-side SIGTERM handler in dynamo.sglang — the real fix for DYN-2784

🤖 Generated with Claude Code

After #8441 skipped test_router_decisions_sglang_multiple_workers to
unblock the sglang nightly, the very next test with num_workers=2 —
test_sglang_indexers_sync[nats_core] — started hanging with the same
KvRouter / min_initial_workers=2 startup bug described in DYN-2784.

Example stuck CI job (PR #8441's own CI after its skip landed):
https://github.com/ai-dynamo/dynamo/actions/runs/24739524988/job/72377886646
— last log line is the multiple_workers skip at [ 96%], then >80 min
of no output in test_sglang_indexers_sync.

All three num_workers=2 tests in this file share the same startup
path; DYN-2784 already tracks the root cause. Mark this one with the
same skip pending the real fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Walkthrough

Added a @pytest.mark.skip decorator to test_sglang_indexers_sync in the test router module, documented with Linear issue DYN-2784, to skip the test due to a hang condition involving worker startup and KvRouter initialization requirements.

Changes

Cohort / File(s) Summary
Test Skip Marker
tests/router/test_router_e2e_with_sglang.py
Added @pytest.mark.skip decorator to test_sglang_indexers_sync test, citing DYN-2784 issue reason related to hang condition with num_workers=2 and KvRouter minimum worker requirements.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding a skip marker to test_sglang_indexers_sync referencing the DYN-2784 issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request provides a detailed description with Overview, rationale, local reproduction findings, and suggested fixes, clearly exceeding the template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/router/test_router_e2e_with_sglang.py (1)

263-269: ⚠️ Potential issue | 🟡 Minor

Add a timeout to the sibling DYN-2784 test before removing the skips.

test_router_decisions_sglang_multiple_workers is also documented as a known min_initial_workers=2 hang path, but unlike test_sglang_indexers_sync, it has no @pytest.mark.timeout(...). Adding the same guard now would prevent reintroducing an indefinite CI hang when both DYN-2784 skips are removed together. As per coding guidelines, “If the test (or any nearby test) can hang, it must be guarded with @pytest.mark.timeout(...).”

Proposed guard
 `@pytest.mark.pre_merge`
 `@pytest.mark.gpu_1`
 `@pytest.mark.parametrize`("request_plane", ["tcp"], indirect=True)
+@pytest.mark.timeout(150)
 def test_router_decisions_sglang_multiple_workers(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/router/test_router_e2e_with_sglang.py` around lines 263 - 269, The test
test_router_decisions_sglang_multiple_workers can hang due to the
min_initial_workers=2 startup issue; add a pytest timeout decorator to it (e.g.,
`@pytest.mark.timeout`(<same_value_used_in_test_sglang_indexers_sync>)) so it is
guarded the same way as test_sglang_indexers_sync and prevents indefinite CI
hangs; place the decorator above the
test_router_decisions_sglang_multiple_workers declaration alongside the existing
marks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/router/test_router_e2e_with_sglang.py`:
- Around line 263-269: The test test_router_decisions_sglang_multiple_workers
can hang due to the min_initial_workers=2 startup issue; add a pytest timeout
decorator to it (e.g.,
`@pytest.mark.timeout`(<same_value_used_in_test_sglang_indexers_sync>)) so it is
guarded the same way as test_sglang_indexers_sync and prevents indefinite CI
hangs; place the decorator above the
test_router_decisions_sglang_multiple_workers declaration alongside the existing
marks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e212c551-5b26-4548-82e1-f90023db73e9

📥 Commits

Reviewing files that changed from the base of the PR and between cc583b2 and cb9fd10.

📒 Files selected for processing (1)
  • tests/router/test_router_e2e_with_sglang.py

…rvation

Temporarily disables every @pytest.mark.skip in
tests/router/test_router_e2e_with_sglang.py (including the one this
PR was originally adding for test_sglang_indexers_sync under
DYN-2784) so PR and post-merge CI runs surface whether the same
hangs that break the nightly also happen here.

Tests affected:
- test_router_decisions_sglang_multiple_workers (DYN-2784)
- test_router_decisions_sglang_dp (DYN-2265)
- test_router_decisions_sglang_disagg (DYN-2603)
- test_sglang_indexers_sync (DYN-2784)

Do NOT merge this debug commit — revert before landing the skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
nv-tusharma added a commit that referenced this pull request Apr 22, 2026
With PR #8441 (skip test_router_decisions_sglang_multiple_workers) merged,
the 2026-04-22 nightly shows the hang has shifted to
test_sglang_indexers_sync — the next unskipped test in the file. pytest
never emits a header for it, so the hang is in fixture setup (worker #2
dies in SGLangProcess launch, KvRouter then blocks forever on
min_initial_workers=2; pytest.mark.timeout is swallowed at a C-level
syscall). Jobs run 5h06m and hit GitHub's 6h runner cap.

The test passes reliably in pre_merge/post_merge — it only wedges in
nightly, where the broader test suite accumulates state that makes
worker #2 death more likely. Scope the skip accordingly:

* Introduce a `skip_in_nightly` pytest marker (registered in
  pyproject.toml) for tests that should be excluded from nightly only.
* Mark test_sglang_indexers_sync with it in place of the previous
  unconditional skip.
* Extend sglang's single_gpu_test_markers in nightly-ci.yml with
  `and not skip_in_nightly` so the marker is honored. Scoped to sglang
  single-GPU since that's where the hang lives; vllm/trtllm filters
  are left untouched and other matrices can opt in when needed.

PR #8468 is open with the same skip but is currently failing its sglang
Multi-GPU checks on cuda12.9/13.0. Landing this unblocks the nightly
without waiting for that PR.

See DYN-2784 for the full root-cause chain and reproduction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants