test: skip test_sglang_indexers_sync in nightly only (DYN-2784)#8539
Merged
Conversation
Contributor
WalkthroughA Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Contributor
Author
|
/ok to test ba734dc |
With PR #8441 (skip test_router_decisions_sglang_multiple_workers) merged, the 2026-04-22 nightly shows the hang has shifted to test_sglang_indexers_sync — the next unskipped test in the file. pytest never emits a header for it, so the hang is in fixture setup (worker #2 dies in SGLangProcess launch, KvRouter then blocks forever on min_initial_workers=2; pytest.mark.timeout is swallowed at a C-level syscall). Jobs run 5h06m and hit GitHub's 6h runner cap. The test passes reliably in pre_merge/post_merge — it only wedges in nightly, where the broader test suite accumulates state that makes worker #2 death more likely. Scope the skip accordingly: * Introduce a `skip_in_nightly` pytest marker (registered in pyproject.toml) for tests that should be excluded from nightly only. * Mark test_sglang_indexers_sync with it in place of the previous unconditional skip. * Extend sglang's single_gpu_test_markers in nightly-ci.yml with `and not skip_in_nightly` so the marker is honored. Scoped to sglang single-GPU since that's where the hang lives; vllm/trtllm filters are left untouched and other matrices can opt in when needed. PR #8468 is open with the same skip but is currently failing its sglang Multi-GPU checks on cuda12.9/13.0. Landing this unblocks the nightly without waiting for that PR. See DYN-2784 for the full root-cause chain and reproduction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
ba734dc to
0dc7f84
Compare
dmitry-tokarev-nv
approved these changes
Apr 22, 2026
Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview:
Skip
tests/router/test_router_e2e_with_sglang.py::test_sglang_indexers_syncin nightly only so the nightlysglang-pipeline / Test (amd64)matrix stops burning 5+ hours on a known root-caused hang (DYN-2784). The test continues to run — and pass — in pre_merge and post_merge CI.Parallel to #8468 which does an unconditional skip for the same test under DYN-2784 but is currently red on sglang Multi-GPU checks. Whichever lands first unblocks the next nightly; this one is scoped narrower and stays green in the other pipelines.
Details:
With #8441 (skip
test_router_decisions_sglang_multiple_workers) merged yesterday, today's nightly (run 24769044442) shows the hang has shifted downstream by one test, exactly as DYN-2784's root-cause chain predicts:test_router_decisions_sglang_multiple_workers[tcp]now reportsSKIPPED [96%](the prior skip is taking effect)test_sglang_indexers_sync[nats_core](sincetest_router_decisions_sglang_dpis already skip-marked under DYN-2265)Terminate orphan process: pid (580) (docker)cleanup at ~14:11 UTCEvidence:
Root cause (from DYN-2784): worker #2 dies in
SGLangProcess.__init__launch;ManagedEngineProcessMixin.__enter__neverproc.poll()s so it proceeds as if both workers are alive;KvRouterthen blocks forever waiting formin_initial_workers=2since only one registered.@pytest.mark.timeout(150)doesn't rescue because signal delivery is swallowed by a C-level syscall in the fixture. The test passes reliably in pre_merge/post_merge — it only wedges in nightly, where the broader test suite accumulates state that makes worker #2's death more likely.Approach
Rather than an unconditional
@pytest.mark.skip, introduce a narrow opt-in:skip_in_nightly— registered inpyproject.toml. Tests flagged with it are excluded from nightly but continue running in pre_merge and post_merge. Applied totest_sglang_indexers_sync..github/workflows/nightly-ci.ymlsglangsingle_gpu_test_markerschanged from'sglang and gpu_1'to'sglang and gpu_1 and not skip_in_nightly'. Scoped to sglang single-GPU, where the hang lives; vllm/trtllm filters are left untouched and other matrices can opt in if/when needed.This keeps coverage in the PR/post-merge pipelines (where the test passes) while unblocking the nightly amd64 matrix.
Where should the reviewer start?
tests/router/test_router_e2e_with_sglang.py—@pytest.mark.skip_in_nightlyadded above the existing decorators ontest_sglang_indexers_sync, with a comment explaining the fixture-setup hang and the re-enable condition.pyproject.toml—skip_in_nightlyregistered in themarkerslist (required by--strict-markers)..github/workflows/nightly-ci.yml— one-line filter change on sglang's single-GPU matrix with an inline comment documenting the pattern.Related Issues:
🤖 Generated with Claude Code