Skip to content

test: skip test_sglang_indexers_sync in nightly only (DYN-2784)#8539

Merged
nv-tusharma merged 4 commits into
mainfrom
tusharma/skip-sglang-indexers-sync-dyn-2784
Apr 23, 2026
Merged

test: skip test_sglang_indexers_sync in nightly only (DYN-2784)#8539
nv-tusharma merged 4 commits into
mainfrom
tusharma/skip-sglang-indexers-sync-dyn-2784

Conversation

@nv-tusharma
Copy link
Copy Markdown
Contributor

@nv-tusharma nv-tusharma commented Apr 22, 2026

Overview:

Skip tests/router/test_router_e2e_with_sglang.py::test_sglang_indexers_sync in nightly only so the nightly sglang-pipeline / Test (amd64) matrix stops burning 5+ hours on a known root-caused hang (DYN-2784). The test continues to run — and pass — in pre_merge and post_merge CI.

Parallel to #8468 which does an unconditional skip for the same test under DYN-2784 but is currently red on sglang Multi-GPU checks. Whichever lands first unblocks the next nightly; this one is scoped narrower and stays green in the other pipelines.

Details:

With #8441 (skip test_router_decisions_sglang_multiple_workers) merged yesterday, today's nightly (run 24769044442) shows the hang has shifted downstream by one test, exactly as DYN-2784's root-cause chain predicts:

  • test_router_decisions_sglang_multiple_workers[tcp] now reports SKIPPED [96%] (the prior skip is taking effect)
  • pytest never emits a header for the next unskipped test — the hang is in its fixture setup
  • The next unskipped test in file order is test_sglang_indexers_sync[nats_core] (since test_router_decisions_sglang_dp is already skip-marked under DYN-2265)
  • Jobs run 5h06m and hit GitHub's 6h runner cap, triggering the Terminate orphan process: pid (580) (docker) cleanup at ~14:11 UTC

Evidence:

Root cause (from DYN-2784): worker #2 dies in SGLangProcess.__init__ launch; ManagedEngineProcessMixin.__enter__ never proc.poll()s so it proceeds as if both workers are alive; KvRouter then blocks forever waiting for min_initial_workers=2 since only one registered. @pytest.mark.timeout(150) doesn't rescue because signal delivery is swallowed by a C-level syscall in the fixture. The test passes reliably in pre_merge/post_merge — it only wedges in nightly, where the broader test suite accumulates state that makes worker #2's death more likely.

Approach

Rather than an unconditional @pytest.mark.skip, introduce a narrow opt-in:

  1. New pytest marker skip_in_nightly — registered in pyproject.toml. Tests flagged with it are excluded from nightly but continue running in pre_merge and post_merge. Applied to test_sglang_indexers_sync.
  2. Nightly filter tweak.github/workflows/nightly-ci.yml sglang single_gpu_test_markers changed from 'sglang and gpu_1' to 'sglang and gpu_1 and not skip_in_nightly'. Scoped to sglang single-GPU, where the hang lives; vllm/trtllm filters are left untouched and other matrices can opt in if/when needed.

This keeps coverage in the PR/post-merge pipelines (where the test passes) while unblocking the nightly amd64 matrix.

Where should the reviewer start?

  • tests/router/test_router_e2e_with_sglang.py@pytest.mark.skip_in_nightly added above the existing decorators on test_sglang_indexers_sync, with a comment explaining the fixture-setup hang and the re-enable condition.
  • pyproject.tomlskip_in_nightly registered in the markers list (required by --strict-markers).
  • .github/workflows/nightly-ci.yml — one-line filter change on sglang's single-GPU matrix with an inline comment documenting the pattern.

Related Issues:

🤖 Generated with Claude Code

@nv-tusharma nv-tusharma requested review from a team as code owners April 22, 2026 19:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Walkthrough

A pytest.mark.skip(...) decorator was added to the test_sglang_indexers_sync test function with reason "DYN-2784: hangs in fixture setup; see linked ticket", preventing the test from executing. An accompanying comment block was moved to appear immediately before the decorator stack.

Changes

Cohort / File(s) Summary
Test Skip Marker
tests/router/test_router_e2e_with_sglang.py
Added @pytest.mark.skip(...) decorator to test_sglang_indexers_sync and repositioned DYN-2784 comment block for clarity regarding fixture hang issue.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly describes the main change: skipping a specific test in nightly CI for a known issue (DYN-2784). It is concise, specific, and directly related to the changeset.
Description check ✅ Passed The description follows the template with all required sections (Overview, Details, Where should the reviewer start, Related Issues) comprehensively completed. It provides thorough context about the problem, root cause, approach, and implementation details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nv-tusharma
Copy link
Copy Markdown
Contributor Author

/ok to test ba734dc

With PR #8441 (skip test_router_decisions_sglang_multiple_workers) merged,
the 2026-04-22 nightly shows the hang has shifted to
test_sglang_indexers_sync — the next unskipped test in the file. pytest
never emits a header for it, so the hang is in fixture setup (worker #2
dies in SGLangProcess launch, KvRouter then blocks forever on
min_initial_workers=2; pytest.mark.timeout is swallowed at a C-level
syscall). Jobs run 5h06m and hit GitHub's 6h runner cap.

The test passes reliably in pre_merge/post_merge — it only wedges in
nightly, where the broader test suite accumulates state that makes
worker #2 death more likely. Scope the skip accordingly:

* Introduce a `skip_in_nightly` pytest marker (registered in
  pyproject.toml) for tests that should be excluded from nightly only.
* Mark test_sglang_indexers_sync with it in place of the previous
  unconditional skip.
* Extend sglang's single_gpu_test_markers in nightly-ci.yml with
  `and not skip_in_nightly` so the marker is honored. Scoped to sglang
  single-GPU since that's where the hang lives; vllm/trtllm filters
  are left untouched and other matrices can opt in when needed.

PR #8468 is open with the same skip but is currently failing its sglang
Multi-GPU checks on cuda12.9/13.0. Landing this unblocks the nightly
without waiting for that PR.

See DYN-2784 for the full root-cause chain and reproduction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
@nv-tusharma nv-tusharma force-pushed the tusharma/skip-sglang-indexers-sync-dyn-2784 branch from ba734dc to 0dc7f84 Compare April 22, 2026 22:43
@nv-tusharma nv-tusharma changed the title test: skip test_sglang_indexers_sync (DYN-2784) test: skip test_sglang_indexers_sync in nightly only (DYN-2784) Apr 22, 2026
Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
@nv-tusharma nv-tusharma enabled auto-merge (squash) April 22, 2026 23:17
@nv-tusharma nv-tusharma merged commit b8e3293 into main Apr 23, 2026
89 checks passed
@nv-tusharma nv-tusharma deleted the tusharma/skip-sglang-indexers-sync-dyn-2784 branch April 23, 2026 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants