ci: update trigger token#2
Closed
nv-anants wants to merge 2 commits into
Closed
Conversation
This was referenced Aug 1, 2025
Merged
elyasmnvidian
added a commit
that referenced
this pull request
Aug 13, 2025
elyasmnvidian
added a commit
that referenced
this pull request
Aug 13, 2025
elyasmnvidian
added a commit
that referenced
this pull request
Aug 13, 2025
elyasmnvidian
added a commit
that referenced
this pull request
Sep 23, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian
added a commit
that referenced
this pull request
Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian
added a commit
that referenced
this pull request
Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian
added a commit
that referenced
this pull request
Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian
added a commit
that referenced
this pull request
Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian
added a commit
that referenced
this pull request
Sep 30, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
keivenchang
added a commit
that referenced
this pull request
Feb 5, 2026
Changes: - Move wait_until_events_processed to BEFORE sending requests #2 and #3 (was AFTER each request, causing stale cache view during routing) - Add DIS-1406/analyze_test_errors.py script to categorize A1/A2/A3/A4 errors - Remove unnecessary sleep delays before first request attempt - Reorganize reproduction materials into DIS-1406/ directory Fix13 improves from Fix12 (38/50 → 39/50 passed, 24% → 22% A1 errors) but still shows ~22% A1 failures due to non-deterministic dp_rank selection.
ryanolson
added a commit
that referenced
this pull request
Feb 16, 2026
4 tasks
tedzhouhk
added a commit
that referenced
this pull request
Apr 18, 2026
Adds TestQwen235BugReproCase with the exact two picks from the original triage report: * Prefill: tp=4, dp=1, moe_tp=1, moe_ep=4 * Decode: tp=1, dp=8, moe_tp=2, moe_ep=4 Each test pins down a specific symptom that the old profiler path produced: * test_prefill_pick_satisfies_aic_identity — Bug #1 (missing moe kwargs) and Bug #2 (using .tp_size property) would break the identity for this pick. Verifies all five kwargs and the MoE identity. * test_decode_pick_satisfies_aic_identity — same for the decode pick, which was the exact shape that triggered Bug #3 (missing attention_dp_size). * test_tp_size_property_differs_from_aic_tp_size — direct regression guard: documents that the property returns 1 here while AIC's tp_size is 4. Anyone who wires .tp_size back into AIC kwargs will break this. * test_end_to_end_sweep_delivers_complete_kwargs_to_aic — runs both sweeps and asserts every AIC call gets complete kwargs. This is the direct reproducer of the originally-reported failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
ranrubin
added a commit
that referenced
this pull request
Apr 20, 2026
Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin
added a commit
that referenced
this pull request
Apr 20, 2026
Bug fixes: - #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache - #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout (bare-layout path was missing the writable-dir check present in HF_HOME test) Minor cleanups: - #4: Move `import pytest` from inside download_lora() to module top-level (lora_utils.py is test-only infra; pytest is always available) - #5: Replace pytestconfig.getoption("--models-dir") re-checks in predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR") (_models_dir_env runs first and sets the var; single source of truth) New coverage tests: - #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test verifying pytest_configure exits with returncode=2 on bad path - #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora() raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set Not addressed: #3 (keep gpu_0 per project guidelines and previous review retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>
nv-tusharma
added a commit
that referenced
this pull request
Apr 21, 2026
Follow-up on the PR #8415 repro. The reproduction showed two problems in how multi-worker sglang tests handle startup failures: 1. Both workers in SGLangProcess share log_dir = request.node.name, so their stdout/stderr race into a single python3.log.txt. When worker #2 dies during init, worker #1's later output clobbers the crash reason and we cannot tell why worker #2 failed. 2. ManagedEngineProcessMixin.__enter__ launches workers sequentially with a 5 s init_delay but never calls proc.poll(), so a dead worker is indistinguishable from a healthy one. The test then blocks in KvRouter(endpoint, ...) (see tests/router/common.py:1993) waiting for min_initial_workers=2 instances that will never register -- manifesting as the 5 h nightly hang. Changes: - tests/router/test_router_e2e_with_sglang.py: give each worker its own log_dir subdirectory (worker_<idx>) so crash output is preserved. - tests/router/e2e_harness.py: add _assert_workers_alive() that polls every launched worker after delayed_start and after init_delay, and raise RuntimeError with the log path if any worker exited during startup. Fails the test in ~seconds instead of hanging for hours. - .github/workflows/pr.yaml: extend the sglang-hang-repro diagnostic step to tail each worker's python3.log.txt so the next repro captures worker #2's death reason directly. DYN-2784: https://linear.app/nvidia/issue/DYN-2784 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
nv-tusharma
added a commit
that referenced
this pull request
Apr 21, 2026
Root cause for the nightly hang in test_router_decisions_sglang_multiple_workers[tcp]: The sglang-hang-repro job on PR #8415 captured worker #2's final stderr via the per-worker log_dir introduced in the previous commit. Worker #2 consistently dies at startup with: RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. Current value: mem_fraction_static=0.4 On sglang 0.5.10 two workers sharing one GPU at mem_fraction_static=0.4 (2 * 0.4 = 0.8 of VRAM for KV cache) do not leave enough headroom for CUDA context, library init, PyTorch caching allocator overhead, and model weights on top. sglang 0.5.10 bumped the baseline footprint beyond what the previous 0.4 + --disable-piecewise-cuda-graph combination could absorb, and worker #2 OOMs during scheduler init. pytest then blocks forever in KvRouter(...) waiting for min_initial_workers=2 instances. Drop mem_fraction_static from 0.4 to 0.3 and context_length from 1024 to 512 so two workers fit comfortably: - KV cache: 2 * 0.3 = 0.6 of VRAM - Overhead: remaining 0.4 is plenty for two CUDA contexts + runtime Keep the per-worker log dirs, the proc.poll() fast-fail, and the diagnostic step added earlier in this PR -- they made root cause diagnosis possible and will catch future regressions in minutes instead of the 5h workflow timeout. DYN-2784: https://linear.app/nvidia/issue/DYN-2784 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
15 tasks
nv-tusharma
added a commit
that referenced
this pull request
Apr 22, 2026
With PR #8441 (skip test_router_decisions_sglang_multiple_workers) merged, the 2026-04-22 nightly shows the hang has shifted to test_sglang_indexers_sync — the next unskipped test in the file. pytest never emits a header for it, so the hang is in fixture setup (worker #2 dies in SGLangProcess launch, KvRouter then blocks forever on min_initial_workers=2; pytest.mark.timeout is swallowed at a C-level syscall). Jobs run 5h06m and hit GitHub's 6h runner cap. The test passes reliably in pre_merge/post_merge — it only wedges in nightly, where the broader test suite accumulates state that makes worker #2 death more likely. Scope the skip accordingly: * Introduce a `skip_in_nightly` pytest marker (registered in pyproject.toml) for tests that should be excluded from nightly only. * Mark test_sglang_indexers_sync with it in place of the previous unconditional skip. * Extend sglang's single_gpu_test_markers in nightly-ci.yml with `and not skip_in_nightly` so the marker is honored. Scoped to sglang single-GPU since that's where the hang lives; vllm/trtllm filters are left untouched and other matrices can opt in when needed. PR #8468 is open with the same skip but is currently failing its sglang Multi-GPU checks on cuda12.9/13.0. Landing this unblocks the nightly without waiting for that PR. See DYN-2784 for the full root-cause chain and reproduction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
4 tasks
yifjiang
added a commit
to yifjiang/dynamo
that referenced
this pull request
May 4, 2026
…code-waiter cleanup Addresses two correctness issues raised by dynamo-ops review on PR 8965. 1. **Race on PrefillReady re-insert** (model_manager.rs) Both `register_prefill_router` and `activate_prefill_router` previously used a `remove → process → insert` pattern on `prefill_router_activators`. Between the `remove` and the `insert` the entry is gone from the map, so a concurrent `remove_prefill_activator` call (called by the watcher on prefill-component teardown) hits the empty map, skips its cleanup, and we then re-insert a stale `PrefillReady` for a prefill that's already gone. The next decode rebuild's lookup finds the resurrected cache and activates a `PrefillRouter` against a dead endpoint. Refactor both functions to use DashMap's `Entry` API. The shard lock is held for the duration of the OccupiedEntry, so any concurrent `remove_prefill_activator` serializes after us and observes the entry it needs to clear. The PrefillReady-consume path in `register_prefill_router` now reads-and-clones in place without removing; the DecodeWaiting/PrefillReady → PrefillReady transition in `activate_prefill_router` uses `OccupiedEntry::insert(...)` for an atomic value swap that returns the old value. 2. **Decode-waiter cleanup gated on `removed.is_some()`** (watcher.rs) `handle_delete_helper`'s decode branch only called `remove_decode_prefill_waiter` when `remove_worker_set` returned `Some(_)`. If decode registered (creating a `DecodeWaiting` activator entry via `register_prefill_router`) but `handle_add_helper` failed later (e.g., on `kv_chooser_for`, monitor setup, or `build_routed_pipeline`) before `add_worker_set`, no WorkerSet ever landed in the manager — and on later teardown `remove_worker_set` returns `None`, leaving the stale `DecodeWaiting` entry orphaned in the activator map. Drop the `removed.is_some()` guard. The helper itself is state-safe (uses `DashMap::remove_if(|_, v| matches!(v, DecodeWaiting(_)))`) so calling it on a key that's vacant or holds `PrefillReady` is a no-op. The prefill-teardown branch keeps its `removed.is_some()` guard because `remove_prefill_activator` blanket-removes both states, and calling it without a real WorkerSet teardown could orphan a live decode-side `DecodeWaiting`. Behavior is unchanged for all four reproducers (ai-dynamo#1 burst-sweep canary OFF, ai-dynamo#2 Pattern 1 multi-decode restart, ai-dynamo#3 Pattern 2 stale-DecodeWaiting, ai-dynamo#4 burst-sweep canary ON); will re-verify on a fresh build.
MatejKosec
added a commit
that referenced
this pull request
May 10, 2026
…Bug A) Why --- Commit 3a7b775 fixed `VeloSession::close()` so that it drains BOTH `available_pins` (the `Mutex<BTreeMap<SequenceHash, ImmutableBlock<G2>>>` that holds strong refs to peer-puller-bound G2 blocks) AND `inbound_pulls` (the `DashMap<u64, Vec<SequenceHash>>` that authorizes peer pulls). Without that drain, the prefill-side `ManagedBlockPool::reset()` failed the two-request smoke with `Reset pool count mismatch: expected 1089, got 1086` whenever the puller errored before emitting `Frame::PullAck`. The fix shipped with `close_drains_unacked_holder_pins`, but that test asserts only `test_available_pin_count() == 0` after `close()`. A future refactor that deletes `self.inner.inbound_pulls.clear();` while keeping `self.inner.available_pins.lock().clear();` would silently pass the existing test but reintroduce the asymmetric protocol-state leak. What ---- New regression test `close_drains_inbound_pulls_and_releases_pool_blocks` that pins the invariant on three INDEPENDENT observable axes: 1. `test_available_pin_count() == 0` after `close()` 2. `test_inbound_pulls_count() == 0` after `close()` (NEW accessor, mirrors the existing `test_available_pin_count`) 3. `g2_manager.available_blocks() == g2_manager.total_blocks()` after the local `ImmutableBlock` handles drop, AND `g2_manager.reset_inactive_pool()` succeeds — the same shape as the production `BlockPoolError::ResetError` failure. The test forges TWO inbound `Frame::Pull`s (so `inbound_pulls` has two entries) and never injects `Frame::PullAck`, simulating the recompute-policy peer abort. It asserts non-empty mid-state on both maps before close, then full drain plus pool-parity after close. Two intermediate sanity checks guard against false positives: * `make_available` actually pins 3 blocks (not 0) * `g2_manager.available_blocks() == total_blocks - 3` mid-test (so the pool-parity assertion is measuring real strong-ref drop) Field-type note --------------- `SequenceHash` is `PositionalLineageHash(u128)` — a plain hash value, not a strong block ref. So clearing `inbound_pulls` alone does not itself release G2 blocks; the load-bearing strong-ref drain is on `available_pins`. But the `Frame::Pull`/`PullAck` protocol contract requires both maps to drain in lockstep (see the `Frame::PullAck` arm in `dispatch_frame`); an asymmetric drain at `close()` leaves the session with stale authorize-but-unacked tracking that any future addition of a strong-ref-bearing field to `inbound_pulls` (e.g. for backpressure) would silently leak. Axis #2 is checkable today via the new accessor, so the contract is locked in now. Validation (computelab-build-5, x86_64, dev profile) --------- * `cargo check -p kvbm-engine --tests` — clean (Finished in 29.54s) * `cargo test -p kvbm-engine --test velo_session_loopback` — all 9 tests pass (the new test plus 8 pre-existing siblings, including `close_drains_unacked_holder_pins`) * Negative path A — `close()` body with BOTH drain lines reverted (the literal pre-fix state): assertion `left == right` failed: close() must drain `available_pins` left: 3 right: 0 at velo_session_loopback.rs:723 (Axis #1 fires — production-shape failure of the original bug.) * Negative path B — `close()` body with ONLY `self.inner.inbound_pulls.clear();` reverted (asymmetric drain): assertion `left == right` failed: close() must drain `inbound_pulls`; otherwise authorized- but-unacked pull entries keep `Vec<SequenceHash>` mirrors alive and the corresponding G2 blocks leak past pool reset left: 2 right: 0 at velo_session_loopback.rs:728 (Axis #2 fires — proves the new test independently guards the inbound_pulls drain that the sibling test does not cover.) Also adds `VeloSession::test_inbound_pulls_count()` under the existing `#[cfg(any(test, feature = "testing"))]` gate, mirroring the existing `test_available_pin_count()`. No new dependencies; no production-path unwrap/expect/panic. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
furionw
added a commit
that referenced
this pull request
May 12, 2026
…later Reader-first ordering. Old order buried Quick start at position 8 of 9; new order surfaces the runnable commands above all the reference sections (FSx vs EBS, label conventions, aiperf SHA rationale). New section order: 1. Title + 3-config table 2. Pre-requisites (was #3) 3. Quick start (was #8 — promoted) 4. Directory layout (was #7 — now serves as map for the rest) 5. Hardware targets (was #2 — now pure reference; invocation examples moved into Quick start) 6. Storage (was #5) 7. aiperf install (was #6) 8. Naming & ownership (was #4) 9. Notes (unchanged) Also drop the stray "We use ebs by default" sentence — it contradicted both the Storage section and the actual yaml (where the PVC block is fully commented out, no default storage class is set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng
added a commit
that referenced
this pull request
May 12, 2026
…proportional_clamp_pair PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling beyond original demand (Open Question #2). The rebase took main's #9294 refactor that delegates to the new budget.proportional_clamp_pair helper. That helper distributes the remaining budget proportionally and can over-allocate decode in the asymmetric (p_gpu != d_gpu) case. Re-introduce the clamp at the call site rather than in the shared helper, so we don't change the contract for other callers (e.g. SLA-mode scale-down consolidation, also using the helper). Fixes: test_decode_never_exceeds_input_when_prefill_binding Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng
added a commit
that referenced
this pull request
May 12, 2026
…proportional_clamp_pair PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling beyond original demand (Open Question #2). The rebase took main's #9294 refactor that delegates to the new budget.proportional_clamp_pair helper. That helper distributes the remaining budget proportionally and can over-allocate decode in the asymmetric (p_gpu != d_gpu) case. Re-introduce the clamp at the call site rather than in the shared helper, so we don't change the contract for other callers (e.g. SLA-mode scale-down consolidation, also using the helper). Fixes: test_decode_never_exceeds_input_when_prefill_binding Signed-off-by: Kai Ma <kaim@nvidia.com>
tanmayv25
added a commit
that referenced
this pull request
May 16, 2026
…cycle Address senior-architect review of the metrics infrastructure: #1 (two parallel Prometheus paths): merge `register_prometheus` + `setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings` method on the Rust `LLMEngine` trait. The trait surface shrinks from 7 → 6 methods; both expfmt bridging and the structured component publisher go through one place. The Python ABC keeps its two methods — the PyO3 bridge adapts (engine-author API unchanged for Python). #3 + #8 (framework lifecycle gauges gated on engine opt-in): - `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side `LifecycleGauges` (prometheus crate gauges registered via the runtime's `MetricsRegistry`). Worker constructs them after `engine.start()` succeeds; observes during cleanup/drain. Emits regardless of engine opt-in. - Drop `set_cleanup_time` / `set_drain_time` from the `ComponentMetricsPublisher` trait — engine no longer responsible. - Drop the corresponding methods from `PyComponentMetricsPublisher`. - `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for parity with the legacy entry points (both legacy `main.py` and the unified bridge populate it). The unified bridge now constructs `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when the engine returns no per-rank sources. #5 (kv_cache_hit_rate Optional semantics): clearer docstring on both Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly documented (`None` = no data, `0.0` = legitimate zero-hit). Deferred with rationale (explained in the post-review architect's note): - #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing but no current bug; 5 fields for 5 signals isn't worth the refactor cost yet. Revisit at ~10+ signals. - #4 (per-source cadence): explicit defer in original review. - #7 (conformance in CI): workflow change, separate from trait surface. Conformance kit collapses `check_register_prometheus` + `check_component_metrics` into `check_setup_metrics`. All 68 backend-common unit tests pass. Mocker example updated. PyO3 bridge wheel rebuilt successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng
added a commit
that referenced
this pull request
May 18, 2026
…proportional_clamp_pair PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling beyond original demand (Open Question #2). The rebase took main's #9294 refactor that delegates to the new budget.proportional_clamp_pair helper. That helper distributes the remaining budget proportionally and can over-allocate decode in the asymmetric (p_gpu != d_gpu) case. Re-introduce the clamp at the call site rather than in the shared helper, so we don't change the contract for other callers (e.g. SLA-mode scale-down consolidation, also using the helper). Fixes: test_decode_never_exceeds_input_when_prefill_binding Signed-off-by: Kai Ma <kaim@nvidia.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does the PR do?
Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)