Skip to content

ci: update trigger token#2

Closed
nv-anants wants to merge 2 commits into
mainfrom
anants/update-ci
Closed

ci: update trigger token#2
nv-anants wants to merge 2 commits into
mainfrom
anants/update-ci

Conversation

@nv-anants
Copy link
Copy Markdown
Member

What does the PR do?

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@nv-anants nv-anants closed this Mar 4, 2025
elyasmnvidian added a commit that referenced this pull request Aug 13, 2025
elyasmnvidian added a commit that referenced this pull request Aug 13, 2025
elyasmnvidian added a commit that referenced this pull request Aug 13, 2025
athreesh added a commit that referenced this pull request Aug 18, 2025
elyasmnvidian added a commit that referenced this pull request Sep 23, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 26, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 30, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
keivenchang added a commit that referenced this pull request Feb 5, 2026
Changes:
- Move wait_until_events_processed to BEFORE sending requests #2 and #3
  (was AFTER each request, causing stale cache view during routing)
- Add DIS-1406/analyze_test_errors.py script to categorize A1/A2/A3/A4 errors
- Remove unnecessary sleep delays before first request attempt
- Reorganize reproduction materials into DIS-1406/ directory

Fix13 improves from Fix12 (38/50 → 39/50 passed, 24% → 22% A1 errors)
but still shows ~22% A1 failures due to non-deterministic dp_rank selection.
tedzhouhk added a commit that referenced this pull request Apr 18, 2026
Adds TestQwen235BugReproCase with the exact two picks from the original
triage report:

* Prefill: tp=4, dp=1, moe_tp=1, moe_ep=4
* Decode: tp=1, dp=8, moe_tp=2, moe_ep=4

Each test pins down a specific symptom that the old profiler path
produced:

* test_prefill_pick_satisfies_aic_identity — Bug #1 (missing moe kwargs)
  and Bug #2 (using .tp_size property) would break the identity for
  this pick. Verifies all five kwargs and the MoE identity.
* test_decode_pick_satisfies_aic_identity — same for the decode pick,
  which was the exact shape that triggered Bug #3 (missing
  attention_dp_size).
* test_tp_size_property_differs_from_aic_tp_size — direct regression
  guard: documents that the property returns 1 here while AIC's tp_size
  is 4. Anyone who wires .tp_size back into AIC kwargs will break this.
* test_end_to_end_sweep_delivers_complete_kwargs_to_aic — runs both
  sweeps and asserts every AIC call gets complete kwargs. This is the
  direct reproducer of the originally-reported failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
nv-tusharma added a commit that referenced this pull request Apr 21, 2026
Follow-up on the PR #8415 repro. The reproduction showed two problems
in how multi-worker sglang tests handle startup failures:

1. Both workers in SGLangProcess share log_dir = request.node.name, so
   their stdout/stderr race into a single python3.log.txt. When worker
   #2 dies during init, worker #1's later output clobbers the crash
   reason and we cannot tell why worker #2 failed.

2. ManagedEngineProcessMixin.__enter__ launches workers sequentially
   with a 5 s init_delay but never calls proc.poll(), so a dead worker
   is indistinguishable from a healthy one. The test then blocks in
   KvRouter(endpoint, ...) (see tests/router/common.py:1993) waiting
   for min_initial_workers=2 instances that will never register --
   manifesting as the 5 h nightly hang.

Changes:

- tests/router/test_router_e2e_with_sglang.py: give each worker its own
  log_dir subdirectory (worker_<idx>) so crash output is preserved.
- tests/router/e2e_harness.py: add _assert_workers_alive() that polls
  every launched worker after delayed_start and after init_delay, and
  raise RuntimeError with the log path if any worker exited during
  startup. Fails the test in ~seconds instead of hanging for hours.
- .github/workflows/pr.yaml: extend the sglang-hang-repro diagnostic
  step to tail each worker's python3.log.txt so the next repro captures
  worker #2's death reason directly.

DYN-2784: https://linear.app/nvidia/issue/DYN-2784

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nv-tusharma added a commit that referenced this pull request Apr 21, 2026
Root cause for the nightly hang in test_router_decisions_sglang_multiple_workers[tcp]:

The sglang-hang-repro job on PR #8415 captured worker #2's final
stderr via the per-worker log_dir introduced in the previous commit.
Worker #2 consistently dies at startup with:

  RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
  Current value: mem_fraction_static=0.4

On sglang 0.5.10 two workers sharing one GPU at mem_fraction_static=0.4
(2 * 0.4 = 0.8 of VRAM for KV cache) do not leave enough headroom for
CUDA context, library init, PyTorch caching allocator overhead, and
model weights on top. sglang 0.5.10 bumped the baseline footprint beyond
what the previous 0.4 + --disable-piecewise-cuda-graph combination could
absorb, and worker #2 OOMs during scheduler init. pytest then blocks
forever in KvRouter(...) waiting for min_initial_workers=2 instances.

Drop mem_fraction_static from 0.4 to 0.3 and context_length from 1024
to 512 so two workers fit comfortably:
  - KV cache: 2 * 0.3 = 0.6 of VRAM
  - Overhead: remaining 0.4 is plenty for two CUDA contexts + runtime

Keep the per-worker log dirs, the proc.poll() fast-fail, and the
diagnostic step added earlier in this PR -- they made root cause
diagnosis possible and will catch future regressions in minutes
instead of the 5h workflow timeout.

DYN-2784: https://linear.app/nvidia/issue/DYN-2784

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nv-tusharma added a commit that referenced this pull request Apr 22, 2026
With PR #8441 (skip test_router_decisions_sglang_multiple_workers) merged,
the 2026-04-22 nightly shows the hang has shifted to
test_sglang_indexers_sync — the next unskipped test in the file. pytest
never emits a header for it, so the hang is in fixture setup (worker #2
dies in SGLangProcess launch, KvRouter then blocks forever on
min_initial_workers=2; pytest.mark.timeout is swallowed at a C-level
syscall). Jobs run 5h06m and hit GitHub's 6h runner cap.

The test passes reliably in pre_merge/post_merge — it only wedges in
nightly, where the broader test suite accumulates state that makes
worker #2 death more likely. Scope the skip accordingly:

* Introduce a `skip_in_nightly` pytest marker (registered in
  pyproject.toml) for tests that should be excluded from nightly only.
* Mark test_sglang_indexers_sync with it in place of the previous
  unconditional skip.
* Extend sglang's single_gpu_test_markers in nightly-ci.yml with
  `and not skip_in_nightly` so the marker is honored. Scoped to sglang
  single-GPU since that's where the hang lives; vllm/trtllm filters
  are left untouched and other matrices can opt in when needed.

PR #8468 is open with the same skip but is currently failing its sglang
Multi-GPU checks on cuda12.9/13.0. Landing this unblocks the nightly
without waiting for that PR.

See DYN-2784 for the full root-cause chain and reproduction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tushar Sharma <tusharma@nvidia.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 4, 2026
…code-waiter cleanup

Addresses two correctness issues raised by dynamo-ops review on PR 8965.

1. **Race on PrefillReady re-insert** (model_manager.rs)

   Both `register_prefill_router` and `activate_prefill_router` previously
   used a `remove → process → insert` pattern on
   `prefill_router_activators`. Between the `remove` and the `insert` the
   entry is gone from the map, so a concurrent `remove_prefill_activator`
   call (called by the watcher on prefill-component teardown) hits the
   empty map, skips its cleanup, and we then re-insert a stale
   `PrefillReady` for a prefill that's already gone. The next decode
   rebuild's lookup finds the resurrected cache and activates a
   `PrefillRouter` against a dead endpoint.

   Refactor both functions to use DashMap's `Entry` API. The shard lock is
   held for the duration of the OccupiedEntry, so any concurrent
   `remove_prefill_activator` serializes after us and observes the entry
   it needs to clear. The PrefillReady-consume path in
   `register_prefill_router` now reads-and-clones in place without
   removing; the DecodeWaiting/PrefillReady → PrefillReady transition in
   `activate_prefill_router` uses `OccupiedEntry::insert(...)` for an
   atomic value swap that returns the old value.

2. **Decode-waiter cleanup gated on `removed.is_some()`** (watcher.rs)

   `handle_delete_helper`'s decode branch only called
   `remove_decode_prefill_waiter` when `remove_worker_set` returned
   `Some(_)`. If decode registered (creating a `DecodeWaiting` activator
   entry via `register_prefill_router`) but `handle_add_helper` failed
   later (e.g., on `kv_chooser_for`, monitor setup, or
   `build_routed_pipeline`) before `add_worker_set`, no WorkerSet ever
   landed in the manager — and on later teardown
   `remove_worker_set` returns `None`, leaving the stale `DecodeWaiting`
   entry orphaned in the activator map.

   Drop the `removed.is_some()` guard. The helper itself is state-safe
   (uses `DashMap::remove_if(|_, v| matches!(v, DecodeWaiting(_)))`) so
   calling it on a key that's vacant or holds `PrefillReady` is a no-op.
   The prefill-teardown branch keeps its `removed.is_some()` guard
   because `remove_prefill_activator` blanket-removes both states, and
   calling it without a real WorkerSet teardown could orphan a live
   decode-side `DecodeWaiting`.

Behavior is unchanged for all four reproducers (ai-dynamo#1 burst-sweep canary
OFF, ai-dynamo#2 Pattern 1 multi-decode restart, ai-dynamo#3 Pattern 2 stale-DecodeWaiting,
ai-dynamo#4 burst-sweep canary ON); will re-verify on a fresh build.
MatejKosec added a commit that referenced this pull request May 10, 2026
…Bug A)

Why
---
Commit 3a7b775 fixed `VeloSession::close()` so that it drains BOTH
`available_pins` (the `Mutex<BTreeMap<SequenceHash, ImmutableBlock<G2>>>`
that holds strong refs to peer-puller-bound G2 blocks) AND `inbound_pulls`
(the `DashMap<u64, Vec<SequenceHash>>` that authorizes peer pulls).
Without that drain, the prefill-side `ManagedBlockPool::reset()` failed
the two-request smoke with `Reset pool count mismatch: expected 1089,
got 1086` whenever the puller errored before emitting `Frame::PullAck`.

The fix shipped with `close_drains_unacked_holder_pins`, but that test
asserts only `test_available_pin_count() == 0` after `close()`.  A future
refactor that deletes `self.inner.inbound_pulls.clear();` while keeping
`self.inner.available_pins.lock().clear();` would silently pass the
existing test but reintroduce the asymmetric protocol-state leak.

What
----
New regression test
`close_drains_inbound_pulls_and_releases_pool_blocks` that pins the
invariant on three INDEPENDENT observable axes:

  1. `test_available_pin_count() == 0` after `close()`
  2. `test_inbound_pulls_count() == 0` after `close()`
     (NEW accessor, mirrors the existing `test_available_pin_count`)
  3. `g2_manager.available_blocks() == g2_manager.total_blocks()`
     after the local `ImmutableBlock` handles drop, AND
     `g2_manager.reset_inactive_pool()` succeeds — the same shape as
     the production `BlockPoolError::ResetError` failure.

The test forges TWO inbound `Frame::Pull`s (so `inbound_pulls` has two
entries) and never injects `Frame::PullAck`, simulating the
recompute-policy peer abort.  It asserts non-empty mid-state on both
maps before close, then full drain plus pool-parity after close.

Two intermediate sanity checks guard against false positives:
  * `make_available` actually pins 3 blocks (not 0)
  * `g2_manager.available_blocks() == total_blocks - 3` mid-test
    (so the pool-parity assertion is measuring real strong-ref drop)

Field-type note
---------------
`SequenceHash` is `PositionalLineageHash(u128)` — a plain hash value,
not a strong block ref.  So clearing `inbound_pulls` alone does not
itself release G2 blocks; the load-bearing strong-ref drain is on
`available_pins`.  But the `Frame::Pull`/`PullAck` protocol contract
requires both maps to drain in lockstep (see the `Frame::PullAck` arm
in `dispatch_frame`); an asymmetric drain at `close()` leaves the
session with stale authorize-but-unacked tracking that any future
addition of a strong-ref-bearing field to `inbound_pulls` (e.g. for
backpressure) would silently leak.  Axis #2 is checkable today via
the new accessor, so the contract is locked in now.

Validation (computelab-build-5, x86_64, dev profile)
---------
  * `cargo check -p kvbm-engine --tests` — clean (Finished in 29.54s)
  * `cargo test -p kvbm-engine --test velo_session_loopback` — all 9
    tests pass (the new test plus 8 pre-existing siblings, including
    `close_drains_unacked_holder_pins`)
  * Negative path A — `close()` body with BOTH drain lines reverted
    (the literal pre-fix state):
      assertion `left == right` failed: close() must drain
      `available_pins`
        left: 3
        right: 0
      at velo_session_loopback.rs:723
    (Axis #1 fires — production-shape failure of the original bug.)
  * Negative path B — `close()` body with ONLY
    `self.inner.inbound_pulls.clear();` reverted (asymmetric drain):
      assertion `left == right` failed: close() must drain
      `inbound_pulls`; otherwise authorized- but-unacked pull entries
      keep `Vec<SequenceHash>` mirrors alive and the corresponding
      G2 blocks leak past pool reset
        left: 2
        right: 0
      at velo_session_loopback.rs:728
    (Axis #2 fires — proves the new test independently guards the
    inbound_pulls drain that the sibling test does not cover.)

Also adds `VeloSession::test_inbound_pulls_count()` under the existing
`#[cfg(any(test, feature = "testing"))]` gate, mirroring the existing
`test_available_pin_count()`.  No new dependencies; no production-path
unwrap/expect/panic.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
…proportional_clamp_pair

PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline
ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling
beyond original demand (Open Question #2).

The rebase took main's #9294 refactor that delegates to the new
budget.proportional_clamp_pair helper. That helper distributes the
remaining budget proportionally and can over-allocate decode in the
asymmetric (p_gpu != d_gpu) case.

Re-introduce the clamp at the call site rather than in the shared helper,
so we don't change the contract for other callers (e.g. SLA-mode
scale-down consolidation, also using the helper).

Fixes: test_decode_never_exceeds_input_when_prefill_binding
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
…proportional_clamp_pair

PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline
ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling
beyond original demand (Open Question #2).

The rebase took main's #9294 refactor that delegates to the new
budget.proportional_clamp_pair helper. That helper distributes the
remaining budget proportionally and can over-allocate decode in the
asymmetric (p_gpu != d_gpu) case.

Re-introduce the clamp at the call site rather than in the shared helper,
so we don't change the contract for other callers (e.g. SLA-mode
scale-down consolidation, also using the helper).

Fixes: test_decode_never_exceeds_input_when_prefill_binding
Signed-off-by: Kai Ma <kaim@nvidia.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
…proportional_clamp_pair

PR 9369 was based on a pre-#9294 _apply_global_budget that had an inline
ceiling-shrink path with min(num_d, ...) to prevent decode up-scaling
beyond original demand (Open Question #2).

The rebase took main's #9294 refactor that delegates to the new
budget.proportional_clamp_pair helper. That helper distributes the
remaining budget proportionally and can over-allocate decode in the
asymmetric (p_gpu != d_gpu) case.

Re-introduce the clamp at the call site rather than in the shared helper,
so we don't change the contract for other callers (e.g. SLA-mode
scale-down consolidation, also using the helper).

Fixes: test_decode_never_exceeds_input_when_prefill_binding
Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant