Skip to content

Update README.md#1

Merged
nnshah1 merged 2 commits into
mainfrom
Update-README.md
Mar 4, 2025
Merged

Update README.md#1
nnshah1 merged 2 commits into
mainfrom
Update-README.md

Conversation

@nvda-mesharma
Copy link
Copy Markdown
Contributor

No description provided.

@nnshah1 nnshah1 merged commit 5033457 into main Mar 4, 2025
@nnshah1 nnshah1 deleted the Update-README.md branch March 4, 2025 17:16
kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
copy-pr-bot Bot pushed a commit that referenced this pull request Sep 9, 2025
* add metrics to disconnect

* fmt

* fmt

Signed-off-by: michaelfeil <me@michaelfeil.eu>
grahamking pushed a commit that referenced this pull request Sep 10, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
ayushag-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
Signed-off-by: ayushag <ayushag@nvidia.com>
zhongdaor-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: michaelfeil <me@michaelfeil.eu>
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
elyasmnvidian added a commit that referenced this pull request Sep 22, 2025
Signed-off-by: Elyas Mehtabuddin <emehtabuddin@nvidia.com>
ShounakRay added a commit to ShounakRay/fuzzy-dynamo that referenced this pull request Mar 20, 2026
* chore: explanation

* feat: add advanced fuzzing harnesses for dynamo-parsers

Add invariant, differential, ReDoS, and tool-definition-aware fuzz
targets with seed corpus, token dictionary, and coverage tooling.

* test: add regression tests for 5 parser bugs

Bugs 1-2 (granite streaming prefix loss/trim asymmetry), bug 5
(DeepSeek V3 newline normalization), bug 6 (pythonic text drop),
bug 9 (base_json text drop), bug 14 (harmony empty content).

* refactor: deduplicate parser code with shared utilities

Extract chunk_ends_with_token_prefix() and decode_xml_entities() into
tool_calling/utils.rs, replacing 6 inline loops and 2 entity decoders.
Fixes UTF-8 byte-slicing panics in XML and GLM47 parsers. Includes
regression tests for bugs 3-4, 7-8, 10-13.

* feat: add kv-router fuzz targets

5 fuzz targets for dynamo-kv-router: radix tree state machine,
block hash computation, KV protocol JSON deserialization,
positional indexer, and request extra info splitting.
Includes dictionary and seed corpus.

* feat: add content preservation and streaming monotonicity fuzz oracles

Two new parser fuzz targets:
- fuzz_content_preservation: verifies text outside tool-call
  delimiters survives parsing without fabrication
- fuzz_streaming_monotonicity: verifies streaming output
  accumulates monotonically (never shrinks)

* feat: add tokens hash fuzz targets

3 fuzz targets for dynamo-tokens: positional sequence hash
round-trip, positional lineage hash boundary testing, and
token block sequence stateful invariants.

* feat: add runtime codec fuzz targets

4 fuzz targets for dynamo-runtime: two-part message encode/decode
round-trip, two-part decode crash oracle, TCP request message
decode, and event plane frame decode.

* test: add llm protocol regression tests

* feat: add unified fuzzing runner, update findings and next steps

* refactor: deduplicate fuzz targets with shared utility modules

* feat: confirm kv_block_size=0 crash bug via fuzzing

* feat: confirm 4 new crash bugs via extended fuzzing campaign

New crashes found:
- RadixTree RefCell reentrant borrow (radix_tree.rs:371)
- RequestExtraInfo division by zero (protocols.rs:365)
- TwoPartCodec integer overflow (two_part.rs:58)
- MiniMaxAppendThink differential trim asymmetry

All 27 fuzz targets run across 5 crates (~140M total executions).
Updated findings and next steps documents.
Added adversarial seeds targeting integer boundaries and hash collisions.

* refactor: minimize fuzz targets and scripts (-1187 lines)

Remove verbose comments, doc strings, and redundant code across all 27
fuzz targets, 2 shared libs, and 2 scripts. Delete superseded files
(run_parser_fuzz.sh, repo_overview.md). Make coverage.sh auto-discover
crates instead of maintaining a hardcoded target list.

* docs: expand FINDINGS.md with detailed reproduction steps for all 20 bugs

Each bug now includes: description, severity, how found (fuzzing vs audit),
exact file/line, crash artifacts, fuzzing reproduction commands, standalone
reproducers, root cause analysis, and suggested fixes.

* docs: add individual upstream issue templates for all 20 bugs

Each file follows the ai-dynamo/dynamo [BUG] issue format with:
description, steps to reproduce, expected/actual behavior, suggested
fix, environment, and additional context. Related bugs are grouped
into single issues where appropriate (3+4, 6+9, 12+13, 19+20).

* docs: update findings and issues based on intentionality audit

- Remove bug 12/13 issue template (already fixed upstream via
  character-based iteration in chunk_ends_with_token_prefix)
- Mark bugs 12/13 as already fixed in FINDINGS.md
- Add debatable verdicts to bugs 5, 10, 19 in FINDINGS.md
- Add context notes to issue templates for debatable bugs
- Update NEXT_STEPS.md with extended run suggestions

* chore: remove upstream CI workflows (require NVIDIA infra)

* feat: add tool-call roundtrip and radix tree consistency fuzz targets

* docs: add security assessment, severity rankings, and extended campaign results

* chore: add setup script for remote configuration

* chore: add WORKFLOW.md, tracked hooks, and updated setup script

* docs: restore repo_overview.md

* 🧪 feat: add ZMQ wire and SSE codec fuzz targets

Expand fuzzing coverage to kv-router ZMQ wire protocol and LLM SSE codec

* Add 4 kv-router fuzz targets: decode, convert, roundtrip, extra_keys
* Add ZMQ wire dictionary entries and msgpack markers
* Add 16 ZMQ wire seed corpus files for structured fuzzing
* Add llm/fuzz crate with 3 SSE codec fuzz targets (decode, decode_data, streaming)
* Add SSE seed corpus (14 files) and dictionary
* Add rmp-serde dependency for msgpack serialization in fuzz harness

---------

Co-authored-by: ShounakRay <shounak@stanford.edu>
ShounakRay added a commit to ShounakRay/fuzzy-dynamo that referenced this pull request Mar 20, 2026
…ers)

LaTeX book covering transformer inference, KV caching, distributed
serving, Dynamo architecture, KV-aware routing, token blocks, network
codecs, fuzzing methodology, and 16 bugs found across 36 fuzz targets.

All bug references use ch10 canonical numbering (ai-dynamo#1-ai-dynamo#16), all cross-
references resolve, all arithmetic verified. Includes TikZ diagrams,
whynotbox pedagogical environments, and marginnotes throughout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Apr 17, 2026
tedzhouhk added a commit that referenced this pull request Apr 18, 2026
Adds TestQwen235BugReproCase with the exact two picks from the original
triage report:

* Prefill: tp=4, dp=1, moe_tp=1, moe_ep=4
* Decode: tp=1, dp=8, moe_tp=2, moe_ep=4

Each test pins down a specific symptom that the old profiler path
produced:

* test_prefill_pick_satisfies_aic_identity — Bug #1 (missing moe kwargs)
  and Bug #2 (using .tp_size property) would break the identity for
  this pick. Verifies all five kwargs and the MoE identity.
* test_decode_pick_satisfies_aic_identity — same for the decode pick,
  which was the exact shape that triggered Bug #3 (missing
  attention_dp_size).
* test_tp_size_property_differs_from_aic_tp_size — direct regression
  guard: documents that the property returns 1 here while AIC's tp_size
  is 4. Anyone who wires .tp_size back into AIC kwargs will break this.
* test_end_to_end_sweep_delivers_complete_kwargs_to_aic — runs both
  sweeps and asserts every AIC call gets complete kwargs. This is the
  direct reproducer of the originally-reported failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
nv-tusharma added a commit that referenced this pull request Apr 21, 2026
Follow-up on the PR #8415 repro. The reproduction showed two problems
in how multi-worker sglang tests handle startup failures:

1. Both workers in SGLangProcess share log_dir = request.node.name, so
   their stdout/stderr race into a single python3.log.txt. When worker
   #2 dies during init, worker #1's later output clobbers the crash
   reason and we cannot tell why worker #2 failed.

2. ManagedEngineProcessMixin.__enter__ launches workers sequentially
   with a 5 s init_delay but never calls proc.poll(), so a dead worker
   is indistinguishable from a healthy one. The test then blocks in
   KvRouter(endpoint, ...) (see tests/router/common.py:1993) waiting
   for min_initial_workers=2 instances that will never register --
   manifesting as the 5 h nightly hang.

Changes:

- tests/router/test_router_e2e_with_sglang.py: give each worker its own
  log_dir subdirectory (worker_<idx>) so crash output is preserved.
- tests/router/e2e_harness.py: add _assert_workers_alive() that polls
  every launched worker after delayed_start and after init_delay, and
  raise RuntimeError with the log path if any worker exited during
  startup. Fails the test in ~seconds instead of hanging for hours.
- .github/workflows/pr.yaml: extend the sglang-hang-repro diagnostic
  step to tail each worker's python3.log.txt so the next repro captures
  worker #2's death reason directly.

DYN-2784: https://linear.app/nvidia/issue/DYN-2784

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
keivenchang added a commit that referenced this pull request Apr 30, 2026
Adds 8 integration tests in lib/llm/tests/tool_choice.rs covering each
parser × {auto, required, named-correct, named-wrong}. Two tests pin
the parser-vs-immediate-jail conflict for tool_choice=required with a
TODO(CASE.11) note pointing at DIS-1842 work-item #1.

Renames the parser-unit placeholder tests to drop the bogus CASE.11/12
labels (parser is contract-blind by design). The renamed tests document
the parser-level invariant: parser does NOT filter by tool_choice and
its output is byte-stable regardless of upstream finish_reason.

Updates DSv4 coverage chart to point CASE.11 at the new integration
tests and CASE.12 at the parser-invariant test + cross-parser fixtures
in lib/llm/tests/test_streaming_tool_parsers.rs.

8 new integration tests; 16 tool_choice tests pass; 424 parser tests
pass; clippy clean.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
keivenchang added a commit that referenced this pull request Apr 30, 2026
CodeRabbit flagged the new tests as carrying internal Linear ticket
references in their doc-comments — repo policy forbids Linear refs in
source code. Replaced each occurrence with neutral wording:

- DIS-1842 work-item #1 → cross-parser tool_choice parametrisation
  work-item (tracked separately)
- DIS-1842 work-item #5 → cross-parser finish_reason mapping work-item
  (tracked separately)
- Universal gap noted in DIS-1842 → Universal gap noted in the test
  taxonomy

Linear refs remain in the PR description (allowed per repo convention).
424 parser tests + 16 tool_choice tests pass.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
yifjiang added a commit to yifjiang/dynamo that referenced this pull request May 4, 2026
…code-waiter cleanup

Addresses two correctness issues raised by dynamo-ops review on PR 8965.

1. **Race on PrefillReady re-insert** (model_manager.rs)

   Both `register_prefill_router` and `activate_prefill_router` previously
   used a `remove → process → insert` pattern on
   `prefill_router_activators`. Between the `remove` and the `insert` the
   entry is gone from the map, so a concurrent `remove_prefill_activator`
   call (called by the watcher on prefill-component teardown) hits the
   empty map, skips its cleanup, and we then re-insert a stale
   `PrefillReady` for a prefill that's already gone. The next decode
   rebuild's lookup finds the resurrected cache and activates a
   `PrefillRouter` against a dead endpoint.

   Refactor both functions to use DashMap's `Entry` API. The shard lock is
   held for the duration of the OccupiedEntry, so any concurrent
   `remove_prefill_activator` serializes after us and observes the entry
   it needs to clear. The PrefillReady-consume path in
   `register_prefill_router` now reads-and-clones in place without
   removing; the DecodeWaiting/PrefillReady → PrefillReady transition in
   `activate_prefill_router` uses `OccupiedEntry::insert(...)` for an
   atomic value swap that returns the old value.

2. **Decode-waiter cleanup gated on `removed.is_some()`** (watcher.rs)

   `handle_delete_helper`'s decode branch only called
   `remove_decode_prefill_waiter` when `remove_worker_set` returned
   `Some(_)`. If decode registered (creating a `DecodeWaiting` activator
   entry via `register_prefill_router`) but `handle_add_helper` failed
   later (e.g., on `kv_chooser_for`, monitor setup, or
   `build_routed_pipeline`) before `add_worker_set`, no WorkerSet ever
   landed in the manager — and on later teardown
   `remove_worker_set` returns `None`, leaving the stale `DecodeWaiting`
   entry orphaned in the activator map.

   Drop the `removed.is_some()` guard. The helper itself is state-safe
   (uses `DashMap::remove_if(|_, v| matches!(v, DecodeWaiting(_)))`) so
   calling it on a key that's vacant or holds `PrefillReady` is a no-op.
   The prefill-teardown branch keeps its `removed.is_some()` guard
   because `remove_prefill_activator` blanket-removes both states, and
   calling it without a real WorkerSet teardown could orphan a live
   decode-side `DecodeWaiting`.

Behavior is unchanged for all four reproducers (ai-dynamo#1 burst-sweep canary
OFF, ai-dynamo#2 Pattern 1 multi-decode restart, ai-dynamo#3 Pattern 2 stale-DecodeWaiting,
ai-dynamo#4 burst-sweep canary ON); will re-verify on a fresh build.
keivenchang added a commit that referenced this pull request May 7, 2026
…p-N / Others

DIS-1842's top-N model #1 (DeepSeek V4) used the deepseek_v4 parser, but
that family was never wired into the M2 cross-impl harness — the matrix
shipped with deepseek_v3 + deepseek_v3_1 only. Closes the gap.

* Add 10 PARSER.batch.* INPUTS entries for deepseek_v4 to
  regenerate_fixtures.py (DSML wire format: <DSML|tool_calls>,
  <DSML|invoke>, <DSML|parameter string="true|false">). Generated via
  the live Dynamo binding; new fixture file at
  tests/parity/parser/fixtures/deepseek_v4/PARSER.batch.yaml.
* Wire "deepseek_v4": "deepseek_v4" in _FAMILY_TO_VLLM_KEY.
* SGLang has no deepseek_v4 detector — whole-impl n/a, footnote added.
* 2 new KNOWN_DIVERGENCES from smoke run:
  - vllm/deepseek_v4/batch.5 → recovery contract (vLLM surfaces raw text
    in normal_text on missing end-marker; Dynamo drops everything).
  - vllm/deepseek_v4/batch.7 → vLLM emits JSON-typed parameter values
    as raw strings; Dynamo coerces nested object/array types.

Restructure the README matrix into two charts:

* Top-N (7 rows) — primary parser per top-N model in DIS-1842, with
  (model) annotation: deepseek_v4 (DeepSeek V4), glm47 (GLM 5.1),
  harmony (gpt-oss), kimi_k2 (Kimi K2.6), minimax_m2 (MiniMax 2.7),
  nemotron_deci (Nemotron), qwen3_coder (Qwen 3.5).
* Others (4 rows) — parsers in the harness without a top-N owner:
  deepseek_v3, deepseek_v3_1, gemma4, pythonic.

Tally: 200 → 220 cells, 128 → 136 parity, 32 → 34 divergence, 40 → 50 n/a.

Also rolls in earlier local-only edits to the same section: compact
b1..b10 column headers (full PARSER.batch.N was too wide), case-name
dictionary moved into the parity-status section directly above the
table, case 1 ... case 10 references in the "Why families' YAMLs look
so similar" section renamed to PARSER.batch.N.

Smoke: vllm 8 passed / 2 xfailed; sglang 10 skipped (no peer); dynamo
10/10. Doc + harness wiring; no parser-side code change.

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
MatejKosec added a commit that referenced this pull request May 10, 2026
…Bug A)

Why
---
Commit 3a7b775 fixed `VeloSession::close()` so that it drains BOTH
`available_pins` (the `Mutex<BTreeMap<SequenceHash, ImmutableBlock<G2>>>`
that holds strong refs to peer-puller-bound G2 blocks) AND `inbound_pulls`
(the `DashMap<u64, Vec<SequenceHash>>` that authorizes peer pulls).
Without that drain, the prefill-side `ManagedBlockPool::reset()` failed
the two-request smoke with `Reset pool count mismatch: expected 1089,
got 1086` whenever the puller errored before emitting `Frame::PullAck`.

The fix shipped with `close_drains_unacked_holder_pins`, but that test
asserts only `test_available_pin_count() == 0` after `close()`.  A future
refactor that deletes `self.inner.inbound_pulls.clear();` while keeping
`self.inner.available_pins.lock().clear();` would silently pass the
existing test but reintroduce the asymmetric protocol-state leak.

What
----
New regression test
`close_drains_inbound_pulls_and_releases_pool_blocks` that pins the
invariant on three INDEPENDENT observable axes:

  1. `test_available_pin_count() == 0` after `close()`
  2. `test_inbound_pulls_count() == 0` after `close()`
     (NEW accessor, mirrors the existing `test_available_pin_count`)
  3. `g2_manager.available_blocks() == g2_manager.total_blocks()`
     after the local `ImmutableBlock` handles drop, AND
     `g2_manager.reset_inactive_pool()` succeeds — the same shape as
     the production `BlockPoolError::ResetError` failure.

The test forges TWO inbound `Frame::Pull`s (so `inbound_pulls` has two
entries) and never injects `Frame::PullAck`, simulating the
recompute-policy peer abort.  It asserts non-empty mid-state on both
maps before close, then full drain plus pool-parity after close.

Two intermediate sanity checks guard against false positives:
  * `make_available` actually pins 3 blocks (not 0)
  * `g2_manager.available_blocks() == total_blocks - 3` mid-test
    (so the pool-parity assertion is measuring real strong-ref drop)

Field-type note
---------------
`SequenceHash` is `PositionalLineageHash(u128)` — a plain hash value,
not a strong block ref.  So clearing `inbound_pulls` alone does not
itself release G2 blocks; the load-bearing strong-ref drain is on
`available_pins`.  But the `Frame::Pull`/`PullAck` protocol contract
requires both maps to drain in lockstep (see the `Frame::PullAck` arm
in `dispatch_frame`); an asymmetric drain at `close()` leaves the
session with stale authorize-but-unacked tracking that any future
addition of a strong-ref-bearing field to `inbound_pulls` (e.g. for
backpressure) would silently leak.  Axis #2 is checkable today via
the new accessor, so the contract is locked in now.

Validation (computelab-build-5, x86_64, dev profile)
---------
  * `cargo check -p kvbm-engine --tests` — clean (Finished in 29.54s)
  * `cargo test -p kvbm-engine --test velo_session_loopback` — all 9
    tests pass (the new test plus 8 pre-existing siblings, including
    `close_drains_unacked_holder_pins`)
  * Negative path A — `close()` body with BOTH drain lines reverted
    (the literal pre-fix state):
      assertion `left == right` failed: close() must drain
      `available_pins`
        left: 3
        right: 0
      at velo_session_loopback.rs:723
    (Axis #1 fires — production-shape failure of the original bug.)
  * Negative path B — `close()` body with ONLY
    `self.inner.inbound_pulls.clear();` reverted (asymmetric drain):
      assertion `left == right` failed: close() must drain
      `inbound_pulls`; otherwise authorized- but-unacked pull entries
      keep `Vec<SequenceHash>` mirrors alive and the corresponding
      G2 blocks leak past pool reset
        left: 2
        right: 0
      at velo_session_loopback.rs:728
    (Axis #2 fires — proves the new test independently guards the
    inbound_pulls drain that the sibling test does not cover.)

Also adds `VeloSession::test_inbound_pulls_count()` under the existing
`#[cfg(any(test, feature = "testing"))]` gate, mirroring the existing
`test_available_pin_count()`.  No new dependencies; no production-path
unwrap/expect/panic.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
shqizhang pushed a commit to shqizhang/dynamo that referenced this pull request May 11, 2026
The shared TCP server keys handlers by {cid:x}/{endpoint_name} where cid
is the process-scoped DRT connection_id. Registering both backend.generate
and prefill.generate from the same dual-mode worker collided on cid/generate
in the dispatcher's DashMap; the second insert silently overwrote the first.
After /switch_role, the prefill MDC's transport URL pointed at the same TCP
slot as the backend MDC, so PrefillRouter dispatches landed on whichever
generate handler won the race -- usually the plain decode handler -- which
returns disaggregated_params=None on every chunk. Rust PrefillRouter
rejected with HTTP 500 NoDisaggregatedParams (~50% of chats failed,
matching the load-balance fraction routed to the switched pod).

Fix: register exactly one TCP handler under cid/generate. The single
generate handler is a polymorphic wrapper that, at request time, looks at
dual_mode.current_role and either runs the decode handler or runs a
chunk-buffering wrapper around partner_prefill_handler.generate (which
consolidates vLLM's multi-chunk prefill output -- kv_transfer_params is
only published on the FINAL chunk by NixlConnector -- into a single chunk
whose disaggregated_params carries the merged kv_transfer_params, so the
Rust PrefillRouter sees the field on its mandatory chunk ai-dynamo#1).

_dual_partner_endpoint is still constructed as an Endpoint object purely
so VllmReregistrar.register('prefill') can publish the prefill MDC; the
MDC's transport URL points at the (now correctly handled) cid/generate
slot.
andyluo7 added a commit to andyluo7/dynamo that referenced this pull request May 12, 2026
Re-ran the Phase 4 M2.5 escalation without --enforce-eager (HIP graph
capture enabled). Production numbers:

  conc | eager (was) | HIP graphs (now) | speedup
  -----|-------------|------------------|--------
   c=1 | 18.6 tok/s  | 94.7 tok/s       | 5.1x
   c=4 | 68.0 tok/s  | 261.4 tok/s      | 3.8x
   c=8 | 72.7 tok/s  | 587.1 tok/s      | 8.1x

Notable: 587 tok/s @ c=8 disagg now BEATS Phase 2.5 single-node agg
(521 tok/s) -- disagg uses 2 nodes' compute (8 GPUs vs 4 in agg) and
the frontend + KV-router overhead does not exceed the doubled capacity.

Per-GPU at c=8: 73.4 tok/s/GPU on 8 MI355X. Fork's InferenceX MoRI 1P1D
at c=32 reports 73.9 tok/s/GPU -- we match the fork's per-GPU efficiency
on a model the fork didn't even test in this configuration.

HIP graph capture: ~1:34 per worker for 5 PIECEWISE + 4 FULL graphs.
KV cache available after capture: 170.93 GiB per worker. No aiter
segfaults observed on gfx950 with M2.5+TP=4.

Updates:
- docs/08-phase34-escalation-results.md: replaces Phase 4 numbers
  table with HIP-graphs result; keeps eager for comparison; updates
  Tier 1 callout to mark item ai-dynamo#1 as DONE; updates final scorecard
- README.md: updates headline table M2.5 row to 587 tok/s with footnote
  explaining the eager vs HIP-graphs context; updates per-GPU
  efficiency observations to highlight matching fork's published numbers
- scripts/phase4_m25_disagg.sh: removes --enforce-eager flag, raises
  --max-num-seqs from 4 to 8 to amortize HIP graph capture
yao531441 pushed a commit to yao531441/dynamo that referenced this pull request May 13, 2026
Signed-off-by: michaelfeil <me@michaelfeil.eu>
tanmayv25 added a commit that referenced this pull request May 15, 2026
Four CI failures on pre-merge after the metrics refactor:

1. cargo fmt --check: 4 files weren't `cargo fmt`-clean (engine.rs,
   metrics.rs, publisher.rs, mocker/engine.rs). Applied.

2. mypy: sglang/llm_engine.py imports `register_engine_registry` from
   `dynamo.common.backend.metrics` for vendor-registry bridging (SGLang's
   multiprocess CollectorRegistry path). The refactor wrongly classified
   this as framework-only and moved it to `_internal_metrics`. Re-export
   from `metrics.py` — engines legitimately use this primitive for
   custom-registry scenarios (sglang) alongside the global-registry
   helper `register_global_registry` (vllm/trtllm).

3. pre-commit isort: 3 files reformatted to single-line imports.

4. pre-commit black: trtllm/args.py reformatted (pre-existing
   line-length cleanup pulled in by black --all-files).

rust-tests (lib/bindings/python) and (kvbm) were cascade-canceled by
the workspace fmt failure; they pass once #1 is fixed.

Fern docs broken-links + preview failures are unrelated tooling
issues (Fern 4.66.1→5.26.5 upgrade prompt) — pre-existing on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
Four CI failures on pre-merge after the metrics refactor:

1. cargo fmt --check: 4 files weren't `cargo fmt`-clean (engine.rs,
   metrics.rs, publisher.rs, mocker/engine.rs). Applied.

2. mypy: sglang/llm_engine.py imports `register_engine_registry` from
   `dynamo.common.backend.metrics` for vendor-registry bridging (SGLang's
   multiprocess CollectorRegistry path). The refactor wrongly classified
   this as framework-only and moved it to `_internal_metrics`. Re-export
   from `metrics.py` — engines legitimately use this primitive for
   custom-registry scenarios (sglang) alongside the global-registry
   helper `register_global_registry` (vllm/trtllm).

3. pre-commit isort: 3 files reformatted to single-line imports.

4. pre-commit black: trtllm/args.py reformatted (pre-existing
   line-length cleanup pulled in by black --all-files).

rust-tests (lib/bindings/python) and (kvbm) were cascade-canceled by
the workspace fmt failure; they pass once #1 is fixed.

Fern docs broken-links + preview failures are unrelated tooling
issues (Fern 4.66.1→5.26.5 upgrade prompt) — pre-existing on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants