Skip to content

fix(planner): backfill max_num_batched_tokens from discovery for VirtualConnector#8042

Merged
jthomson04 merged 3 commits into
mainfrom
jthomson04/planner-discovery-backfill
Apr 20, 2026
Merged

fix(planner): backfill max_num_batched_tokens from discovery for VirtualConnector#8042
jthomson04 merged 3 commits into
mainfrom
jthomson04/planner-discovery-backfill

Conversation

@jthomson04
Copy link
Copy Markdown
Contributor

@jthomson04 jthomson04 commented Apr 9, 2026

Summary

  • The VirtualConnector doesn't implement get_worker_info, so max_num_batched_tokens was always None — blocking both load-based and throughput-based scaling in agg and prefill mode.
  • Engines already publish max_num_batched_tokens in their ModelDeploymentCard via discovery, and the FPM subscriber already watches those MDC events. This change captures the card_json on Added events and exposes it via get_model_cards().
  • The planner now backfills WorkerInfo.max_num_batched_tokens from the first available model card when the connector didn't provide the value. No-ops if the value is already set (e.g. Kubernetes connector).

Test plan

  • Added unit tests for _populate_worker_info_from_discovery: backfill happy path, no-op when already set, no-op without subscriber, skip incomplete cards
  • Verify cargo check -p dynamo-py3 passes (confirmed locally)
  • Deploy planner with "mode": "agg" using VirtualConnector and confirm max_num_batched_tokens is populated from discovery and scaling proceeds

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved throughput and load-based scaling to automatically discover missing worker configuration from model deployment metadata. The system now uses complete information when adjusting resource allocation and batch processing limits.
  • Tests

    • Added comprehensive test coverage validating configuration discovery and backfill behavior in various scenarios.

Open with Devin

@jthomson04 jthomson04 requested a review from a team as a code owner April 9, 2026 20:03
@jthomson04 jthomson04 requested a review from a team April 9, 2026 20:03
@jthomson04 jthomson04 requested a review from a team as a code owner April 9, 2026 20:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 9, 2026

Walkthrough

The changes add support for backfilling max_num_batched_tokens from discovered model deployment cards when not explicitly provided. A new private method is integrated into planner scaling loops to refresh worker information from FPM subscriber model cards, with corresponding Rust-side tracking, type stub declarations, and test coverage.

Changes

Cohort / File(s) Summary
Core Planner Logic
components/src/dynamo/planner/core/base.py, components/src/dynamo/planner/core/agg.py, components/src/dynamo/planner/core/prefill.py
Added new private method _populate_worker_info_from_discovery() to backfill missing max_num_batched_tokens from model cards with early-exit guards; integrated invocations before batching limit retrieval in throughput and load-based scaling loops.
FPM Subscriber
lib/bindings/python/rust/llm/fpm.rs
Added model card tracking in FpmEventSubscriber via concurrent map keyed by worker ID; updated MDC discovery task to capture and remove card JSON on model instance events; exposed new public method get_model_cards() returning filtered snapshot by worker ID.
Type Stubs
lib/bindings/python/src/dynamo/_core.pyi
Updated Python type stubs to declare new FpmEventSubscriber.get_model_cards(self) -> dict[str, str] method with documented runtime error on missing start_tracking() invocation.
Tests
components/src/dynamo/planner/tests/unit/test_load_based_scaling.py
Added TestPopulateWorkerInfoFromDiscovery test suite validating successful backfill from model cards, no-op behavior for already-set values, unset FPM subscriber, and graceful skipping of malformed cards.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description covers the problem, solution, and test plan, but is missing the required template structure with Overview/Details/Where should the reviewer start sections and Related Issues. Restructure the description to follow the template: add Overview, Details, and Where should the reviewer start sections; add Related Issues section with GitHub issue reference.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: backfilling max_num_batched_tokens from discovery for VirtualConnector, which is the core objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@tedzhouhk
Copy link
Copy Markdown
Contributor

Here's a review comment you can paste based on our conversation:


Thanks for unblocking VirtualConnector scaling — this works, but I think it solves the symptom in the wrong layer and is worth reshaping before merge.

The architectural concern

max_num_batched_tokens is a WorkerInfo field, and WorkerInfo population is the connector's contract (resolve_worker_info in monitoring/worker_info.py, KubernetesConnector.get_worker_info in connectors/kubernetes.py:491). The root cause here is simply that VirtualConnector doesn't implement get_worker_info. Fixing it by threading a per-field getter through FpmEventSubscriber (whose documented purpose is Forward Pass Metrics) couples two unrelated concerns and only patches one of several missing fields — total_kv_blocks, kv_cache_block_size, max_num_seqs, and context_length come from the same card_json.runtime_config (see kubernetes.py:453-487) and are all None on VirtualConnector today. Each new field would need another get_<field>() on the subscriber; that's a clear sign the abstraction is off.

Suggested shape

Factor the MDC plumbing into two pieces:

  1. A shared worker_info_from_mdc(card_json, sub_component_type) helper (basically lift _build_worker_info_from_mdc out of kubernetes.py) — pure function, no K8s/discovery dependencies. Also pull the LoRA-card filter and prefill/decode heuristic into the same module; they're properties of MDC, not of K8s.
  2. A small McdSource protocol with two implementations:
    • KubeCrdSource — wraps today's _extract_mdc_entries.
    • DiscoverySource — reads MDC from the existing discovery watch.

Then VirtualConnector.get_worker_info composes DiscoverySource + worker_info_from_mdc. K8s keeps its DGD enrichment (k8s_name, container-arg model-name fallback) as a thin wrapper around the shared extractor.

@jthomson04 jthomson04 force-pushed the jthomson04/planner-discovery-backfill branch from 7d69b37 to c35acbc Compare April 17, 2026 05:16
@jthomson04 jthomson04 force-pushed the jthomson04/planner-discovery-backfill branch from c35acbc to 20fb262 Compare April 17, 2026 15:54
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

devin-ai-integration[bot]

This comment was marked as resolved.

@jthomson04
Copy link
Copy Markdown
Contributor Author

/ok to test 20fb262

jthomson04 and others added 3 commits April 20, 2026 11:09
…ualConnector

The VirtualConnector does not implement get_worker_info, so
max_num_batched_tokens was always None — blocking both load-based and
throughput-based scaling in agg (and prefill) mode.

Engines already publish max_num_batched_tokens in their
ModelDeploymentCard via the discovery plane, and the FPM subscriber
already watches those MDC events. The subscriber now extracts
max_num_batched_tokens from each worker's runtime_config at discovery
time and resolves the minimum across all known workers, exposing it via
a narrow get_max_num_batched_tokens() -> Option<u64> binding.

NativePlannerBase backfills WorkerInfo.max_num_batched_tokens from the
subscriber when the connector didn't provide it, and updates the state
machine's capabilities via a new update_capabilities() method so
subsequent scaling decisions pick up the value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Narrow model_name_fallback except to (RuntimeError, OSError, ValueError)
per python-guidelines.md, and hoist unittest.mock / planner imports in
test_load_based_scaling.py to module top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jthomson04 jthomson04 force-pushed the jthomson04/planner-discovery-backfill branch from 0382e05 to a16693f Compare April 20, 2026 18:09
@jthomson04 jthomson04 merged commit 4419009 into main Apr 20, 2026
89 of 90 checks passed
@jthomson04 jthomson04 deleted the jthomson04/planner-discovery-backfill branch April 20, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants