fix(planner): backfill max_num_batched_tokens from discovery for VirtualConnector#8042
Conversation
WalkthroughThe changes add support for backfilling Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a143136 to
fc88d16
Compare
cf74241 to
7d69b37
Compare
|
Here's a review comment you can paste based on our conversation: Thanks for unblocking VirtualConnector scaling — this works, but I think it solves the symptom in the wrong layer and is worth reshaping before merge. The architectural concern
Suggested shape Factor the MDC plumbing into two pieces:
Then |
7d69b37 to
c35acbc
Compare
c35acbc to
20fb262
Compare
|
/ok to test 20fb262 |
…ualConnector The VirtualConnector does not implement get_worker_info, so max_num_batched_tokens was always None — blocking both load-based and throughput-based scaling in agg (and prefill) mode. Engines already publish max_num_batched_tokens in their ModelDeploymentCard via the discovery plane, and the FPM subscriber already watches those MDC events. The subscriber now extracts max_num_batched_tokens from each worker's runtime_config at discovery time and resolves the minimum across all known workers, exposing it via a narrow get_max_num_batched_tokens() -> Option<u64> binding. NativePlannerBase backfills WorkerInfo.max_num_batched_tokens from the subscriber when the connector didn't provide it, and updates the state machine's capabilities via a new update_capabilities() method so subsequent scaling decisions pick up the value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Narrow model_name_fallback except to (RuntimeError, OSError, ValueError) per python-guidelines.md, and hoist unittest.mock / planner imports in test_load_based_scaling.py to module top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0382e05 to
a16693f
Compare
Summary
VirtualConnectordoesn't implementget_worker_info, somax_num_batched_tokenswas alwaysNone— blocking both load-based and throughput-based scaling in agg and prefill mode.max_num_batched_tokensin theirModelDeploymentCardvia discovery, and the FPM subscriber already watches those MDC events. This change captures thecard_jsononAddedevents and exposes it viaget_model_cards().WorkerInfo.max_num_batched_tokensfrom the first available model card when the connector didn't provide the value. No-ops if the value is already set (e.g. Kubernetes connector).Test plan
_populate_worker_info_from_discovery: backfill happy path, no-op when already set, no-op without subscriber, skip incomplete cardscargo check -p dynamo-py3passes (confirmed locally)"mode": "agg"using VirtualConnector and confirmmax_num_batched_tokensis populated from discovery and scaling proceeds🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests