fix(planner): match MDC component field against backend default, not DGD key#8489
Merged
Conversation
Contributor
WalkthroughRefactored Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…DGD key
KubernetesConnector.get_worker_info()'s secondary filter was comparing the
MDC entry's 'component' field (written by the Rust runtime from the
registered Endpoint name, e.g. "prefill" / "backend" / "tensorrt_llm")
against service.name from the DGD (the spec.services dict key, typically
PascalCase like "VllmPrefillWorker"). These are fundamentally different
identifiers, so for every upstream example that uses a PascalCase services
key, the filter skipped every real MDC entry, fell back to defaults with
context_length=None, and emitted:
WARN load_scaling._prefill_easy_decision: context_length not available,
skipping easy prefill scaling
on every load tick -- silently breaking easy-mode autoscaling.
Fix _resolve_dgd_service() to return the component name the worker
actually writes to MDC as the filter identifier. Source of truth, in
priority order:
1. Parse --endpoint <ns>.<component>.<ep> from the DGD container args
(all three backends -- vllm/sglang/trtllm -- honor this override).
Handled by the new Service.get_component_name_from_endpoint_arg().
2. Backend-specific default from build_worker_info_from_defaults()
("prefill" / "backend" / "tensorrt_llm").
service.name (the PascalCase DGD key) is still returned as the first
tuple element for Kubernetes operations that need it (replica patches,
WorkerInfo.k8s_name).
Note: the naive fix of using sub_component_type.value ("prefill" /
"decode") would break decode filtering because MDC decode carries
backend-specific names ("backend" for vLLM/SGLang/Mocker, "tensorrt_llm"
for TRT-LLM), not "decode".
Added regression tests covering prefill, decode (vLLM "backend"), TRT-LLM
decode ("tensorrt_llm"), the DGD-lookup-fails path, the --endpoint user
override (with and without dyn:// prefix), and malformed --endpoint
fallback.
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
0ce4de2 to
ed8e0b0
Compare
PeaBrane
approved these changes
Apr 22, 2026
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes silent autoscaling breakage in the Kubernetes Planner: the secondary filter in
KubernetesConnector.get_worker_info()was comparing the MDC entry'scomponentfield (written by the Rust runtime from the registered Endpoint name — lowercase, e.g."prefill"/"backend"/"tensorrt_llm") againstservice.namefrom the DGD (thespec.servicesdict key — typically PascalCase like"VllmPrefillWorker").These are fundamentally different identifiers. For every upstream example (all of which use PascalCase services keys), the filter skipped every MDC entry, fell back to defaults with
context_length=None, and the easy-mode load scaling loop emitted:on every 5-second adjustment interval — silently breaking autoscaling on every real-world 1.1.0 deployment. Inference correctness was unaffected, which is why this slipped past #8384 (that PR addressed a separate case-sensitivity bug in
get_model_name, not this filter).Fix
_resolve_dgd_service()now returns, as the filter identifier, the component name the worker actually writes to MDC. Source of truth, in priority order:--endpoint <ns>.<component>.<ep>override in the worker's container args — all three backends support this (vllm/args.py:171-176, sglang/args.py:428, trtllm/args.py:137). Handled by the newService.get_component_name_from_endpoint_arg().build_worker_info_from_defaults()("prefill"/"backend"/"tensorrt_llm").service.name(the PascalCase DGD key) is still returned as the first tuple element for Kubernetes operations that need it (replica patches,WorkerInfo.k8s_name).Why not
sub_component_type.value?The naive fix QA suggested in the ticket (compare against
sub_component_type.value—"prefill"/"decode") would work for prefill but break decode: MDC decode workers carry backend-specific names —"backend"for vLLM/SGLang/Mocker,"tensorrt_llm"for TRT-LLM — never"decode".Test Plan
"backend") / TRT-LLM decode ("tensorrt_llm")--endpoint ns.comp.epoverride (+dyn://prefix variant)--endpointfalls back to defaultService.get_component_name_from_endpoint_argpresent / absent / missing-value edge casestest_kubernetes_connector.pysuite passes (50/50)context_length not availableWARN is gone and Planner makes scaling decisions under loadCloses