Skip to content

fix(planner): match MDC component field against backend default, not DGD key#8489

Merged
tedzhouhk merged 1 commit into
mainfrom
hzhou/dyn-2747-worker-info-filter
Apr 22, 2026
Merged

fix(planner): match MDC component field against backend default, not DGD key#8489
tedzhouhk merged 1 commit into
mainfrom
hzhou/dyn-2747-worker-info-filter

Conversation

@tedzhouhk
Copy link
Copy Markdown
Contributor

@tedzhouhk tedzhouhk commented Apr 22, 2026

Summary

Fixes silent autoscaling breakage in the Kubernetes Planner: the secondary filter in KubernetesConnector.get_worker_info() was comparing the MDC entry's component field (written by the Rust runtime from the registered Endpoint name — lowercase, e.g. "prefill" / "backend" / "tensorrt_llm") against service.name from the DGD (the spec.services dict key — typically PascalCase like "VllmPrefillWorker").

These are fundamentally different identifiers. For every upstream example (all of which use PascalCase services keys), the filter skipped every MDC entry, fell back to defaults with context_length=None, and the easy-mode load scaling loop emitted:

WARN load_scaling._prefill_easy_decision: context_length not available, skipping easy prefill scaling
WARN load_scaling._decode_easy_decision: max_kv_tokens not available, skipping easy decode scaling

on every 5-second adjustment interval — silently breaking autoscaling on every real-world 1.1.0 deployment. Inference correctness was unaffected, which is why this slipped past #8384 (that PR addressed a separate case-sensitivity bug in get_model_name, not this filter).

Fix

_resolve_dgd_service() now returns, as the filter identifier, the component name the worker actually writes to MDC. Source of truth, in priority order:

  1. The user's --endpoint <ns>.<component>.<ep> override in the worker's container args — all three backends support this (vllm/args.py:171-176, sglang/args.py:428, trtllm/args.py:137). Handled by the new Service.get_component_name_from_endpoint_arg().
  2. The backend-specific default from build_worker_info_from_defaults() ("prefill" / "backend" / "tensorrt_llm").

service.name (the PascalCase DGD key) is still returned as the first tuple element for Kubernetes operations that need it (replica patches, WorkerInfo.k8s_name).

Why not sub_component_type.value?

The naive fix QA suggested in the ticket (compare against sub_component_type.value"prefill" / "decode") would work for prefill but break decode: MDC decode workers carry backend-specific names — "backend" for vLLM/SGLang/Mocker, "tensorrt_llm" for TRT-LLM — never "decode".

Test Plan

  • 10 new unit tests covering:
    • Prefill / vLLM decode ("backend") / TRT-LLM decode ("tensorrt_llm")
    • DGD-lookup-fails path
    • --endpoint ns.comp.ep override (+ dyn:// prefix variant)
    • Malformed --endpoint falls back to default
    • Service.get_component_name_from_endpoint_arg present / absent / missing-value edge cases
  • Full test_kubernetes_connector.py suite passes (50/50)
  • pre-commit (isort/black/flake8/ruff/codespell) clean on changed files
  • E2E validation on K8s — still needs end-to-end reproduction with QA's DGD YAML from DYN-2747 (Qwen3-8B disagg, planner active mode) to confirm context_length not available WARN is gone and Planner makes scaling decisions under load

Closes

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Walkthrough

Refactored _resolve_dgd_service to separate concerns by returning a tuple containing both the DGD service name and the component name for MDC filtering, computing the expected component earlier in the flow and updating related docstring documentation.

Changes

Cohort / File(s) Summary
Function Refactoring
components/src/dynamo/planner/connectors/kubernetes.py
Updated _resolve_dgd_service to return a tuple (dgd_service_name, component_name_for_filter), computing the expected component name once from backend defaults and returning it consistently on both success and error paths. Docstring revised to clarify tuple elements.
Test Coverage
components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py
Added four new unit tests validating _resolve_dgd_service tuple return contract across vLLM prefill, vLLM decode, TRT-LLM decode, and error scenarios, confirming correct component name mapping to MDC values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: changing MDC component field matching from DGD key to backend default, which directly addresses the autoscaling breakage described in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive and well-structured, addressing all key aspects: the problem (autoscaling silent breakage), root cause analysis, the fix with implementation details, and a detailed test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…DGD key

KubernetesConnector.get_worker_info()'s secondary filter was comparing the
MDC entry's 'component' field (written by the Rust runtime from the
registered Endpoint name, e.g. "prefill" / "backend" / "tensorrt_llm")
against service.name from the DGD (the spec.services dict key, typically
PascalCase like "VllmPrefillWorker"). These are fundamentally different
identifiers, so for every upstream example that uses a PascalCase services
key, the filter skipped every real MDC entry, fell back to defaults with
context_length=None, and emitted:

  WARN load_scaling._prefill_easy_decision: context_length not available,
       skipping easy prefill scaling

on every load tick -- silently breaking easy-mode autoscaling.

Fix _resolve_dgd_service() to return the component name the worker
actually writes to MDC as the filter identifier. Source of truth, in
priority order:

  1. Parse --endpoint <ns>.<component>.<ep> from the DGD container args
     (all three backends -- vllm/sglang/trtllm -- honor this override).
     Handled by the new Service.get_component_name_from_endpoint_arg().
  2. Backend-specific default from build_worker_info_from_defaults()
     ("prefill" / "backend" / "tensorrt_llm").

service.name (the PascalCase DGD key) is still returned as the first
tuple element for Kubernetes operations that need it (replica patches,
WorkerInfo.k8s_name).

Note: the naive fix of using sub_component_type.value ("prefill" /
"decode") would break decode filtering because MDC decode carries
backend-specific names ("backend" for vLLM/SGLang/Mocker, "tensorrt_llm"
for TRT-LLM), not "decode".

Added regression tests covering prefill, decode (vLLM "backend"), TRT-LLM
decode ("tensorrt_llm"), the DGD-lookup-fails path, the --endpoint user
override (with and without dyn:// prefix), and malformed --endpoint
fallback.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk force-pushed the hzhou/dyn-2747-worker-info-filter branch from 0ce4de2 to ed8e0b0 Compare April 22, 2026 02:31
@tedzhouhk tedzhouhk merged commit 2cd4288 into main Apr 22, 2026
68 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/dyn-2747-worker-info-filter branch April 22, 2026 15:53
nv-nmailhot pushed a commit that referenced this pull request Apr 22, 2026
…DGD key (cherry-pick of #8489) (#8512)

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants