fix(planner): match MDC component field against backend default, not DGD key by tedzhouhk · Pull Request #8489 · ai-dynamo/dynamo

tedzhouhk · 2026-04-22T02:19:51Z

Summary

Fixes silent autoscaling breakage in the Kubernetes Planner: the secondary filter in KubernetesConnector.get_worker_info() was comparing the MDC entry's component field (written by the Rust runtime from the registered Endpoint name — lowercase, e.g. "prefill" / "backend" / "tensorrt_llm") against service.name from the DGD (the spec.services dict key — typically PascalCase like "VllmPrefillWorker").

These are fundamentally different identifiers. For every upstream example (all of which use PascalCase services keys), the filter skipped every MDC entry, fell back to defaults with context_length=None, and the easy-mode load scaling loop emitted:

WARN load_scaling._prefill_easy_decision: context_length not available, skipping easy prefill scaling
WARN load_scaling._decode_easy_decision: max_kv_tokens not available, skipping easy decode scaling

on every 5-second adjustment interval — silently breaking autoscaling on every real-world 1.1.0 deployment. Inference correctness was unaffected, which is why this slipped past #8384 (that PR addressed a separate case-sensitivity bug in get_model_name, not this filter).

Fix

_resolve_dgd_service() now returns, as the filter identifier, the component name the worker actually writes to MDC. Source of truth, in priority order:

The user's --endpoint <ns>.<component>.<ep> override in the worker's container args — all three backends support this (vllm/args.py:171-176, sglang/args.py:428, trtllm/args.py:137). Handled by the new Service.get_component_name_from_endpoint_arg().
The backend-specific default from build_worker_info_from_defaults() ("prefill" / "backend" / "tensorrt_llm").

service.name (the PascalCase DGD key) is still returned as the first tuple element for Kubernetes operations that need it (replica patches, WorkerInfo.k8s_name).

Why not `sub_component_type.value`?

The naive fix QA suggested in the ticket (compare against sub_component_type.value — "prefill" / "decode") would work for prefill but break decode: MDC decode workers carry backend-specific names — "backend" for vLLM/SGLang/Mocker, "tensorrt_llm" for TRT-LLM — never "decode".

Test Plan

10 new unit tests covering:
- Prefill / vLLM decode ("backend") / TRT-LLM decode ("tensorrt_llm")
- DGD-lookup-fails path
- --endpoint ns.comp.ep override (+ dyn:// prefix variant)
- Malformed --endpoint falls back to default
- Service.get_component_name_from_endpoint_arg present / absent / missing-value edge cases
Full test_kubernetes_connector.py suite passes (50/50)
pre-commit (isort/black/flake8/ruff/codespell) clean on changed files
E2E validation on K8s — still needs end-to-end reproduction with QA's DGD YAML from DYN-2747 (Qwen3-8B disagg, planner active mode) to confirm context_length not available WARN is gone and Planner makes scaling decisions under load

Closes

DYN-2747 (real fix; the ticket was prematurely closed by fix(planner): normalize model_name case in KubernetesConnector comparisons #8384 which addressed an unrelated case-sensitivity bug)

coderabbitai · 2026-04-22T02:23:15Z

Walkthrough

Refactored _resolve_dgd_service to separate concerns by returning a tuple containing both the DGD service name and the component name for MDC filtering, computing the expected component earlier in the flow and updating related docstring documentation.

Changes

Cohort / File(s)	Summary
Function Refactoring `components/src/dynamo/planner/connectors/kubernetes.py`	Updated `_resolve_dgd_service` to return a tuple `(dgd_service_name, component_name_for_filter)`, computing the expected component name once from backend defaults and returning it consistently on both success and error paths. Docstring revised to clarify tuple elements.
Test Coverage `components/src/dynamo/planner/tests/unit/test_kubernetes_connector.py`	Added four new unit tests validating `_resolve_dgd_service` tuple return contract across vLLM prefill, vLLM decode, TRT-LLM decode, and error scenarios, confirming correct component name mapping to MDC values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main fix: changing MDC component field matching from DGD key to backend default, which directly addresses the autoscaling breakage described in the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, addressing all key aspects: the problem (autoscaling silent breakage), root cause analysis, the fix with implementation details, and a detailed test plan.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…DGD key KubernetesConnector.get_worker_info()'s secondary filter was comparing the MDC entry's 'component' field (written by the Rust runtime from the registered Endpoint name, e.g. "prefill" / "backend" / "tensorrt_llm") against service.name from the DGD (the spec.services dict key, typically PascalCase like "VllmPrefillWorker"). These are fundamentally different identifiers, so for every upstream example that uses a PascalCase services key, the filter skipped every real MDC entry, fell back to defaults with context_length=None, and emitted: WARN load_scaling._prefill_easy_decision: context_length not available, skipping easy prefill scaling on every load tick -- silently breaking easy-mode autoscaling. Fix _resolve_dgd_service() to return the component name the worker actually writes to MDC as the filter identifier. Source of truth, in priority order: 1. Parse --endpoint <ns>.<component>.<ep> from the DGD container args (all three backends -- vllm/sglang/trtllm -- honor this override). Handled by the new Service.get_component_name_from_endpoint_arg(). 2. Backend-specific default from build_worker_info_from_defaults() ("prefill" / "backend" / "tensorrt_llm"). service.name (the PascalCase DGD key) is still returned as the first tuple element for Kubernetes operations that need it (replica patches, WorkerInfo.k8s_name). Note: the naive fix of using sub_component_type.value ("prefill" / "decode") would break decode filtering because MDC decode carries backend-specific names ("backend" for vLLM/SGLang/Mocker, "tensorrt_llm" for TRT-LLM), not "decode". Added regression tests covering prefill, decode (vLLM "backend"), TRT-LLM decode ("tensorrt_llm"), the DGD-lookup-fails path, the --endpoint user override (with and without dyn:// prefix), and malformed --endpoint fallback. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

…DGD key (cherry-pick of #8489) (#8512) Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk requested review from a team as code owners April 22, 2026 02:19

pull-request-size Bot added the size/L label Apr 22, 2026

github-actions Bot added fix planner labels Apr 22, 2026

tedzhouhk force-pushed the hzhou/dyn-2747-worker-info-filter branch from 0ce4de2 to ed8e0b0 Compare April 22, 2026 02:31

copy-pr-bot Bot temporarily deployed to GITLAB April 22, 2026 02:31 Inactive

PeaBrane approved these changes Apr 22, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB April 22, 2026 03:42 Inactive

tedzhouhk merged commit 2cd4288 into main Apr 22, 2026
68 checks passed

tedzhouhk deleted the hzhou/dyn-2747-worker-info-filter branch April 22, 2026 15:53

tedzhouhk mentioned this pull request Apr 22, 2026

fix(planner): match MDC component field against backend default, not DGD key (cherry-pick of #8489) #8512

Merged

4 tasks

nv-nmailhot pushed a commit that referenced this pull request Apr 22, 2026

fix(planner): match MDC component field against backend default, not …

e0269c0

…DGD key (cherry-pick of #8489) (#8512) Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(planner): match MDC component field against backend default, not DGD key#8489

fix(planner): match MDC component field against backend default, not DGD key#8489
tedzhouhk merged 1 commit into
mainfrom
hzhou/dyn-2747-worker-info-filter

tedzhouhk commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tedzhouhk commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Why not sub_component_type.value?

Test Plan

Closes

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tedzhouhk commented Apr 22, 2026 •

edited

Loading

Why not `sub_component_type.value`?

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading