Skip to content

feat: OAI compatible endpoints for TRTLLM#14

Merged
NVShreyas merged 62 commits into
mainfrom
shreyasm/trtllm-http
Mar 5, 2025
Merged

feat: OAI compatible endpoints for TRTLLM#14
NVShreyas merged 62 commits into
mainfrom
shreyasm/trtllm-http

Conversation

@NVShreyas
Copy link
Copy Markdown
Contributor

What does the PR do?

  • Adds OAI compatible HTTP endpoints to the TRTLLM example.
    • Integrates chat completions from Tanmay's branch.
  • Refactor some percent of duplicate code
  • Refactor input of LLMAPI to make it more extensible

Comment thread examples/python_rs/llm/tensorrt_llm/monolith/worker.py Outdated
Copy link
Copy Markdown
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, but left a few comments on some places for future cleanup, magic numbers, and README nits

Nice work!

Hopefully a lot of the OpenAI related processing could be simplified or removed in the future by using the Rust Preprocessor.

Copy link
Copy Markdown
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🚀

@NVShreyas NVShreyas merged commit b7cca38 into main Mar 5, 2025
@NVShreyas NVShreyas deleted the shreyasm/trtllm-http branch March 5, 2025 01:24
kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
Co-authored-by: Tanmay Verma <tanmayv@nvidia.com>
Co-authored-by: Tanmay Verma <tanmayv@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
Adds power-aware intelligence to the Dynamo planner across three layers,
plus tooling that pipecleans the Phase 4 selection path with currently
available single-TDP AIC data. Reimplementation of unmerged PR #5280
against post-refactor ToT (planner/utils/ no longer exists; configuration
is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR
-----------------

Phase 1 — infrastructure
  - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/}
    Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls
    nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop.
    Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  - Pod annotation actuation: connectors/kubernetes.py writes
    dynamo.nvidia.com/gpu-power-limit on worker pods.
  - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  - Operator chart fix: ClusterRole now grants `pods` permissions
    (deploy/helm/charts/platform/components/operator/templates/planner.yaml,
    chart bumped to 1.2.1). Older clusters can apply
    deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement
  - _apply_power_budget() in core/state_machine.py clamps replica
    scaling within a watt budget. New PlannerConfig fields:
    enable_power_awareness, total_gpu_power_limit (required when
    awareness enabled — validator enforces, no silent default),
    prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer
  - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks
    (replica, power) configs that maximise throughput within the watt
    budget. Per-component EMA correction (c_ttft, c_itl, c_power_p,
    c_power_d for disagg; c_power_agg for agg) closes the loop on
    silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded
    re-optimize, auto-disable on consecutive failures.
  - core/base.py: NativePlannerBase plumbing (admission threshold push,
    named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)
  - tools/integrate_aic_power_data.py: copies measured H200/B200 power
    data into an AIC checkout and reinstalls.
  - tools/validate_aic_power_integration.py: asserts estimate_perf()
    returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  - tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  - examples/deployments/powerplanner/: PIPECLEAN.md runbook,
    disagg-{power-aware,conservative-cold-start}.yaml,
    verify_poweraware.bash (8-section health check inc. Phase 4 preview),
    MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset)
-------------------------------------------------------------------
  - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner
    couldn't list workers or patch annotations. Fixed in chart template;
    workaround manifest provided for clusters not yet upgraded.
  - Stale label selectors in connectors/kubernetes.py:
    `dynamo-graph-deployment` → `dynamo-graph-deployment-name`,
    `dynamo-service` → `dynamo-component`/`dynamo-component-type`.
  - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the
    wrong namespace; corrected to `exported_pod=~` +
    `exported_namespace=<k8s-namespace>` and pod regex now matches the
    operator's `<dgd>-<replica-idx>-<service-key>-...` format
    (monitoring/traffic_metrics.py).
  - Frontend-source metric methods returned NaN on quiet clusters
    (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN
    guard, frontend didn't — added matching `math.isnan` filter.

Tests
-----

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS
dev cluster (qwen3-quickstart DGD, single dev namespace):

  Suite                                                  Pass  Skip  Fail
  ----------------------------------------------------   ----  ----  ----
  planner/tests/unit/                                     465     0     0
  planner/tests/integration/test_aic_power_e2e_sim       15     0     0
  planner/tests/integration/test_aic_power_optimizer     34     0     0
  planner/tests/integration/test_metric_paths_live       22-23  3-2    0
  planner/tests/integration/test_actuation_knobs_live    10-11  1-0    0
  power_agent/tests/                                      43     0     0
  ----------------------------------------------------   ----  ----  ----
  Total (cold deploy / after sanity chat completion)    590-591  4-3    0

Documented skips:
  - test_frontend_metric_series_exists — passes after any traffic.
  - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose
    dynamo_frontend_worker_* gauges (open-question #14 in design doc).
  - TestScalingRealMutation — opt-in disruptive test, gated by
    RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
              "From-Scratch Repro Script" section.

Compatibility
-------------
Power awareness is opt-in (enable_power_awareness=False by default).
AIC optimizer is opt-in (enable_aic_optimizer=False by default).
No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation)
remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4
pipeclean code path so it can be exercised before AIC ships the full
sweep API.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
kaim-eng added a commit that referenced this pull request May 12, 2026
Adds power-aware intelligence to the Dynamo planner across three layers,
plus tooling that pipecleans the Phase 4 selection path with currently
available single-TDP AIC data. Reimplementation of unmerged PR #5280
against post-refactor ToT (planner/utils/ no longer exists; configuration
is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR
-----------------

Phase 1 — infrastructure
  - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/}
    Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls
    nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop.
    Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  - Pod annotation actuation: connectors/kubernetes.py writes
    dynamo.nvidia.com/gpu-power-limit on worker pods.
  - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  - Operator chart fix: ClusterRole now grants `pods` permissions
    (deploy/helm/charts/platform/components/operator/templates/planner.yaml,
    chart bumped to 1.2.1). Older clusters can apply
    deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement
  - _apply_power_budget() in core/state_machine.py clamps replica
    scaling within a watt budget. New PlannerConfig fields:
    enable_power_awareness, total_gpu_power_limit (required when
    awareness enabled — validator enforces, no silent default),
    prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer
  - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks
    (replica, power) configs that maximise throughput within the watt
    budget. Per-component EMA correction (c_ttft, c_itl, c_power_p,
    c_power_d for disagg; c_power_agg for agg) closes the loop on
    silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded
    re-optimize, auto-disable on consecutive failures.
  - core/base.py: NativePlannerBase plumbing (admission threshold push,
    named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)
  - tools/integrate_aic_power_data.py: copies measured H200/B200 power
    data into an AIC checkout and reinstalls.
  - tools/validate_aic_power_integration.py: asserts estimate_perf()
    returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  - tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  - examples/deployments/powerplanner/: PIPECLEAN.md runbook,
    disagg-{power-aware,conservative-cold-start}.yaml,
    verify_poweraware.bash (8-section health check inc. Phase 4 preview),
    MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset)
-------------------------------------------------------------------
  - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner
    couldn't list workers or patch annotations. Fixed in chart template;
    workaround manifest provided for clusters not yet upgraded.
  - Stale label selectors in connectors/kubernetes.py:
    `dynamo-graph-deployment` → `dynamo-graph-deployment-name`,
    `dynamo-service` → `dynamo-component`/`dynamo-component-type`.
  - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the
    wrong namespace; corrected to `exported_pod=~` +
    `exported_namespace=<k8s-namespace>` and pod regex now matches the
    operator's `<dgd>-<replica-idx>-<service-key>-...` format
    (monitoring/traffic_metrics.py).
  - Frontend-source metric methods returned NaN on quiet clusters
    (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN
    guard, frontend didn't — added matching `math.isnan` filter.

Tests
-----

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS
dev cluster (qwen3-quickstart DGD, single dev namespace):

  Suite                                                  Pass  Skip  Fail
  ----------------------------------------------------   ----  ----  ----
  planner/tests/unit/                                     465     0     0
  planner/tests/integration/test_aic_power_e2e_sim       15     0     0
  planner/tests/integration/test_aic_power_optimizer     34     0     0
  planner/tests/integration/test_metric_paths_live       22-23  3-2    0
  planner/tests/integration/test_actuation_knobs_live    10-11  1-0    0
  power_agent/tests/                                      43     0     0
  ----------------------------------------------------   ----  ----  ----
  Total (cold deploy / after sanity chat completion)    590-591  4-3    0

Documented skips:
  - test_frontend_metric_series_exists — passes after any traffic.
  - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose
    dynamo_frontend_worker_* gauges (open-question #14 in design doc).
  - TestScalingRealMutation — opt-in disruptive test, gated by
    RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
              "From-Scratch Repro Script" section.

Compatibility
-------------
Power awareness is opt-in (enable_power_awareness=False by default).
AIC optimizer is opt-in (enable_aic_optimizer=False by default).
No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation)
remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4
pipeclean code path so it can be exercised before AIC ships the full
sweep API.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
kaim-eng added a commit that referenced this pull request May 18, 2026
Adds power-aware intelligence to the Dynamo planner across three layers,
plus tooling that pipecleans the Phase 4 selection path with currently
available single-TDP AIC data. Reimplementation of unmerged PR #5280
against post-refactor ToT (planner/utils/ no longer exists; configuration
is Pydantic; layout is config/, connectors/, core/, monitoring/).

Full design: docs/design-docs/powerplanner-design.md
Repro / dev environment: docs/components/planner/dpp-dev-env.md

What's in this PR
-----------------

Phase 1 — infrastructure
  - Power Agent DaemonSet: components/power_agent/{power_agent.py,tests/}
    Watches pod annotations, maps cgroup→pod_uid→GPU via NVML, calls
    nvmlDeviceSetPowerManagementLimit on a 15 s reconciliation loop.
    Multi-pod-per-GPU policy, fail-closed safe-default, SIGTERM restore.
  - Pod annotation actuation: connectors/kubernetes.py writes
    dynamo.nvidia.com/gpu-power-limit on worker pods.
  - K8s manifests: deploy/power_agent/{daemonset,rbac}.yaml.
  - Operator chart fix: ClusterRole now grants `pods` permissions
    (deploy/helm/charts/platform/components/operator/templates/planner.yaml,
    chart bumped to 1.2.1). Older clusters can apply
    deploy/planner-pod-rbac-dev.yaml as a temporary patch.

Phase 2 — budget enforcement
  - _apply_power_budget() in core/state_machine.py clamps replica
    scaling within a watt budget. New PlannerConfig fields:
    enable_power_awareness, total_gpu_power_limit (required when
    awareness enabled — validator enforces, no silent default),
    prefill/decode_engine_gpu_power_limit, power_agent_safe_default_watts.

Phase 3 — AIC optimizer
  - monitoring/aic_power_optimizer.py: AIConfigurator sweep picks
    (replica, power) configs that maximise throughput within the watt
    budget. Per-component EMA correction (c_ttft, c_itl, c_power_p,
    c_power_d for disagg; c_power_agg for agg) closes the loop on
    silicon-vs-AIC drift. SLA gate, budget gate, capacity-exceeded
    re-optimize, auto-disable on consecutive failures.
  - core/base.py: NativePlannerBase plumbing (admission threshold push,
    named-port resolution, AIC integration).

Phase 4 — pipeclean (using existing single-TDP data)
  - tools/integrate_aic_power_data.py: copies measured H200/B200 power
    data into an AIC checkout and reinstalls.
  - tools/validate_aic_power_integration.py: asserts estimate_perf()
    returns power_w in [100, 710] W (not 0.0, not 700.0 TDP fallback).
  - tools/compute_power_budget.py: rack-capacity → safe-budget helper.
  - examples/deployments/powerplanner/: PIPECLEAN.md runbook,
    disagg-{power-aware,conservative-cold-start}.yaml,
    verify_poweraware.bash (8-section health check inc. Phase 4 preview),
    MULTI_DGD.md operator playbook, README.md.

Bug fixes (found by live integration tests, fixed in this patchset)
-------------------------------------------------------------------
  - Operator chart 1.2.0 ClusterRole had no `pods` rules → planner
    couldn't list workers or patch annotations. Fixed in chart template;
    workaround manifest provided for clusters not yet upgraded.
  - Stale label selectors in connectors/kubernetes.py:
    `dynamo-graph-deployment` → `dynamo-graph-deployment-name`,
    `dynamo-service` → `dynamo-component`/`dynamo-component-type`.
  - DCGM queries used `pod=~` (matched the DCGM exporter pod) and the
    wrong namespace; corrected to `exported_pod=~` +
    `exported_namespace=<k8s-namespace>` and pod regex now matches the
    operator's `<dgd>-<replica-idx>-<service-key>-...` format
    (monitoring/traffic_metrics.py).
  - Frontend-source metric methods returned NaN on quiet clusters
    (0/0 in `increase(_sum)/increase(_count)`); router path had a NaN
    guard, frontend didn't — added matching `math.isnan` filter.

Tests
-----

Verified 2026-05-10 on a clean from-scratch repro on a live Azure AKS
dev cluster (qwen3-quickstart DGD, single dev namespace):

  Suite                                                  Pass  Skip  Fail
  ----------------------------------------------------   ----  ----  ----
  planner/tests/unit/                                     465     0     0
  planner/tests/integration/test_aic_power_e2e_sim       15     0     0
  planner/tests/integration/test_aic_power_optimizer     34     0     0
  planner/tests/integration/test_metric_paths_live       22-23  3-2    0
  planner/tests/integration/test_actuation_knobs_live    10-11  1-0    0
  power_agent/tests/                                      43     0     0
  ----------------------------------------------------   ----  ----  ----
  Total (cold deploy / after sanity chat completion)    590-591  4-3    0

Documented skips:
  - test_frontend_metric_series_exists — passes after any traffic.
  - TestDirectRouterMetricsClientLive::* — LocalRouter doesn't expose
    dynamo_frontend_worker_* gauges (open-question #14 in design doc).
  - TestScalingRealMutation — opt-in disruptive test, gated by
    RUN_DISRUPTIVE_TESTS=1 (passes when enabled).

Repro recipe: docs/components/planner/dpp-dev-env.md
              "From-Scratch Repro Script" section.

Compatibility
-------------
Power awareness is opt-in (enable_power_awareness=False by default).
AIC optimizer is opt-in (enable_aic_optimizer=False by default).
No existing planner behavior changes when both flags are off.

Phase 4 (multi-power-level AIC sweep) and Phase 5 (silicon validation)
remain follow-up work; this PR delivers Phases 1–3 plus the Phase 4
pipeclean code path so it can be exercised before AIC ships the full
sweep API.

Signed-off-by: Kai Ma <kaim@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Kai Ma <kaim@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants