feat(planner): add advisory mode for scaling decisions by tedzhouhk · Pull Request #8244 · ai-dynamo/dynamo

tedzhouhk · 2026-04-15T20:08:28Z

Summary

Add scaling_mode config field (active / advisory) so operators can observe planner decisions before enabling auto-scaling
In advisory mode the full pipeline runs identically (data collection, state machine, Prometheus metrics, Plotly HTML reports) — only _apply_scaling_targets is skipped
Add periodic [summary] log line (both modes) with structured one-line digest: action, current vs recommended replicas, deltas, reasons, estimated latencies — throttled by advisory_log_interval (default 60s)

Changed files

File	Change
`planner/config/defaults.py`	`ScalingMode` enum (active/advisory) + defaults
`planner/config/planner_config.py`	`scaling_mode` and `advisory_log_interval` config fields
`planner/core/base.py`	Advisory guard in `_apply_scaling_targets`, `_log_decision_summary`, startup mode log
`planner/tests/unit/test_advisory_mode.py`	19 unit tests

Design

Advisory mode intercepts at the narrowest point — _apply_scaling_targets — so all subclass _apply_effects logic (Prometheus metrics like predicted_num_*_replicas) still runs. This mirrors the existing no_operation guard pattern.

Test plan

19 new unit tests covering enum, config, guard logic, action classification (hold/scale_up/scale_down/rebalance), startup behavior
20 existing test_planner_config.py tests pass (no regressions)
ruff + pre-commit lint clean

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added advisory operating mode that computes and logs scaling recommendations without executing scaling actions.
- Introduced configurable logging interval for advisory mode decision reports, allowing periodic review of scaling decisions.
- Added startup indication in logs to distinguish between active and advisory operation modes.
Tests
- Added comprehensive tests for advisory mode configuration and decision classification.

Add a `scaling_mode` config field (active/advisory) that lets operators observe what the planner would do before enabling auto-scaling. In advisory mode the full pipeline runs identically to active mode (data collection, state machine, regression models, diagnostics, Prometheus metrics, Plotly HTML reports) — only the actual connector call in `_apply_scaling_targets` is skipped. A periodic `[summary]` log line is emitted in both modes with a structured one-line digest of the decision (action, current vs recommended replicas, deltas, reasons, estimated latencies), throttled by `advisory_log_interval` (default 60 s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

coderabbitai · 2026-04-15T20:11:03Z

Walkthrough

Added a new ScalingMode enum (ACTIVE/ADVISORY) to control planner operating modes. Extended configuration schemas in defaults and planner config with scaling_mode and advisory_log_interval fields. Implemented advisory mode logic in the planner core to skip connector updates and log throttled decision summaries during operation.

Changes

Cohort / File(s)	Summary
Configuration `components/src/dynamo/planner/config/defaults.py`, `components/src/dynamo/planner/config/planner_config.py`	Added `ScalingMode` enum with ACTIVE and ADVISORY members. Extended `SLAPlannerDefaults` and `PlannerConfig` with `scaling_mode` (defaulting to ACTIVE) and `advisory_log_interval` (defaulting to 60 seconds) configuration fields.
Core Planner Logic `components/src/dynamo/planner/core/base.py`	Updated imports for `ScalingMode`; added `_last_advisory_log_s` throttle field; branched startup logging based on scaling mode; modified `_apply_scaling_targets` to skip connector replica updates in ADVISORY mode; implemented `_log_decision_summary()` method to classify actions and log throttled decision summaries in the main run loop.
Advisory Mode Tests `components/src/dynamo/planner/tests/unit/test_advisory_mode.py`	New test module with mocked Rust dependencies validating `ScalingMode` enum correctness, default configuration values, planner config field acceptance, action classification logic (hold/scale\_up/scale\_down/rebalance), and startup behavior across advisory and active modes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(planner): add advisory mode for scaling decisions' clearly and concisely summarizes the main change: introducing an advisory mode feature for the planner's scaling behavior.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Remove the no_operation boolean from PlannerConfig/defaults and absorb its role into the ScalingMode enum (active/advisory). Advisory mode now goes through the full startup path (connector, deployment validation, worker discovery, FPM subscribers) — only the actual scaling call in _apply_scaling_targets is skipped. This is a breaking config change: users who had no_operation=true should switch to scaling_mode=advisory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

components/src/dynamo/planner/tests/unit/test_advisory_mode.py (2)
186-215: These tests are tautological—they test Python boolean logic, not planner behavior.

Tests like test_advisory_mode_validates_deployment simply verify that if not False: flag = True sets the flag. They don't actually invoke any planner code or mock _async_init. Consider either:

Removing these tests (the logic is trivially correct), or

Refactoring to actually test NativePlannerBase._async_init with mocked dependencies

As written, they provide minimal confidence that the startup path works correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py` around lines
186 - 215, The three tautological tests
(test_advisory_mode_validates_deployment, test_active_mode_validates_deployment,
test_no_operation_skips_validation) should be replaced or removed: either delete
them, or refactor to actually exercise NativePlannerBase._async_init by
importing the class, instantiating a planner (or a minimal subclass), and
invoking _async_init with no_operation True/False while mocking/stubbing its
external dependencies and the validation call (e.g., mock the method that
performs deployment validation) so the tests assert the planner invoked
validation when no_operation is False and skipped it when True.
113-121: Helper duplicates logic from base.py—consider importing or refactoring.

_classify_action mirrors the action classification in _log_decision_summary. If the logic in base.py changes, this helper could drift out of sync. Consider either:

Extracting the classification logic to a shared utility that both can import, or

Directly testing _log_decision_summary via mocking (if feasible)

For now this is acceptable since it's explicitly documented as mirroring the base logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py` around lines
113 - 121, The test helper _classify_action duplicates the action classification
logic implemented in _log_decision_summary (in base.py), which can drift if base
logic changes; either move the shared logic into a single utility function that
both the test and base._log_decision_summary import (e.g., extract to a new
classify_action helper used by both), or remove this duplicate and update tests
to call/mock base._log_decision_summary directly; update imports and references
accordingly so _classify_action is eliminated or delegated to the shared
classify function.
components/src/dynamo/planner/core/base.py (1)
631-633: Consider adding public accessors to PlannerStateMachine for worker counts.

Accessing private attributes _num_p_workers and _num_d_workers directly couples this code to the internal implementation of the state machine. These attributes lack public properties or methods despite being accessed across multiple files; adding them would improve encapsulation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/base.py` around lines 631 - 633, The code
reads private attributes _num_p_workers and _num_d_workers from the
PlannerStateMachine instance (referenced as sm/state_machine), which breaks
encapsulation; add public accessor properties or methods on PlannerStateMachine
(e.g., num_p_workers and num_d_workers or
get_num_p_workers()/get_num_d_workers()) and update this code to call those
accessors (replace sm._num_p_workers and sm._num_d_workers with sm.num_p_workers
/ sm.num_d_workers or the getter methods) so other modules use the public API
instead of private attributes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py`:
- Around line 50-53: The test module's pytest mark list (pytestmark) is missing
the required test type marker; update the pytestmark variable to include
pytest.mark.unit alongside pytest.mark.gpu_0 and pytest.mark.pre_merge so the
test is properly classified as a unit test (modify the pytestmark list in
test_advisory_mode.py to add pytest.mark.unit).

---

Nitpick comments:
In `@components/src/dynamo/planner/core/base.py`:
- Around line 631-633: The code reads private attributes _num_p_workers and
_num_d_workers from the PlannerStateMachine instance (referenced as
sm/state_machine), which breaks encapsulation; add public accessor properties or
methods on PlannerStateMachine (e.g., num_p_workers and num_d_workers or
get_num_p_workers()/get_num_d_workers()) and update this code to call those
accessors (replace sm._num_p_workers and sm._num_d_workers with sm.num_p_workers
/ sm.num_d_workers or the getter methods) so other modules use the public API
instead of private attributes.

In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py`:
- Around line 186-215: The three tautological tests
(test_advisory_mode_validates_deployment, test_active_mode_validates_deployment,
test_no_operation_skips_validation) should be replaced or removed: either delete
them, or refactor to actually exercise NativePlannerBase._async_init by
importing the class, instantiating a planner (or a minimal subclass), and
invoking _async_init with no_operation True/False while mocking/stubbing its
external dependencies and the validation call (e.g., mock the method that
performs deployment validation) so the tests assert the planner invoked
validation when no_operation is False and skipped it when True.
- Around line 113-121: The test helper _classify_action duplicates the action
classification logic implemented in _log_decision_summary (in base.py), which
can drift if base logic changes; either move the shared logic into a single
utility function that both the test and base._log_decision_summary import (e.g.,
extract to a new classify_action helper used by both), or remove this duplicate
and update tests to call/mock base._log_decision_summary directly; update
imports and references accordingly so _classify_action is eliminated or
delegated to the shared classify function.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c92f1510-89cb-46d9-a628-62b783c9f28b

📥 Commits

Reviewing files that changed from the base of the PR and between bba26d5 and 6cbb28c.

📒 Files selected for processing (4)

components/src/dynamo/planner/config/defaults.py
components/src/dynamo/planner/config/planner_config.py
components/src/dynamo/planner/core/base.py
components/src/dynamo/planner/tests/unit/test_advisory_mode.py

Replace ScalingMode enum + advisory_log_interval with a simple `advisory: bool` config flag (same pattern as the old no_operation). The [summary] log line now prints unconditionally after every tick in both modes — no throttling config needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

The replay planner sets advisory mode so it logs decisions without executing scaling. Missed during the no_operation → advisory rename. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Wire DiagnosticsRecorder into ReplayPlannerAdapter so offline planner-in-the-loop replay generates the same interactive Plotly HTML report as the live planner. Also hide per-engine FPM traces from the legend to prevent overlap with chart panels. Fix CI test collection errors by adding import guards for forward_pass_metrics and dynamo.llm bindings in planner/profiler unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Allow setting a fixed filename for HTML diagnostics reports via the report_filename planner config field. Useful for replay iteration where you want to refresh the same file in the browser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk requested review from a team as code owners April 15, 2026 20:08

pull-request-size Bot added the size/L label Apr 15, 2026

github-actions Bot added feat planner labels Apr 15, 2026

pc

6cbb28c

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 20:10 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 20:16 Inactive

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread components/src/dynamo/planner/tests/unit/test_advisory_mode.py

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 20:19 Inactive

tedzhouhk enabled auto-merge (squash) April 15, 2026 20:27

fix(planner): add pytest.mark.unit marker to advisory mode tests

c9e3a95

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 20:44 Inactive

PeaBrane approved these changes Apr 15, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 21:50 Inactive

Merge branch 'main' into hzhou/planner-advisory-mode

dfce480

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 21:56 Inactive

tedzhouhk disabled auto-merge April 15, 2026 22:07

copy-pr-bot Bot temporarily deployed to GITLAB April 15, 2026 23:02 Inactive

tedzhouhk enabled auto-merge (squash) April 15, 2026 23:04

copy-pr-bot Bot had a problem deploying to GITLAB April 15, 2026 23:04 Failure

tedzhouhk disabled auto-merge April 15, 2026 23:38

tedzhouhk enabled auto-merge (squash) April 15, 2026 23:46

tedzhouhk merged commit 3da6f4d into main Apr 15, 2026
82 of 83 checks passed

tedzhouhk deleted the hzhou/planner-advisory-mode branch April 15, 2026 23:52

brluobt mentioned this pull request Apr 20, 2026

bug(planner): KubernetesConnector.get_model_name() case-mismatch causes active mode CrashLoopBackOff #8359

Closed

dmitry-tokarev-nv mentioned this pull request Apr 20, 2026

fix(planner/tests): scope test_advisory_mode stubs so they don't leak #8418

Merged

3 tasks

brluobt mentioned this pull request May 7, 2026

feat(planner): Add Advisory Mode for Scaling Decisions #8114

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): add advisory mode for scaling decisions#8244

feat(planner): add advisory mode for scaling decisions#8244
tedzhouhk merged 9 commits into
mainfrom
hzhou/planner-advisory-mode

tedzhouhk commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tedzhouhk commented Apr 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changed files

Design

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tedzhouhk commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading