Skip to content

feat(planner): add advisory mode for scaling decisions#8244

Merged
tedzhouhk merged 9 commits into
mainfrom
hzhou/planner-advisory-mode
Apr 15, 2026
Merged

feat(planner): add advisory mode for scaling decisions#8244
tedzhouhk merged 9 commits into
mainfrom
hzhou/planner-advisory-mode

Conversation

@tedzhouhk
Copy link
Copy Markdown
Contributor

@tedzhouhk tedzhouhk commented Apr 15, 2026

Summary

  • Add scaling_mode config field (active / advisory) so operators can observe planner decisions before enabling auto-scaling
  • In advisory mode the full pipeline runs identically (data collection, state machine, Prometheus metrics, Plotly HTML reports) — only _apply_scaling_targets is skipped
  • Add periodic [summary] log line (both modes) with structured one-line digest: action, current vs recommended replicas, deltas, reasons, estimated latencies — throttled by advisory_log_interval (default 60s)

Changed files

File Change
planner/config/defaults.py ScalingMode enum (active/advisory) + defaults
planner/config/planner_config.py scaling_mode and advisory_log_interval config fields
planner/core/base.py Advisory guard in _apply_scaling_targets, _log_decision_summary, startup mode log
planner/tests/unit/test_advisory_mode.py 19 unit tests

Design

Advisory mode intercepts at the narrowest point — _apply_scaling_targets — so all subclass _apply_effects logic (Prometheus metrics like predicted_num_*_replicas) still runs. This mirrors the existing no_operation guard pattern.

Test plan

  • 19 new unit tests covering enum, config, guard logic, action classification (hold/scale_up/scale_down/rebalance), startup behavior
  • 20 existing test_planner_config.py tests pass (no regressions)
  • ruff + pre-commit lint clean

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added advisory operating mode that computes and logs scaling recommendations without executing scaling actions.
    • Introduced configurable logging interval for advisory mode decision reports, allowing periodic review of scaling decisions.
    • Added startup indication in logs to distinguish between active and advisory operation modes.
  • Tests

    • Added comprehensive tests for advisory mode configuration and decision classification.

Add a `scaling_mode` config field (active/advisory) that lets operators
observe what the planner would do before enabling auto-scaling.

In advisory mode the full pipeline runs identically to active mode
(data collection, state machine, regression models, diagnostics,
Prometheus metrics, Plotly HTML reports) — only the actual connector
call in `_apply_scaling_targets` is skipped.

A periodic `[summary]` log line is emitted in both modes with a
structured one-line digest of the decision (action, current vs
recommended replicas, deltas, reasons, estimated latencies), throttled
by `advisory_log_interval` (default 60 s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Walkthrough

Added a new ScalingMode enum (ACTIVE/ADVISORY) to control planner operating modes. Extended configuration schemas in defaults and planner config with scaling_mode and advisory_log_interval fields. Implemented advisory mode logic in the planner core to skip connector updates and log throttled decision summaries during operation.

Changes

Cohort / File(s) Summary
Configuration
components/src/dynamo/planner/config/defaults.py, components/src/dynamo/planner/config/planner_config.py
Added ScalingMode enum with ACTIVE and ADVISORY members. Extended SLAPlannerDefaults and PlannerConfig with scaling_mode (defaulting to ACTIVE) and advisory_log_interval (defaulting to 60 seconds) configuration fields.
Core Planner Logic
components/src/dynamo/planner/core/base.py
Updated imports for ScalingMode; added _last_advisory_log_s throttle field; branched startup logging based on scaling mode; modified _apply_scaling_targets to skip connector replica updates in ADVISORY mode; implemented _log_decision_summary() method to classify actions and log throttled decision summaries in the main run loop.
Advisory Mode Tests
components/src/dynamo/planner/tests/unit/test_advisory_mode.py
New test module with mocked Rust dependencies validating ScalingMode enum correctness, default configuration values, planner config field acceptance, action classification logic (hold/scale\_up/scale\_down/rebalance), and startup behavior across advisory and active modes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(planner): add advisory mode for scaling decisions' clearly and concisely summarizes the main change: introducing an advisory mode feature for the planner's scaling behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Remove the no_operation boolean from PlannerConfig/defaults and absorb
its role into the ScalingMode enum (active/advisory).  Advisory mode
now goes through the full startup path (connector, deployment
validation, worker discovery, FPM subscribers) — only the actual
scaling call in _apply_scaling_targets is skipped.

This is a breaking config change: users who had no_operation=true
should switch to scaling_mode=advisory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
components/src/dynamo/planner/tests/unit/test_advisory_mode.py (2)

186-215: These tests are tautological—they test Python boolean logic, not planner behavior.

Tests like test_advisory_mode_validates_deployment simply verify that if not False: flag = True sets the flag. They don't actually invoke any planner code or mock _async_init. Consider either:

  1. Removing these tests (the logic is trivially correct), or
  2. Refactoring to actually test NativePlannerBase._async_init with mocked dependencies

As written, they provide minimal confidence that the startup path works correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py` around lines
186 - 215, The three tautological tests
(test_advisory_mode_validates_deployment, test_active_mode_validates_deployment,
test_no_operation_skips_validation) should be replaced or removed: either delete
them, or refactor to actually exercise NativePlannerBase._async_init by
importing the class, instantiating a planner (or a minimal subclass), and
invoking _async_init with no_operation True/False while mocking/stubbing its
external dependencies and the validation call (e.g., mock the method that
performs deployment validation) so the tests assert the planner invoked
validation when no_operation is False and skipped it when True.

113-121: Helper duplicates logic from base.py—consider importing or refactoring.

_classify_action mirrors the action classification in _log_decision_summary. If the logic in base.py changes, this helper could drift out of sync. Consider either:

  1. Extracting the classification logic to a shared utility that both can import, or
  2. Directly testing _log_decision_summary via mocking (if feasible)

For now this is acceptable since it's explicitly documented as mirroring the base logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py` around lines
113 - 121, The test helper _classify_action duplicates the action classification
logic implemented in _log_decision_summary (in base.py), which can drift if base
logic changes; either move the shared logic into a single utility function that
both the test and base._log_decision_summary import (e.g., extract to a new
classify_action helper used by both), or remove this duplicate and update tests
to call/mock base._log_decision_summary directly; update imports and references
accordingly so _classify_action is eliminated or delegated to the shared
classify function.
components/src/dynamo/planner/core/base.py (1)

631-633: Consider adding public accessors to PlannerStateMachine for worker counts.

Accessing private attributes _num_p_workers and _num_d_workers directly couples this code to the internal implementation of the state machine. These attributes lack public properties or methods despite being accessed across multiple files; adding them would improve encapsulation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/planner/core/base.py` around lines 631 - 633, The code
reads private attributes _num_p_workers and _num_d_workers from the
PlannerStateMachine instance (referenced as sm/state_machine), which breaks
encapsulation; add public accessor properties or methods on PlannerStateMachine
(e.g., num_p_workers and num_d_workers or
get_num_p_workers()/get_num_d_workers()) and update this code to call those
accessors (replace sm._num_p_workers and sm._num_d_workers with sm.num_p_workers
/ sm.num_d_workers or the getter methods) so other modules use the public API
instead of private attributes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py`:
- Around line 50-53: The test module's pytest mark list (pytestmark) is missing
the required test type marker; update the pytestmark variable to include
pytest.mark.unit alongside pytest.mark.gpu_0 and pytest.mark.pre_merge so the
test is properly classified as a unit test (modify the pytestmark list in
test_advisory_mode.py to add pytest.mark.unit).

---

Nitpick comments:
In `@components/src/dynamo/planner/core/base.py`:
- Around line 631-633: The code reads private attributes _num_p_workers and
_num_d_workers from the PlannerStateMachine instance (referenced as
sm/state_machine), which breaks encapsulation; add public accessor properties or
methods on PlannerStateMachine (e.g., num_p_workers and num_d_workers or
get_num_p_workers()/get_num_d_workers()) and update this code to call those
accessors (replace sm._num_p_workers and sm._num_d_workers with sm.num_p_workers
/ sm.num_d_workers or the getter methods) so other modules use the public API
instead of private attributes.

In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py`:
- Around line 186-215: The three tautological tests
(test_advisory_mode_validates_deployment, test_active_mode_validates_deployment,
test_no_operation_skips_validation) should be replaced or removed: either delete
them, or refactor to actually exercise NativePlannerBase._async_init by
importing the class, instantiating a planner (or a minimal subclass), and
invoking _async_init with no_operation True/False while mocking/stubbing its
external dependencies and the validation call (e.g., mock the method that
performs deployment validation) so the tests assert the planner invoked
validation when no_operation is False and skipped it when True.
- Around line 113-121: The test helper _classify_action duplicates the action
classification logic implemented in _log_decision_summary (in base.py), which
can drift if base logic changes; either move the shared logic into a single
utility function that both the test and base._log_decision_summary import (e.g.,
extract to a new classify_action helper used by both), or remove this duplicate
and update tests to call/mock base._log_decision_summary directly; update
imports and references accordingly so _classify_action is eliminated or
delegated to the shared classify function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c92f1510-89cb-46d9-a628-62b783c9f28b

📥 Commits

Reviewing files that changed from the base of the PR and between bba26d5 and 6cbb28c.

📒 Files selected for processing (4)
  • components/src/dynamo/planner/config/defaults.py
  • components/src/dynamo/planner/config/planner_config.py
  • components/src/dynamo/planner/core/base.py
  • components/src/dynamo/planner/tests/unit/test_advisory_mode.py

Comment thread components/src/dynamo/planner/tests/unit/test_advisory_mode.py
Replace ScalingMode enum + advisory_log_interval with a simple
`advisory: bool` config flag (same pattern as the old no_operation).

The [summary] log line now prints unconditionally after every tick
in both modes — no throttling config needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk enabled auto-merge (squash) April 15, 2026 20:27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The replay planner sets advisory mode so it logs decisions without
executing scaling.  Missed during the no_operation → advisory rename.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Wire DiagnosticsRecorder into ReplayPlannerAdapter so offline
planner-in-the-loop replay generates the same interactive Plotly
HTML report as the live planner. Also hide per-engine FPM traces
from the legend to prevent overlap with chart panels.

Fix CI test collection errors by adding import guards for
forward_pass_metrics and dynamo.llm bindings in planner/profiler
unit tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Allow setting a fixed filename for HTML diagnostics reports via
the report_filename planner config field. Useful for replay
iteration where you want to refresh the same file in the browser.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk enabled auto-merge (squash) April 15, 2026 23:04
@tedzhouhk tedzhouhk disabled auto-merge April 15, 2026 23:38
@tedzhouhk tedzhouhk enabled auto-merge (squash) April 15, 2026 23:46
@tedzhouhk tedzhouhk merged commit 3da6f4d into main Apr 15, 2026
82 of 83 checks passed
@tedzhouhk tedzhouhk deleted the hzhou/planner-advisory-mode branch April 15, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants