feat(planner): add advisory mode for scaling decisions#8244
Conversation
Add a `scaling_mode` config field (active/advisory) that lets operators observe what the planner would do before enabling auto-scaling. In advisory mode the full pipeline runs identically to active mode (data collection, state machine, regression models, diagnostics, Prometheus metrics, Plotly HTML reports) — only the actual connector call in `_apply_scaling_targets` is skipped. A periodic `[summary]` log line is emitted in both modes with a structured one-line digest of the decision (action, current vs recommended replicas, deltas, reasons, estimated latencies), throttled by `advisory_log_interval` (default 60 s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
WalkthroughAdded a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Remove the no_operation boolean from PlannerConfig/defaults and absorb its role into the ScalingMode enum (active/advisory). Advisory mode now goes through the full startup path (connector, deployment validation, worker discovery, FPM subscribers) — only the actual scaling call in _apply_scaling_targets is skipped. This is a breaking config change: users who had no_operation=true should switch to scaling_mode=advisory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
components/src/dynamo/planner/tests/unit/test_advisory_mode.py (2)
186-215: These tests are tautological—they test Python boolean logic, not planner behavior.Tests like
test_advisory_mode_validates_deploymentsimply verify thatif not False: flag = Truesets the flag. They don't actually invoke any planner code or mock_async_init. Consider either:
- Removing these tests (the logic is trivially correct), or
- Refactoring to actually test
NativePlannerBase._async_initwith mocked dependenciesAs written, they provide minimal confidence that the startup path works correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py` around lines 186 - 215, The three tautological tests (test_advisory_mode_validates_deployment, test_active_mode_validates_deployment, test_no_operation_skips_validation) should be replaced or removed: either delete them, or refactor to actually exercise NativePlannerBase._async_init by importing the class, instantiating a planner (or a minimal subclass), and invoking _async_init with no_operation True/False while mocking/stubbing its external dependencies and the validation call (e.g., mock the method that performs deployment validation) so the tests assert the planner invoked validation when no_operation is False and skipped it when True.
113-121: Helper duplicates logic frombase.py—consider importing or refactoring.
_classify_actionmirrors the action classification in_log_decision_summary. If the logic inbase.pychanges, this helper could drift out of sync. Consider either:
- Extracting the classification logic to a shared utility that both can import, or
- Directly testing
_log_decision_summaryvia mocking (if feasible)For now this is acceptable since it's explicitly documented as mirroring the base logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py` around lines 113 - 121, The test helper _classify_action duplicates the action classification logic implemented in _log_decision_summary (in base.py), which can drift if base logic changes; either move the shared logic into a single utility function that both the test and base._log_decision_summary import (e.g., extract to a new classify_action helper used by both), or remove this duplicate and update tests to call/mock base._log_decision_summary directly; update imports and references accordingly so _classify_action is eliminated or delegated to the shared classify function.components/src/dynamo/planner/core/base.py (1)
631-633: Consider adding public accessors toPlannerStateMachinefor worker counts.Accessing private attributes
_num_p_workersand_num_d_workersdirectly couples this code to the internal implementation of the state machine. These attributes lack public properties or methods despite being accessed across multiple files; adding them would improve encapsulation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/planner/core/base.py` around lines 631 - 633, The code reads private attributes _num_p_workers and _num_d_workers from the PlannerStateMachine instance (referenced as sm/state_machine), which breaks encapsulation; add public accessor properties or methods on PlannerStateMachine (e.g., num_p_workers and num_d_workers or get_num_p_workers()/get_num_d_workers()) and update this code to call those accessors (replace sm._num_p_workers and sm._num_d_workers with sm.num_p_workers / sm.num_d_workers or the getter methods) so other modules use the public API instead of private attributes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py`:
- Around line 50-53: The test module's pytest mark list (pytestmark) is missing
the required test type marker; update the pytestmark variable to include
pytest.mark.unit alongside pytest.mark.gpu_0 and pytest.mark.pre_merge so the
test is properly classified as a unit test (modify the pytestmark list in
test_advisory_mode.py to add pytest.mark.unit).
---
Nitpick comments:
In `@components/src/dynamo/planner/core/base.py`:
- Around line 631-633: The code reads private attributes _num_p_workers and
_num_d_workers from the PlannerStateMachine instance (referenced as
sm/state_machine), which breaks encapsulation; add public accessor properties or
methods on PlannerStateMachine (e.g., num_p_workers and num_d_workers or
get_num_p_workers()/get_num_d_workers()) and update this code to call those
accessors (replace sm._num_p_workers and sm._num_d_workers with sm.num_p_workers
/ sm.num_d_workers or the getter methods) so other modules use the public API
instead of private attributes.
In `@components/src/dynamo/planner/tests/unit/test_advisory_mode.py`:
- Around line 186-215: The three tautological tests
(test_advisory_mode_validates_deployment, test_active_mode_validates_deployment,
test_no_operation_skips_validation) should be replaced or removed: either delete
them, or refactor to actually exercise NativePlannerBase._async_init by
importing the class, instantiating a planner (or a minimal subclass), and
invoking _async_init with no_operation True/False while mocking/stubbing its
external dependencies and the validation call (e.g., mock the method that
performs deployment validation) so the tests assert the planner invoked
validation when no_operation is False and skipped it when True.
- Around line 113-121: The test helper _classify_action duplicates the action
classification logic implemented in _log_decision_summary (in base.py), which
can drift if base logic changes; either move the shared logic into a single
utility function that both the test and base._log_decision_summary import (e.g.,
extract to a new classify_action helper used by both), or remove this duplicate
and update tests to call/mock base._log_decision_summary directly; update
imports and references accordingly so _classify_action is eliminated or
delegated to the shared classify function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c92f1510-89cb-46d9-a628-62b783c9f28b
📒 Files selected for processing (4)
components/src/dynamo/planner/config/defaults.pycomponents/src/dynamo/planner/config/planner_config.pycomponents/src/dynamo/planner/core/base.pycomponents/src/dynamo/planner/tests/unit/test_advisory_mode.py
Replace ScalingMode enum + advisory_log_interval with a simple `advisory: bool` config flag (same pattern as the old no_operation). The [summary] log line now prints unconditionally after every tick in both modes — no throttling config needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
The replay planner sets advisory mode so it logs decisions without executing scaling. Missed during the no_operation → advisory rename. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Wire DiagnosticsRecorder into ReplayPlannerAdapter so offline planner-in-the-loop replay generates the same interactive Plotly HTML report as the live planner. Also hide per-engine FPM traces from the legend to prevent overlap with chart panels. Fix CI test collection errors by adding import guards for forward_pass_metrics and dynamo.llm bindings in planner/profiler unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Allow setting a fixed filename for HTML diagnostics reports via the report_filename planner config field. Useful for replay iteration where you want to refresh the same file in the browser. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Summary
scaling_modeconfig field (active/advisory) so operators can observe planner decisions before enabling auto-scaling_apply_scaling_targetsis skipped[summary]log line (both modes) with structured one-line digest: action, current vs recommended replicas, deltas, reasons, estimated latencies — throttled byadvisory_log_interval(default 60s)Changed files
planner/config/defaults.pyScalingModeenum (active/advisory) + defaultsplanner/config/planner_config.pyscaling_modeandadvisory_log_intervalconfig fieldsplanner/core/base.py_apply_scaling_targets,_log_decision_summary, startup mode logplanner/tests/unit/test_advisory_mode.pyDesign
Advisory mode intercepts at the narrowest point —
_apply_scaling_targets— so all subclass_apply_effectslogic (Prometheus metrics likepredicted_num_*_replicas) still runs. This mirrors the existingno_operationguard pattern.Test plan
test_planner_config.pytests pass (no regressions)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests