feat: Add bring-down sequence in RLA#477
Conversation
1 similar comment
Summary by CodeRabbit
WalkthroughAdds a complete "bring-down" operation: new BringDownRack RPC and request payload, TaskType/OpCode/operation payload, component-manager interfaces and Carbide client checks, Temporal activities/workflow and action wiring, hardcoded bring-down rule sequence, service handler to submit tasks, and related tests. ChangesBring-Down Feature (single cohort)
Sequence DiagramsequenceDiagram
participant Client
participant RLA as RLA Service
participant TM as TaskManager
participant Temporal
participant Activities
participant Carbide
Client->>RLA: BringDownRack(target_spec, rule_id)
RLA->>RLA: validate request, build BringDownTaskInfo
RLA->>TM: SubmitTask(operation.Request)
TM->>Temporal: schedule/start BringDown workflow
Temporal->>Temporal: load rule definition, build targets
Temporal->>Activities: VerifyNoInstance(target)
Activities->>Carbide: InstanceVerifier.VerifyNoInstance(target)
Carbide-->>Activities: result
Activities-->>Temporal: result
Temporal->>Activities: EnterMaintenance(target)
Activities->>Carbide: MaintenanceController.EnterMaintenance(target)
Carbide-->>Activities: result
Activities-->>Temporal: result
Temporal->>Activities: PowerControl OFF (compute)
Activities-->>Temporal: result
Temporal->>Activities: PausePowerOnGate(target)
Activities->>Carbide: PowerOnGateController.PausePowerOnGate(target)
Carbide-->>Activities: result
Activities-->>Temporal: result
Temporal->>Temporal: remaining stages (NVL, power shelf), update status
Temporal-->>Client: final task status / error
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
Test Results8 974 tests +6 8 974 ✅ +6 7m 46s ⏱️ +26s Results for commit 9715fc1. ± Comparison against base commit bce7503. This pull request removes 4 and adds 10 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
rla/internal/task/operationrules/sequences_test.go (1)
95-100: ⚡ Quick winAdd
bring_downto constant-parity coverage as well.You added validity coverage for
SequenceBringDown, butTestSequenceConstantsMatchSharedCodesstill does not assert parity for this new code path.Suggested test extension
func TestSequenceConstantsMatchSharedCodes(t *testing.T) { tests := []struct { name string sequenceConst string sharedCode string }{ @@ {"Rollback", SequenceRollback, common.OpCodeFirmwareControlRollback}, + {"BringDown", SequenceBringDown, common.OpCodeBringDown}, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rla/internal/task/operationrules/sequences_test.go` around lines 95 - 100, The parity test TestSequenceConstantsMatchSharedCodes is missing the new SequenceBringDown/bring_down case; update that test to include the bring_down constant alongside the other sequence constants so the test asserts parity for SequenceBringDown as well (e.g., add bring_down to the expected list/map or the comparison loop used in TestSequenceConstantsMatchSharedCodes so the shared codes mapping includes SequenceBringDown).rla/internal/service/server_impl.go (1)
768-783: ⚡ Quick winAlign
BringUpTaskInfo.OpCodewithbring_downto avoid divergent metadata.
BringDownRackoverrides the wrapper opcode, but the serializedBringUpTaskInfostill defaults to bring-up semantics. Settinginfo.OpCodekeeps both representations consistent for any downstream reader ofOperation.Info.♻️ Suggested patch
info := &operations.BringUpTaskInfo{ RuleID: protobuf.UUIDStringFrom(req.GetRuleId()), + OpCode: taskcommon.OpCodeBringDown, } @@ - // Override the operation code so the rule resolver picks the - // bring-down rule instead of the default bring-up rule. - opReq.Operation.Code = taskcommon.OpCodeBringDown + // Keep request wrapper and serialized task info aligned. + opReq.Operation.Code = info.CodeString()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rla/internal/service/server_impl.go` around lines 768 - 783, Constructed BringUpTaskInfo (info) currently defaults its OpCode to bring-up, causing serialized Operation.Info to diverge from the overridden operation code; set info.OpCode = taskcommon.OpCodeBringDown on the info struct (the BringUpTaskInfo instance) so it matches the later override of opReq.Operation.Code = taskcommon.OpCodeBringDown (ensure this assignment happens before calling convertTargetSpecToOperationRequest or at least before opReq is serialized) to keep Operation.Info and opReq.Operation.Code consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rla/proto/v1/rla.proto`:
- Around line 610-618: Update the BringDownRackRequest message comment to
explicitly state that the documented sequence (power off compute, pause
per-machine power-on gate, then power off NVLSwitch and PowerShelf) is the
default behavior applied when no rule override is provided, and that supplying
rule_id will override rule resolution and may alter that sequence; reference the
fields OperationTargetSpec target_spec, string description, and optional UUID
rule_id in the comment so API consumers understand default vs overridden
bring-down behavior.
---
Nitpick comments:
In `@rla/internal/service/server_impl.go`:
- Around line 768-783: Constructed BringUpTaskInfo (info) currently defaults its
OpCode to bring-up, causing serialized Operation.Info to diverge from the
overridden operation code; set info.OpCode = taskcommon.OpCodeBringDown on the
info struct (the BringUpTaskInfo instance) so it matches the later override of
opReq.Operation.Code = taskcommon.OpCodeBringDown (ensure this assignment
happens before calling convertTargetSpecToOperationRequest or at least before
opReq is serialized) to keep Operation.Info and opReq.Operation.Code consistent.
In `@rla/internal/task/operationrules/sequences_test.go`:
- Around line 95-100: The parity test TestSequenceConstantsMatchSharedCodes is
missing the new SequenceBringDown/bring_down case; update that test to include
the bring_down constant alongside the other sequence constants so the test
asserts parity for SequenceBringDown as well (e.g., add bring_down to the
expected list/map or the comparison loop used in
TestSequenceConstantsMatchSharedCodes so the shared codes mapping includes
SequenceBringDown).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 20139a9e-7c12-43d7-9598-886d967b066c
⛔ Files ignored due to path filters (2)
rla/pkg/proto/v1/rla.pb.gois excluded by!**/*.pb.go,!**/*.pb.gorla/pkg/proto/v1/rla_grpc.pb.gois excluded by!**/*.pb.go,!**/*.pb.go
📒 Files selected for processing (15)
rla/internal/service/server_impl.gorla/internal/task/common/operation_codes.gorla/internal/task/componentmanager/componentmanager.gorla/internal/task/componentmanager/compute/carbide/carbide.gorla/internal/task/componentmanager/mock/mock.gorla/internal/task/executor/temporalworkflow/activity/activity.gorla/internal/task/executor/temporalworkflow/activity/registry.gorla/internal/task/executor/temporalworkflow/activity/registry_test.gorla/internal/task/executor/temporalworkflow/workflow/actions.gorla/internal/task/operationrules/resolver_defaults.gorla/internal/task/operationrules/resolver_defaults_test.gorla/internal/task/operationrules/rules.gorla/internal/task/operationrules/sequences.gorla/internal/task/operationrules/sequences_test.gorla/proto/v1/rla.proto
| // BringDownRackRequest is the symmetric counterpart to BringUpRackRequest. | ||
| // It powers off compute, pauses the per-machine power-on gate so the power | ||
| // manager will not automatically bring it back up, then powers off the | ||
| // remaining components (NVLSwitch and PowerShelf). | ||
| message BringDownRackRequest { | ||
| OperationTargetSpec target_spec = 1; // Target racks for bring-down | ||
| string description = 2; // optional task description | ||
| optional UUID rule_id = 3; // optional: override rule resolution with a specific rule | ||
| } |
There was a problem hiding this comment.
Clarify default vs overridden bring-down behavior in request docs.
Lines 610-613 currently read as an unconditional sequence, but Line 617 permits a custom rule_id override. This can mislead API consumers about guaranteed behavior.
Suggested doc-only fix
-// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
-// It powers off compute, pauses the per-machine power-on gate so the power
-// manager will not automatically bring it back up, then powers off the
-// remaining components (NVLSwitch and PowerShelf).
+// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
+// By default, bring-down powers off compute, pauses the per-machine power-on
+// gate so the power manager will not automatically bring it back up, then
+// powers off remaining components (NVLSwitch and PowerShelf).
+// If rule_id is provided, execution follows the selected rule.
message BringDownRackRequest {As per coding guidelines **/*.proto: “Review the Protobuf definitions, point out issues relative to compatibility, and expressiveness.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // BringDownRackRequest is the symmetric counterpart to BringUpRackRequest. | |
| // It powers off compute, pauses the per-machine power-on gate so the power | |
| // manager will not automatically bring it back up, then powers off the | |
| // remaining components (NVLSwitch and PowerShelf). | |
| message BringDownRackRequest { | |
| OperationTargetSpec target_spec = 1; // Target racks for bring-down | |
| string description = 2; // optional task description | |
| optional UUID rule_id = 3; // optional: override rule resolution with a specific rule | |
| } | |
| // BringDownRackRequest is the symmetric counterpart to BringUpRackRequest. | |
| // By default, bring-down powers off compute, pauses the per-machine power-on | |
| // gate so the power manager will not automatically bring it back up, then | |
| // powers off remaining components (NVLSwitch and PowerShelf). | |
| // If rule_id is provided, execution follows the selected rule. | |
| message BringDownRackRequest { | |
| OperationTargetSpec target_spec = 1; // Target racks for bring-down | |
| string description = 2; // optional task description | |
| optional UUID rule_id = 3; // optional: override rule resolution with a specific rule | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rla/proto/v1/rla.proto` around lines 610 - 618, Update the
BringDownRackRequest message comment to explicitly state that the documented
sequence (power off compute, pause per-machine power-on gate, then power off
NVLSwitch and PowerShelf) is the default behavior applied when no rule override
is provided, and that supplying rule_id will override rule resolution and may
alter that sequence; reference the fields OperationTargetSpec target_spec,
string description, and optional UUID rule_id in the comment so API consumers
understand default vs overridden bring-down behavior.
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-05-04 18:05:49 UTC | Commit: 99ae4f1 |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rla/internal/service/server_impl.go`:
- Around line 751-754: Update the top-of-function comment for BringDownRack to
remove the incorrect statement that it “reuses the BringUp workflow” and instead
describe that this RPC invokes the new dedicated BringDown workflow; mention the
key steps performed (power off compute, pause per-machine power-on gate to
prevent auto-revive, then power off NVLSwitch and PowerShelf) and reference the
BringDown workflow by name so readers know this function maps to that workflow
rather than BringUp.
In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1299-1428: The RuleDefinition default sequence for bring-down can
exceed the TaskTypeBringDown workflow timeout (60m); update either the workflow
timeout in options.go or reduce stage/retry/poll budgets so the worst-case sum
fits under the workflow window. Concretely, adjust RuleDefinition
(CurrentRuleDefinitionVersion) SequenceSteps timeouts/retry policies and
ActionConfig polling windows — e.g. lower Stage 3/5/6 Timeout and
RetryPolicy.MaxAttempts, reduce PostOperation VerifyPowerStatus Timeout and
PollInterval and the RetryPolicy.InitialInterval/BackoffCoefficient for stages
that use
ActionPowerControl/ActionVerifyPowerStatus/ActionVerifyNoInstance/ActionEnterMaintenance/ActionPausePowerOnGate
so the aggregated worst-case duration is safely below TaskTypeBringDown;
alternatively increase the TaskTypeBringDown timeout to a value greater than the
calculated worst-case sum.
In `@rla/internal/task/operations/operations.go`:
- Around line 212-253: New() currently lacks a branch for
taskcommon.TaskTypeBringDown, so generic deserialization returns "unsupported
task type: bring_down"; update the factory in New() to handle TaskTypeBringDown
by adding a switch/case (or equivalent) that instantiates a BringDownTaskInfo,
unmarshals into it, validates it, and returns it (use the existing pattern from
other cases for creation, calling BringDownTaskInfo.Unmarshal/Validate and
returning the resulting Task instance).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 64302ded-b3ec-41ad-bc9e-cd88b6f0b80d
⛔ Files ignored due to path filters (2)
rla/pkg/proto/v1/rla.pb.gois excluded by!**/*.pb.go,!**/*.pb.gorla/pkg/proto/v1/rla_grpc.pb.gois excluded by!**/*.pb.go,!**/*.pb.go
📒 Files selected for processing (25)
rla/internal/carbideapi/grpc.gorla/internal/carbideapi/mock.gorla/internal/carbideapi/mod.gorla/internal/scheduler/taskschedule/template.gorla/internal/service/server_impl.gorla/internal/task/common/common.gorla/internal/task/common/common_test.gorla/internal/task/common/operation_codes.gorla/internal/task/componentmanager/componentmanager.gorla/internal/task/componentmanager/compute/carbide/carbide.gorla/internal/task/componentmanager/mock/mock.gorla/internal/task/conflict/conflict.gorla/internal/task/executor/temporalworkflow/activity/activity.gorla/internal/task/executor/temporalworkflow/activity/registry.gorla/internal/task/executor/temporalworkflow/activity/registry_test.gorla/internal/task/executor/temporalworkflow/workflow/actions.gorla/internal/task/executor/temporalworkflow/workflow/bringdown.gorla/internal/task/executor/temporalworkflow/workflow/registry_test.gorla/internal/task/operationrules/resolver_defaults.gorla/internal/task/operationrules/rules.gorla/internal/task/operationrules/sequences.gorla/internal/task/operationrules/sequences_test.gorla/internal/task/operations/operations.gorla/internal/task/operations/options.gorla/proto/v1/rla.proto
✅ Files skipped from review due to trivial changes (4)
- rla/internal/carbideapi/mod.go
- rla/internal/task/common/operation_codes.go
- rla/internal/task/executor/temporalworkflow/activity/registry.go
- rla/internal/task/operationrules/rules.go
🚧 Files skipped from review as they are similar to previous changes (3)
- rla/internal/task/operationrules/sequences_test.go
- rla/internal/task/executor/temporalworkflow/activity/registry_test.go
- rla/proto/v1/rla.proto
| RuleDefinition: RuleDefinition{ | ||
| Version: CurrentRuleDefinitionVersion, | ||
| Steps: []SequenceStep{ | ||
| // === Stage 1: Compute — verify no allocated instance === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 1, | ||
| MaxParallel: 0, | ||
| Timeout: 5 * time.Minute, | ||
| // No retry: a "machine still has an instance" result | ||
| // is not transient; retrying just delays the failure. | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionVerifyNoInstance, | ||
| }, | ||
| }, | ||
| // === Stage 2: Compute — enter maintenance === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 2, | ||
| MaxParallel: 0, | ||
| Timeout: 5 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 5 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionEnterMaintenance, | ||
| }, | ||
| }, | ||
| // === Stage 3: Compute — power off, verify === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 3, | ||
| MaxParallel: 0, | ||
| Timeout: 20 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 1 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionPowerControl, | ||
| Parameters: map[string]any{ | ||
| ParamOperation: "power_off", | ||
| }, | ||
| }, | ||
| PostOperation: []ActionConfig{ | ||
| { | ||
| Name: ActionVerifyPowerStatus, | ||
| Timeout: 10 * time.Minute, | ||
| PollInterval: 15 * time.Second, | ||
| Parameters: map[string]any{ | ||
| ParamExpectedStatus: "off", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| // === Stage 4: Compute — pause power-on gate === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 4, | ||
| MaxParallel: 0, | ||
| Timeout: 5 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 5 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionPausePowerOnGate, | ||
| }, | ||
| }, | ||
| // === Stage 5: NVLSwitch — power off, verify === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeNVLSwitch, | ||
| Stage: 5, | ||
| MaxParallel: 0, | ||
| Timeout: 15 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 5 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionPowerControl, | ||
| Parameters: map[string]any{ | ||
| ParamOperation: "power_off", | ||
| }, | ||
| }, | ||
| PostOperation: []ActionConfig{ | ||
| { | ||
| Name: ActionVerifyPowerStatus, | ||
| Timeout: 10 * time.Minute, | ||
| PollInterval: 15 * time.Second, | ||
| Parameters: map[string]any{ | ||
| ParamExpectedStatus: "off", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| // === Stage 6: PowerShelf — power off, verify === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypePowerShelf, | ||
| Stage: 6, | ||
| MaxParallel: 0, | ||
| Timeout: 15 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 5 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionPowerControl, | ||
| Parameters: map[string]any{ | ||
| ParamOperation: "power_off", | ||
| }, | ||
| }, | ||
| PostOperation: []ActionConfig{ | ||
| { | ||
| Name: ActionVerifyPowerStatus, | ||
| Timeout: 10 * time.Minute, | ||
| PollInterval: 15 * time.Second, | ||
| Parameters: map[string]any{ | ||
| ParamExpectedStatus: "off", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Bring-down timeout budget can exceed the workflow timeout window.
The new default sequence can run beyond the configured TaskTypeBringDown workflow timeout (60m in rla/internal/task/operations/options.go), especially with per-step retries + verify polling. This risks timing out healthy runs mid-bring-down and leaving partial state.
Please align budgets before merge: either increase the workflow timeout for bring-down, or reduce per-stage/per-action timeout+retry envelopes so worst-case execution stays safely under the workflow timeout.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1299 -
1428, The RuleDefinition default sequence for bring-down can exceed the
TaskTypeBringDown workflow timeout (60m); update either the workflow timeout in
options.go or reduce stage/retry/poll budgets so the worst-case sum fits under
the workflow window. Concretely, adjust RuleDefinition
(CurrentRuleDefinitionVersion) SequenceSteps timeouts/retry policies and
ActionConfig polling windows — e.g. lower Stage 3/5/6 Timeout and
RetryPolicy.MaxAttempts, reduce PostOperation VerifyPowerStatus Timeout and
PollInterval and the RetryPolicy.InitialInterval/BackoffCoefficient for stages
that use
ActionPowerControl/ActionVerifyPowerStatus/ActionVerifyNoInstance/ActionEnterMaintenance/ActionPausePowerOnGate
so the aggregated worst-case duration is safely below TaskTypeBringDown;
alternatively increase the TaskTypeBringDown timeout to a value greater than the
calculated worst-case sum.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1298-1324: Stage 1 (ActionVerifyNoInstance) proves emptiness only
briefly so add a second authoritative verification after maintenance: insert a
new Stage (after the ActionEnterMaintenance stage) that uses ComponentType
compute with MainOperation set to ActionVerifyNoInstance, a short Timeout (e.g.,
5m) and no retries so the maintenance lockout and allocator lock are in place
before any subsequent power actions; reference the existing symbols
ActionVerifyNoInstance and ActionEnterMaintenance to locate where to add this
stage and ensure stage ordering guarantees the extra check runs before any
power-off operations.
- Around line 1325-1366: The pause gate (ActionPausePowerOnGate) must be applied
before powering off compute to avoid a reconcile re-power window; modify the
stages so the pause gate runs prior to the power_off+verify sequence (or if you
keep it as its own Stage 4, insert a follow-up VerifyPowerStatus(off)
ActionConfig after ActionPausePowerOnGate) — add an ActionVerifyPowerStatus with
Parameters {ParamExpectedStatus: "off"} (as a PostOperation or next-stage
MainOperation) to ensure the node is still off before proceeding to Stage 5;
update the RetryPolicy/Timeout blocks around ActionPausePowerOnGate and
ActionVerifyPowerStatus as needed to match existing patterns.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 41c6fbc1-a2ea-4c68-afbb-34a6b282b3d2
⛔ Files ignored due to path filters (2)
rla/pkg/proto/v1/rla.pb.gois excluded by!**/*.pb.go,!**/*.pb.gorla/pkg/proto/v1/rla_grpc.pb.gois excluded by!**/*.pb.go,!**/*.pb.go
📒 Files selected for processing (2)
rla/internal/service/server_impl.gorla/internal/task/operationrules/resolver_defaults.go
🚧 Files skipped from review as they are similar to previous changes (1)
- rla/internal/service/server_impl.go
| // === Stage 1: Compute — verify no allocated instance === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 1, | ||
| MaxParallel: 0, | ||
| Timeout: 5 * time.Minute, | ||
| // No retry: a "machine still has an instance" result | ||
| // is not transient; retrying just delays the failure. | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionVerifyNoInstance, | ||
| }, | ||
| }, | ||
| // === Stage 2: Compute — enter maintenance === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 2, | ||
| MaxParallel: 0, | ||
| Timeout: 5 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 5 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionEnterMaintenance, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Close the verify/maintenance TOCTOU before any power action.
Stage 1 proves the compute is empty only at that instant. Because allocator lockout is not applied until Stage 2, a new instance can still land in the gap and Stage 3 can power off a machine that has become allocated.
Suggested minimal fix
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 2,
MaxParallel: 0,
Timeout: 5 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionEnterMaintenance,
},
+ PostOperation: []ActionConfig{
+ {
+ Name: ActionVerifyNoInstance,
+ },
+ },
},If you want to keep the early fail-fast behavior, a second VerifyNoInstance after maintenance is the smallest way to make the check authoritative.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // === Stage 1: Compute — verify no allocated instance === | |
| { | |
| ComponentType: devicetypes.ComponentTypeCompute, | |
| Stage: 1, | |
| MaxParallel: 0, | |
| Timeout: 5 * time.Minute, | |
| // No retry: a "machine still has an instance" result | |
| // is not transient; retrying just delays the failure. | |
| MainOperation: ActionConfig{ | |
| Name: ActionVerifyNoInstance, | |
| }, | |
| }, | |
| // === Stage 2: Compute — enter maintenance === | |
| { | |
| ComponentType: devicetypes.ComponentTypeCompute, | |
| Stage: 2, | |
| MaxParallel: 0, | |
| Timeout: 5 * time.Minute, | |
| RetryPolicy: &RetryPolicy{ | |
| MaxAttempts: 3, | |
| InitialInterval: 5 * time.Second, | |
| BackoffCoefficient: 2.0, | |
| }, | |
| MainOperation: ActionConfig{ | |
| Name: ActionEnterMaintenance, | |
| }, | |
| }, | |
| // === Stage 1: Compute — verify no allocated instance === | |
| { | |
| ComponentType: devicetypes.ComponentTypeCompute, | |
| Stage: 1, | |
| MaxParallel: 0, | |
| Timeout: 5 * time.Minute, | |
| // No retry: a "machine still has an instance" result | |
| // is not transient; retrying just delays the failure. | |
| MainOperation: ActionConfig{ | |
| Name: ActionVerifyNoInstance, | |
| }, | |
| }, | |
| // === Stage 2: Compute — enter maintenance === | |
| { | |
| ComponentType: devicetypes.ComponentTypeCompute, | |
| Stage: 2, | |
| MaxParallel: 0, | |
| Timeout: 5 * time.Minute, | |
| RetryPolicy: &RetryPolicy{ | |
| MaxAttempts: 3, | |
| InitialInterval: 5 * time.Second, | |
| BackoffCoefficient: 2.0, | |
| }, | |
| MainOperation: ActionConfig{ | |
| Name: ActionEnterMaintenance, | |
| }, | |
| PostOperation: []ActionConfig{ | |
| { | |
| Name: ActionVerifyNoInstance, | |
| }, | |
| }, | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1298 -
1324, Stage 1 (ActionVerifyNoInstance) proves emptiness only briefly so add a
second authoritative verification after maintenance: insert a new Stage (after
the ActionEnterMaintenance stage) that uses ComponentType compute with
MainOperation set to ActionVerifyNoInstance, a short Timeout (e.g., 5m) and no
retries so the maintenance lockout and allocator lock are in place before any
subsequent power actions; reference the existing symbols ActionVerifyNoInstance
and ActionEnterMaintenance to locate where to add this stage and ensure stage
ordering guarantees the extra check runs before any power-off operations.
| // === Stage 3: Compute — power off, verify === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 3, | ||
| MaxParallel: 0, | ||
| Timeout: 20 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 1 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionPowerControl, | ||
| Parameters: map[string]any{ | ||
| ParamOperation: "power_off", | ||
| }, | ||
| }, | ||
| PostOperation: []ActionConfig{ | ||
| { | ||
| Name: ActionVerifyPowerStatus, | ||
| Timeout: 10 * time.Minute, | ||
| PollInterval: 15 * time.Second, | ||
| Parameters: map[string]any{ | ||
| ParamExpectedStatus: "off", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| // === Stage 4: Compute — pause power-on gate === | ||
| { | ||
| ComponentType: devicetypes.ComponentTypeCompute, | ||
| Stage: 4, | ||
| MaxParallel: 0, | ||
| Timeout: 5 * time.Minute, | ||
| RetryPolicy: &RetryPolicy{ | ||
| MaxAttempts: 3, | ||
| InitialInterval: 5 * time.Second, | ||
| BackoffCoefficient: 2.0, | ||
| }, | ||
| MainOperation: ActionConfig{ | ||
| Name: ActionPausePowerOnGate, | ||
| }, |
There was a problem hiding this comment.
Pause the power-on gate before compute shutdown.
ActionPausePowerOnGate is the safeguard against automatic repower, but it currently runs only after compute has been powered off and verified off. That leaves a real window where a reconcile can turn the node back on, and the workflow never re-checks it before powering off the NVL switch and power shelf.
Suggested direction
- // === Stage 3: Compute — power off, verify ===
+ // === Stage 3: Compute — pause gate, power off, verify ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 3,
MaxParallel: 0,
Timeout: 20 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 1 * time.Second,
BackoffCoefficient: 2.0,
},
+ PreOperation: []ActionConfig{
+ {
+ Name: ActionPausePowerOnGate,
+ },
+ },
MainOperation: ActionConfig{
Name: ActionPowerControl,
Parameters: map[string]any{
ParamOperation: "power_off",
},
},
PostOperation: []ActionConfig{
{
Name: ActionVerifyPowerStatus,
Timeout: 10 * time.Minute,
PollInterval: 15 * time.Second,
Parameters: map[string]any{
ParamExpectedStatus: "off",
},
},
},
},
- // === Stage 4: Compute — pause power-on gate ===
- {
- ComponentType: devicetypes.ComponentTypeCompute,
- Stage: 4,
- MaxParallel: 0,
- Timeout: 5 * time.Minute,
- RetryPolicy: &RetryPolicy{
- MaxAttempts: 3,
- InitialInterval: 5 * time.Second,
- BackoffCoefficient: 2.0,
- },
- MainOperation: ActionConfig{
- Name: ActionPausePowerOnGate,
- },
- },If you keep a separate stage, it should at least be followed by another VerifyPowerStatus(off) before Stage 5.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1325 -
1366, The pause gate (ActionPausePowerOnGate) must be applied before powering
off compute to avoid a reconcile re-power window; modify the stages so the pause
gate runs prior to the power_off+verify sequence (or if you keep it as its own
Stage 4, insert a follow-up VerifyPowerStatus(off) ActionConfig after
ActionPausePowerOnGate) — add an ActionVerifyPowerStatus with Parameters
{ParamExpectedStatus: "off"} (as a PostOperation or next-stage MainOperation) to
ensure the node is still off before proceeding to Stage 5; update the
RetryPolicy/Timeout blocks around ActionPausePowerOnGate and
ActionVerifyPowerStatus as needed to match existing patterns.
jw-nvidia
left a comment
There was a problem hiding this comment.
Between 1 and 2, a machine can be allocated again.
Between 3 and 4, a machine can be powered on.
[1 and 2], [3 and 4] should be atomic operations.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
rla/internal/task/operationrules/resolver_defaults.go (2)
1297-1323:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRe-check compute emptiness after maintenance.
Stage 1 only proves the node is empty before allocator lockout. A new instance can land between Line 1305 and Line 1321, and Stage 3 would still power that machine off. Add a second
VerifyNoInstanceafterActionEnterMaintenanceso the authoritative check happens under maintenance.Suggested minimal fix
{ ComponentType: devicetypes.ComponentTypeCompute, Stage: 2, MaxParallel: 0, Timeout: 5 * time.Minute, RetryPolicy: &RetryPolicy{ MaxAttempts: 3, InitialInterval: 5 * time.Second, BackoffCoefficient: 2.0, }, MainOperation: ActionConfig{ Name: ActionEnterMaintenance, }, + PostOperation: []ActionConfig{ + { + Name: ActionVerifyNoInstance, + }, + }, },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1297 - 1323, Stage 1's ActionVerifyNoInstance only checks emptiness before maintenance and a new instance could start before the machine is taken down; add a second verify step immediately after ActionEnterMaintenance so the authoritative check runs while the node is in maintenance. Concretely, in the resolver_defaults.go operation sequence (the block containing ComponentType: devicetypes.ComponentTypeCompute, Stage: 2 and MainOperation: ActionConfig{Name: ActionEnterMaintenance}), append another operation entry with MainOperation: ActionConfig{Name: ActionVerifyNoInstance} (and appropriate Timeout/RetryPolicy similar to Stage 1 or tuned as needed) so VerifyNoInstance runs under maintenance before any subsequent power-off stages.
1324-1365:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winPause the power-on gate before compute shutdown.
With
ActionPausePowerOnGatedelayed until Stage 4, a reconcile can repower compute after Line 1335 succeeds and before Line 1363 runs. The workflow never re-verifies compute is still off before proceeding to the switch and shelf. Move the gate pause into Stage 3 pre-ops, or at minimum add another"off"verification after Stage 4.Suggested direction
- // === Stage 3: Compute — power off, verify === + // === Stage 3: Compute — pause gate, power off, verify === { ComponentType: devicetypes.ComponentTypeCompute, Stage: 3, MaxParallel: 0, Timeout: 20 * time.Minute, RetryPolicy: &RetryPolicy{ MaxAttempts: 3, InitialInterval: 1 * time.Second, BackoffCoefficient: 2.0, }, + PreOperation: []ActionConfig{ + { + Name: ActionPausePowerOnGate, + }, + }, MainOperation: ActionConfig{ Name: ActionPowerControl, Parameters: map[string]any{ ParamOperation: "power_off", }, }, PostOperation: []ActionConfig{ { Name: ActionVerifyPowerStatus, Timeout: 10 * time.Minute, PollInterval: 15 * time.Second, Parameters: map[string]any{ ParamExpectedStatus: "off", }, }, }, }, - // === Stage 4: Compute — pause power-on gate === - { - ComponentType: devicetypes.ComponentTypeCompute, - Stage: 4, - MaxParallel: 0, - Timeout: 5 * time.Minute, - RetryPolicy: &RetryPolicy{ - MaxAttempts: 3, - InitialInterval: 5 * time.Second, - BackoffCoefficient: 2.0, - }, - MainOperation: ActionConfig{ - Name: ActionPausePowerOnGate, - }, - },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1324 - 1365, The pause-of-power-on gate is executed in Stage 4 (ActionPausePowerOnGate) which allows a reconcile to repower compute after Stage 3's power_off MainOperation and its PostOperation verification (ActionVerifyPowerStatus) succeed; to prevent that race, move ActionPausePowerOnGate into Stage 3 as a pre-operation (so it runs before the MainOperation power_off/ActionVerifyPowerStatus) or, if you prefer to keep it in Stage 4, add an additional verification ActionVerifyPowerStatus (expecting "off") after Stage 4 to re-check compute is still off before proceeding to downstream operations; update the relevant structs where MainOperation/PostOperation are defined for ComponentTypeCompute at Stage 3 and Stage 4 to implement the chosen change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rla/internal/task/operations/operations.go`:
- Around line 81-86: The bring-down task branch currently unmarshals into
BringDownTaskInfo but doesn't validate rule_id; update
BringDownTaskInfo.Validate() to reject malformed rule_id by calling
ExtractRuleID(info.RuleID) and returning an error if it returns nil, and then
call BringDownTaskInfo.Validate() in the TaskTypeBringDown case (before
returning) to ensure malformed UUIDs are rejected; apply the same validation
call to the other bring-down handling branch referenced in the comment so both
paths enforce rule_id validation.
In `@rla/internal/task/operations/types.go`:
- Around line 108-114: Add an explicit legacy constant for the reserved value
(e.g., FirmwareOperationRollback = 3) and update FirmwareOperation.CodeString()
so it maps only known values to their correct strings and does NOT fall back to
"upgrade" for unmapped/legacy values; instead return an explicit "rollback" (for
the legacy constant) or an "unsupported"/error indicator for any other unknown
values, and ensure callers validate/reject FirmwareOperationRollback during
input/validation paths rather than treating it as an upgrade.
---
Duplicate comments:
In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1297-1323: Stage 1's ActionVerifyNoInstance only checks emptiness
before maintenance and a new instance could start before the machine is taken
down; add a second verify step immediately after ActionEnterMaintenance so the
authoritative check runs while the node is in maintenance. Concretely, in the
resolver_defaults.go operation sequence (the block containing ComponentType:
devicetypes.ComponentTypeCompute, Stage: 2 and MainOperation: ActionConfig{Name:
ActionEnterMaintenance}), append another operation entry with MainOperation:
ActionConfig{Name: ActionVerifyNoInstance} (and appropriate Timeout/RetryPolicy
similar to Stage 1 or tuned as needed) so VerifyNoInstance runs under
maintenance before any subsequent power-off stages.
- Around line 1324-1365: The pause-of-power-on gate is executed in Stage 4
(ActionPausePowerOnGate) which allows a reconcile to repower compute after Stage
3's power_off MainOperation and its PostOperation verification
(ActionVerifyPowerStatus) succeed; to prevent that race, move
ActionPausePowerOnGate into Stage 3 as a pre-operation (so it runs before the
MainOperation power_off/ActionVerifyPowerStatus) or, if you prefer to keep it in
Stage 4, add an additional verification ActionVerifyPowerStatus (expecting
"off") after Stage 4 to re-check compute is still off before proceeding to
downstream operations; update the relevant structs where
MainOperation/PostOperation are defined for ComponentTypeCompute at Stage 3 and
Stage 4 to implement the chosen change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5bfb9d32-6b61-457d-a086-c45096ee910e
⛔ Files ignored due to path filters (1)
rla/pkg/proto/v1/rla.pb.gois excluded by!**/*.pb.go,!**/*.pb.go
📒 Files selected for processing (13)
rla/internal/scheduler/taskschedule/template.gorla/internal/scheduler/taskschedule/template_test.gorla/internal/task/common/operation_codes.gorla/internal/task/operationrules/loader_test.gorla/internal/task/operationrules/resolver.gorla/internal/task/operationrules/resolver_defaults.gorla/internal/task/operationrules/sequences.gorla/internal/task/operationrules/sequences_test.gorla/internal/task/operations/operations.gorla/internal/task/operations/options.gorla/internal/task/operations/types.gorla/internal/task/operations/types_test.gorla/proto/v1/rla.proto
💤 Files with no reviewable changes (3)
- rla/internal/task/operationrules/resolver.go
- rla/internal/task/operations/types_test.go
- rla/internal/scheduler/taskschedule/template_test.go
✅ Files skipped from review due to trivial changes (1)
- rla/internal/task/operations/options.go
🚧 Files skipped from review as they are similar to previous changes (1)
- rla/proto/v1/rla.proto
| case taskcommon.TaskTypeBringDown: | ||
| var taskInfo BringDownTaskInfo | ||
| if err := json.Unmarshal(info, &taskInfo); err != nil { | ||
| return nil, fmt.Errorf("failed to unmarshal bring-down task info: %w", err) //nolint | ||
| } | ||
| return &taskInfo, nil |
There was a problem hiding this comment.
Validate rule_id before accepting bring-down payloads.
ExtractRuleID() returns nil for an unparsable UUID, so a typo here silently drops the override and falls back to the hardcoded bring-down rule. Please reject malformed rule_id in BringDownTaskInfo.Validate() and call it from the new TaskTypeBringDown branch before returning.
Suggested fix
case taskcommon.TaskTypeBringDown:
var taskInfo BringDownTaskInfo
if err := json.Unmarshal(info, &taskInfo); err != nil {
return nil, fmt.Errorf("failed to unmarshal bring-down task info: %w", err) //nolint
}
+ if err := taskInfo.Validate(); err != nil {
+ return nil, fmt.Errorf("invalid bring-down task info: %w", err)
+ }
return &taskInfo, nil func (t *BringDownTaskInfo) Validate() error {
+ if t.RuleID != "" {
+ if _, err := uuid.Parse(t.RuleID); err != nil {
+ return fmt.Errorf("invalid rule_id: %w", err)
+ }
+ }
return nil
}Also applies to: 218-247
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rla/internal/task/operations/operations.go` around lines 81 - 86, The
bring-down task branch currently unmarshals into BringDownTaskInfo but doesn't
validate rule_id; update BringDownTaskInfo.Validate() to reject malformed
rule_id by calling ExtractRuleID(info.RuleID) and returning an error if it
returns nil, and then call BringDownTaskInfo.Validate() in the TaskTypeBringDown
case (before returning) to ensure malformed UUIDs are rejected; apply the same
validation call to the other bring-down handling branch referenced in the
comment so both paths enforce rule_id validation.
| const ( | ||
| FirmwareOperationUnknown FirmwareOperation = iota | ||
| FirmwareOperationUpgrade | ||
| FirmwareOperationDowngrade | ||
| FirmwareOperationRollback | ||
| FirmwareOperationVersion | ||
| FirmwareOperationUnknown FirmwareOperation = 0 | ||
| FirmwareOperationUpgrade FirmwareOperation = 1 | ||
| FirmwareOperationDowngrade FirmwareOperation = 2 | ||
| // 3 was previously FirmwareOperationRollback; kept reserved so persisted | ||
| // values for FirmwareOperationVersion remain stable. | ||
| FirmwareOperationVersion FirmwareOperation = 4 |
There was a problem hiding this comment.
Do not let legacy rollback values serialize as upgrade.
Reserving enum value 3 suggests old persisted rollback values can still be read, but FirmwareOperation.CodeString() still falls back to "upgrade" for any unmapped value. A stored FirmwareOperation(3) will now be misrouted into the upgrade path instead of failing as unsupported.
Suggested direction
const (
FirmwareOperationUnknown FirmwareOperation = 0
FirmwareOperationUpgrade FirmwareOperation = 1
FirmwareOperationDowngrade FirmwareOperation = 2
- // 3 was previously FirmwareOperationRollback; kept reserved so persisted
- // values for FirmwareOperationVersion remain stable.
+ // 3 was previously FirmwareOperationRollback; kept reserved so persisted
+ // values remain distinguishable from supported operations.
+ FirmwareOperationLegacyRollback FirmwareOperation = 3
FirmwareOperationVersion FirmwareOperation = 4
) func (o FirmwareOperation) CodeString() string {
if code, ok := firmwareOperationCodes[o]; ok {
return code
}
- return taskcommon.OpCodeFirmwareControlUpgrade // Default fallback
+ return ""
}If callers still need to detect old rollback records explicitly, keep the legacy constant and reject it during validation rather than defaulting to upgrade.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rla/internal/task/operations/types.go` around lines 108 - 114, Add an
explicit legacy constant for the reserved value (e.g., FirmwareOperationRollback
= 3) and update FirmwareOperation.CodeString() so it maps only known values to
their correct strings and does NOT fall back to "upgrade" for unmapped/legacy
values; instead return an explicit "rollback" (for the legacy constant) or an
"unsupported"/error indicator for any other unknown values, and ensure callers
validate/reject FirmwareOperationRollback during input/validation paths rather
than treating it as an upgrade.
Description
Add
BringDownRackto RLA — symmetric inverse ofBringUpRackfor safely taking a rack offline.Sequence (6 stages)
FindInstanceByMachineID).rla-bring-downso the machine is locked out of the allocator before its power state changes.UpdatePowerOption(PowerStateDisabled)so a stale reconcile pass cannot bring the machine back up.Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes