feat: Add bring-down sequence in RLA by kunzhao-nv · Pull Request #477 · NVIDIA/infra-controller-rest

kunzhao-nv · 2026-05-04T15:58:53Z

Description

Add BringDownRack to RLA — symmetric inverse of BringUpRack for safely taking a rack offline.

Sequence (6 stages)

Compute — VerifyNoInstance: fail fast if any compute node still has an allocated instance (FindInstanceByMachineID).
Compute — EnterMaintenance: long-lived health-report override with source rla-bring-down so the machine is locked out of the allocator before its power state changes.
Compute — PowerOff: graceful power off + verify off.
Compute — PausePowerOnGate: UpdatePowerOption(PowerStateDisabled) so a stale reconcile pass cannot bring the machine back up.
NVLSwitch — PowerOff: graceful power off + verify off.
PowerShelf — PowerOff: graceful power off + verify off.

Type of Change

Feature - New feature or functionality (feat:)
Fix - Bug fixes (fix:)
Chore - Modification or removal of existing functionality (chore:)
Refactor - Refactoring of existing functionality (refactor:)
Docs - Changes in documentation or OpenAPI schema (docs:)
CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
Version - Issuing a new release version (version:)

Services Affected

API - API models or endpoints updated
Workflow - Workflow service updated
DB - DB DAOs or migrations updated
Site Manager - Site Manager updated
Cert Manager - Cert Manager updated
Site Agent - Site Agent updated
RLA - RLA service updated
Powershelf Manager - Powershelf Manager updated
NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

copy-pr-bot · 2026-05-04T15:58:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot · 2026-05-04T15:59:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-04T15:59:17Z

Summary by CodeRabbit

New Features
- Rack bring-down API and multi-stage bring-down workflow (pre-checks, maintenance marking, staged power-off, pause power-on gating).
- Precondition check to prevent bring-down when machines still have active instances.
Behavior Changes
- Bring-down operations now conflict with other rack operations to avoid concurrency issues.
- Firmware rollback option removed from available operation set.
Tests
- Added tests for bring-down registration, activity wiring, and validation.

Walkthrough

Adds a complete "bring-down" operation: new BringDownRack RPC and request payload, TaskType/OpCode/operation payload, component-manager interfaces and Carbide client checks, Temporal activities/workflow and action wiring, hardcoded bring-down rule sequence, service handler to submit tasks, and related tests.

Changes

Bring-Down Feature (single cohort)

Layer / File(s)	Summary
API & Types `rla/proto/v1/rla.proto`, `rla/internal/task/common/common.go`, `rla/internal/task/common/operation_codes.go`, `rla/internal/task/operations/operations.go`	Adds `BringDownRack` RPC and `BringDownRackRequest`; introduces `TaskTypeBringDown`, `OpCodeBringDown`, and `BringDownTaskInfo` payload with optional `RuleID`.
Service Handler `rla/internal/service/server_impl.go`	Implements `RLAServerImpl.BringDownRack`: validates request, builds `operations.BringDownTaskInfo` and internal `operation.Request`, sets optional `RuleID`, submits via `taskManager`, returns task IDs or error.
Operation Rules / Sequencing `rla/internal/task/operationrules/sequences.go`, `rla/internal/task/operationrules/rules.go`, `rla/internal/task/operationrules/resolver_defaults.go`	Adds `SequenceBringDown`, `ActionPausePowerOnGate`, and hardcoded `buildBringDownRule()` with stages: verify no instance, enter maintenance, compute power-off+verify, pause power-on gate, NVL and power-shelf power-offs. Registers rule for `TaskTypeBringDown`.
Component Manager Interfaces `rla/internal/task/componentmanager/componentmanager.go`	Adds optional interfaces: `PowerOnGateController.PausePowerOnGate`, `InstanceVerifier.VerifyNoInstance`, `MaintenanceController.EnterMaintenance`.
Carbide Client & Mock `rla/internal/carbideapi/mod.go`, `rla/internal/carbideapi/grpc.go`, `rla/internal/carbideapi/mock.go`	Adds `MachineHasInstance(ctx, machineID)` to Client; `grpcClient.MachineHasInstance` wraps `FindInstanceByMachineID`; mock client stores `machineInstances` and exposes `SetMachineHasInstance`.
Carbide Compute Implementation `rla/internal/task/componentmanager/compute/carbide/carbide.go`	Implements `VerifyNoInstance` (calls `MachineHasInstance` per component), `EnterMaintenance` (inserts health-report override with a bring-down source tag), and `PausePowerOnGate` (updates power option to disable power-on gate; treats "already set" as idempotent).
Mock Component Manager `rla/internal/task/componentmanager/mock/mock.go`	Adds mock implementations for `PausePowerOnGate`, `VerifyNoInstance`, and `EnterMaintenance` (log, configurable delay, return nil).
Temporal Activities `rla/internal/task/executor/temporalworkflow/activity/activity.go`, `rla/internal/task/executor/temporalworkflow/activity/registry.go`, `rla/internal/task/executor/temporalworkflow/activity/registry_test.go`	Adds activity names and methods: `PausePowerOnGate`, `VerifyNoInstance`, `EnterMaintenance`. Each validates target, ensures component-manager supports the interface, and delegates; activities registered and tests updated.
Workflow Actions & Executor Wiring `rla/internal/task/executor/temporalworkflow/workflow/actions.go`	Registers action executors and implements `executePausePowerOnGateAction`, `executeVerifyNoInstanceAction`, `executeEnterMaintenanceAction` which invoke the corresponding Temporal activities.
Bring-Down Temporal Workflow `rla/internal/task/executor/temporalworkflow/workflow/bringdown.go`, `rla/internal/task/executor/temporalworkflow/workflow/registry_test.go`	Registers `BringDown` workflow descriptor for `TaskTypeBringDown`, defines activity options and retry policy, implements `bringDown` orchestration that runs the rule-based executor and updates task status; adds workflow registration/unmarshal tests.
Operation Options & Scheduler Summary `rla/internal/task/operations/options.go`, `rla/internal/scheduler/taskschedule/template.go`	Adds default operation options entry for `TaskTypeBringDown` with Timeout = 120 minutes; `SummaryFromTemplate` maps `TaskTypeBringDown` to `("BRING_DOWN","Bring Down")`.
Validation / Conflict / Tests `rla/internal/task/operationrules/sequences_test.go`, `rla/internal/task/common/common_test.go`, `rla/internal/task/executor/temporalworkflow/workflow/registry_test.go`, `rla/internal/task/conflict/conflict.go`	Extends sequence validation and TaskType tests for bring-down, adds built-in conflict rule that makes BringDown conflict with all other operations, and adds workflow registry/unmarshal tests for bring-down.
Firmware rollback removal `rla/internal/task/common/operation_codes.go`, `rla/internal/task/operations/types.go`, `rla/internal/task/operationrules/loader_test.go`, `rla/internal/scheduler/taskschedule/template_test.go`	Removes firmware rollback opcode/constant and associated test expectations (rollback mapping and test cases updated to exclude rollback).

Sequence Diagram

sequenceDiagram
    participant Client
    participant RLA as RLA Service
    participant TM as TaskManager
    participant Temporal
    participant Activities
    participant Carbide

    Client->>RLA: BringDownRack(target_spec, rule_id)
    RLA->>RLA: validate request, build BringDownTaskInfo
    RLA->>TM: SubmitTask(operation.Request)
    TM->>Temporal: schedule/start BringDown workflow

    Temporal->>Temporal: load rule definition, build targets
    Temporal->>Activities: VerifyNoInstance(target)
    Activities->>Carbide: InstanceVerifier.VerifyNoInstance(target)
    Carbide-->>Activities: result
    Activities-->>Temporal: result

    Temporal->>Activities: EnterMaintenance(target)
    Activities->>Carbide: MaintenanceController.EnterMaintenance(target)
    Carbide-->>Activities: result
    Activities-->>Temporal: result

    Temporal->>Activities: PowerControl OFF (compute)
    Activities-->>Temporal: result

    Temporal->>Activities: PausePowerOnGate(target)
    Activities->>Carbide: PowerOnGateController.PausePowerOnGate(target)
    Carbide-->>Activities: result
    Activities-->>Temporal: result

    Temporal->>Temporal: remaining stages (NVL, power shelf), update status
    Temporal-->>Client: final task status / error

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the primary change: adding a bring-down sequence capability to RLA, which aligns precisely with the changeset's focus.
Description check	✅ Passed	The description provides substantive context: it details the six-stage bring-down sequence, explains the rationale for each stage, identifies the type of change and affected services, and documents testing approach.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/bd-rla

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-04T16:01:11Z

Test Results

8 974 tests +6 8 974 ✅ +6 7m 46s ⏱️ +26s
151 suites ±0 0 💤 ±0
14 files ±0 0 ❌ ±0

Results for commit 9715fc1. ± Comparison against base commit bce7503.

This pull request removes 4 and adds 10 tests. Note that renamed tests count towards both.

github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/scheduler/taskschedule ‑ TestSummaryFromTemplate/rollback_firmware
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_rollback
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestSequenceConstantsMatchSharedCodes/Rollback
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operations ‑ TestFirmwareOperationCodeString/Rollback

github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Registered_BringDown
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown/invalid_JSON
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown/valid_empty_payload
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown/valid_with_rule_id
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/bring_down_rejected_under_bring_up_task_type
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_bring_down
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_bring_up
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_ingest
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestSequenceConstantsMatchSharedCodes/BringDown

♻️ This comment has been updated with latest results.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

rla/internal/task/operationrules/sequences_test.go (1)

95-100: ⚡ Quick win

Add bring_down to constant-parity coverage as well.

You added validity coverage for SequenceBringDown, but TestSequenceConstantsMatchSharedCodes still does not assert parity for this new code path.

Suggested test extension

 func TestSequenceConstantsMatchSharedCodes(t *testing.T) {
 	tests := []struct {
 		name          string
 		sequenceConst string
 		sharedCode    string
 	}{
@@
 		{"Rollback", SequenceRollback, common.OpCodeFirmwareControlRollback},
+		{"BringDown", SequenceBringDown, common.OpCodeBringDown},
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/operationrules/sequences_test.go` around lines 95 - 100,
The parity test TestSequenceConstantsMatchSharedCodes is missing the new
SequenceBringDown/bring_down case; update that test to include the bring_down
constant alongside the other sequence constants so the test asserts parity for
SequenceBringDown as well (e.g., add bring_down to the expected list/map or the
comparison loop used in TestSequenceConstantsMatchSharedCodes so the shared
codes mapping includes SequenceBringDown).

rla/internal/service/server_impl.go (1)

768-783: ⚡ Quick win

Align BringUpTaskInfo.OpCode with bring_down to avoid divergent metadata.

BringDownRack overrides the wrapper opcode, but the serialized BringUpTaskInfo still defaults to bring-up semantics. Setting info.OpCode keeps both representations consistent for any downstream reader of Operation.Info.

♻️ Suggested patch

 	info := &operations.BringUpTaskInfo{
 		RuleID: protobuf.UUIDStringFrom(req.GetRuleId()),
+		OpCode: taskcommon.OpCodeBringDown,
 	}
@@
-	// Override the operation code so the rule resolver picks the
-	// bring-down rule instead of the default bring-up rule.
-	opReq.Operation.Code = taskcommon.OpCodeBringDown
+	// Keep request wrapper and serialized task info aligned.
+	opReq.Operation.Code = info.CodeString()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@rla/internal/service/server_impl.go` around lines 768 - 783, Constructed
BringUpTaskInfo (info) currently defaults its OpCode to bring-up, causing
serialized Operation.Info to diverge from the overridden operation code; set
info.OpCode = taskcommon.OpCodeBringDown on the info struct (the BringUpTaskInfo
instance) so it matches the later override of opReq.Operation.Code =
taskcommon.OpCodeBringDown (ensure this assignment happens before calling
convertTargetSpecToOperationRequest or at least before opReq is serialized) to
keep Operation.Info and opReq.Operation.Code consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rla/proto/v1/rla.proto`:
- Around line 610-618: Update the BringDownRackRequest message comment to
explicitly state that the documented sequence (power off compute, pause
per-machine power-on gate, then power off NVLSwitch and PowerShelf) is the
default behavior applied when no rule override is provided, and that supplying
rule_id will override rule resolution and may alter that sequence; reference the
fields OperationTargetSpec target_spec, string description, and optional UUID
rule_id in the comment so API consumers understand default vs overridden
bring-down behavior.

---

Nitpick comments:
In `@rla/internal/service/server_impl.go`:
- Around line 768-783: Constructed BringUpTaskInfo (info) currently defaults its
OpCode to bring-up, causing serialized Operation.Info to diverge from the
overridden operation code; set info.OpCode = taskcommon.OpCodeBringDown on the
info struct (the BringUpTaskInfo instance) so it matches the later override of
opReq.Operation.Code = taskcommon.OpCodeBringDown (ensure this assignment
happens before calling convertTargetSpecToOperationRequest or at least before
opReq is serialized) to keep Operation.Info and opReq.Operation.Code consistent.

In `@rla/internal/task/operationrules/sequences_test.go`:
- Around line 95-100: The parity test TestSequenceConstantsMatchSharedCodes is
missing the new SequenceBringDown/bring_down case; update that test to include
the bring_down constant alongside the other sequence constants so the test
asserts parity for SequenceBringDown as well (e.g., add bring_down to the
expected list/map or the comparison loop used in
TestSequenceConstantsMatchSharedCodes so the shared codes mapping includes
SequenceBringDown).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 20139a9e-7c12-43d7-9598-886d967b066c

📥 Commits

Reviewing files that changed from the base of the PR and between bce7503 and 2b39878.

⛔ Files ignored due to path filters (2)

rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
rla/pkg/proto/v1/rla_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go

📒 Files selected for processing (15)

rla/internal/service/server_impl.go
rla/internal/task/common/operation_codes.go
rla/internal/task/componentmanager/componentmanager.go
rla/internal/task/componentmanager/compute/carbide/carbide.go
rla/internal/task/componentmanager/mock/mock.go
rla/internal/task/executor/temporalworkflow/activity/activity.go
rla/internal/task/executor/temporalworkflow/activity/registry.go
rla/internal/task/executor/temporalworkflow/activity/registry_test.go
rla/internal/task/executor/temporalworkflow/workflow/actions.go
rla/internal/task/operationrules/resolver_defaults.go
rla/internal/task/operationrules/resolver_defaults_test.go
rla/internal/task/operationrules/rules.go
rla/internal/task/operationrules/sequences.go
rla/internal/task/operationrules/sequences_test.go
rla/proto/v1/rla.proto

coderabbitai · 2026-05-04T16:04:16Z

+// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
+// It powers off compute, pauses the per-machine power-on gate so the power
+// manager will not automatically bring it back up, then powers off the
+// remaining components (NVLSwitch and PowerShelf).
+message BringDownRackRequest {
+    OperationTargetSpec target_spec = 1;  // Target racks for bring-down
+    string description = 2;               // optional task description
+    optional UUID rule_id = 3;            // optional: override rule resolution with a specific rule
+}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify default vs overridden bring-down behavior in request docs.

Lines 610-613 currently read as an unconditional sequence, but Line 617 permits a custom rule_id override. This can mislead API consumers about guaranteed behavior.

Suggested doc-only fix

-// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest. -// It powers off compute, pauses the per-machine power-on gate so the power -// manager will not automatically bring it back up, then powers off the -// remaining components (NVLSwitch and PowerShelf). +// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest. +// By default, bring-down powers off compute, pauses the per-machine power-on +// gate so the power manager will not automatically bring it back up, then +// powers off remaining components (NVLSwitch and PowerShelf). +// If rule_id is provided, execution follows the selected rule. message BringDownRackRequest {

As per coding guidelines **/*.proto: “Review the Protobuf definitions, point out issues relative to compatibility, and expressiveness.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.

// It powers off compute, pauses the per-machine power-on gate so the power

// manager will not automatically bring it back up, then powers off the

// remaining components (NVLSwitch and PowerShelf).

message BringDownRackRequest {

OperationTargetSpec target_spec = 1; // Target racks for bring-down

string description = 2; // optional task description

optional UUID rule_id = 3; // optional: override rule resolution with a specific rule

}

// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.

// By default, bring-down powers off compute, pauses the per-machine power-on

// gate so the power manager will not automatically bring it back up, then

// powers off remaining components (NVLSwitch and PowerShelf).

// If rule_id is provided, execution follows the selected rule.

message BringDownRackRequest {

OperationTargetSpec target_spec = 1; // Target racks for bring-down

string description = 2; // optional task description

optional UUID rule_id = 3; // optional: override rule resolution with a specific rule

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@rla/proto/v1/rla.proto` around lines 610 - 618, Update the BringDownRackRequest message comment to explicitly state that the documented sequence (power off compute, pause per-machine power-on gate, then power off NVLSwitch and PowerShelf) is the default behavior applied when no rule override is provided, and that supplying rule_id will override rule resolution and may alter that sequence; reference the fields OperationTargetSpec target_spec, string description, and optional UUID rule_id in the comment so API consumers understand default vs overridden bring-down behavior.

github-actions · 2026-05-04T18:05:50Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-05-04 18:05:49 UTC | Commit: 99ae4f1}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rla/internal/service/server_impl.go`:
- Around line 751-754: Update the top-of-function comment for BringDownRack to
remove the incorrect statement that it “reuses the BringUp workflow” and instead
describe that this RPC invokes the new dedicated BringDown workflow; mention the
key steps performed (power off compute, pause per-machine power-on gate to
prevent auto-revive, then power off NVLSwitch and PowerShelf) and reference the
BringDown workflow by name so readers know this function maps to that workflow
rather than BringUp.

In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1299-1428: The RuleDefinition default sequence for bring-down can
exceed the TaskTypeBringDown workflow timeout (60m); update either the workflow
timeout in options.go or reduce stage/retry/poll budgets so the worst-case sum
fits under the workflow window. Concretely, adjust RuleDefinition
(CurrentRuleDefinitionVersion) SequenceSteps timeouts/retry policies and
ActionConfig polling windows — e.g. lower Stage 3/5/6 Timeout and
RetryPolicy.MaxAttempts, reduce PostOperation VerifyPowerStatus Timeout and
PollInterval and the RetryPolicy.InitialInterval/BackoffCoefficient for stages
that use
ActionPowerControl/ActionVerifyPowerStatus/ActionVerifyNoInstance/ActionEnterMaintenance/ActionPausePowerOnGate
so the aggregated worst-case duration is safely below TaskTypeBringDown;
alternatively increase the TaskTypeBringDown timeout to a value greater than the
calculated worst-case sum.

In `@rla/internal/task/operations/operations.go`:
- Around line 212-253: New() currently lacks a branch for
taskcommon.TaskTypeBringDown, so generic deserialization returns "unsupported
task type: bring_down"; update the factory in New() to handle TaskTypeBringDown
by adding a switch/case (or equivalent) that instantiates a BringDownTaskInfo,
unmarshals into it, validates it, and returns it (use the existing pattern from
other cases for creation, calling BringDownTaskInfo.Unmarshal/Validate and
returning the resulting Task instance).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64302ded-b3ec-41ad-bc9e-cd88b6f0b80d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b39878 and 99ae4f1.

⛔ Files ignored due to path filters (2)

rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
rla/pkg/proto/v1/rla_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go

📒 Files selected for processing (25)

rla/internal/carbideapi/grpc.go
rla/internal/carbideapi/mock.go
rla/internal/carbideapi/mod.go
rla/internal/scheduler/taskschedule/template.go
rla/internal/service/server_impl.go
rla/internal/task/common/common.go
rla/internal/task/common/common_test.go
rla/internal/task/common/operation_codes.go
rla/internal/task/componentmanager/componentmanager.go
rla/internal/task/componentmanager/compute/carbide/carbide.go
rla/internal/task/componentmanager/mock/mock.go
rla/internal/task/conflict/conflict.go
rla/internal/task/executor/temporalworkflow/activity/activity.go
rla/internal/task/executor/temporalworkflow/activity/registry.go
rla/internal/task/executor/temporalworkflow/activity/registry_test.go
rla/internal/task/executor/temporalworkflow/workflow/actions.go
rla/internal/task/executor/temporalworkflow/workflow/bringdown.go
rla/internal/task/executor/temporalworkflow/workflow/registry_test.go
rla/internal/task/operationrules/resolver_defaults.go
rla/internal/task/operationrules/rules.go
rla/internal/task/operationrules/sequences.go
rla/internal/task/operationrules/sequences_test.go
rla/internal/task/operations/operations.go
rla/internal/task/operations/options.go
rla/proto/v1/rla.proto

✅ Files skipped from review due to trivial changes (4)

rla/internal/carbideapi/mod.go
rla/internal/task/common/operation_codes.go
rla/internal/task/executor/temporalworkflow/activity/registry.go
rla/internal/task/operationrules/rules.go

🚧 Files skipped from review as they are similar to previous changes (3)

rla/internal/task/operationrules/sequences_test.go
rla/internal/task/executor/temporalworkflow/activity/registry_test.go
rla/proto/v1/rla.proto

coderabbitai · 2026-05-04T18:09:12Z

+		RuleDefinition: RuleDefinition{
+			Version: CurrentRuleDefinitionVersion,
+			Steps: []SequenceStep{
+				// === Stage 1: Compute — verify no allocated instance ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         1,
+					MaxParallel:   0,
+					Timeout:       5 * time.Minute,
+					// No retry: a "machine still has an instance" result
+					// is not transient; retrying just delays the failure.
+					MainOperation: ActionConfig{
+						Name: ActionVerifyNoInstance,
+					},
+				},
+				// === Stage 2: Compute — enter maintenance ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         2,
+					MaxParallel:   0,
+					Timeout:       5 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    5 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionEnterMaintenance,
+					},
+				},
+				// === Stage 3: Compute — power off, verify ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         3,
+					MaxParallel:   0,
+					Timeout:       20 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    1 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionPowerControl,
+						Parameters: map[string]any{
+							ParamOperation: "power_off",
+						},
+					},
+					PostOperation: []ActionConfig{
+						{
+							Name:         ActionVerifyPowerStatus,
+							Timeout:      10 * time.Minute,
+							PollInterval: 15 * time.Second,
+							Parameters: map[string]any{
+								ParamExpectedStatus: "off",
+							},
+						},
+					},
+				},
+				// === Stage 4: Compute — pause power-on gate ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         4,
+					MaxParallel:   0,
+					Timeout:       5 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    5 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionPausePowerOnGate,
+					},
+				},
+				// === Stage 5: NVLSwitch — power off, verify ===
+				{
+					ComponentType: devicetypes.ComponentTypeNVLSwitch,
+					Stage:         5,
+					MaxParallel:   0,
+					Timeout:       15 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    5 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionPowerControl,
+						Parameters: map[string]any{
+							ParamOperation: "power_off",
+						},
+					},
+					PostOperation: []ActionConfig{
+						{
+							Name:         ActionVerifyPowerStatus,
+							Timeout:      10 * time.Minute,
+							PollInterval: 15 * time.Second,
+							Parameters: map[string]any{
+								ParamExpectedStatus: "off",
+							},
+						},
+					},
+				},
+				// === Stage 6: PowerShelf — power off, verify ===
+				{
+					ComponentType: devicetypes.ComponentTypePowerShelf,
+					Stage:         6,
+					MaxParallel:   0,
+					Timeout:       15 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    5 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionPowerControl,
+						Parameters: map[string]any{
+							ParamOperation: "power_off",
+						},
+					},
+					PostOperation: []ActionConfig{
+						{
+							Name:         ActionVerifyPowerStatus,
+							Timeout:      10 * time.Minute,
+							PollInterval: 15 * time.Second,
+							Parameters: map[string]any{
+								ParamExpectedStatus: "off",
+							},
+						},
+					},
+				},
+			},


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bring-down timeout budget can exceed the workflow timeout window.

The new default sequence can run beyond the configured TaskTypeBringDown workflow timeout (60m in rla/internal/task/operations/options.go), especially with per-step retries + verify polling. This risks timing out healthy runs mid-bring-down and leaving partial state.

Please align budgets before merge: either increase the workflow timeout for bring-down, or reduce per-stage/per-action timeout+retry envelopes so worst-case execution stays safely under the workflow timeout.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1299 - 1428, The RuleDefinition default sequence for bring-down can exceed the TaskTypeBringDown workflow timeout (60m); update either the workflow timeout in options.go or reduce stage/retry/poll budgets so the worst-case sum fits under the workflow window. Concretely, adjust RuleDefinition (CurrentRuleDefinitionVersion) SequenceSteps timeouts/retry policies and ActionConfig polling windows — e.g. lower Stage 3/5/6 Timeout and RetryPolicy.MaxAttempts, reduce PostOperation VerifyPowerStatus Timeout and PollInterval and the RetryPolicy.InitialInterval/BackoffCoefficient for stages that use ActionPowerControl/ActionVerifyPowerStatus/ActionVerifyNoInstance/ActionEnterMaintenance/ActionPausePowerOnGate so the aggregated worst-case duration is safely below TaskTypeBringDown; alternatively increase the TaskTypeBringDown timeout to a value greater than the calculated worst-case sum.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1298-1324: Stage 1 (ActionVerifyNoInstance) proves emptiness only
briefly so add a second authoritative verification after maintenance: insert a
new Stage (after the ActionEnterMaintenance stage) that uses ComponentType
compute with MainOperation set to ActionVerifyNoInstance, a short Timeout (e.g.,
5m) and no retries so the maintenance lockout and allocator lock are in place
before any subsequent power actions; reference the existing symbols
ActionVerifyNoInstance and ActionEnterMaintenance to locate where to add this
stage and ensure stage ordering guarantees the extra check runs before any
power-off operations.
- Around line 1325-1366: The pause gate (ActionPausePowerOnGate) must be applied
before powering off compute to avoid a reconcile re-power window; modify the
stages so the pause gate runs prior to the power_off+verify sequence (or if you
keep it as its own Stage 4, insert a follow-up VerifyPowerStatus(off)
ActionConfig after ActionPausePowerOnGate) — add an ActionVerifyPowerStatus with
Parameters {ParamExpectedStatus: "off"} (as a PostOperation or next-stage
MainOperation) to ensure the node is still off before proceeding to Stage 5;
update the RetryPolicy/Timeout blocks around ActionPausePowerOnGate and
ActionVerifyPowerStatus as needed to match existing patterns.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 41c6fbc1-a2ea-4c68-afbb-34a6b282b3d2

📥 Commits

Reviewing files that changed from the base of the PR and between 99ae4f1 and fe0692c.

⛔ Files ignored due to path filters (2)

rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
rla/pkg/proto/v1/rla_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go

📒 Files selected for processing (2)

rla/internal/service/server_impl.go
rla/internal/task/operationrules/resolver_defaults.go

🚧 Files skipped from review as they are similar to previous changes (1)

rla/internal/service/server_impl.go

coderabbitai · 2026-05-04T18:23:51Z

+				// === Stage 1: Compute — verify no allocated instance ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         1,
+					MaxParallel:   0,
+					Timeout:       5 * time.Minute,
+					// No retry: a "machine still has an instance" result
+					// is not transient; retrying just delays the failure.
+					MainOperation: ActionConfig{
+						Name: ActionVerifyNoInstance,
+					},
+				},
+				// === Stage 2: Compute — enter maintenance ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         2,
+					MaxParallel:   0,
+					Timeout:       5 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    5 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionEnterMaintenance,
+					},
+				},


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Close the verify/maintenance TOCTOU before any power action.

Stage 1 proves the compute is empty only at that instant. Because allocator lockout is not applied until Stage 2, a new instance can still land in the gap and Stage 3 can power off a machine that has become allocated.

Suggested minimal fix

{ ComponentType: devicetypes.ComponentTypeCompute, Stage: 2, MaxParallel: 0, Timeout: 5 * time.Minute, RetryPolicy: &RetryPolicy{ MaxAttempts: 3, InitialInterval: 5 * time.Second, BackoffCoefficient: 2.0, }, MainOperation: ActionConfig{ Name: ActionEnterMaintenance, }, + PostOperation: []ActionConfig{ + { + Name: ActionVerifyNoInstance, + }, + }, },

If you want to keep the early fail-fast behavior, a second VerifyNoInstance after maintenance is the smallest way to make the check authoritative.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// === Stage 1: Compute — verify no allocated instance ===

{

ComponentType: devicetypes.ComponentTypeCompute,

Stage: 1,

MaxParallel: 0,

Timeout: 5 * time.Minute,

// No retry: a "machine still has an instance" result

// is not transient; retrying just delays the failure.

MainOperation: ActionConfig{

Name: ActionVerifyNoInstance,

},

},

// === Stage 2: Compute — enter maintenance ===

{

ComponentType: devicetypes.ComponentTypeCompute,

Stage: 2,

MaxParallel: 0,

Timeout: 5 * time.Minute,

RetryPolicy: &RetryPolicy{

MaxAttempts: 3,

InitialInterval: 5 * time.Second,

BackoffCoefficient: 2.0,

},

MainOperation: ActionConfig{

Name: ActionEnterMaintenance,

},

},

// === Stage 1: Compute — verify no allocated instance ===

{

ComponentType: devicetypes.ComponentTypeCompute,

Stage: 1,

MaxParallel: 0,

Timeout: 5 * time.Minute,

// No retry: a "machine still has an instance" result

// is not transient; retrying just delays the failure.

MainOperation: ActionConfig{

Name: ActionVerifyNoInstance,

},

},

// === Stage 2: Compute — enter maintenance ===

{

ComponentType: devicetypes.ComponentTypeCompute,

Stage: 2,

MaxParallel: 0,

Timeout: 5 * time.Minute,

RetryPolicy: &RetryPolicy{

MaxAttempts: 3,

InitialInterval: 5 * time.Second,

BackoffCoefficient: 2.0,

},

MainOperation: ActionConfig{

Name: ActionEnterMaintenance,

},

PostOperation: []ActionConfig{

{

Name: ActionVerifyNoInstance,

},

},

},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1298 - 1324, Stage 1 (ActionVerifyNoInstance) proves emptiness only briefly so add a second authoritative verification after maintenance: insert a new Stage (after the ActionEnterMaintenance stage) that uses ComponentType compute with MainOperation set to ActionVerifyNoInstance, a short Timeout (e.g., 5m) and no retries so the maintenance lockout and allocator lock are in place before any subsequent power actions; reference the existing symbols ActionVerifyNoInstance and ActionEnterMaintenance to locate where to add this stage and ensure stage ordering guarantees the extra check runs before any power-off operations.

coderabbitai · 2026-05-04T18:23:51Z

+				// === Stage 3: Compute — power off, verify ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         3,
+					MaxParallel:   0,
+					Timeout:       20 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    1 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionPowerControl,
+						Parameters: map[string]any{
+							ParamOperation: "power_off",
+						},
+					},
+					PostOperation: []ActionConfig{
+						{
+							Name:         ActionVerifyPowerStatus,
+							Timeout:      10 * time.Minute,
+							PollInterval: 15 * time.Second,
+							Parameters: map[string]any{
+								ParamExpectedStatus: "off",
+							},
+						},
+					},
+				},
+				// === Stage 4: Compute — pause power-on gate ===
+				{
+					ComponentType: devicetypes.ComponentTypeCompute,
+					Stage:         4,
+					MaxParallel:   0,
+					Timeout:       5 * time.Minute,
+					RetryPolicy: &RetryPolicy{
+						MaxAttempts:        3,
+						InitialInterval:    5 * time.Second,
+						BackoffCoefficient: 2.0,
+					},
+					MainOperation: ActionConfig{
+						Name: ActionPausePowerOnGate,
+					},


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Pause the power-on gate before compute shutdown.

ActionPausePowerOnGate is the safeguard against automatic repower, but it currently runs only after compute has been powered off and verified off. That leaves a real window where a reconcile can turn the node back on, and the workflow never re-checks it before powering off the NVL switch and power shelf.

Suggested direction

- // === Stage 3: Compute — power off, verify === + // === Stage 3: Compute — pause gate, power off, verify === { ComponentType: devicetypes.ComponentTypeCompute, Stage: 3, MaxParallel: 0, Timeout: 20 * time.Minute, RetryPolicy: &RetryPolicy{ MaxAttempts: 3, InitialInterval: 1 * time.Second, BackoffCoefficient: 2.0, }, + PreOperation: []ActionConfig{ + { + Name: ActionPausePowerOnGate, + }, + }, MainOperation: ActionConfig{ Name: ActionPowerControl, Parameters: map[string]any{ ParamOperation: "power_off", }, }, PostOperation: []ActionConfig{ { Name: ActionVerifyPowerStatus, Timeout: 10 * time.Minute, PollInterval: 15 * time.Second, Parameters: map[string]any{ ParamExpectedStatus: "off", }, }, }, }, - // === Stage 4: Compute — pause power-on gate === - { - ComponentType: devicetypes.ComponentTypeCompute, - Stage: 4, - MaxParallel: 0, - Timeout: 5 * time.Minute, - RetryPolicy: &RetryPolicy{ - MaxAttempts: 3, - InitialInterval: 5 * time.Second, - BackoffCoefficient: 2.0, - }, - MainOperation: ActionConfig{ - Name: ActionPausePowerOnGate, - }, - },

If you keep a separate stage, it should at least be followed by another VerifyPowerStatus(off) before Stage 5.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1325 - 1366, The pause gate (ActionPausePowerOnGate) must be applied before powering off compute to avoid a reconcile re-power window; modify the stages so the pause gate runs prior to the power_off+verify sequence (or if you keep it as its own Stage 4, insert a follow-up VerifyPowerStatus(off) ActionConfig after ActionPausePowerOnGate) — add an ActionVerifyPowerStatus with Parameters {ParamExpectedStatus: "off"} (as a PostOperation or next-stage MainOperation) to ensure the node is still off before proceeding to Stage 5; update the RetryPolicy/Timeout blocks around ActionPausePowerOnGate and ActionVerifyPowerStatus as needed to match existing patterns.

jw-nvidia

Between 1 and 2, a machine can be allocated again.
Between 3 and 4, a machine can be powered on.
[1 and 2], [3 and 4] should be atomic operations.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

rla/internal/task/operationrules/resolver_defaults.go (2)

1297-1323: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Re-check compute emptiness after maintenance.

Stage 1 only proves the node is empty before allocator lockout. A new instance can land between Line 1305 and Line 1321, and Stage 3 would still power that machine off. Add a second VerifyNoInstance after ActionEnterMaintenance so the authoritative check happens under maintenance.

Suggested minimal fix

 				{
 					ComponentType: devicetypes.ComponentTypeCompute,
 					Stage:         2,
 					MaxParallel:   0,
 					Timeout:       5 * time.Minute,
 					RetryPolicy: &RetryPolicy{
 						MaxAttempts:        3,
 						InitialInterval:    5 * time.Second,
 						BackoffCoefficient: 2.0,
 					},
 					MainOperation: ActionConfig{
 						Name: ActionEnterMaintenance,
 					},
+					PostOperation: []ActionConfig{
+						{
+							Name: ActionVerifyNoInstance,
+						},
+					},
 				},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1297 -
1323, Stage 1's ActionVerifyNoInstance only checks emptiness before maintenance
and a new instance could start before the machine is taken down; add a second
verify step immediately after ActionEnterMaintenance so the authoritative check
runs while the node is in maintenance. Concretely, in the resolver_defaults.go
operation sequence (the block containing ComponentType:
devicetypes.ComponentTypeCompute, Stage: 2 and MainOperation: ActionConfig{Name:
ActionEnterMaintenance}), append another operation entry with MainOperation:
ActionConfig{Name: ActionVerifyNoInstance} (and appropriate Timeout/RetryPolicy
similar to Stage 1 or tuned as needed) so VerifyNoInstance runs under
maintenance before any subsequent power-off stages.

1324-1365: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Pause the power-on gate before compute shutdown.

With ActionPausePowerOnGate delayed until Stage 4, a reconcile can repower compute after Line 1335 succeeds and before Line 1363 runs. The workflow never re-verifies compute is still off before proceeding to the switch and shelf. Move the gate pause into Stage 3 pre-ops, or at minimum add another "off" verification after Stage 4.

Suggested direction

-				// === Stage 3: Compute — power off, verify ===
+				// === Stage 3: Compute — pause gate, power off, verify ===
 				{
 					ComponentType: devicetypes.ComponentTypeCompute,
 					Stage:         3,
 					MaxParallel:   0,
 					Timeout:       20 * time.Minute,
 					RetryPolicy: &RetryPolicy{
 						MaxAttempts:        3,
 						InitialInterval:    1 * time.Second,
 						BackoffCoefficient: 2.0,
 					},
+					PreOperation: []ActionConfig{
+						{
+							Name: ActionPausePowerOnGate,
+						},
+					},
 					MainOperation: ActionConfig{
 						Name: ActionPowerControl,
 						Parameters: map[string]any{
 							ParamOperation: "power_off",
 						},
 					},
 					PostOperation: []ActionConfig{
 						{
 							Name:         ActionVerifyPowerStatus,
 							Timeout:      10 * time.Minute,
 							PollInterval: 15 * time.Second,
 							Parameters: map[string]any{
 								ParamExpectedStatus: "off",
 							},
 						},
 					},
 				},
-				// === Stage 4: Compute — pause power-on gate ===
-				{
-					ComponentType: devicetypes.ComponentTypeCompute,
-					Stage:         4,
-					MaxParallel:   0,
-					Timeout:       5 * time.Minute,
-					RetryPolicy: &RetryPolicy{
-						MaxAttempts:        3,
-						InitialInterval:    5 * time.Second,
-						BackoffCoefficient: 2.0,
-					},
-					MainOperation: ActionConfig{
-						Name: ActionPausePowerOnGate,
-					},
-				},

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1324 -
1365, The pause-of-power-on gate is executed in Stage 4 (ActionPausePowerOnGate)
which allows a reconcile to repower compute after Stage 3's power_off
MainOperation and its PostOperation verification (ActionVerifyPowerStatus)
succeed; to prevent that race, move ActionPausePowerOnGate into Stage 3 as a
pre-operation (so it runs before the MainOperation
power_off/ActionVerifyPowerStatus) or, if you prefer to keep it in Stage 4, add
an additional verification ActionVerifyPowerStatus (expecting "off") after Stage
4 to re-check compute is still off before proceeding to downstream operations;
update the relevant structs where MainOperation/PostOperation are defined for
ComponentTypeCompute at Stage 3 and Stage 4 to implement the chosen change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rla/internal/task/operations/operations.go`:
- Around line 81-86: The bring-down task branch currently unmarshals into
BringDownTaskInfo but doesn't validate rule_id; update
BringDownTaskInfo.Validate() to reject malformed rule_id by calling
ExtractRuleID(info.RuleID) and returning an error if it returns nil, and then
call BringDownTaskInfo.Validate() in the TaskTypeBringDown case (before
returning) to ensure malformed UUIDs are rejected; apply the same validation
call to the other bring-down handling branch referenced in the comment so both
paths enforce rule_id validation.

In `@rla/internal/task/operations/types.go`:
- Around line 108-114: Add an explicit legacy constant for the reserved value
(e.g., FirmwareOperationRollback = 3) and update FirmwareOperation.CodeString()
so it maps only known values to their correct strings and does NOT fall back to
"upgrade" for unmapped/legacy values; instead return an explicit "rollback" (for
the legacy constant) or an "unsupported"/error indicator for any other unknown
values, and ensure callers validate/reject FirmwareOperationRollback during
input/validation paths rather than treating it as an upgrade.

---

Duplicate comments:
In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1297-1323: Stage 1's ActionVerifyNoInstance only checks emptiness
before maintenance and a new instance could start before the machine is taken
down; add a second verify step immediately after ActionEnterMaintenance so the
authoritative check runs while the node is in maintenance. Concretely, in the
resolver_defaults.go operation sequence (the block containing ComponentType:
devicetypes.ComponentTypeCompute, Stage: 2 and MainOperation: ActionConfig{Name:
ActionEnterMaintenance}), append another operation entry with MainOperation:
ActionConfig{Name: ActionVerifyNoInstance} (and appropriate Timeout/RetryPolicy
similar to Stage 1 or tuned as needed) so VerifyNoInstance runs under
maintenance before any subsequent power-off stages.
- Around line 1324-1365: The pause-of-power-on gate is executed in Stage 4
(ActionPausePowerOnGate) which allows a reconcile to repower compute after Stage
3's power_off MainOperation and its PostOperation verification
(ActionVerifyPowerStatus) succeed; to prevent that race, move
ActionPausePowerOnGate into Stage 3 as a pre-operation (so it runs before the
MainOperation power_off/ActionVerifyPowerStatus) or, if you prefer to keep it in
Stage 4, add an additional verification ActionVerifyPowerStatus (expecting
"off") after Stage 4 to re-check compute is still off before proceeding to
downstream operations; update the relevant structs where
MainOperation/PostOperation are defined for ComponentTypeCompute at Stage 3 and
Stage 4 to implement the chosen change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5bfb9d32-6b61-457d-a086-c45096ee910e

📥 Commits

Reviewing files that changed from the base of the PR and between fe0692c and 9715fc1.

⛔ Files ignored due to path filters (1)

rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go

📒 Files selected for processing (13)

rla/internal/scheduler/taskschedule/template.go
rla/internal/scheduler/taskschedule/template_test.go
rla/internal/task/common/operation_codes.go
rla/internal/task/operationrules/loader_test.go
rla/internal/task/operationrules/resolver.go
rla/internal/task/operationrules/resolver_defaults.go
rla/internal/task/operationrules/sequences.go
rla/internal/task/operationrules/sequences_test.go
rla/internal/task/operations/operations.go
rla/internal/task/operations/options.go
rla/internal/task/operations/types.go
rla/internal/task/operations/types_test.go
rla/proto/v1/rla.proto

💤 Files with no reviewable changes (3)

rla/internal/task/operationrules/resolver.go
rla/internal/task/operations/types_test.go
rla/internal/scheduler/taskschedule/template_test.go

✅ Files skipped from review due to trivial changes (1)

rla/internal/task/operations/options.go

🚧 Files skipped from review as they are similar to previous changes (1)

rla/proto/v1/rla.proto

coderabbitai · 2026-05-04T18:42:31Z

+	case taskcommon.TaskTypeBringDown:
+		var taskInfo BringDownTaskInfo
+		if err := json.Unmarshal(info, &taskInfo); err != nil {
+			return nil, fmt.Errorf("failed to unmarshal bring-down task info: %w", err) //nolint
+		}
+		return &taskInfo, nil


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate rule_id before accepting bring-down payloads.

ExtractRuleID() returns nil for an unparsable UUID, so a typo here silently drops the override and falls back to the hardcoded bring-down rule. Please reject malformed rule_id in BringDownTaskInfo.Validate() and call it from the new TaskTypeBringDown branch before returning.

Suggested fix

case taskcommon.TaskTypeBringDown: var taskInfo BringDownTaskInfo if err := json.Unmarshal(info, &taskInfo); err != nil { return nil, fmt.Errorf("failed to unmarshal bring-down task info: %w", err) //nolint } + if err := taskInfo.Validate(); err != nil { + return nil, fmt.Errorf("invalid bring-down task info: %w", err) + } return &taskInfo, nil

func (t *BringDownTaskInfo) Validate() error { + if t.RuleID != "" { + if _, err := uuid.Parse(t.RuleID); err != nil { + return fmt.Errorf("invalid rule_id: %w", err) + } + } return nil }

Also applies to: 218-247

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/operations/operations.go` around lines 81 - 86, The bring-down task branch currently unmarshals into BringDownTaskInfo but doesn't validate rule_id; update BringDownTaskInfo.Validate() to reject malformed rule_id by calling ExtractRuleID(info.RuleID) and returning an error if it returns nil, and then call BringDownTaskInfo.Validate() in the TaskTypeBringDown case (before returning) to ensure malformed UUIDs are rejected; apply the same validation call to the other bring-down handling branch referenced in the comment so both paths enforce rule_id validation.

coderabbitai · 2026-05-04T18:42:31Z

 const (
-	FirmwareOperationUnknown FirmwareOperation = iota
-	FirmwareOperationUpgrade
-	FirmwareOperationDowngrade
-	FirmwareOperationRollback
-	FirmwareOperationVersion
+	FirmwareOperationUnknown   FirmwareOperation = 0
+	FirmwareOperationUpgrade   FirmwareOperation = 1
+	FirmwareOperationDowngrade FirmwareOperation = 2
+	// 3 was previously FirmwareOperationRollback; kept reserved so persisted
+	// values for FirmwareOperationVersion remain stable.
+	FirmwareOperationVersion FirmwareOperation = 4


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not let legacy rollback values serialize as upgrade.

Reserving enum value 3 suggests old persisted rollback values can still be read, but FirmwareOperation.CodeString() still falls back to "upgrade" for any unmapped value. A stored FirmwareOperation(3) will now be misrouted into the upgrade path instead of failing as unsupported.

Suggested direction

const ( FirmwareOperationUnknown FirmwareOperation = 0 FirmwareOperationUpgrade FirmwareOperation = 1 FirmwareOperationDowngrade FirmwareOperation = 2 - // 3 was previously FirmwareOperationRollback; kept reserved so persisted - // values for FirmwareOperationVersion remain stable. + // 3 was previously FirmwareOperationRollback; kept reserved so persisted + // values remain distinguishable from supported operations. + FirmwareOperationLegacyRollback FirmwareOperation = 3 FirmwareOperationVersion FirmwareOperation = 4 )

func (o FirmwareOperation) CodeString() string { if code, ok := firmwareOperationCodes[o]; ok { return code } - return taskcommon.OpCodeFirmwareControlUpgrade // Default fallback + return "" }

If callers still need to detect old rollback records explicitly, keep the legacy constant and reject it during validation rather than defaulting to upgrade.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/operations/types.go` around lines 108 - 114, Add an explicit legacy constant for the reserved value (e.g., FirmwareOperationRollback = 3) and update FirmwareOperation.CodeString() so it maps only known values to their correct strings and does NOT fall back to "upgrade" for unmapped/legacy values; instead return an explicit "rollback" (for the legacy constant) or an "unsupported"/error indicator for any other unknown values, and ensure callers validate/reject FirmwareOperationRollback during input/validation paths rather than treating it as an upgrade.

kunzhao-nv added 2 commits May 1, 2026 16:18

bring down init

8a91c7b

Merge branch 'main' into feat/bd-rla

2b39878

kunzhao-nv requested a review from a team as a code owner May 4, 2026 15:58

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

refactor

99ae4f1

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

kunzhao-nv added 2 commits May 4, 2026 11:10

cleanup

57a0065

comments cleanup

fe0692c

kunzhao-nv requested a review from jw-nvidia May 4, 2026 18:18

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

jw-nvidia reviewed May 4, 2026

View reviewed changes

address feedback and cleanup

9715fc1

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Conversation

kunzhao-nv commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Sequence (6 stages)

Type of Change

Services Affected

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 4, 2026

🔐 TruffleHog Secret Scan

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

jw-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kunzhao-nv commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading