Skip to content

feat: Add bring-down sequence in RLA#477

Open
kunzhao-nv wants to merge 6 commits intomainfrom
feat/bd-rla
Open

feat: Add bring-down sequence in RLA#477
kunzhao-nv wants to merge 6 commits intomainfrom
feat/bd-rla

Conversation

@kunzhao-nv
Copy link
Copy Markdown
Contributor

@kunzhao-nv kunzhao-nv commented May 4, 2026

Description

Add BringDownRack to RLA — symmetric inverse of BringUpRack for safely taking a rack offline.

Sequence (6 stages)

  1. Compute — VerifyNoInstance: fail fast if any compute node still has an allocated instance (FindInstanceByMachineID).
  2. Compute — EnterMaintenance: long-lived health-report override with source rla-bring-down so the machine is locked out of the allocator before its power state changes.
  3. Compute — PowerOff: graceful power off + verify off.
  4. Compute — PausePowerOnGate: UpdatePowerOption(PowerStateDisabled) so a stale reconcile pass cannot bring the machine back up.
  5. NVLSwitch — PowerOff: graceful power off + verify off.
  6. PowerShelf — PowerOff: graceful power off + verify off.

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@kunzhao-nv kunzhao-nv requested a review from a team as a code owner May 4, 2026 15:58
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

1 similar comment
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Summary by CodeRabbit

  • New Features

    • Rack bring-down API and multi-stage bring-down workflow (pre-checks, maintenance marking, staged power-off, pause power-on gating).
    • Precondition check to prevent bring-down when machines still have active instances.
  • Behavior Changes

    • Bring-down operations now conflict with other rack operations to avoid concurrency issues.
    • Firmware rollback option removed from available operation set.
  • Tests

    • Added tests for bring-down registration, activity wiring, and validation.

Walkthrough

Adds a complete "bring-down" operation: new BringDownRack RPC and request payload, TaskType/OpCode/operation payload, component-manager interfaces and Carbide client checks, Temporal activities/workflow and action wiring, hardcoded bring-down rule sequence, service handler to submit tasks, and related tests.

Changes

Bring-Down Feature (single cohort)

Layer / File(s) Summary
API & Types
rla/proto/v1/rla.proto, rla/internal/task/common/common.go, rla/internal/task/common/operation_codes.go, rla/internal/task/operations/operations.go
Adds BringDownRack RPC and BringDownRackRequest; introduces TaskTypeBringDown, OpCodeBringDown, and BringDownTaskInfo payload with optional RuleID.
Service Handler
rla/internal/service/server_impl.go
Implements RLAServerImpl.BringDownRack: validates request, builds operations.BringDownTaskInfo and internal operation.Request, sets optional RuleID, submits via taskManager, returns task IDs or error.
Operation Rules / Sequencing
rla/internal/task/operationrules/sequences.go, rla/internal/task/operationrules/rules.go, rla/internal/task/operationrules/resolver_defaults.go
Adds SequenceBringDown, ActionPausePowerOnGate, and hardcoded buildBringDownRule() with stages: verify no instance, enter maintenance, compute power-off+verify, pause power-on gate, NVL and power-shelf power-offs. Registers rule for TaskTypeBringDown.
Component Manager Interfaces
rla/internal/task/componentmanager/componentmanager.go
Adds optional interfaces: PowerOnGateController.PausePowerOnGate, InstanceVerifier.VerifyNoInstance, MaintenanceController.EnterMaintenance.
Carbide Client & Mock
rla/internal/carbideapi/mod.go, rla/internal/carbideapi/grpc.go, rla/internal/carbideapi/mock.go
Adds MachineHasInstance(ctx, machineID) to Client; grpcClient.MachineHasInstance wraps FindInstanceByMachineID; mock client stores machineInstances and exposes SetMachineHasInstance.
Carbide Compute Implementation
rla/internal/task/componentmanager/compute/carbide/carbide.go
Implements VerifyNoInstance (calls MachineHasInstance per component), EnterMaintenance (inserts health-report override with a bring-down source tag), and PausePowerOnGate (updates power option to disable power-on gate; treats "already set" as idempotent).
Mock Component Manager
rla/internal/task/componentmanager/mock/mock.go
Adds mock implementations for PausePowerOnGate, VerifyNoInstance, and EnterMaintenance (log, configurable delay, return nil).
Temporal Activities
rla/internal/task/executor/temporalworkflow/activity/activity.go, rla/internal/task/executor/temporalworkflow/activity/registry.go, rla/internal/task/executor/temporalworkflow/activity/registry_test.go
Adds activity names and methods: PausePowerOnGate, VerifyNoInstance, EnterMaintenance. Each validates target, ensures component-manager supports the interface, and delegates; activities registered and tests updated.
Workflow Actions & Executor Wiring
rla/internal/task/executor/temporalworkflow/workflow/actions.go
Registers action executors and implements executePausePowerOnGateAction, executeVerifyNoInstanceAction, executeEnterMaintenanceAction which invoke the corresponding Temporal activities.
Bring-Down Temporal Workflow
rla/internal/task/executor/temporalworkflow/workflow/bringdown.go, rla/internal/task/executor/temporalworkflow/workflow/registry_test.go
Registers BringDown workflow descriptor for TaskTypeBringDown, defines activity options and retry policy, implements bringDown orchestration that runs the rule-based executor and updates task status; adds workflow registration/unmarshal tests.
Operation Options & Scheduler Summary
rla/internal/task/operations/options.go, rla/internal/scheduler/taskschedule/template.go
Adds default operation options entry for TaskTypeBringDown with Timeout = 120 minutes; SummaryFromTemplate maps TaskTypeBringDown to ("BRING_DOWN","Bring Down").
Validation / Conflict / Tests
rla/internal/task/operationrules/sequences_test.go, rla/internal/task/common/common_test.go, rla/internal/task/executor/temporalworkflow/workflow/registry_test.go, rla/internal/task/conflict/conflict.go
Extends sequence validation and TaskType tests for bring-down, adds built-in conflict rule that makes BringDown conflict with all other operations, and adds workflow registry/unmarshal tests for bring-down.
Firmware rollback removal
rla/internal/task/common/operation_codes.go, rla/internal/task/operations/types.go, rla/internal/task/operationrules/loader_test.go, rla/internal/scheduler/taskschedule/template_test.go
Removes firmware rollback opcode/constant and associated test expectations (rollback mapping and test cases updated to exclude rollback).

Sequence Diagram

sequenceDiagram
    participant Client
    participant RLA as RLA Service
    participant TM as TaskManager
    participant Temporal
    participant Activities
    participant Carbide

    Client->>RLA: BringDownRack(target_spec, rule_id)
    RLA->>RLA: validate request, build BringDownTaskInfo
    RLA->>TM: SubmitTask(operation.Request)
    TM->>Temporal: schedule/start BringDown workflow

    Temporal->>Temporal: load rule definition, build targets
    Temporal->>Activities: VerifyNoInstance(target)
    Activities->>Carbide: InstanceVerifier.VerifyNoInstance(target)
    Carbide-->>Activities: result
    Activities-->>Temporal: result

    Temporal->>Activities: EnterMaintenance(target)
    Activities->>Carbide: MaintenanceController.EnterMaintenance(target)
    Carbide-->>Activities: result
    Activities-->>Temporal: result

    Temporal->>Activities: PowerControl OFF (compute)
    Activities-->>Temporal: result

    Temporal->>Activities: PausePowerOnGate(target)
    Activities->>Carbide: PowerOnGateController.PausePowerOnGate(target)
    Carbide-->>Activities: result
    Activities-->>Temporal: result

    Temporal->>Temporal: remaining stages (NVL, power shelf), update status
    Temporal-->>Client: final task status / error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the primary change: adding a bring-down sequence capability to RLA, which aligns precisely with the changeset's focus.
Description check ✅ Passed The description provides substantive context: it details the six-stage bring-down sequence, explains the rationale for each stage, identifies the type of change and affected services, and documents testing approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/bd-rla

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Test Results

8 974 tests  +6   8 974 ✅ +6   7m 46s ⏱️ +26s
  151 suites ±0       0 💤 ±0 
   14 files   ±0       0 ❌ ±0 

Results for commit 9715fc1. ± Comparison against base commit bce7503.

This pull request removes 4 and adds 10 tests. Note that renamed tests count towards both.
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/scheduler/taskschedule ‑ TestSummaryFromTemplate/rollback_firmware
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_rollback
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestSequenceConstantsMatchSharedCodes/Rollback
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operations ‑ TestFirmwareOperationCodeString/Rollback
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Registered_BringDown
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown/invalid_JSON
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown/valid_empty_payload
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/executor/temporalworkflow/workflow ‑ TestWorkflowDescriptor_Unmarshal_BringDown/valid_with_rule_id
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/bring_down_rejected_under_bring_up_task_type
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_bring_down
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_bring_up
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestIsValidOperation/valid_ingest
github.com/NVIDIA/ncx-infra-controller-rest/rla/internal/task/operationrules ‑ TestSequenceConstantsMatchSharedCodes/BringDown

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
rla/internal/task/operationrules/sequences_test.go (1)

95-100: ⚡ Quick win

Add bring_down to constant-parity coverage as well.

You added validity coverage for SequenceBringDown, but TestSequenceConstantsMatchSharedCodes still does not assert parity for this new code path.

Suggested test extension
 func TestSequenceConstantsMatchSharedCodes(t *testing.T) {
 	tests := []struct {
 		name          string
 		sequenceConst string
 		sharedCode    string
 	}{
@@
 		{"Rollback", SequenceRollback, common.OpCodeFirmwareControlRollback},
+		{"BringDown", SequenceBringDown, common.OpCodeBringDown},
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/operationrules/sequences_test.go` around lines 95 - 100,
The parity test TestSequenceConstantsMatchSharedCodes is missing the new
SequenceBringDown/bring_down case; update that test to include the bring_down
constant alongside the other sequence constants so the test asserts parity for
SequenceBringDown as well (e.g., add bring_down to the expected list/map or the
comparison loop used in TestSequenceConstantsMatchSharedCodes so the shared
codes mapping includes SequenceBringDown).
rla/internal/service/server_impl.go (1)

768-783: ⚡ Quick win

Align BringUpTaskInfo.OpCode with bring_down to avoid divergent metadata.

BringDownRack overrides the wrapper opcode, but the serialized BringUpTaskInfo still defaults to bring-up semantics. Setting info.OpCode keeps both representations consistent for any downstream reader of Operation.Info.

♻️ Suggested patch
 	info := &operations.BringUpTaskInfo{
 		RuleID: protobuf.UUIDStringFrom(req.GetRuleId()),
+		OpCode: taskcommon.OpCodeBringDown,
 	}
@@
-	// Override the operation code so the rule resolver picks the
-	// bring-down rule instead of the default bring-up rule.
-	opReq.Operation.Code = taskcommon.OpCodeBringDown
+	// Keep request wrapper and serialized task info aligned.
+	opReq.Operation.Code = info.CodeString()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/service/server_impl.go` around lines 768 - 783, Constructed
BringUpTaskInfo (info) currently defaults its OpCode to bring-up, causing
serialized Operation.Info to diverge from the overridden operation code; set
info.OpCode = taskcommon.OpCodeBringDown on the info struct (the BringUpTaskInfo
instance) so it matches the later override of opReq.Operation.Code =
taskcommon.OpCodeBringDown (ensure this assignment happens before calling
convertTargetSpecToOperationRequest or at least before opReq is serialized) to
keep Operation.Info and opReq.Operation.Code consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rla/proto/v1/rla.proto`:
- Around line 610-618: Update the BringDownRackRequest message comment to
explicitly state that the documented sequence (power off compute, pause
per-machine power-on gate, then power off NVLSwitch and PowerShelf) is the
default behavior applied when no rule override is provided, and that supplying
rule_id will override rule resolution and may alter that sequence; reference the
fields OperationTargetSpec target_spec, string description, and optional UUID
rule_id in the comment so API consumers understand default vs overridden
bring-down behavior.

---

Nitpick comments:
In `@rla/internal/service/server_impl.go`:
- Around line 768-783: Constructed BringUpTaskInfo (info) currently defaults its
OpCode to bring-up, causing serialized Operation.Info to diverge from the
overridden operation code; set info.OpCode = taskcommon.OpCodeBringDown on the
info struct (the BringUpTaskInfo instance) so it matches the later override of
opReq.Operation.Code = taskcommon.OpCodeBringDown (ensure this assignment
happens before calling convertTargetSpecToOperationRequest or at least before
opReq is serialized) to keep Operation.Info and opReq.Operation.Code consistent.

In `@rla/internal/task/operationrules/sequences_test.go`:
- Around line 95-100: The parity test TestSequenceConstantsMatchSharedCodes is
missing the new SequenceBringDown/bring_down case; update that test to include
the bring_down constant alongside the other sequence constants so the test
asserts parity for SequenceBringDown as well (e.g., add bring_down to the
expected list/map or the comparison loop used in
TestSequenceConstantsMatchSharedCodes so the shared codes mapping includes
SequenceBringDown).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 20139a9e-7c12-43d7-9598-886d967b066c

📥 Commits

Reviewing files that changed from the base of the PR and between bce7503 and 2b39878.

⛔ Files ignored due to path filters (2)
  • rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • rla/pkg/proto/v1/rla_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
📒 Files selected for processing (15)
  • rla/internal/service/server_impl.go
  • rla/internal/task/common/operation_codes.go
  • rla/internal/task/componentmanager/componentmanager.go
  • rla/internal/task/componentmanager/compute/carbide/carbide.go
  • rla/internal/task/componentmanager/mock/mock.go
  • rla/internal/task/executor/temporalworkflow/activity/activity.go
  • rla/internal/task/executor/temporalworkflow/activity/registry.go
  • rla/internal/task/executor/temporalworkflow/activity/registry_test.go
  • rla/internal/task/executor/temporalworkflow/workflow/actions.go
  • rla/internal/task/operationrules/resolver_defaults.go
  • rla/internal/task/operationrules/resolver_defaults_test.go
  • rla/internal/task/operationrules/rules.go
  • rla/internal/task/operationrules/sequences.go
  • rla/internal/task/operationrules/sequences_test.go
  • rla/proto/v1/rla.proto

Comment thread rla/proto/v1/rla.proto Outdated
Comment on lines +610 to +618
// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
// It powers off compute, pauses the per-machine power-on gate so the power
// manager will not automatically bring it back up, then powers off the
// remaining components (NVLSwitch and PowerShelf).
message BringDownRackRequest {
OperationTargetSpec target_spec = 1; // Target racks for bring-down
string description = 2; // optional task description
optional UUID rule_id = 3; // optional: override rule resolution with a specific rule
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify default vs overridden bring-down behavior in request docs.

Lines 610-613 currently read as an unconditional sequence, but Line 617 permits a custom rule_id override. This can mislead API consumers about guaranteed behavior.

Suggested doc-only fix
-// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
-// It powers off compute, pauses the per-machine power-on gate so the power
-// manager will not automatically bring it back up, then powers off the
-// remaining components (NVLSwitch and PowerShelf).
+// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
+// By default, bring-down powers off compute, pauses the per-machine power-on
+// gate so the power manager will not automatically bring it back up, then
+// powers off remaining components (NVLSwitch and PowerShelf).
+// If rule_id is provided, execution follows the selected rule.
 message BringDownRackRequest {

As per coding guidelines **/*.proto: “Review the Protobuf definitions, point out issues relative to compatibility, and expressiveness.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
// It powers off compute, pauses the per-machine power-on gate so the power
// manager will not automatically bring it back up, then powers off the
// remaining components (NVLSwitch and PowerShelf).
message BringDownRackRequest {
OperationTargetSpec target_spec = 1; // Target racks for bring-down
string description = 2; // optional task description
optional UUID rule_id = 3; // optional: override rule resolution with a specific rule
}
// BringDownRackRequest is the symmetric counterpart to BringUpRackRequest.
// By default, bring-down powers off compute, pauses the per-machine power-on
// gate so the power manager will not automatically bring it back up, then
// powers off remaining components (NVLSwitch and PowerShelf).
// If rule_id is provided, execution follows the selected rule.
message BringDownRackRequest {
OperationTargetSpec target_spec = 1; // Target racks for bring-down
string description = 2; // optional task description
optional UUID rule_id = 3; // optional: override rule resolution with a specific rule
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/proto/v1/rla.proto` around lines 610 - 618, Update the
BringDownRackRequest message comment to explicitly state that the documented
sequence (power off compute, pause per-machine power-on gate, then power off
NVLSwitch and PowerShelf) is the default behavior applied when no rule override
is provided, and that supplying rule_id will override rule resolution and may
alter that sequence; reference the fields OperationTargetSpec target_spec,
string description, and optional UUID rule_id in the comment so API consumers
understand default vs overridden bring-down behavior.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-04 18:05:49 UTC | Commit: 99ae4f1

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rla/internal/service/server_impl.go`:
- Around line 751-754: Update the top-of-function comment for BringDownRack to
remove the incorrect statement that it “reuses the BringUp workflow” and instead
describe that this RPC invokes the new dedicated BringDown workflow; mention the
key steps performed (power off compute, pause per-machine power-on gate to
prevent auto-revive, then power off NVLSwitch and PowerShelf) and reference the
BringDown workflow by name so readers know this function maps to that workflow
rather than BringUp.

In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1299-1428: The RuleDefinition default sequence for bring-down can
exceed the TaskTypeBringDown workflow timeout (60m); update either the workflow
timeout in options.go or reduce stage/retry/poll budgets so the worst-case sum
fits under the workflow window. Concretely, adjust RuleDefinition
(CurrentRuleDefinitionVersion) SequenceSteps timeouts/retry policies and
ActionConfig polling windows — e.g. lower Stage 3/5/6 Timeout and
RetryPolicy.MaxAttempts, reduce PostOperation VerifyPowerStatus Timeout and
PollInterval and the RetryPolicy.InitialInterval/BackoffCoefficient for stages
that use
ActionPowerControl/ActionVerifyPowerStatus/ActionVerifyNoInstance/ActionEnterMaintenance/ActionPausePowerOnGate
so the aggregated worst-case duration is safely below TaskTypeBringDown;
alternatively increase the TaskTypeBringDown timeout to a value greater than the
calculated worst-case sum.

In `@rla/internal/task/operations/operations.go`:
- Around line 212-253: New() currently lacks a branch for
taskcommon.TaskTypeBringDown, so generic deserialization returns "unsupported
task type: bring_down"; update the factory in New() to handle TaskTypeBringDown
by adding a switch/case (or equivalent) that instantiates a BringDownTaskInfo,
unmarshals into it, validates it, and returns it (use the existing pattern from
other cases for creation, calling BringDownTaskInfo.Unmarshal/Validate and
returning the resulting Task instance).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64302ded-b3ec-41ad-bc9e-cd88b6f0b80d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b39878 and 99ae4f1.

⛔ Files ignored due to path filters (2)
  • rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • rla/pkg/proto/v1/rla_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
📒 Files selected for processing (25)
  • rla/internal/carbideapi/grpc.go
  • rla/internal/carbideapi/mock.go
  • rla/internal/carbideapi/mod.go
  • rla/internal/scheduler/taskschedule/template.go
  • rla/internal/service/server_impl.go
  • rla/internal/task/common/common.go
  • rla/internal/task/common/common_test.go
  • rla/internal/task/common/operation_codes.go
  • rla/internal/task/componentmanager/componentmanager.go
  • rla/internal/task/componentmanager/compute/carbide/carbide.go
  • rla/internal/task/componentmanager/mock/mock.go
  • rla/internal/task/conflict/conflict.go
  • rla/internal/task/executor/temporalworkflow/activity/activity.go
  • rla/internal/task/executor/temporalworkflow/activity/registry.go
  • rla/internal/task/executor/temporalworkflow/activity/registry_test.go
  • rla/internal/task/executor/temporalworkflow/workflow/actions.go
  • rla/internal/task/executor/temporalworkflow/workflow/bringdown.go
  • rla/internal/task/executor/temporalworkflow/workflow/registry_test.go
  • rla/internal/task/operationrules/resolver_defaults.go
  • rla/internal/task/operationrules/rules.go
  • rla/internal/task/operationrules/sequences.go
  • rla/internal/task/operationrules/sequences_test.go
  • rla/internal/task/operations/operations.go
  • rla/internal/task/operations/options.go
  • rla/proto/v1/rla.proto
✅ Files skipped from review due to trivial changes (4)
  • rla/internal/carbideapi/mod.go
  • rla/internal/task/common/operation_codes.go
  • rla/internal/task/executor/temporalworkflow/activity/registry.go
  • rla/internal/task/operationrules/rules.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • rla/internal/task/operationrules/sequences_test.go
  • rla/internal/task/executor/temporalworkflow/activity/registry_test.go
  • rla/proto/v1/rla.proto

Comment thread rla/internal/service/server_impl.go Outdated
Comment on lines +1299 to +1428
RuleDefinition: RuleDefinition{
Version: CurrentRuleDefinitionVersion,
Steps: []SequenceStep{
// === Stage 1: Compute — verify no allocated instance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 1,
MaxParallel: 0,
Timeout: 5 * time.Minute,
// No retry: a "machine still has an instance" result
// is not transient; retrying just delays the failure.
MainOperation: ActionConfig{
Name: ActionVerifyNoInstance,
},
},
// === Stage 2: Compute — enter maintenance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 2,
MaxParallel: 0,
Timeout: 5 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionEnterMaintenance,
},
},
// === Stage 3: Compute — power off, verify ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 3,
MaxParallel: 0,
Timeout: 20 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 1 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionPowerControl,
Parameters: map[string]any{
ParamOperation: "power_off",
},
},
PostOperation: []ActionConfig{
{
Name: ActionVerifyPowerStatus,
Timeout: 10 * time.Minute,
PollInterval: 15 * time.Second,
Parameters: map[string]any{
ParamExpectedStatus: "off",
},
},
},
},
// === Stage 4: Compute — pause power-on gate ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 4,
MaxParallel: 0,
Timeout: 5 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionPausePowerOnGate,
},
},
// === Stage 5: NVLSwitch — power off, verify ===
{
ComponentType: devicetypes.ComponentTypeNVLSwitch,
Stage: 5,
MaxParallel: 0,
Timeout: 15 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionPowerControl,
Parameters: map[string]any{
ParamOperation: "power_off",
},
},
PostOperation: []ActionConfig{
{
Name: ActionVerifyPowerStatus,
Timeout: 10 * time.Minute,
PollInterval: 15 * time.Second,
Parameters: map[string]any{
ParamExpectedStatus: "off",
},
},
},
},
// === Stage 6: PowerShelf — power off, verify ===
{
ComponentType: devicetypes.ComponentTypePowerShelf,
Stage: 6,
MaxParallel: 0,
Timeout: 15 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionPowerControl,
Parameters: map[string]any{
ParamOperation: "power_off",
},
},
PostOperation: []ActionConfig{
{
Name: ActionVerifyPowerStatus,
Timeout: 10 * time.Minute,
PollInterval: 15 * time.Second,
Parameters: map[string]any{
ParamExpectedStatus: "off",
},
},
},
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bring-down timeout budget can exceed the workflow timeout window.

The new default sequence can run beyond the configured TaskTypeBringDown workflow timeout (60m in rla/internal/task/operations/options.go), especially with per-step retries + verify polling. This risks timing out healthy runs mid-bring-down and leaving partial state.

Please align budgets before merge: either increase the workflow timeout for bring-down, or reduce per-stage/per-action timeout+retry envelopes so worst-case execution stays safely under the workflow timeout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1299 -
1428, The RuleDefinition default sequence for bring-down can exceed the
TaskTypeBringDown workflow timeout (60m); update either the workflow timeout in
options.go or reduce stage/retry/poll budgets so the worst-case sum fits under
the workflow window. Concretely, adjust RuleDefinition
(CurrentRuleDefinitionVersion) SequenceSteps timeouts/retry policies and
ActionConfig polling windows — e.g. lower Stage 3/5/6 Timeout and
RetryPolicy.MaxAttempts, reduce PostOperation VerifyPowerStatus Timeout and
PollInterval and the RetryPolicy.InitialInterval/BackoffCoefficient for stages
that use
ActionPowerControl/ActionVerifyPowerStatus/ActionVerifyNoInstance/ActionEnterMaintenance/ActionPausePowerOnGate
so the aggregated worst-case duration is safely below TaskTypeBringDown;
alternatively increase the TaskTypeBringDown timeout to a value greater than the
calculated worst-case sum.

Comment thread rla/internal/task/operations/operations.go
@kunzhao-nv kunzhao-nv requested a review from jw-nvidia May 4, 2026 18:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1298-1324: Stage 1 (ActionVerifyNoInstance) proves emptiness only
briefly so add a second authoritative verification after maintenance: insert a
new Stage (after the ActionEnterMaintenance stage) that uses ComponentType
compute with MainOperation set to ActionVerifyNoInstance, a short Timeout (e.g.,
5m) and no retries so the maintenance lockout and allocator lock are in place
before any subsequent power actions; reference the existing symbols
ActionVerifyNoInstance and ActionEnterMaintenance to locate where to add this
stage and ensure stage ordering guarantees the extra check runs before any
power-off operations.
- Around line 1325-1366: The pause gate (ActionPausePowerOnGate) must be applied
before powering off compute to avoid a reconcile re-power window; modify the
stages so the pause gate runs prior to the power_off+verify sequence (or if you
keep it as its own Stage 4, insert a follow-up VerifyPowerStatus(off)
ActionConfig after ActionPausePowerOnGate) — add an ActionVerifyPowerStatus with
Parameters {ParamExpectedStatus: "off"} (as a PostOperation or next-stage
MainOperation) to ensure the node is still off before proceeding to Stage 5;
update the RetryPolicy/Timeout blocks around ActionPausePowerOnGate and
ActionVerifyPowerStatus as needed to match existing patterns.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 41c6fbc1-a2ea-4c68-afbb-34a6b282b3d2

📥 Commits

Reviewing files that changed from the base of the PR and between 99ae4f1 and fe0692c.

⛔ Files ignored due to path filters (2)
  • rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
  • rla/pkg/proto/v1/rla_grpc.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
📒 Files selected for processing (2)
  • rla/internal/service/server_impl.go
  • rla/internal/task/operationrules/resolver_defaults.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • rla/internal/service/server_impl.go

Comment on lines +1298 to +1324
// === Stage 1: Compute — verify no allocated instance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 1,
MaxParallel: 0,
Timeout: 5 * time.Minute,
// No retry: a "machine still has an instance" result
// is not transient; retrying just delays the failure.
MainOperation: ActionConfig{
Name: ActionVerifyNoInstance,
},
},
// === Stage 2: Compute — enter maintenance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 2,
MaxParallel: 0,
Timeout: 5 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionEnterMaintenance,
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Close the verify/maintenance TOCTOU before any power action.

Stage 1 proves the compute is empty only at that instant. Because allocator lockout is not applied until Stage 2, a new instance can still land in the gap and Stage 3 can power off a machine that has become allocated.

Suggested minimal fix
 				{
 					ComponentType: devicetypes.ComponentTypeCompute,
 					Stage:         2,
 					MaxParallel:   0,
 					Timeout:       5 * time.Minute,
 					RetryPolicy: &RetryPolicy{
 						MaxAttempts:        3,
 						InitialInterval:    5 * time.Second,
 						BackoffCoefficient: 2.0,
 					},
 					MainOperation: ActionConfig{
 						Name: ActionEnterMaintenance,
 					},
+					PostOperation: []ActionConfig{
+						{
+							Name: ActionVerifyNoInstance,
+						},
+					},
 				},

If you want to keep the early fail-fast behavior, a second VerifyNoInstance after maintenance is the smallest way to make the check authoritative.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// === Stage 1: Compute — verify no allocated instance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 1,
MaxParallel: 0,
Timeout: 5 * time.Minute,
// No retry: a "machine still has an instance" result
// is not transient; retrying just delays the failure.
MainOperation: ActionConfig{
Name: ActionVerifyNoInstance,
},
},
// === Stage 2: Compute — enter maintenance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 2,
MaxParallel: 0,
Timeout: 5 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionEnterMaintenance,
},
},
// === Stage 1: Compute — verify no allocated instance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 1,
MaxParallel: 0,
Timeout: 5 * time.Minute,
// No retry: a "machine still has an instance" result
// is not transient; retrying just delays the failure.
MainOperation: ActionConfig{
Name: ActionVerifyNoInstance,
},
},
// === Stage 2: Compute — enter maintenance ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 2,
MaxParallel: 0,
Timeout: 5 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionEnterMaintenance,
},
PostOperation: []ActionConfig{
{
Name: ActionVerifyNoInstance,
},
},
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1298 -
1324, Stage 1 (ActionVerifyNoInstance) proves emptiness only briefly so add a
second authoritative verification after maintenance: insert a new Stage (after
the ActionEnterMaintenance stage) that uses ComponentType compute with
MainOperation set to ActionVerifyNoInstance, a short Timeout (e.g., 5m) and no
retries so the maintenance lockout and allocator lock are in place before any
subsequent power actions; reference the existing symbols ActionVerifyNoInstance
and ActionEnterMaintenance to locate where to add this stage and ensure stage
ordering guarantees the extra check runs before any power-off operations.

Comment on lines +1325 to +1366
// === Stage 3: Compute — power off, verify ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 3,
MaxParallel: 0,
Timeout: 20 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 1 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionPowerControl,
Parameters: map[string]any{
ParamOperation: "power_off",
},
},
PostOperation: []ActionConfig{
{
Name: ActionVerifyPowerStatus,
Timeout: 10 * time.Minute,
PollInterval: 15 * time.Second,
Parameters: map[string]any{
ParamExpectedStatus: "off",
},
},
},
},
// === Stage 4: Compute — pause power-on gate ===
{
ComponentType: devicetypes.ComponentTypeCompute,
Stage: 4,
MaxParallel: 0,
Timeout: 5 * time.Minute,
RetryPolicy: &RetryPolicy{
MaxAttempts: 3,
InitialInterval: 5 * time.Second,
BackoffCoefficient: 2.0,
},
MainOperation: ActionConfig{
Name: ActionPausePowerOnGate,
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Pause the power-on gate before compute shutdown.

ActionPausePowerOnGate is the safeguard against automatic repower, but it currently runs only after compute has been powered off and verified off. That leaves a real window where a reconcile can turn the node back on, and the workflow never re-checks it before powering off the NVL switch and power shelf.

Suggested direction
-				// === Stage 3: Compute — power off, verify ===
+				// === Stage 3: Compute — pause gate, power off, verify ===
 				{
 					ComponentType: devicetypes.ComponentTypeCompute,
 					Stage:         3,
 					MaxParallel:   0,
 					Timeout:       20 * time.Minute,
 					RetryPolicy: &RetryPolicy{
 						MaxAttempts:        3,
 						InitialInterval:    1 * time.Second,
 						BackoffCoefficient: 2.0,
 					},
+					PreOperation: []ActionConfig{
+						{
+							Name: ActionPausePowerOnGate,
+						},
+					},
 					MainOperation: ActionConfig{
 						Name: ActionPowerControl,
 						Parameters: map[string]any{
 							ParamOperation: "power_off",
 						},
 					},
 					PostOperation: []ActionConfig{
 						{
 							Name:         ActionVerifyPowerStatus,
 							Timeout:      10 * time.Minute,
 							PollInterval: 15 * time.Second,
 							Parameters: map[string]any{
 								ParamExpectedStatus: "off",
 							},
 						},
 					},
 				},
-				// === Stage 4: Compute — pause power-on gate ===
-				{
-					ComponentType: devicetypes.ComponentTypeCompute,
-					Stage:         4,
-					MaxParallel:   0,
-					Timeout:       5 * time.Minute,
-					RetryPolicy: &RetryPolicy{
-						MaxAttempts:        3,
-						InitialInterval:    5 * time.Second,
-						BackoffCoefficient: 2.0,
-					},
-					MainOperation: ActionConfig{
-						Name: ActionPausePowerOnGate,
-					},
-				},

If you keep a separate stage, it should at least be followed by another VerifyPowerStatus(off) before Stage 5.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1325 -
1366, The pause gate (ActionPausePowerOnGate) must be applied before powering
off compute to avoid a reconcile re-power window; modify the stages so the pause
gate runs prior to the power_off+verify sequence (or if you keep it as its own
Stage 4, insert a follow-up VerifyPowerStatus(off) ActionConfig after
ActionPausePowerOnGate) — add an ActionVerifyPowerStatus with Parameters
{ParamExpectedStatus: "off"} (as a PostOperation or next-stage MainOperation) to
ensure the node is still off before proceeding to Stage 5; update the
RetryPolicy/Timeout blocks around ActionPausePowerOnGate and
ActionVerifyPowerStatus as needed to match existing patterns.

Copy link
Copy Markdown
Contributor

@jw-nvidia jw-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between 1 and 2, a machine can be allocated again.
Between 3 and 4, a machine can be powered on.
[1 and 2], [3 and 4] should be atomic operations.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
rla/internal/task/operationrules/resolver_defaults.go (2)

1297-1323: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Re-check compute emptiness after maintenance.

Stage 1 only proves the node is empty before allocator lockout. A new instance can land between Line 1305 and Line 1321, and Stage 3 would still power that machine off. Add a second VerifyNoInstance after ActionEnterMaintenance so the authoritative check happens under maintenance.

Suggested minimal fix
 				{
 					ComponentType: devicetypes.ComponentTypeCompute,
 					Stage:         2,
 					MaxParallel:   0,
 					Timeout:       5 * time.Minute,
 					RetryPolicy: &RetryPolicy{
 						MaxAttempts:        3,
 						InitialInterval:    5 * time.Second,
 						BackoffCoefficient: 2.0,
 					},
 					MainOperation: ActionConfig{
 						Name: ActionEnterMaintenance,
 					},
+					PostOperation: []ActionConfig{
+						{
+							Name: ActionVerifyNoInstance,
+						},
+					},
 				},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1297 -
1323, Stage 1's ActionVerifyNoInstance only checks emptiness before maintenance
and a new instance could start before the machine is taken down; add a second
verify step immediately after ActionEnterMaintenance so the authoritative check
runs while the node is in maintenance. Concretely, in the resolver_defaults.go
operation sequence (the block containing ComponentType:
devicetypes.ComponentTypeCompute, Stage: 2 and MainOperation: ActionConfig{Name:
ActionEnterMaintenance}), append another operation entry with MainOperation:
ActionConfig{Name: ActionVerifyNoInstance} (and appropriate Timeout/RetryPolicy
similar to Stage 1 or tuned as needed) so VerifyNoInstance runs under
maintenance before any subsequent power-off stages.

1324-1365: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Pause the power-on gate before compute shutdown.

With ActionPausePowerOnGate delayed until Stage 4, a reconcile can repower compute after Line 1335 succeeds and before Line 1363 runs. The workflow never re-verifies compute is still off before proceeding to the switch and shelf. Move the gate pause into Stage 3 pre-ops, or at minimum add another "off" verification after Stage 4.

Suggested direction
-				// === Stage 3: Compute — power off, verify ===
+				// === Stage 3: Compute — pause gate, power off, verify ===
 				{
 					ComponentType: devicetypes.ComponentTypeCompute,
 					Stage:         3,
 					MaxParallel:   0,
 					Timeout:       20 * time.Minute,
 					RetryPolicy: &RetryPolicy{
 						MaxAttempts:        3,
 						InitialInterval:    1 * time.Second,
 						BackoffCoefficient: 2.0,
 					},
+					PreOperation: []ActionConfig{
+						{
+							Name: ActionPausePowerOnGate,
+						},
+					},
 					MainOperation: ActionConfig{
 						Name: ActionPowerControl,
 						Parameters: map[string]any{
 							ParamOperation: "power_off",
 						},
 					},
 					PostOperation: []ActionConfig{
 						{
 							Name:         ActionVerifyPowerStatus,
 							Timeout:      10 * time.Minute,
 							PollInterval: 15 * time.Second,
 							Parameters: map[string]any{
 								ParamExpectedStatus: "off",
 							},
 						},
 					},
 				},
-				// === Stage 4: Compute — pause power-on gate ===
-				{
-					ComponentType: devicetypes.ComponentTypeCompute,
-					Stage:         4,
-					MaxParallel:   0,
-					Timeout:       5 * time.Minute,
-					RetryPolicy: &RetryPolicy{
-						MaxAttempts:        3,
-						InitialInterval:    5 * time.Second,
-						BackoffCoefficient: 2.0,
-					},
-					MainOperation: ActionConfig{
-						Name: ActionPausePowerOnGate,
-					},
-				},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operationrules/resolver_defaults.go` around lines 1324 -
1365, The pause-of-power-on gate is executed in Stage 4 (ActionPausePowerOnGate)
which allows a reconcile to repower compute after Stage 3's power_off
MainOperation and its PostOperation verification (ActionVerifyPowerStatus)
succeed; to prevent that race, move ActionPausePowerOnGate into Stage 3 as a
pre-operation (so it runs before the MainOperation
power_off/ActionVerifyPowerStatus) or, if you prefer to keep it in Stage 4, add
an additional verification ActionVerifyPowerStatus (expecting "off") after Stage
4 to re-check compute is still off before proceeding to downstream operations;
update the relevant structs where MainOperation/PostOperation are defined for
ComponentTypeCompute at Stage 3 and Stage 4 to implement the chosen change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rla/internal/task/operations/operations.go`:
- Around line 81-86: The bring-down task branch currently unmarshals into
BringDownTaskInfo but doesn't validate rule_id; update
BringDownTaskInfo.Validate() to reject malformed rule_id by calling
ExtractRuleID(info.RuleID) and returning an error if it returns nil, and then
call BringDownTaskInfo.Validate() in the TaskTypeBringDown case (before
returning) to ensure malformed UUIDs are rejected; apply the same validation
call to the other bring-down handling branch referenced in the comment so both
paths enforce rule_id validation.

In `@rla/internal/task/operations/types.go`:
- Around line 108-114: Add an explicit legacy constant for the reserved value
(e.g., FirmwareOperationRollback = 3) and update FirmwareOperation.CodeString()
so it maps only known values to their correct strings and does NOT fall back to
"upgrade" for unmapped/legacy values; instead return an explicit "rollback" (for
the legacy constant) or an "unsupported"/error indicator for any other unknown
values, and ensure callers validate/reject FirmwareOperationRollback during
input/validation paths rather than treating it as an upgrade.

---

Duplicate comments:
In `@rla/internal/task/operationrules/resolver_defaults.go`:
- Around line 1297-1323: Stage 1's ActionVerifyNoInstance only checks emptiness
before maintenance and a new instance could start before the machine is taken
down; add a second verify step immediately after ActionEnterMaintenance so the
authoritative check runs while the node is in maintenance. Concretely, in the
resolver_defaults.go operation sequence (the block containing ComponentType:
devicetypes.ComponentTypeCompute, Stage: 2 and MainOperation: ActionConfig{Name:
ActionEnterMaintenance}), append another operation entry with MainOperation:
ActionConfig{Name: ActionVerifyNoInstance} (and appropriate Timeout/RetryPolicy
similar to Stage 1 or tuned as needed) so VerifyNoInstance runs under
maintenance before any subsequent power-off stages.
- Around line 1324-1365: The pause-of-power-on gate is executed in Stage 4
(ActionPausePowerOnGate) which allows a reconcile to repower compute after Stage
3's power_off MainOperation and its PostOperation verification
(ActionVerifyPowerStatus) succeed; to prevent that race, move
ActionPausePowerOnGate into Stage 3 as a pre-operation (so it runs before the
MainOperation power_off/ActionVerifyPowerStatus) or, if you prefer to keep it in
Stage 4, add an additional verification ActionVerifyPowerStatus (expecting
"off") after Stage 4 to re-check compute is still off before proceeding to
downstream operations; update the relevant structs where
MainOperation/PostOperation are defined for ComponentTypeCompute at Stage 3 and
Stage 4 to implement the chosen change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5bfb9d32-6b61-457d-a086-c45096ee910e

📥 Commits

Reviewing files that changed from the base of the PR and between fe0692c and 9715fc1.

⛔ Files ignored due to path filters (1)
  • rla/pkg/proto/v1/rla.pb.go is excluded by !**/*.pb.go, !**/*.pb.go
📒 Files selected for processing (13)
  • rla/internal/scheduler/taskschedule/template.go
  • rla/internal/scheduler/taskschedule/template_test.go
  • rla/internal/task/common/operation_codes.go
  • rla/internal/task/operationrules/loader_test.go
  • rla/internal/task/operationrules/resolver.go
  • rla/internal/task/operationrules/resolver_defaults.go
  • rla/internal/task/operationrules/sequences.go
  • rla/internal/task/operationrules/sequences_test.go
  • rla/internal/task/operations/operations.go
  • rla/internal/task/operations/options.go
  • rla/internal/task/operations/types.go
  • rla/internal/task/operations/types_test.go
  • rla/proto/v1/rla.proto
💤 Files with no reviewable changes (3)
  • rla/internal/task/operationrules/resolver.go
  • rla/internal/task/operations/types_test.go
  • rla/internal/scheduler/taskschedule/template_test.go
✅ Files skipped from review due to trivial changes (1)
  • rla/internal/task/operations/options.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • rla/proto/v1/rla.proto

Comment on lines +81 to +86
case taskcommon.TaskTypeBringDown:
var taskInfo BringDownTaskInfo
if err := json.Unmarshal(info, &taskInfo); err != nil {
return nil, fmt.Errorf("failed to unmarshal bring-down task info: %w", err) //nolint
}
return &taskInfo, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate rule_id before accepting bring-down payloads.

ExtractRuleID() returns nil for an unparsable UUID, so a typo here silently drops the override and falls back to the hardcoded bring-down rule. Please reject malformed rule_id in BringDownTaskInfo.Validate() and call it from the new TaskTypeBringDown branch before returning.

Suggested fix
 	case taskcommon.TaskTypeBringDown:
 		var taskInfo BringDownTaskInfo
 		if err := json.Unmarshal(info, &taskInfo); err != nil {
 			return nil, fmt.Errorf("failed to unmarshal bring-down task info: %w", err) //nolint
 		}
+		if err := taskInfo.Validate(); err != nil {
+			return nil, fmt.Errorf("invalid bring-down task info: %w", err)
+		}
 		return &taskInfo, nil
 func (t *BringDownTaskInfo) Validate() error {
+	if t.RuleID != "" {
+		if _, err := uuid.Parse(t.RuleID); err != nil {
+			return fmt.Errorf("invalid rule_id: %w", err)
+		}
+	}
 	return nil
 }

Also applies to: 218-247

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operations/operations.go` around lines 81 - 86, The
bring-down task branch currently unmarshals into BringDownTaskInfo but doesn't
validate rule_id; update BringDownTaskInfo.Validate() to reject malformed
rule_id by calling ExtractRuleID(info.RuleID) and returning an error if it
returns nil, and then call BringDownTaskInfo.Validate() in the TaskTypeBringDown
case (before returning) to ensure malformed UUIDs are rejected; apply the same
validation call to the other bring-down handling branch referenced in the
comment so both paths enforce rule_id validation.

Comment on lines 108 to +114
const (
FirmwareOperationUnknown FirmwareOperation = iota
FirmwareOperationUpgrade
FirmwareOperationDowngrade
FirmwareOperationRollback
FirmwareOperationVersion
FirmwareOperationUnknown FirmwareOperation = 0
FirmwareOperationUpgrade FirmwareOperation = 1
FirmwareOperationDowngrade FirmwareOperation = 2
// 3 was previously FirmwareOperationRollback; kept reserved so persisted
// values for FirmwareOperationVersion remain stable.
FirmwareOperationVersion FirmwareOperation = 4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not let legacy rollback values serialize as upgrade.

Reserving enum value 3 suggests old persisted rollback values can still be read, but FirmwareOperation.CodeString() still falls back to "upgrade" for any unmapped value. A stored FirmwareOperation(3) will now be misrouted into the upgrade path instead of failing as unsupported.

Suggested direction
 const (
 	FirmwareOperationUnknown   FirmwareOperation = 0
 	FirmwareOperationUpgrade   FirmwareOperation = 1
 	FirmwareOperationDowngrade FirmwareOperation = 2
-	// 3 was previously FirmwareOperationRollback; kept reserved so persisted
-	// values for FirmwareOperationVersion remain stable.
+	// 3 was previously FirmwareOperationRollback; kept reserved so persisted
+	// values remain distinguishable from supported operations.
+	FirmwareOperationLegacyRollback FirmwareOperation = 3
 	FirmwareOperationVersion FirmwareOperation = 4
 )
 func (o FirmwareOperation) CodeString() string {
 	if code, ok := firmwareOperationCodes[o]; ok {
 		return code
 	}
-	return taskcommon.OpCodeFirmwareControlUpgrade // Default fallback
+	return ""
 }

If callers still need to detect old rollback records explicitly, keep the legacy constant and reject it during validation rather than defaulting to upgrade.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rla/internal/task/operations/types.go` around lines 108 - 114, Add an
explicit legacy constant for the reserved value (e.g., FirmwareOperationRollback
= 3) and update FirmwareOperation.CodeString() so it maps only known values to
their correct strings and does NOT fall back to "upgrade" for unmapped/legacy
values; instead return an explicit "rollback" (for the legacy constant) or an
"unsupported"/error indicator for any other unknown values, and ensure callers
validate/reject FirmwareOperationRollback during input/validation paths rather
than treating it as an upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants