Skip to content

fix: Clean up related components on Site deletion#490

Open
nvlitagaki wants to merge 9 commits intoNVIDIA:mainfrom
nvlitagaki:fix/site-clean-up
Open

fix: Clean up related components on Site deletion#490
nvlitagaki wants to merge 9 commits intoNVIDIA:mainfrom
nvlitagaki:fix/site-clean-up

Conversation

@nvlitagaki
Copy link
Copy Markdown
Contributor

@nvlitagaki nvlitagaki commented May 6, 2026

Description

Ensure that when a site is deleted that we delete any records that may reference said site.

Also clean-up the database to deal with any records that were orphaned when sites were deleted previously.

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Review Change Stack

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced site deletion process to thoroughly remove all associated resources including network interfaces, virtual network components, and expected inventory records. Added automatic migration to clean up orphaned data from previous deletion operations.
  • Tests

    • Added comprehensive test coverage for enhanced site cleanup operations and new test utilities for resource creation.

Walkthrough

Adds bulk soft-delete DAO methods for Interface/InfiniBand/NVLink, implements batching and tracing, adds unit/integration tests, introduces a migration to clean orphan site components, expands site purge orchestration to remove many site-scoped resources, and adds test builders and schema resets.

Changes

Site Cleanup Infrastructure

Layer / File(s) Summary
DAO Interfaces
db/pkg/db/model/interface.go, db/pkg/db/model/infinibandinterface.go, db/pkg/db/model/nvlinkinterface.go
Adds DeleteAllByInstanceIDs and DeleteAllBySiteID methods to Interface, InfiniBandInterface, and NVLinkInterface DAO interfaces.
DAO Implementations
db/pkg/db/model/interface.go, db/pkg/db/model/infinibandinterface.go, db/pkg/db/model/nvlinkinterface.go
Implements bulk soft-delete logic with tracing; Interface deletion batches by db.MaxBatchItems, InfiniBand/NVLink delete by site_id.
DAO Unit Tests
db/pkg/db/model/interface_test.go, db/pkg/db/model/infinibandinterface_test.go, db/pkg/db/model/nvlinkinterface_test.go
Tests verify multi-site scoping, soft-delete semantics, no-op behavior for empty targets, and OTEL span propagation; updates an existing batch-size error assertion.
Orphan Cleanup Migration
db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go
New migration soft-deletes orphan interfaces and other site-scoped soft-deletable rows, and hard-deletes specific orphan tables inside a transaction.
Site Purge Activity
workflow/pkg/activity/site/site.go
Expands DeleteSiteComponentsFromDB to initialize additional DAOs, collect instance IDs, bulk-delete ethernet/InfiniBand/NVLink interfaces, and delete VPC prefixes/peerings, NVLink partitions, SSH key associations, NSGs, DPU deployments, SKUs, expected_* records, and OS associations before removing the Site.
Test Utilities & Schema
workflow/pkg/util/testing.go
Extends TestSetupSchema to reset additional FK tables and adds new test builders: TestBuildVpcPeering, TestBuildNetworkSecurityGroup, TestBuildExpectedMachine, TestBuildExpectedSwitch, TestBuildExpectedPowerShelf, TestBuildSku.
Integration Test
workflow/pkg/activity/site/site_test.go
TestManageSite_DeleteSiteComponentsFromDB_NewResources constructs paired site graphs and asserts comprehensive removal of site-1 resources while leaving site-2 intact.

Sequence Diagram

sequenceDiagram
    actor Caller
    participant Activity as DeleteSiteComponentsFromDB Activity
    participant InstanceDAO
    participant InterfaceDAO as Interface DAO
    participant InfiniBandDAO
    participant NVLinkDAO
    participant VPCDAO
    participant OtherDAOs
    participant DB as Database

    Caller->>Activity: DeleteSiteComponentsFromDB(siteID)
    Activity->>InstanceDAO: Load instances by site
    InstanceDAO->>DB: SELECT instances WHERE site_id = ?
    DB-->>InstanceDAO: instances
    InstanceDAO-->>Activity: instance list

    Activity->>InterfaceDAO: DeleteAllByInstanceIDs(instanceIDs)
    InterfaceDAO->>DB: UPDATE interfaces SET deleted WHERE instance_id IN (?)
    DB-->>InterfaceDAO: deleted count

    Activity->>InfiniBandDAO: DeleteAllBySiteID(siteID)
    InfiniBandDAO->>DB: UPDATE ib_interfaces SET deleted WHERE site_id = ?
    DB-->>InfiniBandDAO: deleted count

    Activity->>NVLinkDAO: DeleteAllBySiteID(siteID)
    NVLinkDAO->>DB: UPDATE nvlink_interfaces SET deleted WHERE site_id = ?
    DB-->>NVLinkDAO: deleted count

    Activity->>VPCDAO: Delete VPC prefixes & peerings for site
    VPCDAO->>DB: DELETE/UPDATE vpc_prefixes, vpc_peerings
    DB-->>VPCDAO: success

    Activity->>OtherDAOs: Delete NVLink partitions, SSH key associations, NSGs, DPU deployments, SKUs, expected_*, OS associations
    OtherDAOs->>DB: DELETE/UPDATE respective tables
    DB-->>OtherDAOs: success

    Activity->>DB: DELETE/UPDATE site WHERE id = ?
    DB-->>Activity: success
    Activity-->>Caller: success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the primary purpose of the changeset: adding comprehensive cleanup of site-related database components when a site is deleted.
Description check ✅ Passed The description clearly explains the motivation and scope of changes: ensuring deletion of records that reference a site and cleaning up orphaned records from previous deletions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@nvlitagaki
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@db/pkg/db/model/interface.go`:
- Around line 501-523: DeleteAllByInstanceIDs currently issues a single
unbounded WHERE "ifc.instance_id IN (?)" with all instanceIDs which can hit DB
parameter limits and cause latency; modify
InterfaceSQLDAO.DeleteAllByInstanceIDs to chunk instanceIDs into reasonably
sized batches (e.g., 500–1000 IDs) and run the
NewDelete().Model((*Interface)(nil)).Where("ifc.instance_id IN (?)",
bun.In(batch)).Exec(ctx) for each batch in a loop, returning the first non-nil
error; preserve the tracer span and update the "instance_id_count" attribute if
desired, and no-op early when len(instanceIDs)==0.

In `@workflow/pkg/activity/site/site.go`:
- Around line 86-110: The cleanup sequence creates many DAO instances (vpDAO,
subnetDAO, vpfxDAO, ..., emDAO) but executes deletes with tx=nil so failures
leave partial state; wrap the entire expanded site cleanup in a single DB
transaction by beginning a tx from mst.dbSession (e.g., tx :=
mst.dbSession.Begin()/BeginTx()), pass that tx into DAO methods/constructors or
use DAO methods that accept a tx instead of nil, ensure every delete call uses
that tx, and only call tx.Commit() after the final site delete succeeds while
calling tx.Rollback() on any error; apply the same transaction threading to the
other cleanup blocks referenced (lines ~191-211 and ~286-555) so all deletions
are atomic.
- Around line 488-503: The code builds allocationIDs (variable allocationIDs)
even when there are no allocations and then calls acDAO.GetAll(...), relying on
the DAO to handle an empty slice; add an explicit guard after the deletion loop
that checks len(allocationIDs) == 0 and returns immediately (e.g., return nil or
the appropriate success value) to skip the AllocationConstraint lookup and avoid
depending on acDAO.GetAll() handling of empty slices; modify the function
surrounding the existing deletion loop and the acDAO.GetAll call to include this
early return.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69c23a47-04c4-4c48-92b0-5ac1d88f3e03

📥 Commits

Reviewing files that changed from the base of the PR and between fbd95ed and 360d273.

📒 Files selected for processing (11)
  • db/cmd/migrations/migrations
  • db/pkg/db/model/infinibandinterface.go
  • db/pkg/db/model/infinibandinterface_test.go
  • db/pkg/db/model/interface.go
  • db/pkg/db/model/interface_test.go
  • db/pkg/db/model/nvlinkinterface.go
  • db/pkg/db/model/nvlinkinterface_test.go
  • db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go
  • workflow/pkg/activity/site/site.go
  • workflow/pkg/activity/site/site_test.go
  • workflow/pkg/util/testing.go

Comment thread db/pkg/db/model/interface.go
Comment thread workflow/pkg/activity/site/site.go
Comment thread workflow/pkg/activity/site/site.go Outdated
@nvlitagaki nvlitagaki force-pushed the fix/site-clean-up branch from 460ec05 to 59ac498 Compare May 6, 2026 14:29
@thossain-nv thossain-nv changed the title fix: Clean-up related records on site deletion fix: Clean up related components on Site deletion May 7, 2026
Copy link
Copy Markdown
Contributor

@thossain-nv thossain-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for initiating this @nvlitagaki Added some suggestion to simplify the PR.

Comment thread workflow/pkg/activity/site/site.go Outdated
Comment thread workflow/pkg/activity/site/site.go Outdated
Comment thread workflow/pkg/activity/site/site.go Outdated
Comment thread workflow/pkg/activity/site/site.go Outdated
return err
}
// Delete InfiniBand interfaces
err = ibiDAO.DeleteAllBySiteID(ctx, tx, siteID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up of Ethernet, InfiniBand, NVLink Interfaces & SSH Key Group Instance associations are already handled in the code that deletes the Instance (fixed recently https://github.com/NVIDIA/infra-controller-rest/blob/main/workflow/pkg/activity/instance/instance.go#L961) so this might also be redundant.

Any lingering un-deleted items will be handled by the migration added in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thossain-nv Shouldn't we be worried about instances that were cleaned up earlier within thisDeleteSiteComponentsFromDb activity? Those wouldn't be affected by the recent fix... My assumption is that the instances in question would be invariably be those created via targeted instance creation, since any other instances would have had to have been deleted to clear the way for allocation deletion as a prerequisite for deleting the site

nvlitagaki added 5 commits May 6, 2026 21:40
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
@nvlitagaki nvlitagaki force-pushed the fix/site-clean-up branch from 63a1255 to fc6fc65 Compare May 7, 2026 15:35
@nvlitagaki
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
workflow/pkg/activity/site/site.go (2)

63-66: ⚡ Quick win

Name the SSH-key cleanup precisely.

This docstring currently implies that SSHKeyGroup records are deleted with the site, but the implementation only removes site/instance associations. Tightening the wording here will prevent future readers from assuming tenant-scoped key groups are purged as part of site deletion.

As per coding guidelines, "Document when you have intentionally omitted code that the reader might otherwise expect to be present".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workflow/pkg/activity/site/site.go` around lines 63 - 66, The docstring for
DeleteSiteComponentsFromDB is misleading about SSH-key cleanup; update the
comment to state explicitly that SSHKeyGroup records themselves are not deleted
and only site/instance associations are removed (e.g., “does not delete
tenant-scoped SSHKeyGroup records; only removes their associations to this
site/its instances”), and add a brief note per coding guidelines documenting the
intentional omission of full SSHKeyGroup deletion so future readers don’t assume
those records are purged.

419-426: ⚡ Quick win

The OS-cleanup comment no longer matches the code path.

This block now deletes any operating system that becomes orphaned after the site associations are removed, and the new tests cover non-image OSes as well. The comment still describes image-only cleanup, which is likely to mislead the next maintainer into reintroducing the wrong type check.

As per coding guidelines, "Document when you have intentionally omitted code that the reader might otherwise expect to be present".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workflow/pkg/activity/site/site.go` around lines 419 - 426, The comment above
the OS-cleanup block in workflow/pkg/activity/site/site.go is stale: update it
to state that after removing site associations the code deletes any operating
system that becomes orphaned (not just image-typed OSes), and explicitly note
that no type-based check (e.g., image vs iPXE) is performed; if any OS types
were intentionally excluded from deletion, document that omission and reference
the related tests that exercise non-image OS cleanup so future maintainers won't
reintroduce a type check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@workflow/pkg/activity/site/site.go`:
- Around line 63-66: The docstring for DeleteSiteComponentsFromDB is misleading
about SSH-key cleanup; update the comment to state explicitly that SSHKeyGroup
records themselves are not deleted and only site/instance associations are
removed (e.g., “does not delete tenant-scoped SSHKeyGroup records; only removes
their associations to this site/its instances”), and add a brief note per coding
guidelines documenting the intentional omission of full SSHKeyGroup deletion so
future readers don’t assume those records are purged.
- Around line 419-426: The comment above the OS-cleanup block in
workflow/pkg/activity/site/site.go is stale: update it to state that after
removing site associations the code deletes any operating system that becomes
orphaned (not just image-typed OSes), and explicitly note that no type-based
check (e.g., image vs iPXE) is performed; if any OS types were intentionally
excluded from deletion, document that omission and reference the related tests
that exercise non-image OS cleanup so future maintainers won't reintroduce a
type check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e2a5cbb8-efde-47f6-9a74-b3bee28a305a

📥 Commits

Reviewing files that changed from the base of the PR and between 360d273 and 557a8ce.

📒 Files selected for processing (11)
  • db/cmd/migrations/migrations
  • db/pkg/db/model/infinibandinterface.go
  • db/pkg/db/model/infinibandinterface_test.go
  • db/pkg/db/model/interface.go
  • db/pkg/db/model/interface_test.go
  • db/pkg/db/model/nvlinkinterface.go
  • db/pkg/db/model/nvlinkinterface_test.go
  • db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go
  • workflow/pkg/activity/site/site.go
  • workflow/pkg/activity/site/site_test.go
  • workflow/pkg/util/testing.go
✅ Files skipped from review due to trivial changes (1)
  • db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • db/pkg/db/model/nvlinkinterface.go
  • db/pkg/db/model/infinibandinterface_test.go
  • workflow/pkg/util/testing.go

Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
@nvlitagaki
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
workflow/pkg/activity/site/site.go (2)

190-196: 💤 Low value

Consider guarding against empty instanceIDs before invoking the bulk delete.

When no instances exist for the site, instanceIDs will be an empty slice passed to DeleteAllByInstanceIDs. While the DAO may handle this gracefully, an explicit guard improves clarity and avoids an unnecessary database call.

♻️ Suggested refinement
+	if len(instanceIDs) > 0 {
 		// Delete ethernet interfaces based on instances
 		// since they are not directly associated with the site
 		err = ifcDAO.DeleteAllByInstanceIDs(ctx, nil, instanceIDs)
 		if err != nil {
 			logger.Error().Err(err).Msg("error deleting Interfaces for Instances from DB")
 			return err
 		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workflow/pkg/activity/site/site.go` around lines 190 - 196, Guard against
calling ifcDAO.DeleteAllByInstanceIDs with an empty instanceIDs slice: before
invoking DeleteAllByInstanceIDs in site.go (the block that deletes ethernet
interfaces for instances), check if len(instanceIDs) == 0 and skip the DAO call
when empty to avoid an unnecessary DB call; keep the existing error handling for
the non-empty case (logger.Error().Err(err).Msg(...)) intact.

429-439: 💤 Low value

Minor: candidateOsIDs may contain duplicates.

Multiple site associations can reference the same OperatingSystemID, resulting in duplicates in candidateOsIDs. The query will function correctly, but passing a deduplicated list would be more efficient. Consider deriving the slice from the map keys after the loop completes.

♻️ Suggested refinement
 	candidateOsIDSet := make(map[uuid.UUID]struct{}, len(ossas))
-	candidateOsIDs := make([]uuid.UUID, 0, len(ossas))
 	for _, ossa := range ossas {
 		candidateOsIDSet[ossa.OperatingSystemID] = struct{}{}
-		candidateOsIDs = append(candidateOsIDs, ossa.OperatingSystemID)
 		serr := ossaDAO.Delete(ctx, nil, ossa.ID)
 		if serr != nil && serr != cdb.ErrDoesNotExist {
 			logger.Error().Err(serr).Str("Operating System Site Association ID", ossa.ID.String()).Msg("error deleting Operating System Site Association record in DB")
 			return serr
 		}
 	}
+	candidateOsIDs := make([]uuid.UUID, 0, len(candidateOsIDSet))
+	for osID := range candidateOsIDSet {
+		candidateOsIDs = append(candidateOsIDs, osID)
+	}
 	if len(candidateOsIDs) > 0 {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workflow/pkg/activity/site/site.go` around lines 429 - 439, The
candidateOsIDs slice may contain duplicates because you append
ossa.OperatingSystemID for each ossa; instead, keep the current loop to populate
candidateOsIDSet and call ossaDAO.Delete for each ossa as you do, but do not
append into candidateOsIDs there—instead, after the loop complete, construct
candidateOsIDs by iterating over the keys of candidateOsIDSet
(map[uuid.UUID]struct{}) so the slice contains unique OperatingSystemIDs; update
any downstream use of candidateOsIDs to rely on this deduplicated slice
(referencing candidateOsIDSet, candidateOsIDs, ossas, and ossaDAO.Delete).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@workflow/pkg/activity/site/site.go`:
- Around line 190-196: Guard against calling ifcDAO.DeleteAllByInstanceIDs with
an empty instanceIDs slice: before invoking DeleteAllByInstanceIDs in site.go
(the block that deletes ethernet interfaces for instances), check if
len(instanceIDs) == 0 and skip the DAO call when empty to avoid an unnecessary
DB call; keep the existing error handling for the non-empty case
(logger.Error().Err(err).Msg(...)) intact.
- Around line 429-439: The candidateOsIDs slice may contain duplicates because
you append ossa.OperatingSystemID for each ossa; instead, keep the current loop
to populate candidateOsIDSet and call ossaDAO.Delete for each ossa as you do,
but do not append into candidateOsIDs there—instead, after the loop complete,
construct candidateOsIDs by iterating over the keys of candidateOsIDSet
(map[uuid.UUID]struct{}) so the slice contains unique OperatingSystemIDs; update
any downstream use of candidateOsIDs to rely on this deduplicated slice
(referencing candidateOsIDSet, candidateOsIDs, ossas, and ossaDAO.Delete).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: df74def8-6816-461f-8dec-9b14d5ab2b5f

📥 Commits

Reviewing files that changed from the base of the PR and between 557a8ce and 6085f16.

📒 Files selected for processing (1)
  • workflow/pkg/activity/site/site.go

Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
@nvlitagaki nvlitagaki force-pushed the fix/site-clean-up branch from 6085f16 to aa4c0b2 Compare May 7, 2026 21:23
@nvlitagaki nvlitagaki marked this pull request as ready for review May 7, 2026 21:24
@nvlitagaki nvlitagaki requested a review from a team as a code owner May 7, 2026 21:24
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-07 21:25:34 UTC | Commit: 5ba17db

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go`:
- Around line 54-104: Migration misses backfilling operating_system cleanup:
inside the same transaction (use tx and handleError like the other loops), first
DELETE FROM operating_system_site_association WHERE site_id NOT IN (SELECT id
FROM site WHERE deleted IS NULL) to remove stale site associations, then DELETE
FROM operating_system WHERE id NOT IN (SELECT operating_system_id FROM
operating_system_site_association) to remove OS rows that no longer have any
live site associations (mirror DeleteSiteComponentsFromDB behavior); use tx.Exec
for both statements and handle errors with handleError(tx, err).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9c5d5bc1-077e-445d-8836-4755298283a8

📥 Commits

Reviewing files that changed from the base of the PR and between 6085f16 and 5ba17db.

📒 Files selected for processing (4)
  • db/pkg/db/model/interface_test.go
  • db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go
  • workflow/pkg/activity/site/site.go
  • workflow/pkg/util/testing.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • db/pkg/db/model/interface_test.go
  • workflow/pkg/util/testing.go

Comment on lines +54 to +104
// Site-scoped tables with a soft_delete column: mark rows deleted
// when their site is missing or already soft-deleted.
softDeleteSiteScopedTables := []struct {
table string
alias string
}{
{"vpc_prefix", "vp"},
{"vpc_peering", "vp"},
{"nvlink_logical_partition", "nvllp"},
{"ssh_key_group_site_association", "skgsa"},
{"ssh_key_group_instance_association", "skgia"},
{"network_security_group", "nsg"},
{"dpu_extension_service_deployment", "desd"},
}
for _, t := range softDeleteSiteScopedTables {
stmt := fmt.Sprintf(`
UPDATE %[1]s %[2]s
SET deleted = CURRENT_TIMESTAMP, updated = CURRENT_TIMESTAMP
WHERE %[2]s.deleted IS NULL
AND %[2]s.site_id NOT IN (SELECT id FROM site WHERE deleted IS NULL)`,
t.table, t.alias)
_, err = tx.Exec(stmt)
handleError(tx, err)
}

// Site-scoped tables without a soft_delete column: hard-delete rows
// whose site is missing or already soft-deleted. The matching DAO
// Delete methods on these tables are also hard deletes, so this
// mirrors the runtime cleanup.
hardDeleteSiteScopedTables := []string{
"sku",
"expected_machine",
"expected_switch",
"expected_power_shelf",
}
for _, table := range hardDeleteSiteScopedTables {
stmt := fmt.Sprintf(`
DELETE FROM %[1]s
WHERE site_id NOT IN (SELECT id FROM site WHERE deleted IS NULL)`,
table)
_, err = tx.Exec(stmt)
handleError(tx, err)
}

terr = tx.Commit()
if terr != nil {
handlePanic(terr, "failed to commit transaction")
}

fmt.Print(" [up migration] Soft-deleted orphan site-scoped rows across interface, vpc_prefix, vpc_peering, nvlink_logical_partition, ssh_key_group_site_association, ssh_key_group_instance_association, network_security_group, and dpu_extension_service_deployment; hard-deleted orphan rows from sku, expected_machine, expected_switch, and expected_power_shelf. ")
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Backfill is still missing operating-system cleanup.

DeleteSiteComponentsFromDB now removes operating_system_site_association rows and then deletes operating_system rows that no longer have any live site association, but this migration never backfills those two steps. That leaves previously deleted sites with stale OSSA rows, and potentially orphaned OS records, even after the migration runs. Please mirror the new runtime cleanup here as well.

Suggested shape of the fix
 		softDeleteSiteScopedTables := []struct {
 			table string
 			alias string
 		}{
 			{"vpc_prefix", "vp"},
 			{"vpc_peering", "vp"},
 			{"nvlink_logical_partition", "nvllp"},
 			{"ssh_key_group_site_association", "skgsa"},
 			{"ssh_key_group_instance_association", "skgia"},
 			{"network_security_group", "nsg"},
 			{"dpu_extension_service_deployment", "desd"},
+			{"operating_system_site_association", "ossa"},
 		}
 		for _, t := range softDeleteSiteScopedTables {
 			stmt := fmt.Sprintf(`
 				UPDATE %[1]s %[2]s
 				SET deleted = CURRENT_TIMESTAMP, updated = CURRENT_TIMESTAMP
 				WHERE %[2]s.deleted IS NULL
 				AND %[2]s.site_id NOT IN (SELECT id FROM site WHERE deleted IS NULL)`,
 				t.table, t.alias)
 			_, err = tx.Exec(stmt)
 			handleError(tx, err)
 		}
+
+		_, err = tx.Exec(`
+			UPDATE operating_system os
+			SET deleted = CURRENT_TIMESTAMP, updated = CURRENT_TIMESTAMP
+			WHERE os.deleted IS NULL
+			AND NOT EXISTS (
+				SELECT 1
+				FROM operating_system_site_association ossa
+				WHERE ossa.operating_system_id = os.id
+				AND ossa.deleted IS NULL
+			)`)
+		handleError(tx, err)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go` around
lines 54 - 104, Migration misses backfilling operating_system cleanup: inside
the same transaction (use tx and handleError like the other loops), first DELETE
FROM operating_system_site_association WHERE site_id NOT IN (SELECT id FROM site
WHERE deleted IS NULL) to remove stale site associations, then DELETE FROM
operating_system WHERE id NOT IN (SELECT operating_system_id FROM
operating_system_site_association) to remove OS rows that no longer have any
live site associations (mirror DeleteSiteComponentsFromDB behavior); use tx.Exec
for both statements and handle errors with handleError(tx, err).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-nsm 64 2 20 33 9 0
nico-psm 56 4 29 13 2 8
nico-rest-api 57 4 30 13 2 8
nico-rest-cert-manager 54 4 28 13 1 8
nico-rest-db 55 4 28 13 2 8
nico-rest-site-agent 54 4 28 13 1 8
nico-rest-site-manager 54 4 28 13 1 8
nico-rest-workflow 56 4 29 13 2 8
nico-rla 55 4 28 13 2 8
TOTAL 505 34 248 137 22 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants