fix: Clean up related components on Site deletion#490
fix: Clean up related components on Site deletion#490nvlitagaki wants to merge 9 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Summary by CodeRabbitRelease Notes
WalkthroughAdds bulk soft-delete DAO methods for Interface/InfiniBand/NVLink, implements batching and tracing, adds unit/integration tests, introduces a migration to clean orphan site components, expands site purge orchestration to remove many site-scoped resources, and adds test builders and schema resets. ChangesSite Cleanup Infrastructure
Sequence DiagramsequenceDiagram
actor Caller
participant Activity as DeleteSiteComponentsFromDB Activity
participant InstanceDAO
participant InterfaceDAO as Interface DAO
participant InfiniBandDAO
participant NVLinkDAO
participant VPCDAO
participant OtherDAOs
participant DB as Database
Caller->>Activity: DeleteSiteComponentsFromDB(siteID)
Activity->>InstanceDAO: Load instances by site
InstanceDAO->>DB: SELECT instances WHERE site_id = ?
DB-->>InstanceDAO: instances
InstanceDAO-->>Activity: instance list
Activity->>InterfaceDAO: DeleteAllByInstanceIDs(instanceIDs)
InterfaceDAO->>DB: UPDATE interfaces SET deleted WHERE instance_id IN (?)
DB-->>InterfaceDAO: deleted count
Activity->>InfiniBandDAO: DeleteAllBySiteID(siteID)
InfiniBandDAO->>DB: UPDATE ib_interfaces SET deleted WHERE site_id = ?
DB-->>InfiniBandDAO: deleted count
Activity->>NVLinkDAO: DeleteAllBySiteID(siteID)
NVLinkDAO->>DB: UPDATE nvlink_interfaces SET deleted WHERE site_id = ?
DB-->>NVLinkDAO: deleted count
Activity->>VPCDAO: Delete VPC prefixes & peerings for site
VPCDAO->>DB: DELETE/UPDATE vpc_prefixes, vpc_peerings
DB-->>VPCDAO: success
Activity->>OtherDAOs: Delete NVLink partitions, SSH key associations, NSGs, DPU deployments, SKUs, expected_*, OS associations
OtherDAOs->>DB: DELETE/UPDATE respective tables
DB-->>OtherDAOs: success
Activity->>DB: DELETE/UPDATE site WHERE id = ?
DB-->>Activity: success
Activity-->>Caller: success
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@db/pkg/db/model/interface.go`:
- Around line 501-523: DeleteAllByInstanceIDs currently issues a single
unbounded WHERE "ifc.instance_id IN (?)" with all instanceIDs which can hit DB
parameter limits and cause latency; modify
InterfaceSQLDAO.DeleteAllByInstanceIDs to chunk instanceIDs into reasonably
sized batches (e.g., 500–1000 IDs) and run the
NewDelete().Model((*Interface)(nil)).Where("ifc.instance_id IN (?)",
bun.In(batch)).Exec(ctx) for each batch in a loop, returning the first non-nil
error; preserve the tracer span and update the "instance_id_count" attribute if
desired, and no-op early when len(instanceIDs)==0.
In `@workflow/pkg/activity/site/site.go`:
- Around line 86-110: The cleanup sequence creates many DAO instances (vpDAO,
subnetDAO, vpfxDAO, ..., emDAO) but executes deletes with tx=nil so failures
leave partial state; wrap the entire expanded site cleanup in a single DB
transaction by beginning a tx from mst.dbSession (e.g., tx :=
mst.dbSession.Begin()/BeginTx()), pass that tx into DAO methods/constructors or
use DAO methods that accept a tx instead of nil, ensure every delete call uses
that tx, and only call tx.Commit() after the final site delete succeeds while
calling tx.Rollback() on any error; apply the same transaction threading to the
other cleanup blocks referenced (lines ~191-211 and ~286-555) so all deletions
are atomic.
- Around line 488-503: The code builds allocationIDs (variable allocationIDs)
even when there are no allocations and then calls acDAO.GetAll(...), relying on
the DAO to handle an empty slice; add an explicit guard after the deletion loop
that checks len(allocationIDs) == 0 and returns immediately (e.g., return nil or
the appropriate success value) to skip the AllocationConstraint lookup and avoid
depending on acDAO.GetAll() handling of empty slices; modify the function
surrounding the existing deletion loop and the acDAO.GetAll call to include this
early return.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 69c23a47-04c4-4c48-92b0-5ac1d88f3e03
📒 Files selected for processing (11)
db/cmd/migrations/migrationsdb/pkg/db/model/infinibandinterface.godb/pkg/db/model/infinibandinterface_test.godb/pkg/db/model/interface.godb/pkg/db/model/interface_test.godb/pkg/db/model/nvlinkinterface.godb/pkg/db/model/nvlinkinterface_test.godb/pkg/migrations/20260505170000_cleanup_orphan_site_components.goworkflow/pkg/activity/site/site.goworkflow/pkg/activity/site/site_test.goworkflow/pkg/util/testing.go
460ec05 to
59ac498
Compare
thossain-nv
left a comment
There was a problem hiding this comment.
Thanks for initiating this @nvlitagaki Added some suggestion to simplify the PR.
| return err | ||
| } | ||
| // Delete InfiniBand interfaces | ||
| err = ibiDAO.DeleteAllBySiteID(ctx, tx, siteID) |
There was a problem hiding this comment.
Clean up of Ethernet, InfiniBand, NVLink Interfaces & SSH Key Group Instance associations are already handled in the code that deletes the Instance (fixed recently https://github.com/NVIDIA/infra-controller-rest/blob/main/workflow/pkg/activity/instance/instance.go#L961) so this might also be redundant.
Any lingering un-deleted items will be handled by the migration added in this PR.
There was a problem hiding this comment.
@thossain-nv Shouldn't we be worried about instances that were cleaned up earlier within thisDeleteSiteComponentsFromDb activity? Those wouldn't be affected by the recent fix... My assumption is that the instances in question would be invariably be those created via targeted instance creation, since any other instances would have had to have been deleted to clear the way for allocation deletion as a prerequisite for deleting the site
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
63a1255 to
fc6fc65
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
workflow/pkg/activity/site/site.go (2)
63-66: ⚡ Quick winName the SSH-key cleanup precisely.
This docstring currently implies that
SSHKeyGrouprecords are deleted with the site, but the implementation only removes site/instance associations. Tightening the wording here will prevent future readers from assuming tenant-scoped key groups are purged as part of site deletion.As per coding guidelines, "Document when you have intentionally omitted code that the reader might otherwise expect to be present".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workflow/pkg/activity/site/site.go` around lines 63 - 66, The docstring for DeleteSiteComponentsFromDB is misleading about SSH-key cleanup; update the comment to state explicitly that SSHKeyGroup records themselves are not deleted and only site/instance associations are removed (e.g., “does not delete tenant-scoped SSHKeyGroup records; only removes their associations to this site/its instances”), and add a brief note per coding guidelines documenting the intentional omission of full SSHKeyGroup deletion so future readers don’t assume those records are purged.
419-426: ⚡ Quick winThe OS-cleanup comment no longer matches the code path.
This block now deletes any operating system that becomes orphaned after the site associations are removed, and the new tests cover non-image OSes as well. The comment still describes image-only cleanup, which is likely to mislead the next maintainer into reintroducing the wrong type check.
As per coding guidelines, "Document when you have intentionally omitted code that the reader might otherwise expect to be present".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workflow/pkg/activity/site/site.go` around lines 419 - 426, The comment above the OS-cleanup block in workflow/pkg/activity/site/site.go is stale: update it to state that after removing site associations the code deletes any operating system that becomes orphaned (not just image-typed OSes), and explicitly note that no type-based check (e.g., image vs iPXE) is performed; if any OS types were intentionally excluded from deletion, document that omission and reference the related tests that exercise non-image OS cleanup so future maintainers won't reintroduce a type check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@workflow/pkg/activity/site/site.go`:
- Around line 63-66: The docstring for DeleteSiteComponentsFromDB is misleading
about SSH-key cleanup; update the comment to state explicitly that SSHKeyGroup
records themselves are not deleted and only site/instance associations are
removed (e.g., “does not delete tenant-scoped SSHKeyGroup records; only removes
their associations to this site/its instances”), and add a brief note per coding
guidelines documenting the intentional omission of full SSHKeyGroup deletion so
future readers don’t assume those records are purged.
- Around line 419-426: The comment above the OS-cleanup block in
workflow/pkg/activity/site/site.go is stale: update it to state that after
removing site associations the code deletes any operating system that becomes
orphaned (not just image-typed OSes), and explicitly note that no type-based
check (e.g., image vs iPXE) is performed; if any OS types were intentionally
excluded from deletion, document that omission and reference the related tests
that exercise non-image OS cleanup so future maintainers won't reintroduce a
type check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e2a5cbb8-efde-47f6-9a74-b3bee28a305a
📒 Files selected for processing (11)
db/cmd/migrations/migrationsdb/pkg/db/model/infinibandinterface.godb/pkg/db/model/infinibandinterface_test.godb/pkg/db/model/interface.godb/pkg/db/model/interface_test.godb/pkg/db/model/nvlinkinterface.godb/pkg/db/model/nvlinkinterface_test.godb/pkg/migrations/20260505170000_cleanup_orphan_site_components.goworkflow/pkg/activity/site/site.goworkflow/pkg/activity/site/site_test.goworkflow/pkg/util/testing.go
✅ Files skipped from review due to trivial changes (1)
- db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go
🚧 Files skipped from review as they are similar to previous changes (3)
- db/pkg/db/model/nvlinkinterface.go
- db/pkg/db/model/infinibandinterface_test.go
- workflow/pkg/util/testing.go
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
workflow/pkg/activity/site/site.go (2)
190-196: 💤 Low valueConsider guarding against empty
instanceIDsbefore invoking the bulk delete.When no instances exist for the site,
instanceIDswill be an empty slice passed toDeleteAllByInstanceIDs. While the DAO may handle this gracefully, an explicit guard improves clarity and avoids an unnecessary database call.♻️ Suggested refinement
+ if len(instanceIDs) > 0 { // Delete ethernet interfaces based on instances // since they are not directly associated with the site err = ifcDAO.DeleteAllByInstanceIDs(ctx, nil, instanceIDs) if err != nil { logger.Error().Err(err).Msg("error deleting Interfaces for Instances from DB") return err } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workflow/pkg/activity/site/site.go` around lines 190 - 196, Guard against calling ifcDAO.DeleteAllByInstanceIDs with an empty instanceIDs slice: before invoking DeleteAllByInstanceIDs in site.go (the block that deletes ethernet interfaces for instances), check if len(instanceIDs) == 0 and skip the DAO call when empty to avoid an unnecessary DB call; keep the existing error handling for the non-empty case (logger.Error().Err(err).Msg(...)) intact.
429-439: 💤 Low valueMinor:
candidateOsIDsmay contain duplicates.Multiple site associations can reference the same
OperatingSystemID, resulting in duplicates incandidateOsIDs. The query will function correctly, but passing a deduplicated list would be more efficient. Consider deriving the slice from the map keys after the loop completes.♻️ Suggested refinement
candidateOsIDSet := make(map[uuid.UUID]struct{}, len(ossas)) - candidateOsIDs := make([]uuid.UUID, 0, len(ossas)) for _, ossa := range ossas { candidateOsIDSet[ossa.OperatingSystemID] = struct{}{} - candidateOsIDs = append(candidateOsIDs, ossa.OperatingSystemID) serr := ossaDAO.Delete(ctx, nil, ossa.ID) if serr != nil && serr != cdb.ErrDoesNotExist { logger.Error().Err(serr).Str("Operating System Site Association ID", ossa.ID.String()).Msg("error deleting Operating System Site Association record in DB") return serr } } + candidateOsIDs := make([]uuid.UUID, 0, len(candidateOsIDSet)) + for osID := range candidateOsIDSet { + candidateOsIDs = append(candidateOsIDs, osID) + } if len(candidateOsIDs) > 0 {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workflow/pkg/activity/site/site.go` around lines 429 - 439, The candidateOsIDs slice may contain duplicates because you append ossa.OperatingSystemID for each ossa; instead, keep the current loop to populate candidateOsIDSet and call ossaDAO.Delete for each ossa as you do, but do not append into candidateOsIDs there—instead, after the loop complete, construct candidateOsIDs by iterating over the keys of candidateOsIDSet (map[uuid.UUID]struct{}) so the slice contains unique OperatingSystemIDs; update any downstream use of candidateOsIDs to rely on this deduplicated slice (referencing candidateOsIDSet, candidateOsIDs, ossas, and ossaDAO.Delete).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@workflow/pkg/activity/site/site.go`:
- Around line 190-196: Guard against calling ifcDAO.DeleteAllByInstanceIDs with
an empty instanceIDs slice: before invoking DeleteAllByInstanceIDs in site.go
(the block that deletes ethernet interfaces for instances), check if
len(instanceIDs) == 0 and skip the DAO call when empty to avoid an unnecessary
DB call; keep the existing error handling for the non-empty case
(logger.Error().Err(err).Msg(...)) intact.
- Around line 429-439: The candidateOsIDs slice may contain duplicates because
you append ossa.OperatingSystemID for each ossa; instead, keep the current loop
to populate candidateOsIDSet and call ossaDAO.Delete for each ossa as you do,
but do not append into candidateOsIDs there—instead, after the loop complete,
construct candidateOsIDs by iterating over the keys of candidateOsIDSet
(map[uuid.UUID]struct{}) so the slice contains unique OperatingSystemIDs; update
any downstream use of candidateOsIDs to rely on this deduplicated slice
(referencing candidateOsIDSet, candidateOsIDs, ossas, and ossaDAO.Delete).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: df74def8-6816-461f-8dec-9b14d5ab2b5f
📒 Files selected for processing (1)
workflow/pkg/activity/site/site.go
Signed-off-by: Leah Itagaki <litagaki@nvidia.com>
6085f16 to
aa4c0b2
Compare
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-05-07 21:25:34 UTC | Commit: 5ba17db |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go`:
- Around line 54-104: Migration misses backfilling operating_system cleanup:
inside the same transaction (use tx and handleError like the other loops), first
DELETE FROM operating_system_site_association WHERE site_id NOT IN (SELECT id
FROM site WHERE deleted IS NULL) to remove stale site associations, then DELETE
FROM operating_system WHERE id NOT IN (SELECT operating_system_id FROM
operating_system_site_association) to remove OS rows that no longer have any
live site associations (mirror DeleteSiteComponentsFromDB behavior); use tx.Exec
for both statements and handle errors with handleError(tx, err).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9c5d5bc1-077e-445d-8836-4755298283a8
📒 Files selected for processing (4)
db/pkg/db/model/interface_test.godb/pkg/migrations/20260505170000_cleanup_orphan_site_components.goworkflow/pkg/activity/site/site.goworkflow/pkg/util/testing.go
🚧 Files skipped from review as they are similar to previous changes (2)
- db/pkg/db/model/interface_test.go
- workflow/pkg/util/testing.go
| // Site-scoped tables with a soft_delete column: mark rows deleted | ||
| // when their site is missing or already soft-deleted. | ||
| softDeleteSiteScopedTables := []struct { | ||
| table string | ||
| alias string | ||
| }{ | ||
| {"vpc_prefix", "vp"}, | ||
| {"vpc_peering", "vp"}, | ||
| {"nvlink_logical_partition", "nvllp"}, | ||
| {"ssh_key_group_site_association", "skgsa"}, | ||
| {"ssh_key_group_instance_association", "skgia"}, | ||
| {"network_security_group", "nsg"}, | ||
| {"dpu_extension_service_deployment", "desd"}, | ||
| } | ||
| for _, t := range softDeleteSiteScopedTables { | ||
| stmt := fmt.Sprintf(` | ||
| UPDATE %[1]s %[2]s | ||
| SET deleted = CURRENT_TIMESTAMP, updated = CURRENT_TIMESTAMP | ||
| WHERE %[2]s.deleted IS NULL | ||
| AND %[2]s.site_id NOT IN (SELECT id FROM site WHERE deleted IS NULL)`, | ||
| t.table, t.alias) | ||
| _, err = tx.Exec(stmt) | ||
| handleError(tx, err) | ||
| } | ||
|
|
||
| // Site-scoped tables without a soft_delete column: hard-delete rows | ||
| // whose site is missing or already soft-deleted. The matching DAO | ||
| // Delete methods on these tables are also hard deletes, so this | ||
| // mirrors the runtime cleanup. | ||
| hardDeleteSiteScopedTables := []string{ | ||
| "sku", | ||
| "expected_machine", | ||
| "expected_switch", | ||
| "expected_power_shelf", | ||
| } | ||
| for _, table := range hardDeleteSiteScopedTables { | ||
| stmt := fmt.Sprintf(` | ||
| DELETE FROM %[1]s | ||
| WHERE site_id NOT IN (SELECT id FROM site WHERE deleted IS NULL)`, | ||
| table) | ||
| _, err = tx.Exec(stmt) | ||
| handleError(tx, err) | ||
| } | ||
|
|
||
| terr = tx.Commit() | ||
| if terr != nil { | ||
| handlePanic(terr, "failed to commit transaction") | ||
| } | ||
|
|
||
| fmt.Print(" [up migration] Soft-deleted orphan site-scoped rows across interface, vpc_prefix, vpc_peering, nvlink_logical_partition, ssh_key_group_site_association, ssh_key_group_instance_association, network_security_group, and dpu_extension_service_deployment; hard-deleted orphan rows from sku, expected_machine, expected_switch, and expected_power_shelf. ") | ||
| return nil |
There was a problem hiding this comment.
Backfill is still missing operating-system cleanup.
DeleteSiteComponentsFromDB now removes operating_system_site_association rows and then deletes operating_system rows that no longer have any live site association, but this migration never backfills those two steps. That leaves previously deleted sites with stale OSSA rows, and potentially orphaned OS records, even after the migration runs. Please mirror the new runtime cleanup here as well.
Suggested shape of the fix
softDeleteSiteScopedTables := []struct {
table string
alias string
}{
{"vpc_prefix", "vp"},
{"vpc_peering", "vp"},
{"nvlink_logical_partition", "nvllp"},
{"ssh_key_group_site_association", "skgsa"},
{"ssh_key_group_instance_association", "skgia"},
{"network_security_group", "nsg"},
{"dpu_extension_service_deployment", "desd"},
+ {"operating_system_site_association", "ossa"},
}
for _, t := range softDeleteSiteScopedTables {
stmt := fmt.Sprintf(`
UPDATE %[1]s %[2]s
SET deleted = CURRENT_TIMESTAMP, updated = CURRENT_TIMESTAMP
WHERE %[2]s.deleted IS NULL
AND %[2]s.site_id NOT IN (SELECT id FROM site WHERE deleted IS NULL)`,
t.table, t.alias)
_, err = tx.Exec(stmt)
handleError(tx, err)
}
+
+ _, err = tx.Exec(`
+ UPDATE operating_system os
+ SET deleted = CURRENT_TIMESTAMP, updated = CURRENT_TIMESTAMP
+ WHERE os.deleted IS NULL
+ AND NOT EXISTS (
+ SELECT 1
+ FROM operating_system_site_association ossa
+ WHERE ossa.operating_system_id = os.id
+ AND ossa.deleted IS NULL
+ )`)
+ handleError(tx, err)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@db/pkg/migrations/20260505170000_cleanup_orphan_site_components.go` around
lines 54 - 104, Migration misses backfilling operating_system cleanup: inside
the same transaction (use tx and handleError like the other loops), first DELETE
FROM operating_system_site_association WHERE site_id NOT IN (SELECT id FROM site
WHERE deleted IS NULL) to remove stale site associations, then DELETE FROM
operating_system WHERE id NOT IN (SELECT operating_system_id FROM
operating_system_site_association) to remove OS rows that no longer have any
live site associations (mirror DeleteSiteComponentsFromDB behavior); use tx.Exec
for both statements and handle errors with handleError(tx, err).
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Description
Ensure that when a site is deleted that we delete any records that may reference said site.
Also clean-up the database to deal with any records that were orphaned when sites were deleted previously.
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes