refactor: Streamline component manager provider bootstrap#504
refactor: Streamline component manager provider bootstrap#504jw-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (33)
💤 Files with no reviewable changes (1)
✅ Files skipped from review due to trivial changes (6)
🚧 Files skipped from review as they are similar to previous changes (22)
Summary by CodeRabbit
WalkthroughThis PR restructures component-manager code into cmconfig and providerapi subpackages, implements strict YAML config loading with provider completion, introduces a nil-safe error-returning provider registry, provides builtin defaults and factory registration, and migrates startup, scheduler jobs, tests, and docs to the new APIs. ChangesComponent Manager Configuration Refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
rla/internal/task/componentmanager/builtin/config.go (1)
53-70: 💤 Low valueConsider replacing the IIFE with a direct conditional assignment for idiomatic Go.
The Immediately Invoked Function Expression exists only to reuse the already-declared
errvariable while introducingconfig. A plain conditional achieves the same with no cognitive overhead and eliminates the duplicatedpath != ""check across the IIFE body and the error-wrapping block.♻️ Proposed refactor
- config, err := func() (cmconfig.Config, error) { - if path != "" { - // Load config from file - return cmconfig.LoadConfig(path, decoders) - } - - // Use embedded service config - return cmconfig.New(defaultServiceComponentManagers(), decoders) - }() - - if err != nil { - if path != "" { - err = fmt.Errorf("load config from file: %w", err) - } else { - err = fmt.Errorf("get default config: %w", err) - } - return cmconfig.Config{}, err - } + var config cmconfig.Config + if path != "" { + config, err = cmconfig.LoadConfig(path, decoders) + err = wrapErr(err, "load config from file") + } else { + config, err = cmconfig.New(defaultServiceComponentManagers(), decoders) + err = wrapErr(err, "get default config") + } + if err != nil { + return cmconfig.Config{}, err + }Where
wrapErris a small inline helper, or simply inlinefmt.Errorfin each branch. Either way the singlepath != ""check drives both loading and error-message selection.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/componentmanager/builtin/config.go` around lines 53 - 70, Replace the IIFE by a straightforward conditional: if path != "" call cmconfig.LoadConfig(path, decoders) and assign to config, err; else call cmconfig.New(defaultServiceComponentManagers(), decoders) and assign to config, err; then check err and wrap it with fmt.Errorf("load config from file: %w", err) when path != "" or fmt.Errorf("get default config: %w", err) otherwise—this removes the IIFE and the duplicated path != "" check while keeping the same calls to cmconfig.LoadConfig, cmconfig.New, defaultServiceComponentManagers, and the existing err wrapping.rla/internal/task/componentmanager/config/yaml.go (1)
116-140: 💤 Low valueConsider wrapping decoder errors with provider context.
When
decoder.DecodeYAMLfails at Line 131, the returned error may lack sufficient context about which provider configuration failed. Wrapping with the provider name would aid debugging.♻️ Optional: Add provider context to decode errors
providerConfig, err := decoder.DecodeYAML(rawNode) if err != nil { - return err + return fmt.Errorf("provider %q: %w", name, err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/task/componentmanager/config/yaml.go` around lines 116 - 140, The decoder.DecodeYAML error returned in setYAMLProviderConfigs lacks provider context; update the error handling after calling decoder.DecodeYAML(rawNode) to wrap or annotate the returned error with the provider name (use the `name` from prepareProviderConfigForAdd) so the error message identifies which provider failed—e.g., replace the direct return of err with a wrapped error that includes `name` and the original err for decoder.DecodeYAML failures, leaving other logic (prepareProviderConfigForAdd, HasProvider, ProviderConfigs assignment) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rla/cmd/serve.go`:
- Around line 108-115: The code in the provider creation loop silently skips
providers when providerConfig.NewProvider returns an error (the block around
provider, err := providerConfig.NewProvider(ctx)), which can cause downstream
runtime failures; change this so you either treat creation failures as fatal
during startup (returning error / exiting) or, if non-fatal, record the failed
provider names in a skippedProviders set and after the provider loop validate
all configured component managers (e.g., wherever component managers are
constructed/registered) to ensure none depend on a skipped provider; if any do,
fail startup with a clear error referencing the missing provider and the
dependent component manager name.
In `@rla/docs/component-manager-architecture.md`:
- Around line 250-256: The import paths in the example are inconsistent: they
use github.com/NVIDIA/ncx-infra-controller-rest/... but the repository uses
github.com/NVIDIA/infra-controller-rest/; update the import strings in the
example (e.g., the imports referencing componentmanager, providerapi,
myapiprovider, common, operations, devicetypes) to use the correct base path
github.com/NVIDIA/infra-controller-rest/ so they match the rest of the codebase.
In `@rla/docs/component-manager-config.md`:
- Around line 37-38: The docs are ambiguous about whether an authoritative
service-loaded YAML's component_managers map can be partial; update the sentence
around "component_managers" to state the exact behavior: explicitly say whether
all three keys ("compute", "nvlswitch", "powershelf") are required when the file
is authoritative (for example: "All three component types must be present;
partial maps are not supported") or, if partial maps are merged with defaults,
state that and describe how missing keys are filled (e.g., defaults applied).
Reference the terms component_managers, compute, nvlswitch, powershelf and
"authoritative" in the new sentence so readers know which behavior applies.
---
Nitpick comments:
In `@rla/internal/task/componentmanager/builtin/config.go`:
- Around line 53-70: Replace the IIFE by a straightforward conditional: if path
!= "" call cmconfig.LoadConfig(path, decoders) and assign to config, err; else
call cmconfig.New(defaultServiceComponentManagers(), decoders) and assign to
config, err; then check err and wrap it with fmt.Errorf("load config from file:
%w", err) when path != "" or fmt.Errorf("get default config: %w", err)
otherwise—this removes the IIFE and the duplicated path != "" check while
keeping the same calls to cmconfig.LoadConfig, cmconfig.New,
defaultServiceComponentManagers, and the existing err wrapping.
In `@rla/internal/task/componentmanager/config/yaml.go`:
- Around line 116-140: The decoder.DecodeYAML error returned in
setYAMLProviderConfigs lacks provider context; update the error handling after
calling decoder.DecodeYAML(rawNode) to wrap or annotate the returned error with
the provider name (use the `name` from prepareProviderConfigForAdd) so the error
message identifies which provider failed—e.g., replace the direct return of err
with a wrapped error that includes `name` and the original err for
decoder.DecodeYAML failures, leaving other logic (prepareProviderConfigForAdd,
HasProvider, ProviderConfigs assignment) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6d6ffa7d-c6b8-407f-9691-95efcd4e8a3f
📒 Files selected for processing (33)
rla/cmd/serve.gorla/docs/component-manager-architecture.mdrla/docs/component-manager-config.mdrla/internal/scheduler/jobs/inventorysync/inventory.gorla/internal/scheduler/jobs/inventorysync/inventory_test.gorla/internal/scheduler/jobs/inventorysync/job.gorla/internal/scheduler/jobs/leakdetection/job.gorla/internal/service/config.gorla/internal/task/componentmanager/builtin/component_manager_factories.gorla/internal/task/componentmanager/builtin/component_manager_factories_test.gorla/internal/task/componentmanager/builtin/config.gorla/internal/task/componentmanager/builtin/config_test.gorla/internal/task/componentmanager/builtin/provider_config_decoders.gorla/internal/task/componentmanager/builtin/provider_config_decoders_test.gorla/internal/task/componentmanager/componentmanager.gorla/internal/task/componentmanager/componentmanager_test.gorla/internal/task/componentmanager/compute/nico/nico.gorla/internal/task/componentmanager/config.gorla/internal/task/componentmanager/config/config.gorla/internal/task/componentmanager/config/config_test.gorla/internal/task/componentmanager/config/doc.gorla/internal/task/componentmanager/config/errors.gorla/internal/task/componentmanager/config/yaml.gorla/internal/task/componentmanager/errors.gorla/internal/task/componentmanager/mock/mock.gorla/internal/task/componentmanager/nvlswitch/nico/nico.gorla/internal/task/componentmanager/nvlswitch/nvswitchmanager/nvswitchmanager.gorla/internal/task/componentmanager/powershelf/nico/nico.gorla/internal/task/componentmanager/powershelf/psm/psm.gorla/internal/task/componentmanager/providerapi/errors.gorla/internal/task/componentmanager/providerapi/providerapi.gorla/internal/task/componentmanager/providerapi/providerapi_test.gorla/internal/task/componentmanager/providerapi/registry.go
💤 Files with no reviewable changes (1)
- rla/internal/task/componentmanager/config.go
3eef9db to
0a819d4
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rla/internal/scheduler/jobs/inventorysync/job.go (1)
69-76:⚠️ Potential issue | 🟠 MajorReconsider unconditional NICo requirement for partial configurations.
The constructor disables the entire inventory sync job when the NICo provider is absent (lines 69–76), but
runInventoryOnedemonstrates capability-aware dispatch:syncNVSwitchesandsyncPowershelvesboth guard against nil clients and dispatch based oncmConfig. However,syncMachinesunconditionally requiresnicoClient. This creates an inconsistency: a deployment configured with PSM and NSM but intentionally without NICo would fail to sync switches and shelves despite having the necessary providers.Either document that machine sync is mandatory (making the entire job unsyncable without NICo), or permit graceful degradation by allowing the job to initialize when partial providers are available, with individual sync functions gating their operations on client availability—aligning with the existing
cmConfig-based dispatch pattern.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rla/internal/scheduler/jobs/inventorysync/job.go` around lines 69 - 76, The constructor currently bails out when providerapi.GetTyped for nicoprovider.ProviderName fails; instead, stop returning nil—log the warning (as you already do) but keep constructing the job with a nil nicoProvider so runInventoryOne can perform capability-aware dispatch; specifically, remove the early return in the block that calls providerapi.GetTyped[*nicoprovider.Provider], allow nicoProvider to be nil, and update syncMachines (and any other sync* helpers like syncNVSwitches and syncPowershelves if needed) to explicitly guard on the stored nicoProvider/nicoClient and the cmConfig before performing NICo-specific work so the job gracefully degrades when NICo is absent (or alternatively add a clear documented assertion in the constructor if machine sync must be mandatory).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rla/cmd/serve.go`:
- Around line 99-113: The loop over config.ProviderConfigs calls
providerConfig.Name() before checking for nil, which can panic; instead, first
check if providerConfig == nil and return the typed
providerapi.ProviderNotConfiguredError{Name: name} (the same error used below)
before calling providerConfig.Name() or providerConfig.NewProvider(ctx); update
the loop handling around providerConfig.Name(), providerConfig.NewProvider, and
the existing ProviderNotConfiguredError / ProviderConfigNameMismatchError
branches so nil entries yield the typed error rather than a panic.
In `@rla/docs/component-manager-config.md`:
- Around line 44-48: Update the documentation to include the NVSwitch
Manager-backed nvlswitch implementation: add the new implementation name (the
provider key used in the builtin registry for NVSwitch Manager) to the
`nvlswitch` row under "Available Implementations" and to any provider/completion
lists that currently show only `nico`/`psm` (e.g., the sections around the
existing nvlswitch/provider examples at the other noted ranges). Specifically,
edit the table row for the `nvlswitch` component and the provider completion
examples that mention `nico`/`psm` so they include the NVSwitch Manager provider
name exactly as it appears in the registry.
---
Outside diff comments:
In `@rla/internal/scheduler/jobs/inventorysync/job.go`:
- Around line 69-76: The constructor currently bails out when
providerapi.GetTyped for nicoprovider.ProviderName fails; instead, stop
returning nil—log the warning (as you already do) but keep constructing the job
with a nil nicoProvider so runInventoryOne can perform capability-aware
dispatch; specifically, remove the early return in the block that calls
providerapi.GetTyped[*nicoprovider.Provider], allow nicoProvider to be nil, and
update syncMachines (and any other sync* helpers like syncNVSwitches and
syncPowershelves if needed) to explicitly guard on the stored
nicoProvider/nicoClient and the cmConfig before performing NICo-specific work so
the job gracefully degrades when NICo is absent (or alternatively add a clear
documented assertion in the constructor if machine sync must be mandatory).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 11ee2044-5037-4867-9c56-9d15d4faae69
📒 Files selected for processing (33)
rla/cmd/serve.gorla/docs/component-manager-architecture.mdrla/docs/component-manager-config.mdrla/internal/scheduler/jobs/inventorysync/inventory.gorla/internal/scheduler/jobs/inventorysync/inventory_test.gorla/internal/scheduler/jobs/inventorysync/job.gorla/internal/scheduler/jobs/leakdetection/job.gorla/internal/service/config.gorla/internal/task/componentmanager/builtin/component_manager_factories.gorla/internal/task/componentmanager/builtin/component_manager_factories_test.gorla/internal/task/componentmanager/builtin/config.gorla/internal/task/componentmanager/builtin/config_test.gorla/internal/task/componentmanager/builtin/provider_config_decoders.gorla/internal/task/componentmanager/builtin/provider_config_decoders_test.gorla/internal/task/componentmanager/componentmanager.gorla/internal/task/componentmanager/componentmanager_test.gorla/internal/task/componentmanager/compute/nico/nico.gorla/internal/task/componentmanager/config.gorla/internal/task/componentmanager/config/config.gorla/internal/task/componentmanager/config/config_test.gorla/internal/task/componentmanager/config/doc.gorla/internal/task/componentmanager/config/errors.gorla/internal/task/componentmanager/config/yaml.gorla/internal/task/componentmanager/errors.gorla/internal/task/componentmanager/mock/mock.gorla/internal/task/componentmanager/nvlswitch/nico/nico.gorla/internal/task/componentmanager/nvlswitch/nvswitchmanager/nvswitchmanager.gorla/internal/task/componentmanager/powershelf/nico/nico.gorla/internal/task/componentmanager/powershelf/psm/psm.gorla/internal/task/componentmanager/providerapi/errors.gorla/internal/task/componentmanager/providerapi/providerapi.gorla/internal/task/componentmanager/providerapi/providerapi_test.gorla/internal/task/componentmanager/providerapi/registry.go
💤 Files with no reviewable changes (1)
- rla/internal/task/componentmanager/config.go
✅ Files skipped from review due to trivial changes (8)
- rla/internal/task/componentmanager/mock/mock.go
- rla/internal/task/componentmanager/config/doc.go
- rla/internal/task/componentmanager/builtin/component_manager_factories_test.go
- rla/internal/task/componentmanager/config/errors.go
- rla/internal/task/componentmanager/config/yaml.go
- rla/internal/task/componentmanager/builtin/config_test.go
- rla/internal/task/componentmanager/config/config.go
- rla/docs/component-manager-architecture.md
🚧 Files skipped from review as they are similar to previous changes (7)
- rla/internal/task/componentmanager/componentmanager_test.go
- rla/internal/task/componentmanager/powershelf/psm/psm.go
- rla/internal/task/componentmanager/compute/nico/nico.go
- rla/internal/service/config.go
- rla/internal/task/componentmanager/config/config_test.go
- rla/internal/task/componentmanager/providerapi/providerapi_test.go
- rla/internal/scheduler/jobs/inventorysync/inventory.go
| for name, providerConfig := range config.ProviderConfigs { | ||
| configName := providerConfig.Name() | ||
| if name != configName { | ||
| return nil, providerapi.ProviderConfigNameMismatchError{ | ||
| Name: name, | ||
| ConfigName: configName, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Initialize PSM provider if configured | ||
| if config.Providers.PSM != nil { | ||
| psmProvider, err := psm.New(*config.Providers.PSM) | ||
| provider, err := providerConfig.NewProvider(ctx) | ||
| if err != nil { | ||
| log.Warn().Err(err).Msg("Unable to create PSM client (powershelf operations may not work)") | ||
| } else { | ||
| providerRegistry.Register(psmProvider) | ||
| log.Info(). | ||
| Dur("timeout", config.Providers.PSM.Timeout). | ||
| Msg("Initialized PSM provider") | ||
| return nil, fmt.Errorf("create provider %q: %w", name, err) | ||
| } | ||
| if provider == nil { | ||
| return nil, providerapi.ProviderNotConfiguredError{Name: name} |
There was a problem hiding this comment.
Guard nil provider configs before calling Name().
providerConfig.Name() is dereferenced before any nil check. A nil entry in config.ProviderConfigs will panic during startup instead of returning the typed ProviderNotConfiguredError used elsewhere in this refactor.
Suggested fix
for name, providerConfig := range config.ProviderConfigs {
+ if providerConfig == nil {
+ return nil, providerapi.ProviderNotConfiguredError{Name: name}
+ }
+
configName := providerConfig.Name()
if name != configName {
return nil, providerapi.ProviderConfigNameMismatchError{
Name: name,
ConfigName: configName,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rla/cmd/serve.go` around lines 99 - 113, The loop over config.ProviderConfigs
calls providerConfig.Name() before checking for nil, which can panic; instead,
first check if providerConfig == nil and return the typed
providerapi.ProviderNotConfiguredError{Name: name} (the same error used below)
before calling providerConfig.Name() or providerConfig.NewProvider(ctx); update
the loop handling around providerConfig.Name(), providerConfig.NewProvider, and
the existing ProviderNotConfiguredError / ProviderConfigNameMismatchError
branches so nil entries yield the typed error rather than a panic.
- split component manager config into dedicated config and builtin packages - move provider registry APIs under providerapi, and remove componentmanager provider re-export wrappers. - normalize provider config construction so YAML/default paths share config helpers while cmd/serve only wires provider creation. Signed-off-by: Jin Wang <jinwan@nvidia.com>
0a819d4 to
6b50d8f
Compare
Description
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes