Non-deterministic boot config selection when multiple configs match the same server
Problem
When multiple HTTPBootConfig or IPXEBootConfig resources match the same server (by UUID or IP), the boot server picks Items[0] from the informer cache query result. The informer cache stores objects in a Go map, so field index lookups return results in non-deterministic order. The server receives a random boot config on each request.
This affects all four boot server handlers in server/bootserver.go:
handleIPXE() — line 139: config := ipxeBootConfigList.Items[0]
handleIgnitionIPXEBoot() — line 211: ipxeBootConfig := ipxeBootConfigList.Items[0]
handleIgnitionHTTPBoot() — line 318: httpBootConfig := HTTPBootConfigList.Items[0]
handleHTTPBoot() — line 470: httpBootConfig := httpBootConfigs.Items[0]
Each already has a TODO acknowledging the gap:
// TODO: Pick the first HttpBootConfig if multiple CRs are found.
// Implement better validation in the future.
When does this happen?
A server can have multiple matching boot configs when a ServerMaintenance with a ServerBootConfigurationTemplate is active. The ServerMaintenanceReconciler creates a maintenance ServerBootConfiguration and sets server.spec.maintenanceBootConfigurationRef. The existing workload ServerBootConfiguration (from server.spec.bootConfigurationRef) is not removed — both coexist during the maintenance window.
Each ServerBootConfiguration creates a child HTTPBootConfig or IPXEBootConfig via the ServerBootConfigurationHTTPReconciler or ServerBootConfigurationPXEReconciler. Both children are indexed by the same systemUUID and systemIPs/networkIdentifiers, so any lookup by those fields returns both.
The result is roughly a 50/50 chance of serving the wrong boot image on each request — either the workload image during maintenance, or the maintenance image after maintenance ends (until the maintenance SBC is cleaned up).
Expected behavior
When multiple boot configs match a server, the boot server should deterministically select the correct one based on the server's current state. During maintenance (when maintenanceBootConfigurationRef is set), the maintenance boot config should be served. Otherwise, the workload boot config should be served.
Suggested approach
Each HTTPBootConfig/IPXEBootConfig has an owner reference pointing to its parent ServerBootConfiguration (set via controllerutil.SetControllerReference). The parent ServerBootConfiguration name matches either server.spec.bootConfigurationRef or server.spec.maintenanceBootConfigurationRef.
When multiple configs match:
- Resolve each config's owning
ServerBootConfiguration from the owner reference
- Look up the
Server by the systemUUID or serverRef on the ServerBootConfiguration
- If
server.spec.maintenanceBootConfigurationRef is set, prefer the config owned by that ServerBootConfiguration
- Otherwise, prefer the config owned by
server.spec.bootConfigurationRef
boot-operator already depends on metalv1alpha1 (metal-operator's API types) and has access to Server resources via the k8sClient passed to the handlers.
Scope
This is a latent issue — it does not manifest today because no in-tree code creates a ServerMaintenance with a ServerBootConfigurationTemplate. It will become a problem when the template feature is used, or any other scenario where multiple ServerBootConfiguration resources target the same server.
Related
- metal-operator#807 — Boot Policy and Deterministic Boot Process (maintenance uses its own
ServerBootConfiguration created from ServerBootConfigurationTemplate, coexisting with the workload SBC)
Non-deterministic boot config selection when multiple configs match the same server
Problem
When multiple
HTTPBootConfigorIPXEBootConfigresources match the same server (by UUID or IP), the boot server picksItems[0]from the informer cache query result. The informer cache stores objects in a Go map, so field index lookups return results in non-deterministic order. The server receives a random boot config on each request.This affects all four boot server handlers in
server/bootserver.go:handleIPXE()— line 139:config := ipxeBootConfigList.Items[0]handleIgnitionIPXEBoot()— line 211:ipxeBootConfig := ipxeBootConfigList.Items[0]handleIgnitionHTTPBoot()— line 318:httpBootConfig := HTTPBootConfigList.Items[0]handleHTTPBoot()— line 470:httpBootConfig := httpBootConfigs.Items[0]Each already has a TODO acknowledging the gap:
When does this happen?
A server can have multiple matching boot configs when a
ServerMaintenancewith aServerBootConfigurationTemplateis active. TheServerMaintenanceReconcilercreates a maintenanceServerBootConfigurationand setsserver.spec.maintenanceBootConfigurationRef. The existing workloadServerBootConfiguration(fromserver.spec.bootConfigurationRef) is not removed — both coexist during the maintenance window.Each
ServerBootConfigurationcreates a childHTTPBootConfigorIPXEBootConfigvia theServerBootConfigurationHTTPReconcilerorServerBootConfigurationPXEReconciler. Both children are indexed by the samesystemUUIDandsystemIPs/networkIdentifiers, so any lookup by those fields returns both.The result is roughly a 50/50 chance of serving the wrong boot image on each request — either the workload image during maintenance, or the maintenance image after maintenance ends (until the maintenance SBC is cleaned up).
Expected behavior
When multiple boot configs match a server, the boot server should deterministically select the correct one based on the server's current state. During maintenance (when
maintenanceBootConfigurationRefis set), the maintenance boot config should be served. Otherwise, the workload boot config should be served.Suggested approach
Each
HTTPBootConfig/IPXEBootConfighas an owner reference pointing to its parentServerBootConfiguration(set viacontrollerutil.SetControllerReference). The parentServerBootConfigurationname matches eitherserver.spec.bootConfigurationReforserver.spec.maintenanceBootConfigurationRef.When multiple configs match:
ServerBootConfigurationfrom the owner referenceServerby thesystemUUIDorserverRefon theServerBootConfigurationserver.spec.maintenanceBootConfigurationRefis set, prefer the config owned by thatServerBootConfigurationserver.spec.bootConfigurationRefboot-operator already depends on
metalv1alpha1(metal-operator's API types) and has access toServerresources via thek8sClientpassed to the handlers.Scope
This is a latent issue — it does not manifest today because no in-tree code creates a
ServerMaintenancewith aServerBootConfigurationTemplate. It will become a problem when the template feature is used, or any other scenario where multipleServerBootConfigurationresources target the same server.Related
ServerBootConfigurationcreated fromServerBootConfigurationTemplate, coexisting with the workload SBC)