What would you like to be added?
Add a periodic model artifact integrity check to the model agent for BaseModel and ClusterBaseModel entries that are already marked Ready on a node.
The checker should run after initial download/readiness, inspect the model artifacts on the local node, and detect missing or corrupted files before the node continues serving inference workloads. When integrity validation fails, the model agent should stop advertising that node as Ready for the affected model by updating both the model-ready node label and the node-scoped model status ConfigMap. The BaseModel/ClusterBaseModel controller should then remove that node from status.nodesReady and add it to status.nodesFailed so InferenceService pods are not scheduled onto a node with bad model files.
The implementation should reuse the current model-agent domain flow where possible:
- Ready node labels are managed by the model agent.
- Node-scoped model status ConfigMaps are the source used by the BaseModel and ClusterBaseModel controllers.
- OCI downloads already have download-time verification through object size/MD5 checks.
- Hugging Face and local model paths should get an equivalent post-ready validation strategy, such as a persisted manifest or conservative file/config presence validation when full checksums are not available.
Why is this needed?
Today the model agent verifies some artifacts during the initial download path and then marks the model Ready for the node. After that point, the node may keep advertising the model as Ready even if files are later deleted, truncated, or corrupted by external factors such as disk cleanup, manual intervention, filesystem problems, or failed node maintenance.
InferenceService scheduling relies on the model-ready node label, so a pod can be placed on a node whose ClusterBaseModel is still reported Ready even though the actual model files are no longer valid. That can lead to model load failures, partial loads, or unexpected generation behavior.
Completion requirements
Can you help us implement this enhancement?
What would you like to be added?
Add a periodic model artifact integrity check to the model agent for BaseModel and ClusterBaseModel entries that are already marked Ready on a node.
The checker should run after initial download/readiness, inspect the model artifacts on the local node, and detect missing or corrupted files before the node continues serving inference workloads. When integrity validation fails, the model agent should stop advertising that node as Ready for the affected model by updating both the model-ready node label and the node-scoped model status ConfigMap. The BaseModel/ClusterBaseModel controller should then remove that node from
status.nodesReadyand add it tostatus.nodesFailedso InferenceService pods are not scheduled onto a node with bad model files.The implementation should reuse the current model-agent domain flow where possible:
Why is this needed?
Today the model agent verifies some artifacts during the initial download path and then marks the model Ready for the node. After that point, the node may keep advertising the model as Ready even if files are later deleted, truncated, or corrupted by external factors such as disk cleanup, manual intervention, filesystem problems, or failed node maintenance.
InferenceService scheduling relies on the model-ready node label, so a pod can be placed on a node whose ClusterBaseModel is still reported Ready even though the actual model files are no longer valid. That can lead to model load failures, partial loads, or unexpected generation behavior.
Completion requirements
nodesReadyand expose it innodesFailed.Can you help us implement this enhancement?