[ENHANCEMENT] Periodically verify ready model artifacts on nodes

## What would you like to be added?

Add a periodic model artifact integrity check to the model agent for BaseModel and ClusterBaseModel entries that are already marked Ready on a node.

The checker should run after initial download/readiness, inspect the model artifacts on the local node, and detect missing or corrupted files before the node continues serving inference workloads. When integrity validation fails, the model agent should stop advertising that node as Ready for the affected model by updating both the model-ready node label and the node-scoped model status ConfigMap. The BaseModel/ClusterBaseModel controller should then remove that node from `status.nodesReady` and add it to `status.nodesFailed` so InferenceService pods are not scheduled onto a node with bad model files.

The implementation should reuse the current model-agent domain flow where possible:

- Ready node labels are managed by the model agent.
- Node-scoped model status ConfigMaps are the source used by the BaseModel and ClusterBaseModel controllers.
- OCI downloads already have download-time verification through object size/MD5 checks.
- Hugging Face and local model paths should get an equivalent post-ready validation strategy, such as a persisted manifest or conservative file/config presence validation when full checksums are not available.

## Why is this needed?

Today the model agent verifies some artifacts during the initial download path and then marks the model Ready for the node. After that point, the node may keep advertising the model as Ready even if files are later deleted, truncated, or corrupted by external factors such as disk cleanup, manual intervention, filesystem problems, or failed node maintenance.

InferenceService scheduling relies on the model-ready node label, so a pod can be placed on a node whose ClusterBaseModel is still reported Ready even though the actual model files are no longer valid. That can lead to model load failures, partial loads, or unexpected generation behavior.

## Completion requirements

- [ ] Add a periodic model-agent integrity reconciliation loop with a configurable interval and startup jitter.
- [ ] Validate only models that are currently Ready on the local node, using the node-scoped model status ConfigMap and existing model storage metadata.
- [ ] Reuse existing OCI object validation logic for OCI-backed models where possible.
- [ ] Add a reliable validation path for Hugging Face and local models, including missing-file detection and a documented strategy for corruption detection.
- [ ] On validation failure, update the node label and node-scoped ConfigMap so the affected node is no longer selected as Ready for the model.
- [ ] Ensure BaseModel and ClusterBaseModel status updates remove the failed node from `nodesReady` and expose it in `nodesFailed`.
- [ ] Record logs and metrics for integrity check success/failure, duration, and failure reason.
- [ ] Add unit tests for the integrity checker and reconciliation behavior, plus controller tests covering Ready-to-Failed node status propagation.
- [ ] Update model-agent configuration documentation with the new interval and behavior.

## Can you help us implement this enhancement?

- [ ] Yes, I can contribute
- [ ] No, but I'm available for testing
- [ ] No


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT] Periodically verify ready model artifacts on nodes #613

What would you like to be added?

Why is this needed?

Completion requirements

Can you help us implement this enhancement?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ENHANCEMENT] Periodically verify ready model artifacts on nodes #613

Description

What would you like to be added?

Why is this needed?

Completion requirements

Can you help us implement this enhancement?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions