Skip to content

[ENHANCEMENT] Periodically verify ready model artifacts on nodes #613

@YouNeedCryDear

Description

@YouNeedCryDear

What would you like to be added?

Add a periodic model artifact integrity check to the model agent for BaseModel and ClusterBaseModel entries that are already marked Ready on a node.

The checker should run after initial download/readiness, inspect the model artifacts on the local node, and detect missing or corrupted files before the node continues serving inference workloads. When integrity validation fails, the model agent should stop advertising that node as Ready for the affected model by updating both the model-ready node label and the node-scoped model status ConfigMap. The BaseModel/ClusterBaseModel controller should then remove that node from status.nodesReady and add it to status.nodesFailed so InferenceService pods are not scheduled onto a node with bad model files.

The implementation should reuse the current model-agent domain flow where possible:

  • Ready node labels are managed by the model agent.
  • Node-scoped model status ConfigMaps are the source used by the BaseModel and ClusterBaseModel controllers.
  • OCI downloads already have download-time verification through object size/MD5 checks.
  • Hugging Face and local model paths should get an equivalent post-ready validation strategy, such as a persisted manifest or conservative file/config presence validation when full checksums are not available.

Why is this needed?

Today the model agent verifies some artifacts during the initial download path and then marks the model Ready for the node. After that point, the node may keep advertising the model as Ready even if files are later deleted, truncated, or corrupted by external factors such as disk cleanup, manual intervention, filesystem problems, or failed node maintenance.

InferenceService scheduling relies on the model-ready node label, so a pod can be placed on a node whose ClusterBaseModel is still reported Ready even though the actual model files are no longer valid. That can lead to model load failures, partial loads, or unexpected generation behavior.

Completion requirements

  • Add a periodic model-agent integrity reconciliation loop with a configurable interval and startup jitter.
  • Validate only models that are currently Ready on the local node, using the node-scoped model status ConfigMap and existing model storage metadata.
  • Reuse existing OCI object validation logic for OCI-backed models where possible.
  • Add a reliable validation path for Hugging Face and local models, including missing-file detection and a documented strategy for corruption detection.
  • On validation failure, update the node label and node-scoped ConfigMap so the affected node is no longer selected as Ready for the model.
  • Ensure BaseModel and ClusterBaseModel status updates remove the failed node from nodesReady and expose it in nodesFailed.
  • Record logs and metrics for integrity check success/failure, duration, and failure reason.
  • Add unit tests for the integrity checker and reconciliation behavior, plus controller tests covering Ready-to-Failed node status propagation.
  • Update model-agent configuration documentation with the new interval and behavior.

Can you help us implement this enhancement?

  • Yes, I can contribute
  • No, but I'm available for testing
  • No

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions