-
Notifications
You must be signed in to change notification settings - Fork 229
fix: auto-heal corrupted OCI local store by forcing re-pull #1455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: pnkcaht <samzoovsk19@gmail.com>
|
can't be merged until the linter is pleased and a maintainer re-approves |
I will solve |
Signed-off-by: pnkcaht <samzoovsk19@gmail.com>
|
Resolved guys @krissetto @simonferquel Task lintpnkcaht@pnkcaht:~/Documents/CagentDocker/cagent$ task lint
task: [lint] golangci-lint run
0 issues.
task: [lint] go mod tidy --diff >/dev/null || (echo "go.mod/go.sum files are not tidy" && exit 1) |
pkg/config/sources.go
Outdated
| // 1. Try local first | ||
| data, err := tryLoad() | ||
| if err == nil { | ||
| return data, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change to this function and introduces new bugs.
This early return means we no longer check for updates to the OCI reference on the registry, which was intended behavior.
It also causes the --pull-interval <mins> feature of cagent api, which is intended to auto-update the image periodically if updates are available, to break. This function now never does that check unless there are store corruption issues.
Let's make sure any change in this PR only fixes potential corruption issues, without changing the underlying logical behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: pnkcaht <samzoovsk19@gmail.com>
What I did
Related issue
Fixes #1448
What was the bug
When an OCI artifact was already present in the local content store, the system assumed it was valid and tried to load it unconditionally.
If the local store became corrupted or partially written (for example, interrupted downloads, invalid tar layers, or missing metadata), the following happened:
ErrStoreCorrupted.remote.Pullwas not sufficient, because the reference already existed locally.In practice, a broken local cache could brick the OCI source resolution entirely.
Explain With Diagrams
OCI source resolution — old behavior (bug)
Description
In the previous implementation, OCI-based agent configurations were loaded from the local content store without any reliable recovery mechanism.
When an agent was requested, the OCI source attempted to read the artifact directly from the local
content.Store. If any file inside the store was missing or inconsistent (for example, a missing reference file, tarball, or metadata), the store returnedErrStoreCorrupted.At this point, the error was treated as fatal.
What went wrong
Once
ErrStoreCorruptedwas returned:This created a persistent failure mode where a transient or partial disk issue resulted in a permanently broken agent cache. The system had no mechanism to invalidate or repair a broken local reference.
Impact
From the user’s perspective, this surfaced as intermittent but unrecoverable errors such as:
<name>:latestnot found”Even though the remote artifact was valid and accessible, the local corruption prevented recovery.
New behavior (self-healing OCI store flow)
Description
With the new implementation, the local OCI content store is treated as a recoverable cache rather than a source of truth.
When an agent is requested from an OCI reference, the system follows a multi-step fallback strategy that guarantees recovery from partial or inconsistent local state.
Step-by-step flow
Local load attempt
Normal OCI pull (safe revalidation)
Corruption detection
ErrStoreCorrupted, the store is considered inconsistent.Forced re-pull (store repair)
Final retry
Key guarantees
Outcome
This change converts a hard failure scenario into a self-healing process, ensuring that agent execution remains reliable even in the presence of local cache corruption.
Failure scenarios and recovery boundaries
Description
This diagram highlights the different failure scenarios that can occur when loading agents from OCI artifacts and clearly defines where recovery is possible and where it must stop.
The goal is to avoid infinite retries while still guaranteeing automatic repair whenever feasible.
Failure scenarios covered
Missing reference link
store/refs/does not exist.Missing or unreadable tarball
<digest>.tarfile is missing or cannot be parsed.Invalid image structure
Empty or unreadable layers
Remote pull failure
Recovery boundaries
If local corruption is detected:
If remote pull fails and no valid local copy exists:
If remote pull fails but a valid local copy exists:
Safety guarantees
Outcome
This model ensures that the system aggressively heals local state when possible, while still failing fast and transparently when recovery is genuinely impossible.
Notes
content.ErrStoreCorrupted.Important
No automated tests were added as part of this change.
The fix was validated through manual builds and targeted reasoning over failure scenarios.
OS / System